Welcome back again. I was reading through some High Availability features and came across a good discussion on NSR (Non-Stop Routing), NSF (Non-Stop Forwarding) and GR (Graceful Restart). Just thought of sharing the same in simple words. But, first we'll understand the need for the same.
Most of the high end routers separate the Control plan with the Forwarding plane with separate components viz. memory and processors. The control plane runs the routing protocols, maintains the databases necessary for route processing and derives the FIB (Forwarding Information Base) which is given the forwarding plane to make the forwarding decisions. This is so done just to protect the Control plane being able to process new routing information even if the forwarding plane becomes too busy. Conversely, if the control plane becomes very busy with the new route information, it doesn't effect the ability of the forwarding plane to continue forwarding traffic at high rate.
NSF is basically the ability of the forwarding plane to continue forwarding traffic even if the control plane stops. Now, this could also cause issues. Consider a situation, where the network information has changed while the control plane is down. This will cause make the FIB invalid on the forwarding plane. To overcome this issue, we have redundant control planes. NSF allows you to move from primary to backup control plane without disrupting the forwarding. The FIB can still become invalid while the switchover from primary to backup control plane is happening but the risk is quiet acceptable. If we can try to shorten the switchover time, this would help to lessen the risk involved during the switchover. So, if we are able to maintain a copy of the active configuration and the current state of the system components, then the switchover from parimary to backup control plane can become much faster. This concept is known as Stateful Switchover (SSO).
The most confusing part of this discussion is the NSR - Non-Stop Routing. The one big problem which we face with control plane switchover as so far described, even if we minimize the switchover time using the stateful procedures, the routing protocol adjacencies are broken during the switchover. When the primary control plane goes down any neighboring router that had a peering session with it, sees the peering session fail. The adjacency is established again when the backup control plane takes over as primary, but in the interim, the neighbor router has advertised to its own neighbors that the router 'X' is no longer a valid next-hop to any destination beyond it and the neighbors should find another path. The neighbors have to recalculate the path, once the adjacency is back up again with the backup control plane. All these frequent changes can be quiet disruptive to the network. The main objective of NSR is to prevent, or atleast minimize the effect of the broken peering sessions.
One way of handling the broken adjacencies during control plane switchovers is Graceful Restart (GR) protocol extensions. Each routing has its own specific GR extensions, but all are pretty much the same as far as their working is concerned. When a router's control plane goes down, its neighbors, instead of reporting to their own neighbors that router X is unavailable, waits for a certain amount of time (which we can call a grace period). if the router X comes back up before the grace period expires, the devices beyond the connected neighbor do not get impacted due to the temporarily broken session.
Points to remember for GR:
1. Neighbors are supposed to support GR protocol extensions. Control plane switchovers can be highly disruptive on the PE devices as they have the most number of adjacencies.
2. Secondly, If there is a complete control plane failure or router failure rather than just a switchover, the GR grace period can slower the network convergence.
The newer generation of NSR uses internal processes to keep the backup control plane aware of routing protocol state and adjacency maintenance activities, so that after the switchover the backup control plane can take charge of the existing peering sessions rather than having to establish new ones. And since its internal to the router, there is no need for the neighbors to have any kind of protocol extensions. With NSR, the switchover is pretty transparent, lowering the possibilities of service disruptions.
Hope this was kind of helpful.
Please let me know your inputs.