9.5 Recovery and Types of Recovery

RECOVERY AND RECOVERY TECHNIQUES 

Recovery

Recovery in distributed systems is crucial for maintaining data integrity and system availability after failures. This involves detecting failures, restoring the system to a consistent state, and resuming normal operations. Effective recovery strategies are essential to handle various types of failures, including hardware crashes, software bugs, and network issues.

Importance Recovery Techniques

  • Effective recovery in distributed systems is crucial for ensuring system reliability, availability, and fault tolerance
  • When a component fails or an error occurs, the system must recover quickly and correctly to minimize downtime and data loss. 
  • Effective recovery mechanisms, such as checkpointing, rollback, and forward recovery, help maintain system consistency, prevent cascading failures, and ensure that the system can continue to function even in the presence of faults.

Recovery Techniques

 

  1. Checkpointing: Periodically saving the system’s state to a stable storage, so that in the event of a failure, the system can be restored to the last known good state. Checkpointing is a key aspect of backward recovery.
  2. Rollback Recovery: Involves reverting the system to a previous checkpointed state upon detecting an error. This technique is useful for undoing the effects of errors and is often combined with checkpointing.
  3. Forward Recovery: Instead of reverting to a previous state, forward recovery attempts to move the system from an erroneous state to a new, correct state. 
  4. Logging and Replay: Keeping logs of system operations and replaying them from a certain point to recover the system’s state. This is useful in scenarios where a complete rollback might not be feasible.
  5. Replication: Maintaining multiple copies of data or system components across different nodes. If one component fails, another can take over, ensuring continuity of service.
  6. Error Detection and Correction: Incorporating mechanisms that detect errors and automatically correct them before they lead to system failure. This is a proactive approach that enhances system resilience.

TYPES OF RECOVERY

Forward Recovery

Forward recovery, also known as roll-forward recovery, involves moving the system forward to a new consistent state after a failure. This is typically achieved by applying the necessary operations to transition the system from its current, possibly inconsistent state to a consistent one.

Example of Forward Recovery

  • Database Systems:
    • After a crash, a database system may use redo logs to reapply committed transactions that were not written to the main database storage before the crash.

Backward Recovery

Backward recovery, also known as roll-back recovery, involves returning the system to a previous consistent state before the failure occurred. This approach typically undoes changes made during the failure period to eliminate inconsistencies.

Example of Backward Recovery

  • Database Systems:
    • After a transaction failure, a database system may use undo logs to roll back changes made by the transaction.

Example of Recovery in Fault Tolerance 

In fault-tolerant distributed systems, there are multiple techniques to ensure the undo and redo actions are applied correctly to ensure consistency and availability

  • Many distributed systems, especially distributed databases, use a log to store all the actions (or transactions) in a write-ahead log (WAL). This log is maintained to facilitate recovery after crashes or failures.
     
    • Undo Operation: If a failure happens after a transaction has been initiated but before it is fully committed, the system will use the log to undo the transaction (rolling back to the previous consistent state).
       
    • Redo Operation: After a failure or crash, the system can replay the log (or parts of it) to redo any operations that were successfully committed but not yet fully replicated across all nodes or systems.