Nonintrusive Failure Detection and Recovery for Internet Services Using Backdoors. Florin Sultan, Aniruddha Bohra, Yufei Pan, Stephen Smaldone, Iulian Neamtiu, Pascal Gallard, and Liviu Iftode. Rutgers University, Department of Computer Science Technical Report DCS-TR-524, December 2003.

We describe an architecture for nonintrusive failure detection and recovery in a cluster of Internet servers in which nodes mutually monitor their liveness and recover client sessions from failed nodes. The system is based on Backdoors, a novel architectural approach for remote healing of computer systems. Backdoors enables monitoring and recovery/repair of state in a computer system by remote access to system resources (memory, I/O devices) without using its processors. Backdoors allows remote actions to be performed with no overhead, and even when the processors (but not the memory) of a machine are not available. We have implemented a Backdoors prototype by modifying the FreeBSD kernel and using Myrinet NICs for remote access. The system uses remote DMA operations to perform monitoring, detect failures and extract OS and application state from a failed machine. We have used our system to run several open-source Internet servers and to run a complex multi-tier e-commerce application. The system tolerates multiple node failures while providing correct and continuous service to ongoing sessions, with negligible disruption.

[ .ps.Z ]

@TECHREPORT{MT-TR,
  AUTHOR = {Florin Sultan and Aniruddha Bohra and Yufei Pan and Stephen Smaldone and Iulian Neamtiu and Pascal Gallard and Liviu Iftode},
  TITLE = {Nonintrusive Failure Detection and Recovery for Internet Services Using Backdoors},
  INSTITUTION = {Rutgers University, Department of Computer Science},
  NUMBER = {DCS-TR-524},
  YEAR = 2003,
  MONTH = DEC
}

Back