In this thesis, a shadow server is designed and implemented so that it guarantees the fault-tolerance of a meta server in SANfs.
The SANfs is a shared file system in which multiple hosts can share multiple disks via a storage area network(SAN). In SANfs, a file manager, called a meta server, manages the file system meta data and the lock state, and provides clients with information necessary to access disk blocks. However, there is a single point-of-failure because SANfs has only one meta server. Therefore, the shadow server for the meta server is required.
The shadow server substitutes for the meta server immediately when a heartbeat of the shadow server detects the failure of the meta server. The heartbeat is a program that detects the failure of server. The shadow server could substitute for the meta server as fast as possible, because the meta data states of the shadow server in memory are always kept equal to those of the meta server. The meta data states of the shadow server are synchronized with those of the meta server using a leader/follower consistency protocol. While heartbeats of both servers monitor each other using UDP PING signals, some failure detection errors are occured due to network packet losses. In order to avoid the failure detection errors, we propose a double check mechanism to detect failures correctly. The double check mechanism is that after the heartbeat checks UDP PING signals, it rechecks the failure of the meta server by connecting the service port of the meta server.
The shadow server has been implemented and integrated with a SANfs file system on Linux 2.2.12. The experimental results show the performance overhead and recovery time, also that file operations continue despite the failure of the meta server. There is no additional performance overhead of the meta server for meta data management when both servers are operated. The recovery time was ranged between 3.5 seconds and 12 seconds with the variance of heartbeat periods.