Cluster client migration prototype

From Linux NFS

(Difference between revisions)
Jump to: navigation, search
m
 
(10 intermediate revisions not shown)
Line 1: Line 1:
As part of CITI's work with IBM, we looked at some of the issues involved with NFSv4 client migration and developed an initial prototype.  Our setup involved a cluster of equivalent NFS servers attached to a GFS2 disk array, with each server exporting the same directory from the GFS2 filesystem.  The intent was to provide an interface by which an administrator could selectively migrate NFSv4 clients from one server to another (e.g., to take a server down for maintenance).
As part of CITI's work with IBM, we looked at some of the issues involved with NFSv4 client migration and developed an initial prototype.  Our setup involved a cluster of equivalent NFS servers attached to a GFS2 disk array, with each server exporting the same directory from the GFS2 filesystem.  The intent was to provide an interface by which an administrator could selectively migrate NFSv4 clients from one server to another (e.g., to take a server down for maintenance).
 +
== Prototype overview ==
== Prototype overview ==
Line 13: Line 14:
=== Going forward ===
=== Going forward ===
-
The mechanism by which one <tt>rpc.stransd</tt> transfers clientids to another will be expanded so that all client open/lock/delegation state held on the source-server can be directly sent to the target-server and loaded into memory.  By directly transferring the state instead of relying on reclaims, the invasive and problematic cluster-wide grace period can be avoided entirely.
+
The mechanism by which one <tt>rpc.stransd</tt> transfers clientids to another will be expanded so that all client open/lock/delegation state held on the source-server can be directly sent to the target-server and loaded into memory.  In order to facilitate that, the underlying cluster filesystem will also need to transfer its own bookkeeping of opens/locks/leases from the source node to the target node.  By directly transferring the state instead of relying on reclaims, the invasive and problematic cluster-wide grace period can be avoided entirely.
=== Limitations ===
=== Limitations ===
-
The existing prototype is limited in many ways: for ease of integration, only the creation of a symlink completes a migration event on the client; there is no security associated with the triggering of a migration; the GFS2 code in the kernel version used in the prototype is fragile; the list goes on.  Nevertheless, we have migrated clients at CITI that are able to -- to the extent that the maturity of that version of the GFS2 code permits -- continue functioning normally after a migration.
+
The existing prototype is limited in many ways: for ease of integration, only the creation of a symlink completes a migration event on the client; there is no security associated with the triggering of a migration; the GFS2 and dlm code in the kernel version used in the prototype are quite fragile; the list goes on.  Nevertheless, we have migrated clients at CITI that are able to -- to the extent that the maturity of that kernel version permits -- continue functioning normally after a migration.
-
 
+
== Prototype code ==
== Prototype code ==
-
As a compromise between the setup of the original prototype and the relative stability of GFS2, the current code based off of the [http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.19.7.tar.bz2 2.6.19.7 Linux kernel].  Until a proper git repository is online, here is a  
+
As a compromise between the setup of the original prototype and the relative stability of GFS2 exported by NFS, the current code is based off of the [http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.19.7.tar.bz2 2.6.19.7 Linux kernel].  Until proper git repositories are online, there is [http://www.citi.umich.edu/u/richterd/strans-kernel-for-2.6.19.7.diff a patch] for the kernel and [http://www.citi.umich.edu/u/richterd/strans-userland-for-2.6.19.7.tar.gz a tarball] of the source for the userland components.
 +
 
 +
Some instructions on how to test the setup and how to work around some cluster-related kinks are in the README file in the userland tarball.  Once the kernels have been built on the nfs servers and the client, and once the userland components are built on the servers, my basic steps are:
 +
 
 +
* boot the cluster, bring up <tt>cman</tt> and <tt>clvmd</tt> everywhere
 +
* mount the gfs2 filesystem on the nfs servers
 +
* cat the files that'll be involved in the reclaims on each of the nfs servers (see the README)
 +
* then start up nfs on the servers, making sure that <tt>rpc.stransd</tt> is running by the time <tt>knfsd</tt> is starting up
 +
* start wireshark on the client
 +
* have the client mount the source-server and hold a file open with, e.g., <tt>less(1)</tt>
 +
* arrange the migration: <tt> $ rpc.strans -m <clientIP> <target-serverIP> <source-serverIP></tt>
 +
* in a second shell on the client, try to create a symlink over nfs -- it should fail and the client should migrate
 +
* the logs, wireshark, netstat, etc, should show the client to have migrated and the client should be able to keep going (but again, functionality is limited -- reading files works).  note that <tt>mount(1)</tt> will continue to show the source-server, though that's not actually the case.
 +
 
-
A [http://www.citi.umich.edu/u/richterd/migration-moved-and-good-open-reclaim-3--apikia-rhcl1-rhcl2.pcap network trace] of the client <tt>141.211.133.'''86'''</tt> migrating from server <tt>141.211.133.'''212'''</tt> to <tt>141.211.133.'''213'''</tt> is available from CITI's website.  After filtering for only NFS traffic, packets 104/106 show a file initially being opened; then the migration was triggered; then, packets 128/130 show the client trying to make a symlink and getting a "moved" error; packets 140/142 show the client making contact with the target server; packets 156/158 show the client reclaiming state for the file it had open; and finally, packets 239/241 show subsequent "normal" operation as another file is read after the artificial grace period expired.
+
A [http://www.citi.umich.edu/u/richterd/migration-moved-and-good-open-reclaim-3--apikia-rhcl1-rhcl2.pcap network trace] of the client <tt>141.211.133.'''86'''</tt> migrating from server <tt>141.211.133.'''212'''</tt> to <tt>141.211.133.'''213'''</tt> is available from CITI's website.  Packets 104/106 show a file initially being opened; then the migration was triggered; then, packets 128/130 show the client trying to make a symlink and getting a "moved" error; packets 140/142 show the client making contact with the target server; packets 156/158 show the client reclaiming state for the file it had open; and finally, packets 239/241 show subsequent "normal" operation as another file is read after the artificial grace period expired.

Latest revision as of 17:23, 11 January 2008

As part of CITI's work with IBM, we looked at some of the issues involved with NFSv4 client migration and developed an initial prototype. Our setup involved a cluster of equivalent NFS servers attached to a GFS2 disk array, with each server exporting the same directory from the GFS2 filesystem. The intent was to provide an interface by which an administrator could selectively migrate NFSv4 clients from one server to another (e.g., to take a server down for maintenance).


Contents

Prototype overview

The prototype is a proof-of-concept: the "right way" to migrate a client would be to transfer all of the client-related state from one server to another and then have the client reorient to the new server and continue without interruption; instead, this prototype leverages parts of the existing reboot-recovery process. To briefly explain reboot-recovery, when a Linux NFSv4 server starts, it enters a ~90sec phase called a grace period; during this time, eligible clients may contact the server and reclaim state for open files and locks they were holding prior to a server crash/reboot. In order to allow clients to reclaim state without conflicts, new opens, etc, are disallowed during the grace period.

Migration overview

During a migration, the cluster is put into an artificial grace period and the target-server is notified that a new client is eligible to perform reclaims. When the client contacts the source-server, it receives an error message saying that the file system has moved and sees that it should migrate to the target-server. The client establishes a connection to the target-server and reclaims its state almost identically to how it would after a server reboot. Shortly thereafter, the grace period expires, the client is purged from the source-server, and then it's business as usual.

Statetransfer daemon

To go into a bit more detail, the migration prototype is based off of a redesigned approach to reboot-recovery that Andy Adamson developed, wherein a new userspace daemon (so far named rpc.stransd) takes over some responsibilities previously handled within the kernel. For the most part, the daemon is responsible for keeping track of the clientids of legitimate NFS clients who have established state on the server; the daemon records these clientids in stable storage.

For migration, the administrator runs a client program (so far called rpc.strans) that contacts the source-server's rpc.stransd and sends the IP address of the client to migrate. rpc.stransd looks up all clientids associated with that IP address and sends them to the target-server's rpc.stransd, which saves them in stable storage and notifies its (the target-server's) knfsd that the clientids are eligible for reclaim. Then, when the client has received the error message that the file system has moved, it sends an FS_LOCATIONS request to the source-server in order to find out where next it should go and receives a reply containing the target-server's IP address. Since it is migrating, the client reuses its existing clientid (already in the target-server's eligible-to-reclaim list) when it contacts the target-server instead of creating a new one, and thereafter proceeds to reclaim its state.

Going forward

The mechanism by which one rpc.stransd transfers clientids to another will be expanded so that all client open/lock/delegation state held on the source-server can be directly sent to the target-server and loaded into memory. In order to facilitate that, the underlying cluster filesystem will also need to transfer its own bookkeeping of opens/locks/leases from the source node to the target node. By directly transferring the state instead of relying on reclaims, the invasive and problematic cluster-wide grace period can be avoided entirely.

Limitations

The existing prototype is limited in many ways: for ease of integration, only the creation of a symlink completes a migration event on the client; there is no security associated with the triggering of a migration; the GFS2 and dlm code in the kernel version used in the prototype are quite fragile; the list goes on. Nevertheless, we have migrated clients at CITI that are able to -- to the extent that the maturity of that kernel version permits -- continue functioning normally after a migration.

Prototype code

As a compromise between the setup of the original prototype and the relative stability of GFS2 exported by NFS, the current code is based off of the 2.6.19.7 Linux kernel. Until proper git repositories are online, there is a patch for the kernel and a tarball of the source for the userland components.

Some instructions on how to test the setup and how to work around some cluster-related kinks are in the README file in the userland tarball. Once the kernels have been built on the nfs servers and the client, and once the userland components are built on the servers, my basic steps are:

  • boot the cluster, bring up cman and clvmd everywhere
  • mount the gfs2 filesystem on the nfs servers
  • cat the files that'll be involved in the reclaims on each of the nfs servers (see the README)
  • then start up nfs on the servers, making sure that rpc.stransd is running by the time knfsd is starting up
  • start wireshark on the client
  • have the client mount the source-server and hold a file open with, e.g., less(1)
  • arrange the migration: $ rpc.strans -m <clientIP> <target-serverIP> <source-serverIP>
  • in a second shell on the client, try to create a symlink over nfs -- it should fail and the client should migrate
  • the logs, wireshark, netstat, etc, should show the client to have migrated and the client should be able to keep going (but again, functionality is limited -- reading files works). note that mount(1) will continue to show the source-server, though that's not actually the case.


A network trace of the client 141.211.133.86 migrating from server 141.211.133.212 to 141.211.133.213 is available from CITI's website. Packets 104/106 show a file initially being opened; then the migration was triggered; then, packets 128/130 show the client trying to make a symlink and getting a "moved" error; packets 140/142 show the client making contact with the target server; packets 156/158 show the client reclaiming state for the file it had open; and finally, packets 239/241 show subsequent "normal" operation as another file is read after the artificial grace period expired.

Personal tools