Nfsd4 server recovery

From Linux NFS

(Difference between revisions)
Jump to: navigation, search
Line 12: Line 12:
Requirements, as compared to current code:
Requirements, as compared to current code:
-
         - Correctly implements the algorithm described in section 8.6.3
+
         * Correctly implements the algorithm described in section 8.6.3
           of rfc 3530, and eliminates known race conditions on recovery.
           of rfc 3530, and eliminates known race conditions on recovery.
-
         - Does not attempt to manage files and directories directly from
+
         * Does not attempt to manage files and directories directly from
           inside the kernel.
           inside the kernel.
Line 20: Line 20:
A server can go down and come back up again for any number of reasons:
A server can go down and come back up again for any number of reasons:
-
         - The server may crash.
+
         * The server may crash.
-
         - Power may go out.
+
         * Power may go out.
-
         - The administrator may reboot the server.
+
         * The administrator may reboot the server.
-
         - The administrator may manually stop and restart the NFS server
+
         * The administrator may manually stop and restart the NFS server
           without stopping other services on the machine, for example
           without stopping other services on the machine, for example
           using:
           using:
Line 42: Line 42:
Call a client "active" if it holds unexpired state on the server.  Then:
Call a client "active" if it holds unexpired state on the server.  Then:
-
         - An NFSv4.0 client becomes active as soon as it succesfully
+
         * An NFSv4.0 client becomes active as soon as it succesfully
           performs its first OPEN_CONFIRM, or its first reclaim OPEN.
           performs its first OPEN_CONFIRM, or its first reclaim OPEN.
-
         - An NFSv4.1 client becomes active when it succesfully performs
+
         * An NFSv4.1 client becomes active when it succesfully performs
           a RECLAIM_COMPLETE.
           a RECLAIM_COMPLETE.
-
         - Active clients become inactive when they expire.  (Or when
+
         * Active clients become inactive when they expire.  (Or when
           they are revoked--but the Linux server does not currently
           they are revoked--but the Linux server does not currently
           support revocation.)
           support revocation.)
-
         - On startup all clients are initially inactive.
+
         * On startup all clients are initially inactive.
On startup the server needs access to the list of clients which are
On startup the server needs access to the list of clients which are
Line 61: Line 61:
So:
So:
-
         - When a new client becomes active, a record for that client
+
         * When a new client becomes active, a record for that client
           must be created in stable storage before responding to the rpc
           must be created in stable storage before responding to the rpc
           in question (OPEN, OPEN_CONFIRM, or RECLAIM_COMPLETE).
           in question (OPEN, OPEN_CONFIRM, or RECLAIM_COMPLETE).
-
         - When a client expires, the record must be removed (or
+
         * When a client expires, the record must be removed (or
           otherwise marked expired) before responding to any requests
           otherwise marked expired) before responding to any requests
           for locks or other state which would conflict with state held
           for locks or other state which would conflict with state held
Line 80: Line 80:
4.1).
4.1).
-
Also desireable, but not absolutely required in the first
+
Also desirable, but not absolutely required in the first
implementation:
implementation:
-
         - We should not take the state lock while waiting for records to
+
         * We should not take the state lock while waiting for records to
           be stored.  (Doing so blocks all other stateful operations
           be stored.  (Doing so blocks all other stateful operations
           while we wait for disk.)
           while we wait for disk.)
-
         - The server should be able to end the grace period early when
+
         * The server should be able to end the grace period early when
           the list of clients allowed to reclaim is empty, or when they
           the list of clients allowed to reclaim is empty, or when they
           are all 4.1 clients, after all have sent RECLAIM_COMPLETE.
           are all 4.1 clients, after all have sent RECLAIM_COMPLETE.
-
         - We should allow pluggable methods for storage of reboot recovery
+
         * We should allow pluggable methods for storage of reboot recovery
           records, as the NFSv2 and NFSv3 code currently does.  These may be
           records, as the NFSv2 and NFSv3 code currently does.  These may be
           used by some high-availability systems.
           used by some high-availability systems.
Line 95: Line 95:
Possibly also desirable:
Possibly also desirable:
-
         - Record the principal that originally created the client, and
+
         * Record the principal that originally created the client, and
           whether it had EXCHGID4_FLAG_BIND_PRINC_STATEID (see rfc 5661
           whether it had EXCHGID4_FLAG_BIND_PRINC_STATEID (see rfc 5661
           section 8.4.2.1).
           section 8.4.2.1).
Line 118: Line 118:
create_client:
create_client:
-
         - given a client owner, returns an error.  Does not return until
+
         * given a client owner, returns an error.  Does not return until
           a new record has safely been recorded on disk.  The kernel
           a new record has safely been recorded on disk.  The kernel
           will call this on the first reclaim OPEN or OPEN_CONFIRM (for
           will call this on the first reclaim OPEN or OPEN_CONFIRM (for
Line 124: Line 124:
grace_done:
grace_done:
-
         - request and reply are both empty; the daemon returns only
+
         * request and reply are both empty; the daemon returns only
           after it has recorded to disk the fact that the grace period
           after it has recorded to disk the fact that the grace period
           completed.  The kernel will not allow any non-reclaim opens
           completed.  The kernel will not allow any non-reclaim opens
Line 130: Line 130:
expire_client:
expire_client:
-
         - given a client owner, replies with an empty reply.  Replies
+
         * given a client owner, replies with an empty reply.  Replies
           only after it has recorded to disk the fact that the client
           only after it has recorded to disk the fact that the client
           has expired.  The kernel will call this when a client loses
           has expired.  The kernel will call this when a client loses
Line 140: Line 140:
allow_client:
allow_client:
-
         - before starting the server, the daemon will open this file,
+
         * before starting the server, the daemon will open this file,
           write a newline-separated list of client owners permitted to
           write a newline-separated list of client owners permitted to
           recover, then close the file.  If no clients are allowed to
           recover, then close the file.  If no clients are allowed to
Line 161: Line 161:
client_owner, and the contents will consist of two newline-separated
client_owner, and the contents will consist of two newline-separated
fields:
fields:
-
         - The client owner, encoded as in the upcall.
+
         * The client owner, encoded as in the upcall.
-
         - A timestamp.
+
         * A timestamp.
More fields may be added in the future.
More fields may be added in the future.
Line 170: Line 170:
         If boot_time exists:
         If boot_time exists:
-
                 - It will be read, and the contents interpreted as an
+
                 * It will be read, and the contents interpreted as an
                   ascii-encoded unix time in seconds.
                   ascii-encoded unix time in seconds.
-
                 - All client records older than that time will be removed.
+
                 * All client records older than that time will be removed.
-
                 - The current boot_time will be recorded to
+
                 * The current boot_time will be recorded to
                   new_boot_time (replacing any existing such file).
                   new_boot_time (replacing any existing such file).
-
                 - All remaining clients will be written to allow_client.
+
                 * All remaining clients will be written to allow_client.
         If boot_time does not exist, an empty /var/lib/nfs/v4clients/ is
         If boot_time does not exist, an empty /var/lib/nfs/v4clients/ is
                 created if necessary, but nothing else is done.
                 created if necessary, but nothing else is done.

Revision as of 23:54, 27 September 2010

This incorporates revisions based on comments on the original documented posted at [1].

The Linux server's reboot recovery code has long-standing architectural problems, fails to adhere to the specifications in some cases, and does not yet handle NFSv4.1 reboot recovery. An overhaul has been a long-standing todo.

This is my attempt to state the problem and a rough solution.

Requirements

Requirements, as compared to current code:

       * Correctly implements the algorithm described in section 8.6.3
         of rfc 3530, and eliminates known race conditions on recovery.
       * Does not attempt to manage files and directories directly from
         inside the kernel.

Requirements, in more detail:

A server can go down and come back up again for any number of reasons:

       * The server may crash.
       * Power may go out.
       * The administrator may reboot the server.
       * The administrator may manually stop and restart the NFS server
         without stopping other services on the machine, for example
         using:
               service nfs stop
               service nfs start
         (where the details may vary from one distribution to another).

We will call any of these events a "restart".

A "server instance" is the lifetime from start to shutdown of a server; a restart ends one server instance and starts another. Normally a server instance consists of a grace period followed by a period of normal operation. However, a server could go down before the grace period completes. Call a server instance that completes the grace period "full", and one that does not "partial".


Call a client "active" if it holds unexpired state on the server. Then:

       * An NFSv4.0 client becomes active as soon as it succesfully
         performs its first OPEN_CONFIRM, or its first reclaim OPEN.
       * An NFSv4.1 client becomes active when it succesfully performs
         a RECLAIM_COMPLETE.
       * Active clients become inactive when they expire.  (Or when
         they are revoked--but the Linux server does not currently
         support revocation.)
       * On startup all clients are initially inactive.

On startup the server needs access to the list of clients which are permitted to reclaim state. That list is exactly the list of clients that were active at the end of the most recent full server instance.

To maintain such a list, we need records to be stored in stable storage. Whenever a client changes from inactive to active, or active to inactive, stable storage must be updated, and until the update has completed the server must do nothing that acknowledges the new state. So:

       * When a new client becomes active, a record for that client
         must be created in stable storage before responding to the rpc
         in question (OPEN, OPEN_CONFIRM, or RECLAIM_COMPLETE).
       * When a client expires, the record must be removed (or
         otherwise marked expired) before responding to any requests
         for locks or other state which would conflict with state held
         by the expiring client.


Updates must be made by upcalls to userspace; the kernel will not be directly involved in managing stable storage. The upcall interface should be extensible.

The records must include the client owner name, to allow identifying clients on restart. The protocol allows client owner names to consist of up to 1024 bytes of binary data. (This is the client-supplied long form, not the server-generated shorthand clientid; co_ownerid for 4.1).

Also desirable, but not absolutely required in the first implementation:

       * We should not take the state lock while waiting for records to
         be stored.  (Doing so blocks all other stateful operations
         while we wait for disk.)
       * The server should be able to end the grace period early when
         the list of clients allowed to reclaim is empty, or when they
         are all 4.1 clients, after all have sent RECLAIM_COMPLETE.
       * We should allow pluggable methods for storage of reboot recovery
         records, as the NFSv2 and NFSv3 code currently does.  These may be
         used by some high-availability systems.

Possibly also desirable:

       * Record the principal that originally created the client, and
         whether it had EXCHGID4_FLAG_BIND_PRINC_STATEID (see rfc 5661
         section 8.4.2.1).

Draft design

We will write a new userspace daemon to handle to manage state in userspace. The new daemon will be written with the possibility in mind of later combining it with one of the other existing daemons (such as idmapd), but it may stand alone at first.

Previous prototype code from CITI will be considered as a starting point.

Kernel<->user communication will use four files in the "nfsd" filesystem. All of them will use the encoding used for rpc cache upcalls and downcalls, which consist of whitespace-separated fields escaped as necessary to allow binary data.

Three of them will be used for upcalls; the daemon reads request from them, and writes responses back:

create_client:

       * given a client owner, returns an error.  Does not return until
         a new record has safely been recorded on disk.  The kernel
         will call this on the first reclaim OPEN or OPEN_CONFIRM (for
         v4.0 clients) or on RECLAIM_COMPLETE (for 4.1 clients).

grace_done:

       * request and reply are both empty; the daemon returns only
         after it has recorded to disk the fact that the grace period
         completed.  The kernel will not allow any non-reclaim opens
         until this returns.

expire_client:

       * given a client owner, replies with an empty reply.  Replies
         only after it has recorded to disk the fact that the client
         has expired.  The kernel will call this when a client loses
         its lease, before removing its locks and opens (and allowing
         potentially conflicting operations).

One additional file will be used for a downcall:


allow_client:

       * before starting the server, the daemon will open this file,
         write a newline-separated list of client owners permitted to
         recover, then close the file.  If no clients are allowed to
         recover, it will still open and close the file.

The daemon will use the presence of these upcalls to determine whether the server supports the new recovery mechanism (and may just exit if it does not). Also, nfsd may use the daemon's open of allow_client to decide whether userspace supports the new mechanism. Thus allows a mismatched kernel and userspace to still maintain reboot recovery records.

In addition, we could support seamless reboot recovery across the transition to the new system by making the daemon convert between on-disk formats. However, for simplicity's sake we plan for the server to be refuse all reclaims on the first boot after the transition.

By default, the daemon will store records as files in the directory /var/lib/nfs/v4clients. The file name will be a hash of the client_owner, and the contents will consist of two newline-separated fields:

       * The client owner, encoded as in the upcall.
       * A timestamp.

More fields may be added in the future.

Before starting the server, and writing to allow_client, the daemon will manage boot times and old clients using files in /var/lib/nfs:

       If boot_time exists:
               * It will be read, and the contents interpreted as an
                 ascii-encoded unix time in seconds.
               * All client records older than that time will be removed.
               * The current boot_time will be recorded to
                 new_boot_time (replacing any existing such file).
               * All remaining clients will be written to allow_client.
       If boot_time does not exist, an empty /var/lib/nfs/v4clients/ is
               created if necessary, but nothing else is done.

The daemon will then wait for create_client, expire_client, and grace_done calls. On grace_done, it will rename boot_time to old_boot_time, and new_boot_time to boot_time.

Personal tools