NFS lock recovery notes
From Linux NFS
Here's my attempt at summarizing how this works, with references to the RFC's at the end.
So, to start off, I'm going to assume NFS version >= 4.1, and I'd recommend any new implementation doing that too, because:
- 4.0 and 4.1 are more different here than you might expect, and 4.0 in particular has some messy problems in this area.
- >=4.1 clients are already widely available, and they'll only be more so by the time this work is done.
- 4.2 and higher aren't expected to change the 4.1 model.
That said, the basics are the same for v2 through v4.2, so hopefully once 4.1's working that work will apply trivially to the other versions.
So the first time an NFSv4.1 client talks to a server:
- EXCHANGE_ID is the first rpc that introduces a client to a server. It includes an (up to 1k) client_owner field which should uniquely identify that client forever.
- The server returns a 64-bit clientid in the EXCHANGE_ID reply. Every subsequent request from that client will have either that clientid or some id derived from it, so the server always knows which client it's talking to.
- The client sends a RECLAIM_COMPLETE rpc. Before replying to that, the server records in stable storage a record with that client's client_owner.
- The client queries the server's "lease_time" attribute. (Default 90s on Linux.)
- Normal filesystem activity starts. The client ensures that it sends one rpc (may be some sort of no-op) at least every lease_time, to reassure the server that it's still active.
Then some day the server crashes and restarts. It starts a grace period, of length equal to the lease period (so, 90s by default).
Once the server starts responding again, the first response will be an error like BADSESSION or STALE_CLIENTID indicating the client's state has been forgotten.
The client responds by sending a new EXCHANGE_ID to the server (with the same client_owner as before), and gets a new client id, then sends OPEN and LOCK requests with "reclaim" bits set. The server allows those requests after checking stable storage and seeing that it has a record for this client_owner. When it's done, it again sends a RECLAIM_COMPLETE, and the server stores an updated record for this client in its stable storage.
THe server returns ERR_GRACE to any non-reclaim OPEN and LOCK requests as long as the grace period is in force.
At the end of the grace period, the server:
- starts accepting non-reclaim OPEN and LOCK requests. Further reclaim attempts get ERR_NO_GRACE.
- purges any stable storage records for clients that haven't yet established a new clientid and sent RECLAIM_COMPLETE.
Also, at any point a client may expire, and lose that stable storage record, either because it sends an explicit DESTROY_CLIENTID on unmount, or because a crash or network problem prevents it from sending its every-lease_period ping.
If a client that previously lost its state for some reason attempts to reclaim on a later reboot, the server denies it with a RECLAIM_BAD error. Note this is necessary even though we trust clients, because a client might just not know whether its state is still good after a network partition. (E.g., the server's been unresponsive for 5 minutes, and now it's giving me STALE_CLIENTID--did it spend that whole time rebooting, or did it reboot quickly, end its grace period, and then reboot again?)
Also, the client_owner actually has an associated "verifier" field that changes each time the client reboots, so is used by the server when a client crashes; when a client it already knows about sends an EXCHANGE_ID with a new verifier, it knows that client crashed and it can blow away its previous state.
Any number of optimizations are possible, most not currently taken advantage of by the Linux server:
- If the newly restarted server knows something about previously held locks (say, because it's part of a cluster that knows this), then it could permit non-reclaim operations that it knows won't conflict.
- in fact, as long as it's remembering all the locks, it could just remember all the associated protocol data (clientid's and various other per-open/lock id's), skip the whole recovery protocol, and pretend nothing happened. Solaris people have done this to implement "transparent migration" between Solaris servers.
- Since the server has a list of all the clients, it knows when they've all sent RECLAIM_COMPLETE, and can end the grace period early if that happens before the 90s is up. (Recent Linux does actually do this.)
- instead of purging a dead client's records right away, the server could be generous and wait till a conflicting lock request forces it to do so.
The authoritative source for the 4.1 stuff is: [1] especially chapters 8 and 9.
NFS4.0 is different because it doesn't have RECLAIM_COMPLETE, and because NFSv4.0 clients are required to present different client_owner's to different server IP addresses. That turned out not to work well for migration so there's a draft sorting that all out: [2].
And then the source for NFSv4.0 in general is [3].
NFSv3 also has the notion of grace periods and reclaim flags on lock requests. The main differences are:
- there's no lease time, and clients aren't required to poll the server regularly. Instead, there's a bidirectional SM (status monitor) protocol, and clients and servers send each other notifications when they restart.
- locking and lock recovery aren't part of the NFS protocol proper, they're done in two sideband protocols (NLM and SM).
- the NLM protocol has no open call, only lock. (Actually, that's not completely true: there's a share lock rpc that I think is rarely used.)
References are:
There's unfortunately also some poorly documented lore especially in the v2/v3 cases.
