CITI Experience with Directory Delegations
From Linux NFS
NOTE: this is a rough work-in-progress; please send criticism to richterd at (nospam) citi.umich.edu thank you.
Directory Delegations Background
NFSv4.1 introduces read-only directory delegations, a protocol addition intended to enable clients to perform more-aggressive caching. More specifically, the goal is essentially to allow clients to avoid excess GETATTR, ACCESS, and LOOKUP calls to the server by increasing the reliability with which clients are able to do directory entry caching (READDIR), name caching (LOOKUP), and directory metadata caching (ACCESS and GETATTR).
The following quoted subsections are from Section 11 of the NFSv4.1 minor version draft:
NFSv4 client caching behavior
"Directory caching for the NFS version 4 protocol is similar to previous versions. Clients typically cache directory information for a duration determined by the client. At the end of a predefined timeout, the client will query the server to see if the directory has been updated. By caching attributes, clients reduce the number of GETATTR calls made to the server to validate attributes. Furthermore, frequently accessed files and directories, such as the current working directory, have their attributes cached on the client so that some NFS operations can be performed without having to make an RPC call. By caching name and inode information about most recently looked up entries in DNLC (Directory Name Lookup Cache), clients do not need to send LOOKUP calls to the server every time these files are accessed."
NFSv4.1 delegations extensions
"[The NFSv4] caching approach works reasonably well at reducing network traffic in many environments. However, it does not address environments where there are numerous queries for files that do not exist. In these cases of "misses", the client must make RPC calls to the server in order to provide reasonable application semantics and promptly detect the creation of new directory entries. Examples of high miss activity are compilation in software development environments. The current behavior of NFS limits its potential scalability and wide-area sharing effectiveness in these types of environments."
Furthermore, analysis of NFSv3 (whose client cache semantics NFSv4 mirrors) network traces by Brian Wickman at the University of Michigan (FIXME: need link to a copy of his prelim) show that a very surprising amount of NFS traffic are the periodic GETATTRs the clients send when a timeout triggers a cache revalidation.
At CITI, we are in the process of implementing directory delegations as described in Section 11 of the minor version draft, although we are not at this time implementing the notifications extension also described therein. The following are some specific aspects of the work.
Delegations and the Linux VFS Lease Subsystem
Directory delegations are implemented on the server with extensions to the Linux VFS file lease subsystem. A lease is a type of lock that gives the lease-holder the chance to perform any necessary tasks (e.g., flushing data) when an operation that conflicts with the lease-type is about to occur -- the caller who is causing the lease to break will block until the lease-holder signals that it is finished cleaning-up (or the lease is forcefully broken after a certain timeout).
The existing lease subsystem only works on files, and leases are only broken when a file is opened for writing or is truncated. In order to implement directory delegations, we have added support for directory leases. These will break when a leased directory is mutated by any additions, deletions, renames, or when the directory's own metadata changes (e.g., chown(1)). Note that changes to existing files, e.g., will not break directory leases.
However, for the very near-term, only NFS protocol operations break directory leases. There are a couple operations that involve some tricky locking issues in the VFS which will be addressed. The difficulty is that, given that breaking a lease involves blocking the caller, one must ensure that no important locks -- like a directory inode's i_sem -- are held while the calling kernel thread blocks.
Leases are usually acquired via the fcntl(2) call, and a lease-holder usually receives a signal from the kernel when a lease is being broken; the lease-holder indicates that any cleanup is finished with another fcntl(2) call. NFS leases are all acquired and revoked in-kernel.
UPDATE
The approach we're currently taking to tackle the issues of integrating NFS delegations with Linux VFS leases (i.e., all directory mutating operations, whether locally on the server or over NFS, will break directory leases/delegations on the server) goes something like this: When breaking a lease (regardless if the caller's local or over NFS) .. 1) During processing, whenever the directory's dentry becomes available (e.g., after a lookup), try break_lease() with O_NONBLOCK. This will avoid blocking while locks are held. 2) If there was not a lease, you're all done. 3) If there was a lease, break_lease() will send the break signal(s) and return -EAGAIN. 3-a) If the breaking operation came over NFS, nfsd will also fail immediately and the client gets NFS4ERR_DELAY (and should retry). We don't want to block an nfsd kernel thread while we wait for the breaks to finish (even if no locks would be held while blocking). 3-b) If the breaking operation was server-local, lease semantics dictate that we block the breaker until leases are returned. If no locks (other than the BKL) are held, we drop them and call break_lease() and let it block. Then we re-acquire any locks we need and restart the operation (e.g., we may need to repeat a lookup to make sure the thing is still there). This approach in 3-b is good in that the operation always succeeds in one pass. The restarting-the-operation bit is tricky and hasn't been ironed-out yet. Some operations are fairly easy, others aren't.
Using Directory Delegations
While a client holds a delegation on a directory, it is assured that the directory will not be mutated without the delegation first being synchronously recalled. However, the notion that a directory delegation is "read-only" has a special case associated with it: a given client's mutating operation on a directory delegated to it will not trigger a recall of that client's delegation (obviously, all other delegations on that directory will have to be recalled before the mutating operation succeeds). Again, from Section 11 of the minor version draft:
"The delegation is read-only and the client may not make changes to the directory other than by performing NFSv4 operations that modify the directory or the associated file attributes so that the server has knowledge of these changes."
Note that a client's ability to mutate a directory without triggering a recall of its own delegation is not at all a write delegation.
if i understand you correctly, the server currently synchronously revokes the delegation from each client. (synchronously, but hopefully not serially.) you are proposing that the server skip over the mutating client, which sounds fine. perhaps it should automatically grant a fresh delegation along with its ack to the client update. (Re: first paragraph) Yes, before the directory-mutating operation succeeds, all outstanding delegations will be recalled (in parallel) -- except for a delegation held by the directory-mutating client, if such a delegation exists. I don't quite understand automatically granting a fresh delegation to the mutating owner-client -- do you mean renew the client's lease (not a Linux VFS lease; the NFS lease that determines if a client is non-responsive)? Can you help me understand? The client's existing delegation should suffice and be basically indistinguishable from a new one. thanks, -d my further understanding is that at the time that the server is busily revoking delegations, it does not know which is the mutating client, and that sessions is the magic bullet that solves the problem. yes? (Re: second paragraph) Yup. For the time-being, I have a hack that conflates a client's IP address with that client's clientid -- when the server receives SETCLIENTID_CONFIRM, it logs the client's IP for later use. When a delegation-breaking operation comes in, the requester's IP is used to find the clientid. This is a hack for prototyping's sake, because in practice multiple clientids can be associated with a given IP. with Sessions, we'll have a session-related ID that will properly map to the requester's clientid.
Recalling NFS Delegations vs. Breaking Linux VFS (Non-NFS) Leases
The NFS delegation leases and existing non-NFS leases differ in how they handle the case where a lease-holder is also the caller performing an operation that conflicts with the lease-type. (Here I will informally refer to leases used to implement delegations as "NFS leases" and all other leases as "non-NFS leases".)
With NFS leases, in order to support directory delegation semantics, the caller a) will not have its own directory lease broken, and b) it will block until all other leases on that directory have been returned to the server, at which point c) the operation should succeed.
UPDATE (please weigh-in..)
OK, I guess my understanding here was wrong -- Bruce advises that we basically fail the NFS operation just as we would with non-NFS leases. We could maybe hang around for 100ms or something just in case it's fast. This simplifies some locking headaches on the server but I need to look at how this changes things. I guess I was sticking too much with the wording ("synchronously"; "have to wait") from the draft:
"The delegation covers directory attributes and all entries in the directory. If either of these change the delegation will be recalled synchronously. The operation causing the recall will have to wait before the recall is complete."
With non-NFS leases, the caller a) will have its lease revoked, and b) will not block while any other lease-holders return their leases, with the latter implying that c) the operation fails with -EAGAIN.
Negative Caching
One opportunity offered by directory delegations is the chance to significantly extend the usefulness of negative dentry caching on the client. Currently, close-to-open consistency requires that, e.g., all OPENs are sent to the server (i.e., negative caching provides no benefit in that case). With directory delegations, one is assured that no new entries or removals have occurred while a delegation is in-effect; this implies that negative dentries in a delegated directory actually can be "trusted".
This could translate into a marked decrease in the number of unnecessary and repeated checks for non-existent files, e.g. when searching for an executable in PATH or a shared library in LD_LIBRARY_PATH. Knowing just when to acquire those delegations may be a matter to address in client-side policy.
Status
At the moment, working on coming up with reasonably representative tests that show the benefits of directory delegations (in terms of OP-counts); pynfs tests are also being written.
The client:
- The client currently requests a delegation just prior to issuing a READDIR on an undelegated directory.
- As long as the client has such a delegation, it will generally refrain from issuing ACCESS, GETATTR, and READDIR calls on the directory (see below) ...
- .. in some cases, though, the client's cache(s) may be deliberately invalidated and require a refresh (e.g., a client creates a file in a directory delegated to it, which won't break its delegation; however, in order to see the file, the client must revalidate its pagecache and send a READDIR on the wire).
- TODO: need to teach the client to trust negative dentries in delegated directories.
- TODO: also -- ..
The server:
- The following NFS operations currently break directory delegations: CREATE (nfsd_create() and nfsd_symlink()), LINK (nfsd_link()), REMOVE (nfsd_unlink()), and RENAME (nfsd_rename()).
- OPEN(w/create) is tied-up: parent-directory delegs are now broken OK in nfsd4_open(). Breaking file-delegs on OPEN(write) is broken: nfsd_open() tries a) under statelock and b) I think usually fails bc. of O_NONBLOCK. nfsd4_truncate() has similar issue.
- SETATTR(on the directory itself) is a pretty bad snarl.
- TODO: get all conflicting VFS-level operations to break our delegations, not just NFS operations. Tricky.
- TODO: also -- policy, look at dir deleg/file deleg interactions, ..