CITI Experience with Directory Delegations
From Linux NFS
NOTE: this is a rough work-in-progress; please send criticism to richterd at (nospam) citi.umich.edu thank you.
[2006-8-2: I've added some rough, preliminary numbers of opcounts from doing compiles with/without directory delegations]
Directory Delegations Background
NFSv4.1 introduces read-only directory delegations, a protocol addition enabling clients to cache more aggressively. More specifically, the goal to allow clients to avoid excess GETATTR, ACCESS, and LOOKUP calls to the server by increasing the reliability of directory entry caching (READDIR), name caching (LOOKUP), and directory metadata caching (ACCESS and GETATTR).
The following quoted subsections are from Section 11 of the NFSv4.1 minor version draft:
NFSv4 client caching behavior
"Directory caching for the NFS version 4 protocol is similar to previous versions. Clients typically cache directory information for a duration determined by the client. At the end of a predefined timeout, the client will query the server to see if the directory has been updated. By caching attributes, clients reduce the number of GETATTR calls made to the server to validate attributes. Furthermore, frequently accessed files and directories, such as the current working directory, have their attributes cached on the client so that some NFS operations can be performed without having to make an RPC call. By caching name and inode information about most recently looked up entries in DNLC (Directory Name Lookup Cache), clients do not need to send LOOKUP calls to the server every time these files are accessed."
NFSv4.1 delegations extensions
"[The NFSv4] caching approach works reasonably well at reducing network traffic in many environments. However, it does not address environments where there are numerous queries for files that do not exist. In these cases of "misses", the client must make RPC calls to the server in order to provide reasonable application semantics and promptly detect the creation of new directory entries. Examples of high miss activity are compilation in software development environments. The current behavior of NFS limits its potential scalability and wide-area sharing effectiveness in these types of environments."
Furthermore, analysis of NFSv3 network traces by Brian Wickman at the University of Michigan (FIXME: need link to a copy of his prelim) show that a surprising amount of NFS traffic is made up of the periodic GETATTRs that clients send when a timeout triggers a cache revalidation.
At CITI, we are implementing directory delegations as described in Section 11 of the minor version draft. (But note that section 11 also describes a directory notification extension that we are ignoring for now.)
Using Directory Delegations
While a client holds a delegation on a directory, it is assured that the directory will not be modified without the delegation first being recalled. The server must delay any operation that modifies a directory until all the clients holding delegations on that directory have returned their delegations.
However, as a special case, the server may allow the client that is modifying a directory to keep its own delegation on that directory. (Obviously, other client's delegations on that directory must still be recalled.)
Note that even though we may permit a client to modify a directory while it holds a read delegation, this is not the same as providing that client with an exclusive (write) delegation; a write delegation would also allow the client to modify the directory locally, and this is explicitly forbidden in section 11 of the minor version draft:
"The delegation is read-only and the client may not make changes to the directory other than by performing NFSv4 operations that modify the directory or the associated file attributes so that the server has knowledge of these changes."
Note that in order to make the special exception that allows a client to modify a directory without recalling its own lease, we must know which client is performing the operation.
Currently we are using the client's IP address for this. However, the NFSv4 protocol does not prohibit the client from changing IP addresses, and does not prohibit multiple clients from sharing an IP address. The final code will instead use the new sessions extensions in NFSv4.1 to identify the client.
Delegations and the Linux VFS Lease Subsystem
We have implemented directory delegations on the server by extending the Linux VFS file lease subsystem. A lease is a type of lock that gives the lease-holder the chance to perform any necessary tasks (e.g., flushing data) when an operation that conflicts with the lease-type is about to occur -- the caller who is causing the lease to break will block until the lease-holder signals that it is finished cleaning-up (or the lease is forcefully broken after a timeout).
The existing lease subsystem only works on files, and leases are only broken when a file is opened for writing or is truncated. In order to implement directory delegations, we have added support for directory leases. These will break when a leased directory is mutated by any additions, deletions, or renames, or when the directory's own metadata changes (e.g., chown(1)). Note that changes to existing files, e.g., will not break directory leases.
Our current implementation modifies the NFS server so that NFS protocol operations will break directory leases. However, it is still possible for a local process on the server to modify a directory without breaking directory leases.
The final implementation will also ensure that operations by local processes break directory leases. This will require addressing some tricky VFS locking issues: the difficulty is that, given that breaking a lease involves blocking the caller, one must ensure that no important locks -- like a directory inode's i_mutex -- are held while the calling kernel thread blocks.
At this point, we are testing general VFS-level directory lease-breaking -- i.e., both NFS and non-NFS operations will break leases. Our approach is described in the next section.
Leases are usually acquired via the fcntl(2) call, and a lease-holder usually receives a signal from the kernel when a lease is being broken; the lease-holder indicates that any cleanup is finished with another fcntl(2) call. NFS leases are all acquired and revoked in-kernel.
Recalling NFS Delegations vs. Breaking Linux VFS (Non-NFS) Leases
The NFS delegation leases and existing non-NFS leases differ in how they handle the case where a lease-holder is also the caller performing an operation that conflicts with the lease-type. (Here I will informally refer to leases used to implement delegations as "NFS leases" and all other leases as "non-NFS leases".)
With NFS leases, in order to support directory delegation semantics, the caller a) will not have its own directory lease broken, and b) it will not block until all other leases on that directory have been returned to the server, and so c) the operation fails with NFS4ERR_DELAY, which indicates that the client should retry the operation. Bruce mentions that we could, perhaps, stall the operation for something like 100ms just in case the delegation(s) is/are returned very quickly.
With non-NFS leases, the caller a) will have its lease revoked, and b) will not block while any other lease-holders return their leases, with the latter implying that c) the operation fails with -EAGAIN.
The approach we're currently taking to tackle the issues of integrating NFS delegations with Linux VFS leases (i.e., all directory mutating operations, whether locally on the server or over NFS, will break directory leases/delegations on the server) goes something like this: When breaking a lease where the call is coming over NFS: 1) During processing, whenever the directory's dentry becomes available (e.g., after a lookup), disable lease-granting for its inode and try break_lease() with O_NONBLOCK. This will avoid blocking while locks are held, as well as avoid tying up server threads for (potentially) long periods. 2) If there was not a lease, finish the operation, re-enable lease-granting on the inode, and we're done. 3) If there was a lease, break_lease() will send the break signal(s) and nfsd will also fail (re-enabling lease-granting on the inode first) and the client gets NFS4ERR_DELAY (and should retry). The downside to this is that a pathological case could arise wherein we break a lease, return NFS4ERR_DELAY, then the client retries the operation -- but another client has acquired a lease in the interim, and we could end up with a cycle. When breaking a lease where the call is server-local: 1) Again, whenever a directory's dentry becomes available, disable lease-granting for its inode. 2) If locks (e.g., an i_mutex) are not held, call break_lease() and, as per normal lease-semantics, block the breaker until leases are returned, after which the breaker is unblocked and its operation succeeds. 3) If locks are held, call break_lease() with O_NONBLOCK; we assume the common-case to be that no lease is present. If break_lease() returns -EWOULDBLOCK, drop the locks and call break_lease() and allow it to block. Once the caller unblocks, restart the operation by reacquiring the locks and, e.g., redoing a lookup to make sure the file system object(s) still exist(s). Since lease-granting was disabled early-on, the operation will succeed in one pass. 4) Regardless of whether 2) or 3) happened, at the end lease-granting is naturally re-enabled for the inode(s) in question.
One opportunity offered by directory delegations is the chance to significantly extend the usefulness of negative dentry caching on the client. Currently, close-to-open consistency requires that, e.g., all OPENs are sent to the server (i.e., negative caching provides no benefit in that case). With directory delegations, one is assured that no new entries or removals have occurred while a delegation is in-effect; this implies that negative dentries in a delegated directory actually can be "trusted".
This could translate into a marked decrease in the number of unnecessary and repeated checks for non-existent files, e.g. when searching for an executable in PATH or a shared library in LD_LIBRARY_PATH. Knowing just when to acquire those delegations may be a matter to address in client-side policy.
At the moment, working on coming up with reasonably representative tests that show the benefits of directory delegations (in terms of OP-counts); pynfs tests are also being written.
- The client currently requests a delegation just prior to issuing a READDIR on an undelegated directory, or when it has done "a few" parent directory revalidations and noticed that it hasn't changed during that span.
- As long as the client has such a delegation, it will generally refrain from issuing ACCESS, GETATTR, and READDIR calls on the directory (see below) ...
- .. in some cases, though, the client's cache(s) may be deliberately invalidated and require a refresh (e.g., a client creates a file in a directory delegated to it, which won't break its delegation; however, in order to see the file, the client must revalidate its pagecache and send a READDIR on the wire).
- TODO: get more opcounts! (hosting a webserver's docroot off an nfs mount? PATH or LD_LIBRARY_PATH stuff?)
- TODO: redo existing opcount tests and instead tally bandwidth savings ...
- getting real NFSv4 workload network traces would be great -- can you help? (richterd AT citi.umich.edu)
- When should/can we decide to voluntarily return delegations (other than when we have no more active open-state)?
- Differentiate between turning file/directory delegations on/off at runtime (done) and enabling/disabling the capability itself (not done; would prevent our client from ever asking for delegations in the first place, independent of its requesting policy).
- The following NFS operations currently break directory delegations: CREATE, LINK, REMOVE, RENAME, and OPEN(w/create). SETATTR on directories is pending.
- An NFS SETATTR breaks file delegations when the file size is changing. Breaking on metadata changes is pending.
- The corresponding VFS-level operations also break delegations and are being tested.
- How to acknowledge/when to act upon resource pressures? --> e.g., after compiling the linux kernel, a client holds ~750 delegations -- that's like 50KB of state on the server, and nearly as much on the client.
- TODO: get NFSv2/NFSv3 operations to break (file and directory) delegations at all of the right times, too.
- TODO: also -- policy, look at dir deleg/file deleg interactions, ..
Some preliminary numbers
We have some very rough numbers in terms of opcounts with vs. without directory (not file) delegations enabled. We used a very naive client policy of simply requesting delegations prior to a READDIR (note that make(1) periodically calls getdents(2) on its own). ACCESS, GETATTR, and LOOKUP are where the real savings are; the other opcounts are just included for context. Again, these numbers are rough, but indicate that compilation environments stand to benefit from directory delegations.
Doing make(1) on cscope-15.5 (first without, then with directory delegations):
READ: 136 124 WRITE: 137 136 OPEN: 1576 1576 ACCESS: 1169 161 (86% reduction) GETATTR: 903 628 (30% reduction) LOOKUP: 1494 496 (67% reduction) GET_DIR_DELEG: 7 DELEGRETURN: 1
Doing make(1) on the 2.6.16 linux kernel (first without, then with directory delegations):
READ: 19803 19892 WRITE: 21921 21869 OPEN: 497472 494648 ACCESS: 20638 3406 (83.5% reduction) GETATTR: 41794 24563 (41.0% reduction) LOOKUP: 45063 17447 (61.3% reduction) READDIR: 1016 884 (13.0% reduction) GET_DIR_DELEG: 750 DELEGRETURN: none