From Linux NFS

Background

To improve performance and reliability, NFSv4.1 introduces read-only directory delegations, a protocol extension that allows consistent caching of directory contents. CITI is implementing directory delegations as described in Section 11 of NFSv4.1 Internet Draft.

Directory Caching in NFSv4

NFSv4 allows clients to cache directory contents:

READDIR uses a directory entry cache
LOOKUP uses the name cache
ACCESS and GETATTR use a directory metadata cache.

To limit the use of stale cached information, RFC 3530 suggests a time-bounded consistency model, which forces the client to revalidate cached directory information periodically.

"Directory caching for the NFS version 4 protocol is similar to previous versions. Clients typically cache directory information for a duration determined by the client. At the end of a predefined timeout, the client will query the server to see if the directory has been updated. By caching attributes, clients reduce the number of GETATTR calls made to the server to validate attributes. Furthermore, frequently accessed files and directories, such as the current working directory, have their attributes cached on the client so that some NFS operations can be performed without having to make an RPC call. By caching name and inode information about most recently looked up entries in DNLC (Directory Name Lookup Cache), clients do not need to send LOOKUP calls to the server every time these files are accessed." NFSv4.1 Internet Draft

Revalidation of directory information is wasteful and opens a window during which a client might use stale cached directory information.

Analysis of network traces at the University of Michigan (FIXME: need link to a copy of Brian Wickman's prelim) show that a surprising amount of NFSv3 traffic is due to GETATTRs triggered by client directory cache revalidation.

How Directory Delegations Can Help

"[The NFSv4] caching approach works reasonably well at reducing network traffic in many environments. However, it does not address environments where there are numerous queries for files that do not exist. In these cases of "misses", the client must make RPC calls to the server in order to provide reasonable application semantics and promptly detect the creation of new directory entries. Examples of high miss activity are compilation in software development environments. The current behavior of NFS limits its potential scalability and wide-area sharing effectiveness in these types of environments." NFSv4.1 Internet Draft

A common "high miss" case involves shell PATH lookups. To execute a program, the shell walks down a list of directories specified in a user's $PATH environment variable and tries to locate the executable file in each directory. It is not uncommon to find a large number of directories in the list. When an executable whose parent directory is located far down the $PATH list is invoked, it causes a "miss" in each of the directories that precede the parent directory. Even though the directories along the path might be cached, running the program more than once still requires that the $PATH directories are revalidated, in case the file appears at some point.

Directory delegations improve matters enormously because the client is assured that the directory has not been modified since the delegation was granted. With directory delegations, once a nonexistent file has been searched for, the client can trust that it won't appear while the delegation is in effect; this is referred to as negative dentry caching. With it, searching for a nonexistent file in a cached and delegated directory can proceed locally, without having to check back with the server.

Program compilation, which induces repeated misses along the paths for header files and library modules, also benefits from directory delegations,. The savings are potentially even greater for repeated 'ls' or 'stat' requests on non-existent files; each such request requires three separate RPC calls -- ACCESS, LOOKUP, and GETATTR -- to discover that a file does not exist.

Beyond these "high miss" cases, analysis of NFSv3 network traces shows that a great deal of NFS traffic consists of the periodic GETATTRs sent by clients when an attribute timeout triggers a cache revalidation. But a delegated directory need not be revalidated unless the directory is modified.

* Should reference Wickman and ... um ... CMU?  Ousterhout?
* From which we can make "a great deal" more specific?

Directory Delegation Operations

An NFSv4.1 client requests a directory delegation with the GET_DIR_DELEGATION operation. Granting a delegation request is solely at the server's discretion, and the delegation may be recalled at any time.

Upon receiving an operation that conflicts with an existing delegation, the server must first recall from all of its clients any delegations on the directory (or directories) being mutated. When a client receives that CB_RECALL callback operation, it relinquishes the delegation in question by responding to the server using the DELEGRETURN operation. When all of the requisite delegations have been returned (or forcefully timed-out), the server allows the conflicting operation to proceed.

Although NFS clients and servers have knowledge of the acquisition and recall of directory delegations, delegation state is opaque to applications.

Notifications

After a delegation recall, a client is forced to refetch a directory in its entirety the next time it is used. For a large directory, this cost, which is above and beyond the two RPCs needed for the recall, can be quite expensive. If the directory also happens to be a popular one — with multiple clients holding delegations — the performance impact on the server can be considerable.

To reduce the impact of a directory modification when the change is small, the NFSv4.1 Internet Draft defines an extension to delegations called notifications. When a client requests a delegation, it can also request that certain changes be conveyed in the form of a notification instead of a recall.

By sending a description of the change instead of recalling the delegation, the server allows the client to maintain a consistent cache without imposing the cost (to the client and to itself) of a recall and refetch.

Notifications are motivated by some common cases. For example, some applications use ephemeral lockfiles for concurrency control by quickly creating and destroying a file in a directory. Other examples include program compilation and CVS updates, which also quickly create and destroy files.

In the proposal for notifications, a client can request notifications on directory entry and directory attribute changes, as well as directory entry attribute changes. To reduce the cost of issuing notifications, the client and server negotiate the rate at which notifications are sent, allowing the server to "batch" notifications and send them asynchronously. In some common cases, delaying a notification can obviate its delivery altogether, e.g., when a file is quickly created and destroyed.

* ref ousterhout

Issues with notifications

Notifications require state on the NFS server to keep track of them and work to deliver them. Wickman's simulator work at CITI found that in some cases, the number of notifications dispatched to support a directory delegation can exceed the cost of simply not using a delegation at all. A restricted version of notifications that sends only directory creates, unlinks, and renames would use much less server state.

Notifications also introduce a level of "fairness" to maintain, in terms of deciding how to allot notifications among multiple clients, given limited server resources.

Notifications can be sent asynchronously, at a rate negotiated by the client and server. This allows the server to batch several notifications and to prune self-cancelling notifications (e.g., "CREATE foo ... REMOVE foo"). Indeed, Wickman found that for certain workloads, batching notifications for 20 to 50 seconds reduces notification traffic by a factor of 5 to 50. For instance, lock files in mail boxes often have a lifetime under 10 seconds, so addition/deletion notifications can be pruned. However, there is a trade-off between the batching delay and client cache consistency.

Because of the complexity of implementation and questions of how best to benefit from them, CITI is not implementing notifications at this time.

Using Directory Delegations

While a client holds a delegation on a directory, it is assured that the directory will not be modified without the delegation first being recalled. The server must delay any operation that modifies a directory until all the clients holding delegations on that directory have returned their delegations.

However, as a special case, the server may allow the client that is modifying a directory to keep its own delegation on that directory. (Obviously, other clients' delegations on that directory must still be recalled.)

Note that even though we may permit a client to modify a directory while it holds a read delegation, this is not the same as providing that client with an exclusive (write) delegation; a write delegation would also allow the client to modify the directory locally, and this is explicitly forbidden in section 11 of the minor version draft:

"The delegation is read-only and the client may not make changes to the directory other than by performing NFSv4 operations that modify the directory or the associated file attributes so that the server has knowledge of these changes."

Note that in order to make the special exception that allows a client to modify a directory without recalling its own lease, we must know which client is performing the operation.

Currently we are using the client's IP address for this. However, the NFSv4 protocol does not prohibit the client from changing IP addresses, and does not prohibit multiple clients from sharing an IP address. The final code will instead use the new Sessions extensions in NFSv4.1 to identify the client.

Negative Caching

One opportunity offered by directory delegations is the chance to significantly extend the usefulness of negative dentry caching on the client. Close-to-open consistency mandates that even in a case where previous LOOKUPs or OPENs for a given file have recently or repeatedly failed, subsequent attempts require that the parent directory is revalidated with a GETATTR in case the file appears. With directory delegations, the client is assured that no new entries or removals have occurred while a delegation is in-effect; this implies that negative dentries in a delegated directory actually can be "trusted".

This could translate into a marked decrease in the number of unnecessary and repeated checks for non-existent files, e.g. when searching for a header file in include paths or a shared library in LD_LIBRARY_PATH (See the Some preliminary numbers section for more details). Knowing just when to acquire those delegations may be a matter to address in client-side policy.

Delegations and the Linux VFS Lease Subsystem

We have implemented directory delegations on the server by extending the Linux VFS file lease subsystem. A lease is a type of lock that gives the lease-holder the chance to perform any necessary tasks (e.g., flushing data) when an operation that conflicts with the lease-type is about to occur -- the caller who is causing the lease to break will block until the lease-holder signals that it is finished cleaning-up (or the lease is forcefully broken after a timeout).

Leases are usually acquired via fcntl(2), and a lease-holder usually receives a signal from the kernel when a lease is being broken; the lease-holder indicates that any cleanup is finished with another fcntl(2) call. Leases used by NFS are all acquired and revoked in-kernel.

The existing lease subsystem only works on files, and leases are only broken when a file is opened for writing or is truncated. In order to implement directory delegations, we have added support for directory leases. These will break when a leased directory is mutated by any additions, deletions, or renames, or when the directory's own metadata changes (e.g., chown(1)). Note that changes to existing files, e.g., will not break directory leases.

Our current implementation modifies the NFS server so that NFS protocol operations will break directory leases. We are testing general VFS-level directory lease-breaking -- i.e., both NFS and local operations will break leases. Our approach is described in the next section.

Recalling NFS Delegations vs. Breaking Linux VFS (Non-NFS) Leases

In the following I will refer to the leases used to implement NFS delegations as "NFS leases" and all other leases as "non-NFS leases".

NFS leases and non-NFS leases differ in how they handle the case where a lease-holder is also the caller performing an operation that conflicts with the lease-type, as described above.

Any operation that breaks a lease, and hence requires delegation recalls, has to wait for delegations to be returned. There are a number of different ways to do this:

Delay responding to the original operation until all recalls are complete.
Immediately return NFS4ERR_DELAY to the client; the process on the client will then block while the client polls on its behalf.
Delay the response from the server for a little while, to handle the (probably common) case of a quick delegation return, and only return NFS4ERR_DELAY if the delegations aren't returned quickly enough.

For now, we have implemented option number 2.

 The approach we're currently taking to tackle the issues of integrating NFS delegations with Linux VFS leases (i.e., all directory-mutating 
 operations, whether locally on the server or over NFS, will break directory leases/delegations on the server) goes something like this:
 
 When breaking a lease where the call is coming over NFS:
 1) During processing, whenever the directory's dentry becomes available (e.g., after a lookup), disable lease-granting for its inode and try         
    break_lease() with O_NONBLOCK.  This will avoid blocking while locks are held, as well as avoid tying up server threads for (potentially)
    long periods.
 
 2) If there was not a lease, finish the operation, re-enable lease-granting on the inode, and we're done.
 
 3) If there was a lease, break_lease() will send the break signal(s) and nfsd will also fail (re-enabling lease-granting on the inode first)
    and the client gets NFS4ERR_DELAY (and should retry).  The downside to this is that a pathological case could arise wherein we break a lease,
    return NFS4ERR_DELAY, then the client retries the operation -- but another client has acquired a lease in the interim, and we could end up 
    with a cycle.
 
 
 When breaking a lease where the call is server-local:
 1) Again, whenever a directory's dentry becomes available, disable lease-granting for its inode.
 
 2a) If locks (e.g., an i_mutex) are not held, call break_lease() and, as per normal lease-semantics, block the breaker until leases are returned,
    after which the breaker is unblocked and its operation succeeds.
 
 2b) If locks are held, call break_lease() with O_NONBLOCK; we assume the common-case to be that no lease is present.  If break_lease() returns
    -EWOULDBLOCK, drop the locks and call break_lease() and allow it to block.  Once the caller unblocks, restart the operation by reacquiring
    the locks and, e.g., redoing a lookup to make sure the file system object(s) still exist(s).  Since lease-granting was disabled early-on, 
    the operation will succeed in one pass.
 
 3) Regardless of whether 2a) or 2b) happened, at the end lease-granting is re-enabled for the inode(s) in question.

Policy (partial)

client: prior to a READDIR, request.

client: if we've sent 3 or 5 revalidations and a directory hasn't changed, request.

client: when to voluntarily surrender? e.g., after a kernel-compile, i hold hundreds of delegations.

server: if a directory's delegation has been recalled in the last N minutes, don't grant new ones.

server: will need to ID "misbehaving" clients and cordon them off.

server: when to preemptively recall? --> server load metric

(simulator)

Previous work at CITI by Brian Wickman consisted of prototyping and analyzing file and directory delegations, based on recorded network traces of NFSv3 use in college environments. The stateless nature of NFSv3 required the instrumentation of OPEN and CLOSE operations into the traces, e.g., but given that in the absence of delegations, NFSv4 client-side cache validation closely mimics that of NFSv3, enough information was available to get an overall impression of the state of the clients' caches. Wickman wrote a simulator to use the instrumented traces to test different delegation models and policies. We now want to use real-world NFSv4 network traces with the simulator, but given the current absence of widescale mainstream deployment of NFSv4, we need to find such traces of representative workloads. Using actual NFSv4 traffic will give a more accurate picture of client-cache state and will more clearly identify operations obviated by delegations; this is both because the traces will not need to be instrumented, and because NFSv3 lacks the COMPOUND operation, with which NFSv4 coalesces groups of commands. NFSv4 traces used with the simulator will allow us to develop client- and server-side policies for requesting and granting delegations.

Some preliminary numbers

A significant demonstration of the benefits of negative dentry caching is software compilation. For instance, when building software using make(1), various directories are repeatedly searched for header files. Since header files tend only to be located in one of the directories, and since many object files depend on the same headers, there are a great number of unnecessary re-checks. By caching negative dentries, a significant number of NFS operations can be avoided.

We have some rough numbers in terms of opcounts, both with and without directory (and not file) delegations enabled. We used a simple client policy of requesting delegations prior to a READDIR (note that make(1) periodically calls getdents(2) on its own). ACCESS, GETATTR, and LOOKUP are where the real savings are; the other opcounts are just included for context. Again, these numbers are rough, but indicate that compilation environments stand to benefit from directory delegations.

Doing make(1) on cscope-15.5 (first without, then with directory delegations):

READ:            136       124
WRITE:           137       136
OPEN:           1576      1576
ACCESS:         1169       161  (86% reduction)
GETATTR:         903       628  (30% reduction)
LOOKUP:         1494       496  (67% reduction)
GET_DIR_DELEG:               7
DELEGRETURN:                 1

Doing make(1) on the 2.6.16 linux kernel (first without, then with directory delegations):

READ:          19803     19892
WRITE:         21921     21869
OPEN:         497472    494648
ACCESS:        20638      3406  (83.5% reduction)
GETATTR:       41794     24563  (41.0% reduction)
LOOKUP:        45063     17447  (61.3% reduction)
READDIR:        1016       884  (13.0% reduction)
GET_DIR_DELEG:             750
DELEGRETURN:              none

Status

At the moment, working on coming up with reasonably representative tests that show the benefits of directory delegations (in terms of OP-counts); pynfs tests are also being written.

The client

The client currently requests a delegation just prior to issuing a READDIR on an undelegated directory, or when it has done "a few" parent directory revalidations and noticed that it hasn't changed during that span.
As long as the client has such a delegation, it will generally refrain from issuing ACCESS, GETATTR, and READDIR calls on the directory (see below) ...
.. in some cases, though, the client's cache(s) may be deliberately invalidated and require a refresh (e.g., a client creates a file in a directory delegated to it, which won't break its delegation; however, in order to see the file, the client must revalidate its pagecache and send a READDIR to the server).
README: any suggestions here? —> TODO: get more opcounts! (hosting a webserver's docroot off an nfs mount? PATH or LD_LIBRARY_PATH stuff?)
TODO: redo existing opcount tests and instead tally bandwidth savings ...
- getting real NFSv4 workload network traces would be great... (can you help? —> email nfsv4@linux-nfs.org)
When should/can we decide to voluntarily return delegations (other than when we have no more active open-state)?

The server

Differentiate between turning file/directory delegations on/off at runtime (done) and enabling/disabling the capability itself (not done; would prevent our client from ever asking for delegations in the first place, independent of its requesting policy).
The following NFS operations currently break directory delegations: CREATE, LINK, REMOVE, RENAME, and OPEN(w/create). SETATTR on directories is pending.
An NFS SETATTR breaks file delegations when the file size is changing. Breaking on metadata changes is pending.
The corresponding VFS-level operations also break delegations and are being tested.
How to acknowledge/when to act upon resource pressures? --> e.g., after compiling the linux kernel, a client holds ~750 delegations -- that's like 50KB of state on the server, and nearly as much on the client.
TODO: get NFSv2/NFSv3 operations to break (file and directory) delegations at all of the right times, too.
TODO: also -- policy, look at dir deleg/file deleg interactions, ..

CITI Experience with Directory Delegations