CITI Experience with Directory Delegations

From Linux NFS

(Difference between revisions)
Jump to: navigation, search
(rewrite operations section)
(Negative Caching)
 
(36 intermediate revisions not shown)
Line 1: Line 1:
-
'''NOTE: this is a rough work-in-progress; please send criticism to richterd at (nospam) citi.umich.edu thank you.'''
+
=Background=
-
'''[2006-8-2:''' ''I've added some rough, preliminary numbers of opcounts from doing compiles with/without directory delegations''''']'''
+
To improve performance and reliability, NFSv4.1 introduces read-only '''directory delegations''', a protocol extension that allows consistent caching of directory contents. 
 +
CITI is implementing directory delegations as described in Section 11 of [http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-02.txt NFSv4.1 Internet Draft].
-
=Background=
+
==Directory Caching in NFSv4==
NFSv4 allows clients to cache directory contents:
NFSv4 allows clients to cache directory contents:
Line 11: Line 12:
* ACCESS and GETATTR use a directory metadata cache.
* ACCESS and GETATTR use a directory metadata cache.
-
To limit the use of stale cached information, [http://www.ietf.org/rfc/rfc3530.txt?number=3530 RFC 3530 ] suggests a time-bounded consistency model, which forces the client to revalidate cached directory information.   
+
To limit the use of stale cached information, RFC 3530 suggests a time-bounded consistency model, which forces the client to revalidate cached directory information periodically.   
"Directory caching for the NFS version 4 protocol is similar to previous versions.  Clients typically cache directory information for a duration determined by the client.  At the end of a predefined timeout, the client will query the server to see if the directory has been updated.  By caching attributes, clients reduce the number of GETATTR calls made to the server to validate attributes.  Furthermore, frequently accessed files and directories, such as the current working directory, have their attributes cached on the client so that some NFS operations can be performed without having to make an RPC call.  By caching name and inode information about most recently looked up entries in DNLC (Directory Name Lookup Cache), clients do not need to send LOOKUP calls to the server every time these files are accessed." [http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-02.txt NFSv4.1 Internet Draft]
"Directory caching for the NFS version 4 protocol is similar to previous versions.  Clients typically cache directory information for a duration determined by the client.  At the end of a predefined timeout, the client will query the server to see if the directory has been updated.  By caching attributes, clients reduce the number of GETATTR calls made to the server to validate attributes.  Furthermore, frequently accessed files and directories, such as the current working directory, have their attributes cached on the client so that some NFS operations can be performed without having to make an RPC call.  By caching name and inode information about most recently looked up entries in DNLC (Directory Name Lookup Cache), clients do not need to send LOOKUP calls to the server every time these files are accessed." [http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-02.txt NFSv4.1 Internet Draft]
Revalidation of directory information is wasteful and opens a window during which a client might use stale cached directory information.
Revalidation of directory information is wasteful and opens a window during which a client might use stale cached directory information.
-
 
-
"Directory caching for the NFS version 4 protocol is similar to previous versions.  Clients typically cache directory information for a duration determined by the client.  At the end of a predefined timeout, the client will query the server to see if the directory has been updated.  By caching attributes, clients reduce the number of GETATTR calls made to the server to validate attributes.  Furthermore, frequently accessed files and directories, such as the current working directory, have their attributes cached on the client so that some NFS operations can be performed without having to make an RPC call.  By caching name and inode information about most recently looked up entries in DNLC (Directory Name Lookup Cache), clients do not need to send LOOKUP calls to the server every time these files are accessed." [http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-02.txt NFSv4.1 Internet Draft]
 
Analysis of network traces at the University of Michigan ('''FIXME''': need link to a copy of Brian Wickman's prelim) show that a surprising amount of NFSv3 traffic is due to GETATTRs triggered by client directory cache revalidation.
Analysis of network traces at the University of Michigan ('''FIXME''': need link to a copy of Brian Wickman's prelim) show that a surprising amount of NFSv3 traffic is due to GETATTRs triggered by client directory cache revalidation.
 +
 +
==How Directory Delegations Can Help==
"[The NFSv4] caching approach works reasonably well at reducing network traffic in many environments.  However, it does not address environments where there are numerous queries for files that do not exist.  In these cases of "misses", the client must make RPC calls to the server in order to provide reasonable application semantics and promptly detect the creation of new directory entries.  Examples of high miss activity are compilation in software development environments.  The current behavior of NFS limits its potential scalability and wide-area sharing effectiveness in these types of environments." [http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-02.txt NFSv4.1 Internet Draft]
"[The NFSv4] caching approach works reasonably well at reducing network traffic in many environments.  However, it does not address environments where there are numerous queries for files that do not exist.  In these cases of "misses", the client must make RPC calls to the server in order to provide reasonable application semantics and promptly detect the creation of new directory entries.  Examples of high miss activity are compilation in software development environments.  The current behavior of NFS limits its potential scalability and wide-area sharing effectiveness in these types of environments." [http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-02.txt NFSv4.1 Internet Draft]
-
To improve performance and reliability, NFSv4.1 introduces read-only '''directory delegations''', a protocol extension that allows consistent caching of directory contents.
+
A common "high miss" case involves shell PATH lookups.
-
CITI is implementing directory delegations as described in Section 11 of the minor version draft(Section 11 also describes a directory notification extension that CITi is not implementing.)
+
To execute a program, the shell walks down a list of directories specified in a user's $PATH
-
   
+
environment variable and tries to locate the executable file in each directory. 
 +
It is not uncommon to find a large number of directories in the list.  When an executable whose parent directory is located far down the $PATH list is invoked, it causes a "miss" in each of the directories that precede the parent directory. Even though the directories along the path might be cached, running the program more than once still requires that the $PATH directories are revalidated, in case the file appears at some point. 
 +
 
 +
Directory delegations improve matters enormously because the client is assured that the directory has not been modified since the delegation was granted.    With directory delegations, once a nonexistent file has been searched for, the client can trust that it won't appear while the delegation is in effect; this is referred to as '''negative dentry caching'''.  With it, searching for a nonexistent file in a cached and delegated directory can proceed locally, without having to check back with the server.
 +
 
 +
Program compilation, which induces repeated misses along the paths for header files and library modules, also benefits from directory delegations,.
 +
The savings are potentially even greater for repeated 'ls' or 'stat' requests on non-existent files; each such request requires three separate RPC calls -- ACCESS, LOOKUP, and GETATTR -- to discover that a file does not exist.
 +
 
 +
Beyond these "high miss" cases, analysis of NFSv3 network traces shows that a great deal of NFS traffic
 +
consists of the periodic GETATTRs sent by clients when an attribute timeout
 +
triggers a cache revalidationBut a delegated directory need not be revalidated unless the directory is modified.  
 +
 
 +
  * Should reference Wickman and ... um ... CMU?  Ousterhout?
 +
* From which we can make "a great deal" more specific?
 +
 
==Directory Delegation Operations==
==Directory Delegation Operations==
An NFSv4.1 client requests a directory delegation with the GET_DIR_DELEGATION operation.
An NFSv4.1 client requests a directory delegation with the GET_DIR_DELEGATION operation.
-
Granting a delegation request is entirely at the
+
Granting a delegation request is solely at the server's discretion, and the delegation may be
-
server's discretion.
+
recalled at any time.
-
Upon receipt of an operation that conflicts with an existing delegation, the server recalls the delegation from all
+
Upon receiving an operation that conflicts with an existing delegation, the server must first
-
clients holding the delegation by issuing them the XXX callback operation.
+
recall from all of its clients any delegations on the directory (or directories) being mutated.  
-
When a client receives a recall request, it relinquishes the delegation and responds to the server with the DELEGRETURN operation,
+
When a client receives that CB_RECALL callback operation, it relinquishes the delegation in
-
When all the clients have returned the delegation, the server proceeds with the conflicting operation.
+
question by responding to the server using the DELEGRETURN operation.
 +
When all of the requisite delegations have been returned (or forcefully timed-out), the server
 +
allows the conflicting operation to proceed.
-
Although NFS clients and servers have knowledge of the acquisition and recall of directory delegations, delegation state  is opaque to applications.
+
Although NFS clients and servers have knowledge of the acquisition and recall of directory  
 +
delegations, delegation state  is opaque to applications.
-
==(problems and solutions)==
+
==Notifications==
-
Common examples of the previously mentioned "high miss" cases involve the PATH
+
After a delegation recall, a client is forced to refetch a directory in its entirety the next time it is used.   
-
shell variable and the loading of shared librariesWhen a user executes a
+
For a large directory, this cost, which is above and beyond the two RPCs needed for the recall, can be quite expensive.
-
program, the user's shell examines the list of directories in the PATH
+
If the directory also happens to be a popular one with multiple clients holding delegations — the performance impact on the server can be considerable.
-
environment variable and looks for the program binary in each of those
+
-
directories, in turn, until the program is found.  Often there can be 5 to 10
+
-
(or more) PATH entries, and normally a given program binary is in only one of
+
-
those directories.  Even when the client is searching for repeatedly-absent
+
-
files, it must nevertheless check with the server in case they have appeared.
+
-
A similar situation arises during software compilation, when the include
+
To reduce the impact of a directory modification when the change is small,
-
paths are repeatedly serially searched for header files. Given that header files are
+
the NFSv4.1 Internet Draft defines an extension to delegations called ''notifications.''
-
generally in only one of those directories, this results in a high miss-rate.
+
When a client requests a delegation, it can also request that certain changes be conveyed in the form of a notification instead of a recall.
-
With respect to the PATH and shared library cases (where no directory-mutating
+
By sending a description of the change instead of recalling the delegation, the server allows the client to maintain a consistent cache without imposing the cost (to the client and to itself) of a recall and refetch.
-
operations are being performed), directory delegations provide a significant
+
-
advantage.  This stems from "negative dentry caching" -- that is, the caching
+
-
of information about non-existent directory entries.  In the absence of
+
-
directory delegations, if a client attempts to OPEN a non-existent file,
+
-
close-to-open consistency semantics require that the operation be sent to the
+
-
server, regardless of whether the client has a negative dentry cached.
+
-
However, if a client holds a delegation on the directory and has a negative
+
-
dentry stored for the missing file, it can "trust" that the file has not
+
-
appeared, which obviates the need for the OPEN.
+
-
Another example is if a client performs an 'ls' or a 'stat' on a non-existent
+
Notifications are motivated by some common cases.  For example, some applications use ephemeral lockfiles for concurrency control by quickly creating and destroying a file in a directoryOther examples include program compilation and CVS updates, which also quickly create and destroy files.
-
file, three separate RPC calls are made to service an ACCESS, a LOOKUP, and a
+
-
GETATTR -- only to find that the file still does not existIf the directory
+
-
were delegated and the client has a negative dentry for the non-existent file,
+
-
however, the client once again is assured that the file has not appeared.
+
-
Beyond just these "high miss" cases, analysis of NFSv3 (whose client cache
+
In the proposal for notifications, a client can request notifications on
-
revalidation semantics NFSv4 roughly mirrors) network traces by Brian Wickman
+
directory entry and directory attribute changes, as well as directory entry
-
at the University of Michigan shows that a significant amount of NFS traffic
+
attribute changes.  To reduce the cost of issuing notifications, the client and server negotiate the rate at which notifications are sent, allowing the server to "batch" notifications and send them asynchronouslyIn some common cases, delaying a notification can obviate its delivery altogether, e.g., when a file is quickly created and destroyed.
-
consists of the periodic GETATTRs which clients send when an attribute timeout
+
-
triggers a cache revalidationNaturally, if a directory is delegated, it
+
-
need not be revalidated until the directory is mutated.
+
-
==(notifications)==
+
  * ref ousterhout
-
Another aspect concerning directory delegations in the minor version draft is
+
-
an extension called notifications. The intent behind notifications is to
+
-
avoid having to revoke a delegation and force the client to refetch the
+
-
contents of a directory when only a relatively small change has been made.  If
+
-
a delegated directory is very large, it can be expensive to return a
+
-
delegation -- which involves two RPCs -- and subsequently refetch the
+
-
directory's entire contents, particularly if multiple clients have delegations
+
-
on that directory.
+
-
Notifications mitigate this circumstance by allowing a client to request that
+
===Issues with notifications===
-
the server merely send a message describing the change; this avoids having to
+
Notifications require state on the NFS server to keep track of them and work to deliver them.
-
revoke the delegation and refetch the directory. In environments where files
+
Wickman's simulator work at CITI
-
are created and deleted with some moderate degree of frequency, notifications
+
found that in some
-
could conceivably provide significant benefits where plain directory
+
cases, the number of
-
delegations alone would result in a prohibitive number of recalls and
+
notifications dispatched to support a directory delegation can exceed
-
directory refreshesExamples might include directories where lockfiles are
+
the cost of simply not using a delegation at all.   
-
used, or where a few new files are created or deleted periodically, as with
+
A restricted version of notifications that sends only directory creates, unlinks, and renames would use much less server state.
-
some compilation or when doing CVS updates.
+
 
-
+
Notifications also introduce a level of "fairness" to maintain, in terms of deciding how to
-
In the proposed model, a client would be able to request notifications on
+
allot notifications among multiple clients, given limited server resources.
-
directory entry and directory attribute changes, as well as directory entry
+
 
-
attribute changes. Enabling a server to track that would involve a lot of
+
Notifications can be sent asynchronously, at a rate negotiated by the client and server.
-
extra state.  Furthermore, the client and server negotiate a rate at which
+
This allows the server to batch several notifications
-
notifications are sent, which allows the server to batch several notifications
+
and to prune self-cancelling
-
and deliver them asynchronously and conceivably even prune self-cancelling
+
notifications (e.g., "CREATE foo ...  REMOVE foo").
-
notifications (e.g., "CREATE foo ...  REMOVE foo"). Notifications would also
+
Indeed, Wickman found that for
-
introduce another level of "fairness" to maintain, in terms of deciding how to
+
certain workloads, batching notifications for 20 to 50 seconds reduces notification traffic by a factor of 5 to 50.
-
allot notifications among multiple clients.  Wickman's simulator work at CITI
+
For instance, lock files in mail boxes often have a lifetime
-
investigated some aspects of enabling notifications and found that in some
+
under 10 seconds, so addition/deletion notifications can be pruned.   
-
cases, certainly with directory entry attribute changes, the number of
+
However, there
-
notifications dispatched to support the directory delegation far outweighed
+
is a trade-off between the batching delay and client
-
the cost of simply not using a delegation at all.  He also found that for
+
cache consistency.   
-
certain workloads, if a server batched notifications for a long time (>20
+
 
-
seconds, sometimes >50 seconds), a significant reduction (5x-50x) in traffic
+
Because of the complexity of implementation and questions of how best to benefit from them, CITI is not implementing
-
could be achieved.  For instance, lockfiles in mailboxes often have a lifetime
+
notifications at this time.
-
under 10 seconds, so addition/deletion notifications could be pruned.  There
+
-
is, however, a direct trade-off between the batching delay and the client's
+
-
cache consistency.  A lesser version of notifications -- wherein only
+
-
directory-mutating operations would generate notifications -- has been loosely
+
-
proposed and would involve much less server state, but seems not to be going
+
-
anywhere.  Primarily because of the complexity of implementation and the open
+
-
questions of how best to benefit from notifications, we are not implementing
+
-
them at this time.
+
=Using Directory Delegations=
=Using Directory Delegations=
Line 125: Line 106:
While a client holds a delegation on a directory, it is assured that the directory will not be modified without the delegation first being recalled.  The server must delay any operation that modifies a directory until all the clients holding delegations on that directory have returned their delegations.
While a client holds a delegation on a directory, it is assured that the directory will not be modified without the delegation first being recalled.  The server must delay any operation that modifies a directory until all the clients holding delegations on that directory have returned their delegations.
-
However, as a special case, the server may allow the client that is modifying a directory to keep its own delegation on that directory.  (Obviously, other client's delegations on that directory must still be recalled.)
+
However, as a special case, the server may allow the client that is modifying a directory to keep its own delegation on that directory.  (Obviously, other clients' delegations on that directory must still be recalled.)
Note that even though we may permit a client to modify a directory while it holds a read delegation, this is not the same as providing that client with an exclusive (write) delegation; a write delegation would also allow the client to modify the directory locally, and this is explicitly forbidden in section 11 of the minor version draft:
Note that even though we may permit a client to modify a directory while it holds a read delegation, this is not the same as providing that client with an exclusive (write) delegation; a write delegation would also allow the client to modify the directory locally, and this is explicitly forbidden in section 11 of the minor version draft:
Line 133: Line 114:
Note that in order to make the special exception that allows a client to modify a directory without recalling its own lease, we must know which client is performing the operation.
Note that in order to make the special exception that allows a client to modify a directory without recalling its own lease, we must know which client is performing the operation.
-
Currently we are using the client's IP address for this.  However, the NFSv4 protocol does not prohibit the client from changing IP addresses, and does not prohibit multiple clients from sharing an IP address.  The final code will instead use the new sessions extensions in NFSv4.1 to identify the client.
+
Currently we are using the client's IP address for this.  However, the NFSv4 protocol does not prohibit the client from changing IP addresses, and does not prohibit multiple clients from sharing an IP address.  The final code will instead use the new Sessions extensions in NFSv4.1 to identify the client.
=Negative Caching=
=Negative Caching=
One opportunity offered by directory delegations is the chance to significantly extend the usefulness of negative dentry caching on the client.   
One opportunity offered by directory delegations is the chance to significantly extend the usefulness of negative dentry caching on the client.   
-
Currently, close-to-open consistency requires that, e.g.,  all OPENs are sent to the server (i.e., negative caching provides no benefit in that case).
+
Close-to-open consistency mandates that even in a case where previous LOOKUPs or OPENs for a given file have recently or repeatedly failed, subsequent attempts require that the parent directory is revalidated with a GETATTR in case the file appears. With directory delegations, the client is assured that no new entries or removals have occurred while a delegation is in-effect; this implies that negative dentries in a delegated directory actually can be "trusted".   
-
With directory delegations, one is assured that no new entries or removals have occurred while a delegation is in-effect; this implies that
+
-
negative dentries in a delegated directory actually can be "trusted".   
+
This could translate into a marked decrease in the number of unnecessary and repeated checks for non-existent files, e.g. when searching for  
This could translate into a marked decrease in the number of unnecessary and repeated checks for non-existent files, e.g. when searching for  
-
an executable in PATH or a shared library in LD_LIBRARY_PATH.  Knowing just when to acquire those delegations may be a matter to address in  
+
a header file in include paths or a shared library in LD_LIBRARY_PATH ''(See the '''Some preliminary numbers''' section for more details)''.  Knowing just when to acquire those delegations may be a matter to address in client-side policy.
-
client-side policy.
+
=Delegations and the Linux VFS Lease Subsystem=
=Delegations and the Linux VFS Lease Subsystem=
We have implemented directory delegations on the server by extending the Linux VFS file lease subsystem.  A lease is a type of lock that gives the lease-holder the chance to perform any necessary tasks (e.g., flushing data) when an operation that conflicts with the lease-type is about to occur -- the caller who is causing the lease to break will block until the lease-holder signals that it is finished cleaning-up (or the lease is forcefully broken after a timeout).
We have implemented directory delegations on the server by extending the Linux VFS file lease subsystem.  A lease is a type of lock that gives the lease-holder the chance to perform any necessary tasks (e.g., flushing data) when an operation that conflicts with the lease-type is about to occur -- the caller who is causing the lease to break will block until the lease-holder signals that it is finished cleaning-up (or the lease is forcefully broken after a timeout).
 +
 +
Leases are usually acquired via fcntl(2), and a lease-holder usually receives a signal from the kernel when a lease is being broken; the lease-holder indicates that any cleanup is finished with another fcntl(2) call.  Leases used by NFS are all acquired and revoked in-kernel.
The existing lease subsystem only works on files, and leases are only broken when a file is opened for writing or is truncated.  In order to implement  
The existing lease subsystem only works on files, and leases are only broken when a file is opened for writing or is truncated.  In order to implement  
directory delegations, we have added support for directory leases.  These will break when a leased directory is mutated by any additions, deletions, or renames, or when the directory's own metadata changes (e.g., chown(1)).  Note that changes to existing files, e.g., will not break directory leases.
directory delegations, we have added support for directory leases.  These will break when a leased directory is mutated by any additions, deletions, or renames, or when the directory's own metadata changes (e.g., chown(1)).  Note that changes to existing files, e.g., will not break directory leases.
-
Our current implementation modifies the NFS server so that NFS protocol operations will break directory leases.  However, it is still possible for a local process on the server to modify a directory without breaking directory leases.
+
Our current implementation modifies the NFS server so that NFS protocol operations will break directory leases.  We are testing general VFS-level directory lease-breaking -- i.e., both NFS and local operations will break leases.  Our approach is described in the next section.
-
 
+
-
The final implementation will also ensure that operations by local processes break directory leases.
+
-
This will require addressing some tricky VFS locking issues: the difficulty is that, given that breaking a lease involves blocking the caller, one must ensure that no important locks -- like a directory inode's ''i_mutex'' -- are held while the calling kernel thread blocks.
+
-
 
+
-
==UPDATE==
+
-
 
+
-
At this point, we are testing general VFS-level directory lease-breaking -- i.e., both NFS and non-NFS operations will break leases.  Our approach is described in the next section.
+
-
 
+
-
Leases are usually acquired via the fcntl(2) call, and a lease-holder usually receives a signal from the kernel when a lease is being broken; the lease-holder indicates that any cleanup is finished with another fcntl(2) call.  NFS leases are all acquired and revoked in-kernel.
+
=Recalling NFS Delegations vs. Breaking Linux VFS (Non-NFS) Leases=
=Recalling NFS Delegations vs. Breaking Linux VFS (Non-NFS) Leases=
-
In the following I will refer to the leases used to implement delegations as "NFS leases" and all other leases as "non-NFS leases".
+
In the following I will refer to the leases used to implement NFS delegations as "NFS leases" and all other leases as "non-NFS leases".
-
NFS leases and non-NFS leases differ in how they handle the case where a lease-holder is '''also''' the caller performing an operation that conflicts with the lease-type, as described above.
+
NFS leases and non-NFS leases differ in how they handle the case where a lease-holder is '''''also''''' the caller performing an operation that conflicts with the lease-type, as described above.
Any operation that breaks a lease, and hence requires delegation recalls, has to wait for delegations to be returned.  There are a number of different ways to do this:
Any operation that breaks a lease, and hence requires delegation recalls, has to wait for delegations to be returned.  There are a number of different ways to do this:
Line 176: Line 147:
# Delay the response from the server for a little while, to handle the (probably common) case of a quick delegation return, and only return NFS4ERR_DELAY if the delegations aren't returned quickly enough.
# Delay the response from the server for a little while, to handle the (probably common) case of a quick delegation return, and only return NFS4ERR_DELAY if the delegations aren't returned quickly enough.
-
For now, we have implemented option number 1.
+
For now, we have implemented option number 2.
-
==UPDATE==
+
   The approach we're currently taking to tackle the issues of integrating NFS delegations with Linux VFS leases (i.e., all directory-mutating  
-
   The approach we're currently taking to tackle the issues of integrating NFS delegations with Linux VFS leases (i.e., all directory mutating  
+
   operations, whether locally on the server or over NFS, will break directory leases/delegations on the server) goes something like this:
   operations, whether locally on the server or over NFS, will break directory leases/delegations on the server) goes something like this:
    
    
Line 198: Line 168:
   1) Again, whenever a directory's dentry becomes available, disable lease-granting for its inode.
   1) Again, whenever a directory's dentry becomes available, disable lease-granting for its inode.
    
    
-
   2) If locks (e.g., an i_mutex) are not held, call break_lease() and, as per normal lease-semantics, block the breaker until leases are returned,
+
   2a) If locks (e.g., an i_mutex) are not held, call break_lease() and, as per normal lease-semantics, block the breaker until leases are returned,
     after which the breaker is unblocked and its operation succeeds.
     after which the breaker is unblocked and its operation succeeds.
    
    
-
   3) If locks are held, call break_lease() with O_NONBLOCK; we assume the common-case to be that no lease is present.  If break_lease() returns
+
   2b) If locks are held, call break_lease() with O_NONBLOCK; we assume the common-case to be that no lease is present.  If break_lease() returns
     -EWOULDBLOCK, drop the locks and call break_lease() and allow it to block.  Once the caller unblocks, restart the operation by reacquiring
     -EWOULDBLOCK, drop the locks and call break_lease() and allow it to block.  Once the caller unblocks, restart the operation by reacquiring
     the locks and, e.g., redoing a lookup to make sure the file system object(s) still exist(s).  Since lease-granting was disabled early-on,  
     the locks and, e.g., redoing a lookup to make sure the file system object(s) still exist(s).  Since lease-granting was disabled early-on,  
     the operation will succeed in one pass.
     the operation will succeed in one pass.
    
    
-
   4) Regardless of whether 2) or 3) happened, at the end lease-granting is naturally re-enabled for the inode(s) in question.
+
   3) Regardless of whether 2a) or 2b) happened, at the end lease-granting is re-enabled for the inode(s) in question.
-
 
+
=Policy (partial)=
=Policy (partial)=
Line 243: Line 212:
=Some preliminary numbers=
=Some preliminary numbers=
A significant demonstration of the benefits of negative dentry
A significant demonstration of the benefits of negative dentry
-
caching is with software compilation.  For instance, when
+
caching is software compilation.  For instance, when
-
building software using the make(1) program, various directories are
+
building software using make(1), various directories are
repeatedly searched for header files.  Since header files tend only to be
repeatedly searched for header files.  Since header files tend only to be
located in one of the directories, and since many object files depend on the
located in one of the directories, and since many object files depend on the
-
same headers, there are a great number of unnecessary checks.  By caching
+
same headers, there are a great number of unnecessary re-checks.  By caching
-
negative dentries, a significant number of NFS operations are obviated.
+
negative dentries, a significant number of NFS operations can be avoided.
-
We have some very rough numbers in terms of opcounts with vs. without directory (not file) delegations enabled.  We used a very naive client policy of simply requesting delegations prior to a READDIR (note that make(1) periodically calls getdents(2) on its own).  ACCESS, GETATTR, and LOOKUP are where the real savings are; the other opcounts are just included for context.  Again, these numbers are ''rough'', but indicate that compilation environments stand to benefit from directory delegations.
+
We have some rough numbers in terms of opcounts, both with and without directory (and not file) delegations enabled.  We used a simple client policy of requesting delegations prior to a READDIR (note that make(1) periodically calls getdents(2) on its own).  ACCESS, GETATTR, and LOOKUP are where the real savings are; the other opcounts are just included for context.  Again, these numbers are ''rough'', but indicate that compilation environments stand to benefit from directory delegations.
   
   
''Doing make(1) on cscope-15.5 (first without, then with directory delegations):''
''Doing make(1) on cscope-15.5 (first without, then with directory delegations):''
Line 284: Line 253:
* The client currently requests a delegation just prior to issuing a READDIR on an undelegated directory, or when it has done "a few" parent directory revalidations and noticed that it hasn't changed during that span.  
* The client currently requests a delegation just prior to issuing a READDIR on an undelegated directory, or when it has done "a few" parent directory revalidations and noticed that it hasn't changed during that span.  
* As long as the client has such a delegation, it will generally refrain from issuing ACCESS, GETATTR, and READDIR calls on the directory (see below) ...
* As long as the client has such a delegation, it will generally refrain from issuing ACCESS, GETATTR, and READDIR calls on the directory (see below) ...
-
* .. in some cases, though, the client's cache(s) may be deliberately invalidated and require a refresh (e.g., a client creates a file in a directory delegated to it, which won't break its delegation; however, in order to see the file, the client must revalidate its pagecache and send a READDIR on the wire).
+
* .. in some cases, though, the client's cache(s) may be deliberately invalidated and require a refresh (e.g., a client creates a file in a directory delegated to it, which won't break its delegation; however, in order to see the file, the client must revalidate its pagecache and send a READDIR to the server).
-
* TODO: get more opcounts!  (hosting a webserver's docroot off an nfs mount?  PATH or LD_LIBRARY_PATH stuff?)
+
* '''README:  any suggestions here?  —> TODO:''' get more opcounts!  (hosting a webserver's docroot off an nfs mount?  PATH or LD_LIBRARY_PATH stuff?)
* TODO: redo existing opcount tests and instead tally bandwidth savings ...
* TODO: redo existing opcount tests and instead tally bandwidth savings ...
-
** getting real NFSv4 workload network traces would be great -- can you help?  (richterd AT citi.umich.edu)
+
** getting ''real'' NFSv4 workload network traces would be great...  '''(can you help?  —>   email nfsv4@linux-nfs.org)'''
* When should/can we decide to voluntarily return delegations (other than when we have no more active open-state)?
* When should/can we decide to voluntarily return delegations (other than when we have no more active open-state)?

Latest revision as of 20:55, 16 January 2008

Contents

Background

To improve performance and reliability, NFSv4.1 introduces read-only directory delegations, a protocol extension that allows consistent caching of directory contents. CITI is implementing directory delegations as described in Section 11 of NFSv4.1 Internet Draft.

Directory Caching in NFSv4

NFSv4 allows clients to cache directory contents:

  • READDIR uses a directory entry cache
  • LOOKUP uses the name cache
  • ACCESS and GETATTR use a directory metadata cache.

To limit the use of stale cached information, RFC 3530 suggests a time-bounded consistency model, which forces the client to revalidate cached directory information periodically.

"Directory caching for the NFS version 4 protocol is similar to previous versions. Clients typically cache directory information for a duration determined by the client. At the end of a predefined timeout, the client will query the server to see if the directory has been updated. By caching attributes, clients reduce the number of GETATTR calls made to the server to validate attributes. Furthermore, frequently accessed files and directories, such as the current working directory, have their attributes cached on the client so that some NFS operations can be performed without having to make an RPC call. By caching name and inode information about most recently looked up entries in DNLC (Directory Name Lookup Cache), clients do not need to send LOOKUP calls to the server every time these files are accessed." NFSv4.1 Internet Draft

Revalidation of directory information is wasteful and opens a window during which a client might use stale cached directory information.

Analysis of network traces at the University of Michigan (FIXME: need link to a copy of Brian Wickman's prelim) show that a surprising amount of NFSv3 traffic is due to GETATTRs triggered by client directory cache revalidation.

How Directory Delegations Can Help

"[The NFSv4] caching approach works reasonably well at reducing network traffic in many environments. However, it does not address environments where there are numerous queries for files that do not exist. In these cases of "misses", the client must make RPC calls to the server in order to provide reasonable application semantics and promptly detect the creation of new directory entries. Examples of high miss activity are compilation in software development environments. The current behavior of NFS limits its potential scalability and wide-area sharing effectiveness in these types of environments." NFSv4.1 Internet Draft

A common "high miss" case involves shell PATH lookups. To execute a program, the shell walks down a list of directories specified in a user's $PATH environment variable and tries to locate the executable file in each directory. It is not uncommon to find a large number of directories in the list. When an executable whose parent directory is located far down the $PATH list is invoked, it causes a "miss" in each of the directories that precede the parent directory. Even though the directories along the path might be cached, running the program more than once still requires that the $PATH directories are revalidated, in case the file appears at some point.

Directory delegations improve matters enormously because the client is assured that the directory has not been modified since the delegation was granted. With directory delegations, once a nonexistent file has been searched for, the client can trust that it won't appear while the delegation is in effect; this is referred to as negative dentry caching. With it, searching for a nonexistent file in a cached and delegated directory can proceed locally, without having to check back with the server.

Program compilation, which induces repeated misses along the paths for header files and library modules, also benefits from directory delegations,. The savings are potentially even greater for repeated 'ls' or 'stat' requests on non-existent files; each such request requires three separate RPC calls -- ACCESS, LOOKUP, and GETATTR -- to discover that a file does not exist.

Beyond these "high miss" cases, analysis of NFSv3 network traces shows that a great deal of NFS traffic consists of the periodic GETATTRs sent by clients when an attribute timeout triggers a cache revalidation. But a delegated directory need not be revalidated unless the directory is modified.

* Should reference Wickman and ... um ... CMU?  Ousterhout?
* From which we can make "a great deal" more specific?

Directory Delegation Operations

An NFSv4.1 client requests a directory delegation with the GET_DIR_DELEGATION operation. Granting a delegation request is solely at the server's discretion, and the delegation may be recalled at any time.

Upon receiving an operation that conflicts with an existing delegation, the server must first recall from all of its clients any delegations on the directory (or directories) being mutated. When a client receives that CB_RECALL callback operation, it relinquishes the delegation in question by responding to the server using the DELEGRETURN operation. When all of the requisite delegations have been returned (or forcefully timed-out), the server allows the conflicting operation to proceed.

Although NFS clients and servers have knowledge of the acquisition and recall of directory delegations, delegation state is opaque to applications.

Notifications

After a delegation recall, a client is forced to refetch a directory in its entirety the next time it is used. For a large directory, this cost, which is above and beyond the two RPCs needed for the recall, can be quite expensive. If the directory also happens to be a popular one — with multiple clients holding delegations — the performance impact on the server can be considerable.

To reduce the impact of a directory modification when the change is small, the NFSv4.1 Internet Draft defines an extension to delegations called notifications. When a client requests a delegation, it can also request that certain changes be conveyed in the form of a notification instead of a recall.

By sending a description of the change instead of recalling the delegation, the server allows the client to maintain a consistent cache without imposing the cost (to the client and to itself) of a recall and refetch.

Notifications are motivated by some common cases. For example, some applications use ephemeral lockfiles for concurrency control by quickly creating and destroying a file in a directory. Other examples include program compilation and CVS updates, which also quickly create and destroy files.

In the proposal for notifications, a client can request notifications on directory entry and directory attribute changes, as well as directory entry attribute changes. To reduce the cost of issuing notifications, the client and server negotiate the rate at which notifications are sent, allowing the server to "batch" notifications and send them asynchronously. In some common cases, delaying a notification can obviate its delivery altogether, e.g., when a file is quickly created and destroyed.

* ref ousterhout

Issues with notifications

Notifications require state on the NFS server to keep track of them and work to deliver them. Wickman's simulator work at CITI found that in some cases, the number of notifications dispatched to support a directory delegation can exceed the cost of simply not using a delegation at all. A restricted version of notifications that sends only directory creates, unlinks, and renames would use much less server state.

Notifications also introduce a level of "fairness" to maintain, in terms of deciding how to allot notifications among multiple clients, given limited server resources.

Notifications can be sent asynchronously, at a rate negotiated by the client and server. This allows the server to batch several notifications and to prune self-cancelling notifications (e.g., "CREATE foo ... REMOVE foo"). Indeed, Wickman found that for certain workloads, batching notifications for 20 to 50 seconds reduces notification traffic by a factor of 5 to 50. For instance, lock files in mail boxes often have a lifetime under 10 seconds, so addition/deletion notifications can be pruned. However, there is a trade-off between the batching delay and client cache consistency.

Because of the complexity of implementation and questions of how best to benefit from them, CITI is not implementing notifications at this time.

Using Directory Delegations

While a client holds a delegation on a directory, it is assured that the directory will not be modified without the delegation first being recalled. The server must delay any operation that modifies a directory until all the clients holding delegations on that directory have returned their delegations.

However, as a special case, the server may allow the client that is modifying a directory to keep its own delegation on that directory. (Obviously, other clients' delegations on that directory must still be recalled.)

Note that even though we may permit a client to modify a directory while it holds a read delegation, this is not the same as providing that client with an exclusive (write) delegation; a write delegation would also allow the client to modify the directory locally, and this is explicitly forbidden in section 11 of the minor version draft:

"The delegation is read-only and the client may not make changes to the directory other than by performing NFSv4 operations that modify the directory or the associated file attributes so that the server has knowledge of these changes."

Note that in order to make the special exception that allows a client to modify a directory without recalling its own lease, we must know which client is performing the operation.

Currently we are using the client's IP address for this. However, the NFSv4 protocol does not prohibit the client from changing IP addresses, and does not prohibit multiple clients from sharing an IP address. The final code will instead use the new Sessions extensions in NFSv4.1 to identify the client.

Negative Caching

One opportunity offered by directory delegations is the chance to significantly extend the usefulness of negative dentry caching on the client. Close-to-open consistency mandates that even in a case where previous LOOKUPs or OPENs for a given file have recently or repeatedly failed, subsequent attempts require that the parent directory is revalidated with a GETATTR in case the file appears. With directory delegations, the client is assured that no new entries or removals have occurred while a delegation is in-effect; this implies that negative dentries in a delegated directory actually can be "trusted".

This could translate into a marked decrease in the number of unnecessary and repeated checks for non-existent files, e.g. when searching for a header file in include paths or a shared library in LD_LIBRARY_PATH (See the Some preliminary numbers section for more details). Knowing just when to acquire those delegations may be a matter to address in client-side policy.

Delegations and the Linux VFS Lease Subsystem

We have implemented directory delegations on the server by extending the Linux VFS file lease subsystem. A lease is a type of lock that gives the lease-holder the chance to perform any necessary tasks (e.g., flushing data) when an operation that conflicts with the lease-type is about to occur -- the caller who is causing the lease to break will block until the lease-holder signals that it is finished cleaning-up (or the lease is forcefully broken after a timeout).

Leases are usually acquired via fcntl(2), and a lease-holder usually receives a signal from the kernel when a lease is being broken; the lease-holder indicates that any cleanup is finished with another fcntl(2) call. Leases used by NFS are all acquired and revoked in-kernel.

The existing lease subsystem only works on files, and leases are only broken when a file is opened for writing or is truncated. In order to implement directory delegations, we have added support for directory leases. These will break when a leased directory is mutated by any additions, deletions, or renames, or when the directory's own metadata changes (e.g., chown(1)). Note that changes to existing files, e.g., will not break directory leases.

Our current implementation modifies the NFS server so that NFS protocol operations will break directory leases. We are testing general VFS-level directory lease-breaking -- i.e., both NFS and local operations will break leases. Our approach is described in the next section.

Recalling NFS Delegations vs. Breaking Linux VFS (Non-NFS) Leases

In the following I will refer to the leases used to implement NFS delegations as "NFS leases" and all other leases as "non-NFS leases".

NFS leases and non-NFS leases differ in how they handle the case where a lease-holder is also the caller performing an operation that conflicts with the lease-type, as described above.

Any operation that breaks a lease, and hence requires delegation recalls, has to wait for delegations to be returned. There are a number of different ways to do this:

  1. Delay responding to the original operation until all recalls are complete.
  2. Immediately return NFS4ERR_DELAY to the client; the process on the client will then block while the client polls on its behalf.
  3. Delay the response from the server for a little while, to handle the (probably common) case of a quick delegation return, and only return NFS4ERR_DELAY if the delegations aren't returned quickly enough.

For now, we have implemented option number 2.

 The approach we're currently taking to tackle the issues of integrating NFS delegations with Linux VFS leases (i.e., all directory-mutating 
 operations, whether locally on the server or over NFS, will break directory leases/delegations on the server) goes something like this:
 
 When breaking a lease where the call is coming over NFS:
 1) During processing, whenever the directory's dentry becomes available (e.g., after a lookup), disable lease-granting for its inode and try         
    break_lease() with O_NONBLOCK.  This will avoid blocking while locks are held, as well as avoid tying up server threads for (potentially)
    long periods.
 
 2) If there was not a lease, finish the operation, re-enable lease-granting on the inode, and we're done.
 
 3) If there was a lease, break_lease() will send the break signal(s) and nfsd will also fail (re-enabling lease-granting on the inode first)
    and the client gets NFS4ERR_DELAY (and should retry).  The downside to this is that a pathological case could arise wherein we break a lease,
    return NFS4ERR_DELAY, then the client retries the operation -- but another client has acquired a lease in the interim, and we could end up 
    with a cycle.
 
 
 When breaking a lease where the call is server-local:
 1) Again, whenever a directory's dentry becomes available, disable lease-granting for its inode.
 
 2a) If locks (e.g., an i_mutex) are not held, call break_lease() and, as per normal lease-semantics, block the breaker until leases are returned,
    after which the breaker is unblocked and its operation succeeds.
 
 2b) If locks are held, call break_lease() with O_NONBLOCK; we assume the common-case to be that no lease is present.  If break_lease() returns
    -EWOULDBLOCK, drop the locks and call break_lease() and allow it to block.  Once the caller unblocks, restart the operation by reacquiring
    the locks and, e.g., redoing a lookup to make sure the file system object(s) still exist(s).  Since lease-granting was disabled early-on, 
    the operation will succeed in one pass.
 
 3) Regardless of whether 2a) or 2b) happened, at the end lease-granting is re-enabled for the inode(s) in question.

Policy (partial)

client: prior to a READDIR, request.

client: if we've sent 3 or 5 revalidations and a directory hasn't changed, request.

client: when to voluntarily surrender? e.g., after a kernel-compile, i hold hundreds of delegations.

server: if a directory's delegation has been recalled in the last N minutes, don't grant new ones.

server: will need to ID "misbehaving" clients and cordon them off.

server: when to preemptively recall? --> server load metric

(simulator)

Previous work at CITI by Brian Wickman consisted of prototyping and analyzing file and directory delegations, based on recorded network traces of NFSv3 use in college environments. The stateless nature of NFSv3 required the instrumentation of OPEN and CLOSE operations into the traces, e.g., but given that in the absence of delegations, NFSv4 client-side cache validation closely mimics that of NFSv3, enough information was available to get an overall impression of the state of the clients' caches. Wickman wrote a simulator to use the instrumented traces to test different delegation models and policies. We now want to use real-world NFSv4 network traces with the simulator, but given the current absence of widescale mainstream deployment of NFSv4, we need to find such traces of representative workloads. Using actual NFSv4 traffic will give a more accurate picture of client-cache state and will more clearly identify operations obviated by delegations; this is both because the traces will not need to be instrumented, and because NFSv3 lacks the COMPOUND operation, with which NFSv4 coalesces groups of commands. NFSv4 traces used with the simulator will allow us to develop client- and server-side policies for requesting and granting delegations.

Some preliminary numbers

A significant demonstration of the benefits of negative dentry caching is software compilation. For instance, when building software using make(1), various directories are repeatedly searched for header files. Since header files tend only to be located in one of the directories, and since many object files depend on the same headers, there are a great number of unnecessary re-checks. By caching negative dentries, a significant number of NFS operations can be avoided.

We have some rough numbers in terms of opcounts, both with and without directory (and not file) delegations enabled. We used a simple client policy of requesting delegations prior to a READDIR (note that make(1) periodically calls getdents(2) on its own). ACCESS, GETATTR, and LOOKUP are where the real savings are; the other opcounts are just included for context. Again, these numbers are rough, but indicate that compilation environments stand to benefit from directory delegations.

Doing make(1) on cscope-15.5 (first without, then with directory delegations):

READ:            136       124
WRITE:           137       136
OPEN:           1576      1576
ACCESS:         1169       161  (86% reduction)
GETATTR:         903       628  (30% reduction)
LOOKUP:         1494       496  (67% reduction)
GET_DIR_DELEG:               7
DELEGRETURN:                 1

Doing make(1) on the 2.6.16 linux kernel (first without, then with directory delegations):

READ:          19803     19892
WRITE:         21921     21869
OPEN:         497472    494648
ACCESS:        20638      3406  (83.5% reduction)
GETATTR:       41794     24563  (41.0% reduction)
LOOKUP:        45063     17447  (61.3% reduction)
READDIR:        1016       884  (13.0% reduction)
GET_DIR_DELEG:             750
DELEGRETURN:              none

Status

At the moment, working on coming up with reasonably representative tests that show the benefits of directory delegations (in terms of OP-counts); pynfs tests are also being written.

The client

  • The client currently requests a delegation just prior to issuing a READDIR on an undelegated directory, or when it has done "a few" parent directory revalidations and noticed that it hasn't changed during that span.
  • As long as the client has such a delegation, it will generally refrain from issuing ACCESS, GETATTR, and READDIR calls on the directory (see below) ...
  • .. in some cases, though, the client's cache(s) may be deliberately invalidated and require a refresh (e.g., a client creates a file in a directory delegated to it, which won't break its delegation; however, in order to see the file, the client must revalidate its pagecache and send a READDIR to the server).
  • README: any suggestions here? —> TODO: get more opcounts! (hosting a webserver's docroot off an nfs mount? PATH or LD_LIBRARY_PATH stuff?)
  • TODO: redo existing opcount tests and instead tally bandwidth savings ...
    • getting real NFSv4 workload network traces would be great... (can you help? —>  email nfsv4@linux-nfs.org)
  • When should/can we decide to voluntarily return delegations (other than when we have no more active open-state)?

The server

  • Differentiate between turning file/directory delegations on/off at runtime (done) and enabling/disabling the capability itself (not done; would prevent our client from ever asking for delegations in the first place, independent of its requesting policy).
  • The following NFS operations currently break directory delegations: CREATE, LINK, REMOVE, RENAME, and OPEN(w/create). SETATTR on directories is pending.
  • An NFS SETATTR breaks file delegations when the file size is changing. Breaking on metadata changes is pending.
  • The corresponding VFS-level operations also break delegations and are being tested.
  • How to acknowledge/when to act upon resource pressures? --> e.g., after compiling the linux kernel, a client holds ~750 delegations -- that's like 50KB of state on the server, and nearly as much on the client.
  • TODO: get NFSv2/NFSv3 operations to break (file and directory) delegations at all of the right times, too.
  • TODO: also -- policy, look at dir deleg/file deleg interactions, ..
Personal tools