CITI Experience with Directory Delegations

From Linux NFS

(Difference between revisions)
Jump to: navigation, search
Line 34: Line 34:
directory delegations, we have added support for directory leases.  These will break when a leased directory is mutated by any additions, deletions, renames, or when the directory's own metadata changes (e.g., chown(1)).  Note that changes to existing files, e.g., will not break directory leases.
directory delegations, we have added support for directory leases.  These will break when a leased directory is mutated by any additions, deletions, renames, or when the directory's own metadata changes (e.g., chown(1)).  Note that changes to existing files, e.g., will not break directory leases.
-
However, for the very near-term, only NFS protocol operations break delegations.  There are a couple operations that involve some tricky locking issues in the VFS which will be addressed.  The difficulty is that, given that breaking a lease involves blocking the caller, one must ensure that no important locks -- like a directory inode's ''i_sem'' -- are held while the calling kernel thread blocks.
+
However, for the very near-term, only NFS protocol operations break directory leases.  There are a couple operations that involve some tricky locking issues in the VFS which will be addressed.  The difficulty is that, given that breaking a lease involves blocking the caller, one must ensure that no important locks -- like a directory inode's ''i_sem'' -- are held while the calling kernel thread blocks.
Leases are usually acquired via the fcntl(2) call, and a lease-holder usually receives a signal from the kernel when a lease is being broken; the lease-holder indicates that any cleanup is finished with another fcntl(2) call.  NFS leases are all acquired and revoked in-kernel.
Leases are usually acquired via the fcntl(2) call, and a lease-holder usually receives a signal from the kernel when a lease is being broken; the lease-holder indicates that any cleanup is finished with another fcntl(2) call.  NFS leases are all acquired and revoked in-kernel.
Line 52: Line 52:
The NFS delegation leases and existing non-NFS leases differ in how they handle the case where a lease-holder is also the caller performing an operation that conflicts with the lease-type. (Here I will informally refer to leases used to implement delegations as "NFS leases" and all other leases as "non-NFS leases".)
The NFS delegation leases and existing non-NFS leases differ in how they handle the case where a lease-holder is also the caller performing an operation that conflicts with the lease-type. (Here I will informally refer to leases used to implement delegations as "NFS leases" and all other leases as "non-NFS leases".)
-
With NFS leases, in order to support directory delegation semantics, the caller '''a)''' ''will not'' have its delegation lease broken, and '''b)''' it ''will'' block until all other delegations have been returned to the server.
+
With NFS leases, in order to support directory delegation semantics, the caller '''a)''' ''will not'' have its directory lease broken, and '''b)''' it ''will'' block until all other leases on that directory have been returned to the server.
With non-NFS leases, the caller '''a)''' ''will'' have its lease revoked, and '''b)''' ''will not'' block while any other lease-holders return their leases.   
With non-NFS leases, the caller '''a)''' ''will'' have its lease revoked, and '''b)''' ''will not'' block while any other lease-holders return their leases.   
Line 60: Line 60:
   The reason for the caller not blocking if it has a non-NFS lease on the file system object in question stems from the asynchronous mechanism of     
   The reason for the caller not blocking if it has a non-NFS lease on the file system object in question stems from the asynchronous mechanism of     
   using signals to communicate to lease-holders that their lease is being broken.  A signal would be queued for the caller and would be delivered  
   using signals to communicate to lease-holders that their lease is being broken.  A signal would be queued for the caller and would be delivered  
-
   when the system call returns.  However, the caller would have been blocked by then, which means it would not be able to respond to the signal.  
+
   when the system call returns.  However, the caller would have been blocked by then, which means it would not be able to receive and respond to  
-
  Eventually, the kernel would forcibly break the lease, but that delay can be quite long (~45 seconds by default, I believe).
+
  the signal. Eventually, the kernel would forcibly break the lease, but that delay can be quite long (~45 seconds by default, I believe).
    
    
   This will be an issue when we work on causing all VFS-level conflicting operations to break NFS leases (instead of just NFS protocol   
   This will be an issue when we work on causing all VFS-level conflicting operations to break NFS leases (instead of just NFS protocol   
   operations, as things stand right now).  The specific difficulty is that, by allowing the caller's operation to succeed immediately, any other
   operations, as things stand right now).  The specific difficulty is that, by allowing the caller's operation to succeed immediately, any other
-
   lease-holders aren't guaranteed that they will return their leases before the file system object is modified.   
+
   lease-holders aren't guaranteed that they will return their leases before the file system object is modified (yes, this race is basically a bug
 +
  in the current design).   
    
    
    
    
-
   That race condition would basically mean that your lease would really just be a notification mechanism, not a locking one!
+
   That race condition would essentially mean that your lease would really just be a notification mechanism, not a locking one!
    
    
    
    
   * One possible solution would be to change the semantics of non-NFS leases so that a lease-holder's conflicting operation would not break his own   
   * One possible solution would be to change the semantics of non-NFS leases so that a lease-holder's conflicting operation would not break his own   
-
     lease.  I cannot imagine that this would ever happen, since applications have already been written that likely rely on the current semantics,  
+
     lease; then the lease-holder could safely be made to block until all other leases have been returned.  The difficulty of the long timeout would
-
    intentionally or not.
+
    be avoided because the lease-holder would not have to receive/respond-to a lease-breaking signal.  I cannot imagine that this would ever happen,  
 +
    since applications could have already been written that rely on the current semantics, intentionally or not.
    
    
   * Since the only way to avoid the race is to ensure that the lease-holder's conflicting operation doesn't succeed until all leases, NFS or not,  
   * Since the only way to avoid the race is to ensure that the lease-holder's conflicting operation doesn't succeed until all leases, NFS or not,  
     have been returned (or forcibly timed-out), perhaps the lease-holder could be set to block until all others have been returned, and only   
     have been returned (or forcibly timed-out), perhaps the lease-holder could be set to block until all others have been returned, and only   
-
     thereafter break his lease.  More specifically, after all the others are finished, queue the signal for the caller, return from __break_lease(),
+
     thereafter break his lease.  More specifically, after all the others are finished, queue the signal for the caller, return from __break_lease(),  
-
     return from the lease-breaking syscall; when the process runs again, it should (?) receive and have to handle the signal immediately, before any
+
     return from the lease-breaking syscall; when the process runs again, it should (?) receive and have to handle the signal immediately, before  
-
     processing subsequent to the lease-breaking syscall proceeds.  Gotta test that.
+
     any userland processing subsequent to the lease-breaking syscall proceeds.  Gotta test that.

Revision as of 23:50, 28 April 2006

NOTE: this is a rough work-in-progress and will be fleshed-out over the next few days; please send criticism to richterd at (nospam) citi.umich.edu thank you.

Directory Delegations Background

NFSv4.1 introduces read-only directory delegations, a protocol addition intended to enable clients to perform more-aggressive caching. More specifically, the goal is essentially to allow clients to avoid excess GETATTR, ACCESS, and LOOKUP calls to the server by increasing the reliability with which clients are able to do directory entry caching (READDIR), name caching (LOOKUP), and directory metadata caching (ACCESS and GETATTR).

The following quoted subsections are from Section 11 of the NFSv4.1 minor version draft:

NFSv4 client caching behavior

"Directory caching for the NFS version 4 protocol is similar to previous versions. Clients typically cache directory information for a duration determined by the client. At the end of a predefined timeout, the client will query the server to see if the directory has been updated. By caching attributes, clients reduce the number of GETATTR calls made to the server to validate attributes. Furthermore, frequently accessed files and directories, such as the current working directory, have their attributes cached on the client so that some NFS operations can be performed without having to make an RPC call. By caching name and inode information about most recently looked up entries in DNLC (Directory Name Lookup Cache), clients do not need to send LOOKUP calls to the server every time these files are accessed."

NFSv4.1 delegations extensions

"[The NFSv4] caching approach works reasonably well at reducing network traffic in many environments. However, it does not address environments where there are numerous queries for files that do not exist. In these cases of "misses", the client must make RPC calls to the server in order to provide reasonable application semantics and promptly detect the creation of new directory entries. Examples of high miss activity are compilation in software development environments. The current behavior of NFS limits its potential scalability and wide-area sharing effectiveness in these types of environments."

Furthermore, analysis of NFSv3 (whose client cache semantics NFSv4 mirrors) network traces by Brian Wickman at the University of Michigan (FIXME: need link to a copy of his prelim) show that a very surprising amount of NFS traffic are the periodic GETATTRs the clients send when a timeout triggers a cache revalidation.

At CITI, we are in the process of implementing directory delegations as described in Section 11 of the minor version draft, although we are not at this time implementing the notifications extension also described therein. The following are some specific aspects of the work.


Delegations and the Linux VFS Lease Subsystem

Directory delegations are implemented on the server with extensions to the Linux VFS file lease subsystem. A lease is a type of lock that gives the lease-holder the chance to perform any necessary tasks (e.g., flushing data) when an operation that conflicts with the lease-type is about to occur -- the caller who is causing the lease to break will block until the lease-holder signals that it is finished cleaning-up (or the lease is forcefully broken after a certain timeout).

The existing lease subsystem only works on files, and leases are only broken when a file is opened for writing or is truncated. In order to implement directory delegations, we have added support for directory leases. These will break when a leased directory is mutated by any additions, deletions, renames, or when the directory's own metadata changes (e.g., chown(1)). Note that changes to existing files, e.g., will not break directory leases.

However, for the very near-term, only NFS protocol operations break directory leases. There are a couple operations that involve some tricky locking issues in the VFS which will be addressed. The difficulty is that, given that breaking a lease involves blocking the caller, one must ensure that no important locks -- like a directory inode's i_sem -- are held while the calling kernel thread blocks.

Leases are usually acquired via the fcntl(2) call, and a lease-holder usually receives a signal from the kernel when a lease is being broken; the lease-holder indicates that any cleanup is finished with another fcntl(2) call. NFS leases are all acquired and revoked in-kernel.


Using Directory Delegations

While a client holds a delegation on a directory, it is assured that the directory will not be mutated without the delegation first being synchronously recalled. However, the notion that a directory delegation is "read-only" has a special case associated with it: a given client's mutating operation on a directory delegated to it will not trigger a recall of that client's delegation (obviously, all other delegations on that directory will have to be recalled before the mutating operation succeeds). Again, from Section 11 of the minor version draft:

"The delegation is read-only and the client may not make changes to the directory other than by performing NFSv4 operations that modify the directory or the associated file attributes so that the server has knowledge of these changes."

Note that a client's ability to mutate a directory without triggering a recall of its own delegation is not at all a write delegation.


Recalling NFS Delegations vs. Breaking (Non-NFS) Linux VFS Leases

The NFS delegation leases and existing non-NFS leases differ in how they handle the case where a lease-holder is also the caller performing an operation that conflicts with the lease-type. (Here I will informally refer to leases used to implement delegations as "NFS leases" and all other leases as "non-NFS leases".)

With NFS leases, in order to support directory delegation semantics, the caller a) will not have its directory lease broken, and b) it will block until all other leases on that directory have been returned to the server.

With non-NFS leases, the caller a) will have its lease revoked, and b) will not block while any other lease-holders return their leases.

... a serious problem:

  The reason for the caller not blocking if it has a non-NFS lease on the file system object in question stems from the asynchronous mechanism of    
  using signals to communicate to lease-holders that their lease is being broken.  A signal would be queued for the caller and would be delivered 
  when the system call returns.  However, the caller would have been blocked by then, which means it would not be able to receive and respond to 
  the signal.  Eventually, the kernel would forcibly break the lease, but that delay can be quite long (~45 seconds by default, I believe).
  
  This will be an issue when we work on causing all VFS-level conflicting operations to break NFS leases (instead of just NFS protocol  
  operations, as things stand right now).  The specific difficulty is that, by allowing the caller's operation to succeed immediately, any other
  lease-holders aren't guaranteed that they will return their leases before the file system object is modified (yes, this race is basically a bug 
  in the current design).  
  
  
  That race condition would essentially mean that your lease would really just be a notification mechanism, not a locking one!
  
  
  * One possible solution would be to change the semantics of non-NFS leases so that a lease-holder's conflicting operation would not break his own   
    lease; then the lease-holder could safely be made to block until all other leases have been returned.  The difficulty of the long timeout would 
    be avoided because the lease-holder would not have to receive/respond-to a lease-breaking signal.  I cannot imagine that this would ever happen, 
    since applications could have already been written that rely on the current semantics, intentionally or not.
  
  * Since the only way to avoid the race is to ensure that the lease-holder's conflicting operation doesn't succeed until all leases, NFS or not, 
    have been returned (or forcibly timed-out), perhaps the lease-holder could be set to block until all others have been returned, and only  
    thereafter break his lease.  More specifically, after all the others are finished, queue the signal for the caller, return from __break_lease(), 
    return from the lease-breaking syscall; when the process runs again, it should (?) receive and have to handle the signal immediately, before 
    any userland processing subsequent to the lease-breaking syscall proceeds.  Gotta test that.
Personal tools