Cluster Coherent NFSv4 and Delegations
From Linux NFS
Revision as of 18:22, 5 April 2006
Cluster Coherent NFSv4 and Delegations
Background
NFSv4 adds a new protocol feature, Delegations. From rfc3530:
The major addition to NFS version 4 in the area of caching is the ability of the server to delegate certain responsibilities to the client. When the server grants a delegation for a file to a client, the client is guaranteed certain semantics with respect to the sharing of that file with other clients. At OPEN, the server may provide the client either a read or write delegation for the file. If the client is granted a read delegation, it is assured that no other client has the ability to write to the file for the duration of the delegation. If the client is granted a write delegation, the client is assured that no other client has read or write access to the file.
Delegations can be recalled by the server. If another client requests access to the file in such a way that the access conflicts with the granted delegation, the server is able to notify the initial client and recall the delegation. This requires that a callback path exist between the server and client. If this callback path does not exist, then delegations can not be granted. The essence of a delegation is that it allows the client to locally service operations such as OPEN, CLOSE, LOCK, LOCKU, READ, WRITE without immediate interaction with the server.
Linux NFSv4 Deletgation Support for Cluster Filesystems
The Linux NFSv4 server delegation implementation uses the lease extensions to the VFS lock subsystem (so a lease equals a delegation). Use of the lease subsystem coordinates local access and NFSv4 delegations. The VFS lease subsystem has an fcntl() interface to set and get a lease, and a break_lease function.
The open syscall provides the opportunity for the NFSD to hand out a delegation. A conflicting open forces a delegation recall. The conflicting open could come from local access, NFS access, Samba access etc. Once a file has been delegated to any client, all OPENS must check if there is a delegation recall in progress related to the requested OPEN access (NFSERR_DELAY) prior to granting OPEN.
If the requested OPEN access forces a delegation recall, NFSD initiates a CB_RECALL on all conflicting delegations. This is currently implemented using the VFS layer break_lease call, which notifies lease holders when a conflicting OPEN has occurred. The VFS layer makes this determination without consulting the underlying file system.
Finally, NFSD determines if it can hand out a delegation on the file for the requested OPEN. The VFS lease subsystem does this by examining in memory inode fields to determine if there are any writers (to grant a READ delegation) or any readers or writers ( to grant a WRITE delegation). The underlying file system will need to be consulted to make this determination.
If NFSD decides to grant a delegation, it needs to tell the underlying file system so that the file system can notify NFSD to recall the delegation at a later time.
Tasks
* Ask file system to check for delegation recall in progress prior to granting an OPEN, granting a delegation, or initiating a recall. * Set up a callback from the file system to notify an NFSv4 server to perform a CB_RECALL upon a conflicting OPEN from another node. * Ask the file system if a delegation can be granted. * Tell the file system that the VFS on a node has detected a lease conflict (rename, unlink, etc) and that any delegations should be recalled.
Proposed Implementation
Extend the set/get/breaklease interfaces to service cluster file systems. The extensions will resemble the posix locking extensions (callbacks, etc).
What we probably need is new inode operations:
* break_lease(inode, mode) * setlease(filp, mode) * getlease(filp, &mode)
Where mode can be one of read, write, or unlock. We'd also allow the mode to be or'ed with a nonblocking flag?
Actually current setlease and getlease functions use a struct file_lock instead of (or in addition to) the mode. Do we need that?
Also, setlease and getlease could be file operations instead of inode operations. This is probably a fairly arbitrary choice.
To handle the possibility that break_lease, setlease, getlease, etc. might block, even in the absence of contention, we might want to allow an -EINPROGRESS return to be followed by a callback e.g. break_lease_result(inode, stat); where stat might be -EAGAIN (we're waiting for the lease to be broken) or OK (it was immediately broken, or there never was one).
Status