Cluster Coherent NFS and Byte Range Locking
From Linux NFS
Cluster Coherent NFS and Byte Range Locking
Clustered filesystems with exports to NFS clients face several issues with providing byte-range locking over NFS.
NFS advisory locking is performed by LOCKD or the NFSv4 server on the exporting node. In the current implementation, LOCKD calls the VFS posix locking layer even if the underlying filesystem provides its own ->lock() locking routine. This is because LOCKD is single-threaded, so LOCKD is not able to block, waiting on communication with another cluster node.
The VFS posix locking layer provides an asynchronous lock manager callback, fl_notify(), that allows LOCKD to queue blocking lock requests and continue to service other client requests.
The NFSv4 server simply treats all blocking locks as non-blocking, choosing not to implement another lock request queue.
NFSv4 Blocking Locks
The NFSv4 server needs to implement blocking-locks. Unlike NLM clients, NFSv4 clients do not register a blocking lock callback with the server. Instead, they poll the server to see if the blocked lock is available. This presents a fairness problem, and the NFSv4 spec suggests that the server should maintain an ordered list of pending blocking locks. To really solve the fairness problem, all consumers of a lock should share such an ordered list e.g. local lock, LOCKD, and NFSv4 server lock requests.
* Implement a shared blocking lock fair queue * Implement the NFSv4 server fl_notify and use the fair queue
We investigated changing the semantics of the existing file_lock->fl_block queue to make it more 'fair'. This queue holds all blocking locks in requesting order, new blockers are added to the tail.
The existing fl_block semantics:
When the lock is released, traverse the fl_block list and wake each blocker, resulting in a 'scrum' to get the lock. The winner then places all losers on its fl_block list. So, this queue is 'fair' in the sense that the blokers wake in order. It's not fair in the sense that LOCKD has bookeeping tasks to perform prior to actually grabbing the lock ensuring that a local blocker will always win the scrum.
The new 'fair' fl_block semantics:
When the lock is released, traverse the fl_block list and wake blockers in order until one claims the lock. We added a lock to protect the fl_block list() from change during this processing. This proved to be problematic for two reasons
* Claiming the lock means calling posix_lock_file which calls kmalloc which can sleep, a no-no when under a spinlock; so we'd have to use a semaphore or mutex; but * For the purposes of mandatory lock checking, this new lock must be obtained in the read/write path to check for lock compliance, and adding a semaphore or mutex to the performance-critical read/write path is thought to be inefficient.
Currently, we are investigating removing the semaphore, and depending on the combination of the BKL held by the unlock that released the lock, and a flag indicating that our processing is in use.
We are also considering adding NFSv4 blocking lock processing to the LOCKD queue, providing fair locking over NFS.
One problem to solve in the NFSv4 case is that since clients poll for locks, and since NFSv4 has no equivalent to the (race-prone) cancel or grant callbacks, it is not possible for NFSv4 to acquire a lock on a client's behalf; it must wait for the client to poll again before granting the lock. If it grants the lock early, and the client chooses not to poll again, then there is no way for the server to cancel the lock that it has already granted. (If the lock has downgraded or coalesced existing locks, then it may not be possible to undo its effect with a simple unlock.)
Correct support for blocking NFSv4 locks will therefore require the ability to apply a new kind of byte-range lock to the backend filesystem that allows us to temporarily block other lock requests, but that does not downgrade or coalesce with existing posix locks, to allow us to later remove the lock safely if the client does not return.
As we turned our attention to the VFS posix locking code, we found and fixed many bugs and races. We also reviewed and applied bug fixes from the community.
Cluster Filesystem ->lock() Interface
There is currently a filesystem ->lock() method, but it is defined only by a few filesystems that are not exported via NFS. So none of the lock routines that are used by LOCKD or the NFSv4 server bother to call those methods. Cluster filesystems would like to NFS to call their own lock methods which keep a consistant view of a lock across cluster filesystem nodes. But the current ->lock() interface is not suitable for cluster filesystems in a couple of ways.
* We'd rather not block the NFSv4 server or LOCKD threads for longer than necessary, so it'd be nice to have a way to make lock requests asynchronously. This is particularly helpful for non-blocking locks, which do not have the option of returning a temporary "blocked" response and then responding with a granted callback later. * Given that in the blocking case we want the filesystem to be able to return from ->lock() without having necessarily acquired the lock, we need to be able to handle the case where a process on the client is interrupted and the client cancels the lock.
* Design and implement an asynchronous ->lock() interface * Have LOCKD and the NFSv4 server test for and call the new ->lock()
Since acquiring a filesystem lock may require comminication with remote hosts, and to avoid blocking lock manager threads during such communication, we allow the results to be returned asynchronously.
When a filesystem ->lock() call needs to block due to a delay in satisfying a non-blocking lock request, the file system will return -EINPROGRESS, and then later return the results with a callback registered via the lock_manager_operations struct.
An FL_CANCEL flag is added to the struct file_lock to indicate to the file system that the caller wants to cancel the provided lock.
New routines vfs_lock_file, vfs_test_lock, and vfs_cancel_lock replace posix_lock_file, posix_test_file, and posix_cancel_lock in LOCKD and the NFSv4 server. They call the new filesystem ->lock() method if it exists, else call the posix conterparts.
Our solution has been tested with the GPFS file system. The relevant patches have been submitted to the Linux community, and we are responding to comments.
A major issue for acceptance is the lack of a consumer in the Linux kernel - e.g. a cluster file system with byte-range locking.