Cluster Coherent NFSv4 and Share Reservations
From Linux NFS
NFSv4 share reservations control the concurrent sharing of files at the time they are opened. Share reservations come in two flavors, ACCESS and DENY. There are three types of ACCESS reservations: READ, WRITE, and BOTH; and four types of DENY reservations: NONE, READ, WRITE, and BOTH.
ACCESS reservations are familiar to Linux users, as they map directly to posix open() flags. NFSv4 ACCESS shares of READ, WRITE, and BOTH map directly to O_RDONLY, O_WRONLY and O_RDWR, respectively.
NFSv4 DENY reservations act as a type of whole file lock applied when a file is opened. NFSv4 DENY shares of READ, WRITE, and BOTH prevent other opens with read, write, or any access from succeeding. DENY NONE allows other opens to proceed.
The Linux system call interface for open() follows the posix standard, which does not include support for share reservations. In particular, there is no direct analog in posix for an application to request DENY READ, WRITE, or BOTH shares. Consequently, Linux NFSv4 clients always use DENY NONE.
The mismatch between posix and NFSv4 shares is also reflected on an NFSv4 server. The Linux NFSv4 server that receives DENY reservations from clients that can express them, which in practice means Windows clients, does the appropriate bookeepping and enforcement, but the local filesystem is unable to enforce DENY shares for local access on the server.
When a cluster file system is exported with NFSv4, multiple NFSv4 servers export a common back-end file system, so ACCESS and DENY reservations must be distributed to take into account shares from other NFSv4 servers. In other words, the NFSv4 server has to ask the cluster file system if an incoming OPEN share can be granted.
Adding DENY share support to the Linux kernel faces several obstacles:
- DENY shares are alien to posix, the Linux model for file systems.
- There are currently no open Linux file systems that support DENY shares.
- Linux and all other UNIX-like NFSv4 clients currently work correctly because they never request DENY access.
- DENY shares do not meet the NFSv4 access needs of Linux clients, just Windows clients.
- Not even off-the-shelf Windows clients benefit as NFSv4 for Windows is a third-party add-on (from Hummingbird).
- The user level SAMBA server implements DENY shares with open and flock (albeit with the obvious race conditions), which obviates kernel support.
To enforce open share DENY access across the cluster back end is complicated, since an open with DENY must atomically lookup, (possibly) create, open, and lock the target file.
The Linux client atomically joins lookup, create, and open with lookup intents; the back end may have to do the same thing. The Linux client must also make the open and lock an atomic operation, but there is a problem: you can't lock that doesn't exist, so you must first create it. But as soon as the file is created, some other application might find it and lock it. Returning an error to an open that succeeding in creating a file is unexpected behavior.
Applying restrictive mode bits to the create won't always work, either, because another application might relax the mode restrictions and open the file.
This suggests that we add the share lock to the open call instead of making it a separate operation.
One approach: new flags for open()
- Use existing O_RDONLY, O_WRONLY and O_RDWR open flags to implement O_ACCESS_READ, O_ACCESS_WRITE, and O_ACCESS_BOTH, respectively.
- Add two open flags: O_DENY_READ and O_DENY_WRITE.
- Propagate O_DENY flags to the intent structure.
- Add operation adjust_share(file, flags). The file system should be allowed to refuse operations that could not result from open or close. (So, anything that doesn't only turn bits on or only turn them off.)
* Is this a new kernel operation? Who is supposed to call it? This needs a little better explanation.
Is there a race here? E.g., say we open+create with a share lock. How do we decide whether to treat it as an upgrade or an open?
* This issue needs to be explained a little better.
Another approach: best attempt
- Issue a lookup. If the file exists, then upgrade.
* Someone please clarify "upgrade."
- Otherwise open with implicit create. If we get an error indicating a share conflict, retry the lookup.
* But the subsequent upgrade (?) might fail. Then what?
This is obviously not ideal.
- Would it help to get a reference on the dentry before trying the open?
- Is there currently a lookup/open race if the backend is a distributed filesystem? One way of looking at it is "that's up to them." The client just needs to look at how we implement open and make sure it does the intent stuff right.
* A brief glance suggests that we probably don't.
An alternative might be to expose something along the lines of the open owner to the VFS and let it decide (by comparing open owners) whether a given open is an upgrade or a new open.
Implementation awaits resolution of these issues.