NFS re-export

From Linux NFS

(Difference between revisions)
Jump to: navigation, search
(reboot recovery)
Line 9: Line 9:
= reboot recovery =
= reboot recovery =
-
NFS is designed to keep operating through server reboots, whether planned or the result of a crash or power outage.  Client applications will see a delay while the server's down, but as soon as it's back up, normal operation resumes.  Opens and file locks held across the reboot will all work correctly.  (The only exception is unliked but still open files, which may disappear after a reboot.)
+
NFS is designed to keep operating through server reboots, whether planned or the result of a crash or power outage.  Client applications will see a delay while the server's down, but as soon as it's back up, normal operation resumes.  Opens and file locks held across the reboot will all work correctly.  (The only exception is unlinked but still open files, which may disappear after a reboot.)
But the protocol's normal reboot recovery mechanisms don't work for the case when the re-export server reboots.  The re-export server is both an NFS client and an NFS server, and the protocol's equipped to deal with the loss of the server's state, but not with the loss of the client's state.
But the protocol's normal reboot recovery mechanisms don't work for the case when the re-export server reboots.  The re-export server is both an NFS client and an NFS server, and the protocol's equipped to deal with the loss of the server's state, but not with the loss of the client's state.

Revision as of 19:32, 25 November 2020

The Linux NFS server can export an NFS mount, but that isn't something we currently recommend unless you've done some careful research and are prepared for problems.

Some known issues:

Contents

fsid= required

The "fsid=" option is required on any export of an NFS filesystem.

reboot recovery

NFS is designed to keep operating through server reboots, whether planned or the result of a crash or power outage. Client applications will see a delay while the server's down, but as soon as it's back up, normal operation resumes. Opens and file locks held across the reboot will all work correctly. (The only exception is unlinked but still open files, which may disappear after a reboot.)

But the protocol's normal reboot recovery mechanisms don't work for the case when the re-export server reboots. The re-export server is both an NFS client and an NFS server, and the protocol's equipped to deal with the loss of the server's state, but not with the loss of the client's state.

Maybe we could keep the client state on low-latency stable storage somehow? Maybe we could add a mechanism to the protocol that allows the client to state that it's lost its protocol state and wants to reclaim? (And then the client would issue reclaims as reclaims from the re-export server's clients came in.)

Maybe the re-export server could take the stateids returned from the server and return them to its clients, avoiding the need for it to keep very much state.

filehandle limits

NFS filehandle sizes are limited (to 32 bytes for NFSv2, 64 bytes for NFSv3, and 128 bytes for NFSv4). When we re-export, we take the filehandle returned from the original server and wrap it with some more bytes of our own to create the filehandle we return to clients. That means the filehandles we give out will be larger than the filehandles we receive from the original server. There's no guarantee this will work. In practice most servers give out filehandles of a fixed size that's less than the maximum, so you *probably* won't run into this problem unless you're re-exporting with NFSv2, or re-exporting repeatedly. But there are no guarantees.

If re-export servers could reuse filehandles from the original server, that'd solve the problem. It would also make it easier for clients to migrate between the original server and other re-export servers, which could be useful.

The wrapping is needed so that the server can identify, even after it may have long forgotten about that particular filehandle, which export the filehandle refers to, so it can refer the operation to the correct underlying filesystem or server, and so it can enforce export permissions.

If a server exports only a single NFS filesystem, then there'd be no problem with it reusing the file handle it got from the original server. Possibly that's a common enough use case to be helpful? With containers we could still allow a single physical machine to handle multiple exports even if each container only handles on each.

Cooperating servers could agree on the structure of filehandles in a way that allowed them to reuse each others' filehandles. Possibly that could be standardized if it proved useful.

errors on re-exports of NFSv4.0 filesystems to NFSv2/3 clients

When re-exporting NFSv4.0 filesystems IO errors have been seen after dropping caches on the re-export server. Workaround is to configure the re-export server to be reluctant to evict inodes from cache.

Maybe this is due to the lack of open-by-filehandle in v4.0? I suspect that's unfixable, due to the lack of open_by_filehandle in v4.0. Some more details at https://lore.kernel.org/linux-nfs/635679406.70384074.1603272832846.JavaMail.zimbra@dneg.com/. Note some other cases there are fixed by patches probably headed for 5.11.

unnecessary GETATTRs

We see unnecessary cache invalidations on the re-export servers; we have some patches in progress that should make it for 5.11 or so (https://lore.kernel.org/linux-nfs/20201120223831.GB7705@fieldses.org/). It looks like they help but don't address every case.

Incorrect pre/post-operation attributes

Pre/post-operation attributes are incorrectly returned as if they were atomic in cases when they aren't. This could cause clients to incorrectly cache stale file data or directory contents. This is fixable but we don't currently have patches.

locking crash

Connectathon locking tests are currently triggering some kind of memory corruption; still investigating.

re-export not reading more than 128K at a time

For some reason when the client issues 1M reads to the re-export server, the re-export server breaks them up into 128K reads to the original server. Workaround is to manually increase client readahead.

Known problems that we've fixed

Personal tools