NFS re-export

From Linux NFS

(Difference between revisions)
Jump to: navigation, search
(filehandle limits)
 
(9 intermediate revisions not shown)
Line 8: Line 8:
= known issues =
= known issues =
 +
 +
== fsid= required, crossmnt broken ==
 +
 +
The re-export server needs to encode into each filehandle something that identifies the specific filesystem being exported.  Otherwise it's stuck when it gets a filehandle back from the client--the operation it uses to map the incoming filehandle to a dentry can't even work without a superblock.  The usual ways of identifying a filesystem don't work for the case of NFS, so we require the "fsid=" export option on any re-export of an NFS filesystem.
 +
 +
Note also that normally you can export a tree of filesystems by exporting only the parent with the "crossmnt" option, and any filesystems underneath are then automatically exported with the same options.  However, that doesn't apply to the fsid= option: it's purpose is to provide a unique identifier for each export, so it can't be automatically copied to the child filesystems.
 +
 +
That means that re-exporting a tree of NFS filesystems in that way won't work--clients will be able to access the top-level export, but attempts to traverse mountpoints underneath will just result in IO errors.
 +
 +
In theory, if the server could at least determine that the filehandle is for an object on an NFS filesystem, and figure out which server the filesystem's from, it could (given some new interface) ask the NFS client to work out the rest.
 +
 +
One idea might be an [[NFS proxy-only mode]] where a server is dedicated to reexporting the filesystems of exactly *one* other server, as is.
== reboot recovery ==
== reboot recovery ==
Line 21: Line 33:
== filehandle limits ==
== filehandle limits ==
-
NFS filehandle sizes are limited (to 32 bytes for NFSv2, 64 bytes for NFSv3, and 128 bytes for NFSv4).  When we re-export, we take the filehandle returned from the original server and wrap it with some more bytes of our own to create the filehandle we return to clients.  That means the filehandles we give out will be larger than the filehandles we receive from the original server.  There's no guarantee this will work.  In practice most servers give out filehandles of a fixed size that's less than the maximum, so you *probably* won't run into this problem unless you're re-exporting with NFSv2, or re-exporting repeatedly.  But there are no guarantees.
+
NFS filehandle sizes are limited (to 32 bytes for NFSv2, 64 bytes for NFSv3, and 128 bytes for NFSv4).  When we re-export, we take the filehandle returned from the original server and wrap it with some more bytes of our own to create the filehandle we return to clients.  That means the filehandles we give out will be larger than the filehandles we receive from the original server.  There's no guarantee this will work.  In practice most servers give out filehandles of a fixed size that's less than the maximum, so you *probably* won't run into this problem unless you're re-exporting with NFSv2, or re-exporting repeatedly.  More details on [filehandle limits https://www.kernel.org/doc/html/latest/filesystems/nfs/reexport.html#filehandle-limits filehandle limits].
-
If re-export servers could reuse filehandles from the original server, that'd solve the problem.  It would also make it easier for clients to migrate between the original server and other re-export servers, which could be useful.
+
The wrapping is needed so that the server can identify, even after it may have long forgotten about that particular filehandle, which export the filehandle refers to, so it can refer the operation to the correct underlying filesystem or server, and so it can enforce export permissions.  Note that filehandle lifetimes are limited only by the lifetime of the object they point to; they are still expected to work after the inode has dropped out of the server's cache, or after the server has rebooted.
-
The wrapping is needed so that the server can identify, even after it may have long forgotten about that particular filehandle, which export the filehandle refers to, so it can refer the operation to the correct underlying filesystem or server, and so it can enforce export permissions.
+
One solution might be a [[NFS_proxy-only_mode]], where a server would be dedicated to re-exporting a single original NFS server, but it's not clear how to implement that.
-
If a server exports only a single NFS filesystem, then there'd be no problem with it reusing the file handle it got from the original server.  Possibly that's a common enough use case to be helpful?  With containers we could still allow a single physical machine to handle multiple exports even if each container only handles on each.
+
== filehandles not portable across servers ==
-
Cooperating servers could agree on the structure of filehandles in a way that allowed them to reuse each others' filehandles.  Possibly that could be standardized if it proved useful.
+
Given multiple servers re-exporting a single filesystem, it might be expected that a client could easily migrate between them.  That's not necessarily true, since filehandles aren't necessarily portable across servers.
 +
 
 +
If the servers are all Linux servers, though, it should be sufficient to make sure reexports of the same filesystem all get the same fsid= option(Note filehandles still won't be portable between reexports and the original server, though.)
 +
 
 +
Some infrastructure to make this coordination easier might be useful.
== errors on re-exports of NFSv4.0 filesystems to NFSv2/3 clients ==
== errors on re-exports of NFSv4.0 filesystems to NFSv2/3 clients ==
Line 48: Line 64:
Also, depending on NFS versions on originating and re-exporting servers, we could probably save some GETATTRs, and set the atomic bit in some cases, if we passed along wcc information from the original server.  Requires a special knfsd<->nfs interface.  Should be doable.
Also, depending on NFS versions on originating and re-exporting servers, we could probably save some GETATTRs, and set the atomic bit in some cases, if we passed along wcc information from the original server.  Requires a special knfsd<->nfs interface.  Should be doable.
-
 
-
== broken file locking ==
 
-
 
-
Connectathon locking tests over v4 are currently triggering some kind of memory corruption; still investigating.
 
-
 
-
I haven't tested NFSv2/v3 (NLM) file locking yet, but I bet it's broken too.
 
== re-export not reading more than 128K at a time ==
== re-export not reading more than 128K at a time ==
Line 65: Line 75:
This is probably not too hard to fix, but also probably not a high priority.
This is probably not too hard to fix, but also probably not a high priority.
-
 
-
== Delegations unsupported ==
 
-
 
-
Currently a re-export server simply won't give out delegations to its clients (if you're looking at the code: this is because the nfs filesystem sets its setlease method to simple_nosetlease).  This is correct but probably suboptimal.
 
= Known problems that we've fixed =
= Known problems that we've fixed =
Line 74: Line 80:
* Problems with sporadic stale filehandles should be fixed by https://lore.kernel.org/linux-nfs/20201019175330.595894-1-trondmy@kernel.org/ (queued for 5.11?)
* Problems with sporadic stale filehandles should be fixed by https://lore.kernel.org/linux-nfs/20201019175330.595894-1-trondmy@kernel.org/ (queued for 5.11?)
* Pre/post-operation attributes are incorrectly returned as if they were atomic in cases when they aren't.  We have fixes for 5.11.
* Pre/post-operation attributes are incorrectly returned as if they were atomic in cases when they aren't.  We have fixes for 5.11.
 +
* File locking crashes should be fixed as of 5.15.  (But note reboot recovery is still unsupported.)
 +
* delegations and leases should work; this could probably use some testing.
= Use cases =
= Use cases =
-
Most interesting cases are probably read-mostly (even read-only) workloads.
+
== Scaling read bandwidth ==
 +
 
 +
You should be able to scale bandwidth by adding more re-export servers; fscache on the re-export servers should also help.
 +
 
 +
== Hiding latency of distant servers ==
-
It should also be useful in conjunction with fscache.
+
You should also be able to hide latency when the original server is far away.  AFS read-only replication is an interesting precedent here, often used to distribute software that is rarely updated.  [https://cernvm.cern.ch/fs/ CernVM-FS] occupies a similar niche.  fscache should help here too.
-
You should be able to scale bandwidth by adding more re-export servers.
+
== NFS version support ==
-
You should also be able to hide latency when the original server is far awayAFS read-only replication is an interesting precedent here, often used to distribute software that is rarely updated.  [https://cernvm.cern.ch/fs/ CernVM-FS] occupies a similar niche.
+
It's also being used as a way to add support for all NFS versions to servers that only support a subsetCareful attention to filehandle limits is required.

Latest revision as of 22:43, 19 January 2022

The Linux NFS server can export an NFS mount, but that isn't something we currently recommend unless you've done some careful research and are prepared for problems.

You'll need nfs-utils at least 1.3.5 (specifically, 3f520e8f6f5 "exportfs: Make sure pass all valid export flags to nfsd"). Otherwise, on recent kernels, attempts to re-export NFS will likely result in "exportfs: <path> does not support NFS export".

The "fsid=" option is required on any export of an NFS filesystem.

For now you should probably also mount readonly and with -onolock (and don't depend on working file locking), and don't allow the re-exporting server to reboot.

Contents

known issues

fsid= required, crossmnt broken

The re-export server needs to encode into each filehandle something that identifies the specific filesystem being exported. Otherwise it's stuck when it gets a filehandle back from the client--the operation it uses to map the incoming filehandle to a dentry can't even work without a superblock. The usual ways of identifying a filesystem don't work for the case of NFS, so we require the "fsid=" export option on any re-export of an NFS filesystem.

Note also that normally you can export a tree of filesystems by exporting only the parent with the "crossmnt" option, and any filesystems underneath are then automatically exported with the same options. However, that doesn't apply to the fsid= option: it's purpose is to provide a unique identifier for each export, so it can't be automatically copied to the child filesystems.

That means that re-exporting a tree of NFS filesystems in that way won't work--clients will be able to access the top-level export, but attempts to traverse mountpoints underneath will just result in IO errors.

In theory, if the server could at least determine that the filehandle is for an object on an NFS filesystem, and figure out which server the filesystem's from, it could (given some new interface) ask the NFS client to work out the rest.

One idea might be an NFS proxy-only mode where a server is dedicated to reexporting the filesystems of exactly *one* other server, as is.

reboot recovery

NFS is designed to keep operating through server reboots, whether planned or the result of a crash or power outage. Client applications will see a delay while the server's down, but as soon as it's back up, normal operation resumes. Opens and file locks held across the reboot will all work correctly. (The only exception is unlinked but still open files, which may disappear after a reboot.)

But the protocol's normal reboot recovery mechanisms don't work for the case when the re-export server reboots. The re-export server is both an NFS client and an NFS server, and the protocol's equipped to deal with the loss of the server's state, but not with the loss of the client's state.

Maybe we could keep the client state on low-latency stable storage somehow? Maybe we could add a mechanism to the protocol that allows the client to state that it has lost its protocol state and wants to reclaim? (And then the client would issue reclaims as reclaims from the re-export server's clients came in.) Tentative plan: reboot recovery for re-export servers

Maybe the re-export server could take the stateids returned from the server and return them to its clients, avoiding the need for it to keep very much state.

filehandle limits

NFS filehandle sizes are limited (to 32 bytes for NFSv2, 64 bytes for NFSv3, and 128 bytes for NFSv4). When we re-export, we take the filehandle returned from the original server and wrap it with some more bytes of our own to create the filehandle we return to clients. That means the filehandles we give out will be larger than the filehandles we receive from the original server. There's no guarantee this will work. In practice most servers give out filehandles of a fixed size that's less than the maximum, so you *probably* won't run into this problem unless you're re-exporting with NFSv2, or re-exporting repeatedly. More details on [filehandle limits https://www.kernel.org/doc/html/latest/filesystems/nfs/reexport.html#filehandle-limits filehandle limits].

The wrapping is needed so that the server can identify, even after it may have long forgotten about that particular filehandle, which export the filehandle refers to, so it can refer the operation to the correct underlying filesystem or server, and so it can enforce export permissions. Note that filehandle lifetimes are limited only by the lifetime of the object they point to; they are still expected to work after the inode has dropped out of the server's cache, or after the server has rebooted.

One solution might be a NFS_proxy-only_mode, where a server would be dedicated to re-exporting a single original NFS server, but it's not clear how to implement that.

filehandles not portable across servers

Given multiple servers re-exporting a single filesystem, it might be expected that a client could easily migrate between them. That's not necessarily true, since filehandles aren't necessarily portable across servers.

If the servers are all Linux servers, though, it should be sufficient to make sure reexports of the same filesystem all get the same fsid= option. (Note filehandles still won't be portable between reexports and the original server, though.)

Some infrastructure to make this coordination easier might be useful.

errors on re-exports of NFSv4.0 filesystems to NFSv2/3 clients

When re-exporting NFSv4.0 filesystems IO errors have been seen after dropping caches on the re-export server. This is probably due to the fact that an NFSv4 client has to open files to perform IO to them, but NFSv3 client only provides filehandles, and NFSv4.0 cannot open by filehandle (it can only open by (parent filehandle, filename) pair). NFSv4.1 allows open by filehandle.

Best is not to do this; use NFSv4.1 or NFSv4.2 on the original server, or NFSv4 on the clients.

If that's not possible, a workaround is to configure the re-export server to be reluctant to evict inodes from cache.

Some more details at https://lore.kernel.org/linux-nfs/635679406.70384074.1603272832846.JavaMail.zimbra@dneg.com/. Note some other cases there (NFSv3 re-exports of NFSv3) are fixed by patches probably headed for 5.11.

Maybe the NFSv4.0 client could also be made to support open-by-filehandle by skipping the open and using special stateids instead? I'm not sure.

unnecessary GETATTRs

We see unnecessary cache invalidations on the re-export servers; we have some patches in progress that should make it for 5.11 or so (https://lore.kernel.org/linux-nfs/20201120223831.GB7705@fieldses.org/). It looks like they help but don't address every case.

Also, depending on NFS versions on originating and re-exporting servers, we could probably save some GETATTRs, and set the atomic bit in some cases, if we passed along wcc information from the original server. Requires a special knfsd<->nfs interface. Should be doable.

re-export not reading more than 128K at a time

For some reason when the client issues 1M reads to the re-export server, the re-export server breaks them up into 128K reads to the original server. Workaround is to manually increase client readahead; see https://lore.kernel.org/linux-nfs/1688437957.87985749.1605554507783.JavaMail.zimbra@dneg.com/

open DENY bits ignored

NFS since NFSv4 supports ALLOW and DENY bits taken from Windows, which allow you, for example, to open a file in a mode which forbids other read opens or write opens. The Linux client doesn't use them, and the server's support has always been incomplete: they are enforced only against other NFS users, not against processes accessing the exported filesystem locally. A re-export server will also not pass them along to the original server, so they will not be enforced between clients of different re-export servers.

This is probably not too hard to fix, but also probably not a high priority.

Known problems that we've fixed

  • Problems with sporadic stale filehandles should be fixed by https://lore.kernel.org/linux-nfs/20201019175330.595894-1-trondmy@kernel.org/ (queued for 5.11?)
  • Pre/post-operation attributes are incorrectly returned as if they were atomic in cases when they aren't. We have fixes for 5.11.
  • File locking crashes should be fixed as of 5.15. (But note reboot recovery is still unsupported.)
  • delegations and leases should work; this could probably use some testing.

Use cases

Scaling read bandwidth

You should be able to scale bandwidth by adding more re-export servers; fscache on the re-export servers should also help.

Hiding latency of distant servers

You should also be able to hide latency when the original server is far away. AFS read-only replication is an interesting precedent here, often used to distribute software that is rarely updated. CernVM-FS occupies a similar niche. fscache should help here too.

NFS version support

It's also being used as a way to add support for all NFS versions to servers that only support a subset. Careful attention to filehandle limits is required.

Personal tools