http://wiki.linux-nfs.org/wiki/index.php?title=Special:Contributions&feed=atom&limit=20&target=BfieldsLinux NFS - User contributions [en]2024-03-28T21:23:07ZFrom Linux NFSMediaWiki 1.16.5http://wiki.linux-nfs.org/wiki/index.php/NFS_for_AFS_usersNFS for AFS users2022-01-27T20:35:23Z<p>Bfields: /* PAGS */</p>
<hr />
<div>This page tracks some of the obstacles that might keep an AFS user from using NFS instead.<br />
<br />
= Missing Features =<br />
<br />
In general: AFS is administered by a consistent set of commands (fs, pts, vos, uss, bos, backup, fstrace, etc.) which work from any client and identify the user with Kerberos. Compared to a traditional unix system it's more flexible about delegating rights to users to do stuff.<br />
<br />
== replication and migration ==<br />
<br />
AFS supports fast clones using COW, along with complete copies on other machines.<br />
<br />
Currently there can be only one writeable version of a volume, but multiple read-only versions (which all have to be identical). They can be on different servers. (There's also an effort to support multiple writeable volumes, possibly using Ceph, but that's not done yet.)<br />
<br />
There can also be a 'backup' volume which is just, say, a daily temporary read-only snapshot of a RW volume and has to be located on the same machine.<br />
<br />
When a RW volume is "released" (snapshotted) to the read-only volumes, all the read-only volumes update simultaneously and atomically. The users, in theory, don't notice as the volumes don't go offline - and then they see all the changes happen at once. There's coordination to handle when one or more of the fileservers or the Volume Location servers are offline.<br />
<br />
Volumes can be migrated between machines whilst in active use without the user in theory noticing anything.<br />
<br />
For NFS migration we need to preserve filehandles, so need to migrate at the block level or using fs-specific send/receive. The protocol can be handled by migrating only entire servers or containers, so that migration can be treated as a server reboot.<br />
<br />
A few Linux options for send/receive:<br />
<br />
* thin_delta (from device-mapper-persistent-data) can calculate a metadata-level diff between two volumes. Additional work would be needed to extract the actual data and produce a diff; that would complete the "send" side. We'd also need a "receive" side that could apply the diff and reconstitute the snapshot on the other side. This is being actively worked on. For NFS, on the read-write server we would take a snapshot of the exported volume before sending. On the receive side, after creating the updated snapshot, we would stop the server, unmount the old snapshot, mount the new one, and restart; clients should see only a brief delay.<br />
<br />
* btrfs-send/btrfs-receive: this is probably the best-tested send/receive functionality currently available, so if we wanted to start work on a prototype right now, this might be an option.<br />
<br />
* xfs volumes loopback-mounted on a backing xfs filesystem, using reflink for snapshots. (See https://lwn.net/Articles/747633/ for some background.) Looks promising, the basic kernel interfaces to find shared extents and such are there, but a lot of userland code remains to be written.<br />
<br />
* stratis: this operates at a layer of abstraction over the above. But that might be the layer we want to actually interact with?<br />
<br />
* lvmsync: looks possibly unmaintained? We wouldn't want to depend on this. But possibly it could be a proof of concept or starting point.<br />
<br />
Clients could be configured to mount particular servers by hand, or they could mount any server and then use [https://tools.ietf.org/html/rfc5661#section-11.9 fs_locations], [https://tools.ietf.org/html/rfc5661#section-11.10 fs_locations_info], or maybe even [https://datatracker.ietf.org/doc/rfc8435/ pnfs flexfiles] to get lists of servers hosting replicas and pick one. They would need some heuristics to make the right choice. It would also be nice if clients could fail over to a different replica when one goes down.<br />
<br />
We also have [https://github.com/nfs-ganesha/nfs-ganesha/wiki Ganesha], [https://docs.ceph.com/docs/master/cephfs/nfs/ Ganesha/Ceph] (which [https://jtlayton.wordpress.com/2018/12/10/deploying-an-active-active-nfs-cluster-over-cephfs/ may be capable of multiple read/write servers now]).<br />
<br />
See also [https://docs.openafs.org/AdminGuide/HDRWQ177.html AFS Administrator's guide, Chapter 5: Managing Volumes]<br />
<br />
A partial alternative may be [https://wiki.linux-nfs.org/wiki/index.php/NFS_re-export NFS proxying]. Like read-only replicas, proxies should be able to hide latency by moving cached data closer to far-flung clients, and scale bandwidth to read-mostly data by taking load off the original server.<br />
<br />
Advantages are that we already have seen reports of some success here, using the NFS re-export code together with fscache. And I think there are a lot of opportunities for incremental progress by fixing problems with existing NFS code, rather than larger and riskier projects that build new infrastructure.<br />
<br />
A disadvantage may be that AFS users seem to like that infrastructure (the volume abstraction and the VLDB).<br />
<br />
Latency-hiding may be particularly tricky; delegation and caching policies may need rethinking. Performance will be more complicated to understand compared to AFS-like read-only replicas.<br />
<br />
AFS-like volume replication has a problem: when new read-only versions are released, they may delete files that are in use by running processes. Applications probably don't expect ESTALE on in-use files; I'd expect application crashes. I wonder how AFS administrators deal with that now?<br />
<br />
My impression is that AFS doesn't reliably prevent this problem, so instead AFS administrators work around it, for example by keeping old versions of binaries in place (and using symlinks to direct users to the newest versions). So maybe NFS doesn't need to solve this problem either.<br />
<br />
Possible approaches to fix the problem if we wanted to:<br />
* Provide some protocol which tracks which files may be open on read-only replicas so that we know not to free those files when they're unlinked.<br />
* When we distribute new versions, allow servers to keep around older versions and serve files from them in the case filehandle lookups against the new copy fail, to be removed only after applications stop referencing them. Hopefully this can be done space-efficiently if the different versions on the replica servers can be represented as dm snapshots.<br />
<br />
If we use NFSv4 proxies instead, proxies will hold opens or delegations on the files on the original server, which will prevent their being deleted while in use. The problem is server reboots. That's partially worked around with silly-rename. [[Server-side silly rename]] would be a more complete solution.<br />
<br />
== volume location database and global namespace ==<br />
<br />
On an AFS client by default you can look up something like /afs/umich.edu/... and reach files kept in AFS anywhere.<br />
<br />
NFS has standards for DNS discovery of a server from a domain, in theory we could use that. Handling kerberos users across domains would be interesting.<br />
<br />
Within one domain, AFS has a "Volume Location Database" that keeps track of volumes and where (machine and partition) they're located. You can make a volume for a purpose; give particular people access to it, give it some storage, expand and contract it and move it around. Volumes have quotas.<br />
<br />
With NFS, within a given domain, we can assemble a namespace out of volumes using referrals. For a higher-level approach more similar to AFS's, there's also [https://wiki.linux-nfs.org/wiki/index.php/FedFsUtilsProject FedFS] which stores the namespace information in a database and provides common protocols for administration tools to manipulate the database. That just provides namespace-management facilities. If it were combined with a kerberized distributed volume manager built on top of LVM, that might server as a more complete AFS VLDB replacement.<br />
<br />
== PAGS ==<br />
<br />
PAGs: AFS allows a group of processes to share a common identity, different from the local uid, for the purposes of accessing an AFS filesystem: https://docs.openafs.org/AdminGuide/HDRWQ63.html<br />
<br />
So, for example, if you have multiple kerberos identities that you use to access AFS, you can pick which one you want to use at a given time, or even use both, each in a different window. We'd like this for NFS as well.<br />
<br />
Dave Howells says: "This is why I added session keyrings. You can run a process in a new keyring<br />
and give it new tokens. systemd kind of stuck a spike in that, though, by<br />
doing their own incompatible thing with their user manager service....<br />
<br />
NFS would need to do what the in-kernel AFS client does and call request_key()<br />
on entry to each filesystem method that doesn't take a file* and use that to<br />
cache the credentials it is using. If there is no key, it can make one up on<br />
the spot and stick the uid/gid/groups in there. This would then need to be<br />
handed down to the sunrpc protocol to define the security creds to use.<br />
<br />
The key used to open a file would then need to be cached in the file struct<br />
private data."<br />
<br />
So, we have a lot of good kernel infrastructure in place which is designed to do this, but (despite an attempt or two) nobody has managed to quite make it work for NFS yet.<br />
<br />
== ACLs ==<br />
<br />
NFSv4 has ACLs, but Linux filesystems only support "posix" ACLs. An attempt was made to support NFSv4 ACLs ("richacls") but hasn't been accepted upstream. So knfsd is stuck mapping between NFSv4 and posix ACLs. Posix ACLs are more coarse-grained than NFSv4 ACLs, so information can be lost when a user on an NFSv4 client sets an ACL. This makes ACLs confusing and less useful.<br />
<br />
There are other servers that support full NFSv4 ACLs, so users of those servers are better off. Our client-side tools could still use some improvements for those users, though.<br />
<br />
AFS ACLs, unfortunately, are yet again a third style of ACL, incompatible with both POSIX and NFSv4 ACLs. They are more fine-grained than POSIX ACLs and probably closer to NFSv4 ACLs overall.<br />
<br />
To do:<br />
<br />
* make NFSv4 ACL tools more usable:<br />
** Map groups of NFSv4 permission bits to read, write, and execute permissions so we only have to display the simpler bits in common cases<br />
** Look for other opportunities to simplify display and editing of NFSv4 ACLs<br />
** Add NFSv4 ACL support to graphical file managers like GNOME Files<br />
** Adopt a commandline interface that's more similar to the posix acl utilities.<br />
** Perhaps also look into https://github.com/kvaneesh/richacl-tools as an alternative starting point to nfs4-acl-tools.<br />
** In general, try to make NFSv4 ACL management more similar to management of existing posix ACLs.<br />
* For AFS->NFS transition:<br />
** Write code that translates AFS ACLs to NFSv4 ACLs. It should be possible to do this with little or no loss of information for servers with full NFSv4 ACL support.<br />
** For migrations to Linux knfsd, this will effectively translate AFS ACLs to POSIX ACLs, and information will be lost. Test this case. The conversion tool should be able to fetch the ACLs after setting them, compare results, and summarize the results of the conversion in a way that's usable even for conversions of large numbers of files. I believe that setting an ACL is enough to invalidate the client's ACL cache, so a subsequent fetch of an ACL should show the results of any server-side mapping. But, test this to make sure. More details on [[AFS to NFSv4 ACL conversion]]<br />
<br />
* more ambitious options:<br />
** Try reviving [https://lwn.net/Articles/661357/ Rich ACLs]. Maybe we could convince people this time. Or maybe there's a different approach that would work. Maybe we could find a more incremental route, e.g. by adding some features of richacls to POSIX ACLs, such as the separation of directory write permissions into add and delete, and of file write permissions into modify and append.<br />
<br />
== user and group management ==<br />
<br />
AFS has a "protection server" and you can communicate with it using the [https://docs.openafs.org/Reference/1/pts.html pts command] which allows you to set up users and groups and add ACEs for machines.<br />
<br />
Compared to traditional unix, it allows wider delegation of management. For example, group creation doesn't require root: https://docs.openafs.org/Reference/1/pts_creategroup.html. Groups have owners, and you can delegate management of group membership: https://docs.openafs.org/Reference/1/pts_adduser.html.<br />
<br />
Our equivalent to the AFS protection server is [https://www.freeipa.org/page/Main_Page FreeIPA]. See also https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_identity_management/index. Installing FreeIPA and experimenting is also useful.<br />
<br />
Unlike AFS, FreeIPA doesn't seem to make it easy for ordinary users to create groups. It does allow delegating group management (including adding and removing users). More details on [[AFS-like group management with FreeIPA]].<br />
<br />
== quotas ==<br />
<br />
AFS has per-volume quotas. There are no per-user quotas that I can see; instead, AFS administrators create volumes for individual users (e.g., for individual home directories), and set quotas on those. Volumes can share the same storage, and it's fine for quotas on volumes to add up to more than the available storage.<br />
<br />
We could get similar functionality with LVM thin provisioning or XFS with project quotas. (There is some work needed there to treat projects as separate exports, but that's very doable.)<br />
<br />
Note NFS, ext4, xfs, and other filesystems all support per-user (and other) quotas. That's not something AFS has, as far as I know. Some notes on [[NFSv4 quota support]].<br />
<br />
= migrating existing AFS installations to NFS =<br />
<br />
Once NFS does everything AFS does, there's still the question of how you'd migrate over a particular installation.<br />
<br />
There's a standard AFS dump format (used by [https://docs.openafs.org/AdminGuide/HDRWQ240.html vos dump/vos restore]) that might be worth looking at. It looks simple enough. Maybe also look at [https://github.com/openafs-contrib/cmu-dumpscan cmu-dumpscan].<br />
<br />
See also [[AFS to NFSv4 ACL conversion]].</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_for_AFS_usersNFS for AFS users2022-01-27T20:30:28Z<p>Bfields: /* volume location database and global namespace */</p>
<hr />
<div>This page tracks some of the obstacles that might keep an AFS user from using NFS instead.<br />
<br />
= Missing Features =<br />
<br />
In general: AFS is administered by a consistent set of commands (fs, pts, vos, uss, bos, backup, fstrace, etc.) which work from any client and identify the user with Kerberos. Compared to a traditional unix system it's more flexible about delegating rights to users to do stuff.<br />
<br />
== replication and migration ==<br />
<br />
AFS supports fast clones using COW, along with complete copies on other machines.<br />
<br />
Currently there can be only one writeable version of a volume, but multiple read-only versions (which all have to be identical). They can be on different servers. (There's also an effort to support multiple writeable volumes, possibly using Ceph, but that's not done yet.)<br />
<br />
There can also be a 'backup' volume which is just, say, a daily temporary read-only snapshot of a RW volume and has to be located on the same machine.<br />
<br />
When a RW volume is "released" (snapshotted) to the read-only volumes, all the read-only volumes update simultaneously and atomically. The users, in theory, don't notice as the volumes don't go offline - and then they see all the changes happen at once. There's coordination to handle when one or more of the fileservers or the Volume Location servers are offline.<br />
<br />
Volumes can be migrated between machines whilst in active use without the user in theory noticing anything.<br />
<br />
For NFS migration we need to preserve filehandles, so need to migrate at the block level or using fs-specific send/receive. The protocol can be handled by migrating only entire servers or containers, so that migration can be treated as a server reboot.<br />
<br />
A few Linux options for send/receive:<br />
<br />
* thin_delta (from device-mapper-persistent-data) can calculate a metadata-level diff between two volumes. Additional work would be needed to extract the actual data and produce a diff; that would complete the "send" side. We'd also need a "receive" side that could apply the diff and reconstitute the snapshot on the other side. This is being actively worked on. For NFS, on the read-write server we would take a snapshot of the exported volume before sending. On the receive side, after creating the updated snapshot, we would stop the server, unmount the old snapshot, mount the new one, and restart; clients should see only a brief delay.<br />
<br />
* btrfs-send/btrfs-receive: this is probably the best-tested send/receive functionality currently available, so if we wanted to start work on a prototype right now, this might be an option.<br />
<br />
* xfs volumes loopback-mounted on a backing xfs filesystem, using reflink for snapshots. (See https://lwn.net/Articles/747633/ for some background.) Looks promising, the basic kernel interfaces to find shared extents and such are there, but a lot of userland code remains to be written.<br />
<br />
* stratis: this operates at a layer of abstraction over the above. But that might be the layer we want to actually interact with?<br />
<br />
* lvmsync: looks possibly unmaintained? We wouldn't want to depend on this. But possibly it could be a proof of concept or starting point.<br />
<br />
Clients could be configured to mount particular servers by hand, or they could mount any server and then use [https://tools.ietf.org/html/rfc5661#section-11.9 fs_locations], [https://tools.ietf.org/html/rfc5661#section-11.10 fs_locations_info], or maybe even [https://datatracker.ietf.org/doc/rfc8435/ pnfs flexfiles] to get lists of servers hosting replicas and pick one. They would need some heuristics to make the right choice. It would also be nice if clients could fail over to a different replica when one goes down.<br />
<br />
We also have [https://github.com/nfs-ganesha/nfs-ganesha/wiki Ganesha], [https://docs.ceph.com/docs/master/cephfs/nfs/ Ganesha/Ceph] (which [https://jtlayton.wordpress.com/2018/12/10/deploying-an-active-active-nfs-cluster-over-cephfs/ may be capable of multiple read/write servers now]).<br />
<br />
See also [https://docs.openafs.org/AdminGuide/HDRWQ177.html AFS Administrator's guide, Chapter 5: Managing Volumes]<br />
<br />
A partial alternative may be [https://wiki.linux-nfs.org/wiki/index.php/NFS_re-export NFS proxying]. Like read-only replicas, proxies should be able to hide latency by moving cached data closer to far-flung clients, and scale bandwidth to read-mostly data by taking load off the original server.<br />
<br />
Advantages are that we already have seen reports of some success here, using the NFS re-export code together with fscache. And I think there are a lot of opportunities for incremental progress by fixing problems with existing NFS code, rather than larger and riskier projects that build new infrastructure.<br />
<br />
A disadvantage may be that AFS users seem to like that infrastructure (the volume abstraction and the VLDB).<br />
<br />
Latency-hiding may be particularly tricky; delegation and caching policies may need rethinking. Performance will be more complicated to understand compared to AFS-like read-only replicas.<br />
<br />
AFS-like volume replication has a problem: when new read-only versions are released, they may delete files that are in use by running processes. Applications probably don't expect ESTALE on in-use files; I'd expect application crashes. I wonder how AFS administrators deal with that now?<br />
<br />
My impression is that AFS doesn't reliably prevent this problem, so instead AFS administrators work around it, for example by keeping old versions of binaries in place (and using symlinks to direct users to the newest versions). So maybe NFS doesn't need to solve this problem either.<br />
<br />
Possible approaches to fix the problem if we wanted to:<br />
* Provide some protocol which tracks which files may be open on read-only replicas so that we know not to free those files when they're unlinked.<br />
* When we distribute new versions, allow servers to keep around older versions and serve files from them in the case filehandle lookups against the new copy fail, to be removed only after applications stop referencing them. Hopefully this can be done space-efficiently if the different versions on the replica servers can be represented as dm snapshots.<br />
<br />
If we use NFSv4 proxies instead, proxies will hold opens or delegations on the files on the original server, which will prevent their being deleted while in use. The problem is server reboots. That's partially worked around with silly-rename. [[Server-side silly rename]] would be a more complete solution.<br />
<br />
== volume location database and global namespace ==<br />
<br />
On an AFS client by default you can look up something like /afs/umich.edu/... and reach files kept in AFS anywhere.<br />
<br />
NFS has standards for DNS discovery of a server from a domain, in theory we could use that. Handling kerberos users across domains would be interesting.<br />
<br />
Within one domain, AFS has a "Volume Location Database" that keeps track of volumes and where (machine and partition) they're located. You can make a volume for a purpose; give particular people access to it, give it some storage, expand and contract it and move it around. Volumes have quotas.<br />
<br />
With NFS, within a given domain, we can assemble a namespace out of volumes using referrals. For a higher-level approach more similar to AFS's, there's also [https://wiki.linux-nfs.org/wiki/index.php/FedFsUtilsProject FedFS] which stores the namespace information in a database and provides common protocols for administration tools to manipulate the database. That just provides namespace-management facilities. If it were combined with a kerberized distributed volume manager built on top of LVM, that might server as a more complete AFS VLDB replacement.<br />
<br />
== PAGS ==<br />
<br />
PAGs: AFS allows a group of processes to share a common identity, different from the local uid, for the purposes of accessing an AFS filesystem: https://docs.openafs.org/AdminGuide/HDRWQ63.html<br />
<br />
Dave Howells says: "This is why I added session keyrings. You can run a process in a new keyring<br />
and give it new tokens. systemd kind of stuck a spike in that, though, by<br />
doing their own incompatible thing with their user manager service....<br />
<br />
NFS would need to do what the in-kernel AFS client does and call request_key()<br />
on entry to each filesystem method that doesn't take a file* and use that to<br />
cache the credentials it is using. If there is no key, it can make one up on<br />
the spot and stick the uid/gid/groups in there. This would then need to be<br />
handed down to the sunrpc protocol to define the security creds to use.<br />
<br />
The key used to open a file would then need to be cached in the file struct<br />
private data."<br />
<br />
== ACLs ==<br />
<br />
NFSv4 has ACLs, but Linux filesystems only support "posix" ACLs. An attempt was made to support NFSv4 ACLs ("richacls") but hasn't been accepted upstream. So knfsd is stuck mapping between NFSv4 and posix ACLs. Posix ACLs are more coarse-grained than NFSv4 ACLs, so information can be lost when a user on an NFSv4 client sets an ACL. This makes ACLs confusing and less useful.<br />
<br />
There are other servers that support full NFSv4 ACLs, so users of those servers are better off. Our client-side tools could still use some improvements for those users, though.<br />
<br />
AFS ACLs, unfortunately, are yet again a third style of ACL, incompatible with both POSIX and NFSv4 ACLs. They are more fine-grained than POSIX ACLs and probably closer to NFSv4 ACLs overall.<br />
<br />
To do:<br />
<br />
* make NFSv4 ACL tools more usable:<br />
** Map groups of NFSv4 permission bits to read, write, and execute permissions so we only have to display the simpler bits in common cases<br />
** Look for other opportunities to simplify display and editing of NFSv4 ACLs<br />
** Add NFSv4 ACL support to graphical file managers like GNOME Files<br />
** Adopt a commandline interface that's more similar to the posix acl utilities.<br />
** Perhaps also look into https://github.com/kvaneesh/richacl-tools as an alternative starting point to nfs4-acl-tools.<br />
** In general, try to make NFSv4 ACL management more similar to management of existing posix ACLs.<br />
* For AFS->NFS transition:<br />
** Write code that translates AFS ACLs to NFSv4 ACLs. It should be possible to do this with little or no loss of information for servers with full NFSv4 ACL support.<br />
** For migrations to Linux knfsd, this will effectively translate AFS ACLs to POSIX ACLs, and information will be lost. Test this case. The conversion tool should be able to fetch the ACLs after setting them, compare results, and summarize the results of the conversion in a way that's usable even for conversions of large numbers of files. I believe that setting an ACL is enough to invalidate the client's ACL cache, so a subsequent fetch of an ACL should show the results of any server-side mapping. But, test this to make sure. More details on [[AFS to NFSv4 ACL conversion]]<br />
<br />
* more ambitious options:<br />
** Try reviving [https://lwn.net/Articles/661357/ Rich ACLs]. Maybe we could convince people this time. Or maybe there's a different approach that would work. Maybe we could find a more incremental route, e.g. by adding some features of richacls to POSIX ACLs, such as the separation of directory write permissions into add and delete, and of file write permissions into modify and append.<br />
<br />
== user and group management ==<br />
<br />
AFS has a "protection server" and you can communicate with it using the [https://docs.openafs.org/Reference/1/pts.html pts command] which allows you to set up users and groups and add ACEs for machines.<br />
<br />
Compared to traditional unix, it allows wider delegation of management. For example, group creation doesn't require root: https://docs.openafs.org/Reference/1/pts_creategroup.html. Groups have owners, and you can delegate management of group membership: https://docs.openafs.org/Reference/1/pts_adduser.html.<br />
<br />
Our equivalent to the AFS protection server is [https://www.freeipa.org/page/Main_Page FreeIPA]. See also https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_identity_management/index. Installing FreeIPA and experimenting is also useful.<br />
<br />
Unlike AFS, FreeIPA doesn't seem to make it easy for ordinary users to create groups. It does allow delegating group management (including adding and removing users). More details on [[AFS-like group management with FreeIPA]].<br />
<br />
== quotas ==<br />
<br />
AFS has per-volume quotas. There are no per-user quotas that I can see; instead, AFS administrators create volumes for individual users (e.g., for individual home directories), and set quotas on those. Volumes can share the same storage, and it's fine for quotas on volumes to add up to more than the available storage.<br />
<br />
We could get similar functionality with LVM thin provisioning or XFS with project quotas. (There is some work needed there to treat projects as separate exports, but that's very doable.)<br />
<br />
Note NFS, ext4, xfs, and other filesystems all support per-user (and other) quotas. That's not something AFS has, as far as I know. Some notes on [[NFSv4 quota support]].<br />
<br />
= migrating existing AFS installations to NFS =<br />
<br />
Once NFS does everything AFS does, there's still the question of how you'd migrate over a particular installation.<br />
<br />
There's a standard AFS dump format (used by [https://docs.openafs.org/AdminGuide/HDRWQ240.html vos dump/vos restore]) that might be worth looking at. It looks simple enough. Maybe also look at [https://github.com/openafs-contrib/cmu-dumpscan cmu-dumpscan].<br />
<br />
See also [[AFS to NFSv4 ACL conversion]].</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_for_AFS_usersNFS for AFS users2022-01-27T20:27:26Z<p>Bfields: /* replication and migration */</p>
<hr />
<div>This page tracks some of the obstacles that might keep an AFS user from using NFS instead.<br />
<br />
= Missing Features =<br />
<br />
In general: AFS is administered by a consistent set of commands (fs, pts, vos, uss, bos, backup, fstrace, etc.) which work from any client and identify the user with Kerberos. Compared to a traditional unix system it's more flexible about delegating rights to users to do stuff.<br />
<br />
== replication and migration ==<br />
<br />
AFS supports fast clones using COW, along with complete copies on other machines.<br />
<br />
Currently there can be only one writeable version of a volume, but multiple read-only versions (which all have to be identical). They can be on different servers. (There's also an effort to support multiple writeable volumes, possibly using Ceph, but that's not done yet.)<br />
<br />
There can also be a 'backup' volume which is just, say, a daily temporary read-only snapshot of a RW volume and has to be located on the same machine.<br />
<br />
When a RW volume is "released" (snapshotted) to the read-only volumes, all the read-only volumes update simultaneously and atomically. The users, in theory, don't notice as the volumes don't go offline - and then they see all the changes happen at once. There's coordination to handle when one or more of the fileservers or the Volume Location servers are offline.<br />
<br />
Volumes can be migrated between machines whilst in active use without the user in theory noticing anything.<br />
<br />
For NFS migration we need to preserve filehandles, so need to migrate at the block level or using fs-specific send/receive. The protocol can be handled by migrating only entire servers or containers, so that migration can be treated as a server reboot.<br />
<br />
A few Linux options for send/receive:<br />
<br />
* thin_delta (from device-mapper-persistent-data) can calculate a metadata-level diff between two volumes. Additional work would be needed to extract the actual data and produce a diff; that would complete the "send" side. We'd also need a "receive" side that could apply the diff and reconstitute the snapshot on the other side. This is being actively worked on. For NFS, on the read-write server we would take a snapshot of the exported volume before sending. On the receive side, after creating the updated snapshot, we would stop the server, unmount the old snapshot, mount the new one, and restart; clients should see only a brief delay.<br />
<br />
* btrfs-send/btrfs-receive: this is probably the best-tested send/receive functionality currently available, so if we wanted to start work on a prototype right now, this might be an option.<br />
<br />
* xfs volumes loopback-mounted on a backing xfs filesystem, using reflink for snapshots. (See https://lwn.net/Articles/747633/ for some background.) Looks promising, the basic kernel interfaces to find shared extents and such are there, but a lot of userland code remains to be written.<br />
<br />
* stratis: this operates at a layer of abstraction over the above. But that might be the layer we want to actually interact with?<br />
<br />
* lvmsync: looks possibly unmaintained? We wouldn't want to depend on this. But possibly it could be a proof of concept or starting point.<br />
<br />
Clients could be configured to mount particular servers by hand, or they could mount any server and then use [https://tools.ietf.org/html/rfc5661#section-11.9 fs_locations], [https://tools.ietf.org/html/rfc5661#section-11.10 fs_locations_info], or maybe even [https://datatracker.ietf.org/doc/rfc8435/ pnfs flexfiles] to get lists of servers hosting replicas and pick one. They would need some heuristics to make the right choice. It would also be nice if clients could fail over to a different replica when one goes down.<br />
<br />
We also have [https://github.com/nfs-ganesha/nfs-ganesha/wiki Ganesha], [https://docs.ceph.com/docs/master/cephfs/nfs/ Ganesha/Ceph] (which [https://jtlayton.wordpress.com/2018/12/10/deploying-an-active-active-nfs-cluster-over-cephfs/ may be capable of multiple read/write servers now]).<br />
<br />
See also [https://docs.openafs.org/AdminGuide/HDRWQ177.html AFS Administrator's guide, Chapter 5: Managing Volumes]<br />
<br />
A partial alternative may be [https://wiki.linux-nfs.org/wiki/index.php/NFS_re-export NFS proxying]. Like read-only replicas, proxies should be able to hide latency by moving cached data closer to far-flung clients, and scale bandwidth to read-mostly data by taking load off the original server.<br />
<br />
Advantages are that we already have seen reports of some success here, using the NFS re-export code together with fscache. And I think there are a lot of opportunities for incremental progress by fixing problems with existing NFS code, rather than larger and riskier projects that build new infrastructure.<br />
<br />
A disadvantage may be that AFS users seem to like that infrastructure (the volume abstraction and the VLDB).<br />
<br />
Latency-hiding may be particularly tricky; delegation and caching policies may need rethinking. Performance will be more complicated to understand compared to AFS-like read-only replicas.<br />
<br />
AFS-like volume replication has a problem: when new read-only versions are released, they may delete files that are in use by running processes. Applications probably don't expect ESTALE on in-use files; I'd expect application crashes. I wonder how AFS administrators deal with that now?<br />
<br />
My impression is that AFS doesn't reliably prevent this problem, so instead AFS administrators work around it, for example by keeping old versions of binaries in place (and using symlinks to direct users to the newest versions). So maybe NFS doesn't need to solve this problem either.<br />
<br />
Possible approaches to fix the problem if we wanted to:<br />
* Provide some protocol which tracks which files may be open on read-only replicas so that we know not to free those files when they're unlinked.<br />
* When we distribute new versions, allow servers to keep around older versions and serve files from them in the case filehandle lookups against the new copy fail, to be removed only after applications stop referencing them. Hopefully this can be done space-efficiently if the different versions on the replica servers can be represented as dm snapshots.<br />
<br />
If we use NFSv4 proxies instead, proxies will hold opens or delegations on the files on the original server, which will prevent their being deleted while in use. The problem is server reboots. That's partially worked around with silly-rename. [[Server-side silly rename]] would be a more complete solution.<br />
<br />
== volume location database and global namespace ==<br />
<br />
On an AFS client by default you can look up something like /afs/umich.edu/... and reach files kept in AFS anywhere.<br />
<br />
NFS has standards for DNS discovery of a server from a domain, in theory we could use that. Handling kerberos users across domains would be interesting.<br />
<br />
Within one domain, there's a "Volume Location Database" that keeps track of volumes and where (machine and partition) they're located. You can make a volume for a purpose; give particular people access to it, give it some storage, expand and contract it and move it around. Volumes have quotas.<br />
<br />
Within a given domain, We can assemble a namespace out of volumes using referrals. For a higher-level approach more similar to AFS's, there's also [https://wiki.linux-nfs.org/wiki/index.php/FedFsUtilsProject FedFS] which stores the namespace information in a database and provides common protocols for administration tools to manipulate the database.<br />
<br />
== PAGS ==<br />
<br />
PAGs: AFS allows a group of processes to share a common identity, different from the local uid, for the purposes of accessing an AFS filesystem: https://docs.openafs.org/AdminGuide/HDRWQ63.html<br />
<br />
Dave Howells says: "This is why I added session keyrings. You can run a process in a new keyring<br />
and give it new tokens. systemd kind of stuck a spike in that, though, by<br />
doing their own incompatible thing with their user manager service....<br />
<br />
NFS would need to do what the in-kernel AFS client does and call request_key()<br />
on entry to each filesystem method that doesn't take a file* and use that to<br />
cache the credentials it is using. If there is no key, it can make one up on<br />
the spot and stick the uid/gid/groups in there. This would then need to be<br />
handed down to the sunrpc protocol to define the security creds to use.<br />
<br />
The key used to open a file would then need to be cached in the file struct<br />
private data."<br />
<br />
== ACLs ==<br />
<br />
NFSv4 has ACLs, but Linux filesystems only support "posix" ACLs. An attempt was made to support NFSv4 ACLs ("richacls") but hasn't been accepted upstream. So knfsd is stuck mapping between NFSv4 and posix ACLs. Posix ACLs are more coarse-grained than NFSv4 ACLs, so information can be lost when a user on an NFSv4 client sets an ACL. This makes ACLs confusing and less useful.<br />
<br />
There are other servers that support full NFSv4 ACLs, so users of those servers are better off. Our client-side tools could still use some improvements for those users, though.<br />
<br />
AFS ACLs, unfortunately, are yet again a third style of ACL, incompatible with both POSIX and NFSv4 ACLs. They are more fine-grained than POSIX ACLs and probably closer to NFSv4 ACLs overall.<br />
<br />
To do:<br />
<br />
* make NFSv4 ACL tools more usable:<br />
** Map groups of NFSv4 permission bits to read, write, and execute permissions so we only have to display the simpler bits in common cases<br />
** Look for other opportunities to simplify display and editing of NFSv4 ACLs<br />
** Add NFSv4 ACL support to graphical file managers like GNOME Files<br />
** Adopt a commandline interface that's more similar to the posix acl utilities.<br />
** Perhaps also look into https://github.com/kvaneesh/richacl-tools as an alternative starting point to nfs4-acl-tools.<br />
** In general, try to make NFSv4 ACL management more similar to management of existing posix ACLs.<br />
* For AFS->NFS transition:<br />
** Write code that translates AFS ACLs to NFSv4 ACLs. It should be possible to do this with little or no loss of information for servers with full NFSv4 ACL support.<br />
** For migrations to Linux knfsd, this will effectively translate AFS ACLs to POSIX ACLs, and information will be lost. Test this case. The conversion tool should be able to fetch the ACLs after setting them, compare results, and summarize the results of the conversion in a way that's usable even for conversions of large numbers of files. I believe that setting an ACL is enough to invalidate the client's ACL cache, so a subsequent fetch of an ACL should show the results of any server-side mapping. But, test this to make sure. More details on [[AFS to NFSv4 ACL conversion]]<br />
<br />
* more ambitious options:<br />
** Try reviving [https://lwn.net/Articles/661357/ Rich ACLs]. Maybe we could convince people this time. Or maybe there's a different approach that would work. Maybe we could find a more incremental route, e.g. by adding some features of richacls to POSIX ACLs, such as the separation of directory write permissions into add and delete, and of file write permissions into modify and append.<br />
<br />
== user and group management ==<br />
<br />
AFS has a "protection server" and you can communicate with it using the [https://docs.openafs.org/Reference/1/pts.html pts command] which allows you to set up users and groups and add ACEs for machines.<br />
<br />
Compared to traditional unix, it allows wider delegation of management. For example, group creation doesn't require root: https://docs.openafs.org/Reference/1/pts_creategroup.html. Groups have owners, and you can delegate management of group membership: https://docs.openafs.org/Reference/1/pts_adduser.html.<br />
<br />
Our equivalent to the AFS protection server is [https://www.freeipa.org/page/Main_Page FreeIPA]. See also https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_identity_management/index. Installing FreeIPA and experimenting is also useful.<br />
<br />
Unlike AFS, FreeIPA doesn't seem to make it easy for ordinary users to create groups. It does allow delegating group management (including adding and removing users). More details on [[AFS-like group management with FreeIPA]].<br />
<br />
== quotas ==<br />
<br />
AFS has per-volume quotas. There are no per-user quotas that I can see; instead, AFS administrators create volumes for individual users (e.g., for individual home directories), and set quotas on those. Volumes can share the same storage, and it's fine for quotas on volumes to add up to more than the available storage.<br />
<br />
We could get similar functionality with LVM thin provisioning or XFS with project quotas. (There is some work needed there to treat projects as separate exports, but that's very doable.)<br />
<br />
Note NFS, ext4, xfs, and other filesystems all support per-user (and other) quotas. That's not something AFS has, as far as I know. Some notes on [[NFSv4 quota support]].<br />
<br />
= migrating existing AFS installations to NFS =<br />
<br />
Once NFS does everything AFS does, there's still the question of how you'd migrate over a particular installation.<br />
<br />
There's a standard AFS dump format (used by [https://docs.openafs.org/AdminGuide/HDRWQ240.html vos dump/vos restore]) that might be worth looking at. It looks simple enough. Maybe also look at [https://github.com/openafs-contrib/cmu-dumpscan cmu-dumpscan].<br />
<br />
See also [[AFS to NFSv4 ACL conversion]].</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_for_AFS_usersNFS for AFS users2022-01-27T20:27:07Z<p>Bfields: /* replication and migration */</p>
<hr />
<div>This page tracks some of the obstacles that might keep an AFS user from using NFS instead.<br />
<br />
= Missing Features =<br />
<br />
In general: AFS is administered by a consistent set of commands (fs, pts, vos, uss, bos, backup, fstrace, etc.) which work from any client and identify the user with Kerberos. Compared to a traditional unix system it's more flexible about delegating rights to users to do stuff.<br />
<br />
== replication and migration ==<br />
<br />
AFS supports fast clones using COW, along with complete copies on other machines.<br />
<br />
Currently there can be only one writeable version of a volume, but multiple read-only versions (which all have to be identical). They can be on different servers. (There's also an effort to support multiple writeable volumes, possibly using Ceph, but that's not done yet.)<br />
<br />
There can also be a 'backup' volume which is just, say, a daily temporary read-only snapshot of a RW volume and has to be located on the same machine.<br />
<br />
When a RW volume is "released" (snapshotted) to the read-only volumes, all the read-only volumes update simultaneously and atomically. The users, in theory, don't notice as the volumes don't go offline - and then they see all the changes happen at once. There's coordination to handle when one or more of the fileservers or the Volume Location servers are offline.<br />
<br />
Volumes can be migrated between machines whilst in active use without the user in theory noticing anything.<br />
<br />
For NFS migration we need to preserve filehandles, so need to migrate at the block level or using fs-specific send/receive. The protocol can be handled by migrating only entire servers or containers, so that migration can be treated as a server reboot.<br />
<br />
A few Linux options for send/receive:<br />
<br />
* thin_delta (from device-mapper-persistent-data) can calculate a metadata-level diff between two volumes. Additional work would be needed to extract the actual data and produce a diff; that would complete the "send" side. We'd also need a "receive" side that could apply the diff and reconstitute the snapshot on the other side. This is being actively worked on. For NFS, on the read-write server we would take a snapshot of the exported volume before sending. On the receive side, after creating the updated snapshot, we would stop the server, unmount the old snapshot, mount the new one, and restart; clients should see only a brief delay.<br />
<br />
* btrfs-send/btrfs-receive: this is probably the best-tested send/receive functionality currently available, so if we wanted to start work on a prototype right now, this might be an option.<br />
<br />
* xfs volumes loopback-mounted on a backing xfs filesystem, using reflink for snapshots. (See https://lwn.net/Articles/747633/ for some background.) Looks promising, the basic kernel interfaces to find shared extents and such are there, but a lot of userland code remains to be written.<br />
<br />
* stratis: this operates at a layer of abstraction over the above. But that might be the layer we want to actually interact with?<br />
<br />
* lvmsync: looks possibly unmaintained? We wouldn't want to depend on this. But possibly it could be a proof of concept or starting point.<br />
<br />
Clients could be configured to mount particular servers by hand, or they could mount any server and then use [https://tools.ietf.org/html/rfc5661#section-11.9 fs_locations], [https://tools.ietf.org/html/rfc5661#section-11.10 fs_locations_info], or maybe even [https://datatracker.ietf.org/doc/rfc8435/ pnfs flexfiles] to get lists of servers hosting replicas and pick one. They would need some heuristics to make the right choice. It would also be nice if clients could fail over to a different replica when one goes down.<br />
<br />
We also have [https://github.com/nfs-ganesha/nfs-ganesha/wiki Ganesha], [https://docs.ceph.com/docs/master/cephfs/nfs/ Ganesha/Ceph] (which [https://jtlayton.wordpress.com/2018/12/10/deploying-an-active-active-nfs-cluster-over-cephfs/ may be capable of multiple read/write servers now]).<br />
<br />
See also [https://docs.openafs.org/AdminGuide/HDRWQ177.html AFS Administrator's guide, Chapter 5: Managing Volumes]<br />
<br />
A partial alternative may be [https://wiki.linux-nfs.org/wiki/index.php/NFS_re-export NFS proxying]. Like read-only replicas, proxies should be able to hide latency by moving cached data closer to far-flung clients, and scale bandwidth to read-mostly data by taking load off the original server.<br />
<br />
Advantages are that we already have seen reports of some success here, using the NFS re-export code together with fscache. And I think there are a lot of opportunities for incremental progress by fixing problems with existing NFS code, rather than larger and riskier projects that build new infrastructure.<br />
<br />
A disadvantage may be that AFS users seem to like that infrastructure (the volume abstraction and the VLDB).<br />
<br />
Latency-hiding may be particularly tricky; delegation and caching policies may need rethinking. Performance will be more complicated to understand compared to AFS-like read-only replicas.<br />
<br />
AFS-like volume replication has a problem: when new read-only versions are released, they may delete files that are in use by running processes. Applications probably don't expect ESTALE on in-use files; I'd expect application crashes. I wonder how AFS administrators deal with that now?<br />
<br />
My impression is that AFS doesn't reliably prevent this problem, so instead AFS administrators work around it, for example by keeping old versions of binaries in place (and using symlinks to direct users to the newest versions). So maybe NFS doesn't need to solve this problem either.<br />
<br />
Possible approaches to fix the problem if we wanted to:<br />
* Provide some protocol which tracks which files may be open on read-only replicas so that we know not to free those files when they're unlinked.<br />
* When we distribute new versions, allow servers to keep around older versions and serve files from them in the case filehandle lookups against the new copy fail, to be removed only after applications stop referencing them. Hopefully this can be done space-efficiently if the different versions on the replica servers can be represented as dm snapshots.<br />
<br />
If we use NFSv4 proxies instead, proxies will hold opens or delegations on the files on the original server, which will prevent their being deleted while in use. The problem is server reboots. That's partially worked around with silly-rename. [[Server-side silly-rename]] would be a more complete solution.<br />
<br />
== volume location database and global namespace ==<br />
<br />
On an AFS client by default you can look up something like /afs/umich.edu/... and reach files kept in AFS anywhere.<br />
<br />
NFS has standards for DNS discovery of a server from a domain, in theory we could use that. Handling kerberos users across domains would be interesting.<br />
<br />
Within one domain, there's a "Volume Location Database" that keeps track of volumes and where (machine and partition) they're located. You can make a volume for a purpose; give particular people access to it, give it some storage, expand and contract it and move it around. Volumes have quotas.<br />
<br />
Within a given domain, We can assemble a namespace out of volumes using referrals. For a higher-level approach more similar to AFS's, there's also [https://wiki.linux-nfs.org/wiki/index.php/FedFsUtilsProject FedFS] which stores the namespace information in a database and provides common protocols for administration tools to manipulate the database.<br />
<br />
== PAGS ==<br />
<br />
PAGs: AFS allows a group of processes to share a common identity, different from the local uid, for the purposes of accessing an AFS filesystem: https://docs.openafs.org/AdminGuide/HDRWQ63.html<br />
<br />
Dave Howells says: "This is why I added session keyrings. You can run a process in a new keyring<br />
and give it new tokens. systemd kind of stuck a spike in that, though, by<br />
doing their own incompatible thing with their user manager service....<br />
<br />
NFS would need to do what the in-kernel AFS client does and call request_key()<br />
on entry to each filesystem method that doesn't take a file* and use that to<br />
cache the credentials it is using. If there is no key, it can make one up on<br />
the spot and stick the uid/gid/groups in there. This would then need to be<br />
handed down to the sunrpc protocol to define the security creds to use.<br />
<br />
The key used to open a file would then need to be cached in the file struct<br />
private data."<br />
<br />
== ACLs ==<br />
<br />
NFSv4 has ACLs, but Linux filesystems only support "posix" ACLs. An attempt was made to support NFSv4 ACLs ("richacls") but hasn't been accepted upstream. So knfsd is stuck mapping between NFSv4 and posix ACLs. Posix ACLs are more coarse-grained than NFSv4 ACLs, so information can be lost when a user on an NFSv4 client sets an ACL. This makes ACLs confusing and less useful.<br />
<br />
There are other servers that support full NFSv4 ACLs, so users of those servers are better off. Our client-side tools could still use some improvements for those users, though.<br />
<br />
AFS ACLs, unfortunately, are yet again a third style of ACL, incompatible with both POSIX and NFSv4 ACLs. They are more fine-grained than POSIX ACLs and probably closer to NFSv4 ACLs overall.<br />
<br />
To do:<br />
<br />
* make NFSv4 ACL tools more usable:<br />
** Map groups of NFSv4 permission bits to read, write, and execute permissions so we only have to display the simpler bits in common cases<br />
** Look for other opportunities to simplify display and editing of NFSv4 ACLs<br />
** Add NFSv4 ACL support to graphical file managers like GNOME Files<br />
** Adopt a commandline interface that's more similar to the posix acl utilities.<br />
** Perhaps also look into https://github.com/kvaneesh/richacl-tools as an alternative starting point to nfs4-acl-tools.<br />
** In general, try to make NFSv4 ACL management more similar to management of existing posix ACLs.<br />
* For AFS->NFS transition:<br />
** Write code that translates AFS ACLs to NFSv4 ACLs. It should be possible to do this with little or no loss of information for servers with full NFSv4 ACL support.<br />
** For migrations to Linux knfsd, this will effectively translate AFS ACLs to POSIX ACLs, and information will be lost. Test this case. The conversion tool should be able to fetch the ACLs after setting them, compare results, and summarize the results of the conversion in a way that's usable even for conversions of large numbers of files. I believe that setting an ACL is enough to invalidate the client's ACL cache, so a subsequent fetch of an ACL should show the results of any server-side mapping. But, test this to make sure. More details on [[AFS to NFSv4 ACL conversion]]<br />
<br />
* more ambitious options:<br />
** Try reviving [https://lwn.net/Articles/661357/ Rich ACLs]. Maybe we could convince people this time. Or maybe there's a different approach that would work. Maybe we could find a more incremental route, e.g. by adding some features of richacls to POSIX ACLs, such as the separation of directory write permissions into add and delete, and of file write permissions into modify and append.<br />
<br />
== user and group management ==<br />
<br />
AFS has a "protection server" and you can communicate with it using the [https://docs.openafs.org/Reference/1/pts.html pts command] which allows you to set up users and groups and add ACEs for machines.<br />
<br />
Compared to traditional unix, it allows wider delegation of management. For example, group creation doesn't require root: https://docs.openafs.org/Reference/1/pts_creategroup.html. Groups have owners, and you can delegate management of group membership: https://docs.openafs.org/Reference/1/pts_adduser.html.<br />
<br />
Our equivalent to the AFS protection server is [https://www.freeipa.org/page/Main_Page FreeIPA]. See also https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_identity_management/index. Installing FreeIPA and experimenting is also useful.<br />
<br />
Unlike AFS, FreeIPA doesn't seem to make it easy for ordinary users to create groups. It does allow delegating group management (including adding and removing users). More details on [[AFS-like group management with FreeIPA]].<br />
<br />
== quotas ==<br />
<br />
AFS has per-volume quotas. There are no per-user quotas that I can see; instead, AFS administrators create volumes for individual users (e.g., for individual home directories), and set quotas on those. Volumes can share the same storage, and it's fine for quotas on volumes to add up to more than the available storage.<br />
<br />
We could get similar functionality with LVM thin provisioning or XFS with project quotas. (There is some work needed there to treat projects as separate exports, but that's very doable.)<br />
<br />
Note NFS, ext4, xfs, and other filesystems all support per-user (and other) quotas. That's not something AFS has, as far as I know. Some notes on [[NFSv4 quota support]].<br />
<br />
= migrating existing AFS installations to NFS =<br />
<br />
Once NFS does everything AFS does, there's still the question of how you'd migrate over a particular installation.<br />
<br />
There's a standard AFS dump format (used by [https://docs.openafs.org/AdminGuide/HDRWQ240.html vos dump/vos restore]) that might be worth looking at. It looks simple enough. Maybe also look at [https://github.com/openafs-contrib/cmu-dumpscan cmu-dumpscan].<br />
<br />
See also [[AFS to NFSv4 ACL conversion]].</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_for_AFS_usersNFS for AFS users2022-01-27T20:23:46Z<p>Bfields: /* replication and migration */</p>
<hr />
<div>This page tracks some of the obstacles that might keep an AFS user from using NFS instead.<br />
<br />
= Missing Features =<br />
<br />
In general: AFS is administered by a consistent set of commands (fs, pts, vos, uss, bos, backup, fstrace, etc.) which work from any client and identify the user with Kerberos. Compared to a traditional unix system it's more flexible about delegating rights to users to do stuff.<br />
<br />
== replication and migration ==<br />
<br />
AFS supports fast clones using COW, along with complete copies on other machines.<br />
<br />
Currently there can be only one writeable version of a volume, but multiple read-only versions (which all have to be identical). They can be on different servers. (There's also an effort to support multiple writeable volumes, possibly using Ceph, but that's not done yet.)<br />
<br />
There can also be a 'backup' volume which is just, say, a daily temporary read-only snapshot of a RW volume and has to be located on the same machine.<br />
<br />
When a RW volume is "released" (snapshotted) to the read-only volumes, all the read-only volumes update simultaneously and atomically. The users, in theory, don't notice as the volumes don't go offline - and then they see all the changes happen at once. There's coordination to handle when one or more of the fileservers or the Volume Location servers are offline.<br />
<br />
Volumes can be migrated between machines whilst in active use without the user in theory noticing anything.<br />
<br />
For NFS migration we need to preserve filehandles, so need to migrate at the block level or using fs-specific send/receive. The protocol can be handled by migrating only entire servers or containers, so that migration can be treated as a server reboot.<br />
<br />
A few Linux options for send/receive:<br />
<br />
* thin_delta (from device-mapper-persistent-data) can calculate a metadata-level diff between two volumes. Additional work would be needed to extract the actual data and produce a diff; that would complete the "send" side. We'd also need a "receive" side that could apply the diff and reconstitute the snapshot on the other side. This is being actively worked on. For NFS, on the read-write server we would take a snapshot of the exported volume before sending. On the receive side, after creating the updated snapshot, we would stop the server, unmount the old snapshot, mount the new one, and restart; clients should see only a brief delay.<br />
<br />
* btrfs-send/btrfs-receive: this is probably the best-tested send/receive functionality currently available, so if we wanted to start work on a prototype right now, this might be an option.<br />
<br />
* xfs volumes loopback-mounted on a backing xfs filesystem, using reflink for snapshots. (See https://lwn.net/Articles/747633/ for some background.) Looks promising, the basic kernel interfaces to find shared extents and such are there, but a lot of userland code remains to be written.<br />
<br />
* stratis: this operates at a layer of abstraction over the above. But that might be the layer we want to actually interact with?<br />
<br />
* lvmsync: looks possibly unmaintained? We wouldn't want to depend on this. But possibly it could be a proof of concept or starting point.<br />
<br />
Clients could be configured to mount particular servers by hand, or they could mount any server and then use [https://tools.ietf.org/html/rfc5661#section-11.9 fs_locations], [https://tools.ietf.org/html/rfc5661#section-11.10 fs_locations_info], or maybe even [https://datatracker.ietf.org/doc/rfc8435/ pnfs flexfiles] to get lists of servers hosting replicas and pick one. They would need some heuristics to make the right choice. It would also be nice if clients could fail over to a different replica when one goes down.<br />
<br />
We also have [https://github.com/nfs-ganesha/nfs-ganesha/wiki Ganesha], [https://docs.ceph.com/docs/master/cephfs/nfs/ Ganesha/Ceph] (which [https://jtlayton.wordpress.com/2018/12/10/deploying-an-active-active-nfs-cluster-over-cephfs/ may be capable of multiple read/write servers now]).<br />
<br />
See also [https://docs.openafs.org/AdminGuide/HDRWQ177.html AFS Administrator's guide, Chapter 5: Managing Volumes]<br />
<br />
A partial alternative may be [https://wiki.linux-nfs.org/wiki/index.php/NFS_re-export NFS proxying]. Like read-only replicas, proxies should be able to hide latency by moving cached data closer to far-flung clients, and scale bandwidth to read-mostly data by taking load off the original server.<br />
<br />
Advantages are that we already have seen reports of some success here, using the NFS re-export code together with fscache. And I think there are a lot of opportunities for incremental progress by fixing problems with existing NFS code, rather than larger and riskier projects that build new infrastructure.<br />
<br />
A disadvantage may be that AFS users seem to like that infrastructure (the volume abstraction and the VLDB).<br />
<br />
Latency-hiding may be particularly tricky; delegation and caching policies may need rethinking. Performance will be more complicated to understand compared to AFS-like read-only replicas.<br />
<br />
AFS-like volume replication has a problem: when new read-only versions are released, they may modify or delete entirely files that are in use by running processes. I'd expect application crashes. I wonder how AFS administrators deal with that now?<br />
<br />
My impression is that AFS doesn't reliably prevent this problem, so instead AFS administrators work around it, for example by keeping old versions of binaries in place (and using symlinks to direct users to the newest versions).<br />
<br />
Possible approaches to fix the problem if we wanted to:<br />
* Provide some protocol which tracks which files may be open on read-only replicas so that we know not to free those files when they're unlinked.<br />
* When we distribute new versions, allow servers to keep around older versions and serve files from them in the case filehandle lookups against the new copy fail, to be removed only after applications stop referencing them. Hopefully this can be done space-efficiently if the different versions on the replica servers can be represented as dm snapshots.<br />
<br />
If we use NFSv4 proxies instead, proxies will hold opens or delegations on the files on the original server, which will prevent their being deleted while in use. The problem is server reboots. That's partially worked around with silly-rename. Server-side silly-rename would be a more complete solution.<br />
<br />
== volume location database and global namespace ==<br />
<br />
On an AFS client by default you can look up something like /afs/umich.edu/... and reach files kept in AFS anywhere.<br />
<br />
NFS has standards for DNS discovery of a server from a domain, in theory we could use that. Handling kerberos users across domains would be interesting.<br />
<br />
Within one domain, there's a "Volume Location Database" that keeps track of volumes and where (machine and partition) they're located. You can make a volume for a purpose; give particular people access to it, give it some storage, expand and contract it and move it around. Volumes have quotas.<br />
<br />
Within a given domain, We can assemble a namespace out of volumes using referrals. For a higher-level approach more similar to AFS's, there's also [https://wiki.linux-nfs.org/wiki/index.php/FedFsUtilsProject FedFS] which stores the namespace information in a database and provides common protocols for administration tools to manipulate the database.<br />
<br />
== PAGS ==<br />
<br />
PAGs: AFS allows a group of processes to share a common identity, different from the local uid, for the purposes of accessing an AFS filesystem: https://docs.openafs.org/AdminGuide/HDRWQ63.html<br />
<br />
Dave Howells says: "This is why I added session keyrings. You can run a process in a new keyring<br />
and give it new tokens. systemd kind of stuck a spike in that, though, by<br />
doing their own incompatible thing with their user manager service....<br />
<br />
NFS would need to do what the in-kernel AFS client does and call request_key()<br />
on entry to each filesystem method that doesn't take a file* and use that to<br />
cache the credentials it is using. If there is no key, it can make one up on<br />
the spot and stick the uid/gid/groups in there. This would then need to be<br />
handed down to the sunrpc protocol to define the security creds to use.<br />
<br />
The key used to open a file would then need to be cached in the file struct<br />
private data."<br />
<br />
== ACLs ==<br />
<br />
NFSv4 has ACLs, but Linux filesystems only support "posix" ACLs. An attempt was made to support NFSv4 ACLs ("richacls") but hasn't been accepted upstream. So knfsd is stuck mapping between NFSv4 and posix ACLs. Posix ACLs are more coarse-grained than NFSv4 ACLs, so information can be lost when a user on an NFSv4 client sets an ACL. This makes ACLs confusing and less useful.<br />
<br />
There are other servers that support full NFSv4 ACLs, so users of those servers are better off. Our client-side tools could still use some improvements for those users, though.<br />
<br />
AFS ACLs, unfortunately, are yet again a third style of ACL, incompatible with both POSIX and NFSv4 ACLs. They are more fine-grained than POSIX ACLs and probably closer to NFSv4 ACLs overall.<br />
<br />
To do:<br />
<br />
* make NFSv4 ACL tools more usable:<br />
** Map groups of NFSv4 permission bits to read, write, and execute permissions so we only have to display the simpler bits in common cases<br />
** Look for other opportunities to simplify display and editing of NFSv4 ACLs<br />
** Add NFSv4 ACL support to graphical file managers like GNOME Files<br />
** Adopt a commandline interface that's more similar to the posix acl utilities.<br />
** Perhaps also look into https://github.com/kvaneesh/richacl-tools as an alternative starting point to nfs4-acl-tools.<br />
** In general, try to make NFSv4 ACL management more similar to management of existing posix ACLs.<br />
* For AFS->NFS transition:<br />
** Write code that translates AFS ACLs to NFSv4 ACLs. It should be possible to do this with little or no loss of information for servers with full NFSv4 ACL support.<br />
** For migrations to Linux knfsd, this will effectively translate AFS ACLs to POSIX ACLs, and information will be lost. Test this case. The conversion tool should be able to fetch the ACLs after setting them, compare results, and summarize the results of the conversion in a way that's usable even for conversions of large numbers of files. I believe that setting an ACL is enough to invalidate the client's ACL cache, so a subsequent fetch of an ACL should show the results of any server-side mapping. But, test this to make sure. More details on [[AFS to NFSv4 ACL conversion]]<br />
<br />
* more ambitious options:<br />
** Try reviving [https://lwn.net/Articles/661357/ Rich ACLs]. Maybe we could convince people this time. Or maybe there's a different approach that would work. Maybe we could find a more incremental route, e.g. by adding some features of richacls to POSIX ACLs, such as the separation of directory write permissions into add and delete, and of file write permissions into modify and append.<br />
<br />
== user and group management ==<br />
<br />
AFS has a "protection server" and you can communicate with it using the [https://docs.openafs.org/Reference/1/pts.html pts command] which allows you to set up users and groups and add ACEs for machines.<br />
<br />
Compared to traditional unix, it allows wider delegation of management. For example, group creation doesn't require root: https://docs.openafs.org/Reference/1/pts_creategroup.html. Groups have owners, and you can delegate management of group membership: https://docs.openafs.org/Reference/1/pts_adduser.html.<br />
<br />
Our equivalent to the AFS protection server is [https://www.freeipa.org/page/Main_Page FreeIPA]. See also https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_identity_management/index. Installing FreeIPA and experimenting is also useful.<br />
<br />
Unlike AFS, FreeIPA doesn't seem to make it easy for ordinary users to create groups. It does allow delegating group management (including adding and removing users). More details on [[AFS-like group management with FreeIPA]].<br />
<br />
== quotas ==<br />
<br />
AFS has per-volume quotas. There are no per-user quotas that I can see; instead, AFS administrators create volumes for individual users (e.g., for individual home directories), and set quotas on those. Volumes can share the same storage, and it's fine for quotas on volumes to add up to more than the available storage.<br />
<br />
We could get similar functionality with LVM thin provisioning or XFS with project quotas. (There is some work needed there to treat projects as separate exports, but that's very doable.)<br />
<br />
Note NFS, ext4, xfs, and other filesystems all support per-user (and other) quotas. That's not something AFS has, as far as I know. Some notes on [[NFSv4 quota support]].<br />
<br />
= migrating existing AFS installations to NFS =<br />
<br />
Once NFS does everything AFS does, there's still the question of how you'd migrate over a particular installation.<br />
<br />
There's a standard AFS dump format (used by [https://docs.openafs.org/AdminGuide/HDRWQ240.html vos dump/vos restore]) that might be worth looking at. It looks simple enough. Maybe also look at [https://github.com/openafs-contrib/cmu-dumpscan cmu-dumpscan].<br />
<br />
See also [[AFS to NFSv4 ACL conversion]].</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_re-exportNFS re-export2022-01-19T22:43:08Z<p>Bfields: /* filehandle limits */</p>
<hr />
<div>The Linux NFS server can export an NFS mount, but that isn't something we currently recommend unless you've done some careful research and are prepared for problems.<br />
<br />
You'll need nfs-utils at least 1.3.5 (specifically, 3f520e8f6f5 "exportfs: Make sure pass all valid export flags to nfsd"). Otherwise, on recent kernels, attempts to re-export NFS will likely result in "exportfs: <path> does not support NFS export".<br />
<br />
The "fsid=" option is required on any export of an NFS filesystem.<br />
<br />
For now you should probably also mount readonly and with -onolock (and don't depend on working file locking), and don't allow the re-exporting server to reboot.<br />
<br />
= known issues =<br />
<br />
== fsid= required, crossmnt broken ==<br />
<br />
The re-export server needs to encode into each filehandle something that identifies the specific filesystem being exported. Otherwise it's stuck when it gets a filehandle back from the client--the operation it uses to map the incoming filehandle to a dentry can't even work without a superblock. The usual ways of identifying a filesystem don't work for the case of NFS, so we require the "fsid=" export option on any re-export of an NFS filesystem.<br />
<br />
Note also that normally you can export a tree of filesystems by exporting only the parent with the "crossmnt" option, and any filesystems underneath are then automatically exported with the same options. However, that doesn't apply to the fsid= option: it's purpose is to provide a unique identifier for each export, so it can't be automatically copied to the child filesystems.<br />
<br />
That means that re-exporting a tree of NFS filesystems in that way won't work--clients will be able to access the top-level export, but attempts to traverse mountpoints underneath will just result in IO errors.<br />
<br />
In theory, if the server could at least determine that the filehandle is for an object on an NFS filesystem, and figure out which server the filesystem's from, it could (given some new interface) ask the NFS client to work out the rest.<br />
<br />
One idea might be an [[NFS proxy-only mode]] where a server is dedicated to reexporting the filesystems of exactly *one* other server, as is.<br />
<br />
== reboot recovery ==<br />
<br />
NFS is designed to keep operating through server reboots, whether planned or the result of a crash or power outage. Client applications will see a delay while the server's down, but as soon as it's back up, normal operation resumes. Opens and file locks held across the reboot will all work correctly. (The only exception is unlinked but still open files, which may disappear after a reboot.)<br />
<br />
But the protocol's normal reboot recovery mechanisms don't work for the case when the re-export server reboots. The re-export server is both an NFS client and an NFS server, and the protocol's equipped to deal with the loss of the server's state, but not with the loss of the client's state.<br />
<br />
Maybe we could keep the client state on low-latency stable storage somehow? Maybe we could add a mechanism to the protocol that allows the client to state that it has lost its protocol state and wants to reclaim? (And then the client would issue reclaims as reclaims from the re-export server's clients came in.) Tentative plan: [[reboot recovery for re-export servers]]<br />
<br />
Maybe the re-export server could take the stateids returned from the server and return them to its clients, avoiding the need for it to keep very much state.<br />
<br />
== filehandle limits ==<br />
<br />
NFS filehandle sizes are limited (to 32 bytes for NFSv2, 64 bytes for NFSv3, and 128 bytes for NFSv4). When we re-export, we take the filehandle returned from the original server and wrap it with some more bytes of our own to create the filehandle we return to clients. That means the filehandles we give out will be larger than the filehandles we receive from the original server. There's no guarantee this will work. In practice most servers give out filehandles of a fixed size that's less than the maximum, so you *probably* won't run into this problem unless you're re-exporting with NFSv2, or re-exporting repeatedly. More details on [filehandle limits https://www.kernel.org/doc/html/latest/filesystems/nfs/reexport.html#filehandle-limits filehandle limits].<br />
<br />
The wrapping is needed so that the server can identify, even after it may have long forgotten about that particular filehandle, which export the filehandle refers to, so it can refer the operation to the correct underlying filesystem or server, and so it can enforce export permissions. Note that filehandle lifetimes are limited only by the lifetime of the object they point to; they are still expected to work after the inode has dropped out of the server's cache, or after the server has rebooted.<br />
<br />
One solution might be a [[NFS_proxy-only_mode]], where a server would be dedicated to re-exporting a single original NFS server, but it's not clear how to implement that.<br />
<br />
== filehandles not portable across servers ==<br />
<br />
Given multiple servers re-exporting a single filesystem, it might be expected that a client could easily migrate between them. That's not necessarily true, since filehandles aren't necessarily portable across servers.<br />
<br />
If the servers are all Linux servers, though, it should be sufficient to make sure reexports of the same filesystem all get the same fsid= option. (Note filehandles still won't be portable between reexports and the original server, though.)<br />
<br />
Some infrastructure to make this coordination easier might be useful.<br />
<br />
== errors on re-exports of NFSv4.0 filesystems to NFSv2/3 clients ==<br />
<br />
When re-exporting NFSv4.0 filesystems IO errors have been seen after dropping caches on the re-export server. This is probably due to the fact that an NFSv4 client has to open files to perform IO to them, but NFSv3 client only provides filehandles, and NFSv4.0 cannot open by filehandle (it can only open by (parent filehandle, filename) pair). NFSv4.1 allows open by filehandle.<br />
<br />
Best is not to do this; use NFSv4.1 or NFSv4.2 on the original server, or NFSv4 on the clients.<br />
<br />
If that's not possible, a workaround is to configure the re-export server to be reluctant to evict inodes from cache.<br />
<br />
Some more details at https://lore.kernel.org/linux-nfs/635679406.70384074.1603272832846.JavaMail.zimbra@dneg.com/. Note some other cases there (NFSv3 re-exports of NFSv3) are fixed by patches probably headed for 5.11.<br />
<br />
Maybe the NFSv4.0 client could also be made to support open-by-filehandle by skipping the open and using special stateids instead? I'm not sure.<br />
<br />
== unnecessary GETATTRs ==<br />
<br />
We see unnecessary cache invalidations on the re-export servers; we have some patches in progress that should make it for 5.11 or so (https://lore.kernel.org/linux-nfs/20201120223831.GB7705@fieldses.org/). It looks like they help but don't address every case.<br />
<br />
Also, depending on NFS versions on originating and re-exporting servers, we could probably save some GETATTRs, and set the atomic bit in some cases, if we passed along wcc information from the original server. Requires a special knfsd<->nfs interface. Should be doable.<br />
<br />
== re-export not reading more than 128K at a time ==<br />
<br />
For some reason when the client issues 1M reads to the re-export server, the re-export server breaks them up into 128K reads to the original server. Workaround is to manually increase client readahead; see <br />
https://lore.kernel.org/linux-nfs/1688437957.87985749.1605554507783.JavaMail.zimbra@dneg.com/<br />
<br />
== open DENY bits ignored ==<br />
<br />
NFS since NFSv4 supports ALLOW and DENY bits taken from Windows, which allow you, for example, to open a file in a mode which forbids other read opens or write opens. The Linux client doesn't use them, and the server's support has always been incomplete: they are enforced only against other NFS users, not against processes accessing the exported filesystem locally. A re-export server will also not pass them along to the original server, so they will not be enforced between clients of different re-export servers.<br />
<br />
This is probably not too hard to fix, but also probably not a high priority.<br />
<br />
= Known problems that we've fixed =<br />
<br />
* Problems with sporadic stale filehandles should be fixed by https://lore.kernel.org/linux-nfs/20201019175330.595894-1-trondmy@kernel.org/ (queued for 5.11?)<br />
* Pre/post-operation attributes are incorrectly returned as if they were atomic in cases when they aren't. We have fixes for 5.11.<br />
* File locking crashes should be fixed as of 5.15. (But note reboot recovery is still unsupported.)<br />
* delegations and leases should work; this could probably use some testing.<br />
<br />
= Use cases =<br />
<br />
== Scaling read bandwidth ==<br />
<br />
You should be able to scale bandwidth by adding more re-export servers; fscache on the re-export servers should also help.<br />
<br />
== Hiding latency of distant servers ==<br />
<br />
You should also be able to hide latency when the original server is far away. AFS read-only replication is an interesting precedent here, often used to distribute software that is rarely updated. [https://cernvm.cern.ch/fs/ CernVM-FS] occupies a similar niche. fscache should help here too.<br />
<br />
== NFS version support ==<br />
<br />
It's also being used as a way to add support for all NFS versions to servers that only support a subset. Careful attention to filehandle limits is required.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_proxy-only_modeNFS proxy-only mode2022-01-19T22:42:53Z<p>Bfields: </p>
<hr />
<div>It could be useful to have a mode where an NFS server is dedicated to reexporting all the exports from *one* other NFS server. It would have no other exports whatsoever.<br />
<br />
This would allow the re-export server to support crossmount-like behavior, skip adding its own filesystem identifier to each filehandle (fixing problems with filehandle length limits), and avoid the need for manual assignment of filesystem identifiers with the fsid= option.<br />
<br />
Containers or virtualization could still allow a single physical machine to handle multiple exports even if desired.<br />
<br />
Possible implementation (needs more details). v4 only for now?:<br />
<br />
- Create a new /proc/fs/nfsd/proxy_only file. Before starting the server, mount "/" on the original nfs server, then write the path to the mount to /proc/fs/nfsd/proxy_only. This interface is per-container. It also works for v3, which wouldn't currently possible with in-kernel mounting, though this feature is not as useful in that case as nested v3 mounts are rarer.<br />
<br />
- the NFS mount can't allow redirection to other servers, unless those servers observe all the same filehandles.<br />
<br />
- Given a filehandle, map to an export using a GETATTR to the server to get at least fsid, fileid, and file type. If it's a directory, it should be possible to connect it up to the psuedoroot using LOOKUPP. Find or create an export from the resulting struct path, cloing the parameters of the root export.<br />
<br />
- If it's *not* a directory, and not already cached, then create a temporary vfsmount and export rooted at that one file. If you've never seen this fsid before, you'll also have to create a superblock. As far as I can tell, s_root on a given nfs superblock is not important, so it's OK for it to point at this file, even as it later accumulates the rest of the filesystem? But I don't think that's true for export and vfsmount, hence the temporary objects. I'm unclear on how to handle these "disconnected" vfsmounts.<br />
<br />
- In theory, this could work with a filesystem other than NFS, if there was a filesystem or group of filesystems that coordinated their filehandles.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_re-exportNFS re-export2022-01-19T16:43:21Z<p>Bfields: /* known issues */</p>
<hr />
<div>The Linux NFS server can export an NFS mount, but that isn't something we currently recommend unless you've done some careful research and are prepared for problems.<br />
<br />
You'll need nfs-utils at least 1.3.5 (specifically, 3f520e8f6f5 "exportfs: Make sure pass all valid export flags to nfsd"). Otherwise, on recent kernels, attempts to re-export NFS will likely result in "exportfs: <path> does not support NFS export".<br />
<br />
The "fsid=" option is required on any export of an NFS filesystem.<br />
<br />
For now you should probably also mount readonly and with -onolock (and don't depend on working file locking), and don't allow the re-exporting server to reboot.<br />
<br />
= known issues =<br />
<br />
== fsid= required, crossmnt broken ==<br />
<br />
The re-export server needs to encode into each filehandle something that identifies the specific filesystem being exported. Otherwise it's stuck when it gets a filehandle back from the client--the operation it uses to map the incoming filehandle to a dentry can't even work without a superblock. The usual ways of identifying a filesystem don't work for the case of NFS, so we require the "fsid=" export option on any re-export of an NFS filesystem.<br />
<br />
Note also that normally you can export a tree of filesystems by exporting only the parent with the "crossmnt" option, and any filesystems underneath are then automatically exported with the same options. However, that doesn't apply to the fsid= option: it's purpose is to provide a unique identifier for each export, so it can't be automatically copied to the child filesystems.<br />
<br />
That means that re-exporting a tree of NFS filesystems in that way won't work--clients will be able to access the top-level export, but attempts to traverse mountpoints underneath will just result in IO errors.<br />
<br />
In theory, if the server could at least determine that the filehandle is for an object on an NFS filesystem, and figure out which server the filesystem's from, it could (given some new interface) ask the NFS client to work out the rest.<br />
<br />
One idea might be an [[NFS proxy-only mode]] where a server is dedicated to reexporting the filesystems of exactly *one* other server, as is.<br />
<br />
== reboot recovery ==<br />
<br />
NFS is designed to keep operating through server reboots, whether planned or the result of a crash or power outage. Client applications will see a delay while the server's down, but as soon as it's back up, normal operation resumes. Opens and file locks held across the reboot will all work correctly. (The only exception is unlinked but still open files, which may disappear after a reboot.)<br />
<br />
But the protocol's normal reboot recovery mechanisms don't work for the case when the re-export server reboots. The re-export server is both an NFS client and an NFS server, and the protocol's equipped to deal with the loss of the server's state, but not with the loss of the client's state.<br />
<br />
Maybe we could keep the client state on low-latency stable storage somehow? Maybe we could add a mechanism to the protocol that allows the client to state that it has lost its protocol state and wants to reclaim? (And then the client would issue reclaims as reclaims from the re-export server's clients came in.) Tentative plan: [[reboot recovery for re-export servers]]<br />
<br />
Maybe the re-export server could take the stateids returned from the server and return them to its clients, avoiding the need for it to keep very much state.<br />
<br />
== filehandle limits ==<br />
<br />
NFS filehandle sizes are limited (to 32 bytes for NFSv2, 64 bytes for NFSv3, and 128 bytes for NFSv4). When we re-export, we take the filehandle returned from the original server and wrap it with some more bytes of our own to create the filehandle we return to clients. That means the filehandles we give out will be larger than the filehandles we receive from the original server. There's no guarantee this will work. In practice most servers give out filehandles of a fixed size that's less than the maximum, so you *probably* won't run into this problem unless you're re-exporting with NFSv2, or re-exporting repeatedly. See more details on [filehandle limits https://www.kernel.org/doc/html/latest/filesystems/nfs/reexport.html#filehandle-limits]<br />
<br />
If re-export servers could reuse filehandles from the original server, that'd solve the problem. It would also make it easier for clients to migrate between the original server and other re-export servers, which could be useful.<br />
<br />
The wrapping is needed so that the server can identify, even after it may have long forgotten about that particular filehandle, which export the filehandle refers to, so it can refer the operation to the correct underlying filesystem or server, and so it can enforce export permissions.<br />
<br />
If a server exports only a single NFS filesystem, then there'd be no problem with it reusing the file handle it got from the original server. Possibly that's a common enough use case to be helpful? With containers we could still allow a single physical machine to handle multiple exports even if each container only handles on each.<br />
<br />
Cooperating servers could agree on the structure of filehandles in a way that allowed them to reuse each others' filehandles. Possibly that could be standardized if it proved useful.<br />
<br />
== errors on re-exports of NFSv4.0 filesystems to NFSv2/3 clients ==<br />
<br />
When re-exporting NFSv4.0 filesystems IO errors have been seen after dropping caches on the re-export server. This is probably due to the fact that an NFSv4 client has to open files to perform IO to them, but NFSv3 client only provides filehandles, and NFSv4.0 cannot open by filehandle (it can only open by (parent filehandle, filename) pair). NFSv4.1 allows open by filehandle.<br />
<br />
Best is not to do this; use NFSv4.1 or NFSv4.2 on the original server, or NFSv4 on the clients.<br />
<br />
If that's not possible, a workaround is to configure the re-export server to be reluctant to evict inodes from cache.<br />
<br />
Some more details at https://lore.kernel.org/linux-nfs/635679406.70384074.1603272832846.JavaMail.zimbra@dneg.com/. Note some other cases there (NFSv3 re-exports of NFSv3) are fixed by patches probably headed for 5.11.<br />
<br />
Maybe the NFSv4.0 client could also be made to support open-by-filehandle by skipping the open and using special stateids instead? I'm not sure.<br />
<br />
== unnecessary GETATTRs ==<br />
<br />
We see unnecessary cache invalidations on the re-export servers; we have some patches in progress that should make it for 5.11 or so (https://lore.kernel.org/linux-nfs/20201120223831.GB7705@fieldses.org/). It looks like they help but don't address every case.<br />
<br />
Also, depending on NFS versions on originating and re-exporting servers, we could probably save some GETATTRs, and set the atomic bit in some cases, if we passed along wcc information from the original server. Requires a special knfsd<->nfs interface. Should be doable.<br />
<br />
== re-export not reading more than 128K at a time ==<br />
<br />
For some reason when the client issues 1M reads to the re-export server, the re-export server breaks them up into 128K reads to the original server. Workaround is to manually increase client readahead; see <br />
https://lore.kernel.org/linux-nfs/1688437957.87985749.1605554507783.JavaMail.zimbra@dneg.com/<br />
<br />
== open DENY bits ignored ==<br />
<br />
NFS since NFSv4 supports ALLOW and DENY bits taken from Windows, which allow you, for example, to open a file in a mode which forbids other read opens or write opens. The Linux client doesn't use them, and the server's support has always been incomplete: they are enforced only against other NFS users, not against processes accessing the exported filesystem locally. A re-export server will also not pass them along to the original server, so they will not be enforced between clients of different re-export servers.<br />
<br />
This is probably not too hard to fix, but also probably not a high priority.<br />
<br />
= Known problems that we've fixed =<br />
<br />
* Problems with sporadic stale filehandles should be fixed by https://lore.kernel.org/linux-nfs/20201019175330.595894-1-trondmy@kernel.org/ (queued for 5.11?)<br />
* Pre/post-operation attributes are incorrectly returned as if they were atomic in cases when they aren't. We have fixes for 5.11.<br />
* File locking crashes should be fixed as of 5.15. (But note reboot recovery is still unsupported.)<br />
* delegations and leases should work; this could probably use some testing.<br />
<br />
= Use cases =<br />
<br />
== Scaling read bandwidth ==<br />
<br />
You should be able to scale bandwidth by adding more re-export servers; fscache on the re-export servers should also help.<br />
<br />
== Hiding latency of distant servers ==<br />
<br />
You should also be able to hide latency when the original server is far away. AFS read-only replication is an interesting precedent here, often used to distribute software that is rarely updated. [https://cernvm.cern.ch/fs/ CernVM-FS] occupies a similar niche. fscache should help here too.<br />
<br />
== NFS version support ==<br />
<br />
It's also being used as a way to add support for all NFS versions to servers that only support a subset. Careful attention to filehandle limits is required.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/Server-side_silly_renameServer-side silly rename2022-01-18T20:27:57Z<p>Bfields: </p>
<hr />
<div>The NFSv3 protocol has no way to say "I'm unlinking this file, but please keep it around because I have an application that's still using it". So if the client wants to provide unix-like semantics, it has to resort to this hack (called "silly rename") on unlink of an open file. See also [http://nfs.sourceforge.net/#section_d], or the earliest description I'm aware of, in [https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.473 "Design and Implementation of the Sun Network Filesystem" (1985)]:<br />
<br />
<blockquote><br />
We tried very hard to make the NFS client obey UNIX filesystem semantics without modifying the server or the protocol. In some cases this was hard to do. For example, UNIX allows removal of open files. A process can open a file, then remove the directory entry for the file so that it has no name anywhere in the filesystem, and still read and write the file. This is a disgusting bit of UNIX trivia and at first we were just not going to support it, but it turns out that all of the programs that we didn't want to have to fix (csh, sendmail, etc.) use this for temporary files.<br />
<p><br />
What we did to make open file removal work on remote files was check in the client VFS remove operation if the file is open, and if so rename it instead of removing it. This makes it (sort of) invisible to the client and still allows reading and writing. The client kernel then removes the new name when the vnode becomes inactive. We call this the 3/4 solution because if the client crashes between the rename and remove a garbage file is left on the server. An entry to cron can be added to clean up on the server.</p><br />
</blockquote><br />
<br />
Silly rename is indeed an imperfect solution. Another case when users sometimes notice the ".nfsXXXX" files is when they try to remove a directory that contains them. Also, it doesn't help if a file is unlinked by a different client than the one that holds it open.<br />
<br />
NFSv4 actually does have open and close calls, and our server won't free a file until last close--unless the server reboots, at which point the file will disappear even if an application on the client is still using it. NFS is supposed to keep working normally across server reboots, so the client still does silly rename even in the v4 case.<br />
<br />
We could move the responsibility for silly rename to the server--the server could keep a hardlink to the file after unlink, and<br />
that would preserve the file after reboot as well. (And it could use a separate directory for the purpose, and avoid the rmdir). We even added a bit to the NFSv4.1 protocol so that the server can tell the client it does this, allowing the client to skip sillyrename (see references to OPEN4_RESULT_PRESERVE_UNLINKED in [https://tools.ietf.org/html/rfc8881].)<br />
<br />
I suspect the client side implementation of this wouldn't be hard--it'd need to watch for the OPEN4_RESULT_PRESERVE_UNLINKED flag and skip silly rename in its presence. (Update: see [https://lore.kernel.org/linux-nfs/20220118190251.55526-1-olga.kornievskaia@gmail.com/T/#u].)<br />
<br />
The server side looks harder.<br />
<br />
One complication is that knfsd doesn't get exclusive use of exported filesystems: other applications may also be using them. A file opened by an NFS client could be unlinked by a local application, and we'd like the file not to disappear after reboot in that case. That said, the current behavior doesn't handle that case--it doesn't even handle the case when the unlink is done by a different client than the open--so for a first implementation I think it'd be fine to ignore that case.<br />
<br />
My rough plan for knfsd is to create a hidden directory in the root of the exported filesystem and modify nfsd4_remove() to check whether the file to be unlinked is open by an NFSv4 client, and if so to instead rename it to that hidden directory. The name shouldn't matter--just use a counter or something.<br />
<br />
I think we can use something like the logic at the start of nfsd4_process_open2 to look up a struct nfs4_file from the filehandle, and then use that to check for nfsv4 opens. We also need to prevent the race where a new open comes in after we decide to unlink the file but before we're done unlinking it--I'm not sure how. Also we need to think about the possibility of filehandle aliasing, in which case there may exist two nfs4_files for a given file.<br />
<br />
Then we need the close code to check whether we're closing one of these files and, if so, to also unlink it from the hidden directory.<br />
<br />
And, finally the laundromat code, after it ends the grace period, needs to walk through the hidden directory and remove any files that haven't been opened. Maybe code like that in nfsd4_recdir_purge_old() would work. This is usually the kind of thing we try not to do from the kernel, but I don't see a clean way to do it from userspace.<br />
<br />
That done, if we wanted to also make this work for unlinks by non-NFSv4 clients, we'd need some way to intercept all the unlinks to a given filesystem. We might need to modify the individual exported filesystems.<br />
<br />
We may want to think about how exactly to hide that directory. Maybe we could get some kind of help from the filesystem.<br />
<br />
The extra hidden link will mean that the st_nlink (for local users) and the numlinks attribute (for NFSv4 GETATTR callers) are wrong. We could fix up the latter, at least, by checking for this specific case.<br />
<br />
Approaches I (bfields) considered and rejected for now:<br />
<br />
* Create a link in the new directory on every open, and remove it on every close. But open may be a frequent operation, and we'd need to actually sync that link to disk on every operation, so it could be pretty slow. But maybe, with cooperation of the filesystem, we could *just* do the link on open, and delay waiting for the sync until there's an unlink.<br />
* Filesystems already have to deal with the case where the system crashes while there are unlinked open files. I believe they keep a list of such files so they can free them in fsck or next mount. I considered hooking into that process somehow--perhaps the server could be given an interface allowing it to discover those orphaned files. It would require nfsd to be involved in the mount process (currently we mount first, then export). And we'd have to figure out how to perform clean shutdowns without losing those files. And we'd have to worry about losing them any time an administrator fsck'd or mounted without running nfsd. So in the end maybe it wouldn't work.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/Server-side_silly_renameServer-side silly rename2022-01-18T20:27:38Z<p>Bfields: </p>
<hr />
<div>The NFSv3 protocol has no way to say "I'm unlinking this file, but please keep it around because I have an application that's still using it". So if the client wants to provide unix-like semantics, it has to resort to this hack (called "silly rename") on unlink of an open file. See also [http://nfs.sourceforge.net/#section_d], or the earliest description I'm aware of, in [https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.473 "Design and Implementation of the Sun Network Filesystem" (1985)]:<br />
<br />
<blockquote><br />
We tried very hard to make the NFS client obey UNIX filesystem semantics without modifying the server or the protocol. In some cases this was hard to do. For example, UNIX allows removal of open files. A process can open a file, then remove the directory entry for the file so that it has no name anywhere in the filesystem, and still read and write the file. This is a disgusting bit of UNIX trivia and at first we were just not going to support it, but it turns out that all of the programs that we didn't want to have to fix (csh, sendmail, etc.) use this for temporary files.<br />
<p><br />
What we did to make open file removal work on remote files was check in the client VFS remove operation if the file is open, and if so rename it instead of removing it. This makes it (sort of) invisible to the client and still allows reading and writing. The client kernel then removes the new name when the vnode becomes inactive. We call this the 3/4 solution because if the client crashes between the rename and remove a garbage file is left on the server. An entry to cron can be added to clean up on the server.</p><br />
</blockquote><br />
<br />
Silly rename is indeed an imperfect solution. Another case when users sometimes notice the ".nfsXXXX" files is when they try to remove a directory that contains them. Also, it doesn't help if a file is unlinked by a different client than the one that holds it open.<br />
<br />
NFSv4 actually does have open and close calls, and our server won't free a file until last close--unless the server reboots, at which point the file will disappear even if an application on the client is still using it. NFS is supposed to keep working normally across server reboots, so the client still does silly rename even in the v4 case.<br />
<br />
We could move the responsibility for silly rename to the server--the server could keep a hardlink to the file after unlink, and<br />
that would preserve the file after reboot as well. (And it could use a separate directory for the purpose, and avoid the rmdir). We even added a bit to the NFSv4.1 protocol so that the server can tell the client it does this, allowing the client to skip sillyrename (see references to OPEN4_RESULT_PRESERVE_UNLINKED in [https://tools.ietf.org/html/rfc8881].)<br />
<br />
I suspect the client side implementation of this wouldn't be hard--it'd need to watch for the OPEN4_RESULT_PRESERVE_UNLINKED flag and skip silly rename in its presence. (Update: see [https://lore.kernel.org/linux-nfs/20220118190251.55526-1-olga.kornievskaia@gmail.com/T/#u].)<br />
<br />
The server side looks harder.<br />
<br />
One complication is that knfsd doesn't get exclusive use of exported filesystems: other applications may also be using them. A file opened by an NFS client could be unlinked by a local application, and we'd like the file not to disappear after reboot in that case. That said, the current behavior doesn't handle that case--it doesn't even handle the case when the unlink is done by a different client than the open--so for a first implementation I think it'd be fine to ignore that case.<br />
<br />
My rough plan for knfsd is to create a hidden directory in the root of the exported filesystem and modify nfsd4_remove() to check whether the file to be unlinked is open by an NFSv4 client, and if so to instead rename it to that hidden directory. The name shouldn't matter--just use a counter or something.<br />
<br />
I think we can use something like the logic at the start of nfsd4_process_open2 to look up a struct nfs4_file from the filehandle, and then use that to check for nfsv4 opens. We also need to prevent the race where a new open comes in after we decide to unlink the file but before we're done unlinking it--I'm not sure how. Also we need to think about the possibility of filehandle aliasing, in which case there may exist two nfs4_files for a given file.<br />
<br />
Then we need the close code to check whether we're closing one of these files and, if so, to also unlink it from the hidden directory.<br />
<br />
And, finally the laundromat code, after it ends the grace period, needs to walk through the hidden directory and remove any files that haven't been opened. Maybe code like that in nfsd4_recdir_purge_old() would work. This is usually the kind of thing we try not to do from the kernel, but I don't see a clean way to do it from userspace.<br />
<br />
That done, if we wanted to also make this work for unlinks by non-NFSv4 clients, we'd need some way to intercept all the unlinks to a given filesystem. We might need to modify the individual exported filesystems.<br />
<br />
We may want to think about how exactly to hide that directory. Maybe we could get some kind of help from the filesystem.<br />
<br />
The extra hidden link will mean that the st_nlink (for local users) and the numlinks attribute (for NFSv4 GETATTR callers) are wrong. We could fix up the latter, at least, by checking for this specific case.<br />
<br />
Approaches I (bfields) considered and rejected for now:<br />
<br />
- Create a link in the new directory on every open, and remove it on every close. But open may be a frequent operation, and we'd need to actually sync that link to disk on every operation, so it could be pretty slow. But maybe, with cooperation of the filesystem, we could *just* do the link on open, and delay waiting for the sync until there's an unlink.<br />
- Filesystems already have to deal with the case where the system crashes while there are unlinked open files. I believe they keep a list of such files so they can free them in fsck or next mount. I considered hooking into that process somehow--perhaps the server could be given an interface allowing it to discover those orphaned files. It would require nfsd to be involved in the mount process (currently we mount first, then export). And we'd have to figure out how to perform clean shutdowns without losing those files. And we'd have to worry about losing them any time an administrator fsck'd or mounted without running nfsd. So in the end maybe it wouldn't work.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/Server-side_silly_renameServer-side silly rename2022-01-18T20:26:38Z<p>Bfields: </p>
<hr />
<div>The NFSv3 protocol has no way to say "I'm unlinking this file, but please keep it around because I have an application that's still using it". So if the client wants to provide unix-like semantics, it has to resort to this hack (called "silly rename") on unlink of an open file. See also [http://nfs.sourceforge.net/#section_d], or the earliest description I'm aware of, in [https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.473 "Design and Implementation of the Sun Network Filesystem" (1985)]:<br />
<br />
<blockquote><br />
We tried very hard to make the NFS client obey UNIX filesystem semantics without modifying the server or the protocol. In some cases this was hard to do. For example, UNIX allows removal of open files. A process can open a file, then remove the directory entry for the file so that it has no name anywhere in the filesystem, and still read and write the file. This is a disgusting bit of UNIX trivia and at first we were just not going to support it, but it turns out that all of the programs that we didn't want to have to fix (csh, sendmail, etc.) use this for temporary files.<br />
<p><br />
What we did to make open file removal work on remote files was check in the client VFS remove operation if the file is open, and if so rename it instead of removing it. This makes it (sort of) invisible to the client and still allows reading and writing. The client kernel then removes the new name when the vnode becomes inactive. We call this the 3/4 solution because if the client crashes between the rename and remove a garbage file is left on the server. An entry to cron can be added to clean up on the server.</p><br />
</blockquote><br />
<br />
Silly rename is indeed an imperfect solution. Another case when users sometimes notice the ".nfsXXXX" files is when they try to remove a directory that contains them. Also, it doesn't help if a file is unlinked by a different client than the one that holds it open.<br />
<br />
NFSv4 actually does have open and close calls, and our server won't free a file until last close--unless the server reboots, at which point the file will disappear even if an application on the client is still using it. NFS is supposed to keep working normally across server reboots, so the client still does silly rename even in the v4 case.<br />
<br />
We could move the responsibility for silly rename to the server--the server could keep a hardlink to the file after unlink, and<br />
that would preserve the file after reboot as well. (And it could use a separate directory for the purpose, and avoid the rmdir). We even added a bit to the NFSv4.1 protocol so that the server can tell the client it does this, allowing the client to skip sillyrename (see references to OPEN4_RESULT_PRESERVE_UNLINKED in [https://tools.ietf.org/html/rfc8881].)<br />
<br />
I suspect the client side implementation of this wouldn't be hard--it'd need to watch for the OPEN4_RESULT_PRESERVE_UNLINKED flag and skip silly rename in its presence. (Update: see [https://lore.kernel.org/linux-nfs/20220118190251.55526-1-olga.kornievskaia@gmail.com/T/#u].)<br />
<br />
The server side looks harder.<br />
<br />
One complication is that knfsd doesn't get exclusive use of exported filesystems: other applications may also be using them. A file opened by an NFS client could be unlinked by a local application, and we'd like the file not to disappear after reboot in that case. That said, the current behavior doesn't handle that case--it doesn't even handle the case when the unlink is done by a different client than the open--so for a first implementation I think it'd be fine to ignore that case.<br />
<br />
My rough plan for knfsd is to create a hidden directory in the root of the exported filesystem and modify nfsd4_remove() to check whether the file to be unlinked is open by an NFSv4 client, and if so to instead rename it to that hidden directory. The name shouldn't matter--just use a counter or something.<br />
<br />
I think we can use something like the logic at the start of nfsd4_process_open2 to look up a struct nfs4_file from the filehandle, and then use that to check for nfsv4 opens. We also need to prevent the race where a new open comes in after we decide to unlink the file but before we're done unlinking it--I'm not sure how. Also we need to think about the possibility of filehandle aliasing, in which case there may exist two nfs4_files for a given file.<br />
<br />
Then we need the close code to check whether we're closing one of these files and, if so, to also unlink it from the hidden directory.<br />
<br />
And, finally the laundromat code, after it ends the grace period, needs to walk through the hidden directory and remove any files that haven't been opened. Maybe code like that in nfsd4_recdir_purge_old() would work. This is usually the kind of thing we try not to do from the kernel, but I don't see a clean way to do it from userspace.<br />
<br />
That done, if we wanted to also make this work for unlinks by non-NFSv4 clients, we'd need some way to intercept all the unlinks to a given filesystem. We might need to modify the individual exported filesystems.<br />
<br />
We may want to think about how exactly to hide that directory. Maybe we could get some kind of help from the filesystem.<br />
<br />
The extra hidden link will mean that the st_nlink (for local users) and the numlinks attribute (for NFSv4 GETATTR callers) are wrong. We could fix up the latter, at least, by checking for this specific case.<br />
<br />
--<br />
<br />
Another possibility I considered was just creating a link in the new directory on every open, and removing it on every close. But open may be a frequent operation, and we'd need to actually sync that link to disk on every operation, so it could be pretty slow. But maybe, with cooperation of the filesystem, we could *just* do the link on open, and delay waiting for the sync until there's an unlink.<br />
<br />
--<br />
<br />
Another possibility: filesystems already have to deal with the case where the system crashes while there are unlinked open files. I believe they keep a list of such files so they can free them in fsck or next mount. I considered hooking into that process somehow--perhaps the server could be given an interface allowing it to discover those orphaned files. It would require nfsd to be involved in the mount process (currently we mount first, then export). And we'd have to figure out how to perform clean shutdowns without losing those files. And we'd have to worry about losing them any time an administrator fsck'd or mounted without running nfsd. So in the end maybe it wouldn't work.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/Server-side_silly_renameServer-side silly rename2022-01-18T19:13:05Z<p>Bfields: </p>
<hr />
<div>The NFSv3 protocol has no way to say "I'm unlinking this file, but please keep it around because I have an application that's still using it". So if the client wants to provide unix-like semantics, it has to resort to this hack (called "silly rename") on unlink of an open file. See also [http://nfs.sourceforge.net/#section_d], or the earliest description I'm aware of, in [https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.473 "Design and Implementation of the Sun Network Filesystem" (1985)]:<br />
<br />
<blockquote><br />
We tried very hard to make the NFS client obey UNIX filesystem semantics without modifying the server or the protocol. In some cases this was hard to do. For example, UNIX allows removal of open files. A process can open a file, then remove the directory entry for the file so that it has no name anywhere in the filesystem, and still read and write the file. This is a disgusting bit of UNIX trivia and at first we were just not going to support it, but it turns out that all of the programs that we didn't want to have to fix (csh, sendmail, etc.) use this for temporary files.<br />
<p><br />
What we did to make open file removal work on remote files was check in the client VFS remove operation if the file is open, and if so rename it instead of removing it. This makes it (sort of) invisible to the client and still allows reading and writing. The client kernel then removes the new name when the vnode becomes inactive. We call this the 3/4 solution because if the client crashes between the rename and remove a garbage file is left on the server. An entry to cron can be added to clean up on the server.</p><br />
</blockquote><br />
<br />
Silly rename is indeed an imperfect solution. Another case when users sometimes notice the ".nfsXXXX" files is when they try to remove a directory that contains them. Also, it doesn't help if a file is unlinked by a different client than the one that holds it open.<br />
<br />
NFSv4 actually does have open and close calls, and our server won't free a file until last close--unless the server reboots, at which point the file will disappear even if an application on the client is still using it. NFS is supposed to keep working normally across server reboots, so the client still does silly rename even in the v4 case.<br />
<br />
We could move the responsibility for silly rename to the server--the server could keep a hardlink to the file after unlink, and<br />
that would preserve the file after reboot as well. (And it could use a separate directory for the purpose, and avoid the rmdir). We even added a bit to the NFSv4.1 protocol so that the server can tell the client it does this, allowing the client to skip sillyrename (see references to OPEN4_RESULT_PRESERVE_UNLINKED in [https://tools.ietf.org/html/rfc8881].)<br />
<br />
I suspect the client side implementation of this wouldn't be hard--it'd need to watch for the OPEN4_RESULT_PRESERVE_UNLINKED flag and skip silly rename in its presence. (Update: see [https://lore.kernel.org/linux-nfs/20220118190251.55526-1-olga.kornievskaia@gmail.com/T/#u].)<br />
<br />
The server side looks harder.<br />
<br />
One complication is that knfsd doesn't get exclusive use of exported filesystems: other applications may also be using them. A file opened by an NFS client could be unlinked by a local application, and we'd like the file not to disappear after reboot in that case. That said, the current behavior doesn't handle that case--it doesn't even handle the case when the unlink is done by a different client than the open--so for a first implementation I think it'd be fine to ignore that case.<br />
<br />
My rough plan for knfsd is to create a hidden directory in the root of the exported filesystem and modify nfsd4_remove() to check whether the file to be unlinked is open by an NFSv4 client, and if so to instead rename it to that hidden directory. The name shouldn't matter--just use a counter or something.<br />
<br />
I think we can use something like the logic at the start of nfsd4_process_open2 to look up a struct nfs4_file from the filehandle, and then use that to check for nfsv4 opens. We also need to prevent the race where a new open comes in after we decide to unlink the file but before we're done unlinking it--I'm not sure how. Also we need to think about the possibility of filehandle aliasing, in which case there may exist two nfs4_files for a given file.<br />
<br />
Then we need the close code to check whether we're closing one of these files and, if so, to also unlink it from the hidden directory.<br />
<br />
And, finally the laundromat code, after it ends the grace period, needs to walk through the hidden directory and remove any files that haven't been opened. Maybe nfsd4_recdir_purge_old() would do. This is usually the kind of thing we try not to do from the kernel, but I don't see a clean way to do it from userspace.<br />
<br />
That done, if we wanted to also make this work for unlinks by non-NFSv4 clients, we'd need some way to intercept all the unlinks to a given filesystem. We might need to modify the individual exported filesystems.<br />
<br />
We may want to think about how exactly to hide that directory. Maybe we could get some kind of help from the filesystem.<br />
<br />
The extra hidden link will mean that the st_nlink (for local users) and the numlinks attribute (for NFSv4 GETATTR callers) are wrong. We could fix up the latter, at least, by checking for this specific case.<br />
<br />
--<br />
<br />
Another possibility I considered was just creating a link in the new directory on every open, and removing it on every close. But open may be a frequent operation, and we'd need to actually sync that link to disk on every operation, so it could be pretty slow. But maybe, with cooperation of the filesystem, we could *just* do the link on open, and delay waiting for the sync until there's an unlink.<br />
<br />
--<br />
<br />
Another possibility: filesystems already have to deal with the case where the system crashes while there are unlinked open files. I believe they keep a list of such files so they can free them in fsck or next mount. I considered hooking into that process somehow--perhaps the server could be given an interface allowing it to discover those orphaned files. It would require nfsd to be involved in the mount process (currently we mount first, then export). And we'd have to figure out how to perform clean shutdowns without losing those files. And we'd have to worry about losing them any time an administrator fsck'd or mounted without running nfsd. So in the end maybe it wouldn't work.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/Server-side_silly_renameServer-side silly rename2022-01-18T19:12:47Z<p>Bfields: </p>
<hr />
<div>The NFSv3 protocol has no way to say "I'm unlinking this file, but please keep it around because I have an application that's still using it". So if the client wants to provide unix-like semantics, it has to resort to this hack (called "silly rename") on unlink of an open file. See also [http://nfs.sourceforge.net/#section_d], or the earliest description I'm aware of, in [https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.473 "Design and Implementation of the Sun Network Filesystem" (1985)]:<br />
<br />
<blockquote><br />
We tried very hard to make the NFS client obey UNIX filesystem semantics without modifying the server or the protocol. In some cases this was hard to do. For example, UNIX allows removal of open files. A process can open a file, then remove the directory entry for the file so that it has no name anywhere in the filesystem, and still read and write the file. This is a disgusting bit of UNIX trivia and at first we were just not going to support it, but it turns out that all of the programs that we didn't want to have to fix (csh, sendmail, etc.) use this for temporary files.<br />
<p><br />
What we did to make open file removal work on remote files was check in the client VFS remove operation if the file is open, and if so rename it instead of removing it. This makes it (sort of) invisible to the client and still allows reading and writing. The client kernel then removes the new name when the vnode becomes inactive. We call this the 3/4 solution because if the client crashes between the rename and remove a garbage file is left on the server. An entry to cron can be added to clean up on the server.</p><br />
</blockquote><br />
<br />
Silly rename is indeed an imperfect solution. Another case when users sometimes notice the ".nfsXXXX" files is when they try to remove a directory that contains them. Also, it doesn't help if a file is unlinked by a different client than the one that holds it open.<br />
<br />
NFSv4 actually does have open and close calls, and our server won't free a file until last close--unless the server reboots, at which point the file will disappear even if an application on the client is still using it. NFS is supposed to keep working normally across server reboots, so the client still does silly rename even in the v4 case.<br />
<br />
We could move the responsibility for silly rename to the server--the server could keep a hardlink to the file after unlink, and<br />
that would preserve the file after reboot as well. (And it could use a separate directory for the purpose, and avoid the rmdir). We even added a bit to the NFSv4.1 protocol so that the server can tell the client it does this, allowing the client to skip sillyrename (see references to OPEN4_RESULT_PRESERVE_UNLINKED in [https://tools.ietf.org/html/rfc8881].)<br />
<br />
I suspect the client side implementation of this wouldn't be hard--it'd need to watch for the OPEN4_RESULT_PRESERVE_UNLINKED flag and skip silly rename in its presence. (Update: see ]https://lore.kernel.org/linux-nfs/20220118190251.55526-1-olga.kornievskaia@gmail.com/T/#u].)<br />
<br />
The server side looks harder.<br />
<br />
One complication is that knfsd doesn't get exclusive use of exported filesystems: other applications may also be using them. A file opened by an NFS client could be unlinked by a local application, and we'd like the file not to disappear after reboot in that case. That said, the current behavior doesn't handle that case--it doesn't even handle the case when the unlink is done by a different client than the open--so for a first implementation I think it'd be fine to ignore that case.<br />
<br />
My rough plan for knfsd is to create a hidden directory in the root of the exported filesystem and modify nfsd4_remove() to check whether the file to be unlinked is open by an NFSv4 client, and if so to instead rename it to that hidden directory. The name shouldn't matter--just use a counter or something.<br />
<br />
I think we can use something like the logic at the start of nfsd4_process_open2 to look up a struct nfs4_file from the filehandle, and then use that to check for nfsv4 opens. We also need to prevent the race where a new open comes in after we decide to unlink the file but before we're done unlinking it--I'm not sure how. Also we need to think about the possibility of filehandle aliasing, in which case there may exist two nfs4_files for a given file.<br />
<br />
Then we need the close code to check whether we're closing one of these files and, if so, to also unlink it from the hidden directory.<br />
<br />
And, finally the laundromat code, after it ends the grace period, needs to walk through the hidden directory and remove any files that haven't been opened. Maybe nfsd4_recdir_purge_old() would do. This is usually the kind of thing we try not to do from the kernel, but I don't see a clean way to do it from userspace.<br />
<br />
That done, if we wanted to also make this work for unlinks by non-NFSv4 clients, we'd need some way to intercept all the unlinks to a given filesystem. We might need to modify the individual exported filesystems.<br />
<br />
We may want to think about how exactly to hide that directory. Maybe we could get some kind of help from the filesystem.<br />
<br />
The extra hidden link will mean that the st_nlink (for local users) and the numlinks attribute (for NFSv4 GETATTR callers) are wrong. We could fix up the latter, at least, by checking for this specific case.<br />
<br />
--<br />
<br />
Another possibility I considered was just creating a link in the new directory on every open, and removing it on every close. But open may be a frequent operation, and we'd need to actually sync that link to disk on every operation, so it could be pretty slow. But maybe, with cooperation of the filesystem, we could *just* do the link on open, and delay waiting for the sync until there's an unlink.<br />
<br />
--<br />
<br />
Another possibility: filesystems already have to deal with the case where the system crashes while there are unlinked open files. I believe they keep a list of such files so they can free them in fsck or next mount. I considered hooking into that process somehow--perhaps the server could be given an interface allowing it to discover those orphaned files. It would require nfsd to be involved in the mount process (currently we mount first, then export). And we'd have to figure out how to perform clean shutdowns without losing those files. And we'd have to worry about losing them any time an administrator fsck'd or mounted without running nfsd. So in the end maybe it wouldn't work.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/Server-side_silly_renameServer-side silly rename2022-01-18T19:07:06Z<p>Bfields: </p>
<hr />
<div>The NFSv3 protocol has no way to say "I'm unlinking this file, but please keep it around because I have an application that's still using it". So if the client wants to provide unix-like semantics, it has to resort to this hack (called "silly rename") on unlink of an open file. See also [http://nfs.sourceforge.net/#section_d], or the earliest description I'm aware of, in [https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.473 "Design and Implementation of the Sun Network Filesystem" (1985)]:<br />
<br />
<blockquote><br />
We tried very hard to make the NFS client obey UNIX filesystem semantics without modifying the server or the protocol. In some cases this was hard to do. For example, UNIX allows removal of open files. A process can open a file, then remove the directory entry for the file so that it has no name anywhere in the filesystem, and still read and write the file. This is a disgusting bit of UNIX trivia and at first we were just not going to support it, but it turns out that all of the programs that we didn't want to have to fix (csh, sendmail, etc.) use this for temporary files.<br />
<p><br />
What we did to make open file removal work on remote files was check in the client VFS remove operation if the file is open, and if so rename it instead of removing it. This makes it (sort of) invisible to the client and still allows reading and writing. The client kernel then removes the new name when the vnode becomes inactive. We call this the 3/4 solution because if the client crashes between the rename and remove a garbage file is left on the server. An entry to cron can be added to clean up on the server.</p><br />
</blockquote><br />
<br />
Silly rename is indeed an imperfect solution. Another case when users sometimes notice the ".nfsXXXX" files is when they try to remove a directory that contains them. Also, it doesn't help if a file is unlinked by a different client than the one that holds it open.<br />
<br />
NFSv4 actually does have open and close calls, and our server won't free a file until last close--unless the server reboots, at which point the file will disappear even if an application on the client is still using it. NFS is supposed to keep working normally across server reboots, so the client still does silly rename even in the v4 case.<br />
<br />
We could move the responsibility for silly rename to the server--the server could keep a hardlink to the file after unlink, and<br />
that would preserve the file after reboot as well. (And it could use a separate directory for the purpose, and avoid the rmdir). We even added a bit to the NFSv4.1 protocol so that the server can tell the client it does this, allowing the client to skip sillyrename (see references to OPEN4_RESULT_PRESERVE_UNLINKED in [https://tools.ietf.org/html/rfc8881].)<br />
<br />
I suspect the client side implementation of this wouldn't be hard--it'd need to watch for the OPEN4_RESULT_PRESERVE_UNLINKED flag and skip silly rename in its presence. (Update: see https://lore.kernel.org/linux-nfs/20220118190251.55526-1-olga.kornievskaia@gmail.com/T/#u .)<br />
<br />
The server side looks harder.<br />
<br />
One complication is that knfsd doesn't get exclusive use of exported filesystems: other applications may also be using them. A file opened by an NFS client could be unlinked by a local application, and we'd like the file not to disappear after reboot in that case. That said, the current behavior doesn't handle that case--it doesn't even handle the case when the unlink is done by a different client than the open--so for a first implementation I think it'd be fine to ignore that case.<br />
<br />
My rough plan for knfsd is to create a hidden directory in the root of the exported filesystem and modify nfsd4_remove() to check whether the file to be unlinked is open by an NFSv4 client, and if so to instead rename it to that hidden directory. The name shouldn't matter--just use a counter or something.<br />
<br />
I think we can use something like the logic at the start of nfsd4_process_open2 to look up a struct nfs4_file from the filehandle, and then use that to check for nfsv4 opens. We also need to prevent the race where a new open comes in after we decide to unlink the file but before we're done unlinking it--I'm not sure how. Also we need to think about the possibility of filehandle aliasing, in which case there may exist two nfs4_files for a given file.<br />
<br />
Then we need the close code to check whether we're closing one of these files and, if so, to also unlink it from the hidden directory.<br />
<br />
And, finally the laundromat code, after it ends the grace period, needs to walk through the hidden directory and remove any files that haven't been opened. Maybe nfsd4_recdir_purge_old() would do. This is usually the kind of thing we try not to do from the kernel, but I don't see a clean way to do it from userspace.<br />
<br />
That done, if we wanted to also make this work for unlinks by non-NFSv4 clients, we'd need some way to intercept all the unlinks to a given filesystem. We might need to modify the individual exported filesystems.<br />
<br />
We may want to think about how exactly to hide that directory. Maybe we could get some kind of help from the filesystem.<br />
<br />
The extra hidden link will mean that the st_nlink (for local users) and the numlinks attribute (for NFSv4 GETATTR callers) are wrong. We could fix up the latter, at least, by checking for this specific case.<br />
<br />
--<br />
<br />
Another possibility I considered was just creating a link in the new directory on every open, and removing it on every close. But open may be a frequent operation, and we'd need to actually sync that link to disk on every operation, so it could be pretty slow. But maybe, with cooperation of the filesystem, we could *just* do the link on open, and delay waiting for the sync until there's an unlink.<br />
<br />
--<br />
<br />
Another possibility: filesystems already have to deal with the case where the system crashes while there are unlinked open files. I believe they keep a list of such files so they can free them in fsck or next mount. I considered hooking into that process somehow--perhaps the server could be given an interface allowing it to discover those orphaned files. It would require nfsd to be involved in the mount process (currently we mount first, then export). And we'd have to figure out how to perform clean shutdowns without losing those files. And we'd have to worry about losing them any time an administrator fsck'd or mounted without running nfsd. So in the end maybe it wouldn't work.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/Server-side_silly_renameServer-side silly rename2021-11-30T21:13:32Z<p>Bfields: </p>
<hr />
<div>The NFSv3 protocol has no way to say "I'm unlinking this file, but please keep it around because I have an application that's still using it". So if the client wants to provide unix-like semantics, it has to resort to this hack (called "silly rename") on unlink of an open file. See also [http://nfs.sourceforge.net/#section_d], or the earliest description I'm aware of, in [https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.473 "Design and Implementation of the Sun Network Filesystem" (1985)]:<br />
<br />
<blockquote><br />
We tried very hard to make the NFS client obey UNIX filesystem semantics without modifying the server or the protocol. In some cases this was hard to do. For example, UNIX allows removal of open files. A process can open a file, then remove the directory entry for the file so that it has no name anywhere in the filesystem, and still read and write the file. This is a disgusting bit of UNIX trivia and at first we were just not going to support it, but it turns out that all of the programs that we didn't want to have to fix (csh, sendmail, etc.) use this for temporary files.<br />
<p><br />
What we did to make open file removal work on remote files was check in the client VFS remove operation if the file is open, and if so rename it instead of removing it. This makes it (sort of) invisible to the client and still allows reading and writing. The client kernel then removes the new name when the vnode becomes inactive. We call this the 3/4 solution because if the client crashes between the rename and remove a garbage file is left on the server. An entry to cron can be added to clean up on the server.</p><br />
</blockquote><br />
<br />
Silly rename is indeed an imperfect solution. Another case when users sometimes notice the ".nfsXXXX" files is when they try to remove a directory that contains them. Also, it doesn't help if a file is unlinked by a different client than the one that holds it open.<br />
<br />
NFSv4 actually does have open and close calls, and our server won't free a file until last close--unless the server reboots, at which point the file will disappear even if an application on the client is still using it. NFS is supposed to keep working normally across server reboots, so the client still does silly rename even in the v4 case.<br />
<br />
We could move the responsibility for silly rename to the server--the server could keep a hardlink to the file after unlink, and<br />
that would preserve the file after reboot as well. (And it could use a separate directory for the purpose, and avoid the rmdir). We even added a bit to the NFSv4.1 protocol so that the server can tell the client it does this, allowing the client to skip sillyrename (see references to OPEN4_RESULT_PRESERVE_UNLINKED in [https://tools.ietf.org/html/rfc8881].)<br />
<br />
I suspect the client side implementation of this wouldn't be hard--it'd need to watch for the OPEN4_RESULT_PRESERVE_UNLINKED flag and skip silly rename in its presence.<br />
<br />
The server side looks harder.<br />
<br />
One complication is that knfsd doesn't get exclusive use of exported filesystems: other applications may also be using them. A file opened by an NFS client could be unlinked by a local application, and we'd like the file not to disappear after reboot in that case. That said, the current behavior doesn't handle that case--it doesn't even handle the case when the unlink is done by a different client than the open--so for a first implementation I think it'd be fine to ignore that case.<br />
<br />
My rough plan for knfsd is to create a hidden directory in the root of the exported filesystem and modify nfsd4_remove() to check whether the file to be unlinked is open by an NFSv4 client, and if so to instead rename it to that hidden directory. The name shouldn't matter--just use a counter or something.<br />
<br />
I think we can use something like the logic at the start of nfsd4_process_open2 to look up a struct nfs4_file from the filehandle, and then use that to check for nfsv4 opens. We also need to prevent the race where a new open comes in after we decide to unlink the file but before we're done unlinking it--I'm not sure how. Also we need to think about the possibility of filehandle aliasing, in which case there may exist two nfs4_files for a given file.<br />
<br />
Then we need the close code to check whether we're closing one of these files and, if so, to also unlink it from the hidden directory.<br />
<br />
And, finally the laundromat code, after it ends the grace period, needs to walk through the hidden directory and remove any files that haven't been opened. Maybe nfsd4_recdir_purge_old() would do. This is usually the kind of thing we try not to do from the kernel, but I don't see a clean way to do it from userspace.<br />
<br />
That done, if we wanted to also make this work for unlinks by non-NFSv4 clients, we'd need some way to intercept all the unlinks to a given filesystem. We might need to modify the individual exported filesystems.<br />
<br />
We may want to think about how exactly to hide that directory. Maybe we could get some kind of help from the filesystem.<br />
<br />
The extra hidden link will mean that the st_nlink (for local users) and the numlinks attribute (for NFSv4 GETATTR callers) are wrong. We could fix up the latter, at least, by checking for this specific case.<br />
<br />
--<br />
<br />
Another possibility I considered was just creating a link in the new directory on every open, and removing it on every close. But open may be a frequent operation, and we'd need to actually sync that link to disk on every operation, so it could be pretty slow. But maybe, with cooperation of the filesystem, we could *just* do the link on open, and delay waiting for the sync until there's an unlink.<br />
<br />
--<br />
<br />
Another possibility: filesystems already have to deal with the case where the system crashes while there are unlinked open files. I believe they keep a list of such files so they can free them in fsck or next mount. I considered hooking into that process somehow--perhaps the server could be given an interface allowing it to discover those orphaned files. It would require nfsd to be involved in the mount process (currently we mount first, then export). And we'd have to figure out how to perform clean shutdowns without losing those files. And we'd have to worry about losing them any time an administrator fsck'd or mounted without running nfsd. So in the end maybe it wouldn't work.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/Server-side_silly_renameServer-side silly rename2021-11-30T21:12:49Z<p>Bfields: </p>
<hr />
<div>The NFSv3 protocol has no way to say "I'm unlinking this file, but please keep it around because I have an application that's still using it". So if the client wants to provide unix-like semantics, it has to resort to this hack (called "silly rename") on unlink of an open file. See also [http://nfs.sourceforge.net/#section_d], or the earliest description I'm aware of, in [https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.473 "Design and Implementation of the Sun Network Filesystem" (1985)]:<br />
<br />
<blockquote><br />
We tried very hard to make the NFS client obey UNIX filesystem semantics without modifying the server or the protocol. In some cases this was hard to do. For example, UNIX allows removal of open files. A process can open a file, then remove the directory entry for the file so that it has no name anywhere in the filesystem, and still read and write the file. This is a disgusting bit of UNIX trivia and at first we were just not going to support it, but it turns out that all of the programs that we didn't want to have to fix (csh, sendmail, etc.) use this for temporary files.<br />
<p><br />
What we did to make open file removal work on remote files was check in the client VFS remove operation if the file is open, and if so rename it instead of removing it. This makes it (sort of) invisible to the client and still allows reading and writing. The client kernel then removes the new name when the vnode becomes inactive. We call this the 3/4 solution because if the client crashes between the rename and remove a garbage file is left on the server. An entry to cron can be added to clean up on the server.</p><br />
</blockquote><br />
<br />
Silly rename is indeed an imperfect solution. Another case when users sometimes notice the ".nfsXXXX" files is when they try to remove a directory that contains them. Also, it doesn't help if a file is unlinked by a different client than the one that holds it open.<br />
<br />
NFSv4 actually does have open and close calls, and our server won't free a file until last close--unless the server reboots, at which point the file will disappear even if an application on the client is still using it. NFS is supposed to keep working normally across server reboots, so the client still does silly rename even in the v4 case.<br />
<br />
We could move the responsibility for silly rename to the server--the server could keep a hardlink to the file after unlink, and<br />
that would preserve the file after reboot as well. (And it could use a separate directory for the purpose, and avoid the rmdir). We even added a bit to the NFSv4.1 protocol so that the server can tell the client it does this, allowing the client to skip sillyrename (see references to OPEN4_RESULT_PRESERVE_UNLINKED in [https://tools.ietf.org/html/rfc8881].)<br />
<br />
I suspect the client side implementation of this wouldn't be hard--it'd need to watch for the OPEN4_RESULT_PRESERVE_UNLINKED flag and skip silly rename in its presence.<br />
<br />
The server side looks harder.<br />
<br />
One complication is knfsd doesn't get exclusive use of exported filesystems: other applications may also be using them. A file opened by an NFS client could be unlinked by a local application, and we'd like the file not to disappear after reboot in that case. That said, the current behavior doesn't handle that case--it doesn't even handle the case when the unlink is done by a different client than the open--so for a first implementation I think it'd be fine to ignore that case.<br />
<br />
My rough plan for knfsd is to create a hidden directory in the root of the exported filesystem and modify nfsd4_remove() to check whether the file to be unlinked is open by an NFSv4 client, and if so to instead rename it to that hidden directory. The name shouldn't matter--just use a counter or something.<br />
<br />
I think we can use something like the logic at the start of nfsd4_process_open2 to look up a struct nfs4_file from the filehandle, and then use that to check for nfsv4 opens. We also need to prevent the race where a new open comes in after we decide to unlink the file but before we're done unlinking it--I'm not sure how. Also we need to think about the possibility of filehandle aliasing, in which case there may exist two nfs4_files for a given file.<br />
<br />
Then we need the close code to check whether we're closing one of these files and, if so, to also unlink it from the hidden directory.<br />
<br />
And, finally the laundromat code, after it ends the grace period, needs to walk through the hidden directory and remove any files that haven't been opened. Maybe nfsd4_recdir_purge_old() would do. This is usually the kind of thing we try not to do from the kernel, but I don't see a clean way to do it from userspace.<br />
<br />
That done, if we wanted to also make this work for unlinks by non-NFSv4 clients, we'd need some way to intercept all the unlinks to a given filesystem. We might need to modify the individual exported filesystems.<br />
<br />
We may want to think about how exactly to hide that directory. Maybe we could get some kind of help from the filesystem.<br />
<br />
The extra hidden link will mean that the st_nlink (for local users) and the numlinks attribute (for NFSv4 GETATTR callers) are wrong. We could fix up the latter, at least, by checking for this specific case.<br />
<br />
--<br />
<br />
Another possibility I considered was just creating a link in the new directory on every open, and removing it on every close. But open may be a frequent operation, and we'd need to actually sync that link to disk on every operation, so it could be pretty slow. But maybe, with cooperation of the filesystem, we could *just* do the link on open, and delay waiting for the sync until there's an unlink.<br />
<br />
--<br />
<br />
Another possibility: filesystems already have to deal with the case where the system crashes while there are unlinked open files. I believe they keep a list of such files so they can free them in fsck or next mount. I considered hooking into that process somehow--perhaps the server could be given an interface allowing it to discover those orphaned files. It would require nfsd to be involved in the mount process (currently we mount first, then export). And we'd have to figure out how to perform clean shutdowns without losing those files. And we'd have to worry about losing them any time an administrator fsck'd or mounted without running nfsd. So in the end maybe it wouldn't work.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_re-exportNFS re-export2021-09-10T18:38:31Z<p>Bfields: </p>
<hr />
<div>The Linux NFS server can export an NFS mount, but that isn't something we currently recommend unless you've done some careful research and are prepared for problems.<br />
<br />
You'll need nfs-utils at least 1.3.5 (specifically, 3f520e8f6f5 "exportfs: Make sure pass all valid export flags to nfsd"). Otherwise, on recent kernels, attempts to re-export NFS will likely result in "exportfs: <path> does not support NFS export".<br />
<br />
The "fsid=" option is required on any export of an NFS filesystem.<br />
<br />
For now you should probably also mount readonly and with -onolock (and don't depend on working file locking), and don't allow the re-exporting server to reboot.<br />
<br />
= known issues =<br />
<br />
== fsid= required, crossmnt broken ==<br />
<br />
The re-export server needs to encode into each filehandle something that identifies the specific filesystem being exported. Otherwise it's stuck when it gets a filehandle back from the client--the operation it uses to map the incoming filehandle to a dentry can't even work without a superblock. The usual ways of identifying a filesystem don't work for the case of NFS, so we require the "fsid=" export option on any re-export of an NFS filesystem.<br />
<br />
Note also that normally you can export a tree of filesystems by exporting only the parent with the "crossmnt" option, and any filesystems underneath are then automatically exported with the same options. However, that doesn't apply to the fsid= option: it's purpose is to provide a unique identifier for each export, so it can't be automatically copied to the child filesystems.<br />
<br />
That means that re-exporting a tree of NFS filesystems in that way won't work--clients will be able to access the top-level export, but attempts to traverse mountpoints underneath will just result in IO errors.<br />
<br />
In theory, if the server could at least determine that the filehandle is for an object on an NFS filesystem, and figure out which server the filesystem's from, it could (given some new interface) ask the NFS client to work out the rest.<br />
<br />
One idea might be an [[NFS proxy-only mode]] where a server is dedicated to reexporting the filesystems of exactly *one* other server, as is.<br />
<br />
== reboot recovery ==<br />
<br />
NFS is designed to keep operating through server reboots, whether planned or the result of a crash or power outage. Client applications will see a delay while the server's down, but as soon as it's back up, normal operation resumes. Opens and file locks held across the reboot will all work correctly. (The only exception is unlinked but still open files, which may disappear after a reboot.)<br />
<br />
But the protocol's normal reboot recovery mechanisms don't work for the case when the re-export server reboots. The re-export server is both an NFS client and an NFS server, and the protocol's equipped to deal with the loss of the server's state, but not with the loss of the client's state.<br />
<br />
Maybe we could keep the client state on low-latency stable storage somehow? Maybe we could add a mechanism to the protocol that allows the client to state that it has lost its protocol state and wants to reclaim? (And then the client would issue reclaims as reclaims from the re-export server's clients came in.) Tentative plan: [[reboot recovery for re-export servers]]<br />
<br />
Maybe the re-export server could take the stateids returned from the server and return them to its clients, avoiding the need for it to keep very much state.<br />
<br />
== filehandle limits ==<br />
<br />
NFS filehandle sizes are limited (to 32 bytes for NFSv2, 64 bytes for NFSv3, and 128 bytes for NFSv4). When we re-export, we take the filehandle returned from the original server and wrap it with some more bytes of our own to create the filehandle we return to clients. That means the filehandles we give out will be larger than the filehandles we receive from the original server. There's no guarantee this will work. In practice most servers give out filehandles of a fixed size that's less than the maximum, so you *probably* won't run into this problem unless you're re-exporting with NFSv2, or re-exporting repeatedly. But there are no guarantees.<br />
<br />
If re-export servers could reuse filehandles from the original server, that'd solve the problem. It would also make it easier for clients to migrate between the original server and other re-export servers, which could be useful.<br />
<br />
The wrapping is needed so that the server can identify, even after it may have long forgotten about that particular filehandle, which export the filehandle refers to, so it can refer the operation to the correct underlying filesystem or server, and so it can enforce export permissions.<br />
<br />
If a server exports only a single NFS filesystem, then there'd be no problem with it reusing the file handle it got from the original server. Possibly that's a common enough use case to be helpful? With containers we could still allow a single physical machine to handle multiple exports even if each container only handles on each.<br />
<br />
Cooperating servers could agree on the structure of filehandles in a way that allowed them to reuse each others' filehandles. Possibly that could be standardized if it proved useful.<br />
<br />
== errors on re-exports of NFSv4.0 filesystems to NFSv2/3 clients ==<br />
<br />
When re-exporting NFSv4.0 filesystems IO errors have been seen after dropping caches on the re-export server. This is probably due to the fact that an NFSv4 client has to open files to perform IO to them, but NFSv3 client only provides filehandles, and NFSv4.0 cannot open by filehandle (it can only open by (parent filehandle, filename) pair). NFSv4.1 allows open by filehandle.<br />
<br />
Best is not to do this; use NFSv4.1 or NFSv4.2 on the original server, or NFSv4 on the clients.<br />
<br />
If that's not possible, a workaround is to configure the re-export server to be reluctant to evict inodes from cache.<br />
<br />
Some more details at https://lore.kernel.org/linux-nfs/635679406.70384074.1603272832846.JavaMail.zimbra@dneg.com/. Note some other cases there (NFSv3 re-exports of NFSv3) are fixed by patches probably headed for 5.11.<br />
<br />
Maybe the NFSv4.0 client could also be made to support open-by-filehandle by skipping the open and using special stateids instead? I'm not sure.<br />
<br />
== unnecessary GETATTRs ==<br />
<br />
We see unnecessary cache invalidations on the re-export servers; we have some patches in progress that should make it for 5.11 or so (https://lore.kernel.org/linux-nfs/20201120223831.GB7705@fieldses.org/). It looks like they help but don't address every case.<br />
<br />
Also, depending on NFS versions on originating and re-exporting servers, we could probably save some GETATTRs, and set the atomic bit in some cases, if we passed along wcc information from the original server. Requires a special knfsd<->nfs interface. Should be doable.<br />
<br />
== re-export not reading more than 128K at a time ==<br />
<br />
For some reason when the client issues 1M reads to the re-export server, the re-export server breaks them up into 128K reads to the original server. Workaround is to manually increase client readahead; see <br />
https://lore.kernel.org/linux-nfs/1688437957.87985749.1605554507783.JavaMail.zimbra@dneg.com/<br />
<br />
== open DENY bits ignored ==<br />
<br />
NFS since NFSv4 supports ALLOW and DENY bits taken from Windows, which allow you, for example, to open a file in a mode which forbids other read opens or write opens. The Linux client doesn't use them, and the server's support has always been incomplete: they are enforced only against other NFS users, not against processes accessing the exported filesystem locally. A re-export server will also not pass them along to the original server, so they will not be enforced between clients of different re-export servers.<br />
<br />
This is probably not too hard to fix, but also probably not a high priority.<br />
<br />
= Known problems that we've fixed =<br />
<br />
* Problems with sporadic stale filehandles should be fixed by https://lore.kernel.org/linux-nfs/20201019175330.595894-1-trondmy@kernel.org/ (queued for 5.11?)<br />
* Pre/post-operation attributes are incorrectly returned as if they were atomic in cases when they aren't. We have fixes for 5.11.<br />
* File locking crashes should be fixed as of 5.15. (But note reboot recovery is still unsupported.)<br />
* delegations and leases should work; this could probably use some testing.<br />
<br />
= Use cases =<br />
<br />
== Scaling read bandwidth ==<br />
<br />
You should be able to scale bandwidth by adding more re-export servers; fscache on the re-export servers should also help.<br />
<br />
== Hiding latency of distant servers ==<br />
<br />
You should also be able to hide latency when the original server is far away. AFS read-only replication is an interesting precedent here, often used to distribute software that is rarely updated. [https://cernvm.cern.ch/fs/ CernVM-FS] occupies a similar niche. fscache should help here too.<br />
<br />
== NFS version support ==<br />
<br />
It's also being used as a way to add support for all NFS versions to servers that only support a subset. Careful attention to filehandle limits is required.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_re-exportNFS re-export2021-09-10T18:36:21Z<p>Bfields: </p>
<hr />
<div>The Linux NFS server can export an NFS mount, but that isn't something we currently recommend unless you've done some careful research and are prepared for problems.<br />
<br />
You'll need nfs-utils at least 1.3.5 (specifically, 3f520e8f6f5 "exportfs: Make sure pass all valid export flags to nfsd"). Otherwise, on recent kernels, attempts to re-export NFS will likely result in "exportfs: <path> does not support NFS export".<br />
<br />
The "fsid=" option is required on any export of an NFS filesystem.<br />
<br />
For now you should probably also mount readonly and with -onolock (and don't depend on working file locking), and don't allow the re-exporting server to reboot.<br />
<br />
= known issues =<br />
<br />
== fsid= required, crossmnt broken ==<br />
<br />
The re-export server needs to encode into each filehandle something that identifies the specific filesystem being exported. Otherwise it's stuck when it gets a filehandle back from the client--the operation it uses to map the incoming filehandle to a dentry can't even work without a superblock. The usual ways of identifying a filesystem don't work for the case of NFS, so we require the "fsid=" export option on any re-export of an NFS filesystem.<br />
<br />
Note also that normally you can export a tree of filesystems by exporting only the parent with the "crossmnt" option, and any filesystems underneath are then automatically exported with the same options. However, that doesn't apply to the fsid= option: it's purpose is to provide a unique identifier for each export, so it can't be automatically copied to the child filesystems.<br />
<br />
That means that re-exporting a tree of NFS filesystems in that way won't work--clients will be able to access the top-level export, but attempts to traverse mountpoints underneath will just result in IO errors.<br />
<br />
In theory, if the server could at least determine that the filehandle is for an object on an NFS filesystem, and figure out which server the filesystem's from, it could (given some new interface) ask the NFS client to work out the rest.<br />
<br />
One idea might be an [[NFS proxy-only mode]] where a server is dedicated to reexporting the filesystems of exactly *one* other server, as is.<br />
<br />
== reboot recovery ==<br />
<br />
NFS is designed to keep operating through server reboots, whether planned or the result of a crash or power outage. Client applications will see a delay while the server's down, but as soon as it's back up, normal operation resumes. Opens and file locks held across the reboot will all work correctly. (The only exception is unlinked but still open files, which may disappear after a reboot.)<br />
<br />
But the protocol's normal reboot recovery mechanisms don't work for the case when the re-export server reboots. The re-export server is both an NFS client and an NFS server, and the protocol's equipped to deal with the loss of the server's state, but not with the loss of the client's state.<br />
<br />
Maybe we could keep the client state on low-latency stable storage somehow? Maybe we could add a mechanism to the protocol that allows the client to state that it has lost its protocol state and wants to reclaim? (And then the client would issue reclaims as reclaims from the re-export server's clients came in.) Tentative plan: [[reboot recovery for re-export servers]]<br />
<br />
Maybe the re-export server could take the stateids returned from the server and return them to its clients, avoiding the need for it to keep very much state.<br />
<br />
== filehandle limits ==<br />
<br />
NFS filehandle sizes are limited (to 32 bytes for NFSv2, 64 bytes for NFSv3, and 128 bytes for NFSv4). When we re-export, we take the filehandle returned from the original server and wrap it with some more bytes of our own to create the filehandle we return to clients. That means the filehandles we give out will be larger than the filehandles we receive from the original server. There's no guarantee this will work. In practice most servers give out filehandles of a fixed size that's less than the maximum, so you *probably* won't run into this problem unless you're re-exporting with NFSv2, or re-exporting repeatedly. But there are no guarantees.<br />
<br />
If re-export servers could reuse filehandles from the original server, that'd solve the problem. It would also make it easier for clients to migrate between the original server and other re-export servers, which could be useful.<br />
<br />
The wrapping is needed so that the server can identify, even after it may have long forgotten about that particular filehandle, which export the filehandle refers to, so it can refer the operation to the correct underlying filesystem or server, and so it can enforce export permissions.<br />
<br />
If a server exports only a single NFS filesystem, then there'd be no problem with it reusing the file handle it got from the original server. Possibly that's a common enough use case to be helpful? With containers we could still allow a single physical machine to handle multiple exports even if each container only handles on each.<br />
<br />
Cooperating servers could agree on the structure of filehandles in a way that allowed them to reuse each others' filehandles. Possibly that could be standardized if it proved useful.<br />
<br />
== errors on re-exports of NFSv4.0 filesystems to NFSv2/3 clients ==<br />
<br />
When re-exporting NFSv4.0 filesystems IO errors have been seen after dropping caches on the re-export server. This is probably due to the fact that an NFSv4 client has to open files to perform IO to them, but NFSv3 client only provides filehandles, and NFSv4.0 cannot open by filehandle (it can only open by (parent filehandle, filename) pair). NFSv4.1 allows open by filehandle.<br />
<br />
Best is not to do this; use NFSv4.1 or NFSv4.2 on the original server, or NFSv4 on the clients.<br />
<br />
If that's not possible, a workaround is to configure the re-export server to be reluctant to evict inodes from cache.<br />
<br />
Some more details at https://lore.kernel.org/linux-nfs/635679406.70384074.1603272832846.JavaMail.zimbra@dneg.com/. Note some other cases there (NFSv3 re-exports of NFSv3) are fixed by patches probably headed for 5.11.<br />
<br />
Maybe the NFSv4.0 client could also be made to support open-by-filehandle by skipping the open and using special stateids instead? I'm not sure.<br />
<br />
== unnecessary GETATTRs ==<br />
<br />
We see unnecessary cache invalidations on the re-export servers; we have some patches in progress that should make it for 5.11 or so (https://lore.kernel.org/linux-nfs/20201120223831.GB7705@fieldses.org/). It looks like they help but don't address every case.<br />
<br />
Also, depending on NFS versions on originating and re-exporting servers, we could probably save some GETATTRs, and set the atomic bit in some cases, if we passed along wcc information from the original server. Requires a special knfsd<->nfs interface. Should be doable.<br />
<br />
== re-export not reading more than 128K at a time ==<br />
<br />
For some reason when the client issues 1M reads to the re-export server, the re-export server breaks them up into 128K reads to the original server. Workaround is to manually increase client readahead; see <br />
https://lore.kernel.org/linux-nfs/1688437957.87985749.1605554507783.JavaMail.zimbra@dneg.com/<br />
<br />
== open DENY bits ignored ==<br />
<br />
NFS since NFSv4 supports ALLOW and DENY bits taken from Windows, which allow you, for example, to open a file in a mode which forbids other read opens or write opens. The Linux client doesn't use them, and the server's support has always been incomplete: they are enforced only against other NFS users, not against processes accessing the exported filesystem locally. A re-export server will also not pass them along to the original server, so they will not be enforced between clients of different re-export servers.<br />
<br />
This is probably not too hard to fix, but also probably not a high priority.<br />
<br />
== Delegations unsupported ==<br />
<br />
Currently a re-export server simply won't give out delegations to its clients (if you're looking at the code: this is because the nfs filesystem sets its setlease method to simple_nosetlease). This is correct but probably suboptimal.<br />
<br />
(Delegations on re-exports should work as of 5.14; could probably use some testing.)<br />
<br />
= Known problems that we've fixed =<br />
<br />
* Problems with sporadic stale filehandles should be fixed by https://lore.kernel.org/linux-nfs/20201019175330.595894-1-trondmy@kernel.org/ (queued for 5.11?)<br />
* Pre/post-operation attributes are incorrectly returned as if they were atomic in cases when they aren't. We have fixes for 5.11.<br />
* File locking crashes should be fixed as of 5.15. (But note reboot recovery is still unsupported.)<br />
<br />
= Use cases =<br />
<br />
== Scaling read bandwidth ==<br />
<br />
You should be able to scale bandwidth by adding more re-export servers; fscache on the re-export servers should also help.<br />
<br />
== Hiding latency of distant servers ==<br />
<br />
You should also be able to hide latency when the original server is far away. AFS read-only replication is an interesting precedent here, often used to distribute software that is rarely updated. [https://cernvm.cern.ch/fs/ CernVM-FS] occupies a similar niche. fscache should help here too.<br />
<br />
== NFS version support ==<br />
<br />
It's also being used as a way to add support for all NFS versions to servers that only support a subset. Careful attention to filehandle limits is required.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_re-exportNFS re-export2021-08-18T20:03:14Z<p>Bfields: /* Delegations unsupported */</p>
<hr />
<div>The Linux NFS server can export an NFS mount, but that isn't something we currently recommend unless you've done some careful research and are prepared for problems.<br />
<br />
You'll need nfs-utils at least 1.3.5 (specifically, 3f520e8f6f5 "exportfs: Make sure pass all valid export flags to nfsd"). Otherwise, on recent kernels, attempts to re-export NFS will likely result in "exportfs: <path> does not support NFS export".<br />
<br />
The "fsid=" option is required on any export of an NFS filesystem.<br />
<br />
For now you should probably also mount readonly and with -onolock (and don't depend on working file locking), and don't allow the re-exporting server to reboot.<br />
<br />
= known issues =<br />
<br />
== fsid= required, crossmnt broken ==<br />
<br />
The re-export server needs to encode into each filehandle something that identifies the specific filesystem being exported. Otherwise it's stuck when it gets a filehandle back from the client--the operation it uses to map the incoming filehandle to a dentry can't even work without a superblock. The usual ways of identifying a filesystem don't work for the case of NFS, so we require the "fsid=" export option on any re-export of an NFS filesystem.<br />
<br />
Note also that normally you can export a tree of filesystems by exporting only the parent with the "crossmnt" option, and any filesystems underneath are then automatically exported with the same options. However, that doesn't apply to the fsid= option: it's purpose is to provide a unique identifier for each export, so it can't be automatically copied to the child filesystems.<br />
<br />
That means that re-exporting a tree of NFS filesystems in that way won't work--clients will be able to access the top-level export, but attempts to traverse mountpoints underneath will just result in IO errors.<br />
<br />
In theory, if the server could at least determine that the filehandle is for an object on an NFS filesystem, and figure out which server the filesystem's from, it could (given some new interface) ask the NFS client to work out the rest.<br />
<br />
One idea might be an [[NFS proxy-only mode]] where a server is dedicated to reexporting the filesystems of exactly *one* other server, as is.<br />
<br />
== reboot recovery ==<br />
<br />
NFS is designed to keep operating through server reboots, whether planned or the result of a crash or power outage. Client applications will see a delay while the server's down, but as soon as it's back up, normal operation resumes. Opens and file locks held across the reboot will all work correctly. (The only exception is unlinked but still open files, which may disappear after a reboot.)<br />
<br />
But the protocol's normal reboot recovery mechanisms don't work for the case when the re-export server reboots. The re-export server is both an NFS client and an NFS server, and the protocol's equipped to deal with the loss of the server's state, but not with the loss of the client's state.<br />
<br />
Maybe we could keep the client state on low-latency stable storage somehow? Maybe we could add a mechanism to the protocol that allows the client to state that it has lost its protocol state and wants to reclaim? (And then the client would issue reclaims as reclaims from the re-export server's clients came in.) Tentative plan: [[reboot recovery for re-export servers]]<br />
<br />
Maybe the re-export server could take the stateids returned from the server and return them to its clients, avoiding the need for it to keep very much state.<br />
<br />
== filehandle limits ==<br />
<br />
NFS filehandle sizes are limited (to 32 bytes for NFSv2, 64 bytes for NFSv3, and 128 bytes for NFSv4). When we re-export, we take the filehandle returned from the original server and wrap it with some more bytes of our own to create the filehandle we return to clients. That means the filehandles we give out will be larger than the filehandles we receive from the original server. There's no guarantee this will work. In practice most servers give out filehandles of a fixed size that's less than the maximum, so you *probably* won't run into this problem unless you're re-exporting with NFSv2, or re-exporting repeatedly. But there are no guarantees.<br />
<br />
If re-export servers could reuse filehandles from the original server, that'd solve the problem. It would also make it easier for clients to migrate between the original server and other re-export servers, which could be useful.<br />
<br />
The wrapping is needed so that the server can identify, even after it may have long forgotten about that particular filehandle, which export the filehandle refers to, so it can refer the operation to the correct underlying filesystem or server, and so it can enforce export permissions.<br />
<br />
If a server exports only a single NFS filesystem, then there'd be no problem with it reusing the file handle it got from the original server. Possibly that's a common enough use case to be helpful? With containers we could still allow a single physical machine to handle multiple exports even if each container only handles on each.<br />
<br />
Cooperating servers could agree on the structure of filehandles in a way that allowed them to reuse each others' filehandles. Possibly that could be standardized if it proved useful.<br />
<br />
== errors on re-exports of NFSv4.0 filesystems to NFSv2/3 clients ==<br />
<br />
When re-exporting NFSv4.0 filesystems IO errors have been seen after dropping caches on the re-export server. This is probably due to the fact that an NFSv4 client has to open files to perform IO to them, but NFSv3 client only provides filehandles, and NFSv4.0 cannot open by filehandle (it can only open by (parent filehandle, filename) pair). NFSv4.1 allows open by filehandle.<br />
<br />
Best is not to do this; use NFSv4.1 or NFSv4.2 on the original server, or NFSv4 on the clients.<br />
<br />
If that's not possible, a workaround is to configure the re-export server to be reluctant to evict inodes from cache.<br />
<br />
Some more details at https://lore.kernel.org/linux-nfs/635679406.70384074.1603272832846.JavaMail.zimbra@dneg.com/. Note some other cases there (NFSv3 re-exports of NFSv3) are fixed by patches probably headed for 5.11.<br />
<br />
Maybe the NFSv4.0 client could also be made to support open-by-filehandle by skipping the open and using special stateids instead? I'm not sure.<br />
<br />
== unnecessary GETATTRs ==<br />
<br />
We see unnecessary cache invalidations on the re-export servers; we have some patches in progress that should make it for 5.11 or so (https://lore.kernel.org/linux-nfs/20201120223831.GB7705@fieldses.org/). It looks like they help but don't address every case.<br />
<br />
Also, depending on NFS versions on originating and re-exporting servers, we could probably save some GETATTRs, and set the atomic bit in some cases, if we passed along wcc information from the original server. Requires a special knfsd<->nfs interface. Should be doable.<br />
<br />
== broken file locking ==<br />
<br />
Connectathon locking tests over v4 are currently triggering some kind of memory corruption; still investigating.<br />
<br />
I haven't tested NFSv2/v3 (NLM) file locking yet, but I bet it's broken too.<br />
<br />
Patches are available, with luck may be included in 5.15. Lock recovery will remain an issue.<br />
<br />
== re-export not reading more than 128K at a time ==<br />
<br />
For some reason when the client issues 1M reads to the re-export server, the re-export server breaks them up into 128K reads to the original server. Workaround is to manually increase client readahead; see <br />
https://lore.kernel.org/linux-nfs/1688437957.87985749.1605554507783.JavaMail.zimbra@dneg.com/<br />
<br />
== open DENY bits ignored ==<br />
<br />
NFS since NFSv4 supports ALLOW and DENY bits taken from Windows, which allow you, for example, to open a file in a mode which forbids other read opens or write opens. The Linux client doesn't use them, and the server's support has always been incomplete: they are enforced only against other NFS users, not against processes accessing the exported filesystem locally. A re-export server will also not pass them along to the original server, so they will not be enforced between clients of different re-export servers.<br />
<br />
This is probably not too hard to fix, but also probably not a high priority.<br />
<br />
== Delegations unsupported ==<br />
<br />
Currently a re-export server simply won't give out delegations to its clients (if you're looking at the code: this is because the nfs filesystem sets its setlease method to simple_nosetlease). This is correct but probably suboptimal.<br />
<br />
(Delegations on re-exports should work as of 5.14; could probably use some testing.)<br />
<br />
= Known problems that we've fixed =<br />
<br />
* Problems with sporadic stale filehandles should be fixed by https://lore.kernel.org/linux-nfs/20201019175330.595894-1-trondmy@kernel.org/ (queued for 5.11?)<br />
* Pre/post-operation attributes are incorrectly returned as if they were atomic in cases when they aren't. We have fixes for 5.11.<br />
<br />
= Use cases =<br />
<br />
== Scaling read bandwidth ==<br />
<br />
You should be able to scale bandwidth by adding more re-export servers; fscache on the re-export servers should also help.<br />
<br />
== Hiding latency of distant servers ==<br />
<br />
You should also be able to hide latency when the original server is far away. AFS read-only replication is an interesting precedent here, often used to distribute software that is rarely updated. [https://cernvm.cern.ch/fs/ CernVM-FS] occupies a similar niche. fscache should help here too.<br />
<br />
== NFS version support ==<br />
<br />
It's also being used as a way to add support for all NFS versions to servers that only support a subset. Careful attention to filehandle limits is required.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_re-exportNFS re-export2021-08-18T20:01:59Z<p>Bfields: /* broken file locking */</p>
<hr />
<div>The Linux NFS server can export an NFS mount, but that isn't something we currently recommend unless you've done some careful research and are prepared for problems.<br />
<br />
You'll need nfs-utils at least 1.3.5 (specifically, 3f520e8f6f5 "exportfs: Make sure pass all valid export flags to nfsd"). Otherwise, on recent kernels, attempts to re-export NFS will likely result in "exportfs: <path> does not support NFS export".<br />
<br />
The "fsid=" option is required on any export of an NFS filesystem.<br />
<br />
For now you should probably also mount readonly and with -onolock (and don't depend on working file locking), and don't allow the re-exporting server to reboot.<br />
<br />
= known issues =<br />
<br />
== fsid= required, crossmnt broken ==<br />
<br />
The re-export server needs to encode into each filehandle something that identifies the specific filesystem being exported. Otherwise it's stuck when it gets a filehandle back from the client--the operation it uses to map the incoming filehandle to a dentry can't even work without a superblock. The usual ways of identifying a filesystem don't work for the case of NFS, so we require the "fsid=" export option on any re-export of an NFS filesystem.<br />
<br />
Note also that normally you can export a tree of filesystems by exporting only the parent with the "crossmnt" option, and any filesystems underneath are then automatically exported with the same options. However, that doesn't apply to the fsid= option: it's purpose is to provide a unique identifier for each export, so it can't be automatically copied to the child filesystems.<br />
<br />
That means that re-exporting a tree of NFS filesystems in that way won't work--clients will be able to access the top-level export, but attempts to traverse mountpoints underneath will just result in IO errors.<br />
<br />
In theory, if the server could at least determine that the filehandle is for an object on an NFS filesystem, and figure out which server the filesystem's from, it could (given some new interface) ask the NFS client to work out the rest.<br />
<br />
One idea might be an [[NFS proxy-only mode]] where a server is dedicated to reexporting the filesystems of exactly *one* other server, as is.<br />
<br />
== reboot recovery ==<br />
<br />
NFS is designed to keep operating through server reboots, whether planned or the result of a crash or power outage. Client applications will see a delay while the server's down, but as soon as it's back up, normal operation resumes. Opens and file locks held across the reboot will all work correctly. (The only exception is unlinked but still open files, which may disappear after a reboot.)<br />
<br />
But the protocol's normal reboot recovery mechanisms don't work for the case when the re-export server reboots. The re-export server is both an NFS client and an NFS server, and the protocol's equipped to deal with the loss of the server's state, but not with the loss of the client's state.<br />
<br />
Maybe we could keep the client state on low-latency stable storage somehow? Maybe we could add a mechanism to the protocol that allows the client to state that it has lost its protocol state and wants to reclaim? (And then the client would issue reclaims as reclaims from the re-export server's clients came in.) Tentative plan: [[reboot recovery for re-export servers]]<br />
<br />
Maybe the re-export server could take the stateids returned from the server and return them to its clients, avoiding the need for it to keep very much state.<br />
<br />
== filehandle limits ==<br />
<br />
NFS filehandle sizes are limited (to 32 bytes for NFSv2, 64 bytes for NFSv3, and 128 bytes for NFSv4). When we re-export, we take the filehandle returned from the original server and wrap it with some more bytes of our own to create the filehandle we return to clients. That means the filehandles we give out will be larger than the filehandles we receive from the original server. There's no guarantee this will work. In practice most servers give out filehandles of a fixed size that's less than the maximum, so you *probably* won't run into this problem unless you're re-exporting with NFSv2, or re-exporting repeatedly. But there are no guarantees.<br />
<br />
If re-export servers could reuse filehandles from the original server, that'd solve the problem. It would also make it easier for clients to migrate between the original server and other re-export servers, which could be useful.<br />
<br />
The wrapping is needed so that the server can identify, even after it may have long forgotten about that particular filehandle, which export the filehandle refers to, so it can refer the operation to the correct underlying filesystem or server, and so it can enforce export permissions.<br />
<br />
If a server exports only a single NFS filesystem, then there'd be no problem with it reusing the file handle it got from the original server. Possibly that's a common enough use case to be helpful? With containers we could still allow a single physical machine to handle multiple exports even if each container only handles on each.<br />
<br />
Cooperating servers could agree on the structure of filehandles in a way that allowed them to reuse each others' filehandles. Possibly that could be standardized if it proved useful.<br />
<br />
== errors on re-exports of NFSv4.0 filesystems to NFSv2/3 clients ==<br />
<br />
When re-exporting NFSv4.0 filesystems IO errors have been seen after dropping caches on the re-export server. This is probably due to the fact that an NFSv4 client has to open files to perform IO to them, but NFSv3 client only provides filehandles, and NFSv4.0 cannot open by filehandle (it can only open by (parent filehandle, filename) pair). NFSv4.1 allows open by filehandle.<br />
<br />
Best is not to do this; use NFSv4.1 or NFSv4.2 on the original server, or NFSv4 on the clients.<br />
<br />
If that's not possible, a workaround is to configure the re-export server to be reluctant to evict inodes from cache.<br />
<br />
Some more details at https://lore.kernel.org/linux-nfs/635679406.70384074.1603272832846.JavaMail.zimbra@dneg.com/. Note some other cases there (NFSv3 re-exports of NFSv3) are fixed by patches probably headed for 5.11.<br />
<br />
Maybe the NFSv4.0 client could also be made to support open-by-filehandle by skipping the open and using special stateids instead? I'm not sure.<br />
<br />
== unnecessary GETATTRs ==<br />
<br />
We see unnecessary cache invalidations on the re-export servers; we have some patches in progress that should make it for 5.11 or so (https://lore.kernel.org/linux-nfs/20201120223831.GB7705@fieldses.org/). It looks like they help but don't address every case.<br />
<br />
Also, depending on NFS versions on originating and re-exporting servers, we could probably save some GETATTRs, and set the atomic bit in some cases, if we passed along wcc information from the original server. Requires a special knfsd<->nfs interface. Should be doable.<br />
<br />
== broken file locking ==<br />
<br />
Connectathon locking tests over v4 are currently triggering some kind of memory corruption; still investigating.<br />
<br />
I haven't tested NFSv2/v3 (NLM) file locking yet, but I bet it's broken too.<br />
<br />
Patches are available, with luck may be included in 5.15. Lock recovery will remain an issue.<br />
<br />
== re-export not reading more than 128K at a time ==<br />
<br />
For some reason when the client issues 1M reads to the re-export server, the re-export server breaks them up into 128K reads to the original server. Workaround is to manually increase client readahead; see <br />
https://lore.kernel.org/linux-nfs/1688437957.87985749.1605554507783.JavaMail.zimbra@dneg.com/<br />
<br />
== open DENY bits ignored ==<br />
<br />
NFS since NFSv4 supports ALLOW and DENY bits taken from Windows, which allow you, for example, to open a file in a mode which forbids other read opens or write opens. The Linux client doesn't use them, and the server's support has always been incomplete: they are enforced only against other NFS users, not against processes accessing the exported filesystem locally. A re-export server will also not pass them along to the original server, so they will not be enforced between clients of different re-export servers.<br />
<br />
This is probably not too hard to fix, but also probably not a high priority.<br />
<br />
== Delegations unsupported ==<br />
<br />
Currently a re-export server simply won't give out delegations to its clients (if you're looking at the code: this is because the nfs filesystem sets its setlease method to simple_nosetlease). This is correct but probably suboptimal.<br />
<br />
= Known problems that we've fixed =<br />
<br />
* Problems with sporadic stale filehandles should be fixed by https://lore.kernel.org/linux-nfs/20201019175330.595894-1-trondmy@kernel.org/ (queued for 5.11?)<br />
* Pre/post-operation attributes are incorrectly returned as if they were atomic in cases when they aren't. We have fixes for 5.11.<br />
<br />
= Use cases =<br />
<br />
== Scaling read bandwidth ==<br />
<br />
You should be able to scale bandwidth by adding more re-export servers; fscache on the re-export servers should also help.<br />
<br />
== Hiding latency of distant servers ==<br />
<br />
You should also be able to hide latency when the original server is far away. AFS read-only replication is an interesting precedent here, often used to distribute software that is rarely updated. [https://cernvm.cern.ch/fs/ CernVM-FS] occupies a similar niche. fscache should help here too.<br />
<br />
== NFS version support ==<br />
<br />
It's also being used as a way to add support for all NFS versions to servers that only support a subset. Careful attention to filehandle limits is required.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/Cluster_Coherent_NFSv4_and_Share_ReservationsCluster Coherent NFSv4 and Share Reservations2021-08-18T15:16:06Z<p>Bfields: /* One approach: new flags for open() */</p>
<hr />
<div>=Background=<br />
<br />
NFSv4 share reservations control the concurrent sharing of files at the time they are opened. Share reservations come in two flavors, ACCESS and DENY. There are three types of ACCESS reservations: READ, WRITE, and BOTH; and four types of DENY reservations: NONE, READ, WRITE, and BOTH. <br />
<br />
ACCESS reservations are familiar to Linux users, as they map directly to posix open() flags. NFSv4 ACCESS shares of READ, WRITE, and BOTH map directly to O_RDONLY, O_WRONLY and O_RDWR, respectively.<br />
<br />
NFSv4 DENY reservations act as a type of whole file lock applied when a file is opened. NFSv4 DENY shares of READ, WRITE, and BOTH prevent other opens with read, write, or any access from succeeding. DENY NONE allows other opens to proceed.<br />
<br />
The Linux system call interface for open() follows the posix standard, which does not include support for share reservations. In particular, there is no direct analog in posix for an application to request DENY READ, WRITE, or BOTH shares. Consequently, Linux NFSv4 clients always use DENY NONE.<br />
<br />
The mismatch between posix and NFSv4 shares is also reflected on an NFSv4 server. The Linux NFSv4 server that receives DENY reservations from clients that can express them, which in practice means Windows clients, does the appropriate bookeepping and enforcement, but the local filesystem is unable to enforce DENY shares for local access on the server.<br />
<br />
When a cluster file system is exported with NFSv4, multiple NFSv4 servers export a common back-end file system, so ACCESS and DENY reservations must be distributed to take into account shares from other NFSv4 servers. In other words, the NFSv4 server has to ask the cluster file system if an incoming OPEN share can be granted.<br />
<br />
==DENY Share Support in Linux==<br />
<br />
Adding DENY share support to the Linux kernel faces several obstacles:<br />
<br />
* DENY shares are alien to posix, the Linux model for file systems.<br />
* There are currently no open Linux file systems that support DENY shares.<br />
* Linux and all other UNIX-like NFSv4 clients currently work correctly because they never request DENY access.<br />
* DENY shares do not meet the NFSv4 access needs of Linux clients, just Windows clients.<br />
* Not even off-the-shelf Windows clients benefit as NFSv4 for Windows is a third-party add-on (from Hummingbird).<br />
* The user level SAMBA server implements DENY shares with open and flock (albeit with the obvious race conditions), which obviates kernel support.<br />
<br />
=Implementation Issues=<br />
<br />
To enforce open share DENY access across the cluster back end is complicated, since an open with DENY must atomically lookup, (possibly) create, open, and lock the target file.<br />
<br />
The Linux client atomically joins lookup, create, and open with [[lookup intents]]; the back end may have to do the same thing. The Linux client must also make the open and lock an atomic operation, but there is a problem: you can't lock that doesn't exist, so you must first create it. But as soon as the file is created, some other application might find it and lock it. Returning an error to an open that succeeding in creating a file is unexpected behavior. <br />
<br />
Applying restrictive mode bits to the create won't always work, either, because another application might relax the mode restrictions and open the file. <br />
<br />
This suggests that we add the share lock to the open call instead of making it a separate operation.<br />
<br />
==One approach: new flags for open()==<br />
<br />
* Use existing O_RDONLY, O_WRONLY and O_RDWR open flags to implement O_ACCESS_READ, O_ACCESS_WRITE, and O_ACCESS_BOTH, respectively.<br />
* Add two open flags: O_DENY_READ and O_DENY_WRITE.<br />
* Propagate O_DENY flags to the intent structure.<br />
* Add operation adjust_share(file, flags). The file system should be allowed to refuse operations that could not result from open or close. (So, anything that doesn't only turn bits on or only turn them off.) <br />
<br />
* Is this a new kernel operation? Who is supposed to call it? This needs a little better explanation.<br />
<br />
Is there a race here? E.g., say we open+create with a share lock. How do we decide whether to treat it as an upgrade or an open?<br />
<br />
* This issue needs to be explained a little better.<br />
<br />
Note patches were posted for this at one point by Pavel Shilovsky; see https://lwn.net/Articles/581005/. He gave up and as of this writing nobody's taken up the task since.<br />
<br />
==Another approach: best attempt==<br />
<br />
* Issue a lookup. If the file exists, then upgrade.<br />
<br />
* Someone please clarify "upgrade."<br />
<br />
* Otherwise open with implicit create. If we get an error indicating a share conflict, retry the lookup.<br />
<br />
* But the subsequent upgrade (?) might fail. Then what?<br />
<br />
This is obviously not ideal.<br />
<br />
* Would it help to get a reference on the dentry before trying the open?<br />
* Is there currently a lookup/open race if the backend is a distributed filesystem? One way of looking at it is "that's up to them." The client just needs to look at how we implement open and make sure it does the intent stuff right. <br />
<br />
* A brief glance suggests that we probably don't.<br />
<br />
An alternative might be to expose something along the lines of the [[open owner]] to the VFS and let it decide (by comparing open owners) whether a given open is an upgrade or a new open.<br />
<br />
=Status=<br />
<br />
Implementation awaits resolution of these issues.</div>Bfields