http://wiki.linux-nfs.org/wiki/index.php?title=Special:Contributions&feed=atom&limit=250&target=BfieldsLinux NFS - User contributions [en]2024-03-28T14:53:02ZFrom Linux NFSMediaWiki 1.16.5http://wiki.linux-nfs.org/wiki/index.php/NFS_for_AFS_usersNFS for AFS users2022-01-27T20:35:23Z<p>Bfields: /* PAGS */</p>
<hr />
<div>This page tracks some of the obstacles that might keep an AFS user from using NFS instead.<br />
<br />
= Missing Features =<br />
<br />
In general: AFS is administered by a consistent set of commands (fs, pts, vos, uss, bos, backup, fstrace, etc.) which work from any client and identify the user with Kerberos. Compared to a traditional unix system it's more flexible about delegating rights to users to do stuff.<br />
<br />
== replication and migration ==<br />
<br />
AFS supports fast clones using COW, along with complete copies on other machines.<br />
<br />
Currently there can be only one writeable version of a volume, but multiple read-only versions (which all have to be identical). They can be on different servers. (There's also an effort to support multiple writeable volumes, possibly using Ceph, but that's not done yet.)<br />
<br />
There can also be a 'backup' volume which is just, say, a daily temporary read-only snapshot of a RW volume and has to be located on the same machine.<br />
<br />
When a RW volume is "released" (snapshotted) to the read-only volumes, all the read-only volumes update simultaneously and atomically. The users, in theory, don't notice as the volumes don't go offline - and then they see all the changes happen at once. There's coordination to handle when one or more of the fileservers or the Volume Location servers are offline.<br />
<br />
Volumes can be migrated between machines whilst in active use without the user in theory noticing anything.<br />
<br />
For NFS migration we need to preserve filehandles, so need to migrate at the block level or using fs-specific send/receive. The protocol can be handled by migrating only entire servers or containers, so that migration can be treated as a server reboot.<br />
<br />
A few Linux options for send/receive:<br />
<br />
* thin_delta (from device-mapper-persistent-data) can calculate a metadata-level diff between two volumes. Additional work would be needed to extract the actual data and produce a diff; that would complete the "send" side. We'd also need a "receive" side that could apply the diff and reconstitute the snapshot on the other side. This is being actively worked on. For NFS, on the read-write server we would take a snapshot of the exported volume before sending. On the receive side, after creating the updated snapshot, we would stop the server, unmount the old snapshot, mount the new one, and restart; clients should see only a brief delay.<br />
<br />
* btrfs-send/btrfs-receive: this is probably the best-tested send/receive functionality currently available, so if we wanted to start work on a prototype right now, this might be an option.<br />
<br />
* xfs volumes loopback-mounted on a backing xfs filesystem, using reflink for snapshots. (See https://lwn.net/Articles/747633/ for some background.) Looks promising, the basic kernel interfaces to find shared extents and such are there, but a lot of userland code remains to be written.<br />
<br />
* stratis: this operates at a layer of abstraction over the above. But that might be the layer we want to actually interact with?<br />
<br />
* lvmsync: looks possibly unmaintained? We wouldn't want to depend on this. But possibly it could be a proof of concept or starting point.<br />
<br />
Clients could be configured to mount particular servers by hand, or they could mount any server and then use [https://tools.ietf.org/html/rfc5661#section-11.9 fs_locations], [https://tools.ietf.org/html/rfc5661#section-11.10 fs_locations_info], or maybe even [https://datatracker.ietf.org/doc/rfc8435/ pnfs flexfiles] to get lists of servers hosting replicas and pick one. They would need some heuristics to make the right choice. It would also be nice if clients could fail over to a different replica when one goes down.<br />
<br />
We also have [https://github.com/nfs-ganesha/nfs-ganesha/wiki Ganesha], [https://docs.ceph.com/docs/master/cephfs/nfs/ Ganesha/Ceph] (which [https://jtlayton.wordpress.com/2018/12/10/deploying-an-active-active-nfs-cluster-over-cephfs/ may be capable of multiple read/write servers now]).<br />
<br />
See also [https://docs.openafs.org/AdminGuide/HDRWQ177.html AFS Administrator's guide, Chapter 5: Managing Volumes]<br />
<br />
A partial alternative may be [https://wiki.linux-nfs.org/wiki/index.php/NFS_re-export NFS proxying]. Like read-only replicas, proxies should be able to hide latency by moving cached data closer to far-flung clients, and scale bandwidth to read-mostly data by taking load off the original server.<br />
<br />
Advantages are that we already have seen reports of some success here, using the NFS re-export code together with fscache. And I think there are a lot of opportunities for incremental progress by fixing problems with existing NFS code, rather than larger and riskier projects that build new infrastructure.<br />
<br />
A disadvantage may be that AFS users seem to like that infrastructure (the volume abstraction and the VLDB).<br />
<br />
Latency-hiding may be particularly tricky; delegation and caching policies may need rethinking. Performance will be more complicated to understand compared to AFS-like read-only replicas.<br />
<br />
AFS-like volume replication has a problem: when new read-only versions are released, they may delete files that are in use by running processes. Applications probably don't expect ESTALE on in-use files; I'd expect application crashes. I wonder how AFS administrators deal with that now?<br />
<br />
My impression is that AFS doesn't reliably prevent this problem, so instead AFS administrators work around it, for example by keeping old versions of binaries in place (and using symlinks to direct users to the newest versions). So maybe NFS doesn't need to solve this problem either.<br />
<br />
Possible approaches to fix the problem if we wanted to:<br />
* Provide some protocol which tracks which files may be open on read-only replicas so that we know not to free those files when they're unlinked.<br />
* When we distribute new versions, allow servers to keep around older versions and serve files from them in the case filehandle lookups against the new copy fail, to be removed only after applications stop referencing them. Hopefully this can be done space-efficiently if the different versions on the replica servers can be represented as dm snapshots.<br />
<br />
If we use NFSv4 proxies instead, proxies will hold opens or delegations on the files on the original server, which will prevent their being deleted while in use. The problem is server reboots. That's partially worked around with silly-rename. [[Server-side silly rename]] would be a more complete solution.<br />
<br />
== volume location database and global namespace ==<br />
<br />
On an AFS client by default you can look up something like /afs/umich.edu/... and reach files kept in AFS anywhere.<br />
<br />
NFS has standards for DNS discovery of a server from a domain, in theory we could use that. Handling kerberos users across domains would be interesting.<br />
<br />
Within one domain, AFS has a "Volume Location Database" that keeps track of volumes and where (machine and partition) they're located. You can make a volume for a purpose; give particular people access to it, give it some storage, expand and contract it and move it around. Volumes have quotas.<br />
<br />
With NFS, within a given domain, we can assemble a namespace out of volumes using referrals. For a higher-level approach more similar to AFS's, there's also [https://wiki.linux-nfs.org/wiki/index.php/FedFsUtilsProject FedFS] which stores the namespace information in a database and provides common protocols for administration tools to manipulate the database. That just provides namespace-management facilities. If it were combined with a kerberized distributed volume manager built on top of LVM, that might server as a more complete AFS VLDB replacement.<br />
<br />
== PAGS ==<br />
<br />
PAGs: AFS allows a group of processes to share a common identity, different from the local uid, for the purposes of accessing an AFS filesystem: https://docs.openafs.org/AdminGuide/HDRWQ63.html<br />
<br />
So, for example, if you have multiple kerberos identities that you use to access AFS, you can pick which one you want to use at a given time, or even use both, each in a different window. We'd like this for NFS as well.<br />
<br />
Dave Howells says: "This is why I added session keyrings. You can run a process in a new keyring<br />
and give it new tokens. systemd kind of stuck a spike in that, though, by<br />
doing their own incompatible thing with their user manager service....<br />
<br />
NFS would need to do what the in-kernel AFS client does and call request_key()<br />
on entry to each filesystem method that doesn't take a file* and use that to<br />
cache the credentials it is using. If there is no key, it can make one up on<br />
the spot and stick the uid/gid/groups in there. This would then need to be<br />
handed down to the sunrpc protocol to define the security creds to use.<br />
<br />
The key used to open a file would then need to be cached in the file struct<br />
private data."<br />
<br />
So, we have a lot of good kernel infrastructure in place which is designed to do this, but (despite an attempt or two) nobody has managed to quite make it work for NFS yet.<br />
<br />
== ACLs ==<br />
<br />
NFSv4 has ACLs, but Linux filesystems only support "posix" ACLs. An attempt was made to support NFSv4 ACLs ("richacls") but hasn't been accepted upstream. So knfsd is stuck mapping between NFSv4 and posix ACLs. Posix ACLs are more coarse-grained than NFSv4 ACLs, so information can be lost when a user on an NFSv4 client sets an ACL. This makes ACLs confusing and less useful.<br />
<br />
There are other servers that support full NFSv4 ACLs, so users of those servers are better off. Our client-side tools could still use some improvements for those users, though.<br />
<br />
AFS ACLs, unfortunately, are yet again a third style of ACL, incompatible with both POSIX and NFSv4 ACLs. They are more fine-grained than POSIX ACLs and probably closer to NFSv4 ACLs overall.<br />
<br />
To do:<br />
<br />
* make NFSv4 ACL tools more usable:<br />
** Map groups of NFSv4 permission bits to read, write, and execute permissions so we only have to display the simpler bits in common cases<br />
** Look for other opportunities to simplify display and editing of NFSv4 ACLs<br />
** Add NFSv4 ACL support to graphical file managers like GNOME Files<br />
** Adopt a commandline interface that's more similar to the posix acl utilities.<br />
** Perhaps also look into https://github.com/kvaneesh/richacl-tools as an alternative starting point to nfs4-acl-tools.<br />
** In general, try to make NFSv4 ACL management more similar to management of existing posix ACLs.<br />
* For AFS->NFS transition:<br />
** Write code that translates AFS ACLs to NFSv4 ACLs. It should be possible to do this with little or no loss of information for servers with full NFSv4 ACL support.<br />
** For migrations to Linux knfsd, this will effectively translate AFS ACLs to POSIX ACLs, and information will be lost. Test this case. The conversion tool should be able to fetch the ACLs after setting them, compare results, and summarize the results of the conversion in a way that's usable even for conversions of large numbers of files. I believe that setting an ACL is enough to invalidate the client's ACL cache, so a subsequent fetch of an ACL should show the results of any server-side mapping. But, test this to make sure. More details on [[AFS to NFSv4 ACL conversion]]<br />
<br />
* more ambitious options:<br />
** Try reviving [https://lwn.net/Articles/661357/ Rich ACLs]. Maybe we could convince people this time. Or maybe there's a different approach that would work. Maybe we could find a more incremental route, e.g. by adding some features of richacls to POSIX ACLs, such as the separation of directory write permissions into add and delete, and of file write permissions into modify and append.<br />
<br />
== user and group management ==<br />
<br />
AFS has a "protection server" and you can communicate with it using the [https://docs.openafs.org/Reference/1/pts.html pts command] which allows you to set up users and groups and add ACEs for machines.<br />
<br />
Compared to traditional unix, it allows wider delegation of management. For example, group creation doesn't require root: https://docs.openafs.org/Reference/1/pts_creategroup.html. Groups have owners, and you can delegate management of group membership: https://docs.openafs.org/Reference/1/pts_adduser.html.<br />
<br />
Our equivalent to the AFS protection server is [https://www.freeipa.org/page/Main_Page FreeIPA]. See also https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_identity_management/index. Installing FreeIPA and experimenting is also useful.<br />
<br />
Unlike AFS, FreeIPA doesn't seem to make it easy for ordinary users to create groups. It does allow delegating group management (including adding and removing users). More details on [[AFS-like group management with FreeIPA]].<br />
<br />
== quotas ==<br />
<br />
AFS has per-volume quotas. There are no per-user quotas that I can see; instead, AFS administrators create volumes for individual users (e.g., for individual home directories), and set quotas on those. Volumes can share the same storage, and it's fine for quotas on volumes to add up to more than the available storage.<br />
<br />
We could get similar functionality with LVM thin provisioning or XFS with project quotas. (There is some work needed there to treat projects as separate exports, but that's very doable.)<br />
<br />
Note NFS, ext4, xfs, and other filesystems all support per-user (and other) quotas. That's not something AFS has, as far as I know. Some notes on [[NFSv4 quota support]].<br />
<br />
= migrating existing AFS installations to NFS =<br />
<br />
Once NFS does everything AFS does, there's still the question of how you'd migrate over a particular installation.<br />
<br />
There's a standard AFS dump format (used by [https://docs.openafs.org/AdminGuide/HDRWQ240.html vos dump/vos restore]) that might be worth looking at. It looks simple enough. Maybe also look at [https://github.com/openafs-contrib/cmu-dumpscan cmu-dumpscan].<br />
<br />
See also [[AFS to NFSv4 ACL conversion]].</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_for_AFS_usersNFS for AFS users2022-01-27T20:30:28Z<p>Bfields: /* volume location database and global namespace */</p>
<hr />
<div>This page tracks some of the obstacles that might keep an AFS user from using NFS instead.<br />
<br />
= Missing Features =<br />
<br />
In general: AFS is administered by a consistent set of commands (fs, pts, vos, uss, bos, backup, fstrace, etc.) which work from any client and identify the user with Kerberos. Compared to a traditional unix system it's more flexible about delegating rights to users to do stuff.<br />
<br />
== replication and migration ==<br />
<br />
AFS supports fast clones using COW, along with complete copies on other machines.<br />
<br />
Currently there can be only one writeable version of a volume, but multiple read-only versions (which all have to be identical). They can be on different servers. (There's also an effort to support multiple writeable volumes, possibly using Ceph, but that's not done yet.)<br />
<br />
There can also be a 'backup' volume which is just, say, a daily temporary read-only snapshot of a RW volume and has to be located on the same machine.<br />
<br />
When a RW volume is "released" (snapshotted) to the read-only volumes, all the read-only volumes update simultaneously and atomically. The users, in theory, don't notice as the volumes don't go offline - and then they see all the changes happen at once. There's coordination to handle when one or more of the fileservers or the Volume Location servers are offline.<br />
<br />
Volumes can be migrated between machines whilst in active use without the user in theory noticing anything.<br />
<br />
For NFS migration we need to preserve filehandles, so need to migrate at the block level or using fs-specific send/receive. The protocol can be handled by migrating only entire servers or containers, so that migration can be treated as a server reboot.<br />
<br />
A few Linux options for send/receive:<br />
<br />
* thin_delta (from device-mapper-persistent-data) can calculate a metadata-level diff between two volumes. Additional work would be needed to extract the actual data and produce a diff; that would complete the "send" side. We'd also need a "receive" side that could apply the diff and reconstitute the snapshot on the other side. This is being actively worked on. For NFS, on the read-write server we would take a snapshot of the exported volume before sending. On the receive side, after creating the updated snapshot, we would stop the server, unmount the old snapshot, mount the new one, and restart; clients should see only a brief delay.<br />
<br />
* btrfs-send/btrfs-receive: this is probably the best-tested send/receive functionality currently available, so if we wanted to start work on a prototype right now, this might be an option.<br />
<br />
* xfs volumes loopback-mounted on a backing xfs filesystem, using reflink for snapshots. (See https://lwn.net/Articles/747633/ for some background.) Looks promising, the basic kernel interfaces to find shared extents and such are there, but a lot of userland code remains to be written.<br />
<br />
* stratis: this operates at a layer of abstraction over the above. But that might be the layer we want to actually interact with?<br />
<br />
* lvmsync: looks possibly unmaintained? We wouldn't want to depend on this. But possibly it could be a proof of concept or starting point.<br />
<br />
Clients could be configured to mount particular servers by hand, or they could mount any server and then use [https://tools.ietf.org/html/rfc5661#section-11.9 fs_locations], [https://tools.ietf.org/html/rfc5661#section-11.10 fs_locations_info], or maybe even [https://datatracker.ietf.org/doc/rfc8435/ pnfs flexfiles] to get lists of servers hosting replicas and pick one. They would need some heuristics to make the right choice. It would also be nice if clients could fail over to a different replica when one goes down.<br />
<br />
We also have [https://github.com/nfs-ganesha/nfs-ganesha/wiki Ganesha], [https://docs.ceph.com/docs/master/cephfs/nfs/ Ganesha/Ceph] (which [https://jtlayton.wordpress.com/2018/12/10/deploying-an-active-active-nfs-cluster-over-cephfs/ may be capable of multiple read/write servers now]).<br />
<br />
See also [https://docs.openafs.org/AdminGuide/HDRWQ177.html AFS Administrator's guide, Chapter 5: Managing Volumes]<br />
<br />
A partial alternative may be [https://wiki.linux-nfs.org/wiki/index.php/NFS_re-export NFS proxying]. Like read-only replicas, proxies should be able to hide latency by moving cached data closer to far-flung clients, and scale bandwidth to read-mostly data by taking load off the original server.<br />
<br />
Advantages are that we already have seen reports of some success here, using the NFS re-export code together with fscache. And I think there are a lot of opportunities for incremental progress by fixing problems with existing NFS code, rather than larger and riskier projects that build new infrastructure.<br />
<br />
A disadvantage may be that AFS users seem to like that infrastructure (the volume abstraction and the VLDB).<br />
<br />
Latency-hiding may be particularly tricky; delegation and caching policies may need rethinking. Performance will be more complicated to understand compared to AFS-like read-only replicas.<br />
<br />
AFS-like volume replication has a problem: when new read-only versions are released, they may delete files that are in use by running processes. Applications probably don't expect ESTALE on in-use files; I'd expect application crashes. I wonder how AFS administrators deal with that now?<br />
<br />
My impression is that AFS doesn't reliably prevent this problem, so instead AFS administrators work around it, for example by keeping old versions of binaries in place (and using symlinks to direct users to the newest versions). So maybe NFS doesn't need to solve this problem either.<br />
<br />
Possible approaches to fix the problem if we wanted to:<br />
* Provide some protocol which tracks which files may be open on read-only replicas so that we know not to free those files when they're unlinked.<br />
* When we distribute new versions, allow servers to keep around older versions and serve files from them in the case filehandle lookups against the new copy fail, to be removed only after applications stop referencing them. Hopefully this can be done space-efficiently if the different versions on the replica servers can be represented as dm snapshots.<br />
<br />
If we use NFSv4 proxies instead, proxies will hold opens or delegations on the files on the original server, which will prevent their being deleted while in use. The problem is server reboots. That's partially worked around with silly-rename. [[Server-side silly rename]] would be a more complete solution.<br />
<br />
== volume location database and global namespace ==<br />
<br />
On an AFS client by default you can look up something like /afs/umich.edu/... and reach files kept in AFS anywhere.<br />
<br />
NFS has standards for DNS discovery of a server from a domain, in theory we could use that. Handling kerberos users across domains would be interesting.<br />
<br />
Within one domain, AFS has a "Volume Location Database" that keeps track of volumes and where (machine and partition) they're located. You can make a volume for a purpose; give particular people access to it, give it some storage, expand and contract it and move it around. Volumes have quotas.<br />
<br />
With NFS, within a given domain, we can assemble a namespace out of volumes using referrals. For a higher-level approach more similar to AFS's, there's also [https://wiki.linux-nfs.org/wiki/index.php/FedFsUtilsProject FedFS] which stores the namespace information in a database and provides common protocols for administration tools to manipulate the database. That just provides namespace-management facilities. If it were combined with a kerberized distributed volume manager built on top of LVM, that might server as a more complete AFS VLDB replacement.<br />
<br />
== PAGS ==<br />
<br />
PAGs: AFS allows a group of processes to share a common identity, different from the local uid, for the purposes of accessing an AFS filesystem: https://docs.openafs.org/AdminGuide/HDRWQ63.html<br />
<br />
Dave Howells says: "This is why I added session keyrings. You can run a process in a new keyring<br />
and give it new tokens. systemd kind of stuck a spike in that, though, by<br />
doing their own incompatible thing with their user manager service....<br />
<br />
NFS would need to do what the in-kernel AFS client does and call request_key()<br />
on entry to each filesystem method that doesn't take a file* and use that to<br />
cache the credentials it is using. If there is no key, it can make one up on<br />
the spot and stick the uid/gid/groups in there. This would then need to be<br />
handed down to the sunrpc protocol to define the security creds to use.<br />
<br />
The key used to open a file would then need to be cached in the file struct<br />
private data."<br />
<br />
== ACLs ==<br />
<br />
NFSv4 has ACLs, but Linux filesystems only support "posix" ACLs. An attempt was made to support NFSv4 ACLs ("richacls") but hasn't been accepted upstream. So knfsd is stuck mapping between NFSv4 and posix ACLs. Posix ACLs are more coarse-grained than NFSv4 ACLs, so information can be lost when a user on an NFSv4 client sets an ACL. This makes ACLs confusing and less useful.<br />
<br />
There are other servers that support full NFSv4 ACLs, so users of those servers are better off. Our client-side tools could still use some improvements for those users, though.<br />
<br />
AFS ACLs, unfortunately, are yet again a third style of ACL, incompatible with both POSIX and NFSv4 ACLs. They are more fine-grained than POSIX ACLs and probably closer to NFSv4 ACLs overall.<br />
<br />
To do:<br />
<br />
* make NFSv4 ACL tools more usable:<br />
** Map groups of NFSv4 permission bits to read, write, and execute permissions so we only have to display the simpler bits in common cases<br />
** Look for other opportunities to simplify display and editing of NFSv4 ACLs<br />
** Add NFSv4 ACL support to graphical file managers like GNOME Files<br />
** Adopt a commandline interface that's more similar to the posix acl utilities.<br />
** Perhaps also look into https://github.com/kvaneesh/richacl-tools as an alternative starting point to nfs4-acl-tools.<br />
** In general, try to make NFSv4 ACL management more similar to management of existing posix ACLs.<br />
* For AFS->NFS transition:<br />
** Write code that translates AFS ACLs to NFSv4 ACLs. It should be possible to do this with little or no loss of information for servers with full NFSv4 ACL support.<br />
** For migrations to Linux knfsd, this will effectively translate AFS ACLs to POSIX ACLs, and information will be lost. Test this case. The conversion tool should be able to fetch the ACLs after setting them, compare results, and summarize the results of the conversion in a way that's usable even for conversions of large numbers of files. I believe that setting an ACL is enough to invalidate the client's ACL cache, so a subsequent fetch of an ACL should show the results of any server-side mapping. But, test this to make sure. More details on [[AFS to NFSv4 ACL conversion]]<br />
<br />
* more ambitious options:<br />
** Try reviving [https://lwn.net/Articles/661357/ Rich ACLs]. Maybe we could convince people this time. Or maybe there's a different approach that would work. Maybe we could find a more incremental route, e.g. by adding some features of richacls to POSIX ACLs, such as the separation of directory write permissions into add and delete, and of file write permissions into modify and append.<br />
<br />
== user and group management ==<br />
<br />
AFS has a "protection server" and you can communicate with it using the [https://docs.openafs.org/Reference/1/pts.html pts command] which allows you to set up users and groups and add ACEs for machines.<br />
<br />
Compared to traditional unix, it allows wider delegation of management. For example, group creation doesn't require root: https://docs.openafs.org/Reference/1/pts_creategroup.html. Groups have owners, and you can delegate management of group membership: https://docs.openafs.org/Reference/1/pts_adduser.html.<br />
<br />
Our equivalent to the AFS protection server is [https://www.freeipa.org/page/Main_Page FreeIPA]. See also https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_identity_management/index. Installing FreeIPA and experimenting is also useful.<br />
<br />
Unlike AFS, FreeIPA doesn't seem to make it easy for ordinary users to create groups. It does allow delegating group management (including adding and removing users). More details on [[AFS-like group management with FreeIPA]].<br />
<br />
== quotas ==<br />
<br />
AFS has per-volume quotas. There are no per-user quotas that I can see; instead, AFS administrators create volumes for individual users (e.g., for individual home directories), and set quotas on those. Volumes can share the same storage, and it's fine for quotas on volumes to add up to more than the available storage.<br />
<br />
We could get similar functionality with LVM thin provisioning or XFS with project quotas. (There is some work needed there to treat projects as separate exports, but that's very doable.)<br />
<br />
Note NFS, ext4, xfs, and other filesystems all support per-user (and other) quotas. That's not something AFS has, as far as I know. Some notes on [[NFSv4 quota support]].<br />
<br />
= migrating existing AFS installations to NFS =<br />
<br />
Once NFS does everything AFS does, there's still the question of how you'd migrate over a particular installation.<br />
<br />
There's a standard AFS dump format (used by [https://docs.openafs.org/AdminGuide/HDRWQ240.html vos dump/vos restore]) that might be worth looking at. It looks simple enough. Maybe also look at [https://github.com/openafs-contrib/cmu-dumpscan cmu-dumpscan].<br />
<br />
See also [[AFS to NFSv4 ACL conversion]].</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_for_AFS_usersNFS for AFS users2022-01-27T20:27:26Z<p>Bfields: /* replication and migration */</p>
<hr />
<div>This page tracks some of the obstacles that might keep an AFS user from using NFS instead.<br />
<br />
= Missing Features =<br />
<br />
In general: AFS is administered by a consistent set of commands (fs, pts, vos, uss, bos, backup, fstrace, etc.) which work from any client and identify the user with Kerberos. Compared to a traditional unix system it's more flexible about delegating rights to users to do stuff.<br />
<br />
== replication and migration ==<br />
<br />
AFS supports fast clones using COW, along with complete copies on other machines.<br />
<br />
Currently there can be only one writeable version of a volume, but multiple read-only versions (which all have to be identical). They can be on different servers. (There's also an effort to support multiple writeable volumes, possibly using Ceph, but that's not done yet.)<br />
<br />
There can also be a 'backup' volume which is just, say, a daily temporary read-only snapshot of a RW volume and has to be located on the same machine.<br />
<br />
When a RW volume is "released" (snapshotted) to the read-only volumes, all the read-only volumes update simultaneously and atomically. The users, in theory, don't notice as the volumes don't go offline - and then they see all the changes happen at once. There's coordination to handle when one or more of the fileservers or the Volume Location servers are offline.<br />
<br />
Volumes can be migrated between machines whilst in active use without the user in theory noticing anything.<br />
<br />
For NFS migration we need to preserve filehandles, so need to migrate at the block level or using fs-specific send/receive. The protocol can be handled by migrating only entire servers or containers, so that migration can be treated as a server reboot.<br />
<br />
A few Linux options for send/receive:<br />
<br />
* thin_delta (from device-mapper-persistent-data) can calculate a metadata-level diff between two volumes. Additional work would be needed to extract the actual data and produce a diff; that would complete the "send" side. We'd also need a "receive" side that could apply the diff and reconstitute the snapshot on the other side. This is being actively worked on. For NFS, on the read-write server we would take a snapshot of the exported volume before sending. On the receive side, after creating the updated snapshot, we would stop the server, unmount the old snapshot, mount the new one, and restart; clients should see only a brief delay.<br />
<br />
* btrfs-send/btrfs-receive: this is probably the best-tested send/receive functionality currently available, so if we wanted to start work on a prototype right now, this might be an option.<br />
<br />
* xfs volumes loopback-mounted on a backing xfs filesystem, using reflink for snapshots. (See https://lwn.net/Articles/747633/ for some background.) Looks promising, the basic kernel interfaces to find shared extents and such are there, but a lot of userland code remains to be written.<br />
<br />
* stratis: this operates at a layer of abstraction over the above. But that might be the layer we want to actually interact with?<br />
<br />
* lvmsync: looks possibly unmaintained? We wouldn't want to depend on this. But possibly it could be a proof of concept or starting point.<br />
<br />
Clients could be configured to mount particular servers by hand, or they could mount any server and then use [https://tools.ietf.org/html/rfc5661#section-11.9 fs_locations], [https://tools.ietf.org/html/rfc5661#section-11.10 fs_locations_info], or maybe even [https://datatracker.ietf.org/doc/rfc8435/ pnfs flexfiles] to get lists of servers hosting replicas and pick one. They would need some heuristics to make the right choice. It would also be nice if clients could fail over to a different replica when one goes down.<br />
<br />
We also have [https://github.com/nfs-ganesha/nfs-ganesha/wiki Ganesha], [https://docs.ceph.com/docs/master/cephfs/nfs/ Ganesha/Ceph] (which [https://jtlayton.wordpress.com/2018/12/10/deploying-an-active-active-nfs-cluster-over-cephfs/ may be capable of multiple read/write servers now]).<br />
<br />
See also [https://docs.openafs.org/AdminGuide/HDRWQ177.html AFS Administrator's guide, Chapter 5: Managing Volumes]<br />
<br />
A partial alternative may be [https://wiki.linux-nfs.org/wiki/index.php/NFS_re-export NFS proxying]. Like read-only replicas, proxies should be able to hide latency by moving cached data closer to far-flung clients, and scale bandwidth to read-mostly data by taking load off the original server.<br />
<br />
Advantages are that we already have seen reports of some success here, using the NFS re-export code together with fscache. And I think there are a lot of opportunities for incremental progress by fixing problems with existing NFS code, rather than larger and riskier projects that build new infrastructure.<br />
<br />
A disadvantage may be that AFS users seem to like that infrastructure (the volume abstraction and the VLDB).<br />
<br />
Latency-hiding may be particularly tricky; delegation and caching policies may need rethinking. Performance will be more complicated to understand compared to AFS-like read-only replicas.<br />
<br />
AFS-like volume replication has a problem: when new read-only versions are released, they may delete files that are in use by running processes. Applications probably don't expect ESTALE on in-use files; I'd expect application crashes. I wonder how AFS administrators deal with that now?<br />
<br />
My impression is that AFS doesn't reliably prevent this problem, so instead AFS administrators work around it, for example by keeping old versions of binaries in place (and using symlinks to direct users to the newest versions). So maybe NFS doesn't need to solve this problem either.<br />
<br />
Possible approaches to fix the problem if we wanted to:<br />
* Provide some protocol which tracks which files may be open on read-only replicas so that we know not to free those files when they're unlinked.<br />
* When we distribute new versions, allow servers to keep around older versions and serve files from them in the case filehandle lookups against the new copy fail, to be removed only after applications stop referencing them. Hopefully this can be done space-efficiently if the different versions on the replica servers can be represented as dm snapshots.<br />
<br />
If we use NFSv4 proxies instead, proxies will hold opens or delegations on the files on the original server, which will prevent their being deleted while in use. The problem is server reboots. That's partially worked around with silly-rename. [[Server-side silly rename]] would be a more complete solution.<br />
<br />
== volume location database and global namespace ==<br />
<br />
On an AFS client by default you can look up something like /afs/umich.edu/... and reach files kept in AFS anywhere.<br />
<br />
NFS has standards for DNS discovery of a server from a domain, in theory we could use that. Handling kerberos users across domains would be interesting.<br />
<br />
Within one domain, there's a "Volume Location Database" that keeps track of volumes and where (machine and partition) they're located. You can make a volume for a purpose; give particular people access to it, give it some storage, expand and contract it and move it around. Volumes have quotas.<br />
<br />
Within a given domain, We can assemble a namespace out of volumes using referrals. For a higher-level approach more similar to AFS's, there's also [https://wiki.linux-nfs.org/wiki/index.php/FedFsUtilsProject FedFS] which stores the namespace information in a database and provides common protocols for administration tools to manipulate the database.<br />
<br />
== PAGS ==<br />
<br />
PAGs: AFS allows a group of processes to share a common identity, different from the local uid, for the purposes of accessing an AFS filesystem: https://docs.openafs.org/AdminGuide/HDRWQ63.html<br />
<br />
Dave Howells says: "This is why I added session keyrings. You can run a process in a new keyring<br />
and give it new tokens. systemd kind of stuck a spike in that, though, by<br />
doing their own incompatible thing with their user manager service....<br />
<br />
NFS would need to do what the in-kernel AFS client does and call request_key()<br />
on entry to each filesystem method that doesn't take a file* and use that to<br />
cache the credentials it is using. If there is no key, it can make one up on<br />
the spot and stick the uid/gid/groups in there. This would then need to be<br />
handed down to the sunrpc protocol to define the security creds to use.<br />
<br />
The key used to open a file would then need to be cached in the file struct<br />
private data."<br />
<br />
== ACLs ==<br />
<br />
NFSv4 has ACLs, but Linux filesystems only support "posix" ACLs. An attempt was made to support NFSv4 ACLs ("richacls") but hasn't been accepted upstream. So knfsd is stuck mapping between NFSv4 and posix ACLs. Posix ACLs are more coarse-grained than NFSv4 ACLs, so information can be lost when a user on an NFSv4 client sets an ACL. This makes ACLs confusing and less useful.<br />
<br />
There are other servers that support full NFSv4 ACLs, so users of those servers are better off. Our client-side tools could still use some improvements for those users, though.<br />
<br />
AFS ACLs, unfortunately, are yet again a third style of ACL, incompatible with both POSIX and NFSv4 ACLs. They are more fine-grained than POSIX ACLs and probably closer to NFSv4 ACLs overall.<br />
<br />
To do:<br />
<br />
* make NFSv4 ACL tools more usable:<br />
** Map groups of NFSv4 permission bits to read, write, and execute permissions so we only have to display the simpler bits in common cases<br />
** Look for other opportunities to simplify display and editing of NFSv4 ACLs<br />
** Add NFSv4 ACL support to graphical file managers like GNOME Files<br />
** Adopt a commandline interface that's more similar to the posix acl utilities.<br />
** Perhaps also look into https://github.com/kvaneesh/richacl-tools as an alternative starting point to nfs4-acl-tools.<br />
** In general, try to make NFSv4 ACL management more similar to management of existing posix ACLs.<br />
* For AFS->NFS transition:<br />
** Write code that translates AFS ACLs to NFSv4 ACLs. It should be possible to do this with little or no loss of information for servers with full NFSv4 ACL support.<br />
** For migrations to Linux knfsd, this will effectively translate AFS ACLs to POSIX ACLs, and information will be lost. Test this case. The conversion tool should be able to fetch the ACLs after setting them, compare results, and summarize the results of the conversion in a way that's usable even for conversions of large numbers of files. I believe that setting an ACL is enough to invalidate the client's ACL cache, so a subsequent fetch of an ACL should show the results of any server-side mapping. But, test this to make sure. More details on [[AFS to NFSv4 ACL conversion]]<br />
<br />
* more ambitious options:<br />
** Try reviving [https://lwn.net/Articles/661357/ Rich ACLs]. Maybe we could convince people this time. Or maybe there's a different approach that would work. Maybe we could find a more incremental route, e.g. by adding some features of richacls to POSIX ACLs, such as the separation of directory write permissions into add and delete, and of file write permissions into modify and append.<br />
<br />
== user and group management ==<br />
<br />
AFS has a "protection server" and you can communicate with it using the [https://docs.openafs.org/Reference/1/pts.html pts command] which allows you to set up users and groups and add ACEs for machines.<br />
<br />
Compared to traditional unix, it allows wider delegation of management. For example, group creation doesn't require root: https://docs.openafs.org/Reference/1/pts_creategroup.html. Groups have owners, and you can delegate management of group membership: https://docs.openafs.org/Reference/1/pts_adduser.html.<br />
<br />
Our equivalent to the AFS protection server is [https://www.freeipa.org/page/Main_Page FreeIPA]. See also https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_identity_management/index. Installing FreeIPA and experimenting is also useful.<br />
<br />
Unlike AFS, FreeIPA doesn't seem to make it easy for ordinary users to create groups. It does allow delegating group management (including adding and removing users). More details on [[AFS-like group management with FreeIPA]].<br />
<br />
== quotas ==<br />
<br />
AFS has per-volume quotas. There are no per-user quotas that I can see; instead, AFS administrators create volumes for individual users (e.g., for individual home directories), and set quotas on those. Volumes can share the same storage, and it's fine for quotas on volumes to add up to more than the available storage.<br />
<br />
We could get similar functionality with LVM thin provisioning or XFS with project quotas. (There is some work needed there to treat projects as separate exports, but that's very doable.)<br />
<br />
Note NFS, ext4, xfs, and other filesystems all support per-user (and other) quotas. That's not something AFS has, as far as I know. Some notes on [[NFSv4 quota support]].<br />
<br />
= migrating existing AFS installations to NFS =<br />
<br />
Once NFS does everything AFS does, there's still the question of how you'd migrate over a particular installation.<br />
<br />
There's a standard AFS dump format (used by [https://docs.openafs.org/AdminGuide/HDRWQ240.html vos dump/vos restore]) that might be worth looking at. It looks simple enough. Maybe also look at [https://github.com/openafs-contrib/cmu-dumpscan cmu-dumpscan].<br />
<br />
See also [[AFS to NFSv4 ACL conversion]].</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_for_AFS_usersNFS for AFS users2022-01-27T20:27:07Z<p>Bfields: /* replication and migration */</p>
<hr />
<div>This page tracks some of the obstacles that might keep an AFS user from using NFS instead.<br />
<br />
= Missing Features =<br />
<br />
In general: AFS is administered by a consistent set of commands (fs, pts, vos, uss, bos, backup, fstrace, etc.) which work from any client and identify the user with Kerberos. Compared to a traditional unix system it's more flexible about delegating rights to users to do stuff.<br />
<br />
== replication and migration ==<br />
<br />
AFS supports fast clones using COW, along with complete copies on other machines.<br />
<br />
Currently there can be only one writeable version of a volume, but multiple read-only versions (which all have to be identical). They can be on different servers. (There's also an effort to support multiple writeable volumes, possibly using Ceph, but that's not done yet.)<br />
<br />
There can also be a 'backup' volume which is just, say, a daily temporary read-only snapshot of a RW volume and has to be located on the same machine.<br />
<br />
When a RW volume is "released" (snapshotted) to the read-only volumes, all the read-only volumes update simultaneously and atomically. The users, in theory, don't notice as the volumes don't go offline - and then they see all the changes happen at once. There's coordination to handle when one or more of the fileservers or the Volume Location servers are offline.<br />
<br />
Volumes can be migrated between machines whilst in active use without the user in theory noticing anything.<br />
<br />
For NFS migration we need to preserve filehandles, so need to migrate at the block level or using fs-specific send/receive. The protocol can be handled by migrating only entire servers or containers, so that migration can be treated as a server reboot.<br />
<br />
A few Linux options for send/receive:<br />
<br />
* thin_delta (from device-mapper-persistent-data) can calculate a metadata-level diff between two volumes. Additional work would be needed to extract the actual data and produce a diff; that would complete the "send" side. We'd also need a "receive" side that could apply the diff and reconstitute the snapshot on the other side. This is being actively worked on. For NFS, on the read-write server we would take a snapshot of the exported volume before sending. On the receive side, after creating the updated snapshot, we would stop the server, unmount the old snapshot, mount the new one, and restart; clients should see only a brief delay.<br />
<br />
* btrfs-send/btrfs-receive: this is probably the best-tested send/receive functionality currently available, so if we wanted to start work on a prototype right now, this might be an option.<br />
<br />
* xfs volumes loopback-mounted on a backing xfs filesystem, using reflink for snapshots. (See https://lwn.net/Articles/747633/ for some background.) Looks promising, the basic kernel interfaces to find shared extents and such are there, but a lot of userland code remains to be written.<br />
<br />
* stratis: this operates at a layer of abstraction over the above. But that might be the layer we want to actually interact with?<br />
<br />
* lvmsync: looks possibly unmaintained? We wouldn't want to depend on this. But possibly it could be a proof of concept or starting point.<br />
<br />
Clients could be configured to mount particular servers by hand, or they could mount any server and then use [https://tools.ietf.org/html/rfc5661#section-11.9 fs_locations], [https://tools.ietf.org/html/rfc5661#section-11.10 fs_locations_info], or maybe even [https://datatracker.ietf.org/doc/rfc8435/ pnfs flexfiles] to get lists of servers hosting replicas and pick one. They would need some heuristics to make the right choice. It would also be nice if clients could fail over to a different replica when one goes down.<br />
<br />
We also have [https://github.com/nfs-ganesha/nfs-ganesha/wiki Ganesha], [https://docs.ceph.com/docs/master/cephfs/nfs/ Ganesha/Ceph] (which [https://jtlayton.wordpress.com/2018/12/10/deploying-an-active-active-nfs-cluster-over-cephfs/ may be capable of multiple read/write servers now]).<br />
<br />
See also [https://docs.openafs.org/AdminGuide/HDRWQ177.html AFS Administrator's guide, Chapter 5: Managing Volumes]<br />
<br />
A partial alternative may be [https://wiki.linux-nfs.org/wiki/index.php/NFS_re-export NFS proxying]. Like read-only replicas, proxies should be able to hide latency by moving cached data closer to far-flung clients, and scale bandwidth to read-mostly data by taking load off the original server.<br />
<br />
Advantages are that we already have seen reports of some success here, using the NFS re-export code together with fscache. And I think there are a lot of opportunities for incremental progress by fixing problems with existing NFS code, rather than larger and riskier projects that build new infrastructure.<br />
<br />
A disadvantage may be that AFS users seem to like that infrastructure (the volume abstraction and the VLDB).<br />
<br />
Latency-hiding may be particularly tricky; delegation and caching policies may need rethinking. Performance will be more complicated to understand compared to AFS-like read-only replicas.<br />
<br />
AFS-like volume replication has a problem: when new read-only versions are released, they may delete files that are in use by running processes. Applications probably don't expect ESTALE on in-use files; I'd expect application crashes. I wonder how AFS administrators deal with that now?<br />
<br />
My impression is that AFS doesn't reliably prevent this problem, so instead AFS administrators work around it, for example by keeping old versions of binaries in place (and using symlinks to direct users to the newest versions). So maybe NFS doesn't need to solve this problem either.<br />
<br />
Possible approaches to fix the problem if we wanted to:<br />
* Provide some protocol which tracks which files may be open on read-only replicas so that we know not to free those files when they're unlinked.<br />
* When we distribute new versions, allow servers to keep around older versions and serve files from them in the case filehandle lookups against the new copy fail, to be removed only after applications stop referencing them. Hopefully this can be done space-efficiently if the different versions on the replica servers can be represented as dm snapshots.<br />
<br />
If we use NFSv4 proxies instead, proxies will hold opens or delegations on the files on the original server, which will prevent their being deleted while in use. The problem is server reboots. That's partially worked around with silly-rename. [[Server-side silly-rename]] would be a more complete solution.<br />
<br />
== volume location database and global namespace ==<br />
<br />
On an AFS client by default you can look up something like /afs/umich.edu/... and reach files kept in AFS anywhere.<br />
<br />
NFS has standards for DNS discovery of a server from a domain, in theory we could use that. Handling kerberos users across domains would be interesting.<br />
<br />
Within one domain, there's a "Volume Location Database" that keeps track of volumes and where (machine and partition) they're located. You can make a volume for a purpose; give particular people access to it, give it some storage, expand and contract it and move it around. Volumes have quotas.<br />
<br />
Within a given domain, We can assemble a namespace out of volumes using referrals. For a higher-level approach more similar to AFS's, there's also [https://wiki.linux-nfs.org/wiki/index.php/FedFsUtilsProject FedFS] which stores the namespace information in a database and provides common protocols for administration tools to manipulate the database.<br />
<br />
== PAGS ==<br />
<br />
PAGs: AFS allows a group of processes to share a common identity, different from the local uid, for the purposes of accessing an AFS filesystem: https://docs.openafs.org/AdminGuide/HDRWQ63.html<br />
<br />
Dave Howells says: "This is why I added session keyrings. You can run a process in a new keyring<br />
and give it new tokens. systemd kind of stuck a spike in that, though, by<br />
doing their own incompatible thing with their user manager service....<br />
<br />
NFS would need to do what the in-kernel AFS client does and call request_key()<br />
on entry to each filesystem method that doesn't take a file* and use that to<br />
cache the credentials it is using. If there is no key, it can make one up on<br />
the spot and stick the uid/gid/groups in there. This would then need to be<br />
handed down to the sunrpc protocol to define the security creds to use.<br />
<br />
The key used to open a file would then need to be cached in the file struct<br />
private data."<br />
<br />
== ACLs ==<br />
<br />
NFSv4 has ACLs, but Linux filesystems only support "posix" ACLs. An attempt was made to support NFSv4 ACLs ("richacls") but hasn't been accepted upstream. So knfsd is stuck mapping between NFSv4 and posix ACLs. Posix ACLs are more coarse-grained than NFSv4 ACLs, so information can be lost when a user on an NFSv4 client sets an ACL. This makes ACLs confusing and less useful.<br />
<br />
There are other servers that support full NFSv4 ACLs, so users of those servers are better off. Our client-side tools could still use some improvements for those users, though.<br />
<br />
AFS ACLs, unfortunately, are yet again a third style of ACL, incompatible with both POSIX and NFSv4 ACLs. They are more fine-grained than POSIX ACLs and probably closer to NFSv4 ACLs overall.<br />
<br />
To do:<br />
<br />
* make NFSv4 ACL tools more usable:<br />
** Map groups of NFSv4 permission bits to read, write, and execute permissions so we only have to display the simpler bits in common cases<br />
** Look for other opportunities to simplify display and editing of NFSv4 ACLs<br />
** Add NFSv4 ACL support to graphical file managers like GNOME Files<br />
** Adopt a commandline interface that's more similar to the posix acl utilities.<br />
** Perhaps also look into https://github.com/kvaneesh/richacl-tools as an alternative starting point to nfs4-acl-tools.<br />
** In general, try to make NFSv4 ACL management more similar to management of existing posix ACLs.<br />
* For AFS->NFS transition:<br />
** Write code that translates AFS ACLs to NFSv4 ACLs. It should be possible to do this with little or no loss of information for servers with full NFSv4 ACL support.<br />
** For migrations to Linux knfsd, this will effectively translate AFS ACLs to POSIX ACLs, and information will be lost. Test this case. The conversion tool should be able to fetch the ACLs after setting them, compare results, and summarize the results of the conversion in a way that's usable even for conversions of large numbers of files. I believe that setting an ACL is enough to invalidate the client's ACL cache, so a subsequent fetch of an ACL should show the results of any server-side mapping. But, test this to make sure. More details on [[AFS to NFSv4 ACL conversion]]<br />
<br />
* more ambitious options:<br />
** Try reviving [https://lwn.net/Articles/661357/ Rich ACLs]. Maybe we could convince people this time. Or maybe there's a different approach that would work. Maybe we could find a more incremental route, e.g. by adding some features of richacls to POSIX ACLs, such as the separation of directory write permissions into add and delete, and of file write permissions into modify and append.<br />
<br />
== user and group management ==<br />
<br />
AFS has a "protection server" and you can communicate with it using the [https://docs.openafs.org/Reference/1/pts.html pts command] which allows you to set up users and groups and add ACEs for machines.<br />
<br />
Compared to traditional unix, it allows wider delegation of management. For example, group creation doesn't require root: https://docs.openafs.org/Reference/1/pts_creategroup.html. Groups have owners, and you can delegate management of group membership: https://docs.openafs.org/Reference/1/pts_adduser.html.<br />
<br />
Our equivalent to the AFS protection server is [https://www.freeipa.org/page/Main_Page FreeIPA]. See also https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_identity_management/index. Installing FreeIPA and experimenting is also useful.<br />
<br />
Unlike AFS, FreeIPA doesn't seem to make it easy for ordinary users to create groups. It does allow delegating group management (including adding and removing users). More details on [[AFS-like group management with FreeIPA]].<br />
<br />
== quotas ==<br />
<br />
AFS has per-volume quotas. There are no per-user quotas that I can see; instead, AFS administrators create volumes for individual users (e.g., for individual home directories), and set quotas on those. Volumes can share the same storage, and it's fine for quotas on volumes to add up to more than the available storage.<br />
<br />
We could get similar functionality with LVM thin provisioning or XFS with project quotas. (There is some work needed there to treat projects as separate exports, but that's very doable.)<br />
<br />
Note NFS, ext4, xfs, and other filesystems all support per-user (and other) quotas. That's not something AFS has, as far as I know. Some notes on [[NFSv4 quota support]].<br />
<br />
= migrating existing AFS installations to NFS =<br />
<br />
Once NFS does everything AFS does, there's still the question of how you'd migrate over a particular installation.<br />
<br />
There's a standard AFS dump format (used by [https://docs.openafs.org/AdminGuide/HDRWQ240.html vos dump/vos restore]) that might be worth looking at. It looks simple enough. Maybe also look at [https://github.com/openafs-contrib/cmu-dumpscan cmu-dumpscan].<br />
<br />
See also [[AFS to NFSv4 ACL conversion]].</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_for_AFS_usersNFS for AFS users2022-01-27T20:23:46Z<p>Bfields: /* replication and migration */</p>
<hr />
<div>This page tracks some of the obstacles that might keep an AFS user from using NFS instead.<br />
<br />
= Missing Features =<br />
<br />
In general: AFS is administered by a consistent set of commands (fs, pts, vos, uss, bos, backup, fstrace, etc.) which work from any client and identify the user with Kerberos. Compared to a traditional unix system it's more flexible about delegating rights to users to do stuff.<br />
<br />
== replication and migration ==<br />
<br />
AFS supports fast clones using COW, along with complete copies on other machines.<br />
<br />
Currently there can be only one writeable version of a volume, but multiple read-only versions (which all have to be identical). They can be on different servers. (There's also an effort to support multiple writeable volumes, possibly using Ceph, but that's not done yet.)<br />
<br />
There can also be a 'backup' volume which is just, say, a daily temporary read-only snapshot of a RW volume and has to be located on the same machine.<br />
<br />
When a RW volume is "released" (snapshotted) to the read-only volumes, all the read-only volumes update simultaneously and atomically. The users, in theory, don't notice as the volumes don't go offline - and then they see all the changes happen at once. There's coordination to handle when one or more of the fileservers or the Volume Location servers are offline.<br />
<br />
Volumes can be migrated between machines whilst in active use without the user in theory noticing anything.<br />
<br />
For NFS migration we need to preserve filehandles, so need to migrate at the block level or using fs-specific send/receive. The protocol can be handled by migrating only entire servers or containers, so that migration can be treated as a server reboot.<br />
<br />
A few Linux options for send/receive:<br />
<br />
* thin_delta (from device-mapper-persistent-data) can calculate a metadata-level diff between two volumes. Additional work would be needed to extract the actual data and produce a diff; that would complete the "send" side. We'd also need a "receive" side that could apply the diff and reconstitute the snapshot on the other side. This is being actively worked on. For NFS, on the read-write server we would take a snapshot of the exported volume before sending. On the receive side, after creating the updated snapshot, we would stop the server, unmount the old snapshot, mount the new one, and restart; clients should see only a brief delay.<br />
<br />
* btrfs-send/btrfs-receive: this is probably the best-tested send/receive functionality currently available, so if we wanted to start work on a prototype right now, this might be an option.<br />
<br />
* xfs volumes loopback-mounted on a backing xfs filesystem, using reflink for snapshots. (See https://lwn.net/Articles/747633/ for some background.) Looks promising, the basic kernel interfaces to find shared extents and such are there, but a lot of userland code remains to be written.<br />
<br />
* stratis: this operates at a layer of abstraction over the above. But that might be the layer we want to actually interact with?<br />
<br />
* lvmsync: looks possibly unmaintained? We wouldn't want to depend on this. But possibly it could be a proof of concept or starting point.<br />
<br />
Clients could be configured to mount particular servers by hand, or they could mount any server and then use [https://tools.ietf.org/html/rfc5661#section-11.9 fs_locations], [https://tools.ietf.org/html/rfc5661#section-11.10 fs_locations_info], or maybe even [https://datatracker.ietf.org/doc/rfc8435/ pnfs flexfiles] to get lists of servers hosting replicas and pick one. They would need some heuristics to make the right choice. It would also be nice if clients could fail over to a different replica when one goes down.<br />
<br />
We also have [https://github.com/nfs-ganesha/nfs-ganesha/wiki Ganesha], [https://docs.ceph.com/docs/master/cephfs/nfs/ Ganesha/Ceph] (which [https://jtlayton.wordpress.com/2018/12/10/deploying-an-active-active-nfs-cluster-over-cephfs/ may be capable of multiple read/write servers now]).<br />
<br />
See also [https://docs.openafs.org/AdminGuide/HDRWQ177.html AFS Administrator's guide, Chapter 5: Managing Volumes]<br />
<br />
A partial alternative may be [https://wiki.linux-nfs.org/wiki/index.php/NFS_re-export NFS proxying]. Like read-only replicas, proxies should be able to hide latency by moving cached data closer to far-flung clients, and scale bandwidth to read-mostly data by taking load off the original server.<br />
<br />
Advantages are that we already have seen reports of some success here, using the NFS re-export code together with fscache. And I think there are a lot of opportunities for incremental progress by fixing problems with existing NFS code, rather than larger and riskier projects that build new infrastructure.<br />
<br />
A disadvantage may be that AFS users seem to like that infrastructure (the volume abstraction and the VLDB).<br />
<br />
Latency-hiding may be particularly tricky; delegation and caching policies may need rethinking. Performance will be more complicated to understand compared to AFS-like read-only replicas.<br />
<br />
AFS-like volume replication has a problem: when new read-only versions are released, they may modify or delete entirely files that are in use by running processes. I'd expect application crashes. I wonder how AFS administrators deal with that now?<br />
<br />
My impression is that AFS doesn't reliably prevent this problem, so instead AFS administrators work around it, for example by keeping old versions of binaries in place (and using symlinks to direct users to the newest versions).<br />
<br />
Possible approaches to fix the problem if we wanted to:<br />
* Provide some protocol which tracks which files may be open on read-only replicas so that we know not to free those files when they're unlinked.<br />
* When we distribute new versions, allow servers to keep around older versions and serve files from them in the case filehandle lookups against the new copy fail, to be removed only after applications stop referencing them. Hopefully this can be done space-efficiently if the different versions on the replica servers can be represented as dm snapshots.<br />
<br />
If we use NFSv4 proxies instead, proxies will hold opens or delegations on the files on the original server, which will prevent their being deleted while in use. The problem is server reboots. That's partially worked around with silly-rename. Server-side silly-rename would be a more complete solution.<br />
<br />
== volume location database and global namespace ==<br />
<br />
On an AFS client by default you can look up something like /afs/umich.edu/... and reach files kept in AFS anywhere.<br />
<br />
NFS has standards for DNS discovery of a server from a domain, in theory we could use that. Handling kerberos users across domains would be interesting.<br />
<br />
Within one domain, there's a "Volume Location Database" that keeps track of volumes and where (machine and partition) they're located. You can make a volume for a purpose; give particular people access to it, give it some storage, expand and contract it and move it around. Volumes have quotas.<br />
<br />
Within a given domain, We can assemble a namespace out of volumes using referrals. For a higher-level approach more similar to AFS's, there's also [https://wiki.linux-nfs.org/wiki/index.php/FedFsUtilsProject FedFS] which stores the namespace information in a database and provides common protocols for administration tools to manipulate the database.<br />
<br />
== PAGS ==<br />
<br />
PAGs: AFS allows a group of processes to share a common identity, different from the local uid, for the purposes of accessing an AFS filesystem: https://docs.openafs.org/AdminGuide/HDRWQ63.html<br />
<br />
Dave Howells says: "This is why I added session keyrings. You can run a process in a new keyring<br />
and give it new tokens. systemd kind of stuck a spike in that, though, by<br />
doing their own incompatible thing with their user manager service....<br />
<br />
NFS would need to do what the in-kernel AFS client does and call request_key()<br />
on entry to each filesystem method that doesn't take a file* and use that to<br />
cache the credentials it is using. If there is no key, it can make one up on<br />
the spot and stick the uid/gid/groups in there. This would then need to be<br />
handed down to the sunrpc protocol to define the security creds to use.<br />
<br />
The key used to open a file would then need to be cached in the file struct<br />
private data."<br />
<br />
== ACLs ==<br />
<br />
NFSv4 has ACLs, but Linux filesystems only support "posix" ACLs. An attempt was made to support NFSv4 ACLs ("richacls") but hasn't been accepted upstream. So knfsd is stuck mapping between NFSv4 and posix ACLs. Posix ACLs are more coarse-grained than NFSv4 ACLs, so information can be lost when a user on an NFSv4 client sets an ACL. This makes ACLs confusing and less useful.<br />
<br />
There are other servers that support full NFSv4 ACLs, so users of those servers are better off. Our client-side tools could still use some improvements for those users, though.<br />
<br />
AFS ACLs, unfortunately, are yet again a third style of ACL, incompatible with both POSIX and NFSv4 ACLs. They are more fine-grained than POSIX ACLs and probably closer to NFSv4 ACLs overall.<br />
<br />
To do:<br />
<br />
* make NFSv4 ACL tools more usable:<br />
** Map groups of NFSv4 permission bits to read, write, and execute permissions so we only have to display the simpler bits in common cases<br />
** Look for other opportunities to simplify display and editing of NFSv4 ACLs<br />
** Add NFSv4 ACL support to graphical file managers like GNOME Files<br />
** Adopt a commandline interface that's more similar to the posix acl utilities.<br />
** Perhaps also look into https://github.com/kvaneesh/richacl-tools as an alternative starting point to nfs4-acl-tools.<br />
** In general, try to make NFSv4 ACL management more similar to management of existing posix ACLs.<br />
* For AFS->NFS transition:<br />
** Write code that translates AFS ACLs to NFSv4 ACLs. It should be possible to do this with little or no loss of information for servers with full NFSv4 ACL support.<br />
** For migrations to Linux knfsd, this will effectively translate AFS ACLs to POSIX ACLs, and information will be lost. Test this case. The conversion tool should be able to fetch the ACLs after setting them, compare results, and summarize the results of the conversion in a way that's usable even for conversions of large numbers of files. I believe that setting an ACL is enough to invalidate the client's ACL cache, so a subsequent fetch of an ACL should show the results of any server-side mapping. But, test this to make sure. More details on [[AFS to NFSv4 ACL conversion]]<br />
<br />
* more ambitious options:<br />
** Try reviving [https://lwn.net/Articles/661357/ Rich ACLs]. Maybe we could convince people this time. Or maybe there's a different approach that would work. Maybe we could find a more incremental route, e.g. by adding some features of richacls to POSIX ACLs, such as the separation of directory write permissions into add and delete, and of file write permissions into modify and append.<br />
<br />
== user and group management ==<br />
<br />
AFS has a "protection server" and you can communicate with it using the [https://docs.openafs.org/Reference/1/pts.html pts command] which allows you to set up users and groups and add ACEs for machines.<br />
<br />
Compared to traditional unix, it allows wider delegation of management. For example, group creation doesn't require root: https://docs.openafs.org/Reference/1/pts_creategroup.html. Groups have owners, and you can delegate management of group membership: https://docs.openafs.org/Reference/1/pts_adduser.html.<br />
<br />
Our equivalent to the AFS protection server is [https://www.freeipa.org/page/Main_Page FreeIPA]. See also https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_identity_management/index. Installing FreeIPA and experimenting is also useful.<br />
<br />
Unlike AFS, FreeIPA doesn't seem to make it easy for ordinary users to create groups. It does allow delegating group management (including adding and removing users). More details on [[AFS-like group management with FreeIPA]].<br />
<br />
== quotas ==<br />
<br />
AFS has per-volume quotas. There are no per-user quotas that I can see; instead, AFS administrators create volumes for individual users (e.g., for individual home directories), and set quotas on those. Volumes can share the same storage, and it's fine for quotas on volumes to add up to more than the available storage.<br />
<br />
We could get similar functionality with LVM thin provisioning or XFS with project quotas. (There is some work needed there to treat projects as separate exports, but that's very doable.)<br />
<br />
Note NFS, ext4, xfs, and other filesystems all support per-user (and other) quotas. That's not something AFS has, as far as I know. Some notes on [[NFSv4 quota support]].<br />
<br />
= migrating existing AFS installations to NFS =<br />
<br />
Once NFS does everything AFS does, there's still the question of how you'd migrate over a particular installation.<br />
<br />
There's a standard AFS dump format (used by [https://docs.openafs.org/AdminGuide/HDRWQ240.html vos dump/vos restore]) that might be worth looking at. It looks simple enough. Maybe also look at [https://github.com/openafs-contrib/cmu-dumpscan cmu-dumpscan].<br />
<br />
See also [[AFS to NFSv4 ACL conversion]].</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_re-exportNFS re-export2022-01-19T22:43:08Z<p>Bfields: /* filehandle limits */</p>
<hr />
<div>The Linux NFS server can export an NFS mount, but that isn't something we currently recommend unless you've done some careful research and are prepared for problems.<br />
<br />
You'll need nfs-utils at least 1.3.5 (specifically, 3f520e8f6f5 "exportfs: Make sure pass all valid export flags to nfsd"). Otherwise, on recent kernels, attempts to re-export NFS will likely result in "exportfs: <path> does not support NFS export".<br />
<br />
The "fsid=" option is required on any export of an NFS filesystem.<br />
<br />
For now you should probably also mount readonly and with -onolock (and don't depend on working file locking), and don't allow the re-exporting server to reboot.<br />
<br />
= known issues =<br />
<br />
== fsid= required, crossmnt broken ==<br />
<br />
The re-export server needs to encode into each filehandle something that identifies the specific filesystem being exported. Otherwise it's stuck when it gets a filehandle back from the client--the operation it uses to map the incoming filehandle to a dentry can't even work without a superblock. The usual ways of identifying a filesystem don't work for the case of NFS, so we require the "fsid=" export option on any re-export of an NFS filesystem.<br />
<br />
Note also that normally you can export a tree of filesystems by exporting only the parent with the "crossmnt" option, and any filesystems underneath are then automatically exported with the same options. However, that doesn't apply to the fsid= option: it's purpose is to provide a unique identifier for each export, so it can't be automatically copied to the child filesystems.<br />
<br />
That means that re-exporting a tree of NFS filesystems in that way won't work--clients will be able to access the top-level export, but attempts to traverse mountpoints underneath will just result in IO errors.<br />
<br />
In theory, if the server could at least determine that the filehandle is for an object on an NFS filesystem, and figure out which server the filesystem's from, it could (given some new interface) ask the NFS client to work out the rest.<br />
<br />
One idea might be an [[NFS proxy-only mode]] where a server is dedicated to reexporting the filesystems of exactly *one* other server, as is.<br />
<br />
== reboot recovery ==<br />
<br />
NFS is designed to keep operating through server reboots, whether planned or the result of a crash or power outage. Client applications will see a delay while the server's down, but as soon as it's back up, normal operation resumes. Opens and file locks held across the reboot will all work correctly. (The only exception is unlinked but still open files, which may disappear after a reboot.)<br />
<br />
But the protocol's normal reboot recovery mechanisms don't work for the case when the re-export server reboots. The re-export server is both an NFS client and an NFS server, and the protocol's equipped to deal with the loss of the server's state, but not with the loss of the client's state.<br />
<br />
Maybe we could keep the client state on low-latency stable storage somehow? Maybe we could add a mechanism to the protocol that allows the client to state that it has lost its protocol state and wants to reclaim? (And then the client would issue reclaims as reclaims from the re-export server's clients came in.) Tentative plan: [[reboot recovery for re-export servers]]<br />
<br />
Maybe the re-export server could take the stateids returned from the server and return them to its clients, avoiding the need for it to keep very much state.<br />
<br />
== filehandle limits ==<br />
<br />
NFS filehandle sizes are limited (to 32 bytes for NFSv2, 64 bytes for NFSv3, and 128 bytes for NFSv4). When we re-export, we take the filehandle returned from the original server and wrap it with some more bytes of our own to create the filehandle we return to clients. That means the filehandles we give out will be larger than the filehandles we receive from the original server. There's no guarantee this will work. In practice most servers give out filehandles of a fixed size that's less than the maximum, so you *probably* won't run into this problem unless you're re-exporting with NFSv2, or re-exporting repeatedly. More details on [filehandle limits https://www.kernel.org/doc/html/latest/filesystems/nfs/reexport.html#filehandle-limits filehandle limits].<br />
<br />
The wrapping is needed so that the server can identify, even after it may have long forgotten about that particular filehandle, which export the filehandle refers to, so it can refer the operation to the correct underlying filesystem or server, and so it can enforce export permissions. Note that filehandle lifetimes are limited only by the lifetime of the object they point to; they are still expected to work after the inode has dropped out of the server's cache, or after the server has rebooted.<br />
<br />
One solution might be a [[NFS_proxy-only_mode]], where a server would be dedicated to re-exporting a single original NFS server, but it's not clear how to implement that.<br />
<br />
== filehandles not portable across servers ==<br />
<br />
Given multiple servers re-exporting a single filesystem, it might be expected that a client could easily migrate between them. That's not necessarily true, since filehandles aren't necessarily portable across servers.<br />
<br />
If the servers are all Linux servers, though, it should be sufficient to make sure reexports of the same filesystem all get the same fsid= option. (Note filehandles still won't be portable between reexports and the original server, though.)<br />
<br />
Some infrastructure to make this coordination easier might be useful.<br />
<br />
== errors on re-exports of NFSv4.0 filesystems to NFSv2/3 clients ==<br />
<br />
When re-exporting NFSv4.0 filesystems IO errors have been seen after dropping caches on the re-export server. This is probably due to the fact that an NFSv4 client has to open files to perform IO to them, but NFSv3 client only provides filehandles, and NFSv4.0 cannot open by filehandle (it can only open by (parent filehandle, filename) pair). NFSv4.1 allows open by filehandle.<br />
<br />
Best is not to do this; use NFSv4.1 or NFSv4.2 on the original server, or NFSv4 on the clients.<br />
<br />
If that's not possible, a workaround is to configure the re-export server to be reluctant to evict inodes from cache.<br />
<br />
Some more details at https://lore.kernel.org/linux-nfs/635679406.70384074.1603272832846.JavaMail.zimbra@dneg.com/. Note some other cases there (NFSv3 re-exports of NFSv3) are fixed by patches probably headed for 5.11.<br />
<br />
Maybe the NFSv4.0 client could also be made to support open-by-filehandle by skipping the open and using special stateids instead? I'm not sure.<br />
<br />
== unnecessary GETATTRs ==<br />
<br />
We see unnecessary cache invalidations on the re-export servers; we have some patches in progress that should make it for 5.11 or so (https://lore.kernel.org/linux-nfs/20201120223831.GB7705@fieldses.org/). It looks like they help but don't address every case.<br />
<br />
Also, depending on NFS versions on originating and re-exporting servers, we could probably save some GETATTRs, and set the atomic bit in some cases, if we passed along wcc information from the original server. Requires a special knfsd<->nfs interface. Should be doable.<br />
<br />
== re-export not reading more than 128K at a time ==<br />
<br />
For some reason when the client issues 1M reads to the re-export server, the re-export server breaks them up into 128K reads to the original server. Workaround is to manually increase client readahead; see <br />
https://lore.kernel.org/linux-nfs/1688437957.87985749.1605554507783.JavaMail.zimbra@dneg.com/<br />
<br />
== open DENY bits ignored ==<br />
<br />
NFS since NFSv4 supports ALLOW and DENY bits taken from Windows, which allow you, for example, to open a file in a mode which forbids other read opens or write opens. The Linux client doesn't use them, and the server's support has always been incomplete: they are enforced only against other NFS users, not against processes accessing the exported filesystem locally. A re-export server will also not pass them along to the original server, so they will not be enforced between clients of different re-export servers.<br />
<br />
This is probably not too hard to fix, but also probably not a high priority.<br />
<br />
= Known problems that we've fixed =<br />
<br />
* Problems with sporadic stale filehandles should be fixed by https://lore.kernel.org/linux-nfs/20201019175330.595894-1-trondmy@kernel.org/ (queued for 5.11?)<br />
* Pre/post-operation attributes are incorrectly returned as if they were atomic in cases when they aren't. We have fixes for 5.11.<br />
* File locking crashes should be fixed as of 5.15. (But note reboot recovery is still unsupported.)<br />
* delegations and leases should work; this could probably use some testing.<br />
<br />
= Use cases =<br />
<br />
== Scaling read bandwidth ==<br />
<br />
You should be able to scale bandwidth by adding more re-export servers; fscache on the re-export servers should also help.<br />
<br />
== Hiding latency of distant servers ==<br />
<br />
You should also be able to hide latency when the original server is far away. AFS read-only replication is an interesting precedent here, often used to distribute software that is rarely updated. [https://cernvm.cern.ch/fs/ CernVM-FS] occupies a similar niche. fscache should help here too.<br />
<br />
== NFS version support ==<br />
<br />
It's also being used as a way to add support for all NFS versions to servers that only support a subset. Careful attention to filehandle limits is required.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_proxy-only_modeNFS proxy-only mode2022-01-19T22:42:53Z<p>Bfields: </p>
<hr />
<div>It could be useful to have a mode where an NFS server is dedicated to reexporting all the exports from *one* other NFS server. It would have no other exports whatsoever.<br />
<br />
This would allow the re-export server to support crossmount-like behavior, skip adding its own filesystem identifier to each filehandle (fixing problems with filehandle length limits), and avoid the need for manual assignment of filesystem identifiers with the fsid= option.<br />
<br />
Containers or virtualization could still allow a single physical machine to handle multiple exports even if desired.<br />
<br />
Possible implementation (needs more details). v4 only for now?:<br />
<br />
- Create a new /proc/fs/nfsd/proxy_only file. Before starting the server, mount "/" on the original nfs server, then write the path to the mount to /proc/fs/nfsd/proxy_only. This interface is per-container. It also works for v3, which wouldn't currently possible with in-kernel mounting, though this feature is not as useful in that case as nested v3 mounts are rarer.<br />
<br />
- the NFS mount can't allow redirection to other servers, unless those servers observe all the same filehandles.<br />
<br />
- Given a filehandle, map to an export using a GETATTR to the server to get at least fsid, fileid, and file type. If it's a directory, it should be possible to connect it up to the psuedoroot using LOOKUPP. Find or create an export from the resulting struct path, cloing the parameters of the root export.<br />
<br />
- If it's *not* a directory, and not already cached, then create a temporary vfsmount and export rooted at that one file. If you've never seen this fsid before, you'll also have to create a superblock. As far as I can tell, s_root on a given nfs superblock is not important, so it's OK for it to point at this file, even as it later accumulates the rest of the filesystem? But I don't think that's true for export and vfsmount, hence the temporary objects. I'm unclear on how to handle these "disconnected" vfsmounts.<br />
<br />
- In theory, this could work with a filesystem other than NFS, if there was a filesystem or group of filesystems that coordinated their filehandles.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_re-exportNFS re-export2022-01-19T16:43:21Z<p>Bfields: /* known issues */</p>
<hr />
<div>The Linux NFS server can export an NFS mount, but that isn't something we currently recommend unless you've done some careful research and are prepared for problems.<br />
<br />
You'll need nfs-utils at least 1.3.5 (specifically, 3f520e8f6f5 "exportfs: Make sure pass all valid export flags to nfsd"). Otherwise, on recent kernels, attempts to re-export NFS will likely result in "exportfs: <path> does not support NFS export".<br />
<br />
The "fsid=" option is required on any export of an NFS filesystem.<br />
<br />
For now you should probably also mount readonly and with -onolock (and don't depend on working file locking), and don't allow the re-exporting server to reboot.<br />
<br />
= known issues =<br />
<br />
== fsid= required, crossmnt broken ==<br />
<br />
The re-export server needs to encode into each filehandle something that identifies the specific filesystem being exported. Otherwise it's stuck when it gets a filehandle back from the client--the operation it uses to map the incoming filehandle to a dentry can't even work without a superblock. The usual ways of identifying a filesystem don't work for the case of NFS, so we require the "fsid=" export option on any re-export of an NFS filesystem.<br />
<br />
Note also that normally you can export a tree of filesystems by exporting only the parent with the "crossmnt" option, and any filesystems underneath are then automatically exported with the same options. However, that doesn't apply to the fsid= option: it's purpose is to provide a unique identifier for each export, so it can't be automatically copied to the child filesystems.<br />
<br />
That means that re-exporting a tree of NFS filesystems in that way won't work--clients will be able to access the top-level export, but attempts to traverse mountpoints underneath will just result in IO errors.<br />
<br />
In theory, if the server could at least determine that the filehandle is for an object on an NFS filesystem, and figure out which server the filesystem's from, it could (given some new interface) ask the NFS client to work out the rest.<br />
<br />
One idea might be an [[NFS proxy-only mode]] where a server is dedicated to reexporting the filesystems of exactly *one* other server, as is.<br />
<br />
== reboot recovery ==<br />
<br />
NFS is designed to keep operating through server reboots, whether planned or the result of a crash or power outage. Client applications will see a delay while the server's down, but as soon as it's back up, normal operation resumes. Opens and file locks held across the reboot will all work correctly. (The only exception is unlinked but still open files, which may disappear after a reboot.)<br />
<br />
But the protocol's normal reboot recovery mechanisms don't work for the case when the re-export server reboots. The re-export server is both an NFS client and an NFS server, and the protocol's equipped to deal with the loss of the server's state, but not with the loss of the client's state.<br />
<br />
Maybe we could keep the client state on low-latency stable storage somehow? Maybe we could add a mechanism to the protocol that allows the client to state that it has lost its protocol state and wants to reclaim? (And then the client would issue reclaims as reclaims from the re-export server's clients came in.) Tentative plan: [[reboot recovery for re-export servers]]<br />
<br />
Maybe the re-export server could take the stateids returned from the server and return them to its clients, avoiding the need for it to keep very much state.<br />
<br />
== filehandle limits ==<br />
<br />
NFS filehandle sizes are limited (to 32 bytes for NFSv2, 64 bytes for NFSv3, and 128 bytes for NFSv4). When we re-export, we take the filehandle returned from the original server and wrap it with some more bytes of our own to create the filehandle we return to clients. That means the filehandles we give out will be larger than the filehandles we receive from the original server. There's no guarantee this will work. In practice most servers give out filehandles of a fixed size that's less than the maximum, so you *probably* won't run into this problem unless you're re-exporting with NFSv2, or re-exporting repeatedly. See more details on [filehandle limits https://www.kernel.org/doc/html/latest/filesystems/nfs/reexport.html#filehandle-limits]<br />
<br />
If re-export servers could reuse filehandles from the original server, that'd solve the problem. It would also make it easier for clients to migrate between the original server and other re-export servers, which could be useful.<br />
<br />
The wrapping is needed so that the server can identify, even after it may have long forgotten about that particular filehandle, which export the filehandle refers to, so it can refer the operation to the correct underlying filesystem or server, and so it can enforce export permissions.<br />
<br />
If a server exports only a single NFS filesystem, then there'd be no problem with it reusing the file handle it got from the original server. Possibly that's a common enough use case to be helpful? With containers we could still allow a single physical machine to handle multiple exports even if each container only handles on each.<br />
<br />
Cooperating servers could agree on the structure of filehandles in a way that allowed them to reuse each others' filehandles. Possibly that could be standardized if it proved useful.<br />
<br />
== errors on re-exports of NFSv4.0 filesystems to NFSv2/3 clients ==<br />
<br />
When re-exporting NFSv4.0 filesystems IO errors have been seen after dropping caches on the re-export server. This is probably due to the fact that an NFSv4 client has to open files to perform IO to them, but NFSv3 client only provides filehandles, and NFSv4.0 cannot open by filehandle (it can only open by (parent filehandle, filename) pair). NFSv4.1 allows open by filehandle.<br />
<br />
Best is not to do this; use NFSv4.1 or NFSv4.2 on the original server, or NFSv4 on the clients.<br />
<br />
If that's not possible, a workaround is to configure the re-export server to be reluctant to evict inodes from cache.<br />
<br />
Some more details at https://lore.kernel.org/linux-nfs/635679406.70384074.1603272832846.JavaMail.zimbra@dneg.com/. Note some other cases there (NFSv3 re-exports of NFSv3) are fixed by patches probably headed for 5.11.<br />
<br />
Maybe the NFSv4.0 client could also be made to support open-by-filehandle by skipping the open and using special stateids instead? I'm not sure.<br />
<br />
== unnecessary GETATTRs ==<br />
<br />
We see unnecessary cache invalidations on the re-export servers; we have some patches in progress that should make it for 5.11 or so (https://lore.kernel.org/linux-nfs/20201120223831.GB7705@fieldses.org/). It looks like they help but don't address every case.<br />
<br />
Also, depending on NFS versions on originating and re-exporting servers, we could probably save some GETATTRs, and set the atomic bit in some cases, if we passed along wcc information from the original server. Requires a special knfsd<->nfs interface. Should be doable.<br />
<br />
== re-export not reading more than 128K at a time ==<br />
<br />
For some reason when the client issues 1M reads to the re-export server, the re-export server breaks them up into 128K reads to the original server. Workaround is to manually increase client readahead; see <br />
https://lore.kernel.org/linux-nfs/1688437957.87985749.1605554507783.JavaMail.zimbra@dneg.com/<br />
<br />
== open DENY bits ignored ==<br />
<br />
NFS since NFSv4 supports ALLOW and DENY bits taken from Windows, which allow you, for example, to open a file in a mode which forbids other read opens or write opens. The Linux client doesn't use them, and the server's support has always been incomplete: they are enforced only against other NFS users, not against processes accessing the exported filesystem locally. A re-export server will also not pass them along to the original server, so they will not be enforced between clients of different re-export servers.<br />
<br />
This is probably not too hard to fix, but also probably not a high priority.<br />
<br />
= Known problems that we've fixed =<br />
<br />
* Problems with sporadic stale filehandles should be fixed by https://lore.kernel.org/linux-nfs/20201019175330.595894-1-trondmy@kernel.org/ (queued for 5.11?)<br />
* Pre/post-operation attributes are incorrectly returned as if they were atomic in cases when they aren't. We have fixes for 5.11.<br />
* File locking crashes should be fixed as of 5.15. (But note reboot recovery is still unsupported.)<br />
* delegations and leases should work; this could probably use some testing.<br />
<br />
= Use cases =<br />
<br />
== Scaling read bandwidth ==<br />
<br />
You should be able to scale bandwidth by adding more re-export servers; fscache on the re-export servers should also help.<br />
<br />
== Hiding latency of distant servers ==<br />
<br />
You should also be able to hide latency when the original server is far away. AFS read-only replication is an interesting precedent here, often used to distribute software that is rarely updated. [https://cernvm.cern.ch/fs/ CernVM-FS] occupies a similar niche. fscache should help here too.<br />
<br />
== NFS version support ==<br />
<br />
It's also being used as a way to add support for all NFS versions to servers that only support a subset. Careful attention to filehandle limits is required.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/Server-side_silly_renameServer-side silly rename2022-01-18T20:27:57Z<p>Bfields: </p>
<hr />
<div>The NFSv3 protocol has no way to say "I'm unlinking this file, but please keep it around because I have an application that's still using it". So if the client wants to provide unix-like semantics, it has to resort to this hack (called "silly rename") on unlink of an open file. See also [http://nfs.sourceforge.net/#section_d], or the earliest description I'm aware of, in [https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.473 "Design and Implementation of the Sun Network Filesystem" (1985)]:<br />
<br />
<blockquote><br />
We tried very hard to make the NFS client obey UNIX filesystem semantics without modifying the server or the protocol. In some cases this was hard to do. For example, UNIX allows removal of open files. A process can open a file, then remove the directory entry for the file so that it has no name anywhere in the filesystem, and still read and write the file. This is a disgusting bit of UNIX trivia and at first we were just not going to support it, but it turns out that all of the programs that we didn't want to have to fix (csh, sendmail, etc.) use this for temporary files.<br />
<p><br />
What we did to make open file removal work on remote files was check in the client VFS remove operation if the file is open, and if so rename it instead of removing it. This makes it (sort of) invisible to the client and still allows reading and writing. The client kernel then removes the new name when the vnode becomes inactive. We call this the 3/4 solution because if the client crashes between the rename and remove a garbage file is left on the server. An entry to cron can be added to clean up on the server.</p><br />
</blockquote><br />
<br />
Silly rename is indeed an imperfect solution. Another case when users sometimes notice the ".nfsXXXX" files is when they try to remove a directory that contains them. Also, it doesn't help if a file is unlinked by a different client than the one that holds it open.<br />
<br />
NFSv4 actually does have open and close calls, and our server won't free a file until last close--unless the server reboots, at which point the file will disappear even if an application on the client is still using it. NFS is supposed to keep working normally across server reboots, so the client still does silly rename even in the v4 case.<br />
<br />
We could move the responsibility for silly rename to the server--the server could keep a hardlink to the file after unlink, and<br />
that would preserve the file after reboot as well. (And it could use a separate directory for the purpose, and avoid the rmdir). We even added a bit to the NFSv4.1 protocol so that the server can tell the client it does this, allowing the client to skip sillyrename (see references to OPEN4_RESULT_PRESERVE_UNLINKED in [https://tools.ietf.org/html/rfc8881].)<br />
<br />
I suspect the client side implementation of this wouldn't be hard--it'd need to watch for the OPEN4_RESULT_PRESERVE_UNLINKED flag and skip silly rename in its presence. (Update: see [https://lore.kernel.org/linux-nfs/20220118190251.55526-1-olga.kornievskaia@gmail.com/T/#u].)<br />
<br />
The server side looks harder.<br />
<br />
One complication is that knfsd doesn't get exclusive use of exported filesystems: other applications may also be using them. A file opened by an NFS client could be unlinked by a local application, and we'd like the file not to disappear after reboot in that case. That said, the current behavior doesn't handle that case--it doesn't even handle the case when the unlink is done by a different client than the open--so for a first implementation I think it'd be fine to ignore that case.<br />
<br />
My rough plan for knfsd is to create a hidden directory in the root of the exported filesystem and modify nfsd4_remove() to check whether the file to be unlinked is open by an NFSv4 client, and if so to instead rename it to that hidden directory. The name shouldn't matter--just use a counter or something.<br />
<br />
I think we can use something like the logic at the start of nfsd4_process_open2 to look up a struct nfs4_file from the filehandle, and then use that to check for nfsv4 opens. We also need to prevent the race where a new open comes in after we decide to unlink the file but before we're done unlinking it--I'm not sure how. Also we need to think about the possibility of filehandle aliasing, in which case there may exist two nfs4_files for a given file.<br />
<br />
Then we need the close code to check whether we're closing one of these files and, if so, to also unlink it from the hidden directory.<br />
<br />
And, finally the laundromat code, after it ends the grace period, needs to walk through the hidden directory and remove any files that haven't been opened. Maybe code like that in nfsd4_recdir_purge_old() would work. This is usually the kind of thing we try not to do from the kernel, but I don't see a clean way to do it from userspace.<br />
<br />
That done, if we wanted to also make this work for unlinks by non-NFSv4 clients, we'd need some way to intercept all the unlinks to a given filesystem. We might need to modify the individual exported filesystems.<br />
<br />
We may want to think about how exactly to hide that directory. Maybe we could get some kind of help from the filesystem.<br />
<br />
The extra hidden link will mean that the st_nlink (for local users) and the numlinks attribute (for NFSv4 GETATTR callers) are wrong. We could fix up the latter, at least, by checking for this specific case.<br />
<br />
Approaches I (bfields) considered and rejected for now:<br />
<br />
* Create a link in the new directory on every open, and remove it on every close. But open may be a frequent operation, and we'd need to actually sync that link to disk on every operation, so it could be pretty slow. But maybe, with cooperation of the filesystem, we could *just* do the link on open, and delay waiting for the sync until there's an unlink.<br />
* Filesystems already have to deal with the case where the system crashes while there are unlinked open files. I believe they keep a list of such files so they can free them in fsck or next mount. I considered hooking into that process somehow--perhaps the server could be given an interface allowing it to discover those orphaned files. It would require nfsd to be involved in the mount process (currently we mount first, then export). And we'd have to figure out how to perform clean shutdowns without losing those files. And we'd have to worry about losing them any time an administrator fsck'd or mounted without running nfsd. So in the end maybe it wouldn't work.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/Server-side_silly_renameServer-side silly rename2022-01-18T20:27:38Z<p>Bfields: </p>
<hr />
<div>The NFSv3 protocol has no way to say "I'm unlinking this file, but please keep it around because I have an application that's still using it". So if the client wants to provide unix-like semantics, it has to resort to this hack (called "silly rename") on unlink of an open file. See also [http://nfs.sourceforge.net/#section_d], or the earliest description I'm aware of, in [https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.473 "Design and Implementation of the Sun Network Filesystem" (1985)]:<br />
<br />
<blockquote><br />
We tried very hard to make the NFS client obey UNIX filesystem semantics without modifying the server or the protocol. In some cases this was hard to do. For example, UNIX allows removal of open files. A process can open a file, then remove the directory entry for the file so that it has no name anywhere in the filesystem, and still read and write the file. This is a disgusting bit of UNIX trivia and at first we were just not going to support it, but it turns out that all of the programs that we didn't want to have to fix (csh, sendmail, etc.) use this for temporary files.<br />
<p><br />
What we did to make open file removal work on remote files was check in the client VFS remove operation if the file is open, and if so rename it instead of removing it. This makes it (sort of) invisible to the client and still allows reading and writing. The client kernel then removes the new name when the vnode becomes inactive. We call this the 3/4 solution because if the client crashes between the rename and remove a garbage file is left on the server. An entry to cron can be added to clean up on the server.</p><br />
</blockquote><br />
<br />
Silly rename is indeed an imperfect solution. Another case when users sometimes notice the ".nfsXXXX" files is when they try to remove a directory that contains them. Also, it doesn't help if a file is unlinked by a different client than the one that holds it open.<br />
<br />
NFSv4 actually does have open and close calls, and our server won't free a file until last close--unless the server reboots, at which point the file will disappear even if an application on the client is still using it. NFS is supposed to keep working normally across server reboots, so the client still does silly rename even in the v4 case.<br />
<br />
We could move the responsibility for silly rename to the server--the server could keep a hardlink to the file after unlink, and<br />
that would preserve the file after reboot as well. (And it could use a separate directory for the purpose, and avoid the rmdir). We even added a bit to the NFSv4.1 protocol so that the server can tell the client it does this, allowing the client to skip sillyrename (see references to OPEN4_RESULT_PRESERVE_UNLINKED in [https://tools.ietf.org/html/rfc8881].)<br />
<br />
I suspect the client side implementation of this wouldn't be hard--it'd need to watch for the OPEN4_RESULT_PRESERVE_UNLINKED flag and skip silly rename in its presence. (Update: see [https://lore.kernel.org/linux-nfs/20220118190251.55526-1-olga.kornievskaia@gmail.com/T/#u].)<br />
<br />
The server side looks harder.<br />
<br />
One complication is that knfsd doesn't get exclusive use of exported filesystems: other applications may also be using them. A file opened by an NFS client could be unlinked by a local application, and we'd like the file not to disappear after reboot in that case. That said, the current behavior doesn't handle that case--it doesn't even handle the case when the unlink is done by a different client than the open--so for a first implementation I think it'd be fine to ignore that case.<br />
<br />
My rough plan for knfsd is to create a hidden directory in the root of the exported filesystem and modify nfsd4_remove() to check whether the file to be unlinked is open by an NFSv4 client, and if so to instead rename it to that hidden directory. The name shouldn't matter--just use a counter or something.<br />
<br />
I think we can use something like the logic at the start of nfsd4_process_open2 to look up a struct nfs4_file from the filehandle, and then use that to check for nfsv4 opens. We also need to prevent the race where a new open comes in after we decide to unlink the file but before we're done unlinking it--I'm not sure how. Also we need to think about the possibility of filehandle aliasing, in which case there may exist two nfs4_files for a given file.<br />
<br />
Then we need the close code to check whether we're closing one of these files and, if so, to also unlink it from the hidden directory.<br />
<br />
And, finally the laundromat code, after it ends the grace period, needs to walk through the hidden directory and remove any files that haven't been opened. Maybe code like that in nfsd4_recdir_purge_old() would work. This is usually the kind of thing we try not to do from the kernel, but I don't see a clean way to do it from userspace.<br />
<br />
That done, if we wanted to also make this work for unlinks by non-NFSv4 clients, we'd need some way to intercept all the unlinks to a given filesystem. We might need to modify the individual exported filesystems.<br />
<br />
We may want to think about how exactly to hide that directory. Maybe we could get some kind of help from the filesystem.<br />
<br />
The extra hidden link will mean that the st_nlink (for local users) and the numlinks attribute (for NFSv4 GETATTR callers) are wrong. We could fix up the latter, at least, by checking for this specific case.<br />
<br />
Approaches I (bfields) considered and rejected for now:<br />
<br />
- Create a link in the new directory on every open, and remove it on every close. But open may be a frequent operation, and we'd need to actually sync that link to disk on every operation, so it could be pretty slow. But maybe, with cooperation of the filesystem, we could *just* do the link on open, and delay waiting for the sync until there's an unlink.<br />
- Filesystems already have to deal with the case where the system crashes while there are unlinked open files. I believe they keep a list of such files so they can free them in fsck or next mount. I considered hooking into that process somehow--perhaps the server could be given an interface allowing it to discover those orphaned files. It would require nfsd to be involved in the mount process (currently we mount first, then export). And we'd have to figure out how to perform clean shutdowns without losing those files. And we'd have to worry about losing them any time an administrator fsck'd or mounted without running nfsd. So in the end maybe it wouldn't work.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/Server-side_silly_renameServer-side silly rename2022-01-18T20:26:38Z<p>Bfields: </p>
<hr />
<div>The NFSv3 protocol has no way to say "I'm unlinking this file, but please keep it around because I have an application that's still using it". So if the client wants to provide unix-like semantics, it has to resort to this hack (called "silly rename") on unlink of an open file. See also [http://nfs.sourceforge.net/#section_d], or the earliest description I'm aware of, in [https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.473 "Design and Implementation of the Sun Network Filesystem" (1985)]:<br />
<br />
<blockquote><br />
We tried very hard to make the NFS client obey UNIX filesystem semantics without modifying the server or the protocol. In some cases this was hard to do. For example, UNIX allows removal of open files. A process can open a file, then remove the directory entry for the file so that it has no name anywhere in the filesystem, and still read and write the file. This is a disgusting bit of UNIX trivia and at first we were just not going to support it, but it turns out that all of the programs that we didn't want to have to fix (csh, sendmail, etc.) use this for temporary files.<br />
<p><br />
What we did to make open file removal work on remote files was check in the client VFS remove operation if the file is open, and if so rename it instead of removing it. This makes it (sort of) invisible to the client and still allows reading and writing. The client kernel then removes the new name when the vnode becomes inactive. We call this the 3/4 solution because if the client crashes between the rename and remove a garbage file is left on the server. An entry to cron can be added to clean up on the server.</p><br />
</blockquote><br />
<br />
Silly rename is indeed an imperfect solution. Another case when users sometimes notice the ".nfsXXXX" files is when they try to remove a directory that contains them. Also, it doesn't help if a file is unlinked by a different client than the one that holds it open.<br />
<br />
NFSv4 actually does have open and close calls, and our server won't free a file until last close--unless the server reboots, at which point the file will disappear even if an application on the client is still using it. NFS is supposed to keep working normally across server reboots, so the client still does silly rename even in the v4 case.<br />
<br />
We could move the responsibility for silly rename to the server--the server could keep a hardlink to the file after unlink, and<br />
that would preserve the file after reboot as well. (And it could use a separate directory for the purpose, and avoid the rmdir). We even added a bit to the NFSv4.1 protocol so that the server can tell the client it does this, allowing the client to skip sillyrename (see references to OPEN4_RESULT_PRESERVE_UNLINKED in [https://tools.ietf.org/html/rfc8881].)<br />
<br />
I suspect the client side implementation of this wouldn't be hard--it'd need to watch for the OPEN4_RESULT_PRESERVE_UNLINKED flag and skip silly rename in its presence. (Update: see [https://lore.kernel.org/linux-nfs/20220118190251.55526-1-olga.kornievskaia@gmail.com/T/#u].)<br />
<br />
The server side looks harder.<br />
<br />
One complication is that knfsd doesn't get exclusive use of exported filesystems: other applications may also be using them. A file opened by an NFS client could be unlinked by a local application, and we'd like the file not to disappear after reboot in that case. That said, the current behavior doesn't handle that case--it doesn't even handle the case when the unlink is done by a different client than the open--so for a first implementation I think it'd be fine to ignore that case.<br />
<br />
My rough plan for knfsd is to create a hidden directory in the root of the exported filesystem and modify nfsd4_remove() to check whether the file to be unlinked is open by an NFSv4 client, and if so to instead rename it to that hidden directory. The name shouldn't matter--just use a counter or something.<br />
<br />
I think we can use something like the logic at the start of nfsd4_process_open2 to look up a struct nfs4_file from the filehandle, and then use that to check for nfsv4 opens. We also need to prevent the race where a new open comes in after we decide to unlink the file but before we're done unlinking it--I'm not sure how. Also we need to think about the possibility of filehandle aliasing, in which case there may exist two nfs4_files for a given file.<br />
<br />
Then we need the close code to check whether we're closing one of these files and, if so, to also unlink it from the hidden directory.<br />
<br />
And, finally the laundromat code, after it ends the grace period, needs to walk through the hidden directory and remove any files that haven't been opened. Maybe code like that in nfsd4_recdir_purge_old() would work. This is usually the kind of thing we try not to do from the kernel, but I don't see a clean way to do it from userspace.<br />
<br />
That done, if we wanted to also make this work for unlinks by non-NFSv4 clients, we'd need some way to intercept all the unlinks to a given filesystem. We might need to modify the individual exported filesystems.<br />
<br />
We may want to think about how exactly to hide that directory. Maybe we could get some kind of help from the filesystem.<br />
<br />
The extra hidden link will mean that the st_nlink (for local users) and the numlinks attribute (for NFSv4 GETATTR callers) are wrong. We could fix up the latter, at least, by checking for this specific case.<br />
<br />
--<br />
<br />
Another possibility I considered was just creating a link in the new directory on every open, and removing it on every close. But open may be a frequent operation, and we'd need to actually sync that link to disk on every operation, so it could be pretty slow. But maybe, with cooperation of the filesystem, we could *just* do the link on open, and delay waiting for the sync until there's an unlink.<br />
<br />
--<br />
<br />
Another possibility: filesystems already have to deal with the case where the system crashes while there are unlinked open files. I believe they keep a list of such files so they can free them in fsck or next mount. I considered hooking into that process somehow--perhaps the server could be given an interface allowing it to discover those orphaned files. It would require nfsd to be involved in the mount process (currently we mount first, then export). And we'd have to figure out how to perform clean shutdowns without losing those files. And we'd have to worry about losing them any time an administrator fsck'd or mounted without running nfsd. So in the end maybe it wouldn't work.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/Server-side_silly_renameServer-side silly rename2022-01-18T19:13:05Z<p>Bfields: </p>
<hr />
<div>The NFSv3 protocol has no way to say "I'm unlinking this file, but please keep it around because I have an application that's still using it". So if the client wants to provide unix-like semantics, it has to resort to this hack (called "silly rename") on unlink of an open file. See also [http://nfs.sourceforge.net/#section_d], or the earliest description I'm aware of, in [https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.473 "Design and Implementation of the Sun Network Filesystem" (1985)]:<br />
<br />
<blockquote><br />
We tried very hard to make the NFS client obey UNIX filesystem semantics without modifying the server or the protocol. In some cases this was hard to do. For example, UNIX allows removal of open files. A process can open a file, then remove the directory entry for the file so that it has no name anywhere in the filesystem, and still read and write the file. This is a disgusting bit of UNIX trivia and at first we were just not going to support it, but it turns out that all of the programs that we didn't want to have to fix (csh, sendmail, etc.) use this for temporary files.<br />
<p><br />
What we did to make open file removal work on remote files was check in the client VFS remove operation if the file is open, and if so rename it instead of removing it. This makes it (sort of) invisible to the client and still allows reading and writing. The client kernel then removes the new name when the vnode becomes inactive. We call this the 3/4 solution because if the client crashes between the rename and remove a garbage file is left on the server. An entry to cron can be added to clean up on the server.</p><br />
</blockquote><br />
<br />
Silly rename is indeed an imperfect solution. Another case when users sometimes notice the ".nfsXXXX" files is when they try to remove a directory that contains them. Also, it doesn't help if a file is unlinked by a different client than the one that holds it open.<br />
<br />
NFSv4 actually does have open and close calls, and our server won't free a file until last close--unless the server reboots, at which point the file will disappear even if an application on the client is still using it. NFS is supposed to keep working normally across server reboots, so the client still does silly rename even in the v4 case.<br />
<br />
We could move the responsibility for silly rename to the server--the server could keep a hardlink to the file after unlink, and<br />
that would preserve the file after reboot as well. (And it could use a separate directory for the purpose, and avoid the rmdir). We even added a bit to the NFSv4.1 protocol so that the server can tell the client it does this, allowing the client to skip sillyrename (see references to OPEN4_RESULT_PRESERVE_UNLINKED in [https://tools.ietf.org/html/rfc8881].)<br />
<br />
I suspect the client side implementation of this wouldn't be hard--it'd need to watch for the OPEN4_RESULT_PRESERVE_UNLINKED flag and skip silly rename in its presence. (Update: see [https://lore.kernel.org/linux-nfs/20220118190251.55526-1-olga.kornievskaia@gmail.com/T/#u].)<br />
<br />
The server side looks harder.<br />
<br />
One complication is that knfsd doesn't get exclusive use of exported filesystems: other applications may also be using them. A file opened by an NFS client could be unlinked by a local application, and we'd like the file not to disappear after reboot in that case. That said, the current behavior doesn't handle that case--it doesn't even handle the case when the unlink is done by a different client than the open--so for a first implementation I think it'd be fine to ignore that case.<br />
<br />
My rough plan for knfsd is to create a hidden directory in the root of the exported filesystem and modify nfsd4_remove() to check whether the file to be unlinked is open by an NFSv4 client, and if so to instead rename it to that hidden directory. The name shouldn't matter--just use a counter or something.<br />
<br />
I think we can use something like the logic at the start of nfsd4_process_open2 to look up a struct nfs4_file from the filehandle, and then use that to check for nfsv4 opens. We also need to prevent the race where a new open comes in after we decide to unlink the file but before we're done unlinking it--I'm not sure how. Also we need to think about the possibility of filehandle aliasing, in which case there may exist two nfs4_files for a given file.<br />
<br />
Then we need the close code to check whether we're closing one of these files and, if so, to also unlink it from the hidden directory.<br />
<br />
And, finally the laundromat code, after it ends the grace period, needs to walk through the hidden directory and remove any files that haven't been opened. Maybe nfsd4_recdir_purge_old() would do. This is usually the kind of thing we try not to do from the kernel, but I don't see a clean way to do it from userspace.<br />
<br />
That done, if we wanted to also make this work for unlinks by non-NFSv4 clients, we'd need some way to intercept all the unlinks to a given filesystem. We might need to modify the individual exported filesystems.<br />
<br />
We may want to think about how exactly to hide that directory. Maybe we could get some kind of help from the filesystem.<br />
<br />
The extra hidden link will mean that the st_nlink (for local users) and the numlinks attribute (for NFSv4 GETATTR callers) are wrong. We could fix up the latter, at least, by checking for this specific case.<br />
<br />
--<br />
<br />
Another possibility I considered was just creating a link in the new directory on every open, and removing it on every close. But open may be a frequent operation, and we'd need to actually sync that link to disk on every operation, so it could be pretty slow. But maybe, with cooperation of the filesystem, we could *just* do the link on open, and delay waiting for the sync until there's an unlink.<br />
<br />
--<br />
<br />
Another possibility: filesystems already have to deal with the case where the system crashes while there are unlinked open files. I believe they keep a list of such files so they can free them in fsck or next mount. I considered hooking into that process somehow--perhaps the server could be given an interface allowing it to discover those orphaned files. It would require nfsd to be involved in the mount process (currently we mount first, then export). And we'd have to figure out how to perform clean shutdowns without losing those files. And we'd have to worry about losing them any time an administrator fsck'd or mounted without running nfsd. So in the end maybe it wouldn't work.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/Server-side_silly_renameServer-side silly rename2022-01-18T19:12:47Z<p>Bfields: </p>
<hr />
<div>The NFSv3 protocol has no way to say "I'm unlinking this file, but please keep it around because I have an application that's still using it". So if the client wants to provide unix-like semantics, it has to resort to this hack (called "silly rename") on unlink of an open file. See also [http://nfs.sourceforge.net/#section_d], or the earliest description I'm aware of, in [https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.473 "Design and Implementation of the Sun Network Filesystem" (1985)]:<br />
<br />
<blockquote><br />
We tried very hard to make the NFS client obey UNIX filesystem semantics without modifying the server or the protocol. In some cases this was hard to do. For example, UNIX allows removal of open files. A process can open a file, then remove the directory entry for the file so that it has no name anywhere in the filesystem, and still read and write the file. This is a disgusting bit of UNIX trivia and at first we were just not going to support it, but it turns out that all of the programs that we didn't want to have to fix (csh, sendmail, etc.) use this for temporary files.<br />
<p><br />
What we did to make open file removal work on remote files was check in the client VFS remove operation if the file is open, and if so rename it instead of removing it. This makes it (sort of) invisible to the client and still allows reading and writing. The client kernel then removes the new name when the vnode becomes inactive. We call this the 3/4 solution because if the client crashes between the rename and remove a garbage file is left on the server. An entry to cron can be added to clean up on the server.</p><br />
</blockquote><br />
<br />
Silly rename is indeed an imperfect solution. Another case when users sometimes notice the ".nfsXXXX" files is when they try to remove a directory that contains them. Also, it doesn't help if a file is unlinked by a different client than the one that holds it open.<br />
<br />
NFSv4 actually does have open and close calls, and our server won't free a file until last close--unless the server reboots, at which point the file will disappear even if an application on the client is still using it. NFS is supposed to keep working normally across server reboots, so the client still does silly rename even in the v4 case.<br />
<br />
We could move the responsibility for silly rename to the server--the server could keep a hardlink to the file after unlink, and<br />
that would preserve the file after reboot as well. (And it could use a separate directory for the purpose, and avoid the rmdir). We even added a bit to the NFSv4.1 protocol so that the server can tell the client it does this, allowing the client to skip sillyrename (see references to OPEN4_RESULT_PRESERVE_UNLINKED in [https://tools.ietf.org/html/rfc8881].)<br />
<br />
I suspect the client side implementation of this wouldn't be hard--it'd need to watch for the OPEN4_RESULT_PRESERVE_UNLINKED flag and skip silly rename in its presence. (Update: see ]https://lore.kernel.org/linux-nfs/20220118190251.55526-1-olga.kornievskaia@gmail.com/T/#u].)<br />
<br />
The server side looks harder.<br />
<br />
One complication is that knfsd doesn't get exclusive use of exported filesystems: other applications may also be using them. A file opened by an NFS client could be unlinked by a local application, and we'd like the file not to disappear after reboot in that case. That said, the current behavior doesn't handle that case--it doesn't even handle the case when the unlink is done by a different client than the open--so for a first implementation I think it'd be fine to ignore that case.<br />
<br />
My rough plan for knfsd is to create a hidden directory in the root of the exported filesystem and modify nfsd4_remove() to check whether the file to be unlinked is open by an NFSv4 client, and if so to instead rename it to that hidden directory. The name shouldn't matter--just use a counter or something.<br />
<br />
I think we can use something like the logic at the start of nfsd4_process_open2 to look up a struct nfs4_file from the filehandle, and then use that to check for nfsv4 opens. We also need to prevent the race where a new open comes in after we decide to unlink the file but before we're done unlinking it--I'm not sure how. Also we need to think about the possibility of filehandle aliasing, in which case there may exist two nfs4_files for a given file.<br />
<br />
Then we need the close code to check whether we're closing one of these files and, if so, to also unlink it from the hidden directory.<br />
<br />
And, finally the laundromat code, after it ends the grace period, needs to walk through the hidden directory and remove any files that haven't been opened. Maybe nfsd4_recdir_purge_old() would do. This is usually the kind of thing we try not to do from the kernel, but I don't see a clean way to do it from userspace.<br />
<br />
That done, if we wanted to also make this work for unlinks by non-NFSv4 clients, we'd need some way to intercept all the unlinks to a given filesystem. We might need to modify the individual exported filesystems.<br />
<br />
We may want to think about how exactly to hide that directory. Maybe we could get some kind of help from the filesystem.<br />
<br />
The extra hidden link will mean that the st_nlink (for local users) and the numlinks attribute (for NFSv4 GETATTR callers) are wrong. We could fix up the latter, at least, by checking for this specific case.<br />
<br />
--<br />
<br />
Another possibility I considered was just creating a link in the new directory on every open, and removing it on every close. But open may be a frequent operation, and we'd need to actually sync that link to disk on every operation, so it could be pretty slow. But maybe, with cooperation of the filesystem, we could *just* do the link on open, and delay waiting for the sync until there's an unlink.<br />
<br />
--<br />
<br />
Another possibility: filesystems already have to deal with the case where the system crashes while there are unlinked open files. I believe they keep a list of such files so they can free them in fsck or next mount. I considered hooking into that process somehow--perhaps the server could be given an interface allowing it to discover those orphaned files. It would require nfsd to be involved in the mount process (currently we mount first, then export). And we'd have to figure out how to perform clean shutdowns without losing those files. And we'd have to worry about losing them any time an administrator fsck'd or mounted without running nfsd. So in the end maybe it wouldn't work.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/Server-side_silly_renameServer-side silly rename2022-01-18T19:07:06Z<p>Bfields: </p>
<hr />
<div>The NFSv3 protocol has no way to say "I'm unlinking this file, but please keep it around because I have an application that's still using it". So if the client wants to provide unix-like semantics, it has to resort to this hack (called "silly rename") on unlink of an open file. See also [http://nfs.sourceforge.net/#section_d], or the earliest description I'm aware of, in [https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.473 "Design and Implementation of the Sun Network Filesystem" (1985)]:<br />
<br />
<blockquote><br />
We tried very hard to make the NFS client obey UNIX filesystem semantics without modifying the server or the protocol. In some cases this was hard to do. For example, UNIX allows removal of open files. A process can open a file, then remove the directory entry for the file so that it has no name anywhere in the filesystem, and still read and write the file. This is a disgusting bit of UNIX trivia and at first we were just not going to support it, but it turns out that all of the programs that we didn't want to have to fix (csh, sendmail, etc.) use this for temporary files.<br />
<p><br />
What we did to make open file removal work on remote files was check in the client VFS remove operation if the file is open, and if so rename it instead of removing it. This makes it (sort of) invisible to the client and still allows reading and writing. The client kernel then removes the new name when the vnode becomes inactive. We call this the 3/4 solution because if the client crashes between the rename and remove a garbage file is left on the server. An entry to cron can be added to clean up on the server.</p><br />
</blockquote><br />
<br />
Silly rename is indeed an imperfect solution. Another case when users sometimes notice the ".nfsXXXX" files is when they try to remove a directory that contains them. Also, it doesn't help if a file is unlinked by a different client than the one that holds it open.<br />
<br />
NFSv4 actually does have open and close calls, and our server won't free a file until last close--unless the server reboots, at which point the file will disappear even if an application on the client is still using it. NFS is supposed to keep working normally across server reboots, so the client still does silly rename even in the v4 case.<br />
<br />
We could move the responsibility for silly rename to the server--the server could keep a hardlink to the file after unlink, and<br />
that would preserve the file after reboot as well. (And it could use a separate directory for the purpose, and avoid the rmdir). We even added a bit to the NFSv4.1 protocol so that the server can tell the client it does this, allowing the client to skip sillyrename (see references to OPEN4_RESULT_PRESERVE_UNLINKED in [https://tools.ietf.org/html/rfc8881].)<br />
<br />
I suspect the client side implementation of this wouldn't be hard--it'd need to watch for the OPEN4_RESULT_PRESERVE_UNLINKED flag and skip silly rename in its presence. (Update: see https://lore.kernel.org/linux-nfs/20220118190251.55526-1-olga.kornievskaia@gmail.com/T/#u .)<br />
<br />
The server side looks harder.<br />
<br />
One complication is that knfsd doesn't get exclusive use of exported filesystems: other applications may also be using them. A file opened by an NFS client could be unlinked by a local application, and we'd like the file not to disappear after reboot in that case. That said, the current behavior doesn't handle that case--it doesn't even handle the case when the unlink is done by a different client than the open--so for a first implementation I think it'd be fine to ignore that case.<br />
<br />
My rough plan for knfsd is to create a hidden directory in the root of the exported filesystem and modify nfsd4_remove() to check whether the file to be unlinked is open by an NFSv4 client, and if so to instead rename it to that hidden directory. The name shouldn't matter--just use a counter or something.<br />
<br />
I think we can use something like the logic at the start of nfsd4_process_open2 to look up a struct nfs4_file from the filehandle, and then use that to check for nfsv4 opens. We also need to prevent the race where a new open comes in after we decide to unlink the file but before we're done unlinking it--I'm not sure how. Also we need to think about the possibility of filehandle aliasing, in which case there may exist two nfs4_files for a given file.<br />
<br />
Then we need the close code to check whether we're closing one of these files and, if so, to also unlink it from the hidden directory.<br />
<br />
And, finally the laundromat code, after it ends the grace period, needs to walk through the hidden directory and remove any files that haven't been opened. Maybe nfsd4_recdir_purge_old() would do. This is usually the kind of thing we try not to do from the kernel, but I don't see a clean way to do it from userspace.<br />
<br />
That done, if we wanted to also make this work for unlinks by non-NFSv4 clients, we'd need some way to intercept all the unlinks to a given filesystem. We might need to modify the individual exported filesystems.<br />
<br />
We may want to think about how exactly to hide that directory. Maybe we could get some kind of help from the filesystem.<br />
<br />
The extra hidden link will mean that the st_nlink (for local users) and the numlinks attribute (for NFSv4 GETATTR callers) are wrong. We could fix up the latter, at least, by checking for this specific case.<br />
<br />
--<br />
<br />
Another possibility I considered was just creating a link in the new directory on every open, and removing it on every close. But open may be a frequent operation, and we'd need to actually sync that link to disk on every operation, so it could be pretty slow. But maybe, with cooperation of the filesystem, we could *just* do the link on open, and delay waiting for the sync until there's an unlink.<br />
<br />
--<br />
<br />
Another possibility: filesystems already have to deal with the case where the system crashes while there are unlinked open files. I believe they keep a list of such files so they can free them in fsck or next mount. I considered hooking into that process somehow--perhaps the server could be given an interface allowing it to discover those orphaned files. It would require nfsd to be involved in the mount process (currently we mount first, then export). And we'd have to figure out how to perform clean shutdowns without losing those files. And we'd have to worry about losing them any time an administrator fsck'd or mounted without running nfsd. So in the end maybe it wouldn't work.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/Server-side_silly_renameServer-side silly rename2021-11-30T21:13:32Z<p>Bfields: </p>
<hr />
<div>The NFSv3 protocol has no way to say "I'm unlinking this file, but please keep it around because I have an application that's still using it". So if the client wants to provide unix-like semantics, it has to resort to this hack (called "silly rename") on unlink of an open file. See also [http://nfs.sourceforge.net/#section_d], or the earliest description I'm aware of, in [https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.473 "Design and Implementation of the Sun Network Filesystem" (1985)]:<br />
<br />
<blockquote><br />
We tried very hard to make the NFS client obey UNIX filesystem semantics without modifying the server or the protocol. In some cases this was hard to do. For example, UNIX allows removal of open files. A process can open a file, then remove the directory entry for the file so that it has no name anywhere in the filesystem, and still read and write the file. This is a disgusting bit of UNIX trivia and at first we were just not going to support it, but it turns out that all of the programs that we didn't want to have to fix (csh, sendmail, etc.) use this for temporary files.<br />
<p><br />
What we did to make open file removal work on remote files was check in the client VFS remove operation if the file is open, and if so rename it instead of removing it. This makes it (sort of) invisible to the client and still allows reading and writing. The client kernel then removes the new name when the vnode becomes inactive. We call this the 3/4 solution because if the client crashes between the rename and remove a garbage file is left on the server. An entry to cron can be added to clean up on the server.</p><br />
</blockquote><br />
<br />
Silly rename is indeed an imperfect solution. Another case when users sometimes notice the ".nfsXXXX" files is when they try to remove a directory that contains them. Also, it doesn't help if a file is unlinked by a different client than the one that holds it open.<br />
<br />
NFSv4 actually does have open and close calls, and our server won't free a file until last close--unless the server reboots, at which point the file will disappear even if an application on the client is still using it. NFS is supposed to keep working normally across server reboots, so the client still does silly rename even in the v4 case.<br />
<br />
We could move the responsibility for silly rename to the server--the server could keep a hardlink to the file after unlink, and<br />
that would preserve the file after reboot as well. (And it could use a separate directory for the purpose, and avoid the rmdir). We even added a bit to the NFSv4.1 protocol so that the server can tell the client it does this, allowing the client to skip sillyrename (see references to OPEN4_RESULT_PRESERVE_UNLINKED in [https://tools.ietf.org/html/rfc8881].)<br />
<br />
I suspect the client side implementation of this wouldn't be hard--it'd need to watch for the OPEN4_RESULT_PRESERVE_UNLINKED flag and skip silly rename in its presence.<br />
<br />
The server side looks harder.<br />
<br />
One complication is that knfsd doesn't get exclusive use of exported filesystems: other applications may also be using them. A file opened by an NFS client could be unlinked by a local application, and we'd like the file not to disappear after reboot in that case. That said, the current behavior doesn't handle that case--it doesn't even handle the case when the unlink is done by a different client than the open--so for a first implementation I think it'd be fine to ignore that case.<br />
<br />
My rough plan for knfsd is to create a hidden directory in the root of the exported filesystem and modify nfsd4_remove() to check whether the file to be unlinked is open by an NFSv4 client, and if so to instead rename it to that hidden directory. The name shouldn't matter--just use a counter or something.<br />
<br />
I think we can use something like the logic at the start of nfsd4_process_open2 to look up a struct nfs4_file from the filehandle, and then use that to check for nfsv4 opens. We also need to prevent the race where a new open comes in after we decide to unlink the file but before we're done unlinking it--I'm not sure how. Also we need to think about the possibility of filehandle aliasing, in which case there may exist two nfs4_files for a given file.<br />
<br />
Then we need the close code to check whether we're closing one of these files and, if so, to also unlink it from the hidden directory.<br />
<br />
And, finally the laundromat code, after it ends the grace period, needs to walk through the hidden directory and remove any files that haven't been opened. Maybe nfsd4_recdir_purge_old() would do. This is usually the kind of thing we try not to do from the kernel, but I don't see a clean way to do it from userspace.<br />
<br />
That done, if we wanted to also make this work for unlinks by non-NFSv4 clients, we'd need some way to intercept all the unlinks to a given filesystem. We might need to modify the individual exported filesystems.<br />
<br />
We may want to think about how exactly to hide that directory. Maybe we could get some kind of help from the filesystem.<br />
<br />
The extra hidden link will mean that the st_nlink (for local users) and the numlinks attribute (for NFSv4 GETATTR callers) are wrong. We could fix up the latter, at least, by checking for this specific case.<br />
<br />
--<br />
<br />
Another possibility I considered was just creating a link in the new directory on every open, and removing it on every close. But open may be a frequent operation, and we'd need to actually sync that link to disk on every operation, so it could be pretty slow. But maybe, with cooperation of the filesystem, we could *just* do the link on open, and delay waiting for the sync until there's an unlink.<br />
<br />
--<br />
<br />
Another possibility: filesystems already have to deal with the case where the system crashes while there are unlinked open files. I believe they keep a list of such files so they can free them in fsck or next mount. I considered hooking into that process somehow--perhaps the server could be given an interface allowing it to discover those orphaned files. It would require nfsd to be involved in the mount process (currently we mount first, then export). And we'd have to figure out how to perform clean shutdowns without losing those files. And we'd have to worry about losing them any time an administrator fsck'd or mounted without running nfsd. So in the end maybe it wouldn't work.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/Server-side_silly_renameServer-side silly rename2021-11-30T21:12:49Z<p>Bfields: </p>
<hr />
<div>The NFSv3 protocol has no way to say "I'm unlinking this file, but please keep it around because I have an application that's still using it". So if the client wants to provide unix-like semantics, it has to resort to this hack (called "silly rename") on unlink of an open file. See also [http://nfs.sourceforge.net/#section_d], or the earliest description I'm aware of, in [https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.473 "Design and Implementation of the Sun Network Filesystem" (1985)]:<br />
<br />
<blockquote><br />
We tried very hard to make the NFS client obey UNIX filesystem semantics without modifying the server or the protocol. In some cases this was hard to do. For example, UNIX allows removal of open files. A process can open a file, then remove the directory entry for the file so that it has no name anywhere in the filesystem, and still read and write the file. This is a disgusting bit of UNIX trivia and at first we were just not going to support it, but it turns out that all of the programs that we didn't want to have to fix (csh, sendmail, etc.) use this for temporary files.<br />
<p><br />
What we did to make open file removal work on remote files was check in the client VFS remove operation if the file is open, and if so rename it instead of removing it. This makes it (sort of) invisible to the client and still allows reading and writing. The client kernel then removes the new name when the vnode becomes inactive. We call this the 3/4 solution because if the client crashes between the rename and remove a garbage file is left on the server. An entry to cron can be added to clean up on the server.</p><br />
</blockquote><br />
<br />
Silly rename is indeed an imperfect solution. Another case when users sometimes notice the ".nfsXXXX" files is when they try to remove a directory that contains them. Also, it doesn't help if a file is unlinked by a different client than the one that holds it open.<br />
<br />
NFSv4 actually does have open and close calls, and our server won't free a file until last close--unless the server reboots, at which point the file will disappear even if an application on the client is still using it. NFS is supposed to keep working normally across server reboots, so the client still does silly rename even in the v4 case.<br />
<br />
We could move the responsibility for silly rename to the server--the server could keep a hardlink to the file after unlink, and<br />
that would preserve the file after reboot as well. (And it could use a separate directory for the purpose, and avoid the rmdir). We even added a bit to the NFSv4.1 protocol so that the server can tell the client it does this, allowing the client to skip sillyrename (see references to OPEN4_RESULT_PRESERVE_UNLINKED in [https://tools.ietf.org/html/rfc8881].)<br />
<br />
I suspect the client side implementation of this wouldn't be hard--it'd need to watch for the OPEN4_RESULT_PRESERVE_UNLINKED flag and skip silly rename in its presence.<br />
<br />
The server side looks harder.<br />
<br />
One complication is knfsd doesn't get exclusive use of exported filesystems: other applications may also be using them. A file opened by an NFS client could be unlinked by a local application, and we'd like the file not to disappear after reboot in that case. That said, the current behavior doesn't handle that case--it doesn't even handle the case when the unlink is done by a different client than the open--so for a first implementation I think it'd be fine to ignore that case.<br />
<br />
My rough plan for knfsd is to create a hidden directory in the root of the exported filesystem and modify nfsd4_remove() to check whether the file to be unlinked is open by an NFSv4 client, and if so to instead rename it to that hidden directory. The name shouldn't matter--just use a counter or something.<br />
<br />
I think we can use something like the logic at the start of nfsd4_process_open2 to look up a struct nfs4_file from the filehandle, and then use that to check for nfsv4 opens. We also need to prevent the race where a new open comes in after we decide to unlink the file but before we're done unlinking it--I'm not sure how. Also we need to think about the possibility of filehandle aliasing, in which case there may exist two nfs4_files for a given file.<br />
<br />
Then we need the close code to check whether we're closing one of these files and, if so, to also unlink it from the hidden directory.<br />
<br />
And, finally the laundromat code, after it ends the grace period, needs to walk through the hidden directory and remove any files that haven't been opened. Maybe nfsd4_recdir_purge_old() would do. This is usually the kind of thing we try not to do from the kernel, but I don't see a clean way to do it from userspace.<br />
<br />
That done, if we wanted to also make this work for unlinks by non-NFSv4 clients, we'd need some way to intercept all the unlinks to a given filesystem. We might need to modify the individual exported filesystems.<br />
<br />
We may want to think about how exactly to hide that directory. Maybe we could get some kind of help from the filesystem.<br />
<br />
The extra hidden link will mean that the st_nlink (for local users) and the numlinks attribute (for NFSv4 GETATTR callers) are wrong. We could fix up the latter, at least, by checking for this specific case.<br />
<br />
--<br />
<br />
Another possibility I considered was just creating a link in the new directory on every open, and removing it on every close. But open may be a frequent operation, and we'd need to actually sync that link to disk on every operation, so it could be pretty slow. But maybe, with cooperation of the filesystem, we could *just* do the link on open, and delay waiting for the sync until there's an unlink.<br />
<br />
--<br />
<br />
Another possibility: filesystems already have to deal with the case where the system crashes while there are unlinked open files. I believe they keep a list of such files so they can free them in fsck or next mount. I considered hooking into that process somehow--perhaps the server could be given an interface allowing it to discover those orphaned files. It would require nfsd to be involved in the mount process (currently we mount first, then export). And we'd have to figure out how to perform clean shutdowns without losing those files. And we'd have to worry about losing them any time an administrator fsck'd or mounted without running nfsd. So in the end maybe it wouldn't work.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_re-exportNFS re-export2021-09-10T18:38:31Z<p>Bfields: </p>
<hr />
<div>The Linux NFS server can export an NFS mount, but that isn't something we currently recommend unless you've done some careful research and are prepared for problems.<br />
<br />
You'll need nfs-utils at least 1.3.5 (specifically, 3f520e8f6f5 "exportfs: Make sure pass all valid export flags to nfsd"). Otherwise, on recent kernels, attempts to re-export NFS will likely result in "exportfs: <path> does not support NFS export".<br />
<br />
The "fsid=" option is required on any export of an NFS filesystem.<br />
<br />
For now you should probably also mount readonly and with -onolock (and don't depend on working file locking), and don't allow the re-exporting server to reboot.<br />
<br />
= known issues =<br />
<br />
== fsid= required, crossmnt broken ==<br />
<br />
The re-export server needs to encode into each filehandle something that identifies the specific filesystem being exported. Otherwise it's stuck when it gets a filehandle back from the client--the operation it uses to map the incoming filehandle to a dentry can't even work without a superblock. The usual ways of identifying a filesystem don't work for the case of NFS, so we require the "fsid=" export option on any re-export of an NFS filesystem.<br />
<br />
Note also that normally you can export a tree of filesystems by exporting only the parent with the "crossmnt" option, and any filesystems underneath are then automatically exported with the same options. However, that doesn't apply to the fsid= option: it's purpose is to provide a unique identifier for each export, so it can't be automatically copied to the child filesystems.<br />
<br />
That means that re-exporting a tree of NFS filesystems in that way won't work--clients will be able to access the top-level export, but attempts to traverse mountpoints underneath will just result in IO errors.<br />
<br />
In theory, if the server could at least determine that the filehandle is for an object on an NFS filesystem, and figure out which server the filesystem's from, it could (given some new interface) ask the NFS client to work out the rest.<br />
<br />
One idea might be an [[NFS proxy-only mode]] where a server is dedicated to reexporting the filesystems of exactly *one* other server, as is.<br />
<br />
== reboot recovery ==<br />
<br />
NFS is designed to keep operating through server reboots, whether planned or the result of a crash or power outage. Client applications will see a delay while the server's down, but as soon as it's back up, normal operation resumes. Opens and file locks held across the reboot will all work correctly. (The only exception is unlinked but still open files, which may disappear after a reboot.)<br />
<br />
But the protocol's normal reboot recovery mechanisms don't work for the case when the re-export server reboots. The re-export server is both an NFS client and an NFS server, and the protocol's equipped to deal with the loss of the server's state, but not with the loss of the client's state.<br />
<br />
Maybe we could keep the client state on low-latency stable storage somehow? Maybe we could add a mechanism to the protocol that allows the client to state that it has lost its protocol state and wants to reclaim? (And then the client would issue reclaims as reclaims from the re-export server's clients came in.) Tentative plan: [[reboot recovery for re-export servers]]<br />
<br />
Maybe the re-export server could take the stateids returned from the server and return them to its clients, avoiding the need for it to keep very much state.<br />
<br />
== filehandle limits ==<br />
<br />
NFS filehandle sizes are limited (to 32 bytes for NFSv2, 64 bytes for NFSv3, and 128 bytes for NFSv4). When we re-export, we take the filehandle returned from the original server and wrap it with some more bytes of our own to create the filehandle we return to clients. That means the filehandles we give out will be larger than the filehandles we receive from the original server. There's no guarantee this will work. In practice most servers give out filehandles of a fixed size that's less than the maximum, so you *probably* won't run into this problem unless you're re-exporting with NFSv2, or re-exporting repeatedly. But there are no guarantees.<br />
<br />
If re-export servers could reuse filehandles from the original server, that'd solve the problem. It would also make it easier for clients to migrate between the original server and other re-export servers, which could be useful.<br />
<br />
The wrapping is needed so that the server can identify, even after it may have long forgotten about that particular filehandle, which export the filehandle refers to, so it can refer the operation to the correct underlying filesystem or server, and so it can enforce export permissions.<br />
<br />
If a server exports only a single NFS filesystem, then there'd be no problem with it reusing the file handle it got from the original server. Possibly that's a common enough use case to be helpful? With containers we could still allow a single physical machine to handle multiple exports even if each container only handles on each.<br />
<br />
Cooperating servers could agree on the structure of filehandles in a way that allowed them to reuse each others' filehandles. Possibly that could be standardized if it proved useful.<br />
<br />
== errors on re-exports of NFSv4.0 filesystems to NFSv2/3 clients ==<br />
<br />
When re-exporting NFSv4.0 filesystems IO errors have been seen after dropping caches on the re-export server. This is probably due to the fact that an NFSv4 client has to open files to perform IO to them, but NFSv3 client only provides filehandles, and NFSv4.0 cannot open by filehandle (it can only open by (parent filehandle, filename) pair). NFSv4.1 allows open by filehandle.<br />
<br />
Best is not to do this; use NFSv4.1 or NFSv4.2 on the original server, or NFSv4 on the clients.<br />
<br />
If that's not possible, a workaround is to configure the re-export server to be reluctant to evict inodes from cache.<br />
<br />
Some more details at https://lore.kernel.org/linux-nfs/635679406.70384074.1603272832846.JavaMail.zimbra@dneg.com/. Note some other cases there (NFSv3 re-exports of NFSv3) are fixed by patches probably headed for 5.11.<br />
<br />
Maybe the NFSv4.0 client could also be made to support open-by-filehandle by skipping the open and using special stateids instead? I'm not sure.<br />
<br />
== unnecessary GETATTRs ==<br />
<br />
We see unnecessary cache invalidations on the re-export servers; we have some patches in progress that should make it for 5.11 or so (https://lore.kernel.org/linux-nfs/20201120223831.GB7705@fieldses.org/). It looks like they help but don't address every case.<br />
<br />
Also, depending on NFS versions on originating and re-exporting servers, we could probably save some GETATTRs, and set the atomic bit in some cases, if we passed along wcc information from the original server. Requires a special knfsd<->nfs interface. Should be doable.<br />
<br />
== re-export not reading more than 128K at a time ==<br />
<br />
For some reason when the client issues 1M reads to the re-export server, the re-export server breaks them up into 128K reads to the original server. Workaround is to manually increase client readahead; see <br />
https://lore.kernel.org/linux-nfs/1688437957.87985749.1605554507783.JavaMail.zimbra@dneg.com/<br />
<br />
== open DENY bits ignored ==<br />
<br />
NFS since NFSv4 supports ALLOW and DENY bits taken from Windows, which allow you, for example, to open a file in a mode which forbids other read opens or write opens. The Linux client doesn't use them, and the server's support has always been incomplete: they are enforced only against other NFS users, not against processes accessing the exported filesystem locally. A re-export server will also not pass them along to the original server, so they will not be enforced between clients of different re-export servers.<br />
<br />
This is probably not too hard to fix, but also probably not a high priority.<br />
<br />
= Known problems that we've fixed =<br />
<br />
* Problems with sporadic stale filehandles should be fixed by https://lore.kernel.org/linux-nfs/20201019175330.595894-1-trondmy@kernel.org/ (queued for 5.11?)<br />
* Pre/post-operation attributes are incorrectly returned as if they were atomic in cases when they aren't. We have fixes for 5.11.<br />
* File locking crashes should be fixed as of 5.15. (But note reboot recovery is still unsupported.)<br />
* delegations and leases should work; this could probably use some testing.<br />
<br />
= Use cases =<br />
<br />
== Scaling read bandwidth ==<br />
<br />
You should be able to scale bandwidth by adding more re-export servers; fscache on the re-export servers should also help.<br />
<br />
== Hiding latency of distant servers ==<br />
<br />
You should also be able to hide latency when the original server is far away. AFS read-only replication is an interesting precedent here, often used to distribute software that is rarely updated. [https://cernvm.cern.ch/fs/ CernVM-FS] occupies a similar niche. fscache should help here too.<br />
<br />
== NFS version support ==<br />
<br />
It's also being used as a way to add support for all NFS versions to servers that only support a subset. Careful attention to filehandle limits is required.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_re-exportNFS re-export2021-09-10T18:36:21Z<p>Bfields: </p>
<hr />
<div>The Linux NFS server can export an NFS mount, but that isn't something we currently recommend unless you've done some careful research and are prepared for problems.<br />
<br />
You'll need nfs-utils at least 1.3.5 (specifically, 3f520e8f6f5 "exportfs: Make sure pass all valid export flags to nfsd"). Otherwise, on recent kernels, attempts to re-export NFS will likely result in "exportfs: <path> does not support NFS export".<br />
<br />
The "fsid=" option is required on any export of an NFS filesystem.<br />
<br />
For now you should probably also mount readonly and with -onolock (and don't depend on working file locking), and don't allow the re-exporting server to reboot.<br />
<br />
= known issues =<br />
<br />
== fsid= required, crossmnt broken ==<br />
<br />
The re-export server needs to encode into each filehandle something that identifies the specific filesystem being exported. Otherwise it's stuck when it gets a filehandle back from the client--the operation it uses to map the incoming filehandle to a dentry can't even work without a superblock. The usual ways of identifying a filesystem don't work for the case of NFS, so we require the "fsid=" export option on any re-export of an NFS filesystem.<br />
<br />
Note also that normally you can export a tree of filesystems by exporting only the parent with the "crossmnt" option, and any filesystems underneath are then automatically exported with the same options. However, that doesn't apply to the fsid= option: it's purpose is to provide a unique identifier for each export, so it can't be automatically copied to the child filesystems.<br />
<br />
That means that re-exporting a tree of NFS filesystems in that way won't work--clients will be able to access the top-level export, but attempts to traverse mountpoints underneath will just result in IO errors.<br />
<br />
In theory, if the server could at least determine that the filehandle is for an object on an NFS filesystem, and figure out which server the filesystem's from, it could (given some new interface) ask the NFS client to work out the rest.<br />
<br />
One idea might be an [[NFS proxy-only mode]] where a server is dedicated to reexporting the filesystems of exactly *one* other server, as is.<br />
<br />
== reboot recovery ==<br />
<br />
NFS is designed to keep operating through server reboots, whether planned or the result of a crash or power outage. Client applications will see a delay while the server's down, but as soon as it's back up, normal operation resumes. Opens and file locks held across the reboot will all work correctly. (The only exception is unlinked but still open files, which may disappear after a reboot.)<br />
<br />
But the protocol's normal reboot recovery mechanisms don't work for the case when the re-export server reboots. The re-export server is both an NFS client and an NFS server, and the protocol's equipped to deal with the loss of the server's state, but not with the loss of the client's state.<br />
<br />
Maybe we could keep the client state on low-latency stable storage somehow? Maybe we could add a mechanism to the protocol that allows the client to state that it has lost its protocol state and wants to reclaim? (And then the client would issue reclaims as reclaims from the re-export server's clients came in.) Tentative plan: [[reboot recovery for re-export servers]]<br />
<br />
Maybe the re-export server could take the stateids returned from the server and return them to its clients, avoiding the need for it to keep very much state.<br />
<br />
== filehandle limits ==<br />
<br />
NFS filehandle sizes are limited (to 32 bytes for NFSv2, 64 bytes for NFSv3, and 128 bytes for NFSv4). When we re-export, we take the filehandle returned from the original server and wrap it with some more bytes of our own to create the filehandle we return to clients. That means the filehandles we give out will be larger than the filehandles we receive from the original server. There's no guarantee this will work. In practice most servers give out filehandles of a fixed size that's less than the maximum, so you *probably* won't run into this problem unless you're re-exporting with NFSv2, or re-exporting repeatedly. But there are no guarantees.<br />
<br />
If re-export servers could reuse filehandles from the original server, that'd solve the problem. It would also make it easier for clients to migrate between the original server and other re-export servers, which could be useful.<br />
<br />
The wrapping is needed so that the server can identify, even after it may have long forgotten about that particular filehandle, which export the filehandle refers to, so it can refer the operation to the correct underlying filesystem or server, and so it can enforce export permissions.<br />
<br />
If a server exports only a single NFS filesystem, then there'd be no problem with it reusing the file handle it got from the original server. Possibly that's a common enough use case to be helpful? With containers we could still allow a single physical machine to handle multiple exports even if each container only handles on each.<br />
<br />
Cooperating servers could agree on the structure of filehandles in a way that allowed them to reuse each others' filehandles. Possibly that could be standardized if it proved useful.<br />
<br />
== errors on re-exports of NFSv4.0 filesystems to NFSv2/3 clients ==<br />
<br />
When re-exporting NFSv4.0 filesystems IO errors have been seen after dropping caches on the re-export server. This is probably due to the fact that an NFSv4 client has to open files to perform IO to them, but NFSv3 client only provides filehandles, and NFSv4.0 cannot open by filehandle (it can only open by (parent filehandle, filename) pair). NFSv4.1 allows open by filehandle.<br />
<br />
Best is not to do this; use NFSv4.1 or NFSv4.2 on the original server, or NFSv4 on the clients.<br />
<br />
If that's not possible, a workaround is to configure the re-export server to be reluctant to evict inodes from cache.<br />
<br />
Some more details at https://lore.kernel.org/linux-nfs/635679406.70384074.1603272832846.JavaMail.zimbra@dneg.com/. Note some other cases there (NFSv3 re-exports of NFSv3) are fixed by patches probably headed for 5.11.<br />
<br />
Maybe the NFSv4.0 client could also be made to support open-by-filehandle by skipping the open and using special stateids instead? I'm not sure.<br />
<br />
== unnecessary GETATTRs ==<br />
<br />
We see unnecessary cache invalidations on the re-export servers; we have some patches in progress that should make it for 5.11 or so (https://lore.kernel.org/linux-nfs/20201120223831.GB7705@fieldses.org/). It looks like they help but don't address every case.<br />
<br />
Also, depending on NFS versions on originating and re-exporting servers, we could probably save some GETATTRs, and set the atomic bit in some cases, if we passed along wcc information from the original server. Requires a special knfsd<->nfs interface. Should be doable.<br />
<br />
== re-export not reading more than 128K at a time ==<br />
<br />
For some reason when the client issues 1M reads to the re-export server, the re-export server breaks them up into 128K reads to the original server. Workaround is to manually increase client readahead; see <br />
https://lore.kernel.org/linux-nfs/1688437957.87985749.1605554507783.JavaMail.zimbra@dneg.com/<br />
<br />
== open DENY bits ignored ==<br />
<br />
NFS since NFSv4 supports ALLOW and DENY bits taken from Windows, which allow you, for example, to open a file in a mode which forbids other read opens or write opens. The Linux client doesn't use them, and the server's support has always been incomplete: they are enforced only against other NFS users, not against processes accessing the exported filesystem locally. A re-export server will also not pass them along to the original server, so they will not be enforced between clients of different re-export servers.<br />
<br />
This is probably not too hard to fix, but also probably not a high priority.<br />
<br />
== Delegations unsupported ==<br />
<br />
Currently a re-export server simply won't give out delegations to its clients (if you're looking at the code: this is because the nfs filesystem sets its setlease method to simple_nosetlease). This is correct but probably suboptimal.<br />
<br />
(Delegations on re-exports should work as of 5.14; could probably use some testing.)<br />
<br />
= Known problems that we've fixed =<br />
<br />
* Problems with sporadic stale filehandles should be fixed by https://lore.kernel.org/linux-nfs/20201019175330.595894-1-trondmy@kernel.org/ (queued for 5.11?)<br />
* Pre/post-operation attributes are incorrectly returned as if they were atomic in cases when they aren't. We have fixes for 5.11.<br />
* File locking crashes should be fixed as of 5.15. (But note reboot recovery is still unsupported.)<br />
<br />
= Use cases =<br />
<br />
== Scaling read bandwidth ==<br />
<br />
You should be able to scale bandwidth by adding more re-export servers; fscache on the re-export servers should also help.<br />
<br />
== Hiding latency of distant servers ==<br />
<br />
You should also be able to hide latency when the original server is far away. AFS read-only replication is an interesting precedent here, often used to distribute software that is rarely updated. [https://cernvm.cern.ch/fs/ CernVM-FS] occupies a similar niche. fscache should help here too.<br />
<br />
== NFS version support ==<br />
<br />
It's also being used as a way to add support for all NFS versions to servers that only support a subset. Careful attention to filehandle limits is required.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_re-exportNFS re-export2021-08-18T20:03:14Z<p>Bfields: /* Delegations unsupported */</p>
<hr />
<div>The Linux NFS server can export an NFS mount, but that isn't something we currently recommend unless you've done some careful research and are prepared for problems.<br />
<br />
You'll need nfs-utils at least 1.3.5 (specifically, 3f520e8f6f5 "exportfs: Make sure pass all valid export flags to nfsd"). Otherwise, on recent kernels, attempts to re-export NFS will likely result in "exportfs: <path> does not support NFS export".<br />
<br />
The "fsid=" option is required on any export of an NFS filesystem.<br />
<br />
For now you should probably also mount readonly and with -onolock (and don't depend on working file locking), and don't allow the re-exporting server to reboot.<br />
<br />
= known issues =<br />
<br />
== fsid= required, crossmnt broken ==<br />
<br />
The re-export server needs to encode into each filehandle something that identifies the specific filesystem being exported. Otherwise it's stuck when it gets a filehandle back from the client--the operation it uses to map the incoming filehandle to a dentry can't even work without a superblock. The usual ways of identifying a filesystem don't work for the case of NFS, so we require the "fsid=" export option on any re-export of an NFS filesystem.<br />
<br />
Note also that normally you can export a tree of filesystems by exporting only the parent with the "crossmnt" option, and any filesystems underneath are then automatically exported with the same options. However, that doesn't apply to the fsid= option: it's purpose is to provide a unique identifier for each export, so it can't be automatically copied to the child filesystems.<br />
<br />
That means that re-exporting a tree of NFS filesystems in that way won't work--clients will be able to access the top-level export, but attempts to traverse mountpoints underneath will just result in IO errors.<br />
<br />
In theory, if the server could at least determine that the filehandle is for an object on an NFS filesystem, and figure out which server the filesystem's from, it could (given some new interface) ask the NFS client to work out the rest.<br />
<br />
One idea might be an [[NFS proxy-only mode]] where a server is dedicated to reexporting the filesystems of exactly *one* other server, as is.<br />
<br />
== reboot recovery ==<br />
<br />
NFS is designed to keep operating through server reboots, whether planned or the result of a crash or power outage. Client applications will see a delay while the server's down, but as soon as it's back up, normal operation resumes. Opens and file locks held across the reboot will all work correctly. (The only exception is unlinked but still open files, which may disappear after a reboot.)<br />
<br />
But the protocol's normal reboot recovery mechanisms don't work for the case when the re-export server reboots. The re-export server is both an NFS client and an NFS server, and the protocol's equipped to deal with the loss of the server's state, but not with the loss of the client's state.<br />
<br />
Maybe we could keep the client state on low-latency stable storage somehow? Maybe we could add a mechanism to the protocol that allows the client to state that it has lost its protocol state and wants to reclaim? (And then the client would issue reclaims as reclaims from the re-export server's clients came in.) Tentative plan: [[reboot recovery for re-export servers]]<br />
<br />
Maybe the re-export server could take the stateids returned from the server and return them to its clients, avoiding the need for it to keep very much state.<br />
<br />
== filehandle limits ==<br />
<br />
NFS filehandle sizes are limited (to 32 bytes for NFSv2, 64 bytes for NFSv3, and 128 bytes for NFSv4). When we re-export, we take the filehandle returned from the original server and wrap it with some more bytes of our own to create the filehandle we return to clients. That means the filehandles we give out will be larger than the filehandles we receive from the original server. There's no guarantee this will work. In practice most servers give out filehandles of a fixed size that's less than the maximum, so you *probably* won't run into this problem unless you're re-exporting with NFSv2, or re-exporting repeatedly. But there are no guarantees.<br />
<br />
If re-export servers could reuse filehandles from the original server, that'd solve the problem. It would also make it easier for clients to migrate between the original server and other re-export servers, which could be useful.<br />
<br />
The wrapping is needed so that the server can identify, even after it may have long forgotten about that particular filehandle, which export the filehandle refers to, so it can refer the operation to the correct underlying filesystem or server, and so it can enforce export permissions.<br />
<br />
If a server exports only a single NFS filesystem, then there'd be no problem with it reusing the file handle it got from the original server. Possibly that's a common enough use case to be helpful? With containers we could still allow a single physical machine to handle multiple exports even if each container only handles on each.<br />
<br />
Cooperating servers could agree on the structure of filehandles in a way that allowed them to reuse each others' filehandles. Possibly that could be standardized if it proved useful.<br />
<br />
== errors on re-exports of NFSv4.0 filesystems to NFSv2/3 clients ==<br />
<br />
When re-exporting NFSv4.0 filesystems IO errors have been seen after dropping caches on the re-export server. This is probably due to the fact that an NFSv4 client has to open files to perform IO to them, but NFSv3 client only provides filehandles, and NFSv4.0 cannot open by filehandle (it can only open by (parent filehandle, filename) pair). NFSv4.1 allows open by filehandle.<br />
<br />
Best is not to do this; use NFSv4.1 or NFSv4.2 on the original server, or NFSv4 on the clients.<br />
<br />
If that's not possible, a workaround is to configure the re-export server to be reluctant to evict inodes from cache.<br />
<br />
Some more details at https://lore.kernel.org/linux-nfs/635679406.70384074.1603272832846.JavaMail.zimbra@dneg.com/. Note some other cases there (NFSv3 re-exports of NFSv3) are fixed by patches probably headed for 5.11.<br />
<br />
Maybe the NFSv4.0 client could also be made to support open-by-filehandle by skipping the open and using special stateids instead? I'm not sure.<br />
<br />
== unnecessary GETATTRs ==<br />
<br />
We see unnecessary cache invalidations on the re-export servers; we have some patches in progress that should make it for 5.11 or so (https://lore.kernel.org/linux-nfs/20201120223831.GB7705@fieldses.org/). It looks like they help but don't address every case.<br />
<br />
Also, depending on NFS versions on originating and re-exporting servers, we could probably save some GETATTRs, and set the atomic bit in some cases, if we passed along wcc information from the original server. Requires a special knfsd<->nfs interface. Should be doable.<br />
<br />
== broken file locking ==<br />
<br />
Connectathon locking tests over v4 are currently triggering some kind of memory corruption; still investigating.<br />
<br />
I haven't tested NFSv2/v3 (NLM) file locking yet, but I bet it's broken too.<br />
<br />
Patches are available, with luck may be included in 5.15. Lock recovery will remain an issue.<br />
<br />
== re-export not reading more than 128K at a time ==<br />
<br />
For some reason when the client issues 1M reads to the re-export server, the re-export server breaks them up into 128K reads to the original server. Workaround is to manually increase client readahead; see <br />
https://lore.kernel.org/linux-nfs/1688437957.87985749.1605554507783.JavaMail.zimbra@dneg.com/<br />
<br />
== open DENY bits ignored ==<br />
<br />
NFS since NFSv4 supports ALLOW and DENY bits taken from Windows, which allow you, for example, to open a file in a mode which forbids other read opens or write opens. The Linux client doesn't use them, and the server's support has always been incomplete: they are enforced only against other NFS users, not against processes accessing the exported filesystem locally. A re-export server will also not pass them along to the original server, so they will not be enforced between clients of different re-export servers.<br />
<br />
This is probably not too hard to fix, but also probably not a high priority.<br />
<br />
== Delegations unsupported ==<br />
<br />
Currently a re-export server simply won't give out delegations to its clients (if you're looking at the code: this is because the nfs filesystem sets its setlease method to simple_nosetlease). This is correct but probably suboptimal.<br />
<br />
(Delegations on re-exports should work as of 5.14; could probably use some testing.)<br />
<br />
= Known problems that we've fixed =<br />
<br />
* Problems with sporadic stale filehandles should be fixed by https://lore.kernel.org/linux-nfs/20201019175330.595894-1-trondmy@kernel.org/ (queued for 5.11?)<br />
* Pre/post-operation attributes are incorrectly returned as if they were atomic in cases when they aren't. We have fixes for 5.11.<br />
<br />
= Use cases =<br />
<br />
== Scaling read bandwidth ==<br />
<br />
You should be able to scale bandwidth by adding more re-export servers; fscache on the re-export servers should also help.<br />
<br />
== Hiding latency of distant servers ==<br />
<br />
You should also be able to hide latency when the original server is far away. AFS read-only replication is an interesting precedent here, often used to distribute software that is rarely updated. [https://cernvm.cern.ch/fs/ CernVM-FS] occupies a similar niche. fscache should help here too.<br />
<br />
== NFS version support ==<br />
<br />
It's also being used as a way to add support for all NFS versions to servers that only support a subset. Careful attention to filehandle limits is required.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_re-exportNFS re-export2021-08-18T20:01:59Z<p>Bfields: /* broken file locking */</p>
<hr />
<div>The Linux NFS server can export an NFS mount, but that isn't something we currently recommend unless you've done some careful research and are prepared for problems.<br />
<br />
You'll need nfs-utils at least 1.3.5 (specifically, 3f520e8f6f5 "exportfs: Make sure pass all valid export flags to nfsd"). Otherwise, on recent kernels, attempts to re-export NFS will likely result in "exportfs: <path> does not support NFS export".<br />
<br />
The "fsid=" option is required on any export of an NFS filesystem.<br />
<br />
For now you should probably also mount readonly and with -onolock (and don't depend on working file locking), and don't allow the re-exporting server to reboot.<br />
<br />
= known issues =<br />
<br />
== fsid= required, crossmnt broken ==<br />
<br />
The re-export server needs to encode into each filehandle something that identifies the specific filesystem being exported. Otherwise it's stuck when it gets a filehandle back from the client--the operation it uses to map the incoming filehandle to a dentry can't even work without a superblock. The usual ways of identifying a filesystem don't work for the case of NFS, so we require the "fsid=" export option on any re-export of an NFS filesystem.<br />
<br />
Note also that normally you can export a tree of filesystems by exporting only the parent with the "crossmnt" option, and any filesystems underneath are then automatically exported with the same options. However, that doesn't apply to the fsid= option: it's purpose is to provide a unique identifier for each export, so it can't be automatically copied to the child filesystems.<br />
<br />
That means that re-exporting a tree of NFS filesystems in that way won't work--clients will be able to access the top-level export, but attempts to traverse mountpoints underneath will just result in IO errors.<br />
<br />
In theory, if the server could at least determine that the filehandle is for an object on an NFS filesystem, and figure out which server the filesystem's from, it could (given some new interface) ask the NFS client to work out the rest.<br />
<br />
One idea might be an [[NFS proxy-only mode]] where a server is dedicated to reexporting the filesystems of exactly *one* other server, as is.<br />
<br />
== reboot recovery ==<br />
<br />
NFS is designed to keep operating through server reboots, whether planned or the result of a crash or power outage. Client applications will see a delay while the server's down, but as soon as it's back up, normal operation resumes. Opens and file locks held across the reboot will all work correctly. (The only exception is unlinked but still open files, which may disappear after a reboot.)<br />
<br />
But the protocol's normal reboot recovery mechanisms don't work for the case when the re-export server reboots. The re-export server is both an NFS client and an NFS server, and the protocol's equipped to deal with the loss of the server's state, but not with the loss of the client's state.<br />
<br />
Maybe we could keep the client state on low-latency stable storage somehow? Maybe we could add a mechanism to the protocol that allows the client to state that it has lost its protocol state and wants to reclaim? (And then the client would issue reclaims as reclaims from the re-export server's clients came in.) Tentative plan: [[reboot recovery for re-export servers]]<br />
<br />
Maybe the re-export server could take the stateids returned from the server and return them to its clients, avoiding the need for it to keep very much state.<br />
<br />
== filehandle limits ==<br />
<br />
NFS filehandle sizes are limited (to 32 bytes for NFSv2, 64 bytes for NFSv3, and 128 bytes for NFSv4). When we re-export, we take the filehandle returned from the original server and wrap it with some more bytes of our own to create the filehandle we return to clients. That means the filehandles we give out will be larger than the filehandles we receive from the original server. There's no guarantee this will work. In practice most servers give out filehandles of a fixed size that's less than the maximum, so you *probably* won't run into this problem unless you're re-exporting with NFSv2, or re-exporting repeatedly. But there are no guarantees.<br />
<br />
If re-export servers could reuse filehandles from the original server, that'd solve the problem. It would also make it easier for clients to migrate between the original server and other re-export servers, which could be useful.<br />
<br />
The wrapping is needed so that the server can identify, even after it may have long forgotten about that particular filehandle, which export the filehandle refers to, so it can refer the operation to the correct underlying filesystem or server, and so it can enforce export permissions.<br />
<br />
If a server exports only a single NFS filesystem, then there'd be no problem with it reusing the file handle it got from the original server. Possibly that's a common enough use case to be helpful? With containers we could still allow a single physical machine to handle multiple exports even if each container only handles on each.<br />
<br />
Cooperating servers could agree on the structure of filehandles in a way that allowed them to reuse each others' filehandles. Possibly that could be standardized if it proved useful.<br />
<br />
== errors on re-exports of NFSv4.0 filesystems to NFSv2/3 clients ==<br />
<br />
When re-exporting NFSv4.0 filesystems IO errors have been seen after dropping caches on the re-export server. This is probably due to the fact that an NFSv4 client has to open files to perform IO to them, but NFSv3 client only provides filehandles, and NFSv4.0 cannot open by filehandle (it can only open by (parent filehandle, filename) pair). NFSv4.1 allows open by filehandle.<br />
<br />
Best is not to do this; use NFSv4.1 or NFSv4.2 on the original server, or NFSv4 on the clients.<br />
<br />
If that's not possible, a workaround is to configure the re-export server to be reluctant to evict inodes from cache.<br />
<br />
Some more details at https://lore.kernel.org/linux-nfs/635679406.70384074.1603272832846.JavaMail.zimbra@dneg.com/. Note some other cases there (NFSv3 re-exports of NFSv3) are fixed by patches probably headed for 5.11.<br />
<br />
Maybe the NFSv4.0 client could also be made to support open-by-filehandle by skipping the open and using special stateids instead? I'm not sure.<br />
<br />
== unnecessary GETATTRs ==<br />
<br />
We see unnecessary cache invalidations on the re-export servers; we have some patches in progress that should make it for 5.11 or so (https://lore.kernel.org/linux-nfs/20201120223831.GB7705@fieldses.org/). It looks like they help but don't address every case.<br />
<br />
Also, depending on NFS versions on originating and re-exporting servers, we could probably save some GETATTRs, and set the atomic bit in some cases, if we passed along wcc information from the original server. Requires a special knfsd<->nfs interface. Should be doable.<br />
<br />
== broken file locking ==<br />
<br />
Connectathon locking tests over v4 are currently triggering some kind of memory corruption; still investigating.<br />
<br />
I haven't tested NFSv2/v3 (NLM) file locking yet, but I bet it's broken too.<br />
<br />
Patches are available, with luck may be included in 5.15. Lock recovery will remain an issue.<br />
<br />
== re-export not reading more than 128K at a time ==<br />
<br />
For some reason when the client issues 1M reads to the re-export server, the re-export server breaks them up into 128K reads to the original server. Workaround is to manually increase client readahead; see <br />
https://lore.kernel.org/linux-nfs/1688437957.87985749.1605554507783.JavaMail.zimbra@dneg.com/<br />
<br />
== open DENY bits ignored ==<br />
<br />
NFS since NFSv4 supports ALLOW and DENY bits taken from Windows, which allow you, for example, to open a file in a mode which forbids other read opens or write opens. The Linux client doesn't use them, and the server's support has always been incomplete: they are enforced only against other NFS users, not against processes accessing the exported filesystem locally. A re-export server will also not pass them along to the original server, so they will not be enforced between clients of different re-export servers.<br />
<br />
This is probably not too hard to fix, but also probably not a high priority.<br />
<br />
== Delegations unsupported ==<br />
<br />
Currently a re-export server simply won't give out delegations to its clients (if you're looking at the code: this is because the nfs filesystem sets its setlease method to simple_nosetlease). This is correct but probably suboptimal.<br />
<br />
= Known problems that we've fixed =<br />
<br />
* Problems with sporadic stale filehandles should be fixed by https://lore.kernel.org/linux-nfs/20201019175330.595894-1-trondmy@kernel.org/ (queued for 5.11?)<br />
* Pre/post-operation attributes are incorrectly returned as if they were atomic in cases when they aren't. We have fixes for 5.11.<br />
<br />
= Use cases =<br />
<br />
== Scaling read bandwidth ==<br />
<br />
You should be able to scale bandwidth by adding more re-export servers; fscache on the re-export servers should also help.<br />
<br />
== Hiding latency of distant servers ==<br />
<br />
You should also be able to hide latency when the original server is far away. AFS read-only replication is an interesting precedent here, often used to distribute software that is rarely updated. [https://cernvm.cern.ch/fs/ CernVM-FS] occupies a similar niche. fscache should help here too.<br />
<br />
== NFS version support ==<br />
<br />
It's also being used as a way to add support for all NFS versions to servers that only support a subset. Careful attention to filehandle limits is required.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/Cluster_Coherent_NFSv4_and_Share_ReservationsCluster Coherent NFSv4 and Share Reservations2021-08-18T15:16:06Z<p>Bfields: /* One approach: new flags for open() */</p>
<hr />
<div>=Background=<br />
<br />
NFSv4 share reservations control the concurrent sharing of files at the time they are opened. Share reservations come in two flavors, ACCESS and DENY. There are three types of ACCESS reservations: READ, WRITE, and BOTH; and four types of DENY reservations: NONE, READ, WRITE, and BOTH. <br />
<br />
ACCESS reservations are familiar to Linux users, as they map directly to posix open() flags. NFSv4 ACCESS shares of READ, WRITE, and BOTH map directly to O_RDONLY, O_WRONLY and O_RDWR, respectively.<br />
<br />
NFSv4 DENY reservations act as a type of whole file lock applied when a file is opened. NFSv4 DENY shares of READ, WRITE, and BOTH prevent other opens with read, write, or any access from succeeding. DENY NONE allows other opens to proceed.<br />
<br />
The Linux system call interface for open() follows the posix standard, which does not include support for share reservations. In particular, there is no direct analog in posix for an application to request DENY READ, WRITE, or BOTH shares. Consequently, Linux NFSv4 clients always use DENY NONE.<br />
<br />
The mismatch between posix and NFSv4 shares is also reflected on an NFSv4 server. The Linux NFSv4 server that receives DENY reservations from clients that can express them, which in practice means Windows clients, does the appropriate bookeepping and enforcement, but the local filesystem is unable to enforce DENY shares for local access on the server.<br />
<br />
When a cluster file system is exported with NFSv4, multiple NFSv4 servers export a common back-end file system, so ACCESS and DENY reservations must be distributed to take into account shares from other NFSv4 servers. In other words, the NFSv4 server has to ask the cluster file system if an incoming OPEN share can be granted.<br />
<br />
==DENY Share Support in Linux==<br />
<br />
Adding DENY share support to the Linux kernel faces several obstacles:<br />
<br />
* DENY shares are alien to posix, the Linux model for file systems.<br />
* There are currently no open Linux file systems that support DENY shares.<br />
* Linux and all other UNIX-like NFSv4 clients currently work correctly because they never request DENY access.<br />
* DENY shares do not meet the NFSv4 access needs of Linux clients, just Windows clients.<br />
* Not even off-the-shelf Windows clients benefit as NFSv4 for Windows is a third-party add-on (from Hummingbird).<br />
* The user level SAMBA server implements DENY shares with open and flock (albeit with the obvious race conditions), which obviates kernel support.<br />
<br />
=Implementation Issues=<br />
<br />
To enforce open share DENY access across the cluster back end is complicated, since an open with DENY must atomically lookup, (possibly) create, open, and lock the target file.<br />
<br />
The Linux client atomically joins lookup, create, and open with [[lookup intents]]; the back end may have to do the same thing. The Linux client must also make the open and lock an atomic operation, but there is a problem: you can't lock that doesn't exist, so you must first create it. But as soon as the file is created, some other application might find it and lock it. Returning an error to an open that succeeding in creating a file is unexpected behavior. <br />
<br />
Applying restrictive mode bits to the create won't always work, either, because another application might relax the mode restrictions and open the file. <br />
<br />
This suggests that we add the share lock to the open call instead of making it a separate operation.<br />
<br />
==One approach: new flags for open()==<br />
<br />
* Use existing O_RDONLY, O_WRONLY and O_RDWR open flags to implement O_ACCESS_READ, O_ACCESS_WRITE, and O_ACCESS_BOTH, respectively.<br />
* Add two open flags: O_DENY_READ and O_DENY_WRITE.<br />
* Propagate O_DENY flags to the intent structure.<br />
* Add operation adjust_share(file, flags). The file system should be allowed to refuse operations that could not result from open or close. (So, anything that doesn't only turn bits on or only turn them off.) <br />
<br />
* Is this a new kernel operation? Who is supposed to call it? This needs a little better explanation.<br />
<br />
Is there a race here? E.g., say we open+create with a share lock. How do we decide whether to treat it as an upgrade or an open?<br />
<br />
* This issue needs to be explained a little better.<br />
<br />
Note patches were posted for this at one point by Pavel Shilovsky; see https://lwn.net/Articles/581005/. He gave up and as of this writing nobody's taken up the task since.<br />
<br />
==Another approach: best attempt==<br />
<br />
* Issue a lookup. If the file exists, then upgrade.<br />
<br />
* Someone please clarify "upgrade."<br />
<br />
* Otherwise open with implicit create. If we get an error indicating a share conflict, retry the lookup.<br />
<br />
* But the subsequent upgrade (?) might fail. Then what?<br />
<br />
This is obviously not ideal.<br />
<br />
* Would it help to get a reference on the dentry before trying the open?<br />
* Is there currently a lookup/open race if the backend is a distributed filesystem? One way of looking at it is "that's up to them." The client just needs to look at how we implement open and make sure it does the intent stuff right. <br />
<br />
* A brief glance suggests that we probably don't.<br />
<br />
An alternative might be to expose something along the lines of the [[open owner]] to the VFS and let it decide (by comparing open owners) whether a given open is an upgrade or a new open.<br />
<br />
=Status=<br />
<br />
Implementation awaits resolution of these issues.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/AFS_to_NFSv4_ACL_conversionAFS to NFSv4 ACL conversion2021-08-04T19:05:42Z<p>Bfields: </p>
<hr />
<div>If filesystems are migrated from AFS to NFSv4, one of the challenges will be mapping file permissions.<br />
<br />
Below we describe an algorithm that maps an AFS ACL to an NFSv4 ACL. It is necessarily imperfect but I think it gets pretty close. If the resulting NFSv4 ACL is stored on a Linux server, the result will be further translated to a POSIX ACL. That process will lose more information. The result may still be adequate for some cases.<br />
<br />
I expect the bigger challenge to be transitioning users. Our NFSv4 ACL tools and documentation both need work.<br />
<br />
Sources:<br />
* https://docs.openafs.org/UserGuide/HDRWQ46.html<br />
* https://www.auristor.com/documentation/man/linux/7/auristorfs_acls.html<br />
<br />
AFS ACL permissions affecting directories are:<br />
(l)ookup<br />
(i)nsert<br />
(d)elete<br />
(a)dminister<br />
AFS ACL permissions affecting files are:<br />
(r)ead<br />
(w)rite<br />
loc(k)<br />
<br />
Perms set only on directories in AFS, also allowed on files, symlinks, mount points in auristor. r/w/k perms on directory control files in<br />
that directory that don't have an ACL set.<br />
<br />
So in auristor it's possible for a file to not have any ACL set.<br />
<br />
I assume that directories do have ACLs set, and new subdirectories inherit same permissions as parents.<br />
<br />
Both also allow 8 additional application-defined permissions A-H which I think we'll ignore. (If necessary, write a v4 protocol extension.)<br />
<br />
"Negative permissions" always take precedence.<br />
<br />
Auristor docs state that "i" is equivalent to "w" permissions when applied to a file. That's a rather strange rule given the way that AFS otherwise segregates directory and file permissions, so I'm going to ignore it until I have a chance to ask some questions.<br />
<br />
Here's a first attempt at an AFS->NFSv4 translation algorithm:<br />
<br />
Given a directory: Start with an empty NFSv4 ACL. Iterate through the ACEs of the AFS ACL one at a time, negative ACEs first. For each ACE:<br />
<br />
* If the AFS ACE is a negative ACE, any new NFSv4 ACEs generated in this step should be ACCESS_DENIED_ACE, otherwise ACCESS_ALLOWED_ACE.<br />
* If this is a directory, create two NFSv4 ACEs, one with ACE4_DIRECTORY_INHERIT_ACE set, the other with NFS4_ACE_INHERIT_ONLY_ACE and NFS4_ACE_FILE_INHERIT_ACE set. On the first ACE, convert mode bits as follows:<br />
l->ACE4_EXECUTE|ACE4_READ_DATA<br />
i->ACE4_WRITE_DATA|ACE4_APPEND_DATA<br />
d->ACE4_DELETE_CHILD<br />
a->ACE4_WRITE_ACL<br />
* On the second ACE, convert mode bits as follows:<br />
r->ACE4_READ_DATA<br />
w->ACE4_WRITE_DATA|ACE4_APPEND_DATA<br />
a->ACE4_WRITE_ACL.<br />
* Given a file: the same, but create only one ACE, with no inheritance bits set. If the source file has no AFS ACL, then the ACL on the destination file should be exactly what it would be if it was inherited from its parent. If the source file has an ACL, then determine the access bits on each destination as above, ignoring any l, i, or d bits. In addition, if the destination NFSv4 server supports the dacl attribute, then set ACE4_INHERITED_ACE on all ACEs when the source file has no AFS ACL, and set ACL4_PROTECTED on any flag on any ACL for which the source file has an AFS ACL set.<br />
<br />
I don't think there's any way to handle loc(k) correctly. It should be just ignored and documented. ACE4_SYNCHRONIZE is vaguely similar but in practice it won't do the right thing on any NFSv4 server. A Linux server will ignore it, and a server that supports it probably uses read or write opens to decide permissions to lock, as is traditional for linux/unix.<br />
<br />
Failures on mapping to NFSv4 ACLs:<br />
* Without server (and client userspace tool) support for the NFSv4.1 DACL attribute and "automatic" inheritance, changes to directories won't automatically propagate to children. This is different from AFS, where the directory ACL controls permissions on all clients (and Auristor, where that's still the default behavior although files may optionally have their own ACLs). The NFSv4.a DACL attribute partly compensates this, though note that its "automatic" inheritance isn't quite the same, as it depends on user tools to propagate changes to directory permissions.<br />
* loc(k) bit, as noted above.<br />
<br />
Failures on mapping to POSIX ACLs as implemented on Linux:<br />
<br />
* POSIX ACLs are unable to distinguish between insert and delete. The mapping should default to erring on the side of restrictiveness, and deny directory write permissions when insert and delete are not both permitted.<br />
* POSIX ACLs inherit parent permissions only on creation, there's no way to emulate AFS behavior.<br />
* POSIX ACL inheritance is very different: directories have an "access ACL" (which determines permissions) and a "default ACL" (which is inherited by children). The default ACL is inherited by both files and directories, and POSIX ACLs use the same bit for, for example, directory modification and file write. This means, for example, it's not possible to set an inheritable ACL giving child files write permission but denying modification permissions to child directories. In such cases by default we should err on the side of denying both permissions.<br />
<br />
Whenever we're unable to map permissions exactly, we should default to erring on the side of denying permissions, and warning the user. A conversion tool may provide ways to override that default and the warnings. Care must be taken to prevent warnings from being unmanageable. When converting large filesystems, we should provide a readable summary and give the user a way to get the details if necessary.<br />
<br />
Some remaining questions:<br />
<br />
* interactions with mode bits: in the past I believe AFS simply ignored mode bits, allowing us to ignore them on conversion. I think that may no longer be true for AFS (or at least for Auristor).<br />
* user & group naming<br />
* file owners and groups</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/AFS-like_group_management_with_FreeIPAAFS-like group management with FreeIPA2021-08-04T18:59:56Z<p>Bfields: </p>
<hr />
<div>AFS allows any ordinary user amy to create groups named "amy:groupname": https://docs.openafs.org/Reference/1/pts_creategroup.html. AFS has per-user quotas limiting the number of such groups created.<br />
<br />
It would be possible to emulate this in FreeIPA by creating a permission, privilege, and role for each individual user, though it's a little cumbersome. For example:<br />
<br />
ipa-permission-add "create amy groups" --type=group --right=add --filter="(cn=amy-*)"<br />
ipa privilege-add "create amy groups"<br />
ipa privilege-add-perimssion --permission="create amy groups" "create amy groups"<br />
ipa role-add role-manage-amy-groups<br />
ipa role-add-member --users=amy role-manage-amy-groups<br />
ipa role-add-privilege --privileges="create amy groups" role-manage-amy-groups<br />
<br />
FreeIPA permits an administrator to grant a user the right to modify membership of a given group (see the member and membermanager attributes), or delegate the right to create groups to certain users.<br />
<br />
You can view and modify group membership with "ipa group-add-member" and "ipa group-show"<br />
<br />
There's no way to enforce quotas. This would require someone writing a new plugin. We're not aware of anyone working on it.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/FedFsUtilsProjectFedFsUtilsProject2021-07-27T15:46:35Z<p>Bfields: /* FedFS introduction */</p>
<hr />
<div>== Project: fedfs-utils == <br />
[<br />
[[FedFsUtilsProject|Project Home]] |<br />
[[FedFsUtilsNews|News]] |<br />
[[FedFsUtilsDownloads|Downloads]] |<br />
[[FedFsUtilsDocs|Docs]] |<br />
[[FedFsUtilsMailingLists|Mailing Lists]] |<br />
[[FedFsUtilsSourceControl|Source Control]] |<br />
[[FedFsUtilsIssues|Issues]]<br />
]<br />
----<br />
<br />
'''Project Description:''' Linux implementation of Federated File System standard (RFC 7532 and RFC 7533)<br />
<br />
''' License:''' <span class="plainlinks">[http://oss.oracle.com/licenses/GPL-2 GPL-2]</span><br />
<br />
'''End-Of-Life Notice:''' This project is now at the end of its lifespan. New features are no longer being developed. The 0.10 branch of fedfs-utils will continue to be updated as security or compatibility issues arise. There is a [[FedFsUtilsTransitionPlan|plan]] for transitioning some components to other projects.<br />
<br />
== FedFS introduction ==<br />
RFC 5716 introduces the Federated File System (FedFS, for short). FedFS is an extensible standardized mechanism by which system administrators construct a coherent namespace across multiple file servers using ''file system referrals.''<br />
<br />
A file system referral is like a symbolic link to another file system share, but it is not visible to applications. It behaves like an automounted directory where a new file system mount is done when an application first accesses that directory. Today, file system referral mechanisms exist in several network file system protocols.<br />
<br />
Thus FedFS does not require any change to file system protocols or client implementations. FedFS provides its namespace features using referral mechanisms already built in to network file system protocols.<br />
<br />
As a result, FedFS provides network file system namespace configuration to file system clients via network file systems themselves, rather than via side-band protocols like NIS. Clients automatically share a common view of the network file system namespace with no need for individual configuration on each client.<br />
<br />
Currently, the Linux FedFS implementation supports only NFS version 4 referrals. More on NFS version 4 referrals can be found in RFC 7530 and RFC 5661. FedFS may support other network file system protocols in the future.<br />
<br />
== Package Overview ==<br />
''The code provided in this package is a technology preview. The intent is to provide a full and supported Linux FedFS client and server implementation based on this code. Programming and user interfaces may change significantly for the next few releases.''<br />
<br />
The components in this package are used for managing file system referrals in order to create a global network file system namespace. Installable components include:<br />
<br />
* Components to enable Linux NFS clients to discover and mount FedFS domain roots<br />
* Components to enable Linux NFS servers to participate in FedFS domains<br />
* Tools to manage NFSv4 referrals on Linux NFS servers<br />
* Tools to administer FedFS domains<br />
<br />
The INSTALL file in this distribution explains more about how to build these components, and which of these components to install on what systems.<br />
<br />
== Distribution Information ==<br />
<br />
* [https://fedoraproject.org/wiki/Features/FedFS Fedora Project FedFS page]</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/FedFsUtilsProjectFedFsUtilsProject2021-07-27T15:46:20Z<p>Bfields: /* FedFS introduction */</p>
<hr />
<div>== Project: fedfs-utils == <br />
[<br />
[[FedFsUtilsProject|Project Home]] |<br />
[[FedFsUtilsNews|News]] |<br />
[[FedFsUtilsDownloads|Downloads]] |<br />
[[FedFsUtilsDocs|Docs]] |<br />
[[FedFsUtilsMailingLists|Mailing Lists]] |<br />
[[FedFsUtilsSourceControl|Source Control]] |<br />
[[FedFsUtilsIssues|Issues]]<br />
]<br />
----<br />
<br />
'''Project Description:''' Linux implementation of Federated File System standard (RFC 7532 and RFC 7533)<br />
<br />
''' License:''' <span class="plainlinks">[http://oss.oracle.com/licenses/GPL-2 GPL-2]</span><br />
<br />
'''End-Of-Life Notice:''' This project is now at the end of its lifespan. New features are no longer being developed. The 0.10 branch of fedfs-utils will continue to be updated as security or compatibility issues arise. There is a [[FedFsUtilsTransitionPlan|plan]] for transitioning some components to other projects.<br />
<br />
== FedFS introduction ==<br />
RFC 5716 introduces the Federated File System (FedFS, for short). FedFS is an extensible standardized mechanism by which system administrators construct a coherent namespace across multiple file servers using ''file system referrals.''<br />
<br />
A file system referral is like a symbolic link to another file system share, but it is not visible to applications. It behaves like an automounted directory where a new file system mount is done when an application first accesses that directory. Today, file system referral mechanisms exist in several network file system protocols.<br />
<br />
Thus FedFS does not require any change to file system protocols or client implementations. FedFS provides its namespace features using referral mechanisms already built in to network file system protocols.<br />
<br />
As a result, FedFS provides network file system namespace configuration to file system clients via network file systems themselves, rather than via side-band protocols like NIS. Clients automatically share a common view of the network file system namespace with no need for individual configuration on each client.<br />
<br />
Currently, the Linux FedFS implementation supports only NFS version 4 referrals. More on NFS version 4 referrals can be found in RFC 7530. FedFS may support other network file system protocols in the future.<br />
<br />
== Package Overview ==<br />
''The code provided in this package is a technology preview. The intent is to provide a full and supported Linux FedFS client and server implementation based on this code. Programming and user interfaces may change significantly for the next few releases.''<br />
<br />
The components in this package are used for managing file system referrals in order to create a global network file system namespace. Installable components include:<br />
<br />
* Components to enable Linux NFS clients to discover and mount FedFS domain roots<br />
* Components to enable Linux NFS servers to participate in FedFS domains<br />
* Tools to manage NFSv4 referrals on Linux NFS servers<br />
* Tools to administer FedFS domains<br />
<br />
The INSTALL file in this distribution explains more about how to build these components, and which of these components to install on what systems.<br />
<br />
== Distribution Information ==<br />
<br />
* [https://fedoraproject.org/wiki/Features/FedFS Fedora Project FedFS page]</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_proxy-only_modeNFS proxy-only mode2021-05-13T19:12:01Z<p>Bfields: </p>
<hr />
<div>It could be useful to have a mode where an NFS server is dedicated to reexporting all the exports from *one* other NFS server. It would have no other exports whatsoever.<br />
<br />
This would allow the re-export server to support crossmount-like behavior, skip adding its own filesystem identifier to each filehandle (fixing problems with filehandle length limits), and avoid the need for manual assignment of filesystem identifiers with the fsid= option.<br />
<br />
Possible implementation (needs more details). v4 only for now?:<br />
<br />
- Create a new /proc/fs/nfsd/proxy_only file. Before starting the server, mount "/" on the original nfs server, then write the path to the mount to /proc/fs/nfsd/proxy_only. This interface is per-container. It also works for v3, which wouldn't currently possible with in-kernel mounting, though this feature is not as useful in that case as nested v3 mounts are rarer.<br />
<br />
- the NFS mount can't allow redirection to other servers, unless those servers observe all the same filehandles.<br />
<br />
- Given a filehandle, map to an export using a GETATTR to the server to get at least fsid, fileid, and file type. If it's a directory, it should be possible to connect it up to the psuedoroot using LOOKUPP. Find or create an export from the resulting struct path, cloing the parameters of the root export.<br />
<br />
- If it's *not* a directory, and not already cached, then create a temporary vfsmount and export rooted at that one file. If you've never seen this fsid before, you'll also have to create a superblock. As far as I can tell, s_root on a given nfs superblock is not important, so it's OK for it to point at this file, even as it later accumulates the rest of the filesystem? But I don't think that's true for export and vfsmount, hence the temporary objects.<br />
<br />
- In theory, this could work with a filesystem other than NFS, if there was a filesystem or group of filesystems that coordinated their filehandles.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_proxy-only_modeNFS proxy-only mode2021-04-23T19:38:03Z<p>Bfields: </p>
<hr />
<div>It could be useful to have a mode where an NFS server is dedicated to reexporting all the exports from *one* other NFS server. It would have no other exports whatsoever.<br />
<br />
This would allow the re-export server to support crossmount-like behavior, skip adding its own filesystem identifier to each filehandle (fixing problems with filehandle length limits), and avoid the need for manual assignment of filesystem identifiers with the fsid= option.<br />
<br />
Possible implementation (needs more details):<br />
<br />
- Create a new /proc/fs/nfsd/proxy_only file. Before starting the server, mount the original nfs server, then write the path to the mount to /proc/fs/nfsd/proxy_only. This interface is per-container. It also works for v3, which wouldn't currently possible with in-kernel mounting, though this feature is not terribly useful in that case as nested v3 mounts are rarer.<br />
<br />
- the NFS mount can't allow redirection to other servers, unless those servers observe all the same filehandles.<br />
<br />
- add a new export operation which accepts a filehandle and returns a superblock.<br />
<br />
- In theory, this could work with a filesystem other than NFS, if there was a filesystem or group of filesystems that coordinated their filehandles.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_proxy-only_modeNFS proxy-only mode2021-04-19T18:47:57Z<p>Bfields: </p>
<hr />
<div>It could be useful to have a mode where an NFS server is dedicated to reexporting all the exports from *one* other NFS server. It would have no other exports whatsoever.<br />
<br />
This would allow the re-export server to support crossmount-like behavior, skip adding its own filesystem identifier to each filehandle (fixing problems with filehandle length limits), and avoid the need for manual assignment of filesystem identifiers with the fsid= option.<br />
<br />
Possible implementation (needs more details):<br />
<br />
- Provide an interface to turn on this mode and identify the original server. (I don't think the existing expkey/export cache upcalls will work. Filehandles have to be treated as totally opaque in this case, but the kernel can't do an expkey upcall without doing some filehandle parsing first. Maybe there'd be some way to hack it in there, but I think it would be ugly.) The interface needs to be per-container. Inclined just to create another file /proc/fs/nfsd/proxy-only to which you can write a server. How would we identify a server? Do we require an existing nfs mount, or take an address and do our own mount? Note if we want to allow reexport of NFSv3, then we need userland help for the mount protocol part. So, best may be to pass a path to an nfs mount.<br />
<br />
- the nfs mount can't allow redirection to other servers, unless those servers observe all the same filehandles.<br />
<br />
- add a new export operation which accepts a filehandle and returns a superblock.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_proxy-only_modeNFS proxy-only mode2021-04-19T17:00:39Z<p>Bfields: Created page with "It could be useful to have a mode where an NFS server is dedicated to reexporting all the exports from *one* other NFS server. It would have no other exports whatsoever. This w..."</p>
<hr />
<div>It could be useful to have a mode where an NFS server is dedicated to reexporting all the exports from *one* other NFS server. It would have no other exports whatsoever.<br />
<br />
This would allow the re-export server to support crossmount-like behavior, skip adding its own filesystem identifier to each filehandle (fixing problems with filehandle length limits), and avoid the need for manual assignment of filesystem identifiers with the fsid= option.<br />
<br />
Possible implementation (needs more details):<br />
<br />
- provide interface to turn on this mode and identify the original server. This could be a new call to be made before starting nfsd, or it could use the export upcall to mountd somehow.<br />
<br />
- add a new export operation which accepts a filehandle and returns a superblock.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_re-exportNFS re-export2021-04-19T15:37:30Z<p>Bfields: /* fsid= required, crossmnt broken */</p>
<hr />
<div>The Linux NFS server can export an NFS mount, but that isn't something we currently recommend unless you've done some careful research and are prepared for problems.<br />
<br />
You'll need nfs-utils at least 1.3.5 (specifically, 3f520e8f6f5 "exportfs: Make sure pass all valid export flags to nfsd"). Otherwise, on recent kernels, attempts to re-export NFS will likely result in "exportfs: <path> does not support NFS export".<br />
<br />
The "fsid=" option is required on any export of an NFS filesystem.<br />
<br />
For now you should probably also mount readonly and with -onolock (and don't depend on working file locking), and don't allow the re-exporting server to reboot.<br />
<br />
= known issues =<br />
<br />
== fsid= required, crossmnt broken ==<br />
<br />
The re-export server needs to encode into each filehandle something that identifies the specific filesystem being exported. Otherwise it's stuck when it gets a filehandle back from the client--the operation it uses to map the incoming filehandle to a dentry can't even work without a superblock. The usual ways of identifying a filesystem don't work for the case of NFS, so we require the "fsid=" export option on any re-export of an NFS filesystem.<br />
<br />
Note also that normally you can export a tree of filesystems by exporting only the parent with the "crossmnt" option, and any filesystems underneath are then automatically exported with the same options. However, that doesn't apply to the fsid= option: it's purpose is to provide a unique identifier for each export, so it can't be automatically copied to the child filesystems.<br />
<br />
That means that re-exporting a tree of NFS filesystems in that way won't work--clients will be able to access the top-level export, but attempts to traverse mountpoints underneath will just result in IO errors.<br />
<br />
In theory, if the server could at least determine that the filehandle is for an object on an NFS filesystem, and figure out which server the filesystem's from, it could (given some new interface) ask the NFS client to work out the rest.<br />
<br />
One idea might be an [[NFS proxy-only mode]] where a server is dedicated to reexporting the filesystems of exactly *one* other server, as is.<br />
<br />
== reboot recovery ==<br />
<br />
NFS is designed to keep operating through server reboots, whether planned or the result of a crash or power outage. Client applications will see a delay while the server's down, but as soon as it's back up, normal operation resumes. Opens and file locks held across the reboot will all work correctly. (The only exception is unlinked but still open files, which may disappear after a reboot.)<br />
<br />
But the protocol's normal reboot recovery mechanisms don't work for the case when the re-export server reboots. The re-export server is both an NFS client and an NFS server, and the protocol's equipped to deal with the loss of the server's state, but not with the loss of the client's state.<br />
<br />
Maybe we could keep the client state on low-latency stable storage somehow? Maybe we could add a mechanism to the protocol that allows the client to state that it has lost its protocol state and wants to reclaim? (And then the client would issue reclaims as reclaims from the re-export server's clients came in.) Tentative plan: [[reboot recovery for re-export servers]]<br />
<br />
Maybe the re-export server could take the stateids returned from the server and return them to its clients, avoiding the need for it to keep very much state.<br />
<br />
== filehandle limits ==<br />
<br />
NFS filehandle sizes are limited (to 32 bytes for NFSv2, 64 bytes for NFSv3, and 128 bytes for NFSv4). When we re-export, we take the filehandle returned from the original server and wrap it with some more bytes of our own to create the filehandle we return to clients. That means the filehandles we give out will be larger than the filehandles we receive from the original server. There's no guarantee this will work. In practice most servers give out filehandles of a fixed size that's less than the maximum, so you *probably* won't run into this problem unless you're re-exporting with NFSv2, or re-exporting repeatedly. But there are no guarantees.<br />
<br />
If re-export servers could reuse filehandles from the original server, that'd solve the problem. It would also make it easier for clients to migrate between the original server and other re-export servers, which could be useful.<br />
<br />
The wrapping is needed so that the server can identify, even after it may have long forgotten about that particular filehandle, which export the filehandle refers to, so it can refer the operation to the correct underlying filesystem or server, and so it can enforce export permissions.<br />
<br />
If a server exports only a single NFS filesystem, then there'd be no problem with it reusing the file handle it got from the original server. Possibly that's a common enough use case to be helpful? With containers we could still allow a single physical machine to handle multiple exports even if each container only handles on each.<br />
<br />
Cooperating servers could agree on the structure of filehandles in a way that allowed them to reuse each others' filehandles. Possibly that could be standardized if it proved useful.<br />
<br />
== errors on re-exports of NFSv4.0 filesystems to NFSv2/3 clients ==<br />
<br />
When re-exporting NFSv4.0 filesystems IO errors have been seen after dropping caches on the re-export server. This is probably due to the fact that an NFSv4 client has to open files to perform IO to them, but NFSv3 client only provides filehandles, and NFSv4.0 cannot open by filehandle (it can only open by (parent filehandle, filename) pair). NFSv4.1 allows open by filehandle.<br />
<br />
Best is not to do this; use NFSv4.1 or NFSv4.2 on the original server, or NFSv4 on the clients.<br />
<br />
If that's not possible, a workaround is to configure the re-export server to be reluctant to evict inodes from cache.<br />
<br />
Some more details at https://lore.kernel.org/linux-nfs/635679406.70384074.1603272832846.JavaMail.zimbra@dneg.com/. Note some other cases there (NFSv3 re-exports of NFSv3) are fixed by patches probably headed for 5.11.<br />
<br />
Maybe the NFSv4.0 client could also be made to support open-by-filehandle by skipping the open and using special stateids instead? I'm not sure.<br />
<br />
== unnecessary GETATTRs ==<br />
<br />
We see unnecessary cache invalidations on the re-export servers; we have some patches in progress that should make it for 5.11 or so (https://lore.kernel.org/linux-nfs/20201120223831.GB7705@fieldses.org/). It looks like they help but don't address every case.<br />
<br />
Also, depending on NFS versions on originating and re-exporting servers, we could probably save some GETATTRs, and set the atomic bit in some cases, if we passed along wcc information from the original server. Requires a special knfsd<->nfs interface. Should be doable.<br />
<br />
== broken file locking ==<br />
<br />
Connectathon locking tests over v4 are currently triggering some kind of memory corruption; still investigating.<br />
<br />
I haven't tested NFSv2/v3 (NLM) file locking yet, but I bet it's broken too.<br />
<br />
== re-export not reading more than 128K at a time ==<br />
<br />
For some reason when the client issues 1M reads to the re-export server, the re-export server breaks them up into 128K reads to the original server. Workaround is to manually increase client readahead; see <br />
https://lore.kernel.org/linux-nfs/1688437957.87985749.1605554507783.JavaMail.zimbra@dneg.com/<br />
<br />
== open DENY bits ignored ==<br />
<br />
NFS since NFSv4 supports ALLOW and DENY bits taken from Windows, which allow you, for example, to open a file in a mode which forbids other read opens or write opens. The Linux client doesn't use them, and the server's support has always been incomplete: they are enforced only against other NFS users, not against processes accessing the exported filesystem locally. A re-export server will also not pass them along to the original server, so they will not be enforced between clients of different re-export servers.<br />
<br />
This is probably not too hard to fix, but also probably not a high priority.<br />
<br />
== Delegations unsupported ==<br />
<br />
Currently a re-export server simply won't give out delegations to its clients (if you're looking at the code: this is because the nfs filesystem sets its setlease method to simple_nosetlease). This is correct but probably suboptimal.<br />
<br />
= Known problems that we've fixed =<br />
<br />
* Problems with sporadic stale filehandles should be fixed by https://lore.kernel.org/linux-nfs/20201019175330.595894-1-trondmy@kernel.org/ (queued for 5.11?)<br />
* Pre/post-operation attributes are incorrectly returned as if they were atomic in cases when they aren't. We have fixes for 5.11.<br />
<br />
= Use cases =<br />
<br />
== Scaling read bandwidth ==<br />
<br />
You should be able to scale bandwidth by adding more re-export servers; fscache on the re-export servers should also help.<br />
<br />
== Hiding latency of distant servers ==<br />
<br />
You should also be able to hide latency when the original server is far away. AFS read-only replication is an interesting precedent here, often used to distribute software that is rarely updated. [https://cernvm.cern.ch/fs/ CernVM-FS] occupies a similar niche. fscache should help here too.<br />
<br />
== NFS version support ==<br />
<br />
It's also being used as a way to add support for all NFS versions to servers that only support a subset. Careful attention to filehandle limits is required.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_re-exportNFS re-export2021-04-19T15:37:17Z<p>Bfields: /* fsid= required, crossmnt broken */</p>
<hr />
<div>The Linux NFS server can export an NFS mount, but that isn't something we currently recommend unless you've done some careful research and are prepared for problems.<br />
<br />
You'll need nfs-utils at least 1.3.5 (specifically, 3f520e8f6f5 "exportfs: Make sure pass all valid export flags to nfsd"). Otherwise, on recent kernels, attempts to re-export NFS will likely result in "exportfs: <path> does not support NFS export".<br />
<br />
The "fsid=" option is required on any export of an NFS filesystem.<br />
<br />
For now you should probably also mount readonly and with -onolock (and don't depend on working file locking), and don't allow the re-exporting server to reboot.<br />
<br />
= known issues =<br />
<br />
== fsid= required, crossmnt broken ==<br />
<br />
The re-export server needs to encode into each filehandle something that identifies the specific filesystem being exported. Otherwise it's stuck when it gets a filehandle back from the client--the operation it uses to map the incoming filehandle to a dentry can't even work without a superblock. The usual ways of identifying a filesystem don't work for the case of NFS, so we require the "fsid=" export option on any re-export of an NFS filesystem.<br />
<br />
Note also that normally you can export a tree of filesystems by exporting only the parent with the "crossmnt" option, and any filesystems underneath are then automatically exported with the same options. However, that doesn't apply to the fsid= option: it's purpose is to provide a unique identifier for each export, so it can't be automatically copied to the child filesystems.<br />
<br />
That means that re-exporting a tree of NFS filesystems in that way won't work--clients will be able to access the top-level export, but attempts to traverse mountpoints underneath will just result in IO errors.<br />
<br />
In theory, if the server could at least determine that the filehandle is for an object on an NFS filesystem, and figure out which server the filesystem's from, it could (given some new interface) ask the NFS client to work out the rest.<br />
<br />
One idea might be an [NFS proxy-only mode] where a server is dedicated to reexporting the filesystems of exactly *one* other server, as is.<br />
<br />
== reboot recovery ==<br />
<br />
NFS is designed to keep operating through server reboots, whether planned or the result of a crash or power outage. Client applications will see a delay while the server's down, but as soon as it's back up, normal operation resumes. Opens and file locks held across the reboot will all work correctly. (The only exception is unlinked but still open files, which may disappear after a reboot.)<br />
<br />
But the protocol's normal reboot recovery mechanisms don't work for the case when the re-export server reboots. The re-export server is both an NFS client and an NFS server, and the protocol's equipped to deal with the loss of the server's state, but not with the loss of the client's state.<br />
<br />
Maybe we could keep the client state on low-latency stable storage somehow? Maybe we could add a mechanism to the protocol that allows the client to state that it has lost its protocol state and wants to reclaim? (And then the client would issue reclaims as reclaims from the re-export server's clients came in.) Tentative plan: [[reboot recovery for re-export servers]]<br />
<br />
Maybe the re-export server could take the stateids returned from the server and return them to its clients, avoiding the need for it to keep very much state.<br />
<br />
== filehandle limits ==<br />
<br />
NFS filehandle sizes are limited (to 32 bytes for NFSv2, 64 bytes for NFSv3, and 128 bytes for NFSv4). When we re-export, we take the filehandle returned from the original server and wrap it with some more bytes of our own to create the filehandle we return to clients. That means the filehandles we give out will be larger than the filehandles we receive from the original server. There's no guarantee this will work. In practice most servers give out filehandles of a fixed size that's less than the maximum, so you *probably* won't run into this problem unless you're re-exporting with NFSv2, or re-exporting repeatedly. But there are no guarantees.<br />
<br />
If re-export servers could reuse filehandles from the original server, that'd solve the problem. It would also make it easier for clients to migrate between the original server and other re-export servers, which could be useful.<br />
<br />
The wrapping is needed so that the server can identify, even after it may have long forgotten about that particular filehandle, which export the filehandle refers to, so it can refer the operation to the correct underlying filesystem or server, and so it can enforce export permissions.<br />
<br />
If a server exports only a single NFS filesystem, then there'd be no problem with it reusing the file handle it got from the original server. Possibly that's a common enough use case to be helpful? With containers we could still allow a single physical machine to handle multiple exports even if each container only handles on each.<br />
<br />
Cooperating servers could agree on the structure of filehandles in a way that allowed them to reuse each others' filehandles. Possibly that could be standardized if it proved useful.<br />
<br />
== errors on re-exports of NFSv4.0 filesystems to NFSv2/3 clients ==<br />
<br />
When re-exporting NFSv4.0 filesystems IO errors have been seen after dropping caches on the re-export server. This is probably due to the fact that an NFSv4 client has to open files to perform IO to them, but NFSv3 client only provides filehandles, and NFSv4.0 cannot open by filehandle (it can only open by (parent filehandle, filename) pair). NFSv4.1 allows open by filehandle.<br />
<br />
Best is not to do this; use NFSv4.1 or NFSv4.2 on the original server, or NFSv4 on the clients.<br />
<br />
If that's not possible, a workaround is to configure the re-export server to be reluctant to evict inodes from cache.<br />
<br />
Some more details at https://lore.kernel.org/linux-nfs/635679406.70384074.1603272832846.JavaMail.zimbra@dneg.com/. Note some other cases there (NFSv3 re-exports of NFSv3) are fixed by patches probably headed for 5.11.<br />
<br />
Maybe the NFSv4.0 client could also be made to support open-by-filehandle by skipping the open and using special stateids instead? I'm not sure.<br />
<br />
== unnecessary GETATTRs ==<br />
<br />
We see unnecessary cache invalidations on the re-export servers; we have some patches in progress that should make it for 5.11 or so (https://lore.kernel.org/linux-nfs/20201120223831.GB7705@fieldses.org/). It looks like they help but don't address every case.<br />
<br />
Also, depending on NFS versions on originating and re-exporting servers, we could probably save some GETATTRs, and set the atomic bit in some cases, if we passed along wcc information from the original server. Requires a special knfsd<->nfs interface. Should be doable.<br />
<br />
== broken file locking ==<br />
<br />
Connectathon locking tests over v4 are currently triggering some kind of memory corruption; still investigating.<br />
<br />
I haven't tested NFSv2/v3 (NLM) file locking yet, but I bet it's broken too.<br />
<br />
== re-export not reading more than 128K at a time ==<br />
<br />
For some reason when the client issues 1M reads to the re-export server, the re-export server breaks them up into 128K reads to the original server. Workaround is to manually increase client readahead; see <br />
https://lore.kernel.org/linux-nfs/1688437957.87985749.1605554507783.JavaMail.zimbra@dneg.com/<br />
<br />
== open DENY bits ignored ==<br />
<br />
NFS since NFSv4 supports ALLOW and DENY bits taken from Windows, which allow you, for example, to open a file in a mode which forbids other read opens or write opens. The Linux client doesn't use them, and the server's support has always been incomplete: they are enforced only against other NFS users, not against processes accessing the exported filesystem locally. A re-export server will also not pass them along to the original server, so they will not be enforced between clients of different re-export servers.<br />
<br />
This is probably not too hard to fix, but also probably not a high priority.<br />
<br />
== Delegations unsupported ==<br />
<br />
Currently a re-export server simply won't give out delegations to its clients (if you're looking at the code: this is because the nfs filesystem sets its setlease method to simple_nosetlease). This is correct but probably suboptimal.<br />
<br />
= Known problems that we've fixed =<br />
<br />
* Problems with sporadic stale filehandles should be fixed by https://lore.kernel.org/linux-nfs/20201019175330.595894-1-trondmy@kernel.org/ (queued for 5.11?)<br />
* Pre/post-operation attributes are incorrectly returned as if they were atomic in cases when they aren't. We have fixes for 5.11.<br />
<br />
= Use cases =<br />
<br />
== Scaling read bandwidth ==<br />
<br />
You should be able to scale bandwidth by adding more re-export servers; fscache on the re-export servers should also help.<br />
<br />
== Hiding latency of distant servers ==<br />
<br />
You should also be able to hide latency when the original server is far away. AFS read-only replication is an interesting precedent here, often used to distribute software that is rarely updated. [https://cernvm.cern.ch/fs/ CernVM-FS] occupies a similar niche. fscache should help here too.<br />
<br />
== NFS version support ==<br />
<br />
It's also being used as a way to add support for all NFS versions to servers that only support a subset. Careful attention to filehandle limits is required.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFSv4_quota_supportNFSv4 quota support2021-04-01T21:03:10Z<p>Bfields: Created page with "NFSv3 had the RQUOTA protocol. NFSv4 has three read-only attributes (see https://tools.ietf.org/html/rfc5661#section-5.8.2.28 and following) that support quotas, but Linux has n..."</p>
<hr />
<div>NFSv3 had the RQUOTA protocol. NFSv4 has three read-only attributes (see https://tools.ietf.org/html/rfc5661#section-5.8.2.28 and following) that support quotas, but Linux has not implemented them.<br />
<br />
The linux implementation of RQUOTA seems to live along with the local filesystem quota utilites at https://sourceforge.net/projects/linuxquota/. The same "quota" tool can make calls to either local filesystems or RQUOTA calls to the server. People can continue using RQUOTA with NFSv4, but one of the goals of the NFSv4 protocol was to move all functionality into the single NFS protocol (among other reasons, to simplify firewall traversal), so ideally we should make it possible to query quotas over NFSv4 as well.<br />
<br />
I can't find any real documentation of the RQUOTA protocol, only the .x file: https://git.kernel.org/pub/scm/utils/quota/quota-tools.git/tree/rquota.x<br />
<br />
Some local filesystems quotas seem to be managed using an actual on-disk file whose format is understood by both the kernel filesystem code and the userspace "quota" tool. That isn't an appropriate interface for the NFS filesystem. But filesystems are also able to hide the on-disk quota information and system calls to get and set quotas. Perhaps NFS could use some subset of that interface (see quotactl(2)), or perhaps it needs to define its own simpler calls.<br />
<br />
The NFSv4 attributes take a filehandle, and are meant to report quotas associated with the given filehandle (exactly what "associated with" means is left unspecified). By contrast, RQUOTA and quotactl can take things like uids, gids, and project ids, and both get and set a wider variety of quotas (like number of files, for example).<br />
<br />
Given that, I wonder whether the NFSv4 attributes are worth implementing, or (if we decide the feature is necessary), if we'd be better off defining additional operations closer to the ones in RQUOTA.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_for_AFS_usersNFS for AFS users2021-04-01T19:56:05Z<p>Bfields: /* quotas */</p>
<hr />
<div>This page tracks some of the obstacles that might keep an AFS user from using NFS instead.<br />
<br />
= Missing Features =<br />
<br />
In general: AFS is administered by a consistent set of commands (fs, pts, vos, uss, bos, backup, fstrace, etc.) which work from any client and identify the user with Kerberos. Compared to a traditional unix system it's more flexible about delegating rights to users to do stuff.<br />
<br />
== replication and migration ==<br />
<br />
AFS supports fast clones using COW, along with complete copies on other machines.<br />
<br />
Currently there can be only one writeable version of a volume, but multiple read-only versions (which all have to be identical). They can be on different servers. (There's also an effort to support multiple writeable volumes, possibly using Ceph, but that's not done yet.)<br />
<br />
There can also be a 'backup' volume which is just, say, a daily temporary read-only snapshot of a RW volume and has to be located on the same machine.<br />
<br />
When a RW volume is "released" (snapshotted) to the read-only volumes, all the read-only volumes update simultaneously and atomically. The users, in theory, don't notice as the volumes don't go offline - and then they see all the changes happen at once. There's coordination to handle when one or more of the fileservers or the Volume Location servers are offline.<br />
<br />
Volumes can be migrated between machines whilst in active use without the user in theory noticing anything.<br />
<br />
For NFS migration we need to preserve filehandles, so need to migrate at the block level or using fs-specific send/receive. The protocol can be handled by migrating only entire servers or containers, so that migration can be treated as a server reboot.<br />
<br />
A few Linux options for send/receive:<br />
<br />
* thin_delta (from device-mapper-persistent-data) can calculate a metadata-level diff between two volumes. Additional work would be needed to extract the actual data and produce a diff; that would complete the "send" side. We'd also need a "receive" side that could apply the diff and reconstitute the snapshot on the other side. This is being actively worked on. For NFS, on the read-write server we would take a snapshot of the exported volume before sending. On the receive side, after creating the updated snapshot, we would stop the server, unmount the old snapshot, mount the new one, and restart; clients should see only a brief delay.<br />
<br />
* btrfs-send/btrfs-receive: this is probably the best-tested send/receive functionality currently available, so if we wanted to start work on a prototype right now, this might be an option.<br />
<br />
* xfs volumes loopback-mounted on a backing xfs filesystem, using reflink for snapshots. (See https://lwn.net/Articles/747633/ for some background.) Looks promising, the basic kernel interfaces to find shared extents and such are there, but a lot of userland code remains to be written.<br />
<br />
* stratis: this operates at a layer of abstraction over the above. But that might be the layer we want to actually interact with?<br />
<br />
* lvmsync: looks possibly unmaintained? We wouldn't want to depend on this. But possibly it could be a proof of concept or starting point.<br />
<br />
Between LVM and (container-respecting) knfsd, we have a lot of the necessary pieces, but there's at a minimum a lot of tooling and documentation to write before this is usable.<br />
<br />
Clients could be configured to mount particular servers by hand, or they could mount any server and then use [https://tools.ietf.org/html/rfc5661#section-11.9 fs_locations], [https://tools.ietf.org/html/rfc5661#section-11.10 fs_locations_info], or maybe even [https://datatracker.ietf.org/doc/rfc8435/ pnfs flexfiles] to get lists of servers hosting replicas and pick one. They would need some heuristics to make the right choice. It would also be nice if clients could fail over to a different replica when one goes down.<br />
<br />
We also have [https://github.com/nfs-ganesha/nfs-ganesha/wiki Ganesha], [https://docs.ceph.com/docs/master/cephfs/nfs/ Ganesha/Ceph] (which [https://jtlayton.wordpress.com/2018/12/10/deploying-an-active-active-nfs-cluster-over-cephfs/ may be capable of multiple read/write servers now]).<br />
<br />
See also [https://docs.openafs.org/AdminGuide/HDRWQ177.html AFS Administrator's guide, Chapter 5: Managing Volumes]<br />
<br />
A partial alternative may be [https://wiki.linux-nfs.org/wiki/index.php/NFS_re-export NFS proxying]. Like read-only replicas, proxies should be able to hide latency by moving cached data closer to far-flung clients, and scale bandwidth to read-mostly data by taking load off the original server.<br />
<br />
Advantages are that we already have seen reports of some success here, using the NFS re-export code together with fscache. And I think there are a lot of opportunities for incremental progress by fixing problems with existing NFS code, rather than larger and riskier projects that build new infrastructure.<br />
<br />
A disadvantage may be that AFS users seem to like that infrastructure (the volume abstraction and the VLDB).<br />
<br />
Latency-hiding may be particularly tricky; delegation and caching policies may need rethinking. Performance will be more complicated to understand compared to AFS-like read-only replicas.<br />
<br />
AFS-like volume replication has a problem: when new read-only versions are released, they may modify or delete entirely files that are in use by running processes. I'd expect application crashes. I wonder how AFS administrators deal with that now?<br />
<br />
My impression is that AFS doesn't reliably prevent this problem, so instead AFS administrators work around it, for example by keeping old versions of binaries in place (and using symlinks to direct users to the newest versions).<br />
<br />
Possible approaches to fix the problem if we wanted to:<br />
* Provide some protocol which tracks which files may be open on read-only replicas so that we know not to free those files when they're unlinked.<br />
* When we distribute new versions, allow servers to keep around older versions and serve files from them in the case filehandle lookups against the new copy fail, to be removed only after applications stop referencing them. Hopefully this can be done space-efficiently if the different versions on the replica servers can be represented as dm snapshots.<br />
<br />
If we use NFSv4 proxies instead, proxies will hold opens or delegations on the files on the original server, which will prevent their being deleted while in use. The problem is server reboots. That's partially worked around with silly-rename. Server-side silly-rename would be a more complete solution.<br />
<br />
== volume location database and global namespace ==<br />
<br />
On an AFS client by default you can look up something like /afs/umich.edu/... and reach files kept in AFS anywhere.<br />
<br />
NFS has standards for DNS discovery of a server from a domain, in theory we could use that. Handling kerberos users across domains would be interesting.<br />
<br />
Within one domain, there's a "Volume Location Database" that keeps track of volumes and where (machine and partition) they're located. You can make a volume for a purpose; give particular people access to it, give it some storage, expand and contract it and move it around. Volumes have quotas.<br />
<br />
Within a given domain, We can assemble a namespace out of volumes using referrals. For a higher-level approach more similar to AFS's, there's also [https://wiki.linux-nfs.org/wiki/index.php/FedFsUtilsProject FedFS] which stores the namespace information in a database and provides common protocols for administration tools to manipulate the database.<br />
<br />
== PAGS ==<br />
<br />
PAGs: AFS allows a group of processes to share a common identity, different from the local uid, for the purposes of accessing an AFS filesystem: https://docs.openafs.org/AdminGuide/HDRWQ63.html<br />
<br />
Dave Howells says: "This is why I added session keyrings. You can run a process in a new keyring<br />
and give it new tokens. systemd kind of stuck a spike in that, though, by<br />
doing their own incompatible thing with their user manager service....<br />
<br />
NFS would need to do what the in-kernel AFS client does and call request_key()<br />
on entry to each filesystem method that doesn't take a file* and use that to<br />
cache the credentials it is using. If there is no key, it can make one up on<br />
the spot and stick the uid/gid/groups in there. This would then need to be<br />
handed down to the sunrpc protocol to define the security creds to use.<br />
<br />
The key used to open a file would then need to be cached in the file struct<br />
private data."<br />
<br />
== ACLs ==<br />
<br />
NFSv4 has ACLs, but Linux filesystems only support "posix" ACLs. An attempt was made to support NFSv4 ACLs ("richacls") but hasn't been accepted upstream. So knfsd is stuck mapping between NFSv4 and posix ACLs. Posix ACLs are more coarse-grained than NFSv4 ACLs, so information can be lost when a user on an NFSv4 client sets an ACL. This makes ACLs confusing and less useful.<br />
<br />
There are other servers that support full NFSv4 ACLs, so users of those servers are better off. Our client-side tools could still use some improvements for those users, though.<br />
<br />
AFS ACLs, unfortunately, are yet again a third style of ACL, incompatible with both POSIX and NFSv4 ACLs. They are more fine-grained than POSIX ACLs and probably closer to NFSv4 ACLs overall.<br />
<br />
To do:<br />
<br />
* make NFSv4 ACL tools more usable:<br />
** Map groups of NFSv4 permission bits to read, write, and execute permissions so we only have to display the simpler bits in common cases<br />
** Look for other opportunities to simplify display and editing of NFSv4 ACLs<br />
** Add NFSv4 ACL support to graphical file managers like GNOME Files<br />
** Adopt a commandline interface that's more similar to the posix acl utilities.<br />
** Perhaps also look into https://github.com/kvaneesh/richacl-tools as an alternative starting point to nfs4-acl-tools.<br />
** In general, try to make NFSv4 ACL management more similar to management of existing posix ACLs.<br />
* For AFS->NFS transition:<br />
** Write code that translates AFS ACLs to NFSv4 ACLs. It should be possible to do this with little or no loss of information for servers with full NFSv4 ACL support.<br />
** For migrations to Linux knfsd, this will effectively translate AFS ACLs to POSIX ACLs, and information will be lost. Test this case. The conversion tool should be able to fetch the ACLs after setting them, compare results, and summarize the results of the conversion in a way that's usable even for conversions of large numbers of files. I believe that setting an ACL is enough to invalidate the client's ACL cache, so a subsequent fetch of an ACL should show the results of any server-side mapping. But, test this to make sure. More details on [[AFS to NFSv4 ACL conversion]]<br />
<br />
* more ambitious options:<br />
** Try reviving [https://lwn.net/Articles/661357/ Rich ACLs]. Maybe we could convince people this time. Or maybe there's a different approach that would work. Maybe we could find a more incremental route, e.g. by adding some features of richacls to POSIX ACLs, such as the separation of directory write permissions into add and delete, and of file write permissions into modify and append.<br />
<br />
== user and group management ==<br />
<br />
AFS has a "protection server" and you can communicate with it using the [https://docs.openafs.org/Reference/1/pts.html pts command] which allows you to set up users and groups and add ACEs for machines.<br />
<br />
Compared to traditional unix, it allows wider delegation of management. For example, group creation doesn't require root: https://docs.openafs.org/Reference/1/pts_creategroup.html. Groups have owners, and you can delegate management of group membership: https://docs.openafs.org/Reference/1/pts_adduser.html.<br />
<br />
Our equivalent to the AFS protection server is [https://www.freeipa.org/page/Main_Page FreeIPA]. See also https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_identity_management/index. Installing FreeIPA and experimenting is also useful.<br />
<br />
Unlike AFS, FreeIPA doesn't seem to make it easy for ordinary users to create groups. It does allow delegating group management (including adding and removing users). More details on [[AFS-like group management with FreeIPA]].<br />
<br />
== quotas ==<br />
<br />
AFS has per-volume quotas. There are no per-user quotas that I can see; instead, AFS administrators create volumes for individual users (e.g., for individual home directories), and set quotas on those. Volumes can share the same storage, and it's fine for quotas on volumes to add up to more than the available storage.<br />
<br />
We could get similar functionality with LVM thin provisioning or XFS with project quotas. (There is some work needed there to treat projects as separate exports, but that's very doable.)<br />
<br />
Note NFS, ext4, xfs, and other filesystems all support per-user (and other) quotas. That's not something AFS has, as far as I know. Some notes on [[NFSv4 quota support]].<br />
<br />
= migrating existing AFS installations to NFS =<br />
<br />
Once NFS does everything AFS does, there's still the question of how you'd migrate over a particular installation.<br />
<br />
There's a standard AFS dump format (used by [https://docs.openafs.org/AdminGuide/HDRWQ240.html vos dump/vos restore]) that might be worth looking at. It looks simple enough. Maybe also look at [https://github.com/openafs-contrib/cmu-dumpscan cmu-dumpscan].<br />
<br />
See also [[AFS to NFSv4 ACL conversion]].</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_for_AFS_usersNFS for AFS users2021-04-01T19:55:47Z<p>Bfields: /* quotas */</p>
<hr />
<div>This page tracks some of the obstacles that might keep an AFS user from using NFS instead.<br />
<br />
= Missing Features =<br />
<br />
In general: AFS is administered by a consistent set of commands (fs, pts, vos, uss, bos, backup, fstrace, etc.) which work from any client and identify the user with Kerberos. Compared to a traditional unix system it's more flexible about delegating rights to users to do stuff.<br />
<br />
== replication and migration ==<br />
<br />
AFS supports fast clones using COW, along with complete copies on other machines.<br />
<br />
Currently there can be only one writeable version of a volume, but multiple read-only versions (which all have to be identical). They can be on different servers. (There's also an effort to support multiple writeable volumes, possibly using Ceph, but that's not done yet.)<br />
<br />
There can also be a 'backup' volume which is just, say, a daily temporary read-only snapshot of a RW volume and has to be located on the same machine.<br />
<br />
When a RW volume is "released" (snapshotted) to the read-only volumes, all the read-only volumes update simultaneously and atomically. The users, in theory, don't notice as the volumes don't go offline - and then they see all the changes happen at once. There's coordination to handle when one or more of the fileservers or the Volume Location servers are offline.<br />
<br />
Volumes can be migrated between machines whilst in active use without the user in theory noticing anything.<br />
<br />
For NFS migration we need to preserve filehandles, so need to migrate at the block level or using fs-specific send/receive. The protocol can be handled by migrating only entire servers or containers, so that migration can be treated as a server reboot.<br />
<br />
A few Linux options for send/receive:<br />
<br />
* thin_delta (from device-mapper-persistent-data) can calculate a metadata-level diff between two volumes. Additional work would be needed to extract the actual data and produce a diff; that would complete the "send" side. We'd also need a "receive" side that could apply the diff and reconstitute the snapshot on the other side. This is being actively worked on. For NFS, on the read-write server we would take a snapshot of the exported volume before sending. On the receive side, after creating the updated snapshot, we would stop the server, unmount the old snapshot, mount the new one, and restart; clients should see only a brief delay.<br />
<br />
* btrfs-send/btrfs-receive: this is probably the best-tested send/receive functionality currently available, so if we wanted to start work on a prototype right now, this might be an option.<br />
<br />
* xfs volumes loopback-mounted on a backing xfs filesystem, using reflink for snapshots. (See https://lwn.net/Articles/747633/ for some background.) Looks promising, the basic kernel interfaces to find shared extents and such are there, but a lot of userland code remains to be written.<br />
<br />
* stratis: this operates at a layer of abstraction over the above. But that might be the layer we want to actually interact with?<br />
<br />
* lvmsync: looks possibly unmaintained? We wouldn't want to depend on this. But possibly it could be a proof of concept or starting point.<br />
<br />
Between LVM and (container-respecting) knfsd, we have a lot of the necessary pieces, but there's at a minimum a lot of tooling and documentation to write before this is usable.<br />
<br />
Clients could be configured to mount particular servers by hand, or they could mount any server and then use [https://tools.ietf.org/html/rfc5661#section-11.9 fs_locations], [https://tools.ietf.org/html/rfc5661#section-11.10 fs_locations_info], or maybe even [https://datatracker.ietf.org/doc/rfc8435/ pnfs flexfiles] to get lists of servers hosting replicas and pick one. They would need some heuristics to make the right choice. It would also be nice if clients could fail over to a different replica when one goes down.<br />
<br />
We also have [https://github.com/nfs-ganesha/nfs-ganesha/wiki Ganesha], [https://docs.ceph.com/docs/master/cephfs/nfs/ Ganesha/Ceph] (which [https://jtlayton.wordpress.com/2018/12/10/deploying-an-active-active-nfs-cluster-over-cephfs/ may be capable of multiple read/write servers now]).<br />
<br />
See also [https://docs.openafs.org/AdminGuide/HDRWQ177.html AFS Administrator's guide, Chapter 5: Managing Volumes]<br />
<br />
A partial alternative may be [https://wiki.linux-nfs.org/wiki/index.php/NFS_re-export NFS proxying]. Like read-only replicas, proxies should be able to hide latency by moving cached data closer to far-flung clients, and scale bandwidth to read-mostly data by taking load off the original server.<br />
<br />
Advantages are that we already have seen reports of some success here, using the NFS re-export code together with fscache. And I think there are a lot of opportunities for incremental progress by fixing problems with existing NFS code, rather than larger and riskier projects that build new infrastructure.<br />
<br />
A disadvantage may be that AFS users seem to like that infrastructure (the volume abstraction and the VLDB).<br />
<br />
Latency-hiding may be particularly tricky; delegation and caching policies may need rethinking. Performance will be more complicated to understand compared to AFS-like read-only replicas.<br />
<br />
AFS-like volume replication has a problem: when new read-only versions are released, they may modify or delete entirely files that are in use by running processes. I'd expect application crashes. I wonder how AFS administrators deal with that now?<br />
<br />
My impression is that AFS doesn't reliably prevent this problem, so instead AFS administrators work around it, for example by keeping old versions of binaries in place (and using symlinks to direct users to the newest versions).<br />
<br />
Possible approaches to fix the problem if we wanted to:<br />
* Provide some protocol which tracks which files may be open on read-only replicas so that we know not to free those files when they're unlinked.<br />
* When we distribute new versions, allow servers to keep around older versions and serve files from them in the case filehandle lookups against the new copy fail, to be removed only after applications stop referencing them. Hopefully this can be done space-efficiently if the different versions on the replica servers can be represented as dm snapshots.<br />
<br />
If we use NFSv4 proxies instead, proxies will hold opens or delegations on the files on the original server, which will prevent their being deleted while in use. The problem is server reboots. That's partially worked around with silly-rename. Server-side silly-rename would be a more complete solution.<br />
<br />
== volume location database and global namespace ==<br />
<br />
On an AFS client by default you can look up something like /afs/umich.edu/... and reach files kept in AFS anywhere.<br />
<br />
NFS has standards for DNS discovery of a server from a domain, in theory we could use that. Handling kerberos users across domains would be interesting.<br />
<br />
Within one domain, there's a "Volume Location Database" that keeps track of volumes and where (machine and partition) they're located. You can make a volume for a purpose; give particular people access to it, give it some storage, expand and contract it and move it around. Volumes have quotas.<br />
<br />
Within a given domain, We can assemble a namespace out of volumes using referrals. For a higher-level approach more similar to AFS's, there's also [https://wiki.linux-nfs.org/wiki/index.php/FedFsUtilsProject FedFS] which stores the namespace information in a database and provides common protocols for administration tools to manipulate the database.<br />
<br />
== PAGS ==<br />
<br />
PAGs: AFS allows a group of processes to share a common identity, different from the local uid, for the purposes of accessing an AFS filesystem: https://docs.openafs.org/AdminGuide/HDRWQ63.html<br />
<br />
Dave Howells says: "This is why I added session keyrings. You can run a process in a new keyring<br />
and give it new tokens. systemd kind of stuck a spike in that, though, by<br />
doing their own incompatible thing with their user manager service....<br />
<br />
NFS would need to do what the in-kernel AFS client does and call request_key()<br />
on entry to each filesystem method that doesn't take a file* and use that to<br />
cache the credentials it is using. If there is no key, it can make one up on<br />
the spot and stick the uid/gid/groups in there. This would then need to be<br />
handed down to the sunrpc protocol to define the security creds to use.<br />
<br />
The key used to open a file would then need to be cached in the file struct<br />
private data."<br />
<br />
== ACLs ==<br />
<br />
NFSv4 has ACLs, but Linux filesystems only support "posix" ACLs. An attempt was made to support NFSv4 ACLs ("richacls") but hasn't been accepted upstream. So knfsd is stuck mapping between NFSv4 and posix ACLs. Posix ACLs are more coarse-grained than NFSv4 ACLs, so information can be lost when a user on an NFSv4 client sets an ACL. This makes ACLs confusing and less useful.<br />
<br />
There are other servers that support full NFSv4 ACLs, so users of those servers are better off. Our client-side tools could still use some improvements for those users, though.<br />
<br />
AFS ACLs, unfortunately, are yet again a third style of ACL, incompatible with both POSIX and NFSv4 ACLs. They are more fine-grained than POSIX ACLs and probably closer to NFSv4 ACLs overall.<br />
<br />
To do:<br />
<br />
* make NFSv4 ACL tools more usable:<br />
** Map groups of NFSv4 permission bits to read, write, and execute permissions so we only have to display the simpler bits in common cases<br />
** Look for other opportunities to simplify display and editing of NFSv4 ACLs<br />
** Add NFSv4 ACL support to graphical file managers like GNOME Files<br />
** Adopt a commandline interface that's more similar to the posix acl utilities.<br />
** Perhaps also look into https://github.com/kvaneesh/richacl-tools as an alternative starting point to nfs4-acl-tools.<br />
** In general, try to make NFSv4 ACL management more similar to management of existing posix ACLs.<br />
* For AFS->NFS transition:<br />
** Write code that translates AFS ACLs to NFSv4 ACLs. It should be possible to do this with little or no loss of information for servers with full NFSv4 ACL support.<br />
** For migrations to Linux knfsd, this will effectively translate AFS ACLs to POSIX ACLs, and information will be lost. Test this case. The conversion tool should be able to fetch the ACLs after setting them, compare results, and summarize the results of the conversion in a way that's usable even for conversions of large numbers of files. I believe that setting an ACL is enough to invalidate the client's ACL cache, so a subsequent fetch of an ACL should show the results of any server-side mapping. But, test this to make sure. More details on [[AFS to NFSv4 ACL conversion]]<br />
<br />
* more ambitious options:<br />
** Try reviving [https://lwn.net/Articles/661357/ Rich ACLs]. Maybe we could convince people this time. Or maybe there's a different approach that would work. Maybe we could find a more incremental route, e.g. by adding some features of richacls to POSIX ACLs, such as the separation of directory write permissions into add and delete, and of file write permissions into modify and append.<br />
<br />
== user and group management ==<br />
<br />
AFS has a "protection server" and you can communicate with it using the [https://docs.openafs.org/Reference/1/pts.html pts command] which allows you to set up users and groups and add ACEs for machines.<br />
<br />
Compared to traditional unix, it allows wider delegation of management. For example, group creation doesn't require root: https://docs.openafs.org/Reference/1/pts_creategroup.html. Groups have owners, and you can delegate management of group membership: https://docs.openafs.org/Reference/1/pts_adduser.html.<br />
<br />
Our equivalent to the AFS protection server is [https://www.freeipa.org/page/Main_Page FreeIPA]. See also https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_identity_management/index. Installing FreeIPA and experimenting is also useful.<br />
<br />
Unlike AFS, FreeIPA doesn't seem to make it easy for ordinary users to create groups. It does allow delegating group management (including adding and removing users). More details on [[AFS-like group management with FreeIPA]].<br />
<br />
== quotas ==<br />
<br />
AFS has per-volume quotas. There are no per-user quotas that I can see; instead, AFS administrators create volumes for individual users (e.g., for individual home directories), and set quotas on those. Volumes can share the same storage, and it's fine for quotas on volumes to add up to more than the available storage.<br />
<br />
We could get similar functionality with LVM thin provisioning or XFS with project quotas. (There is some work needed there to treat projects as separate exports, but that's very doable.)<br />
<br />
Note NFS, ext4, xfs, and other filesystems all support per-user (and other) quotas. That's not something AFS has, as far as I know. Some [[notes on NFSv4 quota support]].<br />
<br />
= migrating existing AFS installations to NFS =<br />
<br />
Once NFS does everything AFS does, there's still the question of how you'd migrate over a particular installation.<br />
<br />
There's a standard AFS dump format (used by [https://docs.openafs.org/AdminGuide/HDRWQ240.html vos dump/vos restore]) that might be worth looking at. It looks simple enough. Maybe also look at [https://github.com/openafs-contrib/cmu-dumpscan cmu-dumpscan].<br />
<br />
See also [[AFS to NFSv4 ACL conversion]].</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_for_AFS_usersNFS for AFS users2021-04-01T14:03:32Z<p>Bfields: /* quotas */</p>
<hr />
<div>This page tracks some of the obstacles that might keep an AFS user from using NFS instead.<br />
<br />
= Missing Features =<br />
<br />
In general: AFS is administered by a consistent set of commands (fs, pts, vos, uss, bos, backup, fstrace, etc.) which work from any client and identify the user with Kerberos. Compared to a traditional unix system it's more flexible about delegating rights to users to do stuff.<br />
<br />
== replication and migration ==<br />
<br />
AFS supports fast clones using COW, along with complete copies on other machines.<br />
<br />
Currently there can be only one writeable version of a volume, but multiple read-only versions (which all have to be identical). They can be on different servers. (There's also an effort to support multiple writeable volumes, possibly using Ceph, but that's not done yet.)<br />
<br />
There can also be a 'backup' volume which is just, say, a daily temporary read-only snapshot of a RW volume and has to be located on the same machine.<br />
<br />
When a RW volume is "released" (snapshotted) to the read-only volumes, all the read-only volumes update simultaneously and atomically. The users, in theory, don't notice as the volumes don't go offline - and then they see all the changes happen at once. There's coordination to handle when one or more of the fileservers or the Volume Location servers are offline.<br />
<br />
Volumes can be migrated between machines whilst in active use without the user in theory noticing anything.<br />
<br />
For NFS migration we need to preserve filehandles, so need to migrate at the block level or using fs-specific send/receive. The protocol can be handled by migrating only entire servers or containers, so that migration can be treated as a server reboot.<br />
<br />
A few Linux options for send/receive:<br />
<br />
* thin_delta (from device-mapper-persistent-data) can calculate a metadata-level diff between two volumes. Additional work would be needed to extract the actual data and produce a diff; that would complete the "send" side. We'd also need a "receive" side that could apply the diff and reconstitute the snapshot on the other side. This is being actively worked on. For NFS, on the read-write server we would take a snapshot of the exported volume before sending. On the receive side, after creating the updated snapshot, we would stop the server, unmount the old snapshot, mount the new one, and restart; clients should see only a brief delay.<br />
<br />
* btrfs-send/btrfs-receive: this is probably the best-tested send/receive functionality currently available, so if we wanted to start work on a prototype right now, this might be an option.<br />
<br />
* xfs volumes loopback-mounted on a backing xfs filesystem, using reflink for snapshots. (See https://lwn.net/Articles/747633/ for some background.) Looks promising, the basic kernel interfaces to find shared extents and such are there, but a lot of userland code remains to be written.<br />
<br />
* stratis: this operates at a layer of abstraction over the above. But that might be the layer we want to actually interact with?<br />
<br />
* lvmsync: looks possibly unmaintained? We wouldn't want to depend on this. But possibly it could be a proof of concept or starting point.<br />
<br />
Between LVM and (container-respecting) knfsd, we have a lot of the necessary pieces, but there's at a minimum a lot of tooling and documentation to write before this is usable.<br />
<br />
Clients could be configured to mount particular servers by hand, or they could mount any server and then use [https://tools.ietf.org/html/rfc5661#section-11.9 fs_locations], [https://tools.ietf.org/html/rfc5661#section-11.10 fs_locations_info], or maybe even [https://datatracker.ietf.org/doc/rfc8435/ pnfs flexfiles] to get lists of servers hosting replicas and pick one. They would need some heuristics to make the right choice. It would also be nice if clients could fail over to a different replica when one goes down.<br />
<br />
We also have [https://github.com/nfs-ganesha/nfs-ganesha/wiki Ganesha], [https://docs.ceph.com/docs/master/cephfs/nfs/ Ganesha/Ceph] (which [https://jtlayton.wordpress.com/2018/12/10/deploying-an-active-active-nfs-cluster-over-cephfs/ may be capable of multiple read/write servers now]).<br />
<br />
See also [https://docs.openafs.org/AdminGuide/HDRWQ177.html AFS Administrator's guide, Chapter 5: Managing Volumes]<br />
<br />
A partial alternative may be [https://wiki.linux-nfs.org/wiki/index.php/NFS_re-export NFS proxying]. Like read-only replicas, proxies should be able to hide latency by moving cached data closer to far-flung clients, and scale bandwidth to read-mostly data by taking load off the original server.<br />
<br />
Advantages are that we already have seen reports of some success here, using the NFS re-export code together with fscache. And I think there are a lot of opportunities for incremental progress by fixing problems with existing NFS code, rather than larger and riskier projects that build new infrastructure.<br />
<br />
A disadvantage may be that AFS users seem to like that infrastructure (the volume abstraction and the VLDB).<br />
<br />
Latency-hiding may be particularly tricky; delegation and caching policies may need rethinking. Performance will be more complicated to understand compared to AFS-like read-only replicas.<br />
<br />
AFS-like volume replication has a problem: when new read-only versions are released, they may modify or delete entirely files that are in use by running processes. I'd expect application crashes. I wonder how AFS administrators deal with that now?<br />
<br />
My impression is that AFS doesn't reliably prevent this problem, so instead AFS administrators work around it, for example by keeping old versions of binaries in place (and using symlinks to direct users to the newest versions).<br />
<br />
Possible approaches to fix the problem if we wanted to:<br />
* Provide some protocol which tracks which files may be open on read-only replicas so that we know not to free those files when they're unlinked.<br />
* When we distribute new versions, allow servers to keep around older versions and serve files from them in the case filehandle lookups against the new copy fail, to be removed only after applications stop referencing them. Hopefully this can be done space-efficiently if the different versions on the replica servers can be represented as dm snapshots.<br />
<br />
If we use NFSv4 proxies instead, proxies will hold opens or delegations on the files on the original server, which will prevent their being deleted while in use. The problem is server reboots. That's partially worked around with silly-rename. Server-side silly-rename would be a more complete solution.<br />
<br />
== volume location database and global namespace ==<br />
<br />
On an AFS client by default you can look up something like /afs/umich.edu/... and reach files kept in AFS anywhere.<br />
<br />
NFS has standards for DNS discovery of a server from a domain, in theory we could use that. Handling kerberos users across domains would be interesting.<br />
<br />
Within one domain, there's a "Volume Location Database" that keeps track of volumes and where (machine and partition) they're located. You can make a volume for a purpose; give particular people access to it, give it some storage, expand and contract it and move it around. Volumes have quotas.<br />
<br />
Within a given domain, We can assemble a namespace out of volumes using referrals. For a higher-level approach more similar to AFS's, there's also [https://wiki.linux-nfs.org/wiki/index.php/FedFsUtilsProject FedFS] which stores the namespace information in a database and provides common protocols for administration tools to manipulate the database.<br />
<br />
== PAGS ==<br />
<br />
PAGs: AFS allows a group of processes to share a common identity, different from the local uid, for the purposes of accessing an AFS filesystem: https://docs.openafs.org/AdminGuide/HDRWQ63.html<br />
<br />
Dave Howells says: "This is why I added session keyrings. You can run a process in a new keyring<br />
and give it new tokens. systemd kind of stuck a spike in that, though, by<br />
doing their own incompatible thing with their user manager service....<br />
<br />
NFS would need to do what the in-kernel AFS client does and call request_key()<br />
on entry to each filesystem method that doesn't take a file* and use that to<br />
cache the credentials it is using. If there is no key, it can make one up on<br />
the spot and stick the uid/gid/groups in there. This would then need to be<br />
handed down to the sunrpc protocol to define the security creds to use.<br />
<br />
The key used to open a file would then need to be cached in the file struct<br />
private data."<br />
<br />
== ACLs ==<br />
<br />
NFSv4 has ACLs, but Linux filesystems only support "posix" ACLs. An attempt was made to support NFSv4 ACLs ("richacls") but hasn't been accepted upstream. So knfsd is stuck mapping between NFSv4 and posix ACLs. Posix ACLs are more coarse-grained than NFSv4 ACLs, so information can be lost when a user on an NFSv4 client sets an ACL. This makes ACLs confusing and less useful.<br />
<br />
There are other servers that support full NFSv4 ACLs, so users of those servers are better off. Our client-side tools could still use some improvements for those users, though.<br />
<br />
AFS ACLs, unfortunately, are yet again a third style of ACL, incompatible with both POSIX and NFSv4 ACLs. They are more fine-grained than POSIX ACLs and probably closer to NFSv4 ACLs overall.<br />
<br />
To do:<br />
<br />
* make NFSv4 ACL tools more usable:<br />
** Map groups of NFSv4 permission bits to read, write, and execute permissions so we only have to display the simpler bits in common cases<br />
** Look for other opportunities to simplify display and editing of NFSv4 ACLs<br />
** Add NFSv4 ACL support to graphical file managers like GNOME Files<br />
** Adopt a commandline interface that's more similar to the posix acl utilities.<br />
** Perhaps also look into https://github.com/kvaneesh/richacl-tools as an alternative starting point to nfs4-acl-tools.<br />
** In general, try to make NFSv4 ACL management more similar to management of existing posix ACLs.<br />
* For AFS->NFS transition:<br />
** Write code that translates AFS ACLs to NFSv4 ACLs. It should be possible to do this with little or no loss of information for servers with full NFSv4 ACL support.<br />
** For migrations to Linux knfsd, this will effectively translate AFS ACLs to POSIX ACLs, and information will be lost. Test this case. The conversion tool should be able to fetch the ACLs after setting them, compare results, and summarize the results of the conversion in a way that's usable even for conversions of large numbers of files. I believe that setting an ACL is enough to invalidate the client's ACL cache, so a subsequent fetch of an ACL should show the results of any server-side mapping. But, test this to make sure. More details on [[AFS to NFSv4 ACL conversion]]<br />
<br />
* more ambitious options:<br />
** Try reviving [https://lwn.net/Articles/661357/ Rich ACLs]. Maybe we could convince people this time. Or maybe there's a different approach that would work. Maybe we could find a more incremental route, e.g. by adding some features of richacls to POSIX ACLs, such as the separation of directory write permissions into add and delete, and of file write permissions into modify and append.<br />
<br />
== user and group management ==<br />
<br />
AFS has a "protection server" and you can communicate with it using the [https://docs.openafs.org/Reference/1/pts.html pts command] which allows you to set up users and groups and add ACEs for machines.<br />
<br />
Compared to traditional unix, it allows wider delegation of management. For example, group creation doesn't require root: https://docs.openafs.org/Reference/1/pts_creategroup.html. Groups have owners, and you can delegate management of group membership: https://docs.openafs.org/Reference/1/pts_adduser.html.<br />
<br />
Our equivalent to the AFS protection server is [https://www.freeipa.org/page/Main_Page FreeIPA]. See also https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_identity_management/index. Installing FreeIPA and experimenting is also useful.<br />
<br />
Unlike AFS, FreeIPA doesn't seem to make it easy for ordinary users to create groups. It does allow delegating group management (including adding and removing users). More details on [[AFS-like group management with FreeIPA]].<br />
<br />
== quotas ==<br />
<br />
NFSv3 had the RQUOTA protocol. NFSv4 has three attributes (see https://tools.ietf.org/html/rfc5661#section-5.8.2.28 and following) that support quotas, but Linux has not implemented them. Maybe we should.<br />
<br />
The linux implementation of RQUOTA seems to live along with the local filesystem quota utilites at https://sourceforge.net/projects/linuxquota/. The same "quota" tool can make calls to either local filesystems or RQUOTA calls to the server. One of the goals of the NFSv4 protocol was to move all functionality into the single NFS protocol (among other reasons, to simplify firewall traversal), so we should really be shifting people to using the NFSv4 attributes instead of the separate RQUOTA protocol. https://sourceforge.net/p/linuxquota/code/ci/master/tree/doc/quotadoc.sgml looks like one place to start understanding how local filesystems deal with quotas.<br />
<br />
Some local filesystems quotas seem to be managed using an actual on-disk file whose format is understood by both the kernel filesystem code and the userspace "quota" tool. That isn't an appropriate interface for the NFS filesystem. But filesystems are also able to hide the on-disk quota information and system calls to get and set quotas. Perhaps NFS could use some subset of that interface (see quotactl(2)), or perhaps it needs to define its own simpler calls.<br />
<br />
AFS appears to only have quotas for volumes; per-user quotas are implemented using per-user volumes: https://docs.openafs.org/AdminGuide/HDRWQ234.html<br />
<br />
So, really, NFS/xfs/ext4 quotas may not be what's needed. Instead, we can probably get similar functionality with thin provisioning and "df". XFS project quotas could also do the same job. (Note some work is needed there to treat projects as separate exports, but that's very doable.)<br />
<br />
= migrating existing AFS installations to NFS =<br />
<br />
Once NFS does everything AFS does, there's still the question of how you'd migrate over a particular installation.<br />
<br />
There's a standard AFS dump format (used by [https://docs.openafs.org/AdminGuide/HDRWQ240.html vos dump/vos restore]) that might be worth looking at. It looks simple enough. Maybe also look at [https://github.com/openafs-contrib/cmu-dumpscan cmu-dumpscan].<br />
<br />
See also [[AFS to NFSv4 ACL conversion]].</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_for_AFS_usersNFS for AFS users2021-04-01T01:51:29Z<p>Bfields: /* quotas */</p>
<hr />
<div>This page tracks some of the obstacles that might keep an AFS user from using NFS instead.<br />
<br />
= Missing Features =<br />
<br />
In general: AFS is administered by a consistent set of commands (fs, pts, vos, uss, bos, backup, fstrace, etc.) which work from any client and identify the user with Kerberos. Compared to a traditional unix system it's more flexible about delegating rights to users to do stuff.<br />
<br />
== replication and migration ==<br />
<br />
AFS supports fast clones using COW, along with complete copies on other machines.<br />
<br />
Currently there can be only one writeable version of a volume, but multiple read-only versions (which all have to be identical). They can be on different servers. (There's also an effort to support multiple writeable volumes, possibly using Ceph, but that's not done yet.)<br />
<br />
There can also be a 'backup' volume which is just, say, a daily temporary read-only snapshot of a RW volume and has to be located on the same machine.<br />
<br />
When a RW volume is "released" (snapshotted) to the read-only volumes, all the read-only volumes update simultaneously and atomically. The users, in theory, don't notice as the volumes don't go offline - and then they see all the changes happen at once. There's coordination to handle when one or more of the fileservers or the Volume Location servers are offline.<br />
<br />
Volumes can be migrated between machines whilst in active use without the user in theory noticing anything.<br />
<br />
For NFS migration we need to preserve filehandles, so need to migrate at the block level or using fs-specific send/receive. The protocol can be handled by migrating only entire servers or containers, so that migration can be treated as a server reboot.<br />
<br />
A few Linux options for send/receive:<br />
<br />
* thin_delta (from device-mapper-persistent-data) can calculate a metadata-level diff between two volumes. Additional work would be needed to extract the actual data and produce a diff; that would complete the "send" side. We'd also need a "receive" side that could apply the diff and reconstitute the snapshot on the other side. This is being actively worked on. For NFS, on the read-write server we would take a snapshot of the exported volume before sending. On the receive side, after creating the updated snapshot, we would stop the server, unmount the old snapshot, mount the new one, and restart; clients should see only a brief delay.<br />
<br />
* btrfs-send/btrfs-receive: this is probably the best-tested send/receive functionality currently available, so if we wanted to start work on a prototype right now, this might be an option.<br />
<br />
* xfs volumes loopback-mounted on a backing xfs filesystem, using reflink for snapshots. (See https://lwn.net/Articles/747633/ for some background.) Looks promising, the basic kernel interfaces to find shared extents and such are there, but a lot of userland code remains to be written.<br />
<br />
* stratis: this operates at a layer of abstraction over the above. But that might be the layer we want to actually interact with?<br />
<br />
* lvmsync: looks possibly unmaintained? We wouldn't want to depend on this. But possibly it could be a proof of concept or starting point.<br />
<br />
Between LVM and (container-respecting) knfsd, we have a lot of the necessary pieces, but there's at a minimum a lot of tooling and documentation to write before this is usable.<br />
<br />
Clients could be configured to mount particular servers by hand, or they could mount any server and then use [https://tools.ietf.org/html/rfc5661#section-11.9 fs_locations], [https://tools.ietf.org/html/rfc5661#section-11.10 fs_locations_info], or maybe even [https://datatracker.ietf.org/doc/rfc8435/ pnfs flexfiles] to get lists of servers hosting replicas and pick one. They would need some heuristics to make the right choice. It would also be nice if clients could fail over to a different replica when one goes down.<br />
<br />
We also have [https://github.com/nfs-ganesha/nfs-ganesha/wiki Ganesha], [https://docs.ceph.com/docs/master/cephfs/nfs/ Ganesha/Ceph] (which [https://jtlayton.wordpress.com/2018/12/10/deploying-an-active-active-nfs-cluster-over-cephfs/ may be capable of multiple read/write servers now]).<br />
<br />
See also [https://docs.openafs.org/AdminGuide/HDRWQ177.html AFS Administrator's guide, Chapter 5: Managing Volumes]<br />
<br />
A partial alternative may be [https://wiki.linux-nfs.org/wiki/index.php/NFS_re-export NFS proxying]. Like read-only replicas, proxies should be able to hide latency by moving cached data closer to far-flung clients, and scale bandwidth to read-mostly data by taking load off the original server.<br />
<br />
Advantages are that we already have seen reports of some success here, using the NFS re-export code together with fscache. And I think there are a lot of opportunities for incremental progress by fixing problems with existing NFS code, rather than larger and riskier projects that build new infrastructure.<br />
<br />
A disadvantage may be that AFS users seem to like that infrastructure (the volume abstraction and the VLDB).<br />
<br />
Latency-hiding may be particularly tricky; delegation and caching policies may need rethinking. Performance will be more complicated to understand compared to AFS-like read-only replicas.<br />
<br />
AFS-like volume replication has a problem: when new read-only versions are released, they may modify or delete entirely files that are in use by running processes. I'd expect application crashes. I wonder how AFS administrators deal with that now?<br />
<br />
My impression is that AFS doesn't reliably prevent this problem, so instead AFS administrators work around it, for example by keeping old versions of binaries in place (and using symlinks to direct users to the newest versions).<br />
<br />
Possible approaches to fix the problem if we wanted to:<br />
* Provide some protocol which tracks which files may be open on read-only replicas so that we know not to free those files when they're unlinked.<br />
* When we distribute new versions, allow servers to keep around older versions and serve files from them in the case filehandle lookups against the new copy fail, to be removed only after applications stop referencing them. Hopefully this can be done space-efficiently if the different versions on the replica servers can be represented as dm snapshots.<br />
<br />
If we use NFSv4 proxies instead, proxies will hold opens or delegations on the files on the original server, which will prevent their being deleted while in use. The problem is server reboots. That's partially worked around with silly-rename. Server-side silly-rename would be a more complete solution.<br />
<br />
== volume location database and global namespace ==<br />
<br />
On an AFS client by default you can look up something like /afs/umich.edu/... and reach files kept in AFS anywhere.<br />
<br />
NFS has standards for DNS discovery of a server from a domain, in theory we could use that. Handling kerberos users across domains would be interesting.<br />
<br />
Within one domain, there's a "Volume Location Database" that keeps track of volumes and where (machine and partition) they're located. You can make a volume for a purpose; give particular people access to it, give it some storage, expand and contract it and move it around. Volumes have quotas.<br />
<br />
Within a given domain, We can assemble a namespace out of volumes using referrals. For a higher-level approach more similar to AFS's, there's also [https://wiki.linux-nfs.org/wiki/index.php/FedFsUtilsProject FedFS] which stores the namespace information in a database and provides common protocols for administration tools to manipulate the database.<br />
<br />
== PAGS ==<br />
<br />
PAGs: AFS allows a group of processes to share a common identity, different from the local uid, for the purposes of accessing an AFS filesystem: https://docs.openafs.org/AdminGuide/HDRWQ63.html<br />
<br />
Dave Howells says: "This is why I added session keyrings. You can run a process in a new keyring<br />
and give it new tokens. systemd kind of stuck a spike in that, though, by<br />
doing their own incompatible thing with their user manager service....<br />
<br />
NFS would need to do what the in-kernel AFS client does and call request_key()<br />
on entry to each filesystem method that doesn't take a file* and use that to<br />
cache the credentials it is using. If there is no key, it can make one up on<br />
the spot and stick the uid/gid/groups in there. This would then need to be<br />
handed down to the sunrpc protocol to define the security creds to use.<br />
<br />
The key used to open a file would then need to be cached in the file struct<br />
private data."<br />
<br />
== ACLs ==<br />
<br />
NFSv4 has ACLs, but Linux filesystems only support "posix" ACLs. An attempt was made to support NFSv4 ACLs ("richacls") but hasn't been accepted upstream. So knfsd is stuck mapping between NFSv4 and posix ACLs. Posix ACLs are more coarse-grained than NFSv4 ACLs, so information can be lost when a user on an NFSv4 client sets an ACL. This makes ACLs confusing and less useful.<br />
<br />
There are other servers that support full NFSv4 ACLs, so users of those servers are better off. Our client-side tools could still use some improvements for those users, though.<br />
<br />
AFS ACLs, unfortunately, are yet again a third style of ACL, incompatible with both POSIX and NFSv4 ACLs. They are more fine-grained than POSIX ACLs and probably closer to NFSv4 ACLs overall.<br />
<br />
To do:<br />
<br />
* make NFSv4 ACL tools more usable:<br />
** Map groups of NFSv4 permission bits to read, write, and execute permissions so we only have to display the simpler bits in common cases<br />
** Look for other opportunities to simplify display and editing of NFSv4 ACLs<br />
** Add NFSv4 ACL support to graphical file managers like GNOME Files<br />
** Adopt a commandline interface that's more similar to the posix acl utilities.<br />
** Perhaps also look into https://github.com/kvaneesh/richacl-tools as an alternative starting point to nfs4-acl-tools.<br />
** In general, try to make NFSv4 ACL management more similar to management of existing posix ACLs.<br />
* For AFS->NFS transition:<br />
** Write code that translates AFS ACLs to NFSv4 ACLs. It should be possible to do this with little or no loss of information for servers with full NFSv4 ACL support.<br />
** For migrations to Linux knfsd, this will effectively translate AFS ACLs to POSIX ACLs, and information will be lost. Test this case. The conversion tool should be able to fetch the ACLs after setting them, compare results, and summarize the results of the conversion in a way that's usable even for conversions of large numbers of files. I believe that setting an ACL is enough to invalidate the client's ACL cache, so a subsequent fetch of an ACL should show the results of any server-side mapping. But, test this to make sure. More details on [[AFS to NFSv4 ACL conversion]]<br />
<br />
* more ambitious options:<br />
** Try reviving [https://lwn.net/Articles/661357/ Rich ACLs]. Maybe we could convince people this time. Or maybe there's a different approach that would work. Maybe we could find a more incremental route, e.g. by adding some features of richacls to POSIX ACLs, such as the separation of directory write permissions into add and delete, and of file write permissions into modify and append.<br />
<br />
== user and group management ==<br />
<br />
AFS has a "protection server" and you can communicate with it using the [https://docs.openafs.org/Reference/1/pts.html pts command] which allows you to set up users and groups and add ACEs for machines.<br />
<br />
Compared to traditional unix, it allows wider delegation of management. For example, group creation doesn't require root: https://docs.openafs.org/Reference/1/pts_creategroup.html. Groups have owners, and you can delegate management of group membership: https://docs.openafs.org/Reference/1/pts_adduser.html.<br />
<br />
Our equivalent to the AFS protection server is [https://www.freeipa.org/page/Main_Page FreeIPA]. See also https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_identity_management/index. Installing FreeIPA and experimenting is also useful.<br />
<br />
Unlike AFS, FreeIPA doesn't seem to make it easy for ordinary users to create groups. It does allow delegating group management (including adding and removing users). More details on [[AFS-like group management with FreeIPA]].<br />
<br />
== quotas ==<br />
<br />
NFSv3 had the RQUOTA protocol. NFSv4 has three attributes (see https://tools.ietf.org/html/rfc5661#section-5.8.2.28 and following) that support quotas, but Linux has not implemented them. Maybe we should.<br />
<br />
The linux implementation of RQUOTA seems to live along with the local filesystem quota utilites at https://sourceforge.net/projects/linuxquota/. The same "quota" tool can make calls to either local filesystems or RQUOTA calls to the server. One of the goals of the NFSv4 protocol was to move all functionality into the single NFS protocol (among other reasons, to simplify firewall traversal), so we should really be shifting people to using the NFSv4 attributes instead of the separate RQUOTA protocol. https://sourceforge.net/p/linuxquota/code/ci/master/tree/doc/quotadoc.sgml looks like one place to start understanding how local filesystems deal with quotas.<br />
<br />
Some local filesystems quotas seem to be managed using an actual on-disk file whose format is understood by both the kernel filesystem code and the userspace "quota" tool. That isn't an appropriate interface for the NFS filesystem. But filesystems are also able to hide the on-disk quota information and system calls to get and set quotas. Perhaps NFS could use some subset of that interface (see quotactl(2)), or perhaps it needs to define its own simpler calls.<br />
<br />
AFS appears to only have quotas for volumes; per-user quotas are implemented using per-user volumes: https://docs.openafs.org/AdminGuide/HDRWQ234.html<br />
<br />
So, really, NFS/xfs/ext4 quotas may not be what's needed. Instead, we can probably get similar functionality with thin provisioning and "df".<br />
<br />
= migrating existing AFS installations to NFS =<br />
<br />
Once NFS does everything AFS does, there's still the question of how you'd migrate over a particular installation.<br />
<br />
There's a standard AFS dump format (used by [https://docs.openafs.org/AdminGuide/HDRWQ240.html vos dump/vos restore]) that might be worth looking at. It looks simple enough. Maybe also look at [https://github.com/openafs-contrib/cmu-dumpscan cmu-dumpscan].<br />
<br />
See also [[AFS to NFSv4 ACL conversion]].</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_for_AFS_usersNFS for AFS users2021-04-01T01:31:09Z<p>Bfields: /* quotas */</p>
<hr />
<div>This page tracks some of the obstacles that might keep an AFS user from using NFS instead.<br />
<br />
= Missing Features =<br />
<br />
In general: AFS is administered by a consistent set of commands (fs, pts, vos, uss, bos, backup, fstrace, etc.) which work from any client and identify the user with Kerberos. Compared to a traditional unix system it's more flexible about delegating rights to users to do stuff.<br />
<br />
== replication and migration ==<br />
<br />
AFS supports fast clones using COW, along with complete copies on other machines.<br />
<br />
Currently there can be only one writeable version of a volume, but multiple read-only versions (which all have to be identical). They can be on different servers. (There's also an effort to support multiple writeable volumes, possibly using Ceph, but that's not done yet.)<br />
<br />
There can also be a 'backup' volume which is just, say, a daily temporary read-only snapshot of a RW volume and has to be located on the same machine.<br />
<br />
When a RW volume is "released" (snapshotted) to the read-only volumes, all the read-only volumes update simultaneously and atomically. The users, in theory, don't notice as the volumes don't go offline - and then they see all the changes happen at once. There's coordination to handle when one or more of the fileservers or the Volume Location servers are offline.<br />
<br />
Volumes can be migrated between machines whilst in active use without the user in theory noticing anything.<br />
<br />
For NFS migration we need to preserve filehandles, so need to migrate at the block level or using fs-specific send/receive. The protocol can be handled by migrating only entire servers or containers, so that migration can be treated as a server reboot.<br />
<br />
A few Linux options for send/receive:<br />
<br />
* thin_delta (from device-mapper-persistent-data) can calculate a metadata-level diff between two volumes. Additional work would be needed to extract the actual data and produce a diff; that would complete the "send" side. We'd also need a "receive" side that could apply the diff and reconstitute the snapshot on the other side. This is being actively worked on. For NFS, on the read-write server we would take a snapshot of the exported volume before sending. On the receive side, after creating the updated snapshot, we would stop the server, unmount the old snapshot, mount the new one, and restart; clients should see only a brief delay.<br />
<br />
* btrfs-send/btrfs-receive: this is probably the best-tested send/receive functionality currently available, so if we wanted to start work on a prototype right now, this might be an option.<br />
<br />
* xfs volumes loopback-mounted on a backing xfs filesystem, using reflink for snapshots. (See https://lwn.net/Articles/747633/ for some background.) Looks promising, the basic kernel interfaces to find shared extents and such are there, but a lot of userland code remains to be written.<br />
<br />
* stratis: this operates at a layer of abstraction over the above. But that might be the layer we want to actually interact with?<br />
<br />
* lvmsync: looks possibly unmaintained? We wouldn't want to depend on this. But possibly it could be a proof of concept or starting point.<br />
<br />
Between LVM and (container-respecting) knfsd, we have a lot of the necessary pieces, but there's at a minimum a lot of tooling and documentation to write before this is usable.<br />
<br />
Clients could be configured to mount particular servers by hand, or they could mount any server and then use [https://tools.ietf.org/html/rfc5661#section-11.9 fs_locations], [https://tools.ietf.org/html/rfc5661#section-11.10 fs_locations_info], or maybe even [https://datatracker.ietf.org/doc/rfc8435/ pnfs flexfiles] to get lists of servers hosting replicas and pick one. They would need some heuristics to make the right choice. It would also be nice if clients could fail over to a different replica when one goes down.<br />
<br />
We also have [https://github.com/nfs-ganesha/nfs-ganesha/wiki Ganesha], [https://docs.ceph.com/docs/master/cephfs/nfs/ Ganesha/Ceph] (which [https://jtlayton.wordpress.com/2018/12/10/deploying-an-active-active-nfs-cluster-over-cephfs/ may be capable of multiple read/write servers now]).<br />
<br />
See also [https://docs.openafs.org/AdminGuide/HDRWQ177.html AFS Administrator's guide, Chapter 5: Managing Volumes]<br />
<br />
A partial alternative may be [https://wiki.linux-nfs.org/wiki/index.php/NFS_re-export NFS proxying]. Like read-only replicas, proxies should be able to hide latency by moving cached data closer to far-flung clients, and scale bandwidth to read-mostly data by taking load off the original server.<br />
<br />
Advantages are that we already have seen reports of some success here, using the NFS re-export code together with fscache. And I think there are a lot of opportunities for incremental progress by fixing problems with existing NFS code, rather than larger and riskier projects that build new infrastructure.<br />
<br />
A disadvantage may be that AFS users seem to like that infrastructure (the volume abstraction and the VLDB).<br />
<br />
Latency-hiding may be particularly tricky; delegation and caching policies may need rethinking. Performance will be more complicated to understand compared to AFS-like read-only replicas.<br />
<br />
AFS-like volume replication has a problem: when new read-only versions are released, they may modify or delete entirely files that are in use by running processes. I'd expect application crashes. I wonder how AFS administrators deal with that now?<br />
<br />
My impression is that AFS doesn't reliably prevent this problem, so instead AFS administrators work around it, for example by keeping old versions of binaries in place (and using symlinks to direct users to the newest versions).<br />
<br />
Possible approaches to fix the problem if we wanted to:<br />
* Provide some protocol which tracks which files may be open on read-only replicas so that we know not to free those files when they're unlinked.<br />
* When we distribute new versions, allow servers to keep around older versions and serve files from them in the case filehandle lookups against the new copy fail, to be removed only after applications stop referencing them. Hopefully this can be done space-efficiently if the different versions on the replica servers can be represented as dm snapshots.<br />
<br />
If we use NFSv4 proxies instead, proxies will hold opens or delegations on the files on the original server, which will prevent their being deleted while in use. The problem is server reboots. That's partially worked around with silly-rename. Server-side silly-rename would be a more complete solution.<br />
<br />
== volume location database and global namespace ==<br />
<br />
On an AFS client by default you can look up something like /afs/umich.edu/... and reach files kept in AFS anywhere.<br />
<br />
NFS has standards for DNS discovery of a server from a domain, in theory we could use that. Handling kerberos users across domains would be interesting.<br />
<br />
Within one domain, there's a "Volume Location Database" that keeps track of volumes and where (machine and partition) they're located. You can make a volume for a purpose; give particular people access to it, give it some storage, expand and contract it and move it around. Volumes have quotas.<br />
<br />
Within a given domain, We can assemble a namespace out of volumes using referrals. For a higher-level approach more similar to AFS's, there's also [https://wiki.linux-nfs.org/wiki/index.php/FedFsUtilsProject FedFS] which stores the namespace information in a database and provides common protocols for administration tools to manipulate the database.<br />
<br />
== PAGS ==<br />
<br />
PAGs: AFS allows a group of processes to share a common identity, different from the local uid, for the purposes of accessing an AFS filesystem: https://docs.openafs.org/AdminGuide/HDRWQ63.html<br />
<br />
Dave Howells says: "This is why I added session keyrings. You can run a process in a new keyring<br />
and give it new tokens. systemd kind of stuck a spike in that, though, by<br />
doing their own incompatible thing with their user manager service....<br />
<br />
NFS would need to do what the in-kernel AFS client does and call request_key()<br />
on entry to each filesystem method that doesn't take a file* and use that to<br />
cache the credentials it is using. If there is no key, it can make one up on<br />
the spot and stick the uid/gid/groups in there. This would then need to be<br />
handed down to the sunrpc protocol to define the security creds to use.<br />
<br />
The key used to open a file would then need to be cached in the file struct<br />
private data."<br />
<br />
== ACLs ==<br />
<br />
NFSv4 has ACLs, but Linux filesystems only support "posix" ACLs. An attempt was made to support NFSv4 ACLs ("richacls") but hasn't been accepted upstream. So knfsd is stuck mapping between NFSv4 and posix ACLs. Posix ACLs are more coarse-grained than NFSv4 ACLs, so information can be lost when a user on an NFSv4 client sets an ACL. This makes ACLs confusing and less useful.<br />
<br />
There are other servers that support full NFSv4 ACLs, so users of those servers are better off. Our client-side tools could still use some improvements for those users, though.<br />
<br />
AFS ACLs, unfortunately, are yet again a third style of ACL, incompatible with both POSIX and NFSv4 ACLs. They are more fine-grained than POSIX ACLs and probably closer to NFSv4 ACLs overall.<br />
<br />
To do:<br />
<br />
* make NFSv4 ACL tools more usable:<br />
** Map groups of NFSv4 permission bits to read, write, and execute permissions so we only have to display the simpler bits in common cases<br />
** Look for other opportunities to simplify display and editing of NFSv4 ACLs<br />
** Add NFSv4 ACL support to graphical file managers like GNOME Files<br />
** Adopt a commandline interface that's more similar to the posix acl utilities.<br />
** Perhaps also look into https://github.com/kvaneesh/richacl-tools as an alternative starting point to nfs4-acl-tools.<br />
** In general, try to make NFSv4 ACL management more similar to management of existing posix ACLs.<br />
* For AFS->NFS transition:<br />
** Write code that translates AFS ACLs to NFSv4 ACLs. It should be possible to do this with little or no loss of information for servers with full NFSv4 ACL support.<br />
** For migrations to Linux knfsd, this will effectively translate AFS ACLs to POSIX ACLs, and information will be lost. Test this case. The conversion tool should be able to fetch the ACLs after setting them, compare results, and summarize the results of the conversion in a way that's usable even for conversions of large numbers of files. I believe that setting an ACL is enough to invalidate the client's ACL cache, so a subsequent fetch of an ACL should show the results of any server-side mapping. But, test this to make sure. More details on [[AFS to NFSv4 ACL conversion]]<br />
<br />
* more ambitious options:<br />
** Try reviving [https://lwn.net/Articles/661357/ Rich ACLs]. Maybe we could convince people this time. Or maybe there's a different approach that would work. Maybe we could find a more incremental route, e.g. by adding some features of richacls to POSIX ACLs, such as the separation of directory write permissions into add and delete, and of file write permissions into modify and append.<br />
<br />
== user and group management ==<br />
<br />
AFS has a "protection server" and you can communicate with it using the [https://docs.openafs.org/Reference/1/pts.html pts command] which allows you to set up users and groups and add ACEs for machines.<br />
<br />
Compared to traditional unix, it allows wider delegation of management. For example, group creation doesn't require root: https://docs.openafs.org/Reference/1/pts_creategroup.html. Groups have owners, and you can delegate management of group membership: https://docs.openafs.org/Reference/1/pts_adduser.html.<br />
<br />
Our equivalent to the AFS protection server is [https://www.freeipa.org/page/Main_Page FreeIPA]. See also https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_identity_management/index. Installing FreeIPA and experimenting is also useful.<br />
<br />
Unlike AFS, FreeIPA doesn't seem to make it easy for ordinary users to create groups. It does allow delegating group management (including adding and removing users). More details on [[AFS-like group management with FreeIPA]].<br />
<br />
== quotas ==<br />
<br />
NFSv3 had the RQUOTA protocol. NFSv4 has three attributes (see https://tools.ietf.org/html/rfc5661#section-5.8.2.28 and following) that support quotas, but Linux has not implemented them. Maybe we should.<br />
<br />
The linux implementation of RQUOTA seems to live along with the local filesystem quota utilites at https://sourceforge.net/projects/linuxquota/. The same "quota" tool can make calls to either local filesystems or RQUOTA calls to the server. One of the goals of the NFSv4 protocol was to move all functionality into the single NFS protocol (among other reasons, to simplify firewall traversal), so we should really be shifting people to using the NFSv4 attributes instead of the separate RQUOTA protocol. https://sourceforge.net/p/linuxquota/code/ci/master/tree/doc/quotadoc.sgml looks like one place to start understanding how local filesystems deal with quotas.<br />
<br />
Some local filesystems quotas seem to be managed using an actual on-disk file whose format is understood by both the kernel filesystem code and the userspace "quota" tool. That isn't an appropriate interface for the NFS filesystem. But filesystems are also able to hide the on-disk quota information and system calls to get and set quotas. Perhaps NFS could use some subset of that interface (see quotactl(2)), or perhaps it needs to define its own simpler calls.<br />
<br />
AFS appears to only have quotas for volumes; per-user quotas are implemented using per-quota volumes: https://docs.openafs.org/AdminGuide/HDRWQ234.html<br />
<br />
So, really, NFS/xfs/ext4 quotas may not be what's needed. Instead, we can probably get similar functionality with thin provisioning and "df".<br />
<br />
= migrating existing AFS installations to NFS =<br />
<br />
Once NFS does everything AFS does, there's still the question of how you'd migrate over a particular installation.<br />
<br />
There's a standard AFS dump format (used by [https://docs.openafs.org/AdminGuide/HDRWQ240.html vos dump/vos restore]) that might be worth looking at. It looks simple enough. Maybe also look at [https://github.com/openafs-contrib/cmu-dumpscan cmu-dumpscan].<br />
<br />
See also [[AFS to NFSv4 ACL conversion]].</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_for_AFS_usersNFS for AFS users2021-04-01T00:26:53Z<p>Bfields: /* quotas */</p>
<hr />
<div>This page tracks some of the obstacles that might keep an AFS user from using NFS instead.<br />
<br />
= Missing Features =<br />
<br />
In general: AFS is administered by a consistent set of commands (fs, pts, vos, uss, bos, backup, fstrace, etc.) which work from any client and identify the user with Kerberos. Compared to a traditional unix system it's more flexible about delegating rights to users to do stuff.<br />
<br />
== replication and migration ==<br />
<br />
AFS supports fast clones using COW, along with complete copies on other machines.<br />
<br />
Currently there can be only one writeable version of a volume, but multiple read-only versions (which all have to be identical). They can be on different servers. (There's also an effort to support multiple writeable volumes, possibly using Ceph, but that's not done yet.)<br />
<br />
There can also be a 'backup' volume which is just, say, a daily temporary read-only snapshot of a RW volume and has to be located on the same machine.<br />
<br />
When a RW volume is "released" (snapshotted) to the read-only volumes, all the read-only volumes update simultaneously and atomically. The users, in theory, don't notice as the volumes don't go offline - and then they see all the changes happen at once. There's coordination to handle when one or more of the fileservers or the Volume Location servers are offline.<br />
<br />
Volumes can be migrated between machines whilst in active use without the user in theory noticing anything.<br />
<br />
For NFS migration we need to preserve filehandles, so need to migrate at the block level or using fs-specific send/receive. The protocol can be handled by migrating only entire servers or containers, so that migration can be treated as a server reboot.<br />
<br />
A few Linux options for send/receive:<br />
<br />
* thin_delta (from device-mapper-persistent-data) can calculate a metadata-level diff between two volumes. Additional work would be needed to extract the actual data and produce a diff; that would complete the "send" side. We'd also need a "receive" side that could apply the diff and reconstitute the snapshot on the other side. This is being actively worked on. For NFS, on the read-write server we would take a snapshot of the exported volume before sending. On the receive side, after creating the updated snapshot, we would stop the server, unmount the old snapshot, mount the new one, and restart; clients should see only a brief delay.<br />
<br />
* btrfs-send/btrfs-receive: this is probably the best-tested send/receive functionality currently available, so if we wanted to start work on a prototype right now, this might be an option.<br />
<br />
* xfs volumes loopback-mounted on a backing xfs filesystem, using reflink for snapshots. (See https://lwn.net/Articles/747633/ for some background.) Looks promising, the basic kernel interfaces to find shared extents and such are there, but a lot of userland code remains to be written.<br />
<br />
* stratis: this operates at a layer of abstraction over the above. But that might be the layer we want to actually interact with?<br />
<br />
* lvmsync: looks possibly unmaintained? We wouldn't want to depend on this. But possibly it could be a proof of concept or starting point.<br />
<br />
Between LVM and (container-respecting) knfsd, we have a lot of the necessary pieces, but there's at a minimum a lot of tooling and documentation to write before this is usable.<br />
<br />
Clients could be configured to mount particular servers by hand, or they could mount any server and then use [https://tools.ietf.org/html/rfc5661#section-11.9 fs_locations], [https://tools.ietf.org/html/rfc5661#section-11.10 fs_locations_info], or maybe even [https://datatracker.ietf.org/doc/rfc8435/ pnfs flexfiles] to get lists of servers hosting replicas and pick one. They would need some heuristics to make the right choice. It would also be nice if clients could fail over to a different replica when one goes down.<br />
<br />
We also have [https://github.com/nfs-ganesha/nfs-ganesha/wiki Ganesha], [https://docs.ceph.com/docs/master/cephfs/nfs/ Ganesha/Ceph] (which [https://jtlayton.wordpress.com/2018/12/10/deploying-an-active-active-nfs-cluster-over-cephfs/ may be capable of multiple read/write servers now]).<br />
<br />
See also [https://docs.openafs.org/AdminGuide/HDRWQ177.html AFS Administrator's guide, Chapter 5: Managing Volumes]<br />
<br />
A partial alternative may be [https://wiki.linux-nfs.org/wiki/index.php/NFS_re-export NFS proxying]. Like read-only replicas, proxies should be able to hide latency by moving cached data closer to far-flung clients, and scale bandwidth to read-mostly data by taking load off the original server.<br />
<br />
Advantages are that we already have seen reports of some success here, using the NFS re-export code together with fscache. And I think there are a lot of opportunities for incremental progress by fixing problems with existing NFS code, rather than larger and riskier projects that build new infrastructure.<br />
<br />
A disadvantage may be that AFS users seem to like that infrastructure (the volume abstraction and the VLDB).<br />
<br />
Latency-hiding may be particularly tricky; delegation and caching policies may need rethinking. Performance will be more complicated to understand compared to AFS-like read-only replicas.<br />
<br />
AFS-like volume replication has a problem: when new read-only versions are released, they may modify or delete entirely files that are in use by running processes. I'd expect application crashes. I wonder how AFS administrators deal with that now?<br />
<br />
My impression is that AFS doesn't reliably prevent this problem, so instead AFS administrators work around it, for example by keeping old versions of binaries in place (and using symlinks to direct users to the newest versions).<br />
<br />
Possible approaches to fix the problem if we wanted to:<br />
* Provide some protocol which tracks which files may be open on read-only replicas so that we know not to free those files when they're unlinked.<br />
* When we distribute new versions, allow servers to keep around older versions and serve files from them in the case filehandle lookups against the new copy fail, to be removed only after applications stop referencing them. Hopefully this can be done space-efficiently if the different versions on the replica servers can be represented as dm snapshots.<br />
<br />
If we use NFSv4 proxies instead, proxies will hold opens or delegations on the files on the original server, which will prevent their being deleted while in use. The problem is server reboots. That's partially worked around with silly-rename. Server-side silly-rename would be a more complete solution.<br />
<br />
== volume location database and global namespace ==<br />
<br />
On an AFS client by default you can look up something like /afs/umich.edu/... and reach files kept in AFS anywhere.<br />
<br />
NFS has standards for DNS discovery of a server from a domain, in theory we could use that. Handling kerberos users across domains would be interesting.<br />
<br />
Within one domain, there's a "Volume Location Database" that keeps track of volumes and where (machine and partition) they're located. You can make a volume for a purpose; give particular people access to it, give it some storage, expand and contract it and move it around. Volumes have quotas.<br />
<br />
Within a given domain, We can assemble a namespace out of volumes using referrals. For a higher-level approach more similar to AFS's, there's also [https://wiki.linux-nfs.org/wiki/index.php/FedFsUtilsProject FedFS] which stores the namespace information in a database and provides common protocols for administration tools to manipulate the database.<br />
<br />
== PAGS ==<br />
<br />
PAGs: AFS allows a group of processes to share a common identity, different from the local uid, for the purposes of accessing an AFS filesystem: https://docs.openafs.org/AdminGuide/HDRWQ63.html<br />
<br />
Dave Howells says: "This is why I added session keyrings. You can run a process in a new keyring<br />
and give it new tokens. systemd kind of stuck a spike in that, though, by<br />
doing their own incompatible thing with their user manager service....<br />
<br />
NFS would need to do what the in-kernel AFS client does and call request_key()<br />
on entry to each filesystem method that doesn't take a file* and use that to<br />
cache the credentials it is using. If there is no key, it can make one up on<br />
the spot and stick the uid/gid/groups in there. This would then need to be<br />
handed down to the sunrpc protocol to define the security creds to use.<br />
<br />
The key used to open a file would then need to be cached in the file struct<br />
private data."<br />
<br />
== ACLs ==<br />
<br />
NFSv4 has ACLs, but Linux filesystems only support "posix" ACLs. An attempt was made to support NFSv4 ACLs ("richacls") but hasn't been accepted upstream. So knfsd is stuck mapping between NFSv4 and posix ACLs. Posix ACLs are more coarse-grained than NFSv4 ACLs, so information can be lost when a user on an NFSv4 client sets an ACL. This makes ACLs confusing and less useful.<br />
<br />
There are other servers that support full NFSv4 ACLs, so users of those servers are better off. Our client-side tools could still use some improvements for those users, though.<br />
<br />
AFS ACLs, unfortunately, are yet again a third style of ACL, incompatible with both POSIX and NFSv4 ACLs. They are more fine-grained than POSIX ACLs and probably closer to NFSv4 ACLs overall.<br />
<br />
To do:<br />
<br />
* make NFSv4 ACL tools more usable:<br />
** Map groups of NFSv4 permission bits to read, write, and execute permissions so we only have to display the simpler bits in common cases<br />
** Look for other opportunities to simplify display and editing of NFSv4 ACLs<br />
** Add NFSv4 ACL support to graphical file managers like GNOME Files<br />
** Adopt a commandline interface that's more similar to the posix acl utilities.<br />
** Perhaps also look into https://github.com/kvaneesh/richacl-tools as an alternative starting point to nfs4-acl-tools.<br />
** In general, try to make NFSv4 ACL management more similar to management of existing posix ACLs.<br />
* For AFS->NFS transition:<br />
** Write code that translates AFS ACLs to NFSv4 ACLs. It should be possible to do this with little or no loss of information for servers with full NFSv4 ACL support.<br />
** For migrations to Linux knfsd, this will effectively translate AFS ACLs to POSIX ACLs, and information will be lost. Test this case. The conversion tool should be able to fetch the ACLs after setting them, compare results, and summarize the results of the conversion in a way that's usable even for conversions of large numbers of files. I believe that setting an ACL is enough to invalidate the client's ACL cache, so a subsequent fetch of an ACL should show the results of any server-side mapping. But, test this to make sure. More details on [[AFS to NFSv4 ACL conversion]]<br />
<br />
* more ambitious options:<br />
** Try reviving [https://lwn.net/Articles/661357/ Rich ACLs]. Maybe we could convince people this time. Or maybe there's a different approach that would work. Maybe we could find a more incremental route, e.g. by adding some features of richacls to POSIX ACLs, such as the separation of directory write permissions into add and delete, and of file write permissions into modify and append.<br />
<br />
== user and group management ==<br />
<br />
AFS has a "protection server" and you can communicate with it using the [https://docs.openafs.org/Reference/1/pts.html pts command] which allows you to set up users and groups and add ACEs for machines.<br />
<br />
Compared to traditional unix, it allows wider delegation of management. For example, group creation doesn't require root: https://docs.openafs.org/Reference/1/pts_creategroup.html. Groups have owners, and you can delegate management of group membership: https://docs.openafs.org/Reference/1/pts_adduser.html.<br />
<br />
Our equivalent to the AFS protection server is [https://www.freeipa.org/page/Main_Page FreeIPA]. See also https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_identity_management/index. Installing FreeIPA and experimenting is also useful.<br />
<br />
Unlike AFS, FreeIPA doesn't seem to make it easy for ordinary users to create groups. It does allow delegating group management (including adding and removing users). More details on [[AFS-like group management with FreeIPA]].<br />
<br />
== quotas ==<br />
<br />
NFSv3 had the RQUOTA protocol. NFSv4 has three attributes (see https://tools.ietf.org/html/rfc5661#section-5.8.2.28 and following) that support quotas, but Linux has not implemented them. Maybe we should.<br />
<br />
The linux implementation of RQUOTA seems to live along with the local filesystem quota utilites at https://sourceforge.net/projects/linuxquota/. The same "quota" tool can make calls to either local filesystems or RQUOTA calls to the server. One of the goals of the NFSv4 protocol was to move all functionality into the single NFS protocol (among other reasons, to simplify firewall traversal), so we should really be shifting people to using the NFSv4 attributes instead of the separate RQUOTA protocol. https://sourceforge.net/p/linuxquota/code/ci/master/tree/doc/quotadoc.sgml looks like one place to start understanding how local filesystems deal with quotas.<br />
<br />
Hooking up local filesystem quota support to NFSv4, and then figuring out how to expose that stuff on the client side, could be a reasonable project. We would also need to look at AFS filesystem quotas and compare.<br />
<br />
= migrating existing AFS installations to NFS =<br />
<br />
Once NFS does everything AFS does, there's still the question of how you'd migrate over a particular installation.<br />
<br />
There's a standard AFS dump format (used by [https://docs.openafs.org/AdminGuide/HDRWQ240.html vos dump/vos restore]) that might be worth looking at. It looks simple enough. Maybe also look at [https://github.com/openafs-contrib/cmu-dumpscan cmu-dumpscan].<br />
<br />
See also [[AFS to NFSv4 ACL conversion]].</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_re-exportNFS re-export2021-03-26T21:47:01Z<p>Bfields: /* known issues */</p>
<hr />
<div>The Linux NFS server can export an NFS mount, but that isn't something we currently recommend unless you've done some careful research and are prepared for problems.<br />
<br />
You'll need nfs-utils at least 1.3.5 (specifically, 3f520e8f6f5 "exportfs: Make sure pass all valid export flags to nfsd"). Otherwise, on recent kernels, attempts to re-export NFS will likely result in "exportfs: <path> does not support NFS export".<br />
<br />
The "fsid=" option is required on any export of an NFS filesystem.<br />
<br />
For now you should probably also mount readonly and with -onolock (and don't depend on working file locking), and don't allow the re-exporting server to reboot.<br />
<br />
= known issues =<br />
<br />
== fsid= required, crossmnt broken ==<br />
<br />
The re-export server needs to encode into each filehandle something that identifies the specific filesystem being exported. Otherwise it's stuck when it gets a filehandle back from the client--the operation it uses to map the incoming filehandle to a dentry can't even work without a superblock. The usual ways of identifying a filesystem don't work for the case of NFS, so we require the "fsid=" export option on any re-export of an NFS filesystem.<br />
<br />
Note also that normally you can export a tree of filesystems by exporting only the parent with the "crossmnt" option, and any filesystems underneath are then automatically exported with the same options. However, that doesn't apply to the fsid= option: it's purpose is to provide a unique identifier for each export, so it can't be automatically copied to the child filesystems.<br />
<br />
That means that re-exporting a tree of NFS filesystems in that way won't work--clients will be able to access the top-level export, but attempts to traverse mountpoints underneath will just result in IO errors.<br />
<br />
In theory, if the server could at least determine that the filehandle is for an object on an NFS filesystem, and figure out which server the filesystem's from, it could (given some new interface) ask the NFS client to work out the rest. I've got only vague ideas here and no real plan for a fix yet.<br />
<br />
== reboot recovery ==<br />
<br />
NFS is designed to keep operating through server reboots, whether planned or the result of a crash or power outage. Client applications will see a delay while the server's down, but as soon as it's back up, normal operation resumes. Opens and file locks held across the reboot will all work correctly. (The only exception is unlinked but still open files, which may disappear after a reboot.)<br />
<br />
But the protocol's normal reboot recovery mechanisms don't work for the case when the re-export server reboots. The re-export server is both an NFS client and an NFS server, and the protocol's equipped to deal with the loss of the server's state, but not with the loss of the client's state.<br />
<br />
Maybe we could keep the client state on low-latency stable storage somehow? Maybe we could add a mechanism to the protocol that allows the client to state that it has lost its protocol state and wants to reclaim? (And then the client would issue reclaims as reclaims from the re-export server's clients came in.) Tentative plan: [[reboot recovery for re-export servers]]<br />
<br />
Maybe the re-export server could take the stateids returned from the server and return them to its clients, avoiding the need for it to keep very much state.<br />
<br />
== filehandle limits ==<br />
<br />
NFS filehandle sizes are limited (to 32 bytes for NFSv2, 64 bytes for NFSv3, and 128 bytes for NFSv4). When we re-export, we take the filehandle returned from the original server and wrap it with some more bytes of our own to create the filehandle we return to clients. That means the filehandles we give out will be larger than the filehandles we receive from the original server. There's no guarantee this will work. In practice most servers give out filehandles of a fixed size that's less than the maximum, so you *probably* won't run into this problem unless you're re-exporting with NFSv2, or re-exporting repeatedly. But there are no guarantees.<br />
<br />
If re-export servers could reuse filehandles from the original server, that'd solve the problem. It would also make it easier for clients to migrate between the original server and other re-export servers, which could be useful.<br />
<br />
The wrapping is needed so that the server can identify, even after it may have long forgotten about that particular filehandle, which export the filehandle refers to, so it can refer the operation to the correct underlying filesystem or server, and so it can enforce export permissions.<br />
<br />
If a server exports only a single NFS filesystem, then there'd be no problem with it reusing the file handle it got from the original server. Possibly that's a common enough use case to be helpful? With containers we could still allow a single physical machine to handle multiple exports even if each container only handles on each.<br />
<br />
Cooperating servers could agree on the structure of filehandles in a way that allowed them to reuse each others' filehandles. Possibly that could be standardized if it proved useful.<br />
<br />
== errors on re-exports of NFSv4.0 filesystems to NFSv2/3 clients ==<br />
<br />
When re-exporting NFSv4.0 filesystems IO errors have been seen after dropping caches on the re-export server. This is probably due to the fact that an NFSv4 client has to open files to perform IO to them, but NFSv3 client only provides filehandles, and NFSv4.0 cannot open by filehandle (it can only open by (parent filehandle, filename) pair). NFSv4.1 allows open by filehandle.<br />
<br />
Best is not to do this; use NFSv4.1 or NFSv4.2 on the original server, or NFSv4 on the clients.<br />
<br />
If that's not possible, a workaround is to configure the re-export server to be reluctant to evict inodes from cache.<br />
<br />
Some more details at https://lore.kernel.org/linux-nfs/635679406.70384074.1603272832846.JavaMail.zimbra@dneg.com/. Note some other cases there (NFSv3 re-exports of NFSv3) are fixed by patches probably headed for 5.11.<br />
<br />
Maybe the NFSv4.0 client could also be made to support open-by-filehandle by skipping the open and using special stateids instead? I'm not sure.<br />
<br />
== unnecessary GETATTRs ==<br />
<br />
We see unnecessary cache invalidations on the re-export servers; we have some patches in progress that should make it for 5.11 or so (https://lore.kernel.org/linux-nfs/20201120223831.GB7705@fieldses.org/). It looks like they help but don't address every case.<br />
<br />
Also, depending on NFS versions on originating and re-exporting servers, we could probably save some GETATTRs, and set the atomic bit in some cases, if we passed along wcc information from the original server. Requires a special knfsd<->nfs interface. Should be doable.<br />
<br />
== broken file locking ==<br />
<br />
Connectathon locking tests over v4 are currently triggering some kind of memory corruption; still investigating.<br />
<br />
I haven't tested NFSv2/v3 (NLM) file locking yet, but I bet it's broken too.<br />
<br />
== re-export not reading more than 128K at a time ==<br />
<br />
For some reason when the client issues 1M reads to the re-export server, the re-export server breaks them up into 128K reads to the original server. Workaround is to manually increase client readahead; see <br />
https://lore.kernel.org/linux-nfs/1688437957.87985749.1605554507783.JavaMail.zimbra@dneg.com/<br />
<br />
== open DENY bits ignored ==<br />
<br />
NFS since NFSv4 supports ALLOW and DENY bits taken from Windows, which allow you, for example, to open a file in a mode which forbids other read opens or write opens. The Linux client doesn't use them, and the server's support has always been incomplete: they are enforced only against other NFS users, not against processes accessing the exported filesystem locally. A re-export server will also not pass them along to the original server, so they will not be enforced between clients of different re-export servers.<br />
<br />
This is probably not too hard to fix, but also probably not a high priority.<br />
<br />
== Delegations unsupported ==<br />
<br />
Currently a re-export server simply won't give out delegations to its clients (if you're looking at the code: this is because the nfs filesystem sets its setlease method to simple_nosetlease). This is correct but probably suboptimal.<br />
<br />
= Known problems that we've fixed =<br />
<br />
* Problems with sporadic stale filehandles should be fixed by https://lore.kernel.org/linux-nfs/20201019175330.595894-1-trondmy@kernel.org/ (queued for 5.11?)<br />
* Pre/post-operation attributes are incorrectly returned as if they were atomic in cases when they aren't. We have fixes for 5.11.<br />
<br />
= Use cases =<br />
<br />
== Scaling read bandwidth ==<br />
<br />
You should be able to scale bandwidth by adding more re-export servers; fscache on the re-export servers should also help.<br />
<br />
== Hiding latency of distant servers ==<br />
<br />
You should also be able to hide latency when the original server is far away. AFS read-only replication is an interesting precedent here, often used to distribute software that is rarely updated. [https://cernvm.cern.ch/fs/ CernVM-FS] occupies a similar niche. fscache should help here too.<br />
<br />
== NFS version support ==<br />
<br />
It's also being used as a way to add support for all NFS versions to servers that only support a subset. Careful attention to filehandle limits is required.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/AFS-like_group_management_with_FreeIPAAFS-like group management with FreeIPA2021-03-03T23:10:24Z<p>Bfields: </p>
<hr />
<div>AFS allows any ordinary user amy to create groups named "amy:groupname": https://docs.openafs.org/Reference/1/pts_creategroup.html. AFS has per-user quotas limiting the number of such groups created.<br />
<br />
It would be possible to emulate this in FreeIPA by creating a permission, privilege, and role for each individual user, though it's a little cumbersome. For example:<br />
<br />
ipa-permission-add "create amy groups" --type=group --right=add --filter="(cn=amy-*)"<br />
ipa privilege-add "create amy groups"<br />
ipa privilege-add-perimssion --permission="create amy groups" "create amy groups"<br />
ipa role-add role-manage-amy-groups<br />
ipa role-add-member --users=amy role-manage-amy-groups<br />
ipa role-add-privilege --privileges="create amy groups" role-manage-amy-groups<br />
<br />
FreeIPA permits an administrator give a given user the right to modify membership of a given group (see the member and membermanager attributes), or delegate the right to create groups to certain users.<br />
<br />
You can view and modify group membership with "ipa group-add-member" and "ipa group-show"<br />
<br />
There's no way to enforce quotas. This would require someone writing a new plugin. We're not aware of anyone working on it.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_for_AFS_usersNFS for AFS users2021-02-23T19:51:17Z<p>Bfields: /* Missing Features */</p>
<hr />
<div>This page tracks some of the obstacles that might keep an AFS user from using NFS instead.<br />
<br />
= Missing Features =<br />
<br />
In general: AFS is administered by a consistent set of commands (fs, pts, vos, uss, bos, backup, fstrace, etc.) which work from any client and identify the user with Kerberos. Compared to a traditional unix system it's more flexible about delegating rights to users to do stuff.<br />
<br />
== replication and migration ==<br />
<br />
AFS supports fast clones using COW, along with complete copies on other machines.<br />
<br />
Currently there can be only one writeable version of a volume, but multiple read-only versions (which all have to be identical). They can be on different servers. (There's also an effort to support multiple writeable volumes, possibly using Ceph, but that's not done yet.)<br />
<br />
There can also be a 'backup' volume which is just, say, a daily temporary read-only snapshot of a RW volume and has to be located on the same machine.<br />
<br />
When a RW volume is "released" (snapshotted) to the read-only volumes, all the read-only volumes update simultaneously and atomically. The users, in theory, don't notice as the volumes don't go offline - and then they see all the changes happen at once. There's coordination to handle when one or more of the fileservers or the Volume Location servers are offline.<br />
<br />
Volumes can be migrated between machines whilst in active use without the user in theory noticing anything.<br />
<br />
For NFS migration we need to preserve filehandles, so need to migrate at the block level or using fs-specific send/receive. The protocol can be handled by migrating only entire servers or containers, so that migration can be treated as a server reboot.<br />
<br />
A few Linux options for send/receive:<br />
<br />
* thin_delta (from device-mapper-persistent-data) can calculate a metadata-level diff between two volumes. Additional work would be needed to extract the actual data and produce a diff; that would complete the "send" side. We'd also need a "receive" side that could apply the diff and reconstitute the snapshot on the other side. This is being actively worked on. For NFS, on the read-write server we would take a snapshot of the exported volume before sending. On the receive side, after creating the updated snapshot, we would stop the server, unmount the old snapshot, mount the new one, and restart; clients should see only a brief delay.<br />
<br />
* btrfs-send/btrfs-receive: this is probably the best-tested send/receive functionality currently available, so if we wanted to start work on a prototype right now, this might be an option.<br />
<br />
* xfs volumes loopback-mounted on a backing xfs filesystem, using reflink for snapshots. (See https://lwn.net/Articles/747633/ for some background.) Looks promising, the basic kernel interfaces to find shared extents and such are there, but a lot of userland code remains to be written.<br />
<br />
* stratis: this operates at a layer of abstraction over the above. But that might be the layer we want to actually interact with?<br />
<br />
* lvmsync: looks possibly unmaintained? We wouldn't want to depend on this. But possibly it could be a proof of concept or starting point.<br />
<br />
Between LVM and (container-respecting) knfsd, we have a lot of the necessary pieces, but there's at a minimum a lot of tooling and documentation to write before this is usable.<br />
<br />
Clients could be configured to mount particular servers by hand, or they could mount any server and then use [https://tools.ietf.org/html/rfc5661#section-11.9 fs_locations], [https://tools.ietf.org/html/rfc5661#section-11.10 fs_locations_info], or maybe even [https://datatracker.ietf.org/doc/rfc8435/ pnfs flexfiles] to get lists of servers hosting replicas and pick one. They would need some heuristics to make the right choice. It would also be nice if clients could fail over to a different replica when one goes down.<br />
<br />
We also have [https://github.com/nfs-ganesha/nfs-ganesha/wiki Ganesha], [https://docs.ceph.com/docs/master/cephfs/nfs/ Ganesha/Ceph] (which [https://jtlayton.wordpress.com/2018/12/10/deploying-an-active-active-nfs-cluster-over-cephfs/ may be capable of multiple read/write servers now]).<br />
<br />
See also [https://docs.openafs.org/AdminGuide/HDRWQ177.html AFS Administrator's guide, Chapter 5: Managing Volumes]<br />
<br />
A partial alternative may be [https://wiki.linux-nfs.org/wiki/index.php/NFS_re-export NFS proxying]. Like read-only replicas, proxies should be able to hide latency by moving cached data closer to far-flung clients, and scale bandwidth to read-mostly data by taking load off the original server.<br />
<br />
Advantages are that we already have seen reports of some success here, using the NFS re-export code together with fscache. And I think there are a lot of opportunities for incremental progress by fixing problems with existing NFS code, rather than larger and riskier projects that build new infrastructure.<br />
<br />
A disadvantage may be that AFS users seem to like that infrastructure (the volume abstraction and the VLDB).<br />
<br />
Latency-hiding may be particularly tricky; delegation and caching policies may need rethinking. Performance will be more complicated to understand compared to AFS-like read-only replicas.<br />
<br />
AFS-like volume replication has a problem: when new read-only versions are released, they may modify or delete entirely files that are in use by running processes. I'd expect application crashes. I wonder how AFS administrators deal with that now?<br />
<br />
My impression is that AFS doesn't reliably prevent this problem, so instead AFS administrators work around it, for example by keeping old versions of binaries in place (and using symlinks to direct users to the newest versions).<br />
<br />
Possible approaches to fix the problem if we wanted to:<br />
* Provide some protocol which tracks which files may be open on read-only replicas so that we know not to free those files when they're unlinked.<br />
* When we distribute new versions, allow servers to keep around older versions and serve files from them in the case filehandle lookups against the new copy fail, to be removed only after applications stop referencing them. Hopefully this can be done space-efficiently if the different versions on the replica servers can be represented as dm snapshots.<br />
<br />
If we use NFSv4 proxies instead, proxies will hold opens or delegations on the files on the original server, which will prevent their being deleted while in use. The problem is server reboots. That's partially worked around with silly-rename. Server-side silly-rename would be a more complete solution.<br />
<br />
== volume location database and global namespace ==<br />
<br />
On an AFS client by default you can look up something like /afs/umich.edu/... and reach files kept in AFS anywhere.<br />
<br />
NFS has standards for DNS discovery of a server from a domain, in theory we could use that. Handling kerberos users across domains would be interesting.<br />
<br />
Within one domain, there's a "Volume Location Database" that keeps track of volumes and where (machine and partition) they're located. You can make a volume for a purpose; give particular people access to it, give it some storage, expand and contract it and move it around. Volumes have quotas.<br />
<br />
Within a given domain, We can assemble a namespace out of volumes using referrals. For a higher-level approach more similar to AFS's, there's also [https://wiki.linux-nfs.org/wiki/index.php/FedFsUtilsProject FedFS] which stores the namespace information in a database and provides common protocols for administration tools to manipulate the database.<br />
<br />
== PAGS ==<br />
<br />
PAGs: AFS allows a group of processes to share a common identity, different from the local uid, for the purposes of accessing an AFS filesystem: https://docs.openafs.org/AdminGuide/HDRWQ63.html<br />
<br />
Dave Howells says: "This is why I added session keyrings. You can run a process in a new keyring<br />
and give it new tokens. systemd kind of stuck a spike in that, though, by<br />
doing their own incompatible thing with their user manager service....<br />
<br />
NFS would need to do what the in-kernel AFS client does and call request_key()<br />
on entry to each filesystem method that doesn't take a file* and use that to<br />
cache the credentials it is using. If there is no key, it can make one up on<br />
the spot and stick the uid/gid/groups in there. This would then need to be<br />
handed down to the sunrpc protocol to define the security creds to use.<br />
<br />
The key used to open a file would then need to be cached in the file struct<br />
private data."<br />
<br />
== ACLs ==<br />
<br />
NFSv4 has ACLs, but Linux filesystems only support "posix" ACLs. An attempt was made to support NFSv4 ACLs ("richacls") but hasn't been accepted upstream. So knfsd is stuck mapping between NFSv4 and posix ACLs. Posix ACLs are more coarse-grained than NFSv4 ACLs, so information can be lost when a user on an NFSv4 client sets an ACL. This makes ACLs confusing and less useful.<br />
<br />
There are other servers that support full NFSv4 ACLs, so users of those servers are better off. Our client-side tools could still use some improvements for those users, though.<br />
<br />
AFS ACLs, unfortunately, are yet again a third style of ACL, incompatible with both POSIX and NFSv4 ACLs. They are more fine-grained than POSIX ACLs and probably closer to NFSv4 ACLs overall.<br />
<br />
To do:<br />
<br />
* make NFSv4 ACL tools more usable:<br />
** Map groups of NFSv4 permission bits to read, write, and execute permissions so we only have to display the simpler bits in common cases<br />
** Look for other opportunities to simplify display and editing of NFSv4 ACLs<br />
** Add NFSv4 ACL support to graphical file managers like GNOME Files<br />
** Adopt a commandline interface that's more similar to the posix acl utilities.<br />
** Perhaps also look into https://github.com/kvaneesh/richacl-tools as an alternative starting point to nfs4-acl-tools.<br />
** In general, try to make NFSv4 ACL management more similar to management of existing posix ACLs.<br />
* For AFS->NFS transition:<br />
** Write code that translates AFS ACLs to NFSv4 ACLs. It should be possible to do this with little or no loss of information for servers with full NFSv4 ACL support.<br />
** For migrations to Linux knfsd, this will effectively translate AFS ACLs to POSIX ACLs, and information will be lost. Test this case. The conversion tool should be able to fetch the ACLs after setting them, compare results, and summarize the results of the conversion in a way that's usable even for conversions of large numbers of files. I believe that setting an ACL is enough to invalidate the client's ACL cache, so a subsequent fetch of an ACL should show the results of any server-side mapping. But, test this to make sure. More details on [[AFS to NFSv4 ACL conversion]]<br />
<br />
* more ambitious options:<br />
** Try reviving [https://lwn.net/Articles/661357/ Rich ACLs]. Maybe we could convince people this time. Or maybe there's a different approach that would work. Maybe we could find a more incremental route, e.g. by adding some features of richacls to POSIX ACLs, such as the separation of directory write permissions into add and delete, and of file write permissions into modify and append.<br />
<br />
== user and group management ==<br />
<br />
AFS has a "protection server" and you can communicate with it using the [https://docs.openafs.org/Reference/1/pts.html pts command] which allows you to set up users and groups and add ACEs for machines.<br />
<br />
Compared to traditional unix, it allows wider delegation of management. For example, group creation doesn't require root: https://docs.openafs.org/Reference/1/pts_creategroup.html. Groups have owners, and you can delegate management of group membership: https://docs.openafs.org/Reference/1/pts_adduser.html.<br />
<br />
Our equivalent to the AFS protection server is [https://www.freeipa.org/page/Main_Page FreeIPA]. See also https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_identity_management/index. Installing FreeIPA and experimenting is also useful.<br />
<br />
Unlike AFS, FreeIPA doesn't seem to make it easy for ordinary users to create groups. It does allow delegating group management (including adding and removing users). More details on [[AFS-like group management with FreeIPA]].<br />
<br />
== quotas ==<br />
<br />
NFSv3 had the RQUOTA protocol. NFSv4 has three attributes (see https://tools.ietf.org/html/rfc5661#section-5.8.2.28 and following) that support quotas, but Linux has not implemented them. Maybe we should.<br />
<br />
The linux implementation of RQUOTA seems to live along with the local filesystem quota utilites at https://sourceforge.net/projects/linuxquota/. https://sourceforge.net/p/linuxquota/code/ci/master/tree/doc/quotadoc.sgml looks like one place to start understanding how local filesystems deal with quotas.<br />
<br />
Hooking up local filesystem quota support to NFSv4, and then figuring out how to expose that stuff on the client side, could be a reasonable project. We would also need to look at AFS filesystem quotas and compare.<br />
<br />
= migrating existing AFS installations to NFS =<br />
<br />
Once NFS does everything AFS does, there's still the question of how you'd migrate over a particular installation.<br />
<br />
There's a standard AFS dump format (used by [https://docs.openafs.org/AdminGuide/HDRWQ240.html vos dump/vos restore]) that might be worth looking at. It looks simple enough. Maybe also look at [https://github.com/openafs-contrib/cmu-dumpscan cmu-dumpscan].<br />
<br />
See also [[AFS to NFSv4 ACL conversion]].</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_re-exportNFS re-export2021-02-23T18:17:07Z<p>Bfields: /* Use cases */</p>
<hr />
<div>The Linux NFS server can export an NFS mount, but that isn't something we currently recommend unless you've done some careful research and are prepared for problems.<br />
<br />
You'll need nfs-utils at least 1.3.5 (specifically, 3f520e8f6f5 "exportfs: Make sure pass all valid export flags to nfsd"). Otherwise, on recent kernels, attempts to re-export NFS will likely result in "exportfs: <path> does not support NFS export".<br />
<br />
The "fsid=" option is required on any export of an NFS filesystem.<br />
<br />
For now you should probably also mount readonly and with -onolock (and don't depend on working file locking), and don't allow the re-exporting server to reboot.<br />
<br />
= known issues =<br />
<br />
== reboot recovery ==<br />
<br />
NFS is designed to keep operating through server reboots, whether planned or the result of a crash or power outage. Client applications will see a delay while the server's down, but as soon as it's back up, normal operation resumes. Opens and file locks held across the reboot will all work correctly. (The only exception is unlinked but still open files, which may disappear after a reboot.)<br />
<br />
But the protocol's normal reboot recovery mechanisms don't work for the case when the re-export server reboots. The re-export server is both an NFS client and an NFS server, and the protocol's equipped to deal with the loss of the server's state, but not with the loss of the client's state.<br />
<br />
Maybe we could keep the client state on low-latency stable storage somehow? Maybe we could add a mechanism to the protocol that allows the client to state that it has lost its protocol state and wants to reclaim? (And then the client would issue reclaims as reclaims from the re-export server's clients came in.) Tentative plan: [[reboot recovery for re-export servers]]<br />
<br />
Maybe the re-export server could take the stateids returned from the server and return them to its clients, avoiding the need for it to keep very much state.<br />
<br />
== filehandle limits ==<br />
<br />
NFS filehandle sizes are limited (to 32 bytes for NFSv2, 64 bytes for NFSv3, and 128 bytes for NFSv4). When we re-export, we take the filehandle returned from the original server and wrap it with some more bytes of our own to create the filehandle we return to clients. That means the filehandles we give out will be larger than the filehandles we receive from the original server. There's no guarantee this will work. In practice most servers give out filehandles of a fixed size that's less than the maximum, so you *probably* won't run into this problem unless you're re-exporting with NFSv2, or re-exporting repeatedly. But there are no guarantees.<br />
<br />
If re-export servers could reuse filehandles from the original server, that'd solve the problem. It would also make it easier for clients to migrate between the original server and other re-export servers, which could be useful.<br />
<br />
The wrapping is needed so that the server can identify, even after it may have long forgotten about that particular filehandle, which export the filehandle refers to, so it can refer the operation to the correct underlying filesystem or server, and so it can enforce export permissions.<br />
<br />
If a server exports only a single NFS filesystem, then there'd be no problem with it reusing the file handle it got from the original server. Possibly that's a common enough use case to be helpful? With containers we could still allow a single physical machine to handle multiple exports even if each container only handles on each.<br />
<br />
Cooperating servers could agree on the structure of filehandles in a way that allowed them to reuse each others' filehandles. Possibly that could be standardized if it proved useful.<br />
<br />
== errors on re-exports of NFSv4.0 filesystems to NFSv2/3 clients ==<br />
<br />
When re-exporting NFSv4.0 filesystems IO errors have been seen after dropping caches on the re-export server. This is probably due to the fact that an NFSv4 client has to open files to perform IO to them, but NFSv3 client only provides filehandles, and NFSv4.0 cannot open by filehandle (it can only open by (parent filehandle, filename) pair). NFSv4.1 allows open by filehandle.<br />
<br />
Best is not to do this; use NFSv4.1 or NFSv4.2 on the original server, or NFSv4 on the clients.<br />
<br />
If that's not possible, a workaround is to configure the re-export server to be reluctant to evict inodes from cache.<br />
<br />
Some more details at https://lore.kernel.org/linux-nfs/635679406.70384074.1603272832846.JavaMail.zimbra@dneg.com/. Note some other cases there (NFSv3 re-exports of NFSv3) are fixed by patches probably headed for 5.11.<br />
<br />
Maybe the NFSv4.0 client could also be made to support open-by-filehandle by skipping the open and using special stateids instead? I'm not sure.<br />
<br />
== unnecessary GETATTRs ==<br />
<br />
We see unnecessary cache invalidations on the re-export servers; we have some patches in progress that should make it for 5.11 or so (https://lore.kernel.org/linux-nfs/20201120223831.GB7705@fieldses.org/). It looks like they help but don't address every case.<br />
<br />
Also, depending on NFS versions on originating and re-exporting servers, we could probably save some GETATTRs, and set the atomic bit in some cases, if we passed along wcc information from the original server. Requires a special knfsd<->nfs interface. Should be doable.<br />
<br />
== broken file locking ==<br />
<br />
Connectathon locking tests over v4 are currently triggering some kind of memory corruption; still investigating.<br />
<br />
I haven't tested NFSv2/v3 (NLM) file locking yet, but I bet it's broken too.<br />
<br />
== re-export not reading more than 128K at a time ==<br />
<br />
For some reason when the client issues 1M reads to the re-export server, the re-export server breaks them up into 128K reads to the original server. Workaround is to manually increase client readahead; see <br />
https://lore.kernel.org/linux-nfs/1688437957.87985749.1605554507783.JavaMail.zimbra@dneg.com/<br />
<br />
== open DENY bits ignored ==<br />
<br />
NFS since NFSv4 supports ALLOW and DENY bits taken from Windows, which allow you, for example, to open a file in a mode which forbids other read opens or write opens. The Linux client doesn't use them, and the server's support has always been incomplete: they are enforced only against other NFS users, not against processes accessing the exported filesystem locally. A re-export server will also not pass them along to the original server, so they will not be enforced between clients of different re-export servers.<br />
<br />
This is probably not too hard to fix, but also probably not a high priority.<br />
<br />
== Delegations unsupported ==<br />
<br />
Currently a re-export server simply won't give out delegations to its clients (if you're looking at the code: this is because the nfs filesystem sets its setlease method to simple_nosetlease). This is correct but probably suboptimal.<br />
<br />
= Known problems that we've fixed =<br />
<br />
* Problems with sporadic stale filehandles should be fixed by https://lore.kernel.org/linux-nfs/20201019175330.595894-1-trondmy@kernel.org/ (queued for 5.11?)<br />
* Pre/post-operation attributes are incorrectly returned as if they were atomic in cases when they aren't. We have fixes for 5.11.<br />
<br />
= Use cases =<br />
<br />
== Scaling read bandwidth ==<br />
<br />
You should be able to scale bandwidth by adding more re-export servers; fscache on the re-export servers should also help.<br />
<br />
== Hiding latency of distant servers ==<br />
<br />
You should also be able to hide latency when the original server is far away. AFS read-only replication is an interesting precedent here, often used to distribute software that is rarely updated. [https://cernvm.cern.ch/fs/ CernVM-FS] occupies a similar niche. fscache should help here too.<br />
<br />
== NFS version support ==<br />
<br />
It's also being used as a way to add support for all NFS versions to servers that only support a subset. Careful attention to filehandle limits is required.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_re-exportNFS re-export2021-02-23T16:22:32Z<p>Bfields: </p>
<hr />
<div>The Linux NFS server can export an NFS mount, but that isn't something we currently recommend unless you've done some careful research and are prepared for problems.<br />
<br />
You'll need nfs-utils at least 1.3.5 (specifically, 3f520e8f6f5 "exportfs: Make sure pass all valid export flags to nfsd"). Otherwise, on recent kernels, attempts to re-export NFS will likely result in "exportfs: <path> does not support NFS export".<br />
<br />
The "fsid=" option is required on any export of an NFS filesystem.<br />
<br />
For now you should probably also mount readonly and with -onolock (and don't depend on working file locking), and don't allow the re-exporting server to reboot.<br />
<br />
= known issues =<br />
<br />
== reboot recovery ==<br />
<br />
NFS is designed to keep operating through server reboots, whether planned or the result of a crash or power outage. Client applications will see a delay while the server's down, but as soon as it's back up, normal operation resumes. Opens and file locks held across the reboot will all work correctly. (The only exception is unlinked but still open files, which may disappear after a reboot.)<br />
<br />
But the protocol's normal reboot recovery mechanisms don't work for the case when the re-export server reboots. The re-export server is both an NFS client and an NFS server, and the protocol's equipped to deal with the loss of the server's state, but not with the loss of the client's state.<br />
<br />
Maybe we could keep the client state on low-latency stable storage somehow? Maybe we could add a mechanism to the protocol that allows the client to state that it has lost its protocol state and wants to reclaim? (And then the client would issue reclaims as reclaims from the re-export server's clients came in.) Tentative plan: [[reboot recovery for re-export servers]]<br />
<br />
Maybe the re-export server could take the stateids returned from the server and return them to its clients, avoiding the need for it to keep very much state.<br />
<br />
== filehandle limits ==<br />
<br />
NFS filehandle sizes are limited (to 32 bytes for NFSv2, 64 bytes for NFSv3, and 128 bytes for NFSv4). When we re-export, we take the filehandle returned from the original server and wrap it with some more bytes of our own to create the filehandle we return to clients. That means the filehandles we give out will be larger than the filehandles we receive from the original server. There's no guarantee this will work. In practice most servers give out filehandles of a fixed size that's less than the maximum, so you *probably* won't run into this problem unless you're re-exporting with NFSv2, or re-exporting repeatedly. But there are no guarantees.<br />
<br />
If re-export servers could reuse filehandles from the original server, that'd solve the problem. It would also make it easier for clients to migrate between the original server and other re-export servers, which could be useful.<br />
<br />
The wrapping is needed so that the server can identify, even after it may have long forgotten about that particular filehandle, which export the filehandle refers to, so it can refer the operation to the correct underlying filesystem or server, and so it can enforce export permissions.<br />
<br />
If a server exports only a single NFS filesystem, then there'd be no problem with it reusing the file handle it got from the original server. Possibly that's a common enough use case to be helpful? With containers we could still allow a single physical machine to handle multiple exports even if each container only handles on each.<br />
<br />
Cooperating servers could agree on the structure of filehandles in a way that allowed them to reuse each others' filehandles. Possibly that could be standardized if it proved useful.<br />
<br />
== errors on re-exports of NFSv4.0 filesystems to NFSv2/3 clients ==<br />
<br />
When re-exporting NFSv4.0 filesystems IO errors have been seen after dropping caches on the re-export server. This is probably due to the fact that an NFSv4 client has to open files to perform IO to them, but NFSv3 client only provides filehandles, and NFSv4.0 cannot open by filehandle (it can only open by (parent filehandle, filename) pair). NFSv4.1 allows open by filehandle.<br />
<br />
Best is not to do this; use NFSv4.1 or NFSv4.2 on the original server, or NFSv4 on the clients.<br />
<br />
If that's not possible, a workaround is to configure the re-export server to be reluctant to evict inodes from cache.<br />
<br />
Some more details at https://lore.kernel.org/linux-nfs/635679406.70384074.1603272832846.JavaMail.zimbra@dneg.com/. Note some other cases there (NFSv3 re-exports of NFSv3) are fixed by patches probably headed for 5.11.<br />
<br />
Maybe the NFSv4.0 client could also be made to support open-by-filehandle by skipping the open and using special stateids instead? I'm not sure.<br />
<br />
== unnecessary GETATTRs ==<br />
<br />
We see unnecessary cache invalidations on the re-export servers; we have some patches in progress that should make it for 5.11 or so (https://lore.kernel.org/linux-nfs/20201120223831.GB7705@fieldses.org/). It looks like they help but don't address every case.<br />
<br />
Also, depending on NFS versions on originating and re-exporting servers, we could probably save some GETATTRs, and set the atomic bit in some cases, if we passed along wcc information from the original server. Requires a special knfsd<->nfs interface. Should be doable.<br />
<br />
== broken file locking ==<br />
<br />
Connectathon locking tests over v4 are currently triggering some kind of memory corruption; still investigating.<br />
<br />
I haven't tested NFSv2/v3 (NLM) file locking yet, but I bet it's broken too.<br />
<br />
== re-export not reading more than 128K at a time ==<br />
<br />
For some reason when the client issues 1M reads to the re-export server, the re-export server breaks them up into 128K reads to the original server. Workaround is to manually increase client readahead; see <br />
https://lore.kernel.org/linux-nfs/1688437957.87985749.1605554507783.JavaMail.zimbra@dneg.com/<br />
<br />
== open DENY bits ignored ==<br />
<br />
NFS since NFSv4 supports ALLOW and DENY bits taken from Windows, which allow you, for example, to open a file in a mode which forbids other read opens or write opens. The Linux client doesn't use them, and the server's support has always been incomplete: they are enforced only against other NFS users, not against processes accessing the exported filesystem locally. A re-export server will also not pass them along to the original server, so they will not be enforced between clients of different re-export servers.<br />
<br />
This is probably not too hard to fix, but also probably not a high priority.<br />
<br />
== Delegations unsupported ==<br />
<br />
Currently a re-export server simply won't give out delegations to its clients (if you're looking at the code: this is because the nfs filesystem sets its setlease method to simple_nosetlease). This is correct but probably suboptimal.<br />
<br />
= Known problems that we've fixed =<br />
<br />
* Problems with sporadic stale filehandles should be fixed by https://lore.kernel.org/linux-nfs/20201019175330.595894-1-trondmy@kernel.org/ (queued for 5.11?)<br />
* Pre/post-operation attributes are incorrectly returned as if they were atomic in cases when they aren't. We have fixes for 5.11.<br />
<br />
= Use cases =<br />
<br />
Most interesting cases are probably read-mostly (even read-only) workloads.<br />
<br />
It should also be useful in conjunction with fscache.<br />
<br />
You should be able to scale bandwidth by adding more re-export servers.<br />
<br />
You should also be able to hide latency when the original server is far away. AFS read-only replication is an interesting precedent here, often used to distribute software that is rarely updated. [https://cernvm.cern.ch/fs/ CernVM-FS] occupies a similar niche.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_re-exportNFS re-export2021-02-23T15:46:59Z<p>Bfields: /* Delegations unsupported */</p>
<hr />
<div>The Linux NFS server can export an NFS mount, but that isn't something we currently recommend unless you've done some careful research and are prepared for problems.<br />
<br />
You'll need nfs-utils at least 1.3.5 (specifically, 3f520e8f6f5 "exportfs: Make sure pass all valid export flags to nfsd"). Otherwise, on recent kernels, attempts to re-export NFS will likely result in "exportfs: <path> does not support NFS export".<br />
<br />
The "fsid=" option is required on any export of an NFS filesystem.<br />
<br />
For now you should probably also mount readonly and with -onolock (and don't depend on working file locking), and don't allow the re-exporting server to reboot.<br />
<br />
= known issues =<br />
<br />
== reboot recovery ==<br />
<br />
NFS is designed to keep operating through server reboots, whether planned or the result of a crash or power outage. Client applications will see a delay while the server's down, but as soon as it's back up, normal operation resumes. Opens and file locks held across the reboot will all work correctly. (The only exception is unlinked but still open files, which may disappear after a reboot.)<br />
<br />
But the protocol's normal reboot recovery mechanisms don't work for the case when the re-export server reboots. The re-export server is both an NFS client and an NFS server, and the protocol's equipped to deal with the loss of the server's state, but not with the loss of the client's state.<br />
<br />
Maybe we could keep the client state on low-latency stable storage somehow? Maybe we could add a mechanism to the protocol that allows the client to state that it has lost its protocol state and wants to reclaim? (And then the client would issue reclaims as reclaims from the re-export server's clients came in.) Tentative plan: [[reboot recovery for re-export servers]]<br />
<br />
Maybe the re-export server could take the stateids returned from the server and return them to its clients, avoiding the need for it to keep very much state.<br />
<br />
== filehandle limits ==<br />
<br />
NFS filehandle sizes are limited (to 32 bytes for NFSv2, 64 bytes for NFSv3, and 128 bytes for NFSv4). When we re-export, we take the filehandle returned from the original server and wrap it with some more bytes of our own to create the filehandle we return to clients. That means the filehandles we give out will be larger than the filehandles we receive from the original server. There's no guarantee this will work. In practice most servers give out filehandles of a fixed size that's less than the maximum, so you *probably* won't run into this problem unless you're re-exporting with NFSv2, or re-exporting repeatedly. But there are no guarantees.<br />
<br />
If re-export servers could reuse filehandles from the original server, that'd solve the problem. It would also make it easier for clients to migrate between the original server and other re-export servers, which could be useful.<br />
<br />
The wrapping is needed so that the server can identify, even after it may have long forgotten about that particular filehandle, which export the filehandle refers to, so it can refer the operation to the correct underlying filesystem or server, and so it can enforce export permissions.<br />
<br />
If a server exports only a single NFS filesystem, then there'd be no problem with it reusing the file handle it got from the original server. Possibly that's a common enough use case to be helpful? With containers we could still allow a single physical machine to handle multiple exports even if each container only handles on each.<br />
<br />
Cooperating servers could agree on the structure of filehandles in a way that allowed them to reuse each others' filehandles. Possibly that could be standardized if it proved useful.<br />
<br />
== errors on re-exports of NFSv4.0 filesystems to NFSv2/3 clients ==<br />
<br />
When re-exporting NFSv4.0 filesystems IO errors have been seen after dropping caches on the re-export server. This is probably due to the fact that an NFSv4 client has to open files to perform IO to them, but NFSv3 client only provides filehandles, and NFSv4.0 cannot open by filehandle (it can only open by (parent filehandle, filename) pair). NFSv4.1 allows open by filehandle.<br />
<br />
Best is not to do this; use NFSv4.1 or NFSv4.2 on the original server, or NFSv4 on the clients.<br />
<br />
If that's not possible, a workaround is to configure the re-export server to be reluctant to evict inodes from cache.<br />
<br />
Some more details at https://lore.kernel.org/linux-nfs/635679406.70384074.1603272832846.JavaMail.zimbra@dneg.com/. Note some other cases there (NFSv3 re-exports of NFSv3) are fixed by patches probably headed for 5.11.<br />
<br />
Maybe the NFSv4.0 client could also be made to support open-by-filehandle by skipping the open and using special stateids instead? I'm not sure.<br />
<br />
== unnecessary GETATTRs ==<br />
<br />
We see unnecessary cache invalidations on the re-export servers; we have some patches in progress that should make it for 5.11 or so (https://lore.kernel.org/linux-nfs/20201120223831.GB7705@fieldses.org/). It looks like they help but don't address every case.<br />
<br />
Also, depending on NFS versions on originating and re-exporting servers, we could probably save some GETATTRs, and set the atomic bit in some cases, if we passed along wcc information from the original server. Requires a special knfsd<->nfs interface. Should be doable.<br />
<br />
== broken file locking ==<br />
<br />
Connectathon locking tests over v4 are currently triggering some kind of memory corruption; still investigating.<br />
<br />
I haven't tested NFSv2/v3 (NLM) file locking yet, but I bet it's broken too.<br />
<br />
== re-export not reading more than 128K at a time ==<br />
<br />
For some reason when the client issues 1M reads to the re-export server, the re-export server breaks them up into 128K reads to the original server. Workaround is to manually increase client readahead; see <br />
https://lore.kernel.org/linux-nfs/1688437957.87985749.1605554507783.JavaMail.zimbra@dneg.com/<br />
<br />
== open DENY bits ignored ==<br />
<br />
NFS since NFSv4 supports ALLOW and DENY bits taken from Windows, which allow you, for example, to open a file in a mode which forbids other read opens or write opens. The Linux client doesn't use them, and the server's support has always been incomplete: they are enforced only against other NFS users, not against processes accessing the exported filesystem locally. A re-export server will also not pass them along to the original server, so they will not be enforced between clients of different re-export servers.<br />
<br />
This is probably not too hard to fix, but also probably not a high priority.<br />
<br />
== Delegations unsupported ==<br />
<br />
Currently a re-export server simply won't give out delegations to its clients (if you're looking at the code: this is because the nfs filesystem sets its setlease method to simple_nosetlease). This is correct but probably suboptimal.<br />
<br />
= Known problems that we've fixed =<br />
<br />
* Problems with sporadic stale filehandles should be fixed by https://lore.kernel.org/linux-nfs/20201019175330.595894-1-trondmy@kernel.org/ (queued for 5.11?)<br />
* Pre/post-operation attributes are incorrectly returned as if they were atomic in cases when they aren't. We have fixes for 5.11.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_re-exportNFS re-export2021-02-23T15:46:33Z<p>Bfields: /* known issues */</p>
<hr />
<div>The Linux NFS server can export an NFS mount, but that isn't something we currently recommend unless you've done some careful research and are prepared for problems.<br />
<br />
You'll need nfs-utils at least 1.3.5 (specifically, 3f520e8f6f5 "exportfs: Make sure pass all valid export flags to nfsd"). Otherwise, on recent kernels, attempts to re-export NFS will likely result in "exportfs: <path> does not support NFS export".<br />
<br />
The "fsid=" option is required on any export of an NFS filesystem.<br />
<br />
For now you should probably also mount readonly and with -onolock (and don't depend on working file locking), and don't allow the re-exporting server to reboot.<br />
<br />
= known issues =<br />
<br />
== reboot recovery ==<br />
<br />
NFS is designed to keep operating through server reboots, whether planned or the result of a crash or power outage. Client applications will see a delay while the server's down, but as soon as it's back up, normal operation resumes. Opens and file locks held across the reboot will all work correctly. (The only exception is unlinked but still open files, which may disappear after a reboot.)<br />
<br />
But the protocol's normal reboot recovery mechanisms don't work for the case when the re-export server reboots. The re-export server is both an NFS client and an NFS server, and the protocol's equipped to deal with the loss of the server's state, but not with the loss of the client's state.<br />
<br />
Maybe we could keep the client state on low-latency stable storage somehow? Maybe we could add a mechanism to the protocol that allows the client to state that it has lost its protocol state and wants to reclaim? (And then the client would issue reclaims as reclaims from the re-export server's clients came in.) Tentative plan: [[reboot recovery for re-export servers]]<br />
<br />
Maybe the re-export server could take the stateids returned from the server and return them to its clients, avoiding the need for it to keep very much state.<br />
<br />
== filehandle limits ==<br />
<br />
NFS filehandle sizes are limited (to 32 bytes for NFSv2, 64 bytes for NFSv3, and 128 bytes for NFSv4). When we re-export, we take the filehandle returned from the original server and wrap it with some more bytes of our own to create the filehandle we return to clients. That means the filehandles we give out will be larger than the filehandles we receive from the original server. There's no guarantee this will work. In practice most servers give out filehandles of a fixed size that's less than the maximum, so you *probably* won't run into this problem unless you're re-exporting with NFSv2, or re-exporting repeatedly. But there are no guarantees.<br />
<br />
If re-export servers could reuse filehandles from the original server, that'd solve the problem. It would also make it easier for clients to migrate between the original server and other re-export servers, which could be useful.<br />
<br />
The wrapping is needed so that the server can identify, even after it may have long forgotten about that particular filehandle, which export the filehandle refers to, so it can refer the operation to the correct underlying filesystem or server, and so it can enforce export permissions.<br />
<br />
If a server exports only a single NFS filesystem, then there'd be no problem with it reusing the file handle it got from the original server. Possibly that's a common enough use case to be helpful? With containers we could still allow a single physical machine to handle multiple exports even if each container only handles on each.<br />
<br />
Cooperating servers could agree on the structure of filehandles in a way that allowed them to reuse each others' filehandles. Possibly that could be standardized if it proved useful.<br />
<br />
== errors on re-exports of NFSv4.0 filesystems to NFSv2/3 clients ==<br />
<br />
When re-exporting NFSv4.0 filesystems IO errors have been seen after dropping caches on the re-export server. This is probably due to the fact that an NFSv4 client has to open files to perform IO to them, but NFSv3 client only provides filehandles, and NFSv4.0 cannot open by filehandle (it can only open by (parent filehandle, filename) pair). NFSv4.1 allows open by filehandle.<br />
<br />
Best is not to do this; use NFSv4.1 or NFSv4.2 on the original server, or NFSv4 on the clients.<br />
<br />
If that's not possible, a workaround is to configure the re-export server to be reluctant to evict inodes from cache.<br />
<br />
Some more details at https://lore.kernel.org/linux-nfs/635679406.70384074.1603272832846.JavaMail.zimbra@dneg.com/. Note some other cases there (NFSv3 re-exports of NFSv3) are fixed by patches probably headed for 5.11.<br />
<br />
Maybe the NFSv4.0 client could also be made to support open-by-filehandle by skipping the open and using special stateids instead? I'm not sure.<br />
<br />
== unnecessary GETATTRs ==<br />
<br />
We see unnecessary cache invalidations on the re-export servers; we have some patches in progress that should make it for 5.11 or so (https://lore.kernel.org/linux-nfs/20201120223831.GB7705@fieldses.org/). It looks like they help but don't address every case.<br />
<br />
Also, depending on NFS versions on originating and re-exporting servers, we could probably save some GETATTRs, and set the atomic bit in some cases, if we passed along wcc information from the original server. Requires a special knfsd<->nfs interface. Should be doable.<br />
<br />
== broken file locking ==<br />
<br />
Connectathon locking tests over v4 are currently triggering some kind of memory corruption; still investigating.<br />
<br />
I haven't tested NFSv2/v3 (NLM) file locking yet, but I bet it's broken too.<br />
<br />
== re-export not reading more than 128K at a time ==<br />
<br />
For some reason when the client issues 1M reads to the re-export server, the re-export server breaks them up into 128K reads to the original server. Workaround is to manually increase client readahead; see <br />
https://lore.kernel.org/linux-nfs/1688437957.87985749.1605554507783.JavaMail.zimbra@dneg.com/<br />
<br />
== open DENY bits ignored ==<br />
<br />
NFS since NFSv4 supports ALLOW and DENY bits taken from Windows, which allow you, for example, to open a file in a mode which forbids other read opens or write opens. The Linux client doesn't use them, and the server's support has always been incomplete: they are enforced only against other NFS users, not against processes accessing the exported filesystem locally. A re-export server will also not pass them along to the original server, so they will not be enforced between clients of different re-export servers.<br />
<br />
This is probably not too hard to fix, but also probably not a high priority.<br />
<br />
== Delegations unsupported ==<br />
<br />
Currently a re-export server simply won't give out delegations to its clients (if you're looking at the code: this is because the nfs filesystem sets its setlease method to simple_nosetlease).<br />
<br />
= Known problems that we've fixed =<br />
<br />
* Problems with sporadic stale filehandles should be fixed by https://lore.kernel.org/linux-nfs/20201019175330.595894-1-trondmy@kernel.org/ (queued for 5.11?)<br />
* Pre/post-operation attributes are incorrectly returned as if they were atomic in cases when they aren't. We have fixes for 5.11.</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_for_AFS_usersNFS for AFS users2021-02-04T14:52:31Z<p>Bfields: /* replication and migration */</p>
<hr />
<div>This page tracks some of the obstacles that might keep an AFS user from using NFS instead.<br />
<br />
= Missing Features =<br />
<br />
In general: AFS is administered by a consistent set of commands (fs, pts, vos, uss, bos, backup, fstrace, etc.) which work from any client and identify the user with Kerberos. Compared to a traditional unix system it's more flexible about delegating rights to users to do stuff.<br />
<br />
== replication and migration ==<br />
<br />
AFS supports fast clones using COW, along with complete copies on other machines.<br />
<br />
Currently there can be only one writeable version of a volume, but multiple read-only versions (which all have to be identical). They can be on different servers. (There's also an effort to support multiple writeable volumes, possibly using Ceph, but that's not done yet.)<br />
<br />
There can also be a 'backup' volume which is just, say, a daily temporary read-only snapshot of a RW volume and has to be located on the same machine.<br />
<br />
When a RW volume is "released" (snapshotted) to the read-only volumes, all the read-only volumes update simultaneously and atomically. The users, in theory, don't notice as the volumes don't go offline - and then they see all the changes happen at once. There's coordination to handle when one or more of the fileservers or the Volume Location servers are offline.<br />
<br />
Volumes can be migrated between machines whilst in active use without the user in theory noticing anything.<br />
<br />
For NFS migration we need to preserve filehandles, so need to migrate at the block level or using fs-specific send/receive. The protocol can be handled by migrating only entire servers or containers, so that migration can be treated as a server reboot.<br />
<br />
A few Linux options for send/receive:<br />
<br />
* thin_delta (from device-mapper-persistent-data) can calculate a metadata-level diff between two volumes. Additional work would be needed to extract the actual data and produce a diff; that would complete the "send" side. We'd also need a "receive" side that could apply the diff and reconstitute the snapshot on the other side. This is being actively worked on. For NFS, on the read-write server we would take a snapshot of the exported volume before sending. On the receive side, after creating the updated snapshot, we would stop the server, unmount the old snapshot, mount the new one, and restart; clients should see only a brief delay.<br />
<br />
* btrfs-send/btrfs-receive: this is probably the best-tested send/receive functionality currently available, so if we wanted to start work on a prototype right now, this might be an option.<br />
<br />
* xfs volumes loopback-mounted on a backing xfs filesystem, using reflink for snapshots. (See https://lwn.net/Articles/747633/ for some background.) Looks promising, the basic kernel interfaces to find shared extents and such are there, but a lot of userland code remains to be written.<br />
<br />
* stratis: this operates at a layer of abstraction over the above. But that might be the layer we want to actually interact with?<br />
<br />
* lvmsync: looks possibly unmaintained? We wouldn't want to depend on this. But possibly it could be a proof of concept or starting point.<br />
<br />
Between LVM and (container-respecting) knfsd, we have a lot of the necessary pieces, but there's at a minimum a lot of tooling and documentation to write before this is usable.<br />
<br />
Clients could be configured to mount particular servers by hand, or they could mount any server and then use [https://tools.ietf.org/html/rfc5661#section-11.9 fs_locations], [https://tools.ietf.org/html/rfc5661#section-11.10 fs_locations_info], or maybe even [https://datatracker.ietf.org/doc/rfc8435/ pnfs flexfiles] to get lists of servers hosting replicas and pick one. They would need some heuristics to make the right choice. It would also be nice if clients could fail over to a different replica when one goes down.<br />
<br />
We also have [https://github.com/nfs-ganesha/nfs-ganesha/wiki Ganesha], [https://docs.ceph.com/docs/master/cephfs/nfs/ Ganesha/Ceph] (which [https://jtlayton.wordpress.com/2018/12/10/deploying-an-active-active-nfs-cluster-over-cephfs/ may be capable of multiple read/write servers now]).<br />
<br />
See also [https://docs.openafs.org/AdminGuide/HDRWQ177.html AFS Administrator's guide, Chapter 5: Managing Volumes]<br />
<br />
A partial alternative may be [https://wiki.linux-nfs.org/wiki/index.php/NFS_re-export NFS proxying]. Like read-only replicas, proxies should be able to hide latency by moving cached data closer to far-flung clients, and scale bandwidth to read-mostly data by taking load off the original server.<br />
<br />
Advantages are that we already have seen reports of some success here, using the NFS re-export code together with fscache. And I think there are a lot of opportunities for incremental progress by fixing problems with existing NFS code, rather than larger and riskier projects that build new infrastructure.<br />
<br />
A disadvantage may be that AFS users seem to like that infrastructure (the volume abstraction and the VLDB).<br />
<br />
Latency-hiding may be particularly tricky; delegation and caching policies may need rethinking. Performance will be more complicated to understand compared to AFS-like read-only replicas.<br />
<br />
AFS-like volume replication has a problem: when new read-only versions are released, they may modify or delete entirely files that are in use by running processes. I'd expect application crashes. I wonder how AFS administrators deal with that now?<br />
<br />
My impression is that AFS doesn't reliably prevent this problem, so instead AFS administrators work around it, for example by keeping old versions of binaries in place (and using symlinks to direct users to the newest versions).<br />
<br />
Possible approaches to fix the problem if we wanted to:<br />
* Provide some protocol which tracks which files may be open on read-only replicas so that we know not to free those files when they're unlinked.<br />
* When we distribute new versions, allow servers to keep around older versions and serve files from them in the case filehandle lookups against the new copy fail, to be removed only after applications stop referencing them. Hopefully this can be done space-efficiently if the different versions on the replica servers can be represented as dm snapshots.<br />
<br />
If we use NFSv4 proxies instead, proxies will hold opens or delegations on the files on the original server, which will prevent their being deleted while in use. The problem is server reboots. That's partially worked around with silly-rename. Server-side silly-rename would be a more complete solution.<br />
<br />
== volume location database and global namespace ==<br />
<br />
On an AFS client by default you can look up something like /afs/umich.edu/... and reach files kept in AFS anywhere.<br />
<br />
NFS has standards for DNS discovery of a server from a domain, in theory we could use that. Handling kerberos users across domains would be interesting.<br />
<br />
Within one domain, there's a "Volume Location Database" that keeps track of volumes and where (machine and partition) they're located. You can make a volume for a purpose; give particular people access to it, give it some storage, expand and contract it and move it around. Volumes have quotas.<br />
<br />
Within a given domain, We can assemble a namespace out of volumes using referrals. For a higher-level approach more similar to AFS's, there's also [https://wiki.linux-nfs.org/wiki/index.php/FedFsUtilsProject FedFS] which stores the namespace information in a database and provides common protocols for administration tools to manipulate the database.<br />
<br />
== PAGS ==<br />
<br />
PAGs: AFS allows a group of processes to share a common identity, different from the local uid, for the purposes of accessing an AFS filesystem: https://docs.openafs.org/AdminGuide/HDRWQ63.html<br />
<br />
Dave Howells says: "This is why I added session keyrings. You can run a process in a new keyring<br />
and give it new tokens. systemd kind of stuck a spike in that, though, by<br />
doing their own incompatible thing with their user manager service....<br />
<br />
NFS would need to do what the in-kernel AFS client does and call request_key()<br />
on entry to each filesystem method that doesn't take a file* and use that to<br />
cache the credentials it is using. If there is no key, it can make one up on<br />
the spot and stick the uid/gid/groups in there. This would then need to be<br />
handed down to the sunrpc protocol to define the security creds to use.<br />
<br />
The key used to open a file would then need to be cached in the file struct<br />
private data."<br />
<br />
== ACLs ==<br />
<br />
NFSv4 has ACLs, but Linux filesystems only support "posix" ACLs. An attempt was made to support NFSv4 ACLs ("richacls") but hasn't been accepted upstream. So knfsd is stuck mapping between NFSv4 and posix ACLs. Posix ACLs are more coarse-grained than NFSv4 ACLs, so information can be lost when a user on an NFSv4 client sets an ACL. This makes ACLs confusing and less useful.<br />
<br />
There are other servers that support full NFSv4 ACLs, so users of those servers are better off. Our client-side tools could still use some improvements for those users, though.<br />
<br />
AFS ACLs, unfortunately, are yet again a third style of ACL, incompatible with both POSIX and NFSv4 ACLs. They are more fine-grained than POSIX ACLs and probably closer to NFSv4 ACLs overall.<br />
<br />
To do:<br />
<br />
* make NFSv4 ACL tools more usable:<br />
** Map groups of NFSv4 permission bits to read, write, and execute permissions so we only have to display the simpler bits in common cases<br />
** Look for other opportunities to simplify display and editing of NFSv4 ACLs<br />
** Add NFSv4 ACL support to graphical file managers like GNOME Files<br />
** Adopt a commandline interface that's more similar to the posix acl utilities.<br />
** Perhaps also look into https://github.com/kvaneesh/richacl-tools as an alternative starting point to nfs4-acl-tools.<br />
** In general, try to make NFSv4 ACL management more similar to management of existing posix ACLs.<br />
* For AFS->NFS transition:<br />
** Write code that translates AFS ACLs to NFSv4 ACLs. It should be possible to do this with little or no loss of information for servers with full NFSv4 ACL support.<br />
** For migrations to Linux knfsd, this will effectively translate AFS ACLs to POSIX ACLs, and information will be lost. Test this case. The conversion tool should be able to fetch the ACLs after setting them, compare results, and summarize the results of the conversion in a way that's usable even for conversions of large numbers of files. I believe that setting an ACL is enough to invalidate the client's ACL cache, so a subsequent fetch of an ACL should show the results of any server-side mapping. But, test this to make sure. More details on [[AFS to NFSv4 ACL conversion]]<br />
<br />
* more ambitious options:<br />
** Try reviving [https://lwn.net/Articles/661357/ Rich ACLs]. Maybe we could convince people this time. Or maybe there's a different approach that would work. Maybe we could find a more incremental route, e.g. by adding some features of richacls to POSIX ACLs, such as the separation of directory write permissions into add and delete, and of file write permissions into modify and append.<br />
<br />
== user and group management ==<br />
<br />
AFS has a "protection server" and you can communicate with it using the [https://docs.openafs.org/Reference/1/pts.html pts command] which allows you to set up users and groups and add ACEs for machines.<br />
<br />
Compared to traditional unix, it allows wider delegation of management. For example, group creation doesn't require root: https://docs.openafs.org/Reference/1/pts_creategroup.html. Groups have owners, and you can delegate management of group membership: https://docs.openafs.org/Reference/1/pts_adduser.html.<br />
<br />
Our equivalent to the AFS protection server is [https://www.freeipa.org/page/Main_Page FreeIPA]. See also https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_identity_management/index. Installing FreeIPA and experimenting is also useful.<br />
<br />
Unlike AFS, FreeIPA doesn't seem to make it easy for ordinary users to create groups. It does allow delegating group management (including adding and removing users). More details on [[AFS-like group management with FreeIPA]].<br />
<br />
= migrating existing AFS installations to NFS =<br />
<br />
Once NFS does everything AFS does, there's still the question of how you'd migrate over a particular installation.<br />
<br />
There's a standard AFS dump format (used by [https://docs.openafs.org/AdminGuide/HDRWQ240.html vos dump/vos restore]) that might be worth looking at. It looks simple enough. Maybe also look at [https://github.com/openafs-contrib/cmu-dumpscan cmu-dumpscan].<br />
<br />
See also [[AFS to NFSv4 ACL conversion]].</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_for_AFS_usersNFS for AFS users2021-02-04T14:48:16Z<p>Bfields: </p>
<hr />
<div>This page tracks some of the obstacles that might keep an AFS user from using NFS instead.<br />
<br />
= Missing Features =<br />
<br />
In general: AFS is administered by a consistent set of commands (fs, pts, vos, uss, bos, backup, fstrace, etc.) which work from any client and identify the user with Kerberos. Compared to a traditional unix system it's more flexible about delegating rights to users to do stuff.<br />
<br />
== replication and migration ==<br />
<br />
<br />
Fast clones using COW are supported, along with complete copies on other machines.<br />
<br />
Currently there can be only one writeable version of a volume, but multiple read-only versions (which all have to be identical). They can be on different servers.<br />
<br />
There can also be a 'backup' volume which is just, say, a daily temporary read-only snapshot of a RW volume and has to be located on the same machine.<br />
<br />
When a RW volume is "released" (snapshotted) to the RO volumes, all the RO volumes update simultaneously and atomically. The users, in theory, don't notice as the volumes don't go offline - and then they see all the changes happen at once. There is coordination handling for when one or more of the fileservers or the Volume Location servers are offline.<br />
<br />
Volumes can be migrated between machines whilst in active use without the user in theory noticing anything.<br />
<br />
There are moves afoot to add multi-hosted RW volumes, but I'm not sure how that'll work, and may involve Ceph integration. But it's not there yet.<br />
<br />
For NFS migration we need to preserve filehandles, so need to migrate at the block level or using fs-specific send/receive. The protocol can be handled by migrating only entire servers or containers, so that migration can be treated as a server reboot.<br />
<br />
A few options for send/receive:<br />
<br />
* thin_delta (from device-mapper-persistent-data) can calculate a metadata-level diff between two volumes. Additional work would be needed to extract the actual data and produce a diff; that would complete the "send" side. We'd also need a "receive" side that could apply the diff and reconstitute the snapshot on the other side. This is being actively worked on. For NFS, on the read-write server we would take a snapshot of the exported volume before sending. On the receive side, after creating the updated snapshot, we would stop the server, unmount the old snapshot, mount the new one, and restart; clients should see only a brief delay.<br />
<br />
* btrfs-send/btrfs-receive: this is probably the best-tested send/receive functionality currently available, so if we wanted to start work on a prototype right now, this might be an option.<br />
<br />
* xfs volumes loopback-mounted on a backing xfs filesystem, using reflink for snapshots. (See https://lwn.net/Articles/747633/ for some background.) Looks promising, the basic kernel interfaces to find shared extents and such are there, but a lot of userland code remains to be written.<br />
<br />
* stratis: this operates at a layer of abstraction over the above. But that might be the layer we want to actually interact with?<br />
<br />
* lvmsync: looks possibly unmaintained? We wouldn't want to depend on this. But possibly it could be a proof of concept or starting point.<br />
<br />
Between LVM and (container-respecting) knfsd, we have a lot of the necessary pieces, but there's at a minimum a lot of tooling and documentation to write before this is usable.<br />
<br />
Clients could be configured to mount particular servers by hand, or they could mount any server and then use [https://tools.ietf.org/html/rfc5661#section-11.9 fs_locations], [https://tools.ietf.org/html/rfc5661#section-11.10 fs_locations_info], or maybe even [https://datatracker.ietf.org/doc/rfc8435/ pnfs flexfiles] to get lists of servers hosting replicas and pick one. They would need some heuristics to make the right choice. It would also be nice if clients could fail over to a different replica when one goes down.<br />
<br />
We also have [https://github.com/nfs-ganesha/nfs-ganesha/wiki Ganesha], [https://docs.ceph.com/docs/master/cephfs/nfs/ Ganesha/Ceph] (which [https://jtlayton.wordpress.com/2018/12/10/deploying-an-active-active-nfs-cluster-over-cephfs/ may be capable of multiple read/write servers now]).<br />
<br />
See also [https://docs.openafs.org/AdminGuide/HDRWQ177.html AFS Administrator's guide, Chapter 5: Managing Volumes]<br />
<br />
A partial alternative may be [https://wiki.linux-nfs.org/wiki/index.php/NFS_re-export NFS proxying]. Like read-only replicas, proxies should be able to hide latency by moving cached data closer to far-flung clients, and scale bandwidth to read-mostly data by taking load off the original server.<br />
<br />
Advantages are that we already have seen reports of some success here, using the NFS re-export code together with fscache. And I think there are a lot of opportunities for incremental progress by fixing problems with existing NFS code, rather than larger and riskier projects that build new infrastructure.<br />
<br />
A disadvantage may be that AFS users seem to like that infrastructure (the volume abstraction and the VLDB).<br />
<br />
Latency-hiding may be particularly tricky; delegation and caching policies may need rethinking. Performance will be more complicated to understand compared to AFS-like read-only replicas.<br />
<br />
AFS-like volume replication has a problem: when new read-only versions are released, they may modify or delete entirely files that are in use by running processes. I'd expect application crashes. I wonder how AFS administrators deal with that now?<br />
<br />
My impression is that AFS doesn't reliably prevent this problem, so instead AFS administrators work around it, for example by keeping old versions of binaries in place (and using symlinks to direct users to the newest versions).<br />
<br />
Possible approaches to fix the problem if we wanted to:<br />
* Provide some protocol which tracks which files may be open on read-only replicas so that we know not to free those files when they're unlinked.<br />
* When we distribute new versions, allow servers to keep around older versions and serve files from them in the case filehandle lookups against the new copy fail, to be removed only after applications stop referencing them. Hopefully this can be done space-efficiently if the different versions on the replica servers can be represented as dm snapshots.<br />
<br />
If we use NFSv4 proxies instead, proxies will hold opens or delegations on the files on the original server, which will prevent their being deleted while in use. The problem is server reboots. That's partially worked around with silly-rename. Server-side silly-rename would be a more complete solution.<br />
<br />
== volume location database and global namespace ==<br />
<br />
On an AFS client by default you can look up something like /afs/umich.edu/... and reach files kept in AFS anywhere.<br />
<br />
NFS has standards for DNS discovery of a server from a domain, in theory we could use that. Handling kerberos users across domains would be interesting.<br />
<br />
Within one domain, there's a "Volume Location Database" that keeps track of volumes and where (machine and partition) they're located. You can make a volume for a purpose; give particular people access to it, give it some storage, expand and contract it and move it around. Volumes have quotas.<br />
<br />
Within a given domain, We can assemble a namespace out of volumes using referrals. For a higher-level approach more similar to AFS's, there's also [https://wiki.linux-nfs.org/wiki/index.php/FedFsUtilsProject FedFS] which stores the namespace information in a database and provides common protocols for administration tools to manipulate the database.<br />
<br />
== PAGS ==<br />
<br />
PAGs: AFS allows a group of processes to share a common identity, different from the local uid, for the purposes of accessing an AFS filesystem: https://docs.openafs.org/AdminGuide/HDRWQ63.html<br />
<br />
Dave Howells says: "This is why I added session keyrings. You can run a process in a new keyring<br />
and give it new tokens. systemd kind of stuck a spike in that, though, by<br />
doing their own incompatible thing with their user manager service....<br />
<br />
NFS would need to do what the in-kernel AFS client does and call request_key()<br />
on entry to each filesystem method that doesn't take a file* and use that to<br />
cache the credentials it is using. If there is no key, it can make one up on<br />
the spot and stick the uid/gid/groups in there. This would then need to be<br />
handed down to the sunrpc protocol to define the security creds to use.<br />
<br />
The key used to open a file would then need to be cached in the file struct<br />
private data."<br />
<br />
== ACLs ==<br />
<br />
NFSv4 has ACLs, but Linux filesystems only support "posix" ACLs. An attempt was made to support NFSv4 ACLs ("richacls") but hasn't been accepted upstream. So knfsd is stuck mapping between NFSv4 and posix ACLs. Posix ACLs are more coarse-grained than NFSv4 ACLs, so information can be lost when a user on an NFSv4 client sets an ACL. This makes ACLs confusing and less useful.<br />
<br />
There are other servers that support full NFSv4 ACLs, so users of those servers are better off. Our client-side tools could still use some improvements for those users, though.<br />
<br />
AFS ACLs, unfortunately, are yet again a third style of ACL, incompatible with both POSIX and NFSv4 ACLs. They are more fine-grained than POSIX ACLs and probably closer to NFSv4 ACLs overall.<br />
<br />
To do:<br />
<br />
* make NFSv4 ACL tools more usable:<br />
** Map groups of NFSv4 permission bits to read, write, and execute permissions so we only have to display the simpler bits in common cases<br />
** Look for other opportunities to simplify display and editing of NFSv4 ACLs<br />
** Add NFSv4 ACL support to graphical file managers like GNOME Files<br />
** Adopt a commandline interface that's more similar to the posix acl utilities.<br />
** Perhaps also look into https://github.com/kvaneesh/richacl-tools as an alternative starting point to nfs4-acl-tools.<br />
** In general, try to make NFSv4 ACL management more similar to management of existing posix ACLs.<br />
* For AFS->NFS transition:<br />
** Write code that translates AFS ACLs to NFSv4 ACLs. It should be possible to do this with little or no loss of information for servers with full NFSv4 ACL support.<br />
** For migrations to Linux knfsd, this will effectively translate AFS ACLs to POSIX ACLs, and information will be lost. Test this case. The conversion tool should be able to fetch the ACLs after setting them, compare results, and summarize the results of the conversion in a way that's usable even for conversions of large numbers of files. I believe that setting an ACL is enough to invalidate the client's ACL cache, so a subsequent fetch of an ACL should show the results of any server-side mapping. But, test this to make sure. More details on [[AFS to NFSv4 ACL conversion]]<br />
<br />
* more ambitious options:<br />
** Try reviving [https://lwn.net/Articles/661357/ Rich ACLs]. Maybe we could convince people this time. Or maybe there's a different approach that would work. Maybe we could find a more incremental route, e.g. by adding some features of richacls to POSIX ACLs, such as the separation of directory write permissions into add and delete, and of file write permissions into modify and append.<br />
<br />
== user and group management ==<br />
<br />
AFS has a "protection server" and you can communicate with it using the [https://docs.openafs.org/Reference/1/pts.html pts command] which allows you to set up users and groups and add ACEs for machines.<br />
<br />
Compared to traditional unix, it allows wider delegation of management. For example, group creation doesn't require root: https://docs.openafs.org/Reference/1/pts_creategroup.html. Groups have owners, and you can delegate management of group membership: https://docs.openafs.org/Reference/1/pts_adduser.html.<br />
<br />
Our equivalent to the AFS protection server is [https://www.freeipa.org/page/Main_Page FreeIPA]. See also https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_identity_management/index. Installing FreeIPA and experimenting is also useful.<br />
<br />
Unlike AFS, FreeIPA doesn't seem to make it easy for ordinary users to create groups. It does allow delegating group management (including adding and removing users). More details on [[AFS-like group management with FreeIPA]].<br />
<br />
= migrating existing AFS installations to NFS =<br />
<br />
Once NFS does everything AFS does, there's still the question of how you'd migrate over a particular installation.<br />
<br />
There's a standard AFS dump format (used by [https://docs.openafs.org/AdminGuide/HDRWQ240.html vos dump/vos restore]) that might be worth looking at. It looks simple enough. Maybe also look at [https://github.com/openafs-contrib/cmu-dumpscan cmu-dumpscan].<br />
<br />
See also [[AFS to NFSv4 ACL conversion]].</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_for_AFS_usersNFS for AFS users2021-02-04T14:46:51Z<p>Bfields: </p>
<hr />
<div>This page tracks some of the obstacles that might keep an AFS user from using NFS instead.<br />
<br />
= Missing Features =<br />
<br />
In general: AFS is administered by a consistent set of commands (fs, pts, vos, uss, bos, backup, fstrace, etc.) which work from any client and identify the user with Kerberos. Compared to a traditional unix system it's more flexible about delegating rights to users to do stuff.<br />
<br />
== replication and migration ==<br />
<br />
AFS has a "Volume Location Database" that tracks where (machine and partition) a volume is located.<br />
<br />
Fast clones using COW are supported, along with complete copies on other machines.<br />
<br />
Currently there can be only one writeable version of a volume, but multiple read-only versions (which all have to be identical). They can be on different servers.<br />
<br />
There can also be a 'backup' volume which is just, say, a daily temporary read-only snapshot of a RW volume and has to be located on the same machine.<br />
<br />
When a RW volume is "released" (snapshotted) to the RO volumes, all the RO volumes update simultaneously and atomically. The users, in theory, don't notice as the volumes don't go offline - and then they see all the changes happen at once. There is coordination handling for when one or more of the fileservers or the Volume Location servers are offline.<br />
<br />
Volumes can be migrated between machines whilst in active use without the user in theory noticing anything.<br />
<br />
There are moves afoot to add multi-hosted RW volumes, but I'm not sure how that'll work, and may involve Ceph integration. But it's not there yet.<br />
<br />
For NFS migration we need to preserve filehandles, so need to migrate at the block level or using fs-specific send/receive. The protocol can be handled by migrating only entire servers or containers, so that migration can be treated as a server reboot.<br />
<br />
A few options for send/receive:<br />
<br />
* thin_delta (from device-mapper-persistent-data) can calculate a metadata-level diff between two volumes. Additional work would be needed to extract the actual data and produce a diff; that would complete the "send" side. We'd also need a "receive" side that could apply the diff and reconstitute the snapshot on the other side. This is being actively worked on. For NFS, on the read-write server we would take a snapshot of the exported volume before sending. On the receive side, after creating the updated snapshot, we would stop the server, unmount the old snapshot, mount the new one, and restart; clients should see only a brief delay.<br />
<br />
* btrfs-send/btrfs-receive: this is probably the best-tested send/receive functionality currently available, so if we wanted to start work on a prototype right now, this might be an option.<br />
<br />
* xfs volumes loopback-mounted on a backing xfs filesystem, using reflink for snapshots. (See https://lwn.net/Articles/747633/ for some background.) Looks promising, the basic kernel interfaces to find shared extents and such are there, but a lot of userland code remains to be written.<br />
<br />
* stratis: this operates at a layer of abstraction over the above. But that might be the layer we want to actually interact with?<br />
<br />
* lvmsync: looks possibly unmaintained? We wouldn't want to depend on this. But possibly it could be a proof of concept or starting point.<br />
<br />
Between LVM and (container-respecting) knfsd, we have a lot of the necessary pieces, but there's at a minimum a lot of tooling and documentation to write before this is usable.<br />
<br />
Clients could be configured to mount particular servers by hand, or they could mount any server and then use [https://tools.ietf.org/html/rfc5661#section-11.9 fs_locations], [https://tools.ietf.org/html/rfc5661#section-11.10 fs_locations_info], or maybe even [https://datatracker.ietf.org/doc/rfc8435/ pnfs flexfiles] to get lists of servers hosting replicas and pick one. They would need some heuristics to make the right choice. It would also be nice if clients could fail over to a different replica when one goes down.<br />
<br />
We also have [https://github.com/nfs-ganesha/nfs-ganesha/wiki Ganesha], [https://docs.ceph.com/docs/master/cephfs/nfs/ Ganesha/Ceph] (which [https://jtlayton.wordpress.com/2018/12/10/deploying-an-active-active-nfs-cluster-over-cephfs/ may be capable of multiple read/write servers now]).<br />
<br />
See also [https://docs.openafs.org/AdminGuide/HDRWQ177.html AFS Administrator's guide, Chapter 5: Managing Volumes]<br />
<br />
A partial alternative may be [https://wiki.linux-nfs.org/wiki/index.php/NFS_re-export NFS proxying]. Like read-only replicas, proxies should be able to hide latency by moving cached data closer to far-flung clients, and scale bandwidth to read-mostly data by taking load off the original server.<br />
<br />
Advantages are that we already have seen reports of some success here, using the NFS re-export code together with fscache. And I think there are a lot of opportunities for incremental progress by fixing problems with existing NFS code, rather than larger and riskier projects that build new infrastructure.<br />
<br />
A disadvantage may be that AFS users seem to like that infrastructure (the volume abstraction and the VLDB).<br />
<br />
Latency-hiding may be particularly tricky; delegation and caching policies may need rethinking. Performance will be more complicated to understand compared to AFS-like read-only replicas.<br />
<br />
AFS-like volume replication has a problem: when new read-only versions are released, they may modify or delete entirely files that are in use by running processes. I'd expect application crashes. I wonder how AFS administrators deal with that now?<br />
<br />
My impression is that AFS doesn't reliably prevent this problem, so instead AFS administrators work around it, for example by keeping old versions of binaries in place (and using symlinks to direct users to the newest versions).<br />
<br />
Possible approaches to fix the problem if we wanted to:<br />
* Provide some protocol which tracks which files may be open on read-only replicas so that we know not to free those files when they're unlinked.<br />
* When we distribute new versions, allow servers to keep around older versions and serve files from them in the case filehandle lookups against the new copy fail, to be removed only after applications stop referencing them. Hopefully this can be done space-efficiently if the different versions on the replica servers can be represented as dm snapshots.<br />
<br />
If we use NFSv4 proxies instead, proxies will hold opens or delegations on the files on the original server, which will prevent their being deleted while in use. The problem is server reboots. That's partially worked around with silly-rename. Server-side silly-rename would be a more complete solution.<br />
<br />
== volume location database and global namespace ==<br />
<br />
On an AFS client by default you can look up something like /afs/umich.edu/... and reach files kept in AFS anywhere.<br />
<br />
NFS has standards for DNS discovery of a server from a domain, in theory we could use that. Handling kerberos users across domains would be interesting.<br />
<br />
Within one domain, there's a volume location database that keeps track of volumes and where they're hosted. You can make a volume for a purpose; give particular people access to it, give it some storage, expand and contract it and move it around. Volumes have quotas.<br />
<br />
Within a given domain, We can assemble a namespace out of volumes using referrals. For a higher-level approach more similar to AFS's, there's also [https://wiki.linux-nfs.org/wiki/index.php/FedFsUtilsProject FedFS] which stores the namespace information in a database and provides common protocols for administration tools to manipulate the database.<br />
<br />
== PAGS ==<br />
<br />
PAGs: AFS allows a group of processes to share a common identity, different from the local uid, for the purposes of accessing an AFS filesystem: https://docs.openafs.org/AdminGuide/HDRWQ63.html<br />
<br />
Dave Howells says: "This is why I added session keyrings. You can run a process in a new keyring<br />
and give it new tokens. systemd kind of stuck a spike in that, though, by<br />
doing their own incompatible thing with their user manager service....<br />
<br />
NFS would need to do what the in-kernel AFS client does and call request_key()<br />
on entry to each filesystem method that doesn't take a file* and use that to<br />
cache the credentials it is using. If there is no key, it can make one up on<br />
the spot and stick the uid/gid/groups in there. This would then need to be<br />
handed down to the sunrpc protocol to define the security creds to use.<br />
<br />
The key used to open a file would then need to be cached in the file struct<br />
private data."<br />
<br />
== ACLs ==<br />
<br />
NFSv4 has ACLs, but Linux filesystems only support "posix" ACLs. An attempt was made to support NFSv4 ACLs ("richacls") but hasn't been accepted upstream. So knfsd is stuck mapping between NFSv4 and posix ACLs. Posix ACLs are more coarse-grained than NFSv4 ACLs, so information can be lost when a user on an NFSv4 client sets an ACL. This makes ACLs confusing and less useful.<br />
<br />
There are other servers that support full NFSv4 ACLs, so users of those servers are better off. Our client-side tools could still use some improvements for those users, though.<br />
<br />
AFS ACLs, unfortunately, are yet again a third style of ACL, incompatible with both POSIX and NFSv4 ACLs. They are more fine-grained than POSIX ACLs and probably closer to NFSv4 ACLs overall.<br />
<br />
To do:<br />
<br />
* make NFSv4 ACL tools more usable:<br />
** Map groups of NFSv4 permission bits to read, write, and execute permissions so we only have to display the simpler bits in common cases<br />
** Look for other opportunities to simplify display and editing of NFSv4 ACLs<br />
** Add NFSv4 ACL support to graphical file managers like GNOME Files<br />
** Adopt a commandline interface that's more similar to the posix acl utilities.<br />
** Perhaps also look into https://github.com/kvaneesh/richacl-tools as an alternative starting point to nfs4-acl-tools.<br />
** In general, try to make NFSv4 ACL management more similar to management of existing posix ACLs.<br />
* For AFS->NFS transition:<br />
** Write code that translates AFS ACLs to NFSv4 ACLs. It should be possible to do this with little or no loss of information for servers with full NFSv4 ACL support.<br />
** For migrations to Linux knfsd, this will effectively translate AFS ACLs to POSIX ACLs, and information will be lost. Test this case. The conversion tool should be able to fetch the ACLs after setting them, compare results, and summarize the results of the conversion in a way that's usable even for conversions of large numbers of files. I believe that setting an ACL is enough to invalidate the client's ACL cache, so a subsequent fetch of an ACL should show the results of any server-side mapping. But, test this to make sure. More details on [[AFS to NFSv4 ACL conversion]]<br />
<br />
* more ambitious options:<br />
** Try reviving [https://lwn.net/Articles/661357/ Rich ACLs]. Maybe we could convince people this time. Or maybe there's a different approach that would work. Maybe we could find a more incremental route, e.g. by adding some features of richacls to POSIX ACLs, such as the separation of directory write permissions into add and delete, and of file write permissions into modify and append.<br />
<br />
== user and group management ==<br />
<br />
AFS has a "protection server" and you can communicate with it using the [https://docs.openafs.org/Reference/1/pts.html pts command] which allows you to set up users and groups and add ACEs for machines.<br />
<br />
Compared to traditional unix, it allows wider delegation of management. For example, group creation doesn't require root: https://docs.openafs.org/Reference/1/pts_creategroup.html. Groups have owners, and you can delegate management of group membership: https://docs.openafs.org/Reference/1/pts_adduser.html.<br />
<br />
Our equivalent to the AFS protection server is [https://www.freeipa.org/page/Main_Page FreeIPA]. See also https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_identity_management/index. Installing FreeIPA and experimenting is also useful.<br />
<br />
Unlike AFS, FreeIPA doesn't seem to make it easy for ordinary users to create groups. It does allow delegating group management (including adding and removing users). More details on [[AFS-like group management with FreeIPA]].<br />
<br />
= migrating existing AFS installations to NFS =<br />
<br />
Once NFS does everything AFS does, there's still the question of how you'd migrate over a particular installation.<br />
<br />
There's a standard AFS dump format (used by [https://docs.openafs.org/AdminGuide/HDRWQ240.html vos dump/vos restore]) that might be worth looking at. It looks simple enough. Maybe also look at [https://github.com/openafs-contrib/cmu-dumpscan cmu-dumpscan].<br />
<br />
See also [[AFS to NFSv4 ACL conversion]].</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_for_AFS_usersNFS for AFS users2021-02-04T14:46:03Z<p>Bfields: /* global namespace */</p>
<hr />
<div>This page tracks some of the obstacles that might keep an AFS user from using NFS instead.<br />
<br />
= Missing Features =<br />
<br />
In general: AFS is administered by a consistent set of commands (fs, pts, vos, uss, bos, backup, fstrace, etc.) which work from any client and identify the user with Kerberos. Compared to a traditional unix system it's more flexible about delegating rights to users to do stuff.<br />
<br />
== replication and migration ==<br />
<br />
AFS has a "Volume Location Database" that tracks where (machine and partition) a volume is located.<br />
<br />
Fast clones using COW are supported, along with complete copies on other machines.<br />
<br />
Currently there can be only one writeable version of a volume, but multiple read-only versions (which all have to be identical). They can be on different servers.<br />
<br />
There can also be a 'backup' volume which is just, say, a daily temporary read-only snapshot of a RW volume and has to be located on the same machine.<br />
<br />
When a RW volume is "released" (snapshotted) to the RO volumes, all the RO volumes update simultaneously and atomically. The users, in theory, don't notice as the volumes don't go offline - and then they see all the changes happen at once. There is coordination handling for when one or more of the fileservers or the Volume Location servers are offline.<br />
<br />
Volumes can be migrated between machines whilst in active use without the user in theory noticing anything.<br />
<br />
There are moves afoot to add multi-hosted RW volumes, but I'm not sure how that'll work, and may involve Ceph integration. But it's not there yet.<br />
<br />
For NFS migration we need to preserve filehandles, so need to migrate at the block level or using fs-specific send/receive. The protocol can be handled by migrating only entire servers or containers, so that migration can be treated as a server reboot.<br />
<br />
A few options for send/receive:<br />
<br />
* thin_delta (from device-mapper-persistent-data) can calculate a metadata-level diff between two volumes. Additional work would be needed to extract the actual data and produce a diff; that would complete the "send" side. We'd also need a "receive" side that could apply the diff and reconstitute the snapshot on the other side. This is being actively worked on. For NFS, on the read-write server we would take a snapshot of the exported volume before sending. On the receive side, after creating the updated snapshot, we would stop the server, unmount the old snapshot, mount the new one, and restart; clients should see only a brief delay.<br />
<br />
* btrfs-send/btrfs-receive: this is probably the best-tested send/receive functionality currently available, so if we wanted to start work on a prototype right now, this might be an option.<br />
<br />
* xfs volumes loopback-mounted on a backing xfs filesystem, using reflink for snapshots. (See https://lwn.net/Articles/747633/ for some background.) Looks promising, the basic kernel interfaces to find shared extents and such are there, but a lot of userland code remains to be written.<br />
<br />
* stratis: this operates at a layer of abstraction over the above. But that might be the layer we want to actually interact with?<br />
<br />
* lvmsync: looks possibly unmaintained? We wouldn't want to depend on this. But possibly it could be a proof of concept or starting point.<br />
<br />
Between LVM and (container-respecting) knfsd, we have a lot of the necessary pieces, but there's at a minimum a lot of tooling and documentation to write before this is usable.<br />
<br />
Clients could be configured to mount particular servers by hand, or they could mount any server and then use [https://tools.ietf.org/html/rfc5661#section-11.9 fs_locations], [https://tools.ietf.org/html/rfc5661#section-11.10 fs_locations_info], or maybe even [https://datatracker.ietf.org/doc/rfc8435/ pnfs flexfiles] to get lists of servers hosting replicas and pick one. They would need some heuristics to make the right choice. It would also be nice if clients could fail over to a different replica when one goes down.<br />
<br />
We also have [https://github.com/nfs-ganesha/nfs-ganesha/wiki Ganesha], [https://docs.ceph.com/docs/master/cephfs/nfs/ Ganesha/Ceph] (which [https://jtlayton.wordpress.com/2018/12/10/deploying-an-active-active-nfs-cluster-over-cephfs/ may be capable of multiple read/write servers now]).<br />
<br />
See also [https://docs.openafs.org/AdminGuide/HDRWQ177.html AFS Administrator's guide, Chapter 5: Managing Volumes]<br />
<br />
A partial alternative may be [https://wiki.linux-nfs.org/wiki/index.php/NFS_re-export NFS proxying]. Like read-only replicas, proxies should be able to hide latency by moving cached data closer to far-flung clients, and scale bandwidth to read-mostly data by taking load off the original server.<br />
<br />
Advantages are that we already have seen reports of some success here, using the NFS re-export code together with fscache. And I think there are a lot of opportunities for incremental progress by fixing problems with existing NFS code, rather than larger and riskier projects that build new infrastructure.<br />
<br />
A disadvantage may be that AFS users seem to like that infrastructure (the volume abstraction and the VLDB).<br />
<br />
Latency-hiding may be particularly tricky; delegation and caching policies may need rethinking. Performance will be more complicated to understand compared to AFS-like read-only replicas.<br />
<br />
AFS-like volume replication has a problem: when new read-only versions are released, they may modify or delete entirely files that are in use by running processes. I'd expect application crashes. I wonder how AFS administrators deal with that now?<br />
<br />
My impression is that AFS doesn't reliably prevent this problem, so instead AFS administrators work around it, for example by keeping old versions of binaries in place (and using symlinks to direct users to the newest versions).<br />
<br />
Possible approaches to fix the problem if we wanted to:<br />
* Provide some protocol which tracks which files may be open on read-only replicas so that we know not to free those files when they're unlinked.<br />
* When we distribute new versions, allow servers to keep around older versions and serve files from them in the case filehandle lookups against the new copy fail, to be removed only after applications stop referencing them. Hopefully this can be done space-efficiently if the different versions on the replica servers can be represented as dm snapshots.<br />
<br />
If we use NFSv4 proxies instead, proxies will hold opens or delegations on the files on the original server, which will prevent their being deleted while in use. The problem is server reboots. That's partially worked around with silly-rename. Server-side silly-rename would be a more complete solution.<br />
<br />
== PAGS ==<br />
<br />
PAGs: AFS allows a group of processes to share a common identity, different from the local uid, for the purposes of accessing an AFS filesystem: https://docs.openafs.org/AdminGuide/HDRWQ63.html<br />
<br />
Dave Howells says: "This is why I added session keyrings. You can run a process in a new keyring<br />
and give it new tokens. systemd kind of stuck a spike in that, though, by<br />
doing their own incompatible thing with their user manager service....<br />
<br />
NFS would need to do what the in-kernel AFS client does and call request_key()<br />
on entry to each filesystem method that doesn't take a file* and use that to<br />
cache the credentials it is using. If there is no key, it can make one up on<br />
the spot and stick the uid/gid/groups in there. This would then need to be<br />
handed down to the sunrpc protocol to define the security creds to use.<br />
<br />
The key used to open a file would then need to be cached in the file struct<br />
private data."<br />
<br />
== ACLs ==<br />
<br />
NFSv4 has ACLs, but Linux filesystems only support "posix" ACLs. An attempt was made to support NFSv4 ACLs ("richacls") but hasn't been accepted upstream. So knfsd is stuck mapping between NFSv4 and posix ACLs. Posix ACLs are more coarse-grained than NFSv4 ACLs, so information can be lost when a user on an NFSv4 client sets an ACL. This makes ACLs confusing and less useful.<br />
<br />
There are other servers that support full NFSv4 ACLs, so users of those servers are better off. Our client-side tools could still use some improvements for those users, though.<br />
<br />
AFS ACLs, unfortunately, are yet again a third style of ACL, incompatible with both POSIX and NFSv4 ACLs. They are more fine-grained than POSIX ACLs and probably closer to NFSv4 ACLs overall.<br />
<br />
To do:<br />
<br />
* make NFSv4 ACL tools more usable:<br />
** Map groups of NFSv4 permission bits to read, write, and execute permissions so we only have to display the simpler bits in common cases<br />
** Look for other opportunities to simplify display and editing of NFSv4 ACLs<br />
** Add NFSv4 ACL support to graphical file managers like GNOME Files<br />
** Adopt a commandline interface that's more similar to the posix acl utilities.<br />
** Perhaps also look into https://github.com/kvaneesh/richacl-tools as an alternative starting point to nfs4-acl-tools.<br />
** In general, try to make NFSv4 ACL management more similar to management of existing posix ACLs.<br />
* For AFS->NFS transition:<br />
** Write code that translates AFS ACLs to NFSv4 ACLs. It should be possible to do this with little or no loss of information for servers with full NFSv4 ACL support.<br />
** For migrations to Linux knfsd, this will effectively translate AFS ACLs to POSIX ACLs, and information will be lost. Test this case. The conversion tool should be able to fetch the ACLs after setting them, compare results, and summarize the results of the conversion in a way that's usable even for conversions of large numbers of files. I believe that setting an ACL is enough to invalidate the client's ACL cache, so a subsequent fetch of an ACL should show the results of any server-side mapping. But, test this to make sure. More details on [[AFS to NFSv4 ACL conversion]]<br />
<br />
* more ambitious options:<br />
** Try reviving [https://lwn.net/Articles/661357/ Rich ACLs]. Maybe we could convince people this time. Or maybe there's a different approach that would work. Maybe we could find a more incremental route, e.g. by adding some features of richacls to POSIX ACLs, such as the separation of directory write permissions into add and delete, and of file write permissions into modify and append.<br />
<br />
== user and group management ==<br />
<br />
AFS has a "protection server" and you can communicate with it using the [https://docs.openafs.org/Reference/1/pts.html pts command] which allows you to set up users and groups and add ACEs for machines.<br />
<br />
Compared to traditional unix, it allows wider delegation of management. For example, group creation doesn't require root: https://docs.openafs.org/Reference/1/pts_creategroup.html. Groups have owners, and you can delegate management of group membership: https://docs.openafs.org/Reference/1/pts_adduser.html.<br />
<br />
Our equivalent to the AFS protection server is [https://www.freeipa.org/page/Main_Page FreeIPA]. See also https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_identity_management/index. Installing FreeIPA and experimenting is also useful.<br />
<br />
Unlike AFS, FreeIPA doesn't seem to make it easy for ordinary users to create groups. It does allow delegating group management (including adding and removing users). More details on [[AFS-like group management with FreeIPA]].<br />
<br />
== volume location database and global namespace ==<br />
<br />
On an AFS client by default you can look up something like /afs/umich.edu/... and reach files kept in AFS anywhere.<br />
<br />
NFS has standards for DNS discovery of a server from a domain, in theory we could use that. Handling kerberos users across domains would be interesting.<br />
<br />
Within one domain, there's a volume location database that keeps track of volumes and where they're hosted. You can make a volume for a purpose; give particular people access to it, give it some storage, expand and contract it and move it around. Volumes have quotas.<br />
<br />
Within a given domain, We can assemble a namespace out of volumes using referrals. For a higher-level approach more similar to AFS's, there's also [https://wiki.linux-nfs.org/wiki/index.php/FedFsUtilsProject FedFS] which stores the namespace information in a database and provides common protocols for administration tools to manipulate the database.<br />
<br />
= migrating existing AFS installations to NFS =<br />
<br />
Once NFS does everything AFS does, there's still the question of how you'd migrate over a particular installation.<br />
<br />
There's a standard AFS dump format (used by [https://docs.openafs.org/AdminGuide/HDRWQ240.html vos dump/vos restore]) that might be worth looking at. It looks simple enough. Maybe also look at [https://github.com/openafs-contrib/cmu-dumpscan cmu-dumpscan].<br />
<br />
See also [[AFS to NFSv4 ACL conversion]].</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_for_AFS_usersNFS for AFS users2021-02-04T14:39:15Z<p>Bfields: /* volume management and migration */</p>
<hr />
<div>This page tracks some of the obstacles that might keep an AFS user from using NFS instead.<br />
<br />
= Missing Features =<br />
<br />
In general: AFS is administered by a consistent set of commands (fs, pts, vos, uss, bos, backup, fstrace, etc.) which work from any client and identify the user with Kerberos. Compared to a traditional unix system it's more flexible about delegating rights to users to do stuff.<br />
<br />
== replication and migration ==<br />
<br />
AFS has a "Volume Location Database" that tracks where (machine and partition) a volume is located.<br />
<br />
Fast clones using COW are supported, along with complete copies on other machines.<br />
<br />
Currently there can be only one writeable version of a volume, but multiple read-only versions (which all have to be identical). They can be on different servers.<br />
<br />
There can also be a 'backup' volume which is just, say, a daily temporary read-only snapshot of a RW volume and has to be located on the same machine.<br />
<br />
When a RW volume is "released" (snapshotted) to the RO volumes, all the RO volumes update simultaneously and atomically. The users, in theory, don't notice as the volumes don't go offline - and then they see all the changes happen at once. There is coordination handling for when one or more of the fileservers or the Volume Location servers are offline.<br />
<br />
Volumes can be migrated between machines whilst in active use without the user in theory noticing anything.<br />
<br />
There are moves afoot to add multi-hosted RW volumes, but I'm not sure how that'll work, and may involve Ceph integration. But it's not there yet.<br />
<br />
For NFS migration we need to preserve filehandles, so need to migrate at the block level or using fs-specific send/receive. The protocol can be handled by migrating only entire servers or containers, so that migration can be treated as a server reboot.<br />
<br />
A few options for send/receive:<br />
<br />
* thin_delta (from device-mapper-persistent-data) can calculate a metadata-level diff between two volumes. Additional work would be needed to extract the actual data and produce a diff; that would complete the "send" side. We'd also need a "receive" side that could apply the diff and reconstitute the snapshot on the other side. This is being actively worked on. For NFS, on the read-write server we would take a snapshot of the exported volume before sending. On the receive side, after creating the updated snapshot, we would stop the server, unmount the old snapshot, mount the new one, and restart; clients should see only a brief delay.<br />
<br />
* btrfs-send/btrfs-receive: this is probably the best-tested send/receive functionality currently available, so if we wanted to start work on a prototype right now, this might be an option.<br />
<br />
* xfs volumes loopback-mounted on a backing xfs filesystem, using reflink for snapshots. (See https://lwn.net/Articles/747633/ for some background.) Looks promising, the basic kernel interfaces to find shared extents and such are there, but a lot of userland code remains to be written.<br />
<br />
* stratis: this operates at a layer of abstraction over the above. But that might be the layer we want to actually interact with?<br />
<br />
* lvmsync: looks possibly unmaintained? We wouldn't want to depend on this. But possibly it could be a proof of concept or starting point.<br />
<br />
Between LVM and (container-respecting) knfsd, we have a lot of the necessary pieces, but there's at a minimum a lot of tooling and documentation to write before this is usable.<br />
<br />
Clients could be configured to mount particular servers by hand, or they could mount any server and then use [https://tools.ietf.org/html/rfc5661#section-11.9 fs_locations], [https://tools.ietf.org/html/rfc5661#section-11.10 fs_locations_info], or maybe even [https://datatracker.ietf.org/doc/rfc8435/ pnfs flexfiles] to get lists of servers hosting replicas and pick one. They would need some heuristics to make the right choice. It would also be nice if clients could fail over to a different replica when one goes down.<br />
<br />
We also have [https://github.com/nfs-ganesha/nfs-ganesha/wiki Ganesha], [https://docs.ceph.com/docs/master/cephfs/nfs/ Ganesha/Ceph] (which [https://jtlayton.wordpress.com/2018/12/10/deploying-an-active-active-nfs-cluster-over-cephfs/ may be capable of multiple read/write servers now]).<br />
<br />
See also [https://docs.openafs.org/AdminGuide/HDRWQ177.html AFS Administrator's guide, Chapter 5: Managing Volumes]<br />
<br />
A partial alternative may be [https://wiki.linux-nfs.org/wiki/index.php/NFS_re-export NFS proxying]. Like read-only replicas, proxies should be able to hide latency by moving cached data closer to far-flung clients, and scale bandwidth to read-mostly data by taking load off the original server.<br />
<br />
Advantages are that we already have seen reports of some success here, using the NFS re-export code together with fscache. And I think there are a lot of opportunities for incremental progress by fixing problems with existing NFS code, rather than larger and riskier projects that build new infrastructure.<br />
<br />
A disadvantage may be that AFS users seem to like that infrastructure (the volume abstraction and the VLDB).<br />
<br />
Latency-hiding may be particularly tricky; delegation and caching policies may need rethinking. Performance will be more complicated to understand compared to AFS-like read-only replicas.<br />
<br />
AFS-like volume replication has a problem: when new read-only versions are released, they may modify or delete entirely files that are in use by running processes. I'd expect application crashes. I wonder how AFS administrators deal with that now?<br />
<br />
My impression is that AFS doesn't reliably prevent this problem, so instead AFS administrators work around it, for example by keeping old versions of binaries in place (and using symlinks to direct users to the newest versions).<br />
<br />
Possible approaches to fix the problem if we wanted to:<br />
* Provide some protocol which tracks which files may be open on read-only replicas so that we know not to free those files when they're unlinked.<br />
* When we distribute new versions, allow servers to keep around older versions and serve files from them in the case filehandle lookups against the new copy fail, to be removed only after applications stop referencing them. Hopefully this can be done space-efficiently if the different versions on the replica servers can be represented as dm snapshots.<br />
<br />
If we use NFSv4 proxies instead, proxies will hold opens or delegations on the files on the original server, which will prevent their being deleted while in use. The problem is server reboots. That's partially worked around with silly-rename. Server-side silly-rename would be a more complete solution.<br />
<br />
== PAGS ==<br />
<br />
PAGs: AFS allows a group of processes to share a common identity, different from the local uid, for the purposes of accessing an AFS filesystem: https://docs.openafs.org/AdminGuide/HDRWQ63.html<br />
<br />
Dave Howells says: "This is why I added session keyrings. You can run a process in a new keyring<br />
and give it new tokens. systemd kind of stuck a spike in that, though, by<br />
doing their own incompatible thing with their user manager service....<br />
<br />
NFS would need to do what the in-kernel AFS client does and call request_key()<br />
on entry to each filesystem method that doesn't take a file* and use that to<br />
cache the credentials it is using. If there is no key, it can make one up on<br />
the spot and stick the uid/gid/groups in there. This would then need to be<br />
handed down to the sunrpc protocol to define the security creds to use.<br />
<br />
The key used to open a file would then need to be cached in the file struct<br />
private data."<br />
<br />
== ACLs ==<br />
<br />
NFSv4 has ACLs, but Linux filesystems only support "posix" ACLs. An attempt was made to support NFSv4 ACLs ("richacls") but hasn't been accepted upstream. So knfsd is stuck mapping between NFSv4 and posix ACLs. Posix ACLs are more coarse-grained than NFSv4 ACLs, so information can be lost when a user on an NFSv4 client sets an ACL. This makes ACLs confusing and less useful.<br />
<br />
There are other servers that support full NFSv4 ACLs, so users of those servers are better off. Our client-side tools could still use some improvements for those users, though.<br />
<br />
AFS ACLs, unfortunately, are yet again a third style of ACL, incompatible with both POSIX and NFSv4 ACLs. They are more fine-grained than POSIX ACLs and probably closer to NFSv4 ACLs overall.<br />
<br />
To do:<br />
<br />
* make NFSv4 ACL tools more usable:<br />
** Map groups of NFSv4 permission bits to read, write, and execute permissions so we only have to display the simpler bits in common cases<br />
** Look for other opportunities to simplify display and editing of NFSv4 ACLs<br />
** Add NFSv4 ACL support to graphical file managers like GNOME Files<br />
** Adopt a commandline interface that's more similar to the posix acl utilities.<br />
** Perhaps also look into https://github.com/kvaneesh/richacl-tools as an alternative starting point to nfs4-acl-tools.<br />
** In general, try to make NFSv4 ACL management more similar to management of existing posix ACLs.<br />
* For AFS->NFS transition:<br />
** Write code that translates AFS ACLs to NFSv4 ACLs. It should be possible to do this with little or no loss of information for servers with full NFSv4 ACL support.<br />
** For migrations to Linux knfsd, this will effectively translate AFS ACLs to POSIX ACLs, and information will be lost. Test this case. The conversion tool should be able to fetch the ACLs after setting them, compare results, and summarize the results of the conversion in a way that's usable even for conversions of large numbers of files. I believe that setting an ACL is enough to invalidate the client's ACL cache, so a subsequent fetch of an ACL should show the results of any server-side mapping. But, test this to make sure. More details on [[AFS to NFSv4 ACL conversion]]<br />
<br />
* more ambitious options:<br />
** Try reviving [https://lwn.net/Articles/661357/ Rich ACLs]. Maybe we could convince people this time. Or maybe there's a different approach that would work. Maybe we could find a more incremental route, e.g. by adding some features of richacls to POSIX ACLs, such as the separation of directory write permissions into add and delete, and of file write permissions into modify and append.<br />
<br />
== user and group management ==<br />
<br />
AFS has a "protection server" and you can communicate with it using the [https://docs.openafs.org/Reference/1/pts.html pts command] which allows you to set up users and groups and add ACEs for machines.<br />
<br />
Compared to traditional unix, it allows wider delegation of management. For example, group creation doesn't require root: https://docs.openafs.org/Reference/1/pts_creategroup.html. Groups have owners, and you can delegate management of group membership: https://docs.openafs.org/Reference/1/pts_adduser.html.<br />
<br />
Our equivalent to the AFS protection server is [https://www.freeipa.org/page/Main_Page FreeIPA]. See also https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_identity_management/index. Installing FreeIPA and experimenting is also useful.<br />
<br />
Unlike AFS, FreeIPA doesn't seem to make it easy for ordinary users to create groups. It does allow delegating group management (including adding and removing users). More details on [[AFS-like group management with FreeIPA]].<br />
<br />
== global namespace ==<br />
<br />
On an AFS client by default you can look up something like /afs/umich.edu/... and reach files kept in AFS anywhere.<br />
<br />
We have automounting support, NFS has standards for DNS discovery of server, so in theory this is all possible. Handling kerberos users across domains would be interesting.<br />
<br />
Within a given domain, We can assemble a namespace out of volumes using referrals. For a higher-level approach more similar to AFS's, there's also [https://wiki.linux-nfs.org/wiki/index.php/FedFsUtilsProject FedFS] which stores the namespace information in a database and provides common protocols for administration tools to manipulate the database.<br />
<br />
= migrating existing AFS installations to NFS =<br />
<br />
Once NFS does everything AFS does, there's still the question of how you'd migrate over a particular installation.<br />
<br />
There's a standard AFS dump format (used by [https://docs.openafs.org/AdminGuide/HDRWQ240.html vos dump/vos restore]) that might be worth looking at. It looks simple enough. Maybe also look at [https://github.com/openafs-contrib/cmu-dumpscan cmu-dumpscan].<br />
<br />
See also [[AFS to NFSv4 ACL conversion]].</div>Bfieldshttp://wiki.linux-nfs.org/wiki/index.php/NFS_for_AFS_usersNFS for AFS users2021-02-04T14:38:06Z<p>Bfields: </p>
<hr />
<div>This page tracks some of the obstacles that might keep an AFS user from using NFS instead.<br />
<br />
= Missing Features =<br />
<br />
In general: AFS is administered by a consistent set of commands (fs, pts, vos, uss, bos, backup, fstrace, etc.) which work from any client and identify the user with Kerberos. Compared to a traditional unix system it's more flexible about delegating rights to users to do stuff.<br />
<br />
== volume management and migration ==<br />
<br />
AFS has a "Volume Location Database" that tracks where (machine and partition) a volume is located.<br />
<br />
Fast clones using COW are supported, along with complete copies on other machines.<br />
<br />
Currently there can be only one writeable version of a volume, but multiple read-only versions (which all have to be identical). They can be on different servers.<br />
<br />
There can also be a 'backup' volume which is just, say, a daily temporary read-only snapshot of a RW volume and has to be located on the same machine.<br />
<br />
When a RW volume is "released" (snapshotted) to the RO volumes, all the RO volumes update simultaneously and atomically. The users, in theory, don't notice as the volumes don't go offline - and then they see all the changes happen at once. There is coordination handling for when one or more of the fileservers or the Volume Location servers are offline.<br />
<br />
Volumes can be migrated between machines whilst in active use without the user in theory noticing anything.<br />
<br />
There are moves afoot to add multi-hosted RW volumes, but I'm not sure how that'll work, and may involve Ceph integration. But it's not there yet.<br />
<br />
Logical volumes is something they particularly like. You can make a volume for a purpose; give particular people access to it, give it some storage, expand and contract it and move it around. It's intrinsically quota'd.<br />
<br />
For NFS migration we need to preserve filehandles, so need to migrate at the block level or using fs-specific send/receive. The protocol can be handled by migrating only entire servers or containers, so that migration can be treated as a server reboot.<br />
<br />
A few options for send/receive:<br />
<br />
* thin_delta (from device-mapper-persistent-data) can calculate a metadata-level diff between two volumes. Additional work would be needed to extract the actual data and produce a diff; that would complete the "send" side. We'd also need a "receive" side that could apply the diff and reconstitute the snapshot on the other side. This is being actively worked on. For NFS, on the read-write server we would take a snapshot of the exported volume before sending. On the receive side, after creating the updated snapshot, we would stop the server, unmount the old snapshot, mount the new one, and restart; clients should see only a brief delay.<br />
<br />
* btrfs-send/btrfs-receive: this is probably the best-tested send/receive functionality currently available, so if we wanted to start work on a prototype right now, this might be an option.<br />
<br />
* xfs volumes loopback-mounted on a backing xfs filesystem, using reflink for snapshots. (See https://lwn.net/Articles/747633/ for some background.) Looks promising, the basic kernel interfaces to find shared extents and such are there, but a lot of userland code remains to be written.<br />
<br />
* stratis: this operates at a layer of abstraction over the above. But that might be the layer we want to actually interact with?<br />
<br />
* lvmsync: looks possibly unmaintained? We wouldn't want to depend on this. But possibly it could be a proof of concept or starting point.<br />
<br />
Between LVM and (container-respecting) knfsd, we have a lot of the necessary pieces, but there's at a minimum a lot of tooling and documentation to write before this is usable.<br />
<br />
Clients could be configured to mount particular servers by hand, or they could mount any server and then use [https://tools.ietf.org/html/rfc5661#section-11.9 fs_locations], [https://tools.ietf.org/html/rfc5661#section-11.10 fs_locations_info], or maybe even [https://datatracker.ietf.org/doc/rfc8435/ pnfs flexfiles] to get lists of servers hosting replicas and pick one. They would need some heuristics to make the right choice. It would also be nice if clients could fail over to a different replica when one goes down.<br />
<br />
We also have [https://github.com/nfs-ganesha/nfs-ganesha/wiki Ganesha], [https://docs.ceph.com/docs/master/cephfs/nfs/ Ganesha/Ceph] (which [https://jtlayton.wordpress.com/2018/12/10/deploying-an-active-active-nfs-cluster-over-cephfs/ may be capable of multiple read/write servers now]).<br />
<br />
See also [https://docs.openafs.org/AdminGuide/HDRWQ177.html AFS Administrator's guide, Chapter 5: Managing Volumes]<br />
<br />
A partial alternative may be [https://wiki.linux-nfs.org/wiki/index.php/NFS_re-export NFS proxying]. Like read-only replicas, proxies should be able to hide latency by moving cached data closer to far-flung clients, and scale bandwidth to read-mostly data by taking load off the original server.<br />
<br />
Advantages are that we already have seen reports of some success here, using the NFS re-export code together with fscache. And I think there are a lot of opportunities for incremental progress by fixing problems with existing NFS code, rather than larger and riskier projects that build new infrastructure.<br />
<br />
A disadvantage may be that AFS users seem to like that infrastructure (the volume abstraction and the VLDB).<br />
<br />
Latency-hiding may be particularly tricky; delegation and caching policies may need rethinking. Performance will be more complicated to understand compared to AFS-like read-only replicas.<br />
<br />
AFS-like volume replication has a problem: when new read-only versions are released, they may modify or delete entirely files that are in use by running processes. I'd expect application crashes. I wonder how AFS administrators deal with that now?<br />
<br />
My impression is that AFS doesn't reliably prevent this problem, so instead AFS administrators work around it, for example by keeping old versions of binaries in place (and using symlinks to direct users to the newest versions).<br />
<br />
Possible approaches to fix the problem if we wanted to:<br />
* Provide some protocol which tracks which files may be open on read-only replicas so that we know not to free those files when they're unlinked.<br />
* When we distribute new versions, allow servers to keep around older versions and serve files from them in the case filehandle lookups against the new copy fail, to be removed only after applications stop referencing them. Hopefully this can be done space-efficiently if the different versions on the replica servers can be represented as dm snapshots.<br />
<br />
If we use NFSv4 proxies instead, proxies will hold opens or delegations on the files on the original server, which will prevent their being deleted while in use. The problem is server reboots. That's partially worked around with silly-rename. Server-side silly-rename would be a more complete solution.<br />
<br />
== PAGS ==<br />
<br />
PAGs: AFS allows a group of processes to share a common identity, different from the local uid, for the purposes of accessing an AFS filesystem: https://docs.openafs.org/AdminGuide/HDRWQ63.html<br />
<br />
Dave Howells says: "This is why I added session keyrings. You can run a process in a new keyring<br />
and give it new tokens. systemd kind of stuck a spike in that, though, by<br />
doing their own incompatible thing with their user manager service....<br />
<br />
NFS would need to do what the in-kernel AFS client does and call request_key()<br />
on entry to each filesystem method that doesn't take a file* and use that to<br />
cache the credentials it is using. If there is no key, it can make one up on<br />
the spot and stick the uid/gid/groups in there. This would then need to be<br />
handed down to the sunrpc protocol to define the security creds to use.<br />
<br />
The key used to open a file would then need to be cached in the file struct<br />
private data."<br />
<br />
== ACLs ==<br />
<br />
NFSv4 has ACLs, but Linux filesystems only support "posix" ACLs. An attempt was made to support NFSv4 ACLs ("richacls") but hasn't been accepted upstream. So knfsd is stuck mapping between NFSv4 and posix ACLs. Posix ACLs are more coarse-grained than NFSv4 ACLs, so information can be lost when a user on an NFSv4 client sets an ACL. This makes ACLs confusing and less useful.<br />
<br />
There are other servers that support full NFSv4 ACLs, so users of those servers are better off. Our client-side tools could still use some improvements for those users, though.<br />
<br />
AFS ACLs, unfortunately, are yet again a third style of ACL, incompatible with both POSIX and NFSv4 ACLs. They are more fine-grained than POSIX ACLs and probably closer to NFSv4 ACLs overall.<br />
<br />
To do:<br />
<br />
* make NFSv4 ACL tools more usable:<br />
** Map groups of NFSv4 permission bits to read, write, and execute permissions so we only have to display the simpler bits in common cases<br />
** Look for other opportunities to simplify display and editing of NFSv4 ACLs<br />
** Add NFSv4 ACL support to graphical file managers like GNOME Files<br />
** Adopt a commandline interface that's more similar to the posix acl utilities.<br />
** Perhaps also look into https://github.com/kvaneesh/richacl-tools as an alternative starting point to nfs4-acl-tools.<br />
** In general, try to make NFSv4 ACL management more similar to management of existing posix ACLs.<br />
* For AFS->NFS transition:<br />
** Write code that translates AFS ACLs to NFSv4 ACLs. It should be possible to do this with little or no loss of information for servers with full NFSv4 ACL support.<br />
** For migrations to Linux knfsd, this will effectively translate AFS ACLs to POSIX ACLs, and information will be lost. Test this case. The conversion tool should be able to fetch the ACLs after setting them, compare results, and summarize the results of the conversion in a way that's usable even for conversions of large numbers of files. I believe that setting an ACL is enough to invalidate the client's ACL cache, so a subsequent fetch of an ACL should show the results of any server-side mapping. But, test this to make sure. More details on [[AFS to NFSv4 ACL conversion]]<br />
<br />
* more ambitious options:<br />
** Try reviving [https://lwn.net/Articles/661357/ Rich ACLs]. Maybe we could convince people this time. Or maybe there's a different approach that would work. Maybe we could find a more incremental route, e.g. by adding some features of richacls to POSIX ACLs, such as the separation of directory write permissions into add and delete, and of file write permissions into modify and append.<br />
<br />
== user and group management ==<br />
<br />
AFS has a "protection server" and you can communicate with it using the [https://docs.openafs.org/Reference/1/pts.html pts command] which allows you to set up users and groups and add ACEs for machines.<br />
<br />
Compared to traditional unix, it allows wider delegation of management. For example, group creation doesn't require root: https://docs.openafs.org/Reference/1/pts_creategroup.html. Groups have owners, and you can delegate management of group membership: https://docs.openafs.org/Reference/1/pts_adduser.html.<br />
<br />
Our equivalent to the AFS protection server is [https://www.freeipa.org/page/Main_Page FreeIPA]. See also https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_identity_management/index. Installing FreeIPA and experimenting is also useful.<br />
<br />
Unlike AFS, FreeIPA doesn't seem to make it easy for ordinary users to create groups. It does allow delegating group management (including adding and removing users). More details on [[AFS-like group management with FreeIPA]].<br />
<br />
== global namespace ==<br />
<br />
On an AFS client by default you can look up something like /afs/umich.edu/... and reach files kept in AFS anywhere.<br />
<br />
We have automounting support, NFS has standards for DNS discovery of server, so in theory this is all possible. Handling kerberos users across domains would be interesting.<br />
<br />
Within a given domain, We can assemble a namespace out of volumes using referrals. For a higher-level approach more similar to AFS's, there's also [https://wiki.linux-nfs.org/wiki/index.php/FedFsUtilsProject FedFS] which stores the namespace information in a database and provides common protocols for administration tools to manipulate the database.<br />
<br />
= migrating existing AFS installations to NFS =<br />
<br />
Once NFS does everything AFS does, there's still the question of how you'd migrate over a particular installation.<br />
<br />
There's a standard AFS dump format (used by [https://docs.openafs.org/AdminGuide/HDRWQ240.html vos dump/vos restore]) that might be worth looking at. It looks simple enough. Maybe also look at [https://github.com/openafs-contrib/cmu-dumpscan cmu-dumpscan].<br />
<br />
See also [[AFS to NFSv4 ACL conversion]].</div>Bfields