NFS for AFS users
From Linux NFS
(→Missing Features) |
(→PAGS) |
||
(34 intermediate revisions not shown) | |||
Line 5: | Line 5: | ||
In general: AFS is administered by a consistent set of commands (fs, pts, vos, uss, bos, backup, fstrace, etc.) which work from any client and identify the user with Kerberos. Compared to a traditional unix system it's more flexible about delegating rights to users to do stuff. | In general: AFS is administered by a consistent set of commands (fs, pts, vos, uss, bos, backup, fstrace, etc.) which work from any client and identify the user with Kerberos. Compared to a traditional unix system it's more flexible about delegating rights to users to do stuff. | ||
- | == | + | == replication and migration == |
- | AFS | + | AFS supports fast clones using COW, along with complete copies on other machines. |
- | + | Currently there can be only one writeable version of a volume, but multiple read-only versions (which all have to be identical). They can be on different servers. (There's also an effort to support multiple writeable volumes, possibly using Ceph, but that's not done yet.) | |
- | + | ||
- | Currently there can be only one writeable version of a volume, but multiple read-only versions (which all have to be identical). They can be on different servers. | + | |
There can also be a 'backup' volume which is just, say, a daily temporary read-only snapshot of a RW volume and has to be located on the same machine. | There can also be a 'backup' volume which is just, say, a daily temporary read-only snapshot of a RW volume and has to be located on the same machine. | ||
- | When a RW volume is "released" (snapshotted) to the | + | When a RW volume is "released" (snapshotted) to the read-only volumes, all the read-only volumes update simultaneously and atomically. The users, in theory, don't notice as the volumes don't go offline - and then they see all the changes happen at once. There's coordination to handle when one or more of the fileservers or the Volume Location servers are offline. |
Volumes can be migrated between machines whilst in active use without the user in theory noticing anything. | Volumes can be migrated between machines whilst in active use without the user in theory noticing anything. | ||
- | + | For NFS migration we need to preserve filehandles, so need to migrate at the block level or using fs-specific send/receive. The protocol can be handled by migrating only entire servers or containers, so that migration can be treated as a server reboot. | |
- | + | A few Linux options for send/receive: | |
- | For NFS | + | * thin_delta (from device-mapper-persistent-data) can calculate a metadata-level diff between two volumes. Additional work would be needed to extract the actual data and produce a diff; that would complete the "send" side. We'd also need a "receive" side that could apply the diff and reconstitute the snapshot on the other side. This is being actively worked on. For NFS, on the read-write server we would take a snapshot of the exported volume before sending. On the receive side, after creating the updated snapshot, we would stop the server, unmount the old snapshot, mount the new one, and restart; clients should see only a brief delay. |
+ | |||
+ | * btrfs-send/btrfs-receive: this is probably the best-tested send/receive functionality currently available, so if we wanted to start work on a prototype right now, this might be an option. | ||
+ | |||
+ | * xfs volumes loopback-mounted on a backing xfs filesystem, using reflink for snapshots. (See https://lwn.net/Articles/747633/ for some background.) Looks promising, the basic kernel interfaces to find shared extents and such are there, but a lot of userland code remains to be written. | ||
+ | |||
+ | * stratis: this operates at a layer of abstraction over the above. But that might be the layer we want to actually interact with? | ||
+ | |||
+ | * lvmsync: looks possibly unmaintained? We wouldn't want to depend on this. But possibly it could be a proof of concept or starting point. | ||
+ | |||
+ | Clients could be configured to mount particular servers by hand, or they could mount any server and then use [https://tools.ietf.org/html/rfc5661#section-11.9 fs_locations], [https://tools.ietf.org/html/rfc5661#section-11.10 fs_locations_info], or maybe even [https://datatracker.ietf.org/doc/rfc8435/ pnfs flexfiles] to get lists of servers hosting replicas and pick one. They would need some heuristics to make the right choice. It would also be nice if clients could fail over to a different replica when one goes down. | ||
+ | |||
+ | We also have [https://github.com/nfs-ganesha/nfs-ganesha/wiki Ganesha], [https://docs.ceph.com/docs/master/cephfs/nfs/ Ganesha/Ceph] (which [https://jtlayton.wordpress.com/2018/12/10/deploying-an-active-active-nfs-cluster-over-cephfs/ may be capable of multiple read/write servers now]). | ||
+ | |||
+ | See also [https://docs.openafs.org/AdminGuide/HDRWQ177.html AFS Administrator's guide, Chapter 5: Managing Volumes] | ||
+ | |||
+ | A partial alternative may be [https://wiki.linux-nfs.org/wiki/index.php/NFS_re-export NFS proxying]. Like read-only replicas, proxies should be able to hide latency by moving cached data closer to far-flung clients, and scale bandwidth to read-mostly data by taking load off the original server. | ||
+ | |||
+ | Advantages are that we already have seen reports of some success here, using the NFS re-export code together with fscache. And I think there are a lot of opportunities for incremental progress by fixing problems with existing NFS code, rather than larger and riskier projects that build new infrastructure. | ||
+ | |||
+ | A disadvantage may be that AFS users seem to like that infrastructure (the volume abstraction and the VLDB). | ||
+ | |||
+ | Latency-hiding may be particularly tricky; delegation and caching policies may need rethinking. Performance will be more complicated to understand compared to AFS-like read-only replicas. | ||
+ | |||
+ | AFS-like volume replication has a problem: when new read-only versions are released, they may delete files that are in use by running processes. Applications probably don't expect ESTALE on in-use files; I'd expect application crashes. I wonder how AFS administrators deal with that now? | ||
+ | |||
+ | My impression is that AFS doesn't reliably prevent this problem, so instead AFS administrators work around it, for example by keeping old versions of binaries in place (and using symlinks to direct users to the newest versions). So maybe NFS doesn't need to solve this problem either. | ||
+ | |||
+ | Possible approaches to fix the problem if we wanted to: | ||
+ | * Provide some protocol which tracks which files may be open on read-only replicas so that we know not to free those files when they're unlinked. | ||
+ | * When we distribute new versions, allow servers to keep around older versions and serve files from them in the case filehandle lookups against the new copy fail, to be removed only after applications stop referencing them. Hopefully this can be done space-efficiently if the different versions on the replica servers can be represented as dm snapshots. | ||
+ | |||
+ | If we use NFSv4 proxies instead, proxies will hold opens or delegations on the files on the original server, which will prevent their being deleted while in use. The problem is server reboots. That's partially worked around with silly-rename. [[Server-side silly rename]] would be a more complete solution. | ||
+ | |||
+ | == volume location database and global namespace == | ||
+ | |||
+ | On an AFS client by default you can look up something like /afs/umich.edu/... and reach files kept in AFS anywhere. | ||
- | + | NFS has standards for DNS discovery of a server from a domain, in theory we could use that. Handling kerberos users across domains would be interesting. | |
- | + | Within one domain, AFS has a "Volume Location Database" that keeps track of volumes and where (machine and partition) they're located. You can make a volume for a purpose; give particular people access to it, give it some storage, expand and contract it and move it around. Volumes have quotas. | |
- | [https:// | + | With NFS, within a given domain, we can assemble a namespace out of volumes using referrals. For a higher-level approach more similar to AFS's, there's also [https://wiki.linux-nfs.org/wiki/index.php/FedFsUtilsProject FedFS] which stores the namespace information in a database and provides common protocols for administration tools to manipulate the database. That just provides namespace-management facilities. If it were combined with a kerberized distributed volume manager built on top of LVM, that might server as a more complete AFS VLDB replacement. |
== PAGS == | == PAGS == | ||
PAGs: AFS allows a group of processes to share a common identity, different from the local uid, for the purposes of accessing an AFS filesystem: https://docs.openafs.org/AdminGuide/HDRWQ63.html | PAGs: AFS allows a group of processes to share a common identity, different from the local uid, for the purposes of accessing an AFS filesystem: https://docs.openafs.org/AdminGuide/HDRWQ63.html | ||
+ | |||
+ | So, for example, if you have multiple kerberos identities that you use to access AFS, you can pick which one you want to use at a given time, or even use both, each in a different window. We'd like this for NFS as well. | ||
Dave Howells says: "This is why I added session keyrings. You can run a process in a new keyring | Dave Howells says: "This is why I added session keyrings. You can run a process in a new keyring | ||
Line 47: | Line 83: | ||
The key used to open a file would then need to be cached in the file struct | The key used to open a file would then need to be cached in the file struct | ||
private data." | private data." | ||
+ | |||
+ | So, we have a lot of good kernel infrastructure in place which is designed to do this, but (despite an attempt or two) nobody has managed to quite make it work for NFS yet. | ||
== ACLs == | == ACLs == | ||
Line 63: | Line 101: | ||
** Add NFSv4 ACL support to graphical file managers like GNOME Files | ** Add NFSv4 ACL support to graphical file managers like GNOME Files | ||
** Adopt a commandline interface that's more similar to the posix acl utilities. | ** Adopt a commandline interface that's more similar to the posix acl utilities. | ||
- | ** Perhaps also look into | + | ** Perhaps also look into https://github.com/kvaneesh/richacl-tools as an alternative starting point to nfs4-acl-tools. |
** In general, try to make NFSv4 ACL management more similar to management of existing posix ACLs. | ** In general, try to make NFSv4 ACL management more similar to management of existing posix ACLs. | ||
* For AFS->NFS transition: | * For AFS->NFS transition: | ||
Line 74: | Line 112: | ||
== user and group management == | == user and group management == | ||
- | AFS has a "protection server" and you can communicate with it using the [https://docs.openafs.org/Reference/1/pts.html pts command] which allows you to set up users and groups and add ACEs for machines | + | AFS has a "protection server" and you can communicate with it using the [https://docs.openafs.org/Reference/1/pts.html pts command] which allows you to set up users and groups and add ACEs for machines. |
- | + | Compared to traditional unix, it allows wider delegation of management. For example, group creation doesn't require root: https://docs.openafs.org/Reference/1/pts_creategroup.html. Groups have owners, and you can delegate management of group membership: https://docs.openafs.org/Reference/1/pts_adduser.html. | |
- | + | Our equivalent to the AFS protection server is [https://www.freeipa.org/page/Main_Page FreeIPA]. See also https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_identity_management/index. Installing FreeIPA and experimenting is also useful. | |
- | + | Unlike AFS, FreeIPA doesn't seem to make it easy for ordinary users to create groups. It does allow delegating group management (including adding and removing users). More details on [[AFS-like group management with FreeIPA]]. | |
+ | |||
+ | == quotas == | ||
+ | |||
+ | AFS has per-volume quotas. There are no per-user quotas that I can see; instead, AFS administrators create volumes for individual users (e.g., for individual home directories), and set quotas on those. Volumes can share the same storage, and it's fine for quotas on volumes to add up to more than the available storage. | ||
+ | |||
+ | We could get similar functionality with LVM thin provisioning or XFS with project quotas. (There is some work needed there to treat projects as separate exports, but that's very doable.) | ||
- | + | Note NFS, ext4, xfs, and other filesystems all support per-user (and other) quotas. That's not something AFS has, as far as I know. Some notes on [[NFSv4 quota support]]. | |
= migrating existing AFS installations to NFS = | = migrating existing AFS installations to NFS = | ||
Line 90: | Line 134: | ||
There's a standard AFS dump format (used by [https://docs.openafs.org/AdminGuide/HDRWQ240.html vos dump/vos restore]) that might be worth looking at. It looks simple enough. Maybe also look at [https://github.com/openafs-contrib/cmu-dumpscan cmu-dumpscan]. | There's a standard AFS dump format (used by [https://docs.openafs.org/AdminGuide/HDRWQ240.html vos dump/vos restore]) that might be worth looking at. It looks simple enough. Maybe also look at [https://github.com/openafs-contrib/cmu-dumpscan cmu-dumpscan]. | ||
- | + | See also [[AFS to NFSv4 ACL conversion]]. |
Latest revision as of 20:35, 27 January 2022
This page tracks some of the obstacles that might keep an AFS user from using NFS instead.
Contents |
Missing Features
In general: AFS is administered by a consistent set of commands (fs, pts, vos, uss, bos, backup, fstrace, etc.) which work from any client and identify the user with Kerberos. Compared to a traditional unix system it's more flexible about delegating rights to users to do stuff.
replication and migration
AFS supports fast clones using COW, along with complete copies on other machines.
Currently there can be only one writeable version of a volume, but multiple read-only versions (which all have to be identical). They can be on different servers. (There's also an effort to support multiple writeable volumes, possibly using Ceph, but that's not done yet.)
There can also be a 'backup' volume which is just, say, a daily temporary read-only snapshot of a RW volume and has to be located on the same machine.
When a RW volume is "released" (snapshotted) to the read-only volumes, all the read-only volumes update simultaneously and atomically. The users, in theory, don't notice as the volumes don't go offline - and then they see all the changes happen at once. There's coordination to handle when one or more of the fileservers or the Volume Location servers are offline.
Volumes can be migrated between machines whilst in active use without the user in theory noticing anything.
For NFS migration we need to preserve filehandles, so need to migrate at the block level or using fs-specific send/receive. The protocol can be handled by migrating only entire servers or containers, so that migration can be treated as a server reboot.
A few Linux options for send/receive:
- thin_delta (from device-mapper-persistent-data) can calculate a metadata-level diff between two volumes. Additional work would be needed to extract the actual data and produce a diff; that would complete the "send" side. We'd also need a "receive" side that could apply the diff and reconstitute the snapshot on the other side. This is being actively worked on. For NFS, on the read-write server we would take a snapshot of the exported volume before sending. On the receive side, after creating the updated snapshot, we would stop the server, unmount the old snapshot, mount the new one, and restart; clients should see only a brief delay.
- btrfs-send/btrfs-receive: this is probably the best-tested send/receive functionality currently available, so if we wanted to start work on a prototype right now, this might be an option.
- xfs volumes loopback-mounted on a backing xfs filesystem, using reflink for snapshots. (See https://lwn.net/Articles/747633/ for some background.) Looks promising, the basic kernel interfaces to find shared extents and such are there, but a lot of userland code remains to be written.
- stratis: this operates at a layer of abstraction over the above. But that might be the layer we want to actually interact with?
- lvmsync: looks possibly unmaintained? We wouldn't want to depend on this. But possibly it could be a proof of concept or starting point.
Clients could be configured to mount particular servers by hand, or they could mount any server and then use fs_locations, fs_locations_info, or maybe even pnfs flexfiles to get lists of servers hosting replicas and pick one. They would need some heuristics to make the right choice. It would also be nice if clients could fail over to a different replica when one goes down.
We also have Ganesha, Ganesha/Ceph (which may be capable of multiple read/write servers now).
See also AFS Administrator's guide, Chapter 5: Managing Volumes
A partial alternative may be NFS proxying. Like read-only replicas, proxies should be able to hide latency by moving cached data closer to far-flung clients, and scale bandwidth to read-mostly data by taking load off the original server.
Advantages are that we already have seen reports of some success here, using the NFS re-export code together with fscache. And I think there are a lot of opportunities for incremental progress by fixing problems with existing NFS code, rather than larger and riskier projects that build new infrastructure.
A disadvantage may be that AFS users seem to like that infrastructure (the volume abstraction and the VLDB).
Latency-hiding may be particularly tricky; delegation and caching policies may need rethinking. Performance will be more complicated to understand compared to AFS-like read-only replicas.
AFS-like volume replication has a problem: when new read-only versions are released, they may delete files that are in use by running processes. Applications probably don't expect ESTALE on in-use files; I'd expect application crashes. I wonder how AFS administrators deal with that now?
My impression is that AFS doesn't reliably prevent this problem, so instead AFS administrators work around it, for example by keeping old versions of binaries in place (and using symlinks to direct users to the newest versions). So maybe NFS doesn't need to solve this problem either.
Possible approaches to fix the problem if we wanted to:
- Provide some protocol which tracks which files may be open on read-only replicas so that we know not to free those files when they're unlinked.
- When we distribute new versions, allow servers to keep around older versions and serve files from them in the case filehandle lookups against the new copy fail, to be removed only after applications stop referencing them. Hopefully this can be done space-efficiently if the different versions on the replica servers can be represented as dm snapshots.
If we use NFSv4 proxies instead, proxies will hold opens or delegations on the files on the original server, which will prevent their being deleted while in use. The problem is server reboots. That's partially worked around with silly-rename. Server-side silly rename would be a more complete solution.
volume location database and global namespace
On an AFS client by default you can look up something like /afs/umich.edu/... and reach files kept in AFS anywhere.
NFS has standards for DNS discovery of a server from a domain, in theory we could use that. Handling kerberos users across domains would be interesting.
Within one domain, AFS has a "Volume Location Database" that keeps track of volumes and where (machine and partition) they're located. You can make a volume for a purpose; give particular people access to it, give it some storage, expand and contract it and move it around. Volumes have quotas.
With NFS, within a given domain, we can assemble a namespace out of volumes using referrals. For a higher-level approach more similar to AFS's, there's also FedFS which stores the namespace information in a database and provides common protocols for administration tools to manipulate the database. That just provides namespace-management facilities. If it were combined with a kerberized distributed volume manager built on top of LVM, that might server as a more complete AFS VLDB replacement.
PAGS
PAGs: AFS allows a group of processes to share a common identity, different from the local uid, for the purposes of accessing an AFS filesystem: https://docs.openafs.org/AdminGuide/HDRWQ63.html
So, for example, if you have multiple kerberos identities that you use to access AFS, you can pick which one you want to use at a given time, or even use both, each in a different window. We'd like this for NFS as well.
Dave Howells says: "This is why I added session keyrings. You can run a process in a new keyring and give it new tokens. systemd kind of stuck a spike in that, though, by doing their own incompatible thing with their user manager service....
NFS would need to do what the in-kernel AFS client does and call request_key() on entry to each filesystem method that doesn't take a file* and use that to cache the credentials it is using. If there is no key, it can make one up on the spot and stick the uid/gid/groups in there. This would then need to be handed down to the sunrpc protocol to define the security creds to use.
The key used to open a file would then need to be cached in the file struct private data."
So, we have a lot of good kernel infrastructure in place which is designed to do this, but (despite an attempt or two) nobody has managed to quite make it work for NFS yet.
ACLs
NFSv4 has ACLs, but Linux filesystems only support "posix" ACLs. An attempt was made to support NFSv4 ACLs ("richacls") but hasn't been accepted upstream. So knfsd is stuck mapping between NFSv4 and posix ACLs. Posix ACLs are more coarse-grained than NFSv4 ACLs, so information can be lost when a user on an NFSv4 client sets an ACL. This makes ACLs confusing and less useful.
There are other servers that support full NFSv4 ACLs, so users of those servers are better off. Our client-side tools could still use some improvements for those users, though.
AFS ACLs, unfortunately, are yet again a third style of ACL, incompatible with both POSIX and NFSv4 ACLs. They are more fine-grained than POSIX ACLs and probably closer to NFSv4 ACLs overall.
To do:
- make NFSv4 ACL tools more usable:
- Map groups of NFSv4 permission bits to read, write, and execute permissions so we only have to display the simpler bits in common cases
- Look for other opportunities to simplify display and editing of NFSv4 ACLs
- Add NFSv4 ACL support to graphical file managers like GNOME Files
- Adopt a commandline interface that's more similar to the posix acl utilities.
- Perhaps also look into https://github.com/kvaneesh/richacl-tools as an alternative starting point to nfs4-acl-tools.
- In general, try to make NFSv4 ACL management more similar to management of existing posix ACLs.
- For AFS->NFS transition:
- Write code that translates AFS ACLs to NFSv4 ACLs. It should be possible to do this with little or no loss of information for servers with full NFSv4 ACL support.
- For migrations to Linux knfsd, this will effectively translate AFS ACLs to POSIX ACLs, and information will be lost. Test this case. The conversion tool should be able to fetch the ACLs after setting them, compare results, and summarize the results of the conversion in a way that's usable even for conversions of large numbers of files. I believe that setting an ACL is enough to invalidate the client's ACL cache, so a subsequent fetch of an ACL should show the results of any server-side mapping. But, test this to make sure. More details on AFS to NFSv4 ACL conversion
- more ambitious options:
- Try reviving Rich ACLs. Maybe we could convince people this time. Or maybe there's a different approach that would work. Maybe we could find a more incremental route, e.g. by adding some features of richacls to POSIX ACLs, such as the separation of directory write permissions into add and delete, and of file write permissions into modify and append.
user and group management
AFS has a "protection server" and you can communicate with it using the pts command which allows you to set up users and groups and add ACEs for machines.
Compared to traditional unix, it allows wider delegation of management. For example, group creation doesn't require root: https://docs.openafs.org/Reference/1/pts_creategroup.html. Groups have owners, and you can delegate management of group membership: https://docs.openafs.org/Reference/1/pts_adduser.html.
Our equivalent to the AFS protection server is FreeIPA. See also https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_identity_management/index. Installing FreeIPA and experimenting is also useful.
Unlike AFS, FreeIPA doesn't seem to make it easy for ordinary users to create groups. It does allow delegating group management (including adding and removing users). More details on AFS-like group management with FreeIPA.
quotas
AFS has per-volume quotas. There are no per-user quotas that I can see; instead, AFS administrators create volumes for individual users (e.g., for individual home directories), and set quotas on those. Volumes can share the same storage, and it's fine for quotas on volumes to add up to more than the available storage.
We could get similar functionality with LVM thin provisioning or XFS with project quotas. (There is some work needed there to treat projects as separate exports, but that's very doable.)
Note NFS, ext4, xfs, and other filesystems all support per-user (and other) quotas. That's not something AFS has, as far as I know. Some notes on NFSv4 quota support.
migrating existing AFS installations to NFS
Once NFS does everything AFS does, there's still the question of how you'd migrate over a particular installation.
There's a standard AFS dump format (used by vos dump/vos restore) that might be worth looking at. It looks simple enough. Maybe also look at cmu-dumpscan.
See also AFS to NFSv4 ACL conversion.