Pseudofilesystem improvements
From Linux NFS
(20 intermediate revisions not shown) | |||
Line 1: | Line 1: | ||
- | See also [http://bugzilla.linux-nfs.org/show_bug.cgi?id=75 This bugzilla bug report]. | + | = The Problem = |
+ | |||
+ | NOTE: all of this has since been mostly fixed, so this page is out of date. | ||
+ | |||
+ | See also [http://bugzilla.linux-nfs.org/show_bug.cgi?id=75 This bugzilla bug report], or [http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=237108 This redhat bugzilla report], or [http://marc.info?l=linux-nfs&m=117408234516807&w=2 possibly relevant mail thread]. | ||
While NFSv2 and NFSv3 use a separate mount protocol to discover a server's exported filesystems, NFSv4 uses the same standard filesystem protocol (lookup, readdir, etc.) that is used to traverse within filesystems. | While NFSv2 and NFSv3 use a separate mount protocol to discover a server's exported filesystems, NFSv4 uses the same standard filesystem protocol (lookup, readdir, etc.) that is used to traverse within filesystems. | ||
Line 5: | Line 9: | ||
This gives the impression that these filesystems are all mounted on top of a top-level "pseudofilesystem". | This gives the impression that these filesystems are all mounted on top of a top-level "pseudofilesystem". | ||
- | Rather than constructing the pseudofilesystem from the list of exports in the /etc/exports file, the nfsd server just uses a real filesystem as the pseudofilesystem, and the administrator to export filesystems | + | Rather than constructing the pseudofilesystem from the list of exports in the /etc/exports file, the nfsd server just uses a real filesystem as the pseudofilesystem, and allows the administrator to mount and export filesystems underneath it. So that the server knows which exported filesystem to use as the pseudofilesystem (the filesystem that NFSv4 clients will see as "/"), that filesystem is marked with the export option "fsid=0". |
This system was relatively simple to implement, but has lead to severe problems for automount users, or for anyone attempting to migrate from NFSv2/v3 to v4, because v4 clients see different paths than mountd clients. | This system was relatively simple to implement, but has lead to severe problems for automount users, or for anyone attempting to migrate from NFSv2/v3 to v4, because v4 clients see different paths than mountd clients. | ||
Line 11: | Line 15: | ||
For example, to quote Trond: | For example, to quote Trond: | ||
+ | <pre> | ||
the current system means that if your export file | the current system means that if your export file | ||
looks like this: | looks like this: | ||
- | + | ||
+ | /export/home myclient(rw,sync,no_subtree_check,fsid=0) | ||
then that means that an NFSv4 fstab entry on 'myclient' will look like | then that means that an NFSv4 fstab entry on 'myclient' will look like | ||
- | + | myserver:/ /mnt nfs4 rw,hard,intr 0 0 | |
whereas an NFSv3 entry would look like | whereas an NFSv3 entry would look like | ||
- | + | myserver:/export/home /mnt nfs rw,hard,intr 0 0 | |
This difference in path semantics means that there is no way we could | This difference in path semantics means that there is no way we could | ||
have 'mount' try NFSv4 first, then automatically fall back to NFSv3 if | have 'mount' try NFSv4 first, then automatically fall back to NFSv3 if | ||
the server doesn't support NFSv4. | the server doesn't support NFSv4. | ||
+ | |||
What we ought to do (what Solaris, Netapp,... all do) is for the NFSv4 | What we ought to do (what Solaris, Netapp,... all do) is for the NFSv4 | ||
server to have a pseudo-fs that contains the entries '/', '/export', and | server to have a pseudo-fs that contains the entries '/', '/export', and | ||
'/export/home' so that the NFSv4 client can mount the | '/export/home' so that the NFSv4 client can mount the | ||
directory /export/home instead of '/'." | directory /export/home instead of '/'." | ||
+ | </pre> | ||
+ | |||
+ | You could try to solve this problem in this example by creating a separate pseudofilesystem at | ||
+ | /var/lib/nfs/v4root, mount --bind'ing /export/home on /var/lib/nfs/v4root/export/home, and creating another export for /var/lib/nfs/v4root/export/home. Then it will be possible to mount myserver:/export/home using either v3 or v4. Unfortunately, anyone using showmount or an automounter will now see a list of exports that looks like | ||
+ | |||
+ | /export/home | ||
+ | /var/lib/nfs/v4root | ||
+ | /var/lib/nfs/v4root/export/home | ||
+ | |||
+ | Also, setting up the pseudofilesystem and creating these extra exports is tedious work for the administrator. | ||
+ | |||
+ | But one solution is to modify mountd so that it creates those new exports itself, and hides the resulting new exports from the MOUNT protocol: | ||
+ | |||
+ | = Solving the problem in mountd = | ||
+ | |||
+ | One possible solution can be implemented entirely in mountd, without changing the kernel or any interfaces: | ||
+ | |||
+ | First, if we find "fsid=0" in the /etc/exports file anywhere, then we fall back on the current behavior, to preserve backwards compatibility. | ||
+ | |||
+ | If the file lacks any "fsid=0", then we automatically construct a pseudofilesystem in mountd: | ||
+ | |||
+ | * As above, create a new filesystem at /var/lib/nfs/v4root/ to use as a pseudofilesystem; you'll probably need to loopback-mount a file so the user doesn't have to set aside a separate partition for this. | ||
+ | * For each export in the export file, create a corresponding path under the pseudofilesystem. | ||
+ | * Create a new fsid=0,ro export for the pseudofilesystem. | ||
+ | * For each export in the export file, create a corresponding export for the path under the pseudofilesystem, with the same client and the same options. | ||
+ | * Mark all of these automatically created exports specially so that mountd knows to use them only for answer upcalls from the kernel, and not for responding to mountd requests. | ||
+ | |||
+ | The end result is an automatically-built filesystem and a set of "shadow" exports that are visible to NFSv4 but not to anyone (NFSv2/3 clients or automounters) using the MOUNT protocol, which have the effect of allowing everyone to see the same export paths. | ||
+ | |||
+ | Note some care has to be taken when reexporting, modifying the export list, etc., not to modify paths in the pseudofilesystem if not necessary; we'd rather not give clients unnecessary STALE errors. Also we should probably save the pseudofilesystem across reboots to prevent filehandles from changing after a reboot. | ||
+ | |||
+ | Also, we should probably hide all the automatically created mountpoints under /var/lib/nfs/v4root/ from other processes on the server; this means mountd should be run in its own namespace (see CLONE_NEWNS in "man 2 clone"). I believe all the lookups done by nfsd are actually done in downcalls that are performed in the context of the downcaller (mountd), so mountd's namespace should be the one it ends up seeing. | ||
+ | |||
+ | For now we probably shouldn't be performing all the above steps by default; we could give mountd an extra commandline option or something. | ||
+ | |||
+ | = Other solutions = | ||
+ | |||
+ | The purely-mountd solution does seem a little complicated. We could build the pseudofilesystem entirely in the kernel, but I think that would require new kernel code and kernel interfaces. Also it might not fit well with the current export table architecture where only mountd every knows the complete list of exports, and kernel just requests information about particular exports as needed. | ||
+ | |||
+ | Other ideas? |
Latest revision as of 22:37, 23 August 2010
The Problem
NOTE: all of this has since been mostly fixed, so this page is out of date.
See also This bugzilla bug report, or This redhat bugzilla report, or possibly relevant mail thread.
While NFSv2 and NFSv3 use a separate mount protocol to discover a server's exported filesystems, NFSv4 uses the same standard filesystem protocol (lookup, readdir, etc.) that is used to traverse within filesystems.
This gives the impression that these filesystems are all mounted on top of a top-level "pseudofilesystem".
Rather than constructing the pseudofilesystem from the list of exports in the /etc/exports file, the nfsd server just uses a real filesystem as the pseudofilesystem, and allows the administrator to mount and export filesystems underneath it. So that the server knows which exported filesystem to use as the pseudofilesystem (the filesystem that NFSv4 clients will see as "/"), that filesystem is marked with the export option "fsid=0".
This system was relatively simple to implement, but has lead to severe problems for automount users, or for anyone attempting to migrate from NFSv2/v3 to v4, because v4 clients see different paths than mountd clients.
For example, to quote Trond:
the current system means that if your export file looks like this: /export/home myclient(rw,sync,no_subtree_check,fsid=0) then that means that an NFSv4 fstab entry on 'myclient' will look like myserver:/ /mnt nfs4 rw,hard,intr 0 0 whereas an NFSv3 entry would look like myserver:/export/home /mnt nfs rw,hard,intr 0 0 This difference in path semantics means that there is no way we could have 'mount' try NFSv4 first, then automatically fall back to NFSv3 if the server doesn't support NFSv4. What we ought to do (what Solaris, Netapp,... all do) is for the NFSv4 server to have a pseudo-fs that contains the entries '/', '/export', and '/export/home' so that the NFSv4 client can mount the directory /export/home instead of '/'."
You could try to solve this problem in this example by creating a separate pseudofilesystem at /var/lib/nfs/v4root, mount --bind'ing /export/home on /var/lib/nfs/v4root/export/home, and creating another export for /var/lib/nfs/v4root/export/home. Then it will be possible to mount myserver:/export/home using either v3 or v4. Unfortunately, anyone using showmount or an automounter will now see a list of exports that looks like
/export/home /var/lib/nfs/v4root /var/lib/nfs/v4root/export/home
Also, setting up the pseudofilesystem and creating these extra exports is tedious work for the administrator.
But one solution is to modify mountd so that it creates those new exports itself, and hides the resulting new exports from the MOUNT protocol:
Solving the problem in mountd
One possible solution can be implemented entirely in mountd, without changing the kernel or any interfaces:
First, if we find "fsid=0" in the /etc/exports file anywhere, then we fall back on the current behavior, to preserve backwards compatibility.
If the file lacks any "fsid=0", then we automatically construct a pseudofilesystem in mountd:
- As above, create a new filesystem at /var/lib/nfs/v4root/ to use as a pseudofilesystem; you'll probably need to loopback-mount a file so the user doesn't have to set aside a separate partition for this.
- For each export in the export file, create a corresponding path under the pseudofilesystem.
- Create a new fsid=0,ro export for the pseudofilesystem.
- For each export in the export file, create a corresponding export for the path under the pseudofilesystem, with the same client and the same options.
- Mark all of these automatically created exports specially so that mountd knows to use them only for answer upcalls from the kernel, and not for responding to mountd requests.
The end result is an automatically-built filesystem and a set of "shadow" exports that are visible to NFSv4 but not to anyone (NFSv2/3 clients or automounters) using the MOUNT protocol, which have the effect of allowing everyone to see the same export paths.
Note some care has to be taken when reexporting, modifying the export list, etc., not to modify paths in the pseudofilesystem if not necessary; we'd rather not give clients unnecessary STALE errors. Also we should probably save the pseudofilesystem across reboots to prevent filehandles from changing after a reboot.
Also, we should probably hide all the automatically created mountpoints under /var/lib/nfs/v4root/ from other processes on the server; this means mountd should be run in its own namespace (see CLONE_NEWNS in "man 2 clone"). I believe all the lookups done by nfsd are actually done in downcalls that are performed in the context of the downcaller (mountd), so mountd's namespace should be the one it ends up seeing.
For now we probably shouldn't be performing all the above steps by default; we could give mountd an extra commandline option or something.
Other solutions
The purely-mountd solution does seem a little complicated. We could build the pseudofilesystem entirely in the kernel, but I think that would require new kernel code and kernel interfaces. Also it might not fit well with the current export table architecture where only mountd every knows the complete list of exports, and kernel just requests information about particular exports as needed.
Other ideas?