MountNotes
From Linux NFS
Chucklever (Talk | contribs) |
Chucklever (Talk | contribs) m (→mount(2) API return codes) |
||
(9 intermediate revisions not shown) | |||
Line 22: | Line 22: | ||
Should I implement the fallback logic first, before I construct the "bg" logic? If I don't, then a bad set of mount options will force a background mount that can't ever be satisfied.... But maybe that's the way it works already. | Should I implement the fallback logic first, before I construct the "bg" logic? If I don't, then a bad set of mount options will force a background mount that can't ever be satisfied.... But maybe that's the way it works already. | ||
+ | |||
+ | Obviously, the legacy mount will sort out bad mount options first, and not even try the mount request. Now that mount option parsing is in the kernel, the kernel has to return some error indicating that the mount options are bad, and that the mount shouldn't be retried. The kernel needs to distinguish between a retry-able and a non-retry-able mount failure. I wonder if Trond will object to return codes from mount(2) that are not listed in the man page? What does CIFS do? | ||
Why isn't "bg" implemented for mount.nfs4 ? | Why isn't "bg" implemented for mount.nfs4 ? | ||
Line 44: | Line 46: | ||
# It always uses UDP for GETPORT requests, for both MNT and NFS, mount and umount; | # It always uses UDP for GETPORT requests, for both MNT and NFS, mount and umount; | ||
+ | # It always uses rpcbind version 2 for IPv4 bind requests; | ||
# It always uses UDP for MNT protocol requests, for both mount and umount; | # It always uses UDP for MNT protocol requests, for both mount and umount; | ||
# It does a MNT NULL request before the actual MNT call, for both mount and umount; | # It does a MNT NULL request before the actual MNT call, for both mount and umount; | ||
Line 57: | Line 60: | ||
Why does a failed umount report the same error twice? | Why does a failed umount report the same error twice? | ||
- | [root@monet ~]# umount /mnt -O mountport=891,proto=tcp | + | [root@monet ~]# umount /mnt -O mountport=891,proto=tcp |
- | umount.nfs: Server failed to unmount 'ingres:/export/fast' | + | umount.nfs: Server failed to unmount 'ingres:/export/fast' |
- | umount.nfs: Server failed to unmount 'ingres:/export/fast' | + | umount.nfs: Server failed to unmount 'ingres:/export/fast' |
- | [root@monet ~] | + | [root@monet ~] |
== Developing some other ideas == | == Developing some other ideas == | ||
Line 97: | Line 100: | ||
Perhaps a clear error message can be reported to the command line, and a lot of detail should be reported in the system log? Well, that's easy enough with in-kernel mount option parsing! | Perhaps a clear error message can be reported to the command line, and a lot of detail should be reported in the system log? Well, that's easy enough with in-kernel mount option parsing! | ||
+ | |||
+ | === mount(2) API return codes === | ||
+ | |||
+ | The mount.nfs program needs to distinguish between temporary problems and permanent errors in order to determine whether it's worth retrying the mount request in the background. I'm still unsure whether the version/protocol fallback mechanism should occur in user space or in the kernel -- certainly policy would be easier to set and implement in user space, but then the kernel would need to provide specific information about how a mount request failed so that user space could make an appropriate choice about the next step to try. | ||
+ | |||
+ | The current mount(2) API is described in a man page. The man page describes a set of generic error return codes, which we excerpt here. It also suggests that we can add specific error codes for NFS mounts. | ||
+ | |||
+ | <pre> | ||
+ | RETURN VALUE | ||
+ | On success, zero is returned. On error, -1 is returned, and errno is | ||
+ | set appropriately. | ||
+ | |||
+ | ERRORS | ||
+ | The error values given below result from filesystem type independent | ||
+ | errors. Each filesystem type may have its own special errors and its | ||
+ | own special behavior. See the kernel source code for details. | ||
+ | |||
+ | EACCES A component of a path was not searchable. (See also path_resolu- | ||
+ | tion(2).) Or, mounting a read-only filesystem was attempted | ||
+ | without giving the MS_RDONLY flag. Or, the block device source | ||
+ | is located on a filesystem mounted with the MS_NODEV option. | ||
+ | |||
+ | EAGAIN A call to umount2() specifying MNT_EXPIRE successfully marked an | ||
+ | unbusy file system as expired. | ||
+ | |||
+ | EBUSY source is already mounted. Or, it cannot be remounted read-only, | ||
+ | because it still holds files open for writing. Or, it cannot be | ||
+ | mounted on target because target is still busy (it is the work- | ||
+ | ing directory of some task, the mount point of another device, | ||
+ | has open files, etc.). Or, it could not be unmounted because it | ||
+ | is busy. | ||
+ | |||
+ | EFAULT One of the pointer arguments points outside the user address | ||
+ | space. | ||
+ | |||
+ | EINVAL source had an invalid superblock. Or, a remount (MS_REMOUNT) | ||
+ | was attempted, but source was not already mounted on target. | ||
+ | Or, a move (MS_MOVE) was attempted, but source was not a mount | ||
+ | point, or was ’/’. Or, an unmount was attempted, but target was | ||
+ | not a mount point. Or, umount2() was called with MNT_EXPIRE and | ||
+ | either MNT_DETACH or MNT_FORCE. | ||
+ | |||
+ | ELOOP Too many link encountered during pathname resolution. Or, a | ||
+ | move was attempted, while target is a descendant of source. | ||
+ | |||
+ | EMFILE (In case no block device is required:) Table of dummy devices is | ||
+ | full. | ||
+ | |||
+ | ENAMETOOLONG | ||
+ | A pathname was longer than MAXPATHLEN. | ||
+ | |||
+ | ENODEV filesystemtype not configured in the kernel. | ||
+ | |||
+ | ENOENT A pathname was empty or had a nonexistent component. | ||
+ | |||
+ | ENOMEM The kernel could not allocate a free page to copy filenames or | ||
+ | data into. | ||
+ | |||
+ | ENOTBLK | ||
+ | source is not a block device (and a device was required). | ||
+ | |||
+ | ENOTDIR | ||
+ | The second argument, or a prefix of the first argument, is not a | ||
+ | directory. | ||
+ | |||
+ | ENXIO The major number of the block device source is out of range. | ||
+ | |||
+ | EPERM The caller does not have the required privileges. | ||
+ | </pre> | ||
+ | |||
+ | Here are some additional return codes I recommend for NFS mounts, just as a start. These should allow a calling program to report a reasonably specific error message, and decide whether and how to retry the request. | ||
+ | |||
+ | <pre> | ||
+ | EBADF The mount option string was not able to be parsed, or an unre- | ||
+ | cognized option was specified, or a keyword option was specified | ||
+ | with a value that is out of range. | ||
+ | </pre> | ||
+ | |||
+ | This is a permanent mount error. The calling program should not retry this request with the same options. | ||
+ | |||
+ | <pre> | ||
+ | ESTALE The server denied access to the requested share. | ||
+ | |||
+ | ETIMEDOUT | ||
+ | The kernel's mount attempt timed out after n seconds (I think n | ||
+ | is 15). | ||
+ | </pre> | ||
+ | |||
+ | These are temporary errors. The calling program may choose to retry this request using the same options, or fail immediately. | ||
+ | |||
+ | <pre> | ||
+ | EPROTONOSUPPORT | ||
+ | The server reports that the program, version, or transport pro- | ||
+ | tocol is not currently available. | ||
+ | |||
+ | ECONNREFUSED | ||
+ | The kernel's mount connection attempt was refused by the server | ||
+ | at the network transport layer. | ||
+ | </pre> | ||
+ | |||
+ | These are temporary errors. The calling program can attempt to recover by adjusting the options and retrying the request. | ||
== i18n == | == i18n == |
Latest revision as of 22:04, 21 August 2007
Contents |
Initial impressions
Should the kernel mount client be smart enough to sniff the remote server and tell what options are supported before trying to mount?
Passing just a string should be pretty darn easy. All that's needed is to drop in an "addr=" option -- mount.c already gets rid of the "MS_" related options for us.
TODO:
- break-back retries
- bg retries
- Support for IPv6
- Support for server failover options
- Better error reporting
- Mount server connection caching
- Remount processing
Does "mount.nfs ... -o defaults" work? Do we need "mount.nfs -a" to work? Check with mount.ocfs2.
And does 'mount.nfs' support single parameter mounts such as "mount.nfs /home" ? There is logic to do this in there, but is it working right?
When does the mount command fail immediately, and when does it background itself? If "bg" is specified, do all errors cause the mount command to go into the background, even permanent errors? Is there a class of errors that should always fail immediately?
Should I implement the fallback logic first, before I construct the "bg" logic? If I don't, then a bad set of mount options will force a background mount that can't ever be satisfied.... But maybe that's the way it works already.
Obviously, the legacy mount will sort out bad mount options first, and not even try the mount request. Now that mount option parsing is in the kernel, the kernel has to return some error indicating that the mount options are bad, and that the mount shouldn't be retried. The kernel needs to distinguish between a retry-able and a non-retry-able mount failure. I wonder if Trond will object to return codes from mount(2) that are not listed in the man page? What does CIFS do?
Why isn't "bg" implemented for mount.nfs4 ?
Setting our own connect timeout
You need to call connect on a socket set to non-blocking mode with fcntl, and then use select with a timeout to limit the amount of time you will wait for the connect to complete. If select returns because you timed out, then close the socket and return an error. If select returns because of an event on the socket, you use getsockopt to determine if the connect succeeded or not.
See Stevens, Unix Network Programming Vol 1 for details. Comments in the code I'm looking at say page 411.
This is a non-bug of sorts... user-space TCP connects will time out after 75 seconds. However, it would be nicer if these timed out quicker, like say after 15 seconds.
Reference implementation
I took a look at Solaris network behavior, just as a reference point. I specified "-o proto=tcp,vers=3".
- It always uses UDP for GETPORT requests, for both MNT and NFS, mount and umount;
- It always uses rpcbind version 2 for IPv4 bind requests;
- It always uses UDP for MNT protocol requests, for both mount and umount;
- It does a MNT NULL request before the actual MNT call, for both mount and umount;
- It does two separate NFS pings, on two separate TCP connections; probably one is from the mount command, and one from the kernel? Both use an ephemeral port rather than a privileged one.
- The Solaris kernel appears to cache TCP connections to the server, so if there's already one, it will use it instead of opening a fresh one. I didn't see a NULL request on this connection (either when it already existed, or when the kernel had to create one).
Copy support for other mount options (quiet/loud, quota, user[s]) to kernel mount client.
The version/transport break-back code is not working. Need to poke at it more. Should it break back if GETPORT says the service is there but the server isn't responding, or should it break back only if GETPORT says use another version?
Also I should check why umount hangs when the server goes down. Is lazy umounting working? What does the --force option do exactly?
Why does a failed umount report the same error twice?
[root@monet ~]# umount /mnt -O mountport=891,proto=tcp umount.nfs: Server failed to unmount 'ingres:/export/fast' umount.nfs: Server failed to unmount 'ingres:/export/fast' [root@monet ~]
Developing some other ideas
- (generic NFS) Somehow, fail new RPCs immediately if the transport is in a state where it can't connect (ECONNREFUSED or EHOSTUNREACH).
- (generic RPC) A control-C isn't cancelling all transport state. An interrupted "mount -o tcp" blocks a subsequent "mount -o udp" until the failed TCP connection attempt times out and clears. Probably what's happening here is that the RPC client's connect logic is attempting to re-use the port, then the ->connect() call is just going on with TCP again. The RPC client should force a different port if the new connect request doesn't use the same transport.
- I should fix up rpcb_getport_sync() to use only UDP. Except, umount needs to work somehow through firewalls. That's fixed... but maybe GETPORT should try UDP first, then if it times out, try TCP.
- Break-back should be done by looking at portmapper's whole database and figuring out which transports, versions, and programs are available. Steve says some Cisco routers depend on a real GETPORT to determine which ports to open.
- If we absolutely need to do a GETPORT over TCP, why not do multiple GETPORTs on the same connection? Because you have to know what GETPORTs you want to do all at once... the RPC library isn't re-entrant; you can't leave a CLIENT open and open a second one.
- Use the select() on a non-blocking connect() method described above to shorten the TCP connect time out in get_socket().
- Support for user-only mount options in the kernel option parser -- [no]quota, [no]user, [no]users, and so on. See utils/mount/mount.c for more. Hmm. Maybe this isn't needed -- looks like mount.c already strips those off before sending the option string to the kernel. Maybe a better strategy would be to remove support for the user-only options (like fg/bg) from the kernel, and make sure they are purged from the options string before I send them down.
- add a t/ directory under utils/mount/ that contains a suite of tests similar to the eponymous directory in the git distribution. The tests can be done against an NFS server running on the same system. That way the tests can start and stop the server and issue iptables commands, without adding a local/remote complication. Maybe I could get Bull or CITI interested?
- Mount support for nfs:// URLs
- Implement a long option for mount.nfs for forcing string-ified mounts.
Rewriting nfs(5)
The purpose of rewriting nfs(5) is several-fold:
- Provide correct and clear user documentation for NFS mount options,
- Review the behavior of each mount option to make sure we agree on what each option does and why, in order to provide an opportunity for discussion and change of said behavior,
- Act as a design specification process for both the user space and string-ified NFS mount process, and
- Modernize the use of the markup macros and address typographic inconsistencies
Should add a "DISCUSSION" section to the man page that presents some background about how mount options interact with each other. What is a foreground mount versus a background mount? What does the v2/v3 mount process look like (GETPORT, MNT, NFS)? It might also be cool to cover how locking, open options such as O_DIRECT and O_SYNC, and ac/cto behave on NFS compared to local file systems. Should also carefully describe the behavior of sharedcache and nosharedcache. A discussion of security flavors...
Also expand the "EXAMPLES" section to provide recommendations for various scenarios. One example might be "noauto,users,nosuid".
Need to test mount.nfs's retry= behavior, as documented in nfs(5).
Need to check how nfs and nfs4 mount's behave for all combinations when the server's portmapper is unavailable, or when the port isn't in the portmapper database.
Improving error reporting
Mount's error messages just suck. One problem is the error messages are just wrong. Another is that errors are reported at too low a level: reporting that RPC program/version mismatch occurred is nonsense -- the error is "proto=udp" is not supported.
Perhaps a clear error message can be reported to the command line, and a lot of detail should be reported in the system log? Well, that's easy enough with in-kernel mount option parsing!
mount(2) API return codes
The mount.nfs program needs to distinguish between temporary problems and permanent errors in order to determine whether it's worth retrying the mount request in the background. I'm still unsure whether the version/protocol fallback mechanism should occur in user space or in the kernel -- certainly policy would be easier to set and implement in user space, but then the kernel would need to provide specific information about how a mount request failed so that user space could make an appropriate choice about the next step to try.
The current mount(2) API is described in a man page. The man page describes a set of generic error return codes, which we excerpt here. It also suggests that we can add specific error codes for NFS mounts.
RETURN VALUE On success, zero is returned. On error, -1 is returned, and errno is set appropriately. ERRORS The error values given below result from filesystem type independent errors. Each filesystem type may have its own special errors and its own special behavior. See the kernel source code for details. EACCES A component of a path was not searchable. (See also path_resolu- tion(2).) Or, mounting a read-only filesystem was attempted without giving the MS_RDONLY flag. Or, the block device source is located on a filesystem mounted with the MS_NODEV option. EAGAIN A call to umount2() specifying MNT_EXPIRE successfully marked an unbusy file system as expired. EBUSY source is already mounted. Or, it cannot be remounted read-only, because it still holds files open for writing. Or, it cannot be mounted on target because target is still busy (it is the work- ing directory of some task, the mount point of another device, has open files, etc.). Or, it could not be unmounted because it is busy. EFAULT One of the pointer arguments points outside the user address space. EINVAL source had an invalid superblock. Or, a remount (MS_REMOUNT) was attempted, but source was not already mounted on target. Or, a move (MS_MOVE) was attempted, but source was not a mount point, or was ’/’. Or, an unmount was attempted, but target was not a mount point. Or, umount2() was called with MNT_EXPIRE and either MNT_DETACH or MNT_FORCE. ELOOP Too many link encountered during pathname resolution. Or, a move was attempted, while target is a descendant of source. EMFILE (In case no block device is required:) Table of dummy devices is full. ENAMETOOLONG A pathname was longer than MAXPATHLEN. ENODEV filesystemtype not configured in the kernel. ENOENT A pathname was empty or had a nonexistent component. ENOMEM The kernel could not allocate a free page to copy filenames or data into. ENOTBLK source is not a block device (and a device was required). ENOTDIR The second argument, or a prefix of the first argument, is not a directory. ENXIO The major number of the block device source is out of range. EPERM The caller does not have the required privileges.
Here are some additional return codes I recommend for NFS mounts, just as a start. These should allow a calling program to report a reasonably specific error message, and decide whether and how to retry the request.
EBADF The mount option string was not able to be parsed, or an unre- cognized option was specified, or a keyword option was specified with a value that is out of range.
This is a permanent mount error. The calling program should not retry this request with the same options.
ESTALE The server denied access to the requested share. ETIMEDOUT The kernel's mount attempt timed out after n seconds (I think n is 15).
These are temporary errors. The calling program may choose to retry this request using the same options, or fail immediately.
EPROTONOSUPPORT The server reports that the program, version, or transport pro- tocol is not currently available. ECONNREFUSED The kernel's mount connection attempt was refused by the server at the network transport layer.
These are temporary errors. The calling program can attempt to recover by adjusting the options and retrying the request.
i18n
Internationalization references and hints: