- B1. What can I do to to improve NFS performance in general?
- A. Review the performance section of the NFS Howto doc and then look at several things:
- How fast is the disk IO speed on your server(s)? That will have a big impact on overall NFS performance for both Version 2 and Version 3.
- Does your application open its files with the O_SYNC option? That will force NFS Version 3 to behave exactly like (synchronous) NFS Version 2.
- UDP requires IP fragment reassembly. If you see fragmentation errors indicated in the output of <tt>netstat -s</tt> you may want to increase the size of your socket buffers.
- Have you started enough NFS daemons? Review the contents of <tt>/proc/net/rpc/nfsd</tt>, especially the line that begins with "th". The first number on that line is the total number of NFS server threads that are started and waiting for NFS requests. The second number indicates whether at any time all of the threads were running at once. The remaining numbers are a thread count time histogram. See the NFS How-to for details on tuning your server based on the data in this histogram.
- Do your NICs and Switches/Hubs/Routers autonegotiate down to 10baseT or half duplex? Half duplex will give you many more network collisions, which are the worst thing possible for NFS performance in UDP.
- Are you running ext3 or ReiserFS? You might look at placing the journal on a separate disk, or on NVRAM. As of January 2002, ext3 allows this, and Reiser has a patch available.
- B2. Everything seems so slow and I think the default rsize and wsize are set to 1024 - what's going on?
- A. Normally, the Linux NFS client uses read-ahead and delayed writes to hide the latency of NFS read and write operations. However, the client can cache only a single read or write request per page. Thus, if reading or writing a whole page requires more than one on-the-wire read or write operation (which it certainly does if rsize or wsize is 1024), each of these operations must complete before the next one can be issued. In the case of small NFS Version 3 write operations, the write must be FILE_SYNC because the client must fully complete each write before it issues the next one.
Note that this limitation becomes especially significant for hardware that supports larger pages. For instance, many distributors provide a Linux kernel built for Itanium processors that uses 16KB pages rather than 4KB pages normally found on 32-bit x86 systems. On such a system, if wsize is smaller than 16KB, the client always sends write operations serially, if they occur in the same page.
Finally, note that the maximum transfer size permitted by the Linux server (NFSSVC_MAXBLKSIZE) is set to 32KB when applying all patches involved with the implementation of NFS over TCP in the 2.4 kernels. The latest 2.4 kernels have TCP support integrated, and allow transfer sizes up to 32KB.
- B3. Why can't I mount more than 255 NFS file systems on my client? Why is it sometimes even less than 255?
- A. On Linux, each mounted file system is assigned a major number, which indicates what file system type it is (eg. ext3, nfs, isofs); and a minor number, which makes it unique among the file systems of the same type. In kernels prior to 2.6, Linux major and minor numbers have only 8 bits, so they may range numerically from zero to 255. Because a minor number has only 8 bits, a system can mount only 255 file systems of the same type. So a system can mount up to 255 NFS file systems, another 255 ext3 file system, 255 more iosfs file systems, and so on. Kernels after 2.6 have 20-bit wide minor numbers, which alleviate this restriction.
For the Linux NFS client, however, the problem is somewhat worse because it is an anonymous file system. Local disk-based file systems have a block device associated with them, but anonymous file systems do not. <tt>/proc</tt>, for example, is an anonymous file system, and so are other network file systems like AFS. All anonymous file systems share the same major number, so there can be a maximum of only 255 anonymous file systems mounted on a single host.
Usually you won't need more than ten or twenty total NFS mounts on any given client. In some large enterprises, though, your work and users might be spread across hundreds of NFS file servers. To work around the limitation on the number of NFS file systems you can mount on a single host, we recommend that you set up and run one of the automounter daemons for Linux. An automounter finds and mounts file systems as they are needed, and unmounts any that it finds are inactive. You can find more information on Linux automounters here.
You may also run into a limit on the number of privileged network ports on your system. The NFS client uses a unique socket with its own port number for each NFS mount point. Using an automounter helps address the limited number of available ports by automatically unmounting file systems that are not in use, thus freeing their network ports. NFS version 4 support in the Linux NFS client uses a single socket per client-server pair, which also helps increase the allowable number of NFS mount points on a client.
- B4. Why does NFS Version 2 seem so much faster than Version 3?
- A. There are actually two problems here, plus a feature. First, some background; the NFS Version 2 protocol specification requires a server to record each write to permanent storage before it sends a reply to a client. This makes server and client reboot recovery very simple, and provides a good guarantee that data sent to the server is permanently stored. Linux servers (although not the Solaris reference implementation) allow this requirement to be relaxed by setting a per-export option in <tt>/etc/exports</tt>. The name of this export option is "[a]sync" (note that there is also a client-side mount option by the same name, but it has a different function, and does not defeat NFS protocol compliance).
When set to "sync," Linux server behavior strictly conforms to the NFS protocol. This is default behavior in most other server implementations. When set to "async," the Linux server replies to NFS clients before flushing data or metadata modifying operations to permanent storage, thus improving performance, but breaking all guarantees about server reboot recovery.
- First problem:<br /> The default value of this export option on Linux NFS servers before nfs-utils-1.0.1 was "async". If a system administrator did not specify either "sync" or "async" in <tt>/etc/exports</tt>, <tt>exportfs</tt> used "async" by default. This allowed the server to reply to Version 2 write operations and metadata update operations (such as CREATE or MKDIR) before the requested data was written to the server's disk, thereby greatly improving the performance of write operations as well as introducing the possibility of undetectable data corruption. Releases of nfs-utils starting with version 1.0.1 use a default value of "sync," which causes the Linux server to conform properly to the NFS protocol specification.
- Second problem:<br /> Support for NFS Version 3 in Linux 2.2's NFS server does not honor the "async" export option. Thus, by default on a system running Linux 2.2 with an old version of the nfs-utils package, NFS Version 2 writes are fast and unsafe, but Version 3 write and commit operations are safe, although slower, since they always follow the client's request for either UNSTABLE or FILE_SYNC (see question A1).
- Feature:<br /> When you use the <tt>exportfs</tt> command with its verbose option set, it displays the various export options in effect for each exported file system. If the "async" export option is set, it appears in the option list, but if "sync" is requested, it will not appear in the exportfs parameter list. This reflects the common usage of "sync" as the default in other platforms, but can be somewhat confusing.
- B5. Why does default NFS Version 2 performance seem equivalent to NFS Version 3 performance in 2.4 kernels?
- A. See B4 for background information on how export options affect the Linux NFS server's write behavior.
Since Linux 2.4, the NFS Version 3 server recognizes the "async" export option. When this option is set, the server replies to clients before data has been written to permanent storage. The server also sends a FILE_SYNC response to the client, indicating that the client need not retain buffered data or send a subsequent COMMIT operation. This exposes the client to the same undetectable corruption as exists for NFS Version 2 (with "async") if the server crashes before it has actually written data to stable storage. (See question B6 for further discussion of this behavior and its consequences.) Note that even if a client sends a Version 3 COMMIT operation, the server replies immediately if the file system has been exported with the "async" option.
Conversely, when the "sync" export option is used on a Linux 2.4 server, both Version 2 and Version 3 writes behave as required by the NFS protocol specification. In this case, NFS Version 3 has a performance advantage over NFS Version 2, while maintaining data resilience during a server crash.
Note well that "[a]sync" also affects some metadata operations on the server.
- B6. Why is the "async" export option unsafe, and is that really a serious problem?
- A. The biggest problem is not just that it is unsafe, but that corruption may not be detected.
In the Linux implementation of NFS Version 2, when the "async" export option is in effect, a Linux NFS server may crash before posting all NFS write requests to disk. A Version 2 client, however, always assumes data is permanently written to stable storage, and that it is safe to discard buffers containing the written data.
After a server crash, the Version 2 client cannot know that unwritten data is lost; this is why Version 2 writes are supposed to be permanent before the server replies. Even if a client still has the modified data in its cache, the data on the server no longer matches what is cached on the client (since some or all of the writes did not complete before the server crashed). This may cause applications to make future decisions based on data cached by the client rather than what is on the server, thus further corrupting the file.
For the Linux implementation of NFS Version 3, using the "async" export option to allow faster writes is no longer necessary. NFS Version 3 explicitly allows a server to reply before writing data to disk, under controlled circumstances. It allows clients and servers to communicate about the disposition of written data so that in the event of a server reboot, a Version 3 client can detect the reboot and resend the data.
In summary, be sure all exports on your Linux NFS servers use the "sync" option by setting it explicitly or by upgrading your nfs-utils package to version 1.0.1 or later. If you need fast writes, be sure your clients mount using NFS Version 3. You may also improve write performance by adding the "wdelay" option to your exports.
- B7. I have achieved pretty fast speeds in some client benchmarks, but when my client is heavily loaded, it slows down considerably. Why does that happen?
- A. The Linux client limits the total number of pending read or write operations per mount point. This prevents the client from exhausting its memory with cached read or write requests when the network or server is slow. The hard limit is 256 outstanding read or write operations per mount point. When that limit is reached, the client does not issue a new read or write operation until at least one outstanding read or write operation completes, thus serializing all reads and writes on that mount point until load is reduced.
Two ways of mitigating this effect are to:
- Increase rsize and wsize on your client's mount points. This increases the amount of data that can be involved in outstanding reads or writes at any given time.
- Mount the same server partition multiple times on your clients, and spread your applications among the mount points.
This limit has been removed in 2.6 and later kernels.
- B8. Why won't my client let me use rsize or wsize larger than 8KB when I mount my Linux NFS server?
- A. NFS Version 2 supports up to 8KB reads and writes. NFS Version 3 allows larger reads and writes (see question A1). Stock 2.4 kernels earlier than 2.4.20 do not support read or write operations larger than 8192 bytes for either NFS Version 2 or 3. Server-side TCP support, introduced as an experimental compile-time option in 2.4.20, increases the server's maximum I/O size to 32KB by increasing the value of <tt>NFSSVC_MAXBLKSIZE</tt> (see question B2).
When a client mounts a file server, the file server advertises the largest number of bytes it can read or write in a single operation. Clients always use the smaller of the server's maximum and the value specified by the rsize and wsize values specified by the client in the mount command.
Large values of rsize and wsize may inhibit performance when using UDP. UDP datagrams must be separated into fragments that fit within your network's Maximum Transfer Unit. The loss of any of these fragments requires retransmission of the whole datagram. This may have a particularly adverse impact on client performance if your network is congested. TCP is considerably better at recovering one or two lost segments and managing network congestion, so larger I/O operations are usually more effective at reliably boosting performance when using NFS over TCP.
- B9. I use the "sync" or "noac" mount options. I've increased my wsize, but write throughput is lower than I expect. Why is this?
- A. Normally, an NFS client delays sending application write requests, allowing application processing to overlap with NFS write operations. An NFS client only causes an application to wait for writes to complete when the application closes or flushes a file. When a client sends write operations synchronously, however, the client causes applications to wait for each write operation to complete at the server. This results in much lower performance.
The Linux NFS client uses synchronous writes under many circumstances, some of which are obvious, and some of which you may not expect. Applications enable synchronous writes for a single file by opening a file with the O_SYNC or O_DSYNC flags. System administrators enable synchronous writes for all files in a local file system by mounting that file system with the "sync" option. The "noac" mount option also enables synchronous writes. If it didn't, applications running on other clients would have a difficult time retrieving file modifications if a client delayed writes.
Currently the Linux NFS client has a limitation which prevents it from safely generating large synchronous writes. The client breaks large write requests into on-the-wire write operations that are no larger than a single page to guarantee that write requests arrive on the server's disk in byte order (some applications depend on this behavior). Even if you set wsize larger than a page, the client will break any application write request into page-sized NFS write operations to meet this guarantee.
In addition, if the server's page size is larger than the client's page size, the server is forced to do additional work when the client writes in small chunks. NFS clients normally align reads and writes to their own page size, which then may be unaligned on the server if it uses larger pages. Depending on the server OS and filesystem, this could result in a number of performance limiting problems.
- B10. Sometimes my server gets slow or becomes unresponsive, then comes back to life. I'm using NFS over UDP, and I've noticed a lot of IP fragmentation on my network. Is there anything I can do?
- A. UDP datagrams larger than the IP Maximum Transfer Unit (MTU) must be divided into pieces that are small enough to be transmitted. If, for example, your network's MTU is 1524 bytes, the Linux IP layer must break UDP datagram larger than 1524 bytes into separate packets, all of which must be smaller than the MTU. These separated packets are called fragments.
The Linux IP layer transmits each fragment as it is breaking up a UDP datagram, encoding enough information in each fragment so that the receiving end can reassemble the individual fragments into the original UDP datagram. If something happens that prevents a client from continuing to fragment a packet (e.g., the output socket buffer space in the IP layer is exceeded), the IP layer stops sending fragments. In this case, the receiving end has a set of fragments that is incomplete, and after a certain time window, it will drop the fragments if it does not receive enough to assemble a complete datagram. When this occurs, the UDP datagram is lost. Clients detect this loss when they have not received a reply from the server after a certain time interval, and recover by retransmitting the datagram.
Under heavy write loads, the Linux NFS client can generate many large UDP datagrams. This can quickly exhaust output socket buffer space on the client. If this occurs many times in a short time, the client sends the server a large number of fragments, but almost never gets a whole datagram's worth of fragments to the server. This fills the server's IP reassembly queue, causing it to become unreachable via UDP until it expels the useless fragments from the queue.
Note that the same thing can occur on servers that are under a heavy read load. If the server's output socket buffers are too small, large reads will cause them to overflow during IP fragmentation. The client's IP reassembly queue then fills with worthless fragments, and little UDP traffic can get to the client.
Here are some symptoms of this problem:
- You use NFS over UDP with a large wsize (relative to the network's MTU), and your application workload is write-intensive, or with a large rsize with a read-intensive application.
- You may see many fragmentation errors on your server or clients (<tt>netstat -s</tt> will tell the story).
- Your server may periodically become very slow or unreachable.
- Increasing the number of threads on your server has no effect on performance.
- One or a small number of clients seem to make the server unusable.
- The network path between your client and server may have a router or switch with small port buffers, or the path may contain links that run at different speeds (100Mb/s and GbE).
The fix is to make the Linux's IP fragmentation logic continue fragmenting a datagram even when output socket buffer space is over its limit. This fix appears in kernels newer than 2.4.20. You can work around this problem in one of several ways:
- Use NFS over TCP. TCP does not use fragmentation, so it does not suffer from this problem. Using TCP may not be possible with older Linux NFS clients and servers that only support NFS over UDP.
- If you can't use NFS over TCP, upgrade your clients to 2.4.20 or later.
- If you can't upgrade your clients, increase the default size of your client's socket buffers (see below). 2.4.20 and later kernels do this automatically for the NFS client's socket buffers. See Section 5.3ff of the NFS How-To for more information.
- If your rsize or wsize is very large, reduce it. This will reduce the load on your client's and server's output socket buffers.
- Reduce network congestion by ensuring your GbE links use full flow control, that your switch and router ports use adequate buffer sizes, and that all links are negotiating their fastest settings.
- B11. Why does my server see so many ACCESS calls when using Linux clients?
- A. Default NFS server behavior is to prevent root on client machines from having privileged access to exported files. Servers do this by mapping the "root" user to some unprivileged user (usually the user "nobody") on the server side. This is known as root squashing. Most servers, including the Linux NFS server, provide an export option to disable this behaviour and allow root on selected clients to enjoy full root privileges on exported file systems.
Unfortunately, an NFS client has no way to determine that a server is squashing root. Thus the Linux client uses NFS Version 3 ACCESS operations when an application is running on a client as root. If an application runs as a normal user, a client uses it's own authentication checking, and doesn't bother to contact the server.
The Linux NFS client should cache the results of these ACCESS operations. In fact, in the new 2.6.x kernels, it does this and it extends ACCESS checking to all users to allow for generic uid/gid mapping on the server. This also enables proper support for Access Control Lists in the server's local file system. In pre-2.6 kernels, the stock NFS client does not cache the results of ACCESS operations.
|