NFS Howto Optimization
From Linux NFS
Optimizing NFS Performance
Careful analysis of your environment, both from the client and from the server point of view, is the first step necessary for optimal NFS performance. The first sections will address issues that are generally important to the client. Later (Section 5.3 and beyond), server side issues will be discussed. In both cases, these issues will not be limited exclusively to one side or the other, but it is useful to separate the two in order to get a clearer picture of cause and effect.
Aside from the general network configuration - appropriate network capacity, faster NICs, full duplex settings in order to reduce collisions, agreement in network speed among the switches and hubs, etc. - one of the most important client optimization settings are the NFS data transfer buffer sizes, specified by the mount command options rsize and wsize.
Setting Block Size to Optimize Transfer Speeds
The mount command options rsize and wsize specify the size of the chunks of data that the client and server pass back and forth to each other. If no rsize and wsize options are specified, the default varies by which version of NFS we are using. The most common default is 4K (4096 bytes), although for TCP-based mounts in 2.2 kernels, and for all mounts beginning with 2.4 kernels, the server specifies the default block size.
The theoretical limit for the NFS V2 protocol is 8K. For the V3 protocol, the limit is specific to the server. On the Linux server, the maximum block size is defined by the value of the kernel constant NFSSVC_MAXBLKSIZE, found in the Linux kernel source file ./include/linux/nfsd/const.h. The current maximum block size for the kernel, as of 2.4.17, is 8K (8192 bytes), but the patch set implementing NFS over TCP/IP transport in the 2.4 series, as of this writing, uses a value of 32K (defined in the patch as 32*1024) for the maximum block size.
All 2.4 clients currently support up to 32K block transfer sizes, allowing the standard 32K block transfers across NFS mounts from other servers, such as Solaris, without client modification.
The defaults may be too big or too small, depending on the specific combination of hardware and kernels. On the one hand, some combinations of Linux kernels and network cards (largely on older machines) cannot handle blocks that large. On the other hand, if they can handle larger blocks, a bigger size might be faster.
You will want to experiment and find an rsize and wsize that works and is as fast as possible. You can test the speed of your options with some simple commands, if your network environment is not heavily used. Note that your results may vary widely unless you resort to using more complex benchmarks, such as Bonnie, Bonnie++, or IOzone.
The first of these commands transfers 16384 blocks of 16k each from the special file /dev/zero (which if you read it just spits out zeros really fast) to the mounted partition. We will time it to see how long it takes. So, from the client machine, type:
# time dd if=/dev/zero of=/mnt/home/testfile bs=16k count=16384
This creates a 256Mb file of zeroed bytes. In general, you should create a file that's at least twice as large as the system RAM on the server, but make sure you have enough disk space! Then read back the file into the great black hole on the client machine (/dev/null) by typing the following:
# time dd if=/mnt/home/testfile of=/dev/null bs=16k
Repeat this a few times and average how long it takes. Be sure to unmount and remount the filesystem each time (both on the client and, if you are zealous, locally on the server as well), which should clear out any caches.
Then unmount, and mount again with a larger and smaller block size. They should be multiples of 1024, and not larger than the maximum block size allowed by your system. Note that NFS Version 2 is limited to a maximum of 8K, regardless of the maximum block size defined by NFSSVC_MAXBLKSIZE; Version 3 will support up to 64K, if permitted. The block size should be a power of two since most of the parameters that would constrain it (such as file system block sizes and network packet size) are also powers of two. However, some users have reported better successes with block sizes that are not powers of two but are still multiples of the file system block size and the network packet size.
Directly after mounting with a larger size, cd into the mounted file system and do things like ls, explore the filesystem a bit to make sure everything is as it should. If the rsize/wsize is too large the symptoms are very odd and not 100% obvious. A typical symptom is incomplete file lists when doing ls, and no error messages, or reading files failing mysteriously with no error messages. After establishing that the given rsize/ wsize works you can do the speed tests again. Different server platforms are likely to have different optimal sizes.
Remember to edit /etc/fstab to reflect the rsize/wsize you found to be the most desirable.
If your results seem inconsistent, or doubtful, you may need to analyze your network more extensively while varying the rsize and wsize values. In that case, here are several pointers to benchmarks that may prove useful:
The easiest benchmark with the widest coverage, including an extensive spread of file sizes, and of IO types - reads, & writes, rereads & rewrites, random access, etc. - seems to be IOzone. A recommended invocation of IOzone (for which you must have root privileges) includes unmounting and remounting the directory under test, in order to clear out the caches between tests, and including the file close time in the measurements. Assuming you've already exported /tmp to everyone from the server foo, and that you've installed IOzone in the local directory, this should work:
# echo "foo:/tmp /mnt/foo nfs rw,hard,intr,rsize=8192,wsize=8192 0 0" >> /etc/fstab # mkdir /mnt/foo # mount /mnt/foo # ./iozone -a -R -c -U /mnt/foo -f /mnt/foo/testfile > logfile
The benchmark should take 2-3 hours at most, but of course you will need to run it for each value of rsize and wsize that is of interest. The web site gives full documentation of the parameters, but the specific options used above are:
- -a: Full automatic mode, which tests file sizes of 64K to 512M, using record sizes of 4K to 16M
- -R: Generate report in excel spreadsheet form (The "surface plot" option for graphs is best)
- -c: Include the file close time in the tests, which will pick up the NFS version 3 commit time
- -U: Use the given mount point to unmount and remount between tests; it clears out caches
- -f: When using unmount, you have to locate the test file in the mounted file system
Packet Size and Network Drivers
While many Linux network card drivers are excellent, some are quite shoddy, including a few drivers for some fairly standard cards. It is worth experimenting with your network card directly to find out how it can best handle traffic.
Try pinging back and forth between the two machines with large packets using the -f and -s options with ping (see ping(8) for more details) and see if a lot of packets get dropped, or if they take a long time for a reply. If so, you may have a problem with the performance of your network card.
For a more extensive analysis of NFS behavior in particular, use the nfsstat command to look at nfs transactions, client and server statistics, network statistics, and so forth. The "-o net" option will show you the number of dropped packets in relation to the total number of transactions. In UDP transactions, the most important statistic is the number of retransmissions, due to dropped packets, socket buffer overflows, general server congestion, timeouts, etc. This will have a tremendously important effect on NFS performance, and should be carefully monitored. Note that nfsstat does not yet implement the -z option, which would zero out all counters, so you must look at the current nfsstat counter values prior to running the benchmarks.
To correct network problems, you may wish to reconfigure the packet size that your network card uses. Very often there is a constraint somewhere else in the network (such as a router) that causes a smaller maximum packet size between two machines than what the network cards on the machines are actually capable of. TCP should autodiscover the appropriate packet size for a network, but UDP will simply stay at a default value. So determining the appropriate packet size is especially important if you are using NFS over UDP.
You can test for the network packet size using the tracepath command: From the client machine, just type tracepath server 2049 and the path MTU should be reported at the bottom. You can then set the MTU on your network card equal to the path MTU, by using the MTU option to ifconfig, and see if fewer packets get dropped. See the ifconfig man pages for details on how to reset the MTU.
In addition, netstat -s will give the statistics collected for traffic across all supported protocols. You may also look at /proc/net/snmp for information about current network behavior; see the next section for more details.
Overflow of Fragmented Packets
Using an rsize or wsize larger than your network's MTU (often set to 1500, in many networks) will cause IP packet fragmentation when using NFS over UDP. IP packet fragmentation and reassembly require a significant amount of CPU resource at both ends of a network connection. In addition, packet fragmentation also exposes your network traffic to greater unreliability, since a complete RPC request must be retransmitted if a UDP packet fragment is dropped for any reason. Any increase of RPC retransmissions, along with the possibility of increased timeouts, are the single worst impediment to performance for NFS over UDP.
Packets may be dropped for many reasons. If your network topography is complex, fragment routes may differ, and may not all arrive at the Server for reassembly. NFS Server capacity may also be an issue, since the kernel has a limit of how many fragments it can buffer before it starts throwing away packets. With kernels that support the /proc filesystem, you can monitor the files /proc/sys/net/ipv4/ipfrag_high_thresh and /proc/sys/net/ipv4/ipfrag_low_thresh. Once the number of unprocessed, fragmented packets reaches the number specified by ipfrag_high_thresh (in bytes), the kernel will simply start throwing away fragmented packets until the number of incomplete packets reaches the number specified by ipfrag_low_thresh.
Another counter to monitor is IP: ReasmFails in the file /proc/net/snmp; this is the number of fragment reassembly failures. if it goes up too quickly during heavy file activity, you may have a problem.
NFS over TCP
A new feature, available for both 2.4 and 2.5 kernels but not yet integrated into the mainstream kernel at the time of this writing, is NFS over TCP. Using TCP has a distinct advantage and a distinct disadvantage over UDP. The advantage is that it works far better than UDP on lossy networks. When using TCP, a single dropped packet can be retransmitted, without the retransmission of the entire RPC request, resulting in better performance on lossy networks. In addition, TCP will handle network speed differences better than UDP, due to the underlying flow control at the network level.
The disadvantage of using TCP is that it is not a stateless protocol like UDP. If your server crashes in the middle of a packet transmission, the client will hang and any shares will need to be unmounted and remounted.
The overhead incurred by the TCP protocol will result in somewhat slower performance than UDP under ideal network conditions, but the cost is not severe, and is often not noticable without careful measurement. If you are using gigabit ethernet from end to end, you might also investigate the usage of jumbo frames, since the high speed network may allow the larger frame sizes without encountering increased collision rates, particularly if you have set the network to full duplex.
Timeout and Retransmission Values
Two mount command options, timeo and retrans, control the behavior of UDP requests when encountering client timeouts due to dropped packets, network congestion, and so forth. The -o timeo option allows designation of the length of time, in tenths of seconds, that the client will wait until it decides it will not get a reply from the server, and must try to send the request again. The default value is 7 tenths of a second. The -o retrans option allows designation of the number of timeouts allowed before the client gives up, and displays the Server not responding message. The default value is 3 attempts. Once the client displays this message, it will continue to try to send the request, but only once before displaying the error message if another timeout occurs. When the client reestablishes contact, it will fall back to using the correct retrans value, and will display the Server OK message.
If you are already encountering excessive retransmissions (see the output of the nfsstat command), or want to increase the block transfer size without encountering timeouts and retransmissions, you may want to adjust these values. The specific adjustment will depend upon your environment, and in most cases, the current defaults are appropriate.
Number of Instances of the NFSD Server Daemon
Most startup scripts, Linux and otherwise, start 8 instances of nfsd. In the early days of NFS, Sun decided on this number as a rule of thumb, and everyone else copied. There are no good measures of how many instances are optimal, but a more heavily-trafficked server may require more. You should use at the very least one daemon per processor, but four to eight per processor may be a better rule of thumb. If you are using a 2.4 or higher kernel and you want to see how heavily each nfsd thread is being used, you can look at the file /proc/net/rpc/nfsd. The last ten numbers on the th line in that file indicate the number of seconds that the thread usage was at that percentage of the maximum allowable. If you have a large number in the top three deciles, you may wish to increase the number of nfsd instances. This is done upon starting nfsd using the number of instances as the command line option, and is specified in the NFS startup script (/etc/rc.d/init.d/nfs on Red Hat) as RPCNFSDCOUNT. See the nfsd(8) man page for more information.