General troubleshooting recommendations
From Linux NFS
(20 intermediate revisions not shown) | |||
Line 1: | Line 1: | ||
+ | Depending on your configuration, there's a number of ways that NFS can fail to work. Sometimes it can be difficult to determine exactly why it is not working. This page describes some general techniques for diagnosing the issue. | ||
+ | If you cannot resolve your problem and plan to report it to the developer, see [[Reporting bugs]]. | ||
- | == | + | =General NFSv4 Issues= |
+ | |||
+ | Check that the "rpc_pipefs" and "nfsd" (on the server side) filesystems are both mounted somewhere. (Your distro should do this for you.) | ||
+ | |||
+ | Check that idmapping is configured. (/etc/idmapd.conf should set the same NFSv4 domain for client and server.) | ||
+ | |||
+ | == Check server's NFSv4 capability == | ||
+ | |||
+ | Make sure your server has NFSv4 available: | ||
+ | |||
+ | rpcinfo -p `hostname` | ||
+ | |||
+ | That should show the versions of NFS available. Also check that the client and server are running the appropriate NFSv4 processes: | ||
+ | |||
+ | ps aux | grep rpc | ||
+ | |||
+ | As a minimum, the server should show: | ||
+ | |||
+ | rpc.mountd | ||
+ | rpc.idmapd | ||
+ | rpc.nfsd | ||
+ | |||
+ | |||
+ | == Check server's exports == | ||
+ | |||
+ | Doublecheck is that your server is exporting what you think it is. On the server, run the command: | ||
+ | |||
+ | exportfs -v | ||
+ | |||
+ | If you need to make modifications, edit /etc/exports and re-export using the command | ||
+ | |||
+ | exportfs -r | ||
+ | |||
+ | Note that recent Linux NFS servers no longer require special treatment for NFSv4 exports; they are configured in the same way NFSv2 and NFSv3 exports always have been. | ||
+ | |||
+ | == Check server mount functionality == | ||
+ | |||
+ | Try mounting the nfs4 export on the server itself by mounting localhost:/. This will isolate whether the problem is with the server configuration. | ||
+ | |||
+ | == Check client mount functionality == | ||
+ | |||
+ | Verify your client has NFSv4 capability. One way to do this: | ||
+ | |||
+ | fgrep nfs4 /proc/kallsyms | ||
+ | |||
+ | You should see a long list of nfs4_ symbols. If not, then check your kernel config and rebuilt it. | ||
+ | |||
+ | Note that older Linux NFSv4 servers, depending on how you configured them, could require the client to mount different paths depending on whether the client was using NFSv4 or an older version. This is no longer true of newer Linux servers. | ||
+ | |||
+ | == Getting detailed debug output of the client/server interactions == | ||
+ | |||
+ | === NFS and RPC Trace Debugging === | ||
+ | |||
+ | You can capture more information about exactly what the client or server thinks is going on by enabling trace debugging. Trace debugging puts messages on the console and in /var/log/messages as the client or server goes through its paces so you can track progress and have some idea what request is being processed. | ||
+ | |||
+ | The debugging value is a bit mask that indicates which types of events you'd like to see traced. For information on the flag values, look in include/linux/nfs_fs.h, include/linux/lockd/debug.h, include/linux/sunrpc/debug.h, or include/linux/nfsd/debug.h. | ||
+ | |||
+ | To set the debugging value, you use a sysctl like so: | ||
+ | |||
+ | sudo sysctl -w sunrpc.nfs_debug=1023 | ||
+ | |||
+ | and to turn off debugging, just do this: | ||
+ | |||
+ | sudo sysctl -w sunrpc.nfs_debug=0 | ||
+ | |||
+ | See also sunrpc.nfsd_debug, sunrpc.rpc_debug, and sunrpc.nlm_debug. | ||
+ | |||
+ | Sometimes this kind of tracing can produce voluminous output. To ensure that your system log daemon can handle the traffic, make these adjustments: | ||
+ | |||
+ | # When you build your kernel, set the CONFIG_LOG_BUF_SHIFT option to a larger value than is recommended for your hardware. That will allow the kernel to buffer more log messages. | ||
+ | # Edit /etc/syslog.conf and place a "-" in front of "/var/log/messages" -- so you get "-/var/log/messages". That will switch syslogd into async mode to allow it to keep up. | ||
+ | |||
+ | You may also consider enabling serial console support. This will cause all printk()'s to be delayed by the time it takes to write the message on the serial port. While this means that kernel logging can now easily keep up with trace message logging, it will also introduce a significant change in timing that may cause your problem to become unreproducible! | ||
+ | |||
+ | === Capturing a Network Trace === | ||
+ | |||
+ | If you suspect the problem may involve some sort of miscommunication between the client and server, it can be useful for debugging purposes to dump the communication stream: | ||
+ | |||
+ | Start `tcpdump -s 9000 -w /tmp/dump.out port 2049` on the client, then conduct the client/server interaction. Review the /tmp/dump.out file (or include it with your bug report). | ||
+ | |||
+ | Useful tips: | ||
+ | |||
+ | # If you build your own kernels, enable CONFIG_PACKET_MMAP (Under Device Drivers --> Networking Support --> Network Options) to help tcpdump to keep up with traffic. | ||
+ | # Use a tmpfs file system for the tcpdump output file. tcpdump will keep up more easily, especially with gigabit speed transfer rates. | ||
+ | # Capture a trace on both ends if you suspect a network problem. Comparing the traces will show what each side of the communication is seeing. | ||
+ | # Leave off the "port 2049" to capture DNS, NIS, LDAP, or Kerberos traffic, if you suspect one of these auxiliary protocols is causing misbehavior. | ||
+ | # Don't forget about tcpslice and tethereal's command line parsers if you have a really big trace and you need to split it into manageable chunks. | ||
+ | |||
+ | == Kernel Stack Traceback == | ||
+ | |||
+ | If you have hung processes, capture a stack traceback to show where the processes are waiting in the kernel. You will need to build your kernel with the CONFIG_MAGIC_SYSRQ option (under Kernel Hacking) to enable stack traceback. | ||
+ | |||
+ | First, look in /etc/sysctl.conf to see if kernel.sysrq is set to 1. If not, then run this command: | ||
+ | |||
+ | echo 1 > /proc/sys/kernel/sysrq | ||
+ | |||
+ | Next, trigger a stack traceback via this command: | ||
+ | |||
+ | echo t > /proc/sysrq-trigger | ||
+ | |||
+ | Look on your console or in /var/log/messages for the output. | ||
+ | |||
+ | Another option, which doesn't require rebuilding your kernel, is to grab the contents of /proc/self/wchan for all the processes on your system. This doesn't give a full traceback, but it will show where each process is waiting, which is sometimes useful. A simple bash script to do this might look like this: | ||
+ | |||
+ | for i in /proc/*/wchan | ||
+ | do | ||
+ | echo "Process" $i | ||
+ | cat $i | ||
+ | echo " " | ||
+ | done | ||
+ | |||
+ | == Making Sense of a Kernel Oops report == | ||
+ | |||
+ | Tip: to get a clean oops report, make sure you've enabled the CONFIG_FRAME_POINTER option under Kernel Hacking when you build your kernel. Then, when you install, copy the System.map file from your build to your boot directory and name it "System.map-`uname -r`" so that the kernel can find it to resolve symbols properly. | ||
+ | |||
+ | =="Reboot" the NFSv4 server without shutting down the machine== | ||
+ | |||
+ | Just shut down rpc.nfsd and start it again. | ||
+ | |||
+ | ==Comparing results when mounting via NFSv3 and NFSv4== | ||
Find a file that is differing between v3 and v4, and look at the output from the `stat` utility. | Find a file that is differing between v3 and v4, and look at the output from the `stat` utility. | ||
Or use `ls -lid --type-style=full-iso` and `ls -lid --time=ctime --time-style=full-iso` if you don't have stat. | Or use `ls -lid --type-style=full-iso` and `ls -lid --time=ctime --time-style=full-iso` if you don't have stat. | ||
+ | |||
+ | =Kerberos issues= | ||
+ | |||
+ | ==Check hostnames== | ||
+ | |||
+ | Kerberos requires the hostname/domainname used in the keytab is correct. Run `hostname` and look in /etc/hosts to doublecheck that it is set properly. Compare with what you've listed in your keytab file. | ||
+ | |||
+ | ==Check keytabs== | ||
+ | |||
+ | Run the following command to check your keytab: | ||
+ | |||
+ | klist -k | ||
+ | |||
+ | ==Check krb5 ccache file== | ||
+ | |||
+ | If you see log messages regarding something like 'FILE:/tmp/krb5cc_machine_FOO.BAR.AD.ROOT', you can review the file after trying to do the mount via: | ||
+ | |||
+ | klist -e -f -c /tmp/krb5cc_machine_FOO.BAR.AD.ROOT | ||
+ | |||
+ | This will list info about your principals such as the valid/expire dates, encryption types, etc. |
Latest revision as of 17:36, 15 October 2012
Depending on your configuration, there's a number of ways that NFS can fail to work. Sometimes it can be difficult to determine exactly why it is not working. This page describes some general techniques for diagnosing the issue.
If you cannot resolve your problem and plan to report it to the developer, see Reporting bugs.
General NFSv4 Issues
Check that the "rpc_pipefs" and "nfsd" (on the server side) filesystems are both mounted somewhere. (Your distro should do this for you.)
Check that idmapping is configured. (/etc/idmapd.conf should set the same NFSv4 domain for client and server.)
Check server's NFSv4 capability
Make sure your server has NFSv4 available:
rpcinfo -p `hostname`
That should show the versions of NFS available. Also check that the client and server are running the appropriate NFSv4 processes:
ps aux | grep rpc
As a minimum, the server should show:
rpc.mountd rpc.idmapd rpc.nfsd
Check server's exports
Doublecheck is that your server is exporting what you think it is. On the server, run the command:
exportfs -v
If you need to make modifications, edit /etc/exports and re-export using the command
exportfs -r
Note that recent Linux NFS servers no longer require special treatment for NFSv4 exports; they are configured in the same way NFSv2 and NFSv3 exports always have been.
Check server mount functionality
Try mounting the nfs4 export on the server itself by mounting localhost:/. This will isolate whether the problem is with the server configuration.
Check client mount functionality
Verify your client has NFSv4 capability. One way to do this:
fgrep nfs4 /proc/kallsyms
You should see a long list of nfs4_ symbols. If not, then check your kernel config and rebuilt it.
Note that older Linux NFSv4 servers, depending on how you configured them, could require the client to mount different paths depending on whether the client was using NFSv4 or an older version. This is no longer true of newer Linux servers.
Getting detailed debug output of the client/server interactions
NFS and RPC Trace Debugging
You can capture more information about exactly what the client or server thinks is going on by enabling trace debugging. Trace debugging puts messages on the console and in /var/log/messages as the client or server goes through its paces so you can track progress and have some idea what request is being processed.
The debugging value is a bit mask that indicates which types of events you'd like to see traced. For information on the flag values, look in include/linux/nfs_fs.h, include/linux/lockd/debug.h, include/linux/sunrpc/debug.h, or include/linux/nfsd/debug.h.
To set the debugging value, you use a sysctl like so:
sudo sysctl -w sunrpc.nfs_debug=1023
and to turn off debugging, just do this:
sudo sysctl -w sunrpc.nfs_debug=0
See also sunrpc.nfsd_debug, sunrpc.rpc_debug, and sunrpc.nlm_debug.
Sometimes this kind of tracing can produce voluminous output. To ensure that your system log daemon can handle the traffic, make these adjustments:
- When you build your kernel, set the CONFIG_LOG_BUF_SHIFT option to a larger value than is recommended for your hardware. That will allow the kernel to buffer more log messages.
- Edit /etc/syslog.conf and place a "-" in front of "/var/log/messages" -- so you get "-/var/log/messages". That will switch syslogd into async mode to allow it to keep up.
You may also consider enabling serial console support. This will cause all printk()'s to be delayed by the time it takes to write the message on the serial port. While this means that kernel logging can now easily keep up with trace message logging, it will also introduce a significant change in timing that may cause your problem to become unreproducible!
Capturing a Network Trace
If you suspect the problem may involve some sort of miscommunication between the client and server, it can be useful for debugging purposes to dump the communication stream:
Start `tcpdump -s 9000 -w /tmp/dump.out port 2049` on the client, then conduct the client/server interaction. Review the /tmp/dump.out file (or include it with your bug report).
Useful tips:
- If you build your own kernels, enable CONFIG_PACKET_MMAP (Under Device Drivers --> Networking Support --> Network Options) to help tcpdump to keep up with traffic.
- Use a tmpfs file system for the tcpdump output file. tcpdump will keep up more easily, especially with gigabit speed transfer rates.
- Capture a trace on both ends if you suspect a network problem. Comparing the traces will show what each side of the communication is seeing.
- Leave off the "port 2049" to capture DNS, NIS, LDAP, or Kerberos traffic, if you suspect one of these auxiliary protocols is causing misbehavior.
- Don't forget about tcpslice and tethereal's command line parsers if you have a really big trace and you need to split it into manageable chunks.
Kernel Stack Traceback
If you have hung processes, capture a stack traceback to show where the processes are waiting in the kernel. You will need to build your kernel with the CONFIG_MAGIC_SYSRQ option (under Kernel Hacking) to enable stack traceback.
First, look in /etc/sysctl.conf to see if kernel.sysrq is set to 1. If not, then run this command:
echo 1 > /proc/sys/kernel/sysrq
Next, trigger a stack traceback via this command:
echo t > /proc/sysrq-trigger
Look on your console or in /var/log/messages for the output.
Another option, which doesn't require rebuilding your kernel, is to grab the contents of /proc/self/wchan for all the processes on your system. This doesn't give a full traceback, but it will show where each process is waiting, which is sometimes useful. A simple bash script to do this might look like this:
for i in /proc/*/wchan do echo "Process" $i cat $i echo " " done
Making Sense of a Kernel Oops report
Tip: to get a clean oops report, make sure you've enabled the CONFIG_FRAME_POINTER option under Kernel Hacking when you build your kernel. Then, when you install, copy the System.map file from your build to your boot directory and name it "System.map-`uname -r`" so that the kernel can find it to resolve symbols properly.
"Reboot" the NFSv4 server without shutting down the machine
Just shut down rpc.nfsd and start it again.
Comparing results when mounting via NFSv3 and NFSv4
Find a file that is differing between v3 and v4, and look at the output from the `stat` utility.
Or use `ls -lid --type-style=full-iso` and `ls -lid --time=ctime --time-style=full-iso` if you don't have stat.
Kerberos issues
Check hostnames
Kerberos requires the hostname/domainname used in the keytab is correct. Run `hostname` and look in /etc/hosts to doublecheck that it is set properly. Compare with what you've listed in your keytab file.
Check keytabs
Run the following command to check your keytab:
klist -k
Check krb5 ccache file
If you see log messages regarding something like 'FILE:/tmp/krb5cc_machine_FOO.BAR.AD.ROOT', you can review the file after trying to do the mount via:
klist -e -f -c /tmp/krb5cc_machine_FOO.BAR.AD.ROOT
This will list info about your principals such as the valid/expire dates, encryption types, etc.