NetworkTracing

From Linux NFS

(Difference between revisions)
Jump to: navigation, search
(replace "oracle" and "filer" in the example)
m (minor polish)
Line 149: Line 149:
== Idempotency ==
== Idempotency ==
-
An ''idempotent'' NFS request is a request that has no lasting side-effects in the file system.  A GETATTR request, for instance, is idempotent, because it doesn't change the data or attributes of a file.  If a client sends two GETATTR requests to a server that target the same file, the server can execute them in either order and the results will be identical, all things being equal.
+
An ''idempotent'' NFS request is a request that has no lasting side-effects in the file system.  A GETATTR request, for instance, is idempotent, because it doesn't change the data or attributes of a file.  If a client sends two GETATTR requests to a server that target the same file, the server can execute them in either order and the results will be identical, assuming the file is not being otherwise changed.
A ''non-idempotent'' NFS request causes changes.  It matters in which order the server performs non-idempotent requests.  If two WRITE requests target the same file and offset but contain different data, the file will look different depending on which order the requests are handled on the server.
A ''non-idempotent'' NFS request causes changes.  It matters in which order the server performs non-idempotent requests.  If two WRITE requests target the same file and offset but contain different data, the file will look different depending on which order the requests are handled on the server.
-
Normally clients and servers are careful to avoid conflicting non-idempotent requests.  However, there are some cases that can't be handled.  If the server crashes and reboots, the clients resend outstanding requests to make sure the server sees the requests.  This can result in WRITE requests being re-applied in a different order.  Usually, any situation that requires a retransmit introduces such non-determinism.
+
Normally clients and servers are careful to avoid conflicting non-idempotent requests.  However, there are some cases that can't be handled.  If the server crashes and reboots, the clients resend outstanding requests to make sure the server has seen the requests.  This can result in WRITE requests being re-applied in a different order.  Usually, any situation that requires a retransmit introduces such non-determinism.
-
To mitigate the effects of retransmitted non-idempotent requests, NFS servers cache the results of previous requests for a short period.  The server can then detect most retransmitted requests, and return the cached results instead of re-applying the requests.
+
To mitigate the effects of retransmitted non-idempotent requests, NFS servers cache the results of previous requests for a short period.  Using this reply cache, the server can detect most retransmitted requests, and return the cached results instead of re-applying the requests.
== Stable v. unstable writes ==
== Stable v. unstable writes ==

Revision as of 15:46, 13 August 2007

Contents

Introduction

This article provides general advice on how to use a network dump to troubleshoot NFS problems. The article introduces the basic elements of the RPC and NFS protocols as they appear on the network. Then, it follows up with how these "upper level" protocols change depending on whether the underlying transport is datagram (UDP) or stream (TCP) oriented. Afterwards, the article reviews various tools that can be used to capture and analyze network traces, and discusses some of their features and limitations. Finally, it introduces some simple techniques for organizing your problem determination work flow when faced with a mountain of trace data.

There are some items that are not included in this discussion. The article doesn't cover client mount options or server export options, for example. For a complete overview of NFS, we recommend Callaghan's NFS Illustrated (see the Appendix of this article).

RPC protocol basics

"Remote Prodedure Call", or RPC, is a framework that an application uses to invoke remote services via a standard procedure call. A developer can use RPC write what appears to be a normal function call, and, transparently to the application, the function may be executed in another address space, or on another host entirely.

An RPC client is a library of services that:

  • locates remote services,
  • converts parameters to a standard network format,
  • sends the arguments to the remote service while causing the caller to wait,
  • receives the results,
  • converts the results back to a locally recognized format,
  • and finally wakes the caller and returns the results to the caller.

An RPC server is a program that:

  • advertises a service,
  • receives requests from clients in a standard network format,
  • converts the requests to a locally recognized format, and calls a local procedure to process them,
  • receives the results and converts them to standard network format,
  • and finally sends the results back to the client.

An RPC client can send queued requests in any order, and a server can reply to the requests in any order.

Notice that there is no mechanism for guaranteeing that a procedure call is executed once and only once on the server. If a request gets dropped on its way to the server, or a reply is dropped on its way back to the client, the client can recover only by retransmitting the request after a timeout. In the second case (the reply was lost), the request has already been executed once on the server, and the client's retransmission will cause the request to be executed again. This has ramifications for NFS, as we will see later.

Because the application that handles the procedure call is possibly remote, the RPC protocol must also provide an authentication service. This identifies the user of the calling application to the remote service to authorize the remote service to act on behalf of the user.

Just like when they program in C, application developers are responsible for defining the procedure names, the arguments and their types, and the results and their types, then feeding these definitions to a compiler to build the code that handle the actual RPC procedure calls. A special language that a developers uses to define all these things is converted by "rpcgen" into C code which can be compiled normally and linked with the developer's application.

XDR in 5 seconds

Converting the arguments and results between network format and a locally recognized format is called as "marshalling" and "unmarshalling". A protocol known as "eXternal Data Representation," or XDR, describes the types and encoding methods for RPC arguments and results. Network tracing tools automatically and transparently handle XDR, so we won't be too concerned about it here.

In brief, each argument or result takes up a multiple of 4 bytes on the wire. For example, if a character string is to be passed, the length of the string might take up 8 bytes, and the string itself will be padded with zeroes so that the the next XDR data field starts at a 4-byte offset. Using this definition of a string, "Hello, World!" would be represented on the network by 7 bytes of zeros and a byte containing 13 (the length), followed by the 13 characters of the string itself, followed three zeroes to pad the string to 16 bytes.

RPC fields on the wire are big-endian, and the local and remote hosts both convert big endian to local formats before any real work is done. XDR can also pass unconverted data, which is called "opaque"; in this case, neither the RPC client nor RPC server touch the data as it is passed between applications.

RPC header, exploded

An RPC header starts every RPC request or reply on the wire. The header is actually pretty simple, once you understand what the fields mean. In fact, while reading traces, you will be mostly concerned with the NFS header and data, so this section is presented merely so you understand how the NFS protocol works on the wire.

The first field in an RPC header contains a four-byte value called the transaction ID, or XID. This value is used by RPC clients to pair an incoming reply with its corresponding initial requests. There are few constraints on what values an XID can take, but the client must ensure that each unique RPC call from applications uses a unique XID so that the client can properly distinguish the replies that come back from the server.

The second field contains a four-byte zero value if this is a request from an RPC client, or a four-byte one value if this is a reply from an RPC server.

RPC request

After the first two fields, each RPC request contains fields that specify program number, authentication credentials, and so on. The first of these is the RPC version field, which is always 2. All current RPC implementations today implement RPC protocol version 2.

Following that is the RPC program number. This represents the RPC "application" that is being invoked on the server. For example, to invoke an NFS server, this field would contain 10003. These numbers are specified by the IETF. An RPC program is also versioned, like the protocol itself, and that version number is contained in the next field. For NFSv3, then, the program field would contain 10003, and the version field would contain 3.

The values in these fields are used during RPC portmapping to locate RPC services. The RPC client asks the portmapper daemon running on RPC servers which network port currently provides the service represented by the program and version number. The RPC client can then connect to the port provided by the server's portmapper and send it RPC requests.

Following these two fields is the procedure number. Each procedure in an RPC program has a unique number that distinguishes it from other procedures. An NFSv3 READ request uses the number 6 in this field. There are 22 procedures in the NFSv3 program.

After the procedure number field comes two opaque fields which are used to convey authentication information. Finally, the program-specific header appears. This is covered below in the "Basics of the NFS protocol" section.

RPC reply

The first two fields of an RPC reply are the same as the first two fields in an RPC request: an RPC XID followed by a field that indicates this is a reply.

There are two types of RPC reply: an "accepted" reply and a "rejected" reply. A "rejected" reply indicates either the RPC version was incorrect, or that the RPC server does not recognized the authentication credentials. It usually returns a pair of values that indicate the high and low valid RPC version numbers.

The "accepted" reply can indicate a successful or failed execution of a request. A failed request can be because the program, version, or procedure number is out of range, or the server couldn't parse the arguments in the request for some reason.

A successful reply finishes with the marshalled results of the RPC request. Note that an NFS server can return an error status in a reply, even though the RPC reply is labeled "successful."

RPC transports

RPC on the network operates on top of a lower level network transport. To date, there have been two transports that are most commonly used for RPC: UDP, and TCP.

RPC over UDP datagrams

RPC over UDP is the traditional way in which RPC requests and replies move between local and remote hosts. Each RPC request and each RPC reply fit into single UDP datagrams, making it a simple, low overhead protocol. The largest UDP datagram is 64KB, so the largest RPC request or reply over UDP is 64KB.

UDP is an unreliable transport, thus the RPC layer is responsible for congestion control and managing lost requests and replies. If a server drops a request or the network drops a datagram, the RPC client can do nothing else but retransmit after a time out. Thus RPC over UDP has some difficulty guaranteeing that an RPC request is processed exactly once on the server. For example, if the network drops a reply, then the RPC client has to retransmit the request. The server sees the request twice and may process it twice if it has no other way to determine that this is a duplicate.

The server acknowleges the receipt of an RPC request by returning a reply. There is no other way to determine whether a server has seen a particular request that was sent via UDP. Thus it is very difficult for an RPC client to tell whether it is encountering network congestion or server congestion, or a temporary glitch, when a request is lost.

UDP datagrams are broken into MTU-sized IP fragments when transmitted. Ethernet MTU is 1536, so UDP packets sent on Ethernet links are broken into 1536 byte fragments. If one of these fragments is lost, there is no way to recover the whole IP packet, thus the whole UDP datagram is lost. Thus as UDP datagrams get larger, the likelihood of a lost fragment causing a dropped UDP packet increases.

The host that receives the fragments must reassemble them into a full datagram. This can require an unbounded amount of memory because fragments can arrive out of order. The receiver must time out incomplete IP packets (usually after 45 seconds) to conserve its resources.

RPC over streams

Although there is somewhat more overhead on the network, sending RPC requests over TCP is considerably more robust than sending requests via UDP datagrams. A TCP stream is a reliable connection, so the sender can always depend on the network layer to manage congestion and packet loss. In almost every network scenario, it is much more rare for a server to drop a request than for a network packet to be lost.

Because a TCP connection is a byte stream, and not a series of individual datagrams, RPC over streams adds a new field before the XID field in the RPC header to separate each RPC request and reply. This is called a "record marker" and contains the length of the RPC request or reply to follow. This allows RPC over TCP to send much larger requests and replies: the transport supports up to 2 to the 31st bytes minus one byte in a single record.

TCP has its own quirks, however. If a network partition occurs or a server or client reboots, the TCP connection state must be re-established, and some RPC retransmission does occur during such events. In addition, many NFS servers implement a form of back pressure on clients by artificially closing the TCP receive window to slow clients down. This back pressure isn't possible with some other transports, such as UDP, and is not required by any RPC or NFS standard.

Finally, in cases where only a single RPC request is needed (for example, when performing an NFS mount), TCP's three-way handshake is significantly less efficient than a single UDP request and reply packet.

NFS protocol basics

The NFS protocol implements a distributed file system on top of RPC. Thus, an NFS client translates system calls like open(2), read(2), and write(2) into one or more remote procedure calls. An NFS server translates remote procedure calls into accesses to a local physical file system.

A file (usually /etc/exports) that contains a list of local file systems on the NFS server controls which file systems are visible to NFS clients. NFS clients cannot cross local mount points on NFS servers. (Note: NFS version 4 changes that).

NFS data types

Files and directories in NFS are represented by "NFS file handles." These are opaque multi-byte objects that do not change during the lifetime of a file (ie from when the file is created to when the file is deleted, even across reboots of the server or client). NFSv2 file handles are always 32 bytes, but NFSv3 file handles are variable length, so a 2-byte length field is included at the start of each NFSv3 file handle. Usually servers always use the same file handle length for all files.

Files and directories each have a set of attributes. The attributes include:

  • the file handle (up to 64 bytes),
  • access, modification, and change time stamps (64 bit values)
  • permission bits and file type (24 bits)
  • number of links (32 bits)
  • owner and group Id (each 32 bits)
  • file size (64 bits)

and so on. These are almost always passed in a single bundle inside requests and replies. Sometimes the protocol allows the attributes to be present or absent. In this case, a four-byte value precedes the attributes; if it is a one, then the attributes are present; if it is a zero, the attributes are not present.

Timestamp fields are 8-byte values, broken into two 4-byte values representing seconds and nanoseconds since midnight, January 1, 1970 (NFS version 2 uses microseconds instead of nanoseconds). Most servers don't store nanosecond-resolution timestamps in their file systems, so often the lowest order bits in the nanosecond field are zeroes. Some servers, such as Linux, don't support subsecond resolution timestamps, thus the entire low-order 4-byte field is zero.

Examples of NFS requests

Version 3 of the NFS protocol has 22 different procedures, including NULL (which, as you might expect, does nothing), FSSTAT, WRITE, ACCESS, and others. The calls and replies for each of these 22 procedures is different, so we'll provide just a couple of samples here just to get you started.

The NFSv3 GETATTR procedure

The RPC header contains the program number 10003, the version number 3, and the procedure number 1, which represent the NFS version 3 GETATTR procedure. The request contains a single argument which follows the RPC header. This argument is an NFS version 3 file handle, which is a 4-byte length followed by up to 64 bytes of file handle.

The reply is similar to the request, and contains two fields: the 4-byte NFS status field, and possibly a set of attributes. The status field contains a value that is similar to an errno value that describes the general result of the operation. If the status is zero, the GETATTR was successful, and the requested attributes follow. If the status is not zero, nothing else is returned.

The NFSv3 READ procedure

The READ procedure is somewhat more complicated than the GETATTR procedure. However, it starts the same: an RPC header. The procedure number is 6, and the first NFS argument is an NFS version 3 file handle. READ requests also contain an 8-byte "offset" and 4-byte "count" field. The offset field tells the server where to start reading in the file, and the count tells the server how many bytes to return.

The reply to a READ request is more interesting. Like the GETATTR reply, the first field is an NFS status field. If the status field is zero, a special type of file attributes are returned, followed by a 4-byte count of how many bytes were actually read and returned by the server, then a 4-byte value that indicates whether the server came to the end of the file during this read, and then finally the requested file data follows. If the status was not zero, then just the special attributes are returned.

These special attributes are returned in several other procedures, so they're worth a closer look. The attributes returned during an NFSv3 READ request are known as post-op attributes. In NFS version 2, these are no different than normal file attributes. However, NFS version 3 adds a new optional feature called "weak cache consistency," or WCC, which will be discussed in a later section of this document.

To implement WCC, the server can return a short set of attributes that describe the state of the file before the request was carried out, in addition to the normal set of attributes that are always returned. These pre-op attributes contain the file's size, mtime, and ctime as recorded just before the request.

Since servers are not required to return pre-op attributes, an additional field appears just before the pre-op attributes. If the field is a one, the pre-op attributes are present. If it is a zero, the pre-op attributes are not present; only the normal set of attributes are present.

Protocol trickiness

The basics of the NFS protocol are fairly straightforward. However there are some corner cases that need explanation because they introduce ambiguous and non-deterministic behavior.

Idempotency

An idempotent NFS request is a request that has no lasting side-effects in the file system. A GETATTR request, for instance, is idempotent, because it doesn't change the data or attributes of a file. If a client sends two GETATTR requests to a server that target the same file, the server can execute them in either order and the results will be identical, assuming the file is not being otherwise changed.

A non-idempotent NFS request causes changes. It matters in which order the server performs non-idempotent requests. If two WRITE requests target the same file and offset but contain different data, the file will look different depending on which order the requests are handled on the server.

Normally clients and servers are careful to avoid conflicting non-idempotent requests. However, there are some cases that can't be handled. If the server crashes and reboots, the clients resend outstanding requests to make sure the server has seen the requests. This can result in WRITE requests being re-applied in a different order. Usually, any situation that requires a retransmit introduces such non-determinism.

To mitigate the effects of retransmitted non-idempotent requests, NFS servers cache the results of previous requests for a short period. Using this reply cache, the server can detect most retransmitted requests, and return the cached results instead of re-applying the requests.

Stable v. unstable writes

NFS version 2 used a simple write model so that servers did not need to maintain any state, making server crash recovery very simple. A client could safely assume that once it received a reply to a WRITE request, the server had committed the data to permanent storage. Unfortunately, this meant that writes had to go to disk immediately before the server could reply, which presented a performance dilemma.

To attack this dilemma, protocol designers introduced a new write model into NFS version 3. This model is known as "unstable" writes. The NFS version 2 model became known as "stable" writes: writes had to be on stable storage before servers could reply. NFS version 3 allowed a new state for WRITE requests, where the server had replied that the write was received, but was not yet on stable storage. At some later point, the client could make a COMMIT request to indicate that the data was to be made permanent.

The window between the server's unstable reply and the client's subsequent COMMIT request allows the server to receive multiple writes and schedule them efficiently to be written to permanent storage, making NFS version 3 servers far more scalable than version 2 servers.

Data and attribute caching

Any file system would be terribly slow if every application request caused some kind of disk access, and NFS is no exception. To improve performance, file systems store oft-used data and attributes in memory.

NFS is somewhat different than local file systems, however, in that files can be accessed on more than one client simultaneously. In such cases, a client needs to know when its caches become stale because an application running on some other client has modified a file.

NFS clients can use a file's mtime and ctime to detect changes to a file. Using attributes that file systems already store means that a file system doesn't have to be modified or re-implemented to be used with NFS.

A server returns file attribute information in the reply of most types of NFS requests. If a client notices that the mtime in the reply doesn't match the mtime it has cached, it knows that its cache is no longer fresh, and purges cached data for that file.

mtime time stamp resolution

Using a file's mtime and ctime work well if the timestamps are guaranteed to be unique after any change to the file. Unfortunately many Linux file systems don't make that guarantee. Ext3 for example can't store more than 32 bits for each time stamp, thus its timestamps can change only every second. If more than one change occurs to a file in the same second, NFS clients can't detect multiple changes.

To work around this limitation, the Linux NFS client watches file size as well as mtime and ctime to detect when its caches become stale. This is not a 100% solution (especially if applications are writing, but not extending a file), but works in most cases.

weak cache consistency

Often a client will target the same file with many requests at the same time. If some of the requests are non-idempotent, and the replies come back in a different order, the mtime in the replies will often not be precisely the same as it was when the requests were made, and that will cause the client to invalidate its cache unnecessarily. This often has a negative performance impact.

NFS version 3 uses a special set of attributes in its replies that supply additional information a client can use to prevent aggressive cache invalidation called "weak cache consistency," or WCC, attributes.

To implement WCC, the server can return a short set of attributes that describe the state of the file before the request was carried out, in addition to the normal set of attributes that are always returned. These "pre-op" attributes contain the file's size, mtime, and ctime as recorded just before the request.

Since servers are not required to return pre-op attributes, an additional field appears just before the pre-op attributes. If the field is a one, the pre-op attributes are present. If it is a zero, the pre-op attributes are not present; only the normal set of attributes are present.

A client can compare the size and timestamps returned in the WCC pre-op attributes to its cached attributes. If they match, then the client can be fairly certain that no other client changed the file, and its data and attribute cache remains fresh.

close-to-open cache consistency

Using mtime to detect outside changes is a low-overhead way in which NFS clients can maintain fresh caches. However, if the NFS client checked a file's mtime every time it accessed a file, it would soon overwhelm servers with GETATTR requests. So clients depend on a 30 second to 1 minute attribute cache during which they don't check with the server to see if files have been changed. This attribute caching is defeated by using the "noac" mount option, by the way.

In addition, clients assume a serial sharing model. NFS is optimized for a common workflow where client A opens a file, writes to it, closes it, then sometime later, client B opens a file, reads it, then closes it. Usually, A and B will not access the file at the same time. (If they do, it's time to use file locking or uncached I/O to prevent data corruption).

To maintain cache coherency, then, it is usually a simple matter to flush any changes when an application closes a file, and force a GETATTR during an open(2) to check whether the file has changed. This is known as "close-to-open cache coherency." It's the reason why clients always send a GETATTR whenever a file is opened.

client cache implications of using file locks

As mentioned above, applications that share files concurrently should use some kind of synchronization mechanism to ensure proper operation. NFS provides a mechanism that can synchronize accesses (and client caches) so that applications operate as if they are running on the same client even if they aren't. This mechanism is called advisory file locking. Advisory file locking means that the synchronizing mechanism does not prevent reads or writes outside of the locks.

The Linux NFS client treats a file lock or unlock request as a cache consistency check point. Locking a file usually means someone recently made some changes that you want a look at, so the client purges its cache to make sure read(2) gets the very latest data. Unlocking a file usually means that you have made some changes that you want others to view, so the client flushes any writes back to the server to make sure that subsequent lockers can see your very latest changes.

Commonly-used tools

Here are some tools commonly used to capture network traffic. You can use any of these to capture traffic for later analysis. All of these tools can read or write the same "pcap" network trace file format.

All of these tools require root privileges when capturing live network traffic in order to put the local Ethernet device into "promiscuous" mode. Usually Ethernet interfaces pass up to the host only traffic whose destination address matches the interface's address. In "promiscuous" mode, the device passes all traffic to the host, regardless of its destination address.

The analysis tools can be run as a normal user, however, as long as they have privileges to access the dump file.

In the following discussion, "frame" refers to a single Ethernet frame.

tcpdump

Tcpdump is the mother of all network analysis tools. It's a command-line tool that has been around since the dinosaur era of computing, and these days it is maintained as an open source application.

It doesn't have a lot of NFS smarts, however, so generally this is a tool that should be used to capture network traffic to a file for later analysis by a tool like Wire Shark that can dissect RPC and NFS traffic more completely.

snoop

Snoop is commonly found on Solaris systems. It's a command-line tool that behaves much like tcpdump, but the syntax is just different enough to be confusing. It can parse RPC and NFS headers, but the only option is to display everything in blinding detail, which can often be useless.

Snoop uses its own capture file format which must be converted before it can be read by tcpslice or tcpdump.

tethereal and wireshark

Ethereal, recently renamed Wire Shark, is a GUI network capture and analysis tool that can be found on Windows, Unix, Linux, and even Macintosh (under the Mac's X-Windows emulator). Ethereal is the preferred trace analysis tool, but is limited to smaller traces (about 100MB is the limit) depending on which preferences are selected.

Tethereal is a command-line tool that accompanies ethereal. It can replace tcpdump, and uses roughly the same filter syntax as ethereal. It can often handle trace files that are much larger than ethereal can.

tcpslice

Tcpslice is a little-known tool that is almost as ancient as tcpdump. It can split very large trace files into smaller files based on the time stamps of the frames in the files. Although its user interface is about as dumb as a box of hair, it can be really useful when you have gigabytes of trace data in a single file that you want to view in ethereal.

pktt

NetApp filers have a built-in trace capture tool called pktt. It is a simplified form of tcpdump that allows you to capture network data and dump it into a file in the filer's root volume. It is a capture-only tool; it has no analysis capabilities. Pktt captures data in the standard "pcap" file format.

Capturing network traces

Capturing useful network traces requires some care. Here are some tips common to all of the tools we've mentioned above for capturing a clean trace that contains all the data needed for analysis.

Reducing your capture data rate

Today's networks typically burst at close to a gigabit per second. Over the minutes (or even hours) that is sometimes required to capture the network traffic associated with an abnormal event, you can expect to see an enormous amount of traffic dumped into your capture files. During bursty traffic, your CPU or disks may drop some of the incoming frames to keep up, resulting in missing data in your capture file or changes in application timing that prevent you from reproducing your problem.

Here are some tips on reducing the data rate during your capture session to help capture a complete trace.

The snaplen option

Network capture tools can reduce the maximum number of bytes captured per frame. By default, tcpdump uses 96 bytes, which is barely enough to capture all the bytes in the transport headers. It's usually useful to set the maximum number of captured bytes explicitly. Tcpdump uses the "-s" option to set the snaplen in bytes, and the other tools should follow suit.

For NFS over UDP, you only need to capture the first 300 or so bytes of each frame. Since standard Ethernet frames can be as long as 1536 bytes, you can see that this eliminates a significant fraction of the bytes on the wire, while preserving the IP, tranport, RPC, and NFS header information.

For NFS over TCP, capture the full MTU of each frame. Standard Ethernet frames are 1536 bytes, so specify a snaplen of 1536 to prevent the capture tool from allocating buffers that are too large. Large buffers means fewer buffers.

Filtering

The capture tools have sophisticated filtering capabilities that allow you to specify which frames are interesting, and which can be left out of the capture file. This can reduce your capture file size considerably.

The simplest and most common filter is by host. With tcpdump, you specify "host name" where "name" is the domain name or IP address of the host whose data you want to include in your capture file. This is typically used to limit the capture to the NFS client and server you are testing.

Filters can get more sophisticated with Wire Shark and tethereal, allowing you to filter on RPC or NFS header fields to get exactly the frames you want. See the Wire Shark documentation for more information on how to use filters.

Memory file systems

On most recent Red Hat Linux systems, /dev/shm is a tmpfs file system. You can write your capture files into this file system (as long as they aren't too large) to prevent disk bottlenecks from causing your capture tools to drop frames.

If /dev/shm isn't present, you can set up a tmpfs file system to handle the capture file.

Reproducing on slow networks

To allow the capture tool to keep up with network traffic, you might consider reproducing your problem on a system with slower networking (like 100Mb/s instead of gigabit Ethernet). Some timing issues may not be reproducible on slower networks.

Splitting large trace files

In order to reduce the amount of data that your analysis tool has to digest, you can split your capture files.

Switch capture files

For long-running captures, you can reduce the size of your capture files by periodically stopping and restarting your capture tool using a script. You can save the capture files to a file system with more capacity, or you can simply delete each file until you or your script has detected the problem you were trying to reproduce.

Use tcpslice

As mentioned before, the tcpslice tool can nondestructively split arbitrarily large capture files into smaller ones. The tool reports the timestamps of the earliest and latest frames in a capture file. Then you specify the range of timestamps you want and redirect the output into another file.

The man page is frustratingly short and vague.

Examples

The tcpdump command is the most commonly installed network capture tool on Linux systems, and is somewhat easier to use than tethereal. Let's look at a few ways to use tcpdump for capturing NFS activity on a local area network. Remember that you can experiment with combining these options in many ways to obtain exactly the capture behavior you want.

Typical example

 tcpdump -s0 -w /tmp/dump host server.example.com

Here, we capture all traffic between the local host and the NFS server server.example.com, and dump it into the file /tmp/dump in pcap format. The "-s0" option asks for all of the bytes in every frame. Notice that we didn't specify an Ethernet device on our local host; that's usually OK to do when your local host has only one Ethernet device.

The tcpdump command must run as a root user in order to capture traffic. Capturing traffic must put the local Ethernet device into "promiscuous" mode, which only a root user can do. In this mode, the capture can pick up any packets on the local area network that is attached to the host.

Another example

 tcpdump -s300 -w /tmp/dump port 2049

If you know your client or server is hosting traffic other than NFS, it helps to limit the capture only to traffic on port 2049 (the standard NFS port). Unless it has been specially configured, NFS servers always listen on port 2049, and respond to clients from port 2049. Clients, on the other hand, choose any port to send and receive. In this example, all traffic the client can see on its LAN interface is visible, and that which is to or from port 2049 on itself or any other host is captured.

Finally, the "-s300" option will pass only the first 300 bytes of every frame that matches the port filter. This is useful if you know the NFS traffic is UDP-based, and you need to restrict the capture stream to keep your capture file managebly small.

Re-reading trace files for additional filtering

It is sometimes useful to cull capture files even further before analysis. You can use tcpdump to do some simple filtering after you've already completed a capture of NFS traffic.

 tcpdump -r /tmp/dump -w/tmp/smaller tcp src 10.0.7.39

In this example, the tcpdump command reads a previously captured trace from the file /tmp/dump and applies two additional filters. The first filter eliminates any non-TCP-based traffic, and the second filter passes only traffic from 10.0.7.39. The output of the filters are written into the file /tmp/smaller in pcap format.

The tcp filter can be useful in cases where you know there is significant non-TCP traffic (for example, the LAN supports Netware hosts or carries router table traffic, or you want to eliminate UDP-based NFS traffic).

Basics of analyzing network traces

setting WireShark preferences

starting with a hypothesis

counting NFS ops

displaying RPC round trip latencies

read and write length histograms

understanding portmap and mount

good and bad TCP behavior

good and bad UDP behavior

Resources

  1. RFCs 1094, 1813, 3530: The 3 versions of the NFS protocol
  2. RFCs 1831, 1832: The ONC RPC version 2 and XDR protocols
  3. Callahan, Brent, "NFS Illustrated", Addison-Wesley Professional, 1st edition, 1999
  4. Linux NFS Frequently-Asked Questions
Personal tools