RpcClientTransportSwitch
From Linux NFS
Revision as of 01:44, 24 August 2007 by Chucklever (Talk | contribs)
Linux 2.6 RPC Transport Switch: Design & Implementation AUTHOR Chuck Lever VERSION Sat Feb 26 13:18:44 PST 2005
Contents |
Purpose
We document the design for a transport switch in the Linux 2.6 RPC client.
Introduction
Today's RPC client and server in the Linux kernel use a socket-based transport layer API. This works well for existing network transport technologies such as IPv4 TCP over gigabit Ethernet.
In the near future, alternate transport technologies will appear which may be difficult to mate with the socket abstraction. Examples of such new technologies include transports that support direct data placement and TCP offload devices accessed directly rather than through the Linux kernel's network layer.
Additionally, other new technologies such as IPv6 and new stream protocols such as SCTP will require significant changes to the socket-based infrastructure in the RPC client and server, but may have little if any effect on other areas.
Finally, security mechanisms such as IPsec and Kerberos 5 privacy may have special buffer management requirements in the transport layer in order to provide as efficient an implementation as possible.
In the following text, we refer to today's RPC client and server that do not have a generic transport switch implementation as the "pre- switch" versions of the client and server.
Specification
Our final goal is an implementation that facilitates integration of alternate transports while retaining or improving the stability, performance, and maintainability of the pre-switch RPC client with socket-based transports. In other words, we want to have no negative impact on the performance or stability of the existing IPv4 socket-based transport as we add a transport switch capability. Toward that end, we will introduce as little new functionality to existing support as possible for IPv4 socket transports; we are simply moving code and data structures. When complete, the IPv4 socket transport implementation will act as a reference for new transport implementations.
A "transport implementation" provides the code base that supports particular transport mechanisms, such as "IPv4 socket." Eventually transport implementations will be contained in loadable kernel modules. As they are loaded, they will register with the RPC client and server. Each transport implementation provides a vector of procs that provide a way to create, bind, and connect a new transport instance, provide auxiliary services such as portmapping, and provide ways to configure send and receive data on, or destroy, such instances.
Each transport connection between the client and server using a particular transport implementation is known as a "transport instance." Such an instance is identified by its transport implementation, and by the endpoint addresses of the client and server, and is represented by an rpc_xprt struct. For the "IPv4 socket" transport implementation, a transport instance is a single IPv4 socket connection that uses either the UDP or TCP network protocol. Note, for example, that a single transport instance might also consist of multiple sockets that share a workload, or an RDMA link with a passive failover IP socket, depending on how the instance's transport is implemented.
The transport API now contains methods to access various fields in the rpc_xprt struct. A transport-private data structure contains fields that are specific to a particular transport instance.
When the API is complete, transport endpoint addresses will be contained in a sockaddr_storage structure and an API method will be provided to retrieve the value of the remote peer's endpoint address. Setting the remote address will only be allowed during transport instance creation.
A transport implementation will usually include its own mechanism for RPC portmapping. For example, IPv4 sockets will use the standard RPC portmapper. IPv6 sockets may use rpcbind. Some implementations will not need any kind of port mapping; such implementations can provide the portmap methods as no-ops.
We defer the introduction of mechanisms by which user space, and subsequently the NFS client and server, specify which transport to use and parameters specific to a particular transport implementation. New mount options that control aspects of transport operation and changes to the mount_data structure will be considered on a case by case basis.
Support for the NFS version 4 session model
The pre-existing RPC client transport model includes a capability to send RPC requests and receive replies from servers via a single transport instance. NFS version 4 (RFC3530) introduces the concept of a callback channel to support RPC requests sent by NFS servers and received by clients. The primary use of this channel is to support NFS version 4 read and write delegation. Typically it uses a separate RPC server instance on the client supported by a separate transport instance to service callback RPC requests.
In the near future, a minor revision of NFS version 4 will require the ability to combine the normal RPC request channel with the callback channel on a single transport instance (also known as the NFS version 4 session layer). To support bi-directional RPC communications on a single transport instance, additional transport methods will be required.
At this time we do not understand yet what will be required, in addition to the methods described above, to support callbacks on the same transport instance as the RPC request forward channel.
API Specification
The generic functionality of all RPC transports (ie congestion control, request queuing, retransmit timeouts, and so on) will remain in xprt.c. All API methods must be present in all transport implementations.
We define thirteen transport methods:
struct rpc_xprt_ops { void (*setbufsize)(struct rpc_xprt *, size_t, size_t); void (*print_addr)(struct rpc_xprt *, size_t, char *, int); int (*is_bound)(struct rpc_xprt *); void (*rpcbind)(struct rpc_task *, struct rpc_clnt *); void (*set_port)(struct rpc_xprt *, unsigned short); void (*connect)(struct rpc_task *); int (*aux_protocol)(struct rpc_xprt *); void * (*buf_alloc)(struct rpc_task *, size_t); void (*buf_free)(struct rpc_task *); int (*send_request)(struct rpc_task *); void (*set_receive_timeout)(struct rpc_task *); int (*is_congested)(struct rpc_xprt *); void (*timeout)(struct rpc_xprt *); void (*close)(struct rpc_xprt *); void (*destroy)(struct rpc_xprt *); };
The following type defines a single transport implementation. It provides a name that functions only as an eye-catcher; the address of the transport implementation's kernel module structure; a family and protocol; and the address of the function that the generic layer can use to set up a new transport instance. The address of this structure is passed to the generic layer when the transport implementation initializes.
struct xprt_type { struct list_head list; char name[32]; struct module * owner; unsigned short family; int protocol; int (*setup)(struct rpc_xprt *, struct rpc_timeout *); };
The setup function is responsible for initializing a number of fields in the rpc_xprt structure it is passed, in addition to possibly allocating and intializing a private area for the transport instance.
tsh_size: the size, in 8-bit bytes, of a transport- specific header to be placed before the RPC header when building each RPC request. cwnd: the initial size of the congestion window. resvport: a boolean which, if true, means this transport needs a reserved port. max_payload: the size, in 8-bit bytes, of the largest payload a single RPC request can contain on this transport. bind_timeout: number of jiffies to wait for a bind request to complete before timing it out. connect_timeout: number of jiffies to wait for a transport connect request to complete before timing it out. reestablish_timeout: number of jiffies to wait after a transport is remotely disconnected before attempting to reestablish a connection. idle_timeout: number of jiffies to wait after a transport becomes idle before disconnecting. ops: the address of this transport instance's operations vector. max_reqs: the maximum number of concurrent requests this transport instance can support.
A (void *) pointer field is made available in the rpc_xprt structure to reference an implementation-private area where instance variables specific to a transport implementation can be maintained.
Procedure syntax and functional descriptions
"setup" This external function is provided by the transport implementation for initializing a new transport instance, setting the remote peer address, and providing some transport-specific parameters, such as request timeout values. This function also initializes the vector of API methods with which the generic layer can manipulate the new transport instance. The function takes two arguments: the address of a freshly allocated rpc_xprt structure, and the address of a structure containing transport-specific options. The "addr" field of the rpc_xprt structure is initialized with the remote endpoint address before "setup" is invoked. The return value is an errno value if problems were encountered, or zero on success. This function is called from a user process context, so it may sleep. It does not depend on any external locks being held. "setbufsize" This API method is invoked following the creation of a new transport instance to initialize transport layer buffer parameters. The function takes three arguments, which are the address of the rpc_xprt structure that is to be reconnected, and two unsigned integers reflecting the desired size of the tranport's buffer size, in bytes. It returns nothing. The caller must ensure that the xprt's reference count is greater than one when calling this function. This function is called from a user process context, so it may sleep. It does not depend on any external locks being held. "print_addr" This API method stuffs a buffer with a formatted string representing the address of the remote peer address. It's useful for building hash functions or with error, warning, and trace messages. The function takes four arguments, which are the address of the rpc_xprt structure containing the remote address, the size in bytes and the address of a buffer to stuff, and a set of flags that determine which address fields are to be formatted. It returns nothing. The caller must ensure that the xprt's reference count is greater than one when calling this function. This function is called from a user process context, so it may sleep. It does not depend on any external locks being held. "is_bound" This API method is invoked to determine whether a bind operation is required before a connection is made. The function takes a single argument, which is the address of the rpc_xprt structure which is being tested. It returns true if the transport is bound already, and false if a bind operation is necessary before proceding. The caller must ensure that the xprt's reference count is greater than one when calling this function. This function can be called from asynchronous RPC tasks so it must not sleep. It does not depend on any external locks being held. "rpcbind" This API method is invoked before a connect to allow portmapping to occur. If ports are not supported by the underlying transport mechanism, this method can be a no-op. The function takes two arguments: the address of the rpc_task structure for the current RPC request, and the address of the rpc_clnt structure associated with this task. It returns nothing. This operation starts the bind operation asynchronously, and the caller sleeps using the RPC client's scheduling primitives. The caller is awoken automatically when the bind is complete, and can check the status of the bind operation using "is_bound." This function can be called from asynchronous RPC tasks so it must not sleep. It does not depend on any external locks being held. "set_port" This API method is invoked to change the bound port number for a transport. It is generally invoked only during a bind operation. The function takes two arguments: the address of an rpc_xprt structure to update, and an unsigned 16-bit integer which is the new port number. It returns nothing. The caller must ensure that the xprt's reference count is greater than one when calling this function. This function can be called from asynchronous RPC tasks so it must not sleep. It does not depend on any external locks being held. "connect" This API method is invoked to connect a transport when the generic transport layer recognizes the need to connect a transport instance. The generic layer serializes transport reads and writes with the connect operation on this transport. Calling this function starts the connection, but the transport may or may not be connected when it returns. The generic layer uses the RPC client's scheduler primitives to wait safely until the connection operation is complete, and to allow only one connection attempt at a time. The details of whether a transport is connection-oriented or datagram-oriented can be well hidden in the tranport implementation itself. The RPC client's finite state engine automatically detects whether a transport is connected before sending each request; if it is not, it will invoke this method automatically. The function takes one argument, which is the address of an rpc_task structure which can be used for scheduling the connection and sleeping. It returns nothing. This function can be called from asynchronous RPC tasks so it must not sleep. "aux_protocol" This API method returns the protocol number to be used to set up auxiliary transports. An auxiliary transport is an additional transport instance that connects the same endpoints, but carries a different RPC program. NLM, NSM, and NFSACL would use an auxiliary transport to connect to servers. The function takes one argument, which is the address of an rpc_xprt structure. It returns an integer. The caller must ensure that the xprt's reference count is greater than one when calling this function. This function can be called from asynchronous RPC tasks so it must not sleep. "buf_alloc" This API method returns an area of memory in which to construct an outgoing RPC and to contain its reply. The memory can be a dynamically allocated buffer, or it can provide the address of an existing memory area where the construction can occur. The function takes two arguments: the address of the rpc_task structure associated with the current request, and a requested size of the memory area, in bytes. It returns an address of a usable area of memory, or NULL in case no area is currently available. The RPC client will retry if a NULL is returned. This function can be called from asynchronous RPC tasks so it must not sleep. It does not depend on any external locks being held. "buf_free" This API method is invoked when an rpc_task is finished and must free a memory area allocated via buf_alloc. The function takes one argument: the address of the rpc_task structure associated with the current request. It returns nothing. This function can be called from asynchronous RPC tasks so it must not sleep. It does not depend on any external locks being held. "send_request" This API method is invoked to send a single RPC request over the transport, after taking the transports write lock to serialize with other write or connect operations. This method must not sleep or block. This method adds any transport-specific headers that are required before the request is transmitted. The transport implementation exports the byte size of the space required in the buffer where requests are assembled so that the generic logic may leave that space available for transport-specific header information. The function takes one argument: the address of the rpc_task structure associated with the current request. The request has already been completely specified in the task's associated rq_rqst. If the transport is unable to write the complete request, this function places the task on a sleep queue and returns EAGAIN. The transport implementation will wake the task when the send operation can make forward progress. The generic layer calls this method again when the task is awakened. The generic layer does not release the write lock until the current request has been completely sent. If the transport requires a "connect" operation, this function returns ENOTCONN. If any other error occurs, that error is returned. If the send operation is entirely successful, this method returns zero. This function can be called from asynchronous RPC tasks so it must not sleep. The generic layer serializes transport reads and writes with the connect operation on this transport. Calling this function starts the write operation, but the write may not be complete when it returns. The generic layer uses the RPC client's scheduler primitives to wait safely until the reply to this request is received. "set_receive_timeout" The generic transport layer invokes this API method after a message has been sent successfully on a transport. Each transport implementation provides its own RPC retransmit logic via this method. It sets the RPC task timeout values so that the task is automatically awakened if no server reply is received. The timer callout is always xprt_timer. The function takes one argument: the address of the rpc_task structure associated with the current request. It returns nothing. This function can be called from asynchronous RPC tasks so it must not sleep. The caller must acquire the transport_lock and the write lock while calling this function. "is_congested" This API method is invoked to determine whether a transport is congested. If the transport indicates that it is congested, the generic transport layer puts the current request to sleep. The function takes one argument: the address of the rpc_xprt structure to check. It returns a zero value if the transport is not congested, and a nonzero value if the current request should be delayed. The caller must ensure that the xprt's reference count is greater than one when calling this function. This function can be called from asynchronous RPC tasks so it must not sleep. It does not depend on any external locks being held. "timeout" This API method is invoked when the RPC client detects a major retransmit timeout on this transport. The transport implementation can use this to record statistics, adjust timeout values, or mark a connection for reconnection. The function takes one argument: the address of the rpc_xprt structure that experienced the retransmit timeout. It returns nothing. The caller must ensure that the xprt's reference count is greater than one when calling this function. This function can be called from asynchronous RPC tasks so it must not sleep. It does not depend on any external locks being held. "close" This API method is invoked to close a transport connection. It is the opposite of the "connect" method. The function takes one argument: the address of an rpc_xprt structure to close. It returns nothing. The caller must ensure that the xprt's reference count is greater than one when calling this function. This function can be called from asynchronous RPC tasks or tasklets, so it must not sleep. It does not depend on any external locks being held. "destroy" This API method is invoked when a transport will no longer be used. It is the opposite of the "setup" external function. The function takes one argument: the address of an rpc_xprt structure to close. It returns nothing. The caller must ensure that the xprt's reference count is positive when calling this function. This function can be called from asynchronous RPC tasks so it must not sleep. It does not depend on any external locks being held.
Procedure syntax and functional descriptions (external functions)
"rpc_peeraddr" This external function is a convenient way to invoke a transport's peer_addr method. The function takes three arguments: the address of the rpc_clnt structure to be queried, the address of a buffer into which to copy the endpoint address, and the size of that buffer. It returns nothing. This function can be called from asynchronous RPC tasks so it must not sleep. It does not depend on any external locks being held. "rpc_print_peeraddr" This external function provides a way to format remote peer addresses for printing or for use in a hash function. The function takes four arguments: the address of the rpc_clnt structure containing the address of interest, the address and size of a buffer, and a set of flags that determine which parts of the address are formatted. This function can be called from asynchronous RPC tasks so it must not sleep. It does not depend on any external locks being held. "xprt_tsh_size" This external function returns the number of bytes to be left before the RPC header is inserted into the transmission buffer. The generic transport layer uses this value when constructing each RPC request to leave room for transport specific and protocol specific headers. This function takes one argument: the address of the rpc_xprt structure that will be used to transmit the current request. It returns the size of any protocol specific header, in bytes, or zero, if no space for a protocol specific header is required. The caller must ensure that the xprt's reference count is greater than one when calling this function. This function can be called from asynchronous RPC tasks so it must not sleep. It does not depend on any external locks being held. "xprt_is_bound" This external function is a convenient way to invoke a transport's bound method. The function takes a single argument, which is the address of the rpc_xprt structure which is being tested. It returns true if the transport is bound already, and false if a bind operation is necessary before proceding. The caller must ensure that the xprt's reference count is greater than one when calling this function. This function can be called from asynchronous RPC tasks so it must not sleep. It does not depend on any external locks being held. "xprt_connected" This external function is a convenient way to determine whether a transport is connected. The function takes one argument: the address of the rpc_xprt structure that represents the transport instance to check. It returns a truth value. The caller must ensure that the xprt's reference count is greater than one when calling this function. This function can be called from asynchronous RPC tasks so it must not sleep. It does not depend on any external locks being held. "rpc_max_payload" This external function reports the maximum number of bytes of payload that a single RPC can carry on a given transport protocol. The function takes one argument, which is the address of an rpc_clnt structure created by rpc_create. It returns a size_t value. This function is called from a user process context, so it may sleep. It does not depend on any external locks being held. "rpc_force_rebind" This external function allows applications to request that the RPC client rebind the transport. The function takes one argument: the address of the rpc_clnt structure to rebind. It returns nothing. This function can be called from asynchronous RPC tasks so it must not sleep. It does not depend on any external locks being held. "rpc_aux_protocol" This external function reports what transport protocol to use when connecting auxiliary services, such as NLM or NFSACL, based on the protocol used on the main forward channel. The function takes one argument: the address of the rpc_clnt structure to query. It returns an integer. This function can be called from asynchronous RPC tasks so it must not sleep. It does not depend on any external locks being held.
Procedure syntax and functional descriptions (generic functions)
In addition to the above API, transport implementations may also need to invoke functions that are a part of the generic RPC client. These functions are: void rpc_getport(struct rpc_task *task, struct rpc_clnt *clnt) This interface provides portmapping for IPv4 sockets. The function takes two arguments: the address of the rpc_task structure for the current RPC request, and the address of the rpc_clnt structure associated with this task. It returns nothing. This operation starts the bind operation asynchronously, and the caller sleeps using the RPC client's scheduling primitives. The caller is awoken automatically when the bind is complete, and can check the status of the bind operation using "is_bound." This function can be called from asynchronous RPC tasks so it must not sleep. It does not depend on any external locks being held. void * rpc_malloc(struct rpc_task *task, size_t size) This interface allocates a buffer from the rpc_buffer slab cache. These buffers are generally used to contain the RPC header for each each RPC request. The function takes two arguments: the address of the rpc_task structure associated with the current request, and a requested size of the new buffer, in bytes. It returns an address of a usable area of memory, or NULL in case no buffer is currently available. The RPC client will retry if a NULL is returned. This function can be called from asynchronous RPC tasks so it must not sleep. It does not depend on any external locks being held. void rpc_free(struct rpc_task *task) Buffers allocated via rpc_malloc are freed via this interface. The function takes one argument: the address of the rpc_task structure associated with the current request. It returns nothing. This function can be called from asynchronous RPC tasks so it must not sleep. It does not depend on any external locks being held. void xdr_partial_copy_from_skb(struct xdr_buf *xdr, unsigned int base, skb_reader_t *desc, skb_read_actor_t copy_actor) This interface is used by datagram socket transports to copy data from an incoming skb to an xdr_buf. It is used by both the client and server RPC implementations. The function takes four arguments: the address of a standard xdr_buf structure containing data to be copied; the base offset where the copy operation should begin; the address of the read operation descriptor, and the address of a copy actor function. It returns nothing. This function can be called from asynchronous RPC tasks so it must not sleep. It does not depend on any external locks being held. int csum_partial_copy_to_xdr(struct xdr_buf *xdr, struct sk_buff *skb) This interface provides a checksum copy function that copies data from an skb to an xdr_buf. It is used by both the client and server RPC implementations. The function takes two arguments: the address of a standard xdr_buf structure that acts as the destination of the copy operation, and the address of an skbuff structure containing data to be copied. It returns the number of bytes that were copied. This function can be called from asynchronous RPC tasks so it must not sleep. It does not depend on any external locks being held. void rpc_init_rtt(struct rpc_rtt *rt, unsigned long timeo) A transport implementation can invoke this function to initialize an rpc_rtt structure. The function takes two arguments: the address of an rpc_rtt structure to initialize, and the number of jiffies to use as the initial timeout value. It returns nothing. This function can be called from asynchronous RPC tasks so it must not sleep. It does not depend on any external locks being held. void rpc_update_rtt(struct rpc_rtt *rt, unsigned timer, long m) Transport implementations use this function to update an rpc_rtt structure when an RPC request has completed. The function takes three arguments: the address of the rpc_rtt structure to update; the index of the timer to update; and the number of jiffies that have passed since the RPC request was started. It returns nothing. This function can be called from asynchronous RPC tasks so it must not sleep. The transport_lock must be held before calling this function. unsigned long rpc_calc_rto(struct rpc_rtt *rt, unsigned timer) This interface returns a value suitable for use as a retransmission timeout, in jiffies, based on the context data contained in an rpc_rtt structure. The function takes two arguments: the address of the rpc_rtt structure that contains the data to use for the calculation, and the index of the timer to use. It returns the number of jiffies to use for the retransmit timer. This function can be called from asynchronous RPC tasks so it must not sleep. The transport_lock must be held before calling this function. int xprt_register(struct xprt_type *transport) int xprt_unregister(struct xprt_type *transport) Transport implementations use this interface to register their presence with the generic transport layer. The transport layer will not use a transport implementation for new RPC connections until the transport implementation has registered via this interface. Both functions take a single argument: the address of an xprt_type structure representing the transport implementation to register or unregister. Both functions return zero on success, and an errno-type value on failure. This function is called from a user process context, so it may sleep. It does not depend on any external locks being held. void xprt_adjust_cwnd(struct rpc_rqst *req, int result) Transport implementations that need congestion control invoke this function to adjust their congestion window. The function takes two arguments: the address of an rpc_rqst structure representing the request that has caused the change in the transport's congestion window, and an integer containing an errno value indicating why the window needs to be adjusted. It returns nothing. This function can be called from asynchronous RPC tasks so it must not sleep. The transport_lock must be held before calling this function. void xprt_disconnect(struct rpc_xprt *xprt) Callers use this interface to mark a transport as disconnected. The generic layer will subsequently terminate the transport connection when it is safe to do so. The function takes a single argument: the address of an rpc_xprt structure representing the transport instance to mark disconnected. It returns nothing. The caller must ensure that the xprt's reference count is greater than one when calling this function. This function can be called from asynchronous RPC tasks so it must not sleep. It does not depend on any external locks being held. struct rpc_rqst *xprt_lookup_rqst(struct rpc_xprt *xprt, u32 xid) When an RPC reply is first recieved, the transport implementation invokes this function to map the received XID to a pending rpc_rqst. The function takes two arguments: the address of an rpc_xprt structure on which a request has just arrived, and a 32-bit value representing the XID of the request to look up. The caller must ensure that the xprt's reference count is greater than one when calling this function. This function can be called from asynchronous RPC tasks so it must not sleep. The transport_lock must be held before calling this function. void xprt_complete_rqst(struct rpc_rqst *req, size_t copied) A transport implementation invokes this function to signal that a complete RPC reply has been received, and that the RPC client may begin decoding the reply. This function takes two arguments: the address of an rpc_rqst structure representing the request that is being completed, and an integer containing the number of payload bytes that were just copied by the request. This function can be called from asynchronous RPC tasks so it must not sleep. The transport_lock must be held before calling this function.
Procedure syntax and functional descriptions (create)
The transport switch replaces the two functions that were formerly used to create a new rpc_clnt, xprt_create_proto and rpc_create_client, with a single function call that hides the details of the transport from RPC applications.
To create a new rpc_clnt structure, an application will fill in this structure, and pass it to the new rpc_create function:
struct rpc_create_args { int protocol; struct sockaddr *address; size_t addrsize; struct rpc_timeout *timeout; char *servername; struct rpc_program *program; u32 version; rpc_authflavor_t authflavor; unsigned long behavior; };
This structure contains all the same parameters that the xprt_create_proto and rpc_create_client function calls used. In addition, a "behavior" field contains bits that enable specific behaviors in the new rpc_clnt instance.
#define RPC_CLNT_SOFTRTRY (1UL << 0) #define RPC_CLNT_INTR (1UL << 1) #define RPC_CLNT_CHATTY (1UL << 2) #define RPC_CLNT_AUTOBIND (1UL << 3) #define RPC_CLNT_DROPPRIV (1UL << 4) #define RPC_CLNT_ONESHOT (1UL << 5) #define RPC_CLNT_RESVPORT (1UL << 6)
int rpc_create(struct rpc_create_args *); This function is invoked by applications to create a new rpc_clnt structure. The function takes a single argument: the address of the rpc_create_args structure that provides the parameters for the new rpc_clnt instance. This function is called from a user process context, so it may sleep. It does not depend on any external locks being held.
Conclusion
With the implementation of an RPC transport switch, we hope to facilitate the introduction of significant new technolgy into the Linux kernel RPC implementation. Not only will the RPC transport switch enable new transport technologies such as high performance TCP offload, but it will ease enhancements such as multiple sockets per client-server pair, the elimination of the RPC slot table, and the removal of the global kernel lock from the RPC client and server.