CITI ASC status

From Linux NFS

(Difference between revisions)
Jump to: navigation, search
(Development)
(we're done)
 
(95 intermediate revisions not shown)
Line 1: Line 1:
-
=University of Michigan/CITI NFSv4 ASC alliance=
 
-
Status of October 2006
 
-
==Task 1.  Demonstration of pNFS with multiple back end methods (PVFS and File) including layout recall — LANL will replicate this demonstration at LANL working with CITI remotely.==
 
-
===Development===
 
-
We updated the Linux '''pNFS''' client and server to the 2.6.17 kernel level, and are preparing to rebase again for 2.6.19.
 
-
We updated the pNFS code base to draft-ietf-nfsv4-minorversion1-05. Testing identified multiple bugs, which we fixed.
 
-
 
-
To make a clean separation of the common NFS v2/3/4/4.1 code from code specific to pNFS, we rewrote the Linux pNFS client to use its own set of RPC operations.
 
-
 
-
Four client layout modules are in development.
 
-
* File layout driver (CITI, Network Appliance, and IBM Almaden).
 
-
* PVFS2 layout driver (CITI).
 
-
* Object layout driver (Panasas).
 
-
* Block layout driver (CITI under contract with EMC).
 
-
 
-
To accommodate the requirements of the multiple layout drivers, we expanded the layout operation policy interfaces between the layout driver and generic pNFS client.
 
-
 
-
We are designing and coding a pNFS client layout cache to replace the current implementation, which supports only a single layout per inode.
 
-
 
-
We improved the interface to the underlying file system on the Linux pNFS server.  The new interface is being used by the Panasas object layout server, the IBM GPFS server, and the PVFS2 server.
 
-
 
-
We are coding the pNFS layout management service and file system interfaces on the Linux pNFS server to do a better job of bookkeeping so that we can extend the layout recall implementation, which is limited to a single layout.
 
-
 
-
We have continued to develop the PVFS2 layout driver and PVFS2 support of the pNFS server.  The layout driver I/O interface now supports the original direct access method; the Linux page cache access method that uses the NFSv4 writeback cache and readahead algorithm, and the O_DIRECT access method.  In addition, PVFS2 now supports the pNFS file-based layout, providing pNFS clients two choices in how they access the file system.
 
-
 
-
We developed prototype implementations of pNFS operations:
 
-
* OP_GETDEVICELIST,
 
-
* OP_GETDEVICEINFO,
 
-
* OP_LAYOUTGET,
 
-
* OP_LAYOUTCOMMIT,
 
-
* OP_LAYOUTRETURN and
 
-
* OP_CB_LAYOUTRECALL
 
-
 
-
We continue to test the ability of our prototype to send direct I/O data to data servers.
 
-
 
-
===Milestones===
 
-
At the September 2006 NFSv4 Bake-a-thon, hosted by CITI, we continued to test the ability of CITI's Linux pNFS client to operate with multiple layouts, and the ability of CITI's Linux pNFS server to export pNFS capable underlying file systems. 
 
-
 
-
We demonstrated the Linux pNFS client support for multiple layouts by copying files between multiple pNFS back ends.
 
-
 
-
The following pNFS implementations were tested.
 
-
 
-
''File Layout''
 
-
* Clients: Linux, Solaris
 
-
* Servers: Network Appliance, Linux IBM GPFS, DESY dCache, Solaris
 
-
 
-
''Object layout''
 
-
* Clients: Linux
 
-
* Servers: Linux, Panasas
 
-
 
-
''Block layout''
 
-
* Clients: Linux
 
-
* Server: EMC
 
-
 
-
===Activities===
 
-
Our current Linux pNFS implementation uses a single whole file layout.  We are extending the layout cache on the client and layout management on the server to support multiple layouts and small byte ranges. 
 
-
 
-
In cooperation with EMC, we continue to develop a block layout driver module for the generic pNFS client.
 
-
 
-
We continue to measure I/O performance.
 
-
 
-
We joined the [http://www.ultralight.org Ultralight project] and are testing pNFS I/O using pNFS clients on 10 GbE against pNFS clusters on 1 GbE.
 
-
The Linux pNFS client included in the Ultralight kernel and distributed to Ultralight sites, providing opportunities for future long-haul WAN testing.
 
-
 
-
==Task 2.  Migration of client from one mount/metadata server to another to be demonstrated.  This demonstration may be replicated at LANL depending on success of this work. ==
 
-
 
-
When a file system moves, the old server notifies clients with NFS4ERR_MOVED.  Clients then reclaim state held on the old server by engaging in reboot recovery with the new server.  For cluster file systems, server-to-server state transfer lets clients avoid the reclaim. 
 
-
 
-
We redesigned state bookkeeping to ensure that state created on NFSv4 servers exporting the same cluster file system will not collide. 
 
-
 
-
As presently implemented, clients save the old server's state in stable storage and pass the state information to the new server as part of the recovery operation.  We are rewriting that interface to also support server-to-server state transfer.
 
-
 
-
* Please check that sentence
 
-
 
-
It remains to inform clients that state established with the old server remains valid on the new server.  The IETF NFSv4 working group is considering solutions, e.g., augmented FS_LOCATIONS information or a new error code NFS4ERR_MOVED_DATA_AND_STATE.
 
-
 
-
==Task 3.  Analysis of caching and lock coherency, demonstration of caching and lock performance with scaling, under various levels of conflict, using byte range locks (looking at lock splitting issues etc.).==
 
-
We have set up test machines and begun planning for tests.  We have some immediate concerns over the memory footprint imposed by server lock structures.
 
-
==Task 4.  Analysis of directory delegations – how well does it work and when, when does it totally not work.==
 
-
===Background===
 
-
'''Directory delegations''' promise to extend the usefulness of dentry caching in two ways.  First, the client is no longer forced to revalidate the dentry cache after a timeout.  Second, while positive caching can be treated as a hint, negative caching without cache invalidation violates open-to-close semantics. 
 
-
Directory delegations allow the client to cache negative results.
 
-
 
-
For example, if a client opens a file that does not exist, it issues an OPEN RPC that fails.  But a subsequent open of the same file might succeed, if the file is created in the interim.  Open-to-close semantics requires that the newly created file be seen by the client, so the earlier negative result can not be cached.  Consequently, subsequent opens of the same non-existent file also require OPEN RPC calls being sent to the server.  This example is played out repeatedly when the shell searches for executables in PATH or when the linker searches for shared libraries in LD_LIBRARY_PATH.
 
-
 
-
With directory delegations, the server callback mechanism can guarantee that no entries have been added or modified in a cached directory, which allows consistent negative caching and eliminates repeated checks for non-existent files.
 
-
===Status===
 
-
We implemented directory delegations in the Linux NFSv4 client and server. 
 
-
 
-
Our server implementation follows the file delegations architecture.  We extended the lease API in the Linux VFS to support read-only leases on directories and NFS-specific lease-breaking semantics.
 
-
 
-
We implemented a '''/proc''' interface on the server to enable or disable directory delegation at run time.  At startup, the client queries the server for directory delegation support.
 
-
 
-
The server has hooks for a policy layer to control the granting of directory delegations.  (No policy is implemented yet.)  When and whether to acquire delegations is also a client concern.
 
-
 
-
===Testing===
 
-
We are testing delegation grant and recall in a test rig with one or two clients.  Testing consists mostly of comparing NFS operation-counts when directory delegations is enabled or disabled.
 
-
 
-
Tests range from simple UNIX utilities — ls, find, touch — to hosting a CVS repository or compiling with shared libraries and header files on NFS servers.  Tests will become more specific.
 
-
 
-
We have extended PyNFS to support directory delegations.  So far,  the support is basic and the tests are trivial.  Tests will become more specific.
 
-
 
-
We are designing mechanisms that allow simulation experiments to compare delegation policies on NFSv4 network traces.
 
-
==Task 5.  How do you specify/measure NFS Server load.==
 
-
 
-
Assume you had 2 symmetric servers w/ a cluster fs backend.  How would you
 
-
compare load on the two to decide whether there'd be a benefit to migrating a
 
-
client from one to the other?  It may be easiest to compare load not directly,
 
-
but using some model of factors that determine the load--what factors are
 
-
those?
 
-
 
-
Write a tool that identifies "load"/bottlenecks from nfsd side; show that it
 
-
predicts performance in a lot of cases.
 
-
 
-
===Specific goal===
 
-
it should identify whether it is the server that is
 
-
the performance bottleneck, or whether the bottleneck is
 
-
elsewhere (probably the client).  In the former case the
 
-
problem may be solved by upgrading some server hardware or
 
-
(in the cluster case) adding a server.  Or by moving a client
 
-
to a different server--the point when server load hits 100%
 
-
is the point when aggregate performance will stop increasing
 
-
as more clients are added, so if removing a client will still
 
-
leave load at 100%, then removing that client won't decrease
 
-
aggregate performance; whereas adding it to another server at
 
-
less than 100% will increase aggregate performance.
 
-
 
-
===Possible input into our load measurement===
 
-
* disk bandwidth to exported filesystems
 
-
* disk seek rate
 
-
* CPU load
 
-
* network load (how?)
 
-
* thread usage
 
-
* interrupt rate
 
-
* rate at which cached file data is evicted from page cache due to memory pressure?  (Not sure exactly what the right measure is here.
 
-
 
-
====How to measure each of these?====
 
-
See iostat for detailed disk statistics
 
-
 
-
====What to report for each?====
 
-
Average load over past time intervals (1ms, 10ms, 100ms, 1s, 10s, 100s,...)
 
-
 
-
====How do we check usefulness of this information?====
 
-
boot with reduced resources somehow, see if increasing resources increases performance as predicted?
 
-
 
-
===Disk bandwidth===
 
-
vary size of raid arrays, bandwidth of disk interfaces?
 
-
 
-
Or run another process that soaks up some percentage of bandwidth??
 
-
 
-
===CPU load===
 
-
CPU throttling??
 
-
 
-
Just try different totally random machines?  Vary workload?  How do we get a light vs. heavy workload?
 
-
 
-
How do we measure performance of each?  Increasing clients until we see performance degredation due to server bottlenecks would be obvious thing to do....)
 
-
 
-
===Measures of load===
 
-
what do we use to determine if our measure of load is correct?
 
-
 
-
* single rpc latency measured from a client?
 
-
* time to complete some other task, measured from a single client (not actually involved in loading the server)?
 
-
* rpc's per second?
 
-
 
-
===Configuration parameters on server that can be varied===
 
-
* number of server threads
 
-
* number of connections per server thread
 
-
* request queue lengths (# of bytes waiting in tcp socket)
 
-
 
-
===Some special situations that can be problems (from Chuck)===
 
-
* reboot recovery: everyone is recovering at once.
 
-
* mount storms: a lab full of clients may all mount at once, or a cluster job may trigger automount from all clients at once.
 
-
 
-
===Possible benchmark sources, for this and locking scalability===
 
-
====postmark====
 
-
looks pretty primitive: mixture of reads, writes, creates, unlinks.  No locks.
 
-
====filebench====
 
-
also no locking.  Haven't figured out exactly what the various loads do.  Is there actually an active developer community?
 
-
====See Bull.net's list?====
 
-
* Bonnie++
 
-
* FStress
 
-
* dbench: simulates filesystem activity created by a samba server running the proprietary SMB benchmark "netbench". Maybe not so useful.
 
-
* Do-it-ourselves modify postmark or filebench? set up a mailserver (e.g.), send it fake mail. get traces from working servers
 

Latest revision as of 15:08, 31 October 2006

Personal tools