P2P Design Specification

From Linux NFS

(Difference between revisions)
Jump to: navigation, search
 
(39 intermediate revisions not shown)
Line 5: Line 5:
|}
|}
-
= Overview =
+
== Overview ==
-
<font color="blue">''The design specification covers the internal details of a module.  This includes anything that doesn’t have an effect on the interaction model presented by the Functional Spec (FS) or Architecture Spec (AS).</font>
+
Peer-to-peer pNFS is designed to solve the "boot storm" problem that happens when several clients in a cluster boot and attempt to read the same set of files from a single NFS server all at the same time.  This could overload the server's bandwidth, slowing down operations on most client machines.  The idea behind p2p NFS is to allow clients to act as an adhoc read-only pNFS data server that serves files out of their data cache.  This should spread out network usage across all machines, rather than focusing all activity on a single node.  Server and desired DS machines will need to be modified but any
 +
pNFS-enabled client already has the code required to read from adhoc DSs.
-
<font color="blue">''The target audience for this document is:</font>
+
== Related Documents ==
-
* <font color="blue">''Development – Current and future: be thinking of the new engineer who’s been assigned a burt in this module</font>
+
-
* <font color="blue">''QA – Given this DS, QA should understand the design enough to be able to create white-box type tests for the various parts.</font>
+
-
 
+
-
<font color=blue>
+
-
''Describe the work concisely but well enough that a reader not on your team will understand at a high level what you're doing, how you're doing it, why you're doing it, who should care enough to read further and why.  Be sure to highlight any key interactions with other components of the system.''
+
-
</font>
+
-
 
+
-
<font color="blue">''Provide enough context to make the rest of this document meaningful.</font>
+
-
 
+
-
= Related Documents =
+
* draft-myklebust-nfsv4-pnfs-backend-protocol-01.txt
* draft-myklebust-nfsv4-pnfs-backend-protocol-01.txt
* [http://tools.ietf.org/html/rfc5661 RFC 5661]
* [http://tools.ietf.org/html/rfc5661 RFC 5661]
-
= Dependencies =
+
 
-
== This design needs the following from others: ==
+
== Dependencies ==
 +
=== This design needs the following from others: ===
{| class=wikitable style="width:100%"
{| class=wikitable style="width:100%"
Line 32: Line 24:
|- style="height:25px"
|- style="height:25px"
| 1
| 1
-
| This is the reference implementation that the p2p work goes on top of.  git://git.linux-nfs.org/projects/bhalevy/linux-pnfs.git
+
| [http://git.linux-nfs.org/?p=bhalevy/linux-pnfs.git;a=summary Linux pNFS server development code]
| Bryan Schumaker
| Bryan Schumaker
| Benny Halevy
| Benny Halevy
-
|-style="height:25px"
+
|-
| 2
| 2
-
| pNFS nfs utils needs to be installed on the NFSD server so it can export a filesystem over pNFS. git://git.linux-nfs.org/projects/bhalevy/pnfs-nfs-utils.git
+
| [http://git.linux-nfs.org/?p=bhalevy/pnfs-nfs-utils.git;a=summary pNFS nfs utils] needs to be installed on the NFSD server so it can export a filesystem over pNFS.
| Bryan Schumaker
| Bryan Schumaker
| Benny Halevy
| Benny Halevy
|}
|}
-
== Assumptions ==
+
=== Assumptions ===
-
* Enable the following .config options:
+
* Workload with large number of read-only files
 +
* Enable the following .config options for the pNFS client and pNFS ds machines:
** CONFIG_NFS_V4_1
** CONFIG_NFS_V4_1
** CONFIG_PNFS_FILE_LAYOUT
** CONFIG_PNFS_FILE_LAYOUT
 +
* Enable the following .config option for the pNFS ds machine:
** CONFIG_NFS_P2P
** CONFIG_NFS_P2P
 +
* Enable the following .config options for the pNFS server and pNFS ds machines:
** CONFIG_PNFSD
** CONFIG_PNFSD
** CONFIG_PNFSD_LOCAL_EXPORT
** CONFIG_PNFSD_LOCAL_EXPORT
** CONFIG_PNFSD_P2P
** CONFIG_PNFSD_P2P
* Install pnfs-nfs-utils on the pNFS server
* Install pnfs-nfs-utils on the pNFS server
-
* Add "pnfs" to the export options of a local filesystem
+
* Add "pnfs" to the export options of a local filesystem on the pNFS server
 +
* pNFS DS should have nfsd running, but does not need to edit /etc/exports to share files
 +
* pnfsd needs to add "pnfs" export option to /etc/exports
 +
* pnfsd also needs to have "fsid=0" as an export option, otherwise the path walking code will trigger an early UNREGISTER_DS.
-
= Design =
+
== Design ==
-
<font color="blue">'''''DESCRIBE YOUR DESIGN IN THIS SECTION'''''</font>
+
=== REGISTER_DS ===
 +
* Server
 +
** Only implemented REGISTER_DS_ALL
 +
** Create a new struct pnfs_p2p_client to store information about the adhoc DS:
 +
*** p2p client stateid
 +
*** netid
 +
*** ip address
 +
*** MDS identifier
 +
** Store structure as part of the nfs4_client
 +
** Encode p2p client stateid as reply to client
 +
* Client
 +
** Send REGISTER_DS call as part of nfs4_remote_mount()
 +
*** Use REGISTER_DS_ALL so server knows we'll cache everything
 +
*** Generate MDS identifier using cl_cb_ident and a static u32 counter
-
<font color="blue">''This section is typically the largest section. Since designs are highly specific, the template cannot provide much in the way of guidelines here.  Information which is relevant to the sections below should not be discussed here. </font>
+
=== UNREGISTER_DS ===
 +
* Server
 +
** Check that the nfs4_client has an associated pnfs_p2p_client
 +
** Check that the nfs4_client is using the correct stateid
 +
** Free memory allocated for struct pnfs_p2p_client structure during REGISTER_DS
 +
** Free pnfs_p2p_po_stids associated with the DS
 +
** Set pnfs_p2p_client pointer in nfs4_client to NULL
 +
* Client
 +
** Send UNREGISTER_DS as part of nfs4_destroy_server()
-
'''<font color="blue">''This is the main place where customizing the template for each particular team can really pay off.  Teams are encouraged to add a section for the design considerations their own particular area needs to address.'''</font>
+
=== PROXY_OPEN ===
 +
* Server
 +
** Introduce a pnfs_p2p_po_stid to track what DS the client was referred to
 +
** Strip MDS ID from the filehandle
 +
** Add stateid to list stored in the pnfs_p2p_client for the DS
 +
** Add stateid to list stored in the nfs4_client for the client
 +
** Initialize a callback workqueue structure for PROXY_REVOKE
 +
* Client
 +
** Check if we have already called PROXY_OPEN for this (filehandle, stateid)
 +
** Check that we still have a delegation for the file
 +
** Use MDS identifier from filehandle to find the correct nfs_server structure
 +
** Use server to call an nfs4_proc_proxy_open()
 +
*** Pass filehandle and read stateid
 +
*** Use the compound: [SEQUENCE, PUTFH, PROXY_OPEN,GETFH] to look up the actual filehandle and get a proxy revoke stateid
 +
** Store both filehandles, read stateid and revoke stateid in a pnfs_po_state structure
 +
*** Store this in the pnfs_layout_hdr
 +
** Pass resulting filehandle to nfs_delegation_find_inode() to find inode
 +
** Use d_find_any_alias() on the inode to find and return a dentry to the server
-
<font color="blue">''The Design specification describes how the functionality is implemented. Intended readers are:</font>
+
=== CB_PROXY_REVOKE ===
-
* <font color="blue">''Engineering (current and future)</font>
+
* Server
-
* <font color="blue">''QA; given this spec, QA should understand the design enough to be able to create white-box type tests for the various parts.</font>
+
** Call when client expires on server
 +
** Remove pnfs_p2p_po_stid from lists, but don't free until proxy_revoke_release()
 +
* Client
 +
** Use the filehandle and stateid to find associated layout
 +
** Free that pnfs_po_stid
-
* <font color="blue">''Overall design<br>This document should describe:</font>
+
=== LAYOUTGET ===
-
** <font color="blue">''How it works, in detail.  </font>
+
* Server
-
** <font color="blue">''Module breakdown</font>
+
** Edit pnfs_lexp_layout_get()
-
** <font color="blue">''Major data paths through the code. (Referring to the use cases might be useful here)</font>
+
** Set device id field in the layout to the clientid of the machine acting as the DS
-
** <font color="blue">''Process structure.</font>
+
** If we are not using p2p for the file, instead continue to return 1 as the devid
-
** <font color="blue">''Major data structures. </font>
+
** Encode a filehandle with the DSs MDS ID prepended in filelayout_encode_layout()
-
** <font color="blue">''Concurrency, parallelism, and mutual exclusion.</font>
+
-
** <font color="blue">''Class hierarchy, if your design uses object-oriented notions of inheritance and polymorphism.  This applies to, but is not limited to, development done in object-oriented languages such as C++ and Java.</font>
+
-
**<font color="blue">''A UML diagram may be the easiest and most precise way of describing the relationship between the various abstractions supported by your design.</font>
+
-
** <font color="blue">''Any state machines.</font>
+
-
** <font color="blue">''What persistent storage is used?  For Data ONTAP this might be files in the root, rdb databases, registry entries, and the like.  For other products, it might be a client filesystem, a NetApp system somewhere, or dedicated hardware.  What happens when (not if) these are lost due to failure or hardware replacement?</font>
+
-
** <font color="blue">''Resources used, how they’re controlled, what we do when we run out, recovery steps</font>
+
-
** <font color="blue">''What languages are involved.</font>
+
-
** <font color="blue">''Document how the consistency model is maintained. (NG, CFO, consistency points, etc.)</font>
+
-
* <font color="blue">''Licenses</font>
+
=== LAYOUTRETURN ===
-
** <font color="blue">''Describe how licenses are used, especially if the license checking must be done before <font color="blue">the licensing infrastructure is initialized</font>.</font>
+
* Server
 +
** Add to pnfs_lexp_layout_return()
 +
** Check nfs4_client for files opened on a DS
 +
*** Send CB_PROXY_REVOKE
 +
** Also check the pnfs_p2p_client structure for files cached as a DS
 +
*** Free these stateids directly
 +
* Client
 +
** Free up pnfs_po_state stored in the pnfs_layout_hdr
-
* <font color="blue">''Upgrade/revert</font>
+
=== GETDEVICEINFO ===
-
** <font color="blue">''Describe how upgrade and revert work.</font>
+
* Server
-
** <font color="blue">''Discuss how <font color="blue">these modules interact with CFO and SFO, data motion, and data replication</font>.</font>
+
** If we are given a device id of 1 continue using the non-p2p code
 +
** Edit pnfsd_lexp_get_device_info() to fill out pnfs_filelayout_devaddr structure with DS information
 +
** Translate deviceid back to clientid to look up the DS
 +
** Fill out netid and ip address information using data in the pnfs_p2p_client structure
-
* <font color="blue">''Install/uninstall</font>
+
=== PUTFH ===
-
** <font color="blue">''Describe how the product is installed and uninstalled.</font>
+
* Server
 +
** If this is a p2p filehandle then skip some of the state checking stuff because we won't have a dentry until after calling PROXY_OPEN
 +
** Check if a filehandle is p2p by looking at the length (p2p: 36 bytes, normal: 28 bytes)
-
* <font color="blue">''Versioning/compatibility</font>
+
=== READ ===
-
** <font color="blue">''Describe how the versioning checks are implemented.</font>
+
* Server
-
** <font color="blue">''If wire- or disk-layout is important, discuss tools (like IDL’s) used to achieve that.</font>
+
** Call into the NFS client module to perform PROXY_OPEN and return the associated dentry for p2p filehandles
-
* <font color="blue">''Internationalization/language support</font>
+
=== OPEN ===
 +
* Server
 +
** Introduce a vfs_find_any_mount() to look up any mount structure for a dentry
 +
*** This is a hack, but we don't care which mount structure as long as we get the file data!
-
* <font color="blue">''Branding and brand or vendor-neutral implementation.''</font>
+
=== Other Notes ===
 +
* free_p2p_po_stid()
 +
** Remove from lists first before either freeing or calling CB_PROXY_REVOKE to prevent accidental double frees
 +
* DS expires on server
 +
** Treat as if the client had called unregister_ds()
-
* <font color="blue">''Configurations</font>
+
=== Data Structures ===
-
** <font color="blue">''Describe algorithms related to the platform or architecture type.</font>
+
==== Server ====
-
** <font color="blue">''Describe algorithms affected by user configuration.</font>
+
* p2p client information
 +
struct pnfs_p2p_client {
 +
        struct nfs4_stid p2p_stid;
 +
        u64 p2p_mds_id;
 +
        char *p2p_netid;
 +
        char *p2p_addr;
 +
        struct list_head p2p_ds_files;
 +
};
 +
* p2p proxy open stateid
 +
struct pnfs_p2p_po_stid {
 +
        struct nfs4_stid  po_stid;
 +
        struct knfsd_fh  po_fh;
 +
        struct list_head  po_ds_list;
 +
        struct list_head  po_cl_list;
 +
        struct nfsd4_callback po_cb;
 +
};
-
* <font color="blue">''Packaging</font>
+
==== Client ====
-
** <font color="blue">''Does it change the build/release/install process in any way (e.g. adds new build types, new build steps, new build files, new files to be shipped in the tar bundle, etc.) If so, describe how these are implemented.</font>
+
* NFSv4 Proxy Open
 +
struct pnfs_po_state {
 +
        nfs4_stateid  read_stateid;
 +
        nfs4_stateid  revoke_stateid;
 +
        struct nfs_fh fh;
 +
        struct list_head list;
 +
};
-
* <font color="blue">''Online documentation</font>
+
=== Compatibility ===
-
** <font color="blue">''Describe implementations of documentation of any form (for example, tools which process commentary and create other documents)</font>
+
* Any v4.1 / pNFS enabled client should be able to make use of adhoc data servers already, and not need special p2p extensions.
 +
* Clients wishing to act as a data server need CONFIG_NFS_P2P enabled
 +
* Servers wishing to track adhoc DSs need CONFIG_PNFSD_P2P enabled
-
=Multitenancy Considerations=
+
=== Documentation ===
-
<font color="blue">
+
* I can write a Documentation/filesystems/nfs/peer_to_peer.txt file to give a brief overview of how p2pNFS is supposed to work and how users can configure it.
-
''Are there any Secure Multi-tenancy (Vserver and Delegated administration)design considerations for this feature? If so, consider which services, protocols, policies, schedules or manageable objects will need to be Vserverized. It is recommended that security implications also be considered.'' </font>
+
* I can also copy the page to linux-nfs.org for "online documentation"
-
=Feature Interaction Dependencies and Impacts=
+
-
<font color="blue">
+
-
''Please review the [[TechnicalAdvisoryBoard/Feature_Reference_List |'''Feature Reference List''']] and denote any dependencies in this section.'' </font>
+
-
= Performance =
+
== Feature Interaction Dependencies and Impacts ==
-
<font color="blue">''Describe what if any aspects of the design impact the performance?</font>
+
* nfsd <-> nfs
-
* <font color="blue">''What bottlenecks, limitations, or unpredictable performance effects may result from the design, and why?</font>
+
** The machine acting as a pNFS DS needs to be running both the nfs server and the nfs client.
-
* <font color="blue">''Discuss resource limitations and sizing issues as they apply to performance.</font>
+
* Made changes to putfh
 +
** Check filehandle length since p2p filehandles are longer
 +
** Call the original version of the function if we are using a normal fh
 +
* Made changes to nfsd4_read
 +
** Call original read function if this isn't p2p, call proxy open otherwise to get data from client
 +
* nfsd_open needs to lookup mount structure without using an exportops structure for NFS
 +
* filelayout_encode_layout needs to be able to encode p2p filehandles and normal filehandles
 +
* pnfs_p2p_mark_fh increases filehandle size, server needs to know to use the mds id for bigger filehandles
 +
* nfsd4_proc_compound needs to know if a filehandle is a p2p fh since the dentry will be looked up later for reads
-
=Scalability=
+
== Performance ==
-
<font color="blue"> ''Provide details about how scalability goals identified in the related Architecture and Functional Specifications will be met.''
+
* Keep a per-file LRU list of clients that currently have the file cached to avoid redirecting all p2p activity to the same client for that file.
-
''Provide descriptions of data structures, algorithms, and programmatic interfaces between Data ONTAP components, or between client and server, which are needed to achieve a scalable solution.''
+
== Scalability ==
 +
The hope is that p2p NFS scales to hundreds and thousands of clients better
 +
than straight pnfs does.  This can be tested by comparing read times for files
 +
of varying sizes both with and without p2p enabled.  A handful of DSs and a
 +
large number of clients should be used to get a feel for how this would work
 +
in a data center.
 +
* An LRU list of clients should help load balance traffic to each DS
 +
** Make use of already existing nfs4_file->fi_delegations list, move a DSs delegation to the end when referring
 +
* I take the state lock (global mutex) when accessing file or client state
 +
* I created a p2p spinlock for accessing p2p state
-
''For example, fast lookup of a logical object may involve replacing use of a linear based search, with use
+
== Testing ==
-
of a hash table or btree based search.'' </font>
+
* Basic proof-of-concept tests
 +
** 1 client, 1 DS, 1 server
 +
** Have DS and client rsync files from server
 +
** Maybe do a `git clone linux-src` instead?
 +
** Try exporting a /lib partition
 +
* In-depth testing
 +
** NFSv4root with varying numbers of clients
 +
*** NFSv4root doesn't work right now due to idmapping issues
 +
** More rsyncs / git clones with more clients
-
= Open Issues =
+
== Open Issues ==
-
 
+
-
<font color="blue">''Record in this section issues that you are aware of, but which are not yet resolved in the specification. If you discover issues after the specification is approved, you may record them here, and then re-review the specification after you address the issues.''</font>
+
{| class=wikitable width="100%"
{| class=wikitable width="100%"
Line 142: Line 236:
|-
|-
| 1
| 1
-
| Date the issue was raised.
+
| 12/11/2012
-
| Who raised it?
+
| Bryan
-
| Describe the issue.
+
| Client needs to mount server with the public filehandle, otherwise the path walking code will trigger an early UNREGISTER_DS.
-
| Describe what you did to resolve the issue.
+
| [NONE]
-
| Date
+
| [NONE]
|}
|}
-
 
+
== Approvals ==
-
=Revision History=
+
=== Approvers ===
-
<font color="blue">''The entries below are for this template itself.  Replace them with the history of changes to your specification.</font>
+
-
 
+
-
{| class=wikitable width="100%"
+
-
|-
+
-
! Version
+
-
! Date
+
-
! Name
+
-
! Change
+
-
|-
+
-
| style="height:25px" |1.0 ||3 Feb 2006 ||Garth Rodericks || Initial version adapted from ONTAP template
+
-
|-
+
-
| style="height:25px" |2.0 ||24 Feb 2006 ||Becca Beaman || Integrated comments from first review.
+
-
|-
+
-
| style="height:25px" |2.1 ||8 Mar 2006 ||Becca Beaman || Added note on customization to main section moved Revision History to end.
+
-
|-
+
-
| style="height:25px" |2.2 ||14 Apr 2006 ||Garth Rodericks || Updated formatting to match other docs. Eliminated title page and TOC.
+
-
|-
+
-
| style="height:25px" |2.3 ||3 May 2006 ||Garth Rodericks || Added TOC to section.
+
-
|-
+
-
| style="height:25px" |2.4 ||22 Jun 2006 ||Garth Rodericks || Moved related docs to follow overview.
+
-
|-
+
-
| style="height:25px" |2.5 ||9 Oct 2006 ||Becca Beaman || Changed “Features” section to “Design,” Added text “DESCRIBE YOUR DESIGN IN THIS SECTION.”
+
-
|-
+
-
| style="height:25px" |2.6 ||13 Dec 2006 ||Garth Rodericks || Updated instruction text formatting to show in blue italics and added missing Revision History table.
+
-
|-
+
-
| style="height:25px" |2.7 ||23 Feb 2007 ||Charlie Hedstrom|| Update copyright year, correct WIKI instructions.
+
-
|-
+
-
| style="height:25px" |2.8 ||12 Jun 2007 ||Kim Merriman|| Added Document Status section
+
-
|-
+
-
| style="height:25px" |2.9 ||22 August 2007 ||Becca Beaman|| Added pointer to "RAS Basics" in RAS section.
+
-
|-
+
-
| style="height:25px" |2.10 ||10 September 2007 ||Brian Hackworth|| Removed the unneeded "Document Status" section. Split Dependencies into two tables: incoming and outgoing.
+
-
|-
+
-
| style="height:25px" |2.11 ||10 September 2007 ||Brian Hackworth|| Added "Assumptions" sub-section in Dependencies.
+
-
|-
+
-
| style="height:25px" |2.12 ||12 September 2007 ||Garth Rodericks|| Updated formatting at top of document, updated link to authoritative Word version, and eliminated duplicate text from top of document that's already in Objective section.
+
-
|-
+
-
| style="height:25px" |2.13 ||27 September 2007 ||Brian Hackworth|| Added some more detailed things to think about with respect to memory budgets in Data ONTAP.
+
-
|-
+
-
| style="height:25px" | 2.14 || 2 May 2008 || Brian Hackworth || Added to the Approvals section a description of "required reviewers" as distinct from approvers.
+
-
|-
+
-
| style="height:25px" | 2.15 || 9 October 2008 || Garth Rodericks || Added information classification and keyword to comply with NetApp Information Security policy.
+
-
|-
+
-
| 2.16 || 2 October 2009 || Brian Hackworth
+
-
| Added Open Issues section and Target Approval Date.
+
-
|-
+
-
| 2.17 || 8 April 2010 || Brian Hackworth
+
-
| Added mention of branding in Design section.
+
-
|-
+
-
| 2.18 || 9 December 2010 || Joe CaraDonna, Eric Hamilton
+
-
| Rework for Data ONTAP 8 and TAB++ review process.
+
-
|-
+
-
| 2.19 || 23 December 2010 || Joe CaraDonna
+
-
| Further refined ONTAP Resource Requirements section.
+
-
|-
+
-
| 2.20 || 16 March 2011 || Eric Hamilton
+
-
| Added Snaplock section.
+
-
|-
+
-
| 2.21 || 17 May 2011 || Kathy Coencas
+
-
|Added instruction for Spec Tool users to replace the approver table with a link to the approver list in Spec Tool
+
-
|-
+
-
| 2.22 || 19 May 2011 || Kathy Coencas
+
-
| Added link to Spec Tool to instruction for Spec Tool users to replace the approver table with a link to the approver list in Spec Tool
+
-
|-
+
-
| 2.23 || 05 July 2011 || Kathy Coencas
+
-
| Added Scalability sections/instructions and highlighted instructions to add a link to approvers in the spec tool
+
-
|-
+
-
| 2.24 || 21 December 2011 || Kathy Coencas
+
-
| Removed IE as a mandatory approver of the Design Spec
+
-
|-
+
-
| 2.25 || 25 January 2012 || Kathy Coencas and Vanesa Knisley
+
-
| Updated section 9 based on approved change request, to add Feature Interaction Dependencies and Impacts.
+
-
|-
+
-
| 2.26 || 19 March 2012 || Vanesa Knisley
+
-
| Updated Section 8 based on approved change request, to add more detail
+
-
|-
+
-
| 2.27 || 9 August 2012 || Vanesa Knisley
+
-
| Added Section 9 based on approved change request, to match AS and FS
+
-
|-
+
-
|}
+
-
 
+
-
= Approvals =
+
-
 
+
-
== Approvers==
+
-
 
+
-
<font color="blue">''Record here the names of the individuals who must approve the specification.  When they approve the specification, add the date of their approval in the last column. '''If your specification is in [http://spectool.eng.netapp.com/index.php Spec Tool], replace the table below with a link to the approver list in Spec Tool.''' </font>
+
-
 
+
-
<font color="blue">''Guidelines for approvers:</font>
+
-
* <font color="blue">''A Technical Director for your project area is the primary approver, and verifies that the specification is complete, adequately addresses the problem space, is consistent with the architecture for the product, is consistent with existing products and features, and adequately addresses dependencies with other projects. Additionally, the TD should verify that the specification has been reviewed both within the project team and with any other teams with dependencies, and that comments raised during reviews have been incorporated into the specification.</font>
+
-
* <font color="blue">''If the specification calls out Dependencies with other groups, include the Technical Directors (or delegates) from those groups as approvers.</font>
+
-
* <font color="blue">''The Product Manager verifies that the specification adequately addresses the requirements from the Engineering Requirements and Response Document.</font>
+
-
* <font color="blue">''The Quality Assurance approver verifies that the specification defines the features and behaviors in sufficient detail to begin work on test planning. This should include the format and content of all inputs and outputs that are user visible.</font>
+
-
* <font color="blue">'' For any Data ONTAP design spec, the Technical Advisory Board uses [[User:Erich/template_rvw/tab_checklist | this checklist]] to verify that the design is consistent with the overall architecture of Data ONTAP. The TAB may also identify additional reviewers and groups that should be consulted.</font>
+
-
* <font color="blue">''The Resource Requirement Reviewer verifies that the specification defines the resource requirements of the feature in sufficient detail to assess whether the target platforms can support it, and assists in planning for system growth. A Resource Requirement Reviewer can be assigned by contacting dl-resource-review.</font>
+
-
* <font color="blue">''If there are other areas of expertise that the project team desires input from (for example, a review of the User Interface sections by someone with UI expertise), or if the specification is complex, feel free to include additional approvers as needed.</font>
+
{| class=wikitable width=100%
{| class=wikitable width=100%
Line 254: Line 253:
! width=175 | Approval Date
! width=175 | Approval Date
|-
|-
-
| Name || Technical Director, or delegate || Date || Date
+
| Trond Myklebust || NFS Client Maintainer || Date || Date
|-
|-
-
| Name || Technical Director, or delegate for any dependent groups || Date || Date
 
-
|-
 
-
| Name || Product Manager || Date || Date
 
-
|-
 
-
| Name || Quality Assurance || Date || Date
 
-
|-
 
-
| Name || Resource Requirement Reviewer || Date || Date
 
-
|-
 
-
| Name || Technical Advisory Board Member || Date || Date
 
|}
|}
-
==Reviewers==
+
=== Reviewers ===
-
 
+
-
<font color="blue">''Reviewers are those people who should be informed of the feature, but who are not required to officially approve it. Normally, these are people you depend on, or who depend on you, and are called out here to make sure they're aware of the dependency. Record here the names of the individuals who should review the specification, and upon completion add the date in the last column. If your specification is in Spec Tool, replace the table below with a link to the approver list in Spec Tool.</font>
+
{| class=wikitable width=100%
{| class=wikitable width=100%
Line 278: Line 266:
! width=175 | Approval Date
! width=175 | Approval Date
|-
|-
-
| Name || - || Date || Date
+
| Jeffrey Heller || Bryan's Manager || Date || Date
-
|-
+
-
| Name || - || Date || Date
+
|-
|-
-
| Name || - || Date || Date
 
|}
|}

Latest revision as of 19:23, 18 January 2013

3 December 2012 DRAFT bjschuma@netapp.com

Contents

Overview

Peer-to-peer pNFS is designed to solve the "boot storm" problem that happens when several clients in a cluster boot and attempt to read the same set of files from a single NFS server all at the same time. This could overload the server's bandwidth, slowing down operations on most client machines. The idea behind p2p NFS is to allow clients to act as an adhoc read-only pNFS data server that serves files out of their data cache. This should spread out network usage across all machines, rather than focusing all activity on a single node. Server and desired DS machines will need to be modified but any pNFS-enabled client already has the code required to read from adhoc DSs.

Related Documents

  • draft-myklebust-nfsv4-pnfs-backend-protocol-01.txt
  • RFC 5661

Dependencies

This design needs the following from others:

Item Description of Dependency or Issue Affected Group Contact
1 Linux pNFS server development code Bryan Schumaker Benny Halevy
2 pNFS nfs utils needs to be installed on the NFSD server so it can export a filesystem over pNFS. Bryan Schumaker Benny Halevy

Assumptions

  • Workload with large number of read-only files
  • Enable the following .config options for the pNFS client and pNFS ds machines:
    • CONFIG_NFS_V4_1
    • CONFIG_PNFS_FILE_LAYOUT
  • Enable the following .config option for the pNFS ds machine:
    • CONFIG_NFS_P2P
  • Enable the following .config options for the pNFS server and pNFS ds machines:
    • CONFIG_PNFSD
    • CONFIG_PNFSD_LOCAL_EXPORT
    • CONFIG_PNFSD_P2P
  • Install pnfs-nfs-utils on the pNFS server
  • Add "pnfs" to the export options of a local filesystem on the pNFS server
  • pNFS DS should have nfsd running, but does not need to edit /etc/exports to share files
  • pnfsd needs to add "pnfs" export option to /etc/exports
  • pnfsd also needs to have "fsid=0" as an export option, otherwise the path walking code will trigger an early UNREGISTER_DS.

Design

REGISTER_DS

  • Server
    • Only implemented REGISTER_DS_ALL
    • Create a new struct pnfs_p2p_client to store information about the adhoc DS:
      • p2p client stateid
      • netid
      • ip address
      • MDS identifier
    • Store structure as part of the nfs4_client
    • Encode p2p client stateid as reply to client
  • Client
    • Send REGISTER_DS call as part of nfs4_remote_mount()
      • Use REGISTER_DS_ALL so server knows we'll cache everything
      • Generate MDS identifier using cl_cb_ident and a static u32 counter

UNREGISTER_DS

  • Server
    • Check that the nfs4_client has an associated pnfs_p2p_client
    • Check that the nfs4_client is using the correct stateid
    • Free memory allocated for struct pnfs_p2p_client structure during REGISTER_DS
    • Free pnfs_p2p_po_stids associated with the DS
    • Set pnfs_p2p_client pointer in nfs4_client to NULL
  • Client
    • Send UNREGISTER_DS as part of nfs4_destroy_server()

PROXY_OPEN

  • Server
    • Introduce a pnfs_p2p_po_stid to track what DS the client was referred to
    • Strip MDS ID from the filehandle
    • Add stateid to list stored in the pnfs_p2p_client for the DS
    • Add stateid to list stored in the nfs4_client for the client
    • Initialize a callback workqueue structure for PROXY_REVOKE
  • Client
    • Check if we have already called PROXY_OPEN for this (filehandle, stateid)
    • Check that we still have a delegation for the file
    • Use MDS identifier from filehandle to find the correct nfs_server structure
    • Use server to call an nfs4_proc_proxy_open()
      • Pass filehandle and read stateid
      • Use the compound: [SEQUENCE, PUTFH, PROXY_OPEN,GETFH] to look up the actual filehandle and get a proxy revoke stateid
    • Store both filehandles, read stateid and revoke stateid in a pnfs_po_state structure
      • Store this in the pnfs_layout_hdr
    • Pass resulting filehandle to nfs_delegation_find_inode() to find inode
    • Use d_find_any_alias() on the inode to find and return a dentry to the server

CB_PROXY_REVOKE

  • Server
    • Call when client expires on server
    • Remove pnfs_p2p_po_stid from lists, but don't free until proxy_revoke_release()
  • Client
    • Use the filehandle and stateid to find associated layout
    • Free that pnfs_po_stid

LAYOUTGET

  • Server
    • Edit pnfs_lexp_layout_get()
    • Set device id field in the layout to the clientid of the machine acting as the DS
    • If we are not using p2p for the file, instead continue to return 1 as the devid
    • Encode a filehandle with the DSs MDS ID prepended in filelayout_encode_layout()

LAYOUTRETURN

  • Server
    • Add to pnfs_lexp_layout_return()
    • Check nfs4_client for files opened on a DS
      • Send CB_PROXY_REVOKE
    • Also check the pnfs_p2p_client structure for files cached as a DS
      • Free these stateids directly
  • Client
    • Free up pnfs_po_state stored in the pnfs_layout_hdr

GETDEVICEINFO

  • Server
    • If we are given a device id of 1 continue using the non-p2p code
    • Edit pnfsd_lexp_get_device_info() to fill out pnfs_filelayout_devaddr structure with DS information
    • Translate deviceid back to clientid to look up the DS
    • Fill out netid and ip address information using data in the pnfs_p2p_client structure

PUTFH

  • Server
    • If this is a p2p filehandle then skip some of the state checking stuff because we won't have a dentry until after calling PROXY_OPEN
    • Check if a filehandle is p2p by looking at the length (p2p: 36 bytes, normal: 28 bytes)

READ

  • Server
    • Call into the NFS client module to perform PROXY_OPEN and return the associated dentry for p2p filehandles

OPEN

  • Server
    • Introduce a vfs_find_any_mount() to look up any mount structure for a dentry
      • This is a hack, but we don't care which mount structure as long as we get the file data!

Other Notes

  • free_p2p_po_stid()
    • Remove from lists first before either freeing or calling CB_PROXY_REVOKE to prevent accidental double frees
  • DS expires on server
    • Treat as if the client had called unregister_ds()

Data Structures

Server

  • p2p client information
struct pnfs_p2p_client {
       struct nfs4_stid p2p_stid;
       u64 p2p_mds_id;
       char *p2p_netid;
       char *p2p_addr;
       struct list_head p2p_ds_files;
};
  • p2p proxy open stateid
struct pnfs_p2p_po_stid {
       struct nfs4_stid  po_stid;
       struct knfsd_fh   po_fh;
       struct list_head  po_ds_list;
       struct list_head  po_cl_list;
       struct nfsd4_callback po_cb;
};

Client

  • NFSv4 Proxy Open
struct pnfs_po_state {
       nfs4_stateid  read_stateid;
       nfs4_stateid  revoke_stateid;
       struct nfs_fh fh;
       struct list_head list;
};

Compatibility

  • Any v4.1 / pNFS enabled client should be able to make use of adhoc data servers already, and not need special p2p extensions.
  • Clients wishing to act as a data server need CONFIG_NFS_P2P enabled
  • Servers wishing to track adhoc DSs need CONFIG_PNFSD_P2P enabled

Documentation

  • I can write a Documentation/filesystems/nfs/peer_to_peer.txt file to give a brief overview of how p2pNFS is supposed to work and how users can configure it.
  • I can also copy the page to linux-nfs.org for "online documentation"

Feature Interaction Dependencies and Impacts

  • nfsd <-> nfs
    • The machine acting as a pNFS DS needs to be running both the nfs server and the nfs client.
  • Made changes to putfh
    • Check filehandle length since p2p filehandles are longer
    • Call the original version of the function if we are using a normal fh
  • Made changes to nfsd4_read
    • Call original read function if this isn't p2p, call proxy open otherwise to get data from client
  • nfsd_open needs to lookup mount structure without using an exportops structure for NFS
  • filelayout_encode_layout needs to be able to encode p2p filehandles and normal filehandles
  • pnfs_p2p_mark_fh increases filehandle size, server needs to know to use the mds id for bigger filehandles
  • nfsd4_proc_compound needs to know if a filehandle is a p2p fh since the dentry will be looked up later for reads

Performance

  • Keep a per-file LRU list of clients that currently have the file cached to avoid redirecting all p2p activity to the same client for that file.

Scalability

The hope is that p2p NFS scales to hundreds and thousands of clients better than straight pnfs does. This can be tested by comparing read times for files of varying sizes both with and without p2p enabled. A handful of DSs and a large number of clients should be used to get a feel for how this would work in a data center.

  • An LRU list of clients should help load balance traffic to each DS
    • Make use of already existing nfs4_file->fi_delegations list, move a DSs delegation to the end when referring
  • I take the state lock (global mutex) when accessing file or client state
  • I created a p2p spinlock for accessing p2p state

Testing

  • Basic proof-of-concept tests
    • 1 client, 1 DS, 1 server
    • Have DS and client rsync files from server
    • Maybe do a `git clone linux-src` instead?
    • Try exporting a /lib partition
  • In-depth testing
    • NFSv4root with varying numbers of clients
      • NFSv4root doesn't work right now due to idmapping issues
    • More rsyncs / git clones with more clients

Open Issues

Item Date Name Issue Resolution Date Resolved
1 12/11/2012 Bryan Client needs to mount server with the public filehandle, otherwise the path walking code will trigger an early UNREGISTER_DS. [NONE] [NONE]

Approvals

Approvers

Name Role Target Approval Date Approval Date
Trond Myklebust NFS Client Maintainer Date Date

Reviewers

Name Role Target Approval Date Approval Date
Jeffrey Heller Bryan's Manager Date Date
Personal tools