P2P Design Specification

From Linux NFS

(Difference between revisions)
Jump to: navigation, search
 
(14 intermediate revisions not shown)
Line 5: Line 5:
|}
|}
-
= Overview =
+
== Overview ==
-
Peer-to-peer pNFS is designed to solve the "boot storm" problem that happens when several clients in a cluster boot and attempt to read the same set of files from a single NFS server all at the same time.  This could overload the server's bandwidth, slowing down operations on most client machines.  The idea behind p2p NFS is to allow clients to act as an adhoc read-only pNFS data server that serves files out of their data cache.  This should spread out network usage across all machines, rather than focusing all activity on a single node.  Server and desired DS machines will need to be modified but any pNFS-enabled client already has the code required to read from adhoc DSs.
+
Peer-to-peer pNFS is designed to solve the "boot storm" problem that happens when several clients in a cluster boot and attempt to read the same set of files from a single NFS server all at the same time.  This could overload the server's bandwidth, slowing down operations on most client machines.  The idea behind p2p NFS is to allow clients to act as an adhoc read-only pNFS data server that serves files out of their data cache.  This should spread out network usage across all machines, rather than focusing all activity on a single node.  Server and desired DS machines will need to be modified but any
 +
pNFS-enabled client already has the code required to read from adhoc DSs.
-
= Related Documents =
+
== Related Documents ==
* draft-myklebust-nfsv4-pnfs-backend-protocol-01.txt
* draft-myklebust-nfsv4-pnfs-backend-protocol-01.txt
* [http://tools.ietf.org/html/rfc5661 RFC 5661]
* [http://tools.ietf.org/html/rfc5661 RFC 5661]
-
= Dependencies =
+
== Dependencies ==
-
== This design needs the following from others: ==
+
=== This design needs the following from others: ===
{| class=wikitable style="width:100%"
{| class=wikitable style="width:100%"
Line 33: Line 34:
|}
|}
-
== Assumptions ==
+
=== Assumptions ===
* Workload with large number of read-only files
* Workload with large number of read-only files
* Enable the following .config options for the pNFS client and pNFS ds machines:
* Enable the following .config options for the pNFS client and pNFS ds machines:
Line 50: Line 51:
* pnfsd also needs to have "fsid=0" as an export option, otherwise the path walking code will trigger an early UNREGISTER_DS.
* pnfsd also needs to have "fsid=0" as an export option, otherwise the path walking code will trigger an early UNREGISTER_DS.
-
= Design =
+
== Design ==
-
== REGISTER_DS ==
+
=== REGISTER_DS ===
* Server
* Server
 +
** Only implemented REGISTER_DS_ALL
** Create a new struct pnfs_p2p_client to store information about the adhoc DS:
** Create a new struct pnfs_p2p_client to store information about the adhoc DS:
*** p2p client stateid
*** p2p client stateid
Line 64: Line 66:
*** Use REGISTER_DS_ALL so server knows we'll cache everything
*** Use REGISTER_DS_ALL so server knows we'll cache everything
*** Generate MDS identifier using cl_cb_ident and a static u32 counter
*** Generate MDS identifier using cl_cb_ident and a static u32 counter
-
== UNREGISTER_DS ==
+
 
 +
=== UNREGISTER_DS ===
* Server
* Server
** Check that the nfs4_client has an associated pnfs_p2p_client
** Check that the nfs4_client has an associated pnfs_p2p_client
** Check that the nfs4_client is using the correct stateid
** Check that the nfs4_client is using the correct stateid
** Free memory allocated for struct pnfs_p2p_client structure during REGISTER_DS
** Free memory allocated for struct pnfs_p2p_client structure during REGISTER_DS
 +
** Free pnfs_p2p_po_stids associated with the DS
** Set pnfs_p2p_client pointer in nfs4_client to NULL
** Set pnfs_p2p_client pointer in nfs4_client to NULL
* Client
* Client
** Send UNREGISTER_DS as part of nfs4_destroy_server()
** Send UNREGISTER_DS as part of nfs4_destroy_server()
-
== LAYOUTGET ==
+
 
 +
=== PROXY_OPEN ===
 +
* Server
 +
** Introduce a pnfs_p2p_po_stid to track what DS the client was referred to
 +
** Strip MDS ID from the filehandle
 +
** Add stateid to list stored in the pnfs_p2p_client for the DS
 +
** Add stateid to list stored in the nfs4_client for the client
 +
** Initialize a callback workqueue structure for PROXY_REVOKE
 +
* Client
 +
** Check if we have already called PROXY_OPEN for this (filehandle, stateid)
 +
** Check that we still have a delegation for the file
 +
** Use MDS identifier from filehandle to find the correct nfs_server structure
 +
** Use server to call an nfs4_proc_proxy_open()
 +
*** Pass filehandle and read stateid
 +
*** Use the compound: [SEQUENCE, PUTFH, PROXY_OPEN,GETFH] to look up the actual filehandle and get a proxy revoke stateid
 +
** Store both filehandles, read stateid and revoke stateid in a pnfs_po_state structure
 +
*** Store this in the pnfs_layout_hdr
 +
** Pass resulting filehandle to nfs_delegation_find_inode() to find inode
 +
** Use d_find_any_alias() on the inode to find and return a dentry to the server
 +
 
 +
=== CB_PROXY_REVOKE ===
 +
* Server
 +
** Call when client expires on server
 +
** Remove pnfs_p2p_po_stid from lists, but don't free until proxy_revoke_release()
 +
* Client
 +
** Use the filehandle and stateid to find associated layout
 +
** Free that pnfs_po_stid
 +
 
 +
=== LAYOUTGET ===
* Server
* Server
** Edit pnfs_lexp_layout_get()
** Edit pnfs_lexp_layout_get()
** Set device id field in the layout to the clientid of the machine acting as the DS
** Set device id field in the layout to the clientid of the machine acting as the DS
 +
** If we are not using p2p for the file, instead continue to return 1 as the devid
** Encode a filehandle with the DSs MDS ID prepended in filelayout_encode_layout()
** Encode a filehandle with the DSs MDS ID prepended in filelayout_encode_layout()
-
== GETDEVICEID ==
+
 
 +
=== LAYOUTRETURN ===
 +
* Server
 +
** Add to pnfs_lexp_layout_return()
 +
** Check nfs4_client for files opened on a DS
 +
*** Send CB_PROXY_REVOKE
 +
** Also check the pnfs_p2p_client structure for files cached as a DS
 +
*** Free these stateids directly
 +
* Client
 +
** Free up pnfs_po_state stored in the pnfs_layout_hdr
 +
 
 +
=== GETDEVICEINFO ===
* Server
* Server
 +
** If we are given a device id of 1 continue using the non-p2p code
** Edit pnfsd_lexp_get_device_info() to fill out pnfs_filelayout_devaddr structure with DS information
** Edit pnfsd_lexp_get_device_info() to fill out pnfs_filelayout_devaddr structure with DS information
** Translate deviceid back to clientid to look up the DS
** Translate deviceid back to clientid to look up the DS
** Fill out netid and ip address information using data in the pnfs_p2p_client structure
** Fill out netid and ip address information using data in the pnfs_p2p_client structure
 +
=== PUTFH ===
 +
* Server
 +
** If this is a p2p filehandle then skip some of the state checking stuff because we won't have a dentry until after calling PROXY_OPEN
 +
** Check if a filehandle is p2p by looking at the length (p2p: 36 bytes, normal: 28 bytes)
 +
=== READ ===
 +
* Server
 +
** Call into the NFS client module to perform PROXY_OPEN and return the associated dentry for p2p filehandles
-
* pnfsd
+
=== OPEN ===
-
** Prepend the MDS-ID to filehandles when turning them into a "p2p filehandle" to make conversions easier
+
* Server
-
** Send CB_PROXY_REVOKE when client either expires or returns a layout
+
** Introduce a vfs_find_any_mount() to look up any mount structure for a dentry
 +
*** This is a hack, but we don't care which mount structure as long as we get the file data!
-
* p2pds
+
=== Other Notes ===
-
** REGISTER_DS_ALL easier to implement
+
* free_p2p_po_stid()
-
** Check for delegation before servicing request
+
** Remove from lists first before either freeing or calling CB_PROXY_REVOKE to prevent accidental double frees
-
** Store previous proxy open result with the layout, free when responding to a CB_PROXY_REVOKE
+
* DS expires on server
-
** UNREGISTER_DS during nfs4_kill_server()
+
** Treat as if the client had called unregister_ds()
-
<font color="blue">'''''DESCRIBE YOUR DESIGN IN THIS SECTION'''''</font>
+
=== Data Structures ===
 +
==== Server ====
 +
* p2p client information
 +
struct pnfs_p2p_client {
 +
        struct nfs4_stid p2p_stid;
 +
        u64 p2p_mds_id;
 +
        char *p2p_netid;
 +
        char *p2p_addr;
 +
        struct list_head p2p_ds_files;
 +
};
 +
* p2p proxy open stateid
 +
struct pnfs_p2p_po_stid {
 +
        struct nfs4_stid  po_stid;
 +
        struct knfsd_fh  po_fh;
 +
        struct list_head  po_ds_list;
 +
        struct list_head  po_cl_list;
 +
        struct nfsd4_callback po_cb;
 +
};
-
<font color="blue">''This section is typically the largest section. Since designs are highly specific, the template cannot provide much in the way of guidelines here.  Information which is relevant to the sections below should not be discussed here. </font>
+
==== Client ====
 +
* NFSv4 Proxy Open
 +
struct pnfs_po_state {
 +
        nfs4_stateid  read_stateid;
 +
        nfs4_stateid  revoke_stateid;
 +
        struct nfs_fh fh;
 +
        struct list_head list;
 +
};
-
'''<font color="blue">''This is the main place where customizing the template for each particular team can really pay off. Teams are encouraged to add a section for the design considerations their own particular area needs to address.'''</font>
+
=== Compatibility ===
 +
* Any v4.1 / pNFS enabled client should be able to make use of adhoc data servers already, and not need special p2p extensions.
 +
* Clients wishing to act as a data server need CONFIG_NFS_P2P enabled
 +
* Servers wishing to track adhoc DSs need CONFIG_PNFSD_P2P enabled
-
<font color="blue">''The Design specification describes how the functionality is implemented. Intended readers are:</font>
+
=== Documentation ===
-
* <font color="blue">''Engineering (current and future)</font>
+
* I can write a Documentation/filesystems/nfs/peer_to_peer.txt file to give a brief overview of how p2pNFS is supposed to work and how users can configure it.
-
* <font color="blue">''QA; given this spec, QA should understand the design enough to be able to create white-box type tests for the various parts.</font>
+
* I can also copy the page to linux-nfs.org for "online documentation"
-
* <font color="blue">''Overall design<br>This document should describe:</font>
+
== Feature Interaction Dependencies and Impacts ==
-
** <font color="blue">''How it works, in detail.  </font>
+
-
** <font color="blue">''Module breakdown</font>
+
-
** <font color="blue">''Major data paths through the code. (Referring to the use cases might be useful here)</font>
+
-
** <font color="blue">''Process structure.</font>
+
-
** <font color="blue">''Major data structures. </font>
+
-
** <font color="blue">''Concurrency, parallelism, and mutual exclusion.</font>
+
-
** <font color="blue">''Class hierarchy, if your design uses object-oriented notions of inheritance and polymorphism.  This applies to, but is not limited to, development done in object-oriented languages such as C++ and Java.</font>
+
-
**<font color="blue">''A UML diagram may be the easiest and most precise way of describing the relationship between the various abstractions supported by your design.</font>
+
-
** <font color="blue">''Any state machines.</font>
+
-
** <font color="blue">''Resources used, how they’re controlled, what we do when we run out, recovery steps</font>
+
-
 
+
-
* Versioning/compatibility
+
-
** Any v4.1 / pNFS enabled client should be able to make use of adhoc data servers already, and not need special p2p extensions.
+
-
** Clients wishing to act as a data server need CONFIG_NFS_P2P enabled
+
-
** Servers wishing to track adhoc DSs need CONFIG_PNFSD_P2P enabled
+
-
 
+
-
* Documentation
+
-
** I can write a Documentation/filesystems/nfs/peer_to_peer.txt file to give a brief overview of how p2pNFS is supposed to work and how users can configure it.
+
-
** I can also copy the page to linux-nfs.org for "online documentation"
+
-
 
+
-
=Feature Interaction Dependencies and Impacts=
+
* nfsd <-> nfs
* nfsd <-> nfs
** The machine acting as a pNFS DS needs to be running both the nfs server and the nfs client.
** The machine acting as a pNFS DS needs to be running both the nfs server and the nfs client.
 +
* Made changes to putfh
 +
** Check filehandle length since p2p filehandles are longer
 +
** Call the original version of the function if we are using a normal fh
 +
* Made changes to nfsd4_read
 +
** Call original read function if this isn't p2p, call proxy open otherwise to get data from client
 +
* nfsd_open needs to lookup mount structure without using an exportops structure for NFS
 +
* filelayout_encode_layout needs to be able to encode p2p filehandles and normal filehandles
 +
* pnfs_p2p_mark_fh increases filehandle size, server needs to know to use the mds id for bigger filehandles
 +
* nfsd4_proc_compound needs to know if a filehandle is a p2p fh since the dentry will be looked up later for reads
-
=Performance=
+
== Performance ==
* Keep a per-file LRU list of clients that currently have the file cached to avoid redirecting all p2p activity to the same client for that file.
* Keep a per-file LRU list of clients that currently have the file cached to avoid redirecting all p2p activity to the same client for that file.
-
=Scalability=
+
== Scalability ==
-
The hope is that p2p NFS scales to hundreds and thousands of clients better than straight pnfs does.  This can be tested by comparing read times for files of varying sizes both with and without p2p enabled.  A handful of DSs and a large number of clients should be used to get a feel for how this would work in a data center.
+
The hope is that p2p NFS scales to hundreds and thousands of clients better
 +
than straight pnfs does.  This can be tested by comparing read times for files
 +
of varying sizes both with and without p2p enabled.  A handful of DSs and a
 +
large number of clients should be used to get a feel for how this would work
 +
in a data center.
 +
* An LRU list of clients should help load balance traffic to each DS
 +
** Make use of already existing nfs4_file->fi_delegations list, move a DSs delegation to the end when referring
 +
* I take the state lock (global mutex) when accessing file or client state
 +
* I created a p2p spinlock for accessing p2p state
-
=Testing=
+
== Testing ==
* Basic proof-of-concept tests
* Basic proof-of-concept tests
** 1 client, 1 DS, 1 server
** 1 client, 1 DS, 1 server
Line 148: Line 224:
** More rsyncs / git clones with more clients
** More rsyncs / git clones with more clients
-
=Open Issues=
+
== Open Issues ==
{| class=wikitable width="100%"
{| class=wikitable width="100%"
Line 167: Line 243:
|}
|}
-
 
+
== Approvals ==
-
= Approvals =
+
=== Approvers ===
-
 
+
-
== Approvers==
+
{| class=wikitable width=100%
{| class=wikitable width=100%
Line 183: Line 257:
|}
|}
-
==Reviewers==
+
=== Reviewers ===
{| class=wikitable width=100%
{| class=wikitable width=100%

Latest revision as of 19:23, 18 January 2013

3 December 2012 DRAFT bjschuma@netapp.com

Contents

Overview

Peer-to-peer pNFS is designed to solve the "boot storm" problem that happens when several clients in a cluster boot and attempt to read the same set of files from a single NFS server all at the same time. This could overload the server's bandwidth, slowing down operations on most client machines. The idea behind p2p NFS is to allow clients to act as an adhoc read-only pNFS data server that serves files out of their data cache. This should spread out network usage across all machines, rather than focusing all activity on a single node. Server and desired DS machines will need to be modified but any pNFS-enabled client already has the code required to read from adhoc DSs.

Related Documents

  • draft-myklebust-nfsv4-pnfs-backend-protocol-01.txt
  • RFC 5661

Dependencies

This design needs the following from others:

Item Description of Dependency or Issue Affected Group Contact
1 Linux pNFS server development code Bryan Schumaker Benny Halevy
2 pNFS nfs utils needs to be installed on the NFSD server so it can export a filesystem over pNFS. Bryan Schumaker Benny Halevy

Assumptions

  • Workload with large number of read-only files
  • Enable the following .config options for the pNFS client and pNFS ds machines:
    • CONFIG_NFS_V4_1
    • CONFIG_PNFS_FILE_LAYOUT
  • Enable the following .config option for the pNFS ds machine:
    • CONFIG_NFS_P2P
  • Enable the following .config options for the pNFS server and pNFS ds machines:
    • CONFIG_PNFSD
    • CONFIG_PNFSD_LOCAL_EXPORT
    • CONFIG_PNFSD_P2P
  • Install pnfs-nfs-utils on the pNFS server
  • Add "pnfs" to the export options of a local filesystem on the pNFS server
  • pNFS DS should have nfsd running, but does not need to edit /etc/exports to share files
  • pnfsd needs to add "pnfs" export option to /etc/exports
  • pnfsd also needs to have "fsid=0" as an export option, otherwise the path walking code will trigger an early UNREGISTER_DS.

Design

REGISTER_DS

  • Server
    • Only implemented REGISTER_DS_ALL
    • Create a new struct pnfs_p2p_client to store information about the adhoc DS:
      • p2p client stateid
      • netid
      • ip address
      • MDS identifier
    • Store structure as part of the nfs4_client
    • Encode p2p client stateid as reply to client
  • Client
    • Send REGISTER_DS call as part of nfs4_remote_mount()
      • Use REGISTER_DS_ALL so server knows we'll cache everything
      • Generate MDS identifier using cl_cb_ident and a static u32 counter

UNREGISTER_DS

  • Server
    • Check that the nfs4_client has an associated pnfs_p2p_client
    • Check that the nfs4_client is using the correct stateid
    • Free memory allocated for struct pnfs_p2p_client structure during REGISTER_DS
    • Free pnfs_p2p_po_stids associated with the DS
    • Set pnfs_p2p_client pointer in nfs4_client to NULL
  • Client
    • Send UNREGISTER_DS as part of nfs4_destroy_server()

PROXY_OPEN

  • Server
    • Introduce a pnfs_p2p_po_stid to track what DS the client was referred to
    • Strip MDS ID from the filehandle
    • Add stateid to list stored in the pnfs_p2p_client for the DS
    • Add stateid to list stored in the nfs4_client for the client
    • Initialize a callback workqueue structure for PROXY_REVOKE
  • Client
    • Check if we have already called PROXY_OPEN for this (filehandle, stateid)
    • Check that we still have a delegation for the file
    • Use MDS identifier from filehandle to find the correct nfs_server structure
    • Use server to call an nfs4_proc_proxy_open()
      • Pass filehandle and read stateid
      • Use the compound: [SEQUENCE, PUTFH, PROXY_OPEN,GETFH] to look up the actual filehandle and get a proxy revoke stateid
    • Store both filehandles, read stateid and revoke stateid in a pnfs_po_state structure
      • Store this in the pnfs_layout_hdr
    • Pass resulting filehandle to nfs_delegation_find_inode() to find inode
    • Use d_find_any_alias() on the inode to find and return a dentry to the server

CB_PROXY_REVOKE

  • Server
    • Call when client expires on server
    • Remove pnfs_p2p_po_stid from lists, but don't free until proxy_revoke_release()
  • Client
    • Use the filehandle and stateid to find associated layout
    • Free that pnfs_po_stid

LAYOUTGET

  • Server
    • Edit pnfs_lexp_layout_get()
    • Set device id field in the layout to the clientid of the machine acting as the DS
    • If we are not using p2p for the file, instead continue to return 1 as the devid
    • Encode a filehandle with the DSs MDS ID prepended in filelayout_encode_layout()

LAYOUTRETURN

  • Server
    • Add to pnfs_lexp_layout_return()
    • Check nfs4_client for files opened on a DS
      • Send CB_PROXY_REVOKE
    • Also check the pnfs_p2p_client structure for files cached as a DS
      • Free these stateids directly
  • Client
    • Free up pnfs_po_state stored in the pnfs_layout_hdr

GETDEVICEINFO

  • Server
    • If we are given a device id of 1 continue using the non-p2p code
    • Edit pnfsd_lexp_get_device_info() to fill out pnfs_filelayout_devaddr structure with DS information
    • Translate deviceid back to clientid to look up the DS
    • Fill out netid and ip address information using data in the pnfs_p2p_client structure

PUTFH

  • Server
    • If this is a p2p filehandle then skip some of the state checking stuff because we won't have a dentry until after calling PROXY_OPEN
    • Check if a filehandle is p2p by looking at the length (p2p: 36 bytes, normal: 28 bytes)

READ

  • Server
    • Call into the NFS client module to perform PROXY_OPEN and return the associated dentry for p2p filehandles

OPEN

  • Server
    • Introduce a vfs_find_any_mount() to look up any mount structure for a dentry
      • This is a hack, but we don't care which mount structure as long as we get the file data!

Other Notes

  • free_p2p_po_stid()
    • Remove from lists first before either freeing or calling CB_PROXY_REVOKE to prevent accidental double frees
  • DS expires on server
    • Treat as if the client had called unregister_ds()

Data Structures

Server

  • p2p client information
struct pnfs_p2p_client {
       struct nfs4_stid p2p_stid;
       u64 p2p_mds_id;
       char *p2p_netid;
       char *p2p_addr;
       struct list_head p2p_ds_files;
};
  • p2p proxy open stateid
struct pnfs_p2p_po_stid {
       struct nfs4_stid  po_stid;
       struct knfsd_fh   po_fh;
       struct list_head  po_ds_list;
       struct list_head  po_cl_list;
       struct nfsd4_callback po_cb;
};

Client

  • NFSv4 Proxy Open
struct pnfs_po_state {
       nfs4_stateid  read_stateid;
       nfs4_stateid  revoke_stateid;
       struct nfs_fh fh;
       struct list_head list;
};

Compatibility

  • Any v4.1 / pNFS enabled client should be able to make use of adhoc data servers already, and not need special p2p extensions.
  • Clients wishing to act as a data server need CONFIG_NFS_P2P enabled
  • Servers wishing to track adhoc DSs need CONFIG_PNFSD_P2P enabled

Documentation

  • I can write a Documentation/filesystems/nfs/peer_to_peer.txt file to give a brief overview of how p2pNFS is supposed to work and how users can configure it.
  • I can also copy the page to linux-nfs.org for "online documentation"

Feature Interaction Dependencies and Impacts

  • nfsd <-> nfs
    • The machine acting as a pNFS DS needs to be running both the nfs server and the nfs client.
  • Made changes to putfh
    • Check filehandle length since p2p filehandles are longer
    • Call the original version of the function if we are using a normal fh
  • Made changes to nfsd4_read
    • Call original read function if this isn't p2p, call proxy open otherwise to get data from client
  • nfsd_open needs to lookup mount structure without using an exportops structure for NFS
  • filelayout_encode_layout needs to be able to encode p2p filehandles and normal filehandles
  • pnfs_p2p_mark_fh increases filehandle size, server needs to know to use the mds id for bigger filehandles
  • nfsd4_proc_compound needs to know if a filehandle is a p2p fh since the dentry will be looked up later for reads

Performance

  • Keep a per-file LRU list of clients that currently have the file cached to avoid redirecting all p2p activity to the same client for that file.

Scalability

The hope is that p2p NFS scales to hundreds and thousands of clients better than straight pnfs does. This can be tested by comparing read times for files of varying sizes both with and without p2p enabled. A handful of DSs and a large number of clients should be used to get a feel for how this would work in a data center.

  • An LRU list of clients should help load balance traffic to each DS
    • Make use of already existing nfs4_file->fi_delegations list, move a DSs delegation to the end when referring
  • I take the state lock (global mutex) when accessing file or client state
  • I created a p2p spinlock for accessing p2p state

Testing

  • Basic proof-of-concept tests
    • 1 client, 1 DS, 1 server
    • Have DS and client rsync files from server
    • Maybe do a `git clone linux-src` instead?
    • Try exporting a /lib partition
  • In-depth testing
    • NFSv4root with varying numbers of clients
      • NFSv4root doesn't work right now due to idmapping issues
    • More rsyncs / git clones with more clients

Open Issues

Item Date Name Issue Resolution Date Resolved
1 12/11/2012 Bryan Client needs to mount server with the public filehandle, otherwise the path walking code will trigger an early UNREGISTER_DS. [NONE] [NONE]

Approvals

Approvers

Name Role Target Approval Date Approval Date
Trond Myklebust NFS Client Maintainer Date Date

Reviewers

Name Role Target Approval Date Approval Date
Jeffrey Heller Bryan's Manager Date Date
Personal tools