User:Peterhoneyman/sandbox
From Linux NFS
< User:Peterhoneyman(Difference between revisions)
(→Legend) |
m (purple -> orange) |
||
(6 intermediate revisions not shown) | |||
Line 1: | Line 1: | ||
- | |||
- | |||
- | |||
- | |||
This document enumerates the pNFS functionality targeted for integration into the upstream Linux kernel. The first wave of patches will implement the minimum set of functionality required to support the Files Layout. These items are denoted as Priority A. Subsequent waves of patches will address functionality that builds on top of the minimum required set as well as implement additional Layout Types. | This document enumerates the pNFS functionality targeted for integration into the upstream Linux kernel. The first wave of patches will implement the minimum set of functionality required to support the Files Layout. These items are denoted as Priority A. Subsequent waves of patches will address functionality that builds on top of the minimum required set as well as implement additional Layout Types. | ||
Line 8: | Line 4: | ||
Note: The labeling still needs to be reviewed by the v4.1 Linux community. | Note: The labeling still needs to be reviewed by the v4.1 Linux community. | ||
* <font color="red">Issues labeled in red need to be addressed as part of the minimum pNFS functionality patches</font> | * <font color="red">Issues labeled in red need to be addressed as part of the minimum pNFS functionality patches</font> | ||
- | * <font color=" | + | * <font color="orange">Issues labeled in purple can be deferred for now</font> |
* <font color="green">Issues labeled in green can be deferred indefinitely</font> | * <font color="green">Issues labeled in green can be deferred indefinitely</font> | ||
The priority list was initially reviewed during Connectathon 2010. | The priority list was initially reviewed during Connectathon 2010. | ||
Line 14: | Line 10: | ||
== General == | == General == | ||
=== Data Structure Integration === | === Data Structure Integration === | ||
- | * <font color="red">Review impact to struct nfs_client</font> | + | * <font color="red">Review impact to struct nfs_client</font> Batsakis |
- | ** <font color="red">Ensure layouts are cleaned-up in the right order when the client is destroyed</font> | + | ** <font color="red">Ensure layouts are cleaned-up in the right order when the client is destroyed</font> |
- | * <font color="red">Review impact to struct nfs_server</font> | + | * <font color="red">Review impact to struct nfs_server</font> Batsakis |
- | * <font color="red">Review impact to struct nfs4_session</font> | + | * <font color="red">Review impact to struct nfs4_session</font> Batsakis |
- | * <font color="red">Determine if there is a need for the DS to have a struct nfs_server</font> | + | * <font color="red">Determine if there is a need for the DS to have a struct nfs_server</font> Batsakis |
- | * <font color="red">Ability to tell client not to use pNFS against a server which may support it</font> | + | * <font color="red">Ability to tell client not to use pNFS against a server which may support it</font> |
- | ** <font color="red">Black list the layout module so that capability is not available | + | ** <font color="red">Black list the layout module so that capability is not available</font> |
- | ** Disable pNFS per mount | + | ** <font color="orange">Disable pNFS per mount</font> |
- | ** Define I/O threshold to override attributes and other policy on the client | + | ** <font color="green">Define I/O threshold to override attributes and other policy on the client</font> |
- | * <font color="red">Layout Drivers should be automatically loaded (Using request module call)</font> | + | * <font color="red">Layout Drivers should be automatically loaded (Using request module call)</font> |
* Ability to have multiple layouts loaded | * Ability to have multiple layouts loaded | ||
- | ** <font color="red">One layout type per filesystem</font> | + | ** <font color="red">One layout type per filesystem</font> |
- | ** Multiple layouts per filesystem | + | ** <font color="green">Multiple layouts per filesystem</font> |
- | * <font color="red">Data should survive data server filehandle invalidation</font> | + | * <font color="red">Data should survive data server filehandle invalidation</font> |
** Client cache maps DS filehandle to MDS filehandle, and the MDS filehandle to cached data (13.1) | ** Client cache maps DS filehandle to MDS filehandle, and the MDS filehandle to cached data (13.1) | ||
* Lease timeout determination | * Lease timeout determination | ||
- | ** <font color="red">EXCHGID4_FLAG_USE_PNFS_DS vs MDS or PNFS (13.1.1)</font> | + | ** <font color="red">EXCHGID4_FLAG_USE_PNFS_DS vs MDS or PNFS (13.1.1)</font> |
- | * Support Direct I/O | + | * <font color="orange">Support Direct I/O</font> |
** Consult with list, is there customer demand for holding off the first integration? | ** Consult with list, is there customer demand for holding off the first integration? | ||
** Dean can volunteer to implement. Shares same RPC calls as buffered I/O - callbacks are slightly different | ** Dean can volunteer to implement. Shares same RPC calls as buffered I/O - callbacks are slightly different | ||
** Determine when to trigger the layoutget | ** Determine when to trigger the layoutget | ||
- | * <font color="red">Support Buffered I/O (Page based)</font> | + | * <font color="red">Support Buffered I/O (Page based)</font> |
* Session Implications | * Session Implications | ||
** Support dual DS/MDS Personality (13.1) | ** Support dual DS/MDS Personality (13.1) | ||
- | *** <font color="red">Each personality with its own clientid and session</font> | + | *** <font color="red">Each personality with its own clientid and session</font> |
- | *** Reuse DS clientid/session if we already have one | + | *** <font color="orange">Reuse DS clientid/session if we already have one</font> |
- | * <font color="red">Remove PNFS_CONFIG Flag</font> | + | * <font color="red">Remove PNFS_CONFIG Flag</font> |
** Check with Fedora | ** Check with Fedora | ||
*** As long as there is a way to specifically prevent the use of pNFS | *** As long as there is a way to specifically prevent the use of pNFS | ||
=== DeviceID Management === | === DeviceID Management === | ||
- | * Add, Remove, Locate | + | * <font color="red">Add, Remove, Locate</font> |
- | ** Policy to prune unused device info ( | + | ** <font color="orange">Policy to prune unused device info (elevate?)</font> |
- | ** Umount should clean device table | + | ** <font color="red">Umount should clean device table</font> |
*** XXX Not sure this is correct, since the scope of a deviceID is the clientID/layouttype - not the filesystem | *** XXX Not sure this is correct, since the scope of a deviceID is the clientID/layouttype - not the filesystem | ||
- | *** Careful handling of lease renewals | + | *** <font color="red">Careful handling of lease renewals</font> |
- | * DeviceInfo Mappings | + | * <font color="red">DeviceInfo Mappings</font> |
- | * Multipath support for each DS | + | * <font color="orange">Multipath support for each DS</font> |
** How does the MDS represent a DS with IPv4 and IPv6 addresses? | ** How does the MDS represent a DS with IPv4 and IPv6 addresses? | ||
** Revisit when generic support for replicated servers is implemented | ** Revisit when generic support for replicated servers is implemented | ||
* Policy | * Policy | ||
** What happens if the device is down? | ** What happens if the device is down? | ||
- | *** Give up and I/O through MDS | + | *** <font color="red">Give up and I/O through MDS</font> |
- | *** Reattempt through DS? | + | *** <font color="orange">Reattempt through DS?</font> |
**** Revisit when generic support fort replicated server | **** Revisit when generic support fort replicated server | ||
* Recalls (See callbacks) | * Recalls (See callbacks) | ||
=== State/connection management === | === State/connection management === | ||
- | * Discuss with server implementers about need for state renewal daemon on DS | + | * <font color="red">Discuss with server implementers about need for state renewal daemon on DS</font> |
** Is there really a need to keep the lease alive? Can we get away without renewed per DS? | ** Is there really a need to keep the lease alive? Can we get away without renewed per DS? | ||
Line 68: | Line 64: | ||
* Layout Driver (See above) | * Layout Driver (See above) | ||
* Add, Remove, Locate | * Add, Remove, Locate | ||
- | ** Return layouts if they have not been used within certain time to avoid running out of state on server | + | ** <font color="orange">Return layouts if they have not been used within certain time to avoid running out of state on server</font> |
- | * Caching beyond CLOSE | + | * <font color="orange">Caching beyond CLOSE</font> |
- | * Whole file layouts | + | * <font color="red">Whole file layouts</font> |
- | * Segment layouts | + | * <font color="orange">Segment layouts</font> |
- | ** Merge Overlapping Layouts | + | ** <font color="orange">Merge Overlapping Layouts</font> |
*** Revisit when we study the layout design | *** Revisit when we study the layout design | ||
- | * Should allow layouts of differing iomode for the same range | + | * <font color="red">Should allow layouts of differing iomode for the same range</font> |
* Stateid/Seqid management | * Stateid/Seqid management | ||
- | ** OLD and BAD stateid error handling in layout operations | + | ** <font color="red">OLD and BAD stateid error handling in layout operations</font> |
- | * Check current Referring Tuple Handling works with pNFS callbacks | + | * <font color="red">Check current Referring Tuple Handling works with pNFS callbacks</font> |
- | === Interaction with Delegations | + | === <font color="red">Interaction with Delegations</font>=== |
* Verify proper use of delegation stateid on layoutget | * Verify proper use of delegation stateid on layoutget | ||
* If no delegation use open stateid | * If no delegation use open stateid | ||
Line 86: | Line 82: | ||
== Metadata Server Operations == | == Metadata Server Operations == | ||
=== EXCHANGE_ID === | === EXCHANGE_ID === | ||
- | * Handle EXCHGID4_FLAG_USE_NON_PNFS/ EXCHGID4_FLAG_USE_PNFS_MDS/ EXCHGID4_FLAG_USE_PNFS_DS combinations | + | * <font color="red">Handle EXCHGID4_FLAG_USE_NON_PNFS/ EXCHGID4_FLAG_USE_PNFS_MDS/ EXCHGID4_FLAG_USE_PNFS_DS combinations</font> |
- | ** If client doesn't specify pNFS and server does, client needs to not do it | + | ** <font color="red">If client doesn't specify pNFS and server does, client needs to not do it</font> |
* Remember server response to determine: | * Remember server response to determine: | ||
- | ** If we need to send GETATTR asking for layout type | + | ** <font color="red">If we need to send GETATTR asking for layout type</font> |
** To determine if we should specify a layout hint during create (Priority?) | ** To determine if we should specify a layout hint during create (Priority?) | ||
- | * EXCHGID4_FLAG4_BIND_PRINC_STATEID | + | * <font color="green">EXCHGID4_FLAG4_BIND_PRINC_STATEID</font> |
- | * Separate nfs_client for MDS/DS dual personality | + | * <font color="red">Separate nfs_client for MDS/DS dual personality</font> |
** Make sure the client owner is different for each | ** Make sure the client owner is different for each | ||
- | === GETDEVICEINFO | + | === <font color="red">GETDEVICEINFO</font>=== |
- | * Request Device notifications | + | * <font color="orange">Request Device notifications</font> |
** NOTIFY_DEVICEID4_CHANGE | ** NOTIFY_DEVICEID4_CHANGE | ||
** NOTIFY_DEVICEID4_DELETE | ** NOTIFY_DEVICEID4_DELETE | ||
- | * Determine best GETDEVICEINFO_ARGS gdia_maxcount limits | + | * <font color="red">Determine best GETDEVICEINFO_ARGS gdia_maxcount limits</font> |
- | ** XDR across page boundaries is problematic today but should be addressed | + | ** <font color="red">XDR across page boundaries is problematic today but should be addressed</font> |
- | * Handle NFS4ERR_TOOSMALL | + | * <font color="red">Handle NFS4ERR_TOOSMALL</font> |
- | ** Turn off pNFS | + | ** <font color="red">Turn off pNFS</font> |
* Determine where to invoke it | * Determine where to invoke it | ||
- | ** Invoke from the state manager | + | ** <font color="red">Invoke from the state manager</font> |
- | === GETDEVICELIST (Opt) | + | === <font color="green">GETDEVICELIST (Opt)</font>=== |
- | === LAYOUTGET | + | === <font color="red">LAYOUTGET</font>=== |
- | * Determine where to invoke it | + | * <font color="red">Determine where to invoke it</font> |
** Acquire layout as close to the actual I/O? | ** Acquire layout as close to the actual I/O? | ||
** For files layout layout at open makes sense - good enough reason to have it as well? | ** For files layout layout at open makes sense - good enough reason to have it as well? | ||
- | ** Minimize sprinkling pNFS calls throughout the call | + | ** <font color="red">Minimize sprinkling pNFS calls throughout the call</font> |
- | ** Minimize number of layout reference/ dereference (number of layout gets per I/O) | + | ** <font color="red">Minimize number of layout reference/ dereference (number of layout gets per I/O)</font> |
** read, write, mmap, splice_read, splice_write ? | ** read, write, mmap, splice_read, splice_write ? | ||
** readpages, writepages error recovery (invoke the state manager?) | ** readpages, writepages error recovery (invoke the state manager?) | ||
- | ** Specify smart minimum and a reasonable size | + | ** <font color="red">Specify smart minimum and a reasonable size</font> |
- | ** nfs_wait_on_sequence to serialize the gets, returns, and recalls | + | ** <font color="orange">nfs_wait_on_sequence to serialize the gets, returns, and recalls</font> |
- | * Support layout range that does not match request | + | * <font color="red">Support layout range that does not match request</font> |
- | * Forgetful Model (12.5.5.1) | + | * <font color="red">Forgetful Model (12.5.5.1)</font> |
** Makes the layoutreturn/ cb_recall simpler | ** Makes the layoutreturn/ cb_recall simpler | ||
* Error handling | * Error handling | ||
- | ** I/O through MDS | + | ** <font color="red">I/O through MDS</font> |
- | ** Timer to retry layout | + | ** <font color="orange">Timer to retry layout</font> |
- | ** Mark inode to not request layout until all dirty pages are flushed | + | ** <font color="orange">Mark inode to not request layout until all dirty pages are flushed</font> |
* Handle NFS4ERR_RECALLCONFLICT AND NFS4ERR_RETURNCONFLICT (12.5.5.2) | * Handle NFS4ERR_RECALLCONFLICT AND NFS4ERR_RETURNCONFLICT (12.5.5.2) | ||
* Handle NFS4ERR_GRACE | * Handle NFS4ERR_GRACE | ||
Line 134: | Line 130: | ||
* Handle NFS4ERR_BADIOMODE | * Handle NFS4ERR_BADIOMODE | ||
* Handle NFS4ERR_LOCKED | * Handle NFS4ERR_LOCKED | ||
- | * Obey stripe unit size and commit through MDS bits | + | * <font color="red">Obey stripe unit size and commit through MDS bits</font> |
* FileHandle Determination (13.3) | * FileHandle Determination (13.3) | ||
- | ** DS Filehandle same as MDS | + | ** <font color="red">DS Filehandle same as MDS</font> |
- | ** Same DS Filehandle for every data server | + | ** <font color="red">Same DS Filehandle for every data server</font> |
*** Not sure if we handle it | *** Not sure if we handle it | ||
- | ** Unique Filehandle for each data server | + | ** <font color="red">Unique Filehandle for each data server</font> |
- | * Specify intended IO Mode in Layout | + | * <font color="red">Specify intended IO Mode in Layout</font> |
- | * More than one striping pattern: logr_layout array > 1 | + | * <font color="orange">More than one striping pattern: logr_layout array > 1</font> |
- | * Able to handle different iomode from what was requested | + | * <font color="red">Able to handle different iomode from what was requested</font> |
- | * Handle layouts of length NFS4_UINT64_MAX (various rules) (18.43.3) | + | * <font color="red">Handle layouts of length NFS4_UINT64_MAX (various rules) (18.43.3)</font> |
- | * Obey logr_return_on_close | + | * <font color="red">Obey logr_return_on_close</font> XXX Study XXX |
** What if you have multiple opens on the same file? | ** What if you have multiple opens on the same file? | ||
- | ** What's the implication on the forgetful model | + | ** <font color="red">What's the implication on the forgetful model</font> |
- | * Layout read(write)-ahead | + | * <font color="orange">Layout read(write)-ahead</font> |
- | ** Files Layout will request entire file ( | + | ** <font color="red">Files Layout will request entire file</font> |
+ | This makes it impossible (or unfeasible) to extend files in block layout | ||
- | === LAYOUTCOMMIT | + | === <font color="red">LAYOUTCOMMIT</font>=== |
- | * Include last_write_offset, offset, length | + | * <font color="red">Include last_write_offset, offset, length</font> |
- | * Include mtime | + | * <font color="green">Include mtime</font> |
- | ** getattr after LAYOUTCOMMIT to update cached attributes | + | ** <font color="red">getattr after LAYOUTCOMMIT to update cached attributes</font> |
* Keep layoutcommit data until return value is received so that you can reissue request in case of GRACE for example | * Keep layoutcommit data until return value is received so that you can reissue request in case of GRACE for example | ||
XXX What about FILE_SYNC vs DATA_SYNC? Trond had some questions XXX | XXX What about FILE_SYNC vs DATA_SYNC? Trond had some questions XXX | ||
- | * Determine where to invoke it | + | * <font color="red">Determine where to invoke it</font> |
** Issue layoutcommit in write_inode() and nfs_revalidate_inode() | ** Issue layoutcommit in write_inode() and nfs_revalidate_inode() | ||
** Issue layoutcommit before data commits | ** Issue layoutcommit before data commits | ||
- | * Support sub-range layouts | + | * <font color="red">Support sub-range layouts</font> |
** Do we really know any servers that will do this at this time? | ** Do we really know any servers that will do this at this time? | ||
** Belongs in the layout opaque structure? XXX Need to review XXX | ** Belongs in the layout opaque structure? XXX Need to review XXX | ||
- | * Recover from MDS reboot | + | * <font color="red">Recover from MDS reboot</font> |
** Issue layout_commit with reclaim bit set | ** Issue layout_commit with reclaim bit set | ||
** Handle NFS4ERR_NO_GRACE | ** Handle NFS4ERR_NO_GRACE | ||
* Handle NFS4ERR_BADLAYOUT | * Handle NFS4ERR_BADLAYOUT | ||
- | ** Check we have a layout and correct I/O mode before issuing layoutcommit | + | ** <font color="red">Check we have a layout and correct I/O mode before issuing layoutcommit</font> |
- | ** | + | ** <font color="orange">Fred's bug of hole in the layout range</font> Subset of layout segments |
- | * Handle NFS4ERR_RECLAIM_BAD | + | * <font color="red">Handle NFS4ERR_RECLAIM_BAD</font> |
* Attribute caching: loca_time_modify specified - follow with GETATTR | * Attribute caching: loca_time_modify specified - follow with GETATTR | ||
- | === LAYOUTRETURN | + | === <font color="red">LAYOUTRETURN</font>=== |
- | * Forgetful Model | + | * <font color="red">Forgetful Model</font> |
- | * On CB_LAYOUTRECALL always return NFS4ERR_NOMATCHING_LAYOUT (12.5.5.1) | + | * <font color="red">On CB_LAYOUTRECALL always return NFS4ERR_NOMATCHING_LAYOUT (12.5.5.1)</font> |
- | * On CB_RECALL_ANY return LAYOUTRETURN4_ALL | + | * <font color="red">On CB_RECALL_ANY return LAYOUTRETURN4_ALL</font> |
- | * Return all subfile ranges on CB_RECALL of entire file layout (12.5.5.1) | + | * <font color="green">Return all subfile ranges on CB_RECALL of entire file layout (12.5.5.1)</font> |
- | * Return full range specified by the layout recall (12.5.5.1) | + | * <font color="green">Return full range specified by the layout recall (12.5.5.1)</font> |
- | * Ability to return chunks of layouts for huge files to show progress | + | * <font color="green">Ability to return chunks of layouts for huge files to show progress</font> |
- | * Return entire range layout as final LAYOUTRETURN | + | * <font color="green">Return entire range layout as final LAYOUTRETURN</font> |
- | * Return NFS4ERR_NOMATCHING_LAYOUT if none is found | + | * <font color="green">Return NFS4ERR_NOMATCHING_LAYOUT if none is found</font> |
- | * Bulk Return | + | * <font color="green">Bulk Return</font> |
** LAYOUTRETURN4_FSID | ** LAYOUTRETURN4_FSID | ||
** LAYOUTRETURN4_ALL | ** LAYOUTRETURN4_ALL | ||
- | ** sync with nfs_wait_on_sequence() | + | ** <font color="green">sync with nfs_wait_on_sequence()</font> |
*** The seqid affinity is associated with the filehandle | *** The seqid affinity is associated with the filehandle | ||
- | * Serialize operations resulting from intersecting CB_LAYOUTRECALLs (18.44.4) | + | * <font color="green">Serialize operations resulting from intersecting CB_LAYOUTRECALLs (18.44.4)</font> |
- | ** Forgetful model always return NFS4ERR_NOMATCHING_LAYOUT (12.5.5.1) | + | ** <font color="red">Forgetful model always return NFS4ERR_NOMATCHING_LAYOUT (12.5.5.1)</font> |
- | ** Serialization later | + | ** <font color="green">Serialization later</font> |
- | ** Return NFS4ERR_DELAY? | + | ** <font color="orange">Return NFS4ERR_DELAY?</font> |
- | * Error Recovery | + | * <font color="red">Error Recovery</font> |
** Handle NFS4ERR_OLD_STATEID | ** Handle NFS4ERR_OLD_STATEID | ||
- | ** Handle NFS4ERR_BAD_STATEID | + | ** <font color="green">Handle NFS4ERR_BAD_STATEID</font> stateid's seqid() |
** Handle NFs4ERR_NO_GRACE | ** Handle NFs4ERR_NO_GRACE | ||
** Handle NFS4ERR_INVAL | ** Handle NFS4ERR_INVAL | ||
=== I/O through the MDS === | === I/O through the MDS === | ||
- | * Error fallback on I/O error | + | * <font color="red">Error fallback on I/O error</font> |
** Including NFS4ERR_BAD_STATEID as returned by DS resulting from DS fencing the I/O after a recall of the layout | ** Including NFS4ERR_BAD_STATEID as returned by DS resulting from DS fencing the I/O after a recall of the layout | ||
- | === SECINFO_NO_NAME (Req) | + | === <font color="green">SECINFO_NO_NAME (Req)</font> === |
* Required only for the server | * Required only for the server | ||
=== OPEN === | === OPEN === | ||
- | * LayoutHint attribute | + | * <font color="green">LayoutHint attribute</font> |
- | ** Need to define a user/programmable interface? | + | ** <font color="green">Need to define a user/programmable interface?</font> |
- | * GETATTR follows OPEN to determine layout type | + | * <font color="green">GETATTR follows OPEN to determine layout type</font> |
- | * Support GUARDED during create | + | * <font color="red">Support GUARDED during create</font> |
=== SETATTR === | === SETATTR === | ||
Line 214: | Line 211: | ||
=== COMMIT === | === COMMIT === | ||
- | * Compare commit verifier to each of the DS write verifiers | + | * <font color="orange">Compare commit verifier to each of the DS write verifiers</font> XXX Review section 13.7 XXX |
* We keep the commit verifier per page | * We keep the commit verifier per page | ||
- | * Keep data until return value is received so that you can reissue request in case error | + | * <font color="red">Keep data until return value is received so that you can reissue request in case error</font> |
== Callback Service Operations == | == Callback Service Operations == | ||
- | === CB_LAYOUTRECALL | + | === <font color="red">CB_LAYOUTRECALL</font>=== |
- | * Forgetful client behavior | + | * <font color="red">Forgetful client behavior</font> |
** NFS4ERR_NOMATCHING_LAYOUT (12.5.5.1) | ** NFS4ERR_NOMATCHING_LAYOUT (12.5.5.1) | ||
* Bulk Recall | * Bulk Recall | ||
- | ** LAYOUTRECALL4_FSID | + | ** <font color="orange">LAYOUTRECALL4_FSID</font> |
- | ** LAYOTURECALL4_ALL | + | ** <font color="orange">LAYOTURECALL4_ALL</font> |
- | === CB_RECALL_ANY (Req) | + | === <font color="red">CB_RECALL_ANY (Req)</font>=== |
- | * Client issues LAYOUTRETURN(ALL) due to forgetful client model | + | * <font color="red">Client issues LAYOUTRETURN(ALL) due to forgetful client model</font> |
- | === CB_RECALLABLE_OBJ_AVAIL | + | === <font color="green">CB_RECALLABLE_OBJ_AVAIL</font>=== |
- | * Set loga_signal_layout_avail on LAYOUTGET to FALSE | + | * <font color="red">Set loga_signal_layout_avail on LAYOUTGET to FALSE</font> |
- | === CB_NOTIFY_DEVICEID (Opt) | + | === <font color="green">CB_NOTIFY_DEVICEID (Opt)</font>=== |
- | * Indicate no interest in notification | + | * <font color="red">Indicate no interest in notification</font> |
- | * Detect race with GETDEVICE_INFO | + | * <font color="orange">Detect race with GETDEVICE_INFO</font> |
** If layouts using deviceID, then issue TEST_STATEID | ** If layouts using deviceID, then issue TEST_STATEID | ||
*** If valid layout in use, then issue GETDEVICEINFO | *** If valid layout in use, then issue GETDEVICEINFO | ||
- | === CB_WANTS_CANCELLED (Req) | + | === <font color="green">CB_WANTS_CANCELLED (Req)</font>=== |
- | * Specify no interest if needed | + | * <font color="red">Specify no interest if needed</font> |
== Data Server Operations == | == Data Server Operations == | ||
Line 245: | Line 242: | ||
=== EXCHANGE_ID === | === EXCHANGE_ID === | ||
- | === SECINFO_NO_NAME | + | === <font color="green">SECINFO_NO_NAME</font>=== |
=== I/O === | === I/O === | ||
- | * Review Data distribution algorithm: (which DS, offset, length) | + | * <font color="red">Review Data distribution algorithm: (which DS, offset, length)</font> |
- | * Sparse | + | * <font color="red">Sparse</font> |
- | * Dense | + | * <font color="green">Dense</font> |
- | ** Stash existing code | + | ** <font color="red">Stash existing code</font> |
* WRITE | * WRITE | ||
- | ** Cache all data in range until successful LAYOUTCOMMIT(1st) and COMMIT (2nd) for unstable data | + | ** <font color="red">Cache all data in range until successful LAYOUTCOMMIT(1st) and COMMIT (2nd) for unstable data</font> |
*** How is it that files does not need this for proper recovery? (12.7.4, top of page 306) | *** How is it that files does not need this for proper recovery? (12.7.4, top of page 306) | ||
* READ | * READ | ||
- | ** Zero byte & EOF handling on reads with holes handled locally (13.10) | + | ** <font color="red">Zero byte & EOF handling on reads with holes handled locally (13.10)</font> |
=== COMMIT === | === COMMIT === | ||
- | * Commit through MDS | + | * <font color="red">Commit through MDS</font> |
- | * Commit through DS | + | * <font color="red">Commit through DS</font> |
== Metadata/ Attribute Handling == | == Metadata/ Attribute Handling == | ||
* pNFS related attributes | * pNFS related attributes | ||
- | ** layout_hint | + | ** <font color="green">layout_hint</font> |
- | ** layout_type | + | ** <font color="orange">layout_type</font> |
- | ** mdsthreshold | + | ** <font color="orange">mdsthreshold</font> |
- | ** fs_layout_type | + | ** <font color="red">fs_layout_type</font> |
- | ** layout_alignment | + | ** <font color="orange">layout_alignment</font> |
- | ** layout_blksize | + | ** <font color="orange">layout_blksize</font> |
== Locking == | == Locking == | ||
- | * Mandatory Locking | + | * <font color="orange">Mandatory Locking</font> |
** Use Lock StateID | ** Use Lock StateID | ||
- | ** Handle NFS4ERR_LOCKED | + | ** <font color="orange">Handle NFS4ERR_LOCKED</font> Check with Windows (Tom Talpey) to see if there's a server in the future |
== Error Handling == | == Error Handling == | ||
- | * Handle I/O errors due to fencing | + | * <font color="red">Handle I/O errors due to fencing</font> |
- | * Due to Layout Revocation | + | * <font color="red">Due to Layout Revocation</font> |
- | * NFS4ERR_GRACE handling | + | * <font color="red">NFS4ERR_GRACE handling</font> |
- | * State recovery through the State Manager only | + | * <font color="red">State recovery through the State Manager only</font> |
** Recover state and mark as I/O for MDS for example | ** Recover state and mark as I/O for MDS for example | ||
* When do we retry again to the DS | * When do we retry again to the DS | ||
- | ** Retry pNFS on remount | + | ** <font color="red">Retry pNFS on remount</font> |
- | ** Timer? | + | ** <font color="orange">Timer?</font> |
- | ** Clear error state once there are no more dirty pages? | + | ** <font color="orange">Clear error state once there are no more dirty pages?</font> |
- | ** Fail to MDS on first error - keep it simple | + | ** <font color="red">Fail to MDS on first error - keep it simple</font> |
- | ** Retry pNFS after X condition/time | + | ** <font color="orange">Retry pNFS after X condition/time</font> |
== Security == | == Security == | ||
- | * DS ACL related errors? | + | * <font color="red">DS ACL related errors?</font> |
== Multiple Layout Type Support == | == Multiple Layout Type Support == | ||
- | * Different Layout types for different files | + | * <font color="green">Different Layout types for different files</font> |
== Recovery == | == Recovery == | ||
* DS Lease Expiration on the Client (12.7.2) (SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED, SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED, SEQ4_STATUS_ADMIN_STATE_REVOKED) | * DS Lease Expiration on the Client (12.7.2) (SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED, SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED, SEQ4_STATUS_ADMIN_STATE_REVOKED) | ||
- | ** Write through MDS | + | ** <font color="red">Write through MDS</font> |
- | ** Redo Session/Layout setup, reissue I/O to DSs | + | ** <font color="orange">Redo Session/Layout setup, reissue I/O to DSs</font> |
- | === Lease Move (11.7.7.1) (Low Priority) | + | === <font color="green">Lease Move (11.7.7.1) (Low Priority)</font>=== |
=== Loss of Layout State on Metadata Server === | === Loss of Layout State on Metadata Server === | ||
- | * Handle fencing error | + | * <font color="red">Handle fencing error</font> |
=== Metadata Server Restart === | === Metadata Server Restart === | ||
- | * SEQ4_STATUS_RESTART_RECLAIM_NEEDED, NFS4ERR_BAD_SESSION/ NFS4_STALE_CLIENTID | + | * <font color="red">SEQ4_STATUS_RESTART_RECLAIM_NEEDED, NFS4ERR_BAD_SESSION/ NFS4_STALE_CLIENTID</font> |
* Server out of Grace | * Server out of Grace | ||
- | ** I/O through MDS | + | ** <font color="red">I/O through MDS</font> |
- | ** Redo Session/Layout setup, reissue I/O to DSs | + | ** <font color="orange">Redo Session/Layout setup, reissue I/O to DSs</font> |
* Server in Grace | * Server in Grace | ||
- | ** LAYOUT_COMMIT in reclaim mode | + | ** <font color="red">LAYOUT_COMMIT in reclaim mode</font> |
- | ** Redo Session/Layout setup, reissue I/O to DSs | + | ** <font color="orange">Redo Session/Layout setup, reissue I/O to DSs</font> |
== Data Server Multipathing (13.5) == | == Data Server Multipathing (13.5) == | ||
- | * Bandwidth Scaling | + | * <font color="orange">Bandwidth Scaling</font> |
- | ** Session Trunking | + | ** <font color="green">Session Trunking</font> |
* Higher Availability | * Higher Availability | ||
- | ** multipath_list4 | + | ** <font color="orange">multipath_list4</font> |
- | ** Replacement DeviceID-to-Device address mapping | + | ** <font color="orange">Replacement DeviceID-to-Device address mapping</font> |
- | * Replacement DeviceID | + | * <font color="orange">Replacement DeviceID</font> |
== IPv6 == | == IPv6 == |
Latest revision as of 17:08, 1 April 2010
This document enumerates the pNFS functionality targeted for integration into the upstream Linux kernel. The first wave of patches will implement the minimum set of functionality required to support the Files Layout. These items are denoted as Priority A. Subsequent waves of patches will address functionality that builds on top of the minimum required set as well as implement additional Layout Types.
Contents |
Legend
Note: The labeling still needs to be reviewed by the v4.1 Linux community.
- Issues labeled in red need to be addressed as part of the minimum pNFS functionality patches
- Issues labeled in purple can be deferred for now
- Issues labeled in green can be deferred indefinitely
The priority list was initially reviewed during Connectathon 2010.
General
Data Structure Integration
- Review impact to struct nfs_client Batsakis
- Ensure layouts are cleaned-up in the right order when the client is destroyed
- Review impact to struct nfs_server Batsakis
- Review impact to struct nfs4_session Batsakis
- Determine if there is a need for the DS to have a struct nfs_server Batsakis
- Ability to tell client not to use pNFS against a server which may support it
- Black list the layout module so that capability is not available
- Disable pNFS per mount
- Define I/O threshold to override attributes and other policy on the client
- Layout Drivers should be automatically loaded (Using request module call)
- Ability to have multiple layouts loaded
- One layout type per filesystem
- Multiple layouts per filesystem
- Data should survive data server filehandle invalidation
- Client cache maps DS filehandle to MDS filehandle, and the MDS filehandle to cached data (13.1)
- Lease timeout determination
- EXCHGID4_FLAG_USE_PNFS_DS vs MDS or PNFS (13.1.1)
- Support Direct I/O
- Consult with list, is there customer demand for holding off the first integration?
- Dean can volunteer to implement. Shares same RPC calls as buffered I/O - callbacks are slightly different
- Determine when to trigger the layoutget
- Support Buffered I/O (Page based)
- Session Implications
- Support dual DS/MDS Personality (13.1)
- Each personality with its own clientid and session
- Reuse DS clientid/session if we already have one
- Support dual DS/MDS Personality (13.1)
- Remove PNFS_CONFIG Flag
- Check with Fedora
- As long as there is a way to specifically prevent the use of pNFS
- Check with Fedora
DeviceID Management
- Add, Remove, Locate
- Policy to prune unused device info (elevate?)
- Umount should clean device table
- XXX Not sure this is correct, since the scope of a deviceID is the clientID/layouttype - not the filesystem
- Careful handling of lease renewals
- DeviceInfo Mappings
- Multipath support for each DS
- How does the MDS represent a DS with IPv4 and IPv6 addresses?
- Revisit when generic support for replicated servers is implemented
- Policy
- What happens if the device is down?
- Give up and I/O through MDS
- Reattempt through DS?
- Revisit when generic support fort replicated server
- What happens if the device is down?
- Recalls (See callbacks)
State/connection management
- Discuss with server implementers about need for state renewal daemon on DS
- Is there really a need to keep the lease alive? Can we get away without renewed per DS?
Layout Management
- Layout Driver (See above)
- Add, Remove, Locate
- Return layouts if they have not been used within certain time to avoid running out of state on server
- Caching beyond CLOSE
- Whole file layouts
- Segment layouts
- Merge Overlapping Layouts
- Revisit when we study the layout design
- Merge Overlapping Layouts
- Should allow layouts of differing iomode for the same range
- Stateid/Seqid management
- OLD and BAD stateid error handling in layout operations
- Check current Referring Tuple Handling works with pNFS callbacks
Interaction with Delegations
- Verify proper use of delegation stateid on layoutget
- If no delegation use open stateid
- If mandatory locking then use lock stateid (Priority?)
Metadata Server Operations
EXCHANGE_ID
- Handle EXCHGID4_FLAG_USE_NON_PNFS/ EXCHGID4_FLAG_USE_PNFS_MDS/ EXCHGID4_FLAG_USE_PNFS_DS combinations
- If client doesn't specify pNFS and server does, client needs to not do it
- Remember server response to determine:
- If we need to send GETATTR asking for layout type
- To determine if we should specify a layout hint during create (Priority?)
- EXCHGID4_FLAG4_BIND_PRINC_STATEID
- Separate nfs_client for MDS/DS dual personality
- Make sure the client owner is different for each
GETDEVICEINFO
- Request Device notifications
- NOTIFY_DEVICEID4_CHANGE
- NOTIFY_DEVICEID4_DELETE
- Determine best GETDEVICEINFO_ARGS gdia_maxcount limits
- XDR across page boundaries is problematic today but should be addressed
- Handle NFS4ERR_TOOSMALL
- Turn off pNFS
- Determine where to invoke it
- Invoke from the state manager
GETDEVICELIST (Opt)
LAYOUTGET
- Determine where to invoke it
- Acquire layout as close to the actual I/O?
- For files layout layout at open makes sense - good enough reason to have it as well?
- Minimize sprinkling pNFS calls throughout the call
- Minimize number of layout reference/ dereference (number of layout gets per I/O)
- read, write, mmap, splice_read, splice_write ?
- readpages, writepages error recovery (invoke the state manager?)
- Specify smart minimum and a reasonable size
- nfs_wait_on_sequence to serialize the gets, returns, and recalls
- Support layout range that does not match request
- Forgetful Model (12.5.5.1)
- Makes the layoutreturn/ cb_recall simpler
- Error handling
- I/O through MDS
- Timer to retry layout
- Mark inode to not request layout until all dirty pages are flushed
- Handle NFS4ERR_RECALLCONFLICT AND NFS4ERR_RETURNCONFLICT (12.5.5.2)
- Handle NFS4ERR_GRACE
- Handle NFS4ERR_LAYOUTTRYLATER
- Handle NFS4ERR_INVAL
- Handle NFs4ERR_TOOSMALL
- Handle NFS4ERR_LAYOUTUNAVAILABLE
- Handle NFS4ERR_UNKNOWN_LAYOUTTYPE
- Handle NFS4ERR_BADIOMODE
- Handle NFS4ERR_LOCKED
- Obey stripe unit size and commit through MDS bits
- FileHandle Determination (13.3)
- DS Filehandle same as MDS
- Same DS Filehandle for every data server
- Not sure if we handle it
- Unique Filehandle for each data server
- Specify intended IO Mode in Layout
- More than one striping pattern: logr_layout array > 1
- Able to handle different iomode from what was requested
- Handle layouts of length NFS4_UINT64_MAX (various rules) (18.43.3)
- Obey logr_return_on_close XXX Study XXX
- What if you have multiple opens on the same file?
- What's the implication on the forgetful model
- Layout read(write)-ahead
- Files Layout will request entire file
This makes it impossible (or unfeasible) to extend files in block layout
LAYOUTCOMMIT
- Include last_write_offset, offset, length
- Include mtime
- getattr after LAYOUTCOMMIT to update cached attributes
- Keep layoutcommit data until return value is received so that you can reissue request in case of GRACE for example
XXX What about FILE_SYNC vs DATA_SYNC? Trond had some questions XXX
- Determine where to invoke it
- Issue layoutcommit in write_inode() and nfs_revalidate_inode()
- Issue layoutcommit before data commits
- Support sub-range layouts
- Do we really know any servers that will do this at this time?
- Belongs in the layout opaque structure? XXX Need to review XXX
- Recover from MDS reboot
- Issue layout_commit with reclaim bit set
- Handle NFS4ERR_NO_GRACE
- Handle NFS4ERR_BADLAYOUT
- Check we have a layout and correct I/O mode before issuing layoutcommit
- Fred's bug of hole in the layout range Subset of layout segments
- Handle NFS4ERR_RECLAIM_BAD
- Attribute caching: loca_time_modify specified - follow with GETATTR
LAYOUTRETURN
- Forgetful Model
- On CB_LAYOUTRECALL always return NFS4ERR_NOMATCHING_LAYOUT (12.5.5.1)
- On CB_RECALL_ANY return LAYOUTRETURN4_ALL
- Return all subfile ranges on CB_RECALL of entire file layout (12.5.5.1)
- Return full range specified by the layout recall (12.5.5.1)
- Ability to return chunks of layouts for huge files to show progress
- Return entire range layout as final LAYOUTRETURN
- Return NFS4ERR_NOMATCHING_LAYOUT if none is found
- Bulk Return
- LAYOUTRETURN4_FSID
- LAYOUTRETURN4_ALL
- sync with nfs_wait_on_sequence()
- The seqid affinity is associated with the filehandle
- Serialize operations resulting from intersecting CB_LAYOUTRECALLs (18.44.4)
- Forgetful model always return NFS4ERR_NOMATCHING_LAYOUT (12.5.5.1)
- Serialization later
- Return NFS4ERR_DELAY?
- Error Recovery
- Handle NFS4ERR_OLD_STATEID
- Handle NFS4ERR_BAD_STATEID stateid's seqid()
- Handle NFs4ERR_NO_GRACE
- Handle NFS4ERR_INVAL
I/O through the MDS
- Error fallback on I/O error
- Including NFS4ERR_BAD_STATEID as returned by DS resulting from DS fencing the I/O after a recall of the layout
SECINFO_NO_NAME (Req)
- Required only for the server
OPEN
- LayoutHint attribute
- Need to define a user/programmable interface?
- GETATTR follows OPEN to determine layout type
- Support GUARDED during create
SETATTR
- Changing size may trigger server to recall layout
- No impact on Forgetful client since there is nothing to return
- Same applies to open with truncate
COMMIT
- Compare commit verifier to each of the DS write verifiers XXX Review section 13.7 XXX
- We keep the commit verifier per page
- Keep data until return value is received so that you can reissue request in case error
Callback Service Operations
CB_LAYOUTRECALL
- Forgetful client behavior
- NFS4ERR_NOMATCHING_LAYOUT (12.5.5.1)
- Bulk Recall
- LAYOUTRECALL4_FSID
- LAYOTURECALL4_ALL
CB_RECALL_ANY (Req)
- Client issues LAYOUTRETURN(ALL) due to forgetful client model
CB_RECALLABLE_OBJ_AVAIL
- Set loga_signal_layout_avail on LAYOUTGET to FALSE
CB_NOTIFY_DEVICEID (Opt)
- Indicate no interest in notification
- Detect race with GETDEVICE_INFO
- If layouts using deviceID, then issue TEST_STATEID
- If valid layout in use, then issue GETDEVICEINFO
- If layouts using deviceID, then issue TEST_STATEID
CB_WANTS_CANCELLED (Req)
- Specify no interest if needed
Data Server Operations
EXCHANGE_ID
SECINFO_NO_NAME
I/O
- Review Data distribution algorithm: (which DS, offset, length)
- Sparse
- Dense
- Stash existing code
- WRITE
- Cache all data in range until successful LAYOUTCOMMIT(1st) and COMMIT (2nd) for unstable data
- How is it that files does not need this for proper recovery? (12.7.4, top of page 306)
- Cache all data in range until successful LAYOUTCOMMIT(1st) and COMMIT (2nd) for unstable data
- READ
- Zero byte & EOF handling on reads with holes handled locally (13.10)
COMMIT
- Commit through MDS
- Commit through DS
Metadata/ Attribute Handling
- pNFS related attributes
- layout_hint
- layout_type
- mdsthreshold
- fs_layout_type
- layout_alignment
- layout_blksize
Locking
- Mandatory Locking
- Use Lock StateID
- Handle NFS4ERR_LOCKED Check with Windows (Tom Talpey) to see if there's a server in the future
Error Handling
- Handle I/O errors due to fencing
- Due to Layout Revocation
- NFS4ERR_GRACE handling
- State recovery through the State Manager only
- Recover state and mark as I/O for MDS for example
- When do we retry again to the DS
- Retry pNFS on remount
- Timer?
- Clear error state once there are no more dirty pages?
- Fail to MDS on first error - keep it simple
- Retry pNFS after X condition/time
Security
- DS ACL related errors?
Multiple Layout Type Support
- Different Layout types for different files
Recovery
- DS Lease Expiration on the Client (12.7.2) (SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED, SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED, SEQ4_STATUS_ADMIN_STATE_REVOKED)
- Write through MDS
- Redo Session/Layout setup, reissue I/O to DSs
Lease Move (11.7.7.1) (Low Priority)
Loss of Layout State on Metadata Server
- Handle fencing error
Metadata Server Restart
- SEQ4_STATUS_RESTART_RECLAIM_NEEDED, NFS4ERR_BAD_SESSION/ NFS4_STALE_CLIENTID
- Server out of Grace
- I/O through MDS
- Redo Session/Layout setup, reissue I/O to DSs
- Server in Grace
- LAYOUT_COMMIT in reclaim mode
- Redo Session/Layout setup, reissue I/O to DSs
Data Server Multipathing (13.5)
- Bandwidth Scaling
- Session Trunking
- Higher Availability
- multipath_list4
- Replacement DeviceID-to-Device address mapping
- Replacement DeviceID