Client pNFS Requirements
From Linux NFS
Client pNFS Deliverables
This document enumerates the pNFS functionality targeted for integration into the upstream Linux kernel. The first wave of patches will implement the minimum set of functionality required to support the Files Layout. These items are denoted as Priority A. Subsequent waves of patches will address functionality that builds on top of the minimum required set as well as implement additional Layout Types.
Contents |
Legend
Note: The labeling still needs to be reviewed by the v4.1 Linux community.
- An (A) indicates the issue needs to be addressed as part of the minimum pNFS functionality patches
- A (B) indicates the issue can be deferred for a subsequent wave of patches
- A (C) indicates the issue can be indefinitely deferred as there is no clear requirement for it
The priority list was initially reviewed during Connectathon 2010.
General
Data Structure Integration
- Review impact to struct nfs_client (A) Batsakis
- Ensure layouts are cleaned-up in the right order when the client is destroyed (A)
- Review impact to struct nfs_server (A) Batsakis
- Review impact to struct nfs4_session (A) Batsakis
- Determine if there is a need for the DS to have a struct nfs_server (A) Batsakis
- Ability to tell client not to use pNFS against a server which may support it (A)
- Black list the layout module so that capability is not available (A)
- Disable pNFS per mount (B)
- Define I/O threshold to override attributes and other policy on the client (C)
- Layout Drivers should be automatically loaded (Using request module call) (A)
- Ability to have multiple layouts loaded
- One layout type per filesystem (A)
- Multiple layouts per filesystem (C-)
- Data should survive data server filehandle invalidation (A)
- Client cache maps DS filehandle to MDS filehandle, and the MDS filehandle to cached data (13.1)
- Lease timeout determination
- EXCHGID4_FLAG_USE_PNFS_DS vs MDS or PNFS (13.1.1) (A)
- Support Direct I/O (B?)
- Consult with list, is there customer demand for holding off the first integration?
- Dean can volunteer to implement. Shares same RPC calls as buffered I/O - callbacks are slightly different
- Determine when to trigger the layoutget
- Support Buffered I/O (Page based) (A)
- Session Implications
- Support dual DS/MDS Personality (13.1)
- Each personality with its own clientid and session (A)
- Reuse DS clientid/session if we already have one (B)
- Support dual DS/MDS Personality (13.1)
- Remove PNFS_CONFIG Flag (A)
- Check with Fedora
- As long as there is a way to specifically prevent the use of pNFS
- Check with Fedora
DeviceID Management
- Add, Remove, Locate (A)
- Policy to prune unused device info (B+)
- Umount should clean device table (A)
- XXX Not sure this is correct, since the scope of a deviceID is the clientID/layouttype - not the filesystem
- Careful handling of lease renewals (A)
- DeviceInfo Mappings (A)
- Multipath support for each DS (B)
- How does the MDS represent a DS with IPv4 and IPv6 addresses?
- Revisit when generic support for replicated servers is implemented
- Policy
- What happens if the device is down?
- Give up and I/O through MDS (A)
- Reattempt through DS? (B)
- Revisit when generic support fort replicated server
- What happens if the device is down?
- Recalls (See callbacks)
State/connection management
- Discuss with server implementers about need for state renewal daemon on DS (A)
- Is there really a need to keep the lease alive? Can we get away without renewed per DS?
Layout Management
- Layout Driver (See above)
- Add, Remove, Locate
- Return layouts if they have not been used within certain time to avoid running out of state on server (B)
- Caching beyond CLOSE (B)
- Whole file layouts (A)
- Segment layouts (B?)
- Merge Overlapping Layouts (B)
- Revisit when we study the layout design
- Merge Overlapping Layouts (B)
- Should allow layouts of differing iomode for the same range (A)
- Stateid/Seqid management
- OLD and BAD stateid error handling in layout operations (A)
- Check current Referring Tuple Handling works with pNFS callbacks (A)
Interaction with Delegations (A)
- Verify proper use of delegation stateid on layoutget
- If no delegation use open stateid
- If mandatory locking then use lock stateid (Priority?)
Metadata Server Operations
EXCHANGE_ID
- Handle EXCHGID4_FLAG_USE_NON_PNFS/ EXCHGID4_FLAG_USE_PNFS_MDS/ EXCHGID4_FLAG_USE_PNFS_DS combinations (A)
- If client doesn't specify pNFS and server does, client needs to not do it (A)
- Remember server response to determine:
- If we need to send GETATTR asking for layout type (A)
- To determine if we should specify a layout hint during create (Priority?)
- EXCHGID4_FLAG4_BIND_PRINC_STATEID (C)
- Separate nfs_client for MDS/DS dual personality (A)
- Make sure the client owner is different for each
GETDEVICEINFO (A)
- Request Device notifications (B)
- NOTIFY_DEVICEID4_CHANGE
- NOTIFY_DEVICEID4_DELETE
- Determine best GETDEVICEINFO_ARGS gdia_maxcount limits (A)
- XDR across page boundaries is problematic today but should be addressed (A?)
- Handle NFS4ERR_TOOSMALL (A)
- Turn off pNFS (A)
- Determine where to invoke it
- Invoke from the state manager (A)
GETDEVICELIST (Opt) (C)
LAYOUTGET (A)
- Determine where to invoke it (A)
- Acquire layout as close to the actual I/O?
- For files layout layout at open makes sense - good enough reason to have it as well?
- Minimize sprinkling pNFS calls throughout the call (A)
- Minimize number of layout reference/ dereference (number of layout gets per I/O) (A)
- read, write, mmap, splice_read, splice_write ?
- readpages, writepages error recovery (invoke the state manager?)
- Specify smart minimum and a reasonable size (A)
- nfs_wait_on_sequence to serialize the gets, returns, and recalls (B)
- Support layout range that does not match request (A)
- Forgetful Model (12.5.5.1) (A)
- Makes the layoutreturn/ cb_recall simpler
- Error handling
- I/O through MDS (A)
- Timer to retry layout (B?)
- Mark inode to not request layout until all dirty pages are flushed (B?)
- Handle NFS4ERR_RECALLCONFLICT AND NFS4ERR_RETURNCONFLICT (12.5.5.2)
- Handle NFS4ERR_GRACE
- Handle NFS4ERR_LAYOUTTRYLATER
- Handle NFS4ERR_INVAL
- Handle NFs4ERR_TOOSMALL
- Handle NFS4ERR_LAYOUTUNAVAILABLE
- Handle NFS4ERR_UNKNOWN_LAYOUTTYPE
- Handle NFS4ERR_BADIOMODE
- Handle NFS4ERR_LOCKED
- Obey stripe unit size and commit through MDS bits (A)
- FileHandle Determination (13.3)
- DS Filehandle same as MDS (A)
- Same DS Filehandle for every data server (A)
- Not sure if we handle it
- Unique Filehandle for each data server (A)
- Specify intended IO Mode in Layout (A)
- More than one striping pattern: logr_layout array > 1 (B)
- Able to handle different iomode from what was requested (A)
- Handle layouts of length NFS4_UINT64_MAX (various rules) (18.43.3) (A)
- Obey logr_return_on_close (A?) XXX Study XXX
- What if you have multiple opens on the same file?
- What's the implication on the forgetful model (A)
- Layout read(write)-ahead (B)
- Files Layout will request entire file (A)
LAYOUTCOMMIT (A)
- Include last_write_offset, offset, length (A)
- Include mtime (C)
- getattr after LAYOUTCOMMIT to update cached attributes (A)
- Keep layoutcommit data until return value is received so that you can reissue request in case of GRACE for example
XXX What about FILE_SYNC vs DATA_SYNC? Trond had some questions XXX
- Determine where to invoke it (A?)
- Issue layoutcommit in write_inode() and nfs_revalidate_inode()
- Issue layoutcommit before data commits
- Support sub-range layouts (A)
- Do we really know any servers that will do this at this time?
- Belongs in the layout opaque structure? XXX Need to review XXX
- Recover from MDS reboot (A)
- Issue layout_commit with reclaim bit set
- Handle NFS4ERR_NO_GRACE
- Handle NFS4ERR_BADLAYOUT
- Check we have a layout and correct I/O mode before issuing layoutcommit (A)
- Fred’s bug of hole in the layout range (B) Subset of layout segments
- Handle NFS4ERR_RECLAIM_BAD (A)
- Attribute caching: loca_time_modify specified - follow with GETATTR
LAYOUTRETURN (A)
- Forgetful Model (A)
- On CB_LAYOUTRECALL always return NFS4ERR_NOMATCHING_LAYOUT (12.5.5.1) (A)
- On CB_RECALL_ANY return LAYOUTRETURN4_ALL (A)
- Return all subfile ranges on CB_RECALL of entire file layout (12.5.5.1) (C)
- Return full range specified by the layout recall (12.5.5.1) (C)
- Ability to return chunks of layouts for huge files to show progress (C)
- Return entire range layout as final LAYOUTRETURN (C)
- Return NFS4ERR_NOMATCHING_LAYOUT if none is found (C)
- Bulk Return (C)
- LAYOUTRETURN4_FSID
- LAYOUTRETURN4_ALL
- sync with nfs_wait_on_sequence() (C)
- The seqid affinity is associated with the filehandle
- Serialize operations resulting from intersecting CB_LAYOUTRECALLs (18.44.4) (C)
- Forgetful model always return NFS4ERR_NOMATCHING_LAYOUT (12.5.5.1) (A)
- Serialization later (C)
- Return NFS4ERR_DELAY?(B)
- Error Recovery (A)
- Handle NFS4ERR_OLD_STATEID
- Handle NFS4ERR_BAD_STATEID (C) stateid's seqid()
- Handle NFs4ERR_NO_GRACE
- Handle NFS4ERR_INVAL
I/O through the MDS
- Error fallback on I/O error (A)
- Including NFS4ERR_BAD_STATEID as returned by DS resulting from DS fencing the I/O after a recall of the layout
SECINFO_NO_NAME (Req) (C)
- Required only for the server
OPEN
- LayoutHint attribute (C)
- Need to define a user/programmable interface? (C)
- GETATTR follows OPEN to determine layout type (C)
- Support GUARDED during create (A)
SETATTR
- Changing size may trigger server to recall layout
- No impact on Forgetful client since there is nothing to return
- Same applies to open with truncate
COMMIT
- Compare commit verifier to each of the DS write verifiers (B) XXX Review section 13.7 XXX
- We keep the commit verifier per page
- Keep data until return value is received so that you can reissue request in case error (A)
Callback Service Operations
CB_LAYOUTRECALL (A)
- Forgetful client behavior (A)
- NFS4ERR_NOMATCHING_LAYOUT (12.5.5.1)
- Bulk Recall
- LAYOUTRECALL4_FSID (B)
- LAYOTURECALL4_ALL (B)
CB_RECALL_ANY (Req) (A)
- Client issues LAYOUTRETURN(ALL) due to forgetful client model (A)
CB_RECALLABLE_OBJ_AVAIL (C)
- Set loga_signal_layout_avail on LAYOUTGET to FALSE (A)
CB_NOTIFY_DEVICEID (Opt) (C)
- Indicate no interest in notification (A)
- Detect race with GETDEVICE_INFO (B)
- If layouts using deviceID, then issue TEST_STATEID
- If valid layout in use, then issue GETDEVICEINFO
- If layouts using deviceID, then issue TEST_STATEID
CB_WANTS_CANCELLED (Req) (C)
- Specify no interest if needed (A)
Data Server Operations
EXCHANGE_ID
SECINFO_NO_NAME (C)
I/O
- Review Data distribution algorithm: (which DS, offset, length) (A)
- Sparse (A)
- Dense (C)
- Stash existing code (A)
- WRITE
- Cache all data in range until successful LAYOUTCOMMIT(1st) and COMMIT (2nd) for unstable data (A)
- How is it that files does not need this for proper recovery? (12.7.4, top of page 306)
- Cache all data in range until successful LAYOUTCOMMIT(1st) and COMMIT (2nd) for unstable data (A)
- READ
- Zero byte & EOF handling on reads with holes handled locally (13.10) (A)
COMMIT
- Commit through MDS (A)
- Commit through DS (A)
Metadata/ Attribute Handling
- pNFS related attributes
- layout_hint (C)
- layout_type (B)
- mdsthreshold (B)
- fs_layout_type (A)
- layout_alignment (B)
- layout_blksize (B)
Locking
- Mandatory Locking (B)
- Use Lock StateID
- Handle NFS4ERR_LOCKED (B) Check with Windows (Tom Talpey) to see if there’s a server in the future
Error Handling
- Handle I/O errors due to fencing (A)
- Due to Layout Revocation (A)
- NFS4ERR_GRACE handling (A)
- State recovery through the State Manager only (A)
- Recover state and mark as I/O for MDS for example
- When do we retry again to the DS
- Retry pNFS on remount (A)
- Timer? (B)
- Clear error state once there are no more dirty pages? (B)
- Fail to MDS on first error - keep it simple (A)
- Retry pNFS after X condition/time (B)
Security
- DS ACL related errors? (A)
Multiple Layout Type Support
- Different Layout types for different files (C)
Recovery
- DS Lease Expiration on the Client (12.7.2) (SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED, SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED, SEQ4_STATUS_ADMIN_STATE_REVOKED)
- Write through MDS (A)
- Redo Session/Layout setup, reissue I/O to DSs (B)
Lease Move (11.7.7.1) (Low Priority) (C)
Loss of Layout State on Metadata Server
- Handle fencing error (A)
Metadata Server Restart
- SEQ4_STATUS_RESTART_RECLAIM_NEEDED, NFS4ERR_BAD_SESSION/ NFS4_STALE_CLIENTID (A?)
- Server out of Grace
- I/O through MDS (A)
- Redo Session/Layout setup, reissue I/O to DSs (B)
- Server in Grace
- LAYOUT_COMMIT in reclaim mode (A)
- Redo Session/Layout setup, reissue I/O to DSs (B)
Data Server Multipathing (13.5)
- Bandwidth Scaling (B)
- Session Trunking (C)
- Higher Availability
- multipath_list4 (B?)
- Replacement DeviceID-to-Device address mapping (B?)
- Replacement DeviceID (B?)