https://wiki.linux-nfs.org/wiki/index.php?title=Special:Contributions/Andros&feed=atom&limit=50&target=Andros&year=&month=Linux NFS - User contributions [en]2024-03-29T02:11:16ZFrom Linux NFSMediaWiki 1.16.5https://wiki.linux-nfs.org/wiki/index.php/Proposed_Device_Management_DesignProposed Device Management Design2010-04-08T19:53:40Z<p>Andros: /* Summary of design changes */</p>
<hr />
<div>== Rules from RFC 5661 ==<br />
# Device IDs are not guaranteed to be valid across metadata server restarts.<br />
# A device ID is unique per client ID and layout type.<br />
# Device ID to device address mappings are not leased, and can be changed at any time. (Note that while device ID to device address mappings are likely to change after the metadata server restarts, the server is not required to change the mappings.)<br />
# The NFSv4.1 protocol has no optimal way to recall all layouts that referred to a particular device ID<br />
# It is possible that GETDEVICEINFO (and GETDEVICELIST) will race with CB_NOTIFY_DEVICEID, i.e., CB_NOTIFY_DEVICEID arrives before the client gets and processes the response to GETDEVICEINFO or GETDEVICELIST. The analysis of the race leverages the fact that the server MUST NOT delete a device ID that is referred to by a layout the client has.<br />
<br />
== Overview of current design ==<br />
<br />
Read/write/commit pNFS paths lookup a layout matching the request range and call LAYOUTGET if range not satified in the per inode layout cache. Upon return, the file layout driver checks the returned layout for validity prior to inserting the layout (segment) into the layout cache. This includes looking upthe device ID in the device ID cache<br />
<br />
If the device ID is not found, GETDEVICEINFO is called as a sychronous rpc with a max count of PAGE_SIZE. If the call fails, the LAYOUTGET fails, unless NFS4ERR_TOO_SMALL is returned in which case a single retry with a max count of up to 6 * PAGE_SIZE is sent.<br />
<br />
Upon a successful return, the device ID cache is searched again for the device ID. If the device ID is found (e.g. a race with another process for the same device ID), the GETDEVICEINFO result is discarded. Otherwise, the result is added to the device ID cache, and the data server cache is searched for each returned data server. If a data server is found, a reference count is incremented. If a data server is not found, an EXCHANGE_ID and CREATE_SESSION is sent, and if successful, the data server is inserted into the data server cache.<br />
<br />
Only valid layout segments (including resolved device IDs) are added to the layout cache. Only connected data servers (established session) are added to the data server cache.<br />
<br />
The layout is returned to the (application context) process which continues on to perform pNFS I/O. This includes identifying the correct data server(s) to perform I/O for a given range. The layout and associated device ID are consulted. This code could also call GETDEVICEINFO if the device ID was not found, an historical remnant of the pre-layout validation code.<br />
<br />
A single rw spinlock protects both the per-mounted filesystem (in struct nfs_server) file layout specific device ID and data server caches.<br />
<br />
<br />
== Summary of design changes ==<br />
<br />
* Change scope of deviceID/data-server cache from per mounted file system to per clientid.<br />
: Allows for sharing of device IDs and storage devices<br />
: Add reference counting to device ID for each layout that references it.<br />
: Reap device ID upon last reference.<br />
<br />
* Change from rw spinlocks to RCU<br />
: As per kernel Documentation which requests no new rwlocks.<br />
: Share device ID cache with all layout types<br />
:: Move device ID lookup and update into generic client so that the RCU code is done once.<br />
<br />
* Move data server cache to a stand alone cache<br />
::Data server cache only updated on a GEDEVICEINFO call, or a umount. The I/O paths find the appropriate data server via array index lookups in the deviceid structure. Therefore, there no need for an RCU/rwspinlocks, or hlist.<br />
<br />
* Only call get_device_info from filelayout_check which performs a device ID cache lookup (read lock) at the end of each LAYOUTGET prior to inserting layout segment into layout cache.<br />
: Assumes layoutget code only caches layouts with resolved device IDs.<br />
: Device IDs are only reaped when nfs_client expires or all layouts referencing the device ID are returned.<br />
<br />
* Only attach to Data servers when first required for I/O, not upon the GETDEVICEINFO return.<br />
<br />
* Only cache the first data server in the multipath_list4 array.<br />
<br />
* Handle GETDEVICEINFO session level errors (and perhaps others) via nfs4_handle_exception<br />
: Some GETDEVICEINFO errors result in failing LAYOUTGET via filelayout_check</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/Proposed_Device_Management_DesignProposed Device Management Design2010-04-08T16:36:37Z<p>Andros: /* Summary of design changes */</p>
<hr />
<div>== Rules from RFC 5661 ==<br />
# Device IDs are not guaranteed to be valid across metadata server restarts.<br />
# A device ID is unique per client ID and layout type.<br />
# Device ID to device address mappings are not leased, and can be changed at any time. (Note that while device ID to device address mappings are likely to change after the metadata server restarts, the server is not required to change the mappings.)<br />
# The NFSv4.1 protocol has no optimal way to recall all layouts that referred to a particular device ID<br />
# It is possible that GETDEVICEINFO (and GETDEVICELIST) will race with CB_NOTIFY_DEVICEID, i.e., CB_NOTIFY_DEVICEID arrives before the client gets and processes the response to GETDEVICEINFO or GETDEVICELIST. The analysis of the race leverages the fact that the server MUST NOT delete a device ID that is referred to by a layout the client has.<br />
<br />
== Overview of current design ==<br />
<br />
Read/write/commit pNFS paths lookup a layout matching the request range and call LAYOUTGET if range not satified in the per inode layout cache. Upon return, the file layout driver checks the returned layout for validity prior to inserting the layout (segment) into the layout cache. This includes looking upthe device ID in the device ID cache<br />
<br />
If the device ID is not found, GETDEVICEINFO is called as a sychronous rpc with a max count of PAGE_SIZE. If the call fails, the LAYOUTGET fails, unless NFS4ERR_TOO_SMALL is returned in which case a single retry with a max count of up to 6 * PAGE_SIZE is sent.<br />
<br />
Upon a successful return, the device ID cache is searched again for the device ID. If the device ID is found (e.g. a race with another process for the same device ID), the GETDEVICEINFO result is discarded. Otherwise, the result is added to the device ID cache, and the data server cache is searched for each returned data server. If a data server is found, a reference count is incremented. If a data server is not found, an EXCHANGE_ID and CREATE_SESSION is sent, and if successful, the data server is inserted into the data server cache.<br />
<br />
Only valid layout segments (including resolved device IDs) are added to the layout cache. Only connected data servers (established session) are added to the data server cache.<br />
<br />
The layout is returned to the (application context) process which continues on to perform pNFS I/O. This includes identifying the correct data server(s) to perform I/O for a given range. The layout and associated device ID are consulted. This code could also call GETDEVICEINFO if the device ID was not found, an historical remnant of the pre-layout validation code.<br />
<br />
A single rw spinlock protects both the per-mounted filesystem (in struct nfs_server) file layout specific device ID and data server caches.<br />
<br />
<br />
== Summary of design changes ==<br />
<br />
* Change scope of deviceID/data-server cache from per mounted file system to per clientid.<br />
: Allows for sharing of device IDs and storage devices<br />
: Add reference counting to device ID for each layout that references it.<br />
: Reap device ID upon last reference.<br />
<br />
* Change from rw spinlocks to RCU<br />
: As per kernel Documentation which requests no new rwlocks.<br />
: Share device ID cache with all layout types<br />
:: Move device ID lookup and update into generic client so that the RCU code is done once.<br />
<br />
* Move data server cache to a stand alone cache<br />
::Data server cache only updated on a GEDEVICEINFO call, or a umount. The I/O paths find the appropriate data server via array index lookups in the deviceid structure. Therefore, there no need for an RCU/rwspinlocks, or hlist.<br />
<br />
* Only call get_device_info from filelayout_check which performs a device ID cache lookup (read lock) at the end of each LAYOUTGET prior to inserting layout segment into layout cache.<br />
: Assumes layoutget code only caches layouts with resolved device IDs.<br />
: Device IDs are only reaped when nfs_client expires or all layouts referencing the device ID are returned.<br />
<br />
* Only attach to Data servers when first required for I/O, not upon the GETDEVICEINFO return.<br />
<br />
* Handle GETDEVICEINFO session level errors (and perhaps others) via nfs4_handle_exception<br />
: Some GETDEVICEINFO errors result in failing LAYOUTGET via filelayout_check</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/Proposed_Device_Management_DesignProposed Device Management Design2010-04-05T14:42:40Z<p>Andros: /* Summary of design changes */</p>
<hr />
<div>== Rules from RFC 5661 ==<br />
# Device IDs are not guaranteed to be valid across metadata server restarts.<br />
# A device ID is unique per client ID and layout type.<br />
# Device ID to device address mappings are not leased, and can be changed at any time. (Note that while device ID to device address mappings are likely to change after the metadata server restarts, the server is not required to change the mappings.)<br />
# The NFSv4.1 protocol has no optimal way to recall all layouts that referred to a particular device ID<br />
# It is possible that GETDEVICEINFO (and GETDEVICELIST) will race with CB_NOTIFY_DEVICEID, i.e., CB_NOTIFY_DEVICEID arrives before the client gets and processes the response to GETDEVICEINFO or GETDEVICELIST. The analysis of the race leverages the fact that the server MUST NOT delete a device ID that is referred to by a layout the client has.<br />
<br />
== Overview of current design ==<br />
<br />
Read/write/commit pNFS paths lookup a layout matching the request range and call LAYOUTGET if range not satified in the per inode layout cache. Upon return, the file layout driver checks the returned layout for validity prior to inserting the layout (segment) into the layout cache. This includes looking upthe device ID in the device ID cache<br />
<br />
If the device ID is not found, GETDEVICEINFO is called as a sychronous rpc with a max count of PAGE_SIZE. If the call fails, the LAYOUTGET fails, unless NFS4ERR_TOO_SMALL is returned in which case a single retry with a max count of up to 6 * PAGE_SIZE is sent.<br />
<br />
Upon a successful return, the device ID cache is searched again for the device ID. If the device ID is found (e.g. a race with another process for the same device ID), the GETDEVICEINFO result is discarded. Otherwise, the result is added to the device ID cache, and the data server cache is searched for each returned data server. If a data server is found, a reference count is incremented. If a data server is not found, an EXCHANGE_ID and CREATE_SESSION is sent, and if successful, the data server is inserted into the data server cache.<br />
<br />
Only valid layout segments (including resolved device IDs) are added to the layout cache. Only connected data servers (established session) are added to the data server cache.<br />
<br />
The layout is returned to the (application context) process which continues on to perform pNFS I/O. This includes identifying the correct data server(s) to perform I/O for a given range. The layout and associated device ID are consulted. This code could also call GETDEVICEINFO if the device ID was not found, an historical remnant of the pre-layout validation code.<br />
<br />
A single rw spinlock protects both the per-mounted filesystem (in struct nfs_server) file layout specific device ID and data server caches.<br />
<br />
<br />
== Summary of design changes ==<br />
<br />
* Change scope of deviceID/data-server cache from per mounted file system to per clientid.<br />
: Allows for sharing of device IDs and storage devices<br />
: Add reference counting to device ID for each layout that references it.<br />
: Reap device ID upon last reference.<br />
<br />
* Change from rw spinlocks to RCU<br />
: As per kernel Documentation which requests no new rwlocks.<br />
: Share device ID cache with all layout types<br />
:: Move device ID lookup and update into generic client so that the RCU code is done once.<br />
:Share storage device cache with all layout types??<br />
:: Need to determine if we should move storage device cache into the generic client so that the RCU code is done once. <br />
<br />
* Only call get_device_info from filelayout_check which performs a device ID cache lookup (read lock) at the end of each LAYOUTGET prior to inserting layout segment into layout cache.<br />
: Assumes layoutget code only caches layouts with resolved device IDs.<br />
: Device IDs are only reaped when nfs_client expires or all layouts referencing the device ID are returned.<br />
<br />
* Only attach to Data servers when first required for I/O, not upon the GETDEVICEINFO return.<br />
<br />
* Handle GETDEVICEINFO session level errors (and perhaps others) via nfs4_handle_exception<br />
: Some GETDEVICEINFO errors result in failing LAYOUTGET via filelayout_check</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/Proposed_Device_Management_DesignProposed Device Management Design2010-04-05T14:41:47Z<p>Andros: /* Summary of design changes */</p>
<hr />
<div>== Rules from RFC 5661 ==<br />
# Device IDs are not guaranteed to be valid across metadata server restarts.<br />
# A device ID is unique per client ID and layout type.<br />
# Device ID to device address mappings are not leased, and can be changed at any time. (Note that while device ID to device address mappings are likely to change after the metadata server restarts, the server is not required to change the mappings.)<br />
# The NFSv4.1 protocol has no optimal way to recall all layouts that referred to a particular device ID<br />
# It is possible that GETDEVICEINFO (and GETDEVICELIST) will race with CB_NOTIFY_DEVICEID, i.e., CB_NOTIFY_DEVICEID arrives before the client gets and processes the response to GETDEVICEINFO or GETDEVICELIST. The analysis of the race leverages the fact that the server MUST NOT delete a device ID that is referred to by a layout the client has.<br />
<br />
== Overview of current design ==<br />
<br />
Read/write/commit pNFS paths lookup a layout matching the request range and call LAYOUTGET if range not satified in the per inode layout cache. Upon return, the file layout driver checks the returned layout for validity prior to inserting the layout (segment) into the layout cache. This includes looking upthe device ID in the device ID cache<br />
<br />
If the device ID is not found, GETDEVICEINFO is called as a sychronous rpc with a max count of PAGE_SIZE. If the call fails, the LAYOUTGET fails, unless NFS4ERR_TOO_SMALL is returned in which case a single retry with a max count of up to 6 * PAGE_SIZE is sent.<br />
<br />
Upon a successful return, the device ID cache is searched again for the device ID. If the device ID is found (e.g. a race with another process for the same device ID), the GETDEVICEINFO result is discarded. Otherwise, the result is added to the device ID cache, and the data server cache is searched for each returned data server. If a data server is found, a reference count is incremented. If a data server is not found, an EXCHANGE_ID and CREATE_SESSION is sent, and if successful, the data server is inserted into the data server cache.<br />
<br />
Only valid layout segments (including resolved device IDs) are added to the layout cache. Only connected data servers (established session) are added to the data server cache.<br />
<br />
The layout is returned to the (application context) process which continues on to perform pNFS I/O. This includes identifying the correct data server(s) to perform I/O for a given range. The layout and associated device ID are consulted. This code could also call GETDEVICEINFO if the device ID was not found, an historical remnant of the pre-layout validation code.<br />
<br />
A single rw spinlock protects both the per-mounted filesystem (in struct nfs_server) file layout specific device ID and data server caches.<br />
<br />
<br />
== Summary of design changes ==<br />
<br />
* Change scope of deviceID/data-server cache from per mounted file system to per clientid.<br />
: Allows for sharing of device IDs and storage devices<br />
: Add reference counting to device ID for each layout that references it.<br />
: Reap device ID upon last reference.<br />
<br />
* Change from rw spinlocks to RCU<br />
: As per kernel Documentation which requests no new rwlocks.<br />
: Share device ID cache with all layout types<br />
:: Move device ID lookup and update into generic client so that the RCU code is done once.<br />
:Share storage device cache with all layout types<br />
:: Move storage device cache into the generic client so that the RCU code is done once. <br />
<br />
* Only call get_device_info from filelayout_check which performs a device ID cache lookup (read lock) at the end of each LAYOUTGET prior to inserting layout segment into layout cache.<br />
: Assumes layoutget code only caches layouts with resolved device IDs.<br />
: Device IDs are only reaped when nfs_client expires or all layouts referencing the device ID are returned.<br />
<br />
* Only attach to Data servers when first required for I/O, not upon the GETDEVICEINFO return.<br />
<br />
* Handle GETDEVICEINFO session level errors (and perhaps others) via nfs4_handle_exception<br />
: Some GETDEVICEINFO errors result in failing LAYOUTGET via filelayout_check</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_designPNFS prototype design2010-03-30T21:38:16Z<p>Andros: /* Current Issues */</p>
<hr />
<div>= pNFS =<br />
<br />
'''pNFS''' is part of the first NFSv4 minor version. This space is used to track and share Linux pNFS implementation ideas and issues.<br />
<br />
== General Information ==<br />
<br />
* [http://www.citi.umich.edu/projects/asci/pnfs/linux/ Linux pNFS Implementation Homepage]<br />
<br />
* [[pNFS Setup Instructions]] - Basic pNFS setup instructions.<br />
<br />
* [[GFS2 Setup Notes - cluster3, 2.6.27 kernel]]<br />
<br />
* [[Older GFS2 Setup Notes - first pass, in VMWare, and upgrading from cluster2 to cluster3]]<br />
<br />
* [[pNFS Block Server Setup Instructions]] - Basic pNFS Block Server setup instructions.<br />
<br />
<br />
<br />
==== Filing Bugs ====<br />
*[http://bugzilla.linux-nfs.org linux-nfs.org Bugzilla] - Read/ Write access by "NFSv4.1 related bugs" group members<br />
** Use the keywords: "NFSv4.1" and "pNFS".<br />
** The "NFSv4.1 related bugs" group is used to track our bugs. You'll need a user account on [http://bugzilla.linux-nfs.org bugzilla], after that, send an email to Trond to add you to the group.<br />
<br />
== Development Resources ==<br />
<br />
* [[pNFS Development Git tree|pNFS Development Git tree]]<br />
<br />
* [[pNFS Git tree recipies|pNFS Git tree recipies]]<br />
<br />
* [[Wireshark Patches|Wireshark Patches]]<br />
<br />
== Current Issues ==<br />
* [[Client_sessions_Implementation_Issues|Client Sessions Implementation Issues]]<br />
<br />
* [[Client_pnfs_deliverables|Client pNFS Prioritized Deliverables]]<br />
**[[pNFS Client Review for Kernel Submission]] - Review and redesign of pNFS client for submission to the Kernel.<br />
<br />
* [[pNFS Todo List|pNFS Todo List]]<br />
<br />
* [[pNFS Implementation Issues|pNFS Implementation Issues]]<br />
<br />
* [[Bakeathon 2007 Issues List|Bakeathon 2007 Issues List]]<br />
<br />
* [[pNFS Development Road Map]]<br />
<br />
* [[pNFS File-based Stateid Distribution]]<br />
<br />
== Old Issues ==<br />
* [[Cthon06 Meeting Notes|Connectathon 2006 Linux pNFS Implementation Meeting Notes]]<br />
<br />
* [[linux pnfs client rewrite may 2006|Linux pNFS Client Internal Reorg patches May 2006 - For Display Purposes Only - Do Not Use]]<br />
<br />
* [[pNFS todo List 2007|pNFS todo List July 2007]]</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_designPNFS prototype design2010-03-30T21:37:15Z<p>Andros: /* Current Issues */</p>
<hr />
<div>= pNFS =<br />
<br />
'''pNFS''' is part of the first NFSv4 minor version. This space is used to track and share Linux pNFS implementation ideas and issues.<br />
<br />
== General Information ==<br />
<br />
* [http://www.citi.umich.edu/projects/asci/pnfs/linux/ Linux pNFS Implementation Homepage]<br />
<br />
* [[pNFS Setup Instructions]] - Basic pNFS setup instructions.<br />
<br />
* [[GFS2 Setup Notes - cluster3, 2.6.27 kernel]]<br />
<br />
* [[Older GFS2 Setup Notes - first pass, in VMWare, and upgrading from cluster2 to cluster3]]<br />
<br />
* [[pNFS Block Server Setup Instructions]] - Basic pNFS Block Server setup instructions.<br />
<br />
<br />
<br />
==== Filing Bugs ====<br />
*[http://bugzilla.linux-nfs.org linux-nfs.org Bugzilla] - Read/ Write access by "NFSv4.1 related bugs" group members<br />
** Use the keywords: "NFSv4.1" and "pNFS".<br />
** The "NFSv4.1 related bugs" group is used to track our bugs. You'll need a user account on [http://bugzilla.linux-nfs.org bugzilla], after that, send an email to Trond to add you to the group.<br />
<br />
== Development Resources ==<br />
<br />
* [[pNFS Development Git tree|pNFS Development Git tree]]<br />
<br />
* [[pNFS Git tree recipies|pNFS Git tree recipies]]<br />
<br />
* [[Wireshark Patches|Wireshark Patches]]<br />
<br />
== Current Issues ==<br />
* [[Client_sessions_Implementation_Issues|Client Sessions Implementation Issues]]<br />
<br />
* [[Client_pnfs_deliverables|Client pNFS Prioritized Deliverables]]<br />
;[[pNFS Client Review for Kernel Submission]] - Review and redesign of pNFS client for submission to the Kernel.<br />
<br />
* [[pNFS Todo List|pNFS Todo List]]<br />
<br />
* [[pNFS Implementation Issues|pNFS Implementation Issues]]<br />
<br />
* [[Bakeathon 2007 Issues List|Bakeathon 2007 Issues List]]<br />
<br />
* [[pNFS Development Road Map]]<br />
<br />
* [[pNFS File-based Stateid Distribution]]<br />
<br />
== Old Issues ==<br />
* [[Cthon06 Meeting Notes|Connectathon 2006 Linux pNFS Implementation Meeting Notes]]<br />
<br />
* [[linux pnfs client rewrite may 2006|Linux pNFS Client Internal Reorg patches May 2006 - For Display Purposes Only - Do Not Use]]<br />
<br />
* [[pNFS todo List 2007|pNFS todo List July 2007]]</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_designPNFS prototype design2010-03-30T21:36:54Z<p>Andros: /* Current Issues */</p>
<hr />
<div>= pNFS =<br />
<br />
'''pNFS''' is part of the first NFSv4 minor version. This space is used to track and share Linux pNFS implementation ideas and issues.<br />
<br />
== General Information ==<br />
<br />
* [http://www.citi.umich.edu/projects/asci/pnfs/linux/ Linux pNFS Implementation Homepage]<br />
<br />
* [[pNFS Setup Instructions]] - Basic pNFS setup instructions.<br />
<br />
* [[GFS2 Setup Notes - cluster3, 2.6.27 kernel]]<br />
<br />
* [[Older GFS2 Setup Notes - first pass, in VMWare, and upgrading from cluster2 to cluster3]]<br />
<br />
* [[pNFS Block Server Setup Instructions]] - Basic pNFS Block Server setup instructions.<br />
<br />
<br />
<br />
==== Filing Bugs ====<br />
*[http://bugzilla.linux-nfs.org linux-nfs.org Bugzilla] - Read/ Write access by "NFSv4.1 related bugs" group members<br />
** Use the keywords: "NFSv4.1" and "pNFS".<br />
** The "NFSv4.1 related bugs" group is used to track our bugs. You'll need a user account on [http://bugzilla.linux-nfs.org bugzilla], after that, send an email to Trond to add you to the group.<br />
<br />
== Development Resources ==<br />
<br />
* [[pNFS Development Git tree|pNFS Development Git tree]]<br />
<br />
* [[pNFS Git tree recipies|pNFS Git tree recipies]]<br />
<br />
* [[Wireshark Patches|Wireshark Patches]]<br />
<br />
== Current Issues ==<br />
* [[Client_sessions_Implementation_Issues|Client Sessions Implementation Issues]]<br />
<br />
* [[Client_pnfs_deliverables|Client pNFS Prioritized Deliverables]]<br />
;* [[pNFS Client Review for Kernel Submission]] - Review and redesign of pNFS client for submission to the Kernel.<br />
<br />
* [[pNFS Todo List|pNFS Todo List]]<br />
<br />
* [[pNFS Implementation Issues|pNFS Implementation Issues]]<br />
<br />
* [[Bakeathon 2007 Issues List|Bakeathon 2007 Issues List]]<br />
<br />
* [[pNFS Development Road Map]]<br />
<br />
* [[pNFS File-based Stateid Distribution]]<br />
<br />
== Old Issues ==<br />
* [[Cthon06 Meeting Notes|Connectathon 2006 Linux pNFS Implementation Meeting Notes]]<br />
<br />
* [[linux pnfs client rewrite may 2006|Linux pNFS Client Internal Reorg patches May 2006 - For Display Purposes Only - Do Not Use]]<br />
<br />
* [[pNFS todo List 2007|pNFS todo List July 2007]]</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_designPNFS prototype design2010-03-30T21:36:28Z<p>Andros: /* General Information */</p>
<hr />
<div>= pNFS =<br />
<br />
'''pNFS''' is part of the first NFSv4 minor version. This space is used to track and share Linux pNFS implementation ideas and issues.<br />
<br />
== General Information ==<br />
<br />
* [http://www.citi.umich.edu/projects/asci/pnfs/linux/ Linux pNFS Implementation Homepage]<br />
<br />
* [[pNFS Setup Instructions]] - Basic pNFS setup instructions.<br />
<br />
* [[GFS2 Setup Notes - cluster3, 2.6.27 kernel]]<br />
<br />
* [[Older GFS2 Setup Notes - first pass, in VMWare, and upgrading from cluster2 to cluster3]]<br />
<br />
* [[pNFS Block Server Setup Instructions]] - Basic pNFS Block Server setup instructions.<br />
<br />
<br />
<br />
==== Filing Bugs ====<br />
*[http://bugzilla.linux-nfs.org linux-nfs.org Bugzilla] - Read/ Write access by "NFSv4.1 related bugs" group members<br />
** Use the keywords: "NFSv4.1" and "pNFS".<br />
** The "NFSv4.1 related bugs" group is used to track our bugs. You'll need a user account on [http://bugzilla.linux-nfs.org bugzilla], after that, send an email to Trond to add you to the group.<br />
<br />
== Development Resources ==<br />
<br />
* [[pNFS Development Git tree|pNFS Development Git tree]]<br />
<br />
* [[pNFS Git tree recipies|pNFS Git tree recipies]]<br />
<br />
* [[Wireshark Patches|Wireshark Patches]]<br />
<br />
== Current Issues ==<br />
* [[Client_sessions_Implementation_Issues|Client Sessions Implementation Issues]]<br />
<br />
* [[Client_pnfs_deliverables|Client pNFS Prioritized Deliverables]]<br />
<br />
* [[pNFS Todo List|pNFS Todo List]]<br />
<br />
* [[pNFS Implementation Issues|pNFS Implementation Issues]]<br />
<br />
* [[Bakeathon 2007 Issues List|Bakeathon 2007 Issues List]]<br />
<br />
* [[pNFS Development Road Map]]<br />
<br />
* [[pNFS File-based Stateid Distribution]]<br />
<br />
== Old Issues ==<br />
* [[Cthon06 Meeting Notes|Connectathon 2006 Linux pNFS Implementation Meeting Notes]]<br />
<br />
* [[linux pnfs client rewrite may 2006|Linux pNFS Client Internal Reorg patches May 2006 - For Display Purposes Only - Do Not Use]]<br />
<br />
* [[pNFS todo List 2007|pNFS todo List July 2007]]</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/PNFS_Client_Review_for_Kernel_SubmissionPNFS Client Review for Kernel Submission2010-03-30T20:32:50Z<p>Andros: moved PNFS client submission to PNFS Client Review for Kernel Submission</p>
<hr />
<div>== pNFS Client Submission Review ==<br />
<br />
The current Linux pNFS client is divided into a generic section, which handles the non-layout specific portions of the pNFS protocol, and three layout driver modules one each for the file, object, and block layouts.<br />
<br />
The Linux pNFS client kernel code is being review for submission to the Kernel. The submission will occur in several stages. Each stage will be RFC 5661 compliant, implementing all mandatory features, but with a minimalistic approach.<br />
<br />
*The first stage will include the generic pNFS client features needed to support a file layout driver, and the file layout driver implementation.<br />
:[[Proposed Device Management Design]]<br />
*The second stage will include additional generic client features needed to support an object layout driver, and the object layout driver implementation.<br />
*A future stage will include additional generic client features needed to support a block layout driver, and the block layout driver implementation.</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/PNFS_client_submissionPNFS client submission2010-03-30T20:32:50Z<p>Andros: moved PNFS client submission to PNFS Client Review for Kernel Submission</p>
<hr />
<div>#REDIRECT [[PNFS Client Review for Kernel Submission]]</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_designPNFS prototype design2010-03-30T20:30:27Z<p>Andros: /* General Information */</p>
<hr />
<div>= pNFS =<br />
<br />
'''pNFS''' is part of the first NFSv4 minor version. This space is used to track and share Linux pNFS implementation ideas and issues.<br />
<br />
== General Information ==<br />
<br />
* [http://www.citi.umich.edu/projects/asci/pnfs/linux/ Linux pNFS Implementation Homepage]<br />
<br />
* [[pNFS Setup Instructions]] - Basic pNFS setup instructions.<br />
<br />
* [[GFS2 Setup Notes - cluster3, 2.6.27 kernel]]<br />
<br />
* [[Older GFS2 Setup Notes - first pass, in VMWare, and upgrading from cluster2 to cluster3]]<br />
<br />
* [[pNFS Block Server Setup Instructions]] - Basic pNFS Block Server setup instructions.<br />
<br />
* [[pNFS Client Review for Kernel Submission]] - Review and redesign of pNFS client for submission to the Kernel.<br />
<br />
==== Filing Bugs ====<br />
*[http://bugzilla.linux-nfs.org linux-nfs.org Bugzilla] - Read/ Write access by "NFSv4.1 related bugs" group members<br />
** Use the keywords: "NFSv4.1" and "pNFS".<br />
** The "NFSv4.1 related bugs" group is used to track our bugs. You'll need a user account on [http://bugzilla.linux-nfs.org bugzilla], after that, send an email to Trond to add you to the group.<br />
<br />
== Development Resources ==<br />
<br />
* [[pNFS Development Git tree|pNFS Development Git tree]]<br />
<br />
* [[pNFS Git tree recipies|pNFS Git tree recipies]]<br />
<br />
* [[Wireshark Patches|Wireshark Patches]]<br />
<br />
== Current Issues ==<br />
* [[Client_sessions_Implementation_Issues|Client Sessions Implementation Issues]]<br />
<br />
* [[Client_pnfs_deliverables|Client pNFS Prioritized Deliverables]]<br />
<br />
* [[pNFS Todo List|pNFS Todo List]]<br />
<br />
* [[pNFS Implementation Issues|pNFS Implementation Issues]]<br />
<br />
* [[Bakeathon 2007 Issues List|Bakeathon 2007 Issues List]]<br />
<br />
* [[pNFS Development Road Map]]<br />
<br />
* [[pNFS File-based Stateid Distribution]]<br />
<br />
== Old Issues ==<br />
* [[Cthon06 Meeting Notes|Connectathon 2006 Linux pNFS Implementation Meeting Notes]]<br />
<br />
* [[linux pnfs client rewrite may 2006|Linux pNFS Client Internal Reorg patches May 2006 - For Display Purposes Only - Do Not Use]]<br />
<br />
* [[pNFS todo List 2007|pNFS todo List July 2007]]</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/PNFS_Client_Review_for_Kernel_SubmissionPNFS Client Review for Kernel Submission2010-03-30T20:28:19Z<p>Andros: /* pNFS Client Submission Review */</p>
<hr />
<div>== pNFS Client Submission Review ==<br />
<br />
The current Linux pNFS client is divided into a generic section, which handles the non-layout specific portions of the pNFS protocol, and three layout driver modules one each for the file, object, and block layouts.<br />
<br />
The Linux pNFS client kernel code is being review for submission to the Kernel. The submission will occur in several stages. Each stage will be RFC 5661 compliant, implementing all mandatory features, but with a minimalistic approach.<br />
<br />
*The first stage will include the generic pNFS client features needed to support a file layout driver, and the file layout driver implementation.<br />
:[[Proposed Device Management Design]]<br />
*The second stage will include additional generic client features needed to support an object layout driver, and the object layout driver implementation.<br />
*A future stage will include additional generic client features needed to support a block layout driver, and the block layout driver implementation.</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/Proposed_Device_Management_DesignProposed Device Management Design2010-03-30T20:24:36Z<p>Andros: Created page with '== Rules from RFC 5661 == # Device IDs are not guaranteed to be valid across metadata server restarts. # A device ID is unique per client ID and layout type. # Device ID to devic…'</p>
<hr />
<div>== Rules from RFC 5661 ==<br />
# Device IDs are not guaranteed to be valid across metadata server restarts.<br />
# A device ID is unique per client ID and layout type.<br />
# Device ID to device address mappings are not leased, and can be changed at any time. (Note that while device ID to device address mappings are likely to change after the metadata server restarts, the server is not required to change the mappings.)<br />
# The NFSv4.1 protocol has no optimal way to recall all layouts that referred to a particular device ID<br />
# It is possible that GETDEVICEINFO (and GETDEVICELIST) will race with CB_NOTIFY_DEVICEID, i.e., CB_NOTIFY_DEVICEID arrives before the client gets and processes the response to GETDEVICEINFO or GETDEVICELIST. The analysis of the race leverages the fact that the server MUST NOT delete a device ID that is referred to by a layout the client has.<br />
<br />
== Overview of current design ==<br />
<br />
Read/write/commit pNFS paths lookup a layout matching the request range and call LAYOUTGET if range not satified in the per inode layout cache. Upon return, the file layout driver checks the returned layout for validity prior to inserting the layout (segment) into the layout cache. This includes looking upthe device ID in the device ID cache<br />
<br />
If the device ID is not found, GETDEVICEINFO is called as a sychronous rpc with a max count of PAGE_SIZE. If the call fails, the LAYOUTGET fails, unless NFS4ERR_TOO_SMALL is returned in which case a single retry with a max count of up to 6 * PAGE_SIZE is sent.<br />
<br />
Upon a successful return, the device ID cache is searched again for the device ID. If the device ID is found (e.g. a race with another process for the same device ID), the GETDEVICEINFO result is discarded. Otherwise, the result is added to the device ID cache, and the data server cache is searched for each returned data server. If a data server is found, a reference count is incremented. If a data server is not found, an EXCHANGE_ID and CREATE_SESSION is sent, and if successful, the data server is inserted into the data server cache.<br />
<br />
Only valid layout segments (including resolved device IDs) are added to the layout cache. Only connected data servers (established session) are added to the data server cache.<br />
<br />
The layout is returned to the (application context) process which continues on to perform pNFS I/O. This includes identifying the correct data server(s) to perform I/O for a given range. The layout and associated device ID are consulted. This code could also call GETDEVICEINFO if the device ID was not found, an historical remnant of the pre-layout validation code.<br />
<br />
A single rw spinlock protects both the per-mounted filesystem (in struct nfs_server) file layout specific device ID and data server caches.<br />
<br />
<br />
== Summary of design changes ==<br />
<br />
* Change scope of deviceID/data-server cache from per mounted file system to per clientid.<br />
: Allows for sharing of device IDs and storage devices<br />
: Add reference counting to device ID for each layout that references it.<br />
: Reap device ID upon last reference.<br />
<br />
* Change from rw spinlocks to RCU<br />
: As per kernel Documentation which requests no new rwlocks.<br />
: Share device ID cache with all layout types<br />
:: Move device ID lookup and update into generic client so that the RCU code is done once.<br />
:Share storage device cache with all layout types<br />
:: Move storage device cache into the generic client so that the RCU code is done once. <br />
<br />
* Only call get_device_info from filelayout_check which performs a device ID cache lookup (read lock) at the end of each LAYOUTGET prior to inserting layout segment into layout cache.<br />
: Assumes layoutget code only caches layouts with resolved device IDs.<br />
: Device IDs are only reaped when nfs_client expires or all layouts referencing the device ID are returned.<br />
<br />
* Only attach to Data servers when first required for I/O, not upon the GETDEVICEINFO return.<br />
<br />
* Handle GETDEVICEINFO session level errors (and perhaps others) via nfs4_handle_exception<br />
: Some GETDEVICEINFO errors result in failing LAYOUTGET via filelayout_check<br />
<br />
<br />
== Summary of design changes ==<br />
<br />
* Change scope of deviceID/data-server cache from per mounted file system to per clientid.<br />
: Allows for sharing of device IDs and storage devices<br />
: Add reference counting to device ID for each layout that references it.<br />
: Reap device ID upon last reference.<br />
<br />
* Change from rw spinlocks to RCU<br />
: As per kernel Documentation which requests no new rwlocks.<br />
: Share device ID cache with all layout types<br />
:: Move device ID lookup and update into generic client so that the RCU code is done once.<br />
:Share storage device cache with all layout types<br />
:: Move storage device cache into the generic client so that the RCU code is done once. <br />
<br />
* Only call get_device_info from filelayout_check which performs a device ID cache lookup (read lock) at the end of each LAYOUTGET prior to inserting layout segment into layout cache.<br />
: Assumes layoutget code only caches layouts with resolved device IDs.<br />
: Device IDs are only reaped when nfs_client expires or all layouts referencing the device ID are returned.<br />
<br />
* Only attach to Data servers when first required for I/O, not upon the GETDEVICEINFO return.<br />
<br />
* Handle GETDEVICEINFO session level errors (and perhaps others) via nfs4_handle_exception<br />
: Some GETDEVICEINFO errors result in failing LAYOUTGET via filelayout_check</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/PNFS_Client_Review_for_Kernel_SubmissionPNFS Client Review for Kernel Submission2010-03-30T20:22:34Z<p>Andros: Created page with '== pNFS Client Submission Review == The current Linux pNFS client is divided into a generic section, which handles the non-layout specific portions of the pNFS protocol, and thr…'</p>
<hr />
<div>== pNFS Client Submission Review ==<br />
<br />
The current Linux pNFS client is divided into a generic section, which handles the non-layout specific portions of the pNFS protocol, and three layout driver modules one each for the file, object, and block layouts.<br />
<br />
The Linux pNFS client kernel code is being review for submission to the Kernel. The submission will occur in several stages. <br />
<br />
#The first stage will include the generic pNFS client features needed to support a file layout driver, and the file layout driver implementation.<br />
##This will be RFC 5661 compliance, but taking a minimal approach.<br />
<br />
[[Proposed Device Management Design]]<br />
<br />
#The second stage will include additional generic client features needed to support an object layout driver, and the object layout driver implementation.<br />
<br />
<br />
#A future stage will include additional generic client features needed to support a block layout driver, and the block layout driver implementation.</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_designPNFS prototype design2010-03-30T20:01:33Z<p>Andros: /* General Information */</p>
<hr />
<div>= pNFS =<br />
<br />
'''pNFS''' is part of the first NFSv4 minor version. This space is used to track and share Linux pNFS implementation ideas and issues.<br />
<br />
== General Information ==<br />
<br />
* [http://www.citi.umich.edu/projects/asci/pnfs/linux/ Linux pNFS Implementation Homepage]<br />
<br />
* [[pNFS Setup Instructions]] - Basic pNFS setup instructions.<br />
<br />
* [[GFS2 Setup Notes - cluster3, 2.6.27 kernel]]<br />
<br />
* [[Older GFS2 Setup Notes - first pass, in VMWare, and upgrading from cluster2 to cluster3]]<br />
<br />
* [[pNFS Block Server Setup Instructions]] - Basic pNFS Block Server setup instructions.<br />
<br />
* [[pNFS client submission]] - Review and redesign of pNFS client for kernel submission.<br />
<br />
==== Filing Bugs ====<br />
*[http://bugzilla.linux-nfs.org linux-nfs.org Bugzilla] - Read/ Write access by "NFSv4.1 related bugs" group members<br />
** Use the keywords: "NFSv4.1" and "pNFS".<br />
** The "NFSv4.1 related bugs" group is used to track our bugs. You'll need a user account on [http://bugzilla.linux-nfs.org bugzilla], after that, send an email to Trond to add you to the group.<br />
<br />
== Development Resources ==<br />
<br />
* [[pNFS Development Git tree|pNFS Development Git tree]]<br />
<br />
* [[pNFS Git tree recipies|pNFS Git tree recipies]]<br />
<br />
* [[Wireshark Patches|Wireshark Patches]]<br />
<br />
== Current Issues ==<br />
* [[Client_sessions_Implementation_Issues|Client Sessions Implementation Issues]]<br />
<br />
* [[Client_pnfs_deliverables|Client pNFS Prioritized Deliverables]]<br />
<br />
* [[pNFS Todo List|pNFS Todo List]]<br />
<br />
* [[pNFS Implementation Issues|pNFS Implementation Issues]]<br />
<br />
* [[Bakeathon 2007 Issues List|Bakeathon 2007 Issues List]]<br />
<br />
* [[pNFS Development Road Map]]<br />
<br />
* [[pNFS File-based Stateid Distribution]]<br />
<br />
== Old Issues ==<br />
* [[Cthon06 Meeting Notes|Connectathon 2006 Linux pNFS Implementation Meeting Notes]]<br />
<br />
* [[linux pnfs client rewrite may 2006|Linux pNFS Client Internal Reorg patches May 2006 - For Display Purposes Only - Do Not Use]]<br />
<br />
* [[pNFS todo List 2007|pNFS todo List July 2007]]</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_designPNFS prototype design2010-03-30T19:58:45Z<p>Andros: /* pNFS */</p>
<hr />
<div>= pNFS =<br />
<br />
'''pNFS''' is part of the first NFSv4 minor version. This space is used to track and share Linux pNFS implementation ideas and issues.<br />
<br />
== General Information ==<br />
<br />
* [http://www.citi.umich.edu/projects/asci/pnfs/linux/ Linux pNFS Implementation Homepage]<br />
<br />
* [[pNFS Setup Instructions]] - Basic pNFS setup instructions.<br />
<br />
* [[GFS2 Setup Notes - cluster3, 2.6.27 kernel]]<br />
<br />
* [[Older GFS2 Setup Notes - first pass, in VMWare, and upgrading from cluster2 to cluster3]]<br />
<br />
* [[pNFS Block Server Setup Instructions]] - Basic pNFS Block Server setup instructions.<br />
<br />
* [[pNFS client submission]<br />
<br />
==== Filing Bugs ====<br />
*[http://bugzilla.linux-nfs.org linux-nfs.org Bugzilla] - Read/ Write access by "NFSv4.1 related bugs" group members<br />
** Use the keywords: "NFSv4.1" and "pNFS".<br />
** The "NFSv4.1 related bugs" group is used to track our bugs. You'll need a user account on [http://bugzilla.linux-nfs.org bugzilla], after that, send an email to Trond to add you to the group.<br />
<br />
== Development Resources ==<br />
<br />
* [[pNFS Development Git tree|pNFS Development Git tree]]<br />
<br />
* [[pNFS Git tree recipies|pNFS Git tree recipies]]<br />
<br />
* [[Wireshark Patches|Wireshark Patches]]<br />
<br />
== Current Issues ==<br />
* [[Client_sessions_Implementation_Issues|Client Sessions Implementation Issues]]<br />
<br />
* [[Client_pnfs_deliverables|Client pNFS Prioritized Deliverables]]<br />
<br />
* [[pNFS Todo List|pNFS Todo List]]<br />
<br />
* [[pNFS Implementation Issues|pNFS Implementation Issues]]<br />
<br />
* [[Bakeathon 2007 Issues List|Bakeathon 2007 Issues List]]<br />
<br />
* [[pNFS Development Road Map]]<br />
<br />
* [[pNFS File-based Stateid Distribution]]<br />
<br />
== Old Issues ==<br />
* [[Cthon06 Meeting Notes|Connectathon 2006 Linux pNFS Implementation Meeting Notes]]<br />
<br />
* [[linux pnfs client rewrite may 2006|Linux pNFS Client Internal Reorg patches May 2006 - For Display Purposes Only - Do Not Use]]<br />
<br />
* [[pNFS todo List 2007|pNFS todo List July 2007]]</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_designPNFS prototype design2007-12-20T21:13:27Z<p>Andros: /* pNFS */</p>
<hr />
<div>= pNFS =<br />
<br />
'''pNFS''' is part of the first NFSv4 minor version. This space is used to track and share Linux pNFS implementation ideas and issues.<br />
<br />
* [http://www.citi.umich.edu/projects/asci/pnfs/linux/ Linux pNFS Implementation Homepage]<br />
<br />
* [[Cthon06 Meeting Notes|Connectathon 2006 Linux pNFS Implementation Meeting Notes]]<br />
<br />
* [[linux pnfs client rewrite may 2006|Linux pNFS Client Internal Reorg patches May 2006 - For Display Purposes Only - Do Not Use]]<br />
<br />
* [[pNFS Implementation Issues|pNFS Implementation Issues]]<br />
<br />
* [[pNFS todo List|pNFS todo List]]<br />
<br />
* [[Wireshark Patches|Wireshark Patches]]<br />
<br />
* [[Bakeathon 2007 Issues List|Bakeathon 2007 Issues List]]<br />
<br />
* [[pNFS Developers Road Map]]<br />
<br />
* [http://spreadsheets.google.com/pub?key=pGVvgce8dC-WWbowI9TSmEg Linux pNFS Development Gantt Chart]<br />
<br />
* [[pNFS Git tree recipies|pNFS Git tree recipies]]</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/PNFS_Git_tree_recipiesPNFS Git tree recipies2007-12-20T21:13:00Z<p>Andros: New page: == pNFS Git Tree Recipies == please use git version 1.5.0.2 git clone git://linux-nfs.org/linux-pnfs.git then edit .git/config and change to ssh. We want to continue development on the...</p>
<hr />
<div>== pNFS Git Tree Recipies ==<br />
<br />
please use git version 1.5.0.2<br />
<br />
git clone git://linux-nfs.org/linux-pnfs.git<br />
<br />
then edit .git/config and change to ssh.<br />
<br />
We want to continue development on the prototype while keeping the<br />
sessions/pnfs split. Here's some basic git recipes for doing that.<br />
<br />
Say your git tree looks like this:<br />
git branch -r<br />
origin/4.1-sessions<br />
origin/HEAD<br />
origin/master<br />
<br />
The idea is to create your own parallel branches for 4.1-sessions and master<br />
to do your work in, and each day to update the origin/4.1-sessions and<br />
origin/master.<br />
<br />
Contents:<br />
1) working on your own copy of origin/master (pnfs + sessions).<br />
2) working on your own copy of origin/4.1-sessions, and merging<br />
results into master for testing.<br />
3) updating your tree with patches committed by CITI.<br />
<br />
1) working on your own copy of origin/master (pnfs + sessions)<br />
<br />
Run "git fetch origin" first to make sure origin/master is up to date,<br />
then:<br />
<br />
git checkout -b my-master origin/master<br />
<br />
to be sure...<br />
git branch<br />
* my-master<br />
<br />
then make changes to existing files<br />
if add a file<br />
git add <filename><br />
if remove a file<br />
git rm <filename><br />
when done, <br />
git commit -a<br />
<br />
(Note: it'll give you a chance to edit the commit message. The first<br />
line should be a *short* description of the patch (will be used as email<br />
subject line); skip a blank line then write at length with any other<br />
comments about the branch.)<br />
this commits the changes to your local tree.<br />
<br />
to show last commit: (review the patch)<br />
git show<br />
<br />
compile, test.<br />
<br />
create a patch for review:<br />
<br />
git format-patch -n origin/4.1-sessions<br />
<br />
(This tells it to produce patches for all commits on your current branch<br />
("my-sessions") that aren't in origin/4.1-sessions--so that's all the<br />
commits you've made. Maybe just one in the example above.)<br />
<br />
NOTE: SAVE THOSE PATCHES!<br />
<br />
mail to the list<br />
<br />
git send-email --to pnfs@linux-nfs.org --from <yourself> <filelist from format patch><br />
<br />
note: the <filelist from format-patch> is usually 00*.<br />
<br />
compile, test.<br />
<br />
create a patch for review. (everything from the previous commit to this<br />
latest commit in my-master diff against origin/master)<br />
git format-patch -n origin/master<br />
<br />
NOTE: SAVE YOUR PATCHES!<br />
<br />
mail to the list<br />
<br />
git send-email -- to:pnfs@linux-nfs.org --from:<yourself> <filelist from format patch><br />
<br />
note: the <filelist from format-patch is usually 00*.<br />
<br />
<br />
2) work on your own copy of origin/4.1-sessions<br />
<br />
Run "git fetch origin" to make sure origin/4.1-sessions is uptodate,<br />
then:<br />
<br />
git checkout -b my-sessions origin/4.1-sessions<br />
<br />
to be sure...<br />
git branch<br />
* my-sessions<br />
<br />
then make changes to existing files, git add, commit -a, and make a<br />
commit message as in step 1.<br />
<br />
compile, test with NFSv4.1 (no pnfs)<br />
<br />
Create patches for review, save them, and mail to the list as above.<br />
<br />
Next, to merge your changes with your local origin/master (e.g the pnfs<br />
branch)<br />
<br />
git checkout -b my-master origin/master<br />
git merge 4.1-sessions<br />
<br />
If you have conflicts due to the merge, it will tell you the file names.<br />
The conflicts will show up in files as arrows.<br />
fix conflicts<br />
git commit -a<br />
<br />
(Note: git will automatically produce a commit message for you in this<br />
case. You can add comments if you want, but usually the message it<br />
creates is fine on its own.)<br />
<br />
Please, if there were non-trivial conflicts, note merge changes and send<br />
them to the list to help us do the repeat the merge on the citi repo:<br />
<br />
git show > <file><br />
email <file> to list...<br />
<br />
(git format-patch doesn't deal with merge commits.)<br />
<br />
the command<br />
<br />
gitk v2.6.18.3.. &<br />
<br />
will bring up a nice little browser and show the merge.</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/PNFS_todo_ListPNFS todo List2007-02-28T18:47:31Z<p>Andros: </p>
<hr />
<div>== pNFS todo List ==<br />
NOTE: This list was last updated: 2/28/2007<br />
<br />
Upcoming Event: June 11-15 Austin - NFSv4.1<br />
<br />
<br />
* Implement Server Session Slots and Sequence counting<br />
* Implement Client Session Slots and Sequence counting<br />
* Implement NFSv4.1 callbacks<br />
* Update to draft-ietf-nfsv4-minorversion1-09<br />
* Separate Sessions branch for pnfs-git tree<br />
* Upgrade pnfs-git base to latest linus git-tree</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_designPNFS prototype design2007-02-28T18:28:37Z<p>Andros: </p>
<hr />
<div>= pNFS =<br />
<br />
'''pNFS''' is part of the first NFSv4 minor version. This space is used to track and share Linux pNFS implementation ideas and issues.<br />
<br />
* [http://www.citi.umich.edu/projects/asci/pnfs/linux/ Linux pNFS Implementation Homepage]<br />
<br />
* [[Cthon06 Meeting Notes|Connectathon 2006 Linux pNFS Implementation Meeting Notes]]<br />
<br />
* [[linux pnfs client rewrite may 2006|Linux pNFS Client Internal Reorg patches May 2006 - For Display Purposes Only - Do Not Use]]<br />
<br />
* [[pNFS Implementation Issues|pNFS Implementation Issues]]<br />
<br />
* [[pNFS todo List|pNFS todo List]]</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/CITI_ASC_statusCITI ASC status2006-10-24T18:22:53Z<p>Andros: /* Activities */</p>
<hr />
<div>=University of Michigan/CITI NFSv4 ASC alliance=<br />
Status of October 2006<br />
==Task 1. pNFS Demonstration==<br />
Demonstration of pNFS with multiple back end methods (PVFS2 and File) including layout recall — LANL will replicate this demonstration at LANL working with CITI remotely<br />
===Development===<br />
We updated the Linux pNFS client and server to the 2.6.17 kernel level, and are preparing to rebase again for 2.6.19. <br />
<br />
We updated the pNFS code base to draft-ietf-nfsv4-minorversion1-05. Testing identified multiple bugs, which we fixed.<br />
<br />
The linux client separates common NFS code from NFSv2/3/4 code by using version specific operations. We rewrote the Linux pNFS client to use its own set of version specfic operations. This provides a controlled interface to the pNFS code, and eases updating the code to new kernel versions.<br />
<br />
Four client layout modules are in development. <br />
* File layout driver (CITI, Network Appliance, and IBM Almaden).<br />
* PVFS2 layout driver (CITI).<br />
* Object layout driver (Panasas).<br />
* Block layout driver (CITI).<br />
<br />
To accommodate the requirements of the multiple layout drivers, we expanded the policy interface between the layout driver and generic pNFS client. This interface allows layout drivers to set the following policies:<br />
* stripe size<br />
* writeback cache gathering policies<br />
* blocksize<br />
* read and write threshold<br />
* timing of layoutget invocation<br />
* determine if I/O uses pagecache or direct method<br />
<br />
We are designing and coding a pNFS client layout cache to replace the current implementation, which supports only a single layout per inode.<br />
<br />
We improved the interface to the underlying file system on the Linux pNFS server. The new interface is being used by the Panasas object layout server, the IBM GPFS server, and the PVFS2 server. <br />
<br />
We are coding the pNFS layout management service and file system interfaces on the Linux pNFS server to do a better job of bookkeeping so that we can extend the layout recall implementation, which is limited to a single layout.<br />
<br />
We have continued to develop the PVFS2 layout driver and PVFS2 support in the pNFS server. The layout driver I/O interface supports direct access, page cache access with NFSv4 readahead and writeback, and the O_DIRECT access method. In addition, PVFS2 now supports the pNFS file-based layout, which lets pNFS clients choose how they access the file system.<br />
<br />
We demonstrated how pNFS can improve the overall write performance of parallel file systems by using direct, parallel I/O for large write requests and the NFSv4 storage protocol for small write requests. To switch between them, we added a write threshold to the layout driver. Write requests smaller than the threshold follow the slower NFSv4 data path. Write requests larger than the threshold follow the faster layout driver data path.<br />
D. Hildebrand, L. Ward, and P. Honeyman, "Large Files, Small Writes, and pNFS," in ''Proc. of the 20th ACM International Conf. on Supercomputing, Cairns, Australia, 2006.<br />
<br />
We improved the performance and scalability of pNFS file-based access with parallel file systems. Our design, named Direct-pNFS, augmented the file-based architecture to enable file-based pNFS clients to bypass intermediate data servers and access heterogeneous data stores directly. Direct access is possible by ensuring file-based layouts match the data layout in the underlying file system and giving pNFS clients the tools to effectively interpret and utilize this information. Experiments with Direct-pNFS demonstrate I/O throughput that equals or outperforms the exported parallel file system across a range of workloads.<br />
D. Hildebrand and P. Honeyman, "Direct-pNFS: Simple, Transparent, and Versatile Access to Parallel File Systems," ''CITI Technical Report 06-8'', October 2006.<br />
<br />
We developed prototype implementations of pNFS operations:<br />
* OP_GETDEVICELIST,<br />
* OP_GETDEVICEINFO, <br />
* OP_LAYOUTGET,<br />
* OP_LAYOUTCOMMIT,<br />
* OP_LAYOUTRETURN and<br />
* OP_CB_LAYOUTRECALL<br />
<br />
We continue to test the ability of our prototype to send direct I/O data to data servers.<br />
===Milestones===<br />
At the September 2006 NFSv4 Bake-a-thon, hosted by CITI, we continued to test the ability of CITI's Linux pNFS client to operate with multiple layouts, and the ability of CITI's Linux pNFS server to export pNFS capable underlying file systems. <br />
<br />
We demonstrated the Linux pNFS client support for multiple layouts by copying files between multiple pNFS back ends.<br />
<br />
The following pNFS implementations were tested.<br />
<br />
''File Layout''<br />
* Clients: Linux, Solaris<br />
* Servers: Network Appliance, Linux IBM GPFS, DESY dCache, Solaris, PVFS2<br />
<br />
''Object layout''<br />
* Client: Linux<br />
* Servers: Linux, Panasas<br />
<br />
''Block layout''<br />
* Client: Linux<br />
* Server: EMC<br />
<br />
''PVFS2 layout''<br />
* Client: Linux<br />
* Server: Linux<br />
===Activities===<br />
Our current Linux pNFS implementation uses a single whole file layout. We are extending the layout cache on the client and layout management on the server to support multiple layouts and small byte ranges. <br />
<br />
In cooperation with EMC, we continue to develop a block layout driver module for the generic pNFS client.<br />
<br />
We continue to measure I/O performance.<br />
<br />
We joined the [http://www.ultralight.org Ultralight project] and are testing pNFS I/O using pNFS clients on 10 GbE against pNFS clusters on 1 GbE.<br />
The Linux pNFS client included in the Ultralight kernel and distributed to Ultralight sites, providing opportunities for future long-haul WAN testing.<br />
<br />
We are demonstrating pNFS file layout at SC06. The demonstration will include multiple 10G nic clients from the Sc06 demonstration floor accessing data across the ultralight network to an IBM GPFS cluster at CITI.<br />
<br />
==Task 2. Client Migration==<br />
Migration of client from one mount/metadata server to another to be demonstrated. This demonstration may be replicated at LANL depending on success of this work.<br />
===Status===<br />
When a file system moves, the old server notifies clients with NFS4ERR_MOVED. Clients then reclaim state held on the old server by engaging in reboot recovery with the new server. For cluster file systems, server-to-server state transfer lets clients avoid the reclaim. <br />
<br />
We redesigned state bookkeeping to ensure that state created on NFSv4 servers exporting the same cluster file system will not collide. <br />
<br />
Server reboot recovery requires servers to save the clientid of active clients in stable storage. The present server implementation does this by writing directly to a filesystem via the vfs layer. A new server instance reads the state from stable storage, again directly via the vfs. We are rewriting this implementation to use a pipefs upcall/downcall interface instead of directly using the vfs layer, and are expanding the interface to support an upcall/downcall of all a clients in-memory state. The userland daemon can then support server-to-server state transfer to the cooresponding daemon on a new server. We have a prototype of the new upcall/down call interface, and have yet to prototype the server-to-server state transfer.<br />
<br />
It remains to inform clients that state established with the old server remains valid on the new server. The IETF NFSv4 working group is considering solutions for the NFSv4.1 protocol, but NFSv4.0 clients will not have support for this feature. We will therefore need to provide Linux specific implementation support - perhaps a mount option or a /proc flag, or simply to try to use an old clientid against a new server on migration.<br />
==Task 3. Lock Analysis==<br />
Analysis of caching and lock coherency, demonstration of caching and lock performance with scaling, under various levels of conflict, using byte range locks (looking at lock splitting issues etc.).<br />
===Background===<br />
The NFSv4 protocol supports three different lock-like operations: opens, byte-range locks, and delegations.<br />
====Opens====<br />
Unlike previous versions of NFS, NFSv4 has an on-the-wire OPEN operation.<br />
<br />
The OPEN call includes the expected access mode, which may be read, write, or both. But it also includes a "deny" mode, which may be read, write, or both, or none. The server fails any open which whose access mode overlaps the deny mode of an existing open, or whose deny mode overlaps the access mode of an existing open.<br />
<br />
Deny modes are not currently used by UNIX-like clients, our main focus, so we don't study this case.<br />
<br />
However, all clients still do perform an OPEN each time an application opens a file, for several reasons: to ensure correct behavior in the presence of Windows clients, to request delegations, and to establish the state necessary to get posix byte-range locks, among other reasons.<br />
<br />
All versions of NFS also tie data caching to open and close: data is flushed before close, and attributes revalidated before open, in such a way as to guarantee that the data seen after an open will always reflect writes any other client performed using file descriptors that were closed before the open.<br />
====Byte-range locks====<br />
POSIX byte-range locks are managed by applications using fcntl(). Each lock request has a byte-range and a type of read or write. Read locks conflict only with write locks, whereas write locks conflict with any other locks. Applications may perform read locks only on files which they have open for read, and write locks only on files which they have open for write.<br />
<br />
Byte-range locks are normally advisory; that is, they do not conflict with IO operations. Such mandatory locking is supported by many unix-like operating systems, appears to be rarely used.<br />
<br />
The NLM sideband protocol enables byte-range locks for versions of NFS earlier than NFSv4. NFSv4 incorporates byte-range locking into the main protocol. This makes it possible to support mandatory byte-range locking, but support for mandatory byte-range locking over NFSv4 is not supported by the linux implementation, and no support is planned at this time.<br />
<br />
As with opens, byte-range locks also affect data caching: unlocks are not allowed to succeed until modified data in the locked range is written to the server, and locks must revalidate file data. Thus writes performed under a lock has been unlocked will be visible to any reader that locks the region after the unlock.<br />
====Delegations====<br />
A server may optionally return a "delegation" with the response to any open call. Delegations may be of type read or write. Servers must guarantee that no client ever holds a read delegation for on a file that another client has open for write, or has a write delegation for. Similarly no client may hold a write delegation on a file that another client has open for read.<br />
<br />
A server is never required to give out a delegation. Also, it may ask for the delegation back at any time, at which point the client is required to do what is necessary to establish on the server any opens or locks which it has performed locally before returning the delegation. Once returned, the client cannot regain the delegation without performing another open.<br />
<br />
An NFS client is not normally synchronously notified of changes performed by another client, but as long as a client holds a delegation, the above rules guarantee that it will be. In theory it might be possible for applications to take advantage of this increased cache consistency. However, this is not useful in practice since a server is never required to give out a delegation. Also, a server can ask for a delegation back at any time.<br />
<br />
Thus clients do not expose the existence of delegations to applications the way they do opens and locks. Instead, clients use delegations to provide increased performance: delegations allows clients to perform open and lock calls locally--in the case of a read delegation, read opens and read locks may be performed without contacting the server, and in the case of a write delegation, any opens and locks may be performed without contacting the server. This also relieves the client of the responsibility to flush dirty data and revalidate data caches.<br />
<br />
When a server recalls a delegation, the client is required to perform opens, locks, and writes to the server as necessary to inform the server of any state that the client has established only locally. Conflicting opens will be delayed until this process is completed.<br />
===Lock performance test harness===<br />
In the performance measurements that follow, we used a single client and server. We also ran the experiments on the client hardware with the local file system for comparison.<br />
<br />
'''Client'''<br />
<br />
* IBM/Lenovo Thinkpad T43<br />
* 2GHz Pentium M CPU<br />
* 512 MB RAM<br />
* 1000bT NIC<br />
* 5400 RPM Ultra-ATA 80GB HD<br />
* running 2.6.17-CITI<br />
<br />
'''Server'''<br />
* 1GHz Athlon 64 3000+ CPU<br />
* 512 MB RAM<br />
* 1000bT NIC<br />
* 7200 RPM SATA-II 80GB HD<br />
===File lock measurements===<br />
Lacking examples of real-world lock-intensive workloads, we have performed a few microbenchmarks to measure such things as the cost of acquiring a single lock with and without a delegation.<br />
<br />
To measure the performance of whole file locking, we use a benchmark that opens N files, then obtains a lock on each file. We measure the elapsed time of the loop that obtains the locks. We ran the microbenchmark on three configurations:<br />
* Local (reiserFS) file system<br />
* NFS without delegations<br />
* NFS with delegations<br />
<br />
To test with no delegations, we disabled file leasing, which disables delegations as a side effect.<br />
<br />
For most cases, we ran the test 10 times and averaged the results. Variance was not negligible, so we show standard deviations.<br />
<br />
With delegations enabled and 1,000 files, we average the result of two runs, not 10, because the server limited the number of delegations to something less than 3,000.<br />
====File lock measurement results====<br />
All times are in &mu;sec.<br />
<br />
'''Local file system'''<br />
<pre><br />
--- 1 lock, 10 runs ---<br />
24 : mean time per lock to lock 1 file<br />
27 : median<br />
0.8 : std dev<br />
<br />
--- 10 locks, 10 runs ---<br />
7.2 : mean time per lock to lock 10 files<br />
6.9 : median<br />
0.7 : std dev<br />
<br />
--- 100 locks, 10 runs ---<br />
6.5 : mean time per lock to lock 100 files<br />
6.4 : median<br />
0.09: std dev<br />
<br />
--- 1000 locks, 10 runs ---<br />
8.3 : mean time per lock to lock 1,000 files<br />
6.6 : median<br />
2.1 : std dev<br />
</pre><br />
<br />
'''NFS no read delegation'''<br />
<pre><br />
--- 1 lock, 10 runs ---<br />
511 : mean time per lock to lock 1 file<br />
303 : median<br />
534 : std dev<br />
<br />
--- 10 locks, 10 runs ---<br />
283 : mean time per lock to lock 10 files<br />
267 : median<br />
34.2 : std dev<br />
<br />
--- 100 locks, 10 runs ---<br />
269 : mean time per lock to lock 100 files<br />
266 : median<br />
11.1 : std dev<br />
<br />
--- 1000 locks, 10 runs ---<br />
305 : mean time per lock to lock 1,000 files<br />
296 : median<br />
34.4 : std dev<br />
</pre><br />
<br />
'''NFS w/ read delegation'''<br />
<pre><br />
--- 1 lock, 10 runs ---<br />
2.8 : mean time per lock to lock 1 file<br />
3.0 : median<br />
6 : std dev<br />
<br />
--- 10 locks, 10 runs ---<br />
9.1 : mean time per lock to lock 10 files<br />
9.1 : median<br />
0.0 : std dev<br />
<br />
--- 100 locks, 10 runs ---<br />
7.6 : mean time per lock to lock 100 files<br />
7.2 : median<br />
1.1 : std dev<br />
<br />
--- 1,000 locks, 2 runs ---<br />
8.0 : mean time per lock to lock 1,000 files<br />
8.0 : median<br />
0.049: std dev<br />
</pre><br />
====File lock performance discussion====<br />
The pattern that emerges suggests that delegations improve the performance of whole-file locking. The cost per lock in resiserFS and in NFS with delegations is six to eight &mu;sec per lock. Without delegations, the cost is greater by two orders of magnitude.<br />
<br />
Further investigation is under way, examining the reduction in server load and in the number of RPCs when delegations are enabled.<br />
===Byte-range lock measurements===<br />
To measure the cost of byte-range locks, we focused on the cost of splitting and joining locks. (All locks discussed in this section are POSIX byte-range locks.) <br />
<br />
For lock splitting, we created a 30 MB file, locked the entire file, then unlocked non-contiguous ranges. Each unlock operation splits the initial lock.<br />
<br />
To measure the cost of lock joining, we ran a complementary test: we locked non-contiguous regions of the file, then locked the entire range. The ranges are non-contiguous to avoid coalescing locks as we proceed.<br />
<br />
We measured performance for four segment sizes:<br />
<br />
* 3 segments, each 100,000,000 bytes<br />
* 30 segments, each 1,000,000 bytes<br />
* 300 segments, each 100,000 bytes<br />
* 3,000 segments, each 10,000 bytes<br />
<br />
As before, we ran each test 10 times and averaged the results. Variance was negligible. Between runs, we unmounted and remounted the server to assure nothing was cached across runs.<br />
<br />
We ran the join and split tests on the local file system, on NFS with no delegations, and again on NFS with delegations. After each test with delegations, we verified that the delegation was in place by opening the file for writing at the end of the run and noting the DELGETRETURN callback.<br />
====Byte-range lock split results====<br />
'''Local file system'''<br />
<pre><br />
0.000016 secs - lock whole file<br />
0.000021 secs - unlock 3 10000000-byte regions (split)<br />
<br />
0.000016 secs - lock whole file<br />
0.000190 secs - unlock 30 1000000-byte regions (split)<br />
<br />
0.000015 secs - lock whole file<br />
0.002985 secs - unlock 300 100000-byte regions (split)<br />
<br />
0.000015 secs - lock whole file<br />
0.177999 secs - unlock 3000 10000-byte regions (split)<br />
<br />
0.000016 secs - lock whole file<br />
23.569079 secs - unlock 29971 1000-byte regions (split)<br />
</pre><br />
<br />
'''NFS, no delegations'''<br />
<pre><br />
0.000276 secs - lock whole file<br />
0.000704 secs - unlock 3 10000000-byte regions (split) <br />
<br />
0.000325 secs - lock whole file<br />
0.007067 secs - unlock 30 1000000-byte regions (split)<br />
<br />
0.000276 secs - lock whole file<br />
0.073822 secs - unlock 300 100000-byte regions (split)<br />
<br />
0.000271 secs - lock whole file<br />
1.099715 secs - unlock 3000 10000-byte regions (split)<br />
<br />
0.000289 secs - lock whole file<br />
74.407294 secs - unlock 29971 1000-byte regions (split)<br />
</pre><br />
<br />
'''NFS with delegations'''<br />
<pre><br />
0.000016 secs - lock whole file<br />
0.000026 secs - unlock 3 10000000-byte regions (split)<br />
<br />
0.000016 secs - lock whole file<br />
0.000248 secs - unlock 30 1000000-byte regions (split)<br />
<br />
0.000017 secs - lock whole file<br />
0.004350 secs - unlock 300 100000-byte regions (split)<br />
<br />
0.000016 secs - lock whole file<br />
0.225387 secs - unlock 3000 10000-byte regions (split)<br />
<br />
0.000017 secs - lock whole file<br />
22.961603 secs - unlock 29971 1000-byte regions (split)<br />
</pre><br />
====Byte-range lock join results====<br />
'''Local file system'''<br />
<pre><br />
0.000028 secs - lock 3 10000000-byte regions<br />
0.000010 secs - lock whole file (join)<br />
<br />
0.000212 secs - lock 30 1000000-byte regions<br />
0.000031 secs - lock whole file (join)<br />
<br />
0.004222 secs - lock 300 100000-byte regions<br />
0.000407 secs - lock whole file (join)<br />
<br />
0.369237 secs - lock 3000 10000-byte regions<br />
0.004469 secs - lock whole file (join)<br />
<br />
43.966929 secs - lock 29971 1000-byte regions<br />
0.030219 secs - lock whole file (join)<br />
</pre><br />
<br />
'''NFS, no delegations'''<br />
<pre><br />
0.000750 secs - lock 3 10000000-byte regions<br />
0.000246 secs - lock whole file (join)<br />
<br />
0.007616 secs - lock 30 1000000-byte regions<br />
0.000307 secs - lock whole file (join)<br />
<br />
0.081856 secs - lock 300 100000-byte regions<br />
0.001215 secs - lock whole file (join)<br />
<br />
1.548707 secs - lock 3000 10000-byte regions<br />
0.011581 secs - lock whole file (join)<br />
<br />
133.975178 secs - lock 29971 1000-byte regions<br />
0.120294 secs - lock whole file (join)<br />
</pre><br />
<br />
'''NFS with delegations'''<br />
<pre><br />
0.000032 secs - lock 3 10000000-byte regions<br />
0.000012 secs - lock whole file (join)<br />
<br />
0.000284 secs - lock 30 1000000-byte regions<br />
0.000046 secs - lock whole file (join)<br />
<br />
0.006794 secs - lock 300 100000-byte regions<br />
0.000558 secs - lock whole file (join)<br />
<br />
0.347239 secs - lock 3000 10000-byte regions<br />
0.002566 secs - lock whole file (join)<br />
<br />
42.846999 secs - lock 29971 1000-byte regions<br />
0.029043 secs - lock whole file (join)<br />
</pre><br />
====Discussion====<br />
Delegations work as advertised: they make the cost of lock splitting and joining approximately the same as the cost in the local file system.<br />
<br />
Future testing should examine the cost of splits and joins over multiple clients, and the cost of random ordering of lock requests.<br />
<br />
Although the lock join test requires non-contiguous ranges to avoid coalescing, the lock split test does not, and should be re-run with contiguous ranges.<br />
===Delegation recall with byte-range locks===<br />
Earlier, we saw the performance advantage for a client holding delegations when acquiring locks. Now we examine the performance penalty to the server when recalling a delegation in the face of numerous client locks.<br />
<br />
To test the performance of delegation recall, the client opens a 30 MB file, acquires a number of byte-range locks, then idles. The server then opens the file for writing, which induces a delegation recall callback. This forces the client to release its locks locally and establish them on the server. The client then relinquishes the delegation. We measure the elapsed time for the server to process the open call.<br />
<br />
We vary the number of locks, testing performance with 0, 1, 2, 3, 4, 5, 10, 25, 50, 100, 250, 500, 1,000, 2,500, 5,000, and 10,000 locks.<br />
<br />
The following table shows the total open time measured on the server and the time normalized by the number of locks recalled.<br />
<br />
Total times are measured in seconds. Normalized times are shown in msec.<br />
<br />
<table border="1" cellpadding="5"><br />
<tr><td align="center">''n''</td><td align="center">''total''</td><td align="center">''per lock''</td></tr><br />
<tr><td align="right">Local</td><td align="right">0.000051</td><td align="right">0.051</td></tr><br />
<tr><td align="right">0</td><td align="right">0.000051</td><td align="right">0.051</td></tr><br />
<tr><td align="right">1</td><td align="right">0.001521</td><td align="right">1.52</td></tr><br />
<tr><td align="right">2</td><td align="right">0.001726</td><td align="right">0.86</td></tr><br />
<tr><td align="right">3</td><td align="right">0.002064</td><td align="right">0.69</td></tr><br />
<tr><td align="right">4</td><td align="right">0.002235</td><td align="right">0.56</td></tr><br />
<tr><td align="right">5</td><td align="right">0.002482</td><td align="right">0.50</td></tr><br />
<tr><td align="right">10</td><td align="right">0.003648</td><td align="right">0.36</td></tr><br />
<tr><td align="right">25</td><td align="right">0.007320</td><td align="right">0.29</td></tr><br />
<tr><td align="right">50</td><td align="right">0.013309</td><td align="right">0.27</td></tr><br />
<tr><td align="right">100</td><td align="right">0.025317</td><td align="right">0.25</td></tr><br />
<tr><td align="right">250</td><td align="right">0.063221</td><td align="right">0.25</td></tr><br />
<tr><td align="right">500</td><td align="right">0.128633</td><td align="right">0.26</td></tr><br />
<tr><td align="right">1000</td><td align="right">0.295346</td><td align="right">0.30</td></tr><br />
<tr><td align="right">2500</td><td align="right">0.842576</td><td align="right">0.34</td></tr><br />
<tr><td align="right">5000</td><td align="right">2.358167</td><td align="right">0.47</td></tr><br />
<tr><td align="right">10000</td><td align="right">7.409892</td><td align="right">0.74</td></tr><br />
<tr><td align="right">15000</td><td align="right">14.412268</td><td align="right">0.96</td></tr><br />
<tr><td align="right">25000</td><td align="right">36.535290</td><td align="right">1.5</td></tr><br />
<tr><td align="right">50000</td><td align="right">90.007199</td><td align="right">1.8</td></tr><br />
</table><br />
<br />
We are investigating the unexpected nonlinear behavior in the cost per lock.<br />
==Task 4. Directory Delegations==<br />
Analysis of directory delegations – how well does it work and when, when does it totally not work.<br />
===Background===<br />
Directory delegations promise to extend the usefulness of dentry caching in two ways. First, the client is no longer forced to revalidate the dentry cache after a timeout. Second, while positive caching can be treated as a hint, negative caching without cache invalidation violates open-to-close semantics. <br />
Directory delegations allow the client to cache negative results. <br />
<br />
For example, if a client opens a file that does not exist, it issues an OPEN RPC that fails. But a subsequent open of the same file might succeed, if the file is created in the interim. Open-to-close semantics requires that the newly created file be seen by the client, so the earlier negative result can not be cached. Consequently, subsequent opens of the same non-existent file also require OPEN RPC calls being sent to the server. This example is played out repeatedly when the shell searches for executables in PATH or when the linker searches for shared libraries in LD_LIBRARY_PATH.<br />
<br />
With directory delegations, the server callback mechanism can guarantee that no entries have been added or modified in a cached directory, which allows consistent negative caching and eliminates repeated checks for non-existent files.<br />
===Status===<br />
We implemented directory delegations in the Linux NFSv4 client and server. <br />
<br />
Our server implementation follows the file delegations architecture. We extended the lease API in the Linux VFS to support read-only leases on directories and NFS-specific lease-breaking semantics.<br />
<br />
We implemented a '''/proc''' interface on the server to enable or disable directory delegation at run time. At startup, the client queries the server for directory delegation support.<br />
<br />
The server has hooks for a policy layer to control the granting of directory delegations. (No policy is implemented yet.) When and whether to acquire delegations is also a client concern.<br />
===Testing===<br />
We are testing delegation grant and recall in a test rig with one or two clients. Testing consists mostly of comparing NFS operation-counts when directory delegations is enabled or disabled.<br />
<br />
Tests range from simple UNIX utilities — ls, find, touch — to hosting a CVS repository or compiling with shared libraries and header files on NFS servers. Tests will become more specific.<br />
<br />
We have extended PyNFS to support directory delegations. So far, the support is basic and the tests are trivial. Tests will become more specific.<br />
<br />
We are designing mechanisms that allow simulation experiments to compare delegation policies on NFSv4 network traces.<br />
<br />
A [[CITI_Experience_with_Directory_Delegations|comprehensive report]] on CITI experience with directory delegationsis under preparation. We will send that report when it is complete.<br />
<br />
==Task 5. NFS Server Load==<br />
How do you specify/measure NFS Server load?<br />
===Status===<br />
To frame the task, consider identical symmetric servers with a cluster file system back end and a task running on one of them.<br />
Can we compare the load on the servers to determine whether there would be a benefit to migrating a client from one to the other? <br />
<br />
Answering this question requires that we define a model of load based on measurable quanta.<br />
<br />
Given a model, the next step is to write a tool that collects the factors that influence load and to measure how well the model accurately predicts performance.<br />
===Goals===<br />
If an application is running at less than peak performance, the load model should tell us whether the bottleneck is in the server, the client, or elsewhere.<br />
<br />
If the bottleneck is in the server, one option for improving application performance is replacing server components with faster ones.<br />
Another option is to add servers. A third option is to migrate the application to a lightly-loaded server.<br />
===Factors that influence server load===<br />
The hardware characteristics that influence server capacity include disk bandwidth, CPU speed, interrupt rate, and memory availability.<br />
====Disks====<br />
The rate at which a single file in a server file system can be read from or written to depends on many factors, including characteristics of the disk hardware (rotation speed, access latency, etc.), the disk controller, the bus, the layout of the file on disk, the size of the transfer, and the degree of caching. The overall bandwidth of a file system also depends on the degree of striping and distribution of requests across disks. <br />
<br />
The ''iostat'' command can reveal a bottleneck due to server disks if seek or transfer rates approach maximum values. For a given server configuration, these values can be measured directly. It might be possible to predict these values for a given hardware ensemble.<br />
<br />
====CPU====<br />
Server threads compete with one another and with the operating system for access to the CPU. Excess offered load can exhaust the availability of server threads.<br />
<br />
* how would we know if this were to happen?<br />
* would it suffice to simply allocate more threads?<br />
* or are there pathological cases to consider?<br />
<br />
Overall CPU utilization can be measured, also with ''iostat'', but there may be other factors influencing the allocation of CPU to server threads. For example, excessive pressure on the memory or interrupt subsystem can force the operating system to intervene.<br />
<br />
We're conflating thread usage and CPU usage. The two<br />
aren't the same, since nfsd threads can block on IO.<br />
So a sudden storm of "commit" requests, for example,<br />
(which require IO to be sync'd to disk) could tie up<br />
every nfsd thread while leaving the CPU idle.<br />
<br />
====Interrupts====<br />
Interrupt rates can be measured with the ''vmstat'' command.<br />
<br />
For a given hardware configuration, a threshold can be measured experimentally.<br />
<br />
====Memory====<br />
The memory subsystem is complex and varies among operating systems. Applications compete with one another for virtual memory. Often, they also compete with the file system, which uses the virtual memory subsystem for its in-memory cache.<br />
<br />
Often, excess demand for memory is reflected by early eviction of pages in virtual memory. The ''vmstat'' command shows the pageout rate, which does not measure early eviction, but does reflect overall memory pressure.<br />
====Network====<br />
Network utilization is the ratio of delivered bandwidth to maximum available bandwidth. Maximum available bandwidth is a property of network hardware. Delivered bandwidth can be measured with the ''netstat'' command.<br />
<br />
Full-duplex network technologies can deliver maximal bandwidth in both directions, while half-duplex network technologies are limited to delivering the sum of the two directions.<br />
<br />
* i believe that is a true statement ...<br />
===Measuring load===<br />
The overall performance of a server can be tested by measuring NFS performance directly with microbenchmarks. Candidate microbenchmarks include NULL RPC, and small READ RPC, large READ RPC, small WRITE RPC, and large WRITE RPC. The latency of a single operation or the maximum rate at which repeated operations can be performed are of interest.<br />
<br />
Macrobenchmarks might also be useful, such as an application that performs specific tasks. Many popular benchmarks unpack and build an application like SSH or the Linux kernel.<br />
<br />
The usefulness of a measured value can be tested by comparing microbenchmark performance as the resource is consumed. For example:<br />
<br />
* We might vary the size of a raid array to reduce the overall disk bandwidth.<br />
* We can run a process that consumes varying amounts of CPU time.<br />
* We can vary the number of threads in the server pool.<br />
* We can vary the number of connections that can be served by a single thread.<br />
* We can run a pair of processes that use varying amounts of network bandwidth.<br />
<br />
Each measured value can be expressed as a ratio between 0 (idle) and 1 (at capacity). For each value, there is a program that consumes the corresponding resource. We can then compare the measured server performance as the amount of the resource is varied.<br />
<br />
It is also useful to sample the instantaneous values, and to track them over time with a damping function that shows the averages over the last second, minute, five minutes, etc.<br />
====Special situations====<br />
Some special situations might be used to measure server performance:<br />
<br />
* Reboot recovery: measure elapsed time when a number of clients are simultaneously recovering.<br />
* Mount storms: measure the elapsed time when a large number of clients simultaneosly attempt to mount a server. This might arise when a cluster job triggers an automount from all clients at once.<br />
===Possible benchmark sources===<br />
Some macrobenchmarks are already in common use.<br />
* '''Postmark''' is a popular macrobenchmark that simulates the operation of a mail server with a mix of reads, writes, creates, and unlinks. Postmark does not exercise locks.<br />
* '''Filebench''' also lacks locks. We haven't figured out exactly what the various loads do and are trying to find the relevant developer community.<br />
* '''IOzone''' can use locks.<br />
* [http://www.llnl.gov/icc/lc/siop/downloads/download.html LLNL] and other labs have some metadata stress tests, e.g., '''IOR''' and '''mdtest'''. IOR places enormous stress on a server, and scales to thousands of clients.<br />
* [http://www.llnl.gov/asci/purple/benchmarks/limited/ ASCI Purple Benchmark Codes]<br />
* [http://www.cs.dartmouth.edu/pario/examples.html Resources at Dartmouth]<br />
* Bull.net uses '''Bonnie++''', '''FStress''', and '''dbench''' in its NFS load testing. dbench simulates the file system activity created by a samba server running the proprietary SMB benchmark '''netbench'''.</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/CITI_ASC_statusCITI ASC status2006-10-18T21:05:23Z<p>Andros: /* Task 2. Migration of client from one mount/metadata server to another to be demonstrated. This demonstration may be replicated at LANL depending on success of this work. */</p>
<hr />
<div>=University of Michigan/CITI NFSv4 ASC alliance=<br />
Status of October 2006<br />
==Task 1. Demonstration of pNFS with multiple back end methods (PVFS2 and File) including layout recall — LANL will replicate this demonstration at LANL working with CITI remotely.==<br />
===Development===<br />
We updated the Linux '''pNFS''' client and server to the 2.6.17 kernel level, and are preparing to rebase again for 2.6.19. <br />
<br />
We updated the pNFS code base to draft-ietf-nfsv4-minorversion1-05. Testing identified multiple bugs, which we fixed.<br />
<br />
The linux client separates common NFS code from NFSv2/3/4 code by using version specific operations. We rewrote the Linux pNFS client to use its own set of version specfic operations. This provides a controlled interface to the pNFS code, and eases updating the code to new kernel versions.<br />
<br />
Four client layout modules are in development. <br />
* File layout driver (CITI, Network Appliance, and IBM Almaden).<br />
* PVFS2 layout driver (CITI).<br />
* Object layout driver (Panasas).<br />
* Block layout driver (CITI).<br />
<br />
To accommodate the requirements of the multiple layout drivers, we expanded the layout operation policy interfaces between the layout driver and generic pNFS client.<br />
<br />
We are designing and coding a pNFS client layout cache to replace the current implementation, which supports only a single layout per inode.<br />
<br />
We improved the interface to the underlying file system on the Linux pNFS server. The new interface is being used by the Panasas object layout server, the IBM GPFS server, and the PVFS2 server. <br />
<br />
We are coding the pNFS layout management service and file system interfaces on the Linux pNFS server to do a better job of bookkeeping so that we can extend the layout recall implementation, which is limited to a single layout.<br />
<br />
We have continued to develop the PVFS2 layout driver and PVFS2 support in the pNFS server. The layout driver I/O interface supports direct access, page cache access with NFSv4 readahead and writeback, and the O_DIRECT access method. In addition, PVFS2 now supports the pNFS file-based layout, which lets pNFS clients choose how they access the file system.<br />
<br />
We demonstrated how pNFS can improve the overall write performance of parallel file systems by using direct, parallel I/O for large write requests and the NFSv4 storage protocol for small write requests. To switch between them, we added a write threshold to the layout driver. Write requests smaller than the threshold follow the slower NFSv4 data path. Write requests larger than the threshold follow the faster layout driver data path.<br />
D. Hildebrand, L. Ward, and P. Honeyman, "Large Files, Small Writes, and pNFS," in ''Proc. of the 20th ACM International Conf. on Supercomputing, Cairns, Australia, 2006.<br />
<br />
We improved the performance and scalability of pNFS file-based access with parallel file systems. Our design, named Direct-pNFS, augmented the file-based architecture to enable file-based pNFS clients to bypass intermediate data servers and access heterogeneous data stores directly. Direct access is possible by ensuring file-based layouts match the data layout in the underlying file system and giving pNFS clients the tools to effectively interpret and utilize this information. Experiments with Direct-pNFS demonstrate I/O throughput that equals or outperforms the exported parallel file system across a range of workloads.<br />
D. Hildebrand and P. Honeyman, "Direct-pNFS: Simple, Transparent, and Versatile Access to Parallel File Systems," ''CITI Technical Report 06-8'', October 2006.<br />
<br />
We developed prototype implementations of pNFS operations:<br />
* OP_GETDEVICELIST,<br />
* OP_GETDEVICEINFO, <br />
* OP_LAYOUTGET,<br />
* OP_LAYOUTCOMMIT,<br />
* OP_LAYOUTRETURN and<br />
* OP_CB_LAYOUTRECALL<br />
<br />
We continue to test the ability of our prototype to send direct I/O data to data servers.<br />
<br />
===Milestones===<br />
At the September 2006 NFSv4 Bake-a-thon, hosted by CITI, we continued to test the ability of CITI's Linux pNFS client to operate with multiple layouts, and the ability of CITI's Linux pNFS server to export pNFS capable underlying file systems. <br />
<br />
We demonstrated the Linux pNFS client support for multiple layouts by copying files between multiple pNFS back ends.<br />
<br />
The following pNFS implementations were tested.<br />
<br />
''File Layout''<br />
* Clients: Linux, Solaris<br />
* Servers: Network Appliance, Linux IBM GPFS, DESY dCache, Solaris, PVFS2<br />
<br />
''Object layout''<br />
* Client: Linux<br />
* Servers: Linux, Panasas<br />
<br />
''Block layout''<br />
* Client: Linux<br />
* Server: EMC<br />
<br />
''PVFS2 layout''<br />
* Client: Linux<br />
* Server: Linux<br />
<br />
===Activities===<br />
Our current Linux pNFS implementation uses a single whole file layout. We are extending the layout cache on the client and layout management on the server to support multiple layouts and small byte ranges. <br />
<br />
In cooperation with EMC, we continue to develop a block layout driver module for the generic pNFS client.<br />
<br />
We continue to measure I/O performance.<br />
<br />
We joined the [http://www.ultralight.org Ultralight project] and are testing pNFS I/O using pNFS clients on 10 GbE against pNFS clusters on 1 GbE.<br />
The Linux pNFS client included in the Ultralight kernel and distributed to Ultralight sites, providing opportunities for future long-haul WAN testing.<br />
<br />
==Task 2. Migration of client from one mount/metadata server to another to be demonstrated. This demonstration may be replicated at LANL depending on success of this work. ==<br />
<br />
When a file system moves, the old server notifies clients with NFS4ERR_MOVED. Clients then reclaim state held on the old server by engaging in reboot recovery with the new server. For cluster file systems, server-to-server state transfer lets clients avoid the reclaim. <br />
<br />
We redesigned state bookkeeping to ensure that state created on NFSv4 servers exporting the same cluster file system will not collide. <br />
<br />
Server reboot recovery requires servers to save the clientid of active clients in stable storage. The present server implementation does this by writing directly to a filesystem via the vfs layer. A new server instance reads the state from stable storage, again directly via the vfs. We are rewriting this implementation to use a pipefs upcall/downcall interface instead of directly using the vfs layer, and are expanding the interface to support an upcall/downcall of all a clients in-memory state. The userland daemon can then support server-to-server state transfer to the cooresponding daemon on a new server. We have a prototype of the new upcall/down call interface, and have yet to prototype the server-to-server state transfer.<br />
<br />
It remains to inform clients that state established with the old server remains valid on the new server. The IETF NFSv4 working group is considering solutions for the NFSv4.1 protocol, but NFSv4.0 clients will not have support for this feature. We will therefore need to provide Linux specific implementation support - perhaps a mount option or a /proc flag, or simply to try to use an old clientid against a new server on migration.<br />
<br />
==Task 3. Analysis of caching and lock coherency, demonstration of caching and lock performance with scaling, under various levels of conflict, using byte range locks (looking at lock splitting issues etc.).==<br />
We have set up test machines and begun planning for tests. We have some immediate concerns over the memory footprint imposed by server lock structures.<br />
==Task 4. Analysis of directory delegations – how well does it work and when, when does it totally not work.==<br />
===Background===<br />
'''Directory delegations''' promise to extend the usefulness of dentry caching in two ways. First, the client is no longer forced to revalidate the dentry cache after a timeout. Second, while positive caching can be treated as a hint, negative caching without cache invalidation violates open-to-close semantics. <br />
Directory delegations allow the client to cache negative results. <br />
<br />
For example, if a client opens a file that does not exist, it issues an OPEN RPC that fails. But a subsequent open of the same file might succeed, if the file is created in the interim. Open-to-close semantics requires that the newly created file be seen by the client, so the earlier negative result can not be cached. Consequently, subsequent opens of the same non-existent file also require OPEN RPC calls being sent to the server. This example is played out repeatedly when the shell searches for executables in PATH or when the linker searches for shared libraries in LD_LIBRARY_PATH.<br />
<br />
With directory delegations, the server callback mechanism can guarantee that no entries have been added or modified in a cached directory, which allows consistent negative caching and eliminates repeated checks for non-existent files.<br />
===Status===<br />
We implemented directory delegations in the Linux NFSv4 client and server. <br />
<br />
Our server implementation follows the file delegations architecture. We extended the lease API in the Linux VFS to support read-only leases on directories and NFS-specific lease-breaking semantics.<br />
<br />
We implemented a '''/proc''' interface on the server to enable or disable directory delegation at run time. At startup, the client queries the server for directory delegation support.<br />
<br />
The server has hooks for a policy layer to control the granting of directory delegations. (No policy is implemented yet.) When and whether to acquire delegations is also a client concern.<br />
<br />
===Testing===<br />
We are testing delegation grant and recall in a test rig with one or two clients. Testing consists mostly of comparing NFS operation-counts when directory delegations is enabled or disabled.<br />
<br />
Tests range from simple UNIX utilities — ls, find, touch — to hosting a CVS repository or compiling with shared libraries and header files on NFS servers. Tests will become more specific.<br />
<br />
We have extended PyNFS to support directory delegations. So far, the support is basic and the tests are trivial. Tests will become more specific.<br />
<br />
We are designing mechanisms that allow simulation experiments to compare delegation policies on NFSv4 network traces.<br />
==Task 5. How do you specify/measure NFS Server load.==<br />
<br />
To frame the task, consider identical symmetric servers with a cluster file system back end and a task running on one of them.<br />
Can we compare the load on the servers to determine whether there would be a benefit to migrating a client from one to the other? <br />
<br />
Answering this question requires that we define a model of load based on measurable quanta.<br />
<br />
Given a model, the next step is to write a tool that collects the factors that influence load and to measure how well the model accurately predicts performance.<br />
<br />
===Goals===<br />
If an application is running at less than peak performance, the load model should tell us whether the bottleneck is in the server, the client, or elsewhere.<br />
<br />
If the bottleneck is in the server, one option for improving application performance is replacing server components with faster ones.<br />
Another option is to add servers. A third option is to migrate the application to a lightly-loaded server.<br />
<br />
* Actually, the second option is fruitless without the third.<br />
<br />
===Factors that influence server load===<br />
<br />
====Disks====<br />
The rate at which a single file in a server file system can be depends on many factors, including characteristics of the disk hardware (rotation speed, access latency, etc.), the disk controller, the bus, the layout of files on the disk, the size of the transfer, and the degree of caching. The overall bandwidth of a file system also depends on the degree of striping and distribution of requests across disks. <br />
<br />
The ''iostat'' command can reveal a bottleneck due to server disks if seek or transfer rates approach maximum values. For a given server configuration, these values can be measured directly. It might be possible to predict these values for a given hardware ensemble.<br />
<br />
====CPU====<br />
Server threads compete with one another and with the operating system for access to the CPU. Excess offered load can exhaust the availability of server threads.<br />
<br />
* how would we know if this were to happen?<br />
* would it suffice to simply allocate more threads?<br />
* or are there pathological cases to consider?<br />
<br />
Overall CPU utilization can be measured, also with ''iostat'', but there may be other factors influencing the allocation of CPU to server threads. For example, excessive pressure on the memory or interrupt subsystem can force the operating system to intervene.<br />
<br />
====Interrupts====<br />
<br />
Interrupt rates can be measured with <br />
<br />
* i forget :-(<br />
<br />
For a given hardware configuration, a threshold can be measured experimentally.<br />
<br />
====Memory====<br />
The memory subsystem is complex and varies among operating systems. Applications compete with one another for virtual memory. Often, they also compete with the file system, which uses the virtual memory subsystem for its in-memory cache.<br />
<br />
Often, excess demand for memory is reflected by early eviction of pages in virtual memory. The ''vmstat'' command shows the pageout rate, which does not measure early eviction, but does reflect overall memory pressure.<br />
<br />
====Network====<br />
Network utilization is the ratio of delivered bandwidth to maximum available bandwidth. Maximum available bandwidth is a property of network hardware. Delivered bandwidth can be measured with the ''netstat'' command.<br />
<br />
Full-duplex network technologies can deliver maximal bandwidth in both directions, while half-duplex network technologies are limited to delivering the sum of the two directions.<br />
<br />
* i believe that ius a true statement ...<br />
<br />
===Measuring load===<br />
Each measured value can be expressed as a ratio between 0 (idle) and 1 (at capacity). For each value, there is a program that consumes the corresponding resource.<br />
<br />
The overall performance of a server can be tested by measuring NFS performance directly with microbenchmarks. Candidate microbenchmarks include NULL RPC, and small READ RPC, large READ RPC, small WRITE RPC, and large WRITE RPC.<br />
<br />
The usefulness of a measured value can be tested by comparing microbenchmark performance as the resource is consumed.<br />
<br />
It is useful to sample the instantaneous values, and to track them over time with a damping function that shows the averages over the last second, minute, five minutes, etc.<br />
<br />
====How do we check usefulness of this information?====<br />
boot with reduced resources somehow, see if increasing resources increases performance as predicted?<br />
<br />
====Disk bandwidth====<br />
vary size of raid arrays, bandwidth of disk interfaces? <br />
<br />
Or run another process that soaks up some percentage of bandwidth??<br />
<br />
====CPU load====<br />
CPU throttling??<br />
<br />
Just try different totally random machines? Vary workload? How do we get a light vs. heavy workload?<br />
<br />
How do we measure performance of each? Increasing clients until we see performance degredation due to server bottlenecks would be obvious thing to do....)<br />
<br />
===Measures of load===<br />
what do we use to determine if our measure of load is correct?<br />
<br />
* single rpc latency measured from a client?<br />
* time to complete some other task, measured from a single client (not actually involved in loading the server)?<br />
* rpc's per second?<br />
<br />
===Configuration parameters on server that can be varied===<br />
* number of server threads<br />
* number of connections per server thread<br />
* request queue lengths (# of bytes waiting in tcp socket)<br />
<br />
===Some special situations that can be problems (from Chuck)===<br />
* reboot recovery: everyone is recovering at once.<br />
* mount storms: a lab full of clients may all mount at once, or a cluster job may trigger automount from all clients at once.<br />
<br />
===Possible benchmark sources, for this and locking scalability===<br />
====postmark====<br />
looks pretty primitive: mixture of reads, writes, creates, unlinks. No locks.<br />
====filebench====<br />
also no locking. Haven't figured out exactly what the various loads do. Is there actually an active developer community?<br />
====See Bull.net's list?====<br />
* Bonnie++<br />
* FStress<br />
* dbench: simulates filesystem activity created by a samba server running the proprietary SMB benchmark "netbench". Maybe not so useful.<br />
* Do-it-ourselves modify postmark or filebench? set up a mailserver (e.g.), send it fake mail. get traces from working servers</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/CITI_ASC_statusCITI ASC status2006-10-18T14:32:59Z<p>Andros: /* Milestones */</p>
<hr />
<div>=University of Michigan/CITI NFSv4 ASC alliance=<br />
Status of October 2006<br />
==Task 1. Demonstration of pNFS with multiple back end methods (PVFS and File) including layout recall — LANL will replicate this demonstration at LANL working with CITI remotely.==<br />
===Development===<br />
We updated the Linux '''pNFS''' client and server to the 2.6.17 kernel level, and are preparing to rebase again for 2.6.19. <br />
<br />
We updated the pNFS code base to draft-ietf-nfsv4-minorversion1-05. Testing identified multiple bugs, which we fixed.<br />
<br />
The linux client separates common NFS code from NFSv2/3/4 code by using version specific operations. We rewrote the Linux pNFS client to use its own set of version specfic operations. This provides a controlled interface to the pNFS code, and eases updating the code to new kernel versions.<br />
<br />
Four client layout modules are in development. <br />
* File layout driver (CITI, Network Appliance, and IBM Almaden).<br />
* PVFS2 layout driver (CITI).<br />
* Object layout driver (Panasas).<br />
* Block layout driver (CITI under contract with EMC).<br />
<br />
To accommodate the requirements of the multiple layout drivers, we expanded the layout operation policy interfaces between the layout driver and generic pNFS client.<br />
<br />
We are designing and coding a pNFS client layout cache to replace the current implementation, which supports only a single layout per inode.<br />
<br />
We improved the interface to the underlying file system on the Linux pNFS server. The new interface is being used by the Panasas object layout server, the IBM GPFS server, and the PVFS2 server. <br />
<br />
We are coding the pNFS layout management service and file system interfaces on the Linux pNFS server to do a better job of bookkeeping so that we can extend the layout recall implementation, which is limited to a single layout.<br />
<br />
We have continued to develop the PVFS2 layout driver and PVFS2 support in the pNFS server. The layout driver I/O interface supports direct access, page cache access through with NFSv4 readahead and writeback, and the O_DIRECT access method. In addition, PVFS2 now supports the pNFS file-based layout, which lets pNFS clients choose how they access the file system.<br />
<br />
We developed prototype implementations of pNFS operations:<br />
* OP_GETDEVICELIST,<br />
* OP_GETDEVICEINFO, <br />
* OP_LAYOUTGET,<br />
* OP_LAYOUTCOMMIT,<br />
* OP_LAYOUTRETURN and<br />
* OP_CB_LAYOUTRECALL<br />
<br />
We continue to test the ability of our prototype to send direct I/O data to data servers.<br />
<br />
===Milestones===<br />
At the September 2006 NFSv4 Bake-a-thon, hosted by CITI, we continued to test the ability of CITI's Linux pNFS client to operate with multiple layouts, and the ability of CITI's Linux pNFS server to export pNFS capable underlying file systems. <br />
<br />
We demonstrated the Linux pNFS client support for multiple layouts by copying files between multiple pNFS back ends.<br />
<br />
The following pNFS implementations were tested.<br />
<br />
''File Layout''<br />
* Clients: Linux, Solaris<br />
* Servers: Network Appliance, Linux IBM GPFS, DESY dCache, Solaris, pVFS2<br />
<br />
''Object layout''<br />
* Client: Linux<br />
* Servers: Linux, Panasas<br />
<br />
''Block layout''<br />
* Client: Linux<br />
* Server: EMC<br />
<br />
''pVFS2layout''<br />
* Client: Linux<br />
* Server: Linux<br />
<br />
===Activities===<br />
Our current Linux pNFS implementation uses a single whole file layout. We are extending the layout cache on the client and layout management on the server to support multiple layouts and small byte ranges. <br />
<br />
In cooperation with EMC, we continue to develop a block layout driver module for the generic pNFS client.<br />
<br />
We continue to measure I/O performance.<br />
<br />
We joined the [http://www.ultralight.org Ultralight project] and are testing pNFS I/O using pNFS clients on 10 GbE against pNFS clusters on 1 GbE.<br />
The Linux pNFS client included in the Ultralight kernel and distributed to Ultralight sites, providing opportunities for future long-haul WAN testing.<br />
<br />
==Task 2. Migration of client from one mount/metadata server to another to be demonstrated. This demonstration may be replicated at LANL depending on success of this work. ==<br />
<br />
When a file system moves, the old server notifies clients with NFS4ERR_MOVED. Clients then reclaim state held on the old server by engaging in reboot recovery with the new server. For cluster file systems, server-to-server state transfer lets clients avoid the reclaim. <br />
<br />
We redesigned state bookkeeping to ensure that state created on NFSv4 servers exporting the same cluster file system will not collide. <br />
<br />
As presently implemented, clients save the old server's state in stable storage and pass the state information to the new server as part of the recovery operation. We are rewriting that interface to also support server-to-server state transfer.<br />
<br />
* Please check that sentence<br />
<br />
It remains to inform clients that state established with the old server remains valid on the new server. The IETF NFSv4 working group is considering solutions, e.g., augmented FS_LOCATIONS information or a new error code NFS4ERR_MOVED_DATA_AND_STATE.<br />
<br />
==Task 3. Analysis of caching and lock coherency, demonstration of caching and lock performance with scaling, under various levels of conflict, using byte range locks (looking at lock splitting issues etc.).==<br />
We have set up test machines and begun planning for tests. We have some immediate concerns over the memory footprint imposed by server lock structures.<br />
==Task 4. Analysis of directory delegations – how well does it work and when, when does it totally not work.==<br />
===Background===<br />
'''Directory delegations''' promise to extend the usefulness of dentry caching in two ways. First, the client is no longer forced to revalidate the dentry cache after a timeout. Second, while positive caching can be treated as a hint, negative caching without cache invalidation violates open-to-close semantics. <br />
Directory delegations allow the client to cache negative results. <br />
<br />
For example, if a client opens a file that does not exist, it issues an OPEN RPC that fails. But a subsequent open of the same file might succeed, if the file is created in the interim. Open-to-close semantics requires that the newly created file be seen by the client, so the earlier negative result can not be cached. Consequently, subsequent opens of the same non-existent file also require OPEN RPC calls being sent to the server. This example is played out repeatedly when the shell searches for executables in PATH or when the linker searches for shared libraries in LD_LIBRARY_PATH.<br />
<br />
With directory delegations, the server callback mechanism can guarantee that no entries have been added or modified in a cached directory, which allows consistent negative caching and eliminates repeated checks for non-existent files.<br />
===Status===<br />
We implemented directory delegations in the Linux NFSv4 client and server. <br />
<br />
Our server implementation follows the file delegations architecture. We extended the lease API in the Linux VFS to support read-only leases on directories and NFS-specific lease-breaking semantics.<br />
<br />
We implemented a '''/proc''' interface on the server to enable or disable directory delegation at run time. At startup, the client queries the server for directory delegation support.<br />
<br />
The server has hooks for a policy layer to control the granting of directory delegations. (No policy is implemented yet.) When and whether to acquire delegations is also a client concern.<br />
<br />
===Testing===<br />
We are testing delegation grant and recall in a test rig with one or two clients. Testing consists mostly of comparing NFS operation-counts when directory delegations is enabled or disabled.<br />
<br />
Tests range from simple UNIX utilities — ls, find, touch — to hosting a CVS repository or compiling with shared libraries and header files on NFS servers. Tests will become more specific.<br />
<br />
We have extended PyNFS to support directory delegations. So far, the support is basic and the tests are trivial. Tests will become more specific.<br />
<br />
We are designing mechanisms that allow simulation experiments to compare delegation policies on NFSv4 network traces.<br />
==Task 5. How do you specify/measure NFS Server load.==<br />
<br />
To frame the task, consider identical symmetric servers with a cluster file system back end and a task running on one of them.<br />
Can we compare the load on the servers to determine whether there would be a benefit to migrating a client from one to the other? <br />
<br />
Answering this question requires that we define a model of load based on measurable quanta.<br />
<br />
Given a model, the next step is to write a tool that collects the factors that influence load and to measure how well the model accurately predicts performance.<br />
<br />
===Goals===<br />
If an application is running at less than peak performance, the load model should tell us whether the bottleneck is in the server, the client, or elsewhere.<br />
<br />
If the bottleneck is in the server, one option for improving application performance is replacing server components with faster ones.<br />
Another option is to add servers. A third option is to migrate the application to a lightly-loaded server.<br />
<br />
* Actually, the second option is fruitless without the third.<br />
<br />
===Factors that influence server load===<br />
<br />
====Disks====<br />
The rate at which a single file in a server file system can be depends on many factors, including characteristics of the disk hardware (rotation speed, access latency, etc.), the disk controller, the bus, the layout of files on the disk, the size of the transfer, and the degree of caching. The overall bandwidth of a file system also depends on the degree of striping and distribution of requests across disks. <br />
<br />
The ''iostat'' command can reveal a bottleneck due to server disks if seek or transfer rates approach maximum values. For a given server configuration, these values can be measured directly. It might be possible to predict these values for a given hardware ensemble.<br />
<br />
====CPU====<br />
Server threads compete with one another and with the operating system for access to the CPU. Excess offered load can exhaust the availability of server threads.<br />
<br />
* how would we know if this were to happen?<br />
* would it suffice to simply allocate more threads?<br />
* or are there pathological cases to consider?<br />
<br />
Overall CPU utilization can be measured, also with ''iostat'', but there may be other factors influencing the allocation of CPU to server threads. For example, excessive pressure on the memory or interrupt subsystem can force the operating system to intervene.<br />
<br />
====Interrupts====<br />
<br />
Interrupt rates can be measured with <br />
<br />
* i forget :-(<br />
<br />
For a given hardware configuration, a threshold can be measured experimentally.<br />
<br />
====Memory====<br />
The memory subsystem is complex and varies among operating systems. Applications compete with one another for virtual memory. Often, they also compete with the file system, which uses the virtual memory subsystem for its in-memory cache.<br />
<br />
Often, excess demand for memory is reflected by early eviction of pages in virtual memory. The ''vmstat'' command shows the pageout rate, which does not measure early eviction, but does reflect overall memory pressure.<br />
<br />
====Network====<br />
<br />
<br />
===What to report for each?===<br />
Average load over past time intervals (1ms, 10ms, 100ms, 1s, 10s, 100s,...)<br />
<br />
===How do we check usefulness of this information?===<br />
boot with reduced resources somehow, see if increasing resources increases performance as predicted?<br />
<br />
====Disk bandwidth====<br />
vary size of raid arrays, bandwidth of disk interfaces? <br />
<br />
Or run another process that soaks up some percentage of bandwidth??<br />
<br />
====CPU load====<br />
CPU throttling??<br />
<br />
Just try different totally random machines? Vary workload? How do we get a light vs. heavy workload?<br />
<br />
How do we measure performance of each? Increasing clients until we see performance degredation due to server bottlenecks would be obvious thing to do....)<br />
<br />
===Measures of load===<br />
what do we use to determine if our measure of load is correct?<br />
<br />
* single rpc latency measured from a client?<br />
* time to complete some other task, measured from a single client (not actually involved in loading the server)?<br />
* rpc's per second?<br />
<br />
===Configuration parameters on server that can be varied===<br />
* number of server threads<br />
* number of connections per server thread<br />
* request queue lengths (# of bytes waiting in tcp socket)<br />
<br />
===Some special situations that can be problems (from Chuck)===<br />
* reboot recovery: everyone is recovering at once.<br />
* mount storms: a lab full of clients may all mount at once, or a cluster job may trigger automount from all clients at once.<br />
<br />
===Possible benchmark sources, for this and locking scalability===<br />
====postmark====<br />
looks pretty primitive: mixture of reads, writes, creates, unlinks. No locks.<br />
====filebench====<br />
also no locking. Haven't figured out exactly what the various loads do. Is there actually an active developer community?<br />
====See Bull.net's list?====<br />
* Bonnie++<br />
* FStress<br />
* dbench: simulates filesystem activity created by a samba server running the proprietary SMB benchmark "netbench". Maybe not so useful.<br />
* Do-it-ourselves modify postmark or filebench? set up a mailserver (e.g.), send it fake mail. get traces from working servers</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/CITI_ASC_statusCITI ASC status2006-10-18T14:28:45Z<p>Andros: /* Development */</p>
<hr />
<div>=University of Michigan/CITI NFSv4 ASC alliance=<br />
Status of October 2006<br />
==Task 1. Demonstration of pNFS with multiple back end methods (PVFS and File) including layout recall — LANL will replicate this demonstration at LANL working with CITI remotely.==<br />
===Development===<br />
We updated the Linux '''pNFS''' client and server to the 2.6.17 kernel level, and are preparing to rebase again for 2.6.19. <br />
<br />
We updated the pNFS code base to draft-ietf-nfsv4-minorversion1-05. Testing identified multiple bugs, which we fixed.<br />
<br />
The linux client separates common NFS code from NFSv2/3/4 code by using version specific operations. We rewrote the Linux pNFS client to use its own set of version specfic operations. This provides a controlled interface to the pNFS code, and eases updating the code to new kernel versions.<br />
<br />
Four client layout modules are in development. <br />
* File layout driver (CITI, Network Appliance, and IBM Almaden).<br />
* PVFS2 layout driver (CITI).<br />
* Object layout driver (Panasas).<br />
* Block layout driver (CITI under contract with EMC).<br />
<br />
To accommodate the requirements of the multiple layout drivers, we expanded the layout operation policy interfaces between the layout driver and generic pNFS client.<br />
<br />
We are designing and coding a pNFS client layout cache to replace the current implementation, which supports only a single layout per inode.<br />
<br />
We improved the interface to the underlying file system on the Linux pNFS server. The new interface is being used by the Panasas object layout server, the IBM GPFS server, and the PVFS2 server. <br />
<br />
We are coding the pNFS layout management service and file system interfaces on the Linux pNFS server to do a better job of bookkeeping so that we can extend the layout recall implementation, which is limited to a single layout.<br />
<br />
We have continued to develop the PVFS2 layout driver and PVFS2 support in the pNFS server. The layout driver I/O interface supports direct access, page cache access through with NFSv4 readahead and writeback, and the O_DIRECT access method. In addition, PVFS2 now supports the pNFS file-based layout, which lets pNFS clients choose how they access the file system.<br />
<br />
We developed prototype implementations of pNFS operations:<br />
* OP_GETDEVICELIST,<br />
* OP_GETDEVICEINFO, <br />
* OP_LAYOUTGET,<br />
* OP_LAYOUTCOMMIT,<br />
* OP_LAYOUTRETURN and<br />
* OP_CB_LAYOUTRECALL<br />
<br />
We continue to test the ability of our prototype to send direct I/O data to data servers.<br />
<br />
===Milestones===<br />
At the September 2006 NFSv4 Bake-a-thon, hosted by CITI, we continued to test the ability of CITI's Linux pNFS client to operate with multiple layouts, and the ability of CITI's Linux pNFS server to export pNFS capable underlying file systems. <br />
<br />
We demonstrated the Linux pNFS client support for multiple layouts by copying files between multiple pNFS back ends.<br />
<br />
The following pNFS implementations were tested.<br />
<br />
''File Layout''<br />
* Clients: Linux, Solaris<br />
* Servers: Network Appliance, Linux IBM GPFS, DESY dCache, Solaris<br />
<br />
''Object layout''<br />
* Clients: Linux<br />
* Servers: Linux, Panasas<br />
<br />
''Block layout''<br />
* Clients: Linux<br />
* Server: EMC<br />
<br />
===Activities===<br />
Our current Linux pNFS implementation uses a single whole file layout. We are extending the layout cache on the client and layout management on the server to support multiple layouts and small byte ranges. <br />
<br />
In cooperation with EMC, we continue to develop a block layout driver module for the generic pNFS client.<br />
<br />
We continue to measure I/O performance.<br />
<br />
We joined the [http://www.ultralight.org Ultralight project] and are testing pNFS I/O using pNFS clients on 10 GbE against pNFS clusters on 1 GbE.<br />
The Linux pNFS client included in the Ultralight kernel and distributed to Ultralight sites, providing opportunities for future long-haul WAN testing.<br />
<br />
==Task 2. Migration of client from one mount/metadata server to another to be demonstrated. This demonstration may be replicated at LANL depending on success of this work. ==<br />
<br />
When a file system moves, the old server notifies clients with NFS4ERR_MOVED. Clients then reclaim state held on the old server by engaging in reboot recovery with the new server. For cluster file systems, server-to-server state transfer lets clients avoid the reclaim. <br />
<br />
We redesigned state bookkeeping to ensure that state created on NFSv4 servers exporting the same cluster file system will not collide. <br />
<br />
As presently implemented, clients save the old server's state in stable storage and pass the state information to the new server as part of the recovery operation. We are rewriting that interface to also support server-to-server state transfer.<br />
<br />
* Please check that sentence<br />
<br />
It remains to inform clients that state established with the old server remains valid on the new server. The IETF NFSv4 working group is considering solutions, e.g., augmented FS_LOCATIONS information or a new error code NFS4ERR_MOVED_DATA_AND_STATE.<br />
<br />
==Task 3. Analysis of caching and lock coherency, demonstration of caching and lock performance with scaling, under various levels of conflict, using byte range locks (looking at lock splitting issues etc.).==<br />
We have set up test machines and begun planning for tests. We have some immediate concerns over the memory footprint imposed by server lock structures.<br />
==Task 4. Analysis of directory delegations – how well does it work and when, when does it totally not work.==<br />
===Background===<br />
'''Directory delegations''' promise to extend the usefulness of dentry caching in two ways. First, the client is no longer forced to revalidate the dentry cache after a timeout. Second, while positive caching can be treated as a hint, negative caching without cache invalidation violates open-to-close semantics. <br />
Directory delegations allow the client to cache negative results. <br />
<br />
For example, if a client opens a file that does not exist, it issues an OPEN RPC that fails. But a subsequent open of the same file might succeed, if the file is created in the interim. Open-to-close semantics requires that the newly created file be seen by the client, so the earlier negative result can not be cached. Consequently, subsequent opens of the same non-existent file also require OPEN RPC calls being sent to the server. This example is played out repeatedly when the shell searches for executables in PATH or when the linker searches for shared libraries in LD_LIBRARY_PATH.<br />
<br />
With directory delegations, the server callback mechanism can guarantee that no entries have been added or modified in a cached directory, which allows consistent negative caching and eliminates repeated checks for non-existent files.<br />
===Status===<br />
We implemented directory delegations in the Linux NFSv4 client and server. <br />
<br />
Our server implementation follows the file delegations architecture. We extended the lease API in the Linux VFS to support read-only leases on directories and NFS-specific lease-breaking semantics.<br />
<br />
We implemented a '''/proc''' interface on the server to enable or disable directory delegation at run time. At startup, the client queries the server for directory delegation support.<br />
<br />
The server has hooks for a policy layer to control the granting of directory delegations. (No policy is implemented yet.) When and whether to acquire delegations is also a client concern.<br />
<br />
===Testing===<br />
We are testing delegation grant and recall in a test rig with one or two clients. Testing consists mostly of comparing NFS operation-counts when directory delegations is enabled or disabled.<br />
<br />
Tests range from simple UNIX utilities — ls, find, touch — to hosting a CVS repository or compiling with shared libraries and header files on NFS servers. Tests will become more specific.<br />
<br />
We have extended PyNFS to support directory delegations. So far, the support is basic and the tests are trivial. Tests will become more specific.<br />
<br />
We are designing mechanisms that allow simulation experiments to compare delegation policies on NFSv4 network traces.<br />
==Task 5. How do you specify/measure NFS Server load.==<br />
<br />
To frame the task, consider identical symmetric servers with a cluster file system back end and a task running on one of them.<br />
Can we compare the load on the servers to determine whether there would be a benefit to migrating a client from one to the other? <br />
<br />
Answering this question requires that we define a model of load based on measurable quanta.<br />
<br />
Given a model, the next step is to write a tool that collects the factors that influence load and to measure how well the model accurately predicts performance.<br />
<br />
===Goals===<br />
If an application is running at less than peak performance, the load model should tell us whether the bottleneck is in the server, the client, or elsewhere.<br />
<br />
If the bottleneck is in the server, one option for improving application performance is replacing server components with faster ones.<br />
Another option is to add servers. A third option is to migrate the application to a lightly-loaded server.<br />
<br />
* Actually, the second option is fruitless without the third.<br />
<br />
===Factors that influence server load===<br />
<br />
====Disks====<br />
The rate at which a single file in a server file system can be depends on many factors, including characteristics of the disk hardware (rotation speed, access latency, etc.), the disk controller, the bus, the layout of files on the disk, the size of the transfer, and the degree of caching. The overall bandwidth of a file system also depends on the degree of striping and distribution of requests across disks. <br />
<br />
The ''iostat'' command can reveal a bottleneck due to server disks if seek or transfer rates approach maximum values. For a given server configuration, these values can be measured directly. It might be possible to predict these values for a given hardware ensemble.<br />
<br />
====CPU====<br />
Server threads compete with one another and with the operating system for access to the CPU. Excess offered load can exhaust the availability of server threads.<br />
<br />
* how would we know if this were to happen?<br />
* would it suffice to simply allocate more threads?<br />
* or are there pathological cases to consider?<br />
<br />
Overall CPU utilization can be measured, also with ''iostat'', but there may be other factors influencing the allocation of CPU to server threads. For example, excessive pressure on the memory or interrupt subsystem can force the operating system to intervene.<br />
<br />
====Interrupts====<br />
<br />
Interrupt rates can be measured with <br />
<br />
* i forget :-(<br />
<br />
For a given hardware configuration, a threshold can be measured experimentally.<br />
<br />
====Memory====<br />
The memory subsystem is complex and varies among operating systems. Applications compete with one another for virtual memory. Often, they also compete with the file system, which uses the virtual memory subsystem for its in-memory cache.<br />
<br />
Often, excess demand for memory is reflected by early eviction of pages in virtual memory. The ''vmstat'' command shows the pageout rate, which does not measure early eviction, but does reflect overall memory pressure.<br />
<br />
====Network====<br />
<br />
<br />
===What to report for each?===<br />
Average load over past time intervals (1ms, 10ms, 100ms, 1s, 10s, 100s,...)<br />
<br />
===How do we check usefulness of this information?===<br />
boot with reduced resources somehow, see if increasing resources increases performance as predicted?<br />
<br />
====Disk bandwidth====<br />
vary size of raid arrays, bandwidth of disk interfaces? <br />
<br />
Or run another process that soaks up some percentage of bandwidth??<br />
<br />
====CPU load====<br />
CPU throttling??<br />
<br />
Just try different totally random machines? Vary workload? How do we get a light vs. heavy workload?<br />
<br />
How do we measure performance of each? Increasing clients until we see performance degredation due to server bottlenecks would be obvious thing to do....)<br />
<br />
===Measures of load===<br />
what do we use to determine if our measure of load is correct?<br />
<br />
* single rpc latency measured from a client?<br />
* time to complete some other task, measured from a single client (not actually involved in loading the server)?<br />
* rpc's per second?<br />
<br />
===Configuration parameters on server that can be varied===<br />
* number of server threads<br />
* number of connections per server thread<br />
* request queue lengths (# of bytes waiting in tcp socket)<br />
<br />
===Some special situations that can be problems (from Chuck)===<br />
* reboot recovery: everyone is recovering at once.<br />
* mount storms: a lab full of clients may all mount at once, or a cluster job may trigger automount from all clients at once.<br />
<br />
===Possible benchmark sources, for this and locking scalability===<br />
====postmark====<br />
looks pretty primitive: mixture of reads, writes, creates, unlinks. No locks.<br />
====filebench====<br />
also no locking. Haven't figured out exactly what the various loads do. Is there actually an active developer community?<br />
====See Bull.net's list?====<br />
* Bonnie++<br />
* FStress<br />
* dbench: simulates filesystem activity created by a samba server running the proprietary SMB benchmark "netbench". Maybe not so useful.<br />
* Do-it-ourselves modify postmark or filebench? set up a mailserver (e.g.), send it fake mail. get traces from working servers</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/CITI_ASC_statusCITI ASC status2006-10-17T19:52:01Z<p>Andros: /* Activities */</p>
<hr />
<div>=University of Michigan/CITI NFSv4 ASC alliance=<br />
Status of October 2006<br />
==Task 1. Demonstration of pNFS with multiple back end methods (PVFS and File) including layout recall — LANL will replicate this demonstration at LANL working with CITI remotely.==<br />
===Development.===<br />
We updated the Linux pNFS client and server to the 2.6.17 kernel level, and are preparing to rebase again for 2.6.19. <br />
<br />
We updated the pNFS code base to draft-ietf-nfsv4-minorversion1-05. Testing identified multiple bugs, which we fixed.<br />
<br />
To make a clean separation of the common NFS v2/3/4/4.1 code from code specific to pNFS, we rewrote the Linux pNFS client to use its own set of RPC operations.<br />
<br />
Four client layout modules are in development. <br />
* File layout driver (CITI, Network Appliance, and IBM Almaden).<br />
* PVFS2 layout driver (CITI).<br />
* Object layout driver (Panasas).<br />
* Block layout driver (CITI under contract with EMC).<br />
<br />
To accommodate the requirements of the multiple layout drivers, we expanded the layout operation policy interfaces between the layout driver and generic pNFS client.<br />
<br />
We are designing and coding a pNFS client layout cache to replace the current implementation, which supports only a single layout per inode.<br />
<br />
* Andy: check wording<br />
<br />
We improved the interface to the underlying file system on the Linux pNFS server. The new interface is being used by the Panasas object layout server and the IBM GPFS server. <br />
<br />
We are coding the pNFS layout management service and file system interfaces on the Linux pNFS server to do a better job of bookkeeping so that we can extend the layout recall implementation, which is limited to a single layout.<br />
<br />
We've continued to developed the pVFS2 layout and pVFS2 pNFS server.<br />
<br />
* Dean, can you throw some text here?<br />
<br />
We developed prototype implementations of pNFS operations:<br />
* OP_GETDEVICELIST,<br />
* OP_GETDEVICEINFO, <br />
* OP_LAYOUTGET,<br />
* OP_LAYOUTCOMMIT,<br />
* OP_LAYOUTRETURN and<br />
* OP_CB_LAYOUTRECALL<br />
<br />
We continue to test the ability of our prototype to send direct I/O data to data servers.<br />
<br />
===Milestones===<br />
At the September 2006 NFSv4 Bake-a-thon, hosted by CITI, we continued to test the ability of CITI's Linux pNFS client to operate with multiple layouts, and the ability of CITI's Linux pNFS server to export pNFS capable underlying file systems. <br />
<br />
We demonstrated the Linux pNFS client support for multiple layouts by copying files between multiple pNFS back ends.<br />
<br />
The following pNFS implementations were tested.<br />
<br />
''File Layout''<br />
* Clients: Linux, Solaris<br />
* Servers: Network Appliance, Linux IBM GPFS, DESY dCache, Solaris<br />
<br />
''Object layout''<br />
* Clients: Linux<br />
* Servers: Linux, Panasas<br />
<br />
''Block layout''<br />
* Clients: Linux<br />
* Server: EMC<br />
<br />
===Activities===<br />
We are expanding our simple single whole file layout implementation to include multiple small byte range layouts which requires a new layout cache implementation on the client and a new layout management implementation on the server.<br />
<br />
In cooperation with EMC, we continue to develop a block layout driver module for the generic pNFS client.<br />
<br />
We continue to measure I/O performance.<br />
<br />
We joined the (http://www.ultralight.org) Ultralight project and are testing pNFS I/O using 10G pNFS clients against 1G pNFS clusters.<br />
The Linux pNFS client is included in the Ultralight kernel which is distributed to ultralight sites providing opportunities for large distance WAN testing.<br />
<br />
==Task 2. Migration of client from one mount/metadata server to another to be demonstrated. This demonstration may be replicated at LANL depending on success of this work. ==<br />
When a file system moves, the former server notifies clients with NFS4ERR_MOVED. Clients then reclaim state held on the former server by engaging in reboot recovery with the new server. For cluster file systems, server-to-server state transfer lets clients avoid the reclaim. <br />
<br />
We redesigned state bookkeeping to ensure that state created on NFSv4 servers exporting the same cluster file system will not collide. We are rewriting the interface that clients use when saving NFSv4 server state in stable storage to also support the server-server state transfer.<br />
<br />
It remains to inform clients that state established with the former server remains valid on the new server. IETF is considering solutions, e.g., augmented FS_LOCATIONS information or a new error code NFS4ERR_MOVED_DATA_AND_STATE.<br />
==Task 3. Analysis of caching and lock coherency, demonstration of caching and lock performance with scaling, under various levels of conflict, using byte range locks (looking at lock splitting issues etc.).==<br />
We have set up test machines and begun planning for tests. We have some immediate concerns over the memory footprint imposed by server lock structures.<br />
==Task 4. Analysis of directory delegations – how well does it work and when, when does it totally not work.==<br />
===Development===<br />
We have implemented directory delegations in the Linux client and server. Our server implementation of directory delegations follows the file delegations architecture. We extended the lease API in the Linux VFS to support read-only leases on directories and NFS-specific lease-breaking semantics.<br />
<br />
We implemented a /proc interface for enabling or disabling directory delegation at run time. At startup, the client queries the server for directory delegation support.<br />
<br />
Directory delegations promise to extend the usefulness of negative dentry caching on the client. Negative caching is unsafe without cache invalidation (positive caching can be treated as a hint). To give an example, opening a file that does not exist produces an OPEN RPC that fails. Open-to-close semantics and the lack of consistent negative caching requires that subsequent opens of the same non-existent file yield repeated OPEN RPC calls being sent to the server. This example is played out frequently when searching for an executable in PATH or a shared library in LD_LIBRARY_PATH.<br />
<br />
Directory delegation enables negative caching by assuring that no entries have been added or modified in a cached directory. This should markedly decrease unnecessary repeated checks for non-existent files. We are testing this use case.<br />
<br />
The server has hooks for a policy layer to control the granting of directory delegations. (No policy is implemented yet.) When and whether to acquire delegations is also a client concern.<br />
===Testing===<br />
We are testing delegation grant and recall in a test rig with one or two clients. Testing consists mostly of comparing NFS operation-counts when directory delegations is enabled or disabled.<br />
<br />
Tests range from simple UNIX utilities — ls, find, touch — to hosting a CVS repository or compiling with shared libraries and header files on NFS servers. Tests will become more specific.<br />
<br />
We have extended PyNFS to support directory delegations. So far, the support is basic and the tests are trivial. Tests will become more specific.<br />
<br />
We are designing mechanisms that allow simulation experiments to compare delegation policies on NFSv4 network traces.<br />
==Task 5. How do you specify/measure NFS Server load.==<br />
We have no progress to report on this task.</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/CITI_ASC_statusCITI ASC status2006-10-12T18:00:57Z<p>Andros: /* Activities */</p>
<hr />
<div>I started with the May 2006 report, which we can bring up to date for the October 2006 report.<br />
<br />
=University of Michigan/CITI NFSv4 ASC alliance=<br />
Status of May 2006<br />
==Task 1. Demonstration of pNFS with multiple back end methods (PVFS and File) including layout recall — LANL will replicate this demonstration at LANL working with CITI remotely.==<br />
===Development.===<br />
We've updated the pNFS client and server to the 2.6.17 kernel level, and will rebase again for 2.6.19. We've updated the pNFS codebase to the draft-ietf-nfsv4-minorversion1-05. Through testing we've identified and fixed multiple bugs.<br />
<br />
We rewrote the Linux pNFS client to use it's own set of rpc operations to cleanly separate the common NFS v2/3/4/4.1 code from the pNFS specific code.<br />
We now have four client layout modules under development. The file layout driver is being jointly developed by CITI, Network Appliance, and IBM Almaden. The CITI pVFS2 layout driver from Dean Hildebrand. The object layout driver from Panasas. The block layout driver is being developed at CITI under contract from EMC.<br />
We've expanded the layout operation interface and the layout policy interface between the layout driver and generic pNFS client to accommodate the requirements of the multiple layout drivers.<br />
We are designing and coding a pNFS client layout cache to replace the current simple single layout per inode implementation.<br />
<br />
We've improved the Linux pNFS server to underlying file system interface which is now used by the Panasas object layout server as well as the IBM GPFS server. We are currently coding the server pNFS layout management service and file system interfaces to bookeep layouts in order to expand the current simple single layout recall implementation.<br />
<br />
We've continued to developed the pVFS2 layout and pVFS2 pNFS server. (XXX Dean)<br />
<br />
We developed prototype implementations of pNFS operations:<br />
o OP_GETDEVICELIST,<br />
o OP_GETDEVICEINFO, <br />
o OP_LAYOUTGET,<br />
o OP_LAYOUTCOMMIT,<br />
o OP_LAYOUTRETURN and<br />
o OP_CB_LAYOUTRECALL<br />
<br />
<br />
We continue testing the prototype’s ability to send direct I/O data to data servers.<br />
<br />
===Milestones===<br />
At the September NFSv4 bakeathon hosted by CITI, we continued to test the ability of CITI's Linux pNFS client to operate with multiple layouts, and CITI's Linux pNFS server to export pNFS capable underlying file systems. We demonstrated the Linux pNFS client support for multiple layouts by copying files between multiple pNFS back ends.<br />
<br />
The following pNFS implementations were tested.<br />
File Layout<br />
<br />
Linux and Solaris client<br />
Network Appliance, Linux IBM GPFS, DESY dCache, Solaris server<br />
<br />
Object layout<br />
<br />
Linux client<br />
Linux Panasas server<br />
<br />
Block layout<br />
<br />
Linux client<br />
EMC server<br />
<br />
===Activities===<br />
We are expanding our simple single whole file layout implementation to include multiple small byte range layouts which requires a new layout cache implementation on the client and a new layout management implementation on the server.<br />
<br />
In cooperation with EMC, we continue to develop a block layout driver module for the generic pNFS client.<br />
<br />
We continue to measure I/O performance.<br />
<br />
We joined the (http://www.ultralight.org) Ultralight project and are testing pNFS I/O using 10G pNFS clients against 1G pNFS clusters.<br />
The Linux pNFS client is included in the Ultralight kernel which is distributed to ultralight sites providing opportunities for large distance WAN testing.<br />
<br />
DESY (?)<br />
<br />
==Task 2. Migration of client from one mount/metadata server to another to be demonstrated. This demonstration may be replicated at LANL depending on success of this work. ==<br />
When a file system moves, the former server notifies clients with NFS4ERR_MOVED. Clients then reclaim state held on the former server by engaging in reboot recovery with the new server. For cluster file systems, server-to-server state transfer lets clients avoid the reclaim. <br />
<br />
We redesigned state bookkeeping to ensure that state created on NFSv4 servers exporting the same cluster file system will not collide. We are rewriting the interface that clients use when saving NFSv4 server state in stable storage to also support the server-server state transfer.<br />
<br />
It remains to inform clients that state established with the former server remains valid on the new server. IETF is considering solutions, e.g., augmented FS_LOCATIONS information or a new error code NFS4ERR_MOVED_DATA_AND_STATE.<br />
==Task 3. Analysis of caching and lock coherency, demonstration of caching and lock performance with scaling, under various levels of conflict, using byte range locks (looking at lock splitting issues etc.).==<br />
We have set up test machines and begun planning for tests. We have some immediate concerns over the memory footprint imposed by server lock structures.<br />
==Task 4. Analysis of directory delegations – how well does it work and when, when does it totally not work.==<br />
===Development===<br />
We have implemented directory delegations in the Linux client and server. Our server implementation of directory delegations follows the file delegations architecture. We extended the lease API in the Linux VFS to support read-only leases on directories and NFS-specific lease-breaking semantics.<br />
<br />
We implemented a /proc interface for enabling or disabling directory delegation at run time. At startup, the client queries the server for directory delegation support.<br />
<br />
Directory delegations promise to extend the usefulness of negative dentry caching on the client. Negative caching is unsafe without cache invalidation (positive caching can be treated as a hint). To give an example, opening a file that does not exist produces an OPEN RPC that fails. Open-to-close semantics and the lack of consistent negative caching requires that subsequent opens of the same non-existent file yield repeated OPEN RPC calls being sent to the server. This example is played out frequently when searching for an executable in PATH or a shared library in LD_LIBRARY_PATH.<br />
<br />
Directory delegation enables negative caching by assuring that no entries have been added or modified in a cached directory. This should markedly decrease unnecessary repeated checks for non-existent files. We are testing this use case.<br />
<br />
The server has hooks for a policy layer to control the granting of directory delegations. (No policy is implemented yet.) When and whether to acquire delegations is also a client concern.<br />
===Testing===<br />
We are testing delegation grant and recall in a test rig with one or two clients. Testing consists mostly of comparing NFS operation-counts when directory delegations is enabled or disabled.<br />
<br />
Tests range from simple UNIX utilities — ls, find, touch — to hosting a CVS repository or compiling with shared libraries and header files on NFS servers. Tests will become more specific.<br />
<br />
We have extended PyNFS to support directory delegations. So far, the support is basic and the tests are trivial. Tests will become more specific.<br />
<br />
We are designing mechanisms that allow simulation experiments to compare delegation policies on NFSv4 network traces.<br />
==Task 5. How do you specify/measure NFS Server load.==<br />
We have no progress to report on this task.</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/CITI_ASC_statusCITI ASC status2006-10-12T17:41:24Z<p>Andros: /* Milestones */</p>
<hr />
<div>I started with the May 2006 report, which we can bring up to date for the October 2006 report.<br />
<br />
=University of Michigan/CITI NFSv4 ASC alliance=<br />
Status of May 2006<br />
==Task 1. Demonstration of pNFS with multiple back end methods (PVFS and File) including layout recall — LANL will replicate this demonstration at LANL working with CITI remotely.==<br />
===Development.===<br />
We've updated the pNFS client and server to the 2.6.17 kernel level, and will rebase again for 2.6.19. We've updated the pNFS codebase to the draft-ietf-nfsv4-minorversion1-05. Through testing we've identified and fixed multiple bugs.<br />
<br />
We rewrote the Linux pNFS client to use it's own set of rpc operations to cleanly separate the common NFS v2/3/4/4.1 code from the pNFS specific code.<br />
We now have four client layout modules under development. The file layout driver is being jointly developed by CITI, Network Appliance, and IBM Almaden. The CITI pVFS2 layout driver from Dean Hildebrand. The object layout driver from Panasas. The block layout driver is being developed at CITI under contract from EMC.<br />
We've expanded the layout operation interface and the layout policy interface between the layout driver and generic pNFS client to accommodate the requirements of the multiple layout drivers.<br />
We are designing and coding a pNFS client layout cache to replace the current simple single layout per inode implementation.<br />
<br />
We've improved the Linux pNFS server to underlying file system interface which is now used by the Panasas object layout server as well as the IBM GPFS server. We are currently coding the server pNFS layout management service and file system interfaces to bookeep layouts in order to expand the current simple single layout recall implementation.<br />
<br />
We've continued to developed the pVFS2 layout and pVFS2 pNFS server. (XXX Dean)<br />
<br />
We developed prototype implementations of pNFS operations:<br />
o OP_GETDEVICELIST,<br />
o OP_GETDEVICEINFO, <br />
o OP_LAYOUTGET,<br />
o OP_LAYOUTCOMMIT,<br />
o OP_LAYOUTRETURN and<br />
o OP_CB_LAYOUTRECALL<br />
<br />
<br />
We continue testing the prototype’s ability to send direct I/O data to data servers.<br />
<br />
===Milestones===<br />
At the September NFSv4 bakeathon hosted by CITI, we continued to test the ability of CITI's Linux pNFS client to operate with multiple layouts, and CITI's Linux pNFS server to export pNFS capable underlying file systems. We demonstrated the Linux pNFS client support for multiple layouts by copying files between multiple pNFS back ends.<br />
<br />
The following pNFS implementations were tested.<br />
File Layout<br />
<br />
Linux and Solaris client<br />
Network Appliance, Linux IBM GPFS, DESY dCache, Solaris server<br />
<br />
Object layout<br />
<br />
Linux client<br />
Linux Panasas server<br />
<br />
Block layout<br />
<br />
Linux client<br />
EMC server<br />
<br />
===Activities===<br />
We are rewriting the pNFS client, beginning to measure I/O performance over pNFS, and designing OP_LAYOUTRETURN and recall.<br />
<br />
In cooperation with EMC, we are developing a block layout driver module for the generic pNFS client.<br />
==Task 2. Migration of client from one mount/metadata server to another to be demonstrated. This demonstration may be replicated at LANL depending on success of this work. ==<br />
When a file system moves, the former server notifies clients with NFS4ERR_MOVED. Clients then reclaim state held on the former server by engaging in reboot recovery with the new server. For cluster file systems, server-to-server state transfer lets clients avoid the reclaim. <br />
<br />
We redesigned state bookkeeping to ensure that state created on NFSv4 servers exporting the same cluster file system will not collide. We are rewriting the interface that clients use when saving NFSv4 server state in stable storage to also support the server-server state transfer.<br />
<br />
It remains to inform clients that state established with the former server remains valid on the new server. IETF is considering solutions, e.g., augmented FS_LOCATIONS information or a new error code NFS4ERR_MOVED_DATA_AND_STATE.<br />
==Task 3. Analysis of caching and lock coherency, demonstration of caching and lock performance with scaling, under various levels of conflict, using byte range locks (looking at lock splitting issues etc.).==<br />
We have set up test machines and begun planning for tests. We have some immediate concerns over the memory footprint imposed by server lock structures.<br />
==Task 4. Analysis of directory delegations – how well does it work and when, when does it totally not work.==<br />
===Development===<br />
We have implemented directory delegations in the Linux client and server. Our server implementation of directory delegations follows the file delegations architecture. We extended the lease API in the Linux VFS to support read-only leases on directories and NFS-specific lease-breaking semantics.<br />
<br />
We implemented a /proc interface for enabling or disabling directory delegation at run time. At startup, the client queries the server for directory delegation support.<br />
<br />
Directory delegations promise to extend the usefulness of negative dentry caching on the client. Negative caching is unsafe without cache invalidation (positive caching can be treated as a hint). To give an example, opening a file that does not exist produces an OPEN RPC that fails. Open-to-close semantics and the lack of consistent negative caching requires that subsequent opens of the same non-existent file yield repeated OPEN RPC calls being sent to the server. This example is played out frequently when searching for an executable in PATH or a shared library in LD_LIBRARY_PATH.<br />
<br />
Directory delegation enables negative caching by assuring that no entries have been added or modified in a cached directory. This should markedly decrease unnecessary repeated checks for non-existent files. We are testing this use case.<br />
<br />
The server has hooks for a policy layer to control the granting of directory delegations. (No policy is implemented yet.) When and whether to acquire delegations is also a client concern.<br />
===Testing===<br />
We are testing delegation grant and recall in a test rig with one or two clients. Testing consists mostly of comparing NFS operation-counts when directory delegations is enabled or disabled.<br />
<br />
Tests range from simple UNIX utilities — ls, find, touch — to hosting a CVS repository or compiling with shared libraries and header files on NFS servers. Tests will become more specific.<br />
<br />
We have extended PyNFS to support directory delegations. So far, the support is basic and the tests are trivial. Tests will become more specific.<br />
<br />
We are designing mechanisms that allow simulation experiments to compare delegation policies on NFSv4 network traces.<br />
==Task 5. How do you specify/measure NFS Server load.==<br />
We have no progress to report on this task.</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/CITI_ASC_statusCITI ASC status2006-10-12T17:41:05Z<p>Andros: /* Milestones */</p>
<hr />
<div>I started with the May 2006 report, which we can bring up to date for the October 2006 report.<br />
<br />
=University of Michigan/CITI NFSv4 ASC alliance=<br />
Status of May 2006<br />
==Task 1. Demonstration of pNFS with multiple back end methods (PVFS and File) including layout recall — LANL will replicate this demonstration at LANL working with CITI remotely.==<br />
===Development.===<br />
We've updated the pNFS client and server to the 2.6.17 kernel level, and will rebase again for 2.6.19. We've updated the pNFS codebase to the draft-ietf-nfsv4-minorversion1-05. Through testing we've identified and fixed multiple bugs.<br />
<br />
We rewrote the Linux pNFS client to use it's own set of rpc operations to cleanly separate the common NFS v2/3/4/4.1 code from the pNFS specific code.<br />
We now have four client layout modules under development. The file layout driver is being jointly developed by CITI, Network Appliance, and IBM Almaden. The CITI pVFS2 layout driver from Dean Hildebrand. The object layout driver from Panasas. The block layout driver is being developed at CITI under contract from EMC.<br />
We've expanded the layout operation interface and the layout policy interface between the layout driver and generic pNFS client to accommodate the requirements of the multiple layout drivers.<br />
We are designing and coding a pNFS client layout cache to replace the current simple single layout per inode implementation.<br />
<br />
We've improved the Linux pNFS server to underlying file system interface which is now used by the Panasas object layout server as well as the IBM GPFS server. We are currently coding the server pNFS layout management service and file system interfaces to bookeep layouts in order to expand the current simple single layout recall implementation.<br />
<br />
We've continued to developed the pVFS2 layout and pVFS2 pNFS server. (XXX Dean)<br />
<br />
We developed prototype implementations of pNFS operations:<br />
o OP_GETDEVICELIST,<br />
o OP_GETDEVICEINFO, <br />
o OP_LAYOUTGET,<br />
o OP_LAYOUTCOMMIT,<br />
o OP_LAYOUTRETURN and<br />
o OP_CB_LAYOUTRECALL<br />
<br />
<br />
We continue testing the prototype’s ability to send direct I/O data to data servers.<br />
<br />
===Milestones===<br />
At the Feb/March 2006 Connectathon, we tested the ability of CITI’s Linux pNFS client to operate with multiple layouts. We configured a Linux pNFS client with (1) a pVFS2 layout to access direct I/O on pVFS2 servers, and (2) a file layout to access data striped across Network Appliance servers. W<br />
<br />
<br />
At the September NFSv4 bakeathon hosted by CITI, we continued to test the ability of CITI's Linux pNFS client to operate with multiple layouts, and CITI's Linux pNFS server to export pNFS capable underlying file systems. We demonstrated the Linux pNFS client support for multiple layouts by copying files between multiple pNFS back ends.<br />
<br />
The following pNFS implementations were tested.<br />
File Layout<br />
<br />
Linux and Solaris client<br />
Network Appliance, Linux IBM GPFS, DESY dCache, Solaris server<br />
<br />
Object layout<br />
<br />
Linux client<br />
Linux Panasas server<br />
<br />
Block layout<br />
<br />
Linux client<br />
EMC server<br />
<br />
===Activities===<br />
We are rewriting the pNFS client, beginning to measure I/O performance over pNFS, and designing OP_LAYOUTRETURN and recall.<br />
<br />
In cooperation with EMC, we are developing a block layout driver module for the generic pNFS client.<br />
==Task 2. Migration of client from one mount/metadata server to another to be demonstrated. This demonstration may be replicated at LANL depending on success of this work. ==<br />
When a file system moves, the former server notifies clients with NFS4ERR_MOVED. Clients then reclaim state held on the former server by engaging in reboot recovery with the new server. For cluster file systems, server-to-server state transfer lets clients avoid the reclaim. <br />
<br />
We redesigned state bookkeeping to ensure that state created on NFSv4 servers exporting the same cluster file system will not collide. We are rewriting the interface that clients use when saving NFSv4 server state in stable storage to also support the server-server state transfer.<br />
<br />
It remains to inform clients that state established with the former server remains valid on the new server. IETF is considering solutions, e.g., augmented FS_LOCATIONS information or a new error code NFS4ERR_MOVED_DATA_AND_STATE.<br />
==Task 3. Analysis of caching and lock coherency, demonstration of caching and lock performance with scaling, under various levels of conflict, using byte range locks (looking at lock splitting issues etc.).==<br />
We have set up test machines and begun planning for tests. We have some immediate concerns over the memory footprint imposed by server lock structures.<br />
==Task 4. Analysis of directory delegations – how well does it work and when, when does it totally not work.==<br />
===Development===<br />
We have implemented directory delegations in the Linux client and server. Our server implementation of directory delegations follows the file delegations architecture. We extended the lease API in the Linux VFS to support read-only leases on directories and NFS-specific lease-breaking semantics.<br />
<br />
We implemented a /proc interface for enabling or disabling directory delegation at run time. At startup, the client queries the server for directory delegation support.<br />
<br />
Directory delegations promise to extend the usefulness of negative dentry caching on the client. Negative caching is unsafe without cache invalidation (positive caching can be treated as a hint). To give an example, opening a file that does not exist produces an OPEN RPC that fails. Open-to-close semantics and the lack of consistent negative caching requires that subsequent opens of the same non-existent file yield repeated OPEN RPC calls being sent to the server. This example is played out frequently when searching for an executable in PATH or a shared library in LD_LIBRARY_PATH.<br />
<br />
Directory delegation enables negative caching by assuring that no entries have been added or modified in a cached directory. This should markedly decrease unnecessary repeated checks for non-existent files. We are testing this use case.<br />
<br />
The server has hooks for a policy layer to control the granting of directory delegations. (No policy is implemented yet.) When and whether to acquire delegations is also a client concern.<br />
===Testing===<br />
We are testing delegation grant and recall in a test rig with one or two clients. Testing consists mostly of comparing NFS operation-counts when directory delegations is enabled or disabled.<br />
<br />
Tests range from simple UNIX utilities — ls, find, touch — to hosting a CVS repository or compiling with shared libraries and header files on NFS servers. Tests will become more specific.<br />
<br />
We have extended PyNFS to support directory delegations. So far, the support is basic and the tests are trivial. Tests will become more specific.<br />
<br />
We are designing mechanisms that allow simulation experiments to compare delegation policies on NFSv4 network traces.<br />
==Task 5. How do you specify/measure NFS Server load.==<br />
We have no progress to report on this task.</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/CITI_ASC_statusCITI ASC status2006-10-12T17:28:37Z<p>Andros: /* Development. */</p>
<hr />
<div>I started with the May 2006 report, which we can bring up to date for the October 2006 report.<br />
<br />
=University of Michigan/CITI NFSv4 ASC alliance=<br />
Status of May 2006<br />
==Task 1. Demonstration of pNFS with multiple back end methods (PVFS and File) including layout recall — LANL will replicate this demonstration at LANL working with CITI remotely.==<br />
===Development.===<br />
We've updated the pNFS client and server to the 2.6.17 kernel level, and will rebase again for 2.6.19. We've updated the pNFS codebase to the draft-ietf-nfsv4-minorversion1-05. Through testing we've identified and fixed multiple bugs.<br />
<br />
We rewrote the Linux pNFS client to use it's own set of rpc operations to cleanly separate the common NFS v2/3/4/4.1 code from the pNFS specific code.<br />
We now have four client layout modules under development. The file layout driver is being jointly developed by CITI, Network Appliance, and IBM Almaden. The CITI pVFS2 layout driver from Dean Hildebrand. The object layout driver from Panasas. The block layout driver is being developed at CITI under contract from EMC.<br />
We've expanded the layout operation interface and the layout policy interface between the layout driver and generic pNFS client to accommodate the requirements of the multiple layout drivers.<br />
We are designing and coding a pNFS client layout cache to replace the current simple single layout per inode implementation.<br />
<br />
We've improved the Linux pNFS server to underlying file system interface which is now used by the Panasas object layout server as well as the IBM GPFS server. We are currently coding the server pNFS layout management service and file system interfaces to bookeep layouts in order to expand the current simple single layout recall implementation.<br />
<br />
We've continued to developed the pVFS2 layout and pVFS2 pNFS server. (XXX Dean)<br />
<br />
We developed prototype implementations of pNFS operations:<br />
o OP_GETDEVICELIST,<br />
o OP_GETDEVICEINFO, <br />
o OP_LAYOUTGET,<br />
o OP_LAYOUTCOMMIT,<br />
o OP_LAYOUTRETURN and<br />
o OP_CB_LAYOUTRECALL<br />
<br />
<br />
We continue testing the prototype’s ability to send direct I/O data to data servers.<br />
<br />
===Milestones===<br />
At the Feb/March 2006 Connectathon, we tested the ability of CITI’s Linux pNFS client to operate with multiple layouts. We configured a Linux pNFS client with (1) a pVFS2 layout to access direct I/O on pVFS2 servers, and (2) a file layout to access data striped across Network Appliance servers. We demonstrated support for multiple layouts by copying files between the pVFS2 and Network Appliance data domains,.<br />
===Activities===<br />
We are rewriting the pNFS client, beginning to measure I/O performance over pNFS, and designing OP_LAYOUTRETURN and recall.<br />
<br />
In cooperation with EMC, we are developing a block layout driver module for the generic pNFS client.<br />
==Task 2. Migration of client from one mount/metadata server to another to be demonstrated. This demonstration may be replicated at LANL depending on success of this work. ==<br />
When a file system moves, the former server notifies clients with NFS4ERR_MOVED. Clients then reclaim state held on the former server by engaging in reboot recovery with the new server. For cluster file systems, server-to-server state transfer lets clients avoid the reclaim. <br />
<br />
We redesigned state bookkeeping to ensure that state created on NFSv4 servers exporting the same cluster file system will not collide. We are rewriting the interface that clients use when saving NFSv4 server state in stable storage to also support the server-server state transfer.<br />
<br />
It remains to inform clients that state established with the former server remains valid on the new server. IETF is considering solutions, e.g., augmented FS_LOCATIONS information or a new error code NFS4ERR_MOVED_DATA_AND_STATE.<br />
==Task 3. Analysis of caching and lock coherency, demonstration of caching and lock performance with scaling, under various levels of conflict, using byte range locks (looking at lock splitting issues etc.).==<br />
We have set up test machines and begun planning for tests. We have some immediate concerns over the memory footprint imposed by server lock structures.<br />
==Task 4. Analysis of directory delegations – how well does it work and when, when does it totally not work.==<br />
===Development===<br />
We have implemented directory delegations in the Linux client and server. Our server implementation of directory delegations follows the file delegations architecture. We extended the lease API in the Linux VFS to support read-only leases on directories and NFS-specific lease-breaking semantics.<br />
<br />
We implemented a /proc interface for enabling or disabling directory delegation at run time. At startup, the client queries the server for directory delegation support.<br />
<br />
Directory delegations promise to extend the usefulness of negative dentry caching on the client. Negative caching is unsafe without cache invalidation (positive caching can be treated as a hint). To give an example, opening a file that does not exist produces an OPEN RPC that fails. Open-to-close semantics and the lack of consistent negative caching requires that subsequent opens of the same non-existent file yield repeated OPEN RPC calls being sent to the server. This example is played out frequently when searching for an executable in PATH or a shared library in LD_LIBRARY_PATH.<br />
<br />
Directory delegation enables negative caching by assuring that no entries have been added or modified in a cached directory. This should markedly decrease unnecessary repeated checks for non-existent files. We are testing this use case.<br />
<br />
The server has hooks for a policy layer to control the granting of directory delegations. (No policy is implemented yet.) When and whether to acquire delegations is also a client concern.<br />
===Testing===<br />
We are testing delegation grant and recall in a test rig with one or two clients. Testing consists mostly of comparing NFS operation-counts when directory delegations is enabled or disabled.<br />
<br />
Tests range from simple UNIX utilities — ls, find, touch — to hosting a CVS repository or compiling with shared libraries and header files on NFS servers. Tests will become more specific.<br />
<br />
We have extended PyNFS to support directory delegations. So far, the support is basic and the tests are trivial. Tests will become more specific.<br />
<br />
We are designing mechanisms that allow simulation experiments to compare delegation policies on NFSv4 network traces.<br />
==Task 5. How do you specify/measure NFS Server load.==<br />
We have no progress to report on this task.</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/CITI_ASC_statusCITI ASC status2006-10-12T17:27:23Z<p>Andros: </p>
<hr />
<div>I started with the May 2006 report, which we can bring up to date for the October 2006 report.<br />
<br />
=University of Michigan/CITI NFSv4 ASC alliance=<br />
Status of May 2006<br />
==Task 1. Demonstration of pNFS with multiple back end methods (PVFS and File) including layout recall — LANL will replicate this demonstration at LANL working with CITI remotely.==<br />
===Development.===<br />
We've updated the pNFS client and server to the 2.6.17 kernel level, and will rebase again for 2.6.19. We've updated the pNFS codebase to the draft-ietf-nfsv4-minorversion1-05. Through testing we've identified and fixed multiple bugs.<br />
<br />
We rewrote the Linux pNFS client to use it's own set of rpc operations to cleanly separate the common NFS v2/3/4/4.1 code from the pNFS specific code.<br />
We now have four client layout modules under development. The file layout driver is being jointly developed by CITI, Network Appliance, and IBM Almaden. The CITI pVFS2 layout driver from Dean Hildebrand. The object layout driver from Panasas. The block layout driver is being developed at CITI under contract from EMC.<br />
We've expanded the layout operation interface and the layout policy interface between the layout driver and generic pNFS client to accommodate the requirements of the multiple layout drivers.<br />
We are designing and coding a pNFS client layout cache to replace the current simple single layout per inode implementation.<br />
<br />
We've improved the Linux pNFS server to underlying file system interface which is now used by the Panasas object layout server as well as the IBM GPFS server. We are currently coding the server pNFS layout management service and file system interfaces to bookeep layouts in order to expand the current simple single layout recall implementation.<br />
<br />
We've continued to developed the pVFS2 layout and pVFS2 pNFS server. (XXX Dean)<br />
<br />
We developed prototype implementations of pNFS operations:<br />
o OP_GETDEVICELIST,<br />
o OP_GETDEVICEINFO, <br />
o OP_LAYOUTGET,<br />
o OP_LAYOUTCOMMIT,<br />
o OP_LAYOUTRETURN and<br />
o OP_CB_LAYOUTRECALL<br />
<br />
The following pNFS implementations were tested at the September NFSv4 bakeathon hosted by CITI.<br />
File Layout<br />
<br />
Linux and Solaris client<br />
Network Appliance, Linux IBM GPFS, DESY dCache, Solaris server<br />
<br />
Object layout<br />
<br />
Linux client<br />
Linux Panasas server<br />
<br />
Block layout<br />
<br />
Linux client<br />
EMC server<br />
<br />
<br />
We continue testing the prototype’s ability to send direct I/O data to data servers.<br />
<br />
===Milestones===<br />
At the Feb/March 2006 Connectathon, we tested the ability of CITI’s Linux pNFS client to operate with multiple layouts. We configured a Linux pNFS client with (1) a pVFS2 layout to access direct I/O on pVFS2 servers, and (2) a file layout to access data striped across Network Appliance servers. We demonstrated support for multiple layouts by copying files between the pVFS2 and Network Appliance data domains,.<br />
===Activities===<br />
We are rewriting the pNFS client, beginning to measure I/O performance over pNFS, and designing OP_LAYOUTRETURN and recall.<br />
<br />
In cooperation with EMC, we are developing a block layout driver module for the generic pNFS client.<br />
==Task 2. Migration of client from one mount/metadata server to another to be demonstrated. This demonstration may be replicated at LANL depending on success of this work. ==<br />
When a file system moves, the former server notifies clients with NFS4ERR_MOVED. Clients then reclaim state held on the former server by engaging in reboot recovery with the new server. For cluster file systems, server-to-server state transfer lets clients avoid the reclaim. <br />
<br />
We redesigned state bookkeeping to ensure that state created on NFSv4 servers exporting the same cluster file system will not collide. We are rewriting the interface that clients use when saving NFSv4 server state in stable storage to also support the server-server state transfer.<br />
<br />
It remains to inform clients that state established with the former server remains valid on the new server. IETF is considering solutions, e.g., augmented FS_LOCATIONS information or a new error code NFS4ERR_MOVED_DATA_AND_STATE.<br />
==Task 3. Analysis of caching and lock coherency, demonstration of caching and lock performance with scaling, under various levels of conflict, using byte range locks (looking at lock splitting issues etc.).==<br />
We have set up test machines and begun planning for tests. We have some immediate concerns over the memory footprint imposed by server lock structures.<br />
==Task 4. Analysis of directory delegations – how well does it work and when, when does it totally not work.==<br />
===Development===<br />
We have implemented directory delegations in the Linux client and server. Our server implementation of directory delegations follows the file delegations architecture. We extended the lease API in the Linux VFS to support read-only leases on directories and NFS-specific lease-breaking semantics.<br />
<br />
We implemented a /proc interface for enabling or disabling directory delegation at run time. At startup, the client queries the server for directory delegation support.<br />
<br />
Directory delegations promise to extend the usefulness of negative dentry caching on the client. Negative caching is unsafe without cache invalidation (positive caching can be treated as a hint). To give an example, opening a file that does not exist produces an OPEN RPC that fails. Open-to-close semantics and the lack of consistent negative caching requires that subsequent opens of the same non-existent file yield repeated OPEN RPC calls being sent to the server. This example is played out frequently when searching for an executable in PATH or a shared library in LD_LIBRARY_PATH.<br />
<br />
Directory delegation enables negative caching by assuring that no entries have been added or modified in a cached directory. This should markedly decrease unnecessary repeated checks for non-existent files. We are testing this use case.<br />
<br />
The server has hooks for a policy layer to control the granting of directory delegations. (No policy is implemented yet.) When and whether to acquire delegations is also a client concern.<br />
===Testing===<br />
We are testing delegation grant and recall in a test rig with one or two clients. Testing consists mostly of comparing NFS operation-counts when directory delegations is enabled or disabled.<br />
<br />
Tests range from simple UNIX utilities — ls, find, touch — to hosting a CVS repository or compiling with shared libraries and header files on NFS servers. Tests will become more specific.<br />
<br />
We have extended PyNFS to support directory delegations. So far, the support is basic and the tests are trivial. Tests will become more specific.<br />
<br />
We are designing mechanisms that allow simulation experiments to compare delegation policies on NFSv4 network traces.<br />
==Task 5. How do you specify/measure NFS Server load.==<br />
We have no progress to report on this task.</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/Linux_pnfs_client_rewrite_may_2006Linux pnfs client rewrite may 20062006-05-09T20:37:13Z<p>Andros: </p>
<hr />
<div>== pNFS Client Rewrite May 2006 ==<br />
NOTE: This patch set has been applied (05-09-2006) to the pnfs-2-6-16 CVS tree. The patches are left on line for their comments.<br />
<br />
*[http://www.citi.umich.edu/projects/asci/pnfs/pnfs-client-rewrite-05-2006/ pNFS client rewrite patches against pNFS CVS 2.6.16]<br />
<br />
The current pNFS 2.6.16 CVS kernel client code combines the pNFS processing in the NFSv4 code path resulting in many #ifdefs and pnfs specific switching. The main purpose of this rewrite is to separate the pNFS code path from the NFSv2/v3/v4 code path. Following the method used to separate the NFSv2/v3/v4 code paths from each other, I created a new rpc_ops for pNFS, and moved all pNFS processing into the appropriate routines. New rpc_ops were created where necessary. Existing nfs functions were split when necessary.<br />
<br />
The rpc_ops are set at mount. The NFSv4 client uses the nfs_v4_clientops rpc_ops as usual. If a server supports pNFS and a layout driver has been negotiated and initalized, the nfs_v4_clientops are replaced with the new pnfs_v4_clientops (see set_pnfs_layoutderiver()). The pnfs_v4_clientops also contain a reference to the new pnfs_file_operations, and which are now set via the rpc_ops.<br />
<br />
Moving pNFS processing into their own rpc_ops allows for errors to be returned in rpc_ops calling routines that are ignored by the normal NFS code path, but required by pNFS. In the RPC based NFS read path, for example, the only error is -ENOMEM from allocating pages. All other errors are detected in the RPC path. pNFS has other possible errors, such as LAYOUTGET failing, or non-RPC based I/0 failing.<br />
<br />
Four new rpc_ops allow pNFS to switch between using the normal server (server->rsize,rpages,wsize,wpages) read/write sizes and the data server read/write sizes (server->ds_rsize,ds_rpages,ds_wsize,ds_wpages) without if statements in the normal NFS code path. Note that there is still a chicken-and-egg problem due to this decision being made proir to the request size being calculated.<br />
<br />
rsize(struct inode *, struct nfs_read_data *)<br />
wsize(struct inode *, struct nfs_write_data *)<br />
rpages(struct inode *, unsigned int *)<br />
wpages(struct inode *, unsigned int *)<br />
<br />
Two new rpc_ops allow isolation of pNFS processing from normal NFS processing in the pageing setup for read and write.<br />
<br />
pagein_one(struct list_head *, struct inode *)<br />
flush_one(struct inode *, struct list_head *, int, int)<br />
<br />
Finally, non-RPC based I/0 drivers can use the page setup routines. At the conclusion of I/O, the pages need to be returned. This code exists in the RPC callback routines, which were called in the old pNFS client code.<br />
This resulted in many if(pnfs_XXX) switches around portions of the callback code that are RPC related. I split the callbacks into functions, and created new pNFS_xxx_norpc() callbacks that call only the portions of the RPC callbacks that apply.<br />
<br />
The above process has been applied to the read/write/commit code paths. The directIO code path has yet to be examined.</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/Linux_pnfs_client_rewrite_may_2006Linux pnfs client rewrite may 20062006-05-09T20:36:37Z<p>Andros: </p>
<hr />
<div>== pNFS Client Rewrite May 2006 ==<br />
NOTE: This patch set has been applied to the pnfs-2-6-16 CVS tree. The patches are left on line for their comments.<br />
<br />
*[http://www.citi.umich.edu/projects/asci/pnfs/pnfs-client-rewrite-05-2006/ pNFS client rewrite patches against pNFS CVS 2.6.16]<br />
<br />
The current pNFS 2.6.16 CVS kernel client code combines the pNFS processing in the NFSv4 code path resulting in many #ifdefs and pnfs specific switching. The main purpose of this rewrite is to separate the pNFS code path from the NFSv2/v3/v4 code path. Following the method used to separate the NFSv2/v3/v4 code paths from each other, I created a new rpc_ops for pNFS, and moved all pNFS processing into the appropriate routines. New rpc_ops were created where necessary. Existing nfs functions were split when necessary.<br />
<br />
The rpc_ops are set at mount. The NFSv4 client uses the nfs_v4_clientops rpc_ops as usual. If a server supports pNFS and a layout driver has been negotiated and initalized, the nfs_v4_clientops are replaced with the new pnfs_v4_clientops (see set_pnfs_layoutderiver()). The pnfs_v4_clientops also contain a reference to the new pnfs_file_operations, and which are now set via the rpc_ops.<br />
<br />
Moving pNFS processing into their own rpc_ops allows for errors to be returned in rpc_ops calling routines that are ignored by the normal NFS code path, but required by pNFS. In the RPC based NFS read path, for example, the only error is -ENOMEM from allocating pages. All other errors are detected in the RPC path. pNFS has other possible errors, such as LAYOUTGET failing, or non-RPC based I/0 failing.<br />
<br />
Four new rpc_ops allow pNFS to switch between using the normal server (server->rsize,rpages,wsize,wpages) read/write sizes and the data server read/write sizes (server->ds_rsize,ds_rpages,ds_wsize,ds_wpages) without if statements in the normal NFS code path. Note that there is still a chicken-and-egg problem due to this decision being made proir to the request size being calculated.<br />
<br />
rsize(struct inode *, struct nfs_read_data *)<br />
wsize(struct inode *, struct nfs_write_data *)<br />
rpages(struct inode *, unsigned int *)<br />
wpages(struct inode *, unsigned int *)<br />
<br />
Two new rpc_ops allow isolation of pNFS processing from normal NFS processing in the pageing setup for read and write.<br />
<br />
pagein_one(struct list_head *, struct inode *)<br />
flush_one(struct inode *, struct list_head *, int, int)<br />
<br />
Finally, non-RPC based I/0 drivers can use the page setup routines. At the conclusion of I/O, the pages need to be returned. This code exists in the RPC callback routines, which were called in the old pNFS client code.<br />
This resulted in many if(pnfs_XXX) switches around portions of the callback code that are RPC related. I split the callbacks into functions, and created new pNFS_xxx_norpc() callbacks that call only the portions of the RPC callbacks that apply.<br />
<br />
The above process has been applied to the read/write/commit code paths. The directIO code path has yet to be examined.</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/Linux_pnfs_client_rewrite_may_2006Linux pnfs client rewrite may 20062006-05-03T19:43:40Z<p>Andros: </p>
<hr />
<div>== pNFS Client Rewrite May 2006 ==<br />
<br />
*[http://www.citi.umich.edu/projects/asci/pnfs/pnfs-client-rewrite-05-2006/ pNFS client rewrite patches against pNFS CVS 2.6.16]<br />
<br />
The current pNFS 2.6.16 CVS kernel client code combines the pNFS processing in the NFSv4 code path resulting in many #ifdefs and pnfs specific switching. The main purpose of this rewrite is to separate the pNFS code path from the NFSv2/v3/v4 code path. Following the method used to separate the NFSv2/v3/v4 code paths from each other, I created a new rpc_ops for pNFS, and moved all pNFS processing into the appropriate routines. New rpc_ops were created where necessary. Existing nfs functions were split when necessary.<br />
<br />
The rpc_ops are set at mount. The NFSv4 client uses the nfs_v4_clientops rpc_ops as usual. If a server supports pNFS and a layout driver has been negotiated and initalized, the nfs_v4_clientops are replaced with the new pnfs_v4_clientops (see set_pnfs_layoutderiver()). The pnfs_v4_clientops also contain a reference to the new pnfs_file_operations, and which are now set via the rpc_ops.<br />
<br />
Moving pNFS processing into their own rpc_ops allows for errors to be returned in rpc_ops calling routines that are ignored by the normal NFS code path, but required by pNFS. In the RPC based NFS read path, for example, the only error is -ENOMEM from allocating pages. All other errors are detected in the RPC path. pNFS has other possible errors, such as LAYOUTGET failing, or non-RPC based I/0 failing.<br />
<br />
Four new rpc_ops allow pNFS to switch between using the normal server (server->rsize,rpages,wsize,wpages) read/write sizes and the data server read/write sizes (server->ds_rsize,ds_rpages,ds_wsize,ds_wpages) without if statements in the normal NFS code path. Note that there is still a chicken-and-egg problem due to this decision being made proir to the request size being calculated.<br />
<br />
rsize(struct inode *, struct nfs_read_data *)<br />
wsize(struct inode *, struct nfs_write_data *)<br />
rpages(struct inode *, unsigned int *)<br />
wpages(struct inode *, unsigned int *)<br />
<br />
Two new rpc_ops allow isolation of pNFS processing from normal NFS processing in the pageing setup for read and write.<br />
<br />
pagein_one(struct list_head *, struct inode *)<br />
flush_one(struct inode *, struct list_head *, int, int)<br />
<br />
Finally, non-RPC based I/0 drivers can use the page setup routines. At the conclusion of I/O, the pages need to be returned. This code exists in the RPC callback routines, which were called in the old pNFS client code.<br />
This resulted in many if(pnfs_XXX) switches around portions of the callback code that are RPC related. I split the callbacks into functions, and created new pNFS_xxx_norpc() callbacks that call only the portions of the RPC callbacks that apply.<br />
<br />
The above process has been applied to the read/write/commit code paths. The directIO code path has yet to be examined.</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/Linux_pnfs_client_rewrite_may_2006Linux pnfs client rewrite may 20062006-05-03T19:42:46Z<p>Andros: </p>
<hr />
<div>== pNFS Client Rewrite May 2006 ==<br />
<br />
*[http://www.citi.umich.edu/projects/asci/pnfs/pnfs-client-rewrite-05-2006/ pNFS client patches]<br />
<br />
The current pNFS 2.6.16 CVS kernel client code combines the pNFS processing in the NFSv4 code path resulting in many #ifdefs and pnfs specific switching. The main purpose of this rewrite is to separate the pNFS code path from the NFSv2/v3/v4 code path. Following the method used to separate the NFSv2/v3/v4 code paths from each other, I created a new rpc_ops for pNFS, and moved all pNFS processing into the appropriate routines. New rpc_ops were created where necessary. Existing nfs functions were split when necessary.<br />
<br />
The rpc_ops are set at mount. The NFSv4 client uses the nfs_v4_clientops rpc_ops as usual. If a server supports pNFS and a layout driver has been negotiated and initalized, the nfs_v4_clientops are replaced with the new pnfs_v4_clientops (see set_pnfs_layoutderiver()). The pnfs_v4_clientops also contain a reference to the new pnfs_file_operations, and which are now set via the rpc_ops.<br />
<br />
Moving pNFS processing into their own rpc_ops allows for errors to be returned in rpc_ops calling routines that are ignored by the normal NFS code path, but required by pNFS. In the RPC based NFS read path, for example, the only error is -ENOMEM from allocating pages. All other errors are detected in the RPC path. pNFS has other possible errors, such as LAYOUTGET failing, or non-RPC based I/0 failing.<br />
<br />
Four new rpc_ops allow pNFS to switch between using the normal server (server->rsize,rpages,wsize,wpages) read/write sizes and the data server read/write sizes (server->ds_rsize,ds_rpages,ds_wsize,ds_wpages) without if statements in the normal NFS code path. Note that there is still a chicken-and-egg problem due to this decision being made proir to the request size being calculated.<br />
<br />
rsize(struct inode *, struct nfs_read_data *)<br />
wsize(struct inode *, struct nfs_write_data *)<br />
rpages(struct inode *, unsigned int *)<br />
wpages(struct inode *, unsigned int *)<br />
<br />
Two new rpc_ops allow isolation of pNFS processing from normal NFS processing in the pageing setup for read and write.<br />
<br />
pagein_one(struct list_head *, struct inode *)<br />
flush_one(struct inode *, struct list_head *, int, int)<br />
<br />
Finally, non-RPC based I/0 drivers can use the page setup routines. At the conclusion of I/O, the pages need to be returned. This code exists in the RPC callback routines, which were called in the old pNFS client code.<br />
This resulted in many if(pnfs_XXX) switches around portions of the callback code that are RPC related. I split the callbacks into functions, and created new pNFS_xxx_norpc() callbacks that call only the portions of the RPC callbacks that apply.<br />
<br />
The above process has been applied to the read/write/commit code paths. The directIO code path has yet to be examined.</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/Linux_pnfs_client_rewrite_may_2006Linux pnfs client rewrite may 20062006-05-03T19:42:08Z<p>Andros: </p>
<hr />
<div>== pNFS Client Rewrite May 2006 ==<br />
<br />
The current pNFS 2.6.16 CVS kernel client code combines the pNFS processing in the NFSv4 code path resulting in many #ifdefs and pnfs specific switching. The main purpose of this rewrite is to separate the pNFS code path from the NFSv2/v3/v4 code path. Following the method used to separate the NFSv2/v3/v4 code paths from each other, I created a new rpc_ops for pNFS, and moved all pNFS processing into the appropriate routines. New rpc_ops were created where necessary. Existing nfs functions were split when necessary.<br />
<br />
The rpc_ops are set at mount. The NFSv4 client uses the nfs_v4_clientops rpc_ops as usual. If a server supports pNFS and a layout driver has been negotiated and initalized, the nfs_v4_clientops are replaced with the new pnfs_v4_clientops (see set_pnfs_layoutderiver()). The pnfs_v4_clientops also contain a reference to the new pnfs_file_operations, and which are now set via the rpc_ops.<br />
<br />
Moving pNFS processing into their own rpc_ops allows for errors to be returned in rpc_ops calling routines that are ignored by the normal NFS code path, but required by pNFS. In the RPC based NFS read path, for example, the only error is -ENOMEM from allocating pages. All other errors are detected in the RPC path. pNFS has other possible errors, such as LAYOUTGET failing, or non-RPC based I/0 failing.<br />
<br />
Four new rpc_ops allow pNFS to switch between using the normal server (server->rsize,rpages,wsize,wpages) read/write sizes and the data server read/write sizes (server->ds_rsize,ds_rpages,ds_wsize,ds_wpages) without if statements in the normal NFS code path. Note that there is still a chicken-and-egg problem due to this decision being made proir to the request size being calculated.<br />
<br />
rsize(struct inode *, struct nfs_read_data *)<br />
wsize(struct inode *, struct nfs_write_data *)<br />
rpages(struct inode *, unsigned int *)<br />
wpages(struct inode *, unsigned int *)<br />
<br />
Two new rpc_ops allow isolation of pNFS processing from normal NFS processing in the pageing setup for read and write.<br />
<br />
pagein_one(struct list_head *, struct inode *)<br />
flush_one(struct inode *, struct list_head *, int, int)<br />
<br />
Finally, non-RPC based I/0 drivers can use the page setup routines. At the conclusion of I/O, the pages need to be returned. This code exists in the RPC callback routines, which were called in the old pNFS client code.<br />
This resulted in many if(pnfs_XXX) switches around portions of the callback code that are RPC related. I split the callbacks into functions, and created new pNFS_xxx_norpc() callbacks that call only the portions of the RPC callbacks that apply.<br />
<br />
The above process has been applied to the read/write/commit code paths. The directIO code path has yet to be examined.<br />
<br />
<br />
*[http://www.citi.umich.edu/projects/asci/pnfs/pnfs-client-rewrite-05-2006/ pNFS client patches]</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/Linux_pnfs_client_rewrite_may_2006Linux pnfs client rewrite may 20062006-05-03T19:39:43Z<p>Andros: </p>
<hr />
<div>== pNFS Client Rewrite May 2006 ==<br />
<br />
The current pNFS 2.6.16 CVS kernel client code combines the pNFS processing in the NFSv4 code path resulting in many #ifdefs and pnfs specific switching. The main purpose of this rewrite is to separate the pNFS code path from the NFSv2/v3/v4 code path. Following the method used to separate the NFSv2/v3/v4 code paths from each other, I created a new rpc_ops for pNFS, and moved all pNFS processing into the appropriate routines. New rpc_ops were created where necessary. Existing nfs functions were split when necessary.<br />
<br />
The rpc_ops are set at mount. The NFSv4 client uses the nfs_v4_clientops rpc_ops as usual. If a server supports pNFS and a layout driver has been negotiated and initalized, the nfs_v4_clientops are replaced with the new pnfs_v4_clientops (see set_pnfs_layoutderiver()). The pnfs_v4_clientops also contain a reference to the new pnfs_file_operations, and which are now set via the rpc_ops.<br />
<br />
Moving pNFS processing into their own rpc_ops allows for errors to be returned in rpc_ops calling routines that are ignored by the normal NFS code path, but required by pNFS. In the RPC based NFS read path, for example, the only error is -ENOMEM from allocating pages. All other errors are detected in the RPC path. pNFS has other possible errors, such as LAYOUTGET failing, or non-RPC based I/0 failing.<br />
<br />
Four new rpc_ops allow pNFS to switch between using the normal server (server->rsize,rpages,wsize,wpages) read/write sizes and the data server read/write sizes (server->ds_rsize,ds_rpages,ds_wsize,ds_wpages) without if statements in the normal NFS code path. Note that there is still a chicken-and-egg problem due to this decision being made proir to the request size being calculated.<br />
<br />
rsize(struct inode *, struct nfs_read_data *)<br />
wsize(struct inode *, struct nfs_write_data *)<br />
rpages(struct inode *, unsigned int *)<br />
wpages(struct inode *, unsigned int *)<br />
<br />
Two new rpc_ops allow isolation of pNFS processing from normal NFS processing in the pageing setup for read and write.<br />
<br />
pagein_one(struct list_head *, struct inode *)<br />
flush_one(struct inode *, struct list_head *, int, int)<br />
<br />
Finally, non-RPC based I/0 drivers can use the page setup routines. At the conclusion of I/O, the pages need to be returned. This code exists in the RPC callback routines, which were called in the old pNFS client code.<br />
This resulted in many if(pnfs_XXX) switches around portions of the callback code that are RPC related. I split the callbacks into functions, and created new pNFS_xxx_norpc() callbacks that call only the portions of the RPC callbacks that apply.<br />
<br />
I have applied the above to the read/write/commit code paths. I have yet to look at the directIO code path.<br />
<br />
Andy<br />
<br />
*[http://www.citi.umich.edu/projects/asci/pnfs/pnfs-client-rewrite-05-2006/ pNFS client patches]</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/Linux_pnfs_client_rewrite_may_2006Linux pnfs client rewrite may 20062006-05-03T19:38:25Z<p>Andros: </p>
<hr />
<div>== pNFS Client Rewrite May 2006 ==<br />
<br />
The current pNFS 2.6.16 CVS kernel client code combines the pNFS processing in the NFSv4 code path resulting in many #ifdefs and pnfs specific switching. The main purpose of this rewrite is to separate the pNFS code path from the NFSv2/v3/v4 code path. Following the method used to separate the NFSv2/v3/v4 code paths from each other, I created a new rpc_ops for pNFS, and moved all pNFS processing into the appropriate routines. New rpc_ops were created where necessary. Existing nfs functions were split when necessary.<br />
<br />
The rpc_ops are set at mount. The NFSv4 client uses the nfs_v4_clientops rpc_ops as usual. If a server supports pNFS and a layout driver has been negotiated and initalized, the nfs_v4_clientops are replaced with the new pnfs_v4_clientops (see set_pnfs_layoutderiver()). The pnfs_v4_clientops also contain a reference to the new pnfs_file_operations, and which are now set via the rpc_ops.<br />
<br />
Moving pNFS processing into their own rpc_ops allows for errors to be returned in rpc_ops calling routines that are ignored by the normal NFS code path, but required by pNFS. In the RPC based NFS read path, for example, the only error is -ENOMEM from allocating pages. All other errors are detected in the RPC path. pNFS has other possible errors, such as LAYOUTGET failing, or non-RPC based I/0 failing.<br />
<br />
Four new rpc_ops allow pNFS to switch between using the normal server (server->rsize,rpages,wsize,wpages) read/write sizes and the data server read/write sizes (server->ds_rsize,ds_rpages,ds_wsize,ds_wpages) without if statements in the normal NFS code path. Note that there is still a chicken-and-egg problem due to this decision being made proir to the request size being calculated.<br />
<br />
rsize(struct inode *, struct nfs_read_data *)<br />
wsize(struct inode *, struct nfs_write_data *)<br />
rpages(struct inode *, unsigned int *)<br />
wpages(struct inode *, unsigned int *)<br />
<br />
Two new rpc_ops allow isolation of pNFS processing from normal NFS processing in the pageing setup for read and write.<br />
<br />
pagein_one(struct list_head *, struct inode *)<br />
flush_one(struct inode *, struct list_head *, int, int)<br />
<br />
Finally, non-RPC based I/0 drivers can use the page setup routines. At the conclusion of I/O, the pages need to be returned. This code exists in the RPC callback routines, which were called in the old pNFS client code.<br />
This resulted in many if(pnfs_XXX) switches around portions of the callback code that are RPC related. I split the callbacks into functions, and created new pNFS_xxx_norpc() callbacks that call only the portions of the RPC callbacks that apply.<br />
<br />
*[http://www.citi.umich.edu/projects/asci/pnfs/pnfs-client-rewrite-05-2006/ pNFS client patches]</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/Linux_pnfs_client_rewrite_may_2006Linux pnfs client rewrite may 20062006-05-03T19:35:50Z<p>Andros: </p>
<hr />
<div>== pNFS Client Rewrite May 2006 ==<br />
<br />
The main purpose of this rewrite is to separate the pNFS code path from the NFSv2/v3/v4 code path. Following the method used to separate the NFSv2/v3/v4 code paths from each other, I created a new rpc_ops for pNFS, and moved all pNFS processing into the appropriate routines. New rpc_ops were created where necessary. Existing nfs functions were split when necessary.<br />
<br />
The rpc_ops are set at mount. The NFSv4 client uses the nfs_v4_clientops rpc_ops as usual. If a server supports pNFS and a layout driver has been negotiated and initalized, the nfs_v4_clientops are replaced with the new pnfs_v4_clientops (see set_pnfs_layoutderiver()). The pnfs_v4_clientops also contain a reference to the new pnfs_file_operations, and which are now set via the rpc_ops.<br />
<br />
Moving pNFS processing into their own rpc_ops allows for errors to be returned in rpc_ops calling routines that are ignored by the normal NFS code path, but required by pNFS. In the RPC based NFS read path, for example, the only error is -ENOMEM from allocating pages. All other errors are detected in the RPC path. pNFS has other possible errors, such as LAYOUTGET failing, or non-RPC based I/0 failing.<br />
<br />
Four new rpc_ops allow pNFS to switch between using the normal server (server->rsize,rpages,wsize,wpages) read/write sizes and the data server read/write sizes (server->ds_rsize,ds_rpages,ds_wsize,ds_wpages) without if statements in the normal NFS code path. Note that there is still a chicken-and-egg problem due to this decision being made proir to the request size being calculated.<br />
<br />
rsize(struct inode *, struct nfs_read_data *)<br />
wsize(struct inode *, struct nfs_write_data *)<br />
rpages(struct inode *, unsigned int *)<br />
wpages(struct inode *, unsigned int *)<br />
<br />
Two new rpc_ops allow isolation of pNFS processing from normal NFS processing in the pageing setup for read and write.<br />
<br />
pagein_one(struct list_head *, struct inode *)<br />
flush_one(struct inode *, struct list_head *, int, int)<br />
<br />
Finally, non-RPC based I/0 drivers can use the page setup routines. At the conclusion of I/O, the pages need to be returned. This code exists in the RPC callback routines, which were called in the old pNFS client code.<br />
This resulted in many if(pnfs_XXX) switches around portions of the callback code that are RPC related. I split the callbacks into functions, and created new pNFS_xxx_norpc() callbacks that call only the portions of the RPC callbacks that apply.<br />
<br />
*[http://www.citi.umich.edu/projects/asci/pnfs/pnfs-client-rewrite-05-2006/ pNFS client patches]</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/Linux_pnfs_client_rewrite_may_2006Linux pnfs client rewrite may 20062006-05-03T19:27:27Z<p>Andros: </p>
<hr />
<div>== pNFS Client Rewrite May 2006 ==<br />
<br />
The main purpose of this rewrite is to separate the pNFS code path from the NFSv2/v3/v4 code path. Following the method used to separate the NFSv2/v3/v4 code paths from each other, I created a new rpc_ops for pNFS, and moved all pNFS processing into the appropriate routines. New rpc_ops were created where necessary. Existing nfs functions were split when necessary.<br />
<br />
The rpc_ops are set at mount. The NFSv4 client uses the nfs_v4_clientops rpc_ops as usual. If a server supports pNFS and a layout driver has been negotiated and initalized, the nfs_v4_clientops are replaced with the new pnfs_v4_clientops (see set_pnfs_layoutderiver()). The pnfs_v4_clientops also contain a reference to the new pnfs_file_operations, and which are now set via the rpc_ops.<br />
<br />
Four new rpc_ops allow pNFS to switch between using the normal server (server->rsize,rpages,wsize,wpages) read/write sizes and the data server read/write sizes (server->ds_rsize,ds_rpages,ds_wsize,ds_wpages) without if statements in the normal NFS code path.<br />
rsize(struct inode *, struct nfs_read_data *)<br />
wsize(struct inode *, struct nfs_write_data *)<br />
rpages(struct inode *, unsigned int *)<br />
wpages(struct inode *, unsigned int *)<br />
<br />
Two new rpc_ops allow isolation of pNFS processing from normal NFS processing in the pageing setup for read and write.<br />
pagein_one(struct list_head *, struct inode *)<br />
<br />
<br />
*[http://www.citi.umich.edu/projects/asci/pnfs/pnfs-client-rewrite-05-2006/ pNFS client patches]</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/Linux_pnfs_client_rewrite_may_2006Linux pnfs client rewrite may 20062006-05-03T19:26:01Z<p>Andros: </p>
<hr />
<div>== pNFS Client Rewrite May 2006 ==<br />
<br />
The main purpose of this rewrite is to separate the pNFS code path from the NFSv2/v3/v4 code path. Following the method used to separate the NFSv2/v3/v4 code paths from each other, I created a new rpc_ops for pNFS, and moved all pNFS processing into the appropriate routines. New rpc_ops were created where necessary. Existing nfs functions were split when necessary.<br />
<br />
The rpc_ops are set at mount. The NFSv4 client uses the nfs_v4_clientops rpc_ops as usual. If a server supports pNFS and a layout driver has been negotiated and initalized, the nfs_v4_clientops are replaced with the new pnfs_v4_clientops (see set_pnfs_layoutderiver()). The pnfs_v4_clientops also contain a reference to the new pnfs_file_operations, and which are now set via the rpc_ops.<br />
<br />
Four new rpc_ops allow pNFS to switch between using the normal server (server->rsize,rpages,wsize,wpages) read/write sizes and the data server read/write sizes (server->ds_rsize,ds_rpages,ds_wsize,ds_wpages) without if statements in the normal NFS code path.<br />
rsize(struct inode *, struct nfs_read_data *)<br />
wsize(struct inode *, struct nfs_write_data *)<br />
rpages(struct inode *, unsigned int *)<br />
wpages(struct inode *, unsigned int *)<br />
<br />
Two<br />
<br />
*[http://www.citi.umich.edu/projects/asci/pnfs/pnfs-client-rewrite-05-2006/ pNFS client patches]</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/Linux_pnfs_client_rewrite_may_2006Linux pnfs client rewrite may 20062006-05-03T19:25:13Z<p>Andros: </p>
<hr />
<div>== pNFS Client Rewrite May 2006 ==<br />
<br />
The main purpose of this rewrite is to separate the pNFS code path from the NFSv2/v3/v4 code path. Following the method used to separate the NFSv2/v3/v4 code paths from each other, I created a new rpc_ops for pNFS, and moved all pNFS processing into the appropriate routines. New rpc_ops were created where necessary. Existing nfs functions were split when necessary.<br />
<br />
The rpc_ops are set at mount. The NFSv4 client uses the nfs_v4_clientops rpc_ops as usual. If a server supports pNFS and a layout driver has been negotiated and initalized, the nfs_v4_clientops are replaced with the new pnfs_v4_clientops (see set_pnfs_layoutderiver()). The pnfs_v4_clientops also contain a reference to the new pnfs_file_operations, and which are now set via the rpc_ops.<br />
<br />
Four new rpc_ops allow pNFS to switch between using the normal server (server->rsize,rpages,wsize,wpages) read/write sizes and the data server read/write sizes (server->ds_rsize,ds_rpages,ds_wsize,ds_wpages) without if statements in the normal NFS code path.<br />
rsize(struct inode *, struct nfs_read_data *)<br />
wsize(struct inode *, struct nfs_write_data *)<br />
<br />
*[http://www.citi.umich.edu/projects/asci/pnfs/pnfs-client-rewrite-05-2006/ pNFS client patches]</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/Linux_pnfs_client_rewrite_may_2006Linux pnfs client rewrite may 20062006-05-03T19:22:28Z<p>Andros: </p>
<hr />
<div>== pNFS Client Rewrite May 2006 ==<br />
<br />
The main purpose of this rewrite is to separate the pNFS code path from the NFSv2/v3/v4 code path. Following the method used to separate the NFSv2/v3/v4 code paths from each other, I created a new rpc_ops for pNFS, and moved all pNFS processing into the appropriate routines. New rpc_ops were created where necessary. Existing nfs functions were split when necessary.<br />
<br />
The rpc_ops are set at mount. The NFSv4 client uses the nfs_v4_clientops rpc_ops as usual. If a server supports pNFS and a layout driver has been negotiated and initalized, the nfs_v4_clientops are replaced with the new pnfs_v4_clientops (see set_pnfs_layoutderiver()). The pnfs_v4_clientops also contain a reference to the new pnfs_file_operations, and which are now set via the rpc_ops.<br />
<br />
<br />
*[http://www.citi.umich.edu/projects/asci/pnfs/pnfs-client-rewrite-05-2006/ pNFS client patches]</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/Linux_pnfs_client_rewrite_may_2006Linux pnfs client rewrite may 20062006-05-03T19:19:25Z<p>Andros: </p>
<hr />
<div>== pNFS Client Rewrite May 2006 ==<br />
<br />
The main purpose of this rewrite is to separate the pNFS code path from the NFSv2/v3/v4 code path. Following the method used to separate the NFSv2/v3/v4 code paths from each other, I created a new rpc_ops for pNFS, and moved all pNFS processing into the appropriate routines. New rpc_ops were created where necessary. Existing nfs functions were split when necessary.<br />
<br />
The rpc_ops are set at mount. The NFSv4 client uses the nfs_v4_clientops rpc_ops as usual. <br />
<br />
<br />
*[http://www.citi.umich.edu/projects/asci/pnfs/pnfs-client-rewrite-05-2006/ pNFS client patches]</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/Linux_pnfs_client_rewrite_may_2006Linux pnfs client rewrite may 20062006-05-03T19:15:33Z<p>Andros: </p>
<hr />
<div>== pNFS Client Rewrite May 2006 ==<br />
<br />
The main purpose of this rewrite is to <br />
<br />
<br />
*[http://www.citi.umich.edu/projects/asci/pnfs/pnfs-client-rewrite-05-2006/ pNFS client patches]</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_designPNFS prototype design2006-05-03T19:14:25Z<p>Andros: </p>
<hr />
<div>= pNFS =<br />
<br />
'''pNFS''' is part of the first NFSv4 minor version. This space is used to track and share Linux pNFS implementation ideas and issues.<br />
<br />
* [http://www.citi.umich.edu/projects/asci/pnfs/linux/ Linux pNFS Implementation Homepage]<br />
<br />
* [[Cthon06 Meeting Notes|Connectathon 2006 Linux pNFS Implementation Meeting Notes]]<br />
<br />
* [[linux pnfs client rewrite may 2006|Linux pNFS Client Re-write patches May 2006]]</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/Pnfs_client_rewrite_may_2006Pnfs client rewrite may 20062006-05-03T19:10:04Z<p>Andros: </p>
<hr />
<div>== pNFS ==<br />
<br />
The main purpose of this rewrite is to cleanly separate the pNFS code path from the NFSv2/3/4 code path, getting rid of as many pNFS #ifdefs as possible along the way. To that end, I followed the method used to separate the NFSv2/v3/v4 specific code paths - I created a pNFS rpc_ops structure and appropriate operations.<br />
<br />
The NFSv4 client starts out using the normal nfs_v4_clientops. The client switches to the new pnfs_v4_clientops in set_pnfs_layoutdriver, after the layoutdriver has been negotiated with the server, and successfully initialized. The pnfs_v4_clientops also contain a reference to the new pnfs_file_operations.<br />
<br />
Four new rpc_ops are introduced to handle which read and write size should be used for preparing pages for I/0. These new rpc_ops allow the pNFS code path to choose between the MDS read/write sizes vrs the Data Server<br />
read/write sizes. There is still a chicken-and-egg problem in the pNFS code path due to this choice being made prior to the request size being known.<br />
<br />
rsize(struct inode *,struct nfs_read_data *) <br />
wsize(struct ionde *, struct nfs_write_data *) <br />
rpages(struct inode *, unsigned int *)<br />
wpages(struct inode *, unsigned int *)<br />
<br />
Two new rpc_ops are introduced to separate NFSv2/v3/v4 processing from pNFS processing.<br />
<br />
<br />
<br />
* [http://www.citi.umich.edu/projects/asci/pnfs/pnfs-client-rewrite-05-2006/ Linux pNFS client rewrite patches]</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_designPNFS prototype design2006-05-03T16:10:23Z<p>Andros: </p>
<hr />
<div>= pNFS =<br />
<br />
'''pNFS''' is part of the first NFSv4 minor version. This space is used to track and share Linux pNFS implementation ideas and issues.<br />
<br />
* [http://www.citi.umich.edu/projects/asci/pnfs/linux/ Linux pNFS Implementation Homepage]<br />
<br />
* [[Cthon06 Meeting Notes|Connectathon 2006 Linux pNFS Implementation Meeting Notes]]<br />
<br />
* [http://www.citi.umich.edu/projects/asci/pnfs/pnfs-client-rewrite-05-2006/ Linux pNFS Client Re-write patches May 2006]]</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/NFS_Recovery_and_Client_MigrationNFS Recovery and Client Migration2006-04-12T20:46:15Z<p>Andros: </p>
<hr />
<div>'''NFS Client Migration and Server Recovery on Clusters'''<br />
<br />
''Background''<br />
<br />
By exporting a shared cluster filesystem using multiple NFS servers, we can provide increased performance and availability through load-balancing and failover. NFSv4 provides some minimal protocol features to allow migration and failover, but there are some implementation challenges, and some small protocol extensions required.<br />
<br />
Consider a few scenarios, in order of increasing complexity:<br />
<br />
1. Server 1 and server 2 share the same cluster filesystem. Server 1 runs an NFS server, <br />
but server 2 doesn't. When server 1 fails or is shut down, an NFS service is started <br />
on server 2, which also takes over server 1's IP address.<br />
2. Server 1, 2, and 3 share the same cluster filesystem. Everything is as in the previous <br />
scenario, except that server 3 is also running a live NFS server throughout; so we need <br />
to ensure that server 3 is not allowed to acquire any locks it shouldn't during the <br />
period that server 2 is taking over.<br />
3. Server 1, 2, and 3 share the same cluster filesystem. Everything is as in the previous <br />
scenario, except that server 2 is already running a live nfs server. We must handle the <br />
failover with minimal interruption to the preexisting clients of server number 2.<br />
4. As in the previous scenario, except that we expect to keep server 1 running throughout, <br />
and migrate possibly only some of its clients. This allows us to do load-balancing. <br />
<br />
The implementation may choose to block essentially all locking activity during the transition, possibly as long as a grace period, which is on the order of a minute. This may be simpler to implement and may be adequate for some applications. However, we prefer to implement the transition period in such a way that applications see no significant delay. A variety of behaviors in between (e.g. that limit delays only to certain files) are also possible.<br />
<br />
Finally, the implementation may allow the client to continue to use all its existing state on the new server, or may require the client to go through the same state recovery process it would go through on server reboot. The latter approach requires less intrusive modifications to the nfs server, and can be done without requiring ceasing locking activity, but there are still optimizations possible using the former method that may reduce latency to the migrating client.<br />
<br />
'''Progress'''<br />
<br />
We are exploring most of these possibilities as part of our Linux NFSv4 server implementation effort.<br />
<br />
'''Protocol issues'''<br />
<br />
In the process of designing the migration implementation for Linux, we have identified two small deficiencies in the NFSv4 protocol that limit the migration scenarios that an NFSv4 implementation can reliably support.<br />
<br />
'''Migration to a live server'''<br />
<br />
Scenarios 3 and 4 above involve migrating clients to a NFSv4 server that is already servering clients of its own. This causes some problems, which we need a little background to explain.<br />
<br />
To manage client state, we first need a reliable way to identify individual clients; ideally, it should allow us to identify clients even across client reboots. Thanks to NAT, DHCP, and userspace NFS clients, the IP address is not a reliable way to identify clients. Therefore the NFSv4 protocol uses a client-generated "client identifier" for this purpose.<br />
<br />
However, to avoid some potential problems caused by servers that have multiple IP addresses, the NFSv4 spec requires the client to calculate the client identifier in such a way that it is always different for different server IP addresses.<br />
<br />
This creates some confusion during migration--the client identifier which the client will present to the new server will differ from the one it presented to the old server, so we do not have a way to track the client across the migration.<br />
<br />
The problem can be avoided in scenarios 1 and 2 by allowing the new server to take over the old server IP address.<br />
<br />
The problem has been discussed in the NFSv4 ietf working group, but a solution has not yet been agreed on. Nevertheless, we expect the problem to be solved in NFSv4.1.<br />
<br />
'''Transparent state migration'''<br />
<br />
We have identified one small protocol change necessary to support transparent migration of state--that is, migration that doesn't require the client to perform lock reclaims as it would on server reboot.<br />
<br />
The problem is that a client, when migrating, does not know whether the server to which it is migrating will wants it to continue to use the state it acquired on the previous server, or wants it to reacquire its state. The client could attempt to find out by trying to use its current state with the new server and seeing what kind of error it gets back. However, there's no guarantee this will work--accidental collisions in the stateid's handed out by the two servers may mean, for example, that the server cannot return a helpful error in this case.<br />
<br />
The current NFSv4.1 draft partially solves this problem by defining a new error (NFS4ERR_MOVED_DATA_AND_STATE) that the server can return to simultaneously trigger a client migration and to indicate to the client that the new server is prepared to accept state handed out by the old server. (This solution is only partial because it doesn't help with failover--in that case, it's too late for the old server to return any errors to the client.) The final 4.1 specification will probably contain a more comprehensive solution, so at this point we're confident that the problem will be solved.<br />
<br />
'''Linux implementation issues'''<br />
<br />
'''Scenario 1 and NFSv4 reboot recovery'''<br />
<br />
The current linux implementation currently supports the above scenario 1 with NFSv2/v3. (See, for example, Falko Timme's [http://www.howtoforge.com/high_availability_nfs_drbd_heartbeat Setting up a high-availability NFS server], which actually shares data at the block level instead of a cluster filesystem.) It is possible to support the same scenario in NFSv4 as long as the directory where the NFSv4 server stores its reboot recovery information (/var/lib/nfs/v4recovery/ by default), is located on shared storage. However, there is a regression compared to v2/v3, because the v2/v3 also provides synchronous callouts to arbitrary scripts whenever that information changes, so that it could be shared using methods other than shared storage. The NFSv4 reboot recovery information is currently under redesign, and one of the side effects of the new design will be to allow such callouts. This work is not yet completed.<br />
<br />
'''Scenario 2 and grace period control'''<br />
<br />
The simplest way to support scenario 2 is to require clients of node 1 to recover their state on node 2 using the current server reboot recovery mechanism, by forcing both server 2 and 3 to observe a grace period.<br />
<br />
We have patches to implement this approach available from the "server-state-recovery" branch of our public git repository; see a [http://www.linux-nfs.org/cgi-bin/gitweb.cgi?p=bfields-2.6.git;a=summary browsable version of the repository].<br />
<br />
This approach is unsatisfactory because it forces the NFS server to observe a grace period for all exported filesystems, even those that aren't affected (or even shared across nodes), and because it blocks all locking activity across the cluster for the duration of the grace period.<br />
<br />
Therefore, we have a second design that allows us to limit the impact: instead of simply forcing servers 2 and 3 into grace, we remove all grace period checking from the nfs server itself, instead allowing the underlying filesystem to enforce locking operations when it is called to perform the locks. (This new behavior is enabled only for filesystems such as cluster filesystem that define lock methods; behavior for other filesystems is unchanged.) In order to enable the filesystem to make correct grace period decisions, we also need to distinguish between "normal" lock operations and reclaims; we accomplish by flagging reclaims locks when they are passed to the filesystem.[[Client migration via reboot|More detail on this design]]<br />
<br />
The cluster filesystem may then choose how to handle the migration; it may choose to continue to enforce a grace period globally across the whole filesystem, but it is now given the information to enable it to make more sophisticated decisions if it prefers.<br />
<br />
The patches in our git repository, referenced above, include incomplete support for this new design.<br />
<br />
Due to the minimal support for cluster filesystems in the mainstream Linux kernel, these patches (like the byte-range locking patches) are unlikely to be accepted for the time being.<br />
<br />
While implementing this we also noticed a problem with our current NFSv4 implementation, which is that its grace period is not necessarily synchronized with the grace period used by lockd, which can cause problems in a mixed NFSv2/v3/v4 environment; those patches are available from our git repository and we expect them to proceed upstream normally.<br />
<br />
'''Scenario 3 and 4, and reboot recovery interfaces'''<br />
<br />
Like statd, the NFSv4 server is required to maintain information in stable storage in order to keep track of which clients have succesfully established state with it. This solves certain problems identified in rfc 3530, where combinations of server reboots and network partitions can lead to situations where neither client nor server could otherwise determine whether the client should still be allowed to reclaim state after a reboot.<br />
<br />
We are in the process of redesigning the linux implementation of this system, partly for reasons given under "Scenario 1 and NFSv4 reboot recovery" above.<br />
<br />
As part of this new design, we need a way for a userland program to tell the NFS on startup which clients have valid state with the server (after that program has retried this information from stable storage).<br />
<br />
Scenarios 3 and 4 require a kernel interface allowing an administrator to migrate particular clients from and to NFS servers. To this end, we plan to use the same reboot recovery interface.<br />
<br />
The interface will consist of a call that takes a client identifier and an status (as an integer).<br />
<br />
For normal nfsd startup, one call will be made for each known client, with a status of 0.<br />
<br />
Similarly, to inform a server that a new client is migrating to it (and hence that it should allow lock reclaims from that client), we will again make one call for that client with status 0.<br />
<br />
To inform a client that it should initiate a migration event, by returning NFS4ERR_MOVED to the client, we'll make a call for that client with status NFS4ERR_MOVED.<br />
<br />
This interface may also be extended in the future to allow for, for example, administrative revocation of locks.</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/NFS_Recovery_and_Client_MigrationNFS Recovery and Client Migration2006-04-12T20:45:42Z<p>Andros: </p>
<hr />
<div>'''NFS Client Migration and Server Recovery on Clusters'''<br />
<br />
''Background''<br />
<br />
By exporting a shared cluster filesystem using multiple NFS servers, we can provide increased performance and availability through load-balancing and failover. NFSv4 provides some minimal protocol features to allow migration and failover, but there are some implementation challenges, and some small protocol extensions required.<br />
<br />
Consider a few scenarios, in order of increasing complexity:<br />
<br />
1. Server 1 and server 2 share the same cluster filesystem. Server 1 runs an NFS server, <br />
but server 2 doesn't. When server 1 fails or is shut down, an NFS service is started <br />
on server 2, which also takes over server 1's IP address.<br />
2. Server 1, 2, and 3 share the same cluster filesystem. Everything is as in the previous <br />
scenario, except that server 3 is also running a live NFS server throughout; so we need <br />
to ensure that server 3 is not allowed to acquire any locks it shouldn't during the <br />
period that server 2 is taking over.<br />
3. Server 1, 2, and 3 share the same cluster filesystem. Everything is as in the previous <br />
scenario, except that server 2 is already running a live nfs server. We must handle the <br />
failover with minimal interruption to the preexisting clients of server number 2.<br />
4. As in the previous scenario, except that we expect to keep server 1 running throughout, <br />
and migrate possibly only some of its clients. This allows us to do load-balancing. <br />
<br />
The implementation may choose to block essentially all locking activity during the transition, possibly as long as a grace period, which is on the order of a minute. This may be simpler to implement and may be adequate for some applications. However, we prefer to implement the transition period in such a way that applications see no significant delay. A variety of behaviors in between (e.g. that limit delays only to certain files) are also possible.<br />
<br />
Finally, the implementation may allow the client to continue to use all its existing state on the new server, or may require the client to go through the same state recovery process it would go through on server reboot. The latter approach requires less intrusive modifications to the nfs server, and can be done without requiring ceasing locking activity, but there are still optimizations possible using the former method that may reduce latency to the migrating client.<br />
<br />
'''Progress'''<br />
<br />
We are exploring most of these possibilities as part of our Linux NFSv4 server implementation effort.<br />
<br />
'''Protocol issues'''<br />
<br />
In the process of designing the migration implementation for Linux, we have identified two small deficiencies in the NFSv4 protocol that limit the migration scenarios that an NFSv4 implementation can reliably support.<br />
<br />
'''Migration to a live server'''<br />
<br />
Scenarios 3 and 4 above involve migrating clients to a NFSv4 server that is already servering clients of its own. This causes some problems, which we need a little background to explain.<br />
<br />
To manage client state, we first need a reliable way to identify individual clients; ideally, it should allow us to identify clients even across client reboots. Thanks to NAT, DHCP, and userspace NFS clients, the IP address is not a reliable way to identify clients. Therefore the NFSv4 protocol uses a client-generated "client identifier" for this purpose.<br />
<br />
However, to avoid some potential problems caused by servers that have multiple IP addresses, the NFSv4 spec requires the client to calculate the client identifier in such a way that it is always different for different server IP addresses.<br />
<br />
This creates some confusion during migration--the client identifier which the client will present to the new server will differ from the one it presented to the old server, so we do not have a way to track the client across the migration.<br />
<br />
The problem can be avoided in scenarios 1 and 2 by allowing the new server to take over the old server IP address.<br />
<br />
The problem has been discussed in the NFSv4 ietf working group, but a solution has not yet been agreed on. Nevertheless, we expect the problem to be solved in NFSv4.1.<br />
<br />
'''Transparent state migration'''<br />
<br />
We have identified one small protocol change necessary to support transparent migration of state--that is, migration that doesn't require the client to perform lock reclaims as it would on server reboot.<br />
<br />
The problem is that a client, when migrating, does not know whether the server to which it is migrating will wants it to continue to use the state it acquired on the previous server, or wants it to reacquire its state. The client could attempt to find out by trying to use its current state with the new server and seeing what kind of error it gets back. However, there's no guarantee this will work--accidental collisions in the stateid's handed out by the two servers may mean, for example, that the server cannot return a helpful error in this case.<br />
<br />
The current NFSv4.1 draft partially solves this problem by defining a new error (NFS4ERR_MOVED_DATA_AND_STATE) that the server can return to simultaneously trigger a client migration and to indicate to the client that the new server is prepared to accept state handed out by the old server. (This solution is only partial because it doesn't help with failover--in that case, it's too late for the old server to return any errors to the client.) The final 4.1 specification will probably contain a more comprehensive solution, so at this point we're confident that the problem will be solved.<br />
<br />
'''Linux implementation issues'''<br />
<br />
'''Scenario 1 and NFSv4 reboot recovery'''<br />
<br />
The current linux implementation currently supports the above scenario 1 with NFSv2/v3. (See, for example, Falko Timme's [http://www.howtoforge.com/high_availability_nfs_drbd_heartbeat Setting up a high-availability NFS server], which actually shares data at the block level instead of a cluster filesystem.) It is possible to support the same scenario in NFSv4 as long as the directory where the NFSv4 server stores its reboot recovery information (/var/lib/nfs/v4recovery/ by default), is located on shared storage. However, there is a regression compared to v2/v3, because the v2/v3 also provides synchronous callouts to arbitrary scripts whenever that information changes, so that it could be shared using methods other than shared storage. The NFSv4 reboot recovery information is currently under redesign, and one of the side effects of the new design will be to allow such callouts. This work is not yet completed.<br />
<br />
'''Scenario 2 and grace period control'''<br />
<br />
The simplest way to support scenario 2 is to require clients of node 1 to recover their state on node 2 using the current server reboot recovery mechanism, by forcing both server 2 and 3 to observe a grace period.<br />
<br />
We have patches to implement this approach available from the "server-state-recovery" branch of our public git repository; see a [http://www.linux-nfs.org/cgi-bin/gitweb.cgi?p=bfields-2.6.git;a=summary browsable version of the repository].<br />
<br />
This approach is unsatisfactory because it forces the NFS server to observe a grace period for all exported filesystems, even those that aren't affected (or even shared across nodes), and because it blocks all locking activity across the cluster for the duration of the grace period.<br />
<br />
Therefore, we have a second design that allows us to limit the impact: instead of simply forcing servers 2 and 3 into grace, we remove all grace period checking from the nfs server itself, instead allowing the underlying filesystem to enforce locking operations when it is called to perform the locks. (This new behavior is enabled only for filesystems such as cluster filesystem that define lock methods; behavior for other filesystems is unchanged.) In order to enable the filesystem to make correct grace period decisions, we also need to distinguish between "normal" lock operations and reclaims; we accomplish by flagging reclaims locks when they are passed to the filesystem.<br />
<br />
[[Client migration via reboot|More detail on this design]]<br />
<br />
The cluster filesystem may then choose how to handle the migration; it may choose to continue to enforce a grace period globally across the whole filesystem, but it is now given the information to enable it to make more sophisticated decisions if it prefers.<br />
<br />
The patches in our git repository, referenced above, include incomplete support for this new design.<br />
<br />
Due to the minimal support for cluster filesystems in the mainstream Linux kernel, these patches (like the byte-range locking patches) are unlikely to be accepted for the time being.<br />
<br />
While implementing this we also noticed a problem with our current NFSv4 implementation, which is that its grace period is not necessarily synchronized with the grace period used by lockd, which can cause problems in a mixed NFSv2/v3/v4 environment; those patches are available from our git repository and we expect them to proceed upstream normally.<br />
<br />
'''Scenario 3 and 4, and reboot recovery interfaces'''<br />
<br />
Like statd, the NFSv4 server is required to maintain information in stable storage in order to keep track of which clients have succesfully established state with it. This solves certain problems identified in rfc 3530, where combinations of server reboots and network partitions can lead to situations where neither client nor server could otherwise determine whether the client should still be allowed to reclaim state after a reboot.<br />
<br />
We are in the process of redesigning the linux implementation of this system, partly for reasons given under "Scenario 1 and NFSv4 reboot recovery" above.<br />
<br />
As part of this new design, we need a way for a userland program to tell the NFS on startup which clients have valid state with the server (after that program has retried this information from stable storage).<br />
<br />
Scenarios 3 and 4 require a kernel interface allowing an administrator to migrate particular clients from and to NFS servers. To this end, we plan to use the same reboot recovery interface.<br />
<br />
The interface will consist of a call that takes a client identifier and an status (as an integer).<br />
<br />
For normal nfsd startup, one call will be made for each known client, with a status of 0.<br />
<br />
Similarly, to inform a server that a new client is migrating to it (and hence that it should allow lock reclaims from that client), we will again make one call for that client with status 0.<br />
<br />
To inform a client that it should initiate a migration event, by returning NFS4ERR_MOVED to the client, we'll make a call for that client with status NFS4ERR_MOVED.<br />
<br />
This interface may also be extended in the future to allow for, for example, administrative revocation of locks.</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/Client_migration_via_rebootClient migration via reboot2006-04-12T20:44:21Z<p>Andros: </p>
<hr />
<div> Cluster Coherent NFS =<br />
<br />
'''NFSv4 Client Migration via Reboot''' <br />
<br />
<br />
<br />
* [[share state|NFSv4 Open Share State]]<br />
* [[byte range locking|NFS Byte Range Locking]]</div>Androshttps://wiki.linux-nfs.org/wiki/index.php/Cluster_Coherent_NFSv4_and_DelegationsCluster Coherent NFSv4 and Delegations2006-04-05T19:52:21Z<p>Andros: </p>
<hr />
<div>'''Cluster Coherent NFSv4 and Delegations'''<br />
<br />
''Background''<br />
<br />
NFSv4 adds a new protocol feature, Delegations. From rfc3530:<br />
<br />
"The major addition to NFS version 4 in the area of caching is the ability of the server to delegate certain responsibilities to the client. When the server grants a delegation for a file to a client, the client is guaranteed certain semantics with respect to the sharing of that file with other clients. At OPEN, the server may provide the client either a read or write delegation for the file. If the client is granted a read delegation, it is assured that no other client has the ability to write to the file for the duration of the delegation. If the client is granted a write delegation, the client is assured that no other client has read or write access to the file."<br />
<br />
"Delegations can be recalled by the server. If another client requests access to the file in such a way that the access conflicts with the granted delegation, the server is able to notify the initial client and recall the delegation. This requires that a callback path exist between the server and client. If this callback path does not exist, then delegations can not be granted. The essence of a delegation is that it allows the client to locally service operations such as OPEN, CLOSE, LOCK, LOCKU, READ, WRITE without immediate interaction with the server."<br />
<br />
'''Linux NFSv4 Deletgation Support for Cluster Filesystems'''<br />
<br />
The Linux NFSv4 server delegation implementation uses the lease extensions to the VFS lock subsystem (so a lease equals a delegation). Use of the lease subsystem coordinates local access and NFSv4 delegations. The VFS lease subsystem has an fcntl() interface to set and get a lease, and a break_lease function is called in the VFS layer to recall a lease upon a conflicting open (needs to be added to the VFS rename and unlink).<br />
<br />
The open syscall provides the opportunity for the NFSD to hand out a delegation. A conflicting open forces a delegation recall. The conflicting open could come from local access, NFS access, Samba access etc. Once a file has been delegated to any client, all OPENS must check if there is a delegation recall in progress related to the requested OPEN access (NFSERR_DELAY) prior to granting OPEN.<br />
<br />
If the requested OPEN access forces a delegation recall, NFSD initiates a CB_RECALL on all conflicting delegations. This is currently implemented using the VFS layer break_lease call, which notifies lease holders when a conflicting OPEN has occurred. The VFS layer makes this determination without consulting the underlying file system.<br />
<br />
Finally, NFSD determines if it can hand out a delegation on the file for the requested OPEN. The VFS lease subsystem does this by examining in memory inode fields to determine if there are any writers (to grant a READ delegation) or any readers or writers ( to grant a WRITE delegation). The underlying file system will need to be consulted to make this determination.<br />
<br />
If NFSD decides to grant a delegation, it needs to tell the underlying file system so that the file system can notify NFSD to recall the delegation at a later time.<br />
<br />
'''Tasks'''<br />
<br />
* Ask file system to check for delegation recall in progress prior to granting an OPEN, <br />
granting a delegation, or initiating a recall.<br />
* Set up a callback from the file system to notify an NFSv4 server to perform a CB_RECALL <br />
upon a conflicting OPEN from another node.<br />
* Ask the file system if a delegation can be granted.<br />
* Tell the file system that the VFS on a node has detected a lease conflict (rename, <br />
unlink, etc) and that any delegations should be recalled.<br />
<br />
'''Proposed Implementation'''<br />
<br />
Extend the set/get/breaklease interfaces to service cluster file systems. The extensions will resemble the posix locking extensions (callbacks, etc).<br />
<br />
What we probably need is new inode operations:<br />
<br />
* break_lease(inode, mode)<br />
* setlease(filp, mode)<br />
* getlease(filp, &mode)<br />
<br />
Where mode can be one of read, write, or unlock. We'd also allow the mode to be or'ed with a nonblocking flag?<br />
<br />
The VFS lease subsystem includes a series of lock manager callbacks. Will these be sufficient for the cluster filesystem case?<br />
<br />
Actually current setlease and getlease functions use a struct file_lock instead of (or in addition to) the mode. Do we need that?<br />
<br />
Also, setlease and getlease could be file operations instead of inode operations. This is probably a fairly arbitrary choice.<br />
<br />
To handle the possibility that break_lease, setlease, getlease, etc. might block, even in the absence of contention, we might want to allow an -EINPROGRESS return to be followed by a callback e.g. break_lease_result(inode, stat); where stat might be -EAGAIN (we're waiting for the lease to be broken) or OK (it was immediately broken, or there never was one).<br />
<br />
'''Status'''</div>Andros