https://wiki.linux-nfs.org/wiki/index.php?title=Special:Contributions/Richterd&feed=atom&limit=50&target=Richterd&year=&month=Linux NFS - User contributions [en]2024-03-29T07:56:33ZFrom Linux NFSMediaWiki 1.16.5https://wiki.linux-nfs.org/wiki/index.php/GFS2_Setup_Notes_-_cluster3,_2.6.27_kernelGFS2 Setup Notes - cluster3, 2.6.27 kernel2008-12-21T21:12:21Z<p>Richterd: </p>
<hr />
<div>With the release of Fedora 10 in early October 2008 ('''Update:''' since delayed until December), Red Hat's newest version of their cluster suite ("'''cluster3'''") will go prime-time. In the last couple months, one of cluster3's main dependencies split into two parts, '''corosync''' and '''openAIS''', and building has been problematic at times. These are my notes from my latest GFS2 setup, and the first time I'm moving everything to 2.6.27.<br />
<br />
==The parts:==<br />
As of this moment, cluster3's dependencies aren't yet packaged as RPMs, so building from source is a must. The particular revisions are changing almost every day, so you'll have to consult the [http://sources.redhat.com/cluster/wiki/ cluster project wiki] for current versions.<br />
<br />
Even overnight, this just changed for me. Yikes! I'm now basing off of cluster3 '''cluster-2.99.10''', '''corosync svn r1667''', and '''openAIS svn r1651'''. <br />
<br />
* I used yum to get: '''libvolume_id-devel''', '''libxml2''', '''libxml2-devel''', '''openldap''', '''openldap-devel''', '''readline''' (likely installed already), and '''readline-devel'''.<br />
** (Note: I only needed the readline stuff when I went back to <tt>LVM2.2.02.39</tt>, as ...40 was broken.)<br />
* get [ftp://sources.redhat.com/pub/dm/ the latest device-mapper]<br />
* use svn to clone the corosync repository<br />
** <tt>$ svn checkout --revision 1667 http://svn.fedorahosted.org/svn/corosync</tt> <br />
*** or <tt> svn update --revision...</tt><br />
** <tt>$ cd corosync && svn export . ../corosync-checkout/</tt><br />
** the build stuff is in the "trunk" directory.<br />
* use svn to clone the openAIS repository<br />
** <tt>$ svn checkout --revision 1651 http://svn.fedorahosted.org/svn/openais</tt><br />
** <tt>$ cd openais && svn export . ../openais-checkout/</tt><br />
** the build stuff is in the "trunk" directory.<br />
* use git to clone my cluster3 repository<br />
** <tt>$ git-clone git://git.linux-nfs.org/projects/richterd/cluster.git</tt><br />
** my "pnfs-gfs2-dev" branch is where current development goes.<br />
* get [ftp://sources.redhat.com/pub/lvm2 the latest LVM2]<br />
** '''note''': <tt>LVM2.2.02.40</tt> is broken (?!?) -- a minor build issue, but I can't believe they released something that has an undefined symbol.<br />
<br />
==The build:==<br />
* build/install the device-mapper<br />
* build/install corosync (shouldn't even need to configure it; this is nice and easy now!)<br />
* build/install openAIS (shouldn't even need to configure it)<br />
* before doing the cluster3 stuff, you'll need to be running a 2.6.27-based kernel and have its sources available<br />
* build/install the cluster3 stuff<br />
** I had to point it at openAIS, and I generally disable the <tt>rgmanager</tt> stuff. Also now disabling the perl/python bindings, and I build the kernel module separately anyway.<br />
** <tt>$ ./configure --openaislibdir=/usr/lib/openais --openaisincdir=/usr/include --without_rgmanager --without_bindings --without_kernel_modules</tt><br />
* build/install LVM2, making sure to specify the <tt>clvmd</tt> type<br />
** <tt>$ ./configure --with-lvm1=none --with-clvmd=cman --prefix=/usr</tt><br />
<br />
=====The rest of the cluster:=====<br />
So far, I'd built all of this on a single cluster node. I set that node up as an NFS server and exported my top-level build directory. Then, on each cluster node, I mounted the export and just did the <tt>make install</tt> steps.<br />
<br />
==Shared storage:==<br />
For testing, I use ATA over ethernet and have had fairly good results with it.<br />
* yum-installed AoE initiator (client) '''<tt>aoetools-23-1</tt>''' across the cluster<br />
* downloaded AoE target (server) [http://internap.dl.sourceforge.net/sourceforge/aoetools/vblade-15.tgz vblade-15.tgz] and installed it on a separate host (<tt>rhclhead</tt>).<br />
** with <tt>dd</tt>, I created a 1GB empty file, <tt>AOE_SHARED_STORAGE</tt><br />
** .. and exported it: <tt>$ sudo vbladed 2 3 eth0 AOE_SHARED_STORAGE (major dev num 2, minor 3)</tt><br />
<br />
==Creating the filesystem:==<br />
* prep the volume with LVM2 metadata: <tt>$ sudo pvcreate -M 2 /dev/etherd/e2.3</tt><br />
* create the volume group '''DMRVolGroup''': <tt>$ sudo vgcreate -M 2 -s 1m -c y DMRVolGroup /dev/etherd/e2.3</tt><br />
* edit <tt>/etc/lvm/lvm.conf</tt> across the cluster and set the locking type to <tt>DLM</tt>.<br />
* make sure you have a properly configured <tt>/etc/cluster/cluster.conf</tt> also set up across the cluster ('''DMRCluster''', in my case).<br />
* now, bring up the cluster: <tt>pdsh -w guest[1-3] sudo service cman start && pdsh -w guest[1-3] sudo service clvmd start</tt><br />
* create the logical volume '''DMRVolume''': <tt>$ sudo lvcreate -n DMRVolume -l 100%VG DMRVolGroup</tt><br />
* create the GFS2 filesystem '''DMRFS''': <tt> $ sudo gfs2_mkfs -j 4 -p lock_dlm -t DMRCluster:DMRFS /dev/DMRVolGroup/DMRVolume</tt><br />
** note the <tt>-j</tt> ("number of journals") argument needs to be appropriate for your cluster size.</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_designPNFS prototype design2008-10-28T15:36:40Z<p>Richterd: </p>
<hr />
<div>= pNFS =<br />
<br />
'''pNFS''' is part of the first NFSv4 minor version. This space is used to track and share Linux pNFS implementation ideas and issues.<br />
<br />
== General Information ==<br />
<br />
* [http://www.citi.umich.edu/projects/asci/pnfs/linux/ Linux pNFS Implementation Homepage]<br />
<br />
* [[pNFS Setup Instructions]] - Basic pNFS setup instructions.<br />
<br />
* [[GFS2 Setup Notes - cluster3, 2.6.27 kernel]]<br />
<br />
* [[Older GFS2 Setup Notes - first pass, in VMWare, and upgrading from cluster2 to cluster3]]<br />
<br />
== Development Resources ==<br />
<br />
* [[pNFS Development Git tree|pNFS Development Git tree]]<br />
<br />
* [[pNFS Git tree recipies|pNFS Git tree recipies]]<br />
<br />
* [[Wireshark Patches|Wireshark Patches]]<br />
<br />
== Current Issues ==<br />
* [[pNFS Todo List|pNFS Todo List]]<br />
<br />
* [[pNFS Implementation Issues|pNFS Implementation Issues]]<br />
<br />
* [[Bakeathon 2007 Issues List|Bakeathon 2007 Issues List]]<br />
<br />
* [[pNFS Development Road Map]]<br />
<br />
* [http://spreadsheets.google.com/pub?key=pGVvgce8dC-WWbowI9TSmEg Linux pNFS Development Gantt Chart]<br />
<br />
== Old Issues ==<br />
* [[Cthon06 Meeting Notes|Connectathon 2006 Linux pNFS Implementation Meeting Notes]]<br />
<br />
* [[linux pnfs client rewrite may 2006|Linux pNFS Client Internal Reorg patches May 2006 - For Display Purposes Only - Do Not Use]]<br />
<br />
* [[pNFS todo List 2007|pNFS todo List July 2007]]</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/Older_GFS2_Setup_Notes_-_first_pass,_in_VMWare,_and_upgrading_from_cluster2_to_cluster3Older GFS2 Setup Notes - first pass, in VMWare, and upgrading from cluster2 to cluster32008-10-28T15:36:14Z<p>Richterd: New page: * GFS2 Setup Notes are basic install notes from setting up a small cluster (perhaps useful for the GFS2 MDS work). * GFS2 Cluster in VMware are a follow-up where I quickly set up...</p>
<hr />
<div><br />
* [[GFS2 Setup Notes]] are basic install notes from setting up a small cluster (perhaps useful for the GFS2 MDS work).<br />
<br />
* [[GFS2 Cluster in VMware]] are a follow-up where I quickly set up a 3-node cluster on my laptop for use at Connectathon.<br />
<br />
* [[GFS2 cluster3 userland notes]] are rough notes from my first stab at upgrading the GFS2 userland from cluster2 to cluster3.</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_designPNFS prototype design2008-10-28T15:36:02Z<p>Richterd: </p>
<hr />
<div>= pNFS =<br />
<br />
'''pNFS''' is part of the first NFSv4 minor version. This space is used to track and share Linux pNFS implementation ideas and issues.<br />
<br />
== General Information ==<br />
<br />
* [http://www.citi.umich.edu/projects/asci/pnfs/linux/ Linux pNFS Implementation Homepage]<br />
<br />
* [[pNFS Setup Instructions]] - Basic pNFS setup instructions.<br />
<br />
* [[GFS2 Setup Notes]] are basic install notes from setting up a small cluster (perhaps useful for the GFS2 MDS work).<br />
<br />
* [[GFS2 Cluster in VMware]] are a follow-up where I quickly set up a 3-node cluster on my laptop for use at Connectathon.<br />
<br />
* [[GFS2 cluster3 userland notes]] are rough notes from my first stab at upgrading the GFS2 userland from cluster2 to cluster3.<br />
<br />
* [[GFS2 Setup Notes - cluster3, 2.6.27 kernel]]<br />
<br />
* [[Older GFS2 Setup Notes - first pass, in VMWare, and upgrading from cluster2 to cluster3]]<br />
<br />
== Development Resources ==<br />
<br />
* [[pNFS Development Git tree|pNFS Development Git tree]]<br />
<br />
* [[pNFS Git tree recipies|pNFS Git tree recipies]]<br />
<br />
* [[Wireshark Patches|Wireshark Patches]]<br />
<br />
== Current Issues ==<br />
* [[pNFS Todo List|pNFS Todo List]]<br />
<br />
* [[pNFS Implementation Issues|pNFS Implementation Issues]]<br />
<br />
* [[Bakeathon 2007 Issues List|Bakeathon 2007 Issues List]]<br />
<br />
* [[pNFS Development Road Map]]<br />
<br />
* [http://spreadsheets.google.com/pub?key=pGVvgce8dC-WWbowI9TSmEg Linux pNFS Development Gantt Chart]<br />
<br />
== Old Issues ==<br />
* [[Cthon06 Meeting Notes|Connectathon 2006 Linux pNFS Implementation Meeting Notes]]<br />
<br />
* [[linux pnfs client rewrite may 2006|Linux pNFS Client Internal Reorg patches May 2006 - For Display Purposes Only - Do Not Use]]<br />
<br />
* [[pNFS todo List 2007|pNFS todo List July 2007]]</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/GFS2_Setup_Notes_-_cluster3,_2.6.27_kernelGFS2 Setup Notes - cluster3, 2.6.27 kernel2008-10-28T15:33:47Z<p>Richterd: </p>
<hr />
<div>With the release of Fedora 10 in early October 2008 ('''Update:''' since delayed until December), Red Hat's newest version of their cluster suite ("'''cluster3'''") will go prime-time. In the last couple months, one of cluster3's main dependencies split into two parts, '''corosync''' and '''openAIS''', and building has been problematic at times. These are my notes from my latest GFS2 setup, and the first time I'm moving everything to 2.6.27.<br />
<br />
==The parts:==<br />
As of this moment, cluster3's dependencies aren't yet packaged as RPMs, so building from source is a must. The particular revisions are changing almost every day, so you'll have to consult the [http://sources.redhat.com/cluster/wiki/ cluster project wiki] for current versions.<br />
<br />
Even overnight, this just changed for me. Yikes! I'm now basing off of cluster3 '''cluster-2.99.10''', '''corosync svn r1667''', and '''openAIS svn r1651'''. <br />
<br />
* I used yum to get: '''libvolume_id-devel''', '''libxml2''', '''libxml2-devel''', '''openldap''', '''openldap-devel''', '''readline''' (likely installed already), and '''readline-devel'''.<br />
** (Note: I only needed the readline stuff when I went back to <tt>LVM2.2.02.39</tt>, as ...40 was broken.)<br />
* get [ftp://sources.redhat.com/pub/dm/ the latest device-mapper]<br />
* use svn to clone the corosync repository<br />
** <tt>$ svn checkout --revision 1667 http://svn.fedorahosted.org/svn/corosync</tt> <br />
*** or <tt> svn update --revision...</tt><br />
** <tt>$ cd corosync && svn export . ../corosync-checkout/</tt><br />
** the build stuff is in the "trunk" directory.<br />
* use svn to clone the openAIS repository<br />
** <tt>$ svn checkout --revision 1651 http://svn.fedorahosted.org/svn/openais</tt><br />
** <tt>$ cd openais && svn export . ../openais-checkout/</tt><br />
** the build stuff is in the "trunk" directory.<br />
* use git to clone the cluster3 repository<br />
** <tt>$ git-clone git.fedorahosted.org/git/cluster.git</tt><br />
** their "master" branch is where current development goes.<br />
* get [ftp://sources.redhat.com/pub/lvm2 the latest LVM2]<br />
** '''note''': <tt>LVM2.2.02.40</tt> is broken (?!?) -- a minor build issue, but I can't believe they released something that has an undefined symbol.<br />
<br />
==The build:==<br />
* build/install the device-mapper<br />
* build/install corosync (shouldn't even need to configure it; this is nice and easy now!)<br />
* build/install openAIS (shouldn't even need to configure it)<br />
* before doing the cluster3 stuff, you'll need to be running a 2.6.27-based kernel and have its sources available<br />
* build/install the cluster3 stuff<br />
** I had to point it at openAIS, and I generally disable the <tt>rgmanager</tt> stuff. Also now disabling the perl/python bindings, and I build the kernel module separately anyway.<br />
** <tt>$ ./configure --openaislibdir=/usr/lib/openais --openaisincdir=/usr/include --without_rgmanager --without_bindings --without_kernel_modules</tt><br />
* build/install LVM2, making sure to specify the <tt>clvmd</tt> type<br />
** <tt>$ ./configure --with-lvm1=none --with-clvmd=cman --prefix=/usr</tt><br />
<br />
=====The rest of the cluster:=====<br />
So far, I'd built all of this on a single cluster node. I set that node up as an NFS server and exported my top-level build directory. Then, on each cluster node, I mounted the export and just did the <tt>make install</tt> steps.<br />
<br />
==Shared storage:==<br />
For testing, I use ATA over ethernet and have had fairly good results with it.<br />
* yum-installed AoE initiator (client) '''<tt>aoetools-23-1</tt>''' across the cluster<br />
* downloaded AoE target (server) [http://internap.dl.sourceforge.net/sourceforge/aoetools/vblade-15.tgz vblade-15.tgz] and installed it on a separate host (<tt>rhclhead</tt>).<br />
** with <tt>dd</tt>, I created a 1GB empty file, <tt>AOE_SHARED_STORAGE</tt><br />
** .. and exported it: <tt>$ sudo vbladed 2 3 eth0 AOE_SHARED_STORAGE (major dev num 2, minor 3)</tt><br />
<br />
==Creating the filesystem:==<br />
* prep the volume with LVM2 metadata: <tt>$ sudo pvcreate -M 2 /dev/etherd/e2.3</tt><br />
* create the volume group '''DMRVolGroup''': <tt>$ sudo vgcreate -M 2 -s 1m -c y DMRVolGroup /dev/etherd/e2.3</tt><br />
* edit <tt>/etc/lvm/lvm.conf</tt> across the cluster and set the locking type to <tt>DLM</tt>.<br />
* make sure you have a properly configured <tt>/etc/cluster/cluster.conf</tt> also set up across the cluster ('''DMRCluster''', in my case).<br />
* now, bring up the cluster: <tt>pdsh -w guest[1-3] sudo service cman start && pdsh -w guest[1-3] sudo service clvmd start</tt><br />
* create the logical volume '''DMRVolume''': <tt>$ sudo lvcreate -n DMRVolume -l 100%VG DMRVolGroup</tt><br />
* create the GFS2 filesystem '''DMRFS''': <tt> $ sudo gfs2_mkfs -j 4 -p lock_dlm -t DMRCluster:DMRFS /dev/DMRVolGroup/DMRVolume</tt><br />
** note the <tt>-j</tt> ("number of journals") argument needs to be appropriate for your cluster size.</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/GFS2_Setup_Notes_-_cluster3,_2.6.27_kernelGFS2 Setup Notes - cluster3, 2.6.27 kernel2008-10-28T15:33:32Z<p>Richterd: </p>
<hr />
<div>With the release of Fedora 10 in early October 2008 ('''Update: since delayed until December'''), Red Hat's newest version of their cluster suite ("'''cluster3'''") will go prime-time. In the last couple months, one of cluster3's main dependencies split into two parts, '''corosync''' and '''openAIS''', and building has been problematic at times. These are my notes from my latest GFS2 setup, and the first time I'm moving everything to 2.6.27.<br />
<br />
==The parts:==<br />
As of this moment, cluster3's dependencies aren't yet packaged as RPMs, so building from source is a must. The particular revisions are changing almost every day, so you'll have to consult the [http://sources.redhat.com/cluster/wiki/ cluster project wiki] for current versions.<br />
<br />
Even overnight, this just changed for me. Yikes! I'm now basing off of cluster3 '''cluster-2.99.10''', '''corosync svn r1667''', and '''openAIS svn r1651'''. <br />
<br />
* I used yum to get: '''libvolume_id-devel''', '''libxml2''', '''libxml2-devel''', '''openldap''', '''openldap-devel''', '''readline''' (likely installed already), and '''readline-devel'''.<br />
** (Note: I only needed the readline stuff when I went back to <tt>LVM2.2.02.39</tt>, as ...40 was broken.)<br />
* get [ftp://sources.redhat.com/pub/dm/ the latest device-mapper]<br />
* use svn to clone the corosync repository<br />
** <tt>$ svn checkout --revision 1667 http://svn.fedorahosted.org/svn/corosync</tt> <br />
*** or <tt> svn update --revision...</tt><br />
** <tt>$ cd corosync && svn export . ../corosync-checkout/</tt><br />
** the build stuff is in the "trunk" directory.<br />
* use svn to clone the openAIS repository<br />
** <tt>$ svn checkout --revision 1651 http://svn.fedorahosted.org/svn/openais</tt><br />
** <tt>$ cd openais && svn export . ../openais-checkout/</tt><br />
** the build stuff is in the "trunk" directory.<br />
* use git to clone the cluster3 repository<br />
** <tt>$ git-clone git.fedorahosted.org/git/cluster.git</tt><br />
** their "master" branch is where current development goes.<br />
* get [ftp://sources.redhat.com/pub/lvm2 the latest LVM2]<br />
** '''note''': <tt>LVM2.2.02.40</tt> is broken (?!?) -- a minor build issue, but I can't believe they released something that has an undefined symbol.<br />
<br />
==The build:==<br />
* build/install the device-mapper<br />
* build/install corosync (shouldn't even need to configure it; this is nice and easy now!)<br />
* build/install openAIS (shouldn't even need to configure it)<br />
* before doing the cluster3 stuff, you'll need to be running a 2.6.27-based kernel and have its sources available<br />
* build/install the cluster3 stuff<br />
** I had to point it at openAIS, and I generally disable the <tt>rgmanager</tt> stuff. Also now disabling the perl/python bindings, and I build the kernel module separately anyway.<br />
** <tt>$ ./configure --openaislibdir=/usr/lib/openais --openaisincdir=/usr/include --without_rgmanager --without_bindings --without_kernel_modules</tt><br />
* build/install LVM2, making sure to specify the <tt>clvmd</tt> type<br />
** <tt>$ ./configure --with-lvm1=none --with-clvmd=cman --prefix=/usr</tt><br />
<br />
=====The rest of the cluster:=====<br />
So far, I'd built all of this on a single cluster node. I set that node up as an NFS server and exported my top-level build directory. Then, on each cluster node, I mounted the export and just did the <tt>make install</tt> steps.<br />
<br />
==Shared storage:==<br />
For testing, I use ATA over ethernet and have had fairly good results with it.<br />
* yum-installed AoE initiator (client) '''<tt>aoetools-23-1</tt>''' across the cluster<br />
* downloaded AoE target (server) [http://internap.dl.sourceforge.net/sourceforge/aoetools/vblade-15.tgz vblade-15.tgz] and installed it on a separate host (<tt>rhclhead</tt>).<br />
** with <tt>dd</tt>, I created a 1GB empty file, <tt>AOE_SHARED_STORAGE</tt><br />
** .. and exported it: <tt>$ sudo vbladed 2 3 eth0 AOE_SHARED_STORAGE (major dev num 2, minor 3)</tt><br />
<br />
==Creating the filesystem:==<br />
* prep the volume with LVM2 metadata: <tt>$ sudo pvcreate -M 2 /dev/etherd/e2.3</tt><br />
* create the volume group '''DMRVolGroup''': <tt>$ sudo vgcreate -M 2 -s 1m -c y DMRVolGroup /dev/etherd/e2.3</tt><br />
* edit <tt>/etc/lvm/lvm.conf</tt> across the cluster and set the locking type to <tt>DLM</tt>.<br />
* make sure you have a properly configured <tt>/etc/cluster/cluster.conf</tt> also set up across the cluster ('''DMRCluster''', in my case).<br />
* now, bring up the cluster: <tt>pdsh -w guest[1-3] sudo service cman start && pdsh -w guest[1-3] sudo service clvmd start</tt><br />
* create the logical volume '''DMRVolume''': <tt>$ sudo lvcreate -n DMRVolume -l 100%VG DMRVolGroup</tt><br />
* create the GFS2 filesystem '''DMRFS''': <tt> $ sudo gfs2_mkfs -j 4 -p lock_dlm -t DMRCluster:DMRFS /dev/DMRVolGroup/DMRVolume</tt><br />
** note the <tt>-j</tt> ("number of journals") argument needs to be appropriate for your cluster size.</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/GFS2_Setup_Notes_-_cluster3,_2.6.27_kernelGFS2 Setup Notes - cluster3, 2.6.27 kernel2008-10-28T15:33:10Z<p>Richterd: New page: With the release of Fedora 10 in early October 2008, Red Hat's newest version of their cluster suite ("'''cluster3'''") will go prime-time. In the last couple months, one of cluster3's ma...</p>
<hr />
<div>With the release of Fedora 10 in early October 2008, Red Hat's newest version of their cluster suite ("'''cluster3'''") will go prime-time. In the last couple months, one of cluster3's main dependencies split into two parts, '''corosync''' and '''openAIS''', and building has been problematic at times. These are my notes from my latest GFS2 setup, and the first time I'm moving everything to 2.6.27.<br />
<br />
==The parts:==<br />
As of this moment, cluster3's dependencies aren't yet packaged as RPMs, so building from source is a must. The particular revisions are changing almost every day, so you'll have to consult the [http://sources.redhat.com/cluster/wiki/ cluster project wiki] for current versions.<br />
<br />
Even overnight, this just changed for me. Yikes! I'm now basing off of cluster3 '''cluster-2.99.10''', '''corosync svn r1667''', and '''openAIS svn r1651'''. <br />
<br />
* I used yum to get: '''libvolume_id-devel''', '''libxml2''', '''libxml2-devel''', '''openldap''', '''openldap-devel''', '''readline''' (likely installed already), and '''readline-devel'''.<br />
** (Note: I only needed the readline stuff when I went back to <tt>LVM2.2.02.39</tt>, as ...40 was broken.)<br />
* get [ftp://sources.redhat.com/pub/dm/ the latest device-mapper]<br />
* use svn to clone the corosync repository<br />
** <tt>$ svn checkout --revision 1667 http://svn.fedorahosted.org/svn/corosync</tt> <br />
*** or <tt> svn update --revision...</tt><br />
** <tt>$ cd corosync && svn export . ../corosync-checkout/</tt><br />
** the build stuff is in the "trunk" directory.<br />
* use svn to clone the openAIS repository<br />
** <tt>$ svn checkout --revision 1651 http://svn.fedorahosted.org/svn/openais</tt><br />
** <tt>$ cd openais && svn export . ../openais-checkout/</tt><br />
** the build stuff is in the "trunk" directory.<br />
* use git to clone the cluster3 repository<br />
** <tt>$ git-clone git.fedorahosted.org/git/cluster.git</tt><br />
** their "master" branch is where current development goes.<br />
* get [ftp://sources.redhat.com/pub/lvm2 the latest LVM2]<br />
** '''note''': <tt>LVM2.2.02.40</tt> is broken (?!?) -- a minor build issue, but I can't believe they released something that has an undefined symbol.<br />
<br />
==The build:==<br />
* build/install the device-mapper<br />
* build/install corosync (shouldn't even need to configure it; this is nice and easy now!)<br />
* build/install openAIS (shouldn't even need to configure it)<br />
* before doing the cluster3 stuff, you'll need to be running a 2.6.27-based kernel and have its sources available<br />
* build/install the cluster3 stuff<br />
** I had to point it at openAIS, and I generally disable the <tt>rgmanager</tt> stuff. Also now disabling the perl/python bindings, and I build the kernel module separately anyway.<br />
** <tt>$ ./configure --openaislibdir=/usr/lib/openais --openaisincdir=/usr/include --without_rgmanager --without_bindings --without_kernel_modules</tt><br />
* build/install LVM2, making sure to specify the <tt>clvmd</tt> type<br />
** <tt>$ ./configure --with-lvm1=none --with-clvmd=cman --prefix=/usr</tt><br />
<br />
=====The rest of the cluster:=====<br />
So far, I'd built all of this on a single cluster node. I set that node up as an NFS server and exported my top-level build directory. Then, on each cluster node, I mounted the export and just did the <tt>make install</tt> steps.<br />
<br />
==Shared storage:==<br />
For testing, I use ATA over ethernet and have had fairly good results with it.<br />
* yum-installed AoE initiator (client) '''<tt>aoetools-23-1</tt>''' across the cluster<br />
* downloaded AoE target (server) [http://internap.dl.sourceforge.net/sourceforge/aoetools/vblade-15.tgz vblade-15.tgz] and installed it on a separate host (<tt>rhclhead</tt>).<br />
** with <tt>dd</tt>, I created a 1GB empty file, <tt>AOE_SHARED_STORAGE</tt><br />
** .. and exported it: <tt>$ sudo vbladed 2 3 eth0 AOE_SHARED_STORAGE (major dev num 2, minor 3)</tt><br />
<br />
==Creating the filesystem:==<br />
* prep the volume with LVM2 metadata: <tt>$ sudo pvcreate -M 2 /dev/etherd/e2.3</tt><br />
* create the volume group '''DMRVolGroup''': <tt>$ sudo vgcreate -M 2 -s 1m -c y DMRVolGroup /dev/etherd/e2.3</tt><br />
* edit <tt>/etc/lvm/lvm.conf</tt> across the cluster and set the locking type to <tt>DLM</tt>.<br />
* make sure you have a properly configured <tt>/etc/cluster/cluster.conf</tt> also set up across the cluster ('''DMRCluster''', in my case).<br />
* now, bring up the cluster: <tt>pdsh -w guest[1-3] sudo service cman start && pdsh -w guest[1-3] sudo service clvmd start</tt><br />
* create the logical volume '''DMRVolume''': <tt>$ sudo lvcreate -n DMRVolume -l 100%VG DMRVolGroup</tt><br />
* create the GFS2 filesystem '''DMRFS''': <tt> $ sudo gfs2_mkfs -j 4 -p lock_dlm -t DMRCluster:DMRFS /dev/DMRVolGroup/DMRVolume</tt><br />
** note the <tt>-j</tt> ("number of journals") argument needs to be appropriate for your cluster size.</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_designPNFS prototype design2008-10-28T15:33:03Z<p>Richterd: </p>
<hr />
<div>= pNFS =<br />
<br />
'''pNFS''' is part of the first NFSv4 minor version. This space is used to track and share Linux pNFS implementation ideas and issues.<br />
<br />
== General Information ==<br />
<br />
* [http://www.citi.umich.edu/projects/asci/pnfs/linux/ Linux pNFS Implementation Homepage]<br />
<br />
* [[pNFS Setup Instructions]] - Basic pNFS setup instructions.<br />
<br />
* [[GFS2 Setup Notes]] are basic install notes from setting up a small cluster (perhaps useful for the GFS2 MDS work).<br />
<br />
* [[GFS2 Cluster in VMware]] are a follow-up where I quickly set up a 3-node cluster on my laptop for use at Connectathon.<br />
<br />
* [[GFS2 cluster3 userland notes]] are rough notes from my first stab at upgrading the GFS2 userland from cluster2 to cluster3.<br />
<br />
* [[GFS2 Setup Notes - cluster3, 2.6.27 kernel]]<br />
<br />
== Development Resources ==<br />
<br />
* [[pNFS Development Git tree|pNFS Development Git tree]]<br />
<br />
* [[pNFS Git tree recipies|pNFS Git tree recipies]]<br />
<br />
* [[Wireshark Patches|Wireshark Patches]]<br />
<br />
== Current Issues ==<br />
* [[pNFS Todo List|pNFS Todo List]]<br />
<br />
* [[pNFS Implementation Issues|pNFS Implementation Issues]]<br />
<br />
* [[Bakeathon 2007 Issues List|Bakeathon 2007 Issues List]]<br />
<br />
* [[pNFS Development Road Map]]<br />
<br />
* [http://spreadsheets.google.com/pub?key=pGVvgce8dC-WWbowI9TSmEg Linux pNFS Development Gantt Chart]<br />
<br />
== Old Issues ==<br />
* [[Cthon06 Meeting Notes|Connectathon 2006 Linux pNFS Implementation Meeting Notes]]<br />
<br />
* [[linux pnfs client rewrite may 2006|Linux pNFS Client Internal Reorg patches May 2006 - For Display Purposes Only - Do Not Use]]<br />
<br />
* [[pNFS todo List 2007|pNFS todo List July 2007]]</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/GFS2_cluster3_userland_notesGFS2 cluster3 userland notes2008-08-06T17:03:11Z<p>Richterd: </p>
<hr />
<div>==='''Update'''===<br />
<tt>'''August 6, 2008'''</tt>: the current cluster3 code from the cluster.git repo only builds against revision 1579 of openAIS. so, to get that:<br />
* <tt>$ svn checkout --revision 1579 http://svn.fedorahosted.org/svn/openais</tt><br />
Um, this might not even be true tomorrow :) The reasons behind it are that openAIS is getting split into two pieces, with "corosync" becoming the main core system, and the new "openAIS" component handles the SA Forum APIs.<br />
<br />
==Purpose==<br />
IBM and CITI are working to integrate GFS2 with pNFS with the purpose of demonstrating that an in-kernel cluster filesystem can be successfully exported over pNFS and take advantage of pNFS's capabilities.<br />
<br />
Part of the work involves extending existing GFS2 userland tools and daemons to handle pNFS requests for state information and the like. That task requires developing an out-of-band GFS2-specific control channel so that pNFS servers exporting GFS2 can issue and process these requests during the course of normal NFS processing.<br />
<br />
The extant version of the GFS2 userland when the GFS2/pNFS work began is referred to as "cluster2"; however, as work was getting under way, David Teigland at Red Hat (lead developer of the cluster suite) suggested that new development be integrated with the next version of the cluster suite.<br />
<br />
==Background==<br />
There are 3 versions of the GFS cluster suite that Red Hat ships, referred to simply as cluster1, cluster2, and cluster3.<br />
* cluster1 (RHEL4-ish, IIRC) was mostly (all?) implemented in-kernel and was tricky and redesigned for a variety of reasons.<br />
* cluster2 (RHEL5, Fedora 9) moves several of the daemons into userland and makes use of [http://www.openais.org OpenAIS], a big powerful framework beyond the scope of these notes. One of the main daemons became an OpenAIS plugin; Red Hat is making a deliberate effort to use things from and give things back to the open source community, rather than sticking to building everything in-house.<br />
* cluster3 (Fedora 10, ..) continues the progression, integrating things more closely with OpenAIS and removing a bunch of code that cluster2 used to bridge between existing daemons and OpenAIS. Despite that cluster3 is still under active development, it is going to be in the wild around early October when Fedora 10 is released; that makes cluster3 the place to focus. However, things like build and configuration setups are still sketchy -- and their development repo is updated many times a day -- so a little persistence is required.<br />
<br />
==Setup==<br />
====No cluster2====<br />
First off, you can save yourself a lot of hassle by '''''not''''' starting out with an existing cluster2 install; I bet this whole thing would've been pretty easy otherwise. I made that mistake and consequently spent a lot of time picking things apart. If these things are lurking around on your system, you'll probably want to remove them first:<br />
* /sbin/gfs_controld, /sbin/gfs_tool, /etc/init.d/cman, /etc/cluster/cluster.conf, /etc/init.d/clvmd<br />
** for ease of removal, you can find the original RPM package names like this: <tt>$ sudo rpm -q --whatprovides /etc/cluster/cluster.conf</tt><br />
<br />
====The parts====<br />
Get the newest versions of things:<br />
* you'll need <tt>'''libvolume_id-devel'''</tt>, but that's okay to get from an RPM.<br />
* [http://sources.redhat.com/dm/ latest device-mapper source]<br />
* use <tt>svn</tt> to clone the openAIS repo: <br />
** <tt>$ svn checkout http://svn.osdl.org/openais</tt><br />
** <tt>$ cd openais && svn export . ../openais-checkout/</tt><br />
** the build stuff is in the "<tt>trunk</tt>" subdirectory.<br />
* use <tt>git</tt> to clone the cluster3 repo:<br />
** <tt>$ git-clone http://sources.redhat.com/git/cluster.git</tt><br />
** their branch "master" is their ongoing cluster3 development.<br />
* [ftp://sources.redhat.com/pub/lvm2 latest LVM2 source]<br />
<br />
====The build====<br />
* build the device-mapper first, shouldn't be a problem. <br />
* next, openAIS; I keep having to futz with the <tt>DESTDIR</tt> string in their <tt>Makefile.inc</tt> -- it's not playing correctly with the <tt>--prefix</tt> option.<br />
* before you can build cluster3, you need to already be running a 2.6.26-based kernel and have its build sources available. so snag/build/boot the 2.6.26-based pnfs-gfs2 kernel.<br />
* cluster3 took me several tries -- but it seems like nearly everything related to the existing cluster2 install.<br />
** <tt>$ ./configure --openaislibdir=/usr/lib/openais --openaisincdir=/usr/include --dlmincdir=/lib/modules/2.6.26-pnfs/source/include</tt><br />
* last, build LVM2. make sure to specify the <tt>clvmd</tt> type, and i always disable LVM1 compatibility:<br />
** <tt>$ ./configure --with-lvm1=none --with-clvmd=cman --prefix=/usr</tt><br />
<br />
Huh.. I had more notes on the afternoon it took me to sort out and finally get cluster3 working, but I'm not seeing them. In the end, I had to run around with <tt>ldd</tt> and verify that everything really was linking the right ways. Maybe this will all be really easy for everyone else and I just got unlucky <shrug>.<br />
<br><br />
====Other bits====<br />
* you'll need a [http://www.citi.umich.edu/u/richterd/pnfs-gfs2/cluster.conf /etc/cluster/cluster.conf]. this is a sample for a 3-node cluster.<br />
* you'll also need a [http://www.citi.umich.edu/u/richterd/pnfs-gfs2/lvm.conf /etc/lvm/lvm.conf], but there's not really any tweaking you'll need to do other than make sure that the <tt>locking_type</tt> is set to DLM (3, IIRC).<br />
* also, the <tt>/etc/init.d/</tt> scripts for starting/stopping the services. I've hacked on them some so they work in a cluster3 setup, but are by no means perfect. [http://www.citi.umich.edu/u/richterd/pnfs-gfs2/cman.init cman.init] and [http://www.citi.umich.edu/u/richterd/pnfs-gfs2/clvmd.init clvmd.init]<br />
* '''Note''': cluster3 has two different modes of operation -- one which is back-compatible with a cluster2 environment and one which only works with other cluster3 members. We want the new, cleaner code paths, and so we run in the cluster3-only mode. You can set this up two different ways (note that my init scripts and sample cluster.conf above do both, meh):<br />
** in <tt>/etc/cluster/cluster.conf</tt>, add the entry <tt><group groupd_compat="0"/></tt><br />
** when starting the daemons, start them all with <tt>-g0</tt><br />
* once you've brought up the cluster, you can then go create a gfs2 filesystem and be on your way.<br />
<br />
<br />
.... more will be added here as work progresses. In particular, there'll be a writeup all about the addition of the pNFS control channel to cluster3.<br />
<br />
<br />
==Troubleshooting==<br />
I wish I'd kept more careful notes about the things that went wrong. I'll spool future things into here.<br />
====LVM can't see my volume group any longer??====<br />
During the upgrade from cluster2 to cluster3, one of the machines somehow lost sight of the ATA-over-ethernet device that I'm using for the cluster's shared storage. The problem wasn't with the <tt>aoe</tt> module, though -- but <tt>lvscan</tt> never saw it, despite that the other two nodes ''could'' see it.<br />
<br />
Turns out that LVM actually got confused somehow -- I'd been under the impression that, sure, while it does maintain a cache of devices (<tt>/etc/lvm/cache/.cache</tt>), it'd nevertheless grok new ones one way or another. And it always had, until now -- it wasn't until I edited that cache file by hand and added the AoE device's <tt>/dev/</tt> entry that <tt>lvscan</tt> was able to see it. Good thing to keep in mind for future debugging: apparently it ''is'' possible for LVM's device cache to go stale, and I didn't see anything in any manpages about how to poke it with <tt>lvm</tt> or something.</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/GFS2_cluster3_userland_notesGFS2 cluster3 userland notes2008-08-06T17:02:47Z<p>Richterd: </p>
<hr />
<div>==='''Update'''===<br />
August 6, 2008: the current cluster3 code from the cluster.git repo only builds against revision 1579 of openAIS. so, to get that:<br />
* <tt>$ svn checkout --revision 1579 http://svn.fedorahosted.org/svn/openais</tt><br />
Um, this might not even be true tomorrow :) The reasons behind it are that openAIS is getting split into two pieces, with "corosync" becoming the main core system, and the new "openAIS" component handles the SA Forum APIs.<br />
<br />
==Purpose==<br />
IBM and CITI are working to integrate GFS2 with pNFS with the purpose of demonstrating that an in-kernel cluster filesystem can be successfully exported over pNFS and take advantage of pNFS's capabilities.<br />
<br />
Part of the work involves extending existing GFS2 userland tools and daemons to handle pNFS requests for state information and the like. That task requires developing an out-of-band GFS2-specific control channel so that pNFS servers exporting GFS2 can issue and process these requests during the course of normal NFS processing.<br />
<br />
The extant version of the GFS2 userland when the GFS2/pNFS work began is referred to as "cluster2"; however, as work was getting under way, David Teigland at Red Hat (lead developer of the cluster suite) suggested that new development be integrated with the next version of the cluster suite.<br />
<br />
==Background==<br />
There are 3 versions of the GFS cluster suite that Red Hat ships, referred to simply as cluster1, cluster2, and cluster3.<br />
* cluster1 (RHEL4-ish, IIRC) was mostly (all?) implemented in-kernel and was tricky and redesigned for a variety of reasons.<br />
* cluster2 (RHEL5, Fedora 9) moves several of the daemons into userland and makes use of [http://www.openais.org OpenAIS], a big powerful framework beyond the scope of these notes. One of the main daemons became an OpenAIS plugin; Red Hat is making a deliberate effort to use things from and give things back to the open source community, rather than sticking to building everything in-house.<br />
* cluster3 (Fedora 10, ..) continues the progression, integrating things more closely with OpenAIS and removing a bunch of code that cluster2 used to bridge between existing daemons and OpenAIS. Despite that cluster3 is still under active development, it is going to be in the wild around early October when Fedora 10 is released; that makes cluster3 the place to focus. However, things like build and configuration setups are still sketchy -- and their development repo is updated many times a day -- so a little persistence is required.<br />
<br />
==Setup==<br />
====No cluster2====<br />
First off, you can save yourself a lot of hassle by '''''not''''' starting out with an existing cluster2 install; I bet this whole thing would've been pretty easy otherwise. I made that mistake and consequently spent a lot of time picking things apart. If these things are lurking around on your system, you'll probably want to remove them first:<br />
* /sbin/gfs_controld, /sbin/gfs_tool, /etc/init.d/cman, /etc/cluster/cluster.conf, /etc/init.d/clvmd<br />
** for ease of removal, you can find the original RPM package names like this: <tt>$ sudo rpm -q --whatprovides /etc/cluster/cluster.conf</tt><br />
<br />
====The parts====<br />
Get the newest versions of things:<br />
* you'll need <tt>'''libvolume_id-devel'''</tt>, but that's okay to get from an RPM.<br />
* [http://sources.redhat.com/dm/ latest device-mapper source]<br />
* use <tt>svn</tt> to clone the openAIS repo: <br />
** <tt>$ svn checkout http://svn.osdl.org/openais</tt><br />
** <tt>$ cd openais && svn export . ../openais-checkout/</tt><br />
** the build stuff is in the "<tt>trunk</tt>" subdirectory.<br />
* use <tt>git</tt> to clone the cluster3 repo:<br />
** <tt>$ git-clone http://sources.redhat.com/git/cluster.git</tt><br />
** their branch "master" is their ongoing cluster3 development.<br />
* [ftp://sources.redhat.com/pub/lvm2 latest LVM2 source]<br />
<br />
====The build====<br />
* build the device-mapper first, shouldn't be a problem. <br />
* next, openAIS; I keep having to futz with the <tt>DESTDIR</tt> string in their <tt>Makefile.inc</tt> -- it's not playing correctly with the <tt>--prefix</tt> option.<br />
* before you can build cluster3, you need to already be running a 2.6.26-based kernel and have its build sources available. so snag/build/boot the 2.6.26-based pnfs-gfs2 kernel.<br />
* cluster3 took me several tries -- but it seems like nearly everything related to the existing cluster2 install.<br />
** <tt>$ ./configure --openaislibdir=/usr/lib/openais --openaisincdir=/usr/include --dlmincdir=/lib/modules/2.6.26-pnfs/source/include</tt><br />
* last, build LVM2. make sure to specify the <tt>clvmd</tt> type, and i always disable LVM1 compatibility:<br />
** <tt>$ ./configure --with-lvm1=none --with-clvmd=cman --prefix=/usr</tt><br />
<br />
Huh.. I had more notes on the afternoon it took me to sort out and finally get cluster3 working, but I'm not seeing them. In the end, I had to run around with <tt>ldd</tt> and verify that everything really was linking the right ways. Maybe this will all be really easy for everyone else and I just got unlucky <shrug>.<br />
<br><br />
====Other bits====<br />
* you'll need a [http://www.citi.umich.edu/u/richterd/pnfs-gfs2/cluster.conf /etc/cluster/cluster.conf]. this is a sample for a 3-node cluster.<br />
* you'll also need a [http://www.citi.umich.edu/u/richterd/pnfs-gfs2/lvm.conf /etc/lvm/lvm.conf], but there's not really any tweaking you'll need to do other than make sure that the <tt>locking_type</tt> is set to DLM (3, IIRC).<br />
* also, the <tt>/etc/init.d/</tt> scripts for starting/stopping the services. I've hacked on them some so they work in a cluster3 setup, but are by no means perfect. [http://www.citi.umich.edu/u/richterd/pnfs-gfs2/cman.init cman.init] and [http://www.citi.umich.edu/u/richterd/pnfs-gfs2/clvmd.init clvmd.init]<br />
* '''Note''': cluster3 has two different modes of operation -- one which is back-compatible with a cluster2 environment and one which only works with other cluster3 members. We want the new, cleaner code paths, and so we run in the cluster3-only mode. You can set this up two different ways (note that my init scripts and sample cluster.conf above do both, meh):<br />
** in <tt>/etc/cluster/cluster.conf</tt>, add the entry <tt><group groupd_compat="0"/></tt><br />
** when starting the daemons, start them all with <tt>-g0</tt><br />
* once you've brought up the cluster, you can then go create a gfs2 filesystem and be on your way.<br />
<br />
<br />
.... more will be added here as work progresses. In particular, there'll be a writeup all about the addition of the pNFS control channel to cluster3.<br />
<br />
<br />
==Troubleshooting==<br />
I wish I'd kept more careful notes about the things that went wrong. I'll spool future things into here.<br />
====LVM can't see my volume group any longer??====<br />
During the upgrade from cluster2 to cluster3, one of the machines somehow lost sight of the ATA-over-ethernet device that I'm using for the cluster's shared storage. The problem wasn't with the <tt>aoe</tt> module, though -- but <tt>lvscan</tt> never saw it, despite that the other two nodes ''could'' see it.<br />
<br />
Turns out that LVM actually got confused somehow -- I'd been under the impression that, sure, while it does maintain a cache of devices (<tt>/etc/lvm/cache/.cache</tt>), it'd nevertheless grok new ones one way or another. And it always had, until now -- it wasn't until I edited that cache file by hand and added the AoE device's <tt>/dev/</tt> entry that <tt>lvscan</tt> was able to see it. Good thing to keep in mind for future debugging: apparently it ''is'' possible for LVM's device cache to go stale, and I didn't see anything in any manpages about how to poke it with <tt>lvm</tt> or something.</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/GFS2_cluster3_userland_notesGFS2 cluster3 userland notes2008-08-06T17:02:30Z<p>Richterd: </p>
<hr />
<div>=='''Update'''==<br />
August 6, 2008: the current cluster3 code from the cluster.git repo only builds against revision 1579 of openAIS. so, to get that:<br />
* <tt>$ svn checkout --revision 1579 http://svn.fedorahosted.org/svn/openais</tt><br />
Um, this might not even be true tomorrow :) The reasons behind it are that openAIS is getting split into two pieces, with "corosync" becoming the main core system, and the new "openAIS" component handles the SA Forum APIs.<br />
<br />
==Purpose==<br />
IBM and CITI are working to integrate GFS2 with pNFS with the purpose of demonstrating that an in-kernel cluster filesystem can be successfully exported over pNFS and take advantage of pNFS's capabilities.<br />
<br />
Part of the work involves extending existing GFS2 userland tools and daemons to handle pNFS requests for state information and the like. That task requires developing an out-of-band GFS2-specific control channel so that pNFS servers exporting GFS2 can issue and process these requests during the course of normal NFS processing.<br />
<br />
The extant version of the GFS2 userland when the GFS2/pNFS work began is referred to as "cluster2"; however, as work was getting under way, David Teigland at Red Hat (lead developer of the cluster suite) suggested that new development be integrated with the next version of the cluster suite.<br />
<br />
==Background==<br />
There are 3 versions of the GFS cluster suite that Red Hat ships, referred to simply as cluster1, cluster2, and cluster3.<br />
* cluster1 (RHEL4-ish, IIRC) was mostly (all?) implemented in-kernel and was tricky and redesigned for a variety of reasons.<br />
* cluster2 (RHEL5, Fedora 9) moves several of the daemons into userland and makes use of [http://www.openais.org OpenAIS], a big powerful framework beyond the scope of these notes. One of the main daemons became an OpenAIS plugin; Red Hat is making a deliberate effort to use things from and give things back to the open source community, rather than sticking to building everything in-house.<br />
* cluster3 (Fedora 10, ..) continues the progression, integrating things more closely with OpenAIS and removing a bunch of code that cluster2 used to bridge between existing daemons and OpenAIS. Despite that cluster3 is still under active development, it is going to be in the wild around early October when Fedora 10 is released; that makes cluster3 the place to focus. However, things like build and configuration setups are still sketchy -- and their development repo is updated many times a day -- so a little persistence is required.<br />
<br />
==Setup==<br />
====No cluster2====<br />
First off, you can save yourself a lot of hassle by '''''not''''' starting out with an existing cluster2 install; I bet this whole thing would've been pretty easy otherwise. I made that mistake and consequently spent a lot of time picking things apart. If these things are lurking around on your system, you'll probably want to remove them first:<br />
* /sbin/gfs_controld, /sbin/gfs_tool, /etc/init.d/cman, /etc/cluster/cluster.conf, /etc/init.d/clvmd<br />
** for ease of removal, you can find the original RPM package names like this: <tt>$ sudo rpm -q --whatprovides /etc/cluster/cluster.conf</tt><br />
<br />
====The parts====<br />
Get the newest versions of things:<br />
* you'll need <tt>'''libvolume_id-devel'''</tt>, but that's okay to get from an RPM.<br />
* [http://sources.redhat.com/dm/ latest device-mapper source]<br />
* use <tt>svn</tt> to clone the openAIS repo: <br />
** <tt>$ svn checkout http://svn.osdl.org/openais</tt><br />
** <tt>$ cd openais && svn export . ../openais-checkout/</tt><br />
** the build stuff is in the "<tt>trunk</tt>" subdirectory.<br />
* use <tt>git</tt> to clone the cluster3 repo:<br />
** <tt>$ git-clone http://sources.redhat.com/git/cluster.git</tt><br />
** their branch "master" is their ongoing cluster3 development.<br />
* [ftp://sources.redhat.com/pub/lvm2 latest LVM2 source]<br />
<br />
====The build====<br />
* build the device-mapper first, shouldn't be a problem. <br />
* next, openAIS; I keep having to futz with the <tt>DESTDIR</tt> string in their <tt>Makefile.inc</tt> -- it's not playing correctly with the <tt>--prefix</tt> option.<br />
* before you can build cluster3, you need to already be running a 2.6.26-based kernel and have its build sources available. so snag/build/boot the 2.6.26-based pnfs-gfs2 kernel.<br />
* cluster3 took me several tries -- but it seems like nearly everything related to the existing cluster2 install.<br />
** <tt>$ ./configure --openaislibdir=/usr/lib/openais --openaisincdir=/usr/include --dlmincdir=/lib/modules/2.6.26-pnfs/source/include</tt><br />
* last, build LVM2. make sure to specify the <tt>clvmd</tt> type, and i always disable LVM1 compatibility:<br />
** <tt>$ ./configure --with-lvm1=none --with-clvmd=cman --prefix=/usr</tt><br />
<br />
Huh.. I had more notes on the afternoon it took me to sort out and finally get cluster3 working, but I'm not seeing them. In the end, I had to run around with <tt>ldd</tt> and verify that everything really was linking the right ways. Maybe this will all be really easy for everyone else and I just got unlucky <shrug>.<br />
<br><br />
====Other bits====<br />
* you'll need a [http://www.citi.umich.edu/u/richterd/pnfs-gfs2/cluster.conf /etc/cluster/cluster.conf]. this is a sample for a 3-node cluster.<br />
* you'll also need a [http://www.citi.umich.edu/u/richterd/pnfs-gfs2/lvm.conf /etc/lvm/lvm.conf], but there's not really any tweaking you'll need to do other than make sure that the <tt>locking_type</tt> is set to DLM (3, IIRC).<br />
* also, the <tt>/etc/init.d/</tt> scripts for starting/stopping the services. I've hacked on them some so they work in a cluster3 setup, but are by no means perfect. [http://www.citi.umich.edu/u/richterd/pnfs-gfs2/cman.init cman.init] and [http://www.citi.umich.edu/u/richterd/pnfs-gfs2/clvmd.init clvmd.init]<br />
* '''Note''': cluster3 has two different modes of operation -- one which is back-compatible with a cluster2 environment and one which only works with other cluster3 members. We want the new, cleaner code paths, and so we run in the cluster3-only mode. You can set this up two different ways (note that my init scripts and sample cluster.conf above do both, meh):<br />
** in <tt>/etc/cluster/cluster.conf</tt>, add the entry <tt><group groupd_compat="0"/></tt><br />
** when starting the daemons, start them all with <tt>-g0</tt><br />
* once you've brought up the cluster, you can then go create a gfs2 filesystem and be on your way.<br />
<br />
<br />
.... more will be added here as work progresses. In particular, there'll be a writeup all about the addition of the pNFS control channel to cluster3.<br />
<br />
<br />
==Troubleshooting==<br />
I wish I'd kept more careful notes about the things that went wrong. I'll spool future things into here.<br />
====LVM can't see my volume group any longer??====<br />
During the upgrade from cluster2 to cluster3, one of the machines somehow lost sight of the ATA-over-ethernet device that I'm using for the cluster's shared storage. The problem wasn't with the <tt>aoe</tt> module, though -- but <tt>lvscan</tt> never saw it, despite that the other two nodes ''could'' see it.<br />
<br />
Turns out that LVM actually got confused somehow -- I'd been under the impression that, sure, while it does maintain a cache of devices (<tt>/etc/lvm/cache/.cache</tt>), it'd nevertheless grok new ones one way or another. And it always had, until now -- it wasn't until I edited that cache file by hand and added the AoE device's <tt>/dev/</tt> entry that <tt>lvscan</tt> was able to see it. Good thing to keep in mind for future debugging: apparently it ''is'' possible for LVM's device cache to go stale, and I didn't see anything in any manpages about how to poke it with <tt>lvm</tt> or something.</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/GFS2_Cluster_in_VMwareGFS2 Cluster in VMware2008-07-29T15:45:48Z<p>Richterd: </p>
<hr />
<div>==VMware==<br />
* bought a copy of VMware Workstation 6, installed it on my T-43 Thinkpad <tt>"atro"</tt>(running openSuSE 10.2, 2GB of RAM).<br />
* made a new virtual machine: '''OS''': Linux, '''Version''': "Other Linux 2.6.x kernel", '''Networking''': Bridged, '''Disk''': 4GB, split into 2GB files, '''RAM''': 256MB<br />
* installed Fedora 8 in it -- even X worked well with only 256MB of RAM(!) -- guest is named <tt>"guest1"</tt><br />
* yum-installed '''libvolume_id-devel''' ''(i also tried cman, cman-devel, openais, openais-devel, and lvm2-cluster, but even '''they''' were out-of-date with the stock Fedora kernel, and so are also too old for the pNFS kernels -- '''stick with source''')''<br />
* downloaded and installed '''device-mapper-1.02.22''', '''openais-0.80.3''', '''cluster-2.01.00''', and '''lvm2-2.02.28'''<br />
** [ftp://sources.redhat.com/pub/dm/device-mapper.1.02.27.tgz device-mapper.1.02.27.tgz]<br />
** [ftp://ftp%40openais%2Eorg:downloads@openais.org/downloads/openais-0.80.3/openais-0.80.3.tar.gz openais-0.80.3.tar.gz]<br />
** [ftp://sources.redhat.com/pub/cluster/releases/cluster-2.03.04.tar.gz cluster-2.03.04.tar.gz]<br />
** [ftp://sources.redhat.com/pub/lvm2/LVM2.2.02.39.tgz LVM2.2.02.39.tgz]<br />
<br />
==ATA over Ethernet (for guest cluster shared storage)==<br />
* yum-installed AoE initiator (client) '''aoetools-18-1''' on <tt>guest1</tt><br />
* downloaded AoE target (server) [http://internap.dl.sourceforge.net/sourceforge/aoetools/vblade-15.tgz vblade-15.tgz] and installed it on <tt>atro</tt><br />
* i set aside a spare partition on <tt>atro</tt> to export as a block device over AoE:<br />
** <tt>[atro] $ sudo ln -s /dev/sda6 /dev/AoE</tt><br />
** <tt>[atro] $ sudo vbladed 0 1 eth0 /dev/AoE</tt> ''(major dev num 0, minor 1)''<br />
** <tt>[guest1] $ sudo modprobe aoe</tt><br />
*** .. AoE discovers all exported devices on the LAN; mine was the only one, and immediately appeared as <tt>/dev/etherd/e0.1</tt>. Mounting it "just worked"; props to AoE!<br />
<br />
==LVM and GFS2 setup==<br />
* prep physical volume for LVM:<br />
** <tt>[guest1] $ sudo pvcreate -M 2 /dev/etherd/e0.1</tt><br />
* create the volume group '''GuestVolGroup''' and add all of the AoE "device" to it:<br />
** <tt>[guest1] $ sudo vgcreate -M 2 -s 1m -c y GuestVolGroup /dev/etherd/e0.1</tt><br />
* edit <tt>/etc/lvm/lvm.conf</tt> and make sure to set locking_type to DLM<br />
* before further stuff can proceed, the cluster needs to be up and <tt>clvmd</tt> needs to be running everywhere. So, in VMware I cloned <tt>guest1</tt> twice: as <tt>guest2</tt> and <tt>guest3</tt>.<br />
* edit <tt>/etc/cluster.conf</tt> and name the cluster <tt>'''GuestCluster'''</tt> and set up the three nodes with manual (read: ignored) fencing.<br />
* bring up the cluster: <br />
** <tt>$ pdsh -w guest[1-3] sudo service cman start && pdsh -w guest[1-3] sudo service clvmd start</tt><br />
* create the logical volume '''GuestVolume''' and assign the full volume group to it:<br />
** <tt>[guest1] $ sudo lvcreate -n GuestVolume -l 100%VG GuestVolGroup</tt><br />
* .. and make a GFS2 fs therein:<br />
** <tt>[guest1] $ sudo gfs2_mkfs -j 3 -p lock_dlm -t GuestCluster:GuestFS /dev/GuestVolGroup/GuestVolume</tt><br />
* restart the daemons, then mount and your VMware GFS2 cluster should be good to go! <tt>:)</tt><br />
<br />
==Adding disk space to an LVM'ed VMware guest==<br />
Having blithely thought that 4GB of disk space per guest (which Fedora LVMs as <tt>VolGroup00</tt>) would be sufficient, I then <tt>git-clone</tt>d my repo and then didn't have enough space to build my kernels; gak. (Since I'm building things on just one guest and then cloning it, I'm hoping that maybe I can somehow shrink the cloned guests' disks back down to just 4GB.)<br />
* in VMware, I went to Edit Virtual Machine Settings -> Add (device). I created a (virtual) SCSI disk, 3GB, allocate on-demand, and added it to my guest.<br />
** after starting the guest, the disk appeared as <tt>/dev/sdb</tt> <br />
* create a single partition using the entire device:<br />
** <tt>[guest1] $ fdisk # etc etc '''NB:''' make sure that the partition type is '''0x8e''' (Linux LVM) </tt><br />
* make a single LVM physical volume on it:<br />
** <tt>[guest1] $ pvcreate -M 2 /dev/sdb1</tt><br />
* extend the existing volume group by adding the prepped physical volume:<br />
** <tt>[guest1] $ vgextend VolGroup00 /dev/sdb1</tt><br />
* extend the logical volume to use the entire (now-larger) volume group:<br />
** <tt>[guest1] $ lvextend -l +100%FREE /dev/VolGroup00/LogVol00</tt><br />
* Inspect things with <tt>pvs</tt>, <tt>vgs</tt>, and <tt>lvs</tt><br />
* extend the filesystem itself within the logical volume (it can handle online resizing):<br />
** <tt>[guest1] $ resize2fs /dev/VolumeGroup00/LogVol00</tt><br />
<br />
At this point, hopefully <tt>df -k</tt> should show you a larger volume :)<br />
<br />
==Update: reactions from Connectathon '08==<br />
The purpose of this entire VMware/GFS2 setup in the first place was so I could work on a pNFS/GFS2 MDS at Connectathon '08 with Frank Filz, Dean Hildebrand, and Marc Eshel (all gentlemen from IBM). <br />
<br />
On the one hand, once I had a primary guest system set up and could just clone it to make a cluster, it was very easy to make kernel changes, rebuild, push things out the cluster, and reboot.<br />
<br />
The downside came during testing, when we tried doing pNFS writes of several KB or more -- the RPC layer would barf on the packet with a message like "Error: bad tcp reclen". Fortunately, Dean recalled that Ricardo Labiaga had had a similar problem with KVM (or UML?) at the fall 2007 CITI Bakeathon, so we started to suspect VMware. I quick set up two laptops to act as GFS2 nodes, accessing shared storage with AoE. I shut down the VMware cluster, configured it so that only one VMware node and the two new laptops would be a 3-node GFS2 cluster, and brought up the new cluster. Then, using the node in VMware as a pNFS MDS and the two laptops as DSes, we almost immediately were able to pass the Connectathon test suite.<br />
<br />
'''The verdict''': VMware Workstation 6 still totally impresses me, but it's probably better to do cluster work on an actual cluster. That said, my I/O troubles may just stem from my laptop, or my particular NIC driver, or whatever -- I can't imagine that there aren't ways to resolve that somehow.<br />
<br />
==Tidbits==<br />
Here are some links to basic things that you might not get "for free" if you build from source (i dunno).<br />
* [http://www.citi.umich.edu/u/richterd/Oliber/cluster.conf /etc/cluster/cluster.conf] is the main thing to configure, but easy for simple setups<br />
* [http://www.citi.umich.edu/u/richterd/Oliber/lvm.conf /etc/lvm/lvm.conf] is normally fine as-is, just make sure that <tt>locking_type</tt> is set to DLM (3, IIRC).<br />
* [http://www.citi.umich.edu/u/richterd/Oliber/cman /etc/init.d/cman] init script for bringing up the cluster suite<br />
* [http://www.citi.umich.edu/u/richterd/Oliber/clvmd /etc/init.d/clvmd] init script for bringing up the cluster-savvy LVM2 daemon.</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/GFS2_Cluster_in_VMwareGFS2 Cluster in VMware2008-07-29T15:44:17Z<p>Richterd: </p>
<hr />
<div>==VMware==<br />
* bought a copy of VMware Workstation 6, installed it on my T-43 Thinkpad <tt>"atro"</tt>(running openSuSE 10.2, 2GB of RAM).<br />
* made a new virtual machine: '''OS''': Linux, '''Version''': "Other Linux 2.6.x kernel", '''Networking''': Bridged, '''Disk''': 4GB, split into 2GB files, '''RAM''': 256MB<br />
* installed Fedora 8 in it -- even X worked well with only 256MB of RAM(!) -- guest is named <tt>"guest1"</tt><br />
* yum-installed '''libvolume_id-devel''' ''(i also tried cman, cman-devel, openais, openais-devel, and lvm2-cluster, but even '''they''' were out-of-date with the stock Fedora kernel, and so are also too old for the pNFS kernels -- '''stick with source''')''<br />
* downloaded and installed '''device-mapper-1.02.22''', '''openais-0.80.3''', '''cluster-2.01.00''', and '''lvm2-2.02.28'''<br />
** [ftp://sources.redhat.com/pub/dm/device-mapper.1.02.27.tgz device-mapper.1.02.27.tgz]<br />
** [ftp://ftp%40openais%2Eorg:downloads@openais.org/downloads/openais-0.80.3/openais-0.80.3.tar.gz openais-0.80.3.tar.gz]<br />
** [ftp://sources.redhat.com/pub/cluster/releases/cluster-2.03.04.tar.gz cluster-2.03.04.tar.gz]<br />
** [ftp://sources.redhat.com/pub/lvm2/LVM2.2.02.39.tgz LVM2.2.02.39.tgz]<br />
<br />
==ATA over Ethernet (for guest cluster shared storage)==<br />
* yum-installed AoE initiator (client) '''aoetools-18-1''' on <tt>guest1</tt><br />
* downloaded AoE target (server) [http://internap.dl.sourceforge.net/sourceforge/aoetools/vblade-15.tgz vblade-15.tgz] and installed it on <tt>atro</tt><br />
* i set aside a spare partition on <tt>atro</tt> to export as a block device over AoE:<br />
** <tt>[atro] $ sudo ln -s /dev/sda6 /dev/AoE</tt><br />
** <tt>[atro] $ sudo vbladed 0 1 eth0 /dev/AoE</tt> ''(major dev num 0, minor 1)''<br />
** <tt>[guest1] $ sudo modprobe aoe</tt><br />
*** .. AoE discovers all exported devices on the LAN; mine was the only one, and immediately appeared as <tt>/dev/etherd/e0.1</tt>. Mounting it "just worked"; props to AoE!<br />
<br />
==LVM and GFS2 setup==<br />
* prep physical volume for LVM:<br />
** <tt>[guest1] $ sudo pvcreate -M 2 /dev/etherd/e0.1</tt><br />
* create the volume group '''GuestVolGroup''' and add all of the AoE "device" to it:<br />
** <tt>[guest1] $ sudo vgcreate -M 2 -s 1m -c y GuestVolGroup /dev/etherd/e0.1</tt><br />
* edit <tt>/etc/lvm/lvm.conf</tt> and make sure to set locking_type to DLM<br />
* before further stuff can proceed, the cluster needs to be up and <tt>clvmd</tt> needs to be running everywhere. So, in VMware I cloned <tt>guest1</tt> twice: as <tt>guest2</tt> and <tt>guest3</tt>.<br />
* edit <tt>/etc/cluster.conf</tt> and name the cluster <tt>'''GuestCluster'''</tt> and set up the three nodes with manual (read: ignored) fencing.<br />
* bring up the cluster: <br />
** <tt>$ pdsh -w guest[1-3] sudo service cman start && pdsh -w guest[1-3] sudo service clvmd start</tt><br />
* create the logical volume '''GuestVolume''' and assign the full volume group to it:<br />
** <tt>[guest1] $ sudo lvcreate -n GuestVolume -l 100%VG GuestVolGroup</tt><br />
* .. and make a GFS2 fs therein:<br />
** <tt>[guest1] $ sudo gfs2_mkfs -j 3 -p lock_dlm -t GuestCluster:GuestFS /dev/GuestVolGroup/GuestVolume</tt><br />
* restart the daemons, then mount and your VMware GFS2 cluster should be good to go! <tt>:)</tt><br />
<br />
==Adding disk space to an LVM'ed VMware guest==<br />
Having blithely thought that 4GB of disk space per guest (which Fedora LVMs as <tt>VolGroup00</tt>) would be sufficient, I then <tt>git-clone</tt>d my repo and then didn't have enough space to build my kernels; gak. (Since I'm building things on just one guest and then cloning it, I'm hoping that maybe I can somehow shrink the cloned guests' disks back down to just 4GB.)<br />
* in VMware, I went to Edit Virtual Machine Settings -> Add (device). I created a (virtual) SCSI disk, 3GB, allocate on-demand, and added it to my guest.<br />
** after starting the guest, the disk appeared as <tt>/dev/sdb</tt> <br />
* create a single partition using the entire device:<br />
** <tt>[guest1] $ fdisk # etc etc '''NB:''' make sure that the partition type is '''0x8e''' (Linux LVM) </tt><br />
* make a single LVM physical volume on it:<br />
** <tt>[guest1] $ pvcreate -M 2 /dev/sdb1</tt><br />
* extend the existing volume group by adding the prepped physical volume:<br />
** <tt>[guest1] $ vgextend VolGroup00 /dev/sdb1</tt><br />
* extend the logical volume to use the entire (now-larger) volume group:<br />
** <tt>[guest1] $ lvextend -l +100%FREE /dev/VolGroup00/LogVol00</tt><br />
* Inspect things with <tt>pvs</tt>, <tt>vgs</tt>, and <tt>lvs</tt><br />
* extend the filesystem itself within the logical volume (it can handle online resizing):<br />
** <tt>[guest1] $ resize2fs /dev/VolumeGroup00/LogVol00</tt><br />
<br />
At this point, hopefully <tt>df -k</tt> should show you a larger volume :)<br />
<br />
==Update: reactions from Connectathon '08==<br />
The purpose of this entire VMware/GFS2 setup in the first place was so I could work on a pNFS/GFS2 MDS at Connectathon '08 with Frank Filz, Dean Hildebrand, and Marc Eshel (all gentlemen from IBM). <br />
<br />
On the one hand, once I had a primary guest system set up and could just clone it to make a cluster, it was very easy to make kernel changes, rebuild, push things out the cluster, and reboot.<br />
<br />
The downside came during testing, when we tried doing pNFS writes of several KB or more -- the RPC layer would barf on the packet with a message like "Error: bad tcp reclen". Fortunately, Dean recalled that Ricardo Labiaga had had a similar problem with KVM (or UML?) at the fall 2007 CITI Bakeathon, so we started to suspect VMware. I quick set up two laptops to act as GFS2 nodes, accessing shared storage with AoE. I shut down the VMware cluster, configured it so that only one VMware node and the two new laptops would be a 3-node GFS2 cluster, and brought up the new cluster. Then, using the node in VMware as a pNFS MDS and the two laptops as DSes, we almost immediately were able to pass the Connectathon test suite.<br />
<br />
'''The verdict''': VMware Workstation 6 still totally impresses me, but it's probably better to do cluster work on an actual cluster. That said, my I/O troubles may just stem from my laptop, or my particular NIC driver, or whatever -- I can't imagine that there aren't ways to resolve that somehow.<br />
<br />
<br />
==Tidbits==<br />
Here are some links to basic things that you might not get "for free" if you build from source (i dunno).<br />
* [http://www.citi.umich.edu/u/richterd/Oliber/cluster.conf /etc/cluster/cluster.conf] is the main thing to configure, but easy for simple setups<br />
* [http://www.citi.umich.edu/u/richterd/Oliber/lvm.conf /etc/lvm/lvm.conf] is normally fine as-is, just make sure that <tt>locking_type</tt> is set to DLM (3, IIRC).<br />
* [http://www.citi.umich.edu/u/richterd/Oliber/cman /etc/init.d/cman] init script for bringing up the cluster suite<br />
* [http://www.citi.umich.edu/u/richterd/Oliber/clvmd /etc/init.d/clvmd] init script for bringing up the cluster-savvy LVM2 daemon.</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/GFS2_Cluster_in_VMwareGFS2 Cluster in VMware2008-07-29T15:37:16Z<p>Richterd: </p>
<hr />
<div>==VMware==<br />
* bought a copy of VMware Workstation 6, installed it on my T-43 Thinkpad <tt>"atro"</tt>(running openSuSE 10.2, 2GB of RAM).<br />
* made a new virtual machine: '''OS''': Linux, '''Version''': "Other Linux 2.6.x kernel", '''Networking''': Bridged, '''Disk''': 4GB, split into 2GB files, '''RAM''': 256MB<br />
* installed Fedora 8 in it -- even X worked well with only 256MB of RAM(!) -- guest is named <tt>"guest1"</tt><br />
* yum-installed '''libvolume_id-devel''' ''(i also tried cman, cman-devel, openais, openais-devel, and lvm2-cluster, but even '''they''' were out-of-date with the stock Fedora kernel, and so are also too old for the pNFS kernels -- '''stick with source''')''<br />
* downloaded and installed '''device-mapper-1.02.22''', '''openais-0.80.3''', '''cluster-2.01.00''', and '''lvm2-2.02.28'''<br />
** [ftp://sources.redhat.com/pub/dm/device-mapper.1.02.27.tgz device-mapper.1.02.27.tgz]<br />
** [ftp://ftp%40openais%2Eorg:downloads@openais.org/downloads/openais-0.80.3/openais-0.80.3.tar.gz openais-0.80.3.tar.gz]<br />
** [ftp://sources.redhat.com/pub/cluster/releases/cluster-2.03.04.tar.gz cluster-2.03.04.tar.gz]<br />
** [ftp://sources.redhat.com/pub/lvm2/LVM2.2.02.39.tgz LVM2.2.02.39.tgz]<br />
<br />
==ATA over Ethernet (for guest cluster shared storage)==<br />
* yum-installed AoE initiator (client) '''aoetools-18-1''' on <tt>guest1</tt><br />
* downloaded AoE target (server) [http://internap.dl.sourceforge.net/sourceforge/aoetools/vblade-15.tgz vblade-15.tgz] and installed it on <tt>atro</tt><br />
* i set aside a spare partition on <tt>atro</tt> to export as a block device over AoE:<br />
** <tt>[atro] $ sudo ln -s /dev/sda6 /dev/AoE</tt><br />
** <tt>[atro] $ sudo vbladed 0 1 eth0 /dev/AoE</tt> ''(major dev num 0, minor 1)''<br />
** <tt>[guest1] $ sudo modprobe aoe</tt><br />
*** .. AoE discovers all exported devices on the LAN; mine was the only one, and immediately appeared as <tt>/dev/etherd/e0.1</tt>. Mounting it "just worked"; props to AoE!<br />
<br />
==LVM and GFS2 setup==<br />
* prep physical volume for LVM:<br />
** <tt>[guest1] $ sudo pvcreate -M 2 /dev/etherd/e0.1</tt><br />
* create the volume group '''GuestVolGroup''' and add all of the AoE "device" to it:<br />
** <tt>[guest1] $ sudo vgcreate -M 2 -s 1m -c y GuestVolGroup /dev/etherd/e0.1</tt><br />
* edit <tt>/etc/lvm/lvm.conf</tt> and make sure to set locking_type to DLM<br />
* before further stuff can proceed, the cluster needs to be up and <tt>clvmd</tt> needs to be running everywhere. So, in VMware I cloned <tt>guest1</tt> twice: as <tt>guest2</tt> and <tt>guest3</tt>.<br />
* edit <tt>/etc/cluster.conf</tt> and name the cluster <tt>'''GuestCluster'''</tt> and set up the three nodes with manual (read: ignored) fencing.<br />
* bring up the cluster: <br />
** <tt>$ pdsh -w guest[1-3] sudo service cman start && pdsh -w guest[1-3] sudo service clvmd start</tt><br />
* create the logical volume '''GuestVolume''' and assign the full volume group to it:<br />
** <tt>[guest1] $ sudo lvcreate -n GuestVolume -l 100%VG GuestVolGroup</tt><br />
* .. and make a GFS2 fs therein:<br />
** <tt>[guest1] $ sudo gfs2_mkfs -j 3 -p lock_dlm -t GuestCluster:GuestFS /dev/GuestVolGroup/GuestVolume</tt><br />
* restart the daemons, then mount and your VMware GFS2 cluster should be good to go! <tt>:)</tt><br />
<br />
==Adding disk space to an LVM'ed VMware guest==<br />
Having blithely thought that 4GB of disk space per guest (which Fedora LVMs as <tt>VolGroup00</tt>) would be sufficient, I then <tt>git-clone</tt>d my repo and then didn't have enough space to build my kernels; gak. (Since I'm building things on just one guest and then cloning it, I'm hoping that maybe I can somehow shrink the cloned guests' disks back down to just 4GB.)<br />
* in VMware, I went to Edit Virtual Machine Settings -> Add (device). I created a (virtual) SCSI disk, 3GB, allocate on-demand, and added it to my guest.<br />
** after starting the guest, the disk appeared as <tt>/dev/sdb</tt> <br />
* create a single partition using the entire device:<br />
** <tt>[guest1] $ fdisk # etc etc '''NB:''' make sure that the partition type is '''0x8e''' (Linux LVM) </tt><br />
* make a single LVM physical volume on it:<br />
** <tt>[guest1] $ pvcreate -M 2 /dev/sdb1</tt><br />
* extend the existing volume group by adding the prepped physical volume:<br />
** <tt>[guest1] $ vgextend VolGroup00 /dev/sdb1</tt><br />
* extend the logical volume to use the entire (now-larger) volume group:<br />
** <tt>[guest1] $ lvextend -l +100%FREE /dev/VolGroup00/LogVol00</tt><br />
* Inspect things with <tt>pvs</tt>, <tt>vgs</tt>, and <tt>lvs</tt><br />
* extend the filesystem itself within the logical volume (it can handle online resizing):<br />
** <tt>[guest1] $ resize2fs /dev/VolumeGroup00/LogVol00</tt><br />
<br />
At this point, hopefully <tt>df -k</tt> should show you a larger volume :)<br />
<br />
==Update: reactions from Connectathon '08==<br />
The purpose of this entire VMware/GFS2 setup in the first place was so I could work on a pNFS/GFS2 MDS at Connectathon '08 with Frank Filz, Dean Hildebrand, and Marc Eshel (all gentlemen from IBM). <br />
<br />
On the one hand, once I had a primary guest system set up and could just clone it to make a cluster, it was very easy to make kernel changes, rebuild, push things out the cluster, and reboot.<br />
<br />
The downside came during testing, when we tried doing pNFS writes of several KB or more -- the RPC layer would barf on the packet with a message like "Error: bad tcp reclen". Fortunately, Dean recalled that Ricardo Labiaga had had a similar problem with KVM (or UML?) at the fall 2007 CITI Bakeathon, so we started to suspect VMware. I quick set up two laptops to act as GFS2 nodes, accessing shared storage with AoE. I shut down the VMware cluster, configured it so that only one VMware node and the two new laptops would be a 3-node GFS2 cluster, and brought up the new cluster. Then, using the node in VMware as a pNFS MDS and the two laptops as DSes, we almost immediately were able to pass the Connectathon test suite.<br />
<br />
'''The verdict''': VMware Workstation 6 still totally impresses me, but it's probably better to do cluster work on an actual cluster. That said, my I/O troubles may just stem from my laptop, or my particular NIC driver, or whatever -- I can't imagine that there aren't ways to resolve that somehow.</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/GFS2_cluster3_userland_notesGFS2 cluster3 userland notes2008-07-25T19:09:42Z<p>Richterd: /* LVM can't see my volume group any longer?? */</p>
<hr />
<div>(7/24/2008 - still being written..)<br />
==Purpose==<br />
IBM and CITI are working to integrate GFS2 with pNFS with the purpose of demonstrating that an in-kernel cluster filesystem can be successfully exported over pNFS and take advantage of pNFS's capabilities.<br />
<br />
Part of the work involves extending existing GFS2 userland tools and daemons to handle pNFS requests for state information and the like. That task requires developing an out-of-band GFS2-specific control channel so that pNFS servers exporting GFS2 can issue and process these requests during the course of normal NFS processing.<br />
<br />
The extant version of the GFS2 userland when the GFS2/pNFS work began is referred to as "cluster2"; however, as work was getting under way, David Teigland at Red Hat (lead developer of the cluster suite) suggested that new development be integrated with the next version of the cluster suite.<br />
<br />
==Background==<br />
There are 3 versions of the GFS cluster suite that Red Hat ships, referred to simply as cluster1, cluster2, and cluster3.<br />
* cluster1 (RHEL4-ish, IIRC) was mostly (all?) implemented in-kernel and was tricky and redesigned for a variety of reasons.<br />
* cluster2 (RHEL5, Fedora 9) moves several of the daemons into userland and makes use of [http://www.openais.org OpenAIS], a big powerful framework beyond the scope of these notes. One of the main daemons became an OpenAIS plugin; Red Hat is making a deliberate effort to use things from and give things back to the open source community, rather than sticking to building everything in-house.<br />
* cluster3 (Fedora 10, ..) continues the progression, integrating things more closely with OpenAIS and removing a bunch of code that cluster2 used to bridge between existing daemons and OpenAIS. Despite that cluster3 is still under active development, it is going to be in the wild around early October when Fedora 10 is released; that makes cluster3 the place to focus. However, things like build and configuration setups are still sketchy -- and their development repo is updated many times a day -- so a little persistence is required.<br />
<br />
==Setup==<br />
====No cluster2====<br />
First off, you can save yourself a lot of hassle by '''''not''''' starting out with an existing cluster2 install; I bet this whole thing would've been pretty easy otherwise. I made that mistake and consequently spent a lot of time picking things apart. If these things are lurking around on your system, you'll probably want to remove them first:<br />
* /sbin/gfs_controld, /sbin/gfs_tool, /etc/init.d/cman, /etc/cluster/cluster.conf, /etc/init.d/clvmd<br />
** for ease of removal, you can find the original RPM package names like this: <tt>$ sudo rpm -q --whatprovides /etc/cluster/cluster.conf</tt><br />
<br />
====The parts====<br />
Get the newest versions of things:<br />
* you'll need <tt>'''libvolume_id-devel'''</tt>, but that's okay to get from an RPM.<br />
* [http://sources.redhat.com/dm/ latest device-mapper source]<br />
* use <tt>svn</tt> to clone the openAIS repo: <br />
** <tt>$ svn checkout http://svn.osdl.org/openais</tt><br />
** <tt>$ cd openais && svn export . ../openais-checkout/</tt><br />
** the build stuff is in the "<tt>trunk</tt>" subdirectory.<br />
* use <tt>git</tt> to clone the cluster3 repo:<br />
** <tt>$ git-clone http://sources.redhat.com/git/cluster.git</tt><br />
** their branch "master" is their ongoing cluster3 development.<br />
* [ftp://sources.redhat.com/pub/lvm2 latest LVM2 source]<br />
<br />
====The build====<br />
* build the device-mapper first, shouldn't be a problem. <br />
* next, openAIS; I keep having to futz with the <tt>DESTDIR</tt> string in their <tt>Makefile.inc</tt> -- it's not playing correctly with the <tt>--prefix</tt> option.<br />
* before you can build cluster3, you need to already be running a 2.6.26-based kernel and have its build sources available. so snag/build/boot the 2.6.26-based pnfs-gfs2 kernel.<br />
* cluster3 took me several tries -- but it seems like nearly everything related to the existing cluster2 install.<br />
** <tt>$ ./configure --openaislibdir=/usr/lib/openais --openaisincdir=/usr/include --dlmincdir=/lib/modules/2.6.26-pnfs/source/include</tt><br />
* last, build LVM2. make sure to specify the <tt>clvmd</tt> type, and i always disable LVM1 compatibility:<br />
** <tt>$ ./configure --with-lvm1=none --with-clvmd=cman --prefix=/usr</tt><br />
<br />
Huh.. I had more notes on the afternoon it took me to sort out and finally get cluster3 working, but I'm not seeing them. In the end, I had to run around with <tt>ldd</tt> and verify that everything really was linking the right ways. Maybe this will all be really easy for everyone else and I just got unlucky <shrug>.<br />
<br><br />
====Other bits====<br />
* you'll need a [http://www.citi.umich.edu/u/richterd/pnfs-gfs2/cluster.conf /etc/cluster/cluster.conf]. this is a sample for a 3-node cluster.<br />
* you'll also need a [http://www.citi.umich.edu/u/richterd/pnfs-gfs2/lvm.conf /etc/lvm/lvm.conf], but there's not really any tweaking you'll need to do other than make sure that the <tt>locking_type</tt> is set to DLM (3, IIRC).<br />
* also, the <tt>/etc/init.d/</tt> scripts for starting/stopping the services. I've hacked on them some so they work in a cluster3 setup, but are by no means perfect. [http://www.citi.umich.edu/u/richterd/pnfs-gfs2/cman.init cman.init] and [http://www.citi.umich.edu/u/richterd/pnfs-gfs2/clvmd.init clvmd.init]<br />
* '''Note''': cluster3 has two different modes of operation -- one which is back-compatible with a cluster2 environment and one which only works with other cluster3 members. We want the new, cleaner code paths, and so we run in the cluster3-only mode. You can set this up two different ways (note that my init scripts and sample cluster.conf above do both, meh):<br />
** in <tt>/etc/cluster/cluster.conf</tt>, add the entry <tt><group groupd_compat="0"/></tt><br />
** when starting the daemons, start them all with <tt>-g0</tt><br />
* once you've brought up the cluster, you can then go create a gfs2 filesystem and be on your way.<br />
<br />
<br />
.... more will be added here as work progresses. In particular, there'll be a writeup all about the addition of the pNFS control channel to cluster3.<br />
<br />
<br />
==Troubleshooting==<br />
I wish I'd kept more careful notes about the things that went wrong. I'll spool future things into here.<br />
====LVM can't see my volume group any longer??====<br />
During the upgrade from cluster2 to cluster3, one of the machines somehow lost sight of the ATA-over-ethernet device that I'm using for the cluster's shared storage. The problem wasn't with the <tt>aoe</tt> module, though -- but <tt>lvscan</tt> never saw it, despite that the other two nodes ''could'' see it.<br />
<br />
Turns out that LVM actually got confused somehow -- I'd been under the impression that, sure, while it does maintain a cache of devices (<tt>/etc/lvm/cache/.cache</tt>), it'd nevertheless grok new ones one way or another. And it always had, until now -- it wasn't until I edited that cache file by hand and added the AoE device's <tt>/dev/</tt> entry that <tt>lvscan</tt> was able to see it. Good thing to keep in mind for future debugging: apparently it ''is'' possible for LVM's device cache to go stale, and I didn't see anything in any manpages about how to poke it with <tt>lvm</tt> or something.</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/GFS2_cluster3_userland_notesGFS2 cluster3 userland notes2008-07-25T19:08:43Z<p>Richterd: </p>
<hr />
<div>(7/24/2008 - still being written..)<br />
==Purpose==<br />
IBM and CITI are working to integrate GFS2 with pNFS with the purpose of demonstrating that an in-kernel cluster filesystem can be successfully exported over pNFS and take advantage of pNFS's capabilities.<br />
<br />
Part of the work involves extending existing GFS2 userland tools and daemons to handle pNFS requests for state information and the like. That task requires developing an out-of-band GFS2-specific control channel so that pNFS servers exporting GFS2 can issue and process these requests during the course of normal NFS processing.<br />
<br />
The extant version of the GFS2 userland when the GFS2/pNFS work began is referred to as "cluster2"; however, as work was getting under way, David Teigland at Red Hat (lead developer of the cluster suite) suggested that new development be integrated with the next version of the cluster suite.<br />
<br />
==Background==<br />
There are 3 versions of the GFS cluster suite that Red Hat ships, referred to simply as cluster1, cluster2, and cluster3.<br />
* cluster1 (RHEL4-ish, IIRC) was mostly (all?) implemented in-kernel and was tricky and redesigned for a variety of reasons.<br />
* cluster2 (RHEL5, Fedora 9) moves several of the daemons into userland and makes use of [http://www.openais.org OpenAIS], a big powerful framework beyond the scope of these notes. One of the main daemons became an OpenAIS plugin; Red Hat is making a deliberate effort to use things from and give things back to the open source community, rather than sticking to building everything in-house.<br />
* cluster3 (Fedora 10, ..) continues the progression, integrating things more closely with OpenAIS and removing a bunch of code that cluster2 used to bridge between existing daemons and OpenAIS. Despite that cluster3 is still under active development, it is going to be in the wild around early October when Fedora 10 is released; that makes cluster3 the place to focus. However, things like build and configuration setups are still sketchy -- and their development repo is updated many times a day -- so a little persistence is required.<br />
<br />
==Setup==<br />
====No cluster2====<br />
First off, you can save yourself a lot of hassle by '''''not''''' starting out with an existing cluster2 install; I bet this whole thing would've been pretty easy otherwise. I made that mistake and consequently spent a lot of time picking things apart. If these things are lurking around on your system, you'll probably want to remove them first:<br />
* /sbin/gfs_controld, /sbin/gfs_tool, /etc/init.d/cman, /etc/cluster/cluster.conf, /etc/init.d/clvmd<br />
** for ease of removal, you can find the original RPM package names like this: <tt>$ sudo rpm -q --whatprovides /etc/cluster/cluster.conf</tt><br />
<br />
====The parts====<br />
Get the newest versions of things:<br />
* you'll need <tt>'''libvolume_id-devel'''</tt>, but that's okay to get from an RPM.<br />
* [http://sources.redhat.com/dm/ latest device-mapper source]<br />
* use <tt>svn</tt> to clone the openAIS repo: <br />
** <tt>$ svn checkout http://svn.osdl.org/openais</tt><br />
** <tt>$ cd openais && svn export . ../openais-checkout/</tt><br />
** the build stuff is in the "<tt>trunk</tt>" subdirectory.<br />
* use <tt>git</tt> to clone the cluster3 repo:<br />
** <tt>$ git-clone http://sources.redhat.com/git/cluster.git</tt><br />
** their branch "master" is their ongoing cluster3 development.<br />
* [ftp://sources.redhat.com/pub/lvm2 latest LVM2 source]<br />
<br />
====The build====<br />
* build the device-mapper first, shouldn't be a problem. <br />
* next, openAIS; I keep having to futz with the <tt>DESTDIR</tt> string in their <tt>Makefile.inc</tt> -- it's not playing correctly with the <tt>--prefix</tt> option.<br />
* before you can build cluster3, you need to already be running a 2.6.26-based kernel and have its build sources available. so snag/build/boot the 2.6.26-based pnfs-gfs2 kernel.<br />
* cluster3 took me several tries -- but it seems like nearly everything related to the existing cluster2 install.<br />
** <tt>$ ./configure --openaislibdir=/usr/lib/openais --openaisincdir=/usr/include --dlmincdir=/lib/modules/2.6.26-pnfs/source/include</tt><br />
* last, build LVM2. make sure to specify the <tt>clvmd</tt> type, and i always disable LVM1 compatibility:<br />
** <tt>$ ./configure --with-lvm1=none --with-clvmd=cman --prefix=/usr</tt><br />
<br />
Huh.. I had more notes on the afternoon it took me to sort out and finally get cluster3 working, but I'm not seeing them. In the end, I had to run around with <tt>ldd</tt> and verify that everything really was linking the right ways. Maybe this will all be really easy for everyone else and I just got unlucky <shrug>.<br />
<br><br />
====Other bits====<br />
* you'll need a [http://www.citi.umich.edu/u/richterd/pnfs-gfs2/cluster.conf /etc/cluster/cluster.conf]. this is a sample for a 3-node cluster.<br />
* you'll also need a [http://www.citi.umich.edu/u/richterd/pnfs-gfs2/lvm.conf /etc/lvm/lvm.conf], but there's not really any tweaking you'll need to do other than make sure that the <tt>locking_type</tt> is set to DLM (3, IIRC).<br />
* also, the <tt>/etc/init.d/</tt> scripts for starting/stopping the services. I've hacked on them some so they work in a cluster3 setup, but are by no means perfect. [http://www.citi.umich.edu/u/richterd/pnfs-gfs2/cman.init cman.init] and [http://www.citi.umich.edu/u/richterd/pnfs-gfs2/clvmd.init clvmd.init]<br />
* '''Note''': cluster3 has two different modes of operation -- one which is back-compatible with a cluster2 environment and one which only works with other cluster3 members. We want the new, cleaner code paths, and so we run in the cluster3-only mode. You can set this up two different ways (note that my init scripts and sample cluster.conf above do both, meh):<br />
** in <tt>/etc/cluster/cluster.conf</tt>, add the entry <tt><group groupd_compat="0"/></tt><br />
** when starting the daemons, start them all with <tt>-g0</tt><br />
* once you've brought up the cluster, you can then go create a gfs2 filesystem and be on your way.<br />
<br />
<br />
.... more will be added here as work progresses. In particular, there'll be a writeup all about the addition of the pNFS control channel to cluster3.<br />
<br />
<br />
==Troubleshooting==<br />
I wish I'd kept more careful notes about the things that went wrong. I'll spool future things into here.<br />
====LVM can't see my volume group any longer??====<br />
During the upgrade from cluster2 to cluster3, one of the machines somehow lost sight of the ATA-over-ethernet device that I'm using for the cluster's shared storage. The problem wasn't with the <tt>aoe</tt> module, though -- but <tt>lvscan</tt> never saw it, despite that the other two nodes ''could'' see it.<br />
<br />
Turns out that LVM actually got confused somehow -- I'd been under the impression that, sure, while it does maintain a cache of devices (<tt>/etc/lvm/cache/.cache</tt>), it'd nevertheless grok new ones one way or another. And it always had, until now -- it wasn't until I edited that cache file by hand and added the AoE device's <tt>/dev/</tt> entry that <tt>lvscan</tt> was able to see it. Good thing to know: it is possible for LVM's device cache to go stale, and I didn't see anything in any manpages about how to poke it with <tt>lvm</tt> or something.</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/GFS2_cluster3_userland_notesGFS2 cluster3 userland notes2008-07-25T17:56:21Z<p>Richterd: New page: (7/24/2008 - still being written..) ==Purpose== IBM and CITI are working to integrate GFS2 with pNFS with the purpose of demonstrating that an in-kernel cluster filesystem can be successfu...</p>
<hr />
<div>(7/24/2008 - still being written..)<br />
==Purpose==<br />
IBM and CITI are working to integrate GFS2 with pNFS with the purpose of demonstrating that an in-kernel cluster filesystem can be successfully exported over pNFS and take advantage of pNFS's capabilities.<br />
<br />
Part of the work involves extending existing GFS2 userland tools and daemons to handle pNFS requests for state information and the like. That task requires developing an out-of-band GFS2-specific control channel so that pNFS servers exporting GFS2 can issue and process these requests during the course of normal NFS processing.<br />
<br />
The extant version of the GFS2 userland when the GFS2/pNFS work began is referred to as "cluster2"; however, as work was getting under way, David Teigland at Red Hat (lead developer of the cluster suite) suggested that new development be integrated with the next version of the cluster suite.<br />
<br />
==Background==<br />
There are 3 versions of the GFS cluster suite that Red Hat ships, referred to simply as cluster1, cluster2, and cluster3.<br />
* cluster1 (RHEL4-ish, IIRC) was mostly (all?) implemented in-kernel and was tricky and redesigned for a variety of reasons.<br />
* cluster2 (RHEL5, Fedora 9) moves several of the daemons into userland and makes use of [http://www.openais.org OpenAIS], a big powerful framework beyond the scope of these notes. One of the main daemons became an OpenAIS plugin; Red Hat is making a deliberate effort to use things from and give things back to the open source community, rather than sticking to building everything in-house.<br />
* cluster3 (Fedora 10, ..) continues the progression, integrating things more closely with OpenAIS and removing a bunch of code that cluster2 used to bridge between existing daemons and OpenAIS. Despite that cluster3 is still under active development, it is going to be in the wild around early October when Fedora 10 is released; that makes cluster3 the place to focus. However, things like build and configuration setups are still sketchy -- and their development repo is updated many times a day -- so a little persistence is required.<br />
<br />
==Setup==<br />
====No cluster2====<br />
First off, you can save yourself a lot of hassle by '''''not''''' starting out with an existing cluster2 install; I bet this whole thing would've been pretty easy otherwise. I made that mistake and consequently spent a lot of time picking things apart. If these things are lurking around on your system, you'll probably want to remove them first:<br />
* /sbin/gfs_controld, /sbin/gfs_tool, /etc/init.d/cman, /etc/cluster/cluster.conf, /etc/init.d/clvmd<br />
** for ease of removal, you can find the original RPM package names like this: <tt>$ sudo rpm -q --whatprovides /etc/cluster/cluster.conf</tt><br />
<br />
====The parts====<br />
Get the newest versions of things:<br />
* you'll need <tt>'''libvolume_id-devel'''</tt>, but that's okay to get from an RPM.<br />
* [http://sources.redhat.com/dm/ latest device-mapper source]<br />
* use <tt>svn</tt> to clone the openAIS repo: <br />
** <tt>$ svn checkout http://svn.osdl.org/openais</tt><br />
** <tt>$ cd openais && svn export . ../openais-checkout/</tt><br />
** the build stuff is in the "<tt>trunk</tt>" subdirectory.<br />
* use <tt>git</tt> to clone the cluster3 repo:<br />
** <tt>$ git-clone http://sources.redhat.com/git/cluster.git</tt><br />
** their branch "master" is their ongoing cluster3 development.<br />
* [ftp://sources.redhat.com/pub/lvm2 latest LVM2 source]<br />
<br />
====The build====<br />
* build the device-mapper first, shouldn't be a problem. <br />
* next, openAIS; I keep having to futz with the <tt>DESTDIR</tt> string in their <tt>Makefile.inc</tt> -- it's not playing correctly with the <tt>--prefix</tt> option.<br />
* before you can build cluster3, you need to already be running a 2.6.26-based kernel and have its build sources available. so snag/build/boot the 2.6.26-based pnfs-gfs2 kernel.<br />
* cluster3 took me several tries -- but it seems like nearly everything related to the existing cluster2 install.<br />
** <tt>$ ./configure --openaislibdir=/usr/lib/openais --openaisincdir=/usr/include --dlmincdir=/lib/modules/2.6.26-pnfs/source/include</tt><br />
* last, build LVM2. make sure to specify the <tt>clvmd</tt> type, and i always disable LVM1 compatibility:<br />
** <tt>$ ./configure --with-lvm1=none --with-clvmd=cman --prefix=/usr</tt><br />
<br />
Huh.. I had more notes on the afternoon it took me to sort out and finally get cluster3 working, but I'm not seeing them. In the end, I had to run around with <tt>ldd</tt> and verify that everything really was linking the right ways. Maybe this will all be really easy for everyone else and I just got unlucky <shrug>.<br />
<br><br />
====Other bits====<br />
* you'll need a [http://www.citi.umich.edu/u/richterd/pnfs-gfs2/cluster.conf /etc/cluster/cluster.conf]. this is a sample for a 3-node cluster.<br />
* you'll also need a [http://www.citi.umich.edu/u/richterd/pnfs-gfs2/lvm.conf /etc/lvm/lvm.conf], but there's not really any tweaking you'll need to do other than make sure that the <tt>locking_type</tt> is set to DLM (3, IIRC).<br />
* also, the <tt>/etc/init.d/</tt> scripts for starting/stopping the services. I've hacked on them some so they work in a cluster3 setup, but are by no means perfect. [http://www.citi.umich.edu/u/richterd/pnfs-gfs2/cman.init cman.init] and [http://www.citi.umich.edu/u/richterd/pnfs-gfs2/clvmd.init clvmd.init]<br />
* '''Note''': cluster3 has two different modes of operation -- one which is back-compatible with a cluster2 environment and one which only works with other cluster3 members. We want the new, cleaner code paths, and so we run in the cluster3-only mode. You can set this up two different ways (note that my init scripts and sample cluster.conf above do both, meh):<br />
** in <tt>/etc/cluster/cluster.conf</tt>, add the entry <tt><group groupd_compat="0"/></tt><br />
** when starting the daemons, start them all with <tt>-g0</tt><br />
<br />
<br />
<br />
.... more will be added here as work progresses. In particular, there'll be a writeup all about the addition of the pNFS control channel to cluster3.</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_designPNFS prototype design2008-07-25T17:56:13Z<p>Richterd: </p>
<hr />
<div>= pNFS =<br />
<br />
'''pNFS''' is part of the first NFSv4 minor version. This space is used to track and share Linux pNFS implementation ideas and issues.<br />
<br />
== General Information ==<br />
<br />
* [http://www.citi.umich.edu/projects/asci/pnfs/linux/ Linux pNFS Implementation Homepage]<br />
<br />
* [[pNFS Setup Instructions]] - Basic pNFS setup instructions.<br />
<br />
* [[GFS2 Setup Notes]] are basic install notes from setting up a small cluster (perhaps useful for the GFS2 MDS work).<br />
<br />
* [[GFS2 Cluster in VMware]] are a follow-up where I quickly set up a 3-node cluster on my laptop for use at Connectathon.<br />
<br />
* [[GFS2 cluster3 userland notes]] are rough notes from my first stab at upgrading the GFS2 userland from cluster2 to cluster3.<br />
<br />
== Current Issues ==<br />
* [[pNFS Todo List|pNFS Todo List]]<br />
<br />
* [[pNFS Implementation Issues|pNFS Implementation Issues]]<br />
<br />
* [[Bakeathon 2007 Issues List|Bakeathon 2007 Issues List]]<br />
<br />
* [[pNFS Development Road Map]]<br />
<br />
* [http://spreadsheets.google.com/pub?key=pGVvgce8dC-WWbowI9TSmEg Linux pNFS Development Gantt Chart]<br />
<br />
* [[pNFS Git tree recipies|pNFS Git tree recipies]]<br />
<br />
* [[pNFS Development Git tree|pNFS Development Git tree]]<br />
<br />
* [[Wireshark Patches|Wireshark Patches]]<br />
<br />
== Old Issues ==<br />
* [[Cthon06 Meeting Notes|Connectathon 2006 Linux pNFS Implementation Meeting Notes]]<br />
<br />
* [[linux pnfs client rewrite may 2006|Linux pNFS Client Internal Reorg patches May 2006 - For Display Purposes Only - Do Not Use]]<br />
<br />
* [[pNFS todo List|pNFS todo List July 2007]]</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/GFS2_Setup_NotesGFS2 Setup Notes2008-07-10T15:35:26Z<p>Richterd: /* Upgrading GFS2 userland for kernels >2.6.18 */</p>
<hr />
<div>==Initial install==<br />
===Basics===<br />
Started with fresh installs of RHEL5.0 on 4 nodes of mixed hardware, all attached to a shared MSA-1000 fibre channel 8-disk array (in two sets of 4, ~550GB total).<br />
<br />
* installed cluster and update RPMs from wendy cheng:<br />
** <tt> cman-2.0.64-1.el5.x86_64.rpm </tt><br />
** <tt> cman-devel-2.0.64-1.el5.x86_64.rpm </tt><br />
** <tt> device-mapper-1.02.13-1.el5.x86_64.rpm </tt><br />
** <tt> gfs-utils-0.1.11-3.el5.x86_64.rpm </tt><br />
** <tt> gfs2-utils-0.1.25-1.el5.x86_64.rpm </tt><br />
** <tt> gnbd-1.1.5-1.el5.x86_64.rpm </tt> ''(unused?)''<br />
** <tt> kmod-gfs-0.1.16-5.2.6.18_8.1.4.el5.x86_64.rpm </tt><br />
** <tt> kmod-gnbd-0.1.3-4.2.6.18_8.1.4.el5.x86_64.rpm </tt> ''(unused?)''<br />
** <tt> lvm2-2.02.16-3.el5.x86_64.rpm </tt><br />
** <tt> lvm2-cluster-2.02.16-3.el5.x86_64.rpm </tt><br />
** <tt> openais-0.80.2-1.el5.x86_64.rpm </tt><br />
** <tt> openais-devel-0.80.2-1.el5.x86_64.rpm </tt><br />
** <tt> system-config-cluster-1.0.39-1.0.noarch.rpm </tt> ''(just a python frontend for several <tt>vg*</tt>, <tt>lv*</tt>, and <tt>pv*</tt> commands)''<br />
<br />
<br />
===Configuring <tt>cman</tt> and <tt>clvmd</tt>===<br />
* '''cman''': at first I tried using <tt>system-config-cluster</tt> to set up <tt>cman</tt>, but given that I didn't have any complicated fencing or quorum-related needs, I basically just took a generic <tt>cluster.conf</tt> and edited it. My <tt>[http://www.citi.umich.edu/u/richterd/gfs2/cluster.conf cluster.conf]</tt> is real basic and has manual fencing set up to be a no-op (I'd get complaints from the daemons if I didn't have any fencing setup).<br />
** distribute the new <tt>cluster.conf</tt> to all nodes; on the first run, you can just use <tt>scp</tt> or whatever.<br />
** once the cluster's up, though, propagating and setting changes on all nodes takes two steps. From the node with the updated configuration, do:<br />
*** <tt>$ sudo ccs_tool update /path/to/new/cluster.conf</tt> ''(pushes to all nodes listed in conf file)''<br />
*** <tt>$ sudo cman_tool version -r <new-version-number></tt> ''(a generation number to keep the nodes synched)''<br />
<br />
* '''clvmd''': as before, I tried using <tt>system-config-lvm</tt> to set up <tt>clvmd</tt>, but it's not quite "there yet" -- it'd get wedged or go blind to clustered volumes at strange times. Again, tweaking a mostly-templated (and very well-commented) stock conf file wasn't hard; my <tt>[http://www.citi.umich.edu/u/richterd/gfs2/lvm.conf lvm.conf]</tt> is real simple. ''Note:'' btw, in my setup the MSA-1000 disk array is initially set up to do raid0 on the 8 disks in two groups of 4; my machines see 2 block devices, each with a capacity of ~270GB. <br />
** create 1 physical linux (0x83) partition each, using whole "disk"; repeat for <tt>/dev/sdc</tt><br />
*** <tt>$ sudo fdisk /dev/sdb</tt><br />
** create physical volumes with LVM2 metadata<br />
*** <tt>$ sudo pvcreate -M 2 /dev/sdb1</tt><br />
*** <tt>$ sudo pvcreate -M 2 /dev/sdc1</tt><br />
** create a clustered volume group and add <tt>/dev/sdb1</tt> to it<br />
*** <tt>$ sudo vgcreate -M 2 -l 256 -p 256 -s 4m -c y VolGroupCluster /dev/sdb1</tt><br />
*** <tt>$ sudo pvscan</tt> ''# (verify it worked)''<br />
** edit <tt>lvm.conf</tt> and make sure that "<tt>locking_type</tt>" is set to 3 (<tt>DLM</tt>).<br />
** distribute <tt>lvm.conf</tt> to all the nodes<br />
** start up both <tt>cman</tt> and <tt>clvmd</tt> everywhere. ''Note:'' fwiw, I use [https://computing.llnl.gov/linux/pdsh.html pdsh], the parallel distributed shell, to communicate to all nodes at once; I have mine use <tt>ssh</tt> for transport. E.g., from my .bashrc:<br />
*** <tt> $ alias start-cluster='for svc in cman clvmd ; do pdsh -w node[1-4] sudo service $svc start; done'</tt><br />
** add <tt>/dev/sdc1</tt> to the existing volume group (needs the daemons running)<br />
*** <tt>$ sudo vgextend VolGroupCluster /dev/sdc1</tt><br />
*** <tt>$ sudo vgs</tt> ''# (verify that the "clustering" flag is set on the volgroup)''<br />
** create a logical volume using the whole volgroup<br />
*** <tt>$ sudo lvcreate -n ClusterVolume -l 138924 VolGroupCluster</tt><br />
*** <tt>$ sudo lvdisplay -c -a</tt> ''# (verify that it worked)''<br />
** create a GFS2 filesystem therein<br />
*** <tt>$ sudo gfs2_mkfs -j 4 -p lock_dlm -t GFS2_Cluster:ClusterFS -O /dev/VolGroupCluster/ClusterVolume</tt><br />
** edit <tt>/etc/fstab</tt> to add a mountpoint, restart the daemons, and mount!<br />
<br />
<br />
===Custom kernels===<br />
Once the basics were going, I built some kernels and things more or less worked -- except I had a heck of a time getting the <tt>Qlogic</tt> firmware to load properly. I'm fine with building the <tt>initcramfs "initrds"</tt> by hand, ''but'' for the firmware in this setup; I don't know, I guess I'm a <tt>udev</tt> idiot or something. What I ended up doing was bogarting a vendor patch from Red Hat (bless their hearts ;) that side-stepped the issue and just built the blobs into the GFS kernel module. A [http://www.citi.umich.edu/u/richterd/gfs2/add-qlogic-firmware-blob--2.6.22.19.diff slightly-updated version against 2.6.22.19] is available.<br />
<br />
<br />
==Upgrading GFS2 userland for kernels >2.6.18==<br />
Not too long after the initial install (which came with a 2.6.18-based kernel), I found that the in-kernel <tt>DLM</tt> (distributed lock manager) stuff changed recently and required a corresponding update to userspace <tt>LVM2</tt> (logical volume manager) tools.<br />
<br />
While Wendy Cheng had gotten things off the ground by giving me the bag of RPMs, we didn't get any RHN entitlements, so no updates = pain in the neck. I did finally manage to find a way to sneak RHEL5 packages out of RHN despite the lack of entitlement, but I had to do it by hand and I had to re-login for each package. Worse, when I finally did get the newest RPMs, they weren't even new enough anyway. Lesson learned: build from source. <br />
<br />
I wasn't sure that it was the best idea, but since I already had GFS2 working with the stock userland, I was skittish and didn't want to clobber the system RPMs so I installed under my home directory; worked fine.<br />
<br />
* got the newest packages:<br />
** [http://sources.redhat.com/dm/ device-mapper.1.02.22]<br />
** [http://www.openais.org/ openAIS] ''(get the stable/"whitetank" release)''<br />
** [ftp://sources.redhat.com/pub/cluster/releases/ cluster-2.01.00 tools]<br />
** [http://sources.redhat.com/lvm2/ LVM2.2.02.28]<br />
** <tt>libvolume_id-devel-095-14.5.el5.x86_64.rpm</tt> ''(bogarted from RHN)''<br />
<br />
* <tt>export CLUSTER=/home/richterd/projects/nfs/CLUSTER; cd $CLUSTER</tt><br />
* <tt>mkdir device-mapper-OBJ cluster-OBJ LVM2-OBJ</tt><br />
<br />
* device-mapper:<br />
** <tt>./configure --prefix=$CLUSTER/device-mapper-OBJ && make && sudo make install</tt><br />
** add <tt>$CLUSTER/device-mapper-OBJ/lib</tt> to <tt>/etc/ld.so.conf</tt> and rerun <tt>ldconfig</tt><br />
<br />
* openAIS:<br />
** edit the Makefile; set <tt>DESTDIR</tt> to the empty string<br />
** <tt>make && sudo make install</tt> -- at some point, this clobbered some of the RPM stuff; meh.<br />
** added <tt>/usr/lib64/openais</tt> to <tt>ld.so.conf</tt> and reran <tt>ldconfig</tt><br />
**<br />
** '''update''': when i came back to this and was building on Fedora 9, i got complaints about <tt>struct ucred</tt> not being defined (see [http://sourceware.org/bugzilla/show_bug.cgi?id=6545 this bugreport]). I edited $OPENAIS/exec/Makefile and added <tt>-D_GNU_SOURCE</tt> to its CFLAGS and things seem copacetic.<br />
<br />
* libvolume_id-devel:<br />
** <tt>sudo rpm -ivh libvolume_id-devel-095-14.5.el5.x86_64.rpm</tt><br />
<br />
* cluster tools:<br />
** <tt>./configure --prefix=$CLUSTER/cluster-OBJ --openaislibdir=/usr/lib64/openais --dlmincdir=/lib/modules/<kernel>/source/include</tt><br />
*** '''update''': am now omitting a couple things (gfs doesn't build right any longer anyway)<tt>./configure --without_gnbd --without_gfs</tt><br />
** edit <tt>dlm/lib/Makefile</tt> and add: <tt>CFLAGS += -I$(dlmincdir)</tt><br />
*** '''update''': on a different install, <tt>libdlm.h</tt> kept hiding. edited <tt>make/defines.mk</tt> and add qc like <tt>CFLAGS += -I$(SRCDIR)/dlm/lib</tt>.<br />
** since I was doing my "trial" install, I added <tt>$CLUSTER/cluster-OBJ/usr/lib</tt> to <tt>ld.so.conf</tt> and reran <tt>ldconfig</tt>. I anticipate going back and installing things in real system locations now that I know things worked <tt>:)</tt><br />
** <tt>make && sudo make install</tt><br />
<br />
* LVM2:<br />
** <tt>./configure --prefix=$CLUSTER/LVM2-OBJ --with-lvm1=none --with-dmdir=$CLUSTER/device-mapper-OBJ --with-clvmd=cman</tt><br />
** edit <tt>make.tmpl</tt> and look for where the above <tt>dmdir</tt> is set; my <tt>configure</tt> screwed up and appended <tt>"/ioctl"</tt> to the end and I had to trim it.<br />
*** '''fix''': rather, first trim from <tt>make.tmpl.in</tt>, where it originates for whatever reason<br />
** <tt>make && sudo make install</tt><br />
<br />
.. at this point, I had a <tt>clvmd</tt> that linked against the right shared libraries and that could deal with the kernel's modified <tt>DLM</tt> setup.<br />
<br />
==Troubleshooting the clustering flag==<br />
'''Problem 1:''' LVM changes don't appear to "take". Quoting from an email I found online (XXX: cite):<br />
Why aren't changes to my logical volume being picked up by the rest of the cluster?<br />
<br />
There's a little-known "clustering" flag for volume groups that should be set on when a cluster uses a shared volume. <br />
If that bit is not set, you can see strange lvm problems on your cluster. For example, if you extend a volume with <br />
lvresize and gfs_grow, the other nodes in the cluster will not be informed of the resize, and will likely crash when <br />
they try to access the volume.<br />
<br />
To check if the clustering flag is on for a volume group, use the "vgs" command and see if the "Attr" column shows <br />
a "c". If the attr column shows something like "wz--n-" the clustering flag is off for the volume group. If the <br />
"Attr" column shows something like "wz--nc" the clustering flag is on.<br />
<br />
To set the clustering flag on, use this command: <tt>vgchange -cy</tt><br />
<br />
'''Problem 2:''' In the midst of adding a new node to the cluster, <tt>clvmd</tt> wouldn't start on other nodes and recognize the disk array.<br />
<br />
I tried the above <tt>vgchange -cy</tt> thing and screwed it up by making the ''local disk's'' VG '''clustered''' (ugh). [http://kbase.redhat.com/faq/FAQ_96_11024.shtm The problem] made sense, but the temporarily-changing-the-locking-type was what I was missing when I tried to undo my mistake. <br />
<br />
The fix: make sure uniform <tt>lvm.conf</tt>s are tweaked as per the link above and distributed to the cluster; start <tt>cman/clvmd</tt> everywhere; ''then'' use <tt>vgchange -cn VolGroup00</tt> to remove clustering flag (''<tt>VolGroup00</tt> is the local disk's VG, set up during the RHEL install''); ''then'' set the <tt>lvm.conf</tt> locking stuff back to "clustered" and redistribute to the cluster; ''then'' restart the daemons, mount, declare victory.</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/GFS2_Setup_NotesGFS2 Setup Notes2008-07-09T18:28:14Z<p>Richterd: /* Upgrading GFS2 userland for kernels >2.6.18 */</p>
<hr />
<div>==Initial install==<br />
===Basics===<br />
Started with fresh installs of RHEL5.0 on 4 nodes of mixed hardware, all attached to a shared MSA-1000 fibre channel 8-disk array (in two sets of 4, ~550GB total).<br />
<br />
* installed cluster and update RPMs from wendy cheng:<br />
** <tt> cman-2.0.64-1.el5.x86_64.rpm </tt><br />
** <tt> cman-devel-2.0.64-1.el5.x86_64.rpm </tt><br />
** <tt> device-mapper-1.02.13-1.el5.x86_64.rpm </tt><br />
** <tt> gfs-utils-0.1.11-3.el5.x86_64.rpm </tt><br />
** <tt> gfs2-utils-0.1.25-1.el5.x86_64.rpm </tt><br />
** <tt> gnbd-1.1.5-1.el5.x86_64.rpm </tt> ''(unused?)''<br />
** <tt> kmod-gfs-0.1.16-5.2.6.18_8.1.4.el5.x86_64.rpm </tt><br />
** <tt> kmod-gnbd-0.1.3-4.2.6.18_8.1.4.el5.x86_64.rpm </tt> ''(unused?)''<br />
** <tt> lvm2-2.02.16-3.el5.x86_64.rpm </tt><br />
** <tt> lvm2-cluster-2.02.16-3.el5.x86_64.rpm </tt><br />
** <tt> openais-0.80.2-1.el5.x86_64.rpm </tt><br />
** <tt> openais-devel-0.80.2-1.el5.x86_64.rpm </tt><br />
** <tt> system-config-cluster-1.0.39-1.0.noarch.rpm </tt> ''(just a python frontend for several <tt>vg*</tt>, <tt>lv*</tt>, and <tt>pv*</tt> commands)''<br />
<br />
<br />
===Configuring <tt>cman</tt> and <tt>clvmd</tt>===<br />
* '''cman''': at first I tried using <tt>system-config-cluster</tt> to set up <tt>cman</tt>, but given that I didn't have any complicated fencing or quorum-related needs, I basically just took a generic <tt>cluster.conf</tt> and edited it. My <tt>[http://www.citi.umich.edu/u/richterd/gfs2/cluster.conf cluster.conf]</tt> is real basic and has manual fencing set up to be a no-op (I'd get complaints from the daemons if I didn't have any fencing setup).<br />
** distribute the new <tt>cluster.conf</tt> to all nodes; on the first run, you can just use <tt>scp</tt> or whatever.<br />
** once the cluster's up, though, propagating and setting changes on all nodes takes two steps. From the node with the updated configuration, do:<br />
*** <tt>$ sudo ccs_tool update /path/to/new/cluster.conf</tt> ''(pushes to all nodes listed in conf file)''<br />
*** <tt>$ sudo cman_tool version -r <new-version-number></tt> ''(a generation number to keep the nodes synched)''<br />
<br />
* '''clvmd''': as before, I tried using <tt>system-config-lvm</tt> to set up <tt>clvmd</tt>, but it's not quite "there yet" -- it'd get wedged or go blind to clustered volumes at strange times. Again, tweaking a mostly-templated (and very well-commented) stock conf file wasn't hard; my <tt>[http://www.citi.umich.edu/u/richterd/gfs2/lvm.conf lvm.conf]</tt> is real simple. ''Note:'' btw, in my setup the MSA-1000 disk array is initially set up to do raid0 on the 8 disks in two groups of 4; my machines see 2 block devices, each with a capacity of ~270GB. <br />
** create 1 physical linux (0x83) partition each, using whole "disk"; repeat for <tt>/dev/sdc</tt><br />
*** <tt>$ sudo fdisk /dev/sdb</tt><br />
** create physical volumes with LVM2 metadata<br />
*** <tt>$ sudo pvcreate -M 2 /dev/sdb1</tt><br />
*** <tt>$ sudo pvcreate -M 2 /dev/sdc1</tt><br />
** create a clustered volume group and add <tt>/dev/sdb1</tt> to it<br />
*** <tt>$ sudo vgcreate -M 2 -l 256 -p 256 -s 4m -c y VolGroupCluster /dev/sdb1</tt><br />
*** <tt>$ sudo pvscan</tt> ''# (verify it worked)''<br />
** edit <tt>lvm.conf</tt> and make sure that "<tt>locking_type</tt>" is set to 3 (<tt>DLM</tt>).<br />
** distribute <tt>lvm.conf</tt> to all the nodes<br />
** start up both <tt>cman</tt> and <tt>clvmd</tt> everywhere. ''Note:'' fwiw, I use [https://computing.llnl.gov/linux/pdsh.html pdsh], the parallel distributed shell, to communicate to all nodes at once; I have mine use <tt>ssh</tt> for transport. E.g., from my .bashrc:<br />
*** <tt> $ alias start-cluster='for svc in cman clvmd ; do pdsh -w node[1-4] sudo service $svc start; done'</tt><br />
** add <tt>/dev/sdc1</tt> to the existing volume group (needs the daemons running)<br />
*** <tt>$ sudo vgextend VolGroupCluster /dev/sdc1</tt><br />
*** <tt>$ sudo vgs</tt> ''# (verify that the "clustering" flag is set on the volgroup)''<br />
** create a logical volume using the whole volgroup<br />
*** <tt>$ sudo lvcreate -n ClusterVolume -l 138924 VolGroupCluster</tt><br />
*** <tt>$ sudo lvdisplay -c -a</tt> ''# (verify that it worked)''<br />
** create a GFS2 filesystem therein<br />
*** <tt>$ sudo gfs2_mkfs -j 4 -p lock_dlm -t GFS2_Cluster:ClusterFS -O /dev/VolGroupCluster/ClusterVolume</tt><br />
** edit <tt>/etc/fstab</tt> to add a mountpoint, restart the daemons, and mount!<br />
<br />
<br />
===Custom kernels===<br />
Once the basics were going, I built some kernels and things more or less worked -- except I had a heck of a time getting the <tt>Qlogic</tt> firmware to load properly. I'm fine with building the <tt>initcramfs "initrds"</tt> by hand, ''but'' for the firmware in this setup; I don't know, I guess I'm a <tt>udev</tt> idiot or something. What I ended up doing was bogarting a vendor patch from Red Hat (bless their hearts ;) that side-stepped the issue and just built the blobs into the GFS kernel module. A [http://www.citi.umich.edu/u/richterd/gfs2/add-qlogic-firmware-blob--2.6.22.19.diff slightly-updated version against 2.6.22.19] is available.<br />
<br />
<br />
==Upgrading GFS2 userland for kernels >2.6.18==<br />
Not too long after the initial install (which came with a 2.6.18-based kernel), I found that the in-kernel <tt>DLM</tt> (distributed lock manager) stuff changed recently and required a corresponding update to userspace <tt>LVM2</tt> (logical volume manager) tools.<br />
<br />
While Wendy Cheng had gotten things off the ground by giving me the bag of RPMs, we didn't get any RHN entitlements, so no updates = pain in the neck. I did finally manage to find a way to sneak RHEL5 packages out of RHN despite the lack of entitlement, but I had to do it by hand and I had to re-login for each package. Worse, when I finally did get the newest RPMs, they weren't even new enough anyway. Lesson learned: build from source. <br />
<br />
I wasn't sure that it was the best idea, but since I already had GFS2 working with the stock userland, I was skittish and didn't want to clobber the system RPMs so I installed under my home directory; worked fine.<br />
<br />
* got the newest packages:<br />
** [http://sources.redhat.com/dm/ device-mapper.1.02.22]<br />
** [http://www.openais.org/ openAIS] ''(get the stable/"whitetank" release)''<br />
** [ftp://sources.redhat.com/pub/cluster/releases/ cluster-2.01.00 tools]<br />
** [http://sources.redhat.com/lvm2/ LVM2.2.02.28]<br />
** <tt>libvolume_id-devel-095-14.5.el5.x86_64.rpm</tt> ''(bogarted from RHN)''<br />
<br />
* <tt>export CLUSTER=/home/richterd/projects/nfs/CLUSTER; cd $CLUSTER</tt><br />
* <tt>mkdir device-mapper-OBJ cluster-OBJ LVM2-OBJ</tt><br />
<br />
* device-mapper:<br />
** <tt>./configure --prefix=$CLUSTER/device-mapper-OBJ && make && sudo make install</tt><br />
** add <tt>$CLUSTER/device-mapper-OBJ/lib</tt> to <tt>/etc/ld.so.conf</tt> and rerun <tt>ldconfig</tt><br />
<br />
* openAIS:<br />
** edit the Makefile; set <tt>DESTDIR</tt> to the empty string<br />
** <tt>make && sudo make install</tt> -- at some point, this clobbered some of the RPM stuff; meh.<br />
** added <tt>/usr/lib64/openais</tt> to <tt>ld.so.conf</tt> and reran <tt>ldconfig</tt><br />
**<br />
** '''update''': when i came back to this and was building on Fedora 9, i got complaints about <tt>struct ucred</tt> not being defined (see [http://sourceware.org/bugzilla/show_bug.cgi?id=6545 this bugreport]). I edited $OPENAIS/exec/Makefile and added <tt>-D_GNU_SOURCE</tt> to its CFLAGS and things seem copacetic.<br />
<br />
* libvolume_id-devel:<br />
** <tt>sudo rpm -ivh libvolume_id-devel-095-14.5.el5.x86_64.rpm</tt><br />
<br />
* cluster tools:<br />
** <tt>./configure --prefix=$CLUSTER/cluster-OBJ --openaislibdir=/usr/lib64/openais --dlmincdir=/lib/modules/<kernel>/source/include</tt><br />
*** '''update''': am now omitting a couple things (gfs doesn't build right any longer anyway)<tt>./configure --without_gnbd --without_gfs</tt><br />
** edit <tt>dlm/lib/Makefile</tt> and add: <tt>CFLAGS += -I$(dlmincdir)</tt><br />
** since I was doing my "trial" install, I added <tt>$CLUSTER/cluster-OBJ/usr/lib</tt> to <tt>ld.so.conf</tt> and reran <tt>ldconfig</tt>. I anticipate going back and installing things in real system locations now that I know things worked <tt>:)</tt><br />
** <tt>make && sudo make install</tt><br />
<br />
* LVM2:<br />
** <tt>./configure --prefix=$CLUSTER/LVM2-OBJ --with-lvm1=none --with-dmdir=$CLUSTER/device-mapper-OBJ --with-clvmd=cman</tt><br />
** edit <tt>make.tmpl</tt> and look for where the above <tt>dmdir</tt> is set; my <tt>configure</tt> screwed up and appended <tt>"/ioctl"</tt> to the end and I had to trim it.<br />
*** '''fix''': rather, first trim from <tt>make.tmpl.in</tt>, where it originates for whatever reason<br />
** <tt>make && sudo make install</tt><br />
<br />
.. at this point, I had a <tt>clvmd</tt> that linked against the right shared libraries and that could deal with the kernel's modified <tt>DLM</tt> setup.<br />
<br />
==Troubleshooting the clustering flag==<br />
'''Problem 1:''' LVM changes don't appear to "take". Quoting from an email I found online (XXX: cite):<br />
Why aren't changes to my logical volume being picked up by the rest of the cluster?<br />
<br />
There's a little-known "clustering" flag for volume groups that should be set on when a cluster uses a shared volume. <br />
If that bit is not set, you can see strange lvm problems on your cluster. For example, if you extend a volume with <br />
lvresize and gfs_grow, the other nodes in the cluster will not be informed of the resize, and will likely crash when <br />
they try to access the volume.<br />
<br />
To check if the clustering flag is on for a volume group, use the "vgs" command and see if the "Attr" column shows <br />
a "c". If the attr column shows something like "wz--n-" the clustering flag is off for the volume group. If the <br />
"Attr" column shows something like "wz--nc" the clustering flag is on.<br />
<br />
To set the clustering flag on, use this command: <tt>vgchange -cy</tt><br />
<br />
'''Problem 2:''' In the midst of adding a new node to the cluster, <tt>clvmd</tt> wouldn't start on other nodes and recognize the disk array.<br />
<br />
I tried the above <tt>vgchange -cy</tt> thing and screwed it up by making the ''local disk's'' VG '''clustered''' (ugh). [http://kbase.redhat.com/faq/FAQ_96_11024.shtm The problem] made sense, but the temporarily-changing-the-locking-type was what I was missing when I tried to undo my mistake. <br />
<br />
The fix: make sure uniform <tt>lvm.conf</tt>s are tweaked as per the link above and distributed to the cluster; start <tt>cman/clvmd</tt> everywhere; ''then'' use <tt>vgchange -cn VolGroup00</tt> to remove clustering flag (''<tt>VolGroup00</tt> is the local disk's VG, set up during the RHEL install''); ''then'' set the <tt>lvm.conf</tt> locking stuff back to "clustered" and redistribute to the cluster; ''then'' restart the daemons, mount, declare victory.</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/GFS2_Cluster_in_VMwareGFS2 Cluster in VMware2008-07-09T18:14:09Z<p>Richterd: /* VMware */</p>
<hr />
<div>==VMware==<br />
* bought a copy of VMware Workstation 6, installed it on my T-43 Thinkpad <tt>"atro"</tt>(running openSuSE 10.2, 2GB of RAM).<br />
* made a new virtual machine: '''OS''': Linux, '''Version''': "Other Linux 2.6.x kernel", '''Networking''': Bridged, '''Disk''': 4GB, split into 2GB files, '''RAM''': 256MB<br />
* installed Fedora 8 in it -- even X worked well with only 256MB of RAM(!) -- guest is named <tt>"guest1"</tt><br />
* yum-installed '''gfs2-utils''' and '''libvolume_id-devel''' ''(i also tried cman, cman-devel, openais, openais-devel, and lvm2-cluster, but even '''they''' were out-of-date with the stock Fedora kernel, and so are also too old for the pNFS kernels)''<br />
* downloaded and installed '''device-mapper-1.02.22''', '''openais-0.80.3''', '''cluster-2.01.00''', and '''lvm2-2.02.28'''<br />
** [ftp://sources.redhat.com/pub/dm/device-mapper.1.02.27.tgz device-mapper.1.02.27.tgz]<br />
** [ftp://ftp%40openais%2Eorg:downloads@openais.org/downloads/openais-0.80.3/openais-0.80.3.tar.gz openais-0.80.3.tar.gz]<br />
** [ftp://sources.redhat.com/pub/cluster/releases/cluster-2.03.04.tar.gz cluster-2.03.04.tar.gz]<br />
** [ftp://sources.redhat.com/pub/lvm2/LVM2.2.02.39.tgz LVM2.2.02.39.tgz]<br />
<br />
==ATA over Ethernet (for guest cluster shared storage)==<br />
* yum-installed AoE initiator (client) '''aoetools-18-1''' on <tt>guest1</tt><br />
* downloaded AoE target (server) [http://internap.dl.sourceforge.net/sourceforge/aoetools/vblade-15.tgz vblade-15.tgz] and installed it on <tt>atro</tt><br />
* i set aside a spare partition on <tt>atro</tt> to export as a block device over AoE:<br />
** <tt>[atro] $ sudo ln -s /dev/sda6 /dev/AoE</tt><br />
** <tt>[atro] $ sudo vbladed 0 1 eth0 /dev/AoE</tt> ''(major dev num 0, minor 1)''<br />
** <tt>[guest1] $ sudo modprobe aoe</tt><br />
*** .. AoE discovers all exported devices on the LAN; mine was the only one, and immediately appeared as <tt>/dev/etherd/e0.1</tt>. Mounting it "just worked"; props to AoE!<br />
<br />
==LVM and GFS2 setup==<br />
* prep physical volume for LVM:<br />
** <tt>[guest1] $ sudo pvcreate -M 2 /dev/etherd/e0.1</tt><br />
* create the volume group '''GuestVolGroup''' and add all of the AoE "device" to it:<br />
** <tt>[guest1] $ sudo vgcreate -M 2 -s 1m -c y GuestVolGroup /dev/etherd/e0.1</tt><br />
* edit <tt>/etc/lvm/lvm.conf</tt> and make sure to set locking_type to DLM<br />
* before further stuff can proceed, the cluster needs to be up and <tt>clvmd</tt> needs to be running everywhere. So, in VMware I cloned <tt>guest1</tt> twice: as <tt>guest2</tt> and <tt>guest3</tt>.<br />
* edit <tt>/etc/cluster.conf</tt> and name the cluster <tt>'''GuestCluster'''</tt> and set up the three nodes with manual (read: ignored) fencing.<br />
* bring up the cluster: <br />
** <tt>$ pdsh -w guest[1-3] sudo service cman start && pdsh -w guest[1-3] sudo service clvmd start</tt><br />
* create the logical volume '''GuestVolume''' and assign the full volume group to it:<br />
** <tt>[guest1] $ sudo lvcreate -n GuestVolume -l 100%VG GuestVolGroup</tt><br />
* .. and make a GFS2 fs therein:<br />
** <tt>[guest1] $ sudo gfs2_mkfs -j 3 -p lock_dlm -t GuestCluster:GuestFS /dev/GuestVolGroup/GuestVolume</tt><br />
* restart the daemons, then mount and your VMware GFS2 cluster should be good to go! <tt>:)</tt><br />
<br />
==Adding disk space to an LVM'ed VMware guest==<br />
Having blithely thought that 4GB of disk space per guest (which Fedora LVMs as <tt>VolGroup00</tt>) would be sufficient, I then <tt>git-clone</tt>d my repo and then didn't have enough space to build my kernels; gak. (Since I'm building things on just one guest and then cloning it, I'm hoping that maybe I can somehow shrink the cloned guests' disks back down to just 4GB.)<br />
* in VMware, I went to Edit Virtual Machine Settings -> Add (device). I created a (virtual) SCSI disk, 3GB, allocate on-demand, and added it to my guest.<br />
** after starting the guest, the disk appeared as <tt>/dev/sdb</tt> <br />
* create a single partition using the entire device:<br />
** <tt>[guest1] $ fdisk # etc etc '''NB:''' make sure that the partition type is '''0x8e''' (Linux LVM) </tt><br />
* make a single LVM physical volume on it:<br />
** <tt>[guest1] $ pvcreate -M 2 /dev/sdb1</tt><br />
* extend the existing volume group by adding the prepped physical volume:<br />
** <tt>[guest1] $ vgextend VolGroup00 /dev/sdb1</tt><br />
* extend the logical volume to use the entire (now-larger) volume group:<br />
** <tt>[guest1] $ lvextend -l +100%FREE /dev/VolGroup00/LogVol00</tt><br />
* Inspect things with <tt>pvs</tt>, <tt>vgs</tt>, and <tt>lvs</tt><br />
* extend the filesystem itself within the logical volume (it can handle online resizing):<br />
** <tt>[guest1] $ resize2fs /dev/VolumeGroup00/LogVol00</tt><br />
<br />
At this point, hopefully <tt>df -k</tt> should show you a larger volume :)<br />
<br />
==Update: reactions from Connectathon '08==<br />
The purpose of this entire VMware/GFS2 setup in the first place was so I could work on a pNFS/GFS2 MDS at Connectathon '08 with Frank Filz, Dean Hildebrand, and Marc Eshel (all gentlemen from IBM). <br />
<br />
On the one hand, once I had a primary guest system set up and could just clone it to make a cluster, it was very easy to make kernel changes, rebuild, push things out the cluster, and reboot.<br />
<br />
The downside came during testing, when we tried doing pNFS writes of several KB or more -- the RPC layer would barf on the packet with a message like "Error: bad tcp reclen". Fortunately, Dean recalled that Ricardo Labiaga had had a similar problem with KVM (or UML?) at the fall 2007 CITI Bakeathon, so we started to suspect VMware. I quick set up two laptops to act as GFS2 nodes, accessing shared storage with AoE. I shut down the VMware cluster, configured it so that only one VMware node and the two new laptops would be a 3-node GFS2 cluster, and brought up the new cluster. Then, using the node in VMware as a pNFS MDS and the two laptops as DSes, we almost immediately were able to pass the Connectathon test suite.<br />
<br />
'''The verdict''': VMware Workstation 6 still totally impresses me, but it's probably better to do cluster work on an actual cluster. That said, my I/O troubles may just stem from my laptop, or my particular NIC driver, or whatever -- I can't imagine that there aren't ways to resolve that somehow.</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/GFS2_Setup_NotesGFS2 Setup Notes2008-06-11T20:12:18Z<p>Richterd: /* Upgrading GFS2 userland for kernels >2.6.18 */</p>
<hr />
<div>==Initial install==<br />
===Basics===<br />
Started with fresh installs of RHEL5.0 on 4 nodes of mixed hardware, all attached to a shared MSA-1000 fibre channel 8-disk array (in two sets of 4, ~550GB total).<br />
<br />
* installed cluster and update RPMs from wendy cheng:<br />
** <tt> cman-2.0.64-1.el5.x86_64.rpm </tt><br />
** <tt> cman-devel-2.0.64-1.el5.x86_64.rpm </tt><br />
** <tt> device-mapper-1.02.13-1.el5.x86_64.rpm </tt><br />
** <tt> gfs-utils-0.1.11-3.el5.x86_64.rpm </tt><br />
** <tt> gfs2-utils-0.1.25-1.el5.x86_64.rpm </tt><br />
** <tt> gnbd-1.1.5-1.el5.x86_64.rpm </tt> ''(unused?)''<br />
** <tt> kmod-gfs-0.1.16-5.2.6.18_8.1.4.el5.x86_64.rpm </tt><br />
** <tt> kmod-gnbd-0.1.3-4.2.6.18_8.1.4.el5.x86_64.rpm </tt> ''(unused?)''<br />
** <tt> lvm2-2.02.16-3.el5.x86_64.rpm </tt><br />
** <tt> lvm2-cluster-2.02.16-3.el5.x86_64.rpm </tt><br />
** <tt> openais-0.80.2-1.el5.x86_64.rpm </tt><br />
** <tt> openais-devel-0.80.2-1.el5.x86_64.rpm </tt><br />
** <tt> system-config-cluster-1.0.39-1.0.noarch.rpm </tt> ''(just a python frontend for several <tt>vg*</tt>, <tt>lv*</tt>, and <tt>pv*</tt> commands)''<br />
<br />
<br />
===Configuring <tt>cman</tt> and <tt>clvmd</tt>===<br />
* '''cman''': at first I tried using <tt>system-config-cluster</tt> to set up <tt>cman</tt>, but given that I didn't have any complicated fencing or quorum-related needs, I basically just took a generic <tt>cluster.conf</tt> and edited it. My <tt>[http://www.citi.umich.edu/u/richterd/gfs2/cluster.conf cluster.conf]</tt> is real basic and has manual fencing set up to be a no-op (I'd get complaints from the daemons if I didn't have any fencing setup).<br />
** distribute the new <tt>cluster.conf</tt> to all nodes; on the first run, you can just use <tt>scp</tt> or whatever.<br />
** once the cluster's up, though, propagating and setting changes on all nodes takes two steps. From the node with the updated configuration, do:<br />
*** <tt>$ sudo ccs_tool update /path/to/new/cluster.conf</tt> ''(pushes to all nodes listed in conf file)''<br />
*** <tt>$ sudo cman_tool version -r <new-version-number></tt> ''(a generation number to keep the nodes synched)''<br />
<br />
* '''clvmd''': as before, I tried using <tt>system-config-lvm</tt> to set up <tt>clvmd</tt>, but it's not quite "there yet" -- it'd get wedged or go blind to clustered volumes at strange times. Again, tweaking a mostly-templated (and very well-commented) stock conf file wasn't hard; my <tt>[http://www.citi.umich.edu/u/richterd/gfs2/lvm.conf lvm.conf]</tt> is real simple. ''Note:'' btw, in my setup the MSA-1000 disk array is initially set up to do raid0 on the 8 disks in two groups of 4; my machines see 2 block devices, each with a capacity of ~270GB. <br />
** create 1 physical linux (0x83) partition each, using whole "disk"; repeat for <tt>/dev/sdc</tt><br />
*** <tt>$ sudo fdisk /dev/sdb</tt><br />
** create physical volumes with LVM2 metadata<br />
*** <tt>$ sudo pvcreate -M 2 /dev/sdb1</tt><br />
*** <tt>$ sudo pvcreate -M 2 /dev/sdc1</tt><br />
** create a clustered volume group and add <tt>/dev/sdb1</tt> to it<br />
*** <tt>$ sudo vgcreate -M 2 -l 256 -p 256 -s 4m -c y VolGroupCluster /dev/sdb1</tt><br />
*** <tt>$ sudo pvscan</tt> ''# (verify it worked)''<br />
** edit <tt>lvm.conf</tt> and make sure that "<tt>locking_type</tt>" is set to 3 (<tt>DLM</tt>).<br />
** distribute <tt>lvm.conf</tt> to all the nodes<br />
** start up both <tt>cman</tt> and <tt>clvmd</tt> everywhere. ''Note:'' fwiw, I use [https://computing.llnl.gov/linux/pdsh.html pdsh], the parallel distributed shell, to communicate to all nodes at once; I have mine use <tt>ssh</tt> for transport. E.g., from my .bashrc:<br />
*** <tt> $ alias start-cluster='for svc in cman clvmd ; do pdsh -w node[1-4] sudo service $svc start; done'</tt><br />
** add <tt>/dev/sdc1</tt> to the existing volume group (needs the daemons running)<br />
*** <tt>$ sudo vgextend VolGroupCluster /dev/sdc1</tt><br />
*** <tt>$ sudo vgs</tt> ''# (verify that the "clustering" flag is set on the volgroup)''<br />
** create a logical volume using the whole volgroup<br />
*** <tt>$ sudo lvcreate -n ClusterVolume -l 138924 VolGroupCluster</tt><br />
*** <tt>$ sudo lvdisplay -c -a</tt> ''# (verify that it worked)''<br />
** create a GFS2 filesystem therein<br />
*** <tt>$ sudo gfs2_mkfs -j 4 -p lock_dlm -t GFS2_Cluster:ClusterFS -O /dev/VolGroupCluster/ClusterVolume</tt><br />
** edit <tt>/etc/fstab</tt> to add a mountpoint, restart the daemons, and mount!<br />
<br />
<br />
===Custom kernels===<br />
Once the basics were going, I built some kernels and things more or less worked -- except I had a heck of a time getting the <tt>Qlogic</tt> firmware to load properly. I'm fine with building the <tt>initcramfs "initrds"</tt> by hand, ''but'' for the firmware in this setup; I don't know, I guess I'm a <tt>udev</tt> idiot or something. What I ended up doing was bogarting a vendor patch from Red Hat (bless their hearts ;) that side-stepped the issue and just built the blobs into the GFS kernel module. A [http://www.citi.umich.edu/u/richterd/gfs2/add-qlogic-firmware-blob--2.6.22.19.diff slightly-updated version against 2.6.22.19] is available.<br />
<br />
<br />
==Upgrading GFS2 userland for kernels >2.6.18==<br />
Not too long after the initial install (which came with a 2.6.18-based kernel), I found that the in-kernel <tt>DLM</tt> (distributed lock manager) stuff changed recently and required a corresponding update to userspace <tt>LVM2</tt> (logical volume manager) tools.<br />
<br />
While Wendy Cheng had gotten things off the ground by giving me the bag of RPMs, we didn't get any RHN entitlements, so no updates = pain in the neck. I did finally manage to find a way to sneak RHEL5 packages out of RHN despite the lack of entitlement, but I had to do it by hand and I had to re-login for each package. Worse, when I finally did get the newest RPMs, they weren't even new enough anyway. Lesson learned: build from source. <br />
<br />
I wasn't sure that it was the best idea, but since I already had GFS2 working with the stock userland, I was skittish and didn't want to clobber the system RPMs so I installed under my home directory; worked fine.<br />
<br />
* got the newest packages:<br />
** [http://sources.redhat.com/dm/ device-mapper.1.02.22]<br />
** [http://www.openais.org/ openAIS] ''(get the stable/"whitetank" release)''<br />
** [ftp://sources.redhat.com/pub/cluster/releases/ cluster-2.01.00 tools]<br />
** [http://sources.redhat.com/lvm2/ LVM2.2.02.28]<br />
** <tt>libvolume_id-devel-095-14.5.el5.x86_64.rpm</tt> ''(bogarted from RHN)''<br />
<br />
* <tt>export CLUSTER=/home/richterd/projects/nfs/CLUSTER; cd $CLUSTER</tt><br />
* <tt>mkdir device-mapper-OBJ cluster-OBJ LVM2-OBJ</tt><br />
<br />
* device-mapper:<br />
** <tt>./configure --prefix=$CLUSTER/device-mapper-OBJ && make && sudo make install</tt><br />
** add <tt>$CLUSTER/device-mapper-OBJ/lib</tt> to <tt>/etc/ld.so.conf</tt> and rerun <tt>ldconfig</tt><br />
<br />
* openAIS:<br />
** edit the Makefile; set <tt>DESTDIR</tt> to the empty string<br />
** <tt>make && sudo make install</tt> -- at some point, this clobbered some of the RPM stuff; meh.<br />
** added <tt>/usr/lib64/openais</tt> to <tt>ld.so.conf</tt> and reran <tt>ldconfig</tt><br />
<br />
* libvolume_id-devel:<br />
** <tt>sudo rpm -ivh libvolume_id-devel-095-14.5.el5.x86_64.rpm</tt><br />
<br />
* cluster tools:<br />
** <tt>./configure --prefix=$CLUSTER/cluster-OBJ --openaislibdir=/usr/lib64/openais --dlmincdir=/lib/modules/<kernel>/source/include</tt><br />
*** '''update''': am now omitting a couple things (gfs doesn't build right any longer anyway)<tt>./configure --without_gnbd --without_gfs</tt><br />
** edit <tt>dlm/lib/Makefile</tt> and add: <tt>CFLAGS += -I$(dlmincdir)</tt><br />
** since I was doing my "trial" install, I added <tt>$CLUSTER/cluster-OBJ/usr/lib</tt> to <tt>ld.so.conf</tt> and reran <tt>ldconfig</tt>. I anticipate going back and installing things in real system locations now that I know things worked <tt>:)</tt><br />
** <tt>make && sudo make install</tt><br />
<br />
* LVM2:<br />
** <tt>./configure --prefix=$CLUSTER/LVM2-OBJ --with-lvm1=none --with-dmdir=$CLUSTER/device-mapper-OBJ --with-clvmd=cman</tt><br />
** edit <tt>make.tmpl</tt> and look for where the above <tt>dmdir</tt> is set; my <tt>configure</tt> screwed up and appended <tt>"/ioctl"</tt> to the end and I had to trim it.<br />
*** '''fix''': rather, first trim from <tt>make.tmpl.in</tt>, where it originates for whatever reason<br />
** <tt>make && sudo make install</tt><br />
<br />
.. at this point, I had a <tt>clvmd</tt> that linked against the right shared libraries and that could deal with the kernel's modified <tt>DLM</tt> setup.<br />
<br />
==Troubleshooting the clustering flag==<br />
'''Problem 1:''' LVM changes don't appear to "take". Quoting from an email I found online (XXX: cite):<br />
Why aren't changes to my logical volume being picked up by the rest of the cluster?<br />
<br />
There's a little-known "clustering" flag for volume groups that should be set on when a cluster uses a shared volume. <br />
If that bit is not set, you can see strange lvm problems on your cluster. For example, if you extend a volume with <br />
lvresize and gfs_grow, the other nodes in the cluster will not be informed of the resize, and will likely crash when <br />
they try to access the volume.<br />
<br />
To check if the clustering flag is on for a volume group, use the "vgs" command and see if the "Attr" column shows <br />
a "c". If the attr column shows something like "wz--n-" the clustering flag is off for the volume group. If the <br />
"Attr" column shows something like "wz--nc" the clustering flag is on.<br />
<br />
To set the clustering flag on, use this command: <tt>vgchange -cy</tt><br />
<br />
'''Problem 2:''' In the midst of adding a new node to the cluster, <tt>clvmd</tt> wouldn't start on other nodes and recognize the disk array.<br />
<br />
I tried the above <tt>vgchange -cy</tt> thing and screwed it up by making the ''local disk's'' VG '''clustered''' (ugh). [http://kbase.redhat.com/faq/FAQ_96_11024.shtm The problem] made sense, but the temporarily-changing-the-locking-type was what I was missing when I tried to undo my mistake. <br />
<br />
The fix: make sure uniform <tt>lvm.conf</tt>s are tweaked as per the link above and distributed to the cluster; start <tt>cman/clvmd</tt> everywhere; ''then'' use <tt>vgchange -cn VolGroup00</tt> to remove clustering flag (''<tt>VolGroup00</tt> is the local disk's VG, set up during the RHEL install''); ''then'' set the <tt>lvm.conf</tt> locking stuff back to "clustered" and redistribute to the cluster; ''then'' restart the daemons, mount, declare victory.</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/GFS2_Setup_NotesGFS2 Setup Notes2008-06-11T20:09:03Z<p>Richterd: /* Upgrading GFS2 userland for kernels >2.6.18 */</p>
<hr />
<div>==Initial install==<br />
===Basics===<br />
Started with fresh installs of RHEL5.0 on 4 nodes of mixed hardware, all attached to a shared MSA-1000 fibre channel 8-disk array (in two sets of 4, ~550GB total).<br />
<br />
* installed cluster and update RPMs from wendy cheng:<br />
** <tt> cman-2.0.64-1.el5.x86_64.rpm </tt><br />
** <tt> cman-devel-2.0.64-1.el5.x86_64.rpm </tt><br />
** <tt> device-mapper-1.02.13-1.el5.x86_64.rpm </tt><br />
** <tt> gfs-utils-0.1.11-3.el5.x86_64.rpm </tt><br />
** <tt> gfs2-utils-0.1.25-1.el5.x86_64.rpm </tt><br />
** <tt> gnbd-1.1.5-1.el5.x86_64.rpm </tt> ''(unused?)''<br />
** <tt> kmod-gfs-0.1.16-5.2.6.18_8.1.4.el5.x86_64.rpm </tt><br />
** <tt> kmod-gnbd-0.1.3-4.2.6.18_8.1.4.el5.x86_64.rpm </tt> ''(unused?)''<br />
** <tt> lvm2-2.02.16-3.el5.x86_64.rpm </tt><br />
** <tt> lvm2-cluster-2.02.16-3.el5.x86_64.rpm </tt><br />
** <tt> openais-0.80.2-1.el5.x86_64.rpm </tt><br />
** <tt> openais-devel-0.80.2-1.el5.x86_64.rpm </tt><br />
** <tt> system-config-cluster-1.0.39-1.0.noarch.rpm </tt> ''(just a python frontend for several <tt>vg*</tt>, <tt>lv*</tt>, and <tt>pv*</tt> commands)''<br />
<br />
<br />
===Configuring <tt>cman</tt> and <tt>clvmd</tt>===<br />
* '''cman''': at first I tried using <tt>system-config-cluster</tt> to set up <tt>cman</tt>, but given that I didn't have any complicated fencing or quorum-related needs, I basically just took a generic <tt>cluster.conf</tt> and edited it. My <tt>[http://www.citi.umich.edu/u/richterd/gfs2/cluster.conf cluster.conf]</tt> is real basic and has manual fencing set up to be a no-op (I'd get complaints from the daemons if I didn't have any fencing setup).<br />
** distribute the new <tt>cluster.conf</tt> to all nodes; on the first run, you can just use <tt>scp</tt> or whatever.<br />
** once the cluster's up, though, propagating and setting changes on all nodes takes two steps. From the node with the updated configuration, do:<br />
*** <tt>$ sudo ccs_tool update /path/to/new/cluster.conf</tt> ''(pushes to all nodes listed in conf file)''<br />
*** <tt>$ sudo cman_tool version -r <new-version-number></tt> ''(a generation number to keep the nodes synched)''<br />
<br />
* '''clvmd''': as before, I tried using <tt>system-config-lvm</tt> to set up <tt>clvmd</tt>, but it's not quite "there yet" -- it'd get wedged or go blind to clustered volumes at strange times. Again, tweaking a mostly-templated (and very well-commented) stock conf file wasn't hard; my <tt>[http://www.citi.umich.edu/u/richterd/gfs2/lvm.conf lvm.conf]</tt> is real simple. ''Note:'' btw, in my setup the MSA-1000 disk array is initially set up to do raid0 on the 8 disks in two groups of 4; my machines see 2 block devices, each with a capacity of ~270GB. <br />
** create 1 physical linux (0x83) partition each, using whole "disk"; repeat for <tt>/dev/sdc</tt><br />
*** <tt>$ sudo fdisk /dev/sdb</tt><br />
** create physical volumes with LVM2 metadata<br />
*** <tt>$ sudo pvcreate -M 2 /dev/sdb1</tt><br />
*** <tt>$ sudo pvcreate -M 2 /dev/sdc1</tt><br />
** create a clustered volume group and add <tt>/dev/sdb1</tt> to it<br />
*** <tt>$ sudo vgcreate -M 2 -l 256 -p 256 -s 4m -c y VolGroupCluster /dev/sdb1</tt><br />
*** <tt>$ sudo pvscan</tt> ''# (verify it worked)''<br />
** edit <tt>lvm.conf</tt> and make sure that "<tt>locking_type</tt>" is set to 3 (<tt>DLM</tt>).<br />
** distribute <tt>lvm.conf</tt> to all the nodes<br />
** start up both <tt>cman</tt> and <tt>clvmd</tt> everywhere. ''Note:'' fwiw, I use [https://computing.llnl.gov/linux/pdsh.html pdsh], the parallel distributed shell, to communicate to all nodes at once; I have mine use <tt>ssh</tt> for transport. E.g., from my .bashrc:<br />
*** <tt> $ alias start-cluster='for svc in cman clvmd ; do pdsh -w node[1-4] sudo service $svc start; done'</tt><br />
** add <tt>/dev/sdc1</tt> to the existing volume group (needs the daemons running)<br />
*** <tt>$ sudo vgextend VolGroupCluster /dev/sdc1</tt><br />
*** <tt>$ sudo vgs</tt> ''# (verify that the "clustering" flag is set on the volgroup)''<br />
** create a logical volume using the whole volgroup<br />
*** <tt>$ sudo lvcreate -n ClusterVolume -l 138924 VolGroupCluster</tt><br />
*** <tt>$ sudo lvdisplay -c -a</tt> ''# (verify that it worked)''<br />
** create a GFS2 filesystem therein<br />
*** <tt>$ sudo gfs2_mkfs -j 4 -p lock_dlm -t GFS2_Cluster:ClusterFS -O /dev/VolGroupCluster/ClusterVolume</tt><br />
** edit <tt>/etc/fstab</tt> to add a mountpoint, restart the daemons, and mount!<br />
<br />
<br />
===Custom kernels===<br />
Once the basics were going, I built some kernels and things more or less worked -- except I had a heck of a time getting the <tt>Qlogic</tt> firmware to load properly. I'm fine with building the <tt>initcramfs "initrds"</tt> by hand, ''but'' for the firmware in this setup; I don't know, I guess I'm a <tt>udev</tt> idiot or something. What I ended up doing was bogarting a vendor patch from Red Hat (bless their hearts ;) that side-stepped the issue and just built the blobs into the GFS kernel module. A [http://www.citi.umich.edu/u/richterd/gfs2/add-qlogic-firmware-blob--2.6.22.19.diff slightly-updated version against 2.6.22.19] is available.<br />
<br />
<br />
==Upgrading GFS2 userland for kernels >2.6.18==<br />
Not too long after the initial install (which came with a 2.6.18-based kernel), I found that the in-kernel <tt>DLM</tt> (distributed lock manager) stuff changed recently and required a corresponding update to userspace <tt>LVM2</tt> (logical volume manager) tools.<br />
<br />
While Wendy Cheng had gotten things off the ground by giving me the bag of RPMs, we didn't get any RHN entitlements, so no updates = pain in the neck. I did finally manage to find a way to sneak RHEL5 packages out of RHN despite the lack of entitlement, but I had to do it by hand and I had to re-login for each package. Worse, when I finally did get the newest RPMs, they weren't even new enough anyway. Lesson learned: build from source. <br />
<br />
I wasn't sure that it was the best idea, but since I already had GFS2 working with the stock userland, I was skittish and didn't want to clobber the system RPMs so I installed under my home directory; worked fine.<br />
<br />
* got the newest packages:<br />
** [http://sources.redhat.com/dm/ device-mapper.1.02.22]<br />
** [http://www.openais.org/ openAIS] ''(get the stable/"whitetank" release)''<br />
** [ftp://sources.redhat.com/pub/cluster/releases/ cluster-2.01.00 tools]<br />
** [http://sources.redhat.com/lvm2/ LVM2.2.02.28]<br />
** <tt>libvolume_id-devel-095-14.5.el5.x86_64.rpm</tt> ''(bogarted from RHN)''<br />
<br />
* <tt>export CLUSTER=/home/richterd/projects/nfs/CLUSTER; cd $CLUSTER</tt><br />
* <tt>mkdir device-mapper-OBJ cluster-OBJ LVM2-OBJ</tt><br />
<br />
* device-mapper:<br />
** <tt>./configure --prefix=$CLUSTER/device-mapper-OBJ && make && sudo make install</tt><br />
** add <tt>$CLUSTER/device-mapper-OBJ/lib</tt> to <tt>/etc/ld.so.conf</tt> and rerun <tt>ldconfig</tt><br />
<br />
* openAIS:<br />
** edit the Makefile; set <tt>DESTDIR</tt> to the empty string<br />
** <tt>make && sudo make install</tt> -- at some point, this clobbered some of the RPM stuff; meh.<br />
** added <tt>/usr/lib64/openais</tt> to <tt>ld.so.conf</tt> and reran <tt>ldconfig</tt><br />
<br />
* libvolume_id-devel:<br />
** <tt>sudo rpm -ivh libvolume_id-devel-095-14.5.el5.x86_64.rpm</tt><br />
<br />
* cluster tools:<br />
** <tt>./configure --prefix=$CLUSTER/cluster-OBJ --openaislibdir=/usr/lib64/openais --dlmincdir=/lib/modules/<kernel>/source/include</tt><br />
*** <tt>--without_gnbd</tt><br />
** edit <tt>dlm/lib/Makefile</tt> and add: <tt>CFLAGS += -I$(dlmincdir)</tt><br />
** since I was doing my "trial" install, I added <tt>$CLUSTER/cluster-OBJ/usr/lib</tt> to <tt>ld.so.conf</tt> and reran <tt>ldconfig</tt>. I anticipate going back and installing things in real system locations now that I know things worked <tt>:)</tt><br />
** <tt>make && sudo make install</tt><br />
<br />
* LVM2:<br />
** <tt>./configure --prefix=$CLUSTER/LVM2-OBJ --with-lvm1=none --with-dmdir=$CLUSTER/device-mapper-OBJ --with-clvmd=cman</tt><br />
** edit <tt>make.tmpl</tt> and look for where the above <tt>dmdir</tt> is set; my <tt>configure</tt> screwed up and appended <tt>"/ioctl"</tt> to the end and I had to trim it.<br />
*** '''fix''': rather, first trim from <tt>make.tmpl.in</tt>, where it originates for whatever reason<br />
** <tt>make && sudo make install</tt><br />
<br />
.. at this point, I had a <tt>clvmd</tt> that linked against the right shared libraries and that could deal with the kernel's modified <tt>DLM</tt> setup.<br />
<br />
==Troubleshooting the clustering flag==<br />
'''Problem 1:''' LVM changes don't appear to "take". Quoting from an email I found online (XXX: cite):<br />
Why aren't changes to my logical volume being picked up by the rest of the cluster?<br />
<br />
There's a little-known "clustering" flag for volume groups that should be set on when a cluster uses a shared volume. <br />
If that bit is not set, you can see strange lvm problems on your cluster. For example, if you extend a volume with <br />
lvresize and gfs_grow, the other nodes in the cluster will not be informed of the resize, and will likely crash when <br />
they try to access the volume.<br />
<br />
To check if the clustering flag is on for a volume group, use the "vgs" command and see if the "Attr" column shows <br />
a "c". If the attr column shows something like "wz--n-" the clustering flag is off for the volume group. If the <br />
"Attr" column shows something like "wz--nc" the clustering flag is on.<br />
<br />
To set the clustering flag on, use this command: <tt>vgchange -cy</tt><br />
<br />
'''Problem 2:''' In the midst of adding a new node to the cluster, <tt>clvmd</tt> wouldn't start on other nodes and recognize the disk array.<br />
<br />
I tried the above <tt>vgchange -cy</tt> thing and screwed it up by making the ''local disk's'' VG '''clustered''' (ugh). [http://kbase.redhat.com/faq/FAQ_96_11024.shtm The problem] made sense, but the temporarily-changing-the-locking-type was what I was missing when I tried to undo my mistake. <br />
<br />
The fix: make sure uniform <tt>lvm.conf</tt>s are tweaked as per the link above and distributed to the cluster; start <tt>cman/clvmd</tt> everywhere; ''then'' use <tt>vgchange -cn VolGroup00</tt> to remove clustering flag (''<tt>VolGroup00</tt> is the local disk's VG, set up during the RHEL install''); ''then'' set the <tt>lvm.conf</tt> locking stuff back to "clustered" and redistribute to the cluster; ''then'' restart the daemons, mount, declare victory.</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/GFS2_Cluster_in_VMwareGFS2 Cluster in VMware2008-05-20T16:07:16Z<p>Richterd: </p>
<hr />
<div>==VMware==<br />
* bought a copy of VMware Workstation 6, installed it on my T-43 Thinkpad <tt>"atro"</tt>(running openSuSE 10.2, 2GB of RAM).<br />
* made a new virtual machine: '''OS''': Linux, '''Version''': "Other Linux 2.6.x kernel", '''Networking''': Bridged, '''Disk''': 4GB, split into 2GB files, '''RAM''': 256MB<br />
* installed Fedora 8 in it -- even X worked well with only 256MB of RAM(!) -- guest is named <tt>"guest1"</tt><br />
* yum-installed '''gfs2-utils''' and '''libvolume_id-devel''' ''(i also tried cman, cman-devel, openais, openais-devel, and lvm2-cluster, but even '''they''' were out-of-date with the stock Fedora kernel, and so are also too old for the pNFS kernels)''<br />
* downloaded and installed '''device-mapper-1.02.22''', '''openais-0.80.3''', '''cluster-2.01.00''', and '''lvm2-2.02.28'''<br />
<br />
==ATA over Ethernet (for guest cluster shared storage)==<br />
* yum-installed AoE initiator (client) '''aoetools-18-1''' on <tt>guest1</tt><br />
* downloaded AoE target (server) [http://internap.dl.sourceforge.net/sourceforge/aoetools/vblade-15.tgz vblade-15.tgz] and installed it on <tt>atro</tt><br />
* i set aside a spare partition on <tt>atro</tt> to export as a block device over AoE:<br />
** <tt>[atro] $ sudo ln -s /dev/sda6 /dev/AoE</tt><br />
** <tt>[atro] $ sudo vbladed 0 1 eth0 /dev/AoE</tt> ''(major dev num 0, minor 1)''<br />
** <tt>[guest1] $ sudo modprobe aoe</tt><br />
*** .. AoE discovers all exported devices on the LAN; mine was the only one, and immediately appeared as <tt>/dev/etherd/e0.1</tt>. Mounting it "just worked"; props to AoE!<br />
<br />
==LVM and GFS2 setup==<br />
* prep physical volume for LVM:<br />
** <tt>[guest1] $ sudo pvcreate -M 2 /dev/etherd/e0.1</tt><br />
* create the volume group '''GuestVolGroup''' and add all of the AoE "device" to it:<br />
** <tt>[guest1] $ sudo vgcreate -M 2 -s 1m -c y GuestVolGroup /dev/etherd/e0.1</tt><br />
* edit <tt>/etc/lvm/lvm.conf</tt> and make sure to set locking_type to DLM<br />
* before further stuff can proceed, the cluster needs to be up and <tt>clvmd</tt> needs to be running everywhere. So, in VMware I cloned <tt>guest1</tt> twice: as <tt>guest2</tt> and <tt>guest3</tt>.<br />
* edit <tt>/etc/cluster.conf</tt> and name the cluster <tt>'''GuestCluster'''</tt> and set up the three nodes with manual (read: ignored) fencing.<br />
* bring up the cluster: <br />
** <tt>$ pdsh -w guest[1-3] sudo service cman start && pdsh -w guest[1-3] sudo service clvmd start</tt><br />
* create the logical volume '''GuestVolume''' and assign the full volume group to it:<br />
** <tt>[guest1] $ sudo lvcreate -n GuestVolume -l 100%VG GuestVolGroup</tt><br />
* .. and make a GFS2 fs therein:<br />
** <tt>[guest1] $ sudo gfs2_mkfs -j 3 -p lock_dlm -t GuestCluster:GuestFS /dev/GuestVolGroup/GuestVolume</tt><br />
* restart the daemons, then mount and your VMware GFS2 cluster should be good to go! <tt>:)</tt><br />
<br />
==Adding disk space to an LVM'ed VMware guest==<br />
Having blithely thought that 4GB of disk space per guest (which Fedora LVMs as <tt>VolGroup00</tt>) would be sufficient, I then <tt>git-clone</tt>d my repo and then didn't have enough space to build my kernels; gak. (Since I'm building things on just one guest and then cloning it, I'm hoping that maybe I can somehow shrink the cloned guests' disks back down to just 4GB.)<br />
* in VMware, I went to Edit Virtual Machine Settings -> Add (device). I created a (virtual) SCSI disk, 3GB, allocate on-demand, and added it to my guest.<br />
** after starting the guest, the disk appeared as <tt>/dev/sdb</tt> <br />
* create a single partition using the entire device:<br />
** <tt>[guest1] $ fdisk # etc etc '''NB:''' make sure that the partition type is '''0x8e''' (Linux LVM) </tt><br />
* make a single LVM physical volume on it:<br />
** <tt>[guest1] $ pvcreate -M 2 /dev/sdb1</tt><br />
* extend the existing volume group by adding the prepped physical volume:<br />
** <tt>[guest1] $ vgextend VolGroup00 /dev/sdb1</tt><br />
* extend the logical volume to use the entire (now-larger) volume group:<br />
** <tt>[guest1] $ lvextend -l +100%FREE /dev/VolGroup00/LogVol00</tt><br />
* Inspect things with <tt>pvs</tt>, <tt>vgs</tt>, and <tt>lvs</tt><br />
* extend the filesystem itself within the logical volume (it can handle online resizing):<br />
** <tt>[guest1] $ resize2fs /dev/VolumeGroup00/LogVol00</tt><br />
<br />
At this point, hopefully <tt>df -k</tt> should show you a larger volume :)<br />
<br />
==Update: reactions from Connectathon '08==<br />
The purpose of this entire VMware/GFS2 setup in the first place was so I could work on a pNFS/GFS2 MDS at Connectathon '08 with Frank Filz, Dean Hildebrand, and Marc Eshel (all gentlemen from IBM). <br />
<br />
On the one hand, once I had a primary guest system set up and could just clone it to make a cluster, it was very easy to make kernel changes, rebuild, push things out the cluster, and reboot.<br />
<br />
The downside came during testing, when we tried doing pNFS writes of several KB or more -- the RPC layer would barf on the packet with a message like "Error: bad tcp reclen". Fortunately, Dean recalled that Ricardo Labiaga had had a similar problem with KVM (or UML?) at the fall 2007 CITI Bakeathon, so we started to suspect VMware. I quick set up two laptops to act as GFS2 nodes, accessing shared storage with AoE. I shut down the VMware cluster, configured it so that only one VMware node and the two new laptops would be a 3-node GFS2 cluster, and brought up the new cluster. Then, using the node in VMware as a pNFS MDS and the two laptops as DSes, we almost immediately were able to pass the Connectathon test suite.<br />
<br />
'''The verdict''': VMware Workstation 6 still totally impresses me, but it's probably better to do cluster work on an actual cluster. That said, my I/O troubles may just stem from my laptop, or my particular NIC driver, or whatever -- I can't imagine that there aren't ways to resolve that somehow.</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/GFS2_Cluster_in_VMwareGFS2 Cluster in VMware2008-05-13T18:22:47Z<p>Richterd: /* LVM and GFS2 setup */</p>
<hr />
<div>==VMware==<br />
* bought a copy of VMware Workstation 6, installed it on my T-43 Thinkpad <tt>"atro"</tt>(running openSuSE 10.2, 2GB of RAM).<br />
* made a new virtual machine: '''OS''': Linux, '''Version''': "Other Linux 2.6.x kernel", '''Networking''': Bridged, '''Disk''': 4GB, split into 2GB files, '''RAM''': 256MB<br />
* installed Fedora 8 in it -- even X worked well with only 256MB of RAM(!) -- guest is named <tt>"guest1"</tt><br />
* yum-installed '''gfs2-utils''' and '''libvolume_id-devel''' ''(i also tried cman, cman-devel, openais, openais-devel, and lvm2-cluster, but even '''they''' were out-of-date with the stock Fedora kernel, and so are also too old for the pNFS kernels)''<br />
* downloaded and installed '''device-mapper-1.02.22''', '''openais-0.80.3''', '''cluster-2.01.00''', and '''lvm2-2.02.28'''<br />
<br />
==ATA over Ethernet (for guest cluster shared storage)==<br />
* yum-installed AoE initiator (client) '''aoetools-18-1''' on <tt>guest1</tt><br />
* downloaded AoE target (server) [http://internap.dl.sourceforge.net/sourceforge/aoetools/vblade-15.tgz vblade-15.tgz] and installed it on <tt>atro</tt><br />
* i set aside a spare partition on <tt>atro</tt> to export as a block device over AoE:<br />
** <tt>[atro] $ sudo ln -s /dev/sda6 /dev/AoE</tt><br />
** <tt>[atro] $ sudo vbladed 0 1 eth0 /dev/AoE</tt> ''(major dev num 0, minor 1)''<br />
** <tt>[guest1] $ sudo modprobe aoe</tt><br />
*** .. AoE discovers all exported devices on the LAN; mine was the only one, and immediately appeared as <tt>/dev/etherd/e0.1</tt>. Mounting it "just worked"; props to AoE!<br />
<br />
==LVM and GFS2 setup==<br />
* prep physical volume for LVM:<br />
** <tt>[guest1] $ sudo pvcreate -M 2 /dev/etherd/e0.1</tt><br />
* create the volume group '''GuestVolGroup''' and add all of the AoE "device" to it:<br />
** <tt>[guest1] $ sudo vgcreate -M 2 -s 1m -c y GuestVolGroup /dev/etherd/e0.1</tt><br />
* edit <tt>/etc/lvm/lvm.conf</tt> and make sure to set locking_type to DLM<br />
* before further stuff can proceed, the cluster needs to be up and <tt>clvmd</tt> needs to be running everywhere. So, in VMware I cloned <tt>guest1</tt> twice: as <tt>guest2</tt> and <tt>guest3</tt>.<br />
* edit <tt>/etc/cluster.conf</tt> and name the cluster <tt>'''GuestCluster'''</tt> and set up the three nodes with manual (read: ignored) fencing.<br />
* bring up the cluster: <br />
** <tt>$ pdsh -w guest[1-3] sudo service cman start && pdsh -w guest[1-3] sudo service clvmd start</tt><br />
* create the logical volume '''GuestVolume''' and assign the full volume group to it:<br />
** <tt>[guest1] $ sudo lvcreate -n GuestVolume -l 100%VG GuestVolGroup</tt><br />
* .. and make a GFS2 fs therein:<br />
** <tt>[guest1] $ sudo gfs2_mkfs -j 3 -p lock_dlm -t GuestCluster:GuestFS /dev/GuestVolGroup/GuestVolume</tt><br />
* restart the daemons, then mount and your VMware GFS2 cluster should be good to go! <tt>:)</tt><br />
<br />
==Adding disk space to an LVM'ed VMware guest==<br />
Having blithely thought that 4GB of disk space per guest (which Fedora LVMs as <tt>VolGroup00</tt>) would be sufficient, I then <tt>git-clone</tt>d my repo and then didn't have enough space to build my kernels; gak. (Since I'm building things on just one guest and then cloning it, I'm hoping that maybe I can somehow shrink the cloned guests' disks back down to just 4GB.)<br />
* in VMware, I went to Edit Virtual Machine Settings -> Add (device). I created a (virtual) SCSI disk, 3GB, allocate on-demand, and added it to my guest.<br />
** after starting the guest, the disk appeared as <tt>/dev/sdb</tt> <br />
* create a single partition using the entire device:<br />
** <tt>[guest1] $ fdisk # etc etc '''NB:''' make sure that the partition type is '''0x8e''' (Linux LVM) </tt><br />
* make a single LVM physical volume on it:<br />
** <tt>[guest1] $ pvcreate -M 2 /dev/sdb1</tt><br />
* extend the existing volume group by adding the prepped physical volume:<br />
** <tt>[guest1] $ vgextend VolGroup00 /dev/sdb1</tt><br />
* extend the logical volume to use the entire (now-larger) volume group:<br />
** <tt>[guest1] $ lvextend -l +100%FREE /dev/VolGroup00/LogVol00</tt><br />
* Inspect things with <tt>pvs</tt>, <tt>vgs</tt>, and <tt>lvs</tt><br />
* extend the filesystem itself within the logical volume (it can handle online resizing):<br />
** <tt>[guest1] $ resize2fs /dev/VolumeGroup00/LogVol00</tt><br />
<br />
At this point, hopefully <tt>df -k</tt> should show you a larger volume :)</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/GFS2_Cluster_in_VMwareGFS2 Cluster in VMware2008-05-12T18:00:54Z<p>Richterd: /* Adding disk space to an LVM'ed VMware guest */</p>
<hr />
<div>==VMware==<br />
* bought a copy of VMware Workstation 6, installed it on my T-43 Thinkpad <tt>"atro"</tt>(running openSuSE 10.2, 2GB of RAM).<br />
* made a new virtual machine: '''OS''': Linux, '''Version''': "Other Linux 2.6.x kernel", '''Networking''': Bridged, '''Disk''': 4GB, split into 2GB files, '''RAM''': 256MB<br />
* installed Fedora 8 in it -- even X worked well with only 256MB of RAM(!) -- guest is named <tt>"guest1"</tt><br />
* yum-installed '''gfs2-utils''' and '''libvolume_id-devel''' ''(i also tried cman, cman-devel, openais, openais-devel, and lvm2-cluster, but even '''they''' were out-of-date with the stock Fedora kernel, and so are also too old for the pNFS kernels)''<br />
* downloaded and installed '''device-mapper-1.02.22''', '''openais-0.80.3''', '''cluster-2.01.00''', and '''lvm2-2.02.28'''<br />
<br />
==ATA over Ethernet (for guest cluster shared storage)==<br />
* yum-installed AoE initiator (client) '''aoetools-18-1''' on <tt>guest1</tt><br />
* downloaded AoE target (server) [http://internap.dl.sourceforge.net/sourceforge/aoetools/vblade-15.tgz vblade-15.tgz] and installed it on <tt>atro</tt><br />
* i set aside a spare partition on <tt>atro</tt> to export as a block device over AoE:<br />
** <tt>[atro] $ sudo ln -s /dev/sda6 /dev/AoE</tt><br />
** <tt>[atro] $ sudo vbladed 0 1 eth0 /dev/AoE</tt> ''(major dev num 0, minor 1)''<br />
** <tt>[guest1] $ sudo modprobe aoe</tt><br />
*** .. AoE discovers all exported devices on the LAN; mine was the only one, and immediately appeared as <tt>/dev/etherd/e0.1</tt>. Mounting it "just worked"; props to AoE!<br />
<br />
==LVM and GFS2 setup==<br />
* prep physical volume for LVM:<br />
** <tt>[guest1] $ sudo pvcreate -M 2 /dev/etherd/e0.1</tt><br />
* create the volume group '''GuestVolGroup''' and add all of the AoE "device" to it:<br />
** <tt>[guest1] $ sudo vgcreate -M 2 -s 1m -c y GuestVolGroup /dev/etherd/e0.1</tt><br />
* edit <tt>/etc/lvm/lvm.conf</tt> and make sure to set locking_type to DLM<br />
* before further stuff can proceed, the cluster needs to be up and <tt>clvmd</tt> needs to be running everywhere. So, in VMware I cloned <tt>guest1</tt> twice: as <tt>guest2</tt> and <tt>guest3</tt>.<br />
* edit <tt>/etc/cluster.conf</tt> and name the cluster <tt>'''GuestCluster'''</tt> and set up the three nodes with manual (read: ignored) fencing.<br />
* bring up the cluster: <br />
** <tt>$ pdsh -w guest[1-3] sudo service cman start && pdsh -w guest[1-3] sudo service clvmd start</tt><br />
* create the logical volume '''GuestVolume''' and assign the full volume group to it:<br />
** <tt>[guest1] $ sudo lvcreate -n GuestVolume -l 100%VG</tt><br />
* .. and make a GFS2 fs therein:<br />
** <tt>[guest1] $ sudo gfs2_mkfs -j 3 -p lock_dlm -t GuestCluster:GuestFS /dev/GuestVolGroup/GuestVolume</tt><br />
* restart the daemons, then mount and your VMware GFS2 cluster should be good to go! <tt>:)</tt><br />
<br />
==Adding disk space to an LVM'ed VMware guest==<br />
Having blithely thought that 4GB of disk space per guest (which Fedora LVMs as <tt>VolGroup00</tt>) would be sufficient, I then <tt>git-clone</tt>d my repo and then didn't have enough space to build my kernels; gak. (Since I'm building things on just one guest and then cloning it, I'm hoping that maybe I can somehow shrink the cloned guests' disks back down to just 4GB.)<br />
* in VMware, I went to Edit Virtual Machine Settings -> Add (device). I created a (virtual) SCSI disk, 3GB, allocate on-demand, and added it to my guest.<br />
** after starting the guest, the disk appeared as <tt>/dev/sdb</tt> <br />
* create a single partition using the entire device:<br />
** <tt>[guest1] $ fdisk # etc etc '''NB:''' make sure that the partition type is '''0x8e''' (Linux LVM) </tt><br />
* make a single LVM physical volume on it:<br />
** <tt>[guest1] $ pvcreate -M 2 /dev/sdb1</tt><br />
* extend the existing volume group by adding the prepped physical volume:<br />
** <tt>[guest1] $ vgextend VolGroup00 /dev/sdb1</tt><br />
* extend the logical volume to use the entire (now-larger) volume group:<br />
** <tt>[guest1] $ lvextend -l +100%FREE /dev/VolGroup00/LogVol00</tt><br />
* Inspect things with <tt>pvs</tt>, <tt>vgs</tt>, and <tt>lvs</tt><br />
* extend the filesystem itself within the logical volume (it can handle online resizing):<br />
** <tt>[guest1] $ resize2fs /dev/VolumeGroup00/LogVol00</tt><br />
<br />
At this point, hopefully <tt>df -k</tt> should show you a larger volume :)</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/GFS2_Cluster_in_VMwareGFS2 Cluster in VMware2008-05-09T18:14:21Z<p>Richterd: /* Adding disk space to a VMware guest */</p>
<hr />
<div>==VMware==<br />
* bought a copy of VMware Workstation 6, installed it on my T-43 Thinkpad <tt>"atro"</tt>(running openSuSE 10.2, 2GB of RAM).<br />
* made a new virtual machine: '''OS''': Linux, '''Version''': "Other Linux 2.6.x kernel", '''Networking''': Bridged, '''Disk''': 4GB, split into 2GB files, '''RAM''': 256MB<br />
* installed Fedora 8 in it -- even X worked well with only 256MB of RAM(!) -- guest is named <tt>"guest1"</tt><br />
* yum-installed '''gfs2-utils''' and '''libvolume_id-devel''' ''(i also tried cman, cman-devel, openais, openais-devel, and lvm2-cluster, but even '''they''' were out-of-date with the stock Fedora kernel, and so are also too old for the pNFS kernels)''<br />
* downloaded and installed '''device-mapper-1.02.22''', '''openais-0.80.3''', '''cluster-2.01.00''', and '''lvm2-2.02.28'''<br />
<br />
==ATA over Ethernet (for guest cluster shared storage)==<br />
* yum-installed AoE initiator (client) '''aoetools-18-1''' on <tt>guest1</tt><br />
* downloaded AoE target (server) [http://internap.dl.sourceforge.net/sourceforge/aoetools/vblade-15.tgz vblade-15.tgz] and installed it on <tt>atro</tt><br />
* i set aside a spare partition on <tt>atro</tt> to export as a block device over AoE:<br />
** <tt>[atro] $ sudo ln -s /dev/sda6 /dev/AoE</tt><br />
** <tt>[atro] $ sudo vbladed 0 1 eth0 /dev/AoE</tt> ''(major dev num 0, minor 1)''<br />
** <tt>[guest1] $ sudo modprobe aoe</tt><br />
*** .. AoE discovers all exported devices on the LAN; mine was the only one, and immediately appeared as <tt>/dev/etherd/e0.1</tt>. Mounting it "just worked"; props to AoE!<br />
<br />
==LVM and GFS2 setup==<br />
* prep physical volume for LVM:<br />
** <tt>[guest1] $ sudo pvcreate -M 2 /dev/etherd/e0.1</tt><br />
* create the volume group '''GuestVolGroup''' and add all of the AoE "device" to it:<br />
** <tt>[guest1] $ sudo vgcreate -M 2 -s 1m -c y GuestVolGroup /dev/etherd/e0.1</tt><br />
* edit <tt>/etc/lvm/lvm.conf</tt> and make sure to set locking_type to DLM<br />
* before further stuff can proceed, the cluster needs to be up and <tt>clvmd</tt> needs to be running everywhere. So, in VMware I cloned <tt>guest1</tt> twice: as <tt>guest2</tt> and <tt>guest3</tt>.<br />
* edit <tt>/etc/cluster.conf</tt> and name the cluster <tt>'''GuestCluster'''</tt> and set up the three nodes with manual (read: ignored) fencing.<br />
* bring up the cluster: <br />
** <tt>$ pdsh -w guest[1-3] sudo service cman start && pdsh -w guest[1-3] sudo service clvmd start</tt><br />
* create the logical volume '''GuestVolume''' and assign the full volume group to it:<br />
** <tt>[guest1] $ sudo lvcreate -n GuestVolume -l 100%VG</tt><br />
* .. and make a GFS2 fs therein:<br />
** <tt>[guest1] $ sudo gfs2_mkfs -j 3 -p lock_dlm -t GuestCluster:GuestFS /dev/GuestVolGroup/GuestVolume</tt><br />
* restart the daemons, then mount and your VMware GFS2 cluster should be good to go! <tt>:)</tt><br />
<br />
==Adding disk space to an LVM'ed VMware guest==<br />
Having blithely thought that 4GB of disk space per guest (which Fedora LVMs as <tt>VolGroup00</tt>) would be sufficient, I then <tt>git-clone</tt>d my repo and then didn't have enough space to build my kernels; gak. (Since I'm building things on just one guest and then cloning it, I'm hoping that maybe I can somehow shrink the cloned guests' disks back down to just 4GB.)<br />
* in VMware, I went to Edit Virtual Machine Settings -> Add (device). I created a (virtual) SCSI disk, 3GB, allocate on-demand, and added it to my guest.<br />
** after starting the guest, the disk appeared as <tt>/dev/sdb</tt> <br />
* create a single partition using the entire device:<br />
** <tt>[guest1] $ fdisk # etc etc '''NB:''' make sure that the partition type is '''0x8e''' (Linux LVM) </tt><br />
* make a single LVM physical volume on it:<br />
** <tt>[guest1] $ pvcreate -M 2 /dev/sdb1</tt><br />
* extend the existing volume group by adding the prepped physical volume:<br />
** <tt>[guest1] $ vgextend VolGroup00 /dev/sdb1</tt><br />
* extend the logical volume to use the entire (now-larger) volume group:<br />
** <tt>[guest1] $ lvextend -l +100%FREE /dev/VolGroup00/LogVOl00</tt><br />
* Inspect things with <tt>pvs</tt>, <tt>vgs</tt>, and <tt>lvs</tt><br />
* extend the filesystem itself within the logical volume (it can handle online resizing):<br />
** <tt>[guest1] $ resize2fs /dev/VolumeGroup00/LogVol00</tt><br />
<br />
At this point, hopefully <tt>df -k</tt> should show you a larger volume :)</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/GFS2_Cluster_in_VMwareGFS2 Cluster in VMware2008-05-08T23:36:31Z<p>Richterd: /* Adding disk space to a VMware guest */</p>
<hr />
<div>==VMware==<br />
* bought a copy of VMware Workstation 6, installed it on my T-43 Thinkpad <tt>"atro"</tt>(running openSuSE 10.2, 2GB of RAM).<br />
* made a new virtual machine: '''OS''': Linux, '''Version''': "Other Linux 2.6.x kernel", '''Networking''': Bridged, '''Disk''': 4GB, split into 2GB files, '''RAM''': 256MB<br />
* installed Fedora 8 in it -- even X worked well with only 256MB of RAM(!) -- guest is named <tt>"guest1"</tt><br />
* yum-installed '''gfs2-utils''' and '''libvolume_id-devel''' ''(i also tried cman, cman-devel, openais, openais-devel, and lvm2-cluster, but even '''they''' were out-of-date with the stock Fedora kernel, and so are also too old for the pNFS kernels)''<br />
* downloaded and installed '''device-mapper-1.02.22''', '''openais-0.80.3''', '''cluster-2.01.00''', and '''lvm2-2.02.28'''<br />
<br />
==ATA over Ethernet (for guest cluster shared storage)==<br />
* yum-installed AoE initiator (client) '''aoetools-18-1''' on <tt>guest1</tt><br />
* downloaded AoE target (server) [http://internap.dl.sourceforge.net/sourceforge/aoetools/vblade-15.tgz vblade-15.tgz] and installed it on <tt>atro</tt><br />
* i set aside a spare partition on <tt>atro</tt> to export as a block device over AoE:<br />
** <tt>[atro] $ sudo ln -s /dev/sda6 /dev/AoE</tt><br />
** <tt>[atro] $ sudo vbladed 0 1 eth0 /dev/AoE</tt> ''(major dev num 0, minor 1)''<br />
** <tt>[guest1] $ sudo modprobe aoe</tt><br />
*** .. AoE discovers all exported devices on the LAN; mine was the only one, and immediately appeared as <tt>/dev/etherd/e0.1</tt>. Mounting it "just worked"; props to AoE!<br />
<br />
==LVM and GFS2 setup==<br />
* prep physical volume for LVM:<br />
** <tt>[guest1] $ sudo pvcreate -M 2 /dev/etherd/e0.1</tt><br />
* create the volume group '''GuestVolGroup''' and add all of the AoE "device" to it:<br />
** <tt>[guest1] $ sudo vgcreate -M 2 -s 1m -c y GuestVolGroup /dev/etherd/e0.1</tt><br />
* edit <tt>/etc/lvm/lvm.conf</tt> and make sure to set locking_type to DLM<br />
* before further stuff can proceed, the cluster needs to be up and <tt>clvmd</tt> needs to be running everywhere. So, in VMware I cloned <tt>guest1</tt> twice: as <tt>guest2</tt> and <tt>guest3</tt>.<br />
* edit <tt>/etc/cluster.conf</tt> and name the cluster <tt>'''GuestCluster'''</tt> and set up the three nodes with manual (read: ignored) fencing.<br />
* bring up the cluster: <br />
** <tt>$ pdsh -w guest[1-3] sudo service cman start && pdsh -w guest[1-3] sudo service clvmd start</tt><br />
* create the logical volume '''GuestVolume''' and assign the full volume group to it:<br />
** <tt>[guest1] $ sudo lvcreate -n GuestVolume -l 100%VG</tt><br />
* .. and make a GFS2 fs therein:<br />
** <tt>[guest1] $ sudo gfs2_mkfs -j 3 -p lock_dlm -t GuestCluster:GuestFS /dev/GuestVolGroup/GuestVolume</tt><br />
* restart the daemons, then mount and your VMware GFS2 cluster should be good to go! <tt>:)</tt><br />
<br />
==Adding disk space to a VMware guest==<br />
Having blithely thought that 4GB of disk space per guest (which Fedora LVMs as <tt>VolGroup00</tt>) would be sufficient, I then <tt>git-clone</tt>d my repo and then didn't have enough space to build my kernels; gak. (Since I'm building things on just one guest and then cloning it, I'm hoping that maybe I can somehow shrink the cloned guests' disks back down to just 4GB.)<br />
* in VMware, I went to Edit Virtual Machine Settings -> Add (device). I created a (virtual) SCSI disk, 3GB, allocate on-demand, and added it to my guest.<br />
** after starting the guest, the disk appeared as <tt>/dev/sdb</tt> <br />
* create a single partition using the entire device:<br />
** <tt>[guest1] $ fdisk # etc etc '''NB:''' make sure that the partition type is '''0x8e''' (Linux LVM) </tt><br />
* make a single LVM physical volume on it:<br />
** <tt>[guest1] $ pvcreate -M 2 /dev/sdb1</tt><br />
* extend the existing volume group by adding the prepped physical volume:<br />
** <tt>[guest1] $ vgextend VolGroup00 /dev/sdb1</tt><br />
* extend the logical volume to use the entire (now-larger) volume group:<br />
** <tt>[guest1] $ lvextend -l +100%FREE /dev/VolGroup00/LogVOl00</tt><br />
* Inspect things with <tt>pvs</tt>, <tt>vgs</tt>, and <tt>lvs</tt><br />
* extend the filesystem itself within the logical volume (it can handle online resizing):<br />
** <tt>[guest1] $ resize2fs /dev/VolumeGroup00/LogVol00</tt><br />
<br />
At this point, hopefully <tt>df -k</tt> should show you a larger volume :)</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/GFS2_Cluster_in_VMwareGFS2 Cluster in VMware2008-05-08T23:03:59Z<p>Richterd: /* Adding disk space to a VMware guest */</p>
<hr />
<div>==VMware==<br />
* bought a copy of VMware Workstation 6, installed it on my T-43 Thinkpad <tt>"atro"</tt>(running openSuSE 10.2, 2GB of RAM).<br />
* made a new virtual machine: '''OS''': Linux, '''Version''': "Other Linux 2.6.x kernel", '''Networking''': Bridged, '''Disk''': 4GB, split into 2GB files, '''RAM''': 256MB<br />
* installed Fedora 8 in it -- even X worked well with only 256MB of RAM(!) -- guest is named <tt>"guest1"</tt><br />
* yum-installed '''gfs2-utils''' and '''libvolume_id-devel''' ''(i also tried cman, cman-devel, openais, openais-devel, and lvm2-cluster, but even '''they''' were out-of-date with the stock Fedora kernel, and so are also too old for the pNFS kernels)''<br />
* downloaded and installed '''device-mapper-1.02.22''', '''openais-0.80.3''', '''cluster-2.01.00''', and '''lvm2-2.02.28'''<br />
<br />
==ATA over Ethernet (for guest cluster shared storage)==<br />
* yum-installed AoE initiator (client) '''aoetools-18-1''' on <tt>guest1</tt><br />
* downloaded AoE target (server) [http://internap.dl.sourceforge.net/sourceforge/aoetools/vblade-15.tgz vblade-15.tgz] and installed it on <tt>atro</tt><br />
* i set aside a spare partition on <tt>atro</tt> to export as a block device over AoE:<br />
** <tt>[atro] $ sudo ln -s /dev/sda6 /dev/AoE</tt><br />
** <tt>[atro] $ sudo vbladed 0 1 eth0 /dev/AoE</tt> ''(major dev num 0, minor 1)''<br />
** <tt>[guest1] $ sudo modprobe aoe</tt><br />
*** .. AoE discovers all exported devices on the LAN; mine was the only one, and immediately appeared as <tt>/dev/etherd/e0.1</tt>. Mounting it "just worked"; props to AoE!<br />
<br />
==LVM and GFS2 setup==<br />
* prep physical volume for LVM:<br />
** <tt>[guest1] $ sudo pvcreate -M 2 /dev/etherd/e0.1</tt><br />
* create the volume group '''GuestVolGroup''' and add all of the AoE "device" to it:<br />
** <tt>[guest1] $ sudo vgcreate -M 2 -s 1m -c y GuestVolGroup /dev/etherd/e0.1</tt><br />
* edit <tt>/etc/lvm/lvm.conf</tt> and make sure to set locking_type to DLM<br />
* before further stuff can proceed, the cluster needs to be up and <tt>clvmd</tt> needs to be running everywhere. So, in VMware I cloned <tt>guest1</tt> twice: as <tt>guest2</tt> and <tt>guest3</tt>.<br />
* edit <tt>/etc/cluster.conf</tt> and name the cluster <tt>'''GuestCluster'''</tt> and set up the three nodes with manual (read: ignored) fencing.<br />
* bring up the cluster: <br />
** <tt>$ pdsh -w guest[1-3] sudo service cman start && pdsh -w guest[1-3] sudo service clvmd start</tt><br />
* create the logical volume '''GuestVolume''' and assign the full volume group to it:<br />
** <tt>[guest1] $ sudo lvcreate -n GuestVolume -l 100%VG</tt><br />
* .. and make a GFS2 fs therein:<br />
** <tt>[guest1] $ sudo gfs2_mkfs -j 3 -p lock_dlm -t GuestCluster:GuestFS /dev/GuestVolGroup/GuestVolume</tt><br />
* restart the daemons, then mount and your VMware GFS2 cluster should be good to go! <tt>:)</tt><br />
<br />
==Adding disk space to a VMware guest==<br />
Having blithely thought that 4GB of disk space per guest (which Fedora LVMs as <tt>VolGroup00</tt>) would be sufficient, I then <tt>git-clone</tt>d my repo and then didn't have enough space to build my kernels; gak. (Since I'm building things on just one guest and then cloning it, I'm hoping that maybe I can somehow shrink the cloned guests' disks back down to just 4GB.)<br />
* in VMware, I went to Edit Virtual Machine Settings -> Add (device). I created a (virtual) SCSI disk, 3GB, allocate on-demand, and added it to my guest.<br />
** after starting the guest, the disk appeared as <tt>/dev/sdb</tt> <br />
* create a single partition using the entire device:<br />
** <tt>[guest1] $ fdisk # etc etc</tt><br />
* make a single LVM physical volume on it:<br />
** <tt>[guest1] $ pvcreate -M 2 /dev/sdb1</tt><br />
* extend the existing volume group by adding the prepped physical volume:<br />
** <tt>[guest1] $ vgextend VolGroup00 /dev/sdb1</tt><br />
* extend the logical volume to use the entire (now-larger) volume group:<br />
** <tt>[guest1] $ lvextend -l +100%FREE /dev/VolGroup00/LogVOl00</tt><br />
* Inspect things with <tt>pvs</tt>, <tt>vgs</tt>, and <tt>lvs</tt> -- should be good. <tt>:)</tt></div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/GFS2_Cluster_in_VMwareGFS2 Cluster in VMware2008-05-08T23:03:41Z<p>Richterd: /* Adding disk space to a VMware guest */</p>
<hr />
<div>==VMware==<br />
* bought a copy of VMware Workstation 6, installed it on my T-43 Thinkpad <tt>"atro"</tt>(running openSuSE 10.2, 2GB of RAM).<br />
* made a new virtual machine: '''OS''': Linux, '''Version''': "Other Linux 2.6.x kernel", '''Networking''': Bridged, '''Disk''': 4GB, split into 2GB files, '''RAM''': 256MB<br />
* installed Fedora 8 in it -- even X worked well with only 256MB of RAM(!) -- guest is named <tt>"guest1"</tt><br />
* yum-installed '''gfs2-utils''' and '''libvolume_id-devel''' ''(i also tried cman, cman-devel, openais, openais-devel, and lvm2-cluster, but even '''they''' were out-of-date with the stock Fedora kernel, and so are also too old for the pNFS kernels)''<br />
* downloaded and installed '''device-mapper-1.02.22''', '''openais-0.80.3''', '''cluster-2.01.00''', and '''lvm2-2.02.28'''<br />
<br />
==ATA over Ethernet (for guest cluster shared storage)==<br />
* yum-installed AoE initiator (client) '''aoetools-18-1''' on <tt>guest1</tt><br />
* downloaded AoE target (server) [http://internap.dl.sourceforge.net/sourceforge/aoetools/vblade-15.tgz vblade-15.tgz] and installed it on <tt>atro</tt><br />
* i set aside a spare partition on <tt>atro</tt> to export as a block device over AoE:<br />
** <tt>[atro] $ sudo ln -s /dev/sda6 /dev/AoE</tt><br />
** <tt>[atro] $ sudo vbladed 0 1 eth0 /dev/AoE</tt> ''(major dev num 0, minor 1)''<br />
** <tt>[guest1] $ sudo modprobe aoe</tt><br />
*** .. AoE discovers all exported devices on the LAN; mine was the only one, and immediately appeared as <tt>/dev/etherd/e0.1</tt>. Mounting it "just worked"; props to AoE!<br />
<br />
==LVM and GFS2 setup==<br />
* prep physical volume for LVM:<br />
** <tt>[guest1] $ sudo pvcreate -M 2 /dev/etherd/e0.1</tt><br />
* create the volume group '''GuestVolGroup''' and add all of the AoE "device" to it:<br />
** <tt>[guest1] $ sudo vgcreate -M 2 -s 1m -c y GuestVolGroup /dev/etherd/e0.1</tt><br />
* edit <tt>/etc/lvm/lvm.conf</tt> and make sure to set locking_type to DLM<br />
* before further stuff can proceed, the cluster needs to be up and <tt>clvmd</tt> needs to be running everywhere. So, in VMware I cloned <tt>guest1</tt> twice: as <tt>guest2</tt> and <tt>guest3</tt>.<br />
* edit <tt>/etc/cluster.conf</tt> and name the cluster <tt>'''GuestCluster'''</tt> and set up the three nodes with manual (read: ignored) fencing.<br />
* bring up the cluster: <br />
** <tt>$ pdsh -w guest[1-3] sudo service cman start && pdsh -w guest[1-3] sudo service clvmd start</tt><br />
* create the logical volume '''GuestVolume''' and assign the full volume group to it:<br />
** <tt>[guest1] $ sudo lvcreate -n GuestVolume -l 100%VG</tt><br />
* .. and make a GFS2 fs therein:<br />
** <tt>[guest1] $ sudo gfs2_mkfs -j 3 -p lock_dlm -t GuestCluster:GuestFS /dev/GuestVolGroup/GuestVolume</tt><br />
* restart the daemons, then mount and your VMware GFS2 cluster should be good to go! <tt>:)</tt><br />
<br />
==Adding disk space to a VMware guest==<br />
Having blithely thought that 4GB of disk space per guest (which Fedora LVMs as <tt>VolGroup00</tt>) would be sufficient, I then <tt>git-clone</tt>d my repo and then didn't have enough space to build my kernels; gak. (Since I'm building things on just one guest and then cloning it, I'm hoping that maybe I can somehow shrink the cloned guests' disks back down to just 4GB.)<br />
* in VMware, I went to Edit Virtual Machine Settings -> Add (device). I created a (virtual) SCSI disk, 3GB, allocate on-demand, and added it to my guest.<br />
** after starting the guest, the disk appeared as <tt>/dev/sdb</tt> <br />
* create a single partition using the entire device:<br />
** <tt>[guest1] fdisk # etc etc</tt><br />
* make a single LVM physical volume on it:<br />
** <tt>[guest1] $ pvcreate -M 2 /dev/sdb1</tt><br />
* extend the existing volume group by adding the prepped physical volume:<br />
** <tt>[guest1] $ vgextend VolGroup00 /dev/sdb1</tt><br />
* extend the logical volume to use the entire (now-larger) volume group:<br />
** <tt>[guest1] $ lvextend -l +100%FREE /dev/VolGroup00/LogVOl00</tt><br />
* Inspect things with <tt>pvs</tt>, <tt>vgs</tt>, and <tt>lvs</tt> -- should be good. <tt>:)</tt></div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/GFS2_Cluster_in_VMwareGFS2 Cluster in VMware2008-05-08T23:03:19Z<p>Richterd: /* Adding disk space to a VMware guest */</p>
<hr />
<div>==VMware==<br />
* bought a copy of VMware Workstation 6, installed it on my T-43 Thinkpad <tt>"atro"</tt>(running openSuSE 10.2, 2GB of RAM).<br />
* made a new virtual machine: '''OS''': Linux, '''Version''': "Other Linux 2.6.x kernel", '''Networking''': Bridged, '''Disk''': 4GB, split into 2GB files, '''RAM''': 256MB<br />
* installed Fedora 8 in it -- even X worked well with only 256MB of RAM(!) -- guest is named <tt>"guest1"</tt><br />
* yum-installed '''gfs2-utils''' and '''libvolume_id-devel''' ''(i also tried cman, cman-devel, openais, openais-devel, and lvm2-cluster, but even '''they''' were out-of-date with the stock Fedora kernel, and so are also too old for the pNFS kernels)''<br />
* downloaded and installed '''device-mapper-1.02.22''', '''openais-0.80.3''', '''cluster-2.01.00''', and '''lvm2-2.02.28'''<br />
<br />
==ATA over Ethernet (for guest cluster shared storage)==<br />
* yum-installed AoE initiator (client) '''aoetools-18-1''' on <tt>guest1</tt><br />
* downloaded AoE target (server) [http://internap.dl.sourceforge.net/sourceforge/aoetools/vblade-15.tgz vblade-15.tgz] and installed it on <tt>atro</tt><br />
* i set aside a spare partition on <tt>atro</tt> to export as a block device over AoE:<br />
** <tt>[atro] $ sudo ln -s /dev/sda6 /dev/AoE</tt><br />
** <tt>[atro] $ sudo vbladed 0 1 eth0 /dev/AoE</tt> ''(major dev num 0, minor 1)''<br />
** <tt>[guest1] $ sudo modprobe aoe</tt><br />
*** .. AoE discovers all exported devices on the LAN; mine was the only one, and immediately appeared as <tt>/dev/etherd/e0.1</tt>. Mounting it "just worked"; props to AoE!<br />
<br />
==LVM and GFS2 setup==<br />
* prep physical volume for LVM:<br />
** <tt>[guest1] $ sudo pvcreate -M 2 /dev/etherd/e0.1</tt><br />
* create the volume group '''GuestVolGroup''' and add all of the AoE "device" to it:<br />
** <tt>[guest1] $ sudo vgcreate -M 2 -s 1m -c y GuestVolGroup /dev/etherd/e0.1</tt><br />
* edit <tt>/etc/lvm/lvm.conf</tt> and make sure to set locking_type to DLM<br />
* before further stuff can proceed, the cluster needs to be up and <tt>clvmd</tt> needs to be running everywhere. So, in VMware I cloned <tt>guest1</tt> twice: as <tt>guest2</tt> and <tt>guest3</tt>.<br />
* edit <tt>/etc/cluster.conf</tt> and name the cluster <tt>'''GuestCluster'''</tt> and set up the three nodes with manual (read: ignored) fencing.<br />
* bring up the cluster: <br />
** <tt>$ pdsh -w guest[1-3] sudo service cman start && pdsh -w guest[1-3] sudo service clvmd start</tt><br />
* create the logical volume '''GuestVolume''' and assign the full volume group to it:<br />
** <tt>[guest1] $ sudo lvcreate -n GuestVolume -l 100%VG</tt><br />
* .. and make a GFS2 fs therein:<br />
** <tt>[guest1] $ sudo gfs2_mkfs -j 3 -p lock_dlm -t GuestCluster:GuestFS /dev/GuestVolGroup/GuestVolume</tt><br />
* restart the daemons, then mount and your VMware GFS2 cluster should be good to go! <tt>:)</tt><br />
<br />
==Adding disk space to a VMware guest==<br />
Having blithely thought that 4GB of disk space per guest (which Fedora LVMs as <tt>VolGroup00</tt>) would be sufficient, I then <tt>git-clone</tt>d my repo and then didn't have enough space to build my kernels; gak. (Since I'm building things on just one guest and then cloning it, I'm hoping that maybe I can somehow shrink the cloned guests' disks back down to just 4GB.)<br />
* in VMware, I went to Edit Virtual Machine Settings -> Add (device). I created a (virtual) SCSI disk, 3GB, allocate on-demand, and added it to my guest.<br />
** after starting the guest, the disk appeared as <tt>/dev/sdb</tt> <br />
* create a single partition using the entire device<br />
** <tt>[guest1] fdisk # etc etc</tt><br />
* make a single LVM physical volume on it:<br />
** <tt>[guest1] $ pvcreate -M 2 /dev/sdb1</tt><br />
* extend the existing volume group by adding the prepped physical volume:<br />
** <tt>[guest1] $ vgextend VolGroup00 /dev/sdb1</tt><br />
* extend the logical volume to use the entire (now-larger) volume group:<br />
** <tt>[guest1] $ lvextend -l +100%FREE /dev/VolGroup00/LogVOl00</tt><br />
* Inspect things with <tt>pvs</tt>, <tt>vgs</tt>, and <tt>lvs</tt> -- should be good. <tt>:)</tt></div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/GFS2_Cluster_in_VMwareGFS2 Cluster in VMware2008-05-08T23:02:06Z<p>Richterd: </p>
<hr />
<div>==VMware==<br />
* bought a copy of VMware Workstation 6, installed it on my T-43 Thinkpad <tt>"atro"</tt>(running openSuSE 10.2, 2GB of RAM).<br />
* made a new virtual machine: '''OS''': Linux, '''Version''': "Other Linux 2.6.x kernel", '''Networking''': Bridged, '''Disk''': 4GB, split into 2GB files, '''RAM''': 256MB<br />
* installed Fedora 8 in it -- even X worked well with only 256MB of RAM(!) -- guest is named <tt>"guest1"</tt><br />
* yum-installed '''gfs2-utils''' and '''libvolume_id-devel''' ''(i also tried cman, cman-devel, openais, openais-devel, and lvm2-cluster, but even '''they''' were out-of-date with the stock Fedora kernel, and so are also too old for the pNFS kernels)''<br />
* downloaded and installed '''device-mapper-1.02.22''', '''openais-0.80.3''', '''cluster-2.01.00''', and '''lvm2-2.02.28'''<br />
<br />
==ATA over Ethernet (for guest cluster shared storage)==<br />
* yum-installed AoE initiator (client) '''aoetools-18-1''' on <tt>guest1</tt><br />
* downloaded AoE target (server) [http://internap.dl.sourceforge.net/sourceforge/aoetools/vblade-15.tgz vblade-15.tgz] and installed it on <tt>atro</tt><br />
* i set aside a spare partition on <tt>atro</tt> to export as a block device over AoE:<br />
** <tt>[atro] $ sudo ln -s /dev/sda6 /dev/AoE</tt><br />
** <tt>[atro] $ sudo vbladed 0 1 eth0 /dev/AoE</tt> ''(major dev num 0, minor 1)''<br />
** <tt>[guest1] $ sudo modprobe aoe</tt><br />
*** .. AoE discovers all exported devices on the LAN; mine was the only one, and immediately appeared as <tt>/dev/etherd/e0.1</tt>. Mounting it "just worked"; props to AoE!<br />
<br />
==LVM and GFS2 setup==<br />
* prep physical volume for LVM:<br />
** <tt>[guest1] $ sudo pvcreate -M 2 /dev/etherd/e0.1</tt><br />
* create the volume group '''GuestVolGroup''' and add all of the AoE "device" to it:<br />
** <tt>[guest1] $ sudo vgcreate -M 2 -s 1m -c y GuestVolGroup /dev/etherd/e0.1</tt><br />
* edit <tt>/etc/lvm/lvm.conf</tt> and make sure to set locking_type to DLM<br />
* before further stuff can proceed, the cluster needs to be up and <tt>clvmd</tt> needs to be running everywhere. So, in VMware I cloned <tt>guest1</tt> twice: as <tt>guest2</tt> and <tt>guest3</tt>.<br />
* edit <tt>/etc/cluster.conf</tt> and name the cluster <tt>'''GuestCluster'''</tt> and set up the three nodes with manual (read: ignored) fencing.<br />
* bring up the cluster: <br />
** <tt>$ pdsh -w guest[1-3] sudo service cman start && pdsh -w guest[1-3] sudo service clvmd start</tt><br />
* create the logical volume '''GuestVolume''' and assign the full volume group to it:<br />
** <tt>[guest1] $ sudo lvcreate -n GuestVolume -l 100%VG</tt><br />
* .. and make a GFS2 fs therein:<br />
** <tt>[guest1] $ sudo gfs2_mkfs -j 3 -p lock_dlm -t GuestCluster:GuestFS /dev/GuestVolGroup/GuestVolume</tt><br />
* restart the daemons, then mount and your VMware GFS2 cluster should be good to go! <tt>:)</tt><br />
<br />
==Adding disk space to a VMware guest==<br />
Having blithely thought that 4GB of disk space per guest (which Fedora LVMs as <tt>VolGroup00</tt>) would be sufficient, I then <tt>git-clone</tt>d my repo and then didn't have enough space to build my kernels; gak. (Since I'm building things on just one guest and then cloning it, I'm hoping that maybe I can somehow shrink the cloned guests' disks back down to just 4GB.)<br />
* in VMware, I went to Edit Virtual Machine Settings -> Add (device). I created a (virtual) SCSI disk, 3GB, allocate on-demand, and added it to my guest.<br />
** after starting the guest, the disk appeared as <tt>/dev/sdb</tt> <br />
* create a single partition using the entire device<br />
** <tt>[guest1] fdisk # etc etc</tt><br />
* make a single LVM physical volume on it:<br />
** <tt>[guest1] $ pvcreate -M 2 /dev/sdb1</tt><br />
* extend the existing volume group by adding the prepped physical volume:<br />
** <tt>[guest1] $ vgextend VolGroup00 /dev/sdb1</tt><br />
* extend the logical volume to use the entire (now-larger) volume group:<br />
** <tt>[guest1] $ lvextend -l +100%FREE /dev/VolGroup00/LogVOl00</tt><br />
Le voila.</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/GFS2_Cluster_in_VMwareGFS2 Cluster in VMware2008-05-05T11:04:26Z<p>Richterd: maybe "guest1" is more useful to the reader than the hostname "fatsuit"..</p>
<hr />
<div>==VMware==<br />
* bought a copy of VMware Workstation 6, installed it on my T-43 Thinkpad <tt>"atro"</tt>(running openSuSE 10.2, 2GB of RAM).<br />
* made a new virtual machine: '''OS''': Linux, '''Version''': "Other Linux 2.6.x kernel", '''Networking''': Bridged, '''Disk''': 4GB, split into 2GB files, '''RAM''': 256MB<br />
* installed Fedora 8 in it -- even X worked well with only 256MB of RAM(!) -- guest is named <tt>"guest1"</tt><br />
* yum-installed '''gfs2-utils''' and '''libvolume_id-devel''' ''(i also tried cman, cman-devel, openais, openais-devel, and lvm2-cluster, but even '''they''' were out-of-date with the stock Fedora kernel, and so are also too old for the pNFS kernels)''<br />
* downloaded and installed '''device-mapper-1.02.22''', '''openais-0.80.3''', '''cluster-2.01.00''', and '''lvm2-2.02.28'''<br />
<br />
==ATA over Ethernet (for guest cluster shared storage)==<br />
* yum-installed AoE initiator (client) '''aoetools-18-1''' on <tt>guest1</tt><br />
* downloaded AoE target (server) [http://internap.dl.sourceforge.net/sourceforge/aoetools/vblade-15.tgz vblade-15.tgz] and installed it on <tt>atro</tt><br />
* i set aside a spare partition on <tt>atro</tt> to export as a block device over AoE:<br />
** <tt>[atro] $ sudo ln -s /dev/sda6 /dev/AoE</tt><br />
** <tt>[atro] $ sudo vbladed 0 1 eth0 /dev/AoE</tt> ''(major dev num 0, minor 1)''<br />
** <tt>[guest1] $ sudo modprobe aoe</tt><br />
*** .. AoE discovers all exported devices on the LAN; mine was the only one, and immediately appeared as <tt>/dev/etherd/e0.1</tt>. Mounting it "just worked"; props to AoE!<br />
<br />
==LVM and GFS2 setup==<br />
* prep physical volume for LVM:<br />
** <tt>[guest1] $ sudo pvcreate -M 2 /dev/etherd/e0.1</tt><br />
* create the volume group '''GuestVolGroup''' and add all of the AoE "device" to it:<br />
** <tt>[guest1] $ sudo vgcreate -M 2 -s 1m -c y GuestVolGroup /dev/etherd/e0.1</tt><br />
* edit <tt>/etc/lvm/lvm.conf</tt> and make sure to set locking_type to DLM<br />
* before further stuff can proceed, the cluster needs to be up and <tt>clvmd</tt> needs to be running everywhere. So, in VMware I cloned <tt>guest1</tt> twice: as <tt>guest2</tt> and <tt>guest3</tt>.<br />
* edit <tt>/etc/cluster.conf</tt> and name the cluster <tt>'''GuestCluster'''</tt> and set up the three nodes with manual (read: ignored) fencing.<br />
* bring up the cluster: <br />
** <tt>$ pdsh -w guest[1-3] sudo service cman start && pdsh -w guest[1-3] sudo service clvmd start</tt><br />
* create the logical volume '''GuestVolume''' and assign the full volume group to it:<br />
** <tt>[guest1] $ sudo lvcreate -n GuestVolume -l 100%VG</tt><br />
* .. and make a GFS2 fs therein:<br />
** <tt>[guest1] $ sudo gfs2_mkfs -j 3 -p lock_dlm -t GuestCluster:GuestFS /dev/GuestVolGroup/GuestVolume</tt><br />
* restart the daemons, then mount and your VMware GFS2 cluster should be good to go! <tt>:)</tt></div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/GFS2_Cluster_in_VMwareGFS2 Cluster in VMware2008-05-04T16:26:25Z<p>Richterd: </p>
<hr />
<div>==VMware==<br />
* bought a copy of VMware Workstation 6, installed it on my T-43 Thinkpad <tt>"atro"</tt>(running openSuSE 10.2, 2GB of RAM).<br />
* made a new virtual machine: '''OS''': Linux, '''Version''': "Other Linux 2.6.x kernel", '''Networking''': Bridged, '''Disk''': 4GB, split into 2GB files, '''RAM''': 256MB<br />
* installed Fedora 8 in it -- even X worked well with only 256MB of RAM(!) -- guest is named <tt>"fatsuit"</tt><br />
* yum-installed '''gfs2-utils''' and '''libvolume_id-devel''' ''(i also tried cman, cman-devel, openais, openais-devel, and lvm2-cluster, but even '''they''' were out-of-date with the stock Fedora kernel, and so are also too old for the pNFS kernels)''<br />
* downloaded and installed '''device-mapper-1.02.22''', '''openais-0.80.3''', '''cluster-2.01.00''', and '''lvm2-2.02.28'''<br />
<br />
==ATA over Ethernet (for guest cluster shared storage)==<br />
* yum-installed AoE initiator (client) '''aoetools-18-1''' on <tt>fatsuit</tt><br />
* downloaded AoE target (server) [http://internap.dl.sourceforge.net/sourceforge/aoetools/vblade-15.tgz vblade-15.tgz] and installed it on <tt>atro</tt><br />
* i set aside a spare partition on <tt>atro</tt> to export as a block device over AoE:<br />
** <tt>[atro] $ sudo ln -s /dev/sda6 /dev/AoE</tt><br />
** <tt>[atro] $ sudo vbladed 0 1 eth0 /dev/AoE</tt> ''(major dev num 0, minor 1)''<br />
** <tt>[fatsuit] $ sudo modprobe aoe</tt><br />
*** .. AoE discovers all exported devices on the LAN; mine was the only one, and immediately appeared as <tt>/dev/etherd/e0.1</tt>. Mounting it "just worked"; props to AoE!<br />
<br />
==LVM and GFS2 setup==<br />
* prep physical volume for LVM:<br />
** <tt>[fatsuit] $ sudo pvcreate -M 2 /dev/etherd/e0.1</tt><br />
* create the volume group '''GuestVolGroup''' and add all of the AoE "device" to it:<br />
** <tt>[fatsuit] $ sudo vgcreate -M 2 -s 1m -c y GuestVolGroup /dev/etherd/e0.1</tt><br />
* edit <tt>/etc/lvm/lvm.conf</tt> and make sure to set locking_type to DLM<br />
* before further stuff can proceed, the cluster needs to be up and <tt>clvmd</tt> needs to be running everywhere. So, in VMware I cloned <tt>fatsuit</tt> twice: as <tt>hagbard</tt> and <tt>wingnut</tt>.<br />
* edit <tt>/etc/cluster.conf</tt> and name the cluster <tt>'''GuestCluster'''</tt> and set up the three nodes with manual (read: ignored) fencing.<br />
* bring up the cluster: <br />
** <tt>$ pdsh -w fatsuit,hagbard,wingnut sudo service cman start && pdsh -w fatsuit,hagbard,wingnut sudo service clvmd start</tt><br />
* create the logical volume '''GuestVolume''' and assign the full volume group to it:<br />
** <tt>[fatsuit] $ sudo lvcreate -n GuestVolume -l 100%VG</tt><br />
* .. and make a GFS2 fs therein:<br />
** <tt>[fatsuit] $ sudo gfs2_mkfs -j 3 -p lock_dlm -t GuestCluster:GuestFS /dev/GuestVolGroup/GuestVolume</tt><br />
* restart the daemons, then mount and your VMware GFS2 cluster should be good to go! <tt>:)</tt></div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/GFS2_Cluster_in_VMwareGFS2 Cluster in VMware2008-05-04T16:26:13Z<p>Richterd: /* ATA over Ethernet (for guest "cluster" shared storage) */</p>
<hr />
<div>==VMware==<br />
* bought a copy of VMware Workstation 6, installed it on my T-43 Thinkpad <tt>"atro"</tt>(running openSuSE 10.2, 2GB of RAM).<br />
* made a new virtual machine: '''OS''': Linux, '''Version''': "Other Linux 2.6.x kernel", '''Networking''': Bridged, '''Disk''': 4GB, split into 2GB files, '''RAM''': 256MB<br />
* installed Fedora 8 in it -- even X worked well with only 256MB of RAM(!) -- guest is named <tt>"fatsuit"</tt><br />
* yum-installed '''gfs2-utils''' and '''libvolume_id-devel''' ''(i also tried cman, cman-devel, openais, openais-devel, and lvm2-cluster, but even '''they''' were out-of-date with the stock Fedora kernel, and so are also too old for the pNFS kernels)''<br />
* downloaded and installed '''device-mapper-1.02.22''', '''openais-0.80.3''', '''cluster-2.01.00''', and '''lvm2-2.02.28'''<br />
<br />
<br />
==ATA over Ethernet (for guest cluster shared storage)==<br />
* yum-installed AoE initiator (client) '''aoetools-18-1''' on <tt>fatsuit</tt><br />
* downloaded AoE target (server) [http://internap.dl.sourceforge.net/sourceforge/aoetools/vblade-15.tgz vblade-15.tgz] and installed it on <tt>atro</tt><br />
* i set aside a spare partition on <tt>atro</tt> to export as a block device over AoE:<br />
** <tt>[atro] $ sudo ln -s /dev/sda6 /dev/AoE</tt><br />
** <tt>[atro] $ sudo vbladed 0 1 eth0 /dev/AoE</tt> ''(major dev num 0, minor 1)''<br />
** <tt>[fatsuit] $ sudo modprobe aoe</tt><br />
*** .. AoE discovers all exported devices on the LAN; mine was the only one, and immediately appeared as <tt>/dev/etherd/e0.1</tt>. Mounting it "just worked"; props to AoE!<br />
<br />
==LVM and GFS2 setup==<br />
* prep physical volume for LVM:<br />
** <tt>[fatsuit] $ sudo pvcreate -M 2 /dev/etherd/e0.1</tt><br />
* create the volume group '''GuestVolGroup''' and add all of the AoE "device" to it:<br />
** <tt>[fatsuit] $ sudo vgcreate -M 2 -s 1m -c y GuestVolGroup /dev/etherd/e0.1</tt><br />
* edit <tt>/etc/lvm/lvm.conf</tt> and make sure to set locking_type to DLM<br />
* before further stuff can proceed, the cluster needs to be up and <tt>clvmd</tt> needs to be running everywhere. So, in VMware I cloned <tt>fatsuit</tt> twice: as <tt>hagbard</tt> and <tt>wingnut</tt>.<br />
* edit <tt>/etc/cluster.conf</tt> and name the cluster <tt>'''GuestCluster'''</tt> and set up the three nodes with manual (read: ignored) fencing.<br />
* bring up the cluster: <br />
** <tt>$ pdsh -w fatsuit,hagbard,wingnut sudo service cman start && pdsh -w fatsuit,hagbard,wingnut sudo service clvmd start</tt><br />
* create the logical volume '''GuestVolume''' and assign the full volume group to it:<br />
** <tt>[fatsuit] $ sudo lvcreate -n GuestVolume -l 100%VG</tt><br />
* .. and make a GFS2 fs therein:<br />
** <tt>[fatsuit] $ sudo gfs2_mkfs -j 3 -p lock_dlm -t GuestCluster:GuestFS /dev/GuestVolGroup/GuestVolume</tt><br />
* restart the daemons, then mount and your VMware GFS2 cluster should be good to go! <tt>:)</tt></div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_designPNFS prototype design2008-05-04T16:13:49Z<p>Richterd: /* General Information */</p>
<hr />
<div>= pNFS =<br />
<br />
'''pNFS''' is part of the first NFSv4 minor version. This space is used to track and share Linux pNFS implementation ideas and issues.<br />
<br />
== General Information ==<br />
<br />
* [http://www.citi.umich.edu/projects/asci/pnfs/linux/ Linux pNFS Implementation Homepage]<br />
<br />
* [[pNFS Setup Instructions]] - Basic pNFS setup instructions.<br />
<br />
* [[GFS2 Setup Notes]] are basic install notes from setting up a small cluster (perhaps useful for the GFS2 MDS work).<br />
<br />
* [[GFS2 Cluster in VMware]] are a follow-up where I quickly set up a 3-node cluster on my laptop for use at Connectathon.<br />
<br />
== Current Issues ==<br />
* [[pNFS Todo List|pNFS Todo List]]<br />
<br />
* [[pNFS Implementation Issues|pNFS Implementation Issues]]<br />
<br />
* [[Bakeathon 2007 Issues List|Bakeathon 2007 Issues List]]<br />
<br />
* [[pNFS Development Road Map]]<br />
<br />
* [http://spreadsheets.google.com/pub?key=pGVvgce8dC-WWbowI9TSmEg Linux pNFS Development Gantt Chart]<br />
<br />
* [[pNFS Git tree recipies|pNFS Git tree recipies]]<br />
<br />
* [[pNFS Development Git tree|pNFS Development Git tree]]<br />
<br />
* [[Wireshark Patches|Wireshark Patches]]<br />
<br />
== Old Issues ==<br />
* [[Cthon06 Meeting Notes|Connectathon 2006 Linux pNFS Implementation Meeting Notes]]<br />
<br />
* [[linux pnfs client rewrite may 2006|Linux pNFS Client Internal Reorg patches May 2006 - For Display Purposes Only - Do Not Use]]<br />
<br />
* [[pNFS todo List|pNFS todo List July 2007]]</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/GFS2_Cluster_in_VMwareGFS2 Cluster in VMware2008-05-04T16:12:35Z<p>Richterd: New page: ==VMware== * bought a copy of VMware Workstation 6, installed it on my T-43 Thinkpad <tt>"atro"</tt>(running openSuSE 10.2, 2GB of RAM). * made a new virtual machine: '''OS''': Linux, '''V...</p>
<hr />
<div>==VMware==<br />
* bought a copy of VMware Workstation 6, installed it on my T-43 Thinkpad <tt>"atro"</tt>(running openSuSE 10.2, 2GB of RAM).<br />
* made a new virtual machine: '''OS''': Linux, '''Version''': "Other Linux 2.6.x kernel", '''Networking''': Bridged, '''Disk''': 4GB, split into 2GB files, '''RAM''': 256MB<br />
* installed Fedora 8 in it -- even X worked well with only 256MB of RAM(!) -- guest is named <tt>"fatsuit"</tt><br />
* yum-installed '''gfs2-utils''' and '''libvolume_id-devel''' ''(i also tried cman, cman-devel, openais, openais-devel, and lvm2-cluster, but even '''they''' were out-of-date with the stock Fedora kernel, and so are also too old for the pNFS kernels)''<br />
* downloaded and installed '''device-mapper-1.02.22''', '''openais-0.80.3''', '''cluster-2.01.00''', and '''lvm2-2.02.28'''<br />
<br />
<br />
==ATA over Ethernet (for guest "cluster" shared storage)==<br />
* yum-installed AoE initiator (client) '''aoetools-18-1''' on <tt>fatsuit</tt><br />
* downloaded AoE target (server) [http://internap.dl.sourceforge.net/sourceforge/aoetools/vblade-15.tgz vblade-15.tgz] and installed it on <tt>atro</tt><br />
* i set aside a spare partition on <tt>atro</tt> to export as a block device over AoE<br />
** <tt>[atro] $ sudo ln -s /dev/sda6 /dev/AoE</tt><br />
** <tt>[atro] $ sudo vbladed 0 1 eth0 /dev/AoE</tt> ''(major dev num 0, minor 1)''<br />
** added the following to <tt>/etc/fstab</tt> so i can access the data from both host and guest OSes:<br />
*** <tt>/dev/AoE /mnt/AoE ext3 noauto,acl,user_xattr 1 1</tt><br />
** <tt>[fatsuit] $ sudo modprobe aoe</tt><br />
*** .. AoE discovers all exported devices on the LAN; mine was the only one, and immediately appeared as <tt>/dev/etherd/e0.1</tt>. Mounting it "just worked"; props to AoE!<br />
<br />
<br />
==LVM and GFS2 setup==<br />
* prep physical volume for LVM:<br />
** <tt>[fatsuit] $ sudo pvcreate -M 2 /dev/etherd/e0.1</tt><br />
* create the volume group '''GuestVolGroup''' and add all of the AoE "device" to it:<br />
** <tt>[fatsuit] $ sudo vgcreate -M 2 -s 1m -c y GuestVolGroup /dev/etherd/e0.1</tt><br />
* edit <tt>/etc/lvm/lvm.conf</tt> and make sure to set locking_type to DLM<br />
* before further stuff can proceed, the cluster needs to be up and <tt>clvmd</tt> needs to be running everywhere. So, in VMware I cloned <tt>fatsuit</tt> twice: as <tt>hagbard</tt> and <tt>wingnut</tt>.<br />
* edit <tt>/etc/cluster.conf</tt> and name the cluster <tt>'''GuestCluster'''</tt> and set up the three nodes with manual (read: ignored) fencing.<br />
* bring up the cluster: <br />
** <tt>$ pdsh -w fatsuit,hagbard,wingnut sudo service cman start && pdsh -w fatsuit,hagbard,wingnut sudo service clvmd start</tt><br />
* create the logical volume '''GuestVolume''' and assign the full volume group to it:<br />
** <tt>[fatsuit] $ sudo lvcreate -n GuestVolume -l 100%VG</tt><br />
* .. and make a GFS2 fs therein:<br />
** <tt>[fatsuit] $ sudo gfs2_mkfs -j 3 -p lock_dlm -t GuestCluster:GuestFS /dev/GuestVolGroup/GuestVolume</tt><br />
* restart the daemons, then mount and your VMware GFS2 cluster should be good to go! <tt>:)</tt></div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_designPNFS prototype design2008-05-04T16:11:04Z<p>Richterd: </p>
<hr />
<div>= pNFS =<br />
<br />
'''pNFS''' is part of the first NFSv4 minor version. This space is used to track and share Linux pNFS implementation ideas and issues.<br />
<br />
== General Information ==<br />
<br />
* [http://www.citi.umich.edu/projects/asci/pnfs/linux/ Linux pNFS Implementation Homepage]<br />
<br />
* [[pNFS Setup Instructions]] - Basic pNFS setup instructions.<br />
<br />
* [[GFS2 Setup Notes]] are basic install notes from setting up a small cluster (perhaps useful for the GFS2 MDS work).<br />
<br />
* [[GFS2 Cluster in VMware]] are a follow-up where I set up a 3-node cluster on my laptop for use at Connectathon.<br />
<br />
== Current Issues ==<br />
* [[pNFS Todo List|pNFS Todo List]]<br />
<br />
* [[pNFS Implementation Issues|pNFS Implementation Issues]]<br />
<br />
* [[Bakeathon 2007 Issues List|Bakeathon 2007 Issues List]]<br />
<br />
* [[pNFS Development Road Map]]<br />
<br />
* [http://spreadsheets.google.com/pub?key=pGVvgce8dC-WWbowI9TSmEg Linux pNFS Development Gantt Chart]<br />
<br />
* [[pNFS Git tree recipies|pNFS Git tree recipies]]<br />
<br />
* [[pNFS Development Git tree|pNFS Development Git tree]]<br />
<br />
* [[Wireshark Patches|Wireshark Patches]]<br />
<br />
== Old Issues ==<br />
* [[Cthon06 Meeting Notes|Connectathon 2006 Linux pNFS Implementation Meeting Notes]]<br />
<br />
* [[linux pnfs client rewrite may 2006|Linux pNFS Client Internal Reorg patches May 2006 - For Display Purposes Only - Do Not Use]]<br />
<br />
* [[pNFS todo List|pNFS todo List July 2007]]</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/GFS2_Setup_NotesGFS2 Setup Notes2008-04-07T22:28:08Z<p>Richterd: New page: ==Initial install== ===Basics=== Started with fresh installs of RHEL5.0 on 4 nodes of mixed hardware, all attached to a shared MSA-1000 fibre channel 8-disk array (in two sets of 4, ~550GB...</p>
<hr />
<div>==Initial install==<br />
===Basics===<br />
Started with fresh installs of RHEL5.0 on 4 nodes of mixed hardware, all attached to a shared MSA-1000 fibre channel 8-disk array (in two sets of 4, ~550GB total).<br />
<br />
* installed cluster and update RPMs from wendy cheng:<br />
** <tt> cman-2.0.64-1.el5.x86_64.rpm </tt><br />
** <tt> cman-devel-2.0.64-1.el5.x86_64.rpm </tt><br />
** <tt> device-mapper-1.02.13-1.el5.x86_64.rpm </tt><br />
** <tt> gfs-utils-0.1.11-3.el5.x86_64.rpm </tt><br />
** <tt> gfs2-utils-0.1.25-1.el5.x86_64.rpm </tt><br />
** <tt> gnbd-1.1.5-1.el5.x86_64.rpm </tt> ''(unused?)''<br />
** <tt> kmod-gfs-0.1.16-5.2.6.18_8.1.4.el5.x86_64.rpm </tt><br />
** <tt> kmod-gnbd-0.1.3-4.2.6.18_8.1.4.el5.x86_64.rpm </tt> ''(unused?)''<br />
** <tt> lvm2-2.02.16-3.el5.x86_64.rpm </tt><br />
** <tt> lvm2-cluster-2.02.16-3.el5.x86_64.rpm </tt><br />
** <tt> openais-0.80.2-1.el5.x86_64.rpm </tt><br />
** <tt> openais-devel-0.80.2-1.el5.x86_64.rpm </tt><br />
** <tt> system-config-cluster-1.0.39-1.0.noarch.rpm </tt> ''(just a python frontend for several <tt>vg*</tt>, <tt>lv*</tt>, and <tt>pv*</tt> commands)''<br />
<br />
<br />
===Configuring <tt>cman</tt> and <tt>clvmd</tt>===<br />
* '''cman''': at first I tried using <tt>system-config-cluster</tt> to set up <tt>cman</tt>, but given that I didn't have any complicated fencing or quorum-related needs, I basically just took a generic <tt>cluster.conf</tt> and edited it. My <tt>[http://www.citi.umich.edu/u/richterd/gfs2/cluster.conf cluster.conf]</tt> is real basic and has manual fencing set up to be a no-op (I'd get complaints from the daemons if I didn't have any fencing setup).<br />
** distribute the new <tt>cluster.conf</tt> to all nodes; on the first run, you can just use <tt>scp</tt> or whatever.<br />
** once the cluster's up, though, propagating and setting changes on all nodes takes two steps. From the node with the updated configuration, do:<br />
*** <tt>$ sudo ccs_tool update /path/to/new/cluster.conf</tt> ''(pushes to all nodes listed in conf file)''<br />
*** <tt>$ sudo cman_tool version -r <new-version-number></tt> ''(a generation number to keep the nodes synched)''<br />
<br />
* '''clvmd''': as before, I tried using <tt>system-config-lvm</tt> to set up <tt>clvmd</tt>, but it's not quite "there yet" -- it'd get wedged or go blind to clustered volumes at strange times. Again, tweaking a mostly-templated (and very well-commented) stock conf file wasn't hard; my <tt>[http://www.citi.umich.edu/u/richterd/gfs2/lvm.conf lvm.conf]</tt> is real simple. ''Note:'' btw, in my setup the MSA-1000 disk array is initially set up to do raid0 on the 8 disks in two groups of 4; my machines see 2 block devices, each with a capacity of ~270GB. <br />
** create 1 physical linux (0x83) partition each, using whole "disk"; repeat for <tt>/dev/sdc</tt><br />
*** <tt>$ sudo fdisk /dev/sdb</tt><br />
** create physical volumes with LVM2 metadata<br />
*** <tt>$ sudo pvcreate -M 2 /dev/sdb1</tt><br />
*** <tt>$ sudo pvcreate -M 2 /dev/sdc1</tt><br />
** create a clustered volume group and add <tt>/dev/sdb1</tt> to it<br />
*** <tt>$ sudo vgcreate -M 2 -l 256 -p 256 -s 4m -c y VolGroupCluster /dev/sdb1</tt><br />
*** <tt>$ sudo pvscan</tt> ''# (verify it worked)''<br />
** edit <tt>lvm.conf</tt> and make sure that "<tt>locking_type</tt>" is set to 3 (<tt>DLM</tt>).<br />
** distribute <tt>lvm.conf</tt> to all the nodes<br />
** start up both <tt>cman</tt> and <tt>clvmd</tt> everywhere. ''Note:'' fwiw, I use [https://computing.llnl.gov/linux/pdsh.html pdsh], the parallel distributed shell, to communicate to all nodes at once; I have mine use <tt>ssh</tt> for transport. E.g., from my .bashrc:<br />
*** <tt> $ alias start-cluster='for svc in cman clvmd ; do pdsh -w node[1-4] sudo service $svc start; done'</tt><br />
** add <tt>/dev/sdc1</tt> to the existing volume group (needs the daemons running)<br />
*** <tt>$ sudo vgextend VolGroupCluster /dev/sdc1</tt><br />
*** <tt>$ sudo vgs</tt> ''# (verify that the "clustering" flag is set on the volgroup)''<br />
** create a logical volume using the whole volgroup<br />
*** <tt>$ sudo lvcreate -n ClusterVolume -l 138924 VolGroupCluster</tt><br />
*** <tt>$ sudo lvdisplay -c -a</tt> ''# (verify that it worked)''<br />
** create a GFS2 filesystem therein<br />
*** <tt>$ sudo gfs2_mkfs -j 4 -p lock_dlm -t GFS2_Cluster:ClusterFS -O /dev/VolGroupCluster/ClusterVolume</tt><br />
** edit <tt>/etc/fstab</tt> to add a mountpoint, restart the daemons, and mount!<br />
<br />
<br />
===Custom kernels===<br />
Once the basics were going, I built some kernels and things more or less worked -- except I had a heck of a time getting the <tt>Qlogic</tt> firmware to load properly. I'm fine with building the <tt>initcramfs "initrds"</tt> by hand, ''but'' for the firmware in this setup; I don't know, I guess I'm a <tt>udev</tt> idiot or something. What I ended up doing was bogarting a vendor patch from Red Hat (bless their hearts ;) that side-stepped the issue and just built the blobs into the GFS kernel module. A [http://www.citi.umich.edu/u/richterd/gfs2/add-qlogic-firmware-blob--2.6.22.19.diff slightly-updated version against 2.6.22.19] is available.<br />
<br />
<br />
==Upgrading GFS2 userland for kernels >2.6.18==<br />
Not too long after the initial install (which came with a 2.6.18-based kernel), I found that the in-kernel <tt>DLM</tt> (distributed lock manager) stuff changed recently and required a corresponding update to userspace <tt>LVM2</tt> (logical volume manager) tools.<br />
<br />
While Wendy Cheng had gotten things off the ground by giving me the bag of RPMs, we didn't get any RHN entitlements, so no updates = pain in the neck. I did finally manage to find a way to sneak RHEL5 packages out of RHN despite the lack of entitlement, but I had to do it by hand and I had to re-login for each package. Worse, when I finally did get the newest RPMs, they weren't even new enough anyway. Lesson learned: build from source. <br />
<br />
I wasn't sure that it was the best idea, but since I already had GFS2 working with the stock userland, I was skittish and didn't want to clobber the system RPMs so I installed under my home directory; worked fine.<br />
<br />
* got the newest packages:<br />
** [http://sources.redhat.com/dm/ device-mapper.1.02.22]<br />
** [http://www.openais.org/ openAIS] ''(get the stable/"whitetank" release)''<br />
** [ftp://sources.redhat.com/pub/cluster/releases/ cluster-2.01.00 tools]<br />
** [http://sources.redhat.com/lvm2/ LVM2.2.02.28]<br />
** <tt>libvolume_id-devel-095-14.5.el5.x86_64.rpm</tt> ''(bogarted from RHN)''<br />
<br />
* <tt>export CLUSTER=/home/richterd/projects/nfs/CLUSTER; cd $CLUSTER</tt><br />
* <tt>mkdir device-mapper-OBJ cluster-OBJ LVM2-OBJ</tt><br />
<br />
* device-mapper:<br />
** <tt>./configure --prefix=$CLUSTER/device-mapper-OBJ && make && sudo make install</tt><br />
** add <tt>$CLUSTER/device-mapper-OBJ/lib</tt> to <tt>/etc/ld.so.conf</tt> and rerun <tt>ldconfig</tt><br />
<br />
* openAIS:<br />
** edit the Makefile; set <tt>DESTDIR</tt> to the empty string<br />
** <tt>make && sudo make install</tt> -- at some point, this clobbered some of the RPM stuff; meh.<br />
** added <tt>/usr/lib64/openais</tt> to <tt>ld.so.conf</tt> and reran <tt>ldconfig</tt><br />
<br />
* libvolume_id-devel:<br />
** <tt>sudo rpm -ivh libvolume_id-devel-095-14.5.el5.x86_64.rpm</tt><br />
<br />
* cluster tools:<br />
** <tt>./configure --prefix=$CLUSTER/cluster-OBJ --openaislibdir=/usr/lib64/openais --dlmincdir=/lib/modules/<kernel>/source/include</tt><br />
** edit <tt>dlm/lib/Makefile</tt> and add: <tt>CFLAGS += -I$(dlmincdir)</tt><br />
** since I was doing my "trial" install, I added <tt>$CLUSTER/cluster-OBJ/usr/lib</tt> to <tt>ld.so.conf</tt> and reran <tt>ldconfig</tt>. I anticipate going back and installing things in real system locations now that I know things worked <tt>:)</tt><br />
** <tt>make && sudo make install</tt><br />
<br />
* LVM2:<br />
** <tt>./configure --prefix=$CLUSTER/LVM2-OBJ --with-lvm1=none --with-dmdir=$CLUSTER/device-mapper-OBJ --with-clvmd=cman</tt><br />
** edit <tt>make.tmpl</tt> and look for where the above <tt>dmdir</tt> is set; my <tt>configure</tt> screwed up and appended <tt>"/ioctl"</tt> to the end and I had to trim it.<br />
*** '''fix''': rather, first trim from <tt>make.tmpl.in</tt>, where it originates for whatever reason<br />
** <tt>make && sudo make install</tt><br />
<br />
.. at this point, I had a <tt>clvmd</tt> that linked against the right shared libraries and that could deal with the kernel's modified <tt>DLM</tt> setup.<br />
<br />
<br />
==Troubleshooting the clustering flag==<br />
'''Problem 1:''' LVM changes don't appear to "take". Quoting from an email I found online (XXX: cite):<br />
Why aren't changes to my logical volume being picked up by the rest of the cluster?<br />
<br />
There's a little-known "clustering" flag for volume groups that should be set on when a cluster uses a shared volume. <br />
If that bit is not set, you can see strange lvm problems on your cluster. For example, if you extend a volume with <br />
lvresize and gfs_grow, the other nodes in the cluster will not be informed of the resize, and will likely crash when <br />
they try to access the volume.<br />
<br />
To check if the clustering flag is on for a volume group, use the "vgs" command and see if the "Attr" column shows <br />
a "c". If the attr column shows something like "wz--n-" the clustering flag is off for the volume group. If the <br />
"Attr" column shows something like "wz--nc" the clustering flag is on.<br />
<br />
To set the clustering flag on, use this command: <tt>vgchange -cy</tt><br />
<br />
'''Problem 2:''' In the midst of adding a new node to the cluster, <tt>clvmd</tt> wouldn't start on other nodes and recognize the disk array.<br />
<br />
I tried the above <tt>vgchange -cy</tt> thing and screwed it up by making the ''local disk's'' VG '''clustered''' (ugh). [http://kbase.redhat.com/faq/FAQ_96_11024.shtm The problem] made sense, but the temporarily-changing-the-locking-type was what I was missing when I tried to undo my mistake. <br />
<br />
The fix: make sure uniform <tt>lvm.conf</tt>s are tweaked as per the link above and distributed to the cluster; start <tt>cman/clvmd</tt> everywhere; ''then'' use <tt>vgchange -cn VolGroup00</tt> to remove clustering flag (''<tt>VolGroup00</tt> is the local disk's VG, set up during the RHEL install''); ''then'' set the <tt>lvm.conf</tt> locking stuff back to "clustered" and redistribute to the cluster; ''then'' restart the daemons, mount, declare victory.</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_designPNFS prototype design2008-04-07T22:26:26Z<p>Richterd: </p>
<hr />
<div>= pNFS =<br />
<br />
'''pNFS''' is part of the first NFSv4 minor version. This space is used to track and share Linux pNFS implementation ideas and issues.<br />
<br />
== General Information ==<br />
<br />
* [http://www.citi.umich.edu/projects/asci/pnfs/linux/ Linux pNFS Implementation Homepage]<br />
<br />
* [[GFS2 Setup Notes]] are basic install notes from setting up a small cluster (perhaps useful for the GFS2 MDS work).<br />
<br />
== Current Issues ==<br />
* [[pNFS Todo List|pNFS Todo List]]<br />
<br />
* [[pNFS Implementation Issues|pNFS Implementation Issues]]<br />
<br />
* [[Bakeathon 2007 Issues List|Bakeathon 2007 Issues List]]<br />
<br />
* [[pNFS Development Road Map]]<br />
<br />
* [http://spreadsheets.google.com/pub?key=pGVvgce8dC-WWbowI9TSmEg Linux pNFS Development Gantt Chart]<br />
<br />
* [[pNFS Git tree recipies|pNFS Git tree recipies]]<br />
<br />
* [[pNFS Development Git tree|pNFS Development Git tree]]<br />
<br />
* [[Wireshark Patches|Wireshark Patches]]<br />
<br />
== Old Issues ==<br />
* [[Cthon06 Meeting Notes|Connectathon 2006 Linux pNFS Implementation Meeting Notes]]<br />
<br />
* [[linux pnfs client rewrite may 2006|Linux pNFS Client Internal Reorg patches May 2006 - For Display Purposes Only - Do Not Use]]<br />
<br />
* [[pNFS todo List|pNFS todo List July 2007]]</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/CITI_Experience_with_Directory_DelegationsCITI Experience with Directory Delegations2008-01-16T20:55:17Z<p>Richterd: /* Negative Caching */</p>
<hr />
<div>=Background=<br />
<br />
To improve performance and reliability, NFSv4.1 introduces read-only '''directory delegations''', a protocol extension that allows consistent caching of directory contents. <br />
CITI is implementing directory delegations as described in Section 11 of [http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-02.txt NFSv4.1 Internet Draft].<br />
<br />
==Directory Caching in NFSv4==<br />
<br />
NFSv4 allows clients to cache directory contents:<br />
<br />
* READDIR uses a directory entry cache<br />
* LOOKUP uses the name cache<br />
* ACCESS and GETATTR use a directory metadata cache.<br />
<br />
To limit the use of stale cached information, RFC 3530 suggests a time-bounded consistency model, which forces the client to revalidate cached directory information periodically. <br />
<br />
"Directory caching for the NFS version 4 protocol is similar to previous versions. Clients typically cache directory information for a duration determined by the client. At the end of a predefined timeout, the client will query the server to see if the directory has been updated. By caching attributes, clients reduce the number of GETATTR calls made to the server to validate attributes. Furthermore, frequently accessed files and directories, such as the current working directory, have their attributes cached on the client so that some NFS operations can be performed without having to make an RPC call. By caching name and inode information about most recently looked up entries in DNLC (Directory Name Lookup Cache), clients do not need to send LOOKUP calls to the server every time these files are accessed." [http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-02.txt NFSv4.1 Internet Draft]<br />
<br />
Revalidation of directory information is wasteful and opens a window during which a client might use stale cached directory information.<br />
<br />
Analysis of network traces at the University of Michigan ('''FIXME''': need link to a copy of Brian Wickman's prelim) show that a surprising amount of NFSv3 traffic is due to GETATTRs triggered by client directory cache revalidation.<br />
<br />
==How Directory Delegations Can Help==<br />
<br />
"[The NFSv4] caching approach works reasonably well at reducing network traffic in many environments. However, it does not address environments where there are numerous queries for files that do not exist. In these cases of "misses", the client must make RPC calls to the server in order to provide reasonable application semantics and promptly detect the creation of new directory entries. Examples of high miss activity are compilation in software development environments. The current behavior of NFS limits its potential scalability and wide-area sharing effectiveness in these types of environments." [http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-02.txt NFSv4.1 Internet Draft]<br />
<br />
A common "high miss" case involves shell PATH lookups.<br />
To execute a program, the shell walks down a list of directories specified in a user's $PATH<br />
environment variable and tries to locate the executable file in each directory. <br />
It is not uncommon to find a large number of directories in the list. When an executable whose parent directory is located far down the $PATH list is invoked, it causes a "miss" in each of the directories that precede the parent directory. Even though the directories along the path might be cached, running the program more than once still requires that the $PATH directories are revalidated, in case the file appears at some point. <br />
<br />
Directory delegations improve matters enormously because the client is assured that the directory has not been modified since the delegation was granted. With directory delegations, once a nonexistent file has been searched for, the client can trust that it won't appear while the delegation is in effect; this is referred to as '''negative dentry caching'''. With it, searching for a nonexistent file in a cached and delegated directory can proceed locally, without having to check back with the server.<br />
<br />
Program compilation, which induces repeated misses along the paths for header files and library modules, also benefits from directory delegations,.<br />
The savings are potentially even greater for repeated 'ls' or 'stat' requests on non-existent files; each such request requires three separate RPC calls -- ACCESS, LOOKUP, and GETATTR -- to discover that a file does not exist.<br />
<br />
Beyond these "high miss" cases, analysis of NFSv3 network traces shows that a great deal of NFS traffic<br />
consists of the periodic GETATTRs sent by clients when an attribute timeout<br />
triggers a cache revalidation. But a delegated directory need not be revalidated unless the directory is modified. <br />
<br />
* Should reference Wickman and ... um ... CMU? Ousterhout?<br />
* From which we can make "a great deal" more specific?<br />
<br />
==Directory Delegation Operations==<br />
An NFSv4.1 client requests a directory delegation with the GET_DIR_DELEGATION operation.<br />
Granting a delegation request is solely at the server's discretion, and the delegation may be <br />
recalled at any time.<br />
<br />
Upon receiving an operation that conflicts with an existing delegation, the server must first <br />
recall from all of its clients any delegations on the directory (or directories) being mutated. <br />
When a client receives that CB_RECALL callback operation, it relinquishes the delegation in <br />
question by responding to the server using the DELEGRETURN operation.<br />
When all of the requisite delegations have been returned (or forcefully timed-out), the server<br />
allows the conflicting operation to proceed.<br />
<br />
Although NFS clients and servers have knowledge of the acquisition and recall of directory <br />
delegations, delegation state is opaque to applications.<br />
<br />
==Notifications==<br />
After a delegation recall, a client is forced to refetch a directory in its entirety the next time it is used. <br />
For a large directory, this cost, which is above and beyond the two RPCs needed for the recall, can be quite expensive.<br />
If the directory also happens to be a popular one — with multiple clients holding delegations — the performance impact on the server can be considerable.<br />
<br />
To reduce the impact of a directory modification when the change is small,<br />
the NFSv4.1 Internet Draft defines an extension to delegations called ''notifications.''<br />
When a client requests a delegation, it can also request that certain changes be conveyed in the form of a notification instead of a recall.<br />
<br />
By sending a description of the change instead of recalling the delegation, the server allows the client to maintain a consistent cache without imposing the cost (to the client and to itself) of a recall and refetch.<br />
<br />
Notifications are motivated by some common cases. For example, some applications use ephemeral lockfiles for concurrency control by quickly creating and destroying a file in a directory. Other examples include program compilation and CVS updates, which also quickly create and destroy files.<br />
<br />
In the proposal for notifications, a client can request notifications on<br />
directory entry and directory attribute changes, as well as directory entry<br />
attribute changes. To reduce the cost of issuing notifications, the client and server negotiate the rate at which notifications are sent, allowing the server to "batch" notifications and send them asynchronously. In some common cases, delaying a notification can obviate its delivery altogether, e.g., when a file is quickly created and destroyed.<br />
<br />
* ref ousterhout<br />
<br />
===Issues with notifications===<br />
Notifications require state on the NFS server to keep track of them and work to deliver them.<br />
Wickman's simulator work at CITI<br />
found that in some<br />
cases, the number of<br />
notifications dispatched to support a directory delegation can exceed<br />
the cost of simply not using a delegation at all. <br />
A restricted version of notifications that sends only directory creates, unlinks, and renames would use much less server state.<br />
<br />
Notifications also introduce a level of "fairness" to maintain, in terms of deciding how to<br />
allot notifications among multiple clients, given limited server resources.<br />
<br />
Notifications can be sent asynchronously, at a rate negotiated by the client and server.<br />
This allows the server to batch several notifications<br />
and to prune self-cancelling<br />
notifications (e.g., "CREATE foo ... REMOVE foo").<br />
Indeed, Wickman found that for<br />
certain workloads, batching notifications for 20 to 50 seconds reduces notification traffic by a factor of 5 to 50.<br />
For instance, lock files in mail boxes often have a lifetime<br />
under 10 seconds, so addition/deletion notifications can be pruned. <br />
However, there<br />
is a trade-off between the batching delay and client<br />
cache consistency. <br />
<br />
Because of the complexity of implementation and questions of how best to benefit from them, CITI is not implementing<br />
notifications at this time.<br />
<br />
=Using Directory Delegations=<br />
<br />
While a client holds a delegation on a directory, it is assured that the directory will not be modified without the delegation first being recalled. The server must delay any operation that modifies a directory until all the clients holding delegations on that directory have returned their delegations.<br />
<br />
However, as a special case, the server may allow the client that is modifying a directory to keep its own delegation on that directory. (Obviously, other clients' delegations on that directory must still be recalled.)<br />
<br />
Note that even though we may permit a client to modify a directory while it holds a read delegation, this is not the same as providing that client with an exclusive (write) delegation; a write delegation would also allow the client to modify the directory locally, and this is explicitly forbidden in section 11 of the minor version draft:<br />
<br />
"The delegation is read-only and the client may not make changes to the directory other than by performing NFSv4 operations that modify the directory or the associated file attributes so that the server has knowledge of these changes."<br />
<br />
Note that in order to make the special exception that allows a client to modify a directory without recalling its own lease, we must know which client is performing the operation.<br />
<br />
Currently we are using the client's IP address for this. However, the NFSv4 protocol does not prohibit the client from changing IP addresses, and does not prohibit multiple clients from sharing an IP address. The final code will instead use the new Sessions extensions in NFSv4.1 to identify the client.<br />
<br />
=Negative Caching=<br />
<br />
One opportunity offered by directory delegations is the chance to significantly extend the usefulness of negative dentry caching on the client. <br />
Close-to-open consistency mandates that even in a case where previous LOOKUPs or OPENs for a given file have recently or repeatedly failed, subsequent attempts require that the parent directory is revalidated with a GETATTR in case the file appears. With directory delegations, the client is assured that no new entries or removals have occurred while a delegation is in-effect; this implies that negative dentries in a delegated directory actually can be "trusted". <br />
<br />
This could translate into a marked decrease in the number of unnecessary and repeated checks for non-existent files, e.g. when searching for <br />
a header file in include paths or a shared library in LD_LIBRARY_PATH ''(See the '''Some preliminary numbers''' section for more details)''. Knowing just when to acquire those delegations may be a matter to address in client-side policy.<br />
<br />
=Delegations and the Linux VFS Lease Subsystem=<br />
<br />
We have implemented directory delegations on the server by extending the Linux VFS file lease subsystem. A lease is a type of lock that gives the lease-holder the chance to perform any necessary tasks (e.g., flushing data) when an operation that conflicts with the lease-type is about to occur -- the caller who is causing the lease to break will block until the lease-holder signals that it is finished cleaning-up (or the lease is forcefully broken after a timeout).<br />
<br />
Leases are usually acquired via fcntl(2), and a lease-holder usually receives a signal from the kernel when a lease is being broken; the lease-holder indicates that any cleanup is finished with another fcntl(2) call. Leases used by NFS are all acquired and revoked in-kernel.<br />
<br />
The existing lease subsystem only works on files, and leases are only broken when a file is opened for writing or is truncated. In order to implement <br />
directory delegations, we have added support for directory leases. These will break when a leased directory is mutated by any additions, deletions, or renames, or when the directory's own metadata changes (e.g., chown(1)). Note that changes to existing files, e.g., will not break directory leases.<br />
<br />
Our current implementation modifies the NFS server so that NFS protocol operations will break directory leases. We are testing general VFS-level directory lease-breaking -- i.e., both NFS and local operations will break leases. Our approach is described in the next section.<br />
<br />
=Recalling NFS Delegations vs. Breaking Linux VFS (Non-NFS) Leases=<br />
<br />
In the following I will refer to the leases used to implement NFS delegations as "NFS leases" and all other leases as "non-NFS leases".<br />
<br />
NFS leases and non-NFS leases differ in how they handle the case where a lease-holder is '''''also''''' the caller performing an operation that conflicts with the lease-type, as described above.<br />
<br />
Any operation that breaks a lease, and hence requires delegation recalls, has to wait for delegations to be returned. There are a number of different ways to do this:<br />
<br />
# Delay responding to the original operation until all recalls are complete.<br />
# Immediately return NFS4ERR_DELAY to the client; the process on the client will then block while the client polls on its behalf.<br />
# Delay the response from the server for a little while, to handle the (probably common) case of a quick delegation return, and only return NFS4ERR_DELAY if the delegations aren't returned quickly enough.<br />
<br />
For now, we have implemented option number 2.<br />
<br />
The approach we're currently taking to tackle the issues of integrating NFS delegations with Linux VFS leases (i.e., all directory-mutating <br />
operations, whether locally on the server or over NFS, will break directory leases/delegations on the server) goes something like this:<br />
<br />
''When breaking a lease where the call is coming over NFS:''<br />
1) During processing, whenever the directory's dentry becomes available (e.g., after a lookup), disable lease-granting for its inode and try <br />
break_lease() with O_NONBLOCK. This will avoid blocking while locks are held, as well as avoid tying up server threads for (potentially)<br />
long periods.<br />
<br />
2) If there was not a lease, finish the operation, re-enable lease-granting on the inode, and we're done.<br />
<br />
3) If there was a lease, break_lease() will send the break signal(s) and nfsd will also fail (re-enabling lease-granting on the inode first)<br />
and the client gets NFS4ERR_DELAY (and should retry). The downside to this is that a pathological case could arise wherein we break a lease,<br />
return NFS4ERR_DELAY, then the client retries the operation -- but another client has acquired a lease in the interim, and we could end up <br />
with a cycle.<br />
<br />
<br />
''When breaking a lease where the call is server-local:''<br />
1) Again, whenever a directory's dentry becomes available, disable lease-granting for its inode.<br />
<br />
2a) If locks (e.g., an i_mutex) are not held, call break_lease() and, as per normal lease-semantics, block the breaker until leases are returned,<br />
after which the breaker is unblocked and its operation succeeds.<br />
<br />
2b) If locks are held, call break_lease() with O_NONBLOCK; we assume the common-case to be that no lease is present. If break_lease() returns<br />
-EWOULDBLOCK, drop the locks and call break_lease() and allow it to block. Once the caller unblocks, restart the operation by reacquiring<br />
the locks and, e.g., redoing a lookup to make sure the file system object(s) still exist(s). Since lease-granting was disabled early-on, <br />
the operation will succeed in one pass.<br />
<br />
3) Regardless of whether 2a) or 2b) happened, at the end lease-granting is re-enabled for the inode(s) in question.<br />
<br />
=Policy (partial)=<br />
client: prior to a READDIR, request. <br />
<br />
client: if we've sent 3 or 5 revalidations and a directory hasn't changed, request.<br />
<br />
client: when to voluntarily surrender? e.g., after a kernel-compile, i hold hundreds of delegations.<br />
<br />
server: if a directory's delegation has been recalled in the last N minutes, don't grant new ones.<br />
<br />
server: will need to ID "misbehaving" clients and cordon them off.<br />
<br />
server: when to preemptively recall? --> server load metric<br />
<br />
==(simulator)==<br />
Previous work at CITI by Brian Wickman consisted of prototyping and analyzing<br />
file and directory delegations, based on recorded network traces of NFSv3 use in<br />
college environments. The stateless nature of NFSv3 required the<br />
instrumentation of OPEN and CLOSE operations into the traces, e.g., but given<br />
that in the absence of delegations, NFSv4 client-side cache validation closely<br />
mimics that of NFSv3, enough information was available to get an overall<br />
impression of the state of the clients' caches. Wickman wrote a simulator to<br />
use the instrumented traces to test different delegation models and policies.<br />
We now want to use real-world NFSv4 network traces with the simulator, but given<br />
the current absence of widescale mainstream deployment of NFSv4, we need to find<br />
such traces of representative workloads. Using actual NFSv4 traffic will give a<br />
more accurate picture of client-cache state and will more clearly identify<br />
operations obviated by delegations; this is both because the traces will not<br />
need to be instrumented, and because NFSv3 lacks the COMPOUND operation, with<br />
which NFSv4 coalesces groups of commands. NFSv4 traces used with the simulator<br />
will allow us to develop client- and server-side policies for requesting and<br />
granting delegations.<br />
<br />
=Some preliminary numbers=<br />
A significant demonstration of the benefits of negative dentry<br />
caching is software compilation. For instance, when<br />
building software using make(1), various directories are<br />
repeatedly searched for header files. Since header files tend only to be<br />
located in one of the directories, and since many object files depend on the<br />
same headers, there are a great number of unnecessary re-checks. By caching<br />
negative dentries, a significant number of NFS operations can be avoided.<br />
<br />
We have some rough numbers in terms of opcounts, both with and without directory (and not file) delegations enabled. We used a simple client policy of requesting delegations prior to a READDIR (note that make(1) periodically calls getdents(2) on its own). ACCESS, GETATTR, and LOOKUP are where the real savings are; the other opcounts are just included for context. Again, these numbers are ''rough'', but indicate that compilation environments stand to benefit from directory delegations.<br />
<br />
''Doing make(1) on cscope-15.5 (first without, then with directory delegations):''<br />
<br />
READ: 136 124<br />
WRITE: 137 136<br />
OPEN: 1576 1576<br />
ACCESS: 1169 161 (86% reduction)<br />
GETATTR: 903 628 (30% reduction)<br />
LOOKUP: 1494 496 (67% reduction)<br />
GET_DIR_DELEG: 7<br />
DELEGRETURN: 1<br />
<br />
''Doing make(1) on the 2.6.16 linux kernel (first without, then with directory delegations):''<br />
<br />
READ: 19803 19892<br />
WRITE: 21921 21869<br />
OPEN: 497472 494648<br />
ACCESS: 20638 3406 (83.5% reduction)<br />
GETATTR: 41794 24563 (41.0% reduction)<br />
LOOKUP: 45063 17447 (61.3% reduction)<br />
READDIR: 1016 884 (13.0% reduction)<br />
GET_DIR_DELEG: 750<br />
DELEGRETURN: none<br />
<br />
=Status=<br />
<br />
At the moment, working on coming up with reasonably representative tests that show the benefits of directory delegations (in terms<br />
of OP-counts); pynfs tests are also being written.<br />
<br />
==The client==<br />
<br />
* The client currently requests a delegation just prior to issuing a READDIR on an undelegated directory, or when it has done "a few" parent directory revalidations and noticed that it hasn't changed during that span. <br />
* As long as the client has such a delegation, it will generally refrain from issuing ACCESS, GETATTR, and READDIR calls on the directory (see below) ...<br />
* .. in some cases, though, the client's cache(s) may be deliberately invalidated and require a refresh (e.g., a client creates a file in a directory delegated to it, which won't break its delegation; however, in order to see the file, the client must revalidate its pagecache and send a READDIR to the server).<br />
* '''README: any suggestions here? —> TODO:''' get more opcounts! (hosting a webserver's docroot off an nfs mount? PATH or LD_LIBRARY_PATH stuff?)<br />
* TODO: redo existing opcount tests and instead tally bandwidth savings ...<br />
** getting ''real'' NFSv4 workload network traces would be great... '''(can you help? —>&nbsp; email nfsv4@linux-nfs.org)'''<br />
* When should/can we decide to voluntarily return delegations (other than when we have no more active open-state)?<br />
<br />
==The server== <br />
<br />
* Differentiate between turning file/directory delegations on/off at runtime (done) and enabling/disabling the capability itself (not done; would prevent our client from ever asking for delegations in the first place, independent of its requesting policy).<br />
* The following NFS operations currently break directory delegations: CREATE, LINK, REMOVE, RENAME, and OPEN(w/create). SETATTR on directories is pending.<br />
* An NFS SETATTR breaks file delegations when the file size is changing. Breaking on metadata changes is pending.<br />
* The corresponding VFS-level operations also break delegations and are being tested.<br />
<!-- <br />
.. CREATE (nfsd_create() and nfsd_symlink()), LINK (nfsd_link()), REMOVE (nfsd_unlink()), and RENAME (nfsd_rename()).<br />
* OPEN(w/create) is tied-up: parent-directory delegs are now broken OK in nfsd4_open(). Breaking file-delegs on OPEN(write) is broken: <br />
nfsd_open() tries '''a)''' under statelock and '''b)''' I think usually fails bc. of O_NONBLOCK. nfsd4_truncate() has similar issue. <br />
--><br />
* How to acknowledge/when to act upon resource pressures? --> e.g., after compiling the linux kernel, a client holds ~750 delegations -- that's like 50KB of state on the server, and nearly as much on the client.<br />
* TODO: get NFSv2/NFSv3 operations to break (file and directory) delegations at all of the right times, too.<br />
* TODO: also -- policy, look at dir deleg/file deleg interactions, ..</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/CITI_Experience_with_Directory_DelegationsCITI Experience with Directory Delegations2008-01-16T20:54:39Z<p>Richterd: /* How Directory Delegations Can Help */</p>
<hr />
<div>=Background=<br />
<br />
To improve performance and reliability, NFSv4.1 introduces read-only '''directory delegations''', a protocol extension that allows consistent caching of directory contents. <br />
CITI is implementing directory delegations as described in Section 11 of [http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-02.txt NFSv4.1 Internet Draft].<br />
<br />
==Directory Caching in NFSv4==<br />
<br />
NFSv4 allows clients to cache directory contents:<br />
<br />
* READDIR uses a directory entry cache<br />
* LOOKUP uses the name cache<br />
* ACCESS and GETATTR use a directory metadata cache.<br />
<br />
To limit the use of stale cached information, RFC 3530 suggests a time-bounded consistency model, which forces the client to revalidate cached directory information periodically. <br />
<br />
"Directory caching for the NFS version 4 protocol is similar to previous versions. Clients typically cache directory information for a duration determined by the client. At the end of a predefined timeout, the client will query the server to see if the directory has been updated. By caching attributes, clients reduce the number of GETATTR calls made to the server to validate attributes. Furthermore, frequently accessed files and directories, such as the current working directory, have their attributes cached on the client so that some NFS operations can be performed without having to make an RPC call. By caching name and inode information about most recently looked up entries in DNLC (Directory Name Lookup Cache), clients do not need to send LOOKUP calls to the server every time these files are accessed." [http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-02.txt NFSv4.1 Internet Draft]<br />
<br />
Revalidation of directory information is wasteful and opens a window during which a client might use stale cached directory information.<br />
<br />
Analysis of network traces at the University of Michigan ('''FIXME''': need link to a copy of Brian Wickman's prelim) show that a surprising amount of NFSv3 traffic is due to GETATTRs triggered by client directory cache revalidation.<br />
<br />
==How Directory Delegations Can Help==<br />
<br />
"[The NFSv4] caching approach works reasonably well at reducing network traffic in many environments. However, it does not address environments where there are numerous queries for files that do not exist. In these cases of "misses", the client must make RPC calls to the server in order to provide reasonable application semantics and promptly detect the creation of new directory entries. Examples of high miss activity are compilation in software development environments. The current behavior of NFS limits its potential scalability and wide-area sharing effectiveness in these types of environments." [http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-02.txt NFSv4.1 Internet Draft]<br />
<br />
A common "high miss" case involves shell PATH lookups.<br />
To execute a program, the shell walks down a list of directories specified in a user's $PATH<br />
environment variable and tries to locate the executable file in each directory. <br />
It is not uncommon to find a large number of directories in the list. When an executable whose parent directory is located far down the $PATH list is invoked, it causes a "miss" in each of the directories that precede the parent directory. Even though the directories along the path might be cached, running the program more than once still requires that the $PATH directories are revalidated, in case the file appears at some point. <br />
<br />
Directory delegations improve matters enormously because the client is assured that the directory has not been modified since the delegation was granted. With directory delegations, once a nonexistent file has been searched for, the client can trust that it won't appear while the delegation is in effect; this is referred to as '''negative dentry caching'''. With it, searching for a nonexistent file in a cached and delegated directory can proceed locally, without having to check back with the server.<br />
<br />
Program compilation, which induces repeated misses along the paths for header files and library modules, also benefits from directory delegations,.<br />
The savings are potentially even greater for repeated 'ls' or 'stat' requests on non-existent files; each such request requires three separate RPC calls -- ACCESS, LOOKUP, and GETATTR -- to discover that a file does not exist.<br />
<br />
Beyond these "high miss" cases, analysis of NFSv3 network traces shows that a great deal of NFS traffic<br />
consists of the periodic GETATTRs sent by clients when an attribute timeout<br />
triggers a cache revalidation. But a delegated directory need not be revalidated unless the directory is modified. <br />
<br />
* Should reference Wickman and ... um ... CMU? Ousterhout?<br />
* From which we can make "a great deal" more specific?<br />
<br />
==Directory Delegation Operations==<br />
An NFSv4.1 client requests a directory delegation with the GET_DIR_DELEGATION operation.<br />
Granting a delegation request is solely at the server's discretion, and the delegation may be <br />
recalled at any time.<br />
<br />
Upon receiving an operation that conflicts with an existing delegation, the server must first <br />
recall from all of its clients any delegations on the directory (or directories) being mutated. <br />
When a client receives that CB_RECALL callback operation, it relinquishes the delegation in <br />
question by responding to the server using the DELEGRETURN operation.<br />
When all of the requisite delegations have been returned (or forcefully timed-out), the server<br />
allows the conflicting operation to proceed.<br />
<br />
Although NFS clients and servers have knowledge of the acquisition and recall of directory <br />
delegations, delegation state is opaque to applications.<br />
<br />
==Notifications==<br />
After a delegation recall, a client is forced to refetch a directory in its entirety the next time it is used. <br />
For a large directory, this cost, which is above and beyond the two RPCs needed for the recall, can be quite expensive.<br />
If the directory also happens to be a popular one — with multiple clients holding delegations — the performance impact on the server can be considerable.<br />
<br />
To reduce the impact of a directory modification when the change is small,<br />
the NFSv4.1 Internet Draft defines an extension to delegations called ''notifications.''<br />
When a client requests a delegation, it can also request that certain changes be conveyed in the form of a notification instead of a recall.<br />
<br />
By sending a description of the change instead of recalling the delegation, the server allows the client to maintain a consistent cache without imposing the cost (to the client and to itself) of a recall and refetch.<br />
<br />
Notifications are motivated by some common cases. For example, some applications use ephemeral lockfiles for concurrency control by quickly creating and destroying a file in a directory. Other examples include program compilation and CVS updates, which also quickly create and destroy files.<br />
<br />
In the proposal for notifications, a client can request notifications on<br />
directory entry and directory attribute changes, as well as directory entry<br />
attribute changes. To reduce the cost of issuing notifications, the client and server negotiate the rate at which notifications are sent, allowing the server to "batch" notifications and send them asynchronously. In some common cases, delaying a notification can obviate its delivery altogether, e.g., when a file is quickly created and destroyed.<br />
<br />
* ref ousterhout<br />
<br />
===Issues with notifications===<br />
Notifications require state on the NFS server to keep track of them and work to deliver them.<br />
Wickman's simulator work at CITI<br />
found that in some<br />
cases, the number of<br />
notifications dispatched to support a directory delegation can exceed<br />
the cost of simply not using a delegation at all. <br />
A restricted version of notifications that sends only directory creates, unlinks, and renames would use much less server state.<br />
<br />
Notifications also introduce a level of "fairness" to maintain, in terms of deciding how to<br />
allot notifications among multiple clients, given limited server resources.<br />
<br />
Notifications can be sent asynchronously, at a rate negotiated by the client and server.<br />
This allows the server to batch several notifications<br />
and to prune self-cancelling<br />
notifications (e.g., "CREATE foo ... REMOVE foo").<br />
Indeed, Wickman found that for<br />
certain workloads, batching notifications for 20 to 50 seconds reduces notification traffic by a factor of 5 to 50.<br />
For instance, lock files in mail boxes often have a lifetime<br />
under 10 seconds, so addition/deletion notifications can be pruned. <br />
However, there<br />
is a trade-off between the batching delay and client<br />
cache consistency. <br />
<br />
Because of the complexity of implementation and questions of how best to benefit from them, CITI is not implementing<br />
notifications at this time.<br />
<br />
=Using Directory Delegations=<br />
<br />
While a client holds a delegation on a directory, it is assured that the directory will not be modified without the delegation first being recalled. The server must delay any operation that modifies a directory until all the clients holding delegations on that directory have returned their delegations.<br />
<br />
However, as a special case, the server may allow the client that is modifying a directory to keep its own delegation on that directory. (Obviously, other clients' delegations on that directory must still be recalled.)<br />
<br />
Note that even though we may permit a client to modify a directory while it holds a read delegation, this is not the same as providing that client with an exclusive (write) delegation; a write delegation would also allow the client to modify the directory locally, and this is explicitly forbidden in section 11 of the minor version draft:<br />
<br />
"The delegation is read-only and the client may not make changes to the directory other than by performing NFSv4 operations that modify the directory or the associated file attributes so that the server has knowledge of these changes."<br />
<br />
Note that in order to make the special exception that allows a client to modify a directory without recalling its own lease, we must know which client is performing the operation.<br />
<br />
Currently we are using the client's IP address for this. However, the NFSv4 protocol does not prohibit the client from changing IP addresses, and does not prohibit multiple clients from sharing an IP address. The final code will instead use the new Sessions extensions in NFSv4.1 to identify the client.<br />
<br />
=Negative Caching=<br />
<br />
One opportunity offered by directory delegations is the chance to significantly extend the usefulness of negative dentry caching on the client. <br />
Close-to-open consistency mandates that even in a case where previous LOOKUPs or OPENs for a given file have recently or repeatedly failed, subsequent attempts require that the parent directory is revalidated in case the file appears. With directory delegations, the client is assured that no new entries or removals have occurred while a delegation is in-effect; this implies that negative dentries in a delegated directory actually can be "trusted". <br />
<br />
This could translate into a marked decrease in the number of unnecessary and repeated checks for non-existent files, e.g. when searching for <br />
a header file in include paths or a shared library in LD_LIBRARY_PATH ''(See the '''Some preliminary numbers''' section for more details)''. Knowing just when to acquire those delegations may be a matter to address in client-side policy.<br />
<br />
=Delegations and the Linux VFS Lease Subsystem=<br />
<br />
We have implemented directory delegations on the server by extending the Linux VFS file lease subsystem. A lease is a type of lock that gives the lease-holder the chance to perform any necessary tasks (e.g., flushing data) when an operation that conflicts with the lease-type is about to occur -- the caller who is causing the lease to break will block until the lease-holder signals that it is finished cleaning-up (or the lease is forcefully broken after a timeout).<br />
<br />
Leases are usually acquired via fcntl(2), and a lease-holder usually receives a signal from the kernel when a lease is being broken; the lease-holder indicates that any cleanup is finished with another fcntl(2) call. Leases used by NFS are all acquired and revoked in-kernel.<br />
<br />
The existing lease subsystem only works on files, and leases are only broken when a file is opened for writing or is truncated. In order to implement <br />
directory delegations, we have added support for directory leases. These will break when a leased directory is mutated by any additions, deletions, or renames, or when the directory's own metadata changes (e.g., chown(1)). Note that changes to existing files, e.g., will not break directory leases.<br />
<br />
Our current implementation modifies the NFS server so that NFS protocol operations will break directory leases. We are testing general VFS-level directory lease-breaking -- i.e., both NFS and local operations will break leases. Our approach is described in the next section.<br />
<br />
=Recalling NFS Delegations vs. Breaking Linux VFS (Non-NFS) Leases=<br />
<br />
In the following I will refer to the leases used to implement NFS delegations as "NFS leases" and all other leases as "non-NFS leases".<br />
<br />
NFS leases and non-NFS leases differ in how they handle the case where a lease-holder is '''''also''''' the caller performing an operation that conflicts with the lease-type, as described above.<br />
<br />
Any operation that breaks a lease, and hence requires delegation recalls, has to wait for delegations to be returned. There are a number of different ways to do this:<br />
<br />
# Delay responding to the original operation until all recalls are complete.<br />
# Immediately return NFS4ERR_DELAY to the client; the process on the client will then block while the client polls on its behalf.<br />
# Delay the response from the server for a little while, to handle the (probably common) case of a quick delegation return, and only return NFS4ERR_DELAY if the delegations aren't returned quickly enough.<br />
<br />
For now, we have implemented option number 2.<br />
<br />
The approach we're currently taking to tackle the issues of integrating NFS delegations with Linux VFS leases (i.e., all directory-mutating <br />
operations, whether locally on the server or over NFS, will break directory leases/delegations on the server) goes something like this:<br />
<br />
''When breaking a lease where the call is coming over NFS:''<br />
1) During processing, whenever the directory's dentry becomes available (e.g., after a lookup), disable lease-granting for its inode and try <br />
break_lease() with O_NONBLOCK. This will avoid blocking while locks are held, as well as avoid tying up server threads for (potentially)<br />
long periods.<br />
<br />
2) If there was not a lease, finish the operation, re-enable lease-granting on the inode, and we're done.<br />
<br />
3) If there was a lease, break_lease() will send the break signal(s) and nfsd will also fail (re-enabling lease-granting on the inode first)<br />
and the client gets NFS4ERR_DELAY (and should retry). The downside to this is that a pathological case could arise wherein we break a lease,<br />
return NFS4ERR_DELAY, then the client retries the operation -- but another client has acquired a lease in the interim, and we could end up <br />
with a cycle.<br />
<br />
<br />
''When breaking a lease where the call is server-local:''<br />
1) Again, whenever a directory's dentry becomes available, disable lease-granting for its inode.<br />
<br />
2a) If locks (e.g., an i_mutex) are not held, call break_lease() and, as per normal lease-semantics, block the breaker until leases are returned,<br />
after which the breaker is unblocked and its operation succeeds.<br />
<br />
2b) If locks are held, call break_lease() with O_NONBLOCK; we assume the common-case to be that no lease is present. If break_lease() returns<br />
-EWOULDBLOCK, drop the locks and call break_lease() and allow it to block. Once the caller unblocks, restart the operation by reacquiring<br />
the locks and, e.g., redoing a lookup to make sure the file system object(s) still exist(s). Since lease-granting was disabled early-on, <br />
the operation will succeed in one pass.<br />
<br />
3) Regardless of whether 2a) or 2b) happened, at the end lease-granting is re-enabled for the inode(s) in question.<br />
<br />
=Policy (partial)=<br />
client: prior to a READDIR, request. <br />
<br />
client: if we've sent 3 or 5 revalidations and a directory hasn't changed, request.<br />
<br />
client: when to voluntarily surrender? e.g., after a kernel-compile, i hold hundreds of delegations.<br />
<br />
server: if a directory's delegation has been recalled in the last N minutes, don't grant new ones.<br />
<br />
server: will need to ID "misbehaving" clients and cordon them off.<br />
<br />
server: when to preemptively recall? --> server load metric<br />
<br />
==(simulator)==<br />
Previous work at CITI by Brian Wickman consisted of prototyping and analyzing<br />
file and directory delegations, based on recorded network traces of NFSv3 use in<br />
college environments. The stateless nature of NFSv3 required the<br />
instrumentation of OPEN and CLOSE operations into the traces, e.g., but given<br />
that in the absence of delegations, NFSv4 client-side cache validation closely<br />
mimics that of NFSv3, enough information was available to get an overall<br />
impression of the state of the clients' caches. Wickman wrote a simulator to<br />
use the instrumented traces to test different delegation models and policies.<br />
We now want to use real-world NFSv4 network traces with the simulator, but given<br />
the current absence of widescale mainstream deployment of NFSv4, we need to find<br />
such traces of representative workloads. Using actual NFSv4 traffic will give a<br />
more accurate picture of client-cache state and will more clearly identify<br />
operations obviated by delegations; this is both because the traces will not<br />
need to be instrumented, and because NFSv3 lacks the COMPOUND operation, with<br />
which NFSv4 coalesces groups of commands. NFSv4 traces used with the simulator<br />
will allow us to develop client- and server-side policies for requesting and<br />
granting delegations.<br />
<br />
=Some preliminary numbers=<br />
A significant demonstration of the benefits of negative dentry<br />
caching is software compilation. For instance, when<br />
building software using make(1), various directories are<br />
repeatedly searched for header files. Since header files tend only to be<br />
located in one of the directories, and since many object files depend on the<br />
same headers, there are a great number of unnecessary re-checks. By caching<br />
negative dentries, a significant number of NFS operations can be avoided.<br />
<br />
We have some rough numbers in terms of opcounts, both with and without directory (and not file) delegations enabled. We used a simple client policy of requesting delegations prior to a READDIR (note that make(1) periodically calls getdents(2) on its own). ACCESS, GETATTR, and LOOKUP are where the real savings are; the other opcounts are just included for context. Again, these numbers are ''rough'', but indicate that compilation environments stand to benefit from directory delegations.<br />
<br />
''Doing make(1) on cscope-15.5 (first without, then with directory delegations):''<br />
<br />
READ: 136 124<br />
WRITE: 137 136<br />
OPEN: 1576 1576<br />
ACCESS: 1169 161 (86% reduction)<br />
GETATTR: 903 628 (30% reduction)<br />
LOOKUP: 1494 496 (67% reduction)<br />
GET_DIR_DELEG: 7<br />
DELEGRETURN: 1<br />
<br />
''Doing make(1) on the 2.6.16 linux kernel (first without, then with directory delegations):''<br />
<br />
READ: 19803 19892<br />
WRITE: 21921 21869<br />
OPEN: 497472 494648<br />
ACCESS: 20638 3406 (83.5% reduction)<br />
GETATTR: 41794 24563 (41.0% reduction)<br />
LOOKUP: 45063 17447 (61.3% reduction)<br />
READDIR: 1016 884 (13.0% reduction)<br />
GET_DIR_DELEG: 750<br />
DELEGRETURN: none<br />
<br />
=Status=<br />
<br />
At the moment, working on coming up with reasonably representative tests that show the benefits of directory delegations (in terms<br />
of OP-counts); pynfs tests are also being written.<br />
<br />
==The client==<br />
<br />
* The client currently requests a delegation just prior to issuing a READDIR on an undelegated directory, or when it has done "a few" parent directory revalidations and noticed that it hasn't changed during that span. <br />
* As long as the client has such a delegation, it will generally refrain from issuing ACCESS, GETATTR, and READDIR calls on the directory (see below) ...<br />
* .. in some cases, though, the client's cache(s) may be deliberately invalidated and require a refresh (e.g., a client creates a file in a directory delegated to it, which won't break its delegation; however, in order to see the file, the client must revalidate its pagecache and send a READDIR to the server).<br />
* '''README: any suggestions here? —> TODO:''' get more opcounts! (hosting a webserver's docroot off an nfs mount? PATH or LD_LIBRARY_PATH stuff?)<br />
* TODO: redo existing opcount tests and instead tally bandwidth savings ...<br />
** getting ''real'' NFSv4 workload network traces would be great... '''(can you help? —>&nbsp; email nfsv4@linux-nfs.org)'''<br />
* When should/can we decide to voluntarily return delegations (other than when we have no more active open-state)?<br />
<br />
==The server== <br />
<br />
* Differentiate between turning file/directory delegations on/off at runtime (done) and enabling/disabling the capability itself (not done; would prevent our client from ever asking for delegations in the first place, independent of its requesting policy).<br />
* The following NFS operations currently break directory delegations: CREATE, LINK, REMOVE, RENAME, and OPEN(w/create). SETATTR on directories is pending.<br />
* An NFS SETATTR breaks file delegations when the file size is changing. Breaking on metadata changes is pending.<br />
* The corresponding VFS-level operations also break delegations and are being tested.<br />
<!-- <br />
.. CREATE (nfsd_create() and nfsd_symlink()), LINK (nfsd_link()), REMOVE (nfsd_unlink()), and RENAME (nfsd_rename()).<br />
* OPEN(w/create) is tied-up: parent-directory delegs are now broken OK in nfsd4_open(). Breaking file-delegs on OPEN(write) is broken: <br />
nfsd_open() tries '''a)''' under statelock and '''b)''' I think usually fails bc. of O_NONBLOCK. nfsd4_truncate() has similar issue. <br />
--><br />
* How to acknowledge/when to act upon resource pressures? --> e.g., after compiling the linux kernel, a client holds ~750 delegations -- that's like 50KB of state on the server, and nearly as much on the client.<br />
* TODO: get NFSv2/NFSv3 operations to break (file and directory) delegations at all of the right times, too.<br />
* TODO: also -- policy, look at dir deleg/file deleg interactions, ..</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/CITI_Experience_with_Directory_DelegationsCITI Experience with Directory Delegations2008-01-16T20:54:07Z<p>Richterd: /* How Directory Delegations Can Help */</p>
<hr />
<div>=Background=<br />
<br />
To improve performance and reliability, NFSv4.1 introduces read-only '''directory delegations''', a protocol extension that allows consistent caching of directory contents. <br />
CITI is implementing directory delegations as described in Section 11 of [http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-02.txt NFSv4.1 Internet Draft].<br />
<br />
==Directory Caching in NFSv4==<br />
<br />
NFSv4 allows clients to cache directory contents:<br />
<br />
* READDIR uses a directory entry cache<br />
* LOOKUP uses the name cache<br />
* ACCESS and GETATTR use a directory metadata cache.<br />
<br />
To limit the use of stale cached information, RFC 3530 suggests a time-bounded consistency model, which forces the client to revalidate cached directory information periodically. <br />
<br />
"Directory caching for the NFS version 4 protocol is similar to previous versions. Clients typically cache directory information for a duration determined by the client. At the end of a predefined timeout, the client will query the server to see if the directory has been updated. By caching attributes, clients reduce the number of GETATTR calls made to the server to validate attributes. Furthermore, frequently accessed files and directories, such as the current working directory, have their attributes cached on the client so that some NFS operations can be performed without having to make an RPC call. By caching name and inode information about most recently looked up entries in DNLC (Directory Name Lookup Cache), clients do not need to send LOOKUP calls to the server every time these files are accessed." [http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-02.txt NFSv4.1 Internet Draft]<br />
<br />
Revalidation of directory information is wasteful and opens a window during which a client might use stale cached directory information.<br />
<br />
Analysis of network traces at the University of Michigan ('''FIXME''': need link to a copy of Brian Wickman's prelim) show that a surprising amount of NFSv3 traffic is due to GETATTRs triggered by client directory cache revalidation.<br />
<br />
==How Directory Delegations Can Help==<br />
<br />
"[The NFSv4] caching approach works reasonably well at reducing network traffic in many environments. However, it does not address environments where there are numerous queries for files that do not exist. In these cases of "misses", the client must make RPC calls to the server in order to provide reasonable application semantics and promptly detect the creation of new directory entries. Examples of high miss activity are compilation in software development environments. The current behavior of NFS limits its potential scalability and wide-area sharing effectiveness in these types of environments." [http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-02.txt NFSv4.1 Internet Draft]<br />
<br />
A common "high miss" case involves shell PATH lookups.<br />
To execute a program, the shell walks down a list of directories specified in a user's $PATH<br />
environment variable and tries to locate the executable file in each directory. <br />
It is not uncommon to find a large number of directories in the list.<br />
<br />
When an executable whose parent directory is located far down the $PATH list is invoked, it causes a "miss" in each of the directories that precede the parent directory. Even though the directories along the path might be cached, running the program more than once still requires that the $PATH directories are revalidated, in case the file appears at some point.<br />
<br />
Directory delegations improve matters enormously because the client is assured that the directory has not been modified since the delegation was granted. With directory delegations, once a nonexistent file has been searched for, the client can trust that it won't appear while the delegation is in effect; this is referred to as '''negative dentry caching'''. With it, searching for a nonexistent file in a cached and delegated directory can proceed locally, without having to check back with the server.<br />
<br />
Program compilation, which induces repeated misses along the paths for header files and library modules, also benefits from directory delegations,.<br />
The savings are potentially even greater for repeated 'ls' or 'stat' requests on non-existent files; each such request requires three separate RPC calls -- ACCESS, LOOKUP, and GETATTR -- to discover that a file does not exist.<br />
<br />
Beyond these "high miss" cases, analysis of NFSv3 network traces shows that a great deal of NFS traffic<br />
consists of the periodic GETATTRs sent by clients when an attribute timeout<br />
triggers a cache revalidation. But a delegated directory need not be revalidated unless the directory is modified. <br />
<br />
* Should reference Wickman and ... um ... CMU? Ousterhout?<br />
* From which we can make "a great deal" more specific?<br />
<br />
==Directory Delegation Operations==<br />
An NFSv4.1 client requests a directory delegation with the GET_DIR_DELEGATION operation.<br />
Granting a delegation request is solely at the server's discretion, and the delegation may be <br />
recalled at any time.<br />
<br />
Upon receiving an operation that conflicts with an existing delegation, the server must first <br />
recall from all of its clients any delegations on the directory (or directories) being mutated. <br />
When a client receives that CB_RECALL callback operation, it relinquishes the delegation in <br />
question by responding to the server using the DELEGRETURN operation.<br />
When all of the requisite delegations have been returned (or forcefully timed-out), the server<br />
allows the conflicting operation to proceed.<br />
<br />
Although NFS clients and servers have knowledge of the acquisition and recall of directory <br />
delegations, delegation state is opaque to applications.<br />
<br />
==Notifications==<br />
After a delegation recall, a client is forced to refetch a directory in its entirety the next time it is used. <br />
For a large directory, this cost, which is above and beyond the two RPCs needed for the recall, can be quite expensive.<br />
If the directory also happens to be a popular one — with multiple clients holding delegations — the performance impact on the server can be considerable.<br />
<br />
To reduce the impact of a directory modification when the change is small,<br />
the NFSv4.1 Internet Draft defines an extension to delegations called ''notifications.''<br />
When a client requests a delegation, it can also request that certain changes be conveyed in the form of a notification instead of a recall.<br />
<br />
By sending a description of the change instead of recalling the delegation, the server allows the client to maintain a consistent cache without imposing the cost (to the client and to itself) of a recall and refetch.<br />
<br />
Notifications are motivated by some common cases. For example, some applications use ephemeral lockfiles for concurrency control by quickly creating and destroying a file in a directory. Other examples include program compilation and CVS updates, which also quickly create and destroy files.<br />
<br />
In the proposal for notifications, a client can request notifications on<br />
directory entry and directory attribute changes, as well as directory entry<br />
attribute changes. To reduce the cost of issuing notifications, the client and server negotiate the rate at which notifications are sent, allowing the server to "batch" notifications and send them asynchronously. In some common cases, delaying a notification can obviate its delivery altogether, e.g., when a file is quickly created and destroyed.<br />
<br />
* ref ousterhout<br />
<br />
===Issues with notifications===<br />
Notifications require state on the NFS server to keep track of them and work to deliver them.<br />
Wickman's simulator work at CITI<br />
found that in some<br />
cases, the number of<br />
notifications dispatched to support a directory delegation can exceed<br />
the cost of simply not using a delegation at all. <br />
A restricted version of notifications that sends only directory creates, unlinks, and renames would use much less server state.<br />
<br />
Notifications also introduce a level of "fairness" to maintain, in terms of deciding how to<br />
allot notifications among multiple clients, given limited server resources.<br />
<br />
Notifications can be sent asynchronously, at a rate negotiated by the client and server.<br />
This allows the server to batch several notifications<br />
and to prune self-cancelling<br />
notifications (e.g., "CREATE foo ... REMOVE foo").<br />
Indeed, Wickman found that for<br />
certain workloads, batching notifications for 20 to 50 seconds reduces notification traffic by a factor of 5 to 50.<br />
For instance, lock files in mail boxes often have a lifetime<br />
under 10 seconds, so addition/deletion notifications can be pruned. <br />
However, there<br />
is a trade-off between the batching delay and client<br />
cache consistency. <br />
<br />
Because of the complexity of implementation and questions of how best to benefit from them, CITI is not implementing<br />
notifications at this time.<br />
<br />
=Using Directory Delegations=<br />
<br />
While a client holds a delegation on a directory, it is assured that the directory will not be modified without the delegation first being recalled. The server must delay any operation that modifies a directory until all the clients holding delegations on that directory have returned their delegations.<br />
<br />
However, as a special case, the server may allow the client that is modifying a directory to keep its own delegation on that directory. (Obviously, other clients' delegations on that directory must still be recalled.)<br />
<br />
Note that even though we may permit a client to modify a directory while it holds a read delegation, this is not the same as providing that client with an exclusive (write) delegation; a write delegation would also allow the client to modify the directory locally, and this is explicitly forbidden in section 11 of the minor version draft:<br />
<br />
"The delegation is read-only and the client may not make changes to the directory other than by performing NFSv4 operations that modify the directory or the associated file attributes so that the server has knowledge of these changes."<br />
<br />
Note that in order to make the special exception that allows a client to modify a directory without recalling its own lease, we must know which client is performing the operation.<br />
<br />
Currently we are using the client's IP address for this. However, the NFSv4 protocol does not prohibit the client from changing IP addresses, and does not prohibit multiple clients from sharing an IP address. The final code will instead use the new Sessions extensions in NFSv4.1 to identify the client.<br />
<br />
=Negative Caching=<br />
<br />
One opportunity offered by directory delegations is the chance to significantly extend the usefulness of negative dentry caching on the client. <br />
Close-to-open consistency mandates that even in a case where previous LOOKUPs or OPENs for a given file have recently or repeatedly failed, subsequent attempts require that the parent directory is revalidated in case the file appears. With directory delegations, the client is assured that no new entries or removals have occurred while a delegation is in-effect; this implies that negative dentries in a delegated directory actually can be "trusted". <br />
<br />
This could translate into a marked decrease in the number of unnecessary and repeated checks for non-existent files, e.g. when searching for <br />
a header file in include paths or a shared library in LD_LIBRARY_PATH ''(See the '''Some preliminary numbers''' section for more details)''. Knowing just when to acquire those delegations may be a matter to address in client-side policy.<br />
<br />
=Delegations and the Linux VFS Lease Subsystem=<br />
<br />
We have implemented directory delegations on the server by extending the Linux VFS file lease subsystem. A lease is a type of lock that gives the lease-holder the chance to perform any necessary tasks (e.g., flushing data) when an operation that conflicts with the lease-type is about to occur -- the caller who is causing the lease to break will block until the lease-holder signals that it is finished cleaning-up (or the lease is forcefully broken after a timeout).<br />
<br />
Leases are usually acquired via fcntl(2), and a lease-holder usually receives a signal from the kernel when a lease is being broken; the lease-holder indicates that any cleanup is finished with another fcntl(2) call. Leases used by NFS are all acquired and revoked in-kernel.<br />
<br />
The existing lease subsystem only works on files, and leases are only broken when a file is opened for writing or is truncated. In order to implement <br />
directory delegations, we have added support for directory leases. These will break when a leased directory is mutated by any additions, deletions, or renames, or when the directory's own metadata changes (e.g., chown(1)). Note that changes to existing files, e.g., will not break directory leases.<br />
<br />
Our current implementation modifies the NFS server so that NFS protocol operations will break directory leases. We are testing general VFS-level directory lease-breaking -- i.e., both NFS and local operations will break leases. Our approach is described in the next section.<br />
<br />
=Recalling NFS Delegations vs. Breaking Linux VFS (Non-NFS) Leases=<br />
<br />
In the following I will refer to the leases used to implement NFS delegations as "NFS leases" and all other leases as "non-NFS leases".<br />
<br />
NFS leases and non-NFS leases differ in how they handle the case where a lease-holder is '''''also''''' the caller performing an operation that conflicts with the lease-type, as described above.<br />
<br />
Any operation that breaks a lease, and hence requires delegation recalls, has to wait for delegations to be returned. There are a number of different ways to do this:<br />
<br />
# Delay responding to the original operation until all recalls are complete.<br />
# Immediately return NFS4ERR_DELAY to the client; the process on the client will then block while the client polls on its behalf.<br />
# Delay the response from the server for a little while, to handle the (probably common) case of a quick delegation return, and only return NFS4ERR_DELAY if the delegations aren't returned quickly enough.<br />
<br />
For now, we have implemented option number 2.<br />
<br />
The approach we're currently taking to tackle the issues of integrating NFS delegations with Linux VFS leases (i.e., all directory-mutating <br />
operations, whether locally on the server or over NFS, will break directory leases/delegations on the server) goes something like this:<br />
<br />
''When breaking a lease where the call is coming over NFS:''<br />
1) During processing, whenever the directory's dentry becomes available (e.g., after a lookup), disable lease-granting for its inode and try <br />
break_lease() with O_NONBLOCK. This will avoid blocking while locks are held, as well as avoid tying up server threads for (potentially)<br />
long periods.<br />
<br />
2) If there was not a lease, finish the operation, re-enable lease-granting on the inode, and we're done.<br />
<br />
3) If there was a lease, break_lease() will send the break signal(s) and nfsd will also fail (re-enabling lease-granting on the inode first)<br />
and the client gets NFS4ERR_DELAY (and should retry). The downside to this is that a pathological case could arise wherein we break a lease,<br />
return NFS4ERR_DELAY, then the client retries the operation -- but another client has acquired a lease in the interim, and we could end up <br />
with a cycle.<br />
<br />
<br />
''When breaking a lease where the call is server-local:''<br />
1) Again, whenever a directory's dentry becomes available, disable lease-granting for its inode.<br />
<br />
2a) If locks (e.g., an i_mutex) are not held, call break_lease() and, as per normal lease-semantics, block the breaker until leases are returned,<br />
after which the breaker is unblocked and its operation succeeds.<br />
<br />
2b) If locks are held, call break_lease() with O_NONBLOCK; we assume the common-case to be that no lease is present. If break_lease() returns<br />
-EWOULDBLOCK, drop the locks and call break_lease() and allow it to block. Once the caller unblocks, restart the operation by reacquiring<br />
the locks and, e.g., redoing a lookup to make sure the file system object(s) still exist(s). Since lease-granting was disabled early-on, <br />
the operation will succeed in one pass.<br />
<br />
3) Regardless of whether 2a) or 2b) happened, at the end lease-granting is re-enabled for the inode(s) in question.<br />
<br />
=Policy (partial)=<br />
client: prior to a READDIR, request. <br />
<br />
client: if we've sent 3 or 5 revalidations and a directory hasn't changed, request.<br />
<br />
client: when to voluntarily surrender? e.g., after a kernel-compile, i hold hundreds of delegations.<br />
<br />
server: if a directory's delegation has been recalled in the last N minutes, don't grant new ones.<br />
<br />
server: will need to ID "misbehaving" clients and cordon them off.<br />
<br />
server: when to preemptively recall? --> server load metric<br />
<br />
==(simulator)==<br />
Previous work at CITI by Brian Wickman consisted of prototyping and analyzing<br />
file and directory delegations, based on recorded network traces of NFSv3 use in<br />
college environments. The stateless nature of NFSv3 required the<br />
instrumentation of OPEN and CLOSE operations into the traces, e.g., but given<br />
that in the absence of delegations, NFSv4 client-side cache validation closely<br />
mimics that of NFSv3, enough information was available to get an overall<br />
impression of the state of the clients' caches. Wickman wrote a simulator to<br />
use the instrumented traces to test different delegation models and policies.<br />
We now want to use real-world NFSv4 network traces with the simulator, but given<br />
the current absence of widescale mainstream deployment of NFSv4, we need to find<br />
such traces of representative workloads. Using actual NFSv4 traffic will give a<br />
more accurate picture of client-cache state and will more clearly identify<br />
operations obviated by delegations; this is both because the traces will not<br />
need to be instrumented, and because NFSv3 lacks the COMPOUND operation, with<br />
which NFSv4 coalesces groups of commands. NFSv4 traces used with the simulator<br />
will allow us to develop client- and server-side policies for requesting and<br />
granting delegations.<br />
<br />
=Some preliminary numbers=<br />
A significant demonstration of the benefits of negative dentry<br />
caching is software compilation. For instance, when<br />
building software using make(1), various directories are<br />
repeatedly searched for header files. Since header files tend only to be<br />
located in one of the directories, and since many object files depend on the<br />
same headers, there are a great number of unnecessary re-checks. By caching<br />
negative dentries, a significant number of NFS operations can be avoided.<br />
<br />
We have some rough numbers in terms of opcounts, both with and without directory (and not file) delegations enabled. We used a simple client policy of requesting delegations prior to a READDIR (note that make(1) periodically calls getdents(2) on its own). ACCESS, GETATTR, and LOOKUP are where the real savings are; the other opcounts are just included for context. Again, these numbers are ''rough'', but indicate that compilation environments stand to benefit from directory delegations.<br />
<br />
''Doing make(1) on cscope-15.5 (first without, then with directory delegations):''<br />
<br />
READ: 136 124<br />
WRITE: 137 136<br />
OPEN: 1576 1576<br />
ACCESS: 1169 161 (86% reduction)<br />
GETATTR: 903 628 (30% reduction)<br />
LOOKUP: 1494 496 (67% reduction)<br />
GET_DIR_DELEG: 7<br />
DELEGRETURN: 1<br />
<br />
''Doing make(1) on the 2.6.16 linux kernel (first without, then with directory delegations):''<br />
<br />
READ: 19803 19892<br />
WRITE: 21921 21869<br />
OPEN: 497472 494648<br />
ACCESS: 20638 3406 (83.5% reduction)<br />
GETATTR: 41794 24563 (41.0% reduction)<br />
LOOKUP: 45063 17447 (61.3% reduction)<br />
READDIR: 1016 884 (13.0% reduction)<br />
GET_DIR_DELEG: 750<br />
DELEGRETURN: none<br />
<br />
=Status=<br />
<br />
At the moment, working on coming up with reasonably representative tests that show the benefits of directory delegations (in terms<br />
of OP-counts); pynfs tests are also being written.<br />
<br />
==The client==<br />
<br />
* The client currently requests a delegation just prior to issuing a READDIR on an undelegated directory, or when it has done "a few" parent directory revalidations and noticed that it hasn't changed during that span. <br />
* As long as the client has such a delegation, it will generally refrain from issuing ACCESS, GETATTR, and READDIR calls on the directory (see below) ...<br />
* .. in some cases, though, the client's cache(s) may be deliberately invalidated and require a refresh (e.g., a client creates a file in a directory delegated to it, which won't break its delegation; however, in order to see the file, the client must revalidate its pagecache and send a READDIR to the server).<br />
* '''README: any suggestions here? —> TODO:''' get more opcounts! (hosting a webserver's docroot off an nfs mount? PATH or LD_LIBRARY_PATH stuff?)<br />
* TODO: redo existing opcount tests and instead tally bandwidth savings ...<br />
** getting ''real'' NFSv4 workload network traces would be great... '''(can you help? —>&nbsp; email nfsv4@linux-nfs.org)'''<br />
* When should/can we decide to voluntarily return delegations (other than when we have no more active open-state)?<br />
<br />
==The server== <br />
<br />
* Differentiate between turning file/directory delegations on/off at runtime (done) and enabling/disabling the capability itself (not done; would prevent our client from ever asking for delegations in the first place, independent of its requesting policy).<br />
* The following NFS operations currently break directory delegations: CREATE, LINK, REMOVE, RENAME, and OPEN(w/create). SETATTR on directories is pending.<br />
* An NFS SETATTR breaks file delegations when the file size is changing. Breaking on metadata changes is pending.<br />
* The corresponding VFS-level operations also break delegations and are being tested.<br />
<!-- <br />
.. CREATE (nfsd_create() and nfsd_symlink()), LINK (nfsd_link()), REMOVE (nfsd_unlink()), and RENAME (nfsd_rename()).<br />
* OPEN(w/create) is tied-up: parent-directory delegs are now broken OK in nfsd4_open(). Breaking file-delegs on OPEN(write) is broken: <br />
nfsd_open() tries '''a)''' under statelock and '''b)''' I think usually fails bc. of O_NONBLOCK. nfsd4_truncate() has similar issue. <br />
--><br />
* How to acknowledge/when to act upon resource pressures? --> e.g., after compiling the linux kernel, a client holds ~750 delegations -- that's like 50KB of state on the server, and nearly as much on the client.<br />
* TODO: get NFSv2/NFSv3 operations to break (file and directory) delegations at all of the right times, too.<br />
* TODO: also -- policy, look at dir deleg/file deleg interactions, ..</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/CITI_Experience_with_Directory_DelegationsCITI Experience with Directory Delegations2008-01-16T20:51:08Z<p>Richterd: /* Negative Caching */</p>
<hr />
<div>=Background=<br />
<br />
To improve performance and reliability, NFSv4.1 introduces read-only '''directory delegations''', a protocol extension that allows consistent caching of directory contents. <br />
CITI is implementing directory delegations as described in Section 11 of [http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-02.txt NFSv4.1 Internet Draft].<br />
<br />
==Directory Caching in NFSv4==<br />
<br />
NFSv4 allows clients to cache directory contents:<br />
<br />
* READDIR uses a directory entry cache<br />
* LOOKUP uses the name cache<br />
* ACCESS and GETATTR use a directory metadata cache.<br />
<br />
To limit the use of stale cached information, RFC 3530 suggests a time-bounded consistency model, which forces the client to revalidate cached directory information periodically. <br />
<br />
"Directory caching for the NFS version 4 protocol is similar to previous versions. Clients typically cache directory information for a duration determined by the client. At the end of a predefined timeout, the client will query the server to see if the directory has been updated. By caching attributes, clients reduce the number of GETATTR calls made to the server to validate attributes. Furthermore, frequently accessed files and directories, such as the current working directory, have their attributes cached on the client so that some NFS operations can be performed without having to make an RPC call. By caching name and inode information about most recently looked up entries in DNLC (Directory Name Lookup Cache), clients do not need to send LOOKUP calls to the server every time these files are accessed." [http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-02.txt NFSv4.1 Internet Draft]<br />
<br />
Revalidation of directory information is wasteful and opens a window during which a client might use stale cached directory information.<br />
<br />
Analysis of network traces at the University of Michigan ('''FIXME''': need link to a copy of Brian Wickman's prelim) show that a surprising amount of NFSv3 traffic is due to GETATTRs triggered by client directory cache revalidation.<br />
<br />
==How Directory Delegations Can Help==<br />
<br />
"[The NFSv4] caching approach works reasonably well at reducing network traffic in many environments. However, it does not address environments where there are numerous queries for files that do not exist. In these cases of "misses", the client must make RPC calls to the server in order to provide reasonable application semantics and promptly detect the creation of new directory entries. Examples of high miss activity are compilation in software development environments. The current behavior of NFS limits its potential scalability and wide-area sharing effectiveness in these types of environments." [http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-02.txt NFSv4.1 Internet Draft]<br />
<br />
A common "high miss" case involves shell PATH lookups.<br />
To execute a program, the shell walks down a list of directories specified in a user's $PATH<br />
environment variable and tries to locate the executable file in each directory. <br />
It is not uncommon to find a large number of directories in the list.<br />
<br />
When an executable whose parent directory is located far down the $PATH list is invoked, it causes a "miss" in each of the directories that precede the parent directory. Even though the directories along the path might be cached, running the program more than once still requires that the $PATH directories are checked. Close-to-open consistency semantics require that these directories are each revalidated with a GETATTR call, to ensure that the executable has not appeared in one of those directories since the last attempt.<br />
<br />
Directory delegations improve matters enormously because the client is assured that the directory has not been modified since the delegation was granted. With directory delegations, once a nonexistent file has been searched for, the client can trust that it won't appear while the delegation is in effect; this is referred to as '''negative dentry caching'''. With it, searching for a nonexistent file in a cached and delegated directory can proceed locally, without having to check back with the server.<br />
<br />
Program compilation, which induces repeated misses along the paths for header files and library modules, also benefits from directory delegations,.<br />
The savings are potentially even greater for repeated 'ls' or 'stat' requests on non-existent files; each such request requires three separate RPC calls -- ACCESS, LOOKUP, and GETATTR -- to discover that a file does not exist.<br />
<br />
Beyond these "high miss" cases, analysis of NFSv3 network traces shows that a great deal of NFS traffic<br />
consists of the periodic GETATTRs sent by clients when an attribute timeout<br />
triggers a cache revalidation. But a delegated directory need not be revalidated unless the directory is modified. <br />
<br />
* Should reference Wickman and ... um ... CMU? Ousterhout?<br />
* From which we can make "a great deal" more specific?<br />
<br />
==Directory Delegation Operations==<br />
An NFSv4.1 client requests a directory delegation with the GET_DIR_DELEGATION operation.<br />
Granting a delegation request is solely at the server's discretion, and the delegation may be <br />
recalled at any time.<br />
<br />
Upon receiving an operation that conflicts with an existing delegation, the server must first <br />
recall from all of its clients any delegations on the directory (or directories) being mutated. <br />
When a client receives that CB_RECALL callback operation, it relinquishes the delegation in <br />
question by responding to the server using the DELEGRETURN operation.<br />
When all of the requisite delegations have been returned (or forcefully timed-out), the server<br />
allows the conflicting operation to proceed.<br />
<br />
Although NFS clients and servers have knowledge of the acquisition and recall of directory <br />
delegations, delegation state is opaque to applications.<br />
<br />
==Notifications==<br />
After a delegation recall, a client is forced to refetch a directory in its entirety the next time it is used. <br />
For a large directory, this cost, which is above and beyond the two RPCs needed for the recall, can be quite expensive.<br />
If the directory also happens to be a popular one — with multiple clients holding delegations — the performance impact on the server can be considerable.<br />
<br />
To reduce the impact of a directory modification when the change is small,<br />
the NFSv4.1 Internet Draft defines an extension to delegations called ''notifications.''<br />
When a client requests a delegation, it can also request that certain changes be conveyed in the form of a notification instead of a recall.<br />
<br />
By sending a description of the change instead of recalling the delegation, the server allows the client to maintain a consistent cache without imposing the cost (to the client and to itself) of a recall and refetch.<br />
<br />
Notifications are motivated by some common cases. For example, some applications use ephemeral lockfiles for concurrency control by quickly creating and destroying a file in a directory. Other examples include program compilation and CVS updates, which also quickly create and destroy files.<br />
<br />
In the proposal for notifications, a client can request notifications on<br />
directory entry and directory attribute changes, as well as directory entry<br />
attribute changes. To reduce the cost of issuing notifications, the client and server negotiate the rate at which notifications are sent, allowing the server to "batch" notifications and send them asynchronously. In some common cases, delaying a notification can obviate its delivery altogether, e.g., when a file is quickly created and destroyed.<br />
<br />
* ref ousterhout<br />
<br />
===Issues with notifications===<br />
Notifications require state on the NFS server to keep track of them and work to deliver them.<br />
Wickman's simulator work at CITI<br />
found that in some<br />
cases, the number of<br />
notifications dispatched to support a directory delegation can exceed<br />
the cost of simply not using a delegation at all. <br />
A restricted version of notifications that sends only directory creates, unlinks, and renames would use much less server state.<br />
<br />
Notifications also introduce a level of "fairness" to maintain, in terms of deciding how to<br />
allot notifications among multiple clients, given limited server resources.<br />
<br />
Notifications can be sent asynchronously, at a rate negotiated by the client and server.<br />
This allows the server to batch several notifications<br />
and to prune self-cancelling<br />
notifications (e.g., "CREATE foo ... REMOVE foo").<br />
Indeed, Wickman found that for<br />
certain workloads, batching notifications for 20 to 50 seconds reduces notification traffic by a factor of 5 to 50.<br />
For instance, lock files in mail boxes often have a lifetime<br />
under 10 seconds, so addition/deletion notifications can be pruned. <br />
However, there<br />
is a trade-off between the batching delay and client<br />
cache consistency. <br />
<br />
Because of the complexity of implementation and questions of how best to benefit from them, CITI is not implementing<br />
notifications at this time.<br />
<br />
=Using Directory Delegations=<br />
<br />
While a client holds a delegation on a directory, it is assured that the directory will not be modified without the delegation first being recalled. The server must delay any operation that modifies a directory until all the clients holding delegations on that directory have returned their delegations.<br />
<br />
However, as a special case, the server may allow the client that is modifying a directory to keep its own delegation on that directory. (Obviously, other clients' delegations on that directory must still be recalled.)<br />
<br />
Note that even though we may permit a client to modify a directory while it holds a read delegation, this is not the same as providing that client with an exclusive (write) delegation; a write delegation would also allow the client to modify the directory locally, and this is explicitly forbidden in section 11 of the minor version draft:<br />
<br />
"The delegation is read-only and the client may not make changes to the directory other than by performing NFSv4 operations that modify the directory or the associated file attributes so that the server has knowledge of these changes."<br />
<br />
Note that in order to make the special exception that allows a client to modify a directory without recalling its own lease, we must know which client is performing the operation.<br />
<br />
Currently we are using the client's IP address for this. However, the NFSv4 protocol does not prohibit the client from changing IP addresses, and does not prohibit multiple clients from sharing an IP address. The final code will instead use the new Sessions extensions in NFSv4.1 to identify the client.<br />
<br />
=Negative Caching=<br />
<br />
One opportunity offered by directory delegations is the chance to significantly extend the usefulness of negative dentry caching on the client. <br />
Close-to-open consistency mandates that even in a case where previous LOOKUPs or OPENs for a given file have recently or repeatedly failed, subsequent attempts require that the parent directory is revalidated in case the file appears. With directory delegations, the client is assured that no new entries or removals have occurred while a delegation is in-effect; this implies that negative dentries in a delegated directory actually can be "trusted". <br />
<br />
This could translate into a marked decrease in the number of unnecessary and repeated checks for non-existent files, e.g. when searching for <br />
a header file in include paths or a shared library in LD_LIBRARY_PATH ''(See the '''Some preliminary numbers''' section for more details)''. Knowing just when to acquire those delegations may be a matter to address in client-side policy.<br />
<br />
=Delegations and the Linux VFS Lease Subsystem=<br />
<br />
We have implemented directory delegations on the server by extending the Linux VFS file lease subsystem. A lease is a type of lock that gives the lease-holder the chance to perform any necessary tasks (e.g., flushing data) when an operation that conflicts with the lease-type is about to occur -- the caller who is causing the lease to break will block until the lease-holder signals that it is finished cleaning-up (or the lease is forcefully broken after a timeout).<br />
<br />
Leases are usually acquired via fcntl(2), and a lease-holder usually receives a signal from the kernel when a lease is being broken; the lease-holder indicates that any cleanup is finished with another fcntl(2) call. Leases used by NFS are all acquired and revoked in-kernel.<br />
<br />
The existing lease subsystem only works on files, and leases are only broken when a file is opened for writing or is truncated. In order to implement <br />
directory delegations, we have added support for directory leases. These will break when a leased directory is mutated by any additions, deletions, or renames, or when the directory's own metadata changes (e.g., chown(1)). Note that changes to existing files, e.g., will not break directory leases.<br />
<br />
Our current implementation modifies the NFS server so that NFS protocol operations will break directory leases. We are testing general VFS-level directory lease-breaking -- i.e., both NFS and local operations will break leases. Our approach is described in the next section.<br />
<br />
=Recalling NFS Delegations vs. Breaking Linux VFS (Non-NFS) Leases=<br />
<br />
In the following I will refer to the leases used to implement NFS delegations as "NFS leases" and all other leases as "non-NFS leases".<br />
<br />
NFS leases and non-NFS leases differ in how they handle the case where a lease-holder is '''''also''''' the caller performing an operation that conflicts with the lease-type, as described above.<br />
<br />
Any operation that breaks a lease, and hence requires delegation recalls, has to wait for delegations to be returned. There are a number of different ways to do this:<br />
<br />
# Delay responding to the original operation until all recalls are complete.<br />
# Immediately return NFS4ERR_DELAY to the client; the process on the client will then block while the client polls on its behalf.<br />
# Delay the response from the server for a little while, to handle the (probably common) case of a quick delegation return, and only return NFS4ERR_DELAY if the delegations aren't returned quickly enough.<br />
<br />
For now, we have implemented option number 2.<br />
<br />
The approach we're currently taking to tackle the issues of integrating NFS delegations with Linux VFS leases (i.e., all directory-mutating <br />
operations, whether locally on the server or over NFS, will break directory leases/delegations on the server) goes something like this:<br />
<br />
''When breaking a lease where the call is coming over NFS:''<br />
1) During processing, whenever the directory's dentry becomes available (e.g., after a lookup), disable lease-granting for its inode and try <br />
break_lease() with O_NONBLOCK. This will avoid blocking while locks are held, as well as avoid tying up server threads for (potentially)<br />
long periods.<br />
<br />
2) If there was not a lease, finish the operation, re-enable lease-granting on the inode, and we're done.<br />
<br />
3) If there was a lease, break_lease() will send the break signal(s) and nfsd will also fail (re-enabling lease-granting on the inode first)<br />
and the client gets NFS4ERR_DELAY (and should retry). The downside to this is that a pathological case could arise wherein we break a lease,<br />
return NFS4ERR_DELAY, then the client retries the operation -- but another client has acquired a lease in the interim, and we could end up <br />
with a cycle.<br />
<br />
<br />
''When breaking a lease where the call is server-local:''<br />
1) Again, whenever a directory's dentry becomes available, disable lease-granting for its inode.<br />
<br />
2a) If locks (e.g., an i_mutex) are not held, call break_lease() and, as per normal lease-semantics, block the breaker until leases are returned,<br />
after which the breaker is unblocked and its operation succeeds.<br />
<br />
2b) If locks are held, call break_lease() with O_NONBLOCK; we assume the common-case to be that no lease is present. If break_lease() returns<br />
-EWOULDBLOCK, drop the locks and call break_lease() and allow it to block. Once the caller unblocks, restart the operation by reacquiring<br />
the locks and, e.g., redoing a lookup to make sure the file system object(s) still exist(s). Since lease-granting was disabled early-on, <br />
the operation will succeed in one pass.<br />
<br />
3) Regardless of whether 2a) or 2b) happened, at the end lease-granting is re-enabled for the inode(s) in question.<br />
<br />
=Policy (partial)=<br />
client: prior to a READDIR, request. <br />
<br />
client: if we've sent 3 or 5 revalidations and a directory hasn't changed, request.<br />
<br />
client: when to voluntarily surrender? e.g., after a kernel-compile, i hold hundreds of delegations.<br />
<br />
server: if a directory's delegation has been recalled in the last N minutes, don't grant new ones.<br />
<br />
server: will need to ID "misbehaving" clients and cordon them off.<br />
<br />
server: when to preemptively recall? --> server load metric<br />
<br />
==(simulator)==<br />
Previous work at CITI by Brian Wickman consisted of prototyping and analyzing<br />
file and directory delegations, based on recorded network traces of NFSv3 use in<br />
college environments. The stateless nature of NFSv3 required the<br />
instrumentation of OPEN and CLOSE operations into the traces, e.g., but given<br />
that in the absence of delegations, NFSv4 client-side cache validation closely<br />
mimics that of NFSv3, enough information was available to get an overall<br />
impression of the state of the clients' caches. Wickman wrote a simulator to<br />
use the instrumented traces to test different delegation models and policies.<br />
We now want to use real-world NFSv4 network traces with the simulator, but given<br />
the current absence of widescale mainstream deployment of NFSv4, we need to find<br />
such traces of representative workloads. Using actual NFSv4 traffic will give a<br />
more accurate picture of client-cache state and will more clearly identify<br />
operations obviated by delegations; this is both because the traces will not<br />
need to be instrumented, and because NFSv3 lacks the COMPOUND operation, with<br />
which NFSv4 coalesces groups of commands. NFSv4 traces used with the simulator<br />
will allow us to develop client- and server-side policies for requesting and<br />
granting delegations.<br />
<br />
=Some preliminary numbers=<br />
A significant demonstration of the benefits of negative dentry<br />
caching is software compilation. For instance, when<br />
building software using make(1), various directories are<br />
repeatedly searched for header files. Since header files tend only to be<br />
located in one of the directories, and since many object files depend on the<br />
same headers, there are a great number of unnecessary re-checks. By caching<br />
negative dentries, a significant number of NFS operations can be avoided.<br />
<br />
We have some rough numbers in terms of opcounts, both with and without directory (and not file) delegations enabled. We used a simple client policy of requesting delegations prior to a READDIR (note that make(1) periodically calls getdents(2) on its own). ACCESS, GETATTR, and LOOKUP are where the real savings are; the other opcounts are just included for context. Again, these numbers are ''rough'', but indicate that compilation environments stand to benefit from directory delegations.<br />
<br />
''Doing make(1) on cscope-15.5 (first without, then with directory delegations):''<br />
<br />
READ: 136 124<br />
WRITE: 137 136<br />
OPEN: 1576 1576<br />
ACCESS: 1169 161 (86% reduction)<br />
GETATTR: 903 628 (30% reduction)<br />
LOOKUP: 1494 496 (67% reduction)<br />
GET_DIR_DELEG: 7<br />
DELEGRETURN: 1<br />
<br />
''Doing make(1) on the 2.6.16 linux kernel (first without, then with directory delegations):''<br />
<br />
READ: 19803 19892<br />
WRITE: 21921 21869<br />
OPEN: 497472 494648<br />
ACCESS: 20638 3406 (83.5% reduction)<br />
GETATTR: 41794 24563 (41.0% reduction)<br />
LOOKUP: 45063 17447 (61.3% reduction)<br />
READDIR: 1016 884 (13.0% reduction)<br />
GET_DIR_DELEG: 750<br />
DELEGRETURN: none<br />
<br />
=Status=<br />
<br />
At the moment, working on coming up with reasonably representative tests that show the benefits of directory delegations (in terms<br />
of OP-counts); pynfs tests are also being written.<br />
<br />
==The client==<br />
<br />
* The client currently requests a delegation just prior to issuing a READDIR on an undelegated directory, or when it has done "a few" parent directory revalidations and noticed that it hasn't changed during that span. <br />
* As long as the client has such a delegation, it will generally refrain from issuing ACCESS, GETATTR, and READDIR calls on the directory (see below) ...<br />
* .. in some cases, though, the client's cache(s) may be deliberately invalidated and require a refresh (e.g., a client creates a file in a directory delegated to it, which won't break its delegation; however, in order to see the file, the client must revalidate its pagecache and send a READDIR to the server).<br />
* '''README: any suggestions here? —> TODO:''' get more opcounts! (hosting a webserver's docroot off an nfs mount? PATH or LD_LIBRARY_PATH stuff?)<br />
* TODO: redo existing opcount tests and instead tally bandwidth savings ...<br />
** getting ''real'' NFSv4 workload network traces would be great... '''(can you help? —>&nbsp; email nfsv4@linux-nfs.org)'''<br />
* When should/can we decide to voluntarily return delegations (other than when we have no more active open-state)?<br />
<br />
==The server== <br />
<br />
* Differentiate between turning file/directory delegations on/off at runtime (done) and enabling/disabling the capability itself (not done; would prevent our client from ever asking for delegations in the first place, independent of its requesting policy).<br />
* The following NFS operations currently break directory delegations: CREATE, LINK, REMOVE, RENAME, and OPEN(w/create). SETATTR on directories is pending.<br />
* An NFS SETATTR breaks file delegations when the file size is changing. Breaking on metadata changes is pending.<br />
* The corresponding VFS-level operations also break delegations and are being tested.<br />
<!-- <br />
.. CREATE (nfsd_create() and nfsd_symlink()), LINK (nfsd_link()), REMOVE (nfsd_unlink()), and RENAME (nfsd_rename()).<br />
* OPEN(w/create) is tied-up: parent-directory delegs are now broken OK in nfsd4_open(). Breaking file-delegs on OPEN(write) is broken: <br />
nfsd_open() tries '''a)''' under statelock and '''b)''' I think usually fails bc. of O_NONBLOCK. nfsd4_truncate() has similar issue. <br />
--><br />
* How to acknowledge/when to act upon resource pressures? --> e.g., after compiling the linux kernel, a client holds ~750 delegations -- that's like 50KB of state on the server, and nearly as much on the client.<br />
* TODO: get NFSv2/NFSv3 operations to break (file and directory) delegations at all of the right times, too.<br />
* TODO: also -- policy, look at dir deleg/file deleg interactions, ..</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/CITI_Experience_with_Directory_DelegationsCITI Experience with Directory Delegations2008-01-16T20:44:45Z<p>Richterd: /* How Directory Delegations Can Help */</p>
<hr />
<div>=Background=<br />
<br />
To improve performance and reliability, NFSv4.1 introduces read-only '''directory delegations''', a protocol extension that allows consistent caching of directory contents. <br />
CITI is implementing directory delegations as described in Section 11 of [http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-02.txt NFSv4.1 Internet Draft].<br />
<br />
==Directory Caching in NFSv4==<br />
<br />
NFSv4 allows clients to cache directory contents:<br />
<br />
* READDIR uses a directory entry cache<br />
* LOOKUP uses the name cache<br />
* ACCESS and GETATTR use a directory metadata cache.<br />
<br />
To limit the use of stale cached information, RFC 3530 suggests a time-bounded consistency model, which forces the client to revalidate cached directory information periodically. <br />
<br />
"Directory caching for the NFS version 4 protocol is similar to previous versions. Clients typically cache directory information for a duration determined by the client. At the end of a predefined timeout, the client will query the server to see if the directory has been updated. By caching attributes, clients reduce the number of GETATTR calls made to the server to validate attributes. Furthermore, frequently accessed files and directories, such as the current working directory, have their attributes cached on the client so that some NFS operations can be performed without having to make an RPC call. By caching name and inode information about most recently looked up entries in DNLC (Directory Name Lookup Cache), clients do not need to send LOOKUP calls to the server every time these files are accessed." [http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-02.txt NFSv4.1 Internet Draft]<br />
<br />
Revalidation of directory information is wasteful and opens a window during which a client might use stale cached directory information.<br />
<br />
Analysis of network traces at the University of Michigan ('''FIXME''': need link to a copy of Brian Wickman's prelim) show that a surprising amount of NFSv3 traffic is due to GETATTRs triggered by client directory cache revalidation.<br />
<br />
==How Directory Delegations Can Help==<br />
<br />
"[The NFSv4] caching approach works reasonably well at reducing network traffic in many environments. However, it does not address environments where there are numerous queries for files that do not exist. In these cases of "misses", the client must make RPC calls to the server in order to provide reasonable application semantics and promptly detect the creation of new directory entries. Examples of high miss activity are compilation in software development environments. The current behavior of NFS limits its potential scalability and wide-area sharing effectiveness in these types of environments." [http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-02.txt NFSv4.1 Internet Draft]<br />
<br />
A common "high miss" case involves shell PATH lookups.<br />
To execute a program, the shell walks down a list of directories specified in a user's $PATH<br />
environment variable and tries to locate the executable file in each directory. <br />
It is not uncommon to find a large number of directories in the list.<br />
<br />
When an executable whose parent directory is located far down the $PATH list is invoked, it causes a "miss" in each of the directories that precede the parent directory. Even though the directories along the path might be cached, running the program more than once still requires that the $PATH directories are checked. Close-to-open consistency semantics require that these directories are each revalidated with a GETATTR call, to ensure that the executable has not appeared in one of those directories since the last attempt.<br />
<br />
Directory delegations improve matters enormously because the client is assured that the directory has not been modified since the delegation was granted. With directory delegations, once a nonexistent file has been searched for, the client can trust that it won't appear while the delegation is in effect; this is referred to as '''negative dentry caching'''. With it, searching for a nonexistent file in a cached and delegated directory can proceed locally, without having to check back with the server.<br />
<br />
Program compilation, which induces repeated misses along the paths for header files and library modules, also benefits from directory delegations,.<br />
The savings are potentially even greater for repeated 'ls' or 'stat' requests on non-existent files; each such request requires three separate RPC calls -- ACCESS, LOOKUP, and GETATTR -- to discover that a file does not exist.<br />
<br />
Beyond these "high miss" cases, analysis of NFSv3 network traces shows that a great deal of NFS traffic<br />
consists of the periodic GETATTRs sent by clients when an attribute timeout<br />
triggers a cache revalidation. But a delegated directory need not be revalidated unless the directory is modified. <br />
<br />
* Should reference Wickman and ... um ... CMU? Ousterhout?<br />
* From which we can make "a great deal" more specific?<br />
<br />
==Directory Delegation Operations==<br />
An NFSv4.1 client requests a directory delegation with the GET_DIR_DELEGATION operation.<br />
Granting a delegation request is solely at the server's discretion, and the delegation may be <br />
recalled at any time.<br />
<br />
Upon receiving an operation that conflicts with an existing delegation, the server must first <br />
recall from all of its clients any delegations on the directory (or directories) being mutated. <br />
When a client receives that CB_RECALL callback operation, it relinquishes the delegation in <br />
question by responding to the server using the DELEGRETURN operation.<br />
When all of the requisite delegations have been returned (or forcefully timed-out), the server<br />
allows the conflicting operation to proceed.<br />
<br />
Although NFS clients and servers have knowledge of the acquisition and recall of directory <br />
delegations, delegation state is opaque to applications.<br />
<br />
==Notifications==<br />
After a delegation recall, a client is forced to refetch a directory in its entirety the next time it is used. <br />
For a large directory, this cost, which is above and beyond the two RPCs needed for the recall, can be quite expensive.<br />
If the directory also happens to be a popular one — with multiple clients holding delegations — the performance impact on the server can be considerable.<br />
<br />
To reduce the impact of a directory modification when the change is small,<br />
the NFSv4.1 Internet Draft defines an extension to delegations called ''notifications.''<br />
When a client requests a delegation, it can also request that certain changes be conveyed in the form of a notification instead of a recall.<br />
<br />
By sending a description of the change instead of recalling the delegation, the server allows the client to maintain a consistent cache without imposing the cost (to the client and to itself) of a recall and refetch.<br />
<br />
Notifications are motivated by some common cases. For example, some applications use ephemeral lockfiles for concurrency control by quickly creating and destroying a file in a directory. Other examples include program compilation and CVS updates, which also quickly create and destroy files.<br />
<br />
In the proposal for notifications, a client can request notifications on<br />
directory entry and directory attribute changes, as well as directory entry<br />
attribute changes. To reduce the cost of issuing notifications, the client and server negotiate the rate at which notifications are sent, allowing the server to "batch" notifications and send them asynchronously. In some common cases, delaying a notification can obviate its delivery altogether, e.g., when a file is quickly created and destroyed.<br />
<br />
* ref ousterhout<br />
<br />
===Issues with notifications===<br />
Notifications require state on the NFS server to keep track of them and work to deliver them.<br />
Wickman's simulator work at CITI<br />
found that in some<br />
cases, the number of<br />
notifications dispatched to support a directory delegation can exceed<br />
the cost of simply not using a delegation at all. <br />
A restricted version of notifications that sends only directory creates, unlinks, and renames would use much less server state.<br />
<br />
Notifications also introduce a level of "fairness" to maintain, in terms of deciding how to<br />
allot notifications among multiple clients, given limited server resources.<br />
<br />
Notifications can be sent asynchronously, at a rate negotiated by the client and server.<br />
This allows the server to batch several notifications<br />
and to prune self-cancelling<br />
notifications (e.g., "CREATE foo ... REMOVE foo").<br />
Indeed, Wickman found that for<br />
certain workloads, batching notifications for 20 to 50 seconds reduces notification traffic by a factor of 5 to 50.<br />
For instance, lock files in mail boxes often have a lifetime<br />
under 10 seconds, so addition/deletion notifications can be pruned. <br />
However, there<br />
is a trade-off between the batching delay and client<br />
cache consistency. <br />
<br />
Because of the complexity of implementation and questions of how best to benefit from them, CITI is not implementing<br />
notifications at this time.<br />
<br />
=Using Directory Delegations=<br />
<br />
While a client holds a delegation on a directory, it is assured that the directory will not be modified without the delegation first being recalled. The server must delay any operation that modifies a directory until all the clients holding delegations on that directory have returned their delegations.<br />
<br />
However, as a special case, the server may allow the client that is modifying a directory to keep its own delegation on that directory. (Obviously, other clients' delegations on that directory must still be recalled.)<br />
<br />
Note that even though we may permit a client to modify a directory while it holds a read delegation, this is not the same as providing that client with an exclusive (write) delegation; a write delegation would also allow the client to modify the directory locally, and this is explicitly forbidden in section 11 of the minor version draft:<br />
<br />
"The delegation is read-only and the client may not make changes to the directory other than by performing NFSv4 operations that modify the directory or the associated file attributes so that the server has knowledge of these changes."<br />
<br />
Note that in order to make the special exception that allows a client to modify a directory without recalling its own lease, we must know which client is performing the operation.<br />
<br />
Currently we are using the client's IP address for this. However, the NFSv4 protocol does not prohibit the client from changing IP addresses, and does not prohibit multiple clients from sharing an IP address. The final code will instead use the new Sessions extensions in NFSv4.1 to identify the client.<br />
<br />
=Negative Caching=<br />
<br />
One opportunity offered by directory delegations is the chance to significantly extend the usefulness of negative dentry caching on the client. <br />
Close-to-open consistency currently requires that even in a case where previous LOOKUPs or OPENs for a given file have recently and repeatedly failed, subsequent LOOKUPs and OPENs must nevertheless be sent to the server (i.e., negative caching provides no benefit in those cases). With directory delegations, the client is assured that no new entries or removals have occurred while a delegation is in-effect; this implies that negative dentries in a delegated directory actually can be "trusted". <br />
<br />
This could translate into a marked decrease in the number of unnecessary and repeated checks for non-existent files, e.g. when searching for <br />
a header file in include paths or a shared library in LD_LIBRARY_PATH ''(See the '''Some preliminary numbers''' section for more details)''. Knowing just when to acquire those delegations may be a matter to address in <br />
client-side policy.<br />
<br />
=Delegations and the Linux VFS Lease Subsystem=<br />
<br />
We have implemented directory delegations on the server by extending the Linux VFS file lease subsystem. A lease is a type of lock that gives the lease-holder the chance to perform any necessary tasks (e.g., flushing data) when an operation that conflicts with the lease-type is about to occur -- the caller who is causing the lease to break will block until the lease-holder signals that it is finished cleaning-up (or the lease is forcefully broken after a timeout).<br />
<br />
Leases are usually acquired via fcntl(2), and a lease-holder usually receives a signal from the kernel when a lease is being broken; the lease-holder indicates that any cleanup is finished with another fcntl(2) call. Leases used by NFS are all acquired and revoked in-kernel.<br />
<br />
The existing lease subsystem only works on files, and leases are only broken when a file is opened for writing or is truncated. In order to implement <br />
directory delegations, we have added support for directory leases. These will break when a leased directory is mutated by any additions, deletions, or renames, or when the directory's own metadata changes (e.g., chown(1)). Note that changes to existing files, e.g., will not break directory leases.<br />
<br />
Our current implementation modifies the NFS server so that NFS protocol operations will break directory leases. We are testing general VFS-level directory lease-breaking -- i.e., both NFS and local operations will break leases. Our approach is described in the next section.<br />
<br />
=Recalling NFS Delegations vs. Breaking Linux VFS (Non-NFS) Leases=<br />
<br />
In the following I will refer to the leases used to implement NFS delegations as "NFS leases" and all other leases as "non-NFS leases".<br />
<br />
NFS leases and non-NFS leases differ in how they handle the case where a lease-holder is '''''also''''' the caller performing an operation that conflicts with the lease-type, as described above.<br />
<br />
Any operation that breaks a lease, and hence requires delegation recalls, has to wait for delegations to be returned. There are a number of different ways to do this:<br />
<br />
# Delay responding to the original operation until all recalls are complete.<br />
# Immediately return NFS4ERR_DELAY to the client; the process on the client will then block while the client polls on its behalf.<br />
# Delay the response from the server for a little while, to handle the (probably common) case of a quick delegation return, and only return NFS4ERR_DELAY if the delegations aren't returned quickly enough.<br />
<br />
For now, we have implemented option number 2.<br />
<br />
The approach we're currently taking to tackle the issues of integrating NFS delegations with Linux VFS leases (i.e., all directory-mutating <br />
operations, whether locally on the server or over NFS, will break directory leases/delegations on the server) goes something like this:<br />
<br />
''When breaking a lease where the call is coming over NFS:''<br />
1) During processing, whenever the directory's dentry becomes available (e.g., after a lookup), disable lease-granting for its inode and try <br />
break_lease() with O_NONBLOCK. This will avoid blocking while locks are held, as well as avoid tying up server threads for (potentially)<br />
long periods.<br />
<br />
2) If there was not a lease, finish the operation, re-enable lease-granting on the inode, and we're done.<br />
<br />
3) If there was a lease, break_lease() will send the break signal(s) and nfsd will also fail (re-enabling lease-granting on the inode first)<br />
and the client gets NFS4ERR_DELAY (and should retry). The downside to this is that a pathological case could arise wherein we break a lease,<br />
return NFS4ERR_DELAY, then the client retries the operation -- but another client has acquired a lease in the interim, and we could end up <br />
with a cycle.<br />
<br />
<br />
''When breaking a lease where the call is server-local:''<br />
1) Again, whenever a directory's dentry becomes available, disable lease-granting for its inode.<br />
<br />
2a) If locks (e.g., an i_mutex) are not held, call break_lease() and, as per normal lease-semantics, block the breaker until leases are returned,<br />
after which the breaker is unblocked and its operation succeeds.<br />
<br />
2b) If locks are held, call break_lease() with O_NONBLOCK; we assume the common-case to be that no lease is present. If break_lease() returns<br />
-EWOULDBLOCK, drop the locks and call break_lease() and allow it to block. Once the caller unblocks, restart the operation by reacquiring<br />
the locks and, e.g., redoing a lookup to make sure the file system object(s) still exist(s). Since lease-granting was disabled early-on, <br />
the operation will succeed in one pass.<br />
<br />
3) Regardless of whether 2a) or 2b) happened, at the end lease-granting is re-enabled for the inode(s) in question.<br />
<br />
=Policy (partial)=<br />
client: prior to a READDIR, request. <br />
<br />
client: if we've sent 3 or 5 revalidations and a directory hasn't changed, request.<br />
<br />
client: when to voluntarily surrender? e.g., after a kernel-compile, i hold hundreds of delegations.<br />
<br />
server: if a directory's delegation has been recalled in the last N minutes, don't grant new ones.<br />
<br />
server: will need to ID "misbehaving" clients and cordon them off.<br />
<br />
server: when to preemptively recall? --> server load metric<br />
<br />
==(simulator)==<br />
Previous work at CITI by Brian Wickman consisted of prototyping and analyzing<br />
file and directory delegations, based on recorded network traces of NFSv3 use in<br />
college environments. The stateless nature of NFSv3 required the<br />
instrumentation of OPEN and CLOSE operations into the traces, e.g., but given<br />
that in the absence of delegations, NFSv4 client-side cache validation closely<br />
mimics that of NFSv3, enough information was available to get an overall<br />
impression of the state of the clients' caches. Wickman wrote a simulator to<br />
use the instrumented traces to test different delegation models and policies.<br />
We now want to use real-world NFSv4 network traces with the simulator, but given<br />
the current absence of widescale mainstream deployment of NFSv4, we need to find<br />
such traces of representative workloads. Using actual NFSv4 traffic will give a<br />
more accurate picture of client-cache state and will more clearly identify<br />
operations obviated by delegations; this is both because the traces will not<br />
need to be instrumented, and because NFSv3 lacks the COMPOUND operation, with<br />
which NFSv4 coalesces groups of commands. NFSv4 traces used with the simulator<br />
will allow us to develop client- and server-side policies for requesting and<br />
granting delegations.<br />
<br />
=Some preliminary numbers=<br />
A significant demonstration of the benefits of negative dentry<br />
caching is software compilation. For instance, when<br />
building software using make(1), various directories are<br />
repeatedly searched for header files. Since header files tend only to be<br />
located in one of the directories, and since many object files depend on the<br />
same headers, there are a great number of unnecessary re-checks. By caching<br />
negative dentries, a significant number of NFS operations can be avoided.<br />
<br />
We have some rough numbers in terms of opcounts, both with and without directory (and not file) delegations enabled. We used a simple client policy of requesting delegations prior to a READDIR (note that make(1) periodically calls getdents(2) on its own). ACCESS, GETATTR, and LOOKUP are where the real savings are; the other opcounts are just included for context. Again, these numbers are ''rough'', but indicate that compilation environments stand to benefit from directory delegations.<br />
<br />
''Doing make(1) on cscope-15.5 (first without, then with directory delegations):''<br />
<br />
READ: 136 124<br />
WRITE: 137 136<br />
OPEN: 1576 1576<br />
ACCESS: 1169 161 (86% reduction)<br />
GETATTR: 903 628 (30% reduction)<br />
LOOKUP: 1494 496 (67% reduction)<br />
GET_DIR_DELEG: 7<br />
DELEGRETURN: 1<br />
<br />
''Doing make(1) on the 2.6.16 linux kernel (first without, then with directory delegations):''<br />
<br />
READ: 19803 19892<br />
WRITE: 21921 21869<br />
OPEN: 497472 494648<br />
ACCESS: 20638 3406 (83.5% reduction)<br />
GETATTR: 41794 24563 (41.0% reduction)<br />
LOOKUP: 45063 17447 (61.3% reduction)<br />
READDIR: 1016 884 (13.0% reduction)<br />
GET_DIR_DELEG: 750<br />
DELEGRETURN: none<br />
<br />
=Status=<br />
<br />
At the moment, working on coming up with reasonably representative tests that show the benefits of directory delegations (in terms<br />
of OP-counts); pynfs tests are also being written.<br />
<br />
==The client==<br />
<br />
* The client currently requests a delegation just prior to issuing a READDIR on an undelegated directory, or when it has done "a few" parent directory revalidations and noticed that it hasn't changed during that span. <br />
* As long as the client has such a delegation, it will generally refrain from issuing ACCESS, GETATTR, and READDIR calls on the directory (see below) ...<br />
* .. in some cases, though, the client's cache(s) may be deliberately invalidated and require a refresh (e.g., a client creates a file in a directory delegated to it, which won't break its delegation; however, in order to see the file, the client must revalidate its pagecache and send a READDIR to the server).<br />
* '''README: any suggestions here? —> TODO:''' get more opcounts! (hosting a webserver's docroot off an nfs mount? PATH or LD_LIBRARY_PATH stuff?)<br />
* TODO: redo existing opcount tests and instead tally bandwidth savings ...<br />
** getting ''real'' NFSv4 workload network traces would be great... '''(can you help? —>&nbsp; email nfsv4@linux-nfs.org)'''<br />
* When should/can we decide to voluntarily return delegations (other than when we have no more active open-state)?<br />
<br />
==The server== <br />
<br />
* Differentiate between turning file/directory delegations on/off at runtime (done) and enabling/disabling the capability itself (not done; would prevent our client from ever asking for delegations in the first place, independent of its requesting policy).<br />
* The following NFS operations currently break directory delegations: CREATE, LINK, REMOVE, RENAME, and OPEN(w/create). SETATTR on directories is pending.<br />
* An NFS SETATTR breaks file delegations when the file size is changing. Breaking on metadata changes is pending.<br />
* The corresponding VFS-level operations also break delegations and are being tested.<br />
<!-- <br />
.. CREATE (nfsd_create() and nfsd_symlink()), LINK (nfsd_link()), REMOVE (nfsd_unlink()), and RENAME (nfsd_rename()).<br />
* OPEN(w/create) is tied-up: parent-directory delegs are now broken OK in nfsd4_open(). Breaking file-delegs on OPEN(write) is broken: <br />
nfsd_open() tries '''a)''' under statelock and '''b)''' I think usually fails bc. of O_NONBLOCK. nfsd4_truncate() has similar issue. <br />
--><br />
* How to acknowledge/when to act upon resource pressures? --> e.g., after compiling the linux kernel, a client holds ~750 delegations -- that's like 50KB of state on the server, and nearly as much on the client.<br />
* TODO: get NFSv2/NFSv3 operations to break (file and directory) delegations at all of the right times, too.<br />
* TODO: also -- policy, look at dir deleg/file deleg interactions, ..</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/Cluster_client_migration_prototypeCluster client migration prototype2008-01-11T17:23:55Z<p>Richterd: </p>
<hr />
<div>As part of CITI's work with IBM, we looked at some of the issues involved with NFSv4 client migration and developed an initial prototype. Our setup involved a cluster of equivalent NFS servers attached to a GFS2 disk array, with each server exporting the same directory from the GFS2 filesystem. The intent was to provide an interface by which an administrator could selectively migrate NFSv4 clients from one server to another (e.g., to take a server down for maintenance).<br />
<br />
<br />
== Prototype overview ==<br />
The prototype is a proof-of-concept: the "right way" to migrate a client would be to transfer all of the client-related state from one server to another and then have the client reorient to the new server and continue without interruption; instead, this prototype leverages parts of the existing reboot-recovery process. To briefly explain reboot-recovery, when a Linux NFSv4 server starts, it enters a ~90sec phase called a grace period; during this time, eligible clients may contact the server and reclaim state for open files and locks they were holding prior to a server crash/reboot. In order to allow clients to reclaim state without conflicts, new opens, etc, are disallowed during the grace period.<br />
<br />
=== Migration overview ===<br />
During a migration, the cluster is put into an artificial grace period and the target-server is notified that a new client is eligible to perform reclaims. When the client contacts the source-server, it receives an error message saying that the file system has moved and sees that it should migrate to the target-server. The client establishes a connection to the target-server and reclaims its state almost identically to how it would after a server reboot. Shortly thereafter, the grace period expires, the client is purged from the source-server, and then it's business as usual.<br />
<br />
=== Statetransfer daemon ===<br />
To go into a bit more detail, the migration prototype is based off of a redesigned approach to reboot-recovery that Andy Adamson developed, wherein a new userspace daemon (so far named <tt>rpc.stransd</tt>) takes over some responsibilities previously handled within the kernel. For the most part, the daemon is responsible for keeping track of the clientids of legitimate NFS clients who have established state on the server; the daemon records these clientids in stable storage.<br />
<br />
For migration, the administrator runs a client program (so far called <tt>rpc.strans</tt>) that contacts the source-server's <tt>rpc.stransd</tt> and sends the IP address of the client to migrate. <tt>rpc.stransd</tt> looks up all clientids associated with that IP address and sends them to the target-server's <tt>rpc.stransd</tt>, which saves them in stable storage and notifies its (the target-server's) <tt>knfsd</tt> that the clientids are eligible for reclaim. Then, when the client has received the error message that the file system has moved, it sends an FS_LOCATIONS request to the source-server in order to find out where next it should go and receives a reply containing the target-server's IP address. Since it is migrating, the client reuses its existing clientid (already in the target-server's eligible-to-reclaim list) when it contacts the target-server instead of creating a new one, and thereafter proceeds to reclaim its state.<br />
<br />
=== Going forward ===<br />
The mechanism by which one <tt>rpc.stransd</tt> transfers clientids to another will be expanded so that all client open/lock/delegation state held on the source-server can be directly sent to the target-server and loaded into memory. In order to facilitate that, the underlying cluster filesystem will also need to transfer its own bookkeeping of opens/locks/leases from the source node to the target node. By directly transferring the state instead of relying on reclaims, the invasive and problematic cluster-wide grace period can be avoided entirely.<br />
<br />
=== Limitations ===<br />
The existing prototype is limited in many ways: for ease of integration, only the creation of a symlink completes a migration event on the client; there is no security associated with the triggering of a migration; the GFS2 and dlm code in the kernel version used in the prototype are quite fragile; the list goes on. Nevertheless, we have migrated clients at CITI that are able to -- to the extent that the maturity of that kernel version permits -- continue functioning normally after a migration.<br />
<br />
== Prototype code ==<br />
As a compromise between the setup of the original prototype and the relative stability of GFS2 exported by NFS, the current code is based off of the [http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.19.7.tar.bz2 2.6.19.7 Linux kernel]. Until proper git repositories are online, there is [http://www.citi.umich.edu/u/richterd/strans-kernel-for-2.6.19.7.diff a patch] for the kernel and [http://www.citi.umich.edu/u/richterd/strans-userland-for-2.6.19.7.tar.gz a tarball] of the source for the userland components.<br />
<br />
Some instructions on how to test the setup and how to work around some cluster-related kinks are in the README file in the userland tarball. Once the kernels have been built on the nfs servers and the client, and once the userland components are built on the servers, my basic steps are:<br />
<br />
* boot the cluster, bring up <tt>cman</tt> and <tt>clvmd</tt> everywhere<br />
* mount the gfs2 filesystem on the nfs servers<br />
* cat the files that'll be involved in the reclaims on each of the nfs servers (see the README)<br />
* then start up nfs on the servers, making sure that <tt>rpc.stransd</tt> is running by the time <tt>knfsd</tt> is starting up<br />
* start wireshark on the client<br />
* have the client mount the source-server and hold a file open with, e.g., <tt>less(1)</tt><br />
* arrange the migration: <tt> $ rpc.strans -m <clientIP> <target-serverIP> <source-serverIP></tt><br />
* in a second shell on the client, try to create a symlink over nfs -- it should fail and the client should migrate<br />
* the logs, wireshark, netstat, etc, should show the client to have migrated and the client should be able to keep going (but again, functionality is limited -- reading files works). note that <tt>mount(1)</tt> will continue to show the source-server, though that's not actually the case.<br />
<br />
<br />
A [http://www.citi.umich.edu/u/richterd/migration-moved-and-good-open-reclaim-3--apikia-rhcl1-rhcl2.pcap network trace] of the client <tt>141.211.133.'''86'''</tt> migrating from server <tt>141.211.133.'''212'''</tt> to <tt>141.211.133.'''213'''</tt> is available from CITI's website. Packets 104/106 show a file initially being opened; then the migration was triggered; then, packets 128/130 show the client trying to make a symlink and getting a "moved" error; packets 140/142 show the client making contact with the target server; packets 156/158 show the client reclaiming state for the file it had open; and finally, packets 239/241 show subsequent "normal" operation as another file is read after the artificial grace period expired.</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/Cluster_client_migration_prototypeCluster client migration prototype2008-01-11T07:46:30Z<p>Richterd: /* Prototype code */</p>
<hr />
<div>As part of CITI's work with IBM, we looked at some of the issues involved with NFSv4 client migration and developed an initial prototype. Our setup involved a cluster of equivalent NFS servers attached to a GFS2 disk array, with each server exporting the same directory from the GFS2 filesystem. The intent was to provide an interface by which an administrator could selectively migrate NFSv4 clients from one server to another (e.g., to take a server down for maintenance).<br />
<br />
<br />
== Prototype overview ==<br />
The prototype is a proof-of-concept: the "right way" to migrate a client would be to transfer all of the client-related state from one server to another and then have the client reorient to the new server and continue without interruption; instead, this prototype leverages parts of the existing reboot-recovery process. To briefly explain reboot-recovery, when a Linux NFSv4 server starts, it enters a ~90sec phase called a grace period; during this time, eligible clients may contact the server and reclaim state for open files and locks they were holding prior to a server crash/reboot. In order to allow clients to reclaim state without conflicts, new opens, etc, are disallowed during the grace period.<br />
<br />
=== Migration overview ===<br />
During a migration, the cluster is put into an artificial grace period and the target-server is notified that a new client is eligible to perform reclaims. When the client contacts the source-server, it receives an error message saying that the file system has moved and sees that it should migrate to the target-server. The client establishes a connection to the target-server and reclaims its state almost identically to how it would after a server reboot. Shortly thereafter, the grace period expires, the client is purged from the source-server, and then it's business as usual.<br />
<br />
=== Statetransfer daemon ===<br />
To go into a bit more detail, the migration prototype is based off of a redesigned approach to reboot-recovery that Andy Adamson developed, wherein a new userspace daemon (so far named <tt>rpc.stransd</tt>) takes over some responsibilities previously handled within the kernel. For the most part, the daemon is responsible for keeping track of the clientids of legitimate NFS clients who have established state on the server; the daemon records these clientids in stable storage.<br />
<br />
For migration, the administrator runs a client program (so far called <tt>rpc.strans</tt>) that contacts the source-server's <tt>rpc.stransd</tt> and sends the IP address of the client to migrate. <tt>rpc.stransd</tt> looks up all clientids associated with that IP address and sends them to the target-server's <tt>rpc.stransd</tt>, which saves them in stable storage and notifies its (the target-server's) <tt>knfsd</tt> that the clientids are eligible for reclaim. Then, when the client has received the error message that the file system has moved, it sends an FS_LOCATIONS request to the source-server in order to find out where next it should go and receives a reply containing the target-server's IP address. Since it is migrating, the client reuses its existing clientid (already in the target-server's eligible-to-reclaim list) when it contacts the target-server instead of creating a new one, and thereafter proceeds to reclaim its state.<br />
<br />
=== Going forward ===<br />
The mechanism by which one <tt>rpc.stransd</tt> transfers clientids to another will be expanded so that all client open/lock/delegation state held on the source-server can be directly sent to the target-server and loaded into memory. By directly transferring the state instead of relying on reclaims, the invasive and problematic cluster-wide grace period can be avoided entirely.<br />
<br />
=== Limitations ===<br />
The existing prototype is limited in many ways: for ease of integration, only the creation of a symlink completes a migration event on the client; there is no security associated with the triggering of a migration; the GFS2 and dlm code in the kernel version used in the prototype are quite fragile; the list goes on. Nevertheless, we have migrated clients at CITI that are able to -- to the extent that the maturity of that kernel version permits -- continue functioning normally after a migration.<br />
<br />
== Prototype code ==<br />
As a compromise between the setup of the original prototype and the relative stability of GFS2 exported by NFS, the current code is based off of the [http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.19.7.tar.bz2 2.6.19.7 Linux kernel]. Until proper git repositories are online, there is [http://www.citi.umich.edu/u/richterd/strans-kernel-for-2.6.19.7.diff a patch] for the kernel and [http://www.citi.umich.edu/u/richterd/strans-userland-for-2.6.19.7.tar.gz a tarball] of the source for the userland components.<br />
<br />
Some instructions on how to test the setup and how to work around some cluster-related kinks are in the README file in the userland tarball. Once the kernels have been built on the nfs servers and the client, and once the userland components are built on the servers, my basic steps are:<br />
<br />
* boot the cluster, bring up <tt>cman</tt> and <tt>clvmd</tt> everywhere<br />
* mount the gfs2 filesystem on the nfs servers<br />
* cat the files that'll be involved in the reclaims on each of the nfs servers (see the README)<br />
* then start up nfs on the servers, making sure that <tt>rpc.stransd</tt> is running by the time <tt>knfsd</tt> is starting up<br />
* start wireshark on the client<br />
* have the client mount the source-server and hold a file open with, e.g., <tt>less(1)</tt><br />
* arrange the migration: <tt> $ rpc.strans -m <clientIP> <target-serverIP> <source-serverIP></tt><br />
* in a second shell on the client, try to create a symlink over nfs -- it should fail and the client should migrate<br />
* the logs, wireshark, netstat, etc, should show the client to have migrated and the client should be able to keep going (but again, functionality is limited -- reading files works). note that <tt>mount(1)</tt> will continue to show the source-server, though that's not actually the case.<br />
<br />
<br />
A [http://www.citi.umich.edu/u/richterd/migration-moved-and-good-open-reclaim-3--apikia-rhcl1-rhcl2.pcap network trace] of the client <tt>141.211.133.'''86'''</tt> migrating from server <tt>141.211.133.'''212'''</tt> to <tt>141.211.133.'''213'''</tt> is available from CITI's website. Packets 104/106 show a file initially being opened; then the migration was triggered; then, packets 128/130 show the client trying to make a symlink and getting a "moved" error; packets 140/142 show the client making contact with the target server; packets 156/158 show the client reclaiming state for the file it had open; and finally, packets 239/241 show subsequent "normal" operation as another file is read after the artificial grace period expired.</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/Cluster_client_migration_prototypeCluster client migration prototype2008-01-11T07:35:14Z<p>Richterd: /* Prototype code */</p>
<hr />
<div>As part of CITI's work with IBM, we looked at some of the issues involved with NFSv4 client migration and developed an initial prototype. Our setup involved a cluster of equivalent NFS servers attached to a GFS2 disk array, with each server exporting the same directory from the GFS2 filesystem. The intent was to provide an interface by which an administrator could selectively migrate NFSv4 clients from one server to another (e.g., to take a server down for maintenance).<br />
<br />
<br />
== Prototype overview ==<br />
The prototype is a proof-of-concept: the "right way" to migrate a client would be to transfer all of the client-related state from one server to another and then have the client reorient to the new server and continue without interruption; instead, this prototype leverages parts of the existing reboot-recovery process. To briefly explain reboot-recovery, when a Linux NFSv4 server starts, it enters a ~90sec phase called a grace period; during this time, eligible clients may contact the server and reclaim state for open files and locks they were holding prior to a server crash/reboot. In order to allow clients to reclaim state without conflicts, new opens, etc, are disallowed during the grace period.<br />
<br />
=== Migration overview ===<br />
During a migration, the cluster is put into an artificial grace period and the target-server is notified that a new client is eligible to perform reclaims. When the client contacts the source-server, it receives an error message saying that the file system has moved and sees that it should migrate to the target-server. The client establishes a connection to the target-server and reclaims its state almost identically to how it would after a server reboot. Shortly thereafter, the grace period expires, the client is purged from the source-server, and then it's business as usual.<br />
<br />
=== Statetransfer daemon ===<br />
To go into a bit more detail, the migration prototype is based off of a redesigned approach to reboot-recovery that Andy Adamson developed, wherein a new userspace daemon (so far named <tt>rpc.stransd</tt>) takes over some responsibilities previously handled within the kernel. For the most part, the daemon is responsible for keeping track of the clientids of legitimate NFS clients who have established state on the server; the daemon records these clientids in stable storage.<br />
<br />
For migration, the administrator runs a client program (so far called <tt>rpc.strans</tt>) that contacts the source-server's <tt>rpc.stransd</tt> and sends the IP address of the client to migrate. <tt>rpc.stransd</tt> looks up all clientids associated with that IP address and sends them to the target-server's <tt>rpc.stransd</tt>, which saves them in stable storage and notifies its (the target-server's) <tt>knfsd</tt> that the clientids are eligible for reclaim. Then, when the client has received the error message that the file system has moved, it sends an FS_LOCATIONS request to the source-server in order to find out where next it should go and receives a reply containing the target-server's IP address. Since it is migrating, the client reuses its existing clientid (already in the target-server's eligible-to-reclaim list) when it contacts the target-server instead of creating a new one, and thereafter proceeds to reclaim its state.<br />
<br />
=== Going forward ===<br />
The mechanism by which one <tt>rpc.stransd</tt> transfers clientids to another will be expanded so that all client open/lock/delegation state held on the source-server can be directly sent to the target-server and loaded into memory. By directly transferring the state instead of relying on reclaims, the invasive and problematic cluster-wide grace period can be avoided entirely.<br />
<br />
=== Limitations ===<br />
The existing prototype is limited in many ways: for ease of integration, only the creation of a symlink completes a migration event on the client; there is no security associated with the triggering of a migration; the GFS2 and dlm code in the kernel version used in the prototype are quite fragile; the list goes on. Nevertheless, we have migrated clients at CITI that are able to -- to the extent that the maturity of that kernel version permits -- continue functioning normally after a migration.<br />
<br />
== Prototype code ==<br />
As a compromise between the setup of the original prototype and the relative stability of GFS2 exported by NFS, the current code is based off of the [http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.19.7.tar.bz2 2.6.19.7 Linux kernel]. Until proper git repositories are online, there is [http://www.citi.umich.edu/u/richterd/strans-kernel-for-2.6.19.7.diff a patch] for the vanilla 2.6.19.7 kernel and [http://www.citi.umich.edu/u/richterd/strans-userland-for-2.6.19.7.tar.gz a tarball] of the source to the userland components.<br />
<br />
Some instructions on how to test the setup and how to work around some cluster-related kinks are in the README file in the userland tarball. Once the kernels have been built on the nfs servers and the client, and once the userland components are built on the servers, my basic steps are:<br />
<br />
* boot the cluster, bring up <tt>cman</tt> and <tt>clvmd</tt> everywhere<br />
* mount the gfs2 filesystem on the nfs servers<br />
* cat the files that'll be involved in the reclaims on each of the nfs servers (see the README)<br />
* then start up nfs on the servers, making sure that <tt>rpc.stransd</tt> is running by the time <tt>knfsd</tt> is starting up<br />
* start wireshark on the client<br />
* have the client mount the source-server and hold a file open with, e.g., <tt>less(1)</tt><br />
* arrange the migration: <tt> $ rpc.strans -m <clientIP> <target-serverIP> <source-serverIP></tt><br />
* in a second shell on the client, try to create a symlink over nfs -- it should fail and the client should migrate<br />
* the logs, wireshark, netstat, etc, should show the client to have migrated and the client should be able to keep going (but again, functionality is limited -- reading files works). note that <tt>mount(1)</tt> will continue to show the source-server, though that's not actually the case.<br />
<br />
<br />
A [http://www.citi.umich.edu/u/richterd/migration-moved-and-good-open-reclaim-3--apikia-rhcl1-rhcl2.pcap network trace] of the client <tt>141.211.133.'''86'''</tt> migrating from server <tt>141.211.133.'''212'''</tt> to <tt>141.211.133.'''213'''</tt> is available from CITI's website. Packets 104/106 show a file initially being opened; then the migration was triggered; then, packets 128/130 show the client trying to make a symlink and getting a "moved" error; packets 140/142 show the client making contact with the target server; packets 156/158 show the client reclaiming state for the file it had open; and finally, packets 239/241 show subsequent "normal" operation as another file is read after the artificial grace period expired.</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/Cluster_client_migration_prototypeCluster client migration prototype2008-01-11T07:34:24Z<p>Richterd: /* Limitations */</p>
<hr />
<div>As part of CITI's work with IBM, we looked at some of the issues involved with NFSv4 client migration and developed an initial prototype. Our setup involved a cluster of equivalent NFS servers attached to a GFS2 disk array, with each server exporting the same directory from the GFS2 filesystem. The intent was to provide an interface by which an administrator could selectively migrate NFSv4 clients from one server to another (e.g., to take a server down for maintenance).<br />
<br />
<br />
== Prototype overview ==<br />
The prototype is a proof-of-concept: the "right way" to migrate a client would be to transfer all of the client-related state from one server to another and then have the client reorient to the new server and continue without interruption; instead, this prototype leverages parts of the existing reboot-recovery process. To briefly explain reboot-recovery, when a Linux NFSv4 server starts, it enters a ~90sec phase called a grace period; during this time, eligible clients may contact the server and reclaim state for open files and locks they were holding prior to a server crash/reboot. In order to allow clients to reclaim state without conflicts, new opens, etc, are disallowed during the grace period.<br />
<br />
=== Migration overview ===<br />
During a migration, the cluster is put into an artificial grace period and the target-server is notified that a new client is eligible to perform reclaims. When the client contacts the source-server, it receives an error message saying that the file system has moved and sees that it should migrate to the target-server. The client establishes a connection to the target-server and reclaims its state almost identically to how it would after a server reboot. Shortly thereafter, the grace period expires, the client is purged from the source-server, and then it's business as usual.<br />
<br />
=== Statetransfer daemon ===<br />
To go into a bit more detail, the migration prototype is based off of a redesigned approach to reboot-recovery that Andy Adamson developed, wherein a new userspace daemon (so far named <tt>rpc.stransd</tt>) takes over some responsibilities previously handled within the kernel. For the most part, the daemon is responsible for keeping track of the clientids of legitimate NFS clients who have established state on the server; the daemon records these clientids in stable storage.<br />
<br />
For migration, the administrator runs a client program (so far called <tt>rpc.strans</tt>) that contacts the source-server's <tt>rpc.stransd</tt> and sends the IP address of the client to migrate. <tt>rpc.stransd</tt> looks up all clientids associated with that IP address and sends them to the target-server's <tt>rpc.stransd</tt>, which saves them in stable storage and notifies its (the target-server's) <tt>knfsd</tt> that the clientids are eligible for reclaim. Then, when the client has received the error message that the file system has moved, it sends an FS_LOCATIONS request to the source-server in order to find out where next it should go and receives a reply containing the target-server's IP address. Since it is migrating, the client reuses its existing clientid (already in the target-server's eligible-to-reclaim list) when it contacts the target-server instead of creating a new one, and thereafter proceeds to reclaim its state.<br />
<br />
=== Going forward ===<br />
The mechanism by which one <tt>rpc.stransd</tt> transfers clientids to another will be expanded so that all client open/lock/delegation state held on the source-server can be directly sent to the target-server and loaded into memory. By directly transferring the state instead of relying on reclaims, the invasive and problematic cluster-wide grace period can be avoided entirely.<br />
<br />
=== Limitations ===<br />
The existing prototype is limited in many ways: for ease of integration, only the creation of a symlink completes a migration event on the client; there is no security associated with the triggering of a migration; the GFS2 and dlm code in the kernel version used in the prototype are quite fragile; the list goes on. Nevertheless, we have migrated clients at CITI that are able to -- to the extent that the maturity of that kernel version permits -- continue functioning normally after a migration.<br />
<br />
== Prototype code ==<br />
As a compromise between the setup of the original prototype and the relative stability of GFS2 exported by NFS, the current code based off of the [http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.19.7.tar.bz2 2.6.19.7 Linux kernel]. Until proper git repositories are online, there is [http://www.citi.umich.edu/u/richterd/strans-kernel-for-2.6.19.7.diff a patch] for the vanilla 2.6.19.7 kernel and [http://www.citi.umich.edu/u/richterd/strans-userland-for-2.6.19.7.tar.gz a tarball] of the source to the userland components.<br />
<br />
Some instructions on how to test the setup and how to work around some cluster-related kinks are in the README file in the userland tarball. Once the kernels have been built on the nfs servers and the client, and once the userland components are built on the servers, my basic steps are:<br />
<br />
* boot the cluster, bring up <tt>cman</tt> and <tt>clvmd</tt> everywhere<br />
* mount the gfs2 filesystem on the nfs servers<br />
* cat the files that'll be involved in the reclaims on each of the nfs servers (see the README)<br />
* then start up nfs on the servers, making sure that <tt>rpc.stransd</tt> is running by the time <tt>knfsd</tt> is starting up<br />
* start wireshark on the client<br />
* have the client mount the source-server and hold a file open with, e.g., <tt>less(1)</tt><br />
* arrange the migration: <tt> $ rpc.strans -m <clientIP> <target-serverIP> <source-serverIP></tt><br />
* in a second shell on the client, try to create a symlink over nfs -- it should fail and the client should migrate<br />
* the logs, wireshark, netstat, etc, should show the client to have migrated and the client should be able to keep going (but again, functionality is limited -- reading files works). note that <tt>mount(1)</tt> will continue to show the source-server, though that's not actually the case.<br />
<br />
<br />
A [http://www.citi.umich.edu/u/richterd/migration-moved-and-good-open-reclaim-3--apikia-rhcl1-rhcl2.pcap network trace] of the client <tt>141.211.133.'''86'''</tt> migrating from server <tt>141.211.133.'''212'''</tt> to <tt>141.211.133.'''213'''</tt> is available from CITI's website. Packets 104/106 show a file initially being opened; then the migration was triggered; then, packets 128/130 show the client trying to make a symlink and getting a "moved" error; packets 140/142 show the client making contact with the target server; packets 156/158 show the client reclaiming state for the file it had open; and finally, packets 239/241 show subsequent "normal" operation as another file is read after the artificial grace period expired.</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/Cluster_client_migration_prototypeCluster client migration prototype2008-01-11T00:08:33Z<p>Richterd: </p>
<hr />
<div>As part of CITI's work with IBM, we looked at some of the issues involved with NFSv4 client migration and developed an initial prototype. Our setup involved a cluster of equivalent NFS servers attached to a GFS2 disk array, with each server exporting the same directory from the GFS2 filesystem. The intent was to provide an interface by which an administrator could selectively migrate NFSv4 clients from one server to another (e.g., to take a server down for maintenance).<br />
<br />
<br />
== Prototype overview ==<br />
The prototype is a proof-of-concept: the "right way" to migrate a client would be to transfer all of the client-related state from one server to another and then have the client reorient to the new server and continue without interruption; instead, this prototype leverages parts of the existing reboot-recovery process. To briefly explain reboot-recovery, when a Linux NFSv4 server starts, it enters a ~90sec phase called a grace period; during this time, eligible clients may contact the server and reclaim state for open files and locks they were holding prior to a server crash/reboot. In order to allow clients to reclaim state without conflicts, new opens, etc, are disallowed during the grace period.<br />
<br />
=== Migration overview ===<br />
During a migration, the cluster is put into an artificial grace period and the target-server is notified that a new client is eligible to perform reclaims. When the client contacts the source-server, it receives an error message saying that the file system has moved and sees that it should migrate to the target-server. The client establishes a connection to the target-server and reclaims its state almost identically to how it would after a server reboot. Shortly thereafter, the grace period expires, the client is purged from the source-server, and then it's business as usual.<br />
<br />
=== Statetransfer daemon ===<br />
To go into a bit more detail, the migration prototype is based off of a redesigned approach to reboot-recovery that Andy Adamson developed, wherein a new userspace daemon (so far named <tt>rpc.stransd</tt>) takes over some responsibilities previously handled within the kernel. For the most part, the daemon is responsible for keeping track of the clientids of legitimate NFS clients who have established state on the server; the daemon records these clientids in stable storage.<br />
<br />
For migration, the administrator runs a client program (so far called <tt>rpc.strans</tt>) that contacts the source-server's <tt>rpc.stransd</tt> and sends the IP address of the client to migrate. <tt>rpc.stransd</tt> looks up all clientids associated with that IP address and sends them to the target-server's <tt>rpc.stransd</tt>, which saves them in stable storage and notifies its (the target-server's) <tt>knfsd</tt> that the clientids are eligible for reclaim. Then, when the client has received the error message that the file system has moved, it sends an FS_LOCATIONS request to the source-server in order to find out where next it should go and receives a reply containing the target-server's IP address. Since it is migrating, the client reuses its existing clientid (already in the target-server's eligible-to-reclaim list) when it contacts the target-server instead of creating a new one, and thereafter proceeds to reclaim its state.<br />
<br />
=== Going forward ===<br />
The mechanism by which one <tt>rpc.stransd</tt> transfers clientids to another will be expanded so that all client open/lock/delegation state held on the source-server can be directly sent to the target-server and loaded into memory. By directly transferring the state instead of relying on reclaims, the invasive and problematic cluster-wide grace period can be avoided entirely.<br />
<br />
=== Limitations ===<br />
The existing prototype is limited in many ways: for ease of integration, only the creation of a symlink completes a migration event on the client; there is no security associated with the triggering of a migration; the GFS2 and dlm code in the kernel version used in the prototype are quite fragile; the list goes on. Nevertheless, we have migrated clients at CITI that are able to -- to the extent that the maturity of that version of the GFS2 code permits -- continue functioning normally after a migration.<br />
<br />
== Prototype code ==<br />
As a compromise between the setup of the original prototype and the relative stability of GFS2 exported by NFS, the current code based off of the [http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.19.7.tar.bz2 2.6.19.7 Linux kernel]. Until proper git repositories are online, there is [http://www.citi.umich.edu/u/richterd/strans-kernel-for-2.6.19.7.diff a patch] for the vanilla 2.6.19.7 kernel and [http://www.citi.umich.edu/u/richterd/strans-userland-for-2.6.19.7.tar.gz a tarball] of the source to the userland components.<br />
<br />
Some instructions on how to test the setup and how to work around some cluster-related kinks are in the README file in the userland tarball. Once the kernels have been built on the nfs servers and the client, and once the userland components are built on the servers, my basic steps are:<br />
<br />
* boot the cluster, bring up <tt>cman</tt> and <tt>clvmd</tt> everywhere<br />
* mount the gfs2 filesystem on the nfs servers<br />
* cat the files that'll be involved in the reclaims on each of the nfs servers (see the README)<br />
* then start up nfs on the servers, making sure that <tt>rpc.stransd</tt> is running by the time <tt>knfsd</tt> is starting up<br />
* start wireshark on the client<br />
* have the client mount the source-server and hold a file open with, e.g., <tt>less(1)</tt><br />
* arrange the migration: <tt> $ rpc.strans -m <clientIP> <target-serverIP> <source-serverIP></tt><br />
* in a second shell on the client, try to create a symlink over nfs -- it should fail and the client should migrate<br />
* the logs, wireshark, netstat, etc, should show the client to have migrated and the client should be able to keep going (but again, functionality is limited -- reading files works). note that <tt>mount(1)</tt> will continue to show the source-server, though that's not actually the case.<br />
<br />
<br />
A [http://www.citi.umich.edu/u/richterd/migration-moved-and-good-open-reclaim-3--apikia-rhcl1-rhcl2.pcap network trace] of the client <tt>141.211.133.'''86'''</tt> migrating from server <tt>141.211.133.'''212'''</tt> to <tt>141.211.133.'''213'''</tt> is available from CITI's website. Packets 104/106 show a file initially being opened; then the migration was triggered; then, packets 128/130 show the client trying to make a symlink and getting a "moved" error; packets 140/142 show the client making contact with the target server; packets 156/158 show the client reclaiming state for the file it had open; and finally, packets 239/241 show subsequent "normal" operation as another file is read after the artificial grace period expired.</div>Richterdhttps://wiki.linux-nfs.org/wiki/index.php/Cluster_client_migration_prototypeCluster client migration prototype2008-01-11T00:08:01Z<p>Richterd: </p>
<hr />
<div>As part of CITI's work with IBM, we looked at some of the issues involved with NFSv4 client migration and developed an initial prototype. Our setup involved a cluster of equivalent NFS servers attached to a GFS2 disk array, with each server exporting the same directory from the GFS2 filesystem. The intent was to provide an interface by which an administrator could selectively migrate NFSv4 clients from one server to another (e.g., to take a server down for maintenance).<br />
<br />
== Prototype overview ==<br />
The prototype is a proof-of-concept: the "right way" to migrate a client would be to transfer all of the client-related state from one server to another and then have the client reorient to the new server and continue without interruption; instead, this prototype leverages parts of the existing reboot-recovery process. To briefly explain reboot-recovery, when a Linux NFSv4 server starts, it enters a ~90sec phase called a grace period; during this time, eligible clients may contact the server and reclaim state for open files and locks they were holding prior to a server crash/reboot. In order to allow clients to reclaim state without conflicts, new opens, etc, are disallowed during the grace period.<br />
<br />
=== Migration overview ===<br />
During a migration, the cluster is put into an artificial grace period and the target-server is notified that a new client is eligible to perform reclaims. When the client contacts the source-server, it receives an error message saying that the file system has moved and sees that it should migrate to the target-server. The client establishes a connection to the target-server and reclaims its state almost identically to how it would after a server reboot. Shortly thereafter, the grace period expires, the client is purged from the source-server, and then it's business as usual.<br />
<br />
=== Statetransfer daemon ===<br />
To go into a bit more detail, the migration prototype is based off of a redesigned approach to reboot-recovery that Andy Adamson developed, wherein a new userspace daemon (so far named <tt>rpc.stransd</tt>) takes over some responsibilities previously handled within the kernel. For the most part, the daemon is responsible for keeping track of the clientids of legitimate NFS clients who have established state on the server; the daemon records these clientids in stable storage.<br />
<br />
For migration, the administrator runs a client program (so far called <tt>rpc.strans</tt>) that contacts the source-server's <tt>rpc.stransd</tt> and sends the IP address of the client to migrate. <tt>rpc.stransd</tt> looks up all clientids associated with that IP address and sends them to the target-server's <tt>rpc.stransd</tt>, which saves them in stable storage and notifies its (the target-server's) <tt>knfsd</tt> that the clientids are eligible for reclaim. Then, when the client has received the error message that the file system has moved, it sends an FS_LOCATIONS request to the source-server in order to find out where next it should go and receives a reply containing the target-server's IP address. Since it is migrating, the client reuses its existing clientid (already in the target-server's eligible-to-reclaim list) when it contacts the target-server instead of creating a new one, and thereafter proceeds to reclaim its state.<br />
<br />
=== Going forward ===<br />
The mechanism by which one <tt>rpc.stransd</tt> transfers clientids to another will be expanded so that all client open/lock/delegation state held on the source-server can be directly sent to the target-server and loaded into memory. By directly transferring the state instead of relying on reclaims, the invasive and problematic cluster-wide grace period can be avoided entirely.<br />
<br />
=== Limitations ===<br />
The existing prototype is limited in many ways: for ease of integration, only the creation of a symlink completes a migration event on the client; there is no security associated with the triggering of a migration; the GFS2 and dlm code in the kernel version used in the prototype are quite fragile; the list goes on. Nevertheless, we have migrated clients at CITI that are able to -- to the extent that the maturity of that version of the GFS2 code permits -- continue functioning normally after a migration.<br />
<br />
<br />
== Prototype code ==<br />
As a compromise between the setup of the original prototype and the relative stability of GFS2 exported by NFS, the current code based off of the [http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.19.7.tar.bz2 2.6.19.7 Linux kernel]. Until proper git repositories are online, there is [http://www.citi.umich.edu/u/richterd/strans-kernel-for-2.6.19.7.diff a patch] for the vanilla 2.6.19.7 kernel and [http://www.citi.umich.edu/u/richterd/strans-userland-for-2.6.19.7.tar.gz a tarball] of the source to the userland components.<br />
<br />
Some instructions on how to test the setup and how to work around some cluster-related kinks are in the README file in the userland tarball. Once the kernels have been built on the nfs servers and the client, and once the userland components are built on the servers, my basic steps are:<br />
<br />
* boot the cluster, bring up <tt>cman</tt> and <tt>clvmd</tt> everywhere<br />
* mount the gfs2 filesystem on the nfs servers<br />
* cat the files that'll be involved in the reclaims on each of the nfs servers (see the README)<br />
* then start up nfs on the servers, making sure that <tt>rpc.stransd</tt> is running by the time <tt>knfsd</tt> is starting up<br />
* start wireshark on the client<br />
* have the client mount the source-server and hold a file open with, e.g., <tt>less(1)</tt><br />
* arrange the migration: <tt> $ rpc.strans -m <clientIP> <target-serverIP> <source-serverIP></tt><br />
* in a second shell on the client, try to create a symlink over nfs -- it should fail and the client should migrate<br />
* the logs, wireshark, netstat, etc, should show the client to have migrated and the client should be able to keep going (but again, functionality is limited -- reading files works). note that <tt>mount(1)</tt> will continue to show the source-server, though that's not actually the case.<br />
<br />
<br />
A [http://www.citi.umich.edu/u/richterd/migration-moved-and-good-open-reclaim-3--apikia-rhcl1-rhcl2.pcap network trace] of the client <tt>141.211.133.'''86'''</tt> migrating from server <tt>141.211.133.'''212'''</tt> to <tt>141.211.133.'''213'''</tt> is available from CITI's website. Packets 104/106 show a file initially being opened; then the migration was triggered; then, packets 128/130 show the client trying to make a symlink and getting a "moved" error; packets 140/142 show the client making contact with the target server; packets 156/158 show the client reclaiming state for the file it had open; and finally, packets 239/241 show subsequent "normal" operation as another file is read after the artificial grace period expired.</div>Richterd