PNFS File-based Stateid Distribution Design

From Linux NFS

Revision as of 23:40, 28 February 2009 by Dhildebz (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search
Updated Stateid Distribution Information
========================================

The main point of this email is to discuss a new export op on the MDS, change_state, that
will revoke stateids cached on the DS.  For reference, my previous email discussing this
topic is at the bottom of this email.

Issue 1: Do we need it?
=======================
We need some mechanism to ensure correctness of the share reservations and stateids on the
DS.   E.g., if a share reservation is upgraded, then the access/deny bits on the DS need to
be updated to ensure that I/O is rejected properly.


Issue 2: What should it do?
==========================
We need to ensure that both stateids and share reservations are updated properly.   Ideally, it would update the stateid on each DS with the latest info, but synchronous state
updates on every change from the MDS to DS would be prohibitively slow.  So I suggest that
it simply deletes state on the DSs, allowing each to DS to retrieve the updated state info
at its leasure. (In the future we could asynchronously update/add stateids.)


Issue 3: When do we call it?
=============================
Updating the DSs every time the stateid changes, would be prohibitively slow.  In addition,
it is often unecessary since a stateid could change 3 times
before a single I/O is sent to the DS (e.g., open->deleg->lock).
The spec says that the DS must allow/reject I/O which the MDS would allow/reject.  This
means that on OPEN/LOCK/ULOCK, we can continue to let the DS retrieve state laissez-faire.  But on CLOSE, OPEN_DOWNGRADE and a 2nd OPEN that upgrades the share reservation, the MDS
must call update_stateid to synchronously revoke the state.             

Issue 4: Arguments
==================
Some suggestions are:
- stateid (boot, gen, file, owner)
- Share reservation (access/deny bits)
- Revoke/Update flag
- inode (may be necessary since deleg id doesn't have a fileid)


Issue 5: Cleaning up old stateids on DS
=======================================
Since several stateids may be cached for a single file, e.g., open->lock->open, *old*
stateids currently stick around until nfsd exits.  In order to be more proactive and memory
conscious, we can use the laundromat thread along with an lru list and limit the number of
stateids.

So in summary, we would add code so that the MDS would revoke stateids on DSs at close/open_{down/up} to ensure share reservations are properly checked.  In addition, we would have the laundromat thread clean up old stateids on the DSs.


Background Information
======================
Current Code
============
a) Laissez-faire retrieval of stateids on DS from the MDS
  - Export functions: DS: get_state   MDS: cb_get_state
  - If a stateid changes, e.g., open->deleg, the new stateid is
retrieved with the old one is left alone.  Old stateids are not cleaned
up by cb_change_state and are only cleaned up when NFSD is exited.
b) Ability for file system to update/revoke the state on the DS (right
now, we only revoke)
  - Export function: DS: cb_change_state

Issues
======
a) New export op on MDS, change_state, that the MDS can use to
revoke/update the stateid on the DS at certain points.  We definitely
don't want to do this every time the stateid changes, since this would
be way too often and unecessary since stateids could change 3 times
before a single I/O is sent to the DS (e.g., open->deleg->lock).  

b) Once a stateid changes from one type to another, there is no current way in the linux impl. to link the old and new stateids together.  How do we identify all the
stateids in the change_state export op. Do we need to pass an array to
change_state, or do we need multiple calls, or is there a simpler identifer?

c) In section 13.9.1., it says,
- "The stateid sent to the data server MUST be sent with the seqid set
to zero, indicating the most current version of that stateid, rather
than indicating a specific non-zero seqid value."

In the linux code, the seqid maps to the si_generation field of the
stateid.  We need to
ensure that the DS never compares the si_generation field.

For background info, currently, the MDS bumps the seqid field
(si_generation) via update_stateid in the following cases:
- open upgrade/downgrade
- open confirm
- close (just for return args, but it is then released and we would need
to call the change_state export op)
- lock
-unlock

d) At all costs, I would like to avoid synchronous pushes of stateids to
the DSs every time the stateid changes.  One option would be to
asyncronously push the stateid from the MDS to the DS, and simply
resolve the timing issue when the DS is asking the MDS.  Under the
banner of KISS, let's avoid this until we determine that laissez faire
stateid retrieval is causing problems.

Dean Hildebrand
Personal tools