From Linux NFS

(Difference between revisions)
Jump to: navigation, search
(Tom noticed a couple of nits)
Line 25: Line 25:
| yes
| yes
| yes
| yes
| yes
| no
| no
| no
| yes
| yes
Line 52: Line 52:
| no
| no
| no
| no
| yes
| yes?
| no
| no
| yes
| yes

Revision as of 23:22, 3 March 2014


Simplifying NFS/RDMA Client Memory Registration Modes

The RPC/RDMA transport uses a variety of memory registration strategies to mark areas of the NFS client's memory that are eligible for RDMA. At mount time, one registration mode is chosen based on the provider and HCA used to connect to the NFS server. By default, RPC/RDMA attempts to use FRMR because it is fast and generally safe.

However, not all providers and HCAs support FRMR. Thus the RPC/RDMA client transport code maintains 7 different memory registration modes, which introduces a lot of complexity to the code base and our test matrix. Over time we would like to remove some of these modes to simplify the code base and testing requirements.

Matrix of Registration Mode Support

The following table is based on a code audit of RDMA providers in the 3.13 kernel and one (the Soft RoCE driver, rxe) that is only in OFED.

memreg mode/provider BOUNCEBUFFERS REGISTER (reg_phys_mr verb present) MEMWINDOWS (device mem_window flag set) MEMWINDOWS_ASYNC (device mem_window flag set) MTHCAFMR (alloc_fmr verb present) FRMR (device mem_mgt_ext and loc_dma_key flags set) ALLPHYSICAL
amso1100 yes yes yes yes no no yes
cxgb3 yes yes yes yes no yes yes
cxgb4 yes yes yes yes no yes yes
ehca yes yes no no yes? no yes
ipath yes yes no no yes no yes
mlx4 yes no yes yes yes yes yes
mlx5 yes no no no no yes yes
mthca yes yes no no yes no yes
nes yes yes yes yes no yes yes
ocrdma yes yes no no no yes yes
qib yes yes no no yes no yes
rxe yes yes no no yes no yes

Summary of memory registration modes

From Tom Talpey.

Registration mode Pros Cons
  • No dynamic memory registration
  • No remote RDMA permissions granted
    • Safe - no client-side RDMA exposure
    • No server-side RDMA support needed
  • Supported by any provider
  • Efficient operation for all-small-message workload
  • No RDMA
  • Increased memory footprint per connection
  • Limited i/o size
  • Increased client CPU consumption for data copies
  • Not intended for operational use
  • RDMA transfers supported
    • Safe - byte-granular registrations on per-io basis
  • Supported by "most" providers
  • Inefficient memory registration
    • Increased client CPU
    • Increased latency and overhead
  • Unsupported by mlx4 and mlx5 providers
  • RDMA transfers supported
    • Safe - byte-granular registrations on per-io basis
  • Supported by iWARP providers
  • RDMA pipeline bubbles
    • Memory window operations may stall send queue
  • Kernel virtual addressing requires mapped pages
  • Not supported by some IB and RoCE providers
    • May be inefficient, if so
  • RDMA transfers supported
    • Slightly less safe than memreg 2
      • Windows remain open after completion
    • Slightly more efficient than memreg 2
      • Window invalidations are made nonblocking
  • Supported by iWARP providers
  • Same as memreg 2
  • RDMA transfers supported
    • 4KB page-granular protection only
  • Supported only by certain Mellanox providers
  • Page-granularity risks RDMA exposure beyond i/o buffer
  • Not highly efficient or tested/testable

Strongly suggest not relying on mthca except for purpose of supporting antique mellanox adapters

  • RDMA transfers supported
    • Byte-granular protection
    • Physical addresses - no translation or mapping
    • Highly efficient
  • Well-supported by "modern" providers
  • Not supported by older providers, and non-hardware impl's
  • FRMR page list limit may result in add'l RDMA segmentation


  • RDMA transfers supported
    • Physical addresses - no translation or mapping
    • Highly efficient
  • No RDMA invalidation, ever
    • Reduced client overhead and interrupts/context switches
  • Supported by all current providers
  • No RDMA protection AT ALL
    • All client physical memory exposed for remote write
  • Each physical segment requires separate RDMA operation
    • Significant impact on NFS write operations
    • Increased number of all RDMA work requests

Long-term Plans

After some discussion on linux-rdma and linux-nfs, we might be able to justify the following changes.

Mode Long-term Comments
RPCRDMA_BOUNCEBUFFERS Deprecate then remove Never intended for production
RPCRDMA_REGISTER Deprecate then remove Safe, but not fast
RPCRDMA_MEMWINDOWS Deprecate then remove No value to keeping this
RPCRDMA_MEMWINDOWS_ASYNC Deprecate then remove Generally unsafe
RPCRDMA_MTHCAFMR Keep Known user; page-alignment requirement make it somewhat unsafe
FRMR Keep Generally fast and safe, should remain our default
ALLPHYSICAL Keep Known user
Personal tools