NfsRdmaClient/MemRegModes
From Linux NFS
Chucklever (Talk | contribs) (Created page with "== Simplifying NFS/RDMA Client Memory Registration Modes == The RPC/RDMA transport uses a variety of memory registration strategies to mark areas of the NFS client's memory that...") |
Chucklever (Talk | contribs) (→Long-term Plans) |
||
(13 intermediate revisions not shown) | |||
Line 4: | Line 4: | ||
However, not all providers and HCAs support FRMR. Thus the RPC/RDMA client transport code maintains 7 different memory registration modes, which introduces a lot of complexity to the code base and our test matrix. Over time we would like to remove some of these modes to simplify the code base and testing requirements. | However, not all providers and HCAs support FRMR. Thus the RPC/RDMA client transport code maintains 7 different memory registration modes, which introduces a lot of complexity to the code base and our test matrix. Over time we would like to remove some of these modes to simplify the code base and testing requirements. | ||
+ | |||
+ | === Matrix of Registration Mode Support === | ||
The following table is based on a code audit of RDMA providers in the 3.13 kernel and one (the Soft RoCE driver, rxe) that is only in OFED. | The following table is based on a code audit of RDMA providers in the 3.13 kernel and one (the Soft RoCE driver, rxe) that is only in OFED. | ||
- | || memreg mode/provider | + | {| class="wikitable" |
- | | amso1100 | yes | + | |- |
- | | cxgb3 | yes | + | ! memreg mode/provider |
- | | cxgb4 | yes | + | ! BOUNCEBUFFERS |
- | | ehca | yes | + | ! REGISTER (reg_phys_mr verb present) |
- | | ipath | yes | + | ! MEMWINDOWS (device mem_window flag set) |
- | | mlx4 | yes | + | ! MEMWINDOWS_ASYNC (device mem_window flag set) |
- | | mlx5 | yes | + | ! MTHCAFMR (alloc_fmr verb present) |
- | | mthca | yes | + | ! FRMR (device mem_mgt_ext and loc_dma_key flags set) |
- | | nes | yes | + | ! ALLPHYSICAL |
- | | ocrdma | yes | + | |- |
- | | qib | yes | + | | amso1100 |
- | | rxe | yes | + | | yes |
+ | | yes | ||
+ | | yes | ||
+ | | yes | ||
+ | | no | ||
+ | | no | ||
+ | | yes | ||
+ | |- | ||
+ | | cxgb3 | ||
+ | | yes | ||
+ | | yes | ||
+ | | yes | ||
+ | | yes | ||
+ | | no | ||
+ | | yes | ||
+ | | yes | ||
+ | |- | ||
+ | | cxgb4 | ||
+ | | yes | ||
+ | | yes | ||
+ | | yes | ||
+ | | yes | ||
+ | | no | ||
+ | | yes | ||
+ | | yes | ||
+ | |- | ||
+ | | ehca | ||
+ | | yes | ||
+ | | yes | ||
+ | | no | ||
+ | | no | ||
+ | | yes | ||
+ | | no | ||
+ | | yes | ||
+ | |- | ||
+ | | ipath | ||
+ | | yes | ||
+ | | yes | ||
+ | | no | ||
+ | | no | ||
+ | | yes | ||
+ | | no | ||
+ | | yes | ||
+ | |- | ||
+ | | mlx4 | ||
+ | | yes | ||
+ | | no | ||
+ | | yes | ||
+ | | yes | ||
+ | | yes | ||
+ | | yes | ||
+ | | yes | ||
+ | |- | ||
+ | | mlx5 | ||
+ | | yes | ||
+ | | no | ||
+ | | no | ||
+ | | no | ||
+ | | no | ||
+ | | yes | ||
+ | | yes | ||
+ | |- | ||
+ | | mthca | ||
+ | | yes | ||
+ | | yes | ||
+ | | no | ||
+ | | no | ||
+ | | yes | ||
+ | | no | ||
+ | | yes | ||
+ | |- | ||
+ | | nes | ||
+ | | yes | ||
+ | | yes | ||
+ | | yes | ||
+ | | yes | ||
+ | | no | ||
+ | | yes | ||
+ | | yes | ||
+ | |- | ||
+ | | ocrdma | ||
+ | | yes | ||
+ | | yes | ||
+ | | no | ||
+ | | no | ||
+ | | no | ||
+ | | yes | ||
+ | | yes | ||
+ | |- | ||
+ | | qib | ||
+ | | yes | ||
+ | | yes | ||
+ | | no | ||
+ | | no | ||
+ | | yes | ||
+ | | no | ||
+ | | yes | ||
+ | |- | ||
+ | | rxe | ||
+ | | yes | ||
+ | | yes | ||
+ | | no | ||
+ | | no | ||
+ | | yes | ||
+ | | no | ||
+ | | yes | ||
+ | |- | ||
+ | |} | ||
=== Summary of memory registration modes === | === Summary of memory registration modes === | ||
Line 25: | Line 134: | ||
From Tom Talpey. | From Tom Talpey. | ||
- | = | + | {| class="wikitable" |
- | + | |- | |
- | + | ! Registration mode | |
+ | ! Pros | ||
+ | ! Cons | ||
+ | |- | ||
+ | | RPCRDMA_BOUNCEBUFFERS | ||
+ | | | ||
* No dynamic memory registration | * No dynamic memory registration | ||
* No remote RDMA permissions granted | * No remote RDMA permissions granted | ||
Line 34: | Line 148: | ||
* Supported by any provider | * Supported by any provider | ||
* Efficient operation for all-small-message workload | * Efficient operation for all-small-message workload | ||
- | + | | | |
- | + | ||
* No RDMA | * No RDMA | ||
* Increased memory footprint per connection | * Increased memory footprint per connection | ||
Line 41: | Line 154: | ||
* Increased client CPU consumption for data copies | * Increased client CPU consumption for data copies | ||
* Not intended for operational use | * Not intended for operational use | ||
- | + | |- | |
- | + | | RPCRDMA_REGISTER | |
- | + | | | |
- | + | ||
* RDMA transfers supported | * RDMA transfers supported | ||
** Safe - byte-granular registrations on per-io basis | ** Safe - byte-granular registrations on per-io basis | ||
* Supported by "most" providers | * Supported by "most" providers | ||
- | + | | | |
- | + | ||
* Inefficient memory registration | * Inefficient memory registration | ||
** Increased client CPU | ** Increased client CPU | ||
** Increased latency and overhead | ** Increased latency and overhead | ||
* Unsupported by mlx4 and mlx5 providers | * Unsupported by mlx4 and mlx5 providers | ||
- | + | |- | |
- | + | | RPCRDMA_MEMWINDOWS | |
- | + | | | |
- | + | ||
* RDMA transfers supported | * RDMA transfers supported | ||
** Safe - byte-granular registrations on per-io basis | ** Safe - byte-granular registrations on per-io basis | ||
* Supported by iWARP providers | * Supported by iWARP providers | ||
- | + | | | |
- | + | ||
* RDMA pipeline bubbles | * RDMA pipeline bubbles | ||
** Memory window operations may stall send queue | ** Memory window operations may stall send queue | ||
Line 68: | Line 177: | ||
* Not supported by some IB and RoCE providers | * Not supported by some IB and RoCE providers | ||
** May be inefficient, if so | ** May be inefficient, if so | ||
- | + | |- | |
- | + | | RPCRDMA_MEMWINDOWS_ASYNC | |
- | + | | | |
- | + | ||
* RDMA transfers supported | * RDMA transfers supported | ||
** Slightly less safe than memreg 2 | ** Slightly less safe than memreg 2 | ||
Line 78: | Line 186: | ||
*** Window invalidations are made nonblocking | *** Window invalidations are made nonblocking | ||
* Supported by iWARP providers | * Supported by iWARP providers | ||
- | + | | | |
- | + | ||
* Same as memreg 2 | * Same as memreg 2 | ||
- | + | |- | |
- | + | | RPCRDMA_MTHCAFMR | |
- | + | | | |
- | + | ||
* RDMA transfers supported | * RDMA transfers supported | ||
** 4KB page-granular protection only | ** 4KB page-granular protection only | ||
- | + | | | |
- | + | ||
* Supported only by certain Mellanox providers | * Supported only by certain Mellanox providers | ||
* Page-granularity risks RDMA exposure beyond i/o buffer | * Page-granularity risks RDMA exposure beyond i/o buffer | ||
Line 94: | Line 199: | ||
Strongly suggest not relying on mthca except for purpose of supporting antique mellanox adapters | Strongly suggest not relying on mthca except for purpose of supporting antique mellanox adapters | ||
- | + | |- | |
- | + | | RPCRDMA_FRMR | |
- | + | | | |
- | + | ||
* RDMA transfers supported | * RDMA transfers supported | ||
** Byte-granular protection | ** Byte-granular protection | ||
Line 103: | Line 207: | ||
** Highly efficient | ** Highly efficient | ||
* Well-supported by "modern" providers | * Well-supported by "modern" providers | ||
- | + | | | |
- | + | ||
* Not supported by older providers, and non-hardware impl's | * Not supported by older providers, and non-hardware impl's | ||
* FRMR page list limit may result in add'l RDMA segmentation | * FRMR page list limit may result in add'l RDMA segmentation | ||
- | + | |- | |
- | + | | | |
- | + | RPCRDMA_ALLPHYSICAL | |
- | + | | | |
* RDMA transfers supported | * RDMA transfers supported | ||
** Physical addresses - no translation or mapping | ** Physical addresses - no translation or mapping | ||
Line 117: | Line 220: | ||
** Reduced client overhead and interrupts/context switches | ** Reduced client overhead and interrupts/context switches | ||
* Supported by all current providers | * Supported by all current providers | ||
- | + | | | |
- | + | ||
* No RDMA protection AT ALL | * No RDMA protection AT ALL | ||
** All client physical memory exposed for remote write | ** All client physical memory exposed for remote write | ||
Line 124: | Line 226: | ||
** Significant impact on NFS write operations | ** Significant impact on NFS write operations | ||
** Increased number of all RDMA work requests | ** Increased number of all RDMA work requests | ||
+ | |} | ||
=== Long-term Plans === | === Long-term Plans === | ||
- | + | After some discussion on linux-rdma and linux-nfs, we might be able to justify removing unsafe and deprecated memory registration modes. | |
- | || Mode | + | {| class="wikitable" |
- | | | + | |- |
- | | | + | ! Mode |
- | | | + | ! Safe |
- | | | + | ! Fast |
- | | | + | ! Long-term |
- | | | + | ! Comments |
- | | | + | |- |
+ | | RPCRDMA_BOUNCEBUFFERS | ||
+ | | yes | ||
+ | | no | ||
+ | | Remove | ||
+ | | Never intended for production | ||
+ | |- | ||
+ | | RPCRDMA_REGISTER | ||
+ | | yes | ||
+ | | no | ||
+ | | Remove | ||
+ | | Safe, but not fast | ||
+ | |- | ||
+ | | RPCRDMA_MEMWINDOWS | ||
+ | | yes | ||
+ | | no | ||
+ | | Remove | ||
+ | | No value to keeping this | ||
+ | |- | ||
+ | | RPCRDMA_MEMWINDOWS_ASYNC | ||
+ | | no | ||
+ | | yes | ||
+ | | Remove | ||
+ | | Generally unsafe | ||
+ | |- | ||
+ | | RPCRDMA_MTHCAFMR | ||
+ | | no | ||
+ | | yes | ||
+ | | Keep | ||
+ | | Supported in Xen guests; RDMA protection on page boundaries only | ||
+ | |- | ||
+ | | RPCRDMA_FRMR | ||
+ | | yes | ||
+ | | yes | ||
+ | | Keep | ||
+ | | Generally fast and safe, should remain our default | ||
+ | |- | ||
+ | | RPCRDMA_ALLPHYSICAL | ||
+ | | no | ||
+ | | yes | ||
+ | | Keep | ||
+ | | Can be unsafe, but is broadly compatible | ||
+ | |- | ||
+ | |} |
Latest revision as of 13:57, 30 May 2014
Contents |
Simplifying NFS/RDMA Client Memory Registration Modes
The RPC/RDMA transport uses a variety of memory registration strategies to mark areas of the NFS client's memory that are eligible for RDMA. At mount time, one registration mode is chosen based on the provider and HCA used to connect to the NFS server. By default, RPC/RDMA attempts to use FRMR because it is fast and generally safe.
However, not all providers and HCAs support FRMR. Thus the RPC/RDMA client transport code maintains 7 different memory registration modes, which introduces a lot of complexity to the code base and our test matrix. Over time we would like to remove some of these modes to simplify the code base and testing requirements.
Matrix of Registration Mode Support
The following table is based on a code audit of RDMA providers in the 3.13 kernel and one (the Soft RoCE driver, rxe) that is only in OFED.
memreg mode/provider | BOUNCEBUFFERS | REGISTER (reg_phys_mr verb present) | MEMWINDOWS (device mem_window flag set) | MEMWINDOWS_ASYNC (device mem_window flag set) | MTHCAFMR (alloc_fmr verb present) | FRMR (device mem_mgt_ext and loc_dma_key flags set) | ALLPHYSICAL |
---|---|---|---|---|---|---|---|
amso1100 | yes | yes | yes | yes | no | no | yes |
cxgb3 | yes | yes | yes | yes | no | yes | yes |
cxgb4 | yes | yes | yes | yes | no | yes | yes |
ehca | yes | yes | no | no | yes | no | yes |
ipath | yes | yes | no | no | yes | no | yes |
mlx4 | yes | no | yes | yes | yes | yes | yes |
mlx5 | yes | no | no | no | no | yes | yes |
mthca | yes | yes | no | no | yes | no | yes |
nes | yes | yes | yes | yes | no | yes | yes |
ocrdma | yes | yes | no | no | no | yes | yes |
qib | yes | yes | no | no | yes | no | yes |
rxe | yes | yes | no | no | yes | no | yes |
Summary of memory registration modes
From Tom Talpey.
Registration mode | Pros | Cons |
---|---|---|
RPCRDMA_BOUNCEBUFFERS |
|
|
RPCRDMA_REGISTER |
|
|
RPCRDMA_MEMWINDOWS |
|
|
RPCRDMA_MEMWINDOWS_ASYNC |
|
|
RPCRDMA_MTHCAFMR |
|
Strongly suggest not relying on mthca except for purpose of supporting antique mellanox adapters |
RPCRDMA_FRMR |
|
|
RPCRDMA_ALLPHYSICAL |
|
|
Long-term Plans
After some discussion on linux-rdma and linux-nfs, we might be able to justify removing unsafe and deprecated memory registration modes.
Mode | Safe | Fast | Long-term | Comments |
---|---|---|---|---|
RPCRDMA_BOUNCEBUFFERS | yes | no | Remove | Never intended for production |
RPCRDMA_REGISTER | yes | no | Remove | Safe, but not fast |
RPCRDMA_MEMWINDOWS | yes | no | Remove | No value to keeping this |
RPCRDMA_MEMWINDOWS_ASYNC | no | yes | Remove | Generally unsafe |
RPCRDMA_MTHCAFMR | no | yes | Keep | Supported in Xen guests; RDMA protection on page boundaries only |
RPCRDMA_FRMR | yes | yes | Keep | Generally fast and safe, should remain our default |
RPCRDMA_ALLPHYSICAL | no | yes | Keep | Can be unsafe, but is broadly compatible |