
News
Content for the HPC community and innovation enthusiasts: tutorials, news and press releases for users, engineers and administrators
- All News
- Aerospace & Defence
- Artificial Intelligence
- Blog
- Cloud Platform
- Cloud Platform
- Collaborazioni
- E4 Various
- European Projects
- HPC
- Kubernetes Cluster
- Latest news
- Press
NVDIMM: classification and advantages (Part 2)

Analysis of the main NVDIMM storage devices and their features.
In the first episode, you can find here, we talked about the main storage devices and their performance features, focusing on the NVDIMM (Non Volatile Dual in-line Memory Module), devices that are housed in the RAM slots, which we have provided a classification of.
Now, the question is: are NVDIMMs as performing as stated? How complicated is their use?
So let’s try to find answers, starting with some simple high-level checks, postponing, for the time being, the use in real and interesting application scenarios.
Intel DCPMM modules configuration
The configuration is quite simple and can be done using different tools: at system startup, using the BIOS configuration utility, starting a UEFI shell, or using special tools available for both Linux and Windows.
So, let’s choose the last one, being the more independent one from the underlying platform; the modules are managed through two open-source tools: ipmctl (https://github.com/intel/ipmctl) and ndcrl (https://github.com/pmem/ndctl), which are generally included in the repositories of major Linux distributions.
We can simply verify the presence of DCPMM on the system in question, equipped in this case with 12 256GB modules, and their location in the dedicated slots.
[root@pmserver ~]# ipmctl show -topology
DimmID | MemoryType | Capacity | PhysicalID| DeviceLocator
==============================================================================
0x0011 | Logical Non-Volatile Device | 252.4 GiB | 0x0028 | CPU1_DIMM_B2
0x0021 | Logical Non-Volatile Device | 252.4 GiB | 0x002a | CPU1_DIMM_C2
0x0001 | Logical Non-Volatile Device | 252.4 GiB | 0x0026 | CPU1_DIMM_A2
0x0111 | Logical Non-Volatile Device | 252.4 GiB | 0x002e | CPU1_DIMM_E2
0x0121 | Logical Non-Volatile Device | 252.4 GiB | 0x0030 | CPU1_DIMM_F2
0x0101 | Logical Non-Volatile Device | 252.4 GiB | 0x002c | CPU1_DIMM_D2
0x1011 | Logical Non-Volatile Device | 252.4 GiB | 0x0034 | CPU2_DIMM_B2
0x1021 | Logical Non-Volatile Device | 252.4 GiB | 0x0036 | CPU2_DIMM_C2
0x1001 | Logical Non-Volatile Device | 252.4 GiB | 0x0032 | CPU2_DIMM_A2
0x1111 | Logical Non-Volatile Device | 252.4 GiB | 0x003a | CPU2_DIMM_E2
0x1121 | Logical Non-Volatile Device | 252.4 GiB | 0x003c | CPU2_DIMM_F2
0x1101 | Logical Non-Volatile Device | 252.4 GiB | 0x0038 | CPU2_DIMM_D2
N/A | DDR4 | 32.0 GiB | 0x0025 | CPU1_DIMM_A1
N/A | DDR4 | 32.0 GiB | 0x0027 | CPU1_DIMM_B1
N/A | DDR4 | 32.0 GiB | 0x0029 | CPU1_DIMM_C1
N/A | DDR4 | 32.0 GiB | 0x002b | CPU1_DIMM_D1
N/A | DDR4 | 32.0 GiB | 0x002d | CPU1_DIMM_E1
N/A | DDR4 | 32.0 GiB | 0x002f | CPU1_DIMM_F1
N/A | DDR4 | 32.0 GiB | 0x0031 | CPU2_DIMM_A1
N/A | DDR4 | 32.0 GiB | 0x0033 | CPU2_DIMM_B1
N/A | DDR4 | 32.0 GiB | 0x0035 | CPU2_DIMM_C1
N/A | DDR4 | 32.0 GiB | 0x0037 | CPU2_DIMM_D1
N/A | DDR4 | 32.0 GiB | 0x0039 | CPU2_DIMM_E1
N/A | DDR4 | 32.0 GiB | 0x003b | CPU2_DIMM_F1
Another useful command provides us with information on their use: initially, we will see the available capacity (3029 GiB, or about 3TiB), currently not allocated:
[root@pmserver ~]# ipmctl show -memoryresources
Capacity=3029.4 GiB
MemoryCapacity=0.0 GiB
AppDirectCapacity=0.0 GiB
UnconfiguredCapacity=3029.4 GiB
InaccessibleCapacity=0.0 GiB
ReservedCapacity=0.0 GiB
We can now assign the capacity available for use via memory mode or direct mode app, with the interesting feature of being able to divide it between the two modes. In fact, there is no need to allocate modules exclusively.
Provisioning – memory mode
Suppose we want to create a fat memory node, for the processing of a large dataset or for the creation of numerous virtual machines with a good amount of memory, allocating all the available capacity as system memory:
[root@pmserver ~]# ipmctl create -goal memorymode=100
The following configuration will be applied:
SocketID | DimmID | MemorySize | AppDirect1Size | AppDirect2Size
==================================================================
0x0000 | 0x0011 | 252.0 GiB | 0.0 GiB | 0.0 GiB
0x0000 | 0x0021 | 252.0 GiB | 0.0 GiB | 0.0 GiB
0x0000 | 0x0001 | 252.0 GiB | 0.0 GiB | 0.0 GiB
0x0000 | 0x0111 | 252.0 GiB | 0.0 GiB | 0.0 GiB
0x0000 | 0x0121 | 252.0 GiB | 0.0 GiB | 0.0 GiB
0x0000 | 0x0101 | 252.0 GiB | 0.0 GiB | 0.0 GiB
0x0001 | 0x1011 | 252.0 GiB | 0.0 GiB | 0.0 GiB
0x0001 | 0x1021 | 252.0 GiB | 0.0 GiB | 0.0 GiB
0x0001 | 0x1001 | 252.0 GiB | 0.0 GiB | 0.0 GiB
0x0001 | 0x1111 | 252.0 GiB | 0.0 GiB | 0.0 GiB
0x0001 | 0x1121 | 252.0 GiB | 0.0 GiB | 0.0 GiB
0x0001 | 0x1101 | 252.0 GiB | 0.0 GiB | 0.0 GiB
Do you want to continue? [y/n]
Created following region configuration goal
SocketID | DimmID | MemorySize | AppDirect1Size | AppDirect2Size
==================================================================
0x0000 | 0x0011 | 252.0 GiB | 0.0 GiB | 0.0 GiB
0x0000 | 0x0021 | 252.0 GiB | 0.0 GiB | 0.0 GiB
0x0000 | 0x0001 | 252.0 GiB | 0.0 GiB | 0.0 GiB
0x0000 | 0x0111 | 252.0 GiB | 0.0 GiB | 0.0 GiB
0x0000 | 0x0121 | 252.0 GiB | 0.0 GiB | 0.0 GiB
0x0000 | 0x0101 | 252.0 GiB | 0.0 GiB | 0.0 GiB
0x0001 | 0x1011 | 252.0 GiB | 0.0 GiB | 0.0 GiB
0x0001 | 0x1021 | 252.0 GiB | 0.0 GiB | 0.0 GiB
0x0001 | 0x1001 | 252.0 GiB | 0.0 GiB | 0.0 GiB
0x0001 | 0x1111 | 252.0 GiB | 0.0 GiB | 0.0 GiB
0x0001 | 0x1121 | 252.0 GiB | 0.0 GiB | 0.0 GiB
0x0001 | 0x1101 | 252.0 GiB | 0.0 GiB | 0.0 GiB
A reboot is required to process new memory allocation goals.
As we are reminded, to set the DCPMMs in memory mode a reboot is required, an activity that must therefore be planned in order not to interrupt the system workload. After reboot:
[root@pmserver ~]# ipmctl show -memoryresources
Capacity=3029.4 GiB
MemoryCapacity=3029. GiB
AppDirectCapacity=0.0 GiB
UnconfiguredCapacity= 0.0 GiB
InaccessibleCapacity=0.0 GiB
ReservedCapacity=0.0 GiB
Benchmarking – memory mode
Let’s only make a simple and synthetic test, commonly used in High-Performance Computing (STREAM memory benchmark https://www.cs.virginia.edu/stream/), we can conduct a quick performance check, running it first on traditional DIMMs, then on NVDIMMs.
Volatile DIMM:
# 1 thread per physical core
[root@pmserver stream]# export OMP_NUM_THREADS=48
[root@pmserver stream]# ./stream
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 120000000 (elements), Offset = 0 (elements)
Memory per array = 915.5 MiB (= 0.9 GiB).
Total memory required = 2746.6 MiB (= 2.7 GiB).
Each kernel will be executed 100 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 48
Number of Threads counted = 48
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 160374.9 0.012769 0.011972 0.016018
Scale: 129198.4 0.015654 0.014861 0.017818
Add: 148268.7 0.020355 0.019424 0.024177
Triad: 149501.8 0.020302 0.019264 0.024210
Non Volatile DIMM:
Using the same execution parameters (1 thread per physical core available on the system, 2.7 GiB allocation):
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 102722.9 0.020470 0.018691 0.022534
Scale: 122823.8 0.016362 0.015632 0.018160
Add: 135089.8 0.022264 0.021319 0.025406
Triad: 136973.9 0.021946 0.021026 0.025303
-------------------------------------------------------------
At first glance, the DCPMMs seem to have significantly lower efficiency, with a performance degradation of between -36% of the Copy function and -9% of the Triad function, but we must keep in mind the different operating frequencies of the two types of memories, the DCPMM under test support 2666 MT/s, while the traditional modules are DDR4 @ 2993. Therefore, if we compare the two technologies at the same frequency, we can say that DCPMMs have performances in line with traditional memory modules. 10 times larger memory area (3029 GiB of NVDIMM vs 384 GiB of DDR4) with similar performance is a dream come true.
Let’s now take a quick look at storage mode use, in order to create a fast block device.
Provisioning – Storage over App Direct
To create a block device in a very simple way, we can use the following command which will allocate to this device all the capacity available in the modules connected to a socket (2 128GB modules in the case in question):
[root@pmserver ~]# ndctl create-namespace -m fsdax
Let’s verify the creation of the device, capacity (256GB) and assigned name (pmem1):
[root@pmserver ~]# ndctl list -N
[
{
"dev":"namespace1.0",
"mode":"fsdax",
"map":"dev",
"size":266352984064,
"uuid":"5d0eb50d-6b79-4acb-b36e-32ac535ac440",
"sector_size":512,
"align":2097152,
"blockdev":"pmem1"
}
]
Then we format and mount the created device:
[root@pmserver ~]# mkfs.xfs /dev/pmem1
meta-data=/dev/pmem1 isize=512 agcount=4, agsize=16256896 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=1 finobt=0, sparse=0
data = bsize=4096 blocks=65027584, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=31751, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
[root@pmserver ~]# mount -o dax,noatime /dev/pmem1 /mnt/nvdimm
Benchmarking – Storage over App direct
A simple test, representative of one of the most penalizing cases [1] in terms of performance, confirms the low latency of these devices, which is around 6,05 us (microseconds), a value that is a few orders of magnitude lower than the usual SSD devices.
[root@pmserver ~]# fio --section=randwrite_8files ../worst_case.fio.job
randwrite_8files: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B,
(T) 4096B-4096B, ioengine=libpmem,
iodepth=1
...
fio-3.18
Starting 4 processes
Jobs: 1 (f=1): [_(3),w(1)][96.2%][w=636MiB/s][w=163k IOPS][eta 00m:01s]
randwrite_8files: (groupid=0, jobs=4): err= 0: pid=231388: Tue Oct 13 14:55:43 2020
write: IOPS=422k, BW=1647MiB/s (1727MB/s)(40.0GiB/24873msec)
clat (usec): min=2, max=413, avg= 6.01, stdev= 2.74
lat (usec): min=2, max=413, avg= 6.05, stdev= 2.74
clat percentiles (nsec):
| 1.00th=[ 2448], 5.00th=[ 2992], 10.00th=[ 3312], 20.00th=[ 3856],
| 30.00th=[ 4320], 40.00th=[ 4512], 50.00th=[ 5088], 60.00th=[ 5856],
| 70.00th=[ 6944], 80.00th=[ 8256], 90.00th=[10048], 95.00th=[11456],
| 99.00th=[14016], 99.50th=[15040], 99.90th=[17536], 99.95th=[18816],
| 99.99th=[24448]
lat percentiles (nsec):
| 1.00th=[ 2480], 5.00th=[ 3024], 10.00th=[ 3376], 20.00th=[ 3888],
| 30.00th=[ 4320], 40.00th=[ 4576], 50.00th=[ 5088], 60.00th=[ 5920],
| 70.00th=[ 6944], 80.00th=[ 8256], 90.00th=[10048], 95.00th=[11456],
| 99.00th=[14016], 99.50th=[15040], 99.90th=[17536], 99.95th=[18816],
| 99.99th=[24448]
bw ( MiB/s): min= 1320, max= 3015, per=100.00%, avg=2178.24, stdev=149.69, samples=152
iops : min=338056, max=771888, avg=557629.57, stdev=38320.01, samples=152
lat (usec) : 4=23.50%, 10=66.24%, 20=10.22%, 50=0.03%, 100=0.01%
lat (usec) : 250=0.01%, 500=0.01%
cpu : usr=56.82%, sys=43.16%, ctx=909, majf=0, minf=10488802
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,10485760,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=1647MiB/s (1727MB/s), 1647MiB/s-1647MiB/s (1727MB/s-1727MB/s),
io=40.0GiB (42.9GB), run=24873-24873msec
Disk stats (read/write):
pmem1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
Conclusions
Intel DCPMMs, which in fact constitute an implementation of NVDIMM-P, confirm the expectations. They are in fact able to provide both large volatile memory areas (and with performance comparable to classic RAM modules) and block storage areas with latencies comparable to the best devices with NVMe interface available on the market. They, therefore, lend themselves to a very wide range of application scenarios!
FOOTNOTES
1- parameters used in FIO (https://github.com/axboe/fio) job file:
blocksize=4k
sync=1
direct=1
buffered=0
ioengine=libpmem
group_reporting=1
directory=/mnt/nvdimm/
lat_percentiles=1
iodepth=1
loops=10
[randread_8files]
rw=randread
filesize=1G
numjobs=4
[randwrite_8files]
rw=randwrite
filesize=1G
numjobs=4