• Skip to primary navigation
  • Skip to main content
  • Skip to footer
E4 Computer Engineering

E4 Computer Engineering

HPC and Enterprise Solutions

  • It
  • En
  • Solutions
    • HPC
    • AI
    • Kubernetes
    • Virtualization
    • Cloud
  • About us
    • Team
    • E4 Analytics
    • E4 Aerospace & Defence
    • Case Studies
    • European Projects
    • Expertise
    • Certifications
    • Partners
    • Jobs
  • News
    • Blog
    • Events
  • Contacts

NVDIMM: classification and advantages (Part 2)

14 October 2020

News

Content for the HPC community and innovation enthusiasts: tutorials, news and press releases for users, engineers and administrators

YouTube Twitter Linkedin

#whenperformancematters

  • All News
  • Aerospace & Defence
  • Artificial Intelligence
  • Cloud Platform
  • E4 Various
  • European Projects
  • HPC
  • Kubernetes Cluster
  • Podcast
  • Press

NVDIMM: classification and advantages (Part 2)

14 October 2020

NVDIMM

Analysis of the main NVDIMM storage devices and their features.

In the first episode, you can find here, we talked about the main storage devices and their performance features, focusing on the NVDIMM (Non Volatile Dual in-line Memory Module), devices that are housed in the RAM slots, which we have provided a classification of.

Now, the question is: are NVDIMMs as performing as stated? How complicated is their use?

So let’s try to find answers, starting with some simple high-level checks, postponing, for the time being, the use in real and interesting application scenarios.

 Intel DCPMM modules configuration

The configuration is quite simple and can be done using different tools: at system startup, using the BIOS configuration utility, starting a UEFI shell, or using special tools available for both Linux and Windows.

So, let’s choose the last one, being the more independent one from the underlying platform; the modules are managed through two open-source tools: ipmctl (https://github.com/intel/ipmctl) and ndcrl (https://github.com/pmem/ndctl), which are generally included in the repositories of major Linux distributions.

We can simply verify the presence of DCPMM on the system in question, equipped in this case with 12 256GB modules, and their location in the dedicated slots.

[root@pmserver ~]# ipmctl show -topology

 DimmID | MemoryType                  | Capacity  | PhysicalID| DeviceLocator
==============================================================================
 0x0011 | Logical Non-Volatile Device | 252.4 GiB | 0x0028    | CPU1_DIMM_B2
 0x0021 | Logical Non-Volatile Device | 252.4 GiB | 0x002a    | CPU1_DIMM_C2
 0x0001 | Logical Non-Volatile Device | 252.4 GiB | 0x0026    | CPU1_DIMM_A2
 0x0111 | Logical Non-Volatile Device | 252.4 GiB | 0x002e    | CPU1_DIMM_E2
 0x0121 | Logical Non-Volatile Device | 252.4 GiB | 0x0030    | CPU1_DIMM_F2
 0x0101 | Logical Non-Volatile Device | 252.4 GiB | 0x002c    | CPU1_DIMM_D2
 0x1011 | Logical Non-Volatile Device | 252.4 GiB | 0x0034    | CPU2_DIMM_B2
 0x1021 | Logical Non-Volatile Device | 252.4 GiB | 0x0036    | CPU2_DIMM_C2
 0x1001 | Logical Non-Volatile Device | 252.4 GiB | 0x0032    | CPU2_DIMM_A2
 0x1111 | Logical Non-Volatile Device | 252.4 GiB | 0x003a    | CPU2_DIMM_E2
 0x1121 | Logical Non-Volatile Device | 252.4 GiB | 0x003c    | CPU2_DIMM_F2
 0x1101 | Logical Non-Volatile Device | 252.4 GiB | 0x0038    | CPU2_DIMM_D2
 N/A    | DDR4                        | 32.0 GiB  | 0x0025    | CPU1_DIMM_A1
 N/A    | DDR4                        | 32.0 GiB  | 0x0027    | CPU1_DIMM_B1
 N/A    | DDR4                        | 32.0 GiB  | 0x0029    | CPU1_DIMM_C1
 N/A    | DDR4                        | 32.0 GiB  | 0x002b    | CPU1_DIMM_D1
 N/A    | DDR4                        | 32.0 GiB  | 0x002d    | CPU1_DIMM_E1
 N/A    | DDR4                        | 32.0 GiB  | 0x002f    | CPU1_DIMM_F1
 N/A    | DDR4                        | 32.0 GiB  | 0x0031    | CPU2_DIMM_A1
 N/A    | DDR4                        | 32.0 GiB  | 0x0033    | CPU2_DIMM_B1
 N/A    | DDR4                        | 32.0 GiB  | 0x0035    | CPU2_DIMM_C1
 N/A    | DDR4                        | 32.0 GiB  | 0x0037    | CPU2_DIMM_D1
 N/A    | DDR4                        | 32.0 GiB  | 0x0039    | CPU2_DIMM_E1
 N/A    | DDR4                        | 32.0 GiB  | 0x003b    | CPU2_DIMM_F1

Another useful command provides us with information on their use: initially, we will see the available capacity (3029 GiB, or about 3TiB), currently not allocated:

[root@pmserver ~]# ipmctl show -memoryresources

Capacity=3029.4 GiB
MemoryCapacity=0.0 GiB
AppDirectCapacity=0.0 GiB
UnconfiguredCapacity=3029.4 GiB
InaccessibleCapacity=0.0 GiB
ReservedCapacity=0.0 GiB

We can now assign the capacity available for use via memory mode or direct mode app, with the interesting feature of being able to divide it between the two modes. In fact, there is no need to allocate modules exclusively.

Provisioning – memory mode

Suppose we want to create a fat memory node, for the processing of a large dataset or for the creation of numerous virtual machines with a good amount of memory, allocating all the available capacity as system memory:

[root@pmserver ~]# ipmctl create -goal memorymode=100
The following configuration will be applied:
 SocketID | DimmID | MemorySize | AppDirect1Size | AppDirect2Size
==================================================================
 0x0000   | 0x0011 | 252.0 GiB  | 0.0 GiB        | 0.0 GiB
 0x0000   | 0x0021 | 252.0 GiB  | 0.0 GiB        | 0.0 GiB
 0x0000   | 0x0001 | 252.0 GiB  | 0.0 GiB        | 0.0 GiB
 0x0000   | 0x0111 | 252.0 GiB  | 0.0 GiB        | 0.0 GiB
 0x0000   | 0x0121 | 252.0 GiB  | 0.0 GiB        | 0.0 GiB
 0x0000   | 0x0101 | 252.0 GiB  | 0.0 GiB        | 0.0 GiB
 0x0001   | 0x1011 | 252.0 GiB  | 0.0 GiB        | 0.0 GiB
 0x0001   | 0x1021 | 252.0 GiB  | 0.0 GiB        | 0.0 GiB
 0x0001   | 0x1001 | 252.0 GiB  | 0.0 GiB        | 0.0 GiB
 0x0001   | 0x1111 | 252.0 GiB  | 0.0 GiB        | 0.0 GiB
 0x0001   | 0x1121 | 252.0 GiB  | 0.0 GiB        | 0.0 GiB
 0x0001   | 0x1101 | 252.0 GiB  | 0.0 GiB        | 0.0 GiB
Do you want to continue? [y/n]
Created following region configuration goal
 SocketID | DimmID | MemorySize | AppDirect1Size | AppDirect2Size
==================================================================
 0x0000   | 0x0011 | 252.0 GiB  | 0.0 GiB        | 0.0 GiB
 0x0000   | 0x0021 | 252.0 GiB  | 0.0 GiB        | 0.0 GiB
 0x0000   | 0x0001 | 252.0 GiB  | 0.0 GiB        | 0.0 GiB
 0x0000   | 0x0111 | 252.0 GiB  | 0.0 GiB        | 0.0 GiB
 0x0000   | 0x0121 | 252.0 GiB  | 0.0 GiB        | 0.0 GiB
 0x0000   | 0x0101 | 252.0 GiB  | 0.0 GiB        | 0.0 GiB
 0x0001   | 0x1011 | 252.0 GiB  | 0.0 GiB        | 0.0 GiB
 0x0001   | 0x1021 | 252.0 GiB  | 0.0 GiB        | 0.0 GiB
 0x0001   | 0x1001 | 252.0 GiB  | 0.0 GiB        | 0.0 GiB
 0x0001   | 0x1111 | 252.0 GiB  | 0.0 GiB        | 0.0 GiB
 0x0001   | 0x1121 | 252.0 GiB  | 0.0 GiB        | 0.0 GiB
 0x0001   | 0x1101 | 252.0 GiB  | 0.0 GiB        | 0.0 GiB
A reboot is required to process new memory allocation goals.

As we are reminded, to set the DCPMMs in memory mode a reboot is required, an activity that must therefore be planned in order not to interrupt the system workload. After reboot:

[root@pmserver ~]# ipmctl show -memoryresources

Capacity=3029.4 GiB
MemoryCapacity=3029. GiB
AppDirectCapacity=0.0 GiB
UnconfiguredCapacity= 0.0 GiB
InaccessibleCapacity=0.0 GiB
ReservedCapacity=0.0 GiB

Benchmarking – memory mode

Let’s only make a simple and synthetic test, commonly used in High-Performance Computing (STREAM memory benchmark https://www.cs.virginia.edu/stream/), we can conduct a quick performance check, running it first on traditional DIMMs, then on NVDIMMs.

Volatile DIMM:

# 1 thread per physical core
[root@pmserver stream]# export OMP_NUM_THREADS=48
[root@pmserver stream]# ./stream

-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 120000000 (elements), Offset = 0 (elements)
Memory per array = 915.5 MiB (= 0.9 GiB).
Total memory required = 2746.6 MiB (= 2.7 GiB).
Each kernel will be executed 100 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 48
Number of Threads counted = 48
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:          160374.9     0.012769     0.011972     0.016018
Scale:         129198.4     0.015654     0.014861     0.017818
Add:           148268.7     0.020355     0.019424     0.024177
Triad:         149501.8     0.020302     0.019264     0.024210

Non Volatile DIMM:

Using the same execution parameters (1 thread per physical core available on the system, 2.7 GiB allocation):

-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:          102722.9     0.020470     0.018691     0.022534
Scale:         122823.8     0.016362     0.015632     0.018160
Add:           135089.8     0.022264     0.021319     0.025406
Triad:         136973.9     0.021946     0.021026     0.025303
-------------------------------------------------------------

At first glance, the DCPMMs seem to have significantly lower efficiency, with a performance degradation of between -36% of the Copy function and -9% of the Triad function, but we must keep in mind the different operating frequencies of the two types of memories, the DCPMM under test support 2666 MT/s, while the traditional modules are DDR4 @ 2993. Therefore, if we compare the two technologies at the same frequency, we can say that DCPMMs have performances in line with traditional memory modules. 10 times larger memory area (3029 GiB of NVDIMM vs 384 GiB of DDR4) with similar performance is a dream come true.

Let’s now take a quick look at storage mode use, in order to create a fast block device.

Provisioning – Storage over App Direct

To create a block device in a very simple way, we can use the following command which will allocate to this device all the capacity available in the modules connected to a socket (2 128GB modules in the case in question):

[root@pmserver ~]# ndctl create-namespace -m fsdax

Let’s verify the creation of the device, capacity (256GB) and assigned name (pmem1):

[root@pmserver ~]# ndctl list -N
[
  {
    "dev":"namespace1.0",
    "mode":"fsdax",
    "map":"dev",
    "size":266352984064,
    "uuid":"5d0eb50d-6b79-4acb-b36e-32ac535ac440",
    "sector_size":512,
    "align":2097152,
    "blockdev":"pmem1"
  }
]

Then we format and mount the created device:

[root@pmserver ~]# mkfs.xfs /dev/pmem1
meta-data=/dev/pmem1             isize=512    agcount=4, agsize=16256896 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=0, sparse=0
data     =                       bsize=4096   blocks=65027584, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=31751, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
[root@pmserver ~]# mount -o dax,noatime /dev/pmem1 /mnt/nvdimm

Benchmarking – Storage over App direct

A simple test, representative of one of the most penalizing cases [1] in terms of performance, confirms the low latency of these devices, which is around 6,05 us (microseconds), a value that is a few orders of magnitude lower than the usual SSD devices.

[root@pmserver ~]# fio --section=randwrite_8files ../worst_case.fio.job
randwrite_8files: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, 
(T) 4096B-4096B, ioengine=libpmem, 
iodepth=1
...
fio-3.18
Starting 4 processes
Jobs: 1 (f=1): [_(3),w(1)][96.2%][w=636MiB/s][w=163k IOPS][eta 00m:01s]
randwrite_8files: (groupid=0, jobs=4): err= 0: pid=231388: Tue Oct 13 14:55:43 2020
  write: IOPS=422k, BW=1647MiB/s (1727MB/s)(40.0GiB/24873msec)
    clat (usec): min=2, max=413, avg= 6.01, stdev= 2.74
     lat (usec): min=2, max=413, avg= 6.05, stdev= 2.74
    clat percentiles (nsec):
     |  1.00th=[ 2448],  5.00th=[ 2992], 10.00th=[ 3312], 20.00th=[ 3856],
     | 30.00th=[ 4320], 40.00th=[ 4512], 50.00th=[ 5088], 60.00th=[ 5856],
     | 70.00th=[ 6944], 80.00th=[ 8256], 90.00th=[10048], 95.00th=[11456],
     | 99.00th=[14016], 99.50th=[15040], 99.90th=[17536], 99.95th=[18816],
     | 99.99th=[24448]
    lat percentiles (nsec):
     |  1.00th=[ 2480],  5.00th=[ 3024], 10.00th=[ 3376], 20.00th=[ 3888],
     | 30.00th=[ 4320], 40.00th=[ 4576], 50.00th=[ 5088], 60.00th=[ 5920],
     | 70.00th=[ 6944], 80.00th=[ 8256], 90.00th=[10048], 95.00th=[11456],
     | 99.00th=[14016], 99.50th=[15040], 99.90th=[17536], 99.95th=[18816],
     | 99.99th=[24448]
   bw (  MiB/s): min= 1320, max= 3015, per=100.00%, avg=2178.24, stdev=149.69, samples=152
   iops        : min=338056, max=771888, avg=557629.57, stdev=38320.01, samples=152
  lat (usec)   : 4=23.50%, 10=66.24%, 20=10.22%, 50=0.03%, 100=0.01%
  lat (usec)   : 250=0.01%, 500=0.01%
  cpu          : usr=56.82%, sys=43.16%, ctx=909, majf=0, minf=10488802
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10485760,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1647MiB/s (1727MB/s), 1647MiB/s-1647MiB/s (1727MB/s-1727MB/s), 
io=40.0GiB (42.9GB), run=24873-24873msec

Disk stats (read/write):
  pmem1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

Conclusions

Intel DCPMMs, which in fact constitute an implementation of NVDIMM-P, confirm the expectations. They are in fact able to provide both large volatile memory areas (and with performance comparable to classic RAM modules) and block storage areas with latencies comparable to the best devices with NVMe interface available on the market. They, therefore, lend themselves to a very wide range of application scenarios!

FOOTNOTES

1- parameters used in FIO (https://github.com/axboe/fio) job file:

blocksize=4k
sync=1
direct=1
buffered=0
ioengine=libpmem
group_reporting=1
directory=/mnt/nvdimm/
lat_percentiles=1
iodepth=1
loops=10

[randread_8files]
rw=randread
filesize=1G
numjobs=4

[randwrite_8files]
rw=randwrite
filesize=1G
numjobs=4

Archiviato in: HPC

By Simone Tinti

Recent Posts

23 January 2023

Cognitive Signal Classifier: improving the RF spectrum awareness with artificial intelligence

Aerospace & Defence

8 July 2022

E4 Computer Engineering accelerates Arm-based solutions for HPC and AI

Artificial Intelligence, HPC, Press

22 May 2022

Our first 20 years!

E4 Various

3 May 2022

E4 Computer Engineering joins RISC-V International

HPC, Press

15 April 2022

Digital Transformation: next years scenario

Artificial Intelligence, HPC

PreviousNext

Footer

Via Martiri della Libertà, 66
42019 Scandiano (RE) – Italy

+39 0522 991811
info@e4company.com

  • YouTube
  • Twitter
  • LinkedIn
  • SOLUTIONS
  • HPC
  • AI
  • Kubernetes
  • Virtualization
  • Cloud
  • ABOUT US
  • Team
  • E4 Analytics
  • E4 Aerospace & Defence
  • Expertise
  • Case Studies
  • European Projects
  • Certifications
  • Partner
  • Jobs

NEWS

  • Blog
  • Events

Signup to Newsletter

Download our company profile

©️2002-2021 E4 COMPUTER ENGINEERING S.p.A. - PIVA/VAT No. IT 02005300351 - R.A.E.E. IT0802 000 000 1117 - CAP. SOC. EURO 150.000,00 I.V. - Privacy policy - Cookie Policy - Manage cookie settings

WebSite by Black Studio