Saturday, September 28, 2019

Benchmark: Cross-NUMA remote memory access cost and real world app

Cross-NUMA remote memory access cost of Intel Xeon E5-2650v3 2CPU computer is investigated.



Coreinfo


Cross-NUMA memory access cost is calculated using SysInternals coreinfo



C:\apps\SysinternalsSuite>coreinfo

Coreinfo v2.11 - Dump information on system CPU and memory topology
Copyright (C) 2008-2010 Mark Russinovich
Sysinternals - www.sysinternals.com

(snip)

Approximate Cross-NUMA Node Access Cost (relative to fastest):
       00  01
00: 1.0 1.3
01: 1.4 1.1

According to coreinfo, cross-NUMA access cost is 1.3 to 1.4 (remote memory is 30 to 40 % slower than local memory)

WWAudioFilter 1.0.54



Test setup: add jitter with convolution length=65537 to 1800 second 44100Hz mono WAV file.



Result

  • Allocate memory on NUMA0, process on NUMA0 CPU: 52 min 29.6 sec
  • Allocate memory on NUMA0, process on NUMA1 CPU: 52 min 31.2 sec
  • Using all resources: 26 min 52.5 sec

Thoughts


  • On this NUMA computer, Using coreinfo, remote memory access is 30 to 40 percent slower than local memory access.
  • WWAudioFilter 1.0.54 does not show cross-NUMA performance degradation. Perhaps it is because all input data to processing is cached on CPU and core does not stall (wait until data is arrived). Also the problem is double precision math and it takes time to compute so it is compute intensive and this is not memory bottle-necked workload.


 computing on NUMA0 core, processing load graph per NUMA node.

 computing on NUMA0 core, processing load graph per logical core.

computing on NUMA1 core, processing load graph per NUMA node.






Appendix A. Full coreinfo result

This computer has 2 CPUs, each CPU has 10 core. Windows accommodates all the core to single processor group so it is very easy to scale for any apps. Last level CPU cache size is 25MB.

C:\apps\SysinternalsSuite>coreinfo

Coreinfo v2.11 - Dump information on system CPU and memory topology
Copyright (C) 2008-2010 Mark Russinovich
Sysinternals - www.sysinternals.com

Logical to Physical Processor Map:
**-------------------------------------- Physical Processor 0 (Hyperthreaded)
--**------------------------------------ Physical Processor 1 (Hyperthreaded)
----**---------------------------------- Physical Processor 2 (Hyperthreaded)
------**-------------------------------- Physical Processor 3 (Hyperthreaded)
--------**------------------------------ Physical Processor 4 (Hyperthreaded)
----------**---------------------------- Physical Processor 5 (Hyperthreaded)
------------**-------------------------- Physical Processor 6 (Hyperthreaded)
--------------**------------------------ Physical Processor 7 (Hyperthreaded)
----------------**---------------------- Physical Processor 8 (Hyperthreaded)
------------------**-------------------- Physical Processor 9 (Hyperthreaded)
--------------------**------------------ Physical Processor 10 (Hyperthreaded)
----------------------**---------------- Physical Processor 11 (Hyperthreaded)
------------------------**-------------- Physical Processor 12 (Hyperthreaded)
--------------------------**------------ Physical Processor 13 (Hyperthreaded)
----------------------------**---------- Physical Processor 14 (Hyperthreaded)
------------------------------**-------- Physical Processor 15 (Hyperthreaded)
--------------------------------**------ Physical Processor 16 (Hyperthreaded)
----------------------------------**---- Physical Processor 17 (Hyperthreaded)
------------------------------------**-- Physical Processor 18 (Hyperthreaded)
--------------------------------------** Physical Processor 19 (Hyperthreaded)

Logical Processor to Socket Map:
********************-------------------- Socket 0
--------------------******************** Socket 1

Logical Processor to NUMA Node Map:
********************-------------------- NUMA Node 0
--------------------******************** NUMA Node 1

Approximate Cross-NUMA Node Access Cost (relative to fastest):
00 01
00: 1.0 1.3
01: 1.4 1.1

Logical Processor to Cache Map:
**-------------------------------------- Data Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64
**-------------------------------------- Instruction Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64
**-------------------------------------- Unified Cache 0, Level 2, 256 KB, Assoc 8, LineSize 64
********************-------------------- Unified Cache 1, Level 3, 25 MB, Assoc 20, LineSize 64
--**------------------------------------ Data Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64
--**------------------------------------ Instruction Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64
--**------------------------------------ Unified Cache 2, Level 2, 256 KB, Assoc 8, LineSize 64
----**---------------------------------- Data Cache 2, Level 1, 32 KB, Assoc 8, LineSize 64
----**---------------------------------- Instruction Cache 2, Level 1, 32 KB, Assoc 8, LineSize 64
----**---------------------------------- Unified Cache 3, Level 2, 256 KB, Assoc 8, LineSize 64
------**-------------------------------- Data Cache 3, Level 1, 32 KB, Assoc 8, LineSize 64
------**-------------------------------- Instruction Cache 3, Level 1, 32 KB, Assoc 8, LineSize 64
------**-------------------------------- Unified Cache 4, Level 2, 256 KB, Assoc 8, LineSize 64
--------**------------------------------ Data Cache 4, Level 1, 32 KB, Assoc 8, LineSize 64
--------**------------------------------ Instruction Cache 4, Level 1, 32 KB, Assoc 8, LineSize 64
--------**------------------------------ Unified Cache 5, Level 2, 256 KB, Assoc 8, LineSize 64
----------**---------------------------- Data Cache 5, Level 1, 32 KB, Assoc 8, LineSize 64
----------**---------------------------- Instruction Cache 5, Level 1, 32 KB, Assoc 8, LineSize 64
----------**---------------------------- Unified Cache 6, Level 2, 256 KB, Assoc 8, LineSize 64
------------**-------------------------- Data Cache 6, Level 1, 32 KB, Assoc 8, LineSize 64
------------**-------------------------- Instruction Cache 6, Level 1, 32 KB, Assoc 8, LineSize 64
------------**-------------------------- Unified Cache 7, Level 2, 256 KB, Assoc 8, LineSize 64
--------------**------------------------ Data Cache 7, Level 1, 32 KB, Assoc 8, LineSize 64
--------------**------------------------ Instruction Cache 7, Level 1, 32 KB, Assoc 8, LineSize 64
--------------**------------------------ Unified Cache 8, Level 2, 256 KB, Assoc 8, LineSize 64
----------------**---------------------- Data Cache 8, Level 1, 32 KB, Assoc 8, LineSize 64
----------------**---------------------- Instruction Cache 8, Level 1, 32 KB, Assoc 8, LineSize 64
----------------**---------------------- Unified Cache 9, Level 2, 256 KB, Assoc 8, LineSize 64
------------------**-------------------- Data Cache 9, Level 1, 32 KB, Assoc 8, LineSize 64
------------------**-------------------- Instruction Cache 9, Level 1, 32 KB, Assoc 8, LineSize 64
------------------**-------------------- Unified Cache 10, Level 2, 256 KB, Assoc 8, LineSize 64
--------------------**------------------ Data Cache 10, Level 1, 32 KB, Assoc 8, LineSize 64
--------------------**------------------ Instruction Cache 10, Level 1, 32 KB, Assoc 8, LineSize 64
--------------------**------------------ Unified Cache 11, Level 2, 256 KB, Assoc 8, LineSize 64
--------------------******************** Unified Cache 12, Level 3, 25 MB, Assoc 20, LineSize 64
----------------------**---------------- Data Cache 11, Level 1, 32 KB, Assoc 8, LineSize 64
----------------------**---------------- Instruction Cache 11, Level 1, 32 KB, Assoc 8, LineSize 64
----------------------**---------------- Unified Cache 13, Level 2, 256 KB, Assoc 8, LineSize 64
------------------------**-------------- Data Cache 12, Level 1, 32 KB, Assoc 8, LineSize 64
------------------------**-------------- Instruction Cache 12, Level 1, 32 KB, Assoc 8, LineSize 64
------------------------**-------------- Unified Cache 14, Level 2, 256 KB, Assoc 8, LineSize 64
--------------------------**------------ Data Cache 13, Level 1, 32 KB, Assoc 8, LineSize 64
--------------------------**------------ Instruction Cache 13, Level 1, 32 KB, Assoc 8, LineSize 64
--------------------------**------------ Unified Cache 15, Level 2, 256 KB, Assoc 8, LineSize 64
----------------------------**---------- Data Cache 14, Level 1, 32 KB, Assoc 8, LineSize 64
----------------------------**---------- Instruction Cache 14, Level 1, 32 KB, Assoc 8, LineSize 64
----------------------------**---------- Unified Cache 16, Level 2, 256 KB, Assoc 8, LineSize 64
------------------------------**-------- Data Cache 15, Level 1, 32 KB, Assoc 8, LineSize 64
------------------------------**-------- Instruction Cache 15, Level 1, 32 KB, Assoc 8, LineSize 64
------------------------------**-------- Unified Cache 17, Level 2, 256 KB, Assoc 8, LineSize 64
--------------------------------**------ Data Cache 16, Level 1, 32 KB, Assoc 8, LineSize 64
--------------------------------**------ Instruction Cache 16, Level 1, 32 KB, Assoc 8, LineSize 64
--------------------------------**------ Unified Cache 18, Level 2, 256 KB, Assoc 8, LineSize 64
----------------------------------**---- Data Cache 17, Level 1, 32 KB, Assoc 8, LineSize 64
----------------------------------**---- Instruction Cache 17, Level 1, 32 KB, Assoc 8, LineSize 64
----------------------------------**---- Unified Cache 19, Level 2, 256 KB, Assoc 8, LineSize 64
------------------------------------**-- Data Cache 18, Level 1, 32 KB, Assoc 8, LineSize 64
------------------------------------**-- Instruction Cache 18, Level 1, 32 KB, Assoc 8, LineSize 64
------------------------------------**-- Unified Cache 20, Level 2, 256 KB, Assoc 8, LineSize 64
--------------------------------------** Data Cache 19, Level 1, 32 KB, Assoc 8, LineSize 64
--------------------------------------** Instruction Cache 19, Level 1, 32 KB, Assoc 8, LineSize 64
--------------------------------------** Unified Cache 21, Level 2, 256 KB, Assoc 8, LineSize 64

Logical Processor to Group Map:
**************************************** Group 0














Monday, September 23, 2019

Intel i7-8700K vs AMD TR 2990WX energy efficiency comparison

Processing time and energy efficiency is compared with two PCs, Intel i7-8700K and AMD TR 2990WX.

The processing task compared is WWAudioFilter 1.0.54 jitter-add with FIR window size=65537 to 44100Hz mono PCM signal of 1800 second. It is double-precision 1D convolution.

Result
  • AMD TR 2990WX is 3 times faster than Intel i7-8700K.
  • Energy efficiency is increased when TDP is reduced (CPU is downclocked).
  • AMD TR 2990WX TDP 250W is roughly the same energy efficiency as Intel i7-8700K TDP 65W.

Processing performance comparison


Energy efficiency comparison