Saturday, September 28, 2019

Benchmark: Cross-NUMA remote memory access cost and real world app

Cross-NUMA remote memory access cost of Intel Xeon E5-2650v3 2CPU computer is investigated.



Coreinfo


Cross-NUMA memory access cost is calculated using SysInternals coreinfo



C:\apps\SysinternalsSuite>coreinfo

Coreinfo v2.11 - Dump information on system CPU and memory topology
Copyright (C) 2008-2010 Mark Russinovich
Sysinternals - www.sysinternals.com

(snip)

Approximate Cross-NUMA Node Access Cost (relative to fastest):
       00  01
00: 1.0 1.3
01: 1.4 1.1

According to coreinfo, cross-NUMA access cost is 1.3 to 1.4 (remote memory is 30 to 40 % slower than local memory)

WWAudioFilter 1.0.54



Test setup: add jitter with convolution length=65537 to 1800 second 44100Hz mono WAV file.



Result

  • Allocate memory on NUMA0, process on NUMA0 CPU: 52 min 29.6 sec
  • Allocate memory on NUMA0, process on NUMA1 CPU: 52 min 31.2 sec
  • Using all resources: 26 min 52.5 sec

Thoughts


  • On this NUMA computer, Using coreinfo, remote memory access is 30 to 40 percent slower than local memory access.
  • WWAudioFilter 1.0.54 does not show cross-NUMA performance degradation. Perhaps it is because all input data to processing is cached on CPU and core does not stall (wait until data is arrived). Also the problem is double precision math and it takes time to compute so it is compute intensive and this is not memory bottle-necked workload.


 computing on NUMA0 core, processing load graph per NUMA node.

 computing on NUMA0 core, processing load graph per logical core.

computing on NUMA1 core, processing load graph per NUMA node.






Appendix A. Full coreinfo result

This computer has 2 CPUs, each CPU has 10 core. Windows accommodates all the core to single processor group so it is very easy to scale for any apps. Last level CPU cache size is 25MB.

C:\apps\SysinternalsSuite>coreinfo

Coreinfo v2.11 - Dump information on system CPU and memory topology
Copyright (C) 2008-2010 Mark Russinovich
Sysinternals - www.sysinternals.com

Logical to Physical Processor Map:
**-------------------------------------- Physical Processor 0 (Hyperthreaded)
--**------------------------------------ Physical Processor 1 (Hyperthreaded)
----**---------------------------------- Physical Processor 2 (Hyperthreaded)
------**-------------------------------- Physical Processor 3 (Hyperthreaded)
--------**------------------------------ Physical Processor 4 (Hyperthreaded)
----------**---------------------------- Physical Processor 5 (Hyperthreaded)
------------**-------------------------- Physical Processor 6 (Hyperthreaded)
--------------**------------------------ Physical Processor 7 (Hyperthreaded)
----------------**---------------------- Physical Processor 8 (Hyperthreaded)
------------------**-------------------- Physical Processor 9 (Hyperthreaded)
--------------------**------------------ Physical Processor 10 (Hyperthreaded)
----------------------**---------------- Physical Processor 11 (Hyperthreaded)
------------------------**-------------- Physical Processor 12 (Hyperthreaded)
--------------------------**------------ Physical Processor 13 (Hyperthreaded)
----------------------------**---------- Physical Processor 14 (Hyperthreaded)
------------------------------**-------- Physical Processor 15 (Hyperthreaded)
--------------------------------**------ Physical Processor 16 (Hyperthreaded)
----------------------------------**---- Physical Processor 17 (Hyperthreaded)
------------------------------------**-- Physical Processor 18 (Hyperthreaded)
--------------------------------------** Physical Processor 19 (Hyperthreaded)

Logical Processor to Socket Map:
********************-------------------- Socket 0
--------------------******************** Socket 1

Logical Processor to NUMA Node Map:
********************-------------------- NUMA Node 0
--------------------******************** NUMA Node 1

Approximate Cross-NUMA Node Access Cost (relative to fastest):
00 01
00: 1.0 1.3
01: 1.4 1.1

Logical Processor to Cache Map:
**-------------------------------------- Data Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64
**-------------------------------------- Instruction Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64
**-------------------------------------- Unified Cache 0, Level 2, 256 KB, Assoc 8, LineSize 64
********************-------------------- Unified Cache 1, Level 3, 25 MB, Assoc 20, LineSize 64
--**------------------------------------ Data Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64
--**------------------------------------ Instruction Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64
--**------------------------------------ Unified Cache 2, Level 2, 256 KB, Assoc 8, LineSize 64
----**---------------------------------- Data Cache 2, Level 1, 32 KB, Assoc 8, LineSize 64
----**---------------------------------- Instruction Cache 2, Level 1, 32 KB, Assoc 8, LineSize 64
----**---------------------------------- Unified Cache 3, Level 2, 256 KB, Assoc 8, LineSize 64
------**-------------------------------- Data Cache 3, Level 1, 32 KB, Assoc 8, LineSize 64
------**-------------------------------- Instruction Cache 3, Level 1, 32 KB, Assoc 8, LineSize 64
------**-------------------------------- Unified Cache 4, Level 2, 256 KB, Assoc 8, LineSize 64
--------**------------------------------ Data Cache 4, Level 1, 32 KB, Assoc 8, LineSize 64
--------**------------------------------ Instruction Cache 4, Level 1, 32 KB, Assoc 8, LineSize 64
--------**------------------------------ Unified Cache 5, Level 2, 256 KB, Assoc 8, LineSize 64
----------**---------------------------- Data Cache 5, Level 1, 32 KB, Assoc 8, LineSize 64
----------**---------------------------- Instruction Cache 5, Level 1, 32 KB, Assoc 8, LineSize 64
----------**---------------------------- Unified Cache 6, Level 2, 256 KB, Assoc 8, LineSize 64
------------**-------------------------- Data Cache 6, Level 1, 32 KB, Assoc 8, LineSize 64
------------**-------------------------- Instruction Cache 6, Level 1, 32 KB, Assoc 8, LineSize 64
------------**-------------------------- Unified Cache 7, Level 2, 256 KB, Assoc 8, LineSize 64
--------------**------------------------ Data Cache 7, Level 1, 32 KB, Assoc 8, LineSize 64
--------------**------------------------ Instruction Cache 7, Level 1, 32 KB, Assoc 8, LineSize 64
--------------**------------------------ Unified Cache 8, Level 2, 256 KB, Assoc 8, LineSize 64
----------------**---------------------- Data Cache 8, Level 1, 32 KB, Assoc 8, LineSize 64
----------------**---------------------- Instruction Cache 8, Level 1, 32 KB, Assoc 8, LineSize 64
----------------**---------------------- Unified Cache 9, Level 2, 256 KB, Assoc 8, LineSize 64
------------------**-------------------- Data Cache 9, Level 1, 32 KB, Assoc 8, LineSize 64
------------------**-------------------- Instruction Cache 9, Level 1, 32 KB, Assoc 8, LineSize 64
------------------**-------------------- Unified Cache 10, Level 2, 256 KB, Assoc 8, LineSize 64
--------------------**------------------ Data Cache 10, Level 1, 32 KB, Assoc 8, LineSize 64
--------------------**------------------ Instruction Cache 10, Level 1, 32 KB, Assoc 8, LineSize 64
--------------------**------------------ Unified Cache 11, Level 2, 256 KB, Assoc 8, LineSize 64
--------------------******************** Unified Cache 12, Level 3, 25 MB, Assoc 20, LineSize 64
----------------------**---------------- Data Cache 11, Level 1, 32 KB, Assoc 8, LineSize 64
----------------------**---------------- Instruction Cache 11, Level 1, 32 KB, Assoc 8, LineSize 64
----------------------**---------------- Unified Cache 13, Level 2, 256 KB, Assoc 8, LineSize 64
------------------------**-------------- Data Cache 12, Level 1, 32 KB, Assoc 8, LineSize 64
------------------------**-------------- Instruction Cache 12, Level 1, 32 KB, Assoc 8, LineSize 64
------------------------**-------------- Unified Cache 14, Level 2, 256 KB, Assoc 8, LineSize 64
--------------------------**------------ Data Cache 13, Level 1, 32 KB, Assoc 8, LineSize 64
--------------------------**------------ Instruction Cache 13, Level 1, 32 KB, Assoc 8, LineSize 64
--------------------------**------------ Unified Cache 15, Level 2, 256 KB, Assoc 8, LineSize 64
----------------------------**---------- Data Cache 14, Level 1, 32 KB, Assoc 8, LineSize 64
----------------------------**---------- Instruction Cache 14, Level 1, 32 KB, Assoc 8, LineSize 64
----------------------------**---------- Unified Cache 16, Level 2, 256 KB, Assoc 8, LineSize 64
------------------------------**-------- Data Cache 15, Level 1, 32 KB, Assoc 8, LineSize 64
------------------------------**-------- Instruction Cache 15, Level 1, 32 KB, Assoc 8, LineSize 64
------------------------------**-------- Unified Cache 17, Level 2, 256 KB, Assoc 8, LineSize 64
--------------------------------**------ Data Cache 16, Level 1, 32 KB, Assoc 8, LineSize 64
--------------------------------**------ Instruction Cache 16, Level 1, 32 KB, Assoc 8, LineSize 64
--------------------------------**------ Unified Cache 18, Level 2, 256 KB, Assoc 8, LineSize 64
----------------------------------**---- Data Cache 17, Level 1, 32 KB, Assoc 8, LineSize 64
----------------------------------**---- Instruction Cache 17, Level 1, 32 KB, Assoc 8, LineSize 64
----------------------------------**---- Unified Cache 19, Level 2, 256 KB, Assoc 8, LineSize 64
------------------------------------**-- Data Cache 18, Level 1, 32 KB, Assoc 8, LineSize 64
------------------------------------**-- Instruction Cache 18, Level 1, 32 KB, Assoc 8, LineSize 64
------------------------------------**-- Unified Cache 20, Level 2, 256 KB, Assoc 8, LineSize 64
--------------------------------------** Data Cache 19, Level 1, 32 KB, Assoc 8, LineSize 64
--------------------------------------** Instruction Cache 19, Level 1, 32 KB, Assoc 8, LineSize 64
--------------------------------------** Unified Cache 21, Level 2, 256 KB, Assoc 8, LineSize 64

Logical Processor to Group Map:
**************************************** Group 0














Monday, September 23, 2019

Intel i7-8700K vs AMD TR 2990WX energy efficiency comparison

Processing time and energy efficiency is compared with two PCs, Intel i7-8700K and AMD TR 2990WX.

The processing task compared is WWAudioFilter 1.0.54 jitter-add with FIR window size=65537 to 44100Hz mono PCM signal of 1800 second. It is double-precision 1D convolution.

Result
  • AMD TR 2990WX is 3 times faster than Intel i7-8700K.
  • Energy efficiency is increased when TDP is reduced (CPU is downclocked).
  • AMD TR 2990WX TDP 250W is roughly the same energy efficiency as Intel i7-8700K TDP 65W.

Processing performance comparison


Energy efficiency comparison

Thursday, August 1, 2019

String Vibration problem

Problem definition

String vibration equation is expressed by the function of space and time y(x,t): y(x,t) = u(x)φ(t). t: time, x: x position, y: y position, amplitude.
Initial string shape is given: y=g(x) at t=0
Initial string velocity is zero: ∂y(x,0)∂t = 0
Both end of the string is fixed: y(0,t) = 0, y(1,t) = 0
Function y(x,t) follows following partial differential equation: u d2φ dt2 = c2φ d2u dx2 , c: constant, c2 = tensiondensity.

Solution


Above problem is solved by Jean d'Alembert in 1747. Following more explicit solution is found by Daniel Bernoulli in 1755 : y(x,t) = n=1 Ansin(nπx)cos(nπct), g(x) = n=1 Ansin(nπx) Where g(x) coefficient Anis calculated using Discrete Sine Transform (but Fourier Transform is not known at that time).

Windows App

Windows App: https://sourceforge.net/projects/playpcmwin/files/others/WWStringVibration103.zip/download
Source code : https://sourceforge.net/p/playpcmwin/code/HEAD/tree/PlayPcmWin/WWStringVibration/

How to install

Extract Zip file to create folder contains WWStringVibration.exe and accompanying DLLs.

How to use

Run WWStringVibration.exe. Edit g(x) using mouse. Press Start to simulate string vibration.

How to uninstall

Just delete downloaded files.

License

MIT License. 

Saturday, May 11, 2019

I built small desktop PC with Nvidia Titan V and Intel x550-T1

 I put Nvidia Titan V and 10gigabit Ethernet adapter onto a small Mini ITX case.

Component list:

  • Case : Fractal Design Node 202
  • CPU : Intel Core i7-8700K 
  • Memory : DDR4 2666 DIMM  8GB x2
  • Motherboard : Asus Rog Z370-i Mini-ITX
  • M.2 NVMe SSD : Samsung 960 Evo 500GB
  • GPU : Nvidia Titan V
  • LAN card : Intel x550-T1 10 gigabit Ethernet adapter
  • PSU : Corsair SF-750 SFX PSU
  • M.2 to PCIex4 adapter : Mintcell M.2 M NGFF to PCIe 4x with 4 pin Molex
  • PCIe x4 flexible riser cable : SourcingBay 20cm PCIe 4x flex riser cable
  • Backpanel LAN port: Neutrik NE8FDX-P6 
  • CPU Fan : Cooler Master i70c
  • Case fan: Noctua NF-F12 Industrial PPC-3000 PWM x2
  • Y cable for 4pin PWM connector: Noctua NA-SYC1 
  • Category 6 50cm Ethernet cable (for internal cabling)


 Build

Bracket is removed from Intel x550-T1. Also Kapton tape is applied to the back side to prevent short circuit.


Ethernet port is added on the back panel.


Samsung 960 Evo M.2 NVMe SSD is put onto back side of the motherboard.



Some parts are assembled. This M.2 to PCIex4 adapter needs power from 4 pin Molex power. Don't forget to feed power to this device.



 Titan V is put onto PCIex16 riser cage (riser cage is included in Fractal Design Node 202 case) 

PCIe x4 flexible riser connect did not fit to the optimal place, so I enlarged the hole a bit.


All parts are fitted.


Case cover is installed. CPU cooler height is just right for this case.



Titan V and x550-T1 are successfully recognized on Device Manager.

Some thoughts


All backpanel USB of this motherboard is USB 3.0 (USB 3.1 Gen1). USB3.1 Gen2 header exists near CPU fan header but it is connected to nowhere. USB 3.1 Gen2 port is nice to have so somehow it should be outputted to outside of the case.

Intel 10GbE LAN adapter, M.2 NVMe SSD and USB ports are all connected to CPU via DMI, which is PCIe x4 bandwidth. Maybe DMI bus will be saturated on I/O intensive application.

Graphics adapter is connected to CPU via dedicated PCIe x16 connection (PEG) so there is no CPU to GPU transfer slowdown on this build.

Two 12cm chassis fan of 3,000rpm is a bit overkill for this build. 2,000rpm is sufficient.

CPU fan and chassis fan speed can be controlled using Asus Fan Xpert. Fan is very quiet when idle.


I restricted CPU max processor frequency to 4.0GHz to prevent thermal throttling. Open Intel Extreme Tuning Utility, on Basic Tuning tab, Step 2 Processor Core Ratio, set 40x, Processor Cache Ratio, set 40x and press Apply and Save button. and performed Stress Test to confirm thermal throttling does not kick in.

CPU Fan LED is a bit too bright and it cancel out motherboard RGB LED effect, it is not a big problem.

Sunday, January 6, 2019

File transfer speed comparison: USB3.1 Gen2 Type-C and USB3.1 Gen2 Type-A

I've got USB3.1 Gen2 Type-C to Type-A adapter Sanwa supply AD-USB29CFA (Fig.1).











Fig.1: AD-USB29CFA.

Attach it to USB 3.1 Gen2 Type-C storage device and tested CrystalDiskMark.

Computer:
AMD Threadripper 2990WX
Asus Zenith Extreme, Firmware 1601

USB Mass Storage Device:
ENCU3NV-JO1 M.2 to USB3.1 Gen2 Type-C converter.
Samsung 970 Pro 512GB M.2 NVMe SSD inside.


NTFS formatted, NTFS Allocation unit size=64KB.














Fig.2:  ENCU3NV-JO1 M.2 to USB3.1 Gen2 Type-C converter.

Motherboard backpanel USB3.1Gen2 ports are used to connect USB mass storage device (Fig.3).

















Fig.3:  USB ports used. SS10 marked, magenta colored. Upper port is Type-A and lower port is Type-C. Both USB 3.1 Gen2 (SuperSpeed+ 10Gbps) ports.













Fig.4: WWShowUsbDeviceTree shows USB storage connected to Type-C port as "TypeC". magenta line means this device is linked at SuperSpeed+.














Fig.5: WWShowUsbDeviceTree shows "TypeA" on USB storage connected to Type-A port using AD-USB29CFA. magenta line means this device is linked at SuperSpeed+.

CrystalDiskMark Results


















Fig.6: CrystalDiskMark result of Type-C port.



















Fig.7: CrystalDiskMark result of Type-A port.

















Fig.8: Comparison graph.

 

Conclusion


There is no significant speed difference between Type-C USB3.1Gen2 port and Type-A USB3.1Gen2 port. Both works at SuperSpeed+ speed.

It is confirmed that AD-USB29CFA is USB3.1Gen2 SuperSpeed+ capable adapter. It is Recommended.






















Tuesday, January 1, 2019

Modifying D3D12HelloFrameBuffering desktop to support fullscreen

DirectX-Graphics-Samples-master\Samples\Desktop\D3D12HelloWorld\src\HelloFrameBuffers

There is D3D12Fullscreen desktop project but it is little bit complicated so I modified D3D12FrameBuffering desktop project to enable Fullscreen capability.

D3D12FrameBuffering Project setting

Right click D3D12FrameBuffering Project top open D3D12HelloFrameBuffering property pages.
Select configuration to "All configurations"
On Configuration Options > Manifest Tool > All options, set DPI Awareness to Per Monitor High DPI Aware.

Win32Application class


Copy  WM_SIZE handler from D3D12Fullscreen Win32Application to WindowProc()

DXSample class


Add OnSizeChanged() pure virtual function declaration to DXSample class
 virtual void OnSizeChanged(UINT width, UINT height, bool minimized) = 0;

Copy SetWindowBounds() from D3D12Fullscreen DXSample.
Add     RECT m_windowBounds; member.


D3D12HelloFrameBuffering class


Comment out following line to enable ALT+Enter

factory->MakeWindowAssociation(Win32Application::GetHwnd(), DXGI_MWA_NO_ALT_ENTER)


Add bool m_windowedMode variable to D3D12HelloFrameBuffering class.


Copy those functions from D3D12Fullscreen to D3D12HelloFrameBuffer
 void LoadSizeDependentResources();
 void UpdatePostViewAndScissor();
 void LoadSceneResolutionDependentResources();

 virtual void OnSizeChanged(UINT width, UINT height, bool minimized);


 PopulateCommandList() is unchanged.

This is my UpdatePostViewAndScissor() implementation:

void D3D12HelloFrameBuffering::UpdatePostViewAndScissor()
{
    float x = 1.0f;
    float y = 1.0f;

    m_viewport.TopLeftX = m_width * (1.0f - x) / 2.0f;
    m_viewport.TopLeftY = m_height * (1.0f - y) / 2.0f;
    m_viewport.Width = x * m_width;
    m_viewport.Height = y * m_height;

    m_scissorRect.left = static_cast<LONG>(m_viewport.TopLeftX);
    m_scissorRect.right = static_cast<LONG>(m_viewport.TopLeftX + m_viewport.Width);
    m_scissorRect.top = static_cast<LONG>(m_viewport.TopLeftY);
    m_scissorRect.bottom = static_cast<LONG>(m_viewport.TopLeftY + m_viewport.Height);
}


m_resolutionOptions[], m_postViewport, m_postScissorRect, m_postCommandList and LoadSceneResolutionDependentResources() is not absolute necessary

 Call LoadSizeDependentResources() on the last portion of LoadAssets()

Run and press ALT+Enter to switch fullscreen

Screen shot of D3D12HelloFrameBuffering, 3840x2160 fullscreen mode


Studying D3D12HelloConstBuffers desktop sample


DirectX-Graphics-Samples-master\Samples\Desktop\D3D12HelloWorld\src\HelloConstBuffers

•Describe how to pass constant buffer to shaders.

Diff from D3DXHelloTriangle.h


struct SceneConstantBuffer
    {
        XMFLOAT4 offset;
    };
SceneConstantBuffer m_constantBufferData;
// constant buffer view (CBV) descriptor heap.
ComPtr<ID3D12DescriptorHeap> m_cbvHeap;
ComPtr<ID3D12Resource> m_constantBuffer;
UINT8* m_pCbvDataBegin;

D3D12HelloConstBuffers objects and their relations






























m_constantBuffer->Map() is called to get mapped pointer m_cbvDataBegin on OnInit() and m_constantBuffer is never Unmap() ed. Keep constant buffer mapped is OK

OnUpdate(), constant buffer data is updated and memcpy() ed to m_cbvDataBegin.

m_commandList->SetDescriptorHeaps() and m_commandList->SetGraphicsRootDesrptorTable() to set m_cbvHeap.

on Shaders.hlsl, constant buffer is exposed at register(b0):

cbuffer SceneConstantBuffer : register(b0)

{

    float4 offset;

};