Monday, July 20, 2020

Running Simple DirectX12 Compute Shader: Looking into Dispatch xyz, numthreads xyz, SV_GroupIndex and SV_GroupID xyz

There are many arguments to run compute shader and many arguments are passed to compute shader main function.

On this post, a computer shader is executed with
  • Dispatch(4,1,1)
  • numthreads(3,1,1)
to see what argument values are passed to compute shader main function.

Source code

Compute shader: https://sourceforge.net/p/playpcmwin/code/HEAD/tree/PlayPcmWin/WWDirectCompute12Test2019/Sandbox.hlsl

C++ program to run the compute shader: https://sourceforge.net/p/playpcmwin/code/HEAD/tree/PlayPcmWin/WWDirectCompute12Test2019/TestSandboxShader.cpp

Compute shader to run on the GPU

Sandbox.hlsl : this compute shader is called with Dispatch(4,1,1)
RWStructuredBuffer<float> g_output   : register(u0);

[numthreads(3, 1, 1)]
void
CSMain(
    uint tid : SV_GroupIndex,                 // 0 <= tid < 3 ← numthreads(3,1,1)
    uint3 groupIdXYZ : SV_GroupID)   // 0 <= groupIdXYZ.x < 4
Dispatch(xyz=(4,1,1))
{
    int idx = tid + groupIdXYZ.x * 5;
    g_output[idx] = 1;
}

Shader setup

Please refer TestSandboxShader.cpp. It compiles Sandbox.hlsl as a compute shader, prepares GPU buffer of 4096 bytes and sets Unordered Access View, creates compute state, calls Dispatch(4,1,1), and copy GPU buffer memory values to CPU memory of float array.


Compute Shader resources, shader main function arguments and thread group

Unordered Access View u0 is visible from the compute shader. Shader can read/write to this buffer.

CSMain function is called 12 times total, function argument of each call is as follows:
CSMain(tid=0, groupIdXYZ=0,0,0)
CSMain(tid=1, groupIdXYZ=0,0,0)
CSMain(tid=2, groupIdXYZ=0,0,0)

CSMain(tid=0, groupIdXYZ=1,0,0)
CSMain(tid=1, groupIdXYZ=1,0,0)
CSMain(tid=2, groupIdXYZ=1,0,0)

CSMain(tid=0, groupIdXYZ=2,0,0)
CSMain(tid=1, groupIdXYZ=2,0,0)
CSMain(tid=2, groupIdXYZ=2,0,0)

CSMain(tid=0, groupIdXYZ=3,0,0)
CSMain(tid=1, groupIdXYZ=3,0,0)
CSMain(tid=2, groupIdXYZ=3,0,0)
3 subsequent calls share the same groupIdXYZ and those 3 calls are executed "simultaneously": GPU has several hundred cores. 3 tasks are assigned to 3 individual GPU cores and they runs in parallel (See the following image). On more practical compute shader, it is important to run 128 or more shaders in parallel: something like numthreads(128,1,1) to utilize GPU cores fully.
Those 3 function calls that shares the same groupIdXYZ is called the thread group. GPU function calls of the same thread group can share thread group shared memory (TGSM) that is significantly faster than UAV memory, while TGSM size is limited to 32 KB or so. Utilizing TGSM is one of the key technique to accelerate GPU computation.

On this Sandbox compute shader, each shader writes adjacent GPU memory position simultaneously. This slows down write operation. It is better for each threadgroup threads to write to more remote memory position each other to write data more quickly.

Values written to u0 GPU memory

Sandbox.hlsl shader writes those values to the GPU buffer memory u0:g_Output.
    i, g_output[i], Shader function args to write this value
    0, 1.000000,   <== CSMain(tid=0, groupIdXYZ=0,0,0)
    1, 1.000000,   <== CSMain(tid=1, groupIdXYZ=0,0,0)
    2, 1.000000,   <== CSMain(tid=2, groupIdXYZ=0,0,0)
    3, 0.000000,
    4, 0.000000,
    5, 1.000000,   <== CSMain(tid=0, groupIdXYZ=1,0,0)
    6, 1.000000,   <== CSMain(tid=1, groupIdXYZ=1,0,0)
    7, 1.000000,   <== CSMain(tid=2, groupIdXYZ=1,0,0)
    8, 0.000000,
    9, 0.000000,
   10, 1.000000,   <== CSMain(tid=0, groupIdXYZ=2,0,0)
   11, 1.000000,   <== CSMain(tid=1, groupIdXYZ=2,0,0)
   12, 1.000000,   <== CSMain(tid=2, groupIdXYZ=2,0,0)
   13, 0.000000,
   14, 0.000000,
   15, 1.000000,   <== CSMain(tid=0, groupIdXYZ=3,0,0)
   16, 1.000000,   <== CSMain(tid=1, groupIdXYZ=3,0,0)
   17, 1.000000,   <== CSMain(tid=2, groupIdXYZ=3,0,0)
   18, 0.000000,
   19, 0.000000,
   20, 0.000000,
   21, 0.000000,
   22, 0.000000,
   23, 0.000000,
   24, 0.000000,
 


Saturday, July 11, 2020

Benchmark: maxCache accelerated HDD RAID10 array

I tried to to accelerate a HDD raid array by connecting a SAS12G SSD to Adaptec raid controller as a maxCache 4.0 device.

Connecting the SSD to Controller

Set the connector to HBA mode on maxView.
On Linux, type dmesg to find drive device file name:

 # dmesg
           ...
[  817.729986] smartpqi 0000:41:00.0: added 10:0:-:- 5000c5003e8f4f3d Direct-Access     SEAGATE  XS960SE70004     AIO+ qd=64

[  817.732450] scsi 10:0:1:0: Direct-Access     SEAGATE  XS960SE70004     0004 PQ: 0 ANSI: 7
[  817.733837] sd 10:0:1:0: Attached scsi generic sg5 type 0
[  817.736622] sd 10:0:1:0: [sdd] 1875385008 512-byte logical blocks: (960 GB/894 GiB)
[  817.736625] sd 10:0:1:0: [sdd] 4096-byte physical blocks
[  817.737389] sd 10:0:1:0: [sdd] Write Protect is off
[  817.737391] sd 10:0:1:0: [sdd] Mode Sense: dd 00 10 08
[  817.738855] sd 10:0:1:0: [sdd] Write cache: disabled, read cache: enabled, supports DPO and FUA
[  817.754703] sd 10:0:1:0: [sdd] Attached SCSI disk

It seems my drive is /dev/sdd and /dev/sg5. And drive logical sector size is 512: It is ready to use it as a maxCache drive. Set the connector to RAID mode on maxView.

Using it as a maxCache drive

Enable maxCache to HDDx8 Raid10 array, it is something similar steps as a last article.

Test setup  

  • AMD ThreadRipper 2990WX
  • Memory 64GB
  • Microsemi Adaptec SmartRaid 3154-8i16e (memory cache is enabled)
  • HDD: WDC WD40EZRZ-00G 3.5inch SATA HDD x8 RAID10
  • SSD for MaxCache: Seagate Nytro XS960SE70004 SAS12G SSD
  • Windows 10 x64 version 2004

Benchmark results

CrystalDiskMark 7.0.0

Fig. 1 HDDx8 RAID10 without maxCache














Fig.2 HDDx8 RAID10 with maxCache of 1 SSD (unsafe)














Fig.3 HDDx8 RAID10 with maxCache of 2 SSD RAID1

Some thoughts about the result


It seems maxCache4.0 works well. Especially read performance is accelerated. HDD array performs like a SSD.

Write performance is better than ordinal HDD even without maxCache. I suppose this is thanks to the 4GB memory cache of the raid controller.

Disabling maxCache

On maxView, wait until super capacitor is charged. Select raid array and press Set Properties button of Logical Device ribbon group. move to maxCache tab and select  Set Write cache policy preferred to Write Through. Wait until SSD cache data is flushed to HDD. then select maxCacheDevice → deviceName → Cache for ArrayName, Press the Delete maxCache button on maxCache ribbon group.

Disk Provisioning

Seagate Nytro 3331 SSD accepts provision command on SeaChest_Basics utility. I reduced drive capacity from 960GB to 800GB (1562500000 LBA) to increase drive write endurance (I hope).

Set controller connector connected to Nytro SSD to HBA mode. Then run following command on the Linux console:
# cd SeaChest/Linux/Lin64/
# ./SeaChest_Basics_280_11923_64 --scan --onlySeagate

to find your seagate SSD. In my case it is /dev/sg1

# ./SeaChest_Basics_280_11923_64 -d /dev/sg1 -i

to see MaxLBA size. My drive MaxLBA is 1875385007 (approx. 960GB). 

Reduce drive capacity to 800GB (LBA 156250000 ): 


# ./SeaChest_Basics_280_11923_64 -d /dev/sg1 --provision 1562500000 

Then reboot the computer.

Set controller connector connected to Nytro SSD to RAID mode to use it as a maxCache drive.

Current Storage Configuration


Following maxView screen shows current configuration of my storage array.