Hi. I have a question about the performance of compute shader, when I using it for general purpose calculation.
I faced with the problem during changing my normal CPU-based algorithm into CPU-GPU algorithm using compute shader, since calculation time becomes extremely slower when looping dispatch inside Update().
I’ve tested simple point-plane distance calculation with and without looping. Below is my shader code.
#pragma kernel MyVisible
struct MyData
{
float3 pt;
float dist;
};
float3 normal;
float3 ptontri;
RWStructuredBuffer<MyData> dataBuffer;
[numthreads(1024,1,1)]
void MyVisible (int3 dispatchID : SV_DispatchThreadID,int3 groupID : SV_GroupID)
{
int id = dispatchID.x + dispatchID.y*1 + groupID.y*1024;
float3 pl = dataBuffer[id].pt - ptontri;
float d = dot(pl,normal);
dataBuffer[id].dist = d;
}
And below is the looping part, briefly.
void Update()
{
//initialization is done in Start()
for (int i = 0; i < 300; i++)
{
buffer = new ComputeBuffer(numgroup * threadcount, single_stride, ComputeBufferType.Default);
buffer.SetData(subdata);
int kernel = shader.FindKernel("MyVisible");
shader.SetBuffer(kernel, "dataBuffer", buffer);
shader.SetFloats("normal", new float[] { normal.x, normal.y, normal.z });
shader.SetFloats("ptontri", new float[] { ptontri.x, ptontri.y, ptontri.z });
shader.Dispatch(kernel, numgroup, 1, 1);
MyData[] data = new MyData[numgroup * threadcount];
buffer.GetData(data);
buffer.Release();
}
}
The result, in short, shows about 30 times slower calculation time when looping.
Which means calculating 300 points 300 times per frame are 30 times slower than calculating 90000 points per frame. I understand that the latter is more efficient in the perspective of parallelization, but I cannot avoid looping due to the how algorithm itself was designed, and gap between those two calculation time is strangely wide.
The main bottleneck is GetData() part in looping when I checked using System.Diagnostics.Stopwatch, and similar problem is reported in link text(in this case, no looping was involved). Answer suggests doing some other calculation after dispatch, which in my case, is not applicable since we need the result value in the next loop.
So, I want to know if there’s a way to improve the performance when using compute shader with looping.