Performance of looping compute shader in Update()

Hi. I have a question about the performance of compute shader, when I using it for general purpose calculation.

I faced with the problem during changing my normal CPU-based algorithm into CPU-GPU algorithm using compute shader, since calculation time becomes extremely slower when looping dispatch inside Update().

I’ve tested simple point-plane distance calculation with and without looping. Below is my shader code.

#pragma kernel MyVisible

struct MyData
{
	float3 pt;
	float dist;
};

float3 normal;
float3 ptontri;
 
RWStructuredBuffer<MyData> dataBuffer;
 
[numthreads(1024,1,1)]
void MyVisible (int3 dispatchID : SV_DispatchThreadID,int3 groupID : SV_GroupID)
{
	int id = dispatchID.x + dispatchID.y*1 + groupID.y*1024;
	float3 pl = dataBuffer[id].pt - ptontri;
	float d = dot(pl,normal);
	dataBuffer[id].dist = d;
}

And below is the looping part, briefly.

 void Update()
    {
        //initialization is done in Start()

        for (int i = 0; i < 300; i++)
        {
            buffer = new ComputeBuffer(numgroup * threadcount, single_stride, ComputeBufferType.Default);
            buffer.SetData(subdata);
            int kernel = shader.FindKernel("MyVisible");
            shader.SetBuffer(kernel, "dataBuffer", buffer);
            shader.SetFloats("normal", new float[] { normal.x, normal.y, normal.z });
            shader.SetFloats("ptontri", new float[] { ptontri.x, ptontri.y, ptontri.z });

            shader.Dispatch(kernel, numgroup, 1, 1);

            MyData[] data = new MyData[numgroup * threadcount];

            buffer.GetData(data);
            buffer.Release();
        }
}

The result, in short, shows about 30 times slower calculation time when looping.

Which means calculating 300 points 300 times per frame are 30 times slower than calculating 90000 points per frame. I understand that the latter is more efficient in the perspective of parallelization, but I cannot avoid looping due to the how algorithm itself was designed, and gap between those two calculation time is strangely wide.

The main bottleneck is GetData() part in looping when I checked using System.Diagnostics.Stopwatch, and similar problem is reported in link text(in this case, no looping was involved). Answer suggests doing some other calculation after dispatch, which in my case, is not applicable since we need the result value in the next loop.

So, I want to know if there’s a way to improve the performance when using compute shader with looping.

This is an older question, but in case anyone comes across this. You need a completely different way of approaching the problem. Using GetData() once or twice pretty much defeats the purpose of hardware acceleration, and your doing 300 times. Once you get your data on the GPU, you really should keep it there.

I don’t know what kind of algorithm your working with so I’m going to say this in regards to the test code you posted. Instead of getting your result (very slowly!) and then feeding it back into the compute shader, you should create two separate buffers when your script starts. Then when you run you’re loop, set buffer1 as read and buffer2 as write, and then swap them at the end of the loop.

Pseudocode:

private ComputeBuffer buffer1;
private ComputeBuffer buffer2;

void Start(){
	buffer1 = new ComputeBuffer(length, stride, type);
	buffer2 = new ComputeBuffer(length, stride, type);
}

void Update(){
	for(var i = 0; i < 300; i++){
		shader.SetBuffer(kernelID, "readBuffer", buffer1);
		shader.SetBuffer(kernelID, "writeBuffer", buffer2);    
		shader.Dispatch(kernelID, numGroup, 1, 1);

		Swap(buffer1, buffer2);
	}
}

void Swap(ref ComputeBuffer b1, ref ComputeBuffer b2) {
    ComputeBuffer bTemp = b1;
    b1 = b2;
    b2 = bTemp;
}

I’d like to know the same.