Hello I m curious if anyone here has a good idea about inter Future of Coding #of-graphics

Hello! I'm curious if anyone here has a good idea ...

Patrick Dubroy

11/02/2024, 10:45 AM

Hello! I'm curious if anyone here has a good idea about interleaving works between a compute shader and a fragment shader. Some relevant details: • My app is built with Rust and wgpu, and I'm running on an M1 Macbook Pro. • I have a single encoder with a compute pipeline and a render pipeline. • The compute shader writes to a storage buffer defined like this:

Copy code

@group(0) @binding(2) var<storage, read_write> output: array<vec4<f32>>;

• The fragment shader reads from the same buffer. Basically, each fragment is just one element of the

vec4<f32>

. The fragment shader is very simple, and doesn't touch anything else in the storage buffer. I've added timestamp queries to the pipeline, and what I'm seeing is this:

Copy code

Duration #1: 47.800208ms
Duration #2: 47.809876ms
Frame time: 51.2545ms

Duration #1

is computed from the compute shader timestamps (the duration between the beginning and end of the compute pass) and

Duration #2

is the time for the render pass, computed the same way.

Frame time

is measured on the CPU. I expected the duration of the compute shader and fragment shader to add up to the frame time (approximately). But it doesn't and I'm confused about why! Could it be due to interleaving of the compute pass and render pass? If so, I'm curious how the synchronization works. How does the GPU figure out the dependencies between the write (a compute shader invocation) and the reader (fragment shader invocation)? I don't have any explicit synchronization, but I'm also not seeing any tearing or anything that would indicate that there is a data race between the shaders.

Duncan Cragg

11/02/2024, 1:33 PM

those durations are suspiciously close for being times of completely different processes! plus, is it possible that the compute for frame N was run in frame N-1?

Patrick Dubroy

11/02/2024, 1:56 PM

I agree. My colleague’s theory was that they are interleaved, and that they are both running at the same time but that they are somehow synchronized by the runtime. I’m skeptical that we’d get that for free.

Patrick Dubroy

11/02/2024, 1:58 PM

I also wondered whether they are somehow running in parallel but the computer shader is one frame ahead, but I don’t see how that’s possible in the code. Both pipelines are added to the same encoder and so definitely submitted together.

Patrick Dubroy

11/02/2024, 2:01 PM

I asked on another forum, and the conclusion was that the timestamps may not be reliable. There are a bunch of wgpu issues related to timestamps on Metal and TBDR architectures

Jack Rusher

11/03/2024, 3:07 PM

I also vote “unreliable timestamps”

Sam Gentle

11/03/2024, 11:31 PM

I did a bit of digging in the Metal docs and they do mention some kind of magic auto-parallelism:

Metal automatically tracks dependencies between the compute and render passes. When the sample sends the command buffer to be executed, Metal detects that the compute pass writes to the output texture and the render pass reads from it, and makes sure the GPU finishes the compute pass before starting the render pass.

https://developer.apple.com/documentation/metal/compute_passes/processing_a_texture_in_a_compute_function

Stefan

11/04/2024, 7:28 AM

There are pretty good graphics debugging tools in Xcode/Instruments. If you can somehow hook them up to the Metal part of whatever you’re running, they would give you detailed information about what’s running where and when. https://developer.apple.com/documentation/xcode/metal-debugger/

Stefan

11/04/2024, 7:32 AM

Perhaps this gets you started: https://developer.apple.com/documentation/xcode/capturing-a-metal-workload-programmatically In particular: “Alternatively, in macOS 14 and later, you can set the environment variable on your Metal app: MTL_CAPTURE_ENABLED=1.” I’d assume that the library you’re using would likely expose any debugging facilities as well…?

Stefan

11/04/2024, 7:35 AM

Then you should be able to replay a captured trace like this: https://developer.apple.com/documentation/xcode/replaying-a-gpu-trace-file

Patrick Dubroy

11/04/2024, 7:37 AM

@Sam Gentle That's interesting, thanks! Though I'm still skeptical that it could automatically determine the fine-grained dependencies between individual invocations of the compute/render passes. From the wording it sounds more like it would wait until the compute pass is complete before starting the render pass.

Patrick Dubroy

11/04/2024, 7:45 AM

@Stefan Good call, I should see what XCode will tell me. I do know how to capture a GPU trace, the only issue is that when I last tried it, it seemed that wgpu doesn't generate Metal debug symbols, so it's somewhat limited. But I could always try debugging a standalone pure Metal example that replicates the same access patterns. I found this, which seems like it could be helpful: https://developer.apple.com/documentation/xcode/analyzing-resource-dependencies

3 Views

Open in Slack