Dispatching a long-running compute shader causes system hang or abnormal behavior #6660

notogawa · 2025-02-10T08:06:59Z

Describe the bug

When we dispatch a shader program using ioctl(SUBMIT_CSD) on a Raspberry Pi 5, if the shader program’s execution time exceeds 500 ms, ioctl(WAIT_BO) returns "Timer expired" or the system hangs.

Once "Timer expired" occurs, even subsequent shader programs that should complete within 500 ms also result in "Timer expired."

When the system hangs, I can’t do anything. Pressing the power button has no effects, and the LED stays green (on).

I suspect this line. Is there any difficulty in relaxing this limit? I think it is too tight for GPGPU.

Steps to reproduce the behaviour

This is an example program to reproduce. In this example, a shader is a busy nop loop.

$ git clone https://gist.github.com/notogawa/4dcebe6db14f5898dee85babb85f7d37
$ cd 4dcebe6db14f5898dee85babb85f7d37
$ gcc -o main main.c
$ ./main N (N is nop-loop count)

Case 1: Normal

$ ./main 1000000
[loop:1000000]
0.008614 sec
$ ./main 1000000
[loop:1000000]
0.008624 sec
$ ./main 64000000
[loop:64000000]
0.271148 sec

Case 2: Timer expired

$ ./main 128000000
[loop:128000000]
wait_bo: Timer expired <- display after 10sec
$ ./main 1000000
[loop:1000000]
wait_bo: Timer expired

Case 3: System hang

$ ./main 128000000
[loop:128000000]
(hang.)

This example is a minimal reproducible program, so it’s just a no-op loop. In reality, however, we’re submitting programs like massive matrix–matrix multiplications.

Device (s)

Raspberry Pi 5

System

$ cat /etc/rpi-issue
Raspberry Pi reference 2024-11-19
Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 891df1e21ed2b6099a2e6a13e26c91dea44b34d4, stage2
$ vcgencmd version
2024/09/23 14:02:56
Copyright (c) 2012 Broadcom
version 26826259 (release) (embedded)
$ uname -a
Linux pi5 6.6.62+rpt-rpi-2712 #1 SMP PREEMPT Debian 1:6.6.62-1+rpt1 (2024-11-25) aarch64 GNU/Linux

Logs

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

notogawa · 2025-02-14T00:17:01Z

Could anyone follow up on this issue?
Please let me know if you are unable to reproduce the problem.

popcornmix · 2025-02-14T11:11:12Z

I can reproduce. Once we have had a timeout, it looks like all subsequent jobs are dead.
So two issues:
1: a timeout should be recoverable and future smaller jobs should work
2: perhaps the timeout should be increased to allow larger jobs

@mairacanal could you have a look at the timeout/reset code? Here is some of the complaints in dmesg:

[62974.906781] v3d 1002000000.v3d: [drm:v3d_reset [v3d]] *ERROR* Resetting GPU for hang.
[62974.906789] v3d 1002000000.v3d: [drm:v3d_reset [v3d]] *ERROR* V3D_ERR_STAT: 0x00001000
[62975.154380] v3d 1002000000.v3d: MMUC flush wait idle failed
[62975.154384] v3d 1002000000.v3d: MMU flush timeout
[62975.418810] Unable to handle kernel NULL pointer dereference at virtual address 00000000000005c7
[62975.427600] Mem abort info:
[62975.430384]   ESR = 0x0000000096000005
[62975.434126]   EC = 0x25: DABT (current EL), IL = 32 bits
[62975.439432]   SET = 0, FnV = 0
[62975.442477]   EA = 0, S1PTW = 0
[62975.445609]   FSC = 0x05: level 1 translation fault
[62975.450479] Data abort info:
[62975.453352]   ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
[62975.458830]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[62975.463873]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[62975.469177] user pgtable: 16k pages, 47-bit VAs, pgdp=00000001c0e08000
[62975.475697] [00000000000005c7] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
[62975.484395] Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP
[62975.490654] Modules linked in: snd_seq_dummy snd_hrtimer snd_seq snd_seq_device binfmt_misc spidev brcmfmac_wcc aes_ce_blk aes_ce_cipher ghash_ce gf128mul libaes sha2_ce sha256_arm64 vc4 sha1_ce brcmfmac sha1_generic brcmutil snd_soc_hdmi_codec drm_display_helper raspberrypi_hwmon cfg80211 cec drm_client_lib drm_dma_helper snd_soc_core v3d snd_compress snd_pcm_dmaengine i2c_brcmstb spi_bcm2835 gpu_sched rpivid_hevc(C) rfkill snd_pcm pisp_be v4l2_mem2mem snd_timer drm_shmem_helper videobuf2_dma_contig rp1_pio snd drm_kms_helper gpio_keys videobuf2_memops videobuf2_v4l2 videodev raspberrypi_gpiomem rp1_adc rp1 rp1_mailbox videobuf2_common mc nvmem_rmem uio_pdrv_genirq uio i2c_dev fuse drm drm_panel_orientation_quirks backlight dm_mod ip_tables x_tables ipv6
[62975.557569] CPU: 0 UID: 0 PID: 17494 Comm: kworker/0:1 Tainted: G         C         6.14.0-rc2-v8-16k #15
[62975.567128] Tainted: [C]=CRAP
[62975.570086] Hardware name: Raspberry Pi 5 Model B Rev 1.1 (DT)
[62975.575909] Workqueue: events drm_sched_job_timedout [gpu_sched]
[62975.581912] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[62975.588865] pc : v3d_job_start_stats.isra.0+0x48/0xd8 [v3d]
[62975.594432] lr : v3d_job_start_stats.isra.0+0x2c/0xd8 [v3d]
[62975.599998] sp : ffffc00084fcbc40
[62975.603303] x29: ffffc00084fcbc40 x28: ffff800080ca0d68 x27: ffff800040a22b80
[62975.610430] x26: ffff800080ca0c78 x25: 00000000ffffffff x24: 00000000000005ef
[62975.617558] x23: 0000000000000001 x22: ffff800040a5fd80 x21: ffff800080ca0000
[62975.624684] x20: ffffffffffffffff x19: 0000000000000003 x18: 00000000ffffffff
[62975.631811] x17: 203a544154535f52 x16: ffffd06fce586a98 x15: 524f5252452a205d
[62975.638937] x14: 5d6433765b207465 x13: 3030303130303030 x12: 7830203a54415453
[62975.646064] x11: 5f5252455f443356 x10: ffffd06fcfc87160 x9 : ffffd06fae8c7afc
[62975.653191] x8 : ffff800040a5fe00 x7 : 0000000000000000 x6 : ffff800080ca0408
[62975.660317] x5 : 0000000000000002 x4 : 0000000002800001 x3 : ffff800003e863c0
[62975.667444] x2 : 0000000000000003 x1 : 000000000000005f x0 : 000039469b77b3a3
[62975.674572] Call trace:
[62975.677009]  v3d_job_start_stats.isra.0+0x48/0xd8 [v3d] (P)
[62975.682576]  v3d_csd_job_run+0xbc/0x2a8 [v3d]
[62975.686926]  drm_sched_resubmit_jobs+0x98/0x238 [gpu_sched]
[62975.692492]  v3d_gpu_reset_for_timeout+0x84/0xd8 [v3d]
[62975.697624]  v3d_csd_job_timedout+0x68/0x80 [v3d]
[62975.702321]  drm_sched_job_timedout+0x7c/0x120 [gpu_sched]
[62975.707799]  process_one_work+0x15c/0x3c0
[62975.711803]  worker_thread+0x2e4/0x3f0
[62975.715545]  kthread+0x138/0x1f0
[62975.718765]  ret_from_fork+0x10/0x20
[62975.722334] Code: b9000861 d37b7e61 2a1303e2 8b010281 (b9456824) 
[62975.728418] ---[ end trace 0000000000000000 ]---

(I happen to be in 6.14 kernel, but OP reported it on 6.6).

pelwell · 2025-02-14T11:13:03Z

I feel there should be a timeout - what is a sensible maximum?

popcornmix · 2025-02-14T12:25:18Z

I feel there should be a timeout - what is a sensible maximum?

In the desktop world, which is likely also using 3d hardware, then you don't want it too high as it will stall gui.
For non-desktop (or non-3d accelerated desktop) a longer timeout seems more acceptable.
Possibly an override (cmdline.txt/sysfs) would be a reasonable compromise.

With the current broken state of the timeout, gui will be dead anyway, so the longer the timeout the better.

Even if the timeout code is fixed so subsequent jobs continue to work, I wonder how many clients of this interface will be able to gracefully handle a timeout (resubmitting a job that times out sounds likely to timeout the next time).

mairacanal · 2025-02-24T10:10:07Z

@popcornmix, I have already taken a look into this issue and can divide it into three separate issues.

The scheduler loops resubmitting the same guilty job repeatedly.
The hang limit timeout is too small for long compute shader jobs.
After the reset, the GPU hangs, making it impossible to complete jobs from any of the queues and freezing the GUI.

Although providing a fix to (1) and (2) was quick, (3) involved some debugging and tracing to understand why this is failing on the RPi 5. In the end, we understood the issue and fixed it.

I'll work on upstreaming the fixes in the next two days (also downstream as I had to make some DTB changes). Thanks for the complete report of the issue and an easy reproducible example!

notogawa · 2025-02-24T10:18:04Z

👍

popcornmix · 2025-02-24T11:49:49Z

Thanks Maira!

mairacanal · 2025-02-28T00:37:23Z

I sent the patches upstream [1] and opened the PR with the fixes. There is just one thing that @notogawa might have missed: increasing the hang limit.

After an analysis, I decided not to increase the hang limit. This doesn't mean that CSD jobs longer than 500ms will cause a GPU reset. This means that compute batches longer than 500ms will cause a GPU reset and it is pretty unlikely to have batches that will take longer than 500ms to run. The v3d driver checks if we are making progress with the batches before resetting and if so, it will skip the reset.

The issue here is that your application uses a 1x1x1 workgroup, which is unusual. I tested other GPGPU applications, computing FFTs, and matrix multiplications, and I didn't have issues. My recommendation would be to split the work into smaller batches.

[1] https://lore.kernel.org/dri-devel/[email protected]/T/

notogawa · 2025-02-28T04:57:14Z

Thank you @mairacanal .

I know that splitting the workload into smaller batches would relax the constraints, but there is another reason why this is difficult: "the overhead of ioctl calls."

Our compiler generates a single compute shader that fuses the entire deep learning computation graph, allowing us to inference with just "one submit_csd call" from start to finish. A "usual" approach is to generate separate smaller batched compute shaders for each layer, such as convolution, activation functions like ReLU, and matrix multiplication, and so on ... and then sequentially submit them with submit_csd one after another. However, this approach incurs significant overhead from multiple submit_csd calls, which negatively impacts inference performance. For the best, we must reject "usual" approach and accept "unusual" one.

With this technique, we have achieved inference speed as shown in this youtube video on a PiZero with a VC4 (though not specifically on v3d). We have also applied the same technique to VC6 (with v3d). And we aim to do the same for VC7.

As workloads offloaded to the GPU - even if, theoretically, the CPU on a Pi5 is about twice as fast as the GPU - are becoming increasingly heavy, with LLMs being a prime example, a small hang limit is becoming an even stricter constraint for such computations.

mairacanal · 2025-02-28T12:28:52Z

I know that splitting the workload into smaller batches would relax the constraints, but there is another reason why this is difficult: "the overhead of ioctl calls."

No, it would be just one ioctl. Please, read the Mesa code to check the difference between jobs and batches [1]. You can use just one compute shader but will need to configure the CSD job differently.

[1] https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/broadcom/vulkan/v3dv_cmd_buffer.c?ref_type=heads#L4309

notogawa · 2025-02-28T14:37:04Z

I already check that part five years ago. And I know how to impl a single-type computation, such as "one large matrix multiplication", using single submit_csd with workgroup, num of batches, uniforms, threading flag, compute shader payload register files, and so on. What I need is a capability to impl sequence of various types of computations like such as convolution, activation functions, matrix multiplication, resizing, transpose, ... , using single submit_csd with the best performance.

The issue here is that your application uses a 1x1x1 workgroup, which is unusual.

The reproducible code I provided is merely a simplified minimal implementation. The way we actually use CSD in our app differs from this code.

mairacanal · 2025-03-01T14:05:03Z

Going through the repository you pointed to, I noticed that the number of batches isn't calculated as in Mesa [1]. Here is a snippet of the Mesa code:

   uint32_t batches_per_sg = DIV_ROUND_UP(wgs_per_sg * wg_size, 16);
   uint32_t whole_sgs = num_wgs / wgs_per_sg;
   uint32_t rem_wgs = num_wgs - whole_sgs * wgs_per_sg;
   uint32_t num_batches = batches_per_sg * whole_sgs +
                          DIV_ROUND_UP(rem_wgs * wg_size, 16);

We improved CSD handling about 3 years ago (2021) [2] with good performance improvements. Let me know if this snippet helps you. In case it doesn't, could you send me instructions on how to reproduce your application? This way I might provide a more precise answer on what could be done.

[1] https://github.com/Idein/py-videocore6/blob/f14c853f5bf4bd22420fdd56558e38ebb5a1c097/videocore6/driver.py#L126

[2] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10541

notogawa · 2025-03-01T22:17:36Z

Yes, I also use the Mesa code only as a reference, and in practice, I determine the best parameters for us through experiments.

The reproducible code is a compute shader with approximately 60,000 instructions, which we cannot publish due to confidentiality. Therefore, I would like to ask a question by giving an example of what is possible with our code. For instance, if you were to execute the entire computation described in this paper [1] within a single submit_csd, how would you configure the CSD parameters?

[1] https://arxiv.org/pdf/1905.02244

pelwell · 2025-03-02T07:10:43Z

Are you saying that you can't create an example of a compute shader large enough to demonstrate the problem (and the code to launch it) that isn't confidential?

notogawa · 2025-03-02T08:08:10Z

Yes. This is obvious from the discussion so far, but it's because "why the hang limit matters" is directly connected to "how to use the QPU(s) for maximum performance" in the type of computations described at #6660 (comment) .

Of course, if it is obvious how to specify workgroups, the number of batches, etc., to fit a computation like a newral network model architecture such as #6660 (comment) within only a single submit_csd, then I would appreciate your guidance. In that case, provided that the performance remains comparable, the hang limit might not need to be as long.

mairacanal · 2025-03-02T20:45:57Z

I'm sorry @notogawa but I don't have experience with neural networks, so I don't think I have the background needed to implement a compute shader for the paper you suggested. Also, because of confidentiality, I don't think I can help you configure the GPU in the best way for your case, as it would include knowledge about the GPU's inner workings and configuration.

As a kernel maintainer of the upstream driver, I appreciate that you found the issues related to the GPU reset (to which I sent patches to fix), but I wouldn't be comfortable increasing the timeout for the CSD job (or using a kernel param for it). Our kernel driver ensures that the GPU won't reset if we are making progress with the batches. From my point of view, I see that your application is very specific [1], so I'd recommend you change the hang limit locally. But, I'd appreciate hearing @pelwell and @popcornmix opinions.

[1] And I can't provide useful help without seeing the code or configuration

notogawa · 2025-03-03T00:14:48Z

Thank you for your consideration. That's unfortunate, but I understand.

Personally, I hope that in the future, changes that break compatibility (= pi4/vc6 also affected this issue), such as preventing code that was previously usable from user space from working, will not be introduced. This point could potentially influence hardware or OS selection.

If there is no further action to be taken, please close this issue on your end.

pelwell · 2025-03-03T14:36:59Z

If you can come up with something more persuasive than expecting us to understand the implications of a research paper then you might have a chance, but until then it's a no from me.

In addition to the standard reset controller, V3D 7.x requires configuring the V3D_SMS registers for proper power on/off and reset. Add the new registers to `v3d_regs.h` and ensure they are properly configured during device probing, removal, and reset. This change fixes GPU reset issues on the Raspberry Pi 5 (BCM2712). Without exposing these registers, a GPU reset causes the GPU to hang, stopping any further job execution and freezing the desktop GUI. The same issue occurs when unloading and loading the v3d driver. Link: raspberrypi#6660 Signed-off-by: Maíra Canal <[email protected]>

In addition to the standard reset controller, V3D 7.x requires configuring the V3D_SMS registers for proper power on/off and reset. Add the new registers to `v3d_regs.h` and ensure they are properly configured during device probing, removal, and reset. This change fixes GPU reset issues on the Raspberry Pi 5 (BCM2712). Without exposing these registers, a GPU reset causes the GPU to hang, stopping any further job execution and freezing the desktop GUI. The same issue occurs when unloading and loading the v3d driver. Link: #6660 Signed-off-by: Maíra Canal <[email protected]>

In addition to the standard reset controller, V3D 7.x requires configuring the V3D_SMS registers for proper power on/off and reset. Add the new registers to `v3d_regs.h` and ensure they are properly configured during device probing, removal, and reset. This change fixes GPU reset issues on the Raspberry Pi 5 (BCM2712). Without exposing these registers, a GPU reset causes the GPU to hang, stopping any further job execution and freezing the desktop GUI. The same issue occurs when unloading and loading the v3d driver. Link: raspberrypi#6660 Reviewed-by: Iago Toral Quiroga <[email protected]> Signed-off-by: Maíra Canal <[email protected]>

In addition to the standard reset controller, V3D 7.x requires configuring the V3D_SMS registers for proper power on/off and reset. Add the new registers to `v3d_regs.h` and ensure they are properly configured during device probing, removal, and reset. This change fixes GPU reset issues on the Raspberry Pi 5 (BCM2712). Without exposing these registers, a GPU reset causes the GPU to hang, stopping any further job execution and freezing the desktop GUI. The same issue occurs when unloading and loading the v3d driver. Link: #6660 Reviewed-by: Iago Toral Quiroga <[email protected]> Signed-off-by: Maíra Canal <[email protected]>

In addition to the standard reset controller, V3D 7.x requires configuring the V3D_SMS registers for proper power on/off and reset. Add the new registers to `v3d_regs.h` and ensure they are properly configured during device probing, removal, and reset. This change fixes GPU reset issues on the Raspberry Pi 5 (BCM2712). Without exposing these registers, a GPU reset causes the GPU to hang, stopping any further job execution and freezing the desktop GUI. The same issue occurs when unloading and loading the v3d driver. Link: raspberrypi/linux#6660 Reviewed-by: Iago Toral Quiroga <[email protected]> Signed-off-by: Maíra Canal <[email protected]>

In addition to the standard reset controller, V3D 7.x requires configuring the V3D_SMS registers for proper power on/off and reset. Add the new registers to `v3d_regs.h` and ensure they are properly configured during device probing, removal, and reset. This change fixes GPU reset issues on the Raspberry Pi 5 (BCM2712). Without exposing these registers, a GPU reset causes the GPU to hang, stopping any further job execution and freezing the desktop GUI. The same issue occurs when unloading and loading the v3d driver. Link: #6660 Reviewed-by: Iago Toral Quiroga <[email protected]> Signed-off-by: Maíra Canal <[email protected]>

In addition to the standard reset controller, V3D 7.x requires configuring the V3D_SMS registers for proper power on/off and reset. Add the new registers to `v3d_regs.h` and ensure they are properly configured during device probing, removal, and reset. This change fixes GPU reset issues on the Raspberry Pi 5 (BCM2712). Without exposing these registers, a GPU reset causes the GPU to hang, stopping any further job execution and freezing the desktop GUI. The same issue occurs when unloading and loading the v3d driver. Link: raspberrypi/linux#6660 Reviewed-by: Iago Toral Quiroga <[email protected]> Signed-off-by: Maíra Canal <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

In addition to the standard reset controller, V3D 7.x requires configuring the V3D_SMS registers for proper power on/off and reset. Add the new registers to `v3d_regs.h` and ensure they are properly configured during device probing, removal, and reset. This change fixes GPU reset issues on the Raspberry Pi 5 (BCM2712). Without exposing these registers, a GPU reset causes the GPU to hang, stopping any further job execution and freezing the desktop GUI. The same issue occurs when unloading and loading the v3d driver. Link: #6660 Reviewed-by: Iago Toral Quiroga <[email protected]> Signed-off-by: Maíra Canal <[email protected]>

mairacanal mentioned this issue Feb 28, 2025

drm/v3d: Fix GPU reset issues on the Raspberry Pi 5 #6692

Merged

pelwell closed this as completed in #6692 Mar 7, 2025

popcornmix mentioned this issue Mar 12, 2025

drm/v3d: Fix GPU reset issues on the Raspberry Pi 5 for 6.14 #6715

Merged

popcornmix mentioned this issue Mar 12, 2025

drm/v3d: Fix GPU reset issues on the Raspberry Pi 5 for 6.12 #6716

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dispatching a long-running compute shader causes system hang or abnormal behavior #6660

Dispatching a long-running compute shader causes system hang or abnormal behavior #6660

notogawa commented Feb 10, 2025

notogawa commented Feb 14, 2025

popcornmix commented Feb 14, 2025

pelwell commented Feb 14, 2025

popcornmix commented Feb 14, 2025

mairacanal commented Feb 24, 2025

notogawa commented Feb 24, 2025

popcornmix commented Feb 24, 2025

mairacanal commented Feb 28, 2025

notogawa commented Feb 28, 2025

mairacanal commented Feb 28, 2025

notogawa commented Feb 28, 2025

mairacanal commented Mar 1, 2025

notogawa commented Mar 1, 2025

pelwell commented Mar 2, 2025

notogawa commented Mar 2, 2025

mairacanal commented Mar 2, 2025 •

edited

Loading

notogawa commented Mar 3, 2025 •

edited

Loading

pelwell commented Mar 3, 2025

Dispatching a long-running compute shader causes system hang or abnormal behavior #6660

Dispatching a long-running compute shader causes system hang or abnormal behavior #6660

Comments

notogawa commented Feb 10, 2025

Describe the bug

Steps to reproduce the behaviour

Device (s)

System

Logs

Additional context

notogawa commented Feb 14, 2025

popcornmix commented Feb 14, 2025

pelwell commented Feb 14, 2025

popcornmix commented Feb 14, 2025

mairacanal commented Feb 24, 2025

notogawa commented Feb 24, 2025

popcornmix commented Feb 24, 2025

mairacanal commented Feb 28, 2025

notogawa commented Feb 28, 2025

mairacanal commented Feb 28, 2025

notogawa commented Feb 28, 2025

mairacanal commented Mar 1, 2025

notogawa commented Mar 1, 2025

pelwell commented Mar 2, 2025

notogawa commented Mar 2, 2025

mairacanal commented Mar 2, 2025 • edited Loading

notogawa commented Mar 3, 2025 • edited Loading

pelwell commented Mar 3, 2025

mairacanal commented Mar 2, 2025 •

edited

Loading

notogawa commented Mar 3, 2025 •

edited

Loading