Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

6.12: drm_open_helper RIP #712

Open
2 tasks
ptr1337 opened this issue Sep 30, 2024 · 7 comments
Open
2 tasks

6.12: drm_open_helper RIP #712

ptr1337 opened this issue Sep 30, 2024 · 7 comments
Labels
bug Something isn't working NV-Triaged An NVBug has been created for dev to investigate

Comments

@ptr1337
Copy link

ptr1337 commented Sep 30, 2024

NVIDIA Open GPU Kernel Modules Version

ed4be64

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

CachyOS (ArchLinux)

Kernel Release

6.12.0rc1

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

GPU 0: NVIDIA GeForce RTX 4070 SUPER (UUID: GPU-8c5baf85-cb1f-fe26-95d5-ff3fd51249bb)

Describe the bug

Since the 6.12.0rc1 Release the kernel drm-helper is crashing with the 560.35.03 drivers.

Following patches were pulled in, to make the driver compatible with 6.12, these were extracted out of the 550.120 release:
drm_fbdev fixup for 6.11+: https://github.com/CachyOS/kernel-patches/blob/master/6.12/misc/nvidia/0004-6.11-Add-fix-for-fbdev.patch
drm_outpull_pill for 6.12: https://github.com/CachyOS/kernel-patches/blob/master/6.12/misc/nvidia/0005-6.12-drm_outpull_pill-changed-check.patch

Additional patch to make the module compilation happy (Introduced in commit torvalds/linux@32f51ea ):

diff --git a/kernel-open/nvidia-uvm/uvm_hmm.c b/kernel-open/nvidia-uvm/uvm_hmm.c
index 93e64424..dc64184e 100644
--- a/kernel-open/nvidia-uvm/uvm_hmm.c
+++ b/kernel-open/nvidia-uvm/uvm_hmm.c
@@ -2694,7 +2694,7 @@ static NV_STATUS dmamap_src_sysmem_pages(uvm_va_block_t *va_block,
                 continue;
             }
 
-            if (PageSwapCache(src_page)) {
+            if (folio_test_swapcache(page_folio(src_page))) {
                 // TODO: Bug 4050579: Remove this when swap cached pages can be
                 // migrated.
                 status = NV_WARN_MISMATCHED_TARGET;

with these patches the DKMS Compilation is successful and the driver works fine with the 6.11.x kernel.

Booting into 6.12.0rc1 results into that the driver crashes, at drm_open_helper and there is graphical interface available anymore. The tty is working fine.
Following is visible in the dmesg log:

[    5.090174] Console: switching to colour frame buffer device 240x67
[    5.090176] nvidia 0000:01:00.0: [drm] fb0: nvidia-drmdrmfb frame buffer device
[    5.096243] ------------[ cut here ]------------
[    5.096244] WARNING: CPU: 0 PID: 453 at drivers/gpu/drm/drm_file.c:312 drm_open_helper+0x135/0x150
[    5.096249] Modules linked in: nvidia_uvm(OE) nvidia_drm(OE) drm_ttm_helper btrfs ttm blake2b_generic nvidia_modeset(OE) libcrc32c crc32c_generic xor hid_generic raid6_pq nvme nvme_core crc32c_intel video sha256_ssse3 usbhid nvme_auth wmi nvidia(OE)
[    5.096255] CPU: 0 UID: 0 PID: 453 Comm: plymouthd Tainted: G           OE      6.12.0-rc1-1-cachyos-rc #1 12df37afa12b373ced2670803975698fbda2ce5d
[    5.096257] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[    5.096257] Hardware name: ASRock X670E Pro RS/X670E Pro RS, BIOS 3.08 09/18/2024
[    5.096258] RIP: 0010:drm_open_helper+0x135/0x150
[    5.096259] Code: 5d 41 5c c3 cc cc cc cc 48 89 df e8 c5 82 fe ff 85 c0 0f 84 7a ff ff ff 48 89 df 89 44 24 0c e8 c1 f9 ff ff 8b 44 24 0c eb d1 <0f> 0b b8 ea ff ff ff eb c8 b8 ea ff ff ff eb c1 b8 f0 ff ff ff eb
[    5.096260] RSP: 0018:ffffa643409ffb20 EFLAGS: 00010246
[    5.096261] RAX: ffffffffc15df380 RBX: ffff89f744740f28 RCX: 0000000000000000
[    5.096262] RDX: ffff89f755ee0000 RSI: ffff89f744740f28 RDI: ffff89f74df1cd80
[    5.096262] RBP: ffff89f74df1cd80 R08: 0000000000000006 R09: ffff89f740213cd0
[    5.096263] R10: 00000000000000e2 R11: 0000000000000002 R12: ffff89f75735a000
[    5.096263] R13: ffffffffc15df380 R14: 00000000ffffffed R15: ffffa643409ffe1c
[    5.096264] FS:  00007f6b595ce480(0000) GS:ffff8a065ce00000(0000) knlGS:0000000000000000
[    5.096264] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    5.096265] CR2: 000055da04c46558 CR3: 000000010d18c000 CR4: 0000000000f50ef0
[    5.096265] PKRU: 55555554
[    5.096266] Call Trace:
[    5.096267]  <TASK>
[    5.096267]  ? drm_open_helper+0x135/0x150
[    5.096268]  ? __warn.cold+0xad/0x116
[    5.096270]  ? drm_open_helper+0x135/0x150
[    5.096272]  ? report_bug+0x127/0x170
[    5.096273]  ? handle_bug+0x58/0x90
[    5.096275]  ? exc_invalid_op+0x1b/0x80
[    5.096276]  ? asm_exc_invalid_op+0x1a/0x20
[    5.096279]  ? drm_open_helper+0x135/0x150
[    5.096279]  drm_open+0x81/0x110
[    5.096280]  drm_stub_open+0xaf/0x100
[    5.096282]  chrdev_open+0xc5/0x260
[    5.096285]  ? __pfx_chrdev_open+0x10/0x10
[    5.096286]  do_dentry_open+0x14b/0x490
[    5.096287]  vfs_open+0x30/0xe0
[    5.096289]  path_openat+0x84d/0x1320
[    5.096290]  ? __alloc_pages_noprof+0x183/0x350
[    5.096292]  do_filp_open+0xd2/0x180
[    5.096293]  do_sys_openat2+0xca/0x100
[    5.096294]  __x64_sys_openat+0x55/0xa0
[    5.096295]  do_syscall_64+0x82/0x190
[    5.096296]  ? handle_mm_fault+0x1d9/0x2e0
[    5.096297]  ? do_user_addr_fault+0x38d/0x6c0
[    5.096299]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[    5.096300] RIP: 0033:0x7f6b59899ae5
[    5.096301] Code: 75 53 89 f0 f7 d0 a9 00 00 41 00 74 48 80 3d d1 b5 0d 00 00 74 6c 45 89 e2 89 da 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 8f 00 00 00 48 8b 54 24 28 64 48 2b 14 25
[    5.096302] RSP: 002b:00007fffbdc08760 EFLAGS: 00000202 ORIG_RAX: 0000000000000101
[    5.096303] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f6b59899ae5
[    5.096303] RDX: 0000000000000002 RSI: 000055da04c42a40 RDI: 00000000ffffff9c
[    5.096303] RBP: 000055da04c42a40 R08: 0000000000000000 R09: 0000000000000007
[    5.096304] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
[    5.096304] R13: 00007f6b599a1a50 R14: 000000000000000b R15: 000055da04c43e30
[    5.096305]  </TASK>
[    5.096305] ---[ end trace 0000000000000000 ]---
[    5.173332] systemd-journald[355]: Received SIGTERM from PID 1 (systemd).

To Reproduce

  1. Compile 6.12.0.rc1 Kernel
  2. Apply above mentioned patches on 560.35.03
  3. Compile the Module and boot into

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

No response

@ptr1337 ptr1337 added the bug Something isn't working label Sep 30, 2024
@philmmanjaro
Copy link

What happens if you revert that kernel change made by upstream. Made the drivers compile without additional patches: What happens if you revert that change in kernel. That is what I did before: https://gitlab.manjaro.org/packages/core/linux612/-/blob/ec1f53f77fd3f92f7cd4eeed444a341d8ded3291/revert-nvidia-446d0f48.patch

@mtijanic
Copy link
Collaborator

mtijanic commented Oct 1, 2024

Thanks! Tracked internally as NV bug 4888621.

@mtijanic mtijanic added the NV-Triaged An NVBug has been created for dev to investigate label Oct 1, 2024
@joanbm
Copy link

joanbm commented Oct 1, 2024

This may be related to commit 641bb4394f40 ("fs: move FMODE_UNSIGNED_OFFSET to fop_flags"). At least for nvidia-470xx it's fixed by adding the .fop_flags = FOP_UNSIGNED_OFFSET line from this patch. Though for me the kernel didn't full crash, just fail to detect the adapters correctly.

@ptr1337
Copy link
Author

ptr1337 commented Oct 2, 2024

@joanbm
It seems this patch does work and I got properly on 6.12 into the kernel.
There was one more patch required to have a succesful dkms compilation, due upstream changes:

diff --git a/kernel-open/nvidia-uvm/uvm_hmm.c b/kernel-open/nvidia-uvm/uvm_hmm.c
index 93e64424..dc64184e 100644
--- a/kernel-open/nvidia-uvm/uvm_hmm.c
+++ b/kernel-open/nvidia-uvm/uvm_hmm.c
@@ -2694,7 +2694,7 @@ static NV_STATUS dmamap_src_sysmem_pages(uvm_va_block_t *va_block,
                 continue;
             }
 
-            if (PageSwapCache(src_page)) {
+            if (folio_test_swapcache(page_folio(src_page))) {
                 // TODO: Bug 4050579: Remove this when swap cached pages can be
                 // migrated.
                 status = NV_WARN_MISMATCHED_TARGET;

Commit:
CachyOS/CachyOS-PKGBUILDS@3352d04

@Binary-Eater
Copy link

@es20490446e
Copy link

es20490446e commented Nov 30, 2024

@Binary-Eater

The bug hit production for me, and the internal laptop display suddenly stopped working.

If I detach the computer from HDMI, then I have no screen.

This means that the bug is critical, as it renders the system unusable.

But more importantly, this exposes a big flaw on how critical bugs are handled and prevented, project management wise.

Critical bugs like these need to be visually separated from the rest, for example by using a tag. And while they are present, all efforts shall concentrate on fixing them before coding anything else.

It cannot happen that adding two lines of code, that someone else coded for you, for a critical bug, takes 2 months.

@ionenwks
Copy link

ionenwks commented Nov 30, 2024

The bug hit production for me, and the internal laptop display suddenly stopped working.

If referring to that patch, it's already included (well a slightly different version of) in the production branch of the drivers, aka 550.135 released November 20 (beta drivers like 565.57.01, are well, betas and may not see immediate fixes -- generally would also avoid using brand new kernel branches to give time unless ok with being the tester -- ideally use long-term-support branches).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working NV-Triaged An NVBug has been created for dev to investigate
Projects
None yet
Development

No branches or pull requests

7 participants