Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Programs compiled with "-fsanitize=address" crashes with: "Shadow memory range interleaves with an existing memory mapping. ASan cannot proceed correctly. ABORTING." #69

Open
Stephen-Seo opened this issue May 12, 2022 · 4 comments
Labels
upstream Bug is caused by upstream problems

Comments

@Stephen-Seo
Copy link

Stephen-Seo commented May 12, 2022

Bug can be reproduced by building the AUR package blender-git with the following patch applied to its PKGBUILD (to make it build in Debug mode):
https://gist.github.com/Stephen-Seo/88b7453a47e38f7cecf26dfde43ecf58

The following appears in the build log at error:
https://gist.github.com/Stephen-Seo/897d12b778ab44c304ae535a23635871

The bug can be reproduced without rebuilding by going into "$srcdir/build/bin/" and running the "datatoc" binary with no arguments within that directory. The source of this file is located at "$srcdir/blender/source/blender/datatoc/datattoc.c".

The datatoc binary builds and runs fine on stock "linux" kernel, but not on "linux-hardened", even if the binary was built when booted into the stock "linux" kernel.

EDIT: The relevant entry in "compile_commands.json" generated by cmake is:

"command": "/usr/bin/cc -DDEBUG -DWITH_ASSERT_ABORT -DWITH_OPENGL -D_DEBUG -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -D_LARGEFILE_SOURCE -D__LITTLE_ENDIAN__ -D__MMX__ -D__SSE2__ -D__SSE__  -Wall -Werror=implicit-function-declaration -Werror=return-type -Werror=vla -Wstrict-prototypes -Wmissing-prototypes -Wno-char-subscripts -Wno-unknown-pragmas -Wpointer-arith -Wunused-parameter -Wwrite-strings -Wlogical-op -Wundef -Winit-self -Wmissing-include-dirs -Wno-div-by-zero -Wtype-limits -Wformat-signedness -Wrestrict -Wnonnull -Wabsolute-value -Wuninitialized -Wredundant-decls -Wshadow -Wno-error=unused-but-set-variable -Wimplicit-fallthrough=5  -fopenmp -std=gnu11 -msse -pipe -fPIC -funsigned-char -fno-strict-aliasing -ffp-contract=off -msse2 -fmacro-prefix-map=\"/home/stephen/Git/blender-git/blender/\"=\"\" -fmacro-prefix-map=\"/home/stephen/Git/blender-git/build/\"=\"\" -g -fsanitize=address -fsanitize=bool -fsanitize=bounds -fsanitize=enum -fsanitize=float-cast-overflow -fsanitize=float-divide-by-zero -fsanitize=nonnull-attribute -fsanitize=returns-nonnull-attribute -fsanitize=signed-integer-overflow -fsanitize=undefined -fsanitize=vla-bound -fno-sanitize=alignment  -fsanitize=leak -fsanitize=object-size -o source/blender/datatoc/CMakeFiles/datatoc.dir/datatoc.c.o -c /home/stephen/Git/blender-git/blender/source/blender/datatoc/datatoc.c",

(one may have to change some directory paths to get this to work. Also, this program appears to have no dependencies, so one should be able to compile it directly into an executable instead of an object file.)

@tsautereau-anssi
Copy link

Looking at the build log you provided, I guess it's caused by our commit Revert "mm: revert x86_64 and arm64 ELF_ET_DYN_BASE base changes".

@Stephen-Seo
Copy link
Author

I tested a basically no-op program:

int main() {
    return 0;
}

If compiled with gcc -fsanitize=address -o test noop_program.c and run, then it will crash. Without the -fsanitize=address it runs fine.

==4336==Shadow memory range interleaves with an existing memory mapping. ASan cannot proceed correctly. ABORTING.
==4336==ASan shadow was supposed to be located in the [0x00007fff7000-0x10007fff7fff] range.
==4336==This might be related to ELF_ET_DYN_BASE change in Linux 4.12.
==4336==See https://github.com/google/sanitizers/issues/856 for possible workarounds.
==4336==Process memory map follows:
	0x045824f83000-0x045824f84000	/home/stephen/temp/test
	0x045824f84000-0x045824f85000	/home/stephen/temp/test
	0x045824f85000-0x045824f86000	/home/stephen/temp/test
	0x045824f86000-0x045824f87000	/home/stephen/temp/test
	0x045824f87000-0x045824f88000	/home/stephen/temp/test
	0x6d57a2700000-0x6d57a2800000	
	0x6d57a2900000-0x6d57a2a00000	
	0x6d57a2b00000-0x6d57a2c00000	
	0x6d57a2d00000-0x6d57a2e00000	
	0x6d57a2f00000-0x6d57a3000000	
	0x6d57a3012000-0x6d57a33a5000	
	0x6d57a33a5000-0x6d57a33a8000	/usr/lib/libgcc_s.so.1
	0x6d57a33a8000-0x6d57a33bf000	/usr/lib/libgcc_s.so.1
	0x6d57a33bf000-0x6d57a33c3000	/usr/lib/libgcc_s.so.1
	0x6d57a33c3000-0x6d57a33c4000	/usr/lib/libgcc_s.so.1
	0x6d57a33c4000-0x6d57a33c5000	/usr/lib/libgcc_s.so.1
	0x6d57a33c5000-0x6d57a33d5000	/usr/lib/libm.so.6
	0x6d57a33d5000-0x6d57a344e000	/usr/lib/libm.so.6
	0x6d57a344e000-0x6d57a34aa000	/usr/lib/libm.so.6
	0x6d57a34aa000-0x6d57a34ab000	/usr/lib/libm.so.6
	0x6d57a34ab000-0x6d57a34ac000	/usr/lib/libm.so.6
	0x6d57a34ac000-0x6d57a3546000	/usr/lib/libstdc++.so.6.0.30
	0x6d57a3546000-0x6d57a365d000	/usr/lib/libstdc++.so.6.0.30
	0x6d57a365d000-0x6d57a36d4000	/usr/lib/libstdc++.so.6.0.30
	0x6d57a36d4000-0x6d57a36e1000	/usr/lib/libstdc++.so.6.0.30
	0x6d57a36e1000-0x6d57a36e2000	/usr/lib/libstdc++.so.6.0.30
	0x6d57a36e2000-0x6d57a36e5000	
	0x6d57a36e5000-0x6d57a3711000	/usr/lib/libc.so.6
	0x6d57a3711000-0x6d57a3889000	/usr/lib/libc.so.6
	0x6d57a3889000-0x6d57a38dd000	/usr/lib/libc.so.6
	0x6d57a38dd000-0x6d57a38de000	/usr/lib/libc.so.6
	0x6d57a38de000-0x6d57a38e1000	/usr/lib/libc.so.6
	0x6d57a38e1000-0x6d57a38e4000	/usr/lib/libc.so.6
	0x6d57a38e4000-0x6d57a38f1000	
	0x6d57a38f1000-0x6d57a3915000	/usr/lib/libasan.so.8.0.0
	0x6d57a3915000-0x6d57a39fe000	/usr/lib/libasan.so.8.0.0
	0x6d57a39fe000-0x6d57a3a33000	/usr/lib/libasan.so.8.0.0
	0x6d57a3a33000-0x6d57a3a37000	/usr/lib/libasan.so.8.0.0
	0x6d57a3a37000-0x6d57a3a3a000	/usr/lib/libasan.so.8.0.0
	0x6d57a3a3a000-0x6d57a3fa0000	
	0x6d57a3fbc000-0x6d57a3fd7000	
	0x6d57a3fd7000-0x6d57a3fdb000	[vvar]
	0x6d57a3fdb000-0x6d57a3fdd000	[vdso]
	0x6d57a3fdd000-0x6d57a3fdf000	/usr/lib/ld-linux-x86-64.so.2
	0x6d57a3fdf000-0x6d57a4006000	/usr/lib/ld-linux-x86-64.so.2
	0x6d57a4006000-0x6d57a4011000	/usr/lib/ld-linux-x86-64.so.2
	0x6d57a4011000-0x6d57a4012000	
	0x6d57a4012000-0x6d57a4014000	/usr/lib/ld-linux-x86-64.so.2
	0x6d57a4014000-0x6d57a4016000	/usr/lib/ld-linux-x86-64.so.2
	0x716d528bc000-0x716d528de000	[stack]
==4336==End of process memory map.

@Stephen-Seo Stephen-Seo changed the title Binary artifact "datatoc" from Blender's Debug build fails to run: "Shadow memory range interleaves with an existing memory mapping. ASan cannot proceed correctly. ABORTING." Programs compiled with "-fsanitize=address": "Shadow memory range interleaves with an existing memory mapping. ASan cannot proceed correctly. ABORTING." May 13, 2022
@Stephen-Seo Stephen-Seo changed the title Programs compiled with "-fsanitize=address": "Shadow memory range interleaves with an existing memory mapping. ASan cannot proceed correctly. ABORTING." Programs compiled with "-fsanitize=address" crashes with: "Shadow memory range interleaves with an existing memory mapping. ASan cannot proceed correctly. ABORTING." May 13, 2022
@Bernhard40
Copy link

Bernhard40 commented May 13, 2022

I think this may be expected as the issues with asan were the reason behind reverting the patch.

This is mentioned in revert description

There is a tracking issue on KSPP about working around the problem but no progress so far.

@anthraxx anthraxx added the upstream Bug is caused by upstream problems label Jun 1, 2022
@anthraxx
Copy link
Owner

anthraxx commented Jun 1, 2022

@Bernhard40 thanks for filling in the info here.

The mentioned revert and the reason related to ASan expectations/limitations is indeed the culprit why this doesn't work with linux-hardened. This unfortunately is expected until upstream finds a solution to this issue.

anthraxx pushed a commit that referenced this issue Oct 24, 2022
…te()

[ Upstream commit f020d95 ]

When peer delete failed in a disconnect operation, use-after-free
detected by KFENCE in below log. It is because for each vdev_id and
address, it has only one struct ath10k_peer, it is allocated in
ath10k_peer_map_event(). When connected to an AP, it has more than
one HTT_T2H_MSG_TYPE_PEER_MAP reported from firmware, then the
array peer_map of struct ath10k will be set muti-elements to the
same ath10k_peer in ath10k_peer_map_event(). When peer delete failed
in ath10k_sta_state(), the ath10k_peer will be free for the 1st peer
id in array peer_map of struct ath10k, and then use-after-free happened
for the 2nd peer id because they map to the same ath10k_peer.

And clean up all peers in array peer_map for the ath10k_peer, then
user-after-free disappeared

peer map event log:
[  306.911021] wlan0: authenticate with b0:2a:43:e6:75:0e
[  306.957187] ath10k_pci 0000:01:00.0: mac vdev 0 peer create b0:2a:43:e6:75:0e (new sta) sta 1 / 32 peer 1 / 33
[  306.957395] ath10k_pci 0000:01:00.0: htt peer map vdev 0 peer b0:2a:43:e6:75:0e id 246
[  306.957404] ath10k_pci 0000:01:00.0: htt peer map vdev 0 peer b0:2a:43:e6:75:0e id 198
[  306.986924] ath10k_pci 0000:01:00.0: htt peer map vdev 0 peer b0:2a:43:e6:75:0e id 166

peer unmap event log:
[  435.715691] wlan0: deauthenticating from b0:2a:43:e6:75:0e by local choice (Reason: 3=DEAUTH_LEAVING)
[  435.716802] ath10k_pci 0000:01:00.0: mac vdev 0 peer delete b0:2a:43:e6:75:0e sta ffff990e0e9c2b50 (sta gone)
[  435.717177] ath10k_pci 0000:01:00.0: htt peer unmap vdev 0 peer b0:2a:43:e6:75:0e id 246
[  435.717186] ath10k_pci 0000:01:00.0: htt peer unmap vdev 0 peer b0:2a:43:e6:75:0e id 198
[  435.717193] ath10k_pci 0000:01:00.0: htt peer unmap vdev 0 peer b0:2a:43:e6:75:0e id 166

use-after-free log:
[21705.888627] wlan0: deauthenticating from d0:76:8f:82:be:75 by local choice (Reason: 3=DEAUTH_LEAVING)
[21713.799910] ath10k_pci 0000:01:00.0: failed to delete peer d0:76:8f:82:be:75 for vdev 0: -110
[21713.799925] ath10k_pci 0000:01:00.0: found sta peer d0:76:8f:82:be:75 (ptr 0000000000000000 id 102) entry on vdev 0 after it was supposedly removed
[21713.799968] ==================================================================
[21713.799991] BUG: KFENCE: use-after-free read in ath10k_sta_state+0x265/0xb8a [ath10k_core]
[21713.799991]
[21713.799997] Use-after-free read at 0x00000000abe1c75e (in kfence-#69):
[21713.800010]  ath10k_sta_state+0x265/0xb8a [ath10k_core]
[21713.800041]  drv_sta_state+0x115/0x677 [mac80211]
[21713.800059]  __sta_info_destroy_part2+0xb1/0x133 [mac80211]
[21713.800076]  __sta_info_flush+0x11d/0x162 [mac80211]
[21713.800093]  ieee80211_set_disassoc+0x12d/0x2f4 [mac80211]
[21713.800110]  ieee80211_mgd_deauth+0x26c/0x29b [mac80211]
[21713.800137]  cfg80211_mlme_deauth+0x13f/0x1bb [cfg80211]
[21713.800153]  nl80211_deauthenticate+0xf8/0x121 [cfg80211]
[21713.800161]  genl_rcv_msg+0x38e/0x3be
[21713.800166]  netlink_rcv_skb+0x89/0xf7
[21713.800171]  genl_rcv+0x28/0x36
[21713.800176]  netlink_unicast+0x179/0x24b
[21713.800181]  netlink_sendmsg+0x3a0/0x40e
[21713.800187]  sock_sendmsg+0x72/0x76
[21713.800192]  ____sys_sendmsg+0x16d/0x1e3
[21713.800196]  ___sys_sendmsg+0x95/0xd1
[21713.800200]  __sys_sendmsg+0x85/0xbf
[21713.800205]  do_syscall_64+0x43/0x55
[21713.800210]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[21713.800213]
[21713.800219] kfence-#69: 0x000000009149b0d5-0x000000004c0697fb, size=1064, cache=kmalloc-2k
[21713.800219]
[21713.800224] allocated by task 13 on cpu 0 at 21705.501373s:
[21713.800241]  ath10k_peer_map_event+0x7e/0x154 [ath10k_core]
[21713.800254]  ath10k_htt_t2h_msg_handler+0x586/0x1039 [ath10k_core]
[21713.800265]  ath10k_htt_htc_t2h_msg_handler+0x12/0x28 [ath10k_core]
[21713.800277]  ath10k_htc_rx_completion_handler+0x14c/0x1b5 [ath10k_core]
[21713.800283]  ath10k_pci_process_rx_cb+0x195/0x1df [ath10k_pci]
[21713.800294]  ath10k_ce_per_engine_service+0x55/0x74 [ath10k_core]
[21713.800305]  ath10k_ce_per_engine_service_any+0x76/0x84 [ath10k_core]
[21713.800310]  ath10k_pci_napi_poll+0x49/0x144 [ath10k_pci]
[21713.800316]  net_rx_action+0xdc/0x361
[21713.800320]  __do_softirq+0x163/0x29a
[21713.800325]  asm_call_irq_on_stack+0x12/0x20
[21713.800331]  do_softirq_own_stack+0x3c/0x48
[21713.800337]  __irq_exit_rcu+0x9b/0x9d
[21713.800342]  common_interrupt+0xc9/0x14d
[21713.800346]  asm_common_interrupt+0x1e/0x40
[21713.800351]  ksoftirqd_should_run+0x5/0x16
[21713.800357]  smpboot_thread_fn+0x148/0x211
[21713.800362]  kthread+0x150/0x15f
[21713.800367]  ret_from_fork+0x22/0x30
[21713.800370]
[21713.800374] freed by task 708 on cpu 1 at 21713.799953s:
[21713.800498]  ath10k_sta_state+0x2c6/0xb8a [ath10k_core]
[21713.800515]  drv_sta_state+0x115/0x677 [mac80211]
[21713.800532]  __sta_info_destroy_part2+0xb1/0x133 [mac80211]
[21713.800548]  __sta_info_flush+0x11d/0x162 [mac80211]
[21713.800565]  ieee80211_set_disassoc+0x12d/0x2f4 [mac80211]
[21713.800581]  ieee80211_mgd_deauth+0x26c/0x29b [mac80211]
[21713.800598]  cfg80211_mlme_deauth+0x13f/0x1bb [cfg80211]
[21713.800614]  nl80211_deauthenticate+0xf8/0x121 [cfg80211]
[21713.800619]  genl_rcv_msg+0x38e/0x3be
[21713.800623]  netlink_rcv_skb+0x89/0xf7
[21713.800628]  genl_rcv+0x28/0x36
[21713.800632]  netlink_unicast+0x179/0x24b
[21713.800637]  netlink_sendmsg+0x3a0/0x40e
[21713.800642]  sock_sendmsg+0x72/0x76
[21713.800646]  ____sys_sendmsg+0x16d/0x1e3
[21713.800651]  ___sys_sendmsg+0x95/0xd1
[21713.800655]  __sys_sendmsg+0x85/0xbf
[21713.800659]  do_syscall_64+0x43/0x55
[21713.800663]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Tested-on: QCA6174 hw3.2 PCI WLAN.RM.4.4.1-00288-QCARMSWPZ-1

Fixes: d0eeafa ("ath10k: Clean up peer when sta goes away.")
Signed-off-by: Wen Gong <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sasha Levin <[email protected]>
anthraxx pushed a commit that referenced this issue Oct 25, 2022
…te()

[ Upstream commit f020d95 ]

When peer delete failed in a disconnect operation, use-after-free
detected by KFENCE in below log. It is because for each vdev_id and
address, it has only one struct ath10k_peer, it is allocated in
ath10k_peer_map_event(). When connected to an AP, it has more than
one HTT_T2H_MSG_TYPE_PEER_MAP reported from firmware, then the
array peer_map of struct ath10k will be set muti-elements to the
same ath10k_peer in ath10k_peer_map_event(). When peer delete failed
in ath10k_sta_state(), the ath10k_peer will be free for the 1st peer
id in array peer_map of struct ath10k, and then use-after-free happened
for the 2nd peer id because they map to the same ath10k_peer.

And clean up all peers in array peer_map for the ath10k_peer, then
user-after-free disappeared

peer map event log:
[  306.911021] wlan0: authenticate with b0:2a:43:e6:75:0e
[  306.957187] ath10k_pci 0000:01:00.0: mac vdev 0 peer create b0:2a:43:e6:75:0e (new sta) sta 1 / 32 peer 1 / 33
[  306.957395] ath10k_pci 0000:01:00.0: htt peer map vdev 0 peer b0:2a:43:e6:75:0e id 246
[  306.957404] ath10k_pci 0000:01:00.0: htt peer map vdev 0 peer b0:2a:43:e6:75:0e id 198
[  306.986924] ath10k_pci 0000:01:00.0: htt peer map vdev 0 peer b0:2a:43:e6:75:0e id 166

peer unmap event log:
[  435.715691] wlan0: deauthenticating from b0:2a:43:e6:75:0e by local choice (Reason: 3=DEAUTH_LEAVING)
[  435.716802] ath10k_pci 0000:01:00.0: mac vdev 0 peer delete b0:2a:43:e6:75:0e sta ffff990e0e9c2b50 (sta gone)
[  435.717177] ath10k_pci 0000:01:00.0: htt peer unmap vdev 0 peer b0:2a:43:e6:75:0e id 246
[  435.717186] ath10k_pci 0000:01:00.0: htt peer unmap vdev 0 peer b0:2a:43:e6:75:0e id 198
[  435.717193] ath10k_pci 0000:01:00.0: htt peer unmap vdev 0 peer b0:2a:43:e6:75:0e id 166

use-after-free log:
[21705.888627] wlan0: deauthenticating from d0:76:8f:82:be:75 by local choice (Reason: 3=DEAUTH_LEAVING)
[21713.799910] ath10k_pci 0000:01:00.0: failed to delete peer d0:76:8f:82:be:75 for vdev 0: -110
[21713.799925] ath10k_pci 0000:01:00.0: found sta peer d0:76:8f:82:be:75 (ptr 0000000000000000 id 102) entry on vdev 0 after it was supposedly removed
[21713.799968] ==================================================================
[21713.799991] BUG: KFENCE: use-after-free read in ath10k_sta_state+0x265/0xb8a [ath10k_core]
[21713.799991]
[21713.799997] Use-after-free read at 0x00000000abe1c75e (in kfence-#69):
[21713.800010]  ath10k_sta_state+0x265/0xb8a [ath10k_core]
[21713.800041]  drv_sta_state+0x115/0x677 [mac80211]
[21713.800059]  __sta_info_destroy_part2+0xb1/0x133 [mac80211]
[21713.800076]  __sta_info_flush+0x11d/0x162 [mac80211]
[21713.800093]  ieee80211_set_disassoc+0x12d/0x2f4 [mac80211]
[21713.800110]  ieee80211_mgd_deauth+0x26c/0x29b [mac80211]
[21713.800137]  cfg80211_mlme_deauth+0x13f/0x1bb [cfg80211]
[21713.800153]  nl80211_deauthenticate+0xf8/0x121 [cfg80211]
[21713.800161]  genl_rcv_msg+0x38e/0x3be
[21713.800166]  netlink_rcv_skb+0x89/0xf7
[21713.800171]  genl_rcv+0x28/0x36
[21713.800176]  netlink_unicast+0x179/0x24b
[21713.800181]  netlink_sendmsg+0x3a0/0x40e
[21713.800187]  sock_sendmsg+0x72/0x76
[21713.800192]  ____sys_sendmsg+0x16d/0x1e3
[21713.800196]  ___sys_sendmsg+0x95/0xd1
[21713.800200]  __sys_sendmsg+0x85/0xbf
[21713.800205]  do_syscall_64+0x43/0x55
[21713.800210]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[21713.800213]
[21713.800219] kfence-#69: 0x000000009149b0d5-0x000000004c0697fb, size=1064, cache=kmalloc-2k
[21713.800219]
[21713.800224] allocated by task 13 on cpu 0 at 21705.501373s:
[21713.800241]  ath10k_peer_map_event+0x7e/0x154 [ath10k_core]
[21713.800254]  ath10k_htt_t2h_msg_handler+0x586/0x1039 [ath10k_core]
[21713.800265]  ath10k_htt_htc_t2h_msg_handler+0x12/0x28 [ath10k_core]
[21713.800277]  ath10k_htc_rx_completion_handler+0x14c/0x1b5 [ath10k_core]
[21713.800283]  ath10k_pci_process_rx_cb+0x195/0x1df [ath10k_pci]
[21713.800294]  ath10k_ce_per_engine_service+0x55/0x74 [ath10k_core]
[21713.800305]  ath10k_ce_per_engine_service_any+0x76/0x84 [ath10k_core]
[21713.800310]  ath10k_pci_napi_poll+0x49/0x144 [ath10k_pci]
[21713.800316]  net_rx_action+0xdc/0x361
[21713.800320]  __do_softirq+0x163/0x29a
[21713.800325]  asm_call_irq_on_stack+0x12/0x20
[21713.800331]  do_softirq_own_stack+0x3c/0x48
[21713.800337]  __irq_exit_rcu+0x9b/0x9d
[21713.800342]  common_interrupt+0xc9/0x14d
[21713.800346]  asm_common_interrupt+0x1e/0x40
[21713.800351]  ksoftirqd_should_run+0x5/0x16
[21713.800357]  smpboot_thread_fn+0x148/0x211
[21713.800362]  kthread+0x150/0x15f
[21713.800367]  ret_from_fork+0x22/0x30
[21713.800370]
[21713.800374] freed by task 708 on cpu 1 at 21713.799953s:
[21713.800498]  ath10k_sta_state+0x2c6/0xb8a [ath10k_core]
[21713.800515]  drv_sta_state+0x115/0x677 [mac80211]
[21713.800532]  __sta_info_destroy_part2+0xb1/0x133 [mac80211]
[21713.800548]  __sta_info_flush+0x11d/0x162 [mac80211]
[21713.800565]  ieee80211_set_disassoc+0x12d/0x2f4 [mac80211]
[21713.800581]  ieee80211_mgd_deauth+0x26c/0x29b [mac80211]
[21713.800598]  cfg80211_mlme_deauth+0x13f/0x1bb [cfg80211]
[21713.800614]  nl80211_deauthenticate+0xf8/0x121 [cfg80211]
[21713.800619]  genl_rcv_msg+0x38e/0x3be
[21713.800623]  netlink_rcv_skb+0x89/0xf7
[21713.800628]  genl_rcv+0x28/0x36
[21713.800632]  netlink_unicast+0x179/0x24b
[21713.800637]  netlink_sendmsg+0x3a0/0x40e
[21713.800642]  sock_sendmsg+0x72/0x76
[21713.800646]  ____sys_sendmsg+0x16d/0x1e3
[21713.800651]  ___sys_sendmsg+0x95/0xd1
[21713.800655]  __sys_sendmsg+0x85/0xbf
[21713.800659]  do_syscall_64+0x43/0x55
[21713.800663]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Tested-on: QCA6174 hw3.2 PCI WLAN.RM.4.4.1-00288-QCARMSWPZ-1

Fixes: d0eeafa ("ath10k: Clean up peer when sta goes away.")
Signed-off-by: Wen Gong <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sasha Levin <[email protected]>
anthraxx pushed a commit that referenced this issue Oct 26, 2022
…te()

[ Upstream commit f020d95 ]

When peer delete failed in a disconnect operation, use-after-free
detected by KFENCE in below log. It is because for each vdev_id and
address, it has only one struct ath10k_peer, it is allocated in
ath10k_peer_map_event(). When connected to an AP, it has more than
one HTT_T2H_MSG_TYPE_PEER_MAP reported from firmware, then the
array peer_map of struct ath10k will be set muti-elements to the
same ath10k_peer in ath10k_peer_map_event(). When peer delete failed
in ath10k_sta_state(), the ath10k_peer will be free for the 1st peer
id in array peer_map of struct ath10k, and then use-after-free happened
for the 2nd peer id because they map to the same ath10k_peer.

And clean up all peers in array peer_map for the ath10k_peer, then
user-after-free disappeared

peer map event log:
[  306.911021] wlan0: authenticate with b0:2a:43:e6:75:0e
[  306.957187] ath10k_pci 0000:01:00.0: mac vdev 0 peer create b0:2a:43:e6:75:0e (new sta) sta 1 / 32 peer 1 / 33
[  306.957395] ath10k_pci 0000:01:00.0: htt peer map vdev 0 peer b0:2a:43:e6:75:0e id 246
[  306.957404] ath10k_pci 0000:01:00.0: htt peer map vdev 0 peer b0:2a:43:e6:75:0e id 198
[  306.986924] ath10k_pci 0000:01:00.0: htt peer map vdev 0 peer b0:2a:43:e6:75:0e id 166

peer unmap event log:
[  435.715691] wlan0: deauthenticating from b0:2a:43:e6:75:0e by local choice (Reason: 3=DEAUTH_LEAVING)
[  435.716802] ath10k_pci 0000:01:00.0: mac vdev 0 peer delete b0:2a:43:e6:75:0e sta ffff990e0e9c2b50 (sta gone)
[  435.717177] ath10k_pci 0000:01:00.0: htt peer unmap vdev 0 peer b0:2a:43:e6:75:0e id 246
[  435.717186] ath10k_pci 0000:01:00.0: htt peer unmap vdev 0 peer b0:2a:43:e6:75:0e id 198
[  435.717193] ath10k_pci 0000:01:00.0: htt peer unmap vdev 0 peer b0:2a:43:e6:75:0e id 166

use-after-free log:
[21705.888627] wlan0: deauthenticating from d0:76:8f:82:be:75 by local choice (Reason: 3=DEAUTH_LEAVING)
[21713.799910] ath10k_pci 0000:01:00.0: failed to delete peer d0:76:8f:82:be:75 for vdev 0: -110
[21713.799925] ath10k_pci 0000:01:00.0: found sta peer d0:76:8f:82:be:75 (ptr 0000000000000000 id 102) entry on vdev 0 after it was supposedly removed
[21713.799968] ==================================================================
[21713.799991] BUG: KFENCE: use-after-free read in ath10k_sta_state+0x265/0xb8a [ath10k_core]
[21713.799991]
[21713.799997] Use-after-free read at 0x00000000abe1c75e (in kfence-#69):
[21713.800010]  ath10k_sta_state+0x265/0xb8a [ath10k_core]
[21713.800041]  drv_sta_state+0x115/0x677 [mac80211]
[21713.800059]  __sta_info_destroy_part2+0xb1/0x133 [mac80211]
[21713.800076]  __sta_info_flush+0x11d/0x162 [mac80211]
[21713.800093]  ieee80211_set_disassoc+0x12d/0x2f4 [mac80211]
[21713.800110]  ieee80211_mgd_deauth+0x26c/0x29b [mac80211]
[21713.800137]  cfg80211_mlme_deauth+0x13f/0x1bb [cfg80211]
[21713.800153]  nl80211_deauthenticate+0xf8/0x121 [cfg80211]
[21713.800161]  genl_rcv_msg+0x38e/0x3be
[21713.800166]  netlink_rcv_skb+0x89/0xf7
[21713.800171]  genl_rcv+0x28/0x36
[21713.800176]  netlink_unicast+0x179/0x24b
[21713.800181]  netlink_sendmsg+0x3a0/0x40e
[21713.800187]  sock_sendmsg+0x72/0x76
[21713.800192]  ____sys_sendmsg+0x16d/0x1e3
[21713.800196]  ___sys_sendmsg+0x95/0xd1
[21713.800200]  __sys_sendmsg+0x85/0xbf
[21713.800205]  do_syscall_64+0x43/0x55
[21713.800210]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[21713.800213]
[21713.800219] kfence-#69: 0x000000009149b0d5-0x000000004c0697fb, size=1064, cache=kmalloc-2k
[21713.800219]
[21713.800224] allocated by task 13 on cpu 0 at 21705.501373s:
[21713.800241]  ath10k_peer_map_event+0x7e/0x154 [ath10k_core]
[21713.800254]  ath10k_htt_t2h_msg_handler+0x586/0x1039 [ath10k_core]
[21713.800265]  ath10k_htt_htc_t2h_msg_handler+0x12/0x28 [ath10k_core]
[21713.800277]  ath10k_htc_rx_completion_handler+0x14c/0x1b5 [ath10k_core]
[21713.800283]  ath10k_pci_process_rx_cb+0x195/0x1df [ath10k_pci]
[21713.800294]  ath10k_ce_per_engine_service+0x55/0x74 [ath10k_core]
[21713.800305]  ath10k_ce_per_engine_service_any+0x76/0x84 [ath10k_core]
[21713.800310]  ath10k_pci_napi_poll+0x49/0x144 [ath10k_pci]
[21713.800316]  net_rx_action+0xdc/0x361
[21713.800320]  __do_softirq+0x163/0x29a
[21713.800325]  asm_call_irq_on_stack+0x12/0x20
[21713.800331]  do_softirq_own_stack+0x3c/0x48
[21713.800337]  __irq_exit_rcu+0x9b/0x9d
[21713.800342]  common_interrupt+0xc9/0x14d
[21713.800346]  asm_common_interrupt+0x1e/0x40
[21713.800351]  ksoftirqd_should_run+0x5/0x16
[21713.800357]  smpboot_thread_fn+0x148/0x211
[21713.800362]  kthread+0x150/0x15f
[21713.800367]  ret_from_fork+0x22/0x30
[21713.800370]
[21713.800374] freed by task 708 on cpu 1 at 21713.799953s:
[21713.800498]  ath10k_sta_state+0x2c6/0xb8a [ath10k_core]
[21713.800515]  drv_sta_state+0x115/0x677 [mac80211]
[21713.800532]  __sta_info_destroy_part2+0xb1/0x133 [mac80211]
[21713.800548]  __sta_info_flush+0x11d/0x162 [mac80211]
[21713.800565]  ieee80211_set_disassoc+0x12d/0x2f4 [mac80211]
[21713.800581]  ieee80211_mgd_deauth+0x26c/0x29b [mac80211]
[21713.800598]  cfg80211_mlme_deauth+0x13f/0x1bb [cfg80211]
[21713.800614]  nl80211_deauthenticate+0xf8/0x121 [cfg80211]
[21713.800619]  genl_rcv_msg+0x38e/0x3be
[21713.800623]  netlink_rcv_skb+0x89/0xf7
[21713.800628]  genl_rcv+0x28/0x36
[21713.800632]  netlink_unicast+0x179/0x24b
[21713.800637]  netlink_sendmsg+0x3a0/0x40e
[21713.800642]  sock_sendmsg+0x72/0x76
[21713.800646]  ____sys_sendmsg+0x16d/0x1e3
[21713.800651]  ___sys_sendmsg+0x95/0xd1
[21713.800655]  __sys_sendmsg+0x85/0xbf
[21713.800659]  do_syscall_64+0x43/0x55
[21713.800663]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Tested-on: QCA6174 hw3.2 PCI WLAN.RM.4.4.1-00288-QCARMSWPZ-1

Fixes: d0eeafa ("ath10k: Clean up peer when sta goes away.")
Signed-off-by: Wen Gong <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sasha Levin <[email protected]>
anthraxx pushed a commit that referenced this issue Oct 26, 2022
…te()

[ Upstream commit f020d95 ]

When peer delete failed in a disconnect operation, use-after-free
detected by KFENCE in below log. It is because for each vdev_id and
address, it has only one struct ath10k_peer, it is allocated in
ath10k_peer_map_event(). When connected to an AP, it has more than
one HTT_T2H_MSG_TYPE_PEER_MAP reported from firmware, then the
array peer_map of struct ath10k will be set muti-elements to the
same ath10k_peer in ath10k_peer_map_event(). When peer delete failed
in ath10k_sta_state(), the ath10k_peer will be free for the 1st peer
id in array peer_map of struct ath10k, and then use-after-free happened
for the 2nd peer id because they map to the same ath10k_peer.

And clean up all peers in array peer_map for the ath10k_peer, then
user-after-free disappeared

peer map event log:
[  306.911021] wlan0: authenticate with b0:2a:43:e6:75:0e
[  306.957187] ath10k_pci 0000:01:00.0: mac vdev 0 peer create b0:2a:43:e6:75:0e (new sta) sta 1 / 32 peer 1 / 33
[  306.957395] ath10k_pci 0000:01:00.0: htt peer map vdev 0 peer b0:2a:43:e6:75:0e id 246
[  306.957404] ath10k_pci 0000:01:00.0: htt peer map vdev 0 peer b0:2a:43:e6:75:0e id 198
[  306.986924] ath10k_pci 0000:01:00.0: htt peer map vdev 0 peer b0:2a:43:e6:75:0e id 166

peer unmap event log:
[  435.715691] wlan0: deauthenticating from b0:2a:43:e6:75:0e by local choice (Reason: 3=DEAUTH_LEAVING)
[  435.716802] ath10k_pci 0000:01:00.0: mac vdev 0 peer delete b0:2a:43:e6:75:0e sta ffff990e0e9c2b50 (sta gone)
[  435.717177] ath10k_pci 0000:01:00.0: htt peer unmap vdev 0 peer b0:2a:43:e6:75:0e id 246
[  435.717186] ath10k_pci 0000:01:00.0: htt peer unmap vdev 0 peer b0:2a:43:e6:75:0e id 198
[  435.717193] ath10k_pci 0000:01:00.0: htt peer unmap vdev 0 peer b0:2a:43:e6:75:0e id 166

use-after-free log:
[21705.888627] wlan0: deauthenticating from d0:76:8f:82:be:75 by local choice (Reason: 3=DEAUTH_LEAVING)
[21713.799910] ath10k_pci 0000:01:00.0: failed to delete peer d0:76:8f:82:be:75 for vdev 0: -110
[21713.799925] ath10k_pci 0000:01:00.0: found sta peer d0:76:8f:82:be:75 (ptr 0000000000000000 id 102) entry on vdev 0 after it was supposedly removed
[21713.799968] ==================================================================
[21713.799991] BUG: KFENCE: use-after-free read in ath10k_sta_state+0x265/0xb8a [ath10k_core]
[21713.799991]
[21713.799997] Use-after-free read at 0x00000000abe1c75e (in kfence-#69):
[21713.800010]  ath10k_sta_state+0x265/0xb8a [ath10k_core]
[21713.800041]  drv_sta_state+0x115/0x677 [mac80211]
[21713.800059]  __sta_info_destroy_part2+0xb1/0x133 [mac80211]
[21713.800076]  __sta_info_flush+0x11d/0x162 [mac80211]
[21713.800093]  ieee80211_set_disassoc+0x12d/0x2f4 [mac80211]
[21713.800110]  ieee80211_mgd_deauth+0x26c/0x29b [mac80211]
[21713.800137]  cfg80211_mlme_deauth+0x13f/0x1bb [cfg80211]
[21713.800153]  nl80211_deauthenticate+0xf8/0x121 [cfg80211]
[21713.800161]  genl_rcv_msg+0x38e/0x3be
[21713.800166]  netlink_rcv_skb+0x89/0xf7
[21713.800171]  genl_rcv+0x28/0x36
[21713.800176]  netlink_unicast+0x179/0x24b
[21713.800181]  netlink_sendmsg+0x3a0/0x40e
[21713.800187]  sock_sendmsg+0x72/0x76
[21713.800192]  ____sys_sendmsg+0x16d/0x1e3
[21713.800196]  ___sys_sendmsg+0x95/0xd1
[21713.800200]  __sys_sendmsg+0x85/0xbf
[21713.800205]  do_syscall_64+0x43/0x55
[21713.800210]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[21713.800213]
[21713.800219] kfence-#69: 0x000000009149b0d5-0x000000004c0697fb, size=1064, cache=kmalloc-2k
[21713.800219]
[21713.800224] allocated by task 13 on cpu 0 at 21705.501373s:
[21713.800241]  ath10k_peer_map_event+0x7e/0x154 [ath10k_core]
[21713.800254]  ath10k_htt_t2h_msg_handler+0x586/0x1039 [ath10k_core]
[21713.800265]  ath10k_htt_htc_t2h_msg_handler+0x12/0x28 [ath10k_core]
[21713.800277]  ath10k_htc_rx_completion_handler+0x14c/0x1b5 [ath10k_core]
[21713.800283]  ath10k_pci_process_rx_cb+0x195/0x1df [ath10k_pci]
[21713.800294]  ath10k_ce_per_engine_service+0x55/0x74 [ath10k_core]
[21713.800305]  ath10k_ce_per_engine_service_any+0x76/0x84 [ath10k_core]
[21713.800310]  ath10k_pci_napi_poll+0x49/0x144 [ath10k_pci]
[21713.800316]  net_rx_action+0xdc/0x361
[21713.800320]  __do_softirq+0x163/0x29a
[21713.800325]  asm_call_irq_on_stack+0x12/0x20
[21713.800331]  do_softirq_own_stack+0x3c/0x48
[21713.800337]  __irq_exit_rcu+0x9b/0x9d
[21713.800342]  common_interrupt+0xc9/0x14d
[21713.800346]  asm_common_interrupt+0x1e/0x40
[21713.800351]  ksoftirqd_should_run+0x5/0x16
[21713.800357]  smpboot_thread_fn+0x148/0x211
[21713.800362]  kthread+0x150/0x15f
[21713.800367]  ret_from_fork+0x22/0x30
[21713.800370]
[21713.800374] freed by task 708 on cpu 1 at 21713.799953s:
[21713.800498]  ath10k_sta_state+0x2c6/0xb8a [ath10k_core]
[21713.800515]  drv_sta_state+0x115/0x677 [mac80211]
[21713.800532]  __sta_info_destroy_part2+0xb1/0x133 [mac80211]
[21713.800548]  __sta_info_flush+0x11d/0x162 [mac80211]
[21713.800565]  ieee80211_set_disassoc+0x12d/0x2f4 [mac80211]
[21713.800581]  ieee80211_mgd_deauth+0x26c/0x29b [mac80211]
[21713.800598]  cfg80211_mlme_deauth+0x13f/0x1bb [cfg80211]
[21713.800614]  nl80211_deauthenticate+0xf8/0x121 [cfg80211]
[21713.800619]  genl_rcv_msg+0x38e/0x3be
[21713.800623]  netlink_rcv_skb+0x89/0xf7
[21713.800628]  genl_rcv+0x28/0x36
[21713.800632]  netlink_unicast+0x179/0x24b
[21713.800637]  netlink_sendmsg+0x3a0/0x40e
[21713.800642]  sock_sendmsg+0x72/0x76
[21713.800646]  ____sys_sendmsg+0x16d/0x1e3
[21713.800651]  ___sys_sendmsg+0x95/0xd1
[21713.800655]  __sys_sendmsg+0x85/0xbf
[21713.800659]  do_syscall_64+0x43/0x55
[21713.800663]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Tested-on: QCA6174 hw3.2 PCI WLAN.RM.4.4.1-00288-QCARMSWPZ-1

Fixes: d0eeafa ("ath10k: Clean up peer when sta goes away.")
Signed-off-by: Wen Gong <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sasha Levin <[email protected]>
anthraxx pushed a commit that referenced this issue Oct 26, 2022
…te()

[ Upstream commit f020d95 ]

When peer delete failed in a disconnect operation, use-after-free
detected by KFENCE in below log. It is because for each vdev_id and
address, it has only one struct ath10k_peer, it is allocated in
ath10k_peer_map_event(). When connected to an AP, it has more than
one HTT_T2H_MSG_TYPE_PEER_MAP reported from firmware, then the
array peer_map of struct ath10k will be set muti-elements to the
same ath10k_peer in ath10k_peer_map_event(). When peer delete failed
in ath10k_sta_state(), the ath10k_peer will be free for the 1st peer
id in array peer_map of struct ath10k, and then use-after-free happened
for the 2nd peer id because they map to the same ath10k_peer.

And clean up all peers in array peer_map for the ath10k_peer, then
user-after-free disappeared

peer map event log:
[  306.911021] wlan0: authenticate with b0:2a:43:e6:75:0e
[  306.957187] ath10k_pci 0000:01:00.0: mac vdev 0 peer create b0:2a:43:e6:75:0e (new sta) sta 1 / 32 peer 1 / 33
[  306.957395] ath10k_pci 0000:01:00.0: htt peer map vdev 0 peer b0:2a:43:e6:75:0e id 246
[  306.957404] ath10k_pci 0000:01:00.0: htt peer map vdev 0 peer b0:2a:43:e6:75:0e id 198
[  306.986924] ath10k_pci 0000:01:00.0: htt peer map vdev 0 peer b0:2a:43:e6:75:0e id 166

peer unmap event log:
[  435.715691] wlan0: deauthenticating from b0:2a:43:e6:75:0e by local choice (Reason: 3=DEAUTH_LEAVING)
[  435.716802] ath10k_pci 0000:01:00.0: mac vdev 0 peer delete b0:2a:43:e6:75:0e sta ffff990e0e9c2b50 (sta gone)
[  435.717177] ath10k_pci 0000:01:00.0: htt peer unmap vdev 0 peer b0:2a:43:e6:75:0e id 246
[  435.717186] ath10k_pci 0000:01:00.0: htt peer unmap vdev 0 peer b0:2a:43:e6:75:0e id 198
[  435.717193] ath10k_pci 0000:01:00.0: htt peer unmap vdev 0 peer b0:2a:43:e6:75:0e id 166

use-after-free log:
[21705.888627] wlan0: deauthenticating from d0:76:8f:82:be:75 by local choice (Reason: 3=DEAUTH_LEAVING)
[21713.799910] ath10k_pci 0000:01:00.0: failed to delete peer d0:76:8f:82:be:75 for vdev 0: -110
[21713.799925] ath10k_pci 0000:01:00.0: found sta peer d0:76:8f:82:be:75 (ptr 0000000000000000 id 102) entry on vdev 0 after it was supposedly removed
[21713.799968] ==================================================================
[21713.799991] BUG: KFENCE: use-after-free read in ath10k_sta_state+0x265/0xb8a [ath10k_core]
[21713.799991]
[21713.799997] Use-after-free read at 0x00000000abe1c75e (in kfence-#69):
[21713.800010]  ath10k_sta_state+0x265/0xb8a [ath10k_core]
[21713.800041]  drv_sta_state+0x115/0x677 [mac80211]
[21713.800059]  __sta_info_destroy_part2+0xb1/0x133 [mac80211]
[21713.800076]  __sta_info_flush+0x11d/0x162 [mac80211]
[21713.800093]  ieee80211_set_disassoc+0x12d/0x2f4 [mac80211]
[21713.800110]  ieee80211_mgd_deauth+0x26c/0x29b [mac80211]
[21713.800137]  cfg80211_mlme_deauth+0x13f/0x1bb [cfg80211]
[21713.800153]  nl80211_deauthenticate+0xf8/0x121 [cfg80211]
[21713.800161]  genl_rcv_msg+0x38e/0x3be
[21713.800166]  netlink_rcv_skb+0x89/0xf7
[21713.800171]  genl_rcv+0x28/0x36
[21713.800176]  netlink_unicast+0x179/0x24b
[21713.800181]  netlink_sendmsg+0x3a0/0x40e
[21713.800187]  sock_sendmsg+0x72/0x76
[21713.800192]  ____sys_sendmsg+0x16d/0x1e3
[21713.800196]  ___sys_sendmsg+0x95/0xd1
[21713.800200]  __sys_sendmsg+0x85/0xbf
[21713.800205]  do_syscall_64+0x43/0x55
[21713.800210]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[21713.800213]
[21713.800219] kfence-#69: 0x000000009149b0d5-0x000000004c0697fb, size=1064, cache=kmalloc-2k
[21713.800219]
[21713.800224] allocated by task 13 on cpu 0 at 21705.501373s:
[21713.800241]  ath10k_peer_map_event+0x7e/0x154 [ath10k_core]
[21713.800254]  ath10k_htt_t2h_msg_handler+0x586/0x1039 [ath10k_core]
[21713.800265]  ath10k_htt_htc_t2h_msg_handler+0x12/0x28 [ath10k_core]
[21713.800277]  ath10k_htc_rx_completion_handler+0x14c/0x1b5 [ath10k_core]
[21713.800283]  ath10k_pci_process_rx_cb+0x195/0x1df [ath10k_pci]
[21713.800294]  ath10k_ce_per_engine_service+0x55/0x74 [ath10k_core]
[21713.800305]  ath10k_ce_per_engine_service_any+0x76/0x84 [ath10k_core]
[21713.800310]  ath10k_pci_napi_poll+0x49/0x144 [ath10k_pci]
[21713.800316]  net_rx_action+0xdc/0x361
[21713.800320]  __do_softirq+0x163/0x29a
[21713.800325]  asm_call_irq_on_stack+0x12/0x20
[21713.800331]  do_softirq_own_stack+0x3c/0x48
[21713.800337]  __irq_exit_rcu+0x9b/0x9d
[21713.800342]  common_interrupt+0xc9/0x14d
[21713.800346]  asm_common_interrupt+0x1e/0x40
[21713.800351]  ksoftirqd_should_run+0x5/0x16
[21713.800357]  smpboot_thread_fn+0x148/0x211
[21713.800362]  kthread+0x150/0x15f
[21713.800367]  ret_from_fork+0x22/0x30
[21713.800370]
[21713.800374] freed by task 708 on cpu 1 at 21713.799953s:
[21713.800498]  ath10k_sta_state+0x2c6/0xb8a [ath10k_core]
[21713.800515]  drv_sta_state+0x115/0x677 [mac80211]
[21713.800532]  __sta_info_destroy_part2+0xb1/0x133 [mac80211]
[21713.800548]  __sta_info_flush+0x11d/0x162 [mac80211]
[21713.800565]  ieee80211_set_disassoc+0x12d/0x2f4 [mac80211]
[21713.800581]  ieee80211_mgd_deauth+0x26c/0x29b [mac80211]
[21713.800598]  cfg80211_mlme_deauth+0x13f/0x1bb [cfg80211]
[21713.800614]  nl80211_deauthenticate+0xf8/0x121 [cfg80211]
[21713.800619]  genl_rcv_msg+0x38e/0x3be
[21713.800623]  netlink_rcv_skb+0x89/0xf7
[21713.800628]  genl_rcv+0x28/0x36
[21713.800632]  netlink_unicast+0x179/0x24b
[21713.800637]  netlink_sendmsg+0x3a0/0x40e
[21713.800642]  sock_sendmsg+0x72/0x76
[21713.800646]  ____sys_sendmsg+0x16d/0x1e3
[21713.800651]  ___sys_sendmsg+0x95/0xd1
[21713.800655]  __sys_sendmsg+0x85/0xbf
[21713.800659]  do_syscall_64+0x43/0x55
[21713.800663]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Tested-on: QCA6174 hw3.2 PCI WLAN.RM.4.4.1-00288-QCARMSWPZ-1

Fixes: d0eeafa ("ath10k: Clean up peer when sta goes away.")
Signed-off-by: Wen Gong <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sasha Levin <[email protected]>
anthraxx pushed a commit that referenced this issue Oct 26, 2022
…te()

[ Upstream commit f020d95 ]

When peer delete failed in a disconnect operation, use-after-free
detected by KFENCE in below log. It is because for each vdev_id and
address, it has only one struct ath10k_peer, it is allocated in
ath10k_peer_map_event(). When connected to an AP, it has more than
one HTT_T2H_MSG_TYPE_PEER_MAP reported from firmware, then the
array peer_map of struct ath10k will be set muti-elements to the
same ath10k_peer in ath10k_peer_map_event(). When peer delete failed
in ath10k_sta_state(), the ath10k_peer will be free for the 1st peer
id in array peer_map of struct ath10k, and then use-after-free happened
for the 2nd peer id because they map to the same ath10k_peer.

And clean up all peers in array peer_map for the ath10k_peer, then
user-after-free disappeared

peer map event log:
[  306.911021] wlan0: authenticate with b0:2a:43:e6:75:0e
[  306.957187] ath10k_pci 0000:01:00.0: mac vdev 0 peer create b0:2a:43:e6:75:0e (new sta) sta 1 / 32 peer 1 / 33
[  306.957395] ath10k_pci 0000:01:00.0: htt peer map vdev 0 peer b0:2a:43:e6:75:0e id 246
[  306.957404] ath10k_pci 0000:01:00.0: htt peer map vdev 0 peer b0:2a:43:e6:75:0e id 198
[  306.986924] ath10k_pci 0000:01:00.0: htt peer map vdev 0 peer b0:2a:43:e6:75:0e id 166

peer unmap event log:
[  435.715691] wlan0: deauthenticating from b0:2a:43:e6:75:0e by local choice (Reason: 3=DEAUTH_LEAVING)
[  435.716802] ath10k_pci 0000:01:00.0: mac vdev 0 peer delete b0:2a:43:e6:75:0e sta ffff990e0e9c2b50 (sta gone)
[  435.717177] ath10k_pci 0000:01:00.0: htt peer unmap vdev 0 peer b0:2a:43:e6:75:0e id 246
[  435.717186] ath10k_pci 0000:01:00.0: htt peer unmap vdev 0 peer b0:2a:43:e6:75:0e id 198
[  435.717193] ath10k_pci 0000:01:00.0: htt peer unmap vdev 0 peer b0:2a:43:e6:75:0e id 166

use-after-free log:
[21705.888627] wlan0: deauthenticating from d0:76:8f:82:be:75 by local choice (Reason: 3=DEAUTH_LEAVING)
[21713.799910] ath10k_pci 0000:01:00.0: failed to delete peer d0:76:8f:82:be:75 for vdev 0: -110
[21713.799925] ath10k_pci 0000:01:00.0: found sta peer d0:76:8f:82:be:75 (ptr 0000000000000000 id 102) entry on vdev 0 after it was supposedly removed
[21713.799968] ==================================================================
[21713.799991] BUG: KFENCE: use-after-free read in ath10k_sta_state+0x265/0xb8a [ath10k_core]
[21713.799991]
[21713.799997] Use-after-free read at 0x00000000abe1c75e (in kfence-#69):
[21713.800010]  ath10k_sta_state+0x265/0xb8a [ath10k_core]
[21713.800041]  drv_sta_state+0x115/0x677 [mac80211]
[21713.800059]  __sta_info_destroy_part2+0xb1/0x133 [mac80211]
[21713.800076]  __sta_info_flush+0x11d/0x162 [mac80211]
[21713.800093]  ieee80211_set_disassoc+0x12d/0x2f4 [mac80211]
[21713.800110]  ieee80211_mgd_deauth+0x26c/0x29b [mac80211]
[21713.800137]  cfg80211_mlme_deauth+0x13f/0x1bb [cfg80211]
[21713.800153]  nl80211_deauthenticate+0xf8/0x121 [cfg80211]
[21713.800161]  genl_rcv_msg+0x38e/0x3be
[21713.800166]  netlink_rcv_skb+0x89/0xf7
[21713.800171]  genl_rcv+0x28/0x36
[21713.800176]  netlink_unicast+0x179/0x24b
[21713.800181]  netlink_sendmsg+0x3a0/0x40e
[21713.800187]  sock_sendmsg+0x72/0x76
[21713.800192]  ____sys_sendmsg+0x16d/0x1e3
[21713.800196]  ___sys_sendmsg+0x95/0xd1
[21713.800200]  __sys_sendmsg+0x85/0xbf
[21713.800205]  do_syscall_64+0x43/0x55
[21713.800210]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[21713.800213]
[21713.800219] kfence-#69: 0x000000009149b0d5-0x000000004c0697fb, size=1064, cache=kmalloc-2k
[21713.800219]
[21713.800224] allocated by task 13 on cpu 0 at 21705.501373s:
[21713.800241]  ath10k_peer_map_event+0x7e/0x154 [ath10k_core]
[21713.800254]  ath10k_htt_t2h_msg_handler+0x586/0x1039 [ath10k_core]
[21713.800265]  ath10k_htt_htc_t2h_msg_handler+0x12/0x28 [ath10k_core]
[21713.800277]  ath10k_htc_rx_completion_handler+0x14c/0x1b5 [ath10k_core]
[21713.800283]  ath10k_pci_process_rx_cb+0x195/0x1df [ath10k_pci]
[21713.800294]  ath10k_ce_per_engine_service+0x55/0x74 [ath10k_core]
[21713.800305]  ath10k_ce_per_engine_service_any+0x76/0x84 [ath10k_core]
[21713.800310]  ath10k_pci_napi_poll+0x49/0x144 [ath10k_pci]
[21713.800316]  net_rx_action+0xdc/0x361
[21713.800320]  __do_softirq+0x163/0x29a
[21713.800325]  asm_call_irq_on_stack+0x12/0x20
[21713.800331]  do_softirq_own_stack+0x3c/0x48
[21713.800337]  __irq_exit_rcu+0x9b/0x9d
[21713.800342]  common_interrupt+0xc9/0x14d
[21713.800346]  asm_common_interrupt+0x1e/0x40
[21713.800351]  ksoftirqd_should_run+0x5/0x16
[21713.800357]  smpboot_thread_fn+0x148/0x211
[21713.800362]  kthread+0x150/0x15f
[21713.800367]  ret_from_fork+0x22/0x30
[21713.800370]
[21713.800374] freed by task 708 on cpu 1 at 21713.799953s:
[21713.800498]  ath10k_sta_state+0x2c6/0xb8a [ath10k_core]
[21713.800515]  drv_sta_state+0x115/0x677 [mac80211]
[21713.800532]  __sta_info_destroy_part2+0xb1/0x133 [mac80211]
[21713.800548]  __sta_info_flush+0x11d/0x162 [mac80211]
[21713.800565]  ieee80211_set_disassoc+0x12d/0x2f4 [mac80211]
[21713.800581]  ieee80211_mgd_deauth+0x26c/0x29b [mac80211]
[21713.800598]  cfg80211_mlme_deauth+0x13f/0x1bb [cfg80211]
[21713.800614]  nl80211_deauthenticate+0xf8/0x121 [cfg80211]
[21713.800619]  genl_rcv_msg+0x38e/0x3be
[21713.800623]  netlink_rcv_skb+0x89/0xf7
[21713.800628]  genl_rcv+0x28/0x36
[21713.800632]  netlink_unicast+0x179/0x24b
[21713.800637]  netlink_sendmsg+0x3a0/0x40e
[21713.800642]  sock_sendmsg+0x72/0x76
[21713.800646]  ____sys_sendmsg+0x16d/0x1e3
[21713.800651]  ___sys_sendmsg+0x95/0xd1
[21713.800655]  __sys_sendmsg+0x85/0xbf
[21713.800659]  do_syscall_64+0x43/0x55
[21713.800663]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Tested-on: QCA6174 hw3.2 PCI WLAN.RM.4.4.1-00288-QCARMSWPZ-1

Fixes: d0eeafa ("ath10k: Clean up peer when sta goes away.")
Signed-off-by: Wen Gong <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sasha Levin <[email protected]>
anthraxx pushed a commit that referenced this issue Oct 26, 2022
…te()

[ Upstream commit f020d95 ]

When peer delete failed in a disconnect operation, use-after-free
detected by KFENCE in below log. It is because for each vdev_id and
address, it has only one struct ath10k_peer, it is allocated in
ath10k_peer_map_event(). When connected to an AP, it has more than
one HTT_T2H_MSG_TYPE_PEER_MAP reported from firmware, then the
array peer_map of struct ath10k will be set muti-elements to the
same ath10k_peer in ath10k_peer_map_event(). When peer delete failed
in ath10k_sta_state(), the ath10k_peer will be free for the 1st peer
id in array peer_map of struct ath10k, and then use-after-free happened
for the 2nd peer id because they map to the same ath10k_peer.

And clean up all peers in array peer_map for the ath10k_peer, then
user-after-free disappeared

peer map event log:
[  306.911021] wlan0: authenticate with b0:2a:43:e6:75:0e
[  306.957187] ath10k_pci 0000:01:00.0: mac vdev 0 peer create b0:2a:43:e6:75:0e (new sta) sta 1 / 32 peer 1 / 33
[  306.957395] ath10k_pci 0000:01:00.0: htt peer map vdev 0 peer b0:2a:43:e6:75:0e id 246
[  306.957404] ath10k_pci 0000:01:00.0: htt peer map vdev 0 peer b0:2a:43:e6:75:0e id 198
[  306.986924] ath10k_pci 0000:01:00.0: htt peer map vdev 0 peer b0:2a:43:e6:75:0e id 166

peer unmap event log:
[  435.715691] wlan0: deauthenticating from b0:2a:43:e6:75:0e by local choice (Reason: 3=DEAUTH_LEAVING)
[  435.716802] ath10k_pci 0000:01:00.0: mac vdev 0 peer delete b0:2a:43:e6:75:0e sta ffff990e0e9c2b50 (sta gone)
[  435.717177] ath10k_pci 0000:01:00.0: htt peer unmap vdev 0 peer b0:2a:43:e6:75:0e id 246
[  435.717186] ath10k_pci 0000:01:00.0: htt peer unmap vdev 0 peer b0:2a:43:e6:75:0e id 198
[  435.717193] ath10k_pci 0000:01:00.0: htt peer unmap vdev 0 peer b0:2a:43:e6:75:0e id 166

use-after-free log:
[21705.888627] wlan0: deauthenticating from d0:76:8f:82:be:75 by local choice (Reason: 3=DEAUTH_LEAVING)
[21713.799910] ath10k_pci 0000:01:00.0: failed to delete peer d0:76:8f:82:be:75 for vdev 0: -110
[21713.799925] ath10k_pci 0000:01:00.0: found sta peer d0:76:8f:82:be:75 (ptr 0000000000000000 id 102) entry on vdev 0 after it was supposedly removed
[21713.799968] ==================================================================
[21713.799991] BUG: KFENCE: use-after-free read in ath10k_sta_state+0x265/0xb8a [ath10k_core]
[21713.799991]
[21713.799997] Use-after-free read at 0x00000000abe1c75e (in kfence-#69):
[21713.800010]  ath10k_sta_state+0x265/0xb8a [ath10k_core]
[21713.800041]  drv_sta_state+0x115/0x677 [mac80211]
[21713.800059]  __sta_info_destroy_part2+0xb1/0x133 [mac80211]
[21713.800076]  __sta_info_flush+0x11d/0x162 [mac80211]
[21713.800093]  ieee80211_set_disassoc+0x12d/0x2f4 [mac80211]
[21713.800110]  ieee80211_mgd_deauth+0x26c/0x29b [mac80211]
[21713.800137]  cfg80211_mlme_deauth+0x13f/0x1bb [cfg80211]
[21713.800153]  nl80211_deauthenticate+0xf8/0x121 [cfg80211]
[21713.800161]  genl_rcv_msg+0x38e/0x3be
[21713.800166]  netlink_rcv_skb+0x89/0xf7
[21713.800171]  genl_rcv+0x28/0x36
[21713.800176]  netlink_unicast+0x179/0x24b
[21713.800181]  netlink_sendmsg+0x3a0/0x40e
[21713.800187]  sock_sendmsg+0x72/0x76
[21713.800192]  ____sys_sendmsg+0x16d/0x1e3
[21713.800196]  ___sys_sendmsg+0x95/0xd1
[21713.800200]  __sys_sendmsg+0x85/0xbf
[21713.800205]  do_syscall_64+0x43/0x55
[21713.800210]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[21713.800213]
[21713.800219] kfence-#69: 0x000000009149b0d5-0x000000004c0697fb, size=1064, cache=kmalloc-2k
[21713.800219]
[21713.800224] allocated by task 13 on cpu 0 at 21705.501373s:
[21713.800241]  ath10k_peer_map_event+0x7e/0x154 [ath10k_core]
[21713.800254]  ath10k_htt_t2h_msg_handler+0x586/0x1039 [ath10k_core]
[21713.800265]  ath10k_htt_htc_t2h_msg_handler+0x12/0x28 [ath10k_core]
[21713.800277]  ath10k_htc_rx_completion_handler+0x14c/0x1b5 [ath10k_core]
[21713.800283]  ath10k_pci_process_rx_cb+0x195/0x1df [ath10k_pci]
[21713.800294]  ath10k_ce_per_engine_service+0x55/0x74 [ath10k_core]
[21713.800305]  ath10k_ce_per_engine_service_any+0x76/0x84 [ath10k_core]
[21713.800310]  ath10k_pci_napi_poll+0x49/0x144 [ath10k_pci]
[21713.800316]  net_rx_action+0xdc/0x361
[21713.800320]  __do_softirq+0x163/0x29a
[21713.800325]  asm_call_irq_on_stack+0x12/0x20
[21713.800331]  do_softirq_own_stack+0x3c/0x48
[21713.800337]  __irq_exit_rcu+0x9b/0x9d
[21713.800342]  common_interrupt+0xc9/0x14d
[21713.800346]  asm_common_interrupt+0x1e/0x40
[21713.800351]  ksoftirqd_should_run+0x5/0x16
[21713.800357]  smpboot_thread_fn+0x148/0x211
[21713.800362]  kthread+0x150/0x15f
[21713.800367]  ret_from_fork+0x22/0x30
[21713.800370]
[21713.800374] freed by task 708 on cpu 1 at 21713.799953s:
[21713.800498]  ath10k_sta_state+0x2c6/0xb8a [ath10k_core]
[21713.800515]  drv_sta_state+0x115/0x677 [mac80211]
[21713.800532]  __sta_info_destroy_part2+0xb1/0x133 [mac80211]
[21713.800548]  __sta_info_flush+0x11d/0x162 [mac80211]
[21713.800565]  ieee80211_set_disassoc+0x12d/0x2f4 [mac80211]
[21713.800581]  ieee80211_mgd_deauth+0x26c/0x29b [mac80211]
[21713.800598]  cfg80211_mlme_deauth+0x13f/0x1bb [cfg80211]
[21713.800614]  nl80211_deauthenticate+0xf8/0x121 [cfg80211]
[21713.800619]  genl_rcv_msg+0x38e/0x3be
[21713.800623]  netlink_rcv_skb+0x89/0xf7
[21713.800628]  genl_rcv+0x28/0x36
[21713.800632]  netlink_unicast+0x179/0x24b
[21713.800637]  netlink_sendmsg+0x3a0/0x40e
[21713.800642]  sock_sendmsg+0x72/0x76
[21713.800646]  ____sys_sendmsg+0x16d/0x1e3
[21713.800651]  ___sys_sendmsg+0x95/0xd1
[21713.800655]  __sys_sendmsg+0x85/0xbf
[21713.800659]  do_syscall_64+0x43/0x55
[21713.800663]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Tested-on: QCA6174 hw3.2 PCI WLAN.RM.4.4.1-00288-QCARMSWPZ-1

Fixes: d0eeafa ("ath10k: Clean up peer when sta goes away.")
Signed-off-by: Wen Gong <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sasha Levin <[email protected]>
anthraxx pushed a commit that referenced this issue Aug 23, 2024
[ Upstream commit 7015843 ]

Alexei reported that send_signal test may fail with nested CONFIG_PARAVIRT
configs. In this particular case, the base VM is AMD with 166 cpus, and I
run selftests with regular qemu on top of that and indeed send_signal test
failed. I also tried with an Intel box with 80 cpus and there is no issue.

The main qemu command line includes:

  -enable-kvm -smp 16 -cpu host

The failure log looks like:

  $ ./test_progs -t send_signal
  [   48.501588] watchdog: BUG: soft lockup - CPU#9 stuck for 26s! [test_progs:2225]
  [   48.503622] Modules linked in: bpf_testmod(O)
  [   48.503622] CPU: 9 PID: 2225 Comm: test_progs Tainted: G           O       6.9.0-08561-g2c1713a8f1c9-dirty #69
  [   48.507629] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
  [   48.511635] RIP: 0010:handle_softirqs+0x71/0x290
  [   48.511635] Code: [...] 10 0a 00 00 00 31 c0 65 66 89 05 d5 f4 fa 7e fb bb ff ff ff ff <49> c7 c2 cb
  [   48.518527] RSP: 0018:ffffc90000310fa0 EFLAGS: 00000246
  [   48.519579] RAX: 0000000000000000 RBX: 00000000ffffffff RCX: 00000000000006e0
  [   48.522526] RDX: 0000000000000006 RSI: ffff88810791ae80 RDI: 0000000000000000
  [   48.523587] RBP: ffffc90000fabc88 R08: 00000005a0af4f7f R09: 0000000000000000
  [   48.525525] R10: 0000000561d2f29c R11: 0000000000006534 R12: 0000000000000280
  [   48.528525] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
  [   48.528525] FS:  00007f2f2885cd00(0000) GS:ffff888237c40000(0000) knlGS:0000000000000000
  [   48.531600] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [   48.535520] CR2: 00007f2f287059f0 CR3: 0000000106a28002 CR4: 00000000003706f0
  [   48.537538] Call Trace:
  [   48.537538]  <IRQ>
  [   48.537538]  ? watchdog_timer_fn+0x1cd/0x250
  [   48.539590]  ? lockup_detector_update_enable+0x50/0x50
  [   48.539590]  ? __hrtimer_run_queues+0xff/0x280
  [   48.542520]  ? hrtimer_interrupt+0x103/0x230
  [   48.544524]  ? __sysvec_apic_timer_interrupt+0x4f/0x140
  [   48.545522]  ? sysvec_apic_timer_interrupt+0x3a/0x90
  [   48.547612]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
  [   48.547612]  ? handle_softirqs+0x71/0x290
  [   48.547612]  irq_exit_rcu+0x63/0x80
  [   48.551585]  sysvec_apic_timer_interrupt+0x75/0x90
  [   48.552521]  </IRQ>
  [   48.553529]  <TASK>
  [   48.553529]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
  [   48.555609] RIP: 0010:finish_task_switch.isra.0+0x90/0x260
  [   48.556526] Code: [...] 9f 58 0a 00 00 48 85 db 0f 85 89 01 00 00 4c 89 ff e8 53 d9 bd 00 fb 66 90 <4d> 85 ed 74
  [   48.562524] RSP: 0018:ffffc90000fabd38 EFLAGS: 00000282
  [   48.563589] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff83385620
  [   48.563589] RDX: ffff888237c73ae4 RSI: 0000000000000000 RDI: ffff888237c6fd00
  [   48.568521] RBP: ffffc90000fabd68 R08: 0000000000000000 R09: 0000000000000000
  [   48.569528] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8881009d0000
  [   48.573525] R13: ffff8881024e5400 R14: ffff88810791ae80 R15: ffff888237c6fd00
  [   48.575614]  ? finish_task_switch.isra.0+0x8d/0x260
  [   48.576523]  __schedule+0x364/0xac0
  [   48.577535]  schedule+0x2e/0x110
  [   48.578555]  pipe_read+0x301/0x400
  [   48.579589]  ? destroy_sched_domains_rcu+0x30/0x30
  [   48.579589]  vfs_read+0x2b3/0x2f0
  [   48.579589]  ksys_read+0x8b/0xc0
  [   48.583590]  do_syscall_64+0x3d/0xc0
  [   48.583590]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
  [   48.586525] RIP: 0033:0x7f2f28703fa1
  [   48.587592] Code: [...] 00 00 00 0f 1f 44 00 00 f3 0f 1e fa 80 3d c5 23 14 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0
  [   48.593534] RSP: 002b:00007ffd90f8cf88 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
  [   48.595589] RAX: ffffffffffffffda RBX: 00007ffd90f8d5e8 RCX: 00007f2f28703fa1
  [   48.595589] RDX: 0000000000000001 RSI: 00007ffd90f8cfb0 RDI: 0000000000000006
  [   48.599592] RBP: 00007ffd90f8d2f0 R08: 0000000000000064 R09: 0000000000000000
  [   48.602527] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
  [   48.603589] R13: 00007ffd90f8d608 R14: 00007f2f288d8000 R15: 0000000000f6bdb0
  [   48.605527]  </TASK>

In the test, two processes are communicating through pipe. Further debugging
with strace found that the above splat is triggered as read() syscall could
not receive the data even if the corresponding write() syscall in another
process successfully wrote data into the pipe.

The failed subtest is "send_signal_perf". The corresponding perf event has
sample_period 1 and config PERF_COUNT_SW_CPU_CLOCK. sample_period 1 means every
overflow event will trigger a call to the BPF program. So I suspect this may
overwhelm the system. So I increased the sample_period to 100,000 and the test
passed. The sample_period 10,000 still has the test failed.

In other parts of selftest, e.g., [1], sample_freq is used instead. So I
decided to use sample_freq = 1,000 since the test can pass as well.

  [1] https://lore.kernel.org/bpf/[email protected]/

Reported-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Yonghong Song <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Signed-off-by: Sasha Levin <[email protected]>
anthraxx pushed a commit that referenced this issue Aug 23, 2024
[ Upstream commit 7015843 ]

Alexei reported that send_signal test may fail with nested CONFIG_PARAVIRT
configs. In this particular case, the base VM is AMD with 166 cpus, and I
run selftests with regular qemu on top of that and indeed send_signal test
failed. I also tried with an Intel box with 80 cpus and there is no issue.

The main qemu command line includes:

  -enable-kvm -smp 16 -cpu host

The failure log looks like:

  $ ./test_progs -t send_signal
  [   48.501588] watchdog: BUG: soft lockup - CPU#9 stuck for 26s! [test_progs:2225]
  [   48.503622] Modules linked in: bpf_testmod(O)
  [   48.503622] CPU: 9 PID: 2225 Comm: test_progs Tainted: G           O       6.9.0-08561-g2c1713a8f1c9-dirty #69
  [   48.507629] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
  [   48.511635] RIP: 0010:handle_softirqs+0x71/0x290
  [   48.511635] Code: [...] 10 0a 00 00 00 31 c0 65 66 89 05 d5 f4 fa 7e fb bb ff ff ff ff <49> c7 c2 cb
  [   48.518527] RSP: 0018:ffffc90000310fa0 EFLAGS: 00000246
  [   48.519579] RAX: 0000000000000000 RBX: 00000000ffffffff RCX: 00000000000006e0
  [   48.522526] RDX: 0000000000000006 RSI: ffff88810791ae80 RDI: 0000000000000000
  [   48.523587] RBP: ffffc90000fabc88 R08: 00000005a0af4f7f R09: 0000000000000000
  [   48.525525] R10: 0000000561d2f29c R11: 0000000000006534 R12: 0000000000000280
  [   48.528525] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
  [   48.528525] FS:  00007f2f2885cd00(0000) GS:ffff888237c40000(0000) knlGS:0000000000000000
  [   48.531600] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [   48.535520] CR2: 00007f2f287059f0 CR3: 0000000106a28002 CR4: 00000000003706f0
  [   48.537538] Call Trace:
  [   48.537538]  <IRQ>
  [   48.537538]  ? watchdog_timer_fn+0x1cd/0x250
  [   48.539590]  ? lockup_detector_update_enable+0x50/0x50
  [   48.539590]  ? __hrtimer_run_queues+0xff/0x280
  [   48.542520]  ? hrtimer_interrupt+0x103/0x230
  [   48.544524]  ? __sysvec_apic_timer_interrupt+0x4f/0x140
  [   48.545522]  ? sysvec_apic_timer_interrupt+0x3a/0x90
  [   48.547612]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
  [   48.547612]  ? handle_softirqs+0x71/0x290
  [   48.547612]  irq_exit_rcu+0x63/0x80
  [   48.551585]  sysvec_apic_timer_interrupt+0x75/0x90
  [   48.552521]  </IRQ>
  [   48.553529]  <TASK>
  [   48.553529]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
  [   48.555609] RIP: 0010:finish_task_switch.isra.0+0x90/0x260
  [   48.556526] Code: [...] 9f 58 0a 00 00 48 85 db 0f 85 89 01 00 00 4c 89 ff e8 53 d9 bd 00 fb 66 90 <4d> 85 ed 74
  [   48.562524] RSP: 0018:ffffc90000fabd38 EFLAGS: 00000282
  [   48.563589] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff83385620
  [   48.563589] RDX: ffff888237c73ae4 RSI: 0000000000000000 RDI: ffff888237c6fd00
  [   48.568521] RBP: ffffc90000fabd68 R08: 0000000000000000 R09: 0000000000000000
  [   48.569528] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8881009d0000
  [   48.573525] R13: ffff8881024e5400 R14: ffff88810791ae80 R15: ffff888237c6fd00
  [   48.575614]  ? finish_task_switch.isra.0+0x8d/0x260
  [   48.576523]  __schedule+0x364/0xac0
  [   48.577535]  schedule+0x2e/0x110
  [   48.578555]  pipe_read+0x301/0x400
  [   48.579589]  ? destroy_sched_domains_rcu+0x30/0x30
  [   48.579589]  vfs_read+0x2b3/0x2f0
  [   48.579589]  ksys_read+0x8b/0xc0
  [   48.583590]  do_syscall_64+0x3d/0xc0
  [   48.583590]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
  [   48.586525] RIP: 0033:0x7f2f28703fa1
  [   48.587592] Code: [...] 00 00 00 0f 1f 44 00 00 f3 0f 1e fa 80 3d c5 23 14 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0
  [   48.593534] RSP: 002b:00007ffd90f8cf88 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
  [   48.595589] RAX: ffffffffffffffda RBX: 00007ffd90f8d5e8 RCX: 00007f2f28703fa1
  [   48.595589] RDX: 0000000000000001 RSI: 00007ffd90f8cfb0 RDI: 0000000000000006
  [   48.599592] RBP: 00007ffd90f8d2f0 R08: 0000000000000064 R09: 0000000000000000
  [   48.602527] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
  [   48.603589] R13: 00007ffd90f8d608 R14: 00007f2f288d8000 R15: 0000000000f6bdb0
  [   48.605527]  </TASK>

In the test, two processes are communicating through pipe. Further debugging
with strace found that the above splat is triggered as read() syscall could
not receive the data even if the corresponding write() syscall in another
process successfully wrote data into the pipe.

The failed subtest is "send_signal_perf". The corresponding perf event has
sample_period 1 and config PERF_COUNT_SW_CPU_CLOCK. sample_period 1 means every
overflow event will trigger a call to the BPF program. So I suspect this may
overwhelm the system. So I increased the sample_period to 100,000 and the test
passed. The sample_period 10,000 still has the test failed.

In other parts of selftest, e.g., [1], sample_freq is used instead. So I
decided to use sample_freq = 1,000 since the test can pass as well.

  [1] https://lore.kernel.org/bpf/[email protected]/

Reported-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Yonghong Song <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Signed-off-by: Sasha Levin <[email protected]>
anthraxx pushed a commit that referenced this issue Aug 23, 2024
[ Upstream commit 7015843 ]

Alexei reported that send_signal test may fail with nested CONFIG_PARAVIRT
configs. In this particular case, the base VM is AMD with 166 cpus, and I
run selftests with regular qemu on top of that and indeed send_signal test
failed. I also tried with an Intel box with 80 cpus and there is no issue.

The main qemu command line includes:

  -enable-kvm -smp 16 -cpu host

The failure log looks like:

  $ ./test_progs -t send_signal
  [   48.501588] watchdog: BUG: soft lockup - CPU#9 stuck for 26s! [test_progs:2225]
  [   48.503622] Modules linked in: bpf_testmod(O)
  [   48.503622] CPU: 9 PID: 2225 Comm: test_progs Tainted: G           O       6.9.0-08561-g2c1713a8f1c9-dirty #69
  [   48.507629] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
  [   48.511635] RIP: 0010:handle_softirqs+0x71/0x290
  [   48.511635] Code: [...] 10 0a 00 00 00 31 c0 65 66 89 05 d5 f4 fa 7e fb bb ff ff ff ff <49> c7 c2 cb
  [   48.518527] RSP: 0018:ffffc90000310fa0 EFLAGS: 00000246
  [   48.519579] RAX: 0000000000000000 RBX: 00000000ffffffff RCX: 00000000000006e0
  [   48.522526] RDX: 0000000000000006 RSI: ffff88810791ae80 RDI: 0000000000000000
  [   48.523587] RBP: ffffc90000fabc88 R08: 00000005a0af4f7f R09: 0000000000000000
  [   48.525525] R10: 0000000561d2f29c R11: 0000000000006534 R12: 0000000000000280
  [   48.528525] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
  [   48.528525] FS:  00007f2f2885cd00(0000) GS:ffff888237c40000(0000) knlGS:0000000000000000
  [   48.531600] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [   48.535520] CR2: 00007f2f287059f0 CR3: 0000000106a28002 CR4: 00000000003706f0
  [   48.537538] Call Trace:
  [   48.537538]  <IRQ>
  [   48.537538]  ? watchdog_timer_fn+0x1cd/0x250
  [   48.539590]  ? lockup_detector_update_enable+0x50/0x50
  [   48.539590]  ? __hrtimer_run_queues+0xff/0x280
  [   48.542520]  ? hrtimer_interrupt+0x103/0x230
  [   48.544524]  ? __sysvec_apic_timer_interrupt+0x4f/0x140
  [   48.545522]  ? sysvec_apic_timer_interrupt+0x3a/0x90
  [   48.547612]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
  [   48.547612]  ? handle_softirqs+0x71/0x290
  [   48.547612]  irq_exit_rcu+0x63/0x80
  [   48.551585]  sysvec_apic_timer_interrupt+0x75/0x90
  [   48.552521]  </IRQ>
  [   48.553529]  <TASK>
  [   48.553529]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
  [   48.555609] RIP: 0010:finish_task_switch.isra.0+0x90/0x260
  [   48.556526] Code: [...] 9f 58 0a 00 00 48 85 db 0f 85 89 01 00 00 4c 89 ff e8 53 d9 bd 00 fb 66 90 <4d> 85 ed 74
  [   48.562524] RSP: 0018:ffffc90000fabd38 EFLAGS: 00000282
  [   48.563589] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff83385620
  [   48.563589] RDX: ffff888237c73ae4 RSI: 0000000000000000 RDI: ffff888237c6fd00
  [   48.568521] RBP: ffffc90000fabd68 R08: 0000000000000000 R09: 0000000000000000
  [   48.569528] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8881009d0000
  [   48.573525] R13: ffff8881024e5400 R14: ffff88810791ae80 R15: ffff888237c6fd00
  [   48.575614]  ? finish_task_switch.isra.0+0x8d/0x260
  [   48.576523]  __schedule+0x364/0xac0
  [   48.577535]  schedule+0x2e/0x110
  [   48.578555]  pipe_read+0x301/0x400
  [   48.579589]  ? destroy_sched_domains_rcu+0x30/0x30
  [   48.579589]  vfs_read+0x2b3/0x2f0
  [   48.579589]  ksys_read+0x8b/0xc0
  [   48.583590]  do_syscall_64+0x3d/0xc0
  [   48.583590]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
  [   48.586525] RIP: 0033:0x7f2f28703fa1
  [   48.587592] Code: [...] 00 00 00 0f 1f 44 00 00 f3 0f 1e fa 80 3d c5 23 14 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0
  [   48.593534] RSP: 002b:00007ffd90f8cf88 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
  [   48.595589] RAX: ffffffffffffffda RBX: 00007ffd90f8d5e8 RCX: 00007f2f28703fa1
  [   48.595589] RDX: 0000000000000001 RSI: 00007ffd90f8cfb0 RDI: 0000000000000006
  [   48.599592] RBP: 00007ffd90f8d2f0 R08: 0000000000000064 R09: 0000000000000000
  [   48.602527] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
  [   48.603589] R13: 00007ffd90f8d608 R14: 00007f2f288d8000 R15: 0000000000f6bdb0
  [   48.605527]  </TASK>

In the test, two processes are communicating through pipe. Further debugging
with strace found that the above splat is triggered as read() syscall could
not receive the data even if the corresponding write() syscall in another
process successfully wrote data into the pipe.

The failed subtest is "send_signal_perf". The corresponding perf event has
sample_period 1 and config PERF_COUNT_SW_CPU_CLOCK. sample_period 1 means every
overflow event will trigger a call to the BPF program. So I suspect this may
overwhelm the system. So I increased the sample_period to 100,000 and the test
passed. The sample_period 10,000 still has the test failed.

In other parts of selftest, e.g., [1], sample_freq is used instead. So I
decided to use sample_freq = 1,000 since the test can pass as well.

  [1] https://lore.kernel.org/bpf/[email protected]/

Reported-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Yonghong Song <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Signed-off-by: Sasha Levin <[email protected]>
anthraxx pushed a commit that referenced this issue Aug 23, 2024
[ Upstream commit 7015843 ]

Alexei reported that send_signal test may fail with nested CONFIG_PARAVIRT
configs. In this particular case, the base VM is AMD with 166 cpus, and I
run selftests with regular qemu on top of that and indeed send_signal test
failed. I also tried with an Intel box with 80 cpus and there is no issue.

The main qemu command line includes:

  -enable-kvm -smp 16 -cpu host

The failure log looks like:

  $ ./test_progs -t send_signal
  [   48.501588] watchdog: BUG: soft lockup - CPU#9 stuck for 26s! [test_progs:2225]
  [   48.503622] Modules linked in: bpf_testmod(O)
  [   48.503622] CPU: 9 PID: 2225 Comm: test_progs Tainted: G           O       6.9.0-08561-g2c1713a8f1c9-dirty #69
  [   48.507629] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
  [   48.511635] RIP: 0010:handle_softirqs+0x71/0x290
  [   48.511635] Code: [...] 10 0a 00 00 00 31 c0 65 66 89 05 d5 f4 fa 7e fb bb ff ff ff ff <49> c7 c2 cb
  [   48.518527] RSP: 0018:ffffc90000310fa0 EFLAGS: 00000246
  [   48.519579] RAX: 0000000000000000 RBX: 00000000ffffffff RCX: 00000000000006e0
  [   48.522526] RDX: 0000000000000006 RSI: ffff88810791ae80 RDI: 0000000000000000
  [   48.523587] RBP: ffffc90000fabc88 R08: 00000005a0af4f7f R09: 0000000000000000
  [   48.525525] R10: 0000000561d2f29c R11: 0000000000006534 R12: 0000000000000280
  [   48.528525] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
  [   48.528525] FS:  00007f2f2885cd00(0000) GS:ffff888237c40000(0000) knlGS:0000000000000000
  [   48.531600] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [   48.535520] CR2: 00007f2f287059f0 CR3: 0000000106a28002 CR4: 00000000003706f0
  [   48.537538] Call Trace:
  [   48.537538]  <IRQ>
  [   48.537538]  ? watchdog_timer_fn+0x1cd/0x250
  [   48.539590]  ? lockup_detector_update_enable+0x50/0x50
  [   48.539590]  ? __hrtimer_run_queues+0xff/0x280
  [   48.542520]  ? hrtimer_interrupt+0x103/0x230
  [   48.544524]  ? __sysvec_apic_timer_interrupt+0x4f/0x140
  [   48.545522]  ? sysvec_apic_timer_interrupt+0x3a/0x90
  [   48.547612]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
  [   48.547612]  ? handle_softirqs+0x71/0x290
  [   48.547612]  irq_exit_rcu+0x63/0x80
  [   48.551585]  sysvec_apic_timer_interrupt+0x75/0x90
  [   48.552521]  </IRQ>
  [   48.553529]  <TASK>
  [   48.553529]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
  [   48.555609] RIP: 0010:finish_task_switch.isra.0+0x90/0x260
  [   48.556526] Code: [...] 9f 58 0a 00 00 48 85 db 0f 85 89 01 00 00 4c 89 ff e8 53 d9 bd 00 fb 66 90 <4d> 85 ed 74
  [   48.562524] RSP: 0018:ffffc90000fabd38 EFLAGS: 00000282
  [   48.563589] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff83385620
  [   48.563589] RDX: ffff888237c73ae4 RSI: 0000000000000000 RDI: ffff888237c6fd00
  [   48.568521] RBP: ffffc90000fabd68 R08: 0000000000000000 R09: 0000000000000000
  [   48.569528] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8881009d0000
  [   48.573525] R13: ffff8881024e5400 R14: ffff88810791ae80 R15: ffff888237c6fd00
  [   48.575614]  ? finish_task_switch.isra.0+0x8d/0x260
  [   48.576523]  __schedule+0x364/0xac0
  [   48.577535]  schedule+0x2e/0x110
  [   48.578555]  pipe_read+0x301/0x400
  [   48.579589]  ? destroy_sched_domains_rcu+0x30/0x30
  [   48.579589]  vfs_read+0x2b3/0x2f0
  [   48.579589]  ksys_read+0x8b/0xc0
  [   48.583590]  do_syscall_64+0x3d/0xc0
  [   48.583590]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
  [   48.586525] RIP: 0033:0x7f2f28703fa1
  [   48.587592] Code: [...] 00 00 00 0f 1f 44 00 00 f3 0f 1e fa 80 3d c5 23 14 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0
  [   48.593534] RSP: 002b:00007ffd90f8cf88 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
  [   48.595589] RAX: ffffffffffffffda RBX: 00007ffd90f8d5e8 RCX: 00007f2f28703fa1
  [   48.595589] RDX: 0000000000000001 RSI: 00007ffd90f8cfb0 RDI: 0000000000000006
  [   48.599592] RBP: 00007ffd90f8d2f0 R08: 0000000000000064 R09: 0000000000000000
  [   48.602527] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
  [   48.603589] R13: 00007ffd90f8d608 R14: 00007f2f288d8000 R15: 0000000000f6bdb0
  [   48.605527]  </TASK>

In the test, two processes are communicating through pipe. Further debugging
with strace found that the above splat is triggered as read() syscall could
not receive the data even if the corresponding write() syscall in another
process successfully wrote data into the pipe.

The failed subtest is "send_signal_perf". The corresponding perf event has
sample_period 1 and config PERF_COUNT_SW_CPU_CLOCK. sample_period 1 means every
overflow event will trigger a call to the BPF program. So I suspect this may
overwhelm the system. So I increased the sample_period to 100,000 and the test
passed. The sample_period 10,000 still has the test failed.

In other parts of selftest, e.g., [1], sample_freq is used instead. So I
decided to use sample_freq = 1,000 since the test can pass as well.

  [1] https://lore.kernel.org/bpf/[email protected]/

Reported-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Yonghong Song <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Signed-off-by: Sasha Levin <[email protected]>
anthraxx pushed a commit that referenced this issue Aug 23, 2024
[ Upstream commit 7015843 ]

Alexei reported that send_signal test may fail with nested CONFIG_PARAVIRT
configs. In this particular case, the base VM is AMD with 166 cpus, and I
run selftests with regular qemu on top of that and indeed send_signal test
failed. I also tried with an Intel box with 80 cpus and there is no issue.

The main qemu command line includes:

  -enable-kvm -smp 16 -cpu host

The failure log looks like:

  $ ./test_progs -t send_signal
  [   48.501588] watchdog: BUG: soft lockup - CPU#9 stuck for 26s! [test_progs:2225]
  [   48.503622] Modules linked in: bpf_testmod(O)
  [   48.503622] CPU: 9 PID: 2225 Comm: test_progs Tainted: G           O       6.9.0-08561-g2c1713a8f1c9-dirty #69
  [   48.507629] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
  [   48.511635] RIP: 0010:handle_softirqs+0x71/0x290
  [   48.511635] Code: [...] 10 0a 00 00 00 31 c0 65 66 89 05 d5 f4 fa 7e fb bb ff ff ff ff <49> c7 c2 cb
  [   48.518527] RSP: 0018:ffffc90000310fa0 EFLAGS: 00000246
  [   48.519579] RAX: 0000000000000000 RBX: 00000000ffffffff RCX: 00000000000006e0
  [   48.522526] RDX: 0000000000000006 RSI: ffff88810791ae80 RDI: 0000000000000000
  [   48.523587] RBP: ffffc90000fabc88 R08: 00000005a0af4f7f R09: 0000000000000000
  [   48.525525] R10: 0000000561d2f29c R11: 0000000000006534 R12: 0000000000000280
  [   48.528525] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
  [   48.528525] FS:  00007f2f2885cd00(0000) GS:ffff888237c40000(0000) knlGS:0000000000000000
  [   48.531600] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [   48.535520] CR2: 00007f2f287059f0 CR3: 0000000106a28002 CR4: 00000000003706f0
  [   48.537538] Call Trace:
  [   48.537538]  <IRQ>
  [   48.537538]  ? watchdog_timer_fn+0x1cd/0x250
  [   48.539590]  ? lockup_detector_update_enable+0x50/0x50
  [   48.539590]  ? __hrtimer_run_queues+0xff/0x280
  [   48.542520]  ? hrtimer_interrupt+0x103/0x230
  [   48.544524]  ? __sysvec_apic_timer_interrupt+0x4f/0x140
  [   48.545522]  ? sysvec_apic_timer_interrupt+0x3a/0x90
  [   48.547612]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
  [   48.547612]  ? handle_softirqs+0x71/0x290
  [   48.547612]  irq_exit_rcu+0x63/0x80
  [   48.551585]  sysvec_apic_timer_interrupt+0x75/0x90
  [   48.552521]  </IRQ>
  [   48.553529]  <TASK>
  [   48.553529]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
  [   48.555609] RIP: 0010:finish_task_switch.isra.0+0x90/0x260
  [   48.556526] Code: [...] 9f 58 0a 00 00 48 85 db 0f 85 89 01 00 00 4c 89 ff e8 53 d9 bd 00 fb 66 90 <4d> 85 ed 74
  [   48.562524] RSP: 0018:ffffc90000fabd38 EFLAGS: 00000282
  [   48.563589] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff83385620
  [   48.563589] RDX: ffff888237c73ae4 RSI: 0000000000000000 RDI: ffff888237c6fd00
  [   48.568521] RBP: ffffc90000fabd68 R08: 0000000000000000 R09: 0000000000000000
  [   48.569528] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8881009d0000
  [   48.573525] R13: ffff8881024e5400 R14: ffff88810791ae80 R15: ffff888237c6fd00
  [   48.575614]  ? finish_task_switch.isra.0+0x8d/0x260
  [   48.576523]  __schedule+0x364/0xac0
  [   48.577535]  schedule+0x2e/0x110
  [   48.578555]  pipe_read+0x301/0x400
  [   48.579589]  ? destroy_sched_domains_rcu+0x30/0x30
  [   48.579589]  vfs_read+0x2b3/0x2f0
  [   48.579589]  ksys_read+0x8b/0xc0
  [   48.583590]  do_syscall_64+0x3d/0xc0
  [   48.583590]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
  [   48.586525] RIP: 0033:0x7f2f28703fa1
  [   48.587592] Code: [...] 00 00 00 0f 1f 44 00 00 f3 0f 1e fa 80 3d c5 23 14 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0
  [   48.593534] RSP: 002b:00007ffd90f8cf88 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
  [   48.595589] RAX: ffffffffffffffda RBX: 00007ffd90f8d5e8 RCX: 00007f2f28703fa1
  [   48.595589] RDX: 0000000000000001 RSI: 00007ffd90f8cfb0 RDI: 0000000000000006
  [   48.599592] RBP: 00007ffd90f8d2f0 R08: 0000000000000064 R09: 0000000000000000
  [   48.602527] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
  [   48.603589] R13: 00007ffd90f8d608 R14: 00007f2f288d8000 R15: 0000000000f6bdb0
  [   48.605527]  </TASK>

In the test, two processes are communicating through pipe. Further debugging
with strace found that the above splat is triggered as read() syscall could
not receive the data even if the corresponding write() syscall in another
process successfully wrote data into the pipe.

The failed subtest is "send_signal_perf". The corresponding perf event has
sample_period 1 and config PERF_COUNT_SW_CPU_CLOCK. sample_period 1 means every
overflow event will trigger a call to the BPF program. So I suspect this may
overwhelm the system. So I increased the sample_period to 100,000 and the test
passed. The sample_period 10,000 still has the test failed.

In other parts of selftest, e.g., [1], sample_freq is used instead. So I
decided to use sample_freq = 1,000 since the test can pass as well.

  [1] https://lore.kernel.org/bpf/[email protected]/

Reported-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Yonghong Song <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Signed-off-by: Sasha Levin <[email protected]>
anthraxx pushed a commit that referenced this issue Aug 23, 2024
[ Upstream commit 7015843 ]

Alexei reported that send_signal test may fail with nested CONFIG_PARAVIRT
configs. In this particular case, the base VM is AMD with 166 cpus, and I
run selftests with regular qemu on top of that and indeed send_signal test
failed. I also tried with an Intel box with 80 cpus and there is no issue.

The main qemu command line includes:

  -enable-kvm -smp 16 -cpu host

The failure log looks like:

  $ ./test_progs -t send_signal
  [   48.501588] watchdog: BUG: soft lockup - CPU#9 stuck for 26s! [test_progs:2225]
  [   48.503622] Modules linked in: bpf_testmod(O)
  [   48.503622] CPU: 9 PID: 2225 Comm: test_progs Tainted: G           O       6.9.0-08561-g2c1713a8f1c9-dirty #69
  [   48.507629] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
  [   48.511635] RIP: 0010:handle_softirqs+0x71/0x290
  [   48.511635] Code: [...] 10 0a 00 00 00 31 c0 65 66 89 05 d5 f4 fa 7e fb bb ff ff ff ff <49> c7 c2 cb
  [   48.518527] RSP: 0018:ffffc90000310fa0 EFLAGS: 00000246
  [   48.519579] RAX: 0000000000000000 RBX: 00000000ffffffff RCX: 00000000000006e0
  [   48.522526] RDX: 0000000000000006 RSI: ffff88810791ae80 RDI: 0000000000000000
  [   48.523587] RBP: ffffc90000fabc88 R08: 00000005a0af4f7f R09: 0000000000000000
  [   48.525525] R10: 0000000561d2f29c R11: 0000000000006534 R12: 0000000000000280
  [   48.528525] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
  [   48.528525] FS:  00007f2f2885cd00(0000) GS:ffff888237c40000(0000) knlGS:0000000000000000
  [   48.531600] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [   48.535520] CR2: 00007f2f287059f0 CR3: 0000000106a28002 CR4: 00000000003706f0
  [   48.537538] Call Trace:
  [   48.537538]  <IRQ>
  [   48.537538]  ? watchdog_timer_fn+0x1cd/0x250
  [   48.539590]  ? lockup_detector_update_enable+0x50/0x50
  [   48.539590]  ? __hrtimer_run_queues+0xff/0x280
  [   48.542520]  ? hrtimer_interrupt+0x103/0x230
  [   48.544524]  ? __sysvec_apic_timer_interrupt+0x4f/0x140
  [   48.545522]  ? sysvec_apic_timer_interrupt+0x3a/0x90
  [   48.547612]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
  [   48.547612]  ? handle_softirqs+0x71/0x290
  [   48.547612]  irq_exit_rcu+0x63/0x80
  [   48.551585]  sysvec_apic_timer_interrupt+0x75/0x90
  [   48.552521]  </IRQ>
  [   48.553529]  <TASK>
  [   48.553529]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
  [   48.555609] RIP: 0010:finish_task_switch.isra.0+0x90/0x260
  [   48.556526] Code: [...] 9f 58 0a 00 00 48 85 db 0f 85 89 01 00 00 4c 89 ff e8 53 d9 bd 00 fb 66 90 <4d> 85 ed 74
  [   48.562524] RSP: 0018:ffffc90000fabd38 EFLAGS: 00000282
  [   48.563589] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff83385620
  [   48.563589] RDX: ffff888237c73ae4 RSI: 0000000000000000 RDI: ffff888237c6fd00
  [   48.568521] RBP: ffffc90000fabd68 R08: 0000000000000000 R09: 0000000000000000
  [   48.569528] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8881009d0000
  [   48.573525] R13: ffff8881024e5400 R14: ffff88810791ae80 R15: ffff888237c6fd00
  [   48.575614]  ? finish_task_switch.isra.0+0x8d/0x260
  [   48.576523]  __schedule+0x364/0xac0
  [   48.577535]  schedule+0x2e/0x110
  [   48.578555]  pipe_read+0x301/0x400
  [   48.579589]  ? destroy_sched_domains_rcu+0x30/0x30
  [   48.579589]  vfs_read+0x2b3/0x2f0
  [   48.579589]  ksys_read+0x8b/0xc0
  [   48.583590]  do_syscall_64+0x3d/0xc0
  [   48.583590]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
  [   48.586525] RIP: 0033:0x7f2f28703fa1
  [   48.587592] Code: [...] 00 00 00 0f 1f 44 00 00 f3 0f 1e fa 80 3d c5 23 14 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0
  [   48.593534] RSP: 002b:00007ffd90f8cf88 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
  [   48.595589] RAX: ffffffffffffffda RBX: 00007ffd90f8d5e8 RCX: 00007f2f28703fa1
  [   48.595589] RDX: 0000000000000001 RSI: 00007ffd90f8cfb0 RDI: 0000000000000006
  [   48.599592] RBP: 00007ffd90f8d2f0 R08: 0000000000000064 R09: 0000000000000000
  [   48.602527] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
  [   48.603589] R13: 00007ffd90f8d608 R14: 00007f2f288d8000 R15: 0000000000f6bdb0
  [   48.605527]  </TASK>

In the test, two processes are communicating through pipe. Further debugging
with strace found that the above splat is triggered as read() syscall could
not receive the data even if the corresponding write() syscall in another
process successfully wrote data into the pipe.

The failed subtest is "send_signal_perf". The corresponding perf event has
sample_period 1 and config PERF_COUNT_SW_CPU_CLOCK. sample_period 1 means every
overflow event will trigger a call to the BPF program. So I suspect this may
overwhelm the system. So I increased the sample_period to 100,000 and the test
passed. The sample_period 10,000 still has the test failed.

In other parts of selftest, e.g., [1], sample_freq is used instead. So I
decided to use sample_freq = 1,000 since the test can pass as well.

  [1] https://lore.kernel.org/bpf/[email protected]/

Reported-by: Alexei Starovoitov <[email protected]>
Signed-off-by: Yonghong Song <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]
Signed-off-by: Sasha Levin <[email protected]>
anthraxx pushed a commit that referenced this issue Nov 10, 2024
commit 409dc51 upstream.

The flexspi on imx8ulp only has 16 LUTs, and imx8mm flexspi has
32 LUTs, so correct the compatible string here, otherwise will
meet below error:

[    1.119072] ------------[ cut here ]------------
[    1.123926] WARNING: CPU: 0 PID: 1 at drivers/spi/spi-nxp-fspi.c:855 nxp_fspi_exec_op+0xb04/0xb64
[    1.133239] Modules linked in:
[    1.136448] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.11.0-rc6-next-20240902-00001-g131bf9439dd9 #69
[    1.146821] Hardware name: NXP i.MX8ULP EVK (DT)
[    1.151647] pstate: 40000005 (nZcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    1.158931] pc : nxp_fspi_exec_op+0xb04/0xb64
[    1.163496] lr : nxp_fspi_exec_op+0xa34/0xb64
[    1.168060] sp : ffff80008002b2a0
[    1.171526] x29: ffff80008002b2d0 x28: 0000000000000000 x27: 0000000000000000
[    1.179002] x26: ffff2eb645542580 x25: ffff800080610014 x24: ffff800080610000
[    1.186480] x23: ffff2eb645548080 x22: 0000000000000006 x21: ffff2eb6455425e0
[    1.193956] x20: 0000000000000000 x19: ffff80008002b5e0 x18: ffffffffffffffff
[    1.201432] x17: ffff2eb644467508 x16: 0000000000000138 x15: 0000000000000002
[    1.208907] x14: 0000000000000000 x13: ffff2eb6400d8080 x12: 00000000ffffff00
[    1.216378] x11: 0000000000000000 x10: ffff2eb6400d8080 x9 : ffff2eb697adca80
[    1.223850] x8 : ffff2eb697ad3cc0 x7 : 0000000100000000 x6 : 0000000000000001
[    1.231324] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 00000000000007a6
[    1.238795] x2 : 0000000000000000 x1 : 00000000000001ce x0 : 00000000ffffff92
[    1.246267] Call trace:
[    1.248824]  nxp_fspi_exec_op+0xb04/0xb64
[    1.253031]  spi_mem_exec_op+0x3a0/0x430
[    1.257139]  spi_nor_read_id+0x80/0xcc
[    1.261065]  spi_nor_scan+0x1ec/0xf10
[    1.264901]  spi_nor_probe+0x108/0x2fc
[    1.268828]  spi_mem_probe+0x6c/0xbc
[    1.272574]  spi_probe+0x84/0xe4
[    1.275958]  really_probe+0xbc/0x29c
[    1.279713]  __driver_probe_device+0x78/0x12c
[    1.284277]  driver_probe_device+0xd8/0x15c
[    1.288660]  __device_attach_driver+0xb8/0x134
[    1.293316]  bus_for_each_drv+0x88/0xe8
[    1.297337]  __device_attach+0xa0/0x190
[    1.301353]  device_initial_probe+0x14/0x20
[    1.305734]  bus_probe_device+0xac/0xb0
[    1.309752]  device_add+0x5d0/0x790
[    1.313408]  __spi_add_device+0x134/0x204
[    1.317606]  of_register_spi_device+0x3b4/0x590
[    1.322348]  spi_register_controller+0x47c/0x754
[    1.327181]  devm_spi_register_controller+0x4c/0xa4
[    1.332289]  nxp_fspi_probe+0x1cc/0x2b0
[    1.336307]  platform_probe+0x68/0xc4
[    1.340145]  really_probe+0xbc/0x29c
[    1.343893]  __driver_probe_device+0x78/0x12c
[    1.348457]  driver_probe_device+0xd8/0x15c
[    1.352838]  __driver_attach+0x90/0x19c
[    1.356857]  bus_for_each_dev+0x7c/0xdc
[    1.360877]  driver_attach+0x24/0x30
[    1.364624]  bus_add_driver+0xe4/0x208
[    1.368552]  driver_register+0x5c/0x124
[    1.372573]  __platform_driver_register+0x28/0x34
[    1.377497]  nxp_fspi_driver_init+0x1c/0x28
[    1.381888]  do_one_initcall+0x80/0x1c8
[    1.385908]  kernel_init_freeable+0x1c4/0x28c
[    1.390472]  kernel_init+0x20/0x1d8
[    1.394138]  ret_from_fork+0x10/0x20
[    1.397885] ---[ end trace 0000000000000000 ]---
[    1.407908] ------------[ cut here ]------------

Fixes: ef89fd5 ("arm64: dts: imx8ulp: add flexspi node")
Cc: [email protected]
Signed-off-by: Haibo Chen <[email protected]>
Signed-off-by: Shawn Guo <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
anthraxx pushed a commit that referenced this issue Nov 10, 2024
commit 409dc51 upstream.

The flexspi on imx8ulp only has 16 LUTs, and imx8mm flexspi has
32 LUTs, so correct the compatible string here, otherwise will
meet below error:

[    1.119072] ------------[ cut here ]------------
[    1.123926] WARNING: CPU: 0 PID: 1 at drivers/spi/spi-nxp-fspi.c:855 nxp_fspi_exec_op+0xb04/0xb64
[    1.133239] Modules linked in:
[    1.136448] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.11.0-rc6-next-20240902-00001-g131bf9439dd9 #69
[    1.146821] Hardware name: NXP i.MX8ULP EVK (DT)
[    1.151647] pstate: 40000005 (nZcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    1.158931] pc : nxp_fspi_exec_op+0xb04/0xb64
[    1.163496] lr : nxp_fspi_exec_op+0xa34/0xb64
[    1.168060] sp : ffff80008002b2a0
[    1.171526] x29: ffff80008002b2d0 x28: 0000000000000000 x27: 0000000000000000
[    1.179002] x26: ffff2eb645542580 x25: ffff800080610014 x24: ffff800080610000
[    1.186480] x23: ffff2eb645548080 x22: 0000000000000006 x21: ffff2eb6455425e0
[    1.193956] x20: 0000000000000000 x19: ffff80008002b5e0 x18: ffffffffffffffff
[    1.201432] x17: ffff2eb644467508 x16: 0000000000000138 x15: 0000000000000002
[    1.208907] x14: 0000000000000000 x13: ffff2eb6400d8080 x12: 00000000ffffff00
[    1.216378] x11: 0000000000000000 x10: ffff2eb6400d8080 x9 : ffff2eb697adca80
[    1.223850] x8 : ffff2eb697ad3cc0 x7 : 0000000100000000 x6 : 0000000000000001
[    1.231324] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 00000000000007a6
[    1.238795] x2 : 0000000000000000 x1 : 00000000000001ce x0 : 00000000ffffff92
[    1.246267] Call trace:
[    1.248824]  nxp_fspi_exec_op+0xb04/0xb64
[    1.253031]  spi_mem_exec_op+0x3a0/0x430
[    1.257139]  spi_nor_read_id+0x80/0xcc
[    1.261065]  spi_nor_scan+0x1ec/0xf10
[    1.264901]  spi_nor_probe+0x108/0x2fc
[    1.268828]  spi_mem_probe+0x6c/0xbc
[    1.272574]  spi_probe+0x84/0xe4
[    1.275958]  really_probe+0xbc/0x29c
[    1.279713]  __driver_probe_device+0x78/0x12c
[    1.284277]  driver_probe_device+0xd8/0x15c
[    1.288660]  __device_attach_driver+0xb8/0x134
[    1.293316]  bus_for_each_drv+0x88/0xe8
[    1.297337]  __device_attach+0xa0/0x190
[    1.301353]  device_initial_probe+0x14/0x20
[    1.305734]  bus_probe_device+0xac/0xb0
[    1.309752]  device_add+0x5d0/0x790
[    1.313408]  __spi_add_device+0x134/0x204
[    1.317606]  of_register_spi_device+0x3b4/0x590
[    1.322348]  spi_register_controller+0x47c/0x754
[    1.327181]  devm_spi_register_controller+0x4c/0xa4
[    1.332289]  nxp_fspi_probe+0x1cc/0x2b0
[    1.336307]  platform_probe+0x68/0xc4
[    1.340145]  really_probe+0xbc/0x29c
[    1.343893]  __driver_probe_device+0x78/0x12c
[    1.348457]  driver_probe_device+0xd8/0x15c
[    1.352838]  __driver_attach+0x90/0x19c
[    1.356857]  bus_for_each_dev+0x7c/0xdc
[    1.360877]  driver_attach+0x24/0x30
[    1.364624]  bus_add_driver+0xe4/0x208
[    1.368552]  driver_register+0x5c/0x124
[    1.372573]  __platform_driver_register+0x28/0x34
[    1.377497]  nxp_fspi_driver_init+0x1c/0x28
[    1.381888]  do_one_initcall+0x80/0x1c8
[    1.385908]  kernel_init_freeable+0x1c4/0x28c
[    1.390472]  kernel_init+0x20/0x1d8
[    1.394138]  ret_from_fork+0x10/0x20
[    1.397885] ---[ end trace 0000000000000000 ]---
[    1.407908] ------------[ cut here ]------------

Fixes: ef89fd5 ("arm64: dts: imx8ulp: add flexspi node")
Cc: [email protected]
Signed-off-by: Haibo Chen <[email protected]>
Signed-off-by: Shawn Guo <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
anthraxx pushed a commit that referenced this issue Nov 25, 2024
The flexspi on imx8ulp only has 16 LUTs, and imx8mm flexspi has
32 LUTs, so correct the compatible string here, otherwise will
meet below error:

[    1.119072] ------------[ cut here ]------------
[    1.123926] WARNING: CPU: 0 PID: 1 at drivers/spi/spi-nxp-fspi.c:855 nxp_fspi_exec_op+0xb04/0xb64
[    1.133239] Modules linked in:
[    1.136448] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.11.0-rc6-next-20240902-00001-g131bf9439dd9 #69
[    1.146821] Hardware name: NXP i.MX8ULP EVK (DT)
[    1.151647] pstate: 40000005 (nZcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    1.158931] pc : nxp_fspi_exec_op+0xb04/0xb64
[    1.163496] lr : nxp_fspi_exec_op+0xa34/0xb64
[    1.168060] sp : ffff80008002b2a0
[    1.171526] x29: ffff80008002b2d0 x28: 0000000000000000 x27: 0000000000000000
[    1.179002] x26: ffff2eb645542580 x25: ffff800080610014 x24: ffff800080610000
[    1.186480] x23: ffff2eb645548080 x22: 0000000000000006 x21: ffff2eb6455425e0
[    1.193956] x20: 0000000000000000 x19: ffff80008002b5e0 x18: ffffffffffffffff
[    1.201432] x17: ffff2eb644467508 x16: 0000000000000138 x15: 0000000000000002
[    1.208907] x14: 0000000000000000 x13: ffff2eb6400d8080 x12: 00000000ffffff00
[    1.216378] x11: 0000000000000000 x10: ffff2eb6400d8080 x9 : ffff2eb697adca80
[    1.223850] x8 : ffff2eb697ad3cc0 x7 : 0000000100000000 x6 : 0000000000000001
[    1.231324] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 00000000000007a6
[    1.238795] x2 : 0000000000000000 x1 : 00000000000001ce x0 : 00000000ffffff92
[    1.246267] Call trace:
[    1.248824]  nxp_fspi_exec_op+0xb04/0xb64
[    1.253031]  spi_mem_exec_op+0x3a0/0x430
[    1.257139]  spi_nor_read_id+0x80/0xcc
[    1.261065]  spi_nor_scan+0x1ec/0xf10
[    1.264901]  spi_nor_probe+0x108/0x2fc
[    1.268828]  spi_mem_probe+0x6c/0xbc
[    1.272574]  spi_probe+0x84/0xe4
[    1.275958]  really_probe+0xbc/0x29c
[    1.279713]  __driver_probe_device+0x78/0x12c
[    1.284277]  driver_probe_device+0xd8/0x15c
[    1.288660]  __device_attach_driver+0xb8/0x134
[    1.293316]  bus_for_each_drv+0x88/0xe8
[    1.297337]  __device_attach+0xa0/0x190
[    1.301353]  device_initial_probe+0x14/0x20
[    1.305734]  bus_probe_device+0xac/0xb0
[    1.309752]  device_add+0x5d0/0x790
[    1.313408]  __spi_add_device+0x134/0x204
[    1.317606]  of_register_spi_device+0x3b4/0x590
[    1.322348]  spi_register_controller+0x47c/0x754
[    1.327181]  devm_spi_register_controller+0x4c/0xa4
[    1.332289]  nxp_fspi_probe+0x1cc/0x2b0
[    1.336307]  platform_probe+0x68/0xc4
[    1.340145]  really_probe+0xbc/0x29c
[    1.343893]  __driver_probe_device+0x78/0x12c
[    1.348457]  driver_probe_device+0xd8/0x15c
[    1.352838]  __driver_attach+0x90/0x19c
[    1.356857]  bus_for_each_dev+0x7c/0xdc
[    1.360877]  driver_attach+0x24/0x30
[    1.364624]  bus_add_driver+0xe4/0x208
[    1.368552]  driver_register+0x5c/0x124
[    1.372573]  __platform_driver_register+0x28/0x34
[    1.377497]  nxp_fspi_driver_init+0x1c/0x28
[    1.381888]  do_one_initcall+0x80/0x1c8
[    1.385908]  kernel_init_freeable+0x1c4/0x28c
[    1.390472]  kernel_init+0x20/0x1d8
[    1.394138]  ret_from_fork+0x10/0x20
[    1.397885] ---[ end trace 0000000000000000 ]---
[    1.407908] ------------[ cut here ]------------

Fixes: ef89fd5 ("arm64: dts: imx8ulp: add flexspi node")
Cc: [email protected]
Signed-off-by: Haibo Chen <[email protected]>
Signed-off-by: Shawn Guo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upstream Bug is caused by upstream problems
Projects
None yet
Development

No branches or pull requests

4 participants