Tagged frames can exceed standard Ethernet mtu of 1500 #1

rrendec · 2013-01-08T20:42:23Z

It is usual for untagged interfaces to receive frames of exactly 1500 (Ethernet maximum mtu) bytes data. However, when sent out on a tagged interface, these frames will be 1504 long and therefore will exceed the maximum Ethernet mtu.

When setting interfaces to "switchport mode trunk", LiSA should automagically set the mtu to 1504 on the device to avoid trouble. Unfortunately, many devices don't support "Jumbo Frames" (frames larger than standard Ethernet) in hardware and issue an error on set_mtu() if mtu is larger than 1500. In such case, LiSA should issue a warning when setting the port to trunk mode and perhaps advise the user to change all untagged interfaces mtu to 1496.

Case study: Broadcom Tg3 driver (drivers/net/tg3.c)

#define TG3_MAX_MTU(tp) \
    ((tp->tg3_flags2 & TG3_FLG2_JUMBO_CAPABLE) ? 9000 : 1500)

static int tg3_change_mtu(struct net_device *dev, int new_mtu)
{
    struct tg3 *tp = netdev_priv(dev);
    int err;

    if (new_mtu < TG3_MIN_MTU || new_mtu > TG3_MAX_MTU(tp))
        return -EINVAL;

Commit 84c1754 (ext4: move work from io_end to inode) triggered a regression when running xfstest #270 when the file system is mounted with dioread_nolock. The problem is that after ext4_evict_inode() calls ext4_ioend_wait(), this guarantees that last io_end structure has been freed, but it does not guarantee that the workqueue structure, which was moved into the inode by commit 84c1754, is actually finished. Once ext4_flush_completed_IO() calls ext4_free_io_end() on CPU #1, this will allow ext4_ioend_wait() to return on CPU #2, at which point the evict_inode() codepath can race against the workqueue code on CPU #1 accessing EXT4_I(inode)->i_unwritten_work to find the next item of work to do. Fix this by calling cancel_work_sync() in ext4_ioend_wait(), which will be renamed ext4_ioend_shutdown(), since it is only used by ext4_evict_inode(). Also, move the call to ext4_ioend_shutdown() until after truncate_inode_pages() and filemap_write_and_wait() are called, to make sure all dirty pages have been written back and flushed from the page cache first. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<c01dda6a>] cwq_activate_delayed_work+0x3b/0x7e *pdpt = 0000000030bc3001 *pde = 0000000000000000 Oops: 0000 [#1] SMP DEBUG_PAGEALLOC Modules linked in: Pid: 6, comm: kworker/u:0 Not tainted 3.8.0-rc3-00013-g84c1754-dirty #91 Bochs Bochs EIP: 0060:[<c01dda6a>] EFLAGS: 00010046 CPU: 0 EIP is at cwq_activate_delayed_work+0x3b/0x7e EAX: 00000000 EBX: 00000000 ECX: f505fe54 EDX: 00000000 ESI: ed5b697c EDI: 00000006 EBP: f64b7e8c ESP: f64b7e84 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 CR0: 8005003b CR2: 00000000 CR3: 30bc2000 CR4: 000006f0 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: ffff0ff0 DR7: 00000400 Process kworker/u:0 (pid: 6, ti=f64b6000 task=f64b4160 task.ti=f64b6000) Stack: f505fe00 00000006 f64b7e9c c01de3d7 f6435540 00000003 f64b7efc c01def1d f6435540 00000002 00000000 0000008a c16d0808 c040a10b c16d07d8 c16d08b0 f505fe00 c16d0780 00000000 00000000 ee153df4 c1ce4a30 c17d0e30 00000000 Call Trace: [<c01de3d7>] cwq_dec_nr_in_flight+0x71/0xfb [<c01def1d>] process_one_work+0x5d8/0x637 [<c040a10b>] ? ext4_end_bio+0x300/0x300 [<c01e3105>] worker_thread+0x249/0x3ef [<c01ea317>] kthread+0xd8/0xeb [<c01e2ebc>] ? manage_workers+0x4bb/0x4bb [<c023a370>] ? trace_hardirqs_on+0x27/0x37 [<c0f1b4b7>] ret_from_kernel_thread+0x1b/0x28 [<c01ea23f>] ? __init_kthread_worker+0x71/0x71 Code: 01 83 15 ac ff 6c c1 00 31 db 89 c6 8b 00 a8 04 74 12 89 c3 30 db 83 05 b0 ff 6c c1 01 83 15 b4 ff 6c c1 00 89 f0 e8 42 ff ff ff <8b> 13 89 f0 83 05 b8 ff 6c c1 6c c1 00 31 c9 83 EIP: [<c01dda6a>] cwq_activate_delayed_work+0x3b/0x7e SS:ESP 0068:f64b7e84 CR2: 0000000000000000 ---[ end trace a1923229da53d8a4 ]--- Signed-off-by: "Theodore Ts'o" <[email protected]> Cc: Jan Kara <[email protected]>

With git commit c705c78 "acpi: Export the acpi_processor_get_performance_info" we are now using a different mechanism to access the P-states. The acpi_processor per-cpu structure is set and filtered by the core ACPI code which shrinks the per_cpu contents to only online CPUs. In the past we would call acpi_processor_register_performance() which would have not tried to dereference offline cpus. With the new patch and the fact that the loop we take is for for_all_possible_cpus we end up crashing on some machines. We could modify the loop to be for online_cpus - but all the other loops in the code use possible_cpus (for a good reason) - so lets leave it as so and just check if per_cpu(processor) is NULL. With this patch we will bypass the !online but possible CPUs. This fixes: IP: [<ffffffffa00d13b5>] xen_acpi_processor_init+0x1b6/0xe01 [xen_acpi_processor] PGD 4126e6067 PUD 4126e3067 PMD 0 Oops: 0002 [#1] SMP Pid: 432, comm: modprobe Not tainted 3.9.0-rc3+ #28 To be filled by O.E.M. To be filled by O.E.M./M5A97 RIP: e030:[<ffffffffa00d13b5>] [<ffffffffa00d13b5>] xen_acpi_processor_init+0x1b6/0xe01 [xen_acpi_processor] RSP: e02b:ffff88040c8a3ce8 EFLAGS: 00010282 .. snip.. Call Trace: [<ffffffffa00d11ff>] ? read_acpi_id+0x12b/0x12b [xen_acpi_processor] [<ffffffff8100215a>] do_one_initcall+0x12a/0x180 [<ffffffff810c42c3>] load_module+0x1cd3/0x2870 [<ffffffff81319b70>] ? ddebug_proc_open+0xc0/0xc0 [<ffffffff810c4f37>] sys_init_module+0xd7/0x120 [<ffffffff8166ce19>] system_call_fastpath+0x16/0x1b on some machines. Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>

Booting with 32 TBytes memory hits BUG at mm/page_alloc.c:552! (output below). The key hint is "page 4294967296 outside zone". 4294967296 = 0x100000000 (bit 32 is set). The problem is in include/linux/mmzone.h: 530 static inline unsigned zone_end_pfn(const struct zone *zone) 531 { 532 return zone->zone_start_pfn + zone->spanned_pages; 533 } zone_end_pfn is "unsigned" (32 bits). Changing it to "unsigned long" (64 bits) fixes the problem. zone_end_pfn() was added recently in commit 108bcc9 ("mm: add & use zone_end_pfn() and zone_spans_pfn()") Output from the failure. No AGP bridge found page 4294967296 outside zone [ 4294967296 - 4327469056 ] ------------[ cut here ]------------ kernel BUG at mm/page_alloc.c:552! invalid opcode: 0000 [#1] SMP Modules linked in: CPU 0 Pid: 0, comm: swapper Not tainted 3.9.0-rc2.dtp+ #10 RIP: free_one_page+0x382/0x430 Process swapper (pid: 0, threadinfo ffffffff81942000, task ffffffff81955420) Call Trace: __free_pages_ok+0x96/0xb0 __free_pages+0x25/0x50 __free_pages_bootmem+0x8a/0x8c __free_memory_core+0xea/0x131 free_low_memory_core_early+0x4a/0x98 free_all_bootmem+0x45/0x47 mem_init+0x7b/0x14c start_kernel+0x216/0x433 x86_64_start_reservations+0x2a/0x2c x86_64_start_kernel+0x144/0x153 Code: 89 f1 ba 01 00 00 00 31 f6 d3 e2 4c 89 ef e8 66 a4 01 00 e9 2c fe ff ff 0f 0b eb fe 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 eb f3 <0f> 0b eb fe 0f 0b 0f 1f 84 00 00 00 00 00 eb f6 0f 0b eb fe 49 Signed-off-by: Russ Anderson <[email protected]> Reported-by: George Beshers <[email protected]> Acked-by: Hedi Berriche <[email protected]> Cc: Cody P Schafer <[email protected]> Cc: Michal Hocko <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

We can deadlock (s_active and fcoe_config_mutex) if a port is being destroyed at the same time one is being created. [ 4200.503113] ====================================================== [ 4200.503114] [ INFO: possible circular locking dependency detected ] [ 4200.503116] 3.8.0-rc5+ #8 Not tainted [ 4200.503117] ------------------------------------------------------- [ 4200.503118] kworker/3:2/2492 is trying to acquire lock: [ 4200.503119] (s_active#292){++++.+}, at: [<ffffffff8122d20b>] sysfs_addrm_finish+0x3b/0x70 [ 4200.503127] but task is already holding lock: [ 4200.503128] (fcoe_config_mutex){+.+.+.}, at: [<ffffffffa02f3338>] fcoe_destroy_work+0xe8/0x120 [fcoe] [ 4200.503133] which lock already depends on the new lock. [ 4200.503135] the existing dependency chain (in reverse order) is: [ 4200.503136] -> #1 (fcoe_config_mutex){+.+.+.}: [ 4200.503139] [<ffffffff810c7711>] lock_acquire+0xa1/0x140 [ 4200.503143] [<ffffffff816ca7be>] mutex_lock_nested+0x6e/0x360 [ 4200.503146] [<ffffffffa02f11bd>] fcoe_enable+0x1d/0xb0 [fcoe] [ 4200.503148] [<ffffffffa02f127d>] fcoe_ctlr_enabled+0x2d/0x50 [fcoe] [ 4200.503151] [<ffffffffa02ffbe8>] store_ctlr_enabled+0x38/0x90 [libfcoe] [ 4200.503154] [<ffffffff81424878>] dev_attr_store+0x18/0x30 [ 4200.503157] [<ffffffff8122b750>] sysfs_write_file+0xe0/0x150 [ 4200.503160] [<ffffffff811b334c>] vfs_write+0xac/0x180 [ 4200.503162] [<ffffffff811b3692>] sys_write+0x52/0xa0 [ 4200.503164] [<ffffffff816d7159>] system_call_fastpath+0x16/0x1b [ 4200.503167] -> #0 (s_active#292){++++.+}: [ 4200.503170] [<ffffffff810c680f>] __lock_acquire+0x135f/0x1c90 [ 4200.503172] [<ffffffff810c7711>] lock_acquire+0xa1/0x140 [ 4200.503174] [<ffffffff8122c626>] sysfs_deactivate+0x116/0x160 [ 4200.503176] [<ffffffff8122d20b>] sysfs_addrm_finish+0x3b/0x70 [ 4200.503178] [<ffffffff8122b2eb>] sysfs_hash_and_remove+0x5b/0xb0 [ 4200.503180] [<ffffffff8122f3d1>] sysfs_remove_group+0x61/0x100 [ 4200.503183] [<ffffffff814251eb>] device_remove_groups+0x3b/0x60 [ 4200.503185] [<ffffffff81425534>] device_remove_attrs+0x44/0x80 [ 4200.503187] [<ffffffff81425e97>] device_del+0x127/0x1c0 [ 4200.503189] [<ffffffff81425f52>] device_unregister+0x22/0x60 [ 4200.503191] [<ffffffffa0300970>] fcoe_ctlr_device_delete+0xe0/0xf0 [libfcoe] [ 4200.503194] [<ffffffffa02f1b5c>] fcoe_interface_cleanup+0x6c/0xa0 [fcoe] [ 4200.503196] [<ffffffffa02f3355>] fcoe_destroy_work+0x105/0x120 [fcoe] [ 4200.503198] [<ffffffff8107ee91>] process_one_work+0x1a1/0x580 [ 4200.503203] [<ffffffff81080c6e>] worker_thread+0x15e/0x440 [ 4200.503205] [<ffffffff8108715a>] kthread+0xea/0xf0 [ 4200.503207] [<ffffffff816d70ac>] ret_from_fork+0x7c/0xb0 [ 4200.503209] other info that might help us debug this: [ 4200.503211] Possible unsafe locking scenario: [ 4200.503212] CPU0 CPU1 [ 4200.503213] ---- ---- [ 4200.503214] lock(fcoe_config_mutex); [ 4200.503215] lock(s_active#292); [ 4200.503218] lock(fcoe_config_mutex); [ 4200.503219] lock(s_active#292); [ 4200.503221] *** DEADLOCK *** [ 4200.503223] 3 locks held by kworker/3:2/2492: [ 4200.503224] #0: (fcoe){.+.+.+}, at: [<ffffffff8107ee2b>] process_one_work+0x13b/0x580 [ 4200.503228] #1: ((&port->destroy_work)){+.+.+.}, at: [<ffffffff8107ee2b>] process_one_work+0x13b/0x580 [ 4200.503232] #2: (fcoe_config_mutex){+.+.+.}, at: [<ffffffffa02f3338>] fcoe_destroy_work+0xe8/0x120 [fcoe] [ 4200.503236] stack backtrace: [ 4200.503238] Pid: 2492, comm: kworker/3:2 Not tainted 3.8.0-rc5+ #8 [ 4200.503240] Call Trace: [ 4200.503243] [<ffffffff816c2f09>] print_circular_bug+0x1fb/0x20c [ 4200.503246] [<ffffffff810c680f>] __lock_acquire+0x135f/0x1c90 [ 4200.503248] [<ffffffff810c463a>] ? debug_check_no_locks_freed+0x9a/0x180 [ 4200.503250] [<ffffffff810c7711>] lock_acquire+0xa1/0x140 [ 4200.503253] [<ffffffff8122d20b>] ? sysfs_addrm_finish+0x3b/0x70 [ 4200.503255] [<ffffffff8122c626>] sysfs_deactivate+0x116/0x160 [ 4200.503258] [<ffffffff8122d20b>] ? sysfs_addrm_finish+0x3b/0x70 [ 4200.503260] [<ffffffff8122d20b>] sysfs_addrm_finish+0x3b/0x70 [ 4200.503262] [<ffffffff8122b2eb>] sysfs_hash_and_remove+0x5b/0xb0 [ 4200.503265] [<ffffffff8122f3d1>] sysfs_remove_group+0x61/0x100 [ 4200.503273] [<ffffffff814251eb>] device_remove_groups+0x3b/0x60 [ 4200.503275] [<ffffffff81425534>] device_remove_attrs+0x44/0x80 [ 4200.503277] [<ffffffff81425e97>] device_del+0x127/0x1c0 [ 4200.503279] [<ffffffff81425f52>] device_unregister+0x22/0x60 [ 4200.503282] [<ffffffffa0300970>] fcoe_ctlr_device_delete+0xe0/0xf0 [libfcoe] [ 4200.503285] [<ffffffffa02f1b5c>] fcoe_interface_cleanup+0x6c/0xa0 [fcoe] [ 4200.503287] [<ffffffffa02f3355>] fcoe_destroy_work+0x105/0x120 [fcoe] [ 4200.503290] [<ffffffff8107ee91>] process_one_work+0x1a1/0x580 [ 4200.503292] [<ffffffff8107ee2b>] ? process_one_work+0x13b/0x580 [ 4200.503295] [<ffffffffa02f3250>] ? fcoe_if_destroy+0x230/0x230 [fcoe] [ 4200.503297] [<ffffffff81080c6e>] worker_thread+0x15e/0x440 [ 4200.503299] [<ffffffff81080b10>] ? busy_worker_rebind_fn+0x100/0x100 [ 4200.503301] [<ffffffff8108715a>] kthread+0xea/0xf0 [ 4200.503304] [<ffffffff81087070>] ? kthread_create_on_node+0x160/0x160 [ 4200.503306] [<ffffffff816d70ac>] ret_from_fork+0x7c/0xb0 [ 4200.503308] [<ffffffff81087070>] ? kthread_create_on_node+0x160/0x160 Signed-off-by: Robert Love <[email protected]> Tested-by: Jack Morgan <[email protected]>

The current code makes the assumption that a cpu_base lock won't be held if the CPU corresponding to that cpu_base is offline, which isn't always true. If a hrtimer is not queued, then it will not be migrated by migrate_hrtimers() when a CPU is offlined. Therefore, the hrtimer's cpu_base may still point to a CPU which has subsequently gone offline if the timer wasn't enqueued at the time the CPU went down. Normally this wouldn't be a problem, but a cpu_base's lock is blindly reinitialized each time a CPU is brought up. If a CPU is brought online during the period that another thread is performing a hrtimer operation on a stale hrtimer, then the lock will be reinitialized under its feet, and a SPIN_BUG() like the following will be observed: <0>[ 28.082085] BUG: spinlock already unlocked on CPU#0, swapper/0/0 <0>[ 28.087078] lock: 0xc4780b40, value 0x0 .magic: dead4ead, .owner: <none>/-1, .owner_cpu: -1 <4>[ 42.451150] [<c0014398>] (unwind_backtrace+0x0/0x120) from [<c0269220>] (do_raw_spin_unlock+0x44/0xdc) <4>[ 42.460430] [<c0269220>] (do_raw_spin_unlock+0x44/0xdc) from [<c071b5bc>] (_raw_spin_unlock+0x8/0x30) <4>[ 42.469632] [<c071b5bc>] (_raw_spin_unlock+0x8/0x30) from [<c00a9ce0>] (__hrtimer_start_range_ns+0x1e4/0x4f8) <4>[ 42.479521] [<c00a9ce0>] (__hrtimer_start_range_ns+0x1e4/0x4f8) from [<c00aa014>] (hrtimer_start+0x20/0x28) <4>[ 42.489247] [<c00aa014>] (hrtimer_start+0x20/0x28) from [<c00e6190>] (rcu_idle_enter_common+0x1ac/0x320) <4>[ 42.498709] [<c00e6190>] (rcu_idle_enter_common+0x1ac/0x320) from [<c00e6440>] (rcu_idle_enter+0xa0/0xb8) <4>[ 42.508259] [<c00e6440>] (rcu_idle_enter+0xa0/0xb8) from [<c000f268>] (cpu_idle+0x24/0xf0) <4>[ 42.516503] [<c000f268>] (cpu_idle+0x24/0xf0) from [<c06ed3c0>] (rest_init+0x88/0xa0) <4>[ 42.524319] [<c06ed3c0>] (rest_init+0x88/0xa0) from [<c0c00978>] (start_kernel+0x3d0/0x434) As an example, this particular crash occurred when hrtimer_start() was executed on CPU #0. The code locked the hrtimer's current cpu_base corresponding to CPU #1. CPU #0 then tried to switch the hrtimer's cpu_base to an optimal CPU which was online. In this case, it selected the cpu_base corresponding to CPU #3. Before it could proceed, CPU #1 came online and reinitialized the spinlock corresponding to its cpu_base. Thus now CPU #0 held a lock which was reinitialized. When CPU #0 finally ended up unlocking the old cpu_base corresponding to CPU #1 so that it could switch to CPU #3, we hit this SPIN_BUG() above while in switch_hrtimer_base(). CPU #0 CPU #1 ---- ---- ... <offline> hrtimer_start() lock_hrtimer_base(base #1) ... init_hrtimers_cpu() switch_hrtimer_base() ... ... raw_spin_lock_init(&cpu_base->lock) raw_spin_unlock(&cpu_base->lock) ... <spin_bug> Solve this by statically initializing the lock. Signed-off-by: Michael Bohan <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Cc: [email protected] Signed-off-by: Thomas Gleixner <[email protected]>

clk inits on OMAP happen quite early, even before slab is available. The dependency comes from the fact that the timer init code starts to use clocks and hwmod and we need clocks to be initialized by then. There are various problems doing clk inits this early, one is, not being able to do dynamic clk registrations and hence the dependency on clk-private.h. The other is, inability to debug early kernel crashes without enabling DEBUG_LL and earlyprintk. Doing early clk init also exposed another instance of a kernel panic due to a BUG() when CONFIG_DEBUG_SLAB is enabled. [ 0.000000] Kernel BUG at c01174f8 [verbose debug info unavailable] [ 0.000000] Internal error: Oops - BUG: 0 [#1] SMP ARM [ 0.000000] Modules linked in: [ 0.000000] CPU: 0 Not tainted (3.9.0-rc1-12179-g72d48f9 #6) [ 0.000000] PC is at __kmalloc+0x1d4/0x248 [ 0.000000] LR is at __clk_init+0x2e0/0x364 [ 0.000000] pc : [<c01174f8>] lr : [<c0441f54>] psr: 600001d3 [ 0.000000] sp : c076ff28 ip : c065cefc fp : c0441f54 [ 0.000000] r10: 0000001c r9 : 000080d0 r8 : c076ffd4 [ 0.000000] r7 : c074b578 r6 : c0794d88 r5 : 00000040 r4 : 00000000 [ 0.000000] r3 : 00000000 r2 : c07cac70 r1 : 000080d0 r0 : 0000001c [ 0.000000] Flags: nZCv IRQs off FIQs off Mode SVC_32 ISA ARM Segment kernel [ 0.000000] Control: 10c53c7d Table: 8000404a DAC: 00000017 [ 0.000000] Process swapper (pid: 0, stack limit = 0xc076e240) [ 0.000000] Stack: (0xc076ff28 to 0xc0770000) [ 0.000000] ff20: 22222222 c0794ec8 c06546e8 00000000 00000040 c0794d88 [ 0.000000] ff40: c074b578 c076ffd4 c07951c8 c076e000 00000000 c0441f54 c074b578 c076ffd4 [ 0.000000] ff60: c0793828 00000040 c0794d88 c074b578 c076ffd4 c0776900 c076e000 c07272ac [ 0.000000] ff80: 2f800000 c074c968 c07f93d0 c0719780 c076ffa0 c076ff98 00000000 00000000 [ 0.000000] ffa0: 00000000 00000000 00000000 00000001 c074cd6c c077b1ec 8000406a c0715724 [ 0.000000] ffc0: 00000000 00000000 00000000 00000000 00000000 c074c968 10c53c7d c0776974 [ 0.000000] ffe0: c074cd6c c077b1ec 8000406a 411fc092 00000000 80008074 00000000 00000000 [ 0.000000] [<c01174f8>] (__kmalloc+0x1d4/0x248) from [<c0441f54>] (__clk_init+0x2e0/0x364) [ 0.000000] [<c0441f54>] (__clk_init+0x2e0/0x364) from [<c07272ac>] (omap4xxx_clk_init+0xbc/0x140) [ 0.000000] [<c07272ac>] (omap4xxx_clk_init+0xbc/0x140) from [<c0719780>] (setup_arch+0x15c/0x284) [ 0.000000] [<c0719780>] (setup_arch+0x15c/0x284) from [<c0715724>] (start_kernel+0x7c/0x334) [ 0.000000] [<c0715724>] (start_kernel+0x7c/0x334) from [<80008074>] (0x80008074) [ 0.000000] Code: e5883004 e1a00006 e28dd00c e8bd8ff0 (e7f001f2) [ 0.000000] ---[ end trace 1b75b31a2719ed1c ]--- [ 0.000000] Kernel panic - not syncing: Attempted to kill the idle task! It was a know issue, that slab allocations would fail when common clock core tries to cache parent pointers for mux clocks on OMAP, and hence a patch 'clk: Allow late cache allocation for clk->parents, commit 7975059' was added to work this problem around. A BUG() within kmalloc() with CONFIG_DEBUG_SLAB enabled was completely overlooked causing this regression. More details on the issue reported can be found here, http://www.mail-archive.com/[email protected]/msg85932.html With all these issues around clk inits happening way too early, it makes sense to at least move them to a point where dynamic memory allocations are possible. So move them to a point just before the timer code starts using clocks and hwmod. This should at least pave way for clk inits on OMAP moving to dynamic clock registrations instead of using the static macros defined in clk-private.h. The issue with kernel panic while CONFIG_DEBUG_SLAB is enabled was reported by Piotr Haber and Tony Lindgren and this patch fixes the reported issue as well. Reported-by: Piotr Haber <[email protected]> Reported-by: Tony Lindgren <[email protected]> Signed-off-by: Rajendra Nayak <[email protected]> Acked-by: Santosh Shilimkar <[email protected]> Reviewed-by: Mike Turquette <[email protected]> Acked-by: Paul Walmsley <[email protected]> Cc: [email protected] # v3.8 Signed-off-by: Tony Lindgren <[email protected]>

The following issue was reported. WARNING: at net/mac80211/util.c:599 ieee80211_can_queue_work.isra.7+0x32/0x40 [mac80211]() Hardware name: iMac12,1 queueing ieee80211 work while going to suspend Pid: 0, comm: swapper/0 Tainted: PF O 3.8.2-206.fc18.x86_64 #1 Call Trace: Mar 16 09:39:17 Parags-iMac kernel: [ 3993.642992] <IRQ> [<ffffffff8105e61f>] warn_slowpath_common+0x7f/0xc0 [<ffffffffa0581420>] ? ath_start_rx_poll+0x70/0x70 [ath9k] <ffffffff8105e716>] warn_slowpath_fmt+0x46/0x50 [<ffffffffa045b542>] ieee80211_can_queue_work.isra.7+0x32/0x40 Fix this by avoiding to queue the work if our device has already been marked as suspended or stopped. Reported-by: Parag Warudkar <[email protected]> Tested-by: Parag Warudkar <[email protected]> Cc: [email protected] Signed-off-by: Luis R. Rodriguez <[email protected]> Signed-off-by: John W. Linville <[email protected]>

Op 23-03-13 12:47, Peter Hurley schreef: > On Tue, 2013-03-19 at 11:13 -0400, Peter Hurley wrote: >> On vanilla 3.9.0-rc3, I get this 100% repeatable oops after login when >> the user X session is coming up: > Perhaps I wasn't clear that this happens on every boot and is a > regression from 3.8 > > I'd be happy to help resolve this but time is of the essence; it would > be a shame to have to revert all of this for 3.9 Well it broke on my system too, so it was easy to fix. I didn't even need gdm to trigger it! >8---- This fixes regression caused by 1d7c71a (drm/nouveau/disp: port vblank handling to event interface), which causes a oops in the following way: BUG: unable to handle kernel NULL pointer dereference at 0000000000000001 IP: [<0000000000000001>] 0x0 PGD 0 Oops: 0010 [#1] PREEMPT SMP Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables ...<snip>... CPU 3 Pid: 0, comm: swapper/3 Not tainted 3.9.0-rc3-xeon #rc3 Dell Inc. Precision WorkStation T5400 /0RW203 RIP: 0010:[<0000000000000001>] [<0000000000000001>] 0x0 RSP: 0018:ffff8802afcc3d80 EFLAGS: 00010087 RAX: ffff88029f6e5808 RBX: 0000000000000001 RCX: 0000000000000000 RDX: 0000000000000096 RSI: 0000000000000001 RDI: ffff88029f6e5808 RBP: ffff8802afcc3dc8 R08: 0000000000000000 R09: 0000000000000004 R10: 000000000000002c R11: ffff88029e559a98 R12: ffff8802a376cb78 R13: ffff88029f6e57e0 R14: ffff88029f6e57f8 R15: ffff88029f6e5808 FS: 0000000000000000(0000) GS:ffff8802afcc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000001 CR3: 000000029fa67000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process swapper/3 (pid: 0, threadinfo ffff8802a355e000, task ffff8802a3535c40) Stack: ffffffffa0159d8a 0000000000000082 ffff88029f6e5820 0000000000000001 ffff88029f71aa00 0000000000000000 0000000000000000 0000000004000000 0000000004000000 ffff8802afcc3e38 ffffffffa01843b5 ffff8802afcc3df8 Call Trace: <IRQ> [<ffffffffa0159d8a>] ? nouveau_event_trigger+0xaa/0xe0 [nouveau] [<ffffffffa01843b5>] nv50_disp_intr+0xc5/0x200 [nouveau] [<ffffffff816fbacc>] ? _raw_spin_unlock_irqrestore+0x2c/0x50 [<ffffffff816ff98d>] ? notifier_call_chain+0x4d/0x70 [<ffffffffa017a105>] nouveau_mc_intr+0xb5/0x110 [nouveau] [<ffffffffa01d45ff>] nouveau_irq_handler+0x6f/0x80 [nouveau] [<ffffffff810eec95>] handle_irq_event_percpu+0x75/0x260 [<ffffffff810eeec8>] handle_irq_event+0x48/0x70 [<ffffffff810f205a>] handle_fasteoi_irq+0x5a/0x100 [<ffffffff810182f2>] handle_irq+0x22/0x40 [<ffffffff8170561a>] do_IRQ+0x5a/0xd0 [<ffffffff816fc2ad>] common_interrupt+0x6d/0x6d <EOI> [<ffffffff810449b6>] ? native_safe_halt+0x6/0x10 [<ffffffff8101ea1d>] default_idle+0x3d/0x170 [<ffffffff8101f736>] cpu_idle+0x116/0x130 [<ffffffff816e2a06>] start_secondary+0x251/0x258 Code: Bad RIP value. RIP [<0000000000000001>] 0x0 RSP <ffff8802afcc3d80> CR2: 0000000000000001 ---[ end trace 907323cb8ce6f301 ]--- Signed-off-by: Maarten Lankhorst <[email protected]> Signed-off-by: Ben Skeggs <[email protected]>

This reverts commit 6aa9707. Commit 6aa9707 ("lockdep: check that no locks held at freeze time") causes problems with NFS root filesystems. The failures were noticed on OMAP2 and 3 boards during kernel init: [ BUG: swapper/0/1 still has locks held! ] 3.9.0-rc3-00344-ga937536 #1 Not tainted ------------------------------------- 1 lock held by swapper/0/1: #0: (&type->s_umount_key#13/1){+.+.+.}, at: [<c011e84c>] sget+0x248/0x574 stack backtrace: rpc_wait_bit_killable __wait_on_bit out_of_line_wait_on_bit __rpc_execute rpc_run_task rpc_call_sync nfs_proc_get_root nfs_get_root nfs_fs_mount_common nfs_try_mount nfs_fs_mount mount_fs vfs_kern_mount do_mount sys_mount do_mount_root mount_root prepare_namespace kernel_init_freeable kernel_init Although the rootfs mounts, the system is unstable. Here's a transcript from a PM test: http://www.pwsan.com/omap/testlogs/test_v3.9-rc3/20130317194234/pm/37xxevm/37xxevm_log.txt Here's what the test log should look like: http://www.pwsan.com/omap/testlogs/test_v3.8/20130218214403/pm/37xxevm/37xxevm_log.txt Mailing list discussion is here: http://lkml.org/lkml/2013/3/4/221 Deal with this for v3.9 by reverting the problem commit, until folks can figure out the right long-term course of action. Signed-off-by: Paul Walmsley <[email protected]> Cc: Mandeep Singh Baines <[email protected]> Cc: Jeff Layton <[email protected]> Cc: Shawn Guo <[email protected]> Cc: <[email protected]> Cc: Fengguang Wu <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Ben Chan <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Commit b81ea1b (PM / QoS: Fix concurrency issues and memory leaks in device PM QoS) put calls to pm_qos_sysfs_add_latency(), pm_qos_sysfs_add_flags(), pm_qos_sysfs_remove_latency(), and pm_qos_sysfs_remove_flags() under dev_pm_qos_mtx, which was a mistake, because it may lead to deadlocks in some situations. For example, if pm_qos_remote_wakeup_store() is run in parallel with dev_pm_qos_constraints_destroy(), they may deadlock in the following way: ====================================================== [ INFO: possible circular locking dependency detected ] 3.9.0-rc4-next-20130328-sasha-00014-g91a3267 #319 Tainted: G W ------------------------------------------------------- trinity-child6/12371 is trying to acquire lock: (s_active#54){++++.+}, at: [<ffffffff81301631>] sysfs_addrm_finish+0x31/0x60 but task is already holding lock: (dev_pm_qos_mtx){+.+.+.}, at: [<ffffffff81f07cc3>] dev_pm_qos_constraints_destroy+0x23/0x250 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (dev_pm_qos_mtx){+.+.+.}: [<ffffffff811811da>] lock_acquire+0x1aa/0x240 [<ffffffff83dab809>] __mutex_lock_common+0x59/0x5e0 [<ffffffff83dabebf>] mutex_lock_nested+0x3f/0x50 [<ffffffff81f07f2f>] dev_pm_qos_update_flags+0x3f/0xc0 [<ffffffff81f05f4f>] pm_qos_remote_wakeup_store+0x3f/0x70 [<ffffffff81efbb43>] dev_attr_store+0x13/0x20 [<ffffffff812ffdaa>] sysfs_write_file+0xfa/0x150 [<ffffffff8127f2c1>] __kernel_write+0x81/0x150 [<ffffffff812afc2d>] write_pipe_buf+0x4d/0x80 [<ffffffff812af57c>] splice_from_pipe_feed+0x7c/0x120 [<ffffffff812afa25>] __splice_from_pipe+0x45/0x80 [<ffffffff812b14fc>] splice_from_pipe+0x4c/0x70 [<ffffffff812b1538>] default_file_splice_write+0x18/0x30 [<ffffffff812afae3>] do_splice_from+0x83/0xb0 [<ffffffff812afb2e>] direct_splice_actor+0x1e/0x20 [<ffffffff812b0277>] splice_direct_to_actor+0xe7/0x200 [<ffffffff812b15bc>] do_splice_direct+0x4c/0x70 [<ffffffff8127eda9>] do_sendfile+0x169/0x300 [<ffffffff8127ff94>] SyS_sendfile64+0x64/0xb0 [<ffffffff83db7d18>] tracesys+0xe1/0xe6 -> #0 (s_active#54){++++.+}: [<ffffffff811800cf>] __lock_acquire+0x15bf/0x1e50 [<ffffffff811811da>] lock_acquire+0x1aa/0x240 [<ffffffff81300aa2>] sysfs_deactivate+0x122/0x1a0 [<ffffffff81301631>] sysfs_addrm_finish+0x31/0x60 [<ffffffff812ff77f>] sysfs_hash_and_remove+0x7f/0xb0 [<ffffffff813035a1>] sysfs_unmerge_group+0x51/0x70 [<ffffffff81f068f4>] pm_qos_sysfs_remove_flags+0x14/0x20 [<ffffffff81f07490>] __dev_pm_qos_hide_flags+0x30/0x70 [<ffffffff81f07cd5>] dev_pm_qos_constraints_destroy+0x35/0x250 [<ffffffff81f06931>] dpm_sysfs_remove+0x11/0x50 [<ffffffff81efcf6f>] device_del+0x3f/0x1b0 [<ffffffff81efd128>] device_unregister+0x48/0x60 [<ffffffff82d4083c>] usb_hub_remove_port_device+0x1c/0x20 [<ffffffff82d2a9cd>] hub_disconnect+0xdd/0x160 [<ffffffff82d36ab7>] usb_unbind_interface+0x67/0x170 [<ffffffff81f001a7>] __device_release_driver+0x87/0xe0 [<ffffffff81f00559>] device_release_driver+0x29/0x40 [<ffffffff81effc58>] bus_remove_device+0x148/0x160 [<ffffffff81efd07f>] device_del+0x14f/0x1b0 [<ffffffff82d344f9>] usb_disable_device+0xf9/0x280 [<ffffffff82d34ff8>] usb_set_configuration+0x268/0x840 [<ffffffff82d3a7fc>] usb_remove_store+0x4c/0x80 [<ffffffff81efbb43>] dev_attr_store+0x13/0x20 [<ffffffff812ffdaa>] sysfs_write_file+0xfa/0x150 [<ffffffff8127f71d>] do_loop_readv_writev+0x4d/0x90 [<ffffffff8127f999>] do_readv_writev+0xf9/0x1e0 [<ffffffff8127faba>] vfs_writev+0x3a/0x60 [<ffffffff8127fc60>] SyS_writev+0x50/0xd0 [<ffffffff83db7d18>] tracesys+0xe1/0xe6 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(dev_pm_qos_mtx); lock(s_active#54); lock(dev_pm_qos_mtx); lock(s_active#54); *** DEADLOCK *** To avoid that, remove the calls to functions mentioned above from under dev_pm_qos_mtx and introduce a separate lock to prevent races between functions that add or remove device PM QoS sysfs attributes from happening. Reported-by: Sasha Levin <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>

Check for the presence of the '/cpus' OF node before dereferencing it blindly: [ 4.181793] Unable to handle kernel NULL pointer dereference at virtual address 0000001c [ 4.181793] pgd = c0004000 [ 4.181823] [0000001c] *pgd=00000000 [ 4.181823] Internal error: Oops: 5 [#1] SMP ARM [ 4.181823] Modules linked in: [ 4.181823] CPU: 1 Tainted: G W (3.8.0-15-generic #25~hbankD) [ 4.181854] PC is at of_get_next_child+0x64/0x70 [ 4.181854] LR is at of_get_next_child+0x24/0x70 [ 4.181854] pc : [<c04fda18>] lr : [<c04fd9d8>] psr: 60000113 [ 4.181854] sp : ed891ec0 ip : ed891ec0 fp : ed891ed4 [ 4.181884] r10: c04dafd0 r9 : c098690c r8 : c0936208 [ 4.181884] r7 : ed890000 r6 : c0a63d00 r5 : 00000000 r4 : 00000000 [ 4.181884] r3 : 00000000 r2 : 00000000 r1 : 00000000 r0 : c0b2acc8 [ 4.181884] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment kernel [ 4.181884] Control: 10c5387d Table: adcb804a DAC: 00000015 [ 4.181915] Process swapper/0 (pid: 1, stack limit = 0xed890238) [ 4.181915] Stack: (0xed891ec0 to 0xed892000) [ 4.181915] 1ec0: c09b7b70 00000007 ed891efc ed891ed8 c04daff4 c04fd9c0 00000000 c09b7b70 [ 4.181915] 1ee0: 00000007 c0a63d00 ed890000 c0936208 ed891f54 ed891f00 c00088e0 c04dafdc [ 4.181945] 1f00: ed891f54 ed891f10 c006e940 00000000 00000000 00000007 00000007 c08a4914 [ 4.181945] 1f20: 00000000 c07dbd30 c0a63d00 c09b7b70 00000007 c0a63d00 000000bc c0936208 [ 4.181945] 1f40: c098690c c0986914 ed891f94 ed891f58 c0936a40 c00087bc 00000007 00000007 [ 4.181976] 1f60: c0936208 be8bda20 b6eea010 c0a63d00 c064547c 00000000 00000000 00000000 [ 4.181976] 1f80: 00000000 00000000 ed891fac ed891f98 c0645498 c09368c8 00000000 00000000 [ 4.181976] 1fa0: 00000000 ed891fb0 c0014658 c0645488 00000000 00000000 00000000 00000000 [ 4.182006] 1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 [ 4.182006] 1fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000 [ 4.182037] [<c04fda18>] (of_get_next_child+0x64/0x70) from [<c04daff4>] (cpu0_cpufreq_driver_init+0x24/0x284) [ 4.182067] [<c04daff4>] (cpu0_cpufreq_driver_init+0x24/0x284) from [<c00088e0>] (do_one_initcall+0x130/0x1b0) [ 4.182067] [<c00088e0>] (do_one_initcall+0x130/0x1b0) from [<c0936a40>] (kernel_init_freeable+0x184/0x24c) [ 4.182098] [<c0936a40>] (kernel_init_freeable+0x184/0x24c) from [<c0645498>] (kernel_init+0x1c/0xf4) [ 4.182128] [<c0645498>] (kernel_init+0x1c/0xf4) from [<c0014658>] (ret_from_fork+0x14/0x20) [ 4.182128] Code: f57ff04f e320f004 e89da830 e89da830 (e595001c) [ 4.182128] ---[ end trace 634903a22e8609cb ]--- [ 4.182189] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b [ 4.182189] [ 4.642395] CPU0: stopping [rjw: Changelog] Signed-off-by: Paolo Pisati <[email protected]> Acked-by: Viresh Kumar <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>

The sched_clock_remote() implementation has the following inatomicity problem on 32bit systems when accessing the remote scd->clock, which is a 64bit value. CPU0 CPU1 sched_clock_local() sched_clock_remote(CPU0) ... remote_clock = scd[CPU0]->clock read_low32bit(scd[CPU0]->clock) cmpxchg64(scd->clock,...) read_high32bit(scd[CPU0]->clock) While the update of scd->clock is using an atomic64 mechanism, the readout on the remote cpu is not, which can cause completely bogus readouts. It is a quite rare problem, because it requires the update to hit the narrow race window between the low/high readout and the update must go across the 32bit boundary. The resulting misbehaviour is, that CPU1 will see the sched_clock on CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value to this bogus timestamp. This stays that way due to the clamping implementation for about 4 seconds until the synchronization with CLOCK_MONOTONIC undoes the problem. The issue is hard to observe, because it might only result in a less accurate SCHED_OTHER timeslicing behaviour. To create observable damage on realtime scheduling classes, it is necessary that the bogus update of CPU1 sched_clock happens in the context of an realtime thread, which then gets charged 4 seconds of RT runtime, which results in the RT throttler mechanism to trigger and prevent scheduling of RT tasks for a little less than 4 seconds. So this is quite unlikely as well. The issue was quite hard to decode as the reproduction time is between 2 days and 3 weeks and intrusive tracing makes it less likely, but the following trace recorded with trace_clock=global, which uses sched_clock_local(), gave the final hint: <idle>-0 0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80 <idle>-0 0d..30 400269.477151: hrtimer_start: hrtimer=0xf7061e80 ... irq/20-S-587 1d..32 400273.772118: sched_wakeup: comm= ... target_cpu=0 <idle>-0 0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80 What happens is that CPU0 goes idle and invokes sched_clock_idle_sleep_event() which invokes sched_clock_local() and CPU1 runs a remote wakeup for CPU0 at the same time, which invokes sched_remote_clock(). The time jump gets propagated to CPU0 via sched_remote_clock() and stays stale on both cores for ~4 seconds. There are only two other possibilities, which could cause a stale sched clock: 1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic wrong value. 2) sched_clock() which reads the TSC returns a sporadic wrong value. #1 can be excluded because sched_clock would continue to increase for one jiffy and then go stale. #2 can be excluded because it would not make the clock jump forward. It would just result in a stale sched_clock for one jiffy. After quite some brain twisting and finding the same pattern on other traces, sched_clock_remote() remained the only place which could cause such a problem and as explained above it's indeed racy on 32bit systems. So while on 64bit systems the readout is atomic, we need to verify the remote readout on 32bit machines. We need to protect the local->clock readout in sched_clock_remote() on 32bit as well because an NMI could hit between the low and the high readout, call sched_clock_local() and modify local->clock. Thanks to Siegfried Wulsch for bearing with my debug requests and going through the tedious tasks of running a bunch of reproducer systems to generate the debug information which let me decode the issue. Reported-by: Siegfried Wulsch <[email protected]> Acked-by: Peter Zijlstra <[email protected]> Cc: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304051544160.21884@ionos Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected]

The function tracing control loop used by perf spits out a warning if the called function is not a control function. This is because the control function references a per cpu allocated data structure on struct ftrace_ops that is not allocated for other types of functions. commit 0a01640 "ftrace: Optimize the function tracer list loop" Had an optimization done to all function tracing loops to optimize for a single registered ops. Unfortunately, this allows for a slight race when tracing starts or ends, where the stub function might be called after the current registered ops is removed. In this case we get the following dump: root# perf stat -e ftrace:function sleep 1 [ 74.339105] WARNING: at include/linux/ftrace.h:209 ftrace_ops_control_func+0xde/0xf0() [ 74.349522] Hardware name: PRIMERGY RX200 S6 [ 74.357149] Modules linked in: sg igb iTCO_wdt ptp pps_core iTCO_vendor_support i7core_edac dca lpc_ich i2c_i801 coretemp edac_core crc32c_intel mfd_core ghash_clmulni_intel dm_multipath acpi_power_meter pcspk r microcode vhost_net tun macvtap macvlan nfsd kvm_intel kvm auth_rpcgss nfs_acl lockd sunrpc uinput xfs libcrc32c sd_mod crc_t10dif sr_mod cdrom mgag200 i2c_algo_bit drm_kms_helper ttm qla2xxx mptsas ahci drm li bahci scsi_transport_sas mptscsih libata scsi_transport_fc i2c_core mptbase scsi_tgt dm_mirror dm_region_hash dm_log dm_mod [ 74.446233] Pid: 1377, comm: perf Tainted: G W 3.9.0-rc1 #1 [ 74.453458] Call Trace: [ 74.456233] [<ffffffff81062e3f>] warn_slowpath_common+0x7f/0xc0 [ 74.462997] [<ffffffff810fbc60>] ? rcu_note_context_switch+0xa0/0xa0 [ 74.470272] [<ffffffff811041a2>] ? __unregister_ftrace_function+0xa2/0x1a0 [ 74.478117] [<ffffffff81062e9a>] warn_slowpath_null+0x1a/0x20 [ 74.484681] [<ffffffff81102ede>] ftrace_ops_control_func+0xde/0xf0 [ 74.491760] [<ffffffff8162f400>] ftrace_call+0x5/0x2f [ 74.497511] [<ffffffff8162f400>] ? ftrace_call+0x5/0x2f [ 74.503486] [<ffffffff8162f400>] ? ftrace_call+0x5/0x2f [ 74.509500] [<ffffffff810fbc65>] ? synchronize_sched+0x5/0x50 [ 74.516088] [<ffffffff816254d5>] ? _cond_resched+0x5/0x40 [ 74.522268] [<ffffffff810fbc65>] ? synchronize_sched+0x5/0x50 [ 74.528837] [<ffffffff811041a2>] ? __unregister_ftrace_function+0xa2/0x1a0 [ 74.536696] [<ffffffff816254d5>] ? _cond_resched+0x5/0x40 [ 74.542878] [<ffffffff8162402d>] ? mutex_lock+0x1d/0x50 [ 74.548869] [<ffffffff81105c67>] unregister_ftrace_function+0x27/0x50 [ 74.556243] [<ffffffff8111eadf>] perf_ftrace_event_register+0x9f/0x140 [ 74.563709] [<ffffffff816254d5>] ? _cond_resched+0x5/0x40 [ 74.569887] [<ffffffff8162402d>] ? mutex_lock+0x1d/0x50 [ 74.575898] [<ffffffff8111e94e>] perf_trace_destroy+0x2e/0x50 [ 74.582505] [<ffffffff81127ba9>] tp_perf_event_destroy+0x9/0x10 [ 74.589298] [<ffffffff811295d0>] free_event+0x70/0x1a0 [ 74.595208] [<ffffffff8112a579>] perf_event_release_kernel+0x69/0xa0 [ 74.602460] [<ffffffff816254d5>] ? _cond_resched+0x5/0x40 [ 74.608667] [<ffffffff8112a640>] put_event+0x90/0xc0 [ 74.614373] [<ffffffff8112a740>] perf_release+0x10/0x20 [ 74.620367] [<ffffffff811a3044>] __fput+0xf4/0x280 [ 74.625894] [<ffffffff811a31de>] ____fput+0xe/0x10 [ 74.631387] [<ffffffff81083697>] task_work_run+0xa7/0xe0 [ 74.637452] [<ffffffff81014981>] do_notify_resume+0x71/0xb0 [ 74.643843] [<ffffffff8162fa92>] int_signal+0x12/0x17 To fix this a new ftrace_ops flag is added that denotes the ftrace_list_end ftrace_ops stub as just that, a stub. This flag is now checked in the control loop and the function is not called if the flag is set. Thanks to Jovi for not just reporting the bug, but also pointing out where the bug was in the code. Link: http://lkml.kernel.org/r/[email protected] Link: http://lkml.kernel.org/r/[email protected] Tested-by: WANG Chao <[email protected]> Reported-by: WANG Chao <[email protected]> Reported-by: zhangwei(Jovi) <[email protected]> Signed-off-by: Steven Rostedt <[email protected]>

do_loopback calls lock_mount(path) and forget to unlock_mount if clone_mnt or copy_mnt fails. [ 77.661566] ================================================ [ 77.662939] [ BUG: lock held when returning to user space! ] [ 77.664104] 3.9.0-rc5+ #17 Not tainted [ 77.664982] ------------------------------------------------ [ 77.666488] mount/514 is leaving the kernel with locks still held! [ 77.668027] 2 locks held by mount/514: [ 77.668817] #0: (&sb->s_type->i_mutex_key#7){+.+.+.}, at: [<ffffffff811cca22>] lock_mount+0x32/0xe0 [ 77.671755] #1: (&namespace_sem){+++++.}, at: [<ffffffff811cca3a>] lock_mount+0x4a/0xe0 Signed-off-by: Andrey Vagin <[email protected]> Signed-off-by: Al Viro <[email protected]>

In paravirtualized x86_64 kernels, vmalloc_fault may cause an oops when lazy MMU updates are enabled, because set_pgd effects are being deferred. One instance of this problem is during process mm cleanup with memory cgroups enabled. The chain of events is as follows: - zap_pte_range enables lazy MMU updates - zap_pte_range eventually calls mem_cgroup_charge_statistics, which accesses the vmalloc'd mem_cgroup per-cpu stat area - vmalloc_fault is triggered which tries to sync the corresponding PGD entry with set_pgd, but the update is deferred - vmalloc_fault oopses due to a mismatch in the PUD entries The OOPs usually looks as so: ------------[ cut here ]------------ kernel BUG at arch/x86/mm/fault.c:396! invalid opcode: 0000 [#1] SMP .. snip .. CPU 1 Pid: 10866, comm: httpd Not tainted 3.6.10-4.fc18.x86_64 #1 RIP: e030:[<ffffffff816271bf>] [<ffffffff816271bf>] vmalloc_fault+0x11f/0x208 .. snip .. Call Trace: [<ffffffff81627759>] do_page_fault+0x399/0x4b0 [<ffffffff81004f4c>] ? xen_mc_extend_args+0xec/0x110 [<ffffffff81624065>] page_fault+0x25/0x30 [<ffffffff81184d03>] ? mem_cgroup_charge_statistics.isra.13+0x13/0x50 [<ffffffff81186f78>] __mem_cgroup_uncharge_common+0xd8/0x350 [<ffffffff8118aac7>] mem_cgroup_uncharge_page+0x57/0x60 [<ffffffff8115fbc0>] page_remove_rmap+0xe0/0x150 [<ffffffff8115311a>] ? vm_normal_page+0x1a/0x80 [<ffffffff81153e61>] unmap_single_vma+0x531/0x870 [<ffffffff81154962>] unmap_vmas+0x52/0xa0 [<ffffffff81007442>] ? pte_mfn_to_pfn+0x72/0x100 [<ffffffff8115c8f8>] exit_mmap+0x98/0x170 [<ffffffff810050d9>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e [<ffffffff81059ce3>] mmput+0x83/0xf0 [<ffffffff810624c4>] exit_mm+0x104/0x130 [<ffffffff8106264a>] do_exit+0x15a/0x8c0 [<ffffffff810630ff>] do_group_exit+0x3f/0xa0 [<ffffffff81063177>] sys_exit_group+0x17/0x20 [<ffffffff8162bae9>] system_call_fastpath+0x16/0x1b Calling arch_flush_lazy_mmu_mode immediately after set_pgd makes the changes visible to the consistency checks. Cc: <[email protected]> RedHat-Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=914737 Tested-by: Josh Boyer <[email protected]> Reported-and-Tested-by: Krishna Raman <[email protected]> Signed-off-by: Samu Kallio <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Tested-by: Konrad Rzeszutek Wilk <[email protected]> Signed-off-by: Konrad Rzeszutek Wilk <[email protected]> Signed-off-by: H. Peter Anvin <[email protected]>

e100 uses pci_map_single, but fails to check for a dma mapping error after its use, resulting in a stack trace: [ 46.656594] ------------[ cut here ]------------ [ 46.657004] WARNING: at lib/dma-debug.c:933 check_unmap+0x47b/0x950() [ 46.657004] Hardware name: To Be Filled By O.E.M. [ 46.657004] e100 0000:00:0e.0: DMA-API: device driver failed to check map error[device address=0x000000007a4540fa] [size=90 bytes] [mapped as single] [ 46.657004] Modules linked in: [ 46.657004] w83627hf hwmon_vid snd_via82xx ppdev snd_ac97_codec ac97_bus snd_seq snd_pcm snd_mpu401 snd_mpu401_uart ns558 snd_rawmidi gameport parport_pc e100 snd_seq_device parport snd_page_alloc snd_timer snd soundcore skge shpchp k8temp mii edac_core i2c_viapro edac_mce_amd nfsd auth_rpcgss nfs_acl lockd sunrpc binfmt_misc uinput ata_generic pata_acpi radeon i2c_algo_bit drm_kms_helper ttm firewire_ohci drm firewire_core pata_via sata_via i2c_core sata_promise crc_itu_t [ 46.657004] Pid: 792, comm: ip Not tainted 3.8.0-0.rc6.git0.1.fc19.x86_64 #1 [ 46.657004] Call Trace: [ 46.657004] <IRQ> [<ffffffff81065ed0>] warn_slowpath_common+0x70/0xa0 [ 46.657004] [<ffffffff81065f4c>] warn_slowpath_fmt+0x4c/0x50 [ 46.657004] [<ffffffff81364cfb>] check_unmap+0x47b/0x950 [ 46.657004] [<ffffffff8136522f>] debug_dma_unmap_page+0x5f/0x70 [ 46.657004] [<ffffffffa030f0f0>] ? e100_tx_clean+0x30/0x210 [e100] [ 46.657004] [<ffffffffa030f1a8>] e100_tx_clean+0xe8/0x210 [e100] [ 46.657004] [<ffffffffa030fc6f>] e100_poll+0x56f/0x6c0 [e100] [ 46.657004] [<ffffffff8159dce1>] ? net_rx_action+0xa1/0x370 [ 46.657004] [<ffffffff8159ddb2>] net_rx_action+0x172/0x370 [ 46.657004] [<ffffffff810703bf>] __do_softirq+0xef/0x3d0 [ 46.657004] [<ffffffff816e4ebc>] call_softirq+0x1c/0x30 [ 46.657004] [<ffffffff8101c485>] do_softirq+0x85/0xc0 [ 46.657004] [<ffffffff81070885>] irq_exit+0xd5/0xe0 [ 46.657004] [<ffffffff816e5756>] do_IRQ+0x56/0xc0 [ 46.657004] [<ffffffff816dacb2>] common_interrupt+0x72/0x72 [ 46.657004] <EOI> [<ffffffff816da1eb>] ? _raw_spin_unlock_irqrestore+0x3b/0x70 [ 46.657004] [<ffffffff816d124d>] __slab_free+0x58/0x38b [ 46.657004] [<ffffffff81214424>] ? fsnotify_clear_marks_by_inode+0x34/0x120 [ 46.657004] [<ffffffff811b0417>] ? kmem_cache_free+0x97/0x320 [ 46.657004] [<ffffffff8157fc14>] ? sock_destroy_inode+0x34/0x40 [ 46.657004] [<ffffffff8157fc14>] ? sock_destroy_inode+0x34/0x40 [ 46.657004] [<ffffffff811b0692>] kmem_cache_free+0x312/0x320 [ 46.657004] [<ffffffff8157fc14>] sock_destroy_inode+0x34/0x40 [ 46.657004] [<ffffffff811e8c28>] destroy_inode+0x38/0x60 [ 46.657004] [<ffffffff811e8d5e>] evict+0x10e/0x1a0 [ 46.657004] [<ffffffff811e9605>] iput+0xf5/0x180 [ 46.657004] [<ffffffff811e4338>] dput+0x248/0x310 [ 46.657004] [<ffffffff811ce0e1>] __fput+0x171/0x240 [ 46.657004] [<ffffffff811ce26e>] ____fput+0xe/0x10 [ 46.657004] [<ffffffff8108d54c>] task_work_run+0xac/0xe0 [ 46.657004] [<ffffffff8106c6ed>] do_exit+0x26d/0xc30 [ 46.657004] [<ffffffff8109eccc>] ? finish_task_switch+0x7c/0x120 [ 46.657004] [<ffffffff816dad58>] ? retint_swapgs+0x13/0x1b [ 46.657004] [<ffffffff8106d139>] do_group_exit+0x49/0xc0 [ 46.657004] [<ffffffff8106d1c4>] sys_exit_group+0x14/0x20 [ 46.657004] [<ffffffff816e3b19>] system_call_fastpath+0x16/0x1b [ 46.657004] ---[ end trace 4468c44e2156e7d1 ]--- [ 46.657004] Mapped at: [ 46.657004] [<ffffffff813663d1>] debug_dma_map_page+0x91/0x140 [ 46.657004] [<ffffffffa030e8eb>] e100_xmit_prepare+0x12b/0x1c0 [e100] [ 46.657004] [<ffffffffa030c924>] e100_exec_cb+0x84/0x140 [e100] [ 46.657004] [<ffffffffa030e56a>] e100_xmit_frame+0x3a/0x190 [e100] [ 46.657004] [<ffffffff8159ee89>] dev_hard_start_xmit+0x259/0x6c0 Easy fix, modify the cb paramter to e100_exec_cb to return an error, and do the dma_mapping_error check in the obvious place This was reported previously here: http://article.gmane.org/gmane.linux.network/257893 But nobody stepped up and fixed it. CC: Josh Boyer <[email protected]> CC: [email protected] Signed-off-by: Neil Horman <[email protected]> Reported-by: Michal Jaegermann <[email protected]> Tested-by: Aaron Brown <[email protected]> Signed-off-by: Jeff Kirsher <[email protected]> Signed-off-by: David S. Miller <[email protected]>

This patch attempts to fix: https://bugzilla.kernel.org/show_bug.cgi?id=56461 The symptom is a crash and messages like this: chrome: Corrupted page table at address 34a03000 *pdpt = 0000000000000000 *pde = 0000000000000000 Bad pagetable: 000f [#1] PREEMPT SMP Ingo guesses this got introduced by commit 611ae8e ("x86/tlb: enable tlb flush range support for x86") since that code started to free unused pagetables. On x86-32 PAE kernels, that new code has the potential to free an entire PMD page and will clear one of the four page-directory-pointer-table (aka pgd_t entries). The hardware aggressively "caches" these top-level entries and invlpg does not actually affect the CPU's copy. If we clear one we *HAVE* to do a full TLB flush, otherwise we might continue using a freed pmd page. (note, we do this properly on the population side in pud_populate()). This patch tracks whenever we clear one of these entries in the 'struct mmu_gather', and ensures that we follow up with a full tlb flush. BTW, I disassembled and checked that: if (tlb->fullmm == 0) and if (!tlb->fullmm && !tlb->need_flush_all) generate essentially the same code, so there should be zero impact there to the !PAE case. Signed-off-by: Dave Hansen <[email protected]> Cc: Peter Anvin <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Artem S Tashkinov <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>

Currently IOP3XX_PERIPHERAL_VIRT_BASE conflicts with PCI_IO_VIRT_BASE: address size PCI_IO_VIRT_BASE 0xfee00000 0x200000 IOP3XX_PERIPHERAL_VIRT_BASE 0xfeffe000 0x2000 Fix by moving IOP3XX_PERIPHERAL_VIRT_BASE below PCI_IO_VIRT_BASE. The patch fixes the following kernel panic with 3.9-rc1 on iop3xx boards: [ 0.000000] Booting Linux on physical CPU 0x0 [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Linux version 3.9.0-rc1-iop32x (aaro@blackmetal) (gcc version 4.7.2 (GCC) ) #20 PREEMPT Tue Mar 5 16:44:36 EET 2013 [ 0.000000] bootconsole [earlycon0] enabled [ 0.000000] ------------[ cut here ]------------ [ 0.000000] kernel BUG at mm/vmalloc.c:1145! [ 0.000000] Internal error: Oops - BUG: 0 [#1] PREEMPT ARM [ 0.000000] Modules linked in: [ 0.000000] CPU: 0 Not tainted (3.9.0-rc1-iop32x #20) [ 0.000000] PC is at vm_area_add_early+0x4c/0x88 [ 0.000000] LR is at add_static_vm_early+0x14/0x68 [ 0.000000] pc : [<c03e74a8>] lr : [<c03e1c40>] psr: 800000d3 [ 0.000000] sp : c03ffee4 ip : dfffdf88 fp : c03ffef4 [ 0.000000] r10: 00000002 r9 : 000000cf r8 : 00000653 [ 0.000000] r7 : c040eca8 r6 : c03e2408 r5 : dfffdf60 r4 : 00200000 [ 0.000000] r3 : dfffdfd8 r2 : feffe000 r1 : ff000000 r0 : dfffdf60 [ 0.000000] Flags: Nzcv IRQs off FIQs off Mode SVC_32 ISA ARM Segment kernel [ 0.000000] Control: 0000397f Table: a0004000 DAC: 00000017 [ 0.000000] Process swapper (pid: 0, stack limit = 0xc03fe1b8) [ 0.000000] Stack: (0xc03ffee4 to 0xc0400000) [ 0.000000] fee0: 00200000 c03fff0c c03ffef8 c03e1c40 c03e7468 00200000 fee00000 [ 0.000000] ff00: c03fff2c c03fff10 c03e23e4 c03e1c38 feffe000 c0408ee4 ff000000 c0408f04 [ 0.000000] ff20: c03fff3c c03fff30 c03e2434 c03e23b4 c03fff84 c03fff40 c03e2c94 c03e2414 [ 0.000000] ff40: c03f8878 c03f6410 ffff0000 000bffff 00001000 00000008 c03fff84 c03f6410 [ 0.000000] ff60: c04227e8 c03fffd4 a0008000 c03f8878 69052e30 c02f96eb c03fffbc c03fff88 [ 0.000000] ff80: c03e044c c03e268c 00000000 0000397f c0385130 00000001 ffffffff c03f8874 [ 0.000000] ffa0: dfffffff a0004000 69052e30 a03f61a0 c03ffff4 c03fffc0 c03dd5cc c03e0184 [ 0.000000] ffc0: 00000000 00000000 00000000 00000000 00000000 c03f8878 0000397d c040601c [ 0.000000] ffe0: c03f8874 c0408674 00000000 c03ffff8 a0008040 c03dd558 00000000 00000000 [ 0.000000] Backtrace: [ 0.000000] [<c03e745c>] (vm_area_add_early+0x0/0x88) from [<c03e1c40>] (add_static_vm_early+0x14/0x68) Tested-by: Mikael Pettersson <[email protected]> Signed-off-by: Aaro Koskinen <[email protected]> Signed-off-by: Russell King <[email protected]>

The following RCU splat indicates lack of RCU protection: [ 953.267649] =============================== [ 953.267652] [ INFO: suspicious RCU usage. ] [ 953.267657] 3.9.0-0.rc6.git2.4.fc19.ppc64p7 #1 Not tainted [ 953.267661] ------------------------------- [ 953.267664] include/linux/cgroup.h:534 suspicious rcu_dereference_check() usage! [ 953.267669] [ 953.267669] other info that might help us debug this: [ 953.267669] [ 953.267675] [ 953.267675] rcu_scheduler_active = 1, debug_locks = 0 [ 953.267680] 1 lock held by glxgears/1289: [ 953.267683] #0: (&sig->cred_guard_mutex){+.+.+.}, at: [<c00000000027f884>] .prepare_bprm_creds+0x34/0xa0 [ 953.267700] [ 953.267700] stack backtrace: [ 953.267704] Call Trace: [ 953.267709] [c0000001f0d1b6e0] [c000000000016e30] .show_stack+0x130/0x200 (unreliable) [ 953.267717] [c0000001f0d1b7b0] [c0000000001267f8] .lockdep_rcu_suspicious+0x138/0x180 [ 953.267724] [c0000001f0d1b840] [c0000000001d43a4] .perf_event_comm+0x4c4/0x690 [ 953.267731] [c0000001f0d1b950] [c00000000027f6e4] .set_task_comm+0x84/0x1f0 [ 953.267737] [c0000001f0d1b9f0] [c000000000280414] .setup_new_exec+0x94/0x220 [ 953.267744] [c0000001f0d1ba70] [c0000000002f665c] .load_elf_binary+0x58c/0x19b0 ... This commit therefore adds the required RCU read-side critical section to perf_event_comm(). Reported-by: Adam Jackson <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Cc: [email protected] Cc: [email protected] Cc: [email protected] Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]> Tested-by: Gustavo Luiz Duarte <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tagged frames can exceed standard Ethernet mtu of 1500 #1

Tagged frames can exceed standard Ethernet mtu of 1500 #1

rrendec commented Jan 8, 2013

Tagged frames can exceed standard Ethernet mtu of 1500 #1

Tagged frames can exceed standard Ethernet mtu of 1500 #1

Comments

rrendec commented Jan 8, 2013