Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

profile slow base performance of NUSE #4

Closed
thehajime opened this issue Nov 10, 2014 · 3 comments
Closed

profile slow base performance of NUSE #4

thehajime opened this issue Nov 10, 2014 · 3 comments
Assignees
Labels

Comments

@thehajime
Copy link
Member

No description provided.

@thehajime thehajime added the bug label Nov 10, 2014
@thehajime thehajime self-assigned this Nov 10, 2014
@thehajime
Copy link
Member Author

commit 22e5069 fixes inefficient poll() call issue.

  • ping (native)
    PING 192.168.209.1 (192.168.209.1) 56(84) bytes of data.
    64 bytes from 192.168.209.1: icmp_seq=1 ttl=64 time=0.224 ms
    64 bytes from 192.168.209.1: icmp_seq=2 ttl=64 time=0.300 ms
    64 bytes from 192.168.209.1: icmp_seq=3 ttl=64 time=0.304 ms
    64 bytes from 192.168.209.1: icmp_seq=4 ttl=64 time=0.239 ms
    64 bytes from 192.168.209.1: icmp_seq=5 ttl=64 time=0.231 ms
    64 bytes from 192.168.209.1: icmp_seq=6 ttl=64 time=0.217 ms
    ^C
    --- 192.168.209.1 ping statistics ---
    6 packets transmitted, 6 received, 0% packet loss, time 5004ms
    rtt min/avg/max/mdev = 0.217/0.252/0.304/0.039 ms
  • ping (NUSE)
    assign nuse interface nuse-eno16777736 IPv4 address 192.168.209.39
    PING 192.168.209.1 (192.168.209.1) 56(84) bytes of data.
    Warning: time of day goes back (-1414852632497358us), taking countermeasures.
    64 bytes from 192.168.209.1: icmp_seq=1 ttl=64 time=0.712 ms
    64 bytes from 192.168.209.1: icmp_seq=2 ttl=64 time=0.683 ms
    64 bytes from 192.168.209.1: icmp_seq=3 ttl=64 time=0.533 ms
    64 bytes from 192.168.209.1: icmp_seq=4 ttl=64 time=0.462 ms
    64 bytes from 192.168.209.1: icmp_seq=5 ttl=64 time=0.880 ms
    64 bytes from 192.168.209.1: icmp_seq=6 ttl=64 time=0.401 ms
    ^C
    --- 192.168.209.1 ping statistics ---
    6 packets transmitted, 6 received, 0% packet loss, time 5006ms
    rtt min/avg/max/mdev = 0.401/0.611/0.880/0.166 ms

still need to work on more..

@thehajime
Copy link
Member Author

- iperf (native)

Client connecting to 192.168.209.1, TCP port 2000

TCP window size: 85.0 KByte (default)

[ 3] local 192.168.209.251 port 51760 connected with 192.168.209.1 port 2000
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 2.09 GBytes 1.80 Gbits/sec

- iperf (NUSE)

Client connecting to 192.168.209.1, TCP port 2000

TCP window size: 45.0 KByte (default)

[ 3] local 192.168.209.39 port 54785 connected with 192.168.209.39 port 54785
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 158 MBytes 133 Mbits/sec

go ahead !

thehajime pushed a commit that referenced this issue Apr 8, 2015
We currently need two checks of the peripheral version in MACB_MID register.
One of them got out of sync after modification by 8a013a9 (net: macb:
Include multi queue support for xilinx ZynqMP ethernet version).
Fix this in macb_configure_caps() so that xilinx ZynqMP will be considered
as a GEM flavor.

Fixes: 8a013a9 ("net: macb: Include multi queue support for xilinx ZynqMP
ethernet version")

Signed-off-by: Nicolas Ferre <[email protected]>
Cc: Michal Simek <[email protected]>
Cc: Punnaiah Choudary Kalluri <[email protected]>
Cc: <[email protected]> #4.0 (if it doesn't make it for -final)
Signed-off-by: David S. Miller <[email protected]>
thehajime pushed a commit that referenced this issue Apr 8, 2015
Andrew Lunn says:

====================
DSA Mavell drivers refactoring and cleanup

v1->v2:
 * Add missing signed-of-by: For patches authored by Guenter Roeck.
 * Add Reviewed by from Guenter Roack to patch #5.

This is a collection of patches again net-next from today containing
refactoring and consolidate of code, cleanups and using #define's to
replace register numbers.

Patch #1 Swaps the 6131 driver to use the consolidated setup code.

Patch #2 Moves the Switch IDs used during probe into a central
         location.  We need these later so that we can differentiate
         the different features the devices have.

Patch #3 Makes the 6131 driver set the number of ports in the private
         state structure. It then uses this, rather than hard coded
         maximum number of ports.

Patch #4 Similar to Patch #3, but for the 6123_61_65 driver.

Patch #5 Similar to Patch #3, and #4, but for all the remaining
         drivers.  This greatly increases the similarity of the code
         between drivers, allow further patches to consolidate the
         duplicated code.

Patch #6 Consolidate the switch reset code, which has two minor
         variants. Removes around 35 lines per driver.

Patch #7 Moves phy page access functions out of the 6352 driver into
         the shared code. Currently only the 6352 driver uses this,
         but it is likely other devices will come along wanting this
         functionality.

Patch #8 Consolidates the code used to access phy registers. Removes
         around 40 lines of code per driver.

Patch #9 Fixes missing mutex locking in the EEE code, and refactors
         the code a bit to make it more understandable with respect to
         locks.

Patch #10 Consolidates reading statistics. This is very similar code
          for all devices, but the number of available statistics
          differ, which can be determined from the product ID. Removes
	  around 65 lines per driver.

Patch #11 Add #defines for registers, and bits within the
          registers. For the moment, this is limited to the shared
          code. The individual drivers will be converted once the
          remaining duplicated code is consolidated

Patch #12 Fix broken statistic counters on the 6172. The 6352 family
          requires the port number is poked into a different set of
          bits in the register compared to other devices.

Many thanks to Guenter Roeck for repeatedly reviewing the patches and
testing them on his hardware.
====================

Signed-off-by: David S. Miller <[email protected]>
@thehajime
Copy link
Member Author

This issue was moved to libos-nuse/linux-libos-tools#4

thehajime pushed a commit that referenced this issue Apr 22, 2015
…_ram allocator

Recently I came across high fragmentation of vm_map_ram allocator:
vmap_block has free space, but still new blocks continue to appear.
Further investigation showed that certain mapping/unmapping sequences
can exhaust vmalloc space.  On small 32bit systems that's not a big
problem, cause purging will be called soon on a first allocation failure
(alloc_vmap_area), but on 64bit machines, e.g.  x86_64 has 45 bits of
vmalloc space, that can be a disaster.

1) I came up with a simple allocation sequence, which exhausts virtual
   space very quickly:

  while (iters) {

                /* Map/unmap big chunk */
                vaddr = vm_map_ram(pages, 16, -1, PAGE_KERNEL);
                vm_unmap_ram(vaddr, 16);

                /* Map/unmap small chunks.
                 *
                 * -1 for hole, which should be left at the end of each block
                 * to keep it partially used, with some free space available */
                for (i = 0; i < (VMAP_BBMAP_BITS - 16) / 8 - 1; i++) {
                        vaddr = vm_map_ram(pages, 8, -1, PAGE_KERNEL);
                        vm_unmap_ram(vaddr, 8);
                }
  }

The idea behind is simple:

 1. We have to map a big chunk, e.g. 16 pages.

 2. Then we have to occupy the remaining space with smaller chunks, i.e.
    8 pages. At the end small hole should remain to keep block in free list,
    but do not let big chunk to occupy remaining space.

 3. Goto 1 - allocation request of 16 pages can't be completed (only 8 slots
    are left free in the block in the #2 step), new block will be allocated,
    all further requests will lay into newly allocated block.

To have some measurement numbers for all further tests I setup ftrace and
enabled 4 basic calls in a function profile:

        echo vm_map_ram              > /sys/kernel/debug/tracing/set_ftrace_filter;
        echo alloc_vmap_area        >> /sys/kernel/debug/tracing/set_ftrace_filter;
        echo vm_unmap_ram           >> /sys/kernel/debug/tracing/set_ftrace_filter;
        echo free_vmap_block        >> /sys/kernel/debug/tracing/set_ftrace_filter;

So for this scenario I got these results:

BEFORE (all new blocks are put to the head of a free list)
# cat /sys/kernel/debug/tracing/trace_stat/function0
  Function                               Hit    Time            Avg             s^2
  --------                               ---    ----            ---             ---
  vm_map_ram                          126000    30683.30 us     0.243 us        30819.36 us
  vm_unmap_ram                        126000    22003.24 us     0.174 us        340.886 us
  alloc_vmap_area                       1000    4132.065 us     4.132 us        0.903 us

AFTER (all new blocks are put to the tail of a free list)
# cat /sys/kernel/debug/tracing/trace_stat/function0
  Function                               Hit    Time            Avg             s^2
  --------                               ---    ----            ---             ---
  vm_map_ram                          126000    28713.13 us     0.227 us        24944.70 us
  vm_unmap_ram                        126000    20403.96 us     0.161 us        1429.872 us
  alloc_vmap_area                        993    3916.795 us     3.944 us        29.370 us
  free_vmap_block                        992    654.157 us      0.659 us        1.273 us

SUMMARY:

The most interesting numbers in those tables are numbers of block
allocations and deallocations: alloc_vmap_area and free_vmap_block
calls, which show that before the change blocks were not freed, and
virtual space and physical memory (vmap_block structure allocations,
etc) were consumed.

Average time which were spent in vm_map_ram/vm_unmap_ram became slightly
better.  That can be explained with a reasonable amount of blocks in a
free list, which we need to iterate to find a suitable free block.

2) Another scenario is a random allocation:

  while (iters) {

                /* Randomly take number from a range [1..32/64] */
                nr = rand(1, VMAP_MAX_ALLOC);
                vaddr = vm_map_ram(pages, nr, -1, PAGE_KERNEL);
                vm_unmap_ram(vaddr, nr);
  }

I chose mersenne twister PRNG to generate persistent random state to
guarantee that both runs have the same random sequence.  For each
vm_map_ram call random number from [1..32/64] was taken to represent
amount of pages which I do map.

I did 10'000 vm_map_ram calls and got these two tables:

BEFORE (all new blocks are put to the head of a free list)

# cat /sys/kernel/debug/tracing/trace_stat/function0
  Function                               Hit    Time            Avg             s^2
  --------                               ---    ----            ---             ---
  vm_map_ram                           10000    10170.01 us     1.017 us        993.609 us
  vm_unmap_ram                         10000    5321.823 us     0.532 us        59.789 us
  alloc_vmap_area                        420    2150.239 us     5.119 us        3.307 us
  free_vmap_block                         37    159.587 us      4.313 us        134.344 us

AFTER (all new blocks are put to the tail of a free list)

# cat /sys/kernel/debug/tracing/trace_stat/function0
  Function                               Hit    Time            Avg             s^2
  --------                               ---    ----            ---             ---
  vm_map_ram                           10000    7745.637 us     0.774 us        395.229 us
  vm_unmap_ram                         10000    5460.573 us     0.546 us        67.187 us
  alloc_vmap_area                        414    2201.650 us     5.317 us        5.591 us
  free_vmap_block                        412    574.421 us      1.394 us        15.138 us

SUMMARY:

'BEFORE' table shows, that 420 blocks were allocated and only 37 were
freed.  Remained 383 blocks are still in a free list, consuming virtual
space and physical memory.

'AFTER' table shows, that 414 blocks were allocated and 412 were really
freed.  2 blocks remained in a free list.

So fragmentation was dramatically reduced.  Why? Because when we put
newly allocated block to the head, all further requests will occupy new
block, regardless remained space in other blocks.  In this scenario all
requests come randomly.  Eventually remained free space will be less
than requested size, free list will be iterated and it is possible that
nothing will be found there - finally new block will be created.  So
exhaustion in random scenario happens for the maximum possible
allocation size: 32 pages for 32-bit system and 64 pages for 64-bit
system.

Also average cost of vm_map_ram was reduced from 1.017 us to 0.774 us.
Again this can be explained by iteration through smaller list of free
blocks.

3) Next simple scenario is a sequential allocation, when the allocation
   order is increased for each block.  This scenario forces allocator to
   reach maximum amount of partially free blocks in a free list:

  while (iters) {

                /* Populate free list with blocks with remaining space */
                for (order = 0; order <= ilog2(VMAP_MAX_ALLOC); order++) {
                        nr = VMAP_BBMAP_BITS / (1 << order);

                        /* Leave a hole */
                        nr -= 1;

                        for (i = 0; i < nr; i++) {
                                vaddr = vm_map_ram(pages, (1 << order), -1, PAGE_KERNEL);
                                vm_unmap_ram(vaddr, (1 << order));
                }

                /* Completely occupy blocks from a free list */
                for (order = 0; order <= ilog2(VMAP_MAX_ALLOC); order++) {
                        vaddr = vm_map_ram(pages, (1 << order), -1, PAGE_KERNEL);
                        vm_unmap_ram(vaddr, (1 << order));
                }
  }

Results which I got:

BEFORE (all new blocks are put to the head of a free list)

# cat /sys/kernel/debug/tracing/trace_stat/function0
  Function                               Hit    Time            Avg             s^2
  --------                               ---    ----            ---             ---
  vm_map_ram                         2032000    399545.2 us     0.196 us        467123.7 us
  vm_unmap_ram                       2032000    363225.7 us     0.178 us        111405.9 us
  alloc_vmap_area                       7001    30627.76 us     4.374 us        495.755 us
  free_vmap_block                       6993    7011.685 us     1.002 us        159.090 us

AFTER (all new blocks are put to the tail of a free list)

# cat /sys/kernel/debug/tracing/trace_stat/function0
  Function                               Hit    Time            Avg             s^2
  --------                               ---    ----            ---             ---
  vm_map_ram                         2032000    394259.7 us     0.194 us        589395.9 us
  vm_unmap_ram                       2032000    292500.7 us     0.143 us        94181.08 us
  alloc_vmap_area                       7000    31103.11 us     4.443 us        703.225 us
  free_vmap_block                       7000    6750.844 us     0.964 us        119.112 us

SUMMARY:

No surprises here, almost all numbers are the same.

Fixing this fragmentation problem I also did some improvements in a
allocation logic of a new vmap block: occupy block immediately and get
rid of extra search in a free list.

Also I replaced dirty bitmap with min/max dirty range values to make the
logic simpler and slightly faster, since two longs comparison costs
less, than loop thru bitmap.

This patchset raises several questions:

 Q: Think the problem you comments is already known so that I wrote comments
    about it as "it could consume lots of address space through fragmentation".
    Could you tell me about your situation and reason why it should be avoided?
                                                                     Gioh Kim

 A: Indeed, there was a commit 3643763 which adds explicit comment about
    fragmentation.  But fragmentation which is described in this comment caused
    by mixing of long-lived and short-lived objects, when a whole block is pinned
    in memory because some page slots are still in use.  But here I am talking
    about blocks which are free, nobody uses them, and allocator keeps them alive
    forever, continuously allocating new blocks.

 Q: I think that if you put newly allocated block to the tail of a free
    list, below example would results in enormous performance degradation.

    new block: 1MB (256 pages)

    while (iters--) {
      vm_map_ram(3 or something else not dividable for 256) * 85
      vm_unmap_ram(3) * 85
    }

    On every iteration, it needs newly allocated block and it is put to the
    tail of a free list so finding it consumes large amount of time.
                                                                    Joonsoo Kim

 A: Second patch in current patchset gets rid of extra search in a free list,
    so new block will be immediately occupied..

    Also, the scenario above is impossible, cause vm_map_ram allocates virtual
    range in orders, i.e. 2^n.  I.e. passing 3 to vm_map_ram you will allocate
    4 slots in a block and 256 slots (capacity of a block) of course dividable
    on 4, so block will be completely occupied.

    But there is a worst case which we can achieve: each free block has a hole
    equal to order size.

    The maximum size of allocation is 64 pages for 64-bit system
    (if you try to map more, original alloc_vmap_area will be called).

    So the maximum order is 6.  That means that worst case, before allocator
    makes a decision to allocate a new block, is to iterate 7 blocks:

    HEAD
    1st block - has 1  page slot  free (order 0)
    2nd block - has 2  page slots free (order 1)
    3rd block - has 4  page slots free (order 2)
    4th block - has 8  page slots free (order 3)
    5th block - has 16 page slots free (order 4)
    6th block - has 32 page slots free (order 5)
    7th block - has 64 page slots free (order 6)
    TAIL

    So the worst scenario on 64-bit system is that each CPU queue can have 7
    blocks in a free list.

    This can happen only and only if you allocate blocks increasing the order.
    (as I did in the function written in the comment of the first patch)
    This is weird and rare case, but still it is possible.  Afterwards you will
    get 7 blocks in a list.

    All further requests should be placed in a newly allocated block or some
    free slots should be found in a free list.
    Seems it does not look dramatically awful.

This patch (of 3):

If suitable block can't be found, new block is allocated and put into a
head of a free list, so on next iteration this new block will be found
first.

That's bad, because old blocks in a free list will not get a chance to be
fully used, thus fragmentation will grow.

Let's consider this simple example:

 #1 We have one block in a free list which is partially used, and where only
    one page is free:

    HEAD |xxxxxxxxx-| TAIL
                   ^
                   free space for 1 page, order 0

 #2 New allocation request of order 1 (2 pages) comes, new block is allocated
    since we do not have free space to complete this request. New block is put
    into a head of a free list:

    HEAD |----------|xxxxxxxxx-| TAIL

 #3 Two pages were occupied in a new found block:

    HEAD |xx--------|xxxxxxxxx-| TAIL
          ^
          two pages mapped here

 #4 New allocation request of order 0 (1 page) comes.  Block, which was created
    on #2 step, is located at the beginning of a free list, so it will be found
    first:

  HEAD |xxX-------|xxxxxxxxx-| TAIL
          ^                 ^
          page mapped here, but better to use this hole

It is obvious, that it is better to complete request of #4 step using the
old block, where free space is left, because in other case fragmentation
will be highly increased.

But fragmentation is not only the case.  The worst thing is that I can
easily create scenario, when the whole vmalloc space is exhausted by
blocks, which are not used, but already dirty and have several free pages.

Let's consider this function which execution should be pinned to one CPU:

static void exhaust_virtual_space(struct page *pages[16], int iters)
{
        /* Firstly we have to map a big chunk, e.g. 16 pages.
         * Then we have to occupy the remaining space with smaller
         * chunks, i.e. 8 pages. At the end small hole should remain.
         * So at the end of our allocation sequence block looks like
         * this:
         *                XX  big chunk
         * |XXxxxxxxx-|    x  small chunk
         *                 -  hole, which is enough for a small chunk,
         *                    but is not enough for a big chunk
         */
        while (iters--) {
                int i;
                void *vaddr;

                /* Map/unmap big chunk */
                vaddr = vm_map_ram(pages, 16, -1, PAGE_KERNEL);
                vm_unmap_ram(vaddr, 16);

                /* Map/unmap small chunks.
                 *
                 * -1 for hole, which should be left at the end of each block
                 * to keep it partially used, with some free space available */
                for (i = 0; i < (VMAP_BBMAP_BITS - 16) / 8 - 1; i++) {
                        vaddr = vm_map_ram(pages, 8, -1, PAGE_KERNEL);
                        vm_unmap_ram(vaddr, 8);
                }
        }
}

On every iteration new block (1MB of vm area in my case) will be
allocated and then will be occupied, without attempt to resolve small
allocation request using previously allocated blocks in a free list.

In case of random allocation (size should be randomly taken from the
range [1..64] in 64-bit case or [1..32] in 32-bit case) situation is the
same: new blocks continue to appear if maximum possible allocation size
(32 or 64) passed to the allocator, because all remaining blocks in a
free list do not have enough free space to complete this allocation
request.

In summary if new blocks are put into the head of a free list eventually
virtual space will be exhausted.

In current patch I simply put newly allocated block to the tail of a
free list, thus reduce fragmentation, giving a chance to resolve
allocation request using older blocks with possible holes left.

Signed-off-by: Roman Pen <[email protected]>
Cc: Eric Dumazet <[email protected]>
Acked-by: Joonsoo Kim <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: WANG Chao <[email protected]>
Cc: Fabian Frederick <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Gioh Kim <[email protected]>
Cc: Rob Jones <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
thehajime pushed a commit that referenced this issue May 10, 2015
I think this is useful to verify whether a filter could be JITed or
not in case of bpf_prog_enable >= 1, which otherwise the test suite
doesn't tell besides taking a good peek at the performance numbers.

Nicolas Schichan reported a bug in the ARM JIT compiler that rejected
and waved the filter to the interpreter although it shouldn't have.
Nevertheless, the test passes as expected, but such information is
not visible.

It's i.e. useful for the remaining classic JITs, but also for
implementing remaining opcodes that are not yet present in eBPF JITs
(e.g. ARM64 waves some of them to the interpreter). This minor patch
allows to grep through dmesg to find those accordingly, but also
provides a total summary, i.e.: [<X>/53 JIT'ed]

  # echo 1 > /proc/sys/net/core/bpf_jit_enable
  # insmod lib/test_bpf.ko
  # dmesg | grep "jited:0"

dmesg example on the ARM issue with JIT rejection:

[...]
[   67.925387] test_bpf: #2 ADD_SUB_MUL_K jited:1 24 PASS
[   67.930889] test_bpf: #3 DIV_MOD_KX jited:0 794 PASS
[   67.943940] test_bpf: #4 AND_OR_LSH_K jited:1 20 20 PASS
[...]

Signed-off-by: Daniel Borkmann <[email protected]>
Cc: Nicolas Schichan <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Acked-by: Alexei Starovoitov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
thehajime pushed a commit that referenced this issue May 12, 2015
Francois Romieu says:

====================
via-rhine rework

The series applies against davem-next as of
9dd3c79 ("drivers: net: xgene: fix kbuild
warnings").

Patches #1..#4 avoid holes in the receive ring.

Patch #5 is a small leftover cleanup for #1..#4.

Patches #6 and #7 are fairly simple barrier stuff.

Patch #8 closes some SMP transmit races - not that anyone really
complained about these but it's a bit hard to handwave that they
can be safely ignored. Some testing, especially SMP testing of
course, would be welcome.

. Changes since #2:
  - added dma_rmb barrier in vlan related patch 6.
  - s/wmb/dma_wmb/ in (*new*) patch 7 of 8.
  - added explicit SMP barriers in (*new*) patch 8 of 8.

. Changes since #1:
  - turned wmb() into dma_wmb() as suggested by davem and Alexander Duyck
    in patch 1 of 6.
  - forgot to reset rx_head_desc in rhine_reset_rbufs in patch 4 of 6.
  - removed rx_head_desc altogether in (*new*) patch 5 of 6
  - remoed some vlan receive uglyness in (*new*) patch 6 of 6.
====================

Signed-off-by: David S. Miller <[email protected]>
thehajime pushed a commit that referenced this issue May 12, 2015
Andrew Lunn says:

====================
More Marvell DSA refactring and fixup

This patch setup continues the refactoring and cleanup of the Marvell
DSA drivers.

Patch #1 Centralizes the duplicated parts of port setup and global
setup into the shared mv88e6xxx.

Patch #2 Centralizes looping over the ports setting them up

Patch #3 Uses mnemonics for the remaining register access in the
drivers.

Patch #4 The 6172 is actually a member of the 6352 family. This moves
the probe code into the correct driver.

Patch #5 Adds more members of the 6171 family to the 6171 driver. The
new devices are untested.

Patch #6 The 6185 is a member of the 6131 family. Add it to the probe
code of the 6131 driver.

Patch #7 and Patch #8 Simply the mutex's in mv88e6xxx.c. The SMI bus
is the bottleneck, not the granularity of the mutex's so simply the
code down to a single mutex.

Patch #8 Fixes a false positive lockdep splat, due to nested uses of
MDIO busses.

Patch #9 Fixes another false positive lockdep splat with the transmit
queue because of stacked Ethernet devices.
====================

Signed-off-by: David S. Miller <[email protected]>
thehajime pushed a commit that referenced this issue May 12, 2015
Perf top raise a warning if a kernel sample is collected but kernel map
is restricted. The warning message needs to dereference al.map->dso...

However, previous perf_event__preprocess_sample() doesn't always
guarantee al.map != NULL, for example, when kernel map is restricted.

This patch validates al.map before dereferencing, avoid the segfault.

Before this patch:

 $ cat /proc/sys/kernel/kptr_restrict
 1
 $ perf top -p  120183
 perf: Segmentation fault
 -------- backtrace --------
 /path/to/perf[0x509868]
 /lib64/libc.so.6(+0x3545f)[0x7f9a1540045f]
 /path/to/perf[0x448820]
 /path/to/perf(cmd_top+0xe3c)[0x44a5dc]
 /path/to/perf[0x4766a2]
 /path/to/perf(main+0x5f5)[0x42e545]
 /lib64/libc.so.6(__libc_start_main+0xf4)[0x7f9a153ecbd4]
 /path/to/perf[0x42e674]

And gdb call trace:

 Program received signal SIGSEGV, Segmentation fault.
 perf_event__process_sample (machine=0xa44030, sample=0x7fffffffa4c0, evsel=0xa43b00, event=0x7ffff41c3000, tool=0x7fffffffa8a0)
    at builtin-top.c:736
 736				  !RB_EMPTY_ROOT(&al.map->dso->symbols[MAP__FUNCTION]) ?
 (gdb) bt
 #0  perf_event__process_sample (machine=0xa44030, sample=0x7fffffffa4c0, evsel=0xa43b00, event=0x7ffff41c3000, tool=0x7fffffffa8a0)
     at builtin-top.c:736
 #1  perf_top__mmap_read_idx (top=top@entry=0x7fffffffa8a0, idx=idx@entry=0) at builtin-top.c:855
 #2  0x000000000044a5dd in perf_top__mmap_read (top=0x7fffffffa8a0) at builtin-top.c:872
 #3  __cmd_top (top=0x7fffffffa8a0) at builtin-top.c:997
 #4  cmd_top (argc=<optimized out>, argv=<optimized out>, prefix=<optimized out>) at builtin-top.c:1267
 #5  0x00000000004766a3 in run_builtin (p=p@entry=0x8a6ce8 <commands+264>, argc=argc@entry=3, argv=argv@entry=0x7fffffffdf70)
      at perf.c:371
 #6  0x000000000042e546 in handle_internal_command (argv=0x7fffffffdf70, argc=3) at perf.c:430
 #7  run_argv (argv=0x7fffffffdcf0, argcp=0x7fffffffdcfc) at perf.c:474
 #8  main (argc=3, argv=0x7fffffffdf70) at perf.c:589
 (gdb)

Signed-off-by: Wang Nan <[email protected]>
Tested-by: Arnaldo Carvalho de Melo <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Zefan Li <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
thehajime pushed a commit that referenced this issue May 26, 2015
In error paths, this was being called without struct_mutex held.
Leading to panics like:

  msm 1a00000.qcom,mdss_mdp: No memory protection without IOMMU
  Kernel panic - not syncing: BUG!
  CPU: 0 PID: 1409 Comm: cat Not tainted 4.0.0-dirty #4
  Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT)
  Call trace:
  [<ffffffc000089c78>] dump_backtrace+0x0/0x118
  [<ffffffc000089da0>] show_stack+0x10/0x20
  [<ffffffc0006686d4>] dump_stack+0x84/0xc4
  [<ffffffc0006678b4>] panic+0xd0/0x210
  [<ffffffc0003e1ce4>] drm_gem_object_free+0x5c/0x60
  [<ffffffc000402870>] adreno_gpu_cleanup+0x60/0x80
  [<ffffffc0004035a0>] a3xx_destroy+0x20/0x70
  [<ffffffc0004036f4>] a3xx_gpu_init+0x84/0x108
  [<ffffffc0004018b8>] adreno_load_gpu+0x58/0x190
  [<ffffffc000419dac>] msm_open+0x74/0x88
  [<ffffffc0003e0a48>] drm_open+0x168/0x400
  [<ffffffc0003e7210>] drm_stub_open+0xa8/0x118
  [<ffffffc0001a0e84>] chrdev_open+0x94/0x198
  [<ffffffc000199f88>] do_dentry_open+0x208/0x310
  [<ffffffc00019a4c4>] vfs_open+0x44/0x50
  [<ffffffc0001aa26c>] do_last.isra.14+0x2c4/0xc10
  [<ffffffc0001aac38>] path_openat+0x80/0x5e8
  [<ffffffc0001ac354>] do_filp_open+0x2c/0x98
  [<ffffffc00019b60c>] do_sys_open+0x13c/0x228
  [<ffffffc00019b72c>] SyS_openat+0xc/0x18
  CPU1: stopping

But there isn't any particularly good reason to hold struct_mutex for
teardown, so just standardize on calling it without the mutex held and
use the _unlocked() versions for GEM obj unref'ing

Signed-off-by: Rob Clark <[email protected]>
thehajime pushed a commit that referenced this issue Jun 11, 2015
Edward Cree says:

====================
sfc: add MCDI tracing

This patchset adds support for logging MCDI (Management-Controller-to-
 Driver Interface) interactions between the sfc driver and a bound device,
 to aid in debugging.
Solarflare has a tool to decode the resulting traces and will look to
 open-source this if there is any external interest, but the protocol is
 already detailed in drivers/net/ethernet/sfc/mcdi_pcol.h.
The logging buffer we allocate per MCDI context is a work area for
 constructing each individual message before logging it with netif_info.
The reason the buffer is long-lived is simply to avoid the overhead of
 allocating and freeing it every MCDI call, since MCDIs are already known
 to be serialised for other reasons.

--
v4: remove patch #4, which has already been applied via sshah
v3: add some explanations to cover letter and patch #4
v2: avoid long lines in cover letter; fix multiline comment style
====================

Signed-off-by: David S. Miller <[email protected]>
thehajime pushed a commit that referenced this issue Jun 11, 2015
Or Gerlitz says:

====================
mlx4 driver update, May 28, 2015

The 1st patch fixes an issue with a function running DPDK overriding
broadcast steering rules set by other functions. Please add this one
to your -stable queue.

The rest of the series from Matan and Ido deals with scaling the number
of IRQs that serve RoCE applications to be in par with the Ethernet driver.

changes from V0:
 - addressed feedback from Sergei, removed extra blank line in patch #4
====================

Signed-off-by: David S. Miller <[email protected]>
thehajime pushed a commit that referenced this issue Jul 14, 2015
To please checkpatch and the tiresome reader, add the "atmel," prefix to the
USB udc compatible string.

Signed-off-by: Nicolas Ferre <[email protected]>
Cc: <[email protected]>  #4.0+
Signed-off-by: Kevin Hilman <[email protected]>
thehajime pushed a commit that referenced this issue Jul 14, 2015
at91sam9g45, at91sam9x5 and sama5 SoCs should not use
"atmel,at91sam9rl-udc" for their USB device compatible property since
this compatible is attached to a specific hardware bug fix.

Signed-off-by: Boris Brezillon <[email protected]>
Acked-by: Alexandre Belloni <[email protected]>
Tested-by: Bo Shen <[email protected]>
Acked-by: Nicolas Ferre <[email protected]>
Cc: <[email protected]>  #4.0+
Signed-off-by: Kevin Hilman <[email protected]>
thehajime pushed a commit that referenced this issue Aug 14, 2015
Peter reported the following potential crash which I was able to
reproduce with his test program,

[  148.765788] ------------[ cut here ]------------
[  148.765796] WARNING: CPU: 34 PID: 2840 at kernel/smp.c:417 smp_call_function_many+0xb6/0x260()
[  148.765797] Modules linked in:
[  148.765800] CPU: 34 PID: 2840 Comm: perf Not tainted 4.2.0-rc1+ #4
[  148.765803]  ffffffff81cdc398 ffff88085f105950 ffffffff818bdfd5 0000000000000007
[  148.765805]  0000000000000000 ffff88085f105990 ffffffff810e413a 0000000000000000
[  148.765807]  ffffffff82301080 0000000000000022 ffffffff8107f640 ffffffff8107f640
[  148.765809] Call Trace:
[  148.765810]  <NMI>  [<ffffffff818bdfd5>] dump_stack+0x45/0x57
[  148.765818]  [<ffffffff810e413a>] warn_slowpath_common+0x8a/0xc0
[  148.765822]  [<ffffffff8107f640>] ? intel_cqm_stable+0x60/0x60
[  148.765824]  [<ffffffff8107f640>] ? intel_cqm_stable+0x60/0x60
[  148.765825]  [<ffffffff810e422a>] warn_slowpath_null+0x1a/0x20
[  148.765827]  [<ffffffff811613f6>] smp_call_function_many+0xb6/0x260
[  148.765829]  [<ffffffff8107f640>] ? intel_cqm_stable+0x60/0x60
[  148.765831]  [<ffffffff81161748>] on_each_cpu_mask+0x28/0x60
[  148.765832]  [<ffffffff8107f6ef>] intel_cqm_event_count+0x7f/0xe0
[  148.765836]  [<ffffffff811cdd35>] perf_output_read+0x2a5/0x400
[  148.765839]  [<ffffffff811d2e5a>] perf_output_sample+0x31a/0x590
[  148.765840]  [<ffffffff811d333d>] ? perf_prepare_sample+0x26d/0x380
[  148.765841]  [<ffffffff811d3497>] perf_event_output+0x47/0x60
[  148.765843]  [<ffffffff811d36c5>] __perf_event_overflow+0x215/0x240
[  148.765844]  [<ffffffff811d4124>] perf_event_overflow+0x14/0x20
[  148.765847]  [<ffffffff8107e7f4>] intel_pmu_handle_irq+0x1d4/0x440
[  148.765849]  [<ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0
[  148.765853]  [<ffffffff81219bad>] ? vunmap_page_range+0x19d/0x2f0
[  148.765854]  [<ffffffff81219d11>] ? unmap_kernel_range_noflush+0x11/0x20
[  148.765859]  [<ffffffff814ce6fe>] ? ghes_copy_tofrom_phys+0x11e/0x2a0
[  148.765863]  [<ffffffff8109e5db>] ? native_apic_msr_write+0x2b/0x30
[  148.765865]  [<ffffffff8109e44d>] ? x2apic_send_IPI_self+0x1d/0x20
[  148.765869]  [<ffffffff81065135>] ? arch_irq_work_raise+0x35/0x40
[  148.765872]  [<ffffffff811c8d86>] ? irq_work_queue+0x66/0x80
[  148.765875]  [<ffffffff81075306>] perf_event_nmi_handler+0x26/0x40
[  148.765877]  [<ffffffff81063ed9>] nmi_handle+0x79/0x100
[  148.765879]  [<ffffffff81064422>] default_do_nmi+0x42/0x100
[  148.765880]  [<ffffffff81064563>] do_nmi+0x83/0xb0
[  148.765884]  [<ffffffff818c7c0f>] end_repeat_nmi+0x1e/0x2e
[  148.765886]  [<ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0
[  148.765888]  [<ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0
[  148.765890]  [<ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0
[  148.765891]  <<EOE>>  [<ffffffff8110ab66>] finish_task_switch+0x156/0x210
[  148.765898]  [<ffffffff818c1671>] __schedule+0x341/0x920
[  148.765899]  [<ffffffff818c1c87>] schedule+0x37/0x80
[  148.765903]  [<ffffffff810ae1af>] ? do_page_fault+0x2f/0x80
[  148.765905]  [<ffffffff818c1f4a>] schedule_user+0x1a/0x50
[  148.765907]  [<ffffffff818c666c>] retint_careful+0x14/0x32
[  148.765908] ---[ end trace e33ff2be78e14901 ]---

The CQM task events are not safe to be called from within interrupt
context because they require performing an IPI to read the counter value
on all sockets. And performing IPIs from within IRQ context is a
"no-no".

Make do with the last read counter value currently event in
event->count when we're invoked in this context.

Reported-by: Peter Zijlstra <[email protected]>
Signed-off-by: Matt Fleming <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vikas Shivappa <[email protected]>
Cc: Kanaka Juvva <[email protected]>
Cc: Will Auld <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
thehajime pushed a commit that referenced this issue Aug 14, 2015
Jiri Pirko says:

====================
Introduce Mellanox Technologies Switch ASICs switchdev drivers

This patchset introduces Mellanox Technologies Switch driver infrastructure
and support for SwitchX-2 ASIC.

The driver is divided into 3 logical parts:
1) Bus - implements switch bus interface. Currently only PCI bus is
   implemented, but more buses will be added in the future. Namely I2C
   and SGMII.
   (patch #2)
2) Driver - implemements of ASIC-specific functions.
   Currently SwitchX-2 ASIC is supported, but a plan exists to introduce
   support for Spectrum ASIC in the near future.
   (patch #4)
3) Core - infrastructure that glues buses and drivers together.
   It implements register access logic (EMADs) and takes care of RX traps
   and events.
   (patch #1 and #3)
====================

Signed-off-by: David S. Miller <[email protected]>
thehajime pushed a commit that referenced this issue Aug 14, 2015
This patch adds the missing update of the transfer data width in
at_xdmac_prep_slave_sg().

Indeed, for each item in the scatter-gather list, we check whether the
transfer length is aligned with the data width provided by
dmaengine_slave_config(). If so, we directly use this data width for the
current part of the transfer we are preparing. Otherwise, the data width
is reduced to 8 bits (1 byte). Of course, the actual number of register
accesses must also be updated to match the new data width.

So one chunk was missing in the original patch (see Fixes tag below): the
number of register accesses was correctly set to (len >> fixed_dwidth) in
mbr_ubc but the real data width was not updated in mbr_cfg. Since mbr_cfg
may change for each part of the scatter-gather transfer this also explains
why the original patch used the Descriptor View 2 instead of the
Descriptor View 1.

Let's take the example of a DMA transfer to write 8bit data into an Atmel
USART with FIFOs. When FIFOs are enabled in the USART, its Transmit
Holding Register (THR) works in multidata mode, that is to say that up to
4 8bit data can be written into the THR in a single 32bit access and it is
still possible to write only one data with a 8bit access. To take
advantage of this new feature, the DMA driver was modified to allow
multiple dwidths when doing slave transfers.
For instance, when the total length is 22 bytes, the USART driver splits
the transfer into 2 parts:

First part: 20 bytes transferred through 5 32bit writes into THR
Second part: 2 bytes transferred though 2 8bit writes into THR

For the second part, the data width was first set to 4_BYTES by the USART
driver thanks to dmaengine_slave_config() then at_xdmac_prep_slave_sg()
reduces this data width to 1_BYTE because the 2 byte length is not aligned
with the original 4_BYTES data width. Since the data width is modified,
the actual number of writes into THR must be set accordingly.

Signed-off-by: Cyrille Pitchen <[email protected]>
Fixes: 6d3a7d9 ("dmaengine: at_xdmac: allow muliple dwidths when doing slave transfers")
Cc: [email protected] #4.0 and later
Acked-by: Nicolas Ferre <[email protected]>
Acked-by: Ludovic Desroches <[email protected]>
Signed-off-by: Vinod Koul <[email protected]>
thehajime pushed a commit that referenced this issue Aug 14, 2015
Nikolay has reported a hang when a memcg reclaim got stuck with the
following backtrace:

PID: 18308  TASK: ffff883d7c9b0a30  CPU: 1   COMMAND: "rsync"
  #0 __schedule at ffffffff815ab152
  #1 schedule at ffffffff815ab76e
  #2 schedule_timeout at ffffffff815ae5e5
  #3 io_schedule_timeout at ffffffff815aad6a
  #4 bit_wait_io at ffffffff815abfc6
  #5 __wait_on_bit at ffffffff815abda5
  #6 wait_on_page_bit at ffffffff8111fd4f
  #7 shrink_page_list at ffffffff81135445
  #8 shrink_inactive_list at ffffffff81135845
  #9 shrink_lruvec at ffffffff81135ead
 #10 shrink_zone at ffffffff811360c3
 #11 shrink_zones at ffffffff81136eff
 #12 do_try_to_free_pages at ffffffff8113712f
 #13 try_to_free_mem_cgroup_pages at ffffffff811372be
 #14 try_charge at ffffffff81189423
 #15 mem_cgroup_try_charge at ffffffff8118c6f5
 #16 __add_to_page_cache_locked at ffffffff8112137d
 #17 add_to_page_cache_lru at ffffffff81121618
 #18 pagecache_get_page at ffffffff8112170b
 #19 grow_dev_page at ffffffff811c8297
 #20 __getblk_slow at ffffffff811c91d6
 #21 __getblk_gfp at ffffffff811c92c1
 #22 ext4_ext_grow_indepth at ffffffff8124565c
 #23 ext4_ext_create_new_leaf at ffffffff81246ca8
 #24 ext4_ext_insert_extent at ffffffff81246f09
 #25 ext4_ext_map_blocks at ffffffff8124a848
 #26 ext4_map_blocks at ffffffff8121a5b7
 #27 mpage_map_one_extent at ffffffff8121b1fa
 #28 mpage_map_and_submit_extent at ffffffff8121f07b
 #29 ext4_writepages at ffffffff8121f6d5
 #30 do_writepages at ffffffff8112c490
 #31 __filemap_fdatawrite_range at ffffffff81120199
 #32 filemap_flush at ffffffff8112041c
 #33 ext4_alloc_da_blocks at ffffffff81219da1
 #34 ext4_rename at ffffffff81229b91
 #35 ext4_rename2 at ffffffff81229e32
 #36 vfs_rename at ffffffff811a08a5
 #37 SYSC_renameat2 at ffffffff811a3ffc
 #38 sys_renameat2 at ffffffff811a408e
 #39 sys_rename at ffffffff8119e51e
 #40 system_call_fastpath at ffffffff815afa89

Dave Chinner has properly pointed out that this is a deadlock in the
reclaim code because ext4 doesn't submit pages which are marked by
PG_writeback right away.

The heuristic was introduced by commit e62e384 ("memcg: prevent OOM
with too many dirty pages") and it was applied only when may_enter_fs
was specified.  The code has been changed by c3b94f4 ("memcg:
further prevent OOM with too many dirty pages") which has removed the
__GFP_FS restriction with a reasoning that we do not get into the fs
code.  But this is not sufficient apparently because the fs doesn't
necessarily submit pages marked PG_writeback for IO right away.

ext4_bio_write_page calls io_submit_add_bh but that doesn't necessarily
submit the bio.  Instead it tries to map more pages into the bio and
mpage_map_one_extent might trigger memcg charge which might end up
waiting on a page which is marked PG_writeback but hasn't been submitted
yet so we would end up waiting for something that never finishes.

Fix this issue by replacing __GFP_IO by may_enter_fs check (for case 2)
before we go to wait on the writeback.  The page fault path, which is
the only path that triggers memcg oom killer since 3.12, shouldn't
require GFP_NOFS and so we shouldn't reintroduce the premature OOM
killer issue which was originally addressed by the heuristic.

As per David Chinner the xfs is doing similar thing since 2.6.15 already
so ext4 is not the only affected filesystem.  Moreover he notes:

: For example: IO completion might require unwritten extent conversion
: which executes filesystem transactions and GFP_NOFS allocations. The
: writeback flag on the pages can not be cleared until unwritten
: extent conversion completes. Hence memory reclaim cannot wait on
: page writeback to complete in GFP_NOFS context because it is not
: safe to do so, memcg reclaim or otherwise.

Cc: [email protected] # 3.9+
[[email protected]: corrected the control flow]
Fixes: c3b94f4 ("memcg: further prevent OOM with too many dirty pages")
Reported-by: Nikolay Borisov <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
thehajime pushed a commit that referenced this issue Aug 14, 2015
It turns out that a PV domU also requires the "Xen PV" APIC
driver. Otherwise, the flat driver is used and we get stuck in busy
loops that never exit, such as in this stack trace:

(gdb) target remote localhost:9999
Remote debugging using localhost:9999
__xapic_wait_icr_idle () at ./arch/x86/include/asm/ipi.h:56
56              while (native_apic_mem_read(APIC_ICR) & APIC_ICR_BUSY)
(gdb) bt
 #0  __xapic_wait_icr_idle () at ./arch/x86/include/asm/ipi.h:56
 #1  __default_send_IPI_shortcut (shortcut=<optimized out>,
dest=<optimized out>, vector=<optimized out>) at
./arch/x86/include/asm/ipi.h:75
 #2  apic_send_IPI_self (vector=246) at arch/x86/kernel/apic/probe_64.c:54
 #3  0xffffffff81011336 in arch_irq_work_raise () at
arch/x86/kernel/irq_work.c:47
 #4  0xffffffff8114990c in irq_work_queue (work=0xffff88000fc0e400) at
kernel/irq_work.c:100
 #5  0xffffffff8110c29d in wake_up_klogd () at kernel/printk/printk.c:2633
 #6  0xffffffff8110ca60 in vprintk_emit (facility=0, level=<optimized
out>, dict=0x0 <irq_stack_union>, dictlen=<optimized out>,
fmt=<optimized out>, args=<optimized out>)
    at kernel/printk/printk.c:1778
 #7  0xffffffff816010c8 in printk (fmt=<optimized out>) at
kernel/printk/printk.c:1868
 #8  0xffffffffc00013ea in ?? ()
 #9  0x0000000000000000 in ?? ()

Mailing-list-thread: https://lkml.org/lkml/2015/8/4/755
Signed-off-by: Jason A. Donenfeld <[email protected]>
Cc: <[email protected]>
Signed-off-by: David Vrabel <[email protected]>
thehajime pushed a commit that referenced this issue Sep 2, 2015
A recent change to the cpu_cooling code introduced a AB-BA deadlock
scenario between the cpufreq_policy_notifier_list rwsem and the
cooling_cpufreq_lock.  This is caused by cooling_cpufreq_lock being held
before the registration/removal of the notifier block (an operation
which takes the rwsem), and the notifier code itself which takes the
locks in the reverse order:

======================================================
[ INFO: possible circular locking dependency detected ]
3.18.0+ #1453 Not tainted
-------------------------------------------------------
rc.local/770 is trying to acquire lock:
 (cooling_cpufreq_lock){+.+.+.}, at: [<c04abfc4>] cpufreq_thermal_notifier+0x34/0xfc

but task is already holding lock:
 ((cpufreq_policy_notifier_list).rwsem){++++.+}, at: [<c0042f04>]  __blocking_notifier_call_chain+0x34/0x68

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 ((cpufreq_policy_notifier_list).rwsem){++++.+}:
       [<c06bc3b0>] down_write+0x44/0x9c
       [<c0043444>] blocking_notifier_chain_register+0x28/0xd8
       [<c04ad610>] cpufreq_register_notifier+0x68/0x90
       [<c04abe4c>] __cpufreq_cooling_register.part.1+0x120/0x180
       [<c04abf44>] __cpufreq_cooling_register+0x98/0xa4
       [<c04abf8c>] cpufreq_cooling_register+0x18/0x1c
       [<bf0046f8>] imx_thermal_probe+0x1c0/0x470 [imx_thermal]
       [<c037cef8>] platform_drv_probe+0x50/0xac
       [<c037b710>] driver_probe_device+0x114/0x234
       [<c037b8cc>] __driver_attach+0x9c/0xa0
       [<c0379d68>] bus_for_each_dev+0x5c/0x90
       [<c037b204>] driver_attach+0x24/0x28
       [<c037ae7c>] bus_add_driver+0xe0/0x1d8
       [<c037c0cc>] driver_register+0x80/0xfc
       [<c037cd80>] __platform_driver_register+0x50/0x64
       [<bf007018>] 0xbf007018
       [<c0008a5c>] do_one_initcall+0x88/0x1d8
       [<c0095da4>] load_module+0x1768/0x1ef8
       [<c0096614>] SyS_init_module+0xe0/0xf4
       [<c000ec00>] ret_fast_syscall+0x0/0x48

-> #0 (cooling_cpufreq_lock){+.+.+.}:
       [<c00619f8>] lock_acquire+0xb0/0x124
       [<c06ba3b4>] mutex_lock_nested+0x5c/0x3d8
       [<c04abfc4>] cpufreq_thermal_notifier+0x34/0xfc
       [<c0042bf4>] notifier_call_chain+0x4c/0x8c
       [<c0042f20>] __blocking_notifier_call_chain+0x50/0x68
       [<c0042f58>] blocking_notifier_call_chain+0x20/0x28
       [<c04ae62c>] cpufreq_set_policy+0x7c/0x1d0
       [<c04af3cc>] store_scaling_governor+0x74/0x9c
       [<c04ad418>] store+0x90/0xc0
       [<c0175384>] sysfs_kf_write+0x54/0x58
       [<c01746b4>] kernfs_fop_write+0xdc/0x190
       [<c010dcc0>] vfs_write+0xac/0x1b4
       [<c010dfec>] SyS_write+0x44/0x90
       [<c000ec00>] ret_fast_syscall+0x0/0x48

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock((cpufreq_policy_notifier_list).rwsem);
                               lock(cooling_cpufreq_lock);
                               lock((cpufreq_policy_notifier_list).rwsem);
  lock(cooling_cpufreq_lock);

 *** DEADLOCK ***

7 locks held by rc.local/770:
 #0:  (sb_writers#6){.+.+.+}, at: [<c010dda0>] vfs_write+0x18c/0x1b4
 #1:  (&of->mutex){+.+.+.}, at: [<c0174678>] kernfs_fop_write+0xa0/0x190
 #2:  (s_active#52){.+.+.+}, at: [<c0174680>] kernfs_fop_write+0xa8/0x190
 #3:  (cpu_hotplug.lock){++++++}, at: [<c0026a60>] get_online_cpus+0x34/0x90
 #4:  (cpufreq_rwsem){.+.+.+}, at: [<c04ad3e0>] store+0x58/0xc0
 #5:  (&policy->rwsem){+.+.+.}, at: [<c04ad3f8>] store+0x70/0xc0
 #6:  ((cpufreq_policy_notifier_list).rwsem){++++.+}, at: [<c0042f04>] __blocking_notifier_call_chain+0x34/0x68

stack backtrace:
CPU: 0 PID: 770 Comm: rc.local Not tainted 3.18.0+ #1453
Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
Backtrace:
[<c00121c8>] (dump_backtrace) from [<c0012360>] (show_stack+0x18/0x1c)
 r6:c0b85a80 r5:c0b75630 r4:00000000 r3:00000000
[<c0012348>] (show_stack) from [<c06b6c48>] (dump_stack+0x7c/0x98)
[<c06b6bcc>] (dump_stack) from [<c06b42a4>] (print_circular_bug+0x28c/0x2d8)
 r4:c0b85a80 r3:d0071d40
[<c06b4018>] (print_circular_bug) from [<c00613b0>] (__lock_acquire+0x1acc/0x1bb0)
 r10:c0b50660 r8:c09e6d80 r7:d0071d40 r6:c11d0f0c r5:00000007 r4:d0072240
[<c005f8e4>] (__lock_acquire) from [<c00619f8>] (lock_acquire+0xb0/0x124)
 r10:00000000 r9:c04abfc4 r8:00000000 r7:00000000 r6:00000000 r5:c0a06f0c
 r4:00000000
[<c0061948>] (lock_acquire) from [<c06ba3b4>] (mutex_lock_nested+0x5c/0x3d8)
 r10:ec853800 r9:c0a06ed4 r8:d0071d40 r7:c0a06ed4 r6:c11d0f0c r5:00000000
 r4:c04abfc4
[<c06ba358>] (mutex_lock_nested) from [<c04abfc4>] (cpufreq_thermal_notifier+0x34/0xfc)
 r10:ec853800 r9:ec85380c r8:d00d7d3c r7:c0a06ed4 r6:d00d7d3c r5:00000000
 r4:fffffffe
[<c04abf90>] (cpufreq_thermal_notifier) from [<c0042bf4>] (notifier_call_chain+0x4c/0x8c)
 r7:00000000 r6:00000000 r5:00000000 r4:fffffffe
[<c0042ba8>] (notifier_call_chain) from [<c0042f20>] (__blocking_notifier_call_chain+0x50/0x68)
 r8:c0a072a4 r7:00000000 r6:d00d7d3c r5:ffffffff r4:c0a06fc8 r3:ffffffff
[<c0042ed0>] (__blocking_notifier_call_chain) from [<c0042f58>] (blocking_notifier_call_chain+0x20/0x28)
 r7:ec98b540 r6:c13ebc80 r5:ed76e600 r4:d00d7d3c
[<c0042f38>] (blocking_notifier_call_chain) from [<c04ae62c>] (cpufreq_set_policy+0x7c/0x1d0)
[<c04ae5b0>] (cpufreq_set_policy) from [<c04af3cc>] (store_scaling_governor+0x74/0x9c)
 r7:ec98b540 r6:0000000c r5:ec98b540 r4:ed76e600
[<c04af358>] (store_scaling_governor) from [<c04ad418>] (store+0x90/0xc0)
 r6:0000000c r5:ed76e6d4 r4:ed76e600
[<c04ad388>] (store) from [<c0175384>] (sysfs_kf_write+0x54/0x58)
 r8:0000000c r7:d00d7f78 r6:ec98b540 r5:0000000c r4:ec853800 r3:0000000c
[<c0175330>] (sysfs_kf_write) from [<c01746b4>] (kernfs_fop_write+0xdc/0x190)
 r6:ec98b540 r5:00000000 r4:00000000 r3:c0175330
[<c01745d8>] (kernfs_fop_write) from [<c010dcc0>] (vfs_write+0xac/0x1b4)
 r10:0162aa70 r9:d00d6000 r8:0000000c r7:d00d7f78 r6:0162aa70 r5:0000000c
 r4:eccde500
[<c010dc14>] (vfs_write) from [<c010dfec>] (SyS_write+0x44/0x90)
 r10:0162aa70 r8:0000000c r7:eccde500 r6:eccde500 r5:00000000 r4:00000000
[<c010dfa8>] (SyS_write) from [<c000ec00>] (ret_fast_syscall+0x0/0x48)
 r10:00000000 r8:c000edc4 r7:00000004 r6:000216cc r5:0000000c r4:0162aa70

Solve this by moving to finer grained locking - use one mutex to protect
the cpufreq_dev_list as a whole, and a separate lock to ensure correct
ordering of cpufreq notifier registration and removal.

cooling_list_lock is taken within cooling_cpufreq_lock on
(un)registration to preserve the behavior of the code, i.e. to
atomically add/remove to the list and (un)register the notifier.

Fixes: 2dcd851 ("thermal: cpu_cooling: Update always cpufreq policy with
Reviewed-by: Viresh Kumar <[email protected]>
Signed-off-by: Russell King <[email protected]>
Signed-off-by: Viresh Kumar <[email protected]>
Signed-off-by: Eduardo Valentin <[email protected]>
thehajime pushed a commit that referenced this issue Jan 25, 2016
As part of fixing another perf issue, observed that after a perf run,
the interrupt got disabled on one/more cores.

Turns out that despite requesting perf irq as percpu, the flow handler
registered was not handle_percpu_irq()

Given that on ARCv2 cores, IRQs < 24 are always private to cpu, we
register the right handler at the very onset.

Before Fix

| [ARCLinux]# cat /proc/interrupts | grep perf
|  20:    0      0      0       0  ARCv2 core Intc  20 ARC perf counters
|
| [ARCLinux]# perf record -c 20000 /sbin/hackbench
| Running with 10*40 (== 400) tasks.
|
| [ARCLinux]# cat /proc/interrupts | grep perf
|  20:    0    522      8    51916  ARCv2 core Intc  20 ARC perf counters
|
| [ARCLinux]# perf record -c 20000 /sbin/hackbench
| Running with 10*40 (== 400) tasks.
|
| [ARCLinux]# cat /proc/interrupts | grep perf
|  20:    0    522      8   104368  ARCv2 core Intc  20 ARC perf counters

After Fix

| [ARCLinux]# cat /proc/interrupts | grep perf
|  20:    0      0      0       0  ARCv2 core Intc  20 ARC perf counters
|
| [ARCLinux]# perf record -c 20000 /sbin/hackbench
| Running with 10*40 (== 400) tasks.
|
| [ARCLinux]# cat /proc/interrupts | grep perf
|  20:  64198  62012  62697  67803  ARCv2 core Intc  20 ARC perf counters
|
| [ARCLinux]# perf record -c 20000 /sbin/hackbench
| Running with 10*40 (== 400) tasks.
|
| [ARCLinux]# cat /proc/interrupts | grep perf
|  20: 126014 122792 123301 133654  ARCv2 core Intc  20 ARC perf counters

Cc: Marc Zyngier <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Alexey Brodkin <[email protected]>
Cc: [email protected] #4.2+
Cc: [email protected]
Signed-off-by: Vineet Gupta <[email protected]>
thehajime pushed a commit that referenced this issue Jan 25, 2016
This was the second perf intr issue

perf sampling on multicore requires intr to be enabled on all cores.
ARC perf probe code used helper arc_request_percpu_irq() which calls
 - request_percpu_irq() on core0
 - enable_percpu_irq() on all all cores (including core0)

genirq requires that request be made ahead of enable call.
However if perf probe happened on non core0 (observed on a 3.18 kernel),
enable would get called ahead of request, failing obviously and
rendering perf intr disabled on all such cores

[   11.120000] 1 ARC perf       : 8 counters (48 bits), 113 conditions, [overflow IRQ support]
[   11.130000] 1 -----> enable_percpu_irq() IRQ 20 failed
[   11.140000] 3 -----> enable_percpu_irq() IRQ 20 failed
[   11.140000] 2 -----> enable_percpu_irq() IRQ 20 failed
[   11.140000] 0 =====> request_percpu_irq() IRQ 20
[   11.140000] 0 -----> enable_percpu_irq() IRQ 20

Fix this fragility, by calling request_percpu_irq() on whatever core
calls probe (there is no requirement on which core calls this anyways)
and then calling enable on each cores.

Interestingly this started as invesigation of STAR 9000838902:
"sporadically IRQs enabled on perf prob"

which was about occassional boot spew as request_percpu_irq got called
non-locally (from an IPI), and re-enabled interrupts in following path
proc_mkdir ->  spin_unlock_irq()

which the irq work code didn't like.

| ARC perf     : 8 counters (48 bits), 113 conditions, [overflow IRQ support]
|
| BUG: failure at ../kernel/irq_work.c:135/irq_work_run_list()!
| CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.18.10-01127-g285efb8e66d1 #2
|
| Stack Trace:
|  arc_unwind_core.constprop.1+0x94/0x104
|  dump_stack+0x62/0x98
|  irq_work_run_list+0xb0/0xb4
|  irq_work_run+0x22/0x3c
|  do_IPI+0x74/0x9c
|  handle_irq_event_percpu+0x34/0x164
|  handle_percpu_irq+0x58/0x78
|  generic_handle_irq+0x1e/0x2c
|  arch_do_IRQ+0x3c/0x60
|  ret_from_exception+0x0/0x8

Cc: Marc Zyngier <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Alexey Brodkin <[email protected]>
Cc: <[email protected]> #4.2+
Signed-off-by: Vineet Gupta <[email protected]>
thehajime pushed a commit that referenced this issue Jan 25, 2016
When a43eec3 ("bpf: introduce bpf_perf_event_output() helper") added
PERF_COUNT_SW_BPF_OUTPUT we ended up with a new entry in the event_symbols_sw
array that wasn't initialized, thus set to NULL, fix print_symbol_events()
to check for that case so that we don't crash if this happens again.

  (gdb) bt
  #0  __match_glob (ignore_space=false, pat=<optimized out>, str=<optimized out>) at util/string.c:198
  #1  strglobmatch (str=<optimized out>, pat=pat@entry=0x7fffffffe61d "stall") at util/string.c:252
  #2  0x00000000004993a5 in print_symbol_events (type=1, syms=0x872880 <event_symbols_sw+160>, max=11, name_only=false, event_glob=0x7fffffffe61d "stall")
      at util/parse-events.c:1615
  #3  print_events (event_glob=event_glob@entry=0x7fffffffe61d "stall", name_only=false) at util/parse-events.c:1675
  #4  0x000000000042c79e in cmd_list (argc=1, argv=0x7fffffffe390, prefix=<optimized out>) at builtin-list.c:68
  #5  0x00000000004788a5 in run_builtin (p=p@entry=0x871758 <commands+120>, argc=argc@entry=2, argv=argv@entry=0x7fffffffe390) at perf.c:370
  #6  0x0000000000420ab0 in handle_internal_command (argv=0x7fffffffe390, argc=2) at perf.c:429
  #7  run_argv (argv=0x7fffffffe110, argcp=0x7fffffffe11c) at perf.c:473
  #8  main (argc=2, argv=0x7fffffffe390) at perf.c:588
  (gdb) p event_symbols_sw[PERF_COUNT_SW_BPF_OUTPUT]
  $4 = {symbol = 0x0, alias = 0x0}
  (gdb)

A patch to robustify perf to not segfault when the next counter gets added in
the kernel will follow this one.

Reported-by: Ingo Molnar <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Wang Nan <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant