Re: deadlock on 2.6.24.3-rt3 - Kernel

This is a discussion on Re: deadlock on 2.6.24.3-rt3 - Kernel ; [Added Peter Zijlsta and LKML] Hiroshi, thanks for looking into this! Peter, this looks like a legit fix, could you Ack it. On Mon, 17 Mar 2008, Hiroshi Shimamoto wrote: > Hiroshi Shimamoto wrote: > > Hi, > > > ...

+ Reply to Thread
Results 1 to 7 of 7

Thread: Re: deadlock on 2.6.24.3-rt3

  1. Re: deadlock on 2.6.24.3-rt3

    [Added Peter Zijlsta and LKML]

    Hiroshi, thanks for looking into this!

    Peter, this looks like a legit fix, could you Ack it.


    On Mon, 17 Mar 2008, Hiroshi Shimamoto wrote:

    > Hiroshi Shimamoto wrote:
    > > Hi,
    > >
    > > I got a soft lockup message on 2.6.24.3-rt3.
    > > I attached the .config.
    > >
    > > I think there is a deadlock scenario, I explain later.
    > >
    > > Here is the console log;
    > > BUG: soft lockup - CPU#2 stuck for 11s! [bash:2175]
    > > CPU 2:
    > > Modules linked in:
    > > Pid: 2175, comm: bash Not tainted 2.6.24.3-rt3 #1
    > > RIP: 0010:[] [] __spin_lock+0x57/0x67
    > > RSP: 0000:ffff8100c52a1d48 EFLAGS: 00000202
    > > RAX: 0000000000000000 RBX: 0000000000004bc5 RCX: 0000000000004bc5
    > > RDX: 0000000000000002 RSI: 00000000006c3208 RDI: 0000000000000001
    > > RBP: 000000000000000d R08: ffff8100cbc28018 R09: ffff810007c95458
    > > R10: 00000000006c3208 R11: 0000000000000246 R12: ffffffff808246e8
    > > R13: 000284d000000002 R14: ffffffff80387277 R15: 00000000ffffffff
    > > FS: 00002b28926a2ef0(0000) GS:ffff8100cf8a3940(0000) knlGS:0000000000000000
    > > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    > > CR2: 00000000006c3208 CR3: 00000000c3cac000 CR4: 00000000000006e0
    > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    > > Call Trace:
    > > [] __spin_lock+0x29/0x67
    > > [] swap_info_get+0x65/0xdd
    > > [] can_share_swap_page+0x39/0x84
    > > [] do_wp_page+0x2f9/0x519
    > > [] handle_mm_fault+0x615/0x7cf
    > > [] proc_flush_task+0x171/0x29c
    > > [] recalc_sigpending+0xe/0x3c
    > > [] do_page_fault+0x162/0x754
    > > [] audit_syscall_exit+0x31c/0x37a
    > > [] error_exit+0x0/0x51
    > > ---------------------------
    > > | preempt count: 00010002 ]
    > > | 2-level deep critical section nesting:
    > > ----------------------------------------
    > > .. [] .... __spin_lock+0xe/0x67
    > > .....[<00000000>] .. ( <= 0x0)
    > > .. [] .... __spin_lock+0xe/0x67
    > > .....[<00000000>] .. ( <= 0x0)
    > > BUG: soft lockup - CPU#3 stuck for 11s! [stress:9460]
    > > CPU 3:
    > > Modules linked in:
    > > Pid: 9460, comm: stress Not tainted 2.6.24.3-rt3 #1
    > > RIP: 0010:[] [] find_get_page+0xad/0xbe
    > > RSP: 0018:ffff8100cbf25b88 EFLAGS: 00000202
    > > 0000000000002009 RBX: ffffffff80824bc8 RCX: 0000000000000002
    > > RDX: 0000000000000002 RSI: ffff8100cbfcf298 RDI: ffff810005ad8910
    > > RBP: ffffffff80383a57 R08: ffff810005ad8918 R09: 0000000000000003
    > > R10: ffff810005ad88d8 R11: 0000000000000001 R12: ffffffff80822880
    > > R13: ffff81000799ce48 R14: ffffffff8028921c R15: ffffffff80822880
    > > FS: 00002acaa373bb00(0000) GS:ffff8100cf8a32c0(0000) knlGS:0000000000000000
    > > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    > > CR2: 00002b9b2827c530 CR3: 000000006cc90000 CR4: 00000000000006e0
    > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    > > Call Trace:
    > > [] find_get_page+0x23/0xbe
    > > [] free_swap_and_cache+0x46/0xdd
    > > [] unmap_vmas+0x626/0x8ce
    > > [] exit_mmap+0xac/0x147
    > > [] mmput+0x32/0xae
    > > [] do_exit+0x199/0x914
    > > [] __dequeue_signal+0x19/0x1b7
    > > [] do_group_exit+0x2c/0x7e
    > > [] get_signal_to_deliver+0x2ef/0x4aa
    > > [] do_notify_resume+0xa8/0x7cd
    > > [] add_preempt_count+0x14/0x111
    > > [] __up_read+0x13/0x8d
    > > [] do_page_fault+0x187/0x754
    > > [] __dequeue_entity+0x2d/0x34
    > > [] __switch_to+0x27/0x2c9
    > > [] do_page_fault+0x1f4/0x754
    > > [] retint_signal+0x3d/0x7f
    > > ---------------------------
    > > | preempt count: 00010005 ]
    > > | 5-level deep critical section nesting:
    > > ----------------------------------------
    > > .. [] .... __spin_lock+0xe/0x67
    > > .....[<00000000>] .. ( <= 0x0)
    > > .. [] .... __spin_lock+0xe/0x67
    > > .....[<00000000>] .. ( <= 0x0)
    > > .. [] .... __spin_lock+0xe/0x67
    > > .....[<00000000>] .. ( <= 0x0)
    > > .. [] .... find_get_page+0x14/0xbe
    > > .....[<00000000>] .. ( <= 0x0)
    > > .. [] .... __spin_lock+0xe/0x67
    > > .....[<00000000>] .. ( <= 0x0)
    > >
    > >
    > > I also got a kernel core.
    > > (gdb) info thr
    > > 4 process 9460 0xffffffff8027c974 in find_get_page (mapping=,
    > > offset=18446744071570598016) at include/asm/processor_64.h:385
    > > 3 process 2175 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333
    > > 2 process 9132 __spin_lock (lock=0xffffffff80893f80) at include/asm/spinlock_64.h:22
    > > * 1 process 9478 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333
    > >
    > > CPU3(thread 4) is in find_get_page(), and the others in __spin_lock().
    > > The thread 4 is waiting to turn PG_nonewrefs bit off in wait_on_page_ref() which is
    > > called from page_cache_get_speculative(), and the thread 4 holds the swap_lock.
    > > The other threads waiting the swap_lock.
    > > On the other hand, the thread 1 turned PG_nonewrefs bit on by calling
    > > lock_page_ref_irq() in remove_mapping(), and then waiting the swap_lock.
    > > So if the target page of remove_mapping() is in the exiting process memory,
    > > the kernel is deadlock.
    > >
    > > (gdb) bt
    > > #0 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333
    > > #1 0xffffffff80296597 in swap_info_get (entry=)
    > > at mm/swapfile.c:253
    > > #2 0xffffffff80296618 in swap_free (entry={val = 1}) at mm/swapfile.c:300
    > > #3 0xffffffff80286acd in remove_mapping (mapping=,
    > > page=0xffff810005ad8910) at mm/vmscan.c:423
    > > ...
    > >
    > > (gdb) thr 2
    > > (gdb) bt
    > > #0 __spin_lock (lock=0xffffffff80893f80) at include/asm/spinlock_64.h:22
    > > #1 0xffffffff80296374 in valid_swaphandles (entry=,
    > > offset=0xffff81001e22bc78) at mm/swapfile.c:1783
    > > #2 0xffffffff8028b0af in swapin_readahead (entry={val = 1}, addr=0, vma=0x1)
    > > at mm/memory.c:2054
    > > #3 0xffffffff8029a6af in shmem_getpage (inode=0xffff8100cdf4fd48, idx=0,
    > > pagep=0xffff81001e22bd80, sgp=SGP_FAULT, type=0xffff81001e22bd34) at mm/shmem.c:1089
    > > ...
    > >
    > > (gdb) thr 3
    > > (gdb) bt
    > > #0 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333
    > > #1 0xffffffff80296597 in swap_info_get (entry=)
    > > at mm/swapfile.c:253
    > > #2 0xffffffff80296c0c in can_share_swap_page (page=)
    > > at mm/swapfile.c:317
    > > #3 0xffffffff8028ae6e in do_wp_page (mm=0xffff8100ce772f40, vma=0xffff8100cd212f00,
    > > address=7090696, page_table=0xffff8100cbcef618, pmd=0xffff8100cbc28018,
    > > ptl=0xffff810007c95458, orig_pte=) at mm/memory.c:1606
    > > ...
    > >
    > > (gdb) thr 4
    > > (gdb) bt
    > > #0 0xffffffff8027c974 in find_get_page (mapping=,
    > > offset=18446744071570598016) at include/asm/processor_64.h:385
    > > #1 0xffffffff80296f83 in free_swap_and_cache (entry={val = 4032}) at mm/swapfile.c:403
    > > #2 0xffffffff8028b9b7 in unmap_vmas (tlbp=0xffff8100cbf25cd8, vma=0xffff8100cde5c678,
    > > start_addr=0, end_addr=18446744073709551615, nr_accounted=0xffff8100cbf25cd0,
    > > details=0x0) at mm/memory.c:728
    > > #3 0xffffffff8028fa4c in exit_mmap (mm=0xffff8100cd093600) at mm/mmap.c:2048
    > > #4 0xffffffff8023ced7 in mmput (mm=0xffff8100cd093600) at kernel/fork.c:443
    > > #5 0xffffffff80242f00 in do_exit (code=14) at kernel/exit.c:997
    > > ...
    > >
    > >
    > > I think it came from the lockless speculative get page patch.
    > > I found the newer version of this patch in linux-mm.
    > > http://marc.info/?l=linux-mm&m=119477111927364&w=2
    > >
    > > I haven't tested it because it looks big change and hard to apply.
    > > But it seems to fix this deadlock issue.
    > > Any other patch to fix this issue is welcome.
    > >

    >
    > Is this patch good?
    >
    > ---
    > From: Hiroshi Shimamoto
    > Subject: [PATCH] avoid deadlock related with PG_nonewrefs and swap_lock
    >
    > There is a deadlock scenario; remove_mapping() vs free_swap_and_cache().
    > remove_mapping() turns PG_nonewrefs bit on, then locks swap_lock.
    > free_swap_and_cache() locks swap_lock, then wait to turn PG_nonewrefs bit
    > off in find_get_page().
    >
    > swap_lock can be unlocked before calling find_get_page().
    >
    > Signed-off-by: Hiroshi Shimamoto
    > ---
    > mm/swapfile.c | 5 +++--
    > 1 files changed, 3 insertions(+), 2 deletions(-)
    >
    > diff --git a/mm/swapfile.c b/mm/swapfile.c
    > index 5036b70..581afee 100644
    > --- a/mm/swapfile.c
    > +++ b/mm/swapfile.c
    > @@ -400,13 +400,14 @@ void free_swap_and_cache(swp_entry_t entry)
    > p = swap_info_get(entry);
    > if (p) {
    > if (swap_entry_free(p, swp_offset(entry)) == 1) {
    > + spin_unlock(&swap_lock);
    > page = find_get_page(&swapper_space, entry.val);
    > if (page && unlikely(TestSetPageLocked(page))) {
    > page_cache_release(page);
    > page = NULL;
    > }
    > - }
    > - spin_unlock(&swap_lock);
    > + } else
    > + spin_unlock(&swap_lock);
    > }
    > if (page) {
    > int one_user;
    > --
    > 1.5.4.1
    >

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: deadlock on 2.6.24.3-rt3

    On Mon, 2008-03-17 at 21:53 -0400, Steven Rostedt wrote:
    > [Added Peter Zijlsta and LKML]


    Yeah, patches and such should really go to LKML as LKML is the -rt
    development list.

    > Hiroshi, thanks for looking into this!
    >
    > Peter, this looks like a legit fix, could you Ack it.
    >
    >
    > On Mon, 17 Mar 2008, Hiroshi Shimamoto wrote:
    >
    > > Hiroshi Shimamoto wrote:
    > > > Hi,
    > > >
    > > > I got a soft lockup message on 2.6.24.3-rt3.
    > > > I attached the .config.
    > > >
    > > > I think there is a deadlock scenario, I explain later.
    > > >
    > > > Here is the console log;
    > > > BUG: soft lockup - CPU#2 stuck for 11s! [bash:2175]
    > > > CPU 2:
    > > > Modules linked in:
    > > > Pid: 2175, comm: bash Not tainted 2.6.24.3-rt3 #1
    > > > RIP: 0010:[] [] __spin_lock+0x57/0x67
    > > > RSP: 0000:ffff8100c52a1d48 EFLAGS: 00000202
    > > > RAX: 0000000000000000 RBX: 0000000000004bc5 RCX: 0000000000004bc5
    > > > RDX: 0000000000000002 RSI: 00000000006c3208 RDI: 0000000000000001
    > > > RBP: 000000000000000d R08: ffff8100cbc28018 R09: ffff810007c95458
    > > > R10: 00000000006c3208 R11: 0000000000000246 R12: ffffffff808246e8
    > > > R13: 000284d000000002 R14: ffffffff80387277 R15: 00000000ffffffff
    > > > FS: 00002b28926a2ef0(0000) GS:ffff8100cf8a3940(0000) knlGS:0000000000000000
    > > > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    > > > CR2: 00000000006c3208 CR3: 00000000c3cac000 CR4: 00000000000006e0
    > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    > > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    > > > Call Trace:
    > > > [] __spin_lock+0x29/0x67
    > > > [] swap_info_get+0x65/0xdd
    > > > [] can_share_swap_page+0x39/0x84
    > > > [] do_wp_page+0x2f9/0x519
    > > > [] handle_mm_fault+0x615/0x7cf
    > > > [] proc_flush_task+0x171/0x29c
    > > > [] recalc_sigpending+0xe/0x3c
    > > > [] do_page_fault+0x162/0x754
    > > > [] audit_syscall_exit+0x31c/0x37a
    > > > [] error_exit+0x0/0x51
    > > > ---------------------------
    > > > | preempt count: 00010002 ]
    > > > | 2-level deep critical section nesting:
    > > > ----------------------------------------
    > > > .. [] .... __spin_lock+0xe/0x67
    > > > .....[<00000000>] .. ( <= 0x0)
    > > > .. [] .... __spin_lock+0xe/0x67
    > > > .....[<00000000>] .. ( <= 0x0)
    > > > BUG: soft lockup - CPU#3 stuck for 11s! [stress:9460]
    > > > CPU 3:
    > > > Modules linked in:
    > > > Pid: 9460, comm: stress Not tainted 2.6.24.3-rt3 #1
    > > > RIP: 0010:[] [] find_get_page+0xad/0xbe
    > > > RSP: 0018:ffff8100cbf25b88 EFLAGS: 00000202
    > > > 0000000000002009 RBX: ffffffff80824bc8 RCX: 0000000000000002
    > > > RDX: 0000000000000002 RSI: ffff8100cbfcf298 RDI: ffff810005ad8910
    > > > RBP: ffffffff80383a57 R08: ffff810005ad8918 R09: 0000000000000003
    > > > R10: ffff810005ad88d8 R11: 0000000000000001 R12: ffffffff80822880
    > > > R13: ffff81000799ce48 R14: ffffffff8028921c R15: ffffffff80822880
    > > > FS: 00002acaa373bb00(0000) GS:ffff8100cf8a32c0(0000) knlGS:0000000000000000
    > > > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    > > > CR2: 00002b9b2827c530 CR3: 000000006cc90000 CR4: 00000000000006e0
    > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    > > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    > > > Call Trace:
    > > > [] find_get_page+0x23/0xbe
    > > > [] free_swap_and_cache+0x46/0xdd
    > > > [] unmap_vmas+0x626/0x8ce
    > > > [] exit_mmap+0xac/0x147
    > > > [] mmput+0x32/0xae
    > > > [] do_exit+0x199/0x914
    > > > [] __dequeue_signal+0x19/0x1b7
    > > > [] do_group_exit+0x2c/0x7e
    > > > [] get_signal_to_deliver+0x2ef/0x4aa
    > > > [] do_notify_resume+0xa8/0x7cd
    > > > [] add_preempt_count+0x14/0x111
    > > > [] __up_read+0x13/0x8d
    > > > [] do_page_fault+0x187/0x754
    > > > [] __dequeue_entity+0x2d/0x34
    > > > [] __switch_to+0x27/0x2c9
    > > > [] do_page_fault+0x1f4/0x754
    > > > [] retint_signal+0x3d/0x7f
    > > > ---------------------------
    > > > | preempt count: 00010005 ]
    > > > | 5-level deep critical section nesting:
    > > > ----------------------------------------
    > > > .. [] .... __spin_lock+0xe/0x67
    > > > .....[<00000000>] .. ( <= 0x0)
    > > > .. [] .... __spin_lock+0xe/0x67
    > > > .....[<00000000>] .. ( <= 0x0)
    > > > .. [] .... __spin_lock+0xe/0x67
    > > > .....[<00000000>] .. ( <= 0x0)
    > > > .. [] .... find_get_page+0x14/0xbe
    > > > .....[<00000000>] .. ( <= 0x0)
    > > > .. [] .... __spin_lock+0xe/0x67
    > > > .....[<00000000>] .. ( <= 0x0)
    > > >
    > > >
    > > > I also got a kernel core.
    > > > (gdb) info thr
    > > > 4 process 9460 0xffffffff8027c974 in find_get_page (mapping=,
    > > > offset=18446744071570598016) at include/asm/processor_64.h:385
    > > > 3 process 2175 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333
    > > > 2 process 9132 __spin_lock (lock=0xffffffff80893f80) at include/asm/spinlock_64.h:22
    > > > * 1 process 9478 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333
    > > >
    > > > CPU3(thread 4) is in find_get_page(), and the others in __spin_lock().
    > > > The thread 4 is waiting to turn PG_nonewrefs bit off in wait_on_page_ref() which is
    > > > called from page_cache_get_speculative(), and the thread 4 holds the swap_lock.
    > > > The other threads waiting the swap_lock.
    > > > On the other hand, the thread 1 turned PG_nonewrefs bit on by calling
    > > > lock_page_ref_irq() in remove_mapping(), and then waiting the swap_lock.
    > > > So if the target page of remove_mapping() is in the exiting process memory,
    > > > the kernel is deadlock.
    > > >
    > > > (gdb) bt
    > > > #0 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333
    > > > #1 0xffffffff80296597 in swap_info_get (entry=)
    > > > at mm/swapfile.c:253
    > > > #2 0xffffffff80296618 in swap_free (entry={val = 1}) at mm/swapfile.c:300
    > > > #3 0xffffffff80286acd in remove_mapping (mapping=,
    > > > page=0xffff810005ad8910) at mm/vmscan.c:423
    > > > ...
    > > >
    > > > (gdb) thr 2
    > > > (gdb) bt
    > > > #0 __spin_lock (lock=0xffffffff80893f80) at include/asm/spinlock_64.h:22
    > > > #1 0xffffffff80296374 in valid_swaphandles (entry=,
    > > > offset=0xffff81001e22bc78) at mm/swapfile.c:1783
    > > > #2 0xffffffff8028b0af in swapin_readahead (entry={val = 1}, addr=0, vma=0x1)
    > > > at mm/memory.c:2054
    > > > #3 0xffffffff8029a6af in shmem_getpage (inode=0xffff8100cdf4fd48, idx=0,
    > > > pagep=0xffff81001e22bd80, sgp=SGP_FAULT, type=0xffff81001e22bd34) at mm/shmem.c:1089
    > > > ...
    > > >
    > > > (gdb) thr 3
    > > > (gdb) bt
    > > > #0 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333
    > > > #1 0xffffffff80296597 in swap_info_get (entry=)
    > > > at mm/swapfile.c:253
    > > > #2 0xffffffff80296c0c in can_share_swap_page (page=)
    > > > at mm/swapfile.c:317
    > > > #3 0xffffffff8028ae6e in do_wp_page (mm=0xffff8100ce772f40, vma=0xffff8100cd212f00,
    > > > address=7090696, page_table=0xffff8100cbcef618, pmd=0xffff8100cbc28018,
    > > > ptl=0xffff810007c95458, orig_pte=) at mm/memory.c:1606
    > > > ...
    > > >
    > > > (gdb) thr 4
    > > > (gdb) bt
    > > > #0 0xffffffff8027c974 in find_get_page (mapping=,
    > > > offset=18446744071570598016) at include/asm/processor_64.h:385
    > > > #1 0xffffffff80296f83 in free_swap_and_cache (entry={val = 4032}) at mm/swapfile.c:403
    > > > #2 0xffffffff8028b9b7 in unmap_vmas (tlbp=0xffff8100cbf25cd8, vma=0xffff8100cde5c678,
    > > > start_addr=0, end_addr=18446744073709551615, nr_accounted=0xffff8100cbf25cd0,
    > > > details=0x0) at mm/memory.c:728
    > > > #3 0xffffffff8028fa4c in exit_mmap (mm=0xffff8100cd093600) at mm/mmap.c:2048
    > > > #4 0xffffffff8023ced7 in mmput (mm=0xffff8100cd093600) at kernel/fork.c:443
    > > > #5 0xffffffff80242f00 in do_exit (code=14) at kernel/exit.c:997
    > > > ...
    > > >
    > > >
    > > > I think it came from the lockless speculative get page patch.
    > > > I found the newer version of this patch in linux-mm.
    > > > http://marc.info/?l=linux-mm&m=119477111927364&w=2
    > > >
    > > > I haven't tested it because it looks big change and hard to apply.
    > > > But it seems to fix this deadlock issue.
    > > > Any other patch to fix this issue is welcome.


    Yeah, in the latest lockless pagecache patches Nick got rid of
    PG_nonewrefs as suggested by Hugh, however -rt also has my concurrent
    pagecache patches and those need PG_nonewrefs on their own, so I hadn't
    bothered to update to Nick's latest.

    Perhaps I ought to, as you point out, page_cache_get_speculative() is
    cleaner in his latest.. /me puts it on his overlong TODO list.

    > > Is this patch good?


    Yes, it does look good. Doesn't remove_exclusice_swap_page() also nest
    PG_nonewrefs inside of swap_lock?

    Looks to me that also needs fixing..

    > > ---
    > > From: Hiroshi Shimamoto
    > > Subject: [PATCH] avoid deadlock related with PG_nonewrefs and swap_lock
    > >
    > > There is a deadlock scenario; remove_mapping() vs free_swap_and_cache().
    > > remove_mapping() turns PG_nonewrefs bit on, then locks swap_lock.
    > > free_swap_and_cache() locks swap_lock, then wait to turn PG_nonewrefs bit
    > > off in find_get_page().
    > >
    > > swap_lock can be unlocked before calling find_get_page().
    > >
    > > Signed-off-by: Hiroshi Shimamoto
    > > ---
    > > mm/swapfile.c | 5 +++--
    > > 1 files changed, 3 insertions(+), 2 deletions(-)
    > >
    > > diff --git a/mm/swapfile.c b/mm/swapfile.c
    > > index 5036b70..581afee 100644
    > > --- a/mm/swapfile.c
    > > +++ b/mm/swapfile.c
    > > @@ -400,13 +400,14 @@ void free_swap_and_cache(swp_entry_t entry)
    > > p = swap_info_get(entry);
    > > if (p) {
    > > if (swap_entry_free(p, swp_offset(entry)) == 1) {
    > > + spin_unlock(&swap_lock);
    > > page = find_get_page(&swapper_space, entry.val);
    > > if (page && unlikely(TestSetPageLocked(page))) {
    > > page_cache_release(page);
    > > page = NULL;
    > > }
    > > - }
    > > - spin_unlock(&swap_lock);
    > > + } else
    > > + spin_unlock(&swap_lock);
    > > }
    > > if (page) {
    > > int one_user;
    > > --
    > > 1.5.4.1
    > >


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: deadlock on 2.6.24.3-rt3

    Peter Zijlstra wrote:
    > On Mon, 2008-03-17 at 21:53 -0400, Steven Rostedt wrote:
    >> [Added Peter Zijlsta and LKML]

    >
    > Yeah, patches and such should really go to LKML as LKML is the -rt
    > development list.


    Sorry, I missed to add LKML. It's written in RT wiki.
    I'm not sure about BUG report too. Should I send it to LKML?

    >
    >> Hiroshi, thanks for looking into this!
    >>
    >> Peter, this looks like a legit fix, could you Ack it.
    >>
    >>
    >> On Mon, 17 Mar 2008, Hiroshi Shimamoto wrote:
    >>
    >>> Hiroshi Shimamoto wrote:
    >>>> Hi,
    >>>>
    >>>> I got a soft lockup message on 2.6.24.3-rt3.
    >>>> I attached the .config.
    >>>>
    >>>> I think there is a deadlock scenario, I explain later.
    >>>>
    >>>> Here is the console log;
    >>>> BUG: soft lockup - CPU#2 stuck for 11s! [bash:2175]
    >>>> CPU 2:
    >>>> Modules linked in:
    >>>> Pid: 2175, comm: bash Not tainted 2.6.24.3-rt3 #1
    >>>> RIP: 0010:[] [] __spin_lock+0x57/0x67
    >>>> RSP: 0000:ffff8100c52a1d48 EFLAGS: 00000202
    >>>> RAX: 0000000000000000 RBX: 0000000000004bc5 RCX: 0000000000004bc5
    >>>> RDX: 0000000000000002 RSI: 00000000006c3208 RDI: 0000000000000001
    >>>> RBP: 000000000000000d R08: ffff8100cbc28018 R09: ffff810007c95458
    >>>> R10: 00000000006c3208 R11: 0000000000000246 R12: ffffffff808246e8
    >>>> R13: 000284d000000002 R14: ffffffff80387277 R15: 00000000ffffffff
    >>>> FS: 00002b28926a2ef0(0000) GS:ffff8100cf8a3940(0000) knlGS:0000000000000000
    >>>> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    >>>> CR2: 00000000006c3208 CR3: 00000000c3cac000 CR4: 00000000000006e0
    >>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    >>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    >>>> Call Trace:
    >>>> [] __spin_lock+0x29/0x67
    >>>> [] swap_info_get+0x65/0xdd
    >>>> [] can_share_swap_page+0x39/0x84
    >>>> [] do_wp_page+0x2f9/0x519
    >>>> [] handle_mm_fault+0x615/0x7cf
    >>>> [] proc_flush_task+0x171/0x29c
    >>>> [] recalc_sigpending+0xe/0x3c
    >>>> [] do_page_fault+0x162/0x754
    >>>> [] audit_syscall_exit+0x31c/0x37a
    >>>> [] error_exit+0x0/0x51
    >>>> ---------------------------
    >>>> | preempt count: 00010002 ]
    >>>> | 2-level deep critical section nesting:
    >>>> ----------------------------------------
    >>>> .. [] .... __spin_lock+0xe/0x67
    >>>> .....[<00000000>] .. ( <= 0x0)
    >>>> .. [] .... __spin_lock+0xe/0x67
    >>>> .....[<00000000>] .. ( <= 0x0)
    >>>> BUG: soft lockup - CPU#3 stuck for 11s! [stress:9460]
    >>>> CPU 3:
    >>>> Modules linked in:
    >>>> Pid: 9460, comm: stress Not tainted 2.6.24.3-rt3 #1
    >>>> RIP: 0010:[] [] find_get_page+0xad/0xbe
    >>>> RSP: 0018:ffff8100cbf25b88 EFLAGS: 00000202
    >>>> 0000000000002009 RBX: ffffffff80824bc8 RCX: 0000000000000002
    >>>> RDX: 0000000000000002 RSI: ffff8100cbfcf298 RDI: ffff810005ad8910
    >>>> RBP: ffffffff80383a57 R08: ffff810005ad8918 R09: 0000000000000003
    >>>> R10: ffff810005ad88d8 R11: 0000000000000001 R12: ffffffff80822880
    >>>> R13: ffff81000799ce48 R14: ffffffff8028921c R15: ffffffff80822880
    >>>> FS: 00002acaa373bb00(0000) GS:ffff8100cf8a32c0(0000) knlGS:0000000000000000
    >>>> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    >>>> CR2: 00002b9b2827c530 CR3: 000000006cc90000 CR4: 00000000000006e0
    >>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    >>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    >>>> Call Trace:
    >>>> [] find_get_page+0x23/0xbe
    >>>> [] free_swap_and_cache+0x46/0xdd
    >>>> [] unmap_vmas+0x626/0x8ce
    >>>> [] exit_mmap+0xac/0x147
    >>>> [] mmput+0x32/0xae
    >>>> [] do_exit+0x199/0x914
    >>>> [] __dequeue_signal+0x19/0x1b7
    >>>> [] do_group_exit+0x2c/0x7e
    >>>> [] get_signal_to_deliver+0x2ef/0x4aa
    >>>> [] do_notify_resume+0xa8/0x7cd
    >>>> [] add_preempt_count+0x14/0x111
    >>>> [] __up_read+0x13/0x8d
    >>>> [] do_page_fault+0x187/0x754
    >>>> [] __dequeue_entity+0x2d/0x34
    >>>> [] __switch_to+0x27/0x2c9
    >>>> [] do_page_fault+0x1f4/0x754
    >>>> [] retint_signal+0x3d/0x7f
    >>>> ---------------------------
    >>>> | preempt count: 00010005 ]
    >>>> | 5-level deep critical section nesting:
    >>>> ----------------------------------------
    >>>> .. [] .... __spin_lock+0xe/0x67
    >>>> .....[<00000000>] .. ( <= 0x0)
    >>>> .. [] .... __spin_lock+0xe/0x67
    >>>> .....[<00000000>] .. ( <= 0x0)
    >>>> .. [] .... __spin_lock+0xe/0x67
    >>>> .....[<00000000>] .. ( <= 0x0)
    >>>> .. [] .... find_get_page+0x14/0xbe
    >>>> .....[<00000000>] .. ( <= 0x0)
    >>>> .. [] .... __spin_lock+0xe/0x67
    >>>> .....[<00000000>] .. ( <= 0x0)
    >>>>
    >>>>
    >>>> I also got a kernel core.
    >>>> (gdb) info thr
    >>>> 4 process 9460 0xffffffff8027c974 in find_get_page (mapping=,
    >>>> offset=18446744071570598016) at include/asm/processor_64.h:385
    >>>> 3 process 2175 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333
    >>>> 2 process 9132 __spin_lock (lock=0xffffffff80893f80) at include/asm/spinlock_64.h:22
    >>>> * 1 process 9478 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333
    >>>>
    >>>> CPU3(thread 4) is in find_get_page(), and the others in __spin_lock().
    >>>> The thread 4 is waiting to turn PG_nonewrefs bit off in wait_on_page_ref() which is
    >>>> called from page_cache_get_speculative(), and the thread 4 holds the swap_lock.
    >>>> The other threads waiting the swap_lock.
    >>>> On the other hand, the thread 1 turned PG_nonewrefs bit on by calling
    >>>> lock_page_ref_irq() in remove_mapping(), and then waiting the swap_lock.
    >>>> So if the target page of remove_mapping() is in the exiting process memory,
    >>>> the kernel is deadlock.
    >>>>
    >>>> (gdb) bt
    >>>> #0 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333
    >>>> #1 0xffffffff80296597 in swap_info_get (entry=)
    >>>> at mm/swapfile.c:253
    >>>> #2 0xffffffff80296618 in swap_free (entry={val = 1}) at mm/swapfile.c:300
    >>>> #3 0xffffffff80286acd in remove_mapping (mapping=,
    >>>> page=0xffff810005ad8910) at mm/vmscan.c:423
    >>>> ...
    >>>>
    >>>> (gdb) thr 2
    >>>> (gdb) bt
    >>>> #0 __spin_lock (lock=0xffffffff80893f80) at include/asm/spinlock_64.h:22
    >>>> #1 0xffffffff80296374 in valid_swaphandles (entry=,
    >>>> offset=0xffff81001e22bc78) at mm/swapfile.c:1783
    >>>> #2 0xffffffff8028b0af in swapin_readahead (entry={val = 1}, addr=0, vma=0x1)
    >>>> at mm/memory.c:2054
    >>>> #3 0xffffffff8029a6af in shmem_getpage (inode=0xffff8100cdf4fd48, idx=0,
    >>>> pagep=0xffff81001e22bd80, sgp=SGP_FAULT, type=0xffff81001e22bd34) at mm/shmem.c:1089
    >>>> ...
    >>>>
    >>>> (gdb) thr 3
    >>>> (gdb) bt
    >>>> #0 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333
    >>>> #1 0xffffffff80296597 in swap_info_get (entry=)
    >>>> at mm/swapfile.c:253
    >>>> #2 0xffffffff80296c0c in can_share_swap_page (page=)
    >>>> at mm/swapfile.c:317
    >>>> #3 0xffffffff8028ae6e in do_wp_page (mm=0xffff8100ce772f40, vma=0xffff8100cd212f00,
    >>>> address=7090696, page_table=0xffff8100cbcef618, pmd=0xffff8100cbc28018,
    >>>> ptl=0xffff810007c95458, orig_pte=) at mm/memory.c:1606
    >>>> ...
    >>>>
    >>>> (gdb) thr 4
    >>>> (gdb) bt
    >>>> #0 0xffffffff8027c974 in find_get_page (mapping=,
    >>>> offset=18446744071570598016) at include/asm/processor_64.h:385
    >>>> #1 0xffffffff80296f83 in free_swap_and_cache (entry={val = 4032}) at mm/swapfile.c:403
    >>>> #2 0xffffffff8028b9b7 in unmap_vmas (tlbp=0xffff8100cbf25cd8, vma=0xffff8100cde5c678,
    >>>> start_addr=0, end_addr=18446744073709551615, nr_accounted=0xffff8100cbf25cd0,
    >>>> details=0x0) at mm/memory.c:728
    >>>> #3 0xffffffff8028fa4c in exit_mmap (mm=0xffff8100cd093600) at mm/mmap.c:2048
    >>>> #4 0xffffffff8023ced7 in mmput (mm=0xffff8100cd093600) at kernel/fork.c:443
    >>>> #5 0xffffffff80242f00 in do_exit (code=14) at kernel/exit.c:997
    >>>> ...
    >>>>
    >>>>
    >>>> I think it came from the lockless speculative get page patch.
    >>>> I found the newer version of this patch in linux-mm.
    >>>> http://marc.info/?l=linux-mm&m=119477111927364&w=2
    >>>>
    >>>> I haven't tested it because it looks big change and hard to apply.
    >>>> But it seems to fix this deadlock issue.
    >>>> Any other patch to fix this issue is welcome.

    >
    > Yeah, in the latest lockless pagecache patches Nick got rid of
    > PG_nonewrefs as suggested by Hugh, however -rt also has my concurrent
    > pagecache patches and those need PG_nonewrefs on their own, so I hadn't
    > bothered to update to Nick's latest.
    >
    > Perhaps I ought to, as you point out, page_cache_get_speculative() is
    > cleaner in his latest.. /me puts it on his overlong TODO list.
    >
    >>> Is this patch good?

    >
    > Yes, it does look good. Doesn't remove_exclusice_swap_page() also nest
    > PG_nonewrefs inside of swap_lock?


    I'm not sure about it. I haven't noticed it and I haven't checked all PG_nonewrefs.
    Will look into it.

    By the way, unfortunately, I got another BUG message w/ my patch... I'll report it.
    The new issue is related with SLAB.
    kernel BUG at mm/slab.c:3125!

    thanks,
    Hiroshi Shimamoto
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: deadlock on 2.6.24.3-rt3

    Hiroshi Shimamoto wrote:
    > Peter Zijlstra wrote:
    >> On Mon, 2008-03-17 at 21:53 -0400, Steven Rostedt wrote:
    >>> [Added Peter Zijlsta and LKML]

    >> Yeah, patches and such should really go to LKML as LKML is the -rt
    >> development list.

    >
    > Sorry, I missed to add LKML. It's written in RT wiki.
    > I'm not sure about BUG report too. Should I send it to LKML?
    >
    >>> Hiroshi, thanks for looking into this!
    >>>
    >>> Peter, this looks like a legit fix, could you Ack it.
    >>>
    >>>
    >>> On Mon, 17 Mar 2008, Hiroshi Shimamoto wrote:
    >>>
    >>>> Hiroshi Shimamoto wrote:
    >>>>> Hi,
    >>>>>
    >>>>> I got a soft lockup message on 2.6.24.3-rt3.
    >>>>> I attached the .config.
    >>>>>
    >>>>> I think there is a deadlock scenario, I explain later.
    >>>>>
    >>>>> Here is the console log;
    >>>>> BUG: soft lockup - CPU#2 stuck for 11s! [bash:2175]
    >>>>> CPU 2:
    >>>>> Modules linked in:
    >>>>> Pid: 2175, comm: bash Not tainted 2.6.24.3-rt3 #1
    >>>>> RIP: 0010:[] [] __spin_lock+0x57/0x67
    >>>>> RSP: 0000:ffff8100c52a1d48 EFLAGS: 00000202
    >>>>> RAX: 0000000000000000 RBX: 0000000000004bc5 RCX: 0000000000004bc5
    >>>>> RDX: 0000000000000002 RSI: 00000000006c3208 RDI: 0000000000000001
    >>>>> RBP: 000000000000000d R08: ffff8100cbc28018 R09: ffff810007c95458
    >>>>> R10: 00000000006c3208 R11: 0000000000000246 R12: ffffffff808246e8
    >>>>> R13: 000284d000000002 R14: ffffffff80387277 R15: 00000000ffffffff
    >>>>> FS: 00002b28926a2ef0(0000) GS:ffff8100cf8a3940(0000) knlGS:0000000000000000
    >>>>> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    >>>>> CR2: 00000000006c3208 CR3: 00000000c3cac000 CR4: 00000000000006e0
    >>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    >>>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    >>>>> Call Trace:
    >>>>> [] __spin_lock+0x29/0x67
    >>>>> [] swap_info_get+0x65/0xdd
    >>>>> [] can_share_swap_page+0x39/0x84
    >>>>> [] do_wp_page+0x2f9/0x519
    >>>>> [] handle_mm_fault+0x615/0x7cf
    >>>>> [] proc_flush_task+0x171/0x29c
    >>>>> [] recalc_sigpending+0xe/0x3c
    >>>>> [] do_page_fault+0x162/0x754
    >>>>> [] audit_syscall_exit+0x31c/0x37a
    >>>>> [] error_exit+0x0/0x51
    >>>>> ---------------------------
    >>>>> | preempt count: 00010002 ]
    >>>>> | 2-level deep critical section nesting:
    >>>>> ----------------------------------------
    >>>>> .. [] .... __spin_lock+0xe/0x67
    >>>>> .....[<00000000>] .. ( <= 0x0)
    >>>>> .. [] .... __spin_lock+0xe/0x67
    >>>>> .....[<00000000>] .. ( <= 0x0)
    >>>>> BUG: soft lockup - CPU#3 stuck for 11s! [stress:9460]
    >>>>> CPU 3:
    >>>>> Modules linked in:
    >>>>> Pid: 9460, comm: stress Not tainted 2.6.24.3-rt3 #1
    >>>>> RIP: 0010:[] [] find_get_page+0xad/0xbe
    >>>>> RSP: 0018:ffff8100cbf25b88 EFLAGS: 00000202
    >>>>> 0000000000002009 RBX: ffffffff80824bc8 RCX: 0000000000000002
    >>>>> RDX: 0000000000000002 RSI: ffff8100cbfcf298 RDI: ffff810005ad8910
    >>>>> RBP: ffffffff80383a57 R08: ffff810005ad8918 R09: 0000000000000003
    >>>>> R10: ffff810005ad88d8 R11: 0000000000000001 R12: ffffffff80822880
    >>>>> R13: ffff81000799ce48 R14: ffffffff8028921c R15: ffffffff80822880
    >>>>> FS: 00002acaa373bb00(0000) GS:ffff8100cf8a32c0(0000) knlGS:0000000000000000
    >>>>> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    >>>>> CR2: 00002b9b2827c530 CR3: 000000006cc90000 CR4: 00000000000006e0
    >>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    >>>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    >>>>> Call Trace:
    >>>>> [] find_get_page+0x23/0xbe
    >>>>> [] free_swap_and_cache+0x46/0xdd
    >>>>> [] unmap_vmas+0x626/0x8ce
    >>>>> [] exit_mmap+0xac/0x147
    >>>>> [] mmput+0x32/0xae
    >>>>> [] do_exit+0x199/0x914
    >>>>> [] __dequeue_signal+0x19/0x1b7
    >>>>> [] do_group_exit+0x2c/0x7e
    >>>>> [] get_signal_to_deliver+0x2ef/0x4aa
    >>>>> [] do_notify_resume+0xa8/0x7cd
    >>>>> [] add_preempt_count+0x14/0x111
    >>>>> [] __up_read+0x13/0x8d
    >>>>> [] do_page_fault+0x187/0x754
    >>>>> [] __dequeue_entity+0x2d/0x34
    >>>>> [] __switch_to+0x27/0x2c9
    >>>>> [] do_page_fault+0x1f4/0x754
    >>>>> [] retint_signal+0x3d/0x7f
    >>>>> ---------------------------
    >>>>> | preempt count: 00010005 ]
    >>>>> | 5-level deep critical section nesting:
    >>>>> ----------------------------------------
    >>>>> .. [] .... __spin_lock+0xe/0x67
    >>>>> .....[<00000000>] .. ( <= 0x0)
    >>>>> .. [] .... __spin_lock+0xe/0x67
    >>>>> .....[<00000000>] .. ( <= 0x0)
    >>>>> .. [] .... __spin_lock+0xe/0x67
    >>>>> .....[<00000000>] .. ( <= 0x0)
    >>>>> .. [] .... find_get_page+0x14/0xbe
    >>>>> .....[<00000000>] .. ( <= 0x0)
    >>>>> .. [] .... __spin_lock+0xe/0x67
    >>>>> .....[<00000000>] .. ( <= 0x0)
    >>>>>
    >>>>>
    >>>>> I also got a kernel core.
    >>>>> (gdb) info thr
    >>>>> 4 process 9460 0xffffffff8027c974 in find_get_page (mapping=,
    >>>>> offset=18446744071570598016) at include/asm/processor_64.h:385
    >>>>> 3 process 2175 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333
    >>>>> 2 process 9132 __spin_lock (lock=0xffffffff80893f80) at include/asm/spinlock_64.h:22
    >>>>> * 1 process 9478 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333
    >>>>>
    >>>>> CPU3(thread 4) is in find_get_page(), and the others in __spin_lock().
    >>>>> The thread 4 is waiting to turn PG_nonewrefs bit off in wait_on_page_ref() which is
    >>>>> called from page_cache_get_speculative(), and the thread 4 holds the swap_lock.
    >>>>> The other threads waiting the swap_lock.
    >>>>> On the other hand, the thread 1 turned PG_nonewrefs bit on by calling
    >>>>> lock_page_ref_irq() in remove_mapping(), and then waiting the swap_lock.
    >>>>> So if the target page of remove_mapping() is in the exiting process memory,
    >>>>> the kernel is deadlock.
    >>>>>
    >>>>> (gdb) bt
    >>>>> #0 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333
    >>>>> #1 0xffffffff80296597 in swap_info_get (entry=)
    >>>>> at mm/swapfile.c:253
    >>>>> #2 0xffffffff80296618 in swap_free (entry={val = 1}) at mm/swapfile.c:300
    >>>>> #3 0xffffffff80286acd in remove_mapping (mapping=,
    >>>>> page=0xffff810005ad8910) at mm/vmscan.c:423
    >>>>> ...
    >>>>>
    >>>>> (gdb) thr 2
    >>>>> (gdb) bt
    >>>>> #0 __spin_lock (lock=0xffffffff80893f80) at include/asm/spinlock_64.h:22
    >>>>> #1 0xffffffff80296374 in valid_swaphandles (entry=,
    >>>>> offset=0xffff81001e22bc78) at mm/swapfile.c:1783
    >>>>> #2 0xffffffff8028b0af in swapin_readahead (entry={val = 1}, addr=0, vma=0x1)
    >>>>> at mm/memory.c:2054
    >>>>> #3 0xffffffff8029a6af in shmem_getpage (inode=0xffff8100cdf4fd48, idx=0,
    >>>>> pagep=0xffff81001e22bd80, sgp=SGP_FAULT, type=0xffff81001e22bd34) at mm/shmem.c:1089
    >>>>> ...
    >>>>>
    >>>>> (gdb) thr 3
    >>>>> (gdb) bt
    >>>>> #0 __spin_lock (lock=0xffffffff80893f80) at kernel/spinlock.c:333
    >>>>> #1 0xffffffff80296597 in swap_info_get (entry=)
    >>>>> at mm/swapfile.c:253
    >>>>> #2 0xffffffff80296c0c in can_share_swap_page (page=)
    >>>>> at mm/swapfile.c:317
    >>>>> #3 0xffffffff8028ae6e in do_wp_page (mm=0xffff8100ce772f40, vma=0xffff8100cd212f00,
    >>>>> address=7090696, page_table=0xffff8100cbcef618, pmd=0xffff8100cbc28018,
    >>>>> ptl=0xffff810007c95458, orig_pte=) at mm/memory.c:1606
    >>>>> ...
    >>>>>
    >>>>> (gdb) thr 4
    >>>>> (gdb) bt
    >>>>> #0 0xffffffff8027c974 in find_get_page (mapping=,
    >>>>> offset=18446744071570598016) at include/asm/processor_64.h:385
    >>>>> #1 0xffffffff80296f83 in free_swap_and_cache (entry={val = 4032}) at mm/swapfile.c:403
    >>>>> #2 0xffffffff8028b9b7 in unmap_vmas (tlbp=0xffff8100cbf25cd8, vma=0xffff8100cde5c678,
    >>>>> start_addr=0, end_addr=18446744073709551615, nr_accounted=0xffff8100cbf25cd0,
    >>>>> details=0x0) at mm/memory.c:728
    >>>>> #3 0xffffffff8028fa4c in exit_mmap (mm=0xffff8100cd093600) at mm/mmap.c:2048
    >>>>> #4 0xffffffff8023ced7 in mmput (mm=0xffff8100cd093600) at kernel/fork.c:443
    >>>>> #5 0xffffffff80242f00 in do_exit (code=14) at kernel/exit.c:997
    >>>>> ...
    >>>>>
    >>>>>
    >>>>> I think it came from the lockless speculative get page patch.
    >>>>> I found the newer version of this patch in linux-mm.
    >>>>> http://marc.info/?l=linux-mm&m=119477111927364&w=2
    >>>>>
    >>>>> I haven't tested it because it looks big change and hard to apply.
    >>>>> But it seems to fix this deadlock issue.
    >>>>> Any other patch to fix this issue is welcome.

    >> Yeah, in the latest lockless pagecache patches Nick got rid of
    >> PG_nonewrefs as suggested by Hugh, however -rt also has my concurrent
    >> pagecache patches and those need PG_nonewrefs on their own, so I hadn't
    >> bothered to update to Nick's latest.
    >>
    >> Perhaps I ought to, as you point out, page_cache_get_speculative() is
    >> cleaner in his latest.. /me puts it on his overlong TODO list.
    >>
    >>>> Is this patch good?

    >> Yes, it does look good. Doesn't remove_exclusice_swap_page() also nest
    >> PG_nonewrefs inside of swap_lock?

    >
    > I'm not sure about it. I haven't noticed it and I haven't checked all PG_nonewrefs.
    > Will look into it.
    >


    Is this valid fix?

    diff --git a/mm/swapfile.c b/mm/swapfile.c
    index 581afee..6fbc77e 100644
    --- a/mm/swapfile.c
    +++ b/mm/swapfile.c
    @@ -366,6 +366,7 @@ int remove_exclusive_swap_page(struct page *page)
    /* Is the only swap cache user the cache itself? */
    retval = 0;
    if (p->swap_map[swp_offset(entry)] == 1) {
    + spin_unlock(&swap_lock);
    /* Recheck the page count with the swapcache lock held.. */
    lock_page_ref_irq(page);
    if ((page_count(page) == 2) && !PageWriteback(page)) {
    @@ -374,8 +375,8 @@ int remove_exclusive_swap_page(struct page *page)
    retval = 1;
    }
    unlock_page_ref_irq(page);
    - }
    - spin_unlock(&swap_lock);
    + } else
    + spin_unlock(&swap_lock);

    if (retval) {
    swap_free(entry);


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. [PATCH -rt] avoid deadlock related with PG_nonewrefs and swap_lock

    Hi Peter,

    I've updated the patch. Could you please review it?

    I'm also thinking that it can be in the mainline because it makes
    the lock period shorter, correct?

    ---
    From: Hiroshi Shimamoto

    There is a deadlock scenario; remove_mapping() vs free_swap_and_cache().
    remove_mapping() turns PG_nonewrefs bit on, then locks swap_lock.
    free_swap_and_cache() locks swap_lock, then wait to turn PG_nonewrefs bit
    off in find_get_page().

    swap_lock can be unlocked before calling find_get_page().

    In remove_exclusive_swap_page(), there is similar lock sequence;
    swap_lock, then PG_nonewrefs bit. swap_lock can be unlocked before
    turning PG_nonewrefs bit on.

    Signed-off-by: Hiroshi Shimamoto
    ---
    mm/swapfile.c | 10 ++++++----
    1 files changed, 6 insertions(+), 4 deletions(-)

    diff --git a/mm/swapfile.c b/mm/swapfile.c
    index 5036b70..6fbc77e 100644
    --- a/mm/swapfile.c
    +++ b/mm/swapfile.c
    @@ -366,6 +366,7 @@ int remove_exclusive_swap_page(struct page *page)
    /* Is the only swap cache user the cache itself? */
    retval = 0;
    if (p->swap_map[swp_offset(entry)] == 1) {
    + spin_unlock(&swap_lock);
    /* Recheck the page count with the swapcache lock held.. */
    lock_page_ref_irq(page);
    if ((page_count(page) == 2) && !PageWriteback(page)) {
    @@ -374,8 +375,8 @@ int remove_exclusive_swap_page(struct page *page)
    retval = 1;
    }
    unlock_page_ref_irq(page);
    - }
    - spin_unlock(&swap_lock);
    + } else
    + spin_unlock(&swap_lock);

    if (retval) {
    swap_free(entry);
    @@ -400,13 +401,14 @@ void free_swap_and_cache(swp_entry_t entry)
    p = swap_info_get(entry);
    if (p) {
    if (swap_entry_free(p, swp_offset(entry)) == 1) {
    + spin_unlock(&swap_lock);
    page = find_get_page(&swapper_space, entry.val);
    if (page && unlikely(TestSetPageLocked(page))) {
    page_cache_release(page);
    page = NULL;
    }
    - }
    - spin_unlock(&swap_lock);
    + } else
    + spin_unlock(&swap_lock);
    }
    if (page) {
    int one_user;
    --
    1.5.4.1


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [PATCH -rt] avoid deadlock related with PG_nonewrefs and swap_lock

    On Mon, 2008-03-24 at 11:24 -0700, Hiroshi Shimamoto wrote:
    > Hi Peter,
    >
    > I've updated the patch. Could you please review it?
    >
    > I'm also thinking that it can be in the mainline because it makes
    > the lock period shorter, correct?


    Possibly yeah, Nick, Hugh?

    > ---
    > From: Hiroshi Shimamoto
    >
    > There is a deadlock scenario; remove_mapping() vs free_swap_and_cache().
    > remove_mapping() turns PG_nonewrefs bit on, then locks swap_lock.
    > free_swap_and_cache() locks swap_lock, then wait to turn PG_nonewrefs bit
    > off in find_get_page().
    >
    > swap_lock can be unlocked before calling find_get_page().
    >
    > In remove_exclusive_swap_page(), there is similar lock sequence;
    > swap_lock, then PG_nonewrefs bit. swap_lock can be unlocked before
    > turning PG_nonewrefs bit on.


    I worry about this, Once we free the swap entry with swap_entry_free(),
    and drop the swap_lock, another task is basically free to re-use that
    swap location and try to insert another page in that same spot in
    add_to_swap() - read_swap_cache_async() can't race because it would mean
    it still has a swap entry pinned.

    However, add_to_swap() can already handle the race, because it used to
    race against read_swap_cache_async(). It also swap_free()s the entry so
    as to not leak entries. So I think this is indeed correct.

    [ I ought to find some time to port the concurrent page-cache patches on
    top of Nick's latest lockless series, Hugh's suggestion makes the
    speculative get much nicer. ]

    > Signed-off-by: Hiroshi Shimamoto


    Acked-by: Peter Zijlstra

    > ---
    > mm/swapfile.c | 10 ++++++----
    > 1 files changed, 6 insertions(+), 4 deletions(-)
    >
    > diff --git a/mm/swapfile.c b/mm/swapfile.c
    > index 5036b70..6fbc77e 100644
    > --- a/mm/swapfile.c
    > +++ b/mm/swapfile.c
    > @@ -366,6 +366,7 @@ int remove_exclusive_swap_page(struct page *page)
    > /* Is the only swap cache user the cache itself? */
    > retval = 0;
    > if (p->swap_map[swp_offset(entry)] == 1) {
    > + spin_unlock(&swap_lock);
    > /* Recheck the page count with the swapcache lock held.. */
    > lock_page_ref_irq(page);
    > if ((page_count(page) == 2) && !PageWriteback(page)) {
    > @@ -374,8 +375,8 @@ int remove_exclusive_swap_page(struct page *page)
    > retval = 1;
    > }
    > unlock_page_ref_irq(page);
    > - }
    > - spin_unlock(&swap_lock);
    > + } else
    > + spin_unlock(&swap_lock);
    >
    > if (retval) {
    > swap_free(entry);
    > @@ -400,13 +401,14 @@ void free_swap_and_cache(swp_entry_t entry)
    > p = swap_info_get(entry);
    > if (p) {
    > if (swap_entry_free(p, swp_offset(entry)) == 1) {
    > + spin_unlock(&swap_lock);
    > page = find_get_page(&swapper_space, entry.val);
    > if (page && unlikely(TestSetPageLocked(page))) {
    > page_cache_release(page);
    > page = NULL;
    > }
    > - }
    > - spin_unlock(&swap_lock);
    > + } else
    > + spin_unlock(&swap_lock);
    > }
    > if (page) {
    > int one_user;


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [PATCH -rt] avoid deadlock related with PG_nonewrefs and swap_lock

    On Wed, 2008-03-26 at 10:50 +0100, Peter Zijlstra wrote:
    > On Mon, 2008-03-24 at 11:24 -0700, Hiroshi Shimamoto wrote:
    > > Hi Peter,
    > >
    > > I've updated the patch. Could you please review it?
    > >
    > > I'm also thinking that it can be in the mainline because it makes
    > > the lock period shorter, correct?

    >
    > Possibly yeah, Nick, Hugh?
    >
    > > ---
    > > From: Hiroshi Shimamoto
    > >
    > > There is a deadlock scenario; remove_mapping() vs free_swap_and_cache().
    > > remove_mapping() turns PG_nonewrefs bit on, then locks swap_lock.
    > > free_swap_and_cache() locks swap_lock, then wait to turn PG_nonewrefs bit
    > > off in find_get_page().
    > >
    > > swap_lock can be unlocked before calling find_get_page().
    > >
    > > In remove_exclusive_swap_page(), there is similar lock sequence;
    > > swap_lock, then PG_nonewrefs bit. swap_lock can be unlocked before
    > > turning PG_nonewrefs bit on.

    >
    > I worry about this, Once we free the swap entry with swap_entry_free(),
    > and drop the swap_lock, another task is basically free to re-use that
    > swap location and try to insert another page in that same spot in
    > add_to_swap() - read_swap_cache_async() can't race because it would mean
    > it still has a swap entry pinned.


    D'oh of course it can race, otherwise the add_to_swap() vs
    read_swap_cache_async() race wouldn't exist.

    Still, given that add_to_swap() handles the race I suspect the other end
    does the right thing as well.

    > However, add_to_swap() can already handle the race, because it used to
    > race against read_swap_cache_async(). It also swap_free()s the entry so
    > as to not leak entries. So I think this is indeed correct.
    >
    > [ I ought to find some time to port the concurrent page-cache patches on
    > top of Nick's latest lockless series, Hugh's suggestion makes the
    > speculative get much nicer. ]
    >
    > > Signed-off-by: Hiroshi Shimamoto

    >
    > Acked-by: Peter Zijlstra
    >
    > > ---
    > > mm/swapfile.c | 10 ++++++----
    > > 1 files changed, 6 insertions(+), 4 deletions(-)
    > >
    > > diff --git a/mm/swapfile.c b/mm/swapfile.c
    > > index 5036b70..6fbc77e 100644
    > > --- a/mm/swapfile.c
    > > +++ b/mm/swapfile.c
    > > @@ -366,6 +366,7 @@ int remove_exclusive_swap_page(struct page *page)
    > > /* Is the only swap cache user the cache itself? */
    > > retval = 0;
    > > if (p->swap_map[swp_offset(entry)] == 1) {
    > > + spin_unlock(&swap_lock);
    > > /* Recheck the page count with the swapcache lock held.. */
    > > lock_page_ref_irq(page);
    > > if ((page_count(page) == 2) && !PageWriteback(page)) {
    > > @@ -374,8 +375,8 @@ int remove_exclusive_swap_page(struct page *page)
    > > retval = 1;
    > > }
    > > unlock_page_ref_irq(page);
    > > - }
    > > - spin_unlock(&swap_lock);
    > > + } else
    > > + spin_unlock(&swap_lock);
    > >
    > > if (retval) {
    > > swap_free(entry);
    > > @@ -400,13 +401,14 @@ void free_swap_and_cache(swp_entry_t entry)
    > > p = swap_info_get(entry);
    > > if (p) {
    > > if (swap_entry_free(p, swp_offset(entry)) == 1) {
    > > + spin_unlock(&swap_lock);
    > > page = find_get_page(&swapper_space, entry.val);
    > > if (page && unlikely(TestSetPageLocked(page))) {
    > > page_cache_release(page);
    > > page = NULL;
    > > }
    > > - }
    > > - spin_unlock(&swap_lock);
    > > + } else
    > > + spin_unlock(&swap_lock);
    > > }
    > > if (page) {
    > > int one_user;


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread