[RFC PATCH 0/2] fast_gup for shared futexes - Kernel

This is a discussion on [RFC PATCH 0/2] fast_gup for shared futexes - Kernel ; Hi, this patch series removes mmap_sem from the fast path of shared futexes by making use of Nick's recent fast_gup() patches. Full series at: http://programming.kicks-ass.net/ker...v2.6.24.4-rt4/ -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body ...

+ Reply to Thread
Results 1 to 6 of 6

Thread: [RFC PATCH 0/2] fast_gup for shared futexes

  1. [RFC PATCH 0/2] fast_gup for shared futexes

    Hi,

    this patch series removes mmap_sem from the fast path of shared futexes by
    making use of Nick's recent fast_gup() patches. Full series at:

    http://programming.kicks-ass.net/ker...v2.6.24.4-rt4/

    --

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: [RFC PATCH 0/2] fast_gup for shared futexes

    On Fri, 4 Apr 2008, Peter Zijlstra wrote:
    > Hi,
    >
    > this patch series removes mmap_sem from the fast path of shared futexes by
    > making use of Nick's recent fast_gup() patches. Full series at:
    >
    > http://programming.kicks-ass.net/ker...v2.6.24.4-rt4/


    Looks good at the first glance. Need to look at the corner cases, but
    the code does not really depend on mmap_sem anymore so the chance that
    it blows up is pretty low.

    Thanks,

    tglx
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: [RFC PATCH 1/2] futex: rely on get_user_pages() for shared futexes

    On Saturday 05 April 2008 06:33, Peter Zijlstra wrote:
    > On the way of getting rid of the mmap_sem requirement for shared futexes,
    > start by relying on get_user_pages().
    >
    > This requires we get the page associated with the key, and put the page
    > when we're done with it.


    Hi Peter,

    Cool.

    I'm all for removing mmap_sem requirement from shared futexes...
    Are there many apps which make a non-trivial use of them I wonder?
    I guess it will help legacy (pre-FUTEX_PRIVATE) usespaces in
    performance too, though.

    What I'm worried about with this is invalidate or truncate races.
    With direct IO, it obviously doesn't matter because the only
    requirement is that the page existed at the address at some point
    during the syscall...

    So I'd really like you to not carry the page around in the key, but
    just continue using the same key we have now. Also, lock the page
    and ensure it hasn't been truncated before taking the inode from the
    key and incrementing its count (page lock's extra atomics should be
    more or less cancelled out by fewer mmap_sem atomic ops).

    get_futex_key should look something like this I would have thought:?

    BTW. I like that it removes a lot of fshared crap from around
    the place. And also this is a really good user of fast_gup
    because I guess it should usually be faulted in. The problem is
    that this could be a little more expensive for architectures that
    don't implement fast_gup. Though most should be able to.

    @@ -191,7 +191,6 @@ static int get_futex_key(u32 __user *uad
    {
    unsigned long address = (unsigned long)uaddr;
    struct mm_struct *mm = current->mm;
    - struct vm_area_struct *vma;
    struct page *page;
    int err;

    @@ -210,27 +209,26 @@ static int get_futex_key(u32 __user *uad
    * Note : We do have to check 'uaddr' is a valid user address,
    * but access_ok() should be faster than find_vma()
    */
    - if (!fshared) {
    + if (likely(!fshared)) {
    if (unlikely(!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))))
    return -EFAULT;
    key->private.mm = mm;
    key->private.address = address;
    return 0;
    }
    - /*
    - * The futex is hashed differently depending on whether
    - * it's in a shared or private mapping. So check vma first.
    - */
    - vma = find_extend_vma(mm, address);
    - if (unlikely(!vma))
    - return -EFAULT;
    -
    - /*
    - * Permissions.
    - */
    - if (unlikely((vma->vm_flags & (VM_IO|VM_READ)) != VM_READ))
    - return (vma->vm_flags & VM_IO) ? -EPERM : -EACCES;

    +again:
    + err = fast_gup(address, 1, 0, &page);
    + if (err < 0)
    + return err;
    +
    + lock_page(page);
    + if (!page->mapping) { /* PageAnon pages shouldn't get caught here */
    + unlock_page(page);
    + put_page(page);
    + goto again;
    + }
    +
    /*
    * Private mappings are handled in a simple way.
    *
    @@ -240,38 +238,19 @@ static int get_futex_key(u32 __user *uad
    * VM_MAYSHARE here, not VM_SHARED which is restricted to shared
    * mappings of _writable_ handles.
    */
    - if (likely(!(vma->vm_flags & VM_MAYSHARE))) {
    - key->both.offset |= FUT_OFF_MMSHARED; /* reference taken on mm
    *
    /
    + if (PageAnon(page)) {
    + key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
    key->private.mm = mm;
    key->private.address = address;
    - return 0;
    - }
    -
    - /*
    - * Linear file mappings are also simple.
    - */
    - key->shared.inode = vma->vm_file->f_path.dentry->d_inode;
    - key->both.offset |= FUT_OFF_INODE; /* inode-based key. */
    - if (likely(!(vma->vm_flags & VM_NONLINEAR))) {
    - key->shared.pgoff = (((address - vma->vm_start) >> PAGE_SHIFT)
    - + vma->vm_pgoff);
    - return 0;
    - }
    -
    - /*
    - * We could walk the page table to read the non-linear
    - * pte, and get the page index without fetching the page
    - * from swap. But that's a lot of code to duplicate here
    - * for a rare case, so we simply fetch the page.
    - */
    - err = get_user_pages(current, mm, address, 1, 0, 0, &page, NULL);
    - if (err >= 0) {
    - key->shared.pgoff =
    - page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
    - put_page(page);
    - return 0;
    + } else {
    + key->both.offset |= FUT_OFF_INODE; /* inode-based key. */
    + key->shared.inode = page->mapping->inode;
    + key->shared.pgoff = page->index;
    }
    - return err;
    +out:
    + unlock_page(page);
    + put_page(page);
    + return 0;
    }

    /*

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: [RFC PATCH 1/2] futex: rely on get_user_pages() for shared futexes

    On Tue, 2008-04-08 at 21:40 +1000, Nick Piggin wrote:
    > On Saturday 05 April 2008 06:33, Peter Zijlstra wrote:
    > > On the way of getting rid of the mmap_sem requirement for shared futexes,
    > > start by relying on get_user_pages().
    > >
    > > This requires we get the page associated with the key, and put the page
    > > when we're done with it.

    >
    > Hi Peter,
    >
    > Cool.
    >
    > I'm all for removing mmap_sem requirement from shared futexes...
    > Are there many apps which make a non-trivial use of them I wonder?


    No idea. I've heard some stories, but I've never seen any code..

    > I guess it will help legacy (pre-FUTEX_PRIVATE) usespaces in
    > performance too, though.


    Yeah, that was one of the motivations for this patch.

    > What I'm worried about with this is invalidate or truncate races.
    > With direct IO, it obviously doesn't matter because the only
    > requirement is that the page existed at the address at some point
    > during the syscall...
    >
    > So I'd really like you to not carry the page around in the key, but
    > just continue using the same key we have now. Also, lock the page
    > and ensure it hasn't been truncated before taking the inode from the
    > key and incrementing its count (page lock's extra atomics should be
    > more or less cancelled out by fewer mmap_sem atomic ops).
    >
    > get_futex_key should look something like this I would have thought:?


    Does look nicer, will have to ponder it a bit though.

    I must admit to not fully understanding why we take inode/mm references
    in the key anyway as neither will stop unmapping the futex.

    Will make this into a nice patch.. thanks!

    > BTW. I like that it removes a lot of fshared crap from around
    > the place. And also this is a really good user of fast_gup
    > because I guess it should usually be faulted in. The problem is
    > that this could be a little more expensive for architectures that
    > don't implement fast_gup. Though most should be able to.


    Yeah, if that really becomes a problem (but I doubt it will) we could
    possibly make the old and new scheme work based on ARCH_HAVE_PTE_SPECIAL
    or something ugly like that.

    > @@ -191,7 +191,6 @@ static int get_futex_key(u32 __user *uad
    > {
    > unsigned long address = (unsigned long)uaddr;
    > struct mm_struct *mm = current->mm;
    > - struct vm_area_struct *vma;
    > struct page *page;
    > int err;
    >
    > @@ -210,27 +209,26 @@ static int get_futex_key(u32 __user *uad
    > * Note : We do have to check 'uaddr' is a valid user address,
    > * but access_ok() should be faster than find_vma()
    > */
    > - if (!fshared) {
    > + if (likely(!fshared)) {
    > if (unlikely(!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))))
    > return -EFAULT;
    > key->private.mm = mm;
    > key->private.address = address;
    > return 0;
    > }
    > - /*
    > - * The futex is hashed differently depending on whether
    > - * it's in a shared or private mapping. So check vma first.
    > - */
    > - vma = find_extend_vma(mm, address);
    > - if (unlikely(!vma))
    > - return -EFAULT;
    > -
    > - /*
    > - * Permissions.
    > - */
    > - if (unlikely((vma->vm_flags & (VM_IO|VM_READ)) != VM_READ))
    > - return (vma->vm_flags & VM_IO) ? -EPERM : -EACCES;
    >
    > +again:
    > + err = fast_gup(address, 1, 0, &page);
    > + if (err < 0)
    > + return err;
    > +
    > + lock_page(page);
    > + if (!page->mapping) { /* PageAnon pages shouldn't get caught here */
    > + unlock_page(page);
    > + put_page(page);
    > + goto again;
    > + }
    > +
    > /*
    > * Private mappings are handled in a simple way.
    > *
    > @@ -240,38 +238,19 @@ static int get_futex_key(u32 __user *uad
    > * VM_MAYSHARE here, not VM_SHARED which is restricted to shared
    > * mappings of _writable_ handles.
    > */
    > - if (likely(!(vma->vm_flags & VM_MAYSHARE))) {
    > - key->both.offset |= FUT_OFF_MMSHARED; /* reference taken on mm
    > *
    > /
    > + if (PageAnon(page)) {
    > + key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
    > key->private.mm = mm;
    > key->private.address = address;
    > - return 0;
    > - }
    > -
    > - /*
    > - * Linear file mappings are also simple.
    > - */
    > - key->shared.inode = vma->vm_file->f_path.dentry->d_inode;
    > - key->both.offset |= FUT_OFF_INODE; /* inode-based key. */
    > - if (likely(!(vma->vm_flags & VM_NONLINEAR))) {
    > - key->shared.pgoff = (((address - vma->vm_start) >> PAGE_SHIFT)
    > - + vma->vm_pgoff);
    > - return 0;
    > - }
    > -
    > - /*
    > - * We could walk the page table to read the non-linear
    > - * pte, and get the page index without fetching the page
    > - * from swap. But that's a lot of code to duplicate here
    > - * for a rare case, so we simply fetch the page.
    > - */
    > - err = get_user_pages(current, mm, address, 1, 0, 0, &page, NULL);
    > - if (err >= 0) {
    > - key->shared.pgoff =
    > - page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
    > - put_page(page);
    > - return 0;
    > + } else {
    > + key->both.offset |= FUT_OFF_INODE; /* inode-based key. */
    > + key->shared.inode = page->mapping->inode;
    > + key->shared.pgoff = page->index;
    > }
    > - return err;
    > +out:
    > + unlock_page(page);
    > + put_page(page);
    > + return 0;
    > }
    >
    > /*
    >


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [RFC PATCH 1/2] futex: rely on get_user_pages() for shared futexes

    On Wednesday 09 April 2008 02:59, Peter Zijlstra wrote:
    > On Tue, 2008-04-08 at 21:40 +1000, Nick Piggin wrote:
    > > On Saturday 05 April 2008 06:33, Peter Zijlstra wrote:
    > > > On the way of getting rid of the mmap_sem requirement for shared
    > > > futexes, start by relying on get_user_pages().
    > > >
    > > > This requires we get the page associated with the key, and put the page
    > > > when we're done with it.

    > >
    > > Hi Peter,
    > >
    > > Cool.
    > >
    > > I'm all for removing mmap_sem requirement from shared futexes...
    > > Are there many apps which make a non-trivial use of them I wonder?

    >
    > No idea. I've heard some stories, but I've never seen any code..
    >
    > > I guess it will help legacy (pre-FUTEX_PRIVATE) usespaces in
    > > performance too, though.

    >
    > Yeah, that was one of the motivations for this patch.
    >
    > > What I'm worried about with this is invalidate or truncate races.
    > > With direct IO, it obviously doesn't matter because the only
    > > requirement is that the page existed at the address at some point
    > > during the syscall...
    > >
    > > So I'd really like you to not carry the page around in the key, but
    > > just continue using the same key we have now. Also, lock the page
    > > and ensure it hasn't been truncated before taking the inode from the
    > > key and incrementing its count (page lock's extra atomics should be
    > > more or less cancelled out by fewer mmap_sem atomic ops).
    > >
    > > get_futex_key should look something like this I would have thought:?

    >
    > Does look nicer, will have to ponder it a bit though.
    >
    > I must admit to not fully understanding why we take inode/mm references
    > in the key anyway as neither will stop unmapping the futex.


    I guess that's the problem; if we unmap the futex then we might
    free the inode without a ref.


    > Will make this into a nice patch.. thanks!
    >
    > > BTW. I like that it removes a lot of fshared crap from around
    > > the place. And also this is a really good user of fast_gup
    > > because I guess it should usually be faulted in. The problem is
    > > that this could be a little more expensive for architectures that
    > > don't implement fast_gup. Though most should be able to.

    >
    > Yeah, if that really becomes a problem (but I doubt it will) we could
    > possibly make the old and new scheme work based on ARCH_HAVE_PTE_SPECIAL
    > or something ugly like that.


    Yeah, hopefully we won't have to worry. Technically this is a slowish
    path anyway, so while heavy global contention and even sleeping on
    mmap_sem may be a problem, I doubt the extra atomic or so would be
    out of the noise.

    Anyway, again thanks for doing this patch. What would be really nice
    in order to get some testing concurrently while I'm trying to get
    fast_gup in, is to make patch 2 which is exactly the same but it
    still uses get_user_pages (ie. it just moves the mmap_sem call right
    around the get_user_pages site). Then patch 3 can just remove those
    3 lines and replace them with a call to fast_gup. Does that make sense?

    Thanks,
    Nick
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [RFC PATCH 1/2] futex: rely on get_user_pages() for shared futexes

    On Tue, 2008-04-08 at 21:40 +1000, Nick Piggin wrote:

    > @@ -191,7 +191,6 @@ static int get_futex_key(u32 __user *uad
    > {
    > unsigned long address = (unsigned long)uaddr;
    > struct mm_struct *mm = current->mm;
    > - struct vm_area_struct *vma;
    > struct page *page;
    > int err;
    >
    > @@ -210,27 +209,26 @@ static int get_futex_key(u32 __user *uad
    > * Note : We do have to check 'uaddr' is a valid user address,
    > * but access_ok() should be faster than find_vma()
    > */
    > - if (!fshared) {
    > + if (likely(!fshared)) {
    > if (unlikely(!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))))
    > return -EFAULT;
    > key->private.mm = mm;
    > key->private.address = address;
    > return 0;
    > }
    > - /*
    > - * The futex is hashed differently depending on whether
    > - * it's in a shared or private mapping. So check vma first.
    > - */
    > - vma = find_extend_vma(mm, address);
    > - if (unlikely(!vma))
    > - return -EFAULT;
    > -
    > - /*
    > - * Permissions.
    > - */
    > - if (unlikely((vma->vm_flags & (VM_IO|VM_READ)) != VM_READ))
    > - return (vma->vm_flags & VM_IO) ? -EPERM : -EACCES;
    >
    > +again:
    > + err = fast_gup(address, 1, 0, &page);
    > + if (err < 0)
    > + return err;
    > +
    > + lock_page(page);
    > + if (!page->mapping) { /* PageAnon pages shouldn't get caught here */
    > + unlock_page(page);
    > + put_page(page);
    > + goto again;
    > + }
    > +
    > /*
    > * Private mappings are handled in a simple way.
    > *
    > @@ -240,38 +238,19 @@ static int get_futex_key(u32 __user *uad
    > * VM_MAYSHARE here, not VM_SHARED which is restricted to shared
    > * mappings of _writable_ handles.
    > */
    > - if (likely(!(vma->vm_flags & VM_MAYSHARE))) {
    > - key->both.offset |= FUT_OFF_MMSHARED; /* reference taken on mm
    > *
    > /
    > + if (PageAnon(page)) {
    > + key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
    > key->private.mm = mm;
    > key->private.address = address;
    > - return 0;
    > - }
    > -
    > - /*
    > - * Linear file mappings are also simple.
    > - */
    > - key->shared.inode = vma->vm_file->f_path.dentry->d_inode;
    > - key->both.offset |= FUT_OFF_INODE; /* inode-based key. */
    > - if (likely(!(vma->vm_flags & VM_NONLINEAR))) {
    > - key->shared.pgoff = (((address - vma->vm_start) >> PAGE_SHIFT)
    > - + vma->vm_pgoff);
    > - return 0;
    > - }
    > -
    > - /*
    > - * We could walk the page table to read the non-linear
    > - * pte, and get the page index without fetching the page
    > - * from swap. But that's a lot of code to duplicate here
    > - * for a rare case, so we simply fetch the page.
    > - */
    > - err = get_user_pages(current, mm, address, 1, 0, 0, &page, NULL);
    > - if (err >= 0) {
    > - key->shared.pgoff =
    > - page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
    > - put_page(page);
    > - return 0;
    > + } else {
    > + key->both.offset |= FUT_OFF_INODE; /* inode-based key. */
    > + key->shared.inode = page->mapping->inode;
    > + key->shared.pgoff = page->index;
    > }
    > - return err;
    > +out:
    > + unlock_page(page);
    > + put_page(page);
    > + return 0;
    > }


    Right, so staring at this for a while makes me feel uneasy. The thing
    is, once we exit get_futex_key() there is nothing stopping the mm of
    inode from going away.

    So I'm going to have to take a ref while we have the page locked, and do
    the put_futex_key() thing to release it.

    Code at (uncompiled and untested):
    http://programming.kicks-ass.net/ker....6.24.4-rt4-2/

    The thing is, it now needs to take a reference on a rather large object
    (mm/inode), giving a rather large opportunity to bounce that cacheline.

    What problems do you see with just keeping a ref on the page?

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread