[patch v3] splice: fix race with page invalidation - Kernel

This is a discussion on [patch v3] splice: fix race with page invalidation - Kernel ; Jens, Please apply or ack this for 2.6.27. [v3: respun against 2.6.27-rc1] Thanks, Miklos ---- From: Miklos Szeredi Brian Wang reported that a FUSE filesystem exported through NFS could return I/O errors on read. This was traced to splice_direct_to_actor() returning ...

+ Reply to Thread
Page 1 of 2 1 2 LastLast
Results 1 to 20 of 40

Thread: [patch v3] splice: fix race with page invalidation

  1. [patch v3] splice: fix race with page invalidation

    Jens,

    Please apply or ack this for 2.6.27.

    [v3: respun against 2.6.27-rc1]

    Thanks,
    Miklos

    ----
    From: Miklos Szeredi

    Brian Wang reported that a FUSE filesystem exported through NFS could return
    I/O errors on read. This was traced to splice_direct_to_actor() returning a
    short or zero count when racing with page invalidation.

    However this is not FUSE or NFSD specific, other filesystems (notably NFS)
    also call invalidate_inode_pages2() to purge stale data from the cache.

    If this happens while such pages are sitting in a pipe buffer, then splice(2)
    from the pipe can return zero, and read(2) from the pipe can return ENODATA.

    The zero return is especially bad, since it implies end-of-file or
    disconnected pipe/socket, and is documented as such for splice. But returning
    an error for read() is also nasty, when in fact there was no error (data
    becoming stale is not an error).

    The same problems can be triggered by "hole punching" with
    madvise(MADV_REMOVE).

    This patch simply reuses the do_generic_file_read() infrastructure to collect
    pages to be spliced into the pipe buffer. This has the advantage of using
    very well tested codepaths. Other than fixing the above problem it also fixes
    smaller issues with the previous generic_file_splice_read() implementation:

    - error handling bugs for some corner cases
    - AOP_TRUNCATED_PAGE handling

    There are no real disadvantages: splice() from a file was originally meant to
    be asynchronous, but in reality it only did that for non-readahead pages,
    which happen rarely.

    Signed-off-by: Miklos Szeredi
    ---

    fs/splice.c | 295 ++++++-----------------------------------------------
    include/linux/fs.h | 2
    mm/filemap.c | 2
    3 files changed, 41 insertions(+), 258 deletions(-)

    Index: linux-2.6/fs/splice.c
    ================================================== =================
    --- linux-2.6.orig/fs/splice.c 2008-07-30 10:06:11.000000000 +0200
    +++ linux-2.6/fs/splice.c 2008-07-30 11:26:23.000000000 +0200
    @@ -87,53 +87,11 @@ static void page_cache_pipe_buf_release(
    buf->flags &= ~PIPE_BUF_FLAG_LRU;
    }

    -/*
    - * Check whether the contents of buf is OK to access. Since the content
    - * is a page cache page, IO may be in flight.
    - */
    -static int page_cache_pipe_buf_confirm(struct pipe_inode_info *pipe,
    - struct pipe_buffer *buf)
    -{
    - struct page *page = buf->page;
    - int err;
    -
    - if (!PageUptodate(page)) {
    - lock_page(page);
    -
    - /*
    - * Page got truncated/unhashed. This will cause a 0-byte
    - * splice, if this is the first page.
    - */
    - if (!page->mapping) {
    - err = -ENODATA;
    - goto error;
    - }
    -
    - /*
    - * Uh oh, read-error from disk.
    - */
    - if (!PageUptodate(page)) {
    - err = -EIO;
    - goto error;
    - }
    -
    - /*
    - * Page is ok afterall, we are done.
    - */
    - unlock_page(page);
    - }
    -
    - return 0;
    -error:
    - unlock_page(page);
    - return err;
    -}
    -
    static const struct pipe_buf_operations page_cache_pipe_buf_ops = {
    .can_merge = 0,
    .map = generic_pipe_buf_map,
    .unmap = generic_pipe_buf_unmap,
    - .confirm = page_cache_pipe_buf_confirm,
    + .confirm = generic_pipe_buf_confirm,
    .release = page_cache_pipe_buf_release,
    .steal = page_cache_pipe_buf_steal,
    .get = generic_pipe_buf_get,
    @@ -265,212 +223,26 @@ static void spd_release_page(struct spli
    page_cache_release(spd->pages[i]);
    }

    -static int
    -__generic_file_splice_read(struct file *in, loff_t *ppos,
    - struct pipe_inode_info *pipe, size_t len,
    - unsigned int flags)
    +static int file_splice_read_actor(read_descriptor_t *desc, struct page *page,
    + unsigned long offset, unsigned long size)
    {
    - struct address_space *mapping = in->f_mapping;
    - unsigned int loff, nr_pages, req_pages;
    - struct page *pages[PIPE_BUFFERS];
    - struct partial_page partial[PIPE_BUFFERS];
    - struct page *page;
    - pgoff_t index, end_index;
    - loff_t isize;
    - int error, page_nr;
    - struct splice_pipe_desc spd = {
    - .pages = pages,
    - .partial = partial,
    - .flags = flags,
    - .ops = &page_cache_pipe_buf_ops,
    - .spd_release = spd_release_page,
    - };
    -
    - index = *ppos >> PAGE_CACHE_SHIFT;
    - loff = *ppos & ~PAGE_CACHE_MASK;
    - req_pages = (len + loff + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
    - nr_pages = min(req_pages, (unsigned)PIPE_BUFFERS);
    -
    - /*
    - * Lookup the (hopefully) full range of pages we need.
    - */
    - spd.nr_pages = find_get_pages_contig(mapping, index, nr_pages, pages);
    - index += spd.nr_pages;
    -
    - /*
    - * If find_get_pages_contig() returned fewer pages than we needed,
    - * readahead/allocate the rest and fill in the holes.
    - */
    - if (spd.nr_pages < nr_pages)
    - page_cache_sync_readahead(mapping, &in->f_ra, in,
    - index, req_pages - spd.nr_pages);
    -
    - error = 0;
    - while (spd.nr_pages < nr_pages) {
    - /*
    - * Page could be there, find_get_pages_contig() breaks on
    - * the first hole.
    - */
    - page = find_get_page(mapping, index);
    - if (!page) {
    - /*
    - * page didn't exist, allocate one.
    - */
    - page = page_cache_alloc_cold(mapping);
    - if (!page)
    - break;
    -
    - error = add_to_page_cache_lru(page, mapping, index,
    - mapping_gfp_mask(mapping));
    - if (unlikely(error)) {
    - page_cache_release(page);
    - if (error == -EEXIST)
    - continue;
    - break;
    - }
    - /*
    - * add_to_page_cache() locks the page, unlock it
    - * to avoid convoluting the logic below even more.
    - */
    - unlock_page(page);
    - }
    -
    - pages[spd.nr_pages++] = page;
    - index++;
    - }
    -
    - /*
    - * Now loop over the map and see if we need to start IO on any
    - * pages, fill in the partial map, etc.
    - */
    - index = *ppos >> PAGE_CACHE_SHIFT;
    - nr_pages = spd.nr_pages;
    - spd.nr_pages = 0;
    - for (page_nr = 0; page_nr < nr_pages; page_nr++) {
    - unsigned int this_len;
    -
    - if (!len)
    - break;
    -
    - /*
    - * this_len is the max we'll use from this page
    - */
    - this_len = min_t(unsigned long, len, PAGE_CACHE_SIZE - loff);
    - page = pages[page_nr];
    -
    - if (PageReadahead(page))
    - page_cache_async_readahead(mapping, &in->f_ra, in,
    - page, index, req_pages - page_nr);
    -
    - /*
    - * If the page isn't uptodate, we may need to start io on it
    - */
    - if (!PageUptodate(page)) {
    - /*
    - * If in nonblock mode then dont block on waiting
    - * for an in-flight io page
    - */
    - if (flags & SPLICE_F_NONBLOCK) {
    - if (TestSetPageLocked(page)) {
    - error = -EAGAIN;
    - break;
    - }
    - } else
    - lock_page(page);
    -
    - /*
    - * Page was truncated, or invalidated by the
    - * filesystem. Redo the find/create, but this time the
    - * page is kept locked, so there's no chance of another
    - * race with truncate/invalidate.
    - */
    - if (!page->mapping) {
    - unlock_page(page);
    - page = find_or_create_page(mapping, index,
    - mapping_gfp_mask(mapping));
    -
    - if (!page) {
    - error = -ENOMEM;
    - break;
    - }
    - page_cache_release(pages[page_nr]);
    - pages[page_nr] = page;
    - }
    - /*
    - * page was already under io and is now done, great
    - */
    - if (PageUptodate(page)) {
    - unlock_page(page);
    - goto fill_it;
    - }
    -
    - /*
    - * need to read in the page
    - */
    - error = mapping->a_ops->readpage(in, page);
    - if (unlikely(error)) {
    - /*
    - * We really should re-lookup the page here,
    - * but it complicates things a lot. Instead
    - * lets just do what we already stored, and
    - * we'll get it the next time we are called.
    - */
    - if (error == AOP_TRUNCATED_PAGE)
    - error = 0;
    -
    - break;
    - }
    - }
    -fill_it:
    - /*
    - * i_size must be checked after PageUptodate.
    - */
    - isize = i_size_read(mapping->host);
    - end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
    - if (unlikely(!isize || index > end_index))
    - break;
    -
    - /*
    - * if this is the last page, see if we need to shrink
    - * the length and stop
    - */
    - if (end_index == index) {
    - unsigned int plen;
    -
    - /*
    - * max good bytes in this page
    - */
    - plen = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
    - if (plen <= loff)
    - break;
    + struct splice_pipe_desc *spd = desc->arg.data;
    + unsigned long count = desc->count;

    - /*
    - * force quit after adding this page
    - */
    - this_len = min(this_len, plen - loff);
    - len = this_len;
    - }
    + BUG_ON(spd->nr_pages >= PIPE_BUFFERS);

    - partial[page_nr].offset = loff;
    - partial[page_nr].len = this_len;
    - len -= this_len;
    - loff = 0;
    - spd.nr_pages++;
    - index++;
    - }
    + if (size > count)
    + size = count;

    - /*
    - * Release any pages at the end, if we quit early. 'page_nr' is how far
    - * we got, 'nr_pages' is how many pages are in the map.
    - */
    - while (page_nr < nr_pages)
    - page_cache_release(pages[page_nr++]);
    - in->f_ra.prev_pos = (loff_t)index << PAGE_CACHE_SHIFT;
    + page_cache_get(page);
    + spd->pages[spd->nr_pages] = page;
    + spd->partial[spd->nr_pages].offset = offset;
    + spd->partial[spd->nr_pages].len = size;
    + spd->nr_pages++;

    - if (spd.nr_pages)
    - return splice_to_pipe(pipe, &spd);
    + desc->count = count - size;

    - return error;
    + return size;
    }

    /**
    @@ -491,24 +263,33 @@ ssize_t generic_file_splice_read(struct
    struct pipe_inode_info *pipe, size_t len,
    unsigned int flags)
    {
    - loff_t isize, left;
    - int ret;
    -
    - isize = i_size_read(in->f_mapping->host);
    - if (unlikely(*ppos >= isize))
    - return 0;
    -
    - left = isize - *ppos;
    - if (unlikely(left < len))
    - len = left;
    + ssize_t ret;
    + loff_t pos = *ppos;
    + size_t offset = pos & ~PAGE_CACHE_MASK;
    + struct page *pages[PIPE_BUFFERS];
    + struct partial_page partial[PIPE_BUFFERS];
    + struct splice_pipe_desc spd = {
    + .pages = pages,
    + .partial = partial,
    + .flags = flags,
    + .ops = &page_cache_pipe_buf_ops,
    + .spd_release = spd_release_page,
    + };
    + read_descriptor_t desc = {
    + .count = min(len, (PIPE_BUFFERS << PAGE_CACHE_SHIFT) - offset),
    + .arg.data = &spd,
    + };

    - ret = __generic_file_splice_read(in, ppos, pipe, len, flags);
    - if (ret > 0)
    - *ppos += ret;
    + do_generic_file_read(in, &pos, &desc, file_splice_read_actor);
    + ret = desc.error;
    + if (spd.nr_pages) {
    + ret = splice_to_pipe(pipe, &spd);
    + if (ret > 0)
    + *ppos += ret;
    + }

    return ret;
    }
    -
    EXPORT_SYMBOL(generic_file_splice_read);

    /*
    Index: linux-2.6/include/linux/fs.h
    ================================================== =================
    --- linux-2.6.orig/include/linux/fs.h 2008-07-30 10:06:11.000000000 +0200
    +++ linux-2.6/include/linux/fs.h 2008-07-30 11:26:23.000000000 +0200
    @@ -1873,6 +1873,8 @@ extern ssize_t do_sync_read(struct file
    extern ssize_t do_sync_write(struct file *filp, const char __user *buf, size_t len, loff_t *ppos);
    extern int generic_segment_checks(const struct iovec *iov,
    unsigned long *nr_segs, size_t *count, int access_flags);
    +extern void do_generic_file_read(struct file *filp, loff_t *ppos,
    + read_descriptor_t *desc, read_actor_t actor);

    /* fs/splice.c */
    extern ssize_t generic_file_splice_read(struct file *, loff_t *,
    Index: linux-2.6/mm/filemap.c
    ================================================== =================
    --- linux-2.6.orig/mm/filemap.c 2008-07-30 10:06:11.000000000 +0200
    +++ linux-2.6/mm/filemap.c 2008-07-30 11:26:23.000000000 +0200
    @@ -982,7 +982,7 @@ static void shrink_readahead_size_eio(st
    * This is really ugly. But the goto's actually try to clarify some
    * of the logic when it comes to error handling etc.
    */
    -static void do_generic_file_read(struct file *filp, loff_t *ppos,
    +void do_generic_file_read(struct file *filp, loff_t *ppos,
    read_descriptor_t *desc, read_actor_t actor)
    {
    struct address_space *mapping = filp->f_mapping;
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: [patch v3] splice: fix race with page invalidation



    On Wed, 30 Jul 2008, Miklos Szeredi wrote:
    >
    > There are no real disadvantages: splice() from a file was originally meant to
    > be asynchronous, but in reality it only did that for non-readahead pages,
    > which happen rarely.


    I still don't like this. I still don't see the point, and I still think
    there is something fundamentally wrong elsewhere.

    I also object to just dismissing the async nature as unimportant. Fix it
    instead. Make it use generic_file_readahead() or something. This is fixing
    things in all the wrong places, imnsho.

    Linus
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: [patch v3] splice: fix race with page invalidation

    On Wed, 30 Jul 2008, Linus Torvalds wrote:
    > On Wed, 30 Jul 2008, Miklos Szeredi wrote:
    > >
    > > There are no real disadvantages: splice() from a file was
    > > originally meant to be asynchronous, but in reality it only did
    > > that for non-readahead pages, which happen rarely.

    >
    > I still don't like this. I still don't see the point, and I still
    > think there is something fundamentally wrong elsewhere.


    We discussed the possible solutions with Nick, and came to the
    conclusion, that short term (i.e. 2.6.27) this is probably the best
    solution.

    Long term sure, I have no problem with implementing async splice.

    In fact, I may even have personal interest in looking at splice,
    because people are asking for a zero-copy interface for fuse.

    But that is definitely not 2.6.27, so I think you should reconsider
    taking this patch, which is obviously correct due to its simplicity,
    and won't cause any performance regressions either.

    Thanks,
    Miklos
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: [patch v3] splice: fix race with page invalidation

    On Wed, Jul 30 2008, Miklos Szeredi wrote:
    > On Wed, 30 Jul 2008, Linus Torvalds wrote:
    > > On Wed, 30 Jul 2008, Miklos Szeredi wrote:
    > > >
    > > > There are no real disadvantages: splice() from a file was
    > > > originally meant to be asynchronous, but in reality it only did
    > > > that for non-readahead pages, which happen rarely.

    > >
    > > I still don't like this. I still don't see the point, and I still
    > > think there is something fundamentally wrong elsewhere.


    You snipped the part where Linus objected to dismissing the async
    nature, I fully agree with that part.

    > We discussed the possible solutions with Nick, and came to the
    > conclusion, that short term (i.e. 2.6.27) this is probably the best
    > solution.


    Ehm where? Nick also said that he didn't like removing the ->confirm()
    bits as they are completely related to the async nature of splice. You
    already submitted this exact patch earlier and it was nak'ed.

    > Long term sure, I have no problem with implementing async splice.
    >
    > In fact, I may even have personal interest in looking at splice,
    > because people are asking for a zero-copy interface for fuse.
    >
    > But that is definitely not 2.6.27, so I think you should reconsider
    > taking this patch, which is obviously correct due to its simplicity,
    > and won't cause any performance regressions either.


    Then please just fix the issue, instead of removing the bits that make
    this possible.

    --
    Jens Axboe

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [patch v3] splice: fix race with page invalidation

    On Wed, 30 Jul 2008, Jens Axboe wrote:
    > On Wed, Jul 30 2008, Miklos Szeredi wrote:
    > > On Wed, 30 Jul 2008, Linus Torvalds wrote:
    > > > On Wed, 30 Jul 2008, Miklos Szeredi wrote:
    > > > >
    > > > > There are no real disadvantages: splice() from a file was
    > > > > originally meant to be asynchronous, but in reality it only did
    > > > > that for non-readahead pages, which happen rarely.
    > > >
    > > > I still don't like this. I still don't see the point, and I still
    > > > think there is something fundamentally wrong elsewhere.

    >
    > You snipped the part where Linus objected to dismissing the async
    > nature, I fully agree with that part.
    >
    > > We discussed the possible solutions with Nick, and came to the
    > > conclusion, that short term (i.e. 2.6.27) this is probably the best
    > > solution.

    >
    > Ehm where?


    http://lkml.org/lkml/2008/7/7/476

    > Nick also said that he didn't like removing the ->confirm()
    > bits as they are completely related to the async nature of splice. You
    > already submitted this exact patch earlier and it was nak'ed.


    That's not true. The resubmitted patch didn't remove the ->confirm()
    calls, which is what Nick objected to, I think.

    > > Long term sure, I have no problem with implementing async splice.
    > >
    > > In fact, I may even have personal interest in looking at splice,
    > > because people are asking for a zero-copy interface for fuse.
    > >
    > > But that is definitely not 2.6.27, so I think you should reconsider
    > > taking this patch, which is obviously correct due to its simplicity,
    > > and won't cause any performance regressions either.

    >
    > Then please just fix the issue, instead of removing the bits that make
    > this possible.


    I tried to fix it, but Nick didn't like my fix. Ideas are of course
    welcome.

    Thanks,
    Miklos
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [patch v3] splice: fix race with page invalidation

    > On Wed, 30 Jul 2008, Jens Axboe wrote:
    > > You snipped the part where Linus objected to dismissing the async
    > > nature, I fully agree with that part.


    And also note in what Nick said in the referenced mail: it would be
    nice if someone actually _cared_ about the async nature. The fact
    that it has been broken from the start, and nobody noticed is a strong
    hint that currently there isn't anybody who does.

    Maybe fuse will be the first one to actually care, and then I'll
    bother with putting a lot of effort into it. But until someone cares,
    nobody will bother, and that's how it should be. That's very much in
    line with the evoultionary nature of kernel developments: unused
    features will just get broken and eventually removed.

    Miklos
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [patch v3] splice: fix race with page invalidation

    On Wed, Jul 30 2008, Miklos Szeredi wrote:
    > > On Wed, 30 Jul 2008, Jens Axboe wrote:
    > > > You snipped the part where Linus objected to dismissing the async
    > > > nature, I fully agree with that part.

    >
    > And also note in what Nick said in the referenced mail: it would be
    > nice if someone actually _cared_ about the async nature. The fact
    > that it has been broken from the start, and nobody noticed is a strong
    > hint that currently there isn't anybody who does.


    That's largely due to the (still) lack of direct splice users. It's a
    clear part of the design and benefit of using splice. I very much care
    about this, and as soon as there are available cycles for this, I'll get
    it into better shape in this respect. Taking a step backwards is not the
    right way forward, imho.

    > Maybe fuse will be the first one to actually care, and then I'll
    > bother with putting a lot of effort into it. But until someone cares,
    > nobody will bother, and that's how it should be. That's very much in
    > line with the evoultionary nature of kernel developments: unused
    > features will just get broken and eventually removed.


    People always say then, and the end result is it never gets done. Not to
    point the finger at Nick, but removing the steal part of the splice
    design was something I objected to a lot. Yet it was acked on the
    premise that it would eventually get resubmitted in a fixed manner, but
    were are not further along that path.

    --
    Jens Axboe

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: [patch v3] splice: fix race with page invalidation

    On Wed, 30 Jul 2008, Jens Axboe wrote:
    > On Wed, Jul 30 2008, Miklos Szeredi wrote:
    > > > On Wed, 30 Jul 2008, Jens Axboe wrote:
    > > > > You snipped the part where Linus objected to dismissing the async
    > > > > nature, I fully agree with that part.

    > >
    > > And also note in what Nick said in the referenced mail: it would be
    > > nice if someone actually _cared_ about the async nature. The fact
    > > that it has been broken from the start, and nobody noticed is a strong
    > > hint that currently there isn't anybody who does.

    >
    > That's largely due to the (still) lack of direct splice users. It's a
    > clear part of the design and benefit of using splice. I very much care
    > about this, and as soon as there are available cycles for this, I'll get
    > it into better shape in this respect. Taking a step backwards is not the
    > right way forward, imho.


    So what is? The only way forward is to put those cycles into it,
    which nobody seems to have available.

    Take this patch as a bugfix. It's not in any way showing the way
    forward: as soon as you have the time, you can revert it and start
    from the current state.

    Hmm?

    Miklos
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [patch v3] splice: fix race with page invalidation



    On Wed, 30 Jul 2008, Miklos Szeredi wrote:
    >
    > Take this patch as a bugfix. It's not in any way showing the way
    > forward: as soon as you have the time, you can revert it and start
    > from the current state.
    >
    > Hmm?


    I dislike that mentality.

    The fact is, it's not a bug-fix, it's just papering over the real problem.

    And by papering it over, it then just makes people less likely to bother
    with the real issue.

    For example, and I talked about this earlier - what make syou think that
    the FUSE/NFSD behaviour you don't like is at all valid in the first place?

    If you depend on data not being truncated because you have it "in flight",
    tjhere's already something wrong there. It's _not_ just that people can
    see zero bytes in the reply - apparently they can see the file shrink
    before they see the read return. That kind of thing just worries me. And
    it might be a general NFS issue, not necessarily a FUSE one.

    So I think your whole approach stinks. I don't agree with the "bug-fix".
    It really smells like a "bug-paper-over".

    Linus
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: [patch v3] splice: fix race with page invalidation

    On Wed, 30 Jul 2008, Linus Torvalds wrote:
    > On Wed, 30 Jul 2008, Miklos Szeredi wrote:
    > >
    > > Take this patch as a bugfix. It's not in any way showing the way
    > > forward: as soon as you have the time, you can revert it and start
    > > from the current state.
    > >
    > > Hmm?

    >
    > I dislike that mentality.
    >
    > The fact is, it's not a bug-fix, it's just papering over the real problem.


    It _is_ a bug fix. See here, from man 2 splice:

    RETURN VALUE
    Upon successful completion, splice() returns the number of bytes
    spliced to or from the pipe. A return value of 0 means that there was
    no data to transfer, and it would not make sense to block, because
    there are no writers connected to the write end of the pipe referred to
    by fd_in.

    Currently splice on NFS, FUSE and a few other filesystems don't
    conform to that clause: splice can return zero even if there's still
    data to be read, just because the data happened to be invalidated
    during the splicing. That's a plain and clear bug, which has
    absolutely nothing to do with NFS _exporting_.

    > And by papering it over, it then just makes people less likely to bother
    > with the real issue.


    I think you are talking about a totally separate issue: that NFSD's
    use of splice can result in strange things if the file is truncated
    while being read. But this is an NFSD issue and I don't see that it
    has _anything_ to do with the above bug in splice. I think you are
    just confusing the two things.

    Miklos
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: [patch v3] splice: fix race with page invalidation



    On Wed, 30 Jul 2008, Miklos Szeredi wrote:
    >
    > It _is_ a bug fix.


    No it's not.

    It's still papering over a totally unrelated bug. It's not a "fix", it's a
    "paper-over". It doesn't matter if _behaviour_ changes.

    Can you really not see the difference?

    Afaik, everybody pretty much agreed what the real fix should be (don't
    mark the page not up-to-date, just remove it from the radix tree), I'm not
    seeing why you continue to push the patch that ALREADY GOT NAK'ed.

    Linus
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: [patch v3] splice: fix race with page invalidation

    > On Wed, 30 Jul 2008, Miklos Szeredi wrote:
    > >
    > > It _is_ a bug fix.

    >
    > No it's not.
    >
    > It's still papering over a totally unrelated bug. It's not a "fix", it's a
    > "paper-over". It doesn't matter if _behaviour_ changes.
    >
    > Can you really not see the difference?
    >
    > Afaik, everybody pretty much agreed what the real fix should be (don't
    > mark the page not up-to-date, just remove it from the radix tree)


    Huh? I did exactly that, and that patch got NAK'ed by you and Nick:

    http://lkml.org/lkml/2008/6/25/230
    http://lkml.org/lkml/2008/7/7/21

    Confused.

    > I'm not seeing why you continue to push the patch that ALREADY GOT
    > NAK'ed.


    And later ACK'ed by Nick.

    There's everything here but agreement about a solution

    Miklos
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. Re: [patch v3] splice: fix race with page invalidation



    On Wed, 30 Jul 2008, Miklos Szeredi wrote:
    >
    > Huh? I did exactly that, and that patch got NAK'ed by you and Nick:


    Umm. That was during the late -rc sequence (what, -rc8?) where I wanted a
    minimal "let's fix it". Not that anybody then apparently cared, which is
    probably just as well.

    Then you did NOTHING AT ALL about it, now the merge window is over, and
    you complain again, with what looks like basically the same patch that was
    already rejected earlier.

    Hellooo..

    Linus
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. Re: [patch v3] splice: fix race with page invalidation

    On Wed, 30 Jul 2008, Linus Torvalds wrote:
    > On Wed, 30 Jul 2008, Miklos Szeredi wrote:
    > >
    > > Huh? I did exactly that, and that patch got NAK'ed by you and Nick:

    >
    > Umm. That was during the late -rc sequence (what, -rc8?) where I wanted a
    > minimal "let's fix it". Not that anybody then apparently cared, which is
    > probably just as well.
    >
    > Then you did NOTHING AT ALL about it, now the merge window is over, and
    > you complain again, with what looks like basically the same patch that was
    > already rejected earlier.
    >
    > Hellooo..


    You are being unfair: after having talked it over with Nick I
    resubmitted this patch (not the same), which was added to -mm and
    nobody complained then. Then it got thrown out of -mm during the merge
    window because of a conflict, and then now I got around to
    resubmitting it again.

    But you're the boss, if you don't like it it won't go in. I'll have a
    look at the "don't clear PG_uptodate" solution again and see if we can
    get an agreement there.

    Miklos
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  15. Re: [patch v3] splice: fix race with page invalidation



    On Wed, 30 Jul 2008, Miklos Szeredi wrote:
    >
    > You are being unfair: after having talked it over with Nick I
    > resubmitted this patch (not the same), which was added to -mm and
    > nobody complained then. Then it got thrown out of -mm during the merge
    > window because of a conflict, and then now I got around to
    > resubmitting it again.


    Ok, fair enough. I don't follow -mm myself (since as far as I'm concerned,
    a lot of the point of -mm is that Andrew takes a lot of load off me). So
    yes, it was unfair and yes, I'd never have reacted to it in -mm.

    But I'd really like to get that PG_uptodate bit just fixed - both wrt
    writeout errors and wrt truncate/holepunch. We had some similar issues wrt
    ext3 (?) inode buffers, where removing the uptodate bit actually ended up
    being a mistake.

    Linus
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  16. Re: [patch v3] splice: fix race with page invalidation

    > > And by papering it over, it then just makes people less likely to bother
    > > with the real issue.

    >
    > I think you are talking about a totally separate issue: that NFSD's
    > use of splice can result in strange things if the file is truncated
    > while being read. But this is an NFSD issue and I don't see that it
    > has _anything_ to do with the above bug in splice. I think you are
    > just confusing the two things.


    I'm more concerned by sendfile() users like Apache, Samba, FTPd. In
    an earlier thread on this topic, I asked if the splice bug can also
    result in sendfile() sending blocks of zeros, when a file is truncated
    after it has been sent, and the answer was yes probably.

    Not that I checked or anything. But if it affects sendfile() it's a
    bigger deal - that has many users.

    Assuming it does affect sendfile(), it's exasperated by not being able
    to tell when a sendfile() has finished with the pages its sending.
    E.g. you can't lock the file or otherwise synchronise with another
    program which wants to modify the file.

    -- Jamie
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  17. Re: [patch v3] splice: fix race with page invalidation

    Jamie Lokier wrote:
    > not being able to tell when a sendfile() has finished with the pages
    > its sending.


    (Except by the socket fully closing or a handshake from the other end,
    obviously.)

    -- Jamie
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  18. Re: [patch v3] splice: fix race with page invalidation



    On Thu, 31 Jul 2008, Jamie Lokier wrote:
    >
    > Jamie Lokier wrote:
    > > not being able to tell when a sendfile() has finished with the pages
    > > its sending.

    >
    > (Except by the socket fully closing or a handshake from the other end,
    > obviously.)


    Well, people should realize that this is pretty fundamental to zero-copy
    scemes. It's why zero-copy is often much less useful than doing a copy in
    the first place. How do you know how far in a splice buffer some random
    'struct page' has gotten? Especially with splicing to spicing to tee to
    splice...

    You'd have to have some kind of barrier model (which would be really
    complex), or perhaps a "wait for this page to no longer be shared" (which
    has issues all its own).

    IOW, splice() is very closely related to a magic kind of "mmap()+write()"
    in another thread. That's literally what it does internally (except the
    "mmap" is just a small magic kernel buffer rather than virtual address
    space), and exactly as with mmap, if you modify the file, the other thread
    will see if, even though it did it long ago.

    Personally, I think the right approach is to just realize that splice() is
    _not_ a write() system call, and never will be. If you need synchronous
    writing, you simply shouldn't use splice().

    Linus
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  19. Re: [patch v3] splice: fix race with page invalidation



    On Wed, 30 Jul 2008, Linus Torvalds wrote:
    >
    > Personally, I think the right approach is to just realize that splice() is
    > _not_ a write() system call, and never will be. If you need synchronous
    > writing, you simply shouldn't use splice().


    Side note: in-kernel users could probably do somethign about this. IOW, if
    there's some in-kernel usage (and yes, knfsd would be a prime example),
    that one may actually be able to do things that a _user_level user of
    splice() could never do.

    That includes things like getting the inode semaphore over a write (so
    that you can guarantee that pages that are in flight are not modified,
    except again possibly by other mmap users), and/or a per-page callback for
    when splice() is done with a page (so that you could keep the page locked
    while it's getting spliced, for example).

    And no, we don't actually have that either, of course.

    Linus
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  20. Re: [patch v3] splice: fix race with page invalidation

    On Thursday 31 July 2008 03:54, Jens Axboe wrote:
    > On Wed, Jul 30 2008, Miklos Szeredi wrote:
    > > On Wed, 30 Jul 2008, Linus Torvalds wrote:
    > > > On Wed, 30 Jul 2008, Miklos Szeredi wrote:
    > > > > There are no real disadvantages: splice() from a file was
    > > > > originally meant to be asynchronous, but in reality it only did
    > > > > that for non-readahead pages, which happen rarely.
    > > >
    > > > I still don't like this. I still don't see the point, and I still
    > > > think there is something fundamentally wrong elsewhere.

    >
    > You snipped the part where Linus objected to dismissing the async
    > nature, I fully agree with that part.
    >
    > > We discussed the possible solutions with Nick, and came to the
    > > conclusion, that short term (i.e. 2.6.27) this is probably the best
    > > solution.

    >
    > Ehm where? Nick also said that he didn't like removing the ->confirm()
    > bits as they are completely related to the async nature of splice. You
    > already submitted this exact patch earlier and it was nak'ed.
    >
    > > Long term sure, I have no problem with implementing async splice.
    > >
    > > In fact, I may even have personal interest in looking at splice,
    > > because people are asking for a zero-copy interface for fuse.
    > >
    > > But that is definitely not 2.6.27, so I think you should reconsider
    > > taking this patch, which is obviously correct due to its simplicity,
    > > and won't cause any performance regressions either.

    >
    > Then please just fix the issue, instead of removing the bits that make
    > this possible.


    The only "real" objection I had to avoiding the ClearPageUptodate there
    I guess is that it would weaken some assertions. I was more concerned
    about the unidentified problems... but there probably shouldn't be
    too many places in the VM that really care that much anymore (and those
    that do might already be racy).

    Now it seems to be perfectly fine to use the actual page itself that may
    have been truncated, and we have been doing that for a long time (see
    get_user_pages). So I'm not so worried about a bad data corruption or
    anything but just the VM getting confused, which we could fix anyway.

    I guess that kind of patch could sit in -mm for a while then get merged.
    Linus probably wouldn't think highly of a post-rc1 merge, but if it
    really is a bugfix, maybe?
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread
Page 1 of 2 1 2 LastLast