[patch v3] splice: fix race with page invalidation - Kernel

This is a discussion on [patch v3] splice: fix race with page invalidation - Kernel ; Linus Torvalds wrote: > > Jamie Lokier wrote: > > > not being able to tell when a sendfile() has finished with the pages > > > its sending. > > > > (Except by the socket fully closing or ...

+ Reply to Thread
Page 2 of 2 FirstFirst 1 2
Results 21 to 40 of 40

Thread: [patch v3] splice: fix race with page invalidation

  1. Re: [patch v3] splice: fix race with page invalidation

    Linus Torvalds wrote:
    > > Jamie Lokier wrote:
    > > > not being able to tell when a sendfile() has finished with the pages
    > > > its sending.

    > >
    > > (Except by the socket fully closing or a handshake from the other end,
    > > obviously.)

    >
    > Well, people should realize that this is pretty fundamental to zero-copy
    > scemes. It's why zero-copy is often much less useful than doing a copy in
    > the first place. How do you know how far in a splice buffer some random
    > 'struct page' has gotten? Especially with splicing to spicing to tee to
    > splice...


    Having implemented an equivalent zero-copy thing in userspace, I can
    confidently say it's not fundamental at all.

    What is fundamental is that you either (a) treat sendfile as an async
    operation, and get a notification when it's finished with the data,
    just like any other async operation, or (b) while sendfile claims those
    pages, they are marked COW.

    (b) is *much* more useful for the things you actually want to use
    sendfile for, namely a faster copy-file-to-socket with no weird
    complications. Since you're sending files which you don't *expect* to
    change (but want to behave sensibly if they do), and the pages
    probably aren't mapped into any process, COW would not cost anything.

    Right now, sendfile is used by servers of all kinds: http, ftp, file
    servers, you name it. They all want to believe it's purely a
    performance optimisation, equivalent to write. On many operations
    systems, it is. (I count sendfile equivalents on: Windows NT, SCO
    Unixware, Solaris, FreeBSD, Dragonfly, HP-UX, Tru64, AIX and S/390 in
    addition to Linux :-)

    > You'd have to have some kind of barrier model (which would be really
    > complex), or perhaps a "wait for this page to no longer be shared" (which
    > has issues all its own).
    >
    > IOW, splice() is very closely related to a magic kind of "mmap()+write()"
    > in another thread. That's literally what it does internally (except the
    > "mmap" is just a small magic kernel buffer rather than virtual address
    > space), and exactly as with mmap, if you modify the file, the other thread
    > will see if, even though it did it long ago.


    That's fine. But if you use a thread, the thread can tell you when
    it's done. Then you know what you're sending not an infinite time in
    the future :-)

    > Personally, I think the right approach is to just realize that splice() is
    > _not_ a write() system call, and never will be. If you need synchronous
    > writing, you simply shouldn't use splice().


    People want zero-copy, and no weirdness like sending blocks of zeros
    which the file never contained, and (if you lock the file) knowing
    when to release locks for someone else to edit the file.

    Sync or async doesn't matter so much; that's API stuff.

    The obvious mechanism for completion notifications is the AIO event
    interface. I.e. aio_sendfile that reports completion when it's safe
    to modify data it was using. aio_splice would be logical for similar
    reasons. Note it doesn't mean when the data has reached a particular
    place, it means when the pages it's holding are released. Pity AIO
    still sucks ;-)

    Btw, Windows had this since forever, it's called overlapped
    TransmitFile with an I/O completion event. Don't know if it's any
    good though ;-)

    -- Jamie
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: [patch v3] splice: fix race with page invalidation

    On Thu, 31 Jul 2008, Jamie Lokier wrote:
    > I'm more concerned by sendfile() users like Apache, Samba, FTPd. In
    > an earlier thread on this topic, I asked if the splice bug can also
    > result in sendfile() sending blocks of zeros, when a file is truncated
    > after it has been sent, and the answer was yes probably.
    >
    > Not that I checked or anything. But if it affects sendfile() it's a
    > bigger deal - that has many users.


    Nick also pointed out, that it also affects plain read(2), albeit only
    with a tiny window.

    But partial truncates are _rare_ (we don't even have a UNIX utility
    for that , so in practice all this may not actually matter very
    much.

    Miklos
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: [patch v3] splice: fix race with page invalidation

    On Thu, Jul 31, 2008 at 07:12:01AM +0100, Jamie Lokier (jamie@shareable.org) wrote:
    > The obvious mechanism for completion notifications is the AIO event
    > interface. I.e. aio_sendfile that reports completion when it's safe
    > to modify data it was using. aio_splice would be logical for similar
    > reasons. Note it doesn't mean when the data has reached a particular
    > place, it means when the pages it's holding are released. Pity AIO
    > still sucks ;-)


    It is not that simple: page can be held in hardware or tcp queues for
    a long time, and the only possible way to know, that system finished
    with it, is receiving ack from the remote side. There is a project to
    implement such a callback at skb destruction time (it is freed after ack
    from the other peer), but do we really need it? System which does care
    about transmit will implement own ack mechanism, so page can be unlocked
    at higher layer. Actually page can be locked during transfer and
    unlocked after rpc reply received, so underlying page invalidation will
    be postponed and will not affect sendfile/splice.

    > Btw, Windows had this since forever, it's called overlapped
    > TransmitFile with an I/O completion event. Don't know if it's any
    > good though ;-)


    There was a linux aio_sendfile() too. Google still knows about its
    numbers, graphs and so on...

    --
    Evgeniy Polyakov
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: [patch v3] splice: fix race with page invalidation

    Evgeniy Polyakov wrote:
    > On Thu, Jul 31, 2008 at 07:12:01AM +0100, Jamie Lokier (jamie@shareable.org) wrote:
    > > The obvious mechanism for completion notifications is the AIO event
    > > interface. I.e. aio_sendfile that reports completion when it's safe
    > > to modify data it was using. aio_splice would be logical for similar
    > > reasons. Note it doesn't mean when the data has reached a particular
    > > place, it means when the pages it's holding are released. Pity AIO
    > > still sucks ;-)

    >
    > It is not that simple: page can be held in hardware or tcp queues for
    > a long time, and the only possible way to know, that system finished
    > with it, is receiving ack from the remote side. There is a project to
    > implement such a callback at skb destruction time (it is freed after ack
    > from the other peer), but do we really need it? System which does care
    > about transmit will implement own ack mechanism, so page can be unlocked
    > at higher layer. Actually page can be locked during transfer and
    > unlocked after rpc reply received, so underlying page invalidation will
    > be postponed and will not affect sendfile/splice.


    This is why marking the pages COW would be better. Automatic!
    There's no need for a notification, merely letting go of the page
    references - yes, the hardware / TCP acks already do that, no locking
    or anything! :-) The last reference is nothing special, it just means
    the next file write/truncate sees the count is 1 and doesn't need to
    COW the page.


    Two reason for being mildly curious about sendfile page releases in an
    application though:

    - Sendfile on tmpfs files: zero copy sending of calculated data.
    Only trouble is when can you reuse the pages? Current solution
    is use a set of files, consume the pages in sequential order, delete
    files at some point, let the kernel hold the pages. Works for
    sequentially generated and transmitted data, but not good for
    userspace caches where different parts expire separately. Also,
    may pin a lot of page cache; not sure if that's accounted.

    - Sendfile on real large data contained in a userspace
    database-come-filesystem (the future!). App wants to send big
    blobs, and with COW it can forget about them, but for performance
    it would rathe allocate new writes in the file to areas that are
    not sendfile-hot. It can approximate with heuristics though.

    > > Btw, Windows had this since forever, it's called overlapped
    > > TransmitFile with an I/O completion event. Don't know if it's any
    > > good though ;-)

    >
    > There was a linux aio_sendfile() too. Google still knows about its
    > numbers, graphs and so on...


    I vaguely remember it's performance didn't seem that good.

    One of the problems is you don't really want AIO all the time, just
    when a process would block because the data isn't in cache. You
    really don't want to be sending *all* ops to worker threads, even
    kernel threads. And you preferably don't want the AIO interface
    overhead for ops satisfied from cache.

    Syslets got some of the way there, and maybe that's why they were
    faster than AIO for some things. There are user-space hacks which are
    a bit like syslets. (Bind two processes to the same CPU, process 1
    wakes process 2 just before 1 does a syscall, and puts 2 back to sleep
    if 2 didn't wake and do an atomic op to prove it's awake). I haven't
    tested their performance, it could suck.

    Look up LAIO, Lazy Asynchronous I/O. Apparently FreeBSD, NetBSD,
    Solaris, Tru64, and Windows, have the capability to call a synchronous
    I/O op and if it's satisfied from cache, simply return a result, if
    not, either queue it and return an AIO event later (Windows style (it
    does some cleverer thread balancing too)), or wake another thread to
    handle it (FreeBSD style). I believe Linus suggested something like
    the latter line approach some time ago.

    -- Jamie
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: [patch v3] splice: fix race with page invalidation

    On Thursday 31 July 2008 22:33, Jamie Lokier wrote:
    > Evgeniy Polyakov wrote:
    > > On Thu, Jul 31, 2008 at 07:12:01AM +0100, Jamie Lokier

    (jamie@shareable.org) wrote:
    > > > The obvious mechanism for completion notifications is the AIO event
    > > > interface. I.e. aio_sendfile that reports completion when it's safe
    > > > to modify data it was using. aio_splice would be logical for similar
    > > > reasons. Note it doesn't mean when the data has reached a particular
    > > > place, it means when the pages it's holding are released. Pity AIO
    > > > still sucks ;-)

    > >
    > > It is not that simple: page can be held in hardware or tcp queues for
    > > a long time, and the only possible way to know, that system finished
    > > with it, is receiving ack from the remote side. There is a project to
    > > implement such a callback at skb destruction time (it is freed after ack
    > > from the other peer), but do we really need it? System which does care
    > > about transmit will implement own ack mechanism, so page can be unlocked
    > > at higher layer. Actually page can be locked during transfer and
    > > unlocked after rpc reply received, so underlying page invalidation will
    > > be postponed and will not affect sendfile/splice.

    >
    > This is why marking the pages COW would be better. Automatic!
    > There's no need for a notification, merely letting go of the page
    > references - yes, the hardware / TCP acks already do that, no locking
    > or anything! :-) The last reference is nothing special, it just means
    > the next file write/truncate sees the count is 1 and doesn't need to
    > COW the page.


    Better == more bloat and complexity and corner cases in the VM?

    If the app wants to ensure some specific data is sent, then it has
    to wait until the receiver receives it before changing it, simple.

    And you still don't avoid the fundamental problem that the receiver
    may not get exactly what the sender has put in flight if we do things
    asynchronously.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: [patch v3] splice: fix race with page invalidation

    On Wednesday 30 July 2008 19:43, Miklos Szeredi wrote:
    > Jens,
    >
    > Please apply or ack this for 2.6.27.
    >
    > [v3: respun against 2.6.27-rc1]
    >
    > Thanks,
    > Miklos
    >
    > ----
    > From: Miklos Szeredi
    >
    > Brian Wang reported that a FUSE filesystem exported through NFS could
    > return I/O errors on read. This was traced to splice_direct_to_actor()
    > returning a short or zero count when racing with page invalidation.
    >
    > However this is not FUSE or NFSD specific, other filesystems (notably NFS)
    > also call invalidate_inode_pages2() to purge stale data from the cache.
    >
    > If this happens while such pages are sitting in a pipe buffer, then
    > splice(2) from the pipe can return zero, and read(2) from the pipe can
    > return ENODATA.
    >
    > The zero return is especially bad, since it implies end-of-file or
    > disconnected pipe/socket, and is documented as such for splice. But
    > returning an error for read() is also nasty, when in fact there was no
    > error (data becoming stale is not an error).


    Hmm, the PageError case is a similar one which cannot be avoided, so
    it kind of indicates to me that the splice async API is slightly
    lacking (and provides me with some confirmation about my dislike of
    removing ClearPageUptodate from invalidate...)

    Returning -EIO at the pipe read I don't think quite make sense because
    it is conceptually an IO error for the splicer, not the reader (who
    is reading from a pipe, not from the file causing the error).

    It seems like the right way to fix this would be to allow the splicing
    process to be notified of a short read, in which case it could try to
    refill the pipe with the unread bytes...
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: [patch v3] splice: fix race with page invalidation

    On Thu, Jul 31, 2008 at 01:33:50PM +0100, Jamie Lokier (jamie@shareable.org) wrote:
    > This is why marking the pages COW would be better. Automatic!
    > There's no need for a notification, merely letting go of the page
    > references - yes, the hardware / TCP acks already do that, no locking
    > or anything! :-) The last reference is nothing special, it just means
    > the next file write/truncate sees the count is 1 and doesn't need to
    > COW the page.


    It depends... COW can DoS the system: consider attacker who sends a
    page, writes there, sends again and so on in lots of threads. Depending
    on link capacity eventually COW will eat the whole RAM.

    > > There was a linux aio_sendfile() too. Google still knows about its
    > > numbers, graphs and so on...

    >
    > I vaguely remember it's performance didn't seem that good.



    Benchmark of the 100 1MB files transfer (files are in VFS already) using
    sync sendfile() against aio_sendfile_path() shows about 10MB/sec
    performance win (78 MB/s vs 66-72 MB/s over 1 Gb network, sendfile
    sending server is one-way AMD Athlong 64 3500+) for aio_sendfile_path().


    So, it was really better that sync sendfile

    > One of the problems is you don't really want AIO all the time, just
    > when a process would block because the data isn't in cache. You
    > really don't want to be sending *all* ops to worker threads, even
    > kernel threads. And you preferably don't want the AIO interface
    > overhead for ops satisfied from cache.


    That's how all AIO should work of course. We are getting into a bit of
    offtopic, but aio_sendfile() worked that way as long as syslets,
    although the former did allocate some structures before trying to send
    the data.

    > Syslets got some of the way there, and maybe that's why they were
    > faster than AIO for some things. There are user-space hacks which are
    > a bit like syslets. (Bind two processes to the same CPU, process 1
    > wakes process 2 just before 1 does a syscall, and puts 2 back to sleep
    > if 2 didn't wake and do an atomic op to prove it's awake). I haven't
    > tested their performance, it could suck.


    Looks scary
    Thread allocation in userspace is rather costly operations compared to
    syslet threads in kernelspace. But depending on IO pattern this may or
    may not be a noticeble factor... It requires testing and numbers.

    --
    Evgeniy Polyakov
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: [patch v3] splice: fix race with page invalidation



    On Thu, 31 Jul 2008, Jamie Lokier wrote:
    >
    > Having implemented an equivalent zero-copy thing in userspace, I can
    > confidently say it's not fundamental at all.


    Oh yes it is.

    Doing it in user space is _trivial_, because you control everything, and
    there are no barriers.

    > What is fundamental is that you either (a) treat sendfile as an async
    > operation, and get a notification when it's finished with the data,
    > just like any other async operation


    Umm. And that's exactly what I *described*.

    But it's trivial to do inside one program (either all in user space, or
    all in kernel space).

    It's very difficult indeed to do across two totally different domains.

    Have you _looked_ at the complexities of async IO in UNIX? They are
    horrible. The overhead to even just _track_ the notifiers basically undoes
    all relevant optimizations for doing zero-copy.

    IOW, AIO is useful not because of zero-copy, but because it allows
    _overlapping_ IO. Anybody who confuses the two is seriously misguided.

    > , or (b) while sendfile claims those
    > pages, they are marked COW.


    ... and this one shows that you have no clue about performance of a memcpy.

    Once you do that COW, you're actually MUCH BETTER OFF just copying.

    Really.

    Copying a page is much cheaper than doing COW on it. Doing a "write()"
    really isn't that expensive. People think that memory is slow, but memory
    isn't all that slow, and caches work really well. Yes, memory is slow
    compared to a few reference count increments, but memory is absolutely
    *not* slow when compared to the overhead of TLB invalidates across CPUs
    etc.

    So don't do it. If you think you need it, you should not be using
    zero-copy in the first place.

    In other words, let me repeat:

    - use splice() when you *understand* that it's just taking a refcount and
    you don't care.

    - use read()/write() when you can't be bothered.

    There's nothing wrong with read/write. The _normal_ situation should be
    that 99.9% of all IO is done using the regular interfaces. Splice() (and
    sendpage() before it) is a special case. You should be using splice if you
    have a DVR and you can do all the DMA from the tuner card into buffers
    that you can then split up and send off to show real-time at the same time
    as you copy them to disk.

    THAT is when zero-copy is useful. If you think you need to play games with
    async notifiers, you're already off the deep end.

    Linus
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: [patch v3] splice: fix race with page invalidation



    On Thu, 31 Jul 2008, Evgeniy Polyakov wrote:
    >
    > It depends... COW can DoS the system: consider attacker who sends a
    > page, writes there, sends again and so on in lots of threads. Depending
    > on link capacity eventually COW will eat the whole RAM.


    Yes, COW is complex, and the complexity would be part of the cost. But the
    much bigger cost is the fact that COW is simply most costly than copying
    the data in the first place.

    A _single_ page fault is usually much much more expensive than copying a
    page, especially if you can do the copy well wrt caches. For example, a
    very common case is that the data you're writing is already in the CPU
    caches.

    In fact, even if you can avoid the fault, the cost of doing all the
    locking and looking up the pages for COW is likely already bigger than the
    memcpy. The memcpy() is a nice linear access which both the CPU and the
    memory controller can optimize and can get almost perfect CPU throughput
    for. In contrast, doing a COW implies a lot of random walking over
    multiple data structures. And _if_ it's all in cache, it's probably ok,
    but you're totally screwed if you need to send an IPI to another CPU to
    actually flush the TLB (to make the COW actually take effect!).

    So yes, you can win by COW'ing, but it's rare, and it mainly happens in
    benchmarks.

    For example, I had a trial patch long long ago (I think ten years by now)
    to do pipe transfers as copying pages around with COW. It was absolutely
    _beautiful_ in benchmarks. I could transfer gigabytes per second, and this
    was on something like a Pentium/MMX which had what, 7-10MB/s memcpy
    performance?

    In other words, I don't dispute at all that COW schemes can perform really
    really well.

    HOWEVER - the COW scheme actually performed _worse_ in any real world
    benchmark, including just compiling the kernel (we used to use -pipe to
    gcc to transfer data between cc/as).

    The reason? The benchmark worked _really_ well, because what it did was
    basically to do a trivial microbenchmark that did

    for (; {
    write(fd, buffer, bufsize);
    }

    and do you see something unrealistic there? Right: it never actually
    touched the buffer itself, so it would not ever actually trigger the COW
    case, and as a result, the memory got marked read-only on the first time,
    and it never ever took a fault, and in fact the TLB never ever needed to
    be flushed after the first one because the page was already marked
    read-only.

    That's simply not _realistic_. It's hardly how any real load work.

    > > > There was a linux aio_sendfile() too. Google still knows about its
    > > > numbers, graphs and so on...

    > >
    > > I vaguely remember it's performance didn't seem that good.

    >
    >
    > Benchmark of the 100 1MB files transfer (files are in VFS already) using
    > sync sendfile() against aio_sendfile_path() shows about 10MB/sec
    > performance win (78 MB/s vs 66-72 MB/s over 1 Gb network, sendfile
    > sending server is one-way AMD Athlong 64 3500+) for aio_sendfile_path().
    >

    >
    > So, it was really better that sync sendfile


    I suspect it wasn't any better with small files and small transfers.

    Yes, some people do big files. Physicists have special things where they
    get a few terabytes per second from some high-energy experiment. The
    various people spying on you have special setups where they move gigabytes
    of satellite map data around to visualize it.

    So there are undeniably cases like that, but they are also usually so
    special that they really don't even care about COW, because they sure as
    hell don't care about somebody else modifying the file they're sending at
    the same time.

    In fact the whole point is that they don't touch the data at teh CPU
    _at_all_, and the reason they want zero-copy sending is that they
    literally want to DMA from disk buffers to memory, and then from memory to
    a network interface, and they don't want the CPU _ever_ seeing it with all
    the cache invalidations etc.

    And _that_ is the case where you should use sendfile(). If your CPU has
    actually touched the data, you should probably just use read()/write().

    Of course, one of the really nice things about splice() (which was not
    true about sendfile()) is that you can actually mix-and-match. You can
    splice data from kernel buffers, but you can also splice data from user
    VM, or you can do regular "write()" calls to fill (or read) the data from
    the splice pipe.

    This is useful in ways that sendfile() never was. You can write() headers
    into the pipe buffer, and then splice() the file data into it, and the
    user only sees a pipe (and can either read from it or splice it or tee it
    or whatever). IOW, splice very much has the UNIX model of "everything is
    a pipe", taken to one (admittedly odd) extreme.

    Anyway, the correct way to use splice() is to either just know the data is
    "safe" (either because you are _ok_ with people modifying it after the
    splice op took place, or because you know nobody will). The alternative is
    to expect an acknowledgement from the other side, because then you know
    the buffer is done.

    Linus
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: [patch v3] splice: fix race with page invalidation



    On Thu, 31 Jul 2008, Nick Piggin wrote:
    >
    > It seems like the right way to fix this would be to allow the splicing
    > process to be notified of a short read, in which case it could try to
    > refill the pipe with the unread bytes...


    Hmm. That should certainly work with the splice model. The users of the
    data wouldn't eat (or ignore) the invalid data, they'd just say "invalid
    data", and stop. And it would be up to the other side to handle it (it
    can see the state of the pipe, we can make it both wake up POLL_ERR _and_
    return an error if somebody tries to write to a "blocked" pipe).

    So yes, that's very possible, but it obviously requires splice() users to
    be able to handle more cases. I'm not sure it's realistic to expect users
    to be that advanced.

    Linus
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: [patch v3] splice: fix race with page invalidation

    Linus Torvalds wrote:
    > > , or (b) while sendfile claims those
    > > pages, they are marked COW.

    >
    > .. and this one shows that you have no clue about performance of a memcpy.
    >
    > Once you do that COW, you're actually MUCH BETTER OFF just copying.
    > Copying a page is much cheaper than doing COW on it.


    That sounds familiar :-)

    But did you miss the bit where you DON'T COPY ANYTHING EVER*? COW is
    able provide _correctness_ for the rare corner cases which you're not
    optimising for. You don't actually copy more than 0.0% (*approx).

    Copying is supposed to be _so_ rare, in this, that it doesn't count.

    Correctness is so you can say "I've written that, when the syscall
    returns I know what data the other end will receive, if it succeeds".
    Knowing what can happen in what order is bread and butter around here,
    you know how useful that can.

    The cost of COW is TLB flushes*. But for splice, there ARE NO TLB
    FLUSHES because such files are not mapped writable! And you don't
    intend to write the file soon either. A program would be daft to use
    splice _intending_ to do those things, it obviously would be poor use
    of the interface. The kernel may as well copy the data if they did
    (and it's in a good position to decide).

    > Doing a "write()" really isn't that expensive. People think that
    > memory is slow, but memory isn't all that slow, and caches work
    > really well. Yes, memory is slow compared to a few reference count
    > increments, but memory is absolutely *not* slow when compared to the
    > overhead of TLB invalidates across CPUs etc.


    You're missing the real point of network splice().

    It's not just for speed.

    It's for sharing data. Your TCP buffers can share data, when the same
    big lump is in flight to lots of clients. Think static file / web /
    FTP server, the kind with 80% of hits to 0.01% of the files roughly
    the same of your RAM.

    You want network splice() for the same reason you want shared
    libraries. So that memory use scales better with some loads**.

    You don't know how much good that will do, only, like shared
    libraries, that it's intrinsically good if it doesn't cost anything.
    And I'm suggesting that since no TLB flushes or COW copies are
    expected, and you can just copy at sendfile time if the page is
    already write-mapped anywhere, so the page references aren't
    complicated, it shouldn't cost anything.

    ** - Admittedly this is rather load dependent. But it's potentially
    O(c*d) for write vs. O(d) for sendfile, hand-wavingly, where c is the
    number of connections using d data. (Then again, as I work out the
    figures, RAM is getting cheaper faster than bandwidth-latency products
    are getting bigger... It's not a motivator except for cheapskates.
    But doesn't detract from intrinsic goodness.)

    -- Jamie
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: [patch v3] splice: fix race with page invalidation

    On Thu, 31 Jul 2008, Linus Torvalds wrote:
    > On Thu, 31 Jul 2008, Nick Piggin wrote:
    > >
    > > It seems like the right way to fix this would be to allow the splicing
    > > process to be notified of a short read, in which case it could try to
    > > refill the pipe with the unread bytes...

    >
    > Hmm. That should certainly work with the splice model. The users of the
    > data wouldn't eat (or ignore) the invalid data, they'd just say "invalid
    > data", and stop. And it would be up to the other side to handle it (it
    > can see the state of the pipe, we can make it both wake up POLL_ERR _and_
    > return an error if somebody tries to write to a "blocked" pipe).
    >
    > So yes, that's very possible, but it obviously requires splice() users to
    > be able to handle more cases. I'm not sure it's realistic to expect users
    > to be that advanced.


    Worse, it's not guaranteed to make any progress. E.g. on NFS server
    with data being continually updated, cache on the client will
    basically always be invalid, so the above scheme might just repeat the
    splices forever without success.

    Miklos
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. Re: [patch v3] splice: fix race with page invalidation



    On Thu, 31 Jul 2008, Jamie Lokier wrote:
    >
    > But did you miss the bit where you DON'T COPY ANYTHING EVER*? COW is
    > able provide _correctness_ for the rare corner cases which you're not
    > optimising for. You don't actually copy more than 0.0% (*approx).


    The thing is, just even _marking_ things COW is the expensive part. If we
    have to walk page tables - we're screwed.

    > The cost of COW is TLB flushes*. But for splice, there ARE NO TLB
    > FLUSHES because such files are not mapped writable!


    For splice, there are also no flags to set, no extra tracking costs, etc
    etc.

    But yes, we could make splice (from a file) do something like

    - just fall back to copy if the page is already mapped (page->mapcount
    gives us that)

    - set a bit ("splicemapped") when we splice it in, and increment
    page->mapcount for each splice copy.

    - if a "splicemapped" page is ever mmap'ed or written to (either through
    write or truncate), we COW it then (and actually move the page cache
    page - it would be a "woc": a reverse cow, not a normal one).

    - do all of this with page lock held, to make sure that there are no
    writers or new mappers happening.

    So it's probably doable.

    (We could have a separate "splicecount", and actually allow non-writable
    mappings, but I suspect we cannot afford the space in teh "struct space"
    for a whole new count).

    > You're missing the real point of network splice().
    >
    > It's not just for speed.
    >
    > It's for sharing data. Your TCP buffers can share data, when the same
    > big lump is in flight to lots of clients. Think static file / web /
    > FTP server, the kind with 80% of hits to 0.01% of the files roughly
    > the same of your RAM.


    Maybe. Does it really show up as a big thing?

    Linus
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. Re: [patch v3] splice: fix race with page invalidation

    On Friday 01 August 2008 04:13, Miklos Szeredi wrote:
    > On Thu, 31 Jul 2008, Linus Torvalds wrote:
    > > On Thu, 31 Jul 2008, Nick Piggin wrote:
    > > > It seems like the right way to fix this would be to allow the splicing
    > > > process to be notified of a short read, in which case it could try to
    > > > refill the pipe with the unread bytes...

    > >
    > > Hmm. That should certainly work with the splice model. The users of the
    > > data wouldn't eat (or ignore) the invalid data, they'd just say "invalid
    > > data", and stop. And it would be up to the other side to handle it (it
    > > can see the state of the pipe, we can make it both wake up POLL_ERR _and_
    > > return an error if somebody tries to write to a "blocked" pipe).
    > >
    > > So yes, that's very possible, but it obviously requires splice() users to
    > > be able to handle more cases. I'm not sure it's realistic to expect users
    > > to be that advanced.

    >
    > Worse, it's not guaranteed to make any progress. E.g. on NFS server
    > with data being continually updated, cache on the client will
    > basically always be invalid, so the above scheme might just repeat the
    > splices forever without success.


    Well, a) it probably makes sense in that case to provide another mode
    of operation which fills the data synchronously from the sender and
    copys it to the pipe (although the sender might just use read/write)
    And b) we could *also* look at clearing PG_uptodate as an optimisation
    iff that is found to help.

    But I think it is kind of needed. The data comes from the sender, and
    so only the sender may really know what to do in case of failure. I
    think it is quite reasonable for an asynchronous interface to have some
    kind of completion/error check and I think users should be that
    advanced... if they aren't that advanced, they could use the synchonous,
    copying flag to splice outlined in a), and then they wouldn't have to
    care.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  15. Re: [patch v3] splice: fix race with page invalidation

    On Fri, 1 Aug 2008, Nick Piggin wrote:
    > Well, a) it probably makes sense in that case to provide another mode
    > of operation which fills the data synchronously from the sender and
    > copys it to the pipe (although the sender might just use read/write)
    > And b) we could *also* look at clearing PG_uptodate as an optimisation
    > iff that is found to help.


    IMO it's not worth it to complicate the API just for the sake of
    correctness in the so-very-rare read error case. Users of the splice
    API will simply ignore this requirement, because things will work fine
    on ext3 and friends, and will break only rarely on NFS and FUSE.

    So I think it's much better to make the API simple: invalid pages are
    OK, and for I/O errors we return -EIO on the pipe. It's not 100%
    correct, but all in all it will result in less buggy programs.

    Thanks,
    Miklos
    ----

    Subject: mm: dont clear PG_uptodate on truncate/invalidate

    From: Miklos Szeredi

    Brian Wang reported that a FUSE filesystem exported through NFS could return
    I/O errors on read. This was traced to splice_direct_to_actor() returning a
    short or zero count when racing with page invalidation.

    However this is not FUSE or NFSD specific, other filesystems (notably NFS)
    also call invalidate_inode_pages2() to purge stale data from the cache.

    If this happens while such pages are sitting in a pipe buffer, then splice(2)
    from the pipe can return zero, and read(2) from the pipe can return ENODATA.

    The zero return is especially bad, since it implies end-of-file or
    disconnected pipe/socket, and is documented as such for splice. But returning
    an error for read() is also nasty, when in fact there was no error (data
    becoming stale is not an error).

    The same problems can be triggered by "hole punching" with
    madvise(MADV_REMOVE).

    Fix this by not clearing the PG_uptodate flag on truncation and
    invalidation.

    Signed-off-by: Miklos Szeredi
    ---
    mm/truncate.c | 2 --
    1 file changed, 2 deletions(-)

    Index: linux-2.6/mm/truncate.c
    ================================================== =================
    --- linux-2.6.orig/mm/truncate.c 2008-07-28 17:45:02.000000000 +0200
    +++ linux-2.6/mm/truncate.c 2008-08-01 20:18:51.000000000 +0200
    @@ -104,7 +104,6 @@ truncate_complete_page(struct address_sp
    cancel_dirty_page(page, PAGE_CACHE_SIZE);

    remove_from_page_cache(page);
    - ClearPageUptodate(page);
    ClearPageMappedToDisk(page);
    page_cache_release(page); /* pagecache ref */
    }
    @@ -356,7 +355,6 @@ invalidate_complete_page2(struct address
    BUG_ON(PagePrivate(page));
    __remove_from_page_cache(page);
    spin_unlock_irq(&mapping->tree_lock);
    - ClearPageUptodate(page);
    page_cache_release(page); /* pagecache ref */
    return 1;
    failed:
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  16. Re: [patch v3] splice: fix race with page invalidation



    On Fri, 1 Aug 2008, Miklos Szeredi wrote:
    >
    > Subject: mm: dont clear PG_uptodate on truncate/invalidate
    > From: Miklos Szeredi


    Ok, this one I have no problems with what-so-ever. I'd like Ack's for this
    kind of change (and obviously hope that it's tested), but it looks clean
    and I think the new rules are better (not just for this case).

    Linus
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  17. Re: [patch v3] splice: fix race with page invalidation

    On Saturday 02 August 2008 04:28, Miklos Szeredi wrote:
    > On Fri, 1 Aug 2008, Nick Piggin wrote:
    > > Well, a) it probably makes sense in that case to provide another mode
    > > of operation which fills the data synchronously from the sender and
    > > copys it to the pipe (although the sender might just use read/write)
    > > And b) we could *also* look at clearing PG_uptodate as an optimisation
    > > iff that is found to help.

    >
    > IMO it's not worth it to complicate the API just for the sake of
    > correctness in the so-very-rare read error case. Users of the splice
    > API will simply ignore this requirement, because things will work fine
    > on ext3 and friends, and will break only rarely on NFS and FUSE.
    >
    > So I think it's much better to make the API simple: invalid pages are
    > OK, and for I/O errors we return -EIO on the pipe. It's not 100%
    > correct, but all in all it will result in less buggy programs.


    That's true, but I hate how we always (in the VM, at least) just brush
    error handling under the carpet because it is too hard

    I guess your patch is OK, though. I don't see any reasons it could cause
    problems...
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  18. Re: [patch v3] splice: fix race with page invalidation

    Nick Piggin wrote:
    > On Saturday 02 August 2008 04:28, Miklos Szeredi wrote:
    > > On Fri, 1 Aug 2008, Nick Piggin wrote:
    > > > Well, a) it probably makes sense in that case to provide another mode
    > > > of operation which fills the data synchronously from the sender and
    > > > copys it to the pipe (although the sender might just use read/write)
    > > > And b) we could *also* look at clearing PG_uptodate as an optimisation
    > > > iff that is found to help.

    > >
    > > IMO it's not worth it to complicate the API just for the sake of
    > > correctness in the so-very-rare read error case. Users of the splice
    > > API will simply ignore this requirement, because things will work fine
    > > on ext3 and friends, and will break only rarely on NFS and FUSE.
    > >
    > > So I think it's much better to make the API simple: invalid pages are
    > > OK, and for I/O errors we return -EIO on the pipe. It's not 100%
    > > correct, but all in all it will result in less buggy programs.

    >
    > That's true, but I hate how we always (in the VM, at least) just brush
    > error handling under the carpet because it is too hard
    >
    > I guess your patch is OK, though. I don't see any reasons it could cause
    > problems...


    At least, if there are situations where the data received is not what
    a common sense programmer would expect (e.g. blocks of zeros, data
    from an unexpected time in syscall sequence, or something, or just
    "reliable except with FUSE and NFS"), please ensure it's documented in
    splice.txt or wherever.

    -- Jamie
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  19. Re: [patch v3] splice: fix race with page invalidation

    On Tuesday 05 August 2008 01:29, Jamie Lokier wrote:
    > Nick Piggin wrote:
    > > On Saturday 02 August 2008 04:28, Miklos Szeredi wrote:
    > > > On Fri, 1 Aug 2008, Nick Piggin wrote:
    > > > > Well, a) it probably makes sense in that case to provide another mode
    > > > > of operation which fills the data synchronously from the sender and
    > > > > copys it to the pipe (although the sender might just use read/write)
    > > > > And b) we could *also* look at clearing PG_uptodate as an
    > > > > optimisation iff that is found to help.
    > > >
    > > > IMO it's not worth it to complicate the API just for the sake of
    > > > correctness in the so-very-rare read error case. Users of the splice
    > > > API will simply ignore this requirement, because things will work fine
    > > > on ext3 and friends, and will break only rarely on NFS and FUSE.
    > > >
    > > > So I think it's much better to make the API simple: invalid pages are
    > > > OK, and for I/O errors we return -EIO on the pipe. It's not 100%
    > > > correct, but all in all it will result in less buggy programs.

    > >
    > > That's true, but I hate how we always (in the VM, at least) just brush
    > > error handling under the carpet because it is too hard
    > >
    > > I guess your patch is OK, though. I don't see any reasons it could cause
    > > problems...

    >
    > At least, if there are situations where the data received is not what
    > a common sense programmer would expect (e.g. blocks of zeros, data
    > from an unexpected time in syscall sequence, or something, or just
    > "reliable except with FUSE and NFS"), please ensure it's documented in
    > splice.txt or wherever.


    Not quite true. Many filesystems can return -EIO, and truncate can
    partially zero pages.

    Basically the man page should note that until the splice API is
    improved, then a) -EIO errors will be seen at the receiever, b)
    the pages can see transient zeroes (this is the case with read(2)
    as well, but splice has a much bigger window), and c) the sender
    does not send a snapshot of data because it can still be modified
    until it is recieved.

    c is not too surprising for an asynchronous interface, but it is
    nice to document in case people are expecting COw or something.
    b and c can more or less be worked around by not doing silly things
    like truncating or scribbling on data until reciever really has it.
    a, I argue, should be fixed in API.
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  20. Re: [patch v3] splice: fix race with page invalidation

    Nick, Jens

    On Tue, Aug 5, 2008 at 4:57 AM, Nick Piggin wrote:
    > On Tuesday 05 August 2008 01:29, Jamie Lokier wrote:
    >> Nick Piggin wrote:
    >> > On Saturday 02 August 2008 04:28, Miklos Szeredi wrote:
    >> > > On Fri, 1 Aug 2008, Nick Piggin wrote:
    >> > > > Well, a) it probably makes sense in that case to provide another mode
    >> > > > of operation which fills the data synchronously from the sender and
    >> > > > copys it to the pipe (although the sender might just use read/write)
    >> > > > And b) we could *also* look at clearing PG_uptodate as an
    >> > > > optimisation iff that is found to help.
    >> > >
    >> > > IMO it's not worth it to complicate the API just for the sake of
    >> > > correctness in the so-very-rare read error case. Users of the splice
    >> > > API will simply ignore this requirement, because things will work fine
    >> > > on ext3 and friends, and will break only rarely on NFS and FUSE.
    >> > >
    >> > > So I think it's much better to make the API simple: invalid pages are
    >> > > OK, and for I/O errors we return -EIO on the pipe. It's not 100%
    >> > > correct, but all in all it will result in less buggy programs.
    >> >
    >> > That's true, but I hate how we always (in the VM, at least) just brush
    >> > error handling under the carpet because it is too hard
    >> >
    >> > I guess your patch is OK, though. I don't see any reasons it could cause
    >> > problems...

    >>
    >> At least, if there are situations where the data received is not what
    >> a common sense programmer would expect (e.g. blocks of zeros, data
    >> from an unexpected time in syscall sequence, or something, or just
    >> "reliable except with FUSE and NFS"), please ensure it's documented in
    >> splice.txt or wherever.

    >
    > Not quite true. Many filesystems can return -EIO, and truncate can
    > partially zero pages.
    >
    > Basically the man page should note that until the splice API is
    > improved, then a) -EIO errors will be seen at the receiever, b)
    > the pages can see transient zeroes (this is the case with read(2)
    > as well, but splice has a much bigger window), and c) the sender
    > does not send a snapshot of data because it can still be modified
    > until it is recieved.
    >
    > c is not too surprising for an asynchronous interface, but it is
    > nice to document in case people are expecting COw or something.
    > b and c can more or less be worked around by not doing silly things
    > like truncating or scribbling on data until reciever really has it.
    > a, I argue, should be fixed in API.


    Nick, could you come up with a patch to the man page for this?
    Something that's ACKable by Jens?

    Cheers,

    Michael

    --
    Michael Kerrisk
    Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
    Found a bug? http://www.kernel.org/doc/man-pages/reporting_bugs.html
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread
Page 2 of 2 FirstFirst 1 2