Reproducible data corruption with sendfile+vsftp - splice regression? - Kernel

This is a discussion on Reproducible data corruption with sendfile+vsftp - splice regression? - Kernel ; Hi - This regular Linux user and lkml lurker just noticed data corruption in ftp'ed files and narrowed it down to vsftpd using sendfile(). So far this has never caused problems in the past; I have not noticed this with ...

+ Reply to Thread
Results 1 to 15 of 15

Thread: Reproducible data corruption with sendfile+vsftp - splice regression?

  1. Reproducible data corruption with sendfile+vsftp - splice regression?


    Hi -

    This regular Linux user and lkml lurker just noticed data corruption in
    ftp'ed files and narrowed it down to vsftpd using sendfile(). So far this
    has never caused problems in the past; I have not noticed this with
    2.6.22.x but may have missed it. I do remember reading about some changes
    to the underlying splice stuff since .23 so that may have something to do
    with it.

    The scenario:

    - created a file with known bit pattern on Linux server
    - ftp-got this file to Windows client: file has bad crc (yes, binary)
    - verified with another client: same result

    I have thus far eliminated (to the best of my knowledge) NICs, switches,
    cables, the Windows FTP clients, the hard disk in the server (SATA, ext3):
    nothing suspicious in any logs. Box is an AMD Sempron 2600+ with 1.5 GB
    RAM, added rt8169 card, Gentoo, vsftpd stable 2.0.5 - nothing fancy.
    Transferring the file with samba (interestingly with sendfile enabled) and
    via ftp but from /dev/shm repeatably works fine; pulling from disk creates
    bad crc, every time. The file is readable and can be copied, verified etc.
    over and over so I'm sure that I'm not falling prey to a false positive.
    ifconfig indicates no dropped or otherwise corrupted packets.
    I noticed this first with 2.6.4-rc3, but also just tried the latest stable
    2.6.23.9 with the same config, with no change in behaviour. After setting
    vsftpd to use_sendfile=NO, gigs can be transferred without corruption.

    The data corruption is sporadic, but absolutely repeatable. The file with
    the known good pattern just contains multiple lines of:

    01234567890123456789012345678901234567890123456789 0
    01234567890123456789012345678901234567890123456789 0
    01234567890123456789012345678901234567890123456789 0
    ...etc..

    A corrupted file is missing random characters, so that the corrupted lines
    looks like this (line numbers added by me):

    19785: 01234567890123456789012345678901234567890123456789 0
    19786: 01234567890123456789012345678901234567890123678901 234567890
    19787: 01234567890123456789012345678901234567890123456789 0

    or:

    20074: 01234567890123456789012345678901234567890123456789 0
    20075: 01234567890123456789012345678901234567890123012345 678901234567890123456789012345678901234567890
    20076: 01234567890123456789012345678901234567890123456789 0

    Again, other network or hd traffic shows no signs of gremlins; the box is
    perfectly stable, and turning sendfile on or off triggers/untriggers the
    corruption reliably. I will try 2.6.22.x over the weekend, and before I
    bother lkml with dmesg/.config etc. I wanted to fish for initial thoughts.

    thanks
    Holger


    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  2. Re: Reproducible data corruption with sendfile+vsftp - splice regression?

    Holger Hoffstaette a écrit :
    > Hi -
    >
    > This regular Linux user and lkml lurker just noticed data corruption in
    > ftp'ed files and narrowed it down to vsftpd using sendfile(). So far this
    > has never caused problems in the past; I have not noticed this with
    > 2.6.22.x but may have missed it. I do remember reading about some changes
    > to the underlying splice stuff since .23 so that may have something to do
    > with it.
    >
    > The scenario:
    >
    > - created a file with known bit pattern on Linux server
    > - ftp-got this file to Windows client: file has bad crc (yes, binary)
    > - verified with another client: same result
    >
    > I have thus far eliminated (to the best of my knowledge) NICs, switches,
    > cables, the Windows FTP clients, the hard disk in the server (SATA, ext3):
    > nothing suspicious in any logs. Box is an AMD Sempron 2600+ with 1.5 GB
    > RAM, added rt8169 card, Gentoo, vsftpd stable 2.0.5 - nothing fancy.
    > Transferring the file with samba (interestingly with sendfile enabled) and
    > via ftp but from /dev/shm repeatably works fine; pulling from disk creates
    > bad crc, every time. The file is readable and can be copied, verified etc.
    > over and over so I'm sure that I'm not falling prey to a false positive.
    > ifconfig indicates no dropped or otherwise corrupted packets.
    > I noticed this first with 2.6.4-rc3, but also just tried the latest stable
    > 2.6.23.9 with the same config, with no change in behaviour. After setting
    > vsftpd to use_sendfile=NO, gigs can be transferred without corruption.
    >
    > The data corruption is sporadic, but absolutely repeatable. The file with
    > the known good pattern just contains multiple lines of:
    >
    > 01234567890123456789012345678901234567890123456789 0
    > 01234567890123456789012345678901234567890123456789 0
    > 01234567890123456789012345678901234567890123456789 0
    > ..etc..
    >
    > A corrupted file is missing random characters, so that the corrupted lines
    > looks like this (line numbers added by me):
    >
    > 19785: 01234567890123456789012345678901234567890123456789 0
    > 19786: 01234567890123456789012345678901234567890123678901 234567890
    > 19787: 01234567890123456789012345678901234567890123456789 0
    >
    > or:
    >
    > 20074: 01234567890123456789012345678901234567890123456789 0
    > 20075: 01234567890123456789012345678901234567890123012345 678901234567890123456789012345678901234567890
    > 20076: 01234567890123456789012345678901234567890123456789 0
    >
    > Again, other network or hd traffic shows no signs of gremlins; the box is
    > perfectly stable, and turning sendfile on or off triggers/untriggers the
    > corruption reliably. I will try 2.6.22.x over the weekend, and before I
    > bother lkml with dmesg/.config etc. I wanted to fish for initial thoughts.
    >


    CC to netdev, it might concern network guys

    Could you try with a test file containing unique patterns ?

    like a 80 MB file :

    #include
    main()
    {
    unsigned long ul;
    for (ul = 0 ; ul < 10000000 ; ul++)
    printf("%8lu", ul);
    return 0;
    }


    Thank you

    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  3. Re: Reproducible data corruption with sendfile+vsftp - splice regression?

    On Fri, 30 Nov 2007 09:07:53 +0100, Eric Dumazet wrote:

    > CC to netdev, it might concern network guys


    It is indeed related to network/r8169, more below.

    > Could you try with a test file containing unique patterns ?


    Same result, here is new information.

    - contrary to my first posting, the corruption does not reliably occur
    when a second client pulls the file; sorry for that. The difference is
    that the box that gets corrupted data only has a 100mbit interface, while
    the one that gets working data is completely gigabit (all on the same
    switch though).

    - after some digging in my server changelogs I noticed that I had enabled
    misc. r8169 offload options not too long ago (while migrating to gigabit
    and perftesting the new network), and bingo! Turning off tso (leaving all
    others on except for UDP which is apparently not implemented) singled out
    the corruption while ftp'ing to the slower 100mbit client.

    I have since just permanently disabled tso and everything is
    fine with and without sendfile. So this seems to be either a bug with the
    r8169 or some bad interaction of tso with sendfile, but then maybe it's
    just the symptom of a race condition/timing problem. Is tso on the r8169
    known to be kaput?

    lspci says:

    00:08.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit Ethernet (rev 10)
    Subsystem: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit Ethernet
    Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 17
    I/O ports at d000 [size=256]
    Memory at f6022000 (32-bit, non-prefetchable) [size=256]
    [virtual] Expansion ROM at 60000000 [disabled] [size=128K]
    Capabilities: [dc] Power Management version 2

    Further suggestions welcome, looks like we're getting somewhere.
    I can still create broken files with tso and the unique patterns that Eric
    suggested, if that helps tracking down the tso corruption.

    thank you!
    Holger


    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  4. Re: Reproducible data corruption with sendfile+vsftp - splice regression?


    Btw, the r8169 has NAPI enabled.

    kernel config:
    http://hoho.dyndns.org/~holger/dist/...g-x86-2.6.23.9

    dmesg:
    http://hoho.dyndns.org/~holger/dist/dmesg

    lspci -vv:
    http://hoho.dyndns.org/~holger/dist/lspci

    thanks
    Holger


    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  5. Re: Reproducible data corruption with sendfile+vsftp - splice regression?

    Could the corruption be seen in a tcpdump trace prior to transmission
    (ie taken on the sender) or was it only seen after the data passed out
    the NIC?

    rick jones
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  6. Re: Reproducible data corruption with sendfile+vsftp - splice regression?


    On Fri, 30 Nov 2007 10:26:54 -0800, Rick Jones wrote:

    > Could the corruption be seen in a tcpdump trace prior to transmission (ie
    > taken on the sender) or was it only seen after the data passed out the
    > NIC?


    I did the following:

    1) turn on tso on the server's r8169: ethtool --offload eth0 tso on
    2) on the server: tcpdump -i eth0 -s 0 -w
    3) ftp'ed file to 100mbit client

    As expected the file was corrupted, and the various corrupted byte
    sequences also show up in the tcpdump file at the corresponding offsets.

    I did this with 2.6.22.14, so it does not seem to be a recent regression
    in .23/.24.

    All files can be found here:
    http://hoho.dyndns.org/~holger/dist/r8169-tso/

    I will gladly try out any other tweaks but need some guidance as I don't
    know what exactly to change - maybe without NAPI for the r8169?

    thank you
    Holger


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  7. Re: Reproducible data corruption with sendfile+vsftp - splice regression?

    On Sun, 02 Dec 2007 17:00:03 +0100, Holger Hoffstaette wrote:

    > On Fri, 30 Nov 2007 10:26:54 -0800, Rick Jones wrote:
    >
    >> Could the corruption be seen in a tcpdump trace prior to transmission
    >> (ie taken on the sender) or was it only seen after the data passed out
    >> the NIC?

    >
    > I did the following:
    >
    > 1) turn on tso on the server's r8169: ethtool --offload eth0 tso on
    > 2) on the server: tcpdump -i eth0 -s 0 -w
    > 3) ftp'ed file to 100mbit client
    >
    > As expected the file was corrupted, and the various corrupted byte
    > sequences also show up in the tcpdump file at the corresponding offsets.
    >
    > I did this with 2.6.22.14, so it does not seem to be a recent regression
    > in .23/.24.
    >
    > All files can be found here:
    > http://hoho.dyndns.org/~holger/dist/r8169-tso/
    >
    > I will gladly try out any other tweaks but need some guidance as I don't
    > know what exactly to change - maybe without NAPI for the r8169?


    Ta-daa! Rebuilding 2.6.22.14 (and I suspect all other versions) without
    NAPI for the r8169 but with tso enabled yields NO data corruption; the
    ftp'ed file has a good crc, repeatedly.

    Any suggestions how to proceed? Should I file this in bugzilla?

    thanks
    Holger


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  8. Re: Reproducible data corruption with sendfile+vsftp - splice regression?

    Holger Hoffstaette :
    [...]
    > Should I file this in bugzilla?


    Yes.

    --
    Ueimor
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  9. Re: Reproducible data corruption with sendfile+vsftp - splice regression?

    Francois Romieu :
    > Holger Hoffstaette :
    > [...]
    > > Should I file this in bugzilla?

    >
    > Yes.


    5326 5585327 5585328 5585329 5585330 5585331 5585332 5585333 5585334 5585335 558
    5336 5585337 5585338 5585339 5585340 5585341 5585342 5585343 5589440 5589441 558
    ^^^^^^^ ^^^^^^^
    9442 5589443 5589444 5589445 5589446 5589447 5589448 5589449 5589450 5589451 558
    9452 5589453 5589454 5589455 5589456 5589457 5589458 5589459 5589460 5589461 558

    It misses 8*4096 bytes.

    8443 9068442 9068441 9068440 9068439 9068438 9068437 9068436 9068435 9068434 906
    8433 9068432 9068431 9068430 9068429 9068428 9068427 9064330 9064329 9064328 906
    ^^^^^^^ ^^^^^^^
    4327 9064326 9064325 9064324 9064323 9064322 9064321 9064320 9064319 9064318 906

    Same thing later.

    But the amount of data transmitted is fine.

    Could you locate the offsets were the sequence is broken ?

    --
    Ueimor
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  10. Re: Reproducible data corruption with sendfile+vsftp - splice regression?


    On Wed, 05 Dec 2007 23:54:29 +0100, Francois Romieu wrote:

    > Holger Hoffstaette : [...]
    >> Should I file this in bugzilla?

    >
    > Yes.


    Thanks for responding - will do. I verified with 2.6.24-rc4 (same bug) and
    have some new information about this.
    Despite my previous posting the corruption is NOT triggered by NAPI. It
    may be related, but even without NAPI but tso on again I got corruption,
    now also on the gbit client (Thinkpad T60). When ftp'ing to ramdisk with
    full speed (at a reasonable ~77 MB/sec) it "often" works, but intermediate
    writes that cause the ftp to temporarily slow down reliably cause
    corrupted files, so I guess tso gets confused when some kind of throttling
    sets in during transfer. That is probably why I first noticed it on the
    slow 100mbit client.
    Maybe turning off sendfile or NAPI just lead to random success - so far it
    really looks like tso on the r8169 is the common cause.

    thank you
    Holger


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  11. Re: Reproducible data corruption with sendfile+vsftp - splice regression?

    Holger Hoffstaette :
    [...]
    > Maybe turning off sendfile or NAPI just lead to random success - so far it
    > really looks like tso on the r8169 is the common cause.


    TSO on the r8169 is the magic switch but the regression makes imvho more
    sense from a VM pov:

    - the corrupted file has the same size as the expected file
    - the corrupted file exhibits holes which come as a multiple of 4096 bytes
    (8*4k, 2 places, there may be more)
    - the r8169 driver does not know what a page is
    - the 8169 hardware has a small 8192 bytes Tx buffer

    It would be nice if someone could do a sendfile + vsftp test with TSO on a
    different hardware. While I could not reproduce the corruption when simply
    downloading a file that I had copied on the server with scp, it triggered
    almost immediately after I copied it locally and tried to download the copy.

    --
    Ueimor
    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  12. Re: Reproducible data corruption with sendfile+vsftp - splice regression?

    Francois Romieu wrote:
    > Holger Hoffstaette :
    > [...]
    >> Maybe turning off sendfile or NAPI just lead to random success - so far it
    >> really looks like tso on the r8169 is the common cause.

    >
    > TSO on the r8169 is the magic switch but the regression makes imvho more
    > sense from a VM pov:
    >
    > - the corrupted file has the same size as the expected file
    > - the corrupted file exhibits holes which come as a multiple of 4096 bytes
    > (8*4k, 2 places, there may be more)

    ....

    That's interesting. I had the those exact same symptoms here
    with copying data to/from a USB stick recently.
    But that stick died completely shortly thereafter,
    so this was written-off as "bad hardware".

    Strange that you see the same symptoms from a different scenario.
    Probably no relationship there, but ..

    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  13. Re: Reproducible data corruption with sendfile+vsftp - splice regression?

    On Thu, 06 Dec 2007 19:44:26 +0100, Francois Romieu wrote:

    > Holger Hoffstaette : [...]
    >> Maybe turning off sendfile or NAPI just lead to random success - so far
    >> it really looks like tso on the r8169 is the common cause.

    >
    > TSO on the r8169 is the magic switch but the regression makes imvho more
    > sense from a VM pov:
    >
    > - the corrupted file has the same size as the expected file - the
    > corrupted file exhibits holes which come as a multiple of 4096 bytes
    > (8*4k, 2 places, there may be more)
    > - the r8169 driver does not know what a page is - the 8169 hardware has a
    > small 8192 bytes Tx buffer
    >
    > It would be nice if someone could do a sendfile + vsftp test with TSO on a
    > different hardware. While I could not reproduce the corruption when simply
    > downloading a file that I had copied on the server with scp, it triggered
    > almost immediately after I copied it locally and tried to download the
    > copy.


    Here's an update - sorry for the delay but I need that machine for everyday work.

    I have now gone back to enable TSO since vsftp with sendfile really seems
    to be the only app that causes this. I have simply set it to
    use_sendfile=NO and no corruption occurs at all; the machine is stable and
    fast.

    FWIW the corruption can still be reproduced with 2.6.24-rc5. For kicks I
    have also tried -rc5 with SLAB instead of SLUB, but that didn't help
    either.

    The directory with the tcpdump & test data now also contains a few more
    corrupted files; maybe comparing the corruption offsets gives someone a
    better idea.

    thanks
    Holger


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  14. Re: Reproducible data corruption with sendfile+vsftp - splice regression?

    On Thu, 13 Dec 2007 03:19:43 +0100, Holger Hoffstaette wrote:

    > I have now gone back to enable TSO since vsftp with sendfile really seems
    > to be the only app that causes this. I have simply set it to
    > use_sendfile=NO and no corruption occurs at all; the machine is stable and
    > fast.


    In the good tradition of proving myself wrong I can reliably create
    corrupted files by wget-ting from apache (with sendfile enabled) as
    well, so no more TSO after all. No TSO, no corruption.
    The same also happens on a different machine with a r8169 (same model).
    Tickless kernel makes no difference either. Shot in the dark, but hey..

    Holger


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

  15. Re: Reproducible data corruption with sendfile+vsftp - splice regression?

    On Thu, 06 Dec 2007 19:44:26 +0100, Francois Romieu wrote:

    > Holger Hoffstaette : [...]
    >> Maybe turning off sendfile or NAPI just lead to random success - so far
    >> it really looks like tso on the r8169 is the common cause.

    >
    > TSO on the r8169 is the magic switch but the regression makes imvho more
    > sense from a VM pov:
    >
    > - the corrupted file has the same size as the expected file
    > - the corrupted file exhibits holes which come as a multiple of 4096 bytes
    > (8*4k, 2 places, there may be more)
    > - the r8169 driver does not know what a page is
    > - the 8169 hardware has a small 8192 bytes Tx buffer
    >
    > It would be nice if someone could do a sendfile + vsftp test with TSO on a
    > different hardware. While I could not reproduce the corruption when simply
    > downloading a file that I had copied on the server with scp, it triggered
    > almost immediately after I copied it locally and tried to download the
    > copy.


    I tested 2.6.24-rc5 on my T60 (Intel e1000 built with NAPI) and installed
    vsftp/apache with sendfile and enabled all offload options incl. TSO.
    Repeated downloads of >500 MB with ftp or wget over the NIC onto ram- or
    physical disk gives no corruption whatsoever. Speed of download to ramdisk
    is a nice continuous 125 MB/sec.
    Looks like the r8169 or the driver after all..

    thanks
    Holger


    --
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

+ Reply to Thread