keep rsync from removing unfinished source files? - Tools

This is a discussion on keep rsync from removing unfinished source files? - Tools ; I have two machines, speed and mass. speed has a fast Internet connection and is running a crawler which downloads a lot of files to disk. mass has a lot of disk space. I want to move the files from ...

+ Reply to Thread
Results 1 to 4 of 4

Thread: keep rsync from removing unfinished source files?

  1. keep rsync from removing unfinished source files?

    I have two machines, speed and mass. speed has a fast Internet
    connection and is running a crawler which downloads a lot of files to
    disk. mass has a lot of disk space. I want to move the files from
    speed to mass after they're done downloading. Ideally, I'd just run:

    $ rsync --remove-source-files speed:/var/crawldir .

    but I worry that rsync will unlink a source file that hasn't finished
    downloading yet. (I looked at the source code and I didn't see
    anything protecting against this.) Any suggestions?

    Ideas I had were:
    - a pause between downloading the file list and downloading the files
    - an exclude rule for recently modified files
    - a check to not delete a file if its file size has changed since it was copied
    but I don't see any way to do any of these.
    --
    Please use reply-all for most replies to avoid omitting the mailing list.
    To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
    Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


  2. Re: keep rsync from removing unfinished source files?

    On Sun, 2008-09-07 at 10:59 -0400, Aaron Swartz wrote:
    > I have two machines, speed and mass. speed has a fast Internet
    > connection and is running a crawler which downloads a lot of files to
    > disk. mass has a lot of disk space. I want to move the files from
    > speed to mass after they're done downloading. Ideally, I'd just run:
    >
    > $ rsync --remove-source-files speed:/var/crawldir .
    >
    > but I worry that rsync will unlink a source file that hasn't finished
    > downloading yet. (I looked at the source code and I didn't see
    > anything protecting against this.)


    Yes, that could happen.

    > Ideas I had were:
    > - a pause between downloading the file list and downloading the files


    This approach would fail for very large files unless the pause is
    correspondingly long.

    > - an exclude rule for recently modified files
    > - a check to not delete a file if its file size has changed since it
    > was copied


    Either of these would probably work, and they would not be hard to
    implement by modifying rsync, but they seem hackish.

    IMO, a proper solution is to have the crawler indicate somehow which
    files are unfinished so rsync can avoid copying those. E.g., the
    crawler could name unfinished files according to a special pattern so
    that you could exclude them with --exclude, or it could keep them in a
    temporary directory that rsync doesn't visit.

    Matt

    --
    Please use reply-all for most replies to avoid omitting the mailing list.
    To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
    Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


  3. Re: keep rsync from removing unfinished source files?

    Aaron, please CC rsync@lists.samba.org so that others can help you and
    your message is archived for others' future benefit.

    On Sun, 2008-09-07 at 16:16 -0400, Aaron Swartz wrote:
    > > IMO, a proper solution is to have the crawler indicate somehow which
    > > files are unfinished so rsync can avoid copying those. E.g., the
    > > crawler could name unfinished files according to a special pattern so
    > > that you could exclude them with --exclude, or it could keep them in a
    > > temporary directory that rsync doesn't visit.

    >
    > I agree, but this is not how most crawlers are written. (Imagine, e.g. wget.)


    Then I would modify the crawler.

    Matt

    --
    Please use reply-all for most replies to avoid omitting the mailing list.
    To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
    Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


  4. Re: keep rsync from removing unfinished source files?

    On Sun, 2008-09-07 at 16:42 -0400, Matt McCutchen wrote:
    > On Sun, 2008-09-07 at 16:16 -0400, Aaron Swartz wrote:
    > > > IMO, a proper solution is to have the crawler indicate somehow which
    > > > files are unfinished so rsync can avoid copying those. E.g., the
    > > > crawler could name unfinished files according to a special pattern so
    > > > that you could exclude them with --exclude, or it could keep them in a
    > > > temporary directory that rsync doesn't visit.

    > >
    > > I agree, but this is not how most crawlers are written. (Imagine, e.g. wget.)

    >
    > Then I would modify the crawler.


    I hacked together a modified version of Fedora's wget that has an option
    --temp-file-suffix that you could use to make unfinished files
    excludable. The patch is attached. I also made some RPMs, which are
    available at:

    http://mattmccutchen.net/private/wget-temp-file-suffix/

    I ran a little test, and the option does seem to stop "rsync
    --remove-source-files" from deleting unfinished files when combined with
    an appropriate --exclude, but I won't claim that it will play nicely
    with all wget options.

    Matt

    --
    Please use reply-all for most replies to avoid omitting the mailing list.
    To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
    Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

+ Reply to Thread