New feature: detect and avoid transfering renamed files - Tools

This is a discussion on New feature: detect and avoid transfering renamed files - Tools ; Hello rsyncers, I have long wished for a feature in rsync to detect files that have renamed on the sender side since the last time a sync was performed, and avoid transfering those files to the destination the same way ...

+ Reply to Thread
Results 1 to 4 of 4

Thread: New feature: detect and avoid transfering renamed files

  1. New feature: detect and avoid transfering renamed files

    Hello rsyncers,

    I have long wished for a feature in rsync to detect files that have
    renamed on the sender side since the last time a sync was performed,
    and avoid transfering those files to the destination the same way it
    avoids transfering files that haven't changed.

    Example 1: a log directory (like /var/log) is backed up every day. Most
    of the time, rsync transfers very little data, but once a week it
    basically performs a full copy without finding any basis files on the
    destination copy. This is because the logs have been rotated and what
    was /var/log/example.1.gz has been renamed to /var/log/example.2.gz,
    /var/log/example.2.gz has become /var/log/example.3.gz, and so on.
    rsync has no way to know this.

    Example 2: a home directory is similarly backed up every day. One day,
    the user decides to clean house and move lots of files around, creating
    new directories, and moving hundreds of files around among different
    directories. Once again, rsync is going to have to retransfer each of
    those files.

    My solution is to add a stable, unchanging, name for each file in the
    transfer. As long as the --hard-links (-H) option is used, this stable
    name will provide a name that already exists on the receiver side, and
    the receiver can create a hard link to this name even when the file
    appears under a completely new name. These stable names do not actually
    exist on the sender side, they are synthesized by rsync from the
    unchanging attributes that confer the file its identity: its device and
    inode number.

    I have attached a patch for a proof of concept of this feature. With it,
    I can start with this directory structure on the sender side:

    testfiles/
    testfiles/one
    testfiles/two

    ....and rsync it to the destination. On the destination there is an extra
    directory called "byinode" which contains hard links to both regular files.
    Then I rename testfiles/one to testfiles/oneone . When I rsync again,
    instead of deleting the file called "one" and transfering the full contents
    of an apparently new file called "oneone", the file "one" is deleted and
    "oneone" is created by hardlinking to the stable filename.

    I would like to know the following:

    - Are people interested in a feature like this?

    - Is there a better way to do it?

    - It only works with protocol version 30 at the moment. Would there be
    any interest in making it work with older protocol versions? (it has to
    be done very differently in older versions)

    - Could this patch be committed once I take it beyond proof of concept
    state?

    Notes, if you want to test it:

    - It is hardcoded to always enable the feature synthesize a directory
    called "byinode" and add it to the file list. The final version will
    make this a command line option, of course.
    - It only works with protocol version 30.
    - Use with at least --delete --no-i-r -r
    - Only the sender side requires the patch
    - The patch is against rsync-3.0.3pre2

    Thank you for your feedback

    -Phil

    --
    Please use reply-all for most replies to avoid omitting the mailing list.
    To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
    Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

  2. Re: New feature: detect and avoid transfering renamed files

    Sorry for the slow reply -- I marked your message for more in-depth
    study, and failed to get back to it until now.

    On Sun, Jun 22, 2008 at 08:01:16PM -0400, Phil Vandry wrote:
    > I have long wished for a feature in rsync to detect files that have
    > renamed on the sender side since the last time a sync was performed,
    > and avoid transfering those files to the destination the same way it
    > avoids transfering files that haven't changed.


    The detect-renamed patch in the patches directory has one possible
    implementation of this, but it fails to handle things like the /var/log
    rotation where files get renamed over the top of other files in the
    transfer.

    Your solution is quite an interesting one, but it does have some minor
    drawbacks:

    - It creates a single (potentially really big) directory of files on
    the receiver for the byinode/* files.
    - The file list increases in size significantly (around double).
    - The transfer must remain identical to prior transfers, or the
    synthesized directory will not match (and could be truncated with
    --delete).
    - It disables incremental recursion (as does the detect-renamed patch,
    but it would be nice to avoid this).
    - While it avoids doing an extra scan of the destination files (unlike
    the detect-renamed patch), the processing of all the files in the
    synthesized directory is akin to an extra scan pass.

    However, as long as those trade-offs are acceptable, it does do a great
    job of finding renamed files. I'd like something a little more flexible
    for a future rsync, though.

    I had been thinking of extending the db patch to add the ability to
    track files by checksum in a database. This would allow a run that used
    the DB to be an efficient checksum run (reading the checksums from the
    DB, not slowly generating them) and look up matching checksums in the DB
    on the receiving side to facilitate either renaming and/or efficient
    copying. Using a simple DB for the data (such as SQLite) would be easy
    to support, would work regardless of how much of a hierarchy was being
    copied, and would not require an extra hierarchy scan for each transfer
    (though it would require that the DB info be double-checked and ignored
    if not accurate, and it would be most efficient if the receiving side
    was not prone to being reorganized without updating the DB). To
    facilitate the typical log-dir rotation idiom, I was thinking of doing
    a directory-at-a-time of delayed-update (unless the user asked for a
    whole-transfer delayed-update).

    That idea is my current favorite for adding rename support. What do
    folks think?

    ...wayne..
    --
    Please use reply-all for most replies to avoid omitting the mailing list.
    To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
    Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


  3. Re: New feature: detect and avoid transfering renamed files

    On Tue, 9 Sep 2008 07:49:06 -0700, Wayne Davison wrote:
    > Sorry for the slow reply -- I marked your message for more in-depth
    > study, and failed to get back to it until now.


    That's OK, I've done worse :-(

    > drawbacks:
    >
    > - It creates a single (potentially really big) directory of files on
    > the receiver for the byinode/* files.

    [others deleted]

    Indeed, it more or less assumes you have a filesystem which handles this
    well. Your other observations are also quite correct.

    > I had been thinking of extending the db patch to add the ability to
    > track files by checksum in a database. This would allow a run that used
    > the DB to be an efficient checksum run (reading the checksums from the
    > DB, not slowly generating them) and look up matching checksums in the DB


    That is a good idea because the database can be used for all sorts of
    other purposes too. Here are the drawbacks I see:

    - I think you will have trouble catching files that have both moved
    and changed since the last rsync (typical example: /var/log/syslog
    had data appended and then was rotated to /var/log/syslog.0 and
    then rsync runs to do an incremental backup).

    You can solve this by storing dev&inum in the database. I'm not
    familiar with the db patch, perhaps it already does this?

    - It might be inconvenient to have a database on both sides of the
    transfer. For example, when I backup many devices to a backup
    server, it's cleaner to keep state only on the backup server.
    Some small devices (mobile phones?) might not even tolerate the
    creation of "extra" files (sqlite database) in their filesystem.

    As I indicated, the first is solvable, and I think I can live with the
    second one.

    -Phil
    --
    Please use reply-all for most replies to avoid omitting the mailing list.
    To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
    Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


  4. Re: New feature: detect and avoid transfering renamed files

    Phil Vandry wrote:
    > On Tue, 9 Sep 2008 07:49:06 -0700, Wayne Davison wrote:
    >
    >> Sorry for the slow reply -- I marked your message for more in-depth
    >> study, and failed to get back to it until now.
    >>

    >
    > That's OK, I've done worse :-(
    >
    >
    >> drawbacks:
    >>
    >> - It creates a single (potentially really big) directory of files on
    >> the receiver for the byinode/* files.
    >>

    > [others deleted]
    >
    > Indeed, it more or less assumes you have a filesystem which handles this
    > well. Your other observations are also quite correct.
    >
    >
    >> I had been thinking of extending the db patch to add the ability to
    >> track files by checksum in a database. This would allow a run that used
    >> the DB to be an efficient checksum run (reading the checksums from the
    >> DB, not slowly generating them) and look up matching checksums in the DB
    >>

    I missed the first part of the discussion, and I may be off-base, but
    you may want to look at unison and how it handles its database.

    http://www.cis.upenn.edu/~bcpierce/unison/

    Where rsync is stateless, unison is stateful. You're talking about
    making rsync stateful.

    --Yan

    --
    Please use reply-all for most replies to avoid omitting the mailing list.
    To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
    Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


+ Reply to Thread