comparing two 74 meg logfiles - suggestions? - Slackware

This is a discussion on comparing two 74 meg logfiles - suggestions? - Slackware ; Long story short, I messed up my disk, and then made it worse, so I had to restore from a backup. the problem I find is that files deleted on purpose are now "restored" and I want to re-delete them. ...

+ Reply to Thread
Results 1 to 4 of 4

Thread: comparing two 74 meg logfiles - suggestions?

  1. comparing two 74 meg logfiles - suggestions?

    Long story short, I messed up my disk, and then made it worse, so I had
    to restore from a backup. the problem I find is that files deleted on
    purpose are now "restored" and I want to re-delete them.

    I have a directory listing before and after and want to diff the two
    files. They should be 99% the same, but they are 74 megs (633000+
    files) each.

    I'm going to try diff and see if it can work with files that large, but
    was thinking last night about alternatives to it if diff pukes or
    doesn't want to work because the files are too different.

    option 1:
    diff

    option 2:
    bash script that reads the list of files on disk right now and greps the
    old file looking for a match. This will work, but I suspect this will
    take a LOOOOONG time (1/2 logfile size * 633000 times) - I was thinking
    a ramdisk might speed that up.

    option 3:
    use perl and read both files in and loop through looking for differences
    - might be a bit faster than 2 on a ramdisk?

    option 4:
    use mysql - create two tables, index them and then do 600000 joins and
    look for (0)rows returned

    option 5:
    use touch and create a directory structure containing zero byte files of
    the "before" setup and then for each file currently on disk look in the
    "empty" structure for a file - if it's not there than flag it.

    any other thoughts?

    I'm going to start with sort | diff and see if can handle it - the files
    should be almost the same...

    Oh, in case anyone wants to know wtf happened - I overwrote a symlink -
    thought I had deleted about 50,000 files, and before I realized that I
    had just whacked the link I googled undelete and reiser and found that
    --rebuild-tree is dangerous but might undelete... and it did, but it
    made a bit of a mess, and thus the restore. But, my backups don't take
    into account the following possibility: Image on Monday, delete file on
    tuesday (on purpose), disk failure on Thursday, restore of Monday's
    image brings back file you deleted on Tuesday... nothing re-deletes the
    file. (The daily backup script actually runs via windows that I wrote
    for my old day job like 10 years ago - I've not updated it in a while
    because it's good for backups, but has this weakness for restores. I
    figure it's still better than no backups at all. I'd rather have to do
    some cleanup than have nothing or corrupted files.)

    Ray

  2. Re: comparing two 74 meg logfiles - suggestions?

    ray wrote:
    > Long story short, I messed up my disk, and then made it worse, so I had
    > to restore from a backup. the problem I find is that files deleted on
    > purpose are now "restored" and I want to re-delete them.
    >
    > I have a directory listing before and after and want to diff the two
    > files. They should be 99% the same, but they are 74 megs (633000+
    > files) each.
    >
    > I'm going to try diff and see if it can work with files that large, but
    > was thinking last night about alternatives to it if diff pukes or
    > doesn't want to work because the files are too different.


    Merge the two lists, sort the result and only print the lines that don't
    appear in both:

    cat list.before list.after | sort | uniq -u

    Hopefully this should print the files that should be deleted.


  3. Re: comparing two 74 meg logfiles - suggestions?

    On Thu, 07 Aug 2008 19:10:49 +0000, ray wrote:

    > Long story short, I messed up my disk, and then made it worse, so I had
    > to restore from a backup. the problem I find is that files deleted on
    > purpose are now "restored" and I want to re-delete them.
    >
    > I have a directory listing before and after and want to diff the two
    > files. They should be 99% the same, but they are 74 megs (633000+
    > files) each.
    >
    > I'm going to try diff and see if it can work with files that large, but
    > was thinking last night about alternatives to it if diff pukes or
    > doesn't want to work because the files are too different.
    >
    > option 1:
    > diff
    >
    > option 2:
    > bash script that reads the list of files on disk right now and greps the
    > old file looking for a match. This will work, but I suspect this will
    > take a LOOOOONG time (1/2 logfile size * 633000 times) - I was thinking
    > a ramdisk might speed that up.
    >
    > option 3:
    > use perl and read both files in and loop through looking for differences
    > - might be a bit faster than 2 on a ramdisk?
    >
    > option 4:
    > use mysql - create two tables, index them and then do 600000 joins and
    > look for (0)rows returned
    >
    > option 5:
    > use touch and create a directory structure containing zero byte files of
    > the "before" setup and then for each file currently on disk look in the
    > "empty" structure for a file - if it's not there than flag it.
    >
    > any other thoughts?
    >
    > I'm going to start with sort | diff and see if can handle it - the files
    > should be almost the same...
    >
    > Oh, in case anyone wants to know wtf happened - I overwrote a symlink -
    > thought I had deleted about 50,000 files, and before I realized that I
    > had just whacked the link I googled undelete and reiser and found that
    > --rebuild-tree is dangerous but might undelete... and it did, but it
    > made a bit of a mess, and thus the restore. But, my backups don't take
    > into account the following possibility: Image on Monday, delete file on
    > tuesday (on purpose), disk failure on Thursday, restore of Monday's
    > image brings back file you deleted on Tuesday... nothing re-deletes the
    > file. (The daily backup script actually runs via windows that I wrote
    > for my old day job like 10 years ago - I've not updated it in a while
    > because it's good for backups, but has this weakness for restores. I
    > figure it's still better than no backups at all. I'd rather have to do
    > some cleanup than have nothing or corrupted files.)
    >
    > Ray
    >

    Option 6: Text based sort of listings using text tools: sort and uniq.

    This sort of problem needs a comparison tool which can compare two lists and
    generate four reports:

    Report 1. Name and details match (Name and file details match exactly.)
    Report 2. Name Match found in both lists, but details different.
    Report 3. Name found only in list 1.
    Report 4. Name found only in list 2.

    If you only interested in the names, and not the file details, as
    appears to be true in this case, then you can simplify the problem.
    Report 2 is empty. Also, assuming you have a list of file names as
    follows:

    1. L1 is the name of file which is a raw list of filenames before the
    accident.
    2. L2 is the name of file which is a raw list of filenames as
    restored (now).

    In the simplified case (where report 2 is empty), I think this
    command generates a superset of Reports 3 and 4. T1 is set to a temp file.
    The first sequence does a presort to try to acount for the large set of
    files.

    Caution:
    1. This code is untested. Use at your own risk.
    2. Pruning directories which have been removed is trickier than simply
    removing the files themselves. Having a list of existing directories in
    addition to the list of existing files is very helpful.

    L1=~/list1
    L2=~/list2
    T1=~/temp.x
    $ cat $L1 | sort >${L1}.sort
    $ cat $L2 | sort >${L2}.sort
    $ cat ${L1}.sort | sort -m - ${L2}.sort | uniq -u >$T1

    Then, to generate report 4:
    # cat $T1 | sort -m - ${L2}.sort | uniq -d >~/report4

    --
    Douglas Mayne

  4. Re: comparing two 74 meg logfiles - suggestions?

    Douglas Mayne wrote:
    > On Thu, 07 Aug 2008 19:10:49 +0000, ray wrote:
    >
    >> Long story short, I messed up my disk, and then made it worse, so I had
    >> to restore from a backup. the problem I find is that files deleted on
    >> purpose are now "restored" and I want to re-delete them.
    >>
    >> I have a directory listing before and after and want to diff the two
    >> files. They should be 99% the same, but they are 74 megs (633000+
    >> files) each.
    >>
    >> I'm going to try diff and see if it can work with files that large, but
    >> was thinking last night about alternatives to it if diff pukes or
    >> doesn't want to work because the files are too different.
    >>
    >> option 1:
    >> diff
    >>
    >> option 2:
    >> bash script that reads the list of files on disk right now and greps the
    >> old file looking for a match. This will work, but I suspect this will
    >> take a LOOOOONG time (1/2 logfile size * 633000 times) - I was thinking
    >> a ramdisk might speed that up.
    >>
    >> option 3:
    >> use perl and read both files in and loop through looking for differences
    >> - might be a bit faster than 2 on a ramdisk?
    >>
    >> option 4:
    >> use mysql - create two tables, index them and then do 600000 joins and
    >> look for (0)rows returned
    >>
    >> option 5:
    >> use touch and create a directory structure containing zero byte files of
    >> the "before" setup and then for each file currently on disk look in the
    >> "empty" structure for a file - if it's not there than flag it.
    >>
    >> any other thoughts?
    >>
    >> I'm going to start with sort | diff and see if can handle it - the files
    >> should be almost the same...
    >>
    >> Oh, in case anyone wants to know wtf happened - I overwrote a symlink -
    >> thought I had deleted about 50,000 files, and before I realized that I
    >> had just whacked the link I googled undelete and reiser and found that
    >> --rebuild-tree is dangerous but might undelete... and it did, but it
    >> made a bit of a mess, and thus the restore. But, my backups don't take
    >> into account the following possibility: Image on Monday, delete file on
    >> tuesday (on purpose), disk failure on Thursday, restore of Monday's
    >> image brings back file you deleted on Tuesday... nothing re-deletes the
    >> file. (The daily backup script actually runs via windows that I wrote
    >> for my old day job like 10 years ago - I've not updated it in a while
    >> because it's good for backups, but has this weakness for restores. I
    >> figure it's still better than no backups at all. I'd rather have to do
    >> some cleanup than have nothing or corrupted files.)
    >>
    >> Ray
    >>

    > Option 6: Text based sort of listings using text tools: sort and uniq.
    >
    > This sort of problem needs a comparison tool which can compare two lists and
    > generate four reports:
    >
    > Report 1. Name and details match (Name and file details match exactly.)
    > Report 2. Name Match found in both lists, but details different.
    > Report 3. Name found only in list 1.
    > Report 4. Name found only in list 2.
    >
    > If you only interested in the names, and not the file details, as
    > appears to be true in this case, then you can simplify the problem.
    > Report 2 is empty. Also, assuming you have a list of file names as
    > follows:
    >
    > 1. L1 is the name of file which is a raw list of filenames before the
    > accident.
    > 2. L2 is the name of file which is a raw list of filenames as
    > restored (now).
    >
    > In the simplified case (where report 2 is empty), I think this
    > command generates a superset of Reports 3 and 4. T1 is set to a temp file.
    > The first sequence does a presort to try to acount for the large set of
    > files.
    >
    > Caution:
    > 1. This code is untested. Use at your own risk.
    > 2. Pruning directories which have been removed is trickier than simply
    > removing the files themselves. Having a list of existing directories in
    > addition to the list of existing files is very helpful.
    >
    > L1=~/list1
    > L2=~/list2
    > T1=~/temp.x
    > $ cat $L1 | sort >${L1}.sort
    > $ cat $L2 | sort >${L2}.sort
    > $ cat ${L1}.sort | sort -m - ${L2}.sort | uniq -u >$T1
    >
    > Then, to generate report 4:
    > # cat $T1 | sort -m - ${L2}.sort | uniq -d >~/report4
    >


    fwiw, diff ended up working just fine.

    I didn't need to compare the file contents with this script - I created
    another one that did that - this one was to confirm that the files that
    exist in the restore directory existed the day before I blew it up.

    Ray

+ Reply to Thread