comparing two 74 meg logfiles - suggestions?
Long story short, I messed up my disk, and then made it worse, so I had
to restore from a backup. the problem I find is that files deleted on
purpose are now "restored" and I want to re-delete them.
I have a directory listing before and after and want to diff the two
files. They should be 99% the same, but they are 74 megs (633000+
files) each.
I'm going to try diff and see if it can work with files that large, but
was thinking last night about alternatives to it if diff pukes or
doesn't want to work because the files are too different.
option 1:
diff
option 2:
bash script that reads the list of files on disk right now and greps the
old file looking for a match. This will work, but I suspect this will
take a LOOOOONG time (1/2 logfile size * 633000 times) - I was thinking
a ramdisk might speed that up.
option 3:
use perl and read both files in and loop through looking for differences
- might be a bit faster than 2 on a ramdisk?
option 4:
use mysql - create two tables, index them and then do 600000 joins and
look for (0)rows returned
option 5:
use touch and create a directory structure containing zero byte files of
the "before" setup and then for each file currently on disk look in the
"empty" structure for a file - if it's not there than flag it.
any other thoughts?
I'm going to start with sort | diff and see if can handle it - the files
should be almost the same...
Oh, in case anyone wants to know wtf happened - I overwrote a symlink -
thought I had deleted about 50,000 files, and before I realized that I
had just whacked the link I googled undelete and reiser and found that
--rebuild-tree is dangerous but might undelete... and it did, but it
made a bit of a mess, and thus the restore. But, my backups don't take
into account the following possibility: Image on Monday, delete file on
tuesday (on purpose), disk failure on Thursday, restore of Monday's
image brings back file you deleted on Tuesday... nothing re-deletes the
file. (The daily backup script actually runs via windows that I wrote
for my old day job like 10 years ago - I've not updated it in a while
because it's good for backups, but has this weakness for restores. I
figure it's still better than no backups at all. I'd rather have to do
some cleanup than have nothing or corrupted files.)
Ray
Re: comparing two 74 meg logfiles - suggestions?
ray wrote:[color=blue]
> Long story short, I messed up my disk, and then made it worse, so I had
> to restore from a backup. the problem I find is that files deleted on
> purpose are now "restored" and I want to re-delete them.
>
> I have a directory listing before and after and want to diff the two
> files. They should be 99% the same, but they are 74 megs (633000+
> files) each.
>
> I'm going to try diff and see if it can work with files that large, but
> was thinking last night about alternatives to it if diff pukes or
> doesn't want to work because the files are too different.[/color]
Merge the two lists, sort the result and only print the lines that don't
appear in both:
cat list.before list.after | sort | uniq -u
Hopefully this should print the files that should be deleted.
Re: comparing two 74 meg logfiles - suggestions?
On Thu, 07 Aug 2008 19:10:49 +0000, ray wrote:
[color=blue]
> Long story short, I messed up my disk, and then made it worse, so I had
> to restore from a backup. the problem I find is that files deleted on
> purpose are now "restored" and I want to re-delete them.
>
> I have a directory listing before and after and want to diff the two
> files. They should be 99% the same, but they are 74 megs (633000+
> files) each.
>
> I'm going to try diff and see if it can work with files that large, but
> was thinking last night about alternatives to it if diff pukes or
> doesn't want to work because the files are too different.
>
> option 1:
> diff
>
> option 2:
> bash script that reads the list of files on disk right now and greps the
> old file looking for a match. This will work, but I suspect this will
> take a LOOOOONG time (1/2 logfile size * 633000 times) - I was thinking
> a ramdisk might speed that up.
>
> option 3:
> use perl and read both files in and loop through looking for differences
> - might be a bit faster than 2 on a ramdisk?
>
> option 4:
> use mysql - create two tables, index them and then do 600000 joins and
> look for (0)rows returned
>
> option 5:
> use touch and create a directory structure containing zero byte files of
> the "before" setup and then for each file currently on disk look in the
> "empty" structure for a file - if it's not there than flag it.
>
> any other thoughts?
>
> I'm going to start with sort | diff and see if can handle it - the files
> should be almost the same...
>
> Oh, in case anyone wants to know wtf happened - I overwrote a symlink -
> thought I had deleted about 50,000 files, and before I realized that I
> had just whacked the link I googled undelete and reiser and found that
> --rebuild-tree is dangerous but might undelete... and it did, but it
> made a bit of a mess, and thus the restore. But, my backups don't take
> into account the following possibility: Image on Monday, delete file on
> tuesday (on purpose), disk failure on Thursday, restore of Monday's
> image brings back file you deleted on Tuesday... nothing re-deletes the
> file. (The daily backup script actually runs via windows that I wrote
> for my old day job like 10 years ago - I've not updated it in a while
> because it's good for backups, but has this weakness for restores. I
> figure it's still better than no backups at all. I'd rather have to do
> some cleanup than have nothing or corrupted files.)
>
> Ray
>[/color]
Option 6: Text based sort of listings using text tools: sort and uniq.
This sort of problem needs a comparison tool which can compare two lists and
generate four reports:
Report 1. Name and details match (Name and file details match exactly.)
Report 2. Name Match found in both lists, but details different.
Report 3. Name found only in list 1.
Report 4. Name found only in list 2.
If you only interested in the names, and not the file details, as
appears to be true in this case, then you can simplify the problem.
Report 2 is empty. Also, assuming you have a list of file names as
follows:
1. L1 is the name of file which is a raw list of filenames before the
accident.
2. L2 is the name of file which is a raw list of filenames as
restored (now).
In the simplified case (where report 2 is empty), I think this
command generates a superset of Reports 3 and 4. T1 is set to a temp file.
The first sequence does a presort to try to acount for the large set of
files.
Caution:
1. This code is untested. Use at your own risk.
2. Pruning directories which have been removed is trickier than simply
removing the files themselves. Having a list of existing directories in
addition to the list of existing files is very helpful.
L1=~/list1
L2=~/list2
T1=~/temp.x
$ cat $L1 | sort >${L1}.sort
$ cat $L2 | sort >${L2}.sort
$ cat ${L1}.sort | sort -m - ${L2}.sort | uniq -u >$T1
Then, to generate report 4:
# cat $T1 | sort -m - ${L2}.sort | uniq -d >~/report4
--
Douglas Mayne
Re: comparing two 74 meg logfiles - suggestions?
Douglas Mayne wrote:[color=blue]
> On Thu, 07 Aug 2008 19:10:49 +0000, ray wrote:
>[color=green]
>> Long story short, I messed up my disk, and then made it worse, so I had
>> to restore from a backup. the problem I find is that files deleted on
>> purpose are now "restored" and I want to re-delete them.
>>
>> I have a directory listing before and after and want to diff the two
>> files. They should be 99% the same, but they are 74 megs (633000+
>> files) each.
>>
>> I'm going to try diff and see if it can work with files that large, but
>> was thinking last night about alternatives to it if diff pukes or
>> doesn't want to work because the files are too different.
>>
>> option 1:
>> diff
>>
>> option 2:
>> bash script that reads the list of files on disk right now and greps the
>> old file looking for a match. This will work, but I suspect this will
>> take a LOOOOONG time (1/2 logfile size * 633000 times) - I was thinking
>> a ramdisk might speed that up.
>>
>> option 3:
>> use perl and read both files in and loop through looking for differences
>> - might be a bit faster than 2 on a ramdisk?
>>
>> option 4:
>> use mysql - create two tables, index them and then do 600000 joins and
>> look for (0)rows returned
>>
>> option 5:
>> use touch and create a directory structure containing zero byte files of
>> the "before" setup and then for each file currently on disk look in the
>> "empty" structure for a file - if it's not there than flag it.
>>
>> any other thoughts?
>>
>> I'm going to start with sort | diff and see if can handle it - the files
>> should be almost the same...
>>
>> Oh, in case anyone wants to know wtf happened - I overwrote a symlink -
>> thought I had deleted about 50,000 files, and before I realized that I
>> had just whacked the link I googled undelete and reiser and found that
>> --rebuild-tree is dangerous but might undelete... and it did, but it
>> made a bit of a mess, and thus the restore. But, my backups don't take
>> into account the following possibility: Image on Monday, delete file on
>> tuesday (on purpose), disk failure on Thursday, restore of Monday's
>> image brings back file you deleted on Tuesday... nothing re-deletes the
>> file. (The daily backup script actually runs via windows that I wrote
>> for my old day job like 10 years ago - I've not updated it in a while
>> because it's good for backups, but has this weakness for restores. I
>> figure it's still better than no backups at all. I'd rather have to do
>> some cleanup than have nothing or corrupted files.)
>>
>> Ray
>>[/color]
> Option 6: Text based sort of listings using text tools: sort and uniq.
>
> This sort of problem needs a comparison tool which can compare two lists and
> generate four reports:
>
> Report 1. Name and details match (Name and file details match exactly.)
> Report 2. Name Match found in both lists, but details different.
> Report 3. Name found only in list 1.
> Report 4. Name found only in list 2.
>
> If you only interested in the names, and not the file details, as
> appears to be true in this case, then you can simplify the problem.
> Report 2 is empty. Also, assuming you have a list of file names as
> follows:
>
> 1. L1 is the name of file which is a raw list of filenames before the
> accident.
> 2. L2 is the name of file which is a raw list of filenames as
> restored (now).
>
> In the simplified case (where report 2 is empty), I think this
> command generates a superset of Reports 3 and 4. T1 is set to a temp file.
> The first sequence does a presort to try to acount for the large set of
> files.
>
> Caution:
> 1. This code is untested. Use at your own risk.
> 2. Pruning directories which have been removed is trickier than simply
> removing the files themselves. Having a list of existing directories in
> addition to the list of existing files is very helpful.
>
> L1=~/list1
> L2=~/list2
> T1=~/temp.x
> $ cat $L1 | sort >${L1}.sort
> $ cat $L2 | sort >${L2}.sort
> $ cat ${L1}.sort | sort -m - ${L2}.sort | uniq -u >$T1
>
> Then, to generate report 4:
> # cat $T1 | sort -m - ${L2}.sort | uniq -d >~/report4
>[/color]
fwiw, diff ended up working just fine.
I didn't need to compare the file contents with this script - I created
another one that did that - this one was to confirm that the files that
exist in the restore directory existed the day before I blew it up.
Ray