Combining large files - Linux
This is a discussion on Combining large files - Linux ; I have several 1M sized files (around 10000) of them that make up one large,
20G tar file. I would like to combine them all. So I started with a simple
script to do a "cat" on each file and ...
-
Combining large files
I have several 1M sized files (around 10000) of them that make up one large,
20G tar file. I would like to combine them all. So I started with a simple
script to do a "cat" on each file and combine it with the next one in
series. That process seems to work but it's EXTREMLY slow.
In DOS, it's posbile to copy files as follows, where bulk of the work is
done by the copy command itself:
copy file1+file2+file3 new_file
There is no need to concatenate individual files. On my board,
concatenating a combined 300M file with another 1M, eg, takes about 5
minutes.. not great performance. So I am wondering if there is a utlity
similar utility out there for Linux that supports the DOS copy feature.
I am really trying to avoid writing a C program to accomplish this task 
TIA
Salman
-
Re: Combining large files
> cat file1 file2 file3 > new_file
If they're numbered sequentially, you could get away with:
cat file* >new_file
This assumes the names are 01 to 20 and NOT 1 to 20. Since the default
sort is by ASCII sequence. Assuming an ASCII based platform and other
defaults are in place. I join mpg and other files this way all the time.
This also assumes the new_file name differs enough from the file? name
that it doesn't get included in the cat portion. And that no other
extraneous files get grabbed by the wild card.
Otherwise DOS's:
copy file1/b+file2/b+file3/b new_file
under linux is roughly equal to:
cat file1 file2 file3 >new_file
You do NOT need to step it up like this:
cat file1 file2 >new_file1
cat new_file1 file3 >new_file2
cat new_file2 file4 >new_file1
cat new_file1 file5 >new_file
rm new_file1 new_file2
That would be very wasteful and slow. But I've known a few sadists in my
day who enjoyed typing and would do it that way.
One other limitation of sorts is that 32 bit processors are likely to
limit the maximum size of files to 2.1G. So you may be limited in only
forming your 20G file on a x86-64 or other 64+ bit platform.
HTH,
Shadow_7
-
Re: Combining large files
Interesting problem.
Why is it useful for you to have 10,000 1M files rolled into one huge
file?
-DU-...etc...
-
Re: Combining large files
I had over 20G of data that I wanted to backup before I re-build my linux
machine. Due to disk space constraints, I tarred it all up and then
compressed it using bzip2. It shrank down to around 11G file. Before
storing the data file, I verified bzip2 integrity using bzip2 utility.
There are other alternatives to doing a more reliable backup, but this
process seemed pretty fast and simple.
Everything was fine until I copied the data file back and tried
decompressing it. bzip2 utility complained about CRC errors. Hence I used
bzip2recover to recover undamanged bzip2 blocks. Uusually those blocks
(depending on how they are intially set when creating bz2 compressed file)
are 900K chunks. bzip2recover utility apparently receovered all 13500~
blocks, and stored them in 900K sized bz2-format files.
So in order to get the orignal 20G tar file, I started unziping each
compressed block and then combining the resulting data files together. That
process is quite lenghty and taking a long time.
"David Utidjian" wrote in message
news
an.2003.06.28.19.37.00.971319.2168@nospamremarque. org...
> Interesting problem.
>
> Why is it useful for you to have 10,000 1M files rolled into one huge
> file?
>
> -DU-...etc...
-
Re: Combining large files
On Sat, 28 Jun 2003 21:23:17 -0400, Salman Moghal wrote:
> I had over 20G of data that I wanted to backup before I re-build my
> linux machine. Due to disk space constraints, I tarred it all up and
> then compressed it using bzip2. It shrank down to around 11G file.
> Before storing the data file, I verified bzip2 integrity using bzip2
> utility. There are other alternatives to doing a more reliable backup,
> but this process seemed pretty fast and simple.
>
> Everything was fine until I copied the data file back and tried
> decompressing it. bzip2 utility complained about CRC errors. Hence I
> used bzip2recover to recover undamanged bzip2 blocks. Uusually those
> blocks (depending on how they are intially set when creating bz2
> compressed file) are 900K chunks. bzip2recover utility apparently
> receovered all 13500~ blocks, and stored them in 900K sized bz2-format
> files.
>
> So in order to get the orignal 20G tar file, I started unziping each
> compressed block and then combining the resulting data files together.
> That process is quite lenghty and taking a long time.
Hmmmm... I think see your problem... it basically boils down to not
having enough of the right kind of storage media and/or a solid tested
plan for using it.
I apologize in advance if some of what follows sounds harsh or uncaring
but when it comes to solid backup plans, as you will learn, there is zero
room for error... and by extension... very little room for kindness and
understanding.
With that said... I think I do understand the position you are in. I hope
for your sake that your livelihood does not depend on the recovery of all
these files.
Are/were all these files located in a single subdirectory off of / ? Perhaps /home or
/var? If a single subdirectory was it also a separate partition? If you
had kept the data in its own subdir on its own partition you could have avoided
the neccessity of moving it off of the current media in the first place.
What version of bzip2 are you using? According to the manpage versions
1.0.1 and earlier have a limit of 512MBytes for file size. This
restriction is removed with version 1.0.2. Not sure what the max is after
that.
Also according to the manpage the way to restore the original file after
completing a successful bzip2recover is to do this:
bzip2 -dc rec*file.bz2 > recovered_data
Does that work? If so then, I guess you can untar the recovered_data
file.
In the future... you should consider a more robust backup plan. If your
data is valuable to you and/or your employer then you should consider
getting,at the very least, one or more backup disks so that the data can be
mirrored. Even better... get a good tape backup system. I have had very
good luck with DLT tapes and drives. I have had very bad luck with
DAT/DDS tapes and drives. A DLT tape system can handle up to 40/80G of
data and 200+G in the SuperDLT drives.
Having a good (and tested) backup plan means never having to say you are
sorry.
-DU-...etc...