Combining large files - Linux

This is a discussion on Combining large files - Linux ; I have several 1M sized files (around 10000) of them that make up one large, 20G tar file. I would like to combine them all. So I started with a simple script to do a "cat" on each file and ...

+ Reply to Thread
Results 1 to 5 of 5

Thread: Combining large files

  1. Combining large files

    I have several 1M sized files (around 10000) of them that make up one large,
    20G tar file. I would like to combine them all. So I started with a simple
    script to do a "cat" on each file and combine it with the next one in
    series. That process seems to work but it's EXTREMLY slow.

    In DOS, it's posbile to copy files as follows, where bulk of the work is
    done by the copy command itself:

    copy file1+file2+file3 new_file

    There is no need to concatenate individual files. On my board,
    concatenating a combined 300M file with another 1M, eg, takes about 5
    minutes.. not great performance. So I am wondering if there is a utlity
    similar utility out there for Linux that supports the DOS copy feature.

    I am really trying to avoid writing a C program to accomplish this task

    TIA
    Salman



  2. Re: Combining large files

    > cat file1 file2 file3 > new_file

    If they're numbered sequentially, you could get away with:

    cat file* >new_file

    This assumes the names are 01 to 20 and NOT 1 to 20. Since the default
    sort is by ASCII sequence. Assuming an ASCII based platform and other
    defaults are in place. I join mpg and other files this way all the time.
    This also assumes the new_file name differs enough from the file? name
    that it doesn't get included in the cat portion. And that no other
    extraneous files get grabbed by the wild card.

    Otherwise DOS's:
    copy file1/b+file2/b+file3/b new_file

    under linux is roughly equal to:
    cat file1 file2 file3 >new_file

    You do NOT need to step it up like this:
    cat file1 file2 >new_file1
    cat new_file1 file3 >new_file2
    cat new_file2 file4 >new_file1
    cat new_file1 file5 >new_file
    rm new_file1 new_file2
    That would be very wasteful and slow. But I've known a few sadists in my
    day who enjoyed typing and would do it that way.

    One other limitation of sorts is that 32 bit processors are likely to
    limit the maximum size of files to 2.1G. So you may be limited in only
    forming your 20G file on a x86-64 or other 64+ bit platform.

    HTH,

    Shadow_7

  3. Re: Combining large files

    Interesting problem.

    Why is it useful for you to have 10,000 1M files rolled into one huge
    file?

    -DU-...etc...

  4. Re: Combining large files

    I had over 20G of data that I wanted to backup before I re-build my linux
    machine. Due to disk space constraints, I tarred it all up and then
    compressed it using bzip2. It shrank down to around 11G file. Before
    storing the data file, I verified bzip2 integrity using bzip2 utility.
    There are other alternatives to doing a more reliable backup, but this
    process seemed pretty fast and simple.

    Everything was fine until I copied the data file back and tried
    decompressing it. bzip2 utility complained about CRC errors. Hence I used
    bzip2recover to recover undamanged bzip2 blocks. Uusually those blocks
    (depending on how they are intially set when creating bz2 compressed file)
    are 900K chunks. bzip2recover utility apparently receovered all 13500~
    blocks, and stored them in 900K sized bz2-format files.

    So in order to get the orignal 20G tar file, I started unziping each
    compressed block and then combining the resulting data files together. That
    process is quite lenghty and taking a long time.

    "David Utidjian" wrote in message
    newsan.2003.06.28.19.37.00.971319.2168@nospamremarque. org...
    > Interesting problem.
    >
    > Why is it useful for you to have 10,000 1M files rolled into one huge
    > file?
    >
    > -DU-...etc...




  5. Re: Combining large files

    On Sat, 28 Jun 2003 21:23:17 -0400, Salman Moghal wrote:

    > I had over 20G of data that I wanted to backup before I re-build my
    > linux machine. Due to disk space constraints, I tarred it all up and
    > then compressed it using bzip2. It shrank down to around 11G file.
    > Before storing the data file, I verified bzip2 integrity using bzip2
    > utility. There are other alternatives to doing a more reliable backup,
    > but this process seemed pretty fast and simple.
    >
    > Everything was fine until I copied the data file back and tried
    > decompressing it. bzip2 utility complained about CRC errors. Hence I
    > used bzip2recover to recover undamanged bzip2 blocks. Uusually those
    > blocks (depending on how they are intially set when creating bz2
    > compressed file) are 900K chunks. bzip2recover utility apparently
    > receovered all 13500~ blocks, and stored them in 900K sized bz2-format
    > files.
    >
    > So in order to get the orignal 20G tar file, I started unziping each
    > compressed block and then combining the resulting data files together.
    > That process is quite lenghty and taking a long time.


    Hmmmm... I think see your problem... it basically boils down to not
    having enough of the right kind of storage media and/or a solid tested
    plan for using it.

    I apologize in advance if some of what follows sounds harsh or uncaring
    but when it comes to solid backup plans, as you will learn, there is zero
    room for error... and by extension... very little room for kindness and
    understanding.

    With that said... I think I do understand the position you are in. I hope
    for your sake that your livelihood does not depend on the recovery of all
    these files.

    Are/were all these files located in a single subdirectory off of / ? Perhaps /home or
    /var? If a single subdirectory was it also a separate partition? If you
    had kept the data in its own subdir on its own partition you could have avoided
    the neccessity of moving it off of the current media in the first place.

    What version of bzip2 are you using? According to the manpage versions
    1.0.1 and earlier have a limit of 512MBytes for file size. This
    restriction is removed with version 1.0.2. Not sure what the max is after
    that.

    Also according to the manpage the way to restore the original file after
    completing a successful bzip2recover is to do this:

    bzip2 -dc rec*file.bz2 > recovered_data

    Does that work? If so then, I guess you can untar the recovered_data
    file.

    In the future... you should consider a more robust backup plan. If your
    data is valuable to you and/or your employer then you should consider
    getting,at the very least, one or more backup disks so that the data can be
    mirrored. Even better... get a good tape backup system. I have had very
    good luck with DLT tapes and drives. I have had very bad luck with
    DAT/DDS tapes and drives. A DLT tape system can handle up to 40/80G of
    data and 200+G in the SuperDLT drives.

    Having a good (and tested) backup plan means never having to say you are
    sorry.

    -DU-...etc...

+ Reply to Thread