efficient compression tools - Unix

This is a discussion on efficient compression tools - Unix ; Hi all, I am facing the following problem (on a mainframe platform). After a few somewhat crude attempts on my own, I thought I would seek guidance from any experts out there... I have a log file with the following ...

+ Reply to Thread
Results 1 to 6 of 6

Thread: efficient compression tools

  1. efficient compression tools


    Hi all,

    I am facing the following problem (on a mainframe platform). After a
    few somewhat crude attempts on my own, I thought I would seek guidance
    from any experts out there...

    I have a log file with the following fixed format (it is infact an MVS
    PDSE but that shouldn't matter).

    15-char data set name (padded right with spaces)
    1 space
    8-char timestamp mmddyyyy
    1 space
    6-char file size (i.e. can be at most 999999, padded left with spaces)

    My system produces about 20 such logfiles a day, each about 10MB. I
    want to compress these files as much as possible manner using standard
    UNIX tools e.g. perl,python,compress,pack etc. The constraints are
    that I have to be able to recover faithfully (100% no loss or changes)
    and that the compression should not take more than 1-2 minutes per
    run.

    I have tried all the standard tools (bzip2,gzip,pack,compress) and
    combinations of these with sort,uniq etc. But I am not able to break
    the 3MB barrier. The best combination I've got, a crude pipeline that
    sorts the data by filename, alphabetically, does some prefix-
    supression, and finally bzip2, gets me about 3MB.

    Anyone have any insightful suggestions? In particular, if Perl/Python
    have some built-in primitives to reversibly reduce the symbol-bit
    count, that would really help. Many thanks in advance!

    -SNS

  2. Re: efficient compression tools

    sam.n.seaborn@gmail.com writes:

    > Hi all,
    >
    > I am facing the following problem (on a mainframe platform). After a
    > few somewhat crude attempts on my own, I thought I would seek guidance
    > from any experts out there...
    >
    > I have a log file with the following fixed format (it is infact an MVS
    > PDSE but that shouldn't matter).
    >
    > 15-char data set name (padded right with spaces)
    > 1 space
    > 8-char timestamp mmddyyyy
    > 1 space
    > 6-char file size (i.e. can be at most 999999, padded left with spaces)
    >
    > My system produces about 20 such logfiles a day, each about 10MB. I
    > want to compress these files as much as possible manner using standard
    > UNIX tools e.g. perl,python,compress,pack etc. The constraints are
    > that I have to be able to recover faithfully (100% no loss or changes)
    > and that the compression should not take more than 1-2 minutes per
    > run.
    >
    > I have tried all the standard tools (bzip2,gzip,pack,compress) and
    > combinations of these with sort,uniq etc. But I am not able to break
    > the 3MB barrier. The best combination I've got, a crude pipeline that
    > sorts the data by filename, alphabetically, does some prefix-
    > supression, and finally bzip2, gets me about 3MB.
    >
    > Anyone have any insightful suggestions? In particular, if Perl/Python
    > have some built-in primitives to reversibly reduce the symbol-bit
    > count, that would really help. Many thanks in advance!


    Maybe packing the data in a binary format before compressing might
    help. Drop the padding from the filename. Prefix it with a 4-bit
    length field instead. The timestamp can be expressed as days since a
    date of your choice. 16 bits here gives you about 180 years. Drop
    the padding between the timestamp and size fields. The size is at
    most 999999, so it will fit in 20 bits. This makes a nice 24 bits
    together with the 4-bit name length. In all, this gives 5 bytes plus
    the length of the name. With an average name length of 8 (you didn't
    specify) this alone would get you down to about 4.1MB. Compressing
    this with whatever algorithm seems to work best (try lzma), should
    bring it down to 2MB or so.

    --
    Måns Rullgård
    mans@mansr.com

  3. Re: efficient compression tools

    sam.n.seaborn@gmail.com wrote:
    > Hi all,
    >
    > I am facing the following problem (on a mainframe platform). After a
    > few somewhat crude attempts on my own, I thought I would seek guidance
    > from any experts out there...
    >
    > I have a log file with the following fixed format (it is infact an MVS
    > PDSE but that shouldn't matter).
    >
    > 15-char data set name (padded right with spaces)
    > 1 space
    > 8-char timestamp mmddyyyy
    > 1 space
    > 6-char file size (i.e. can be at most 999999, padded left with spaces)
    >
    > My system produces about 20 such logfiles a day, each about 10MB. I
    > want to compress these files as much as possible manner using standard
    > UNIX tools e.g. perl,python,compress,pack etc. The constraints are
    > that I have to be able to recover faithfully (100% no loss or changes)
    > and that the compression should not take more than 1-2 minutes per
    > run.
    >
    > I have tried all the standard tools (bzip2,gzip,pack,compress) and
    > combinations of these with sort,uniq etc. But I am not able to break
    > the 3MB barrier. The best combination I've got, a crude pipeline that
    > sorts the data by filename, alphabetically, does some prefix-
    > supression, and finally bzip2, gets me about 3MB.
    >
    > Anyone have any insightful suggestions? In particular, if Perl/Python
    > have some built-in primitives to reversibly reduce the symbol-bit
    > count, that would really help. Many thanks in advance!
    >
    > -SNS


    first of all I will not go into the discussion whether in 2008 possible
    savings of some megabytes of disk space deserve throwing in a lot of
    time or brainpower.

    But have you given p7zip a try? http://p7zip.sourceforge.net/ In terms
    of ratio it seems to be on the forefront of general-purpose
    archivers/compressors.

    As for reducing symbol bit count, Huffmann and arithmetic encoding do
    this with limited success in terms of ratio, see the Unix pack(1)
    command. I wouldn't try to implement anything of this in perl or python.

    If you want to really bite hard on the task, you can consider writing a
    very specialized compressor/decompressor for your data, but then again:
    how much does your diskspace cost?

    Regards
    Joachim

    followup-to set to comp.unix.shell

  4. Re: efficient compression tools

    sam.n.seaborn@gmail.com writes:

    > I have a log file with the following fixed format (it is infact an MVS
    > PDSE but that shouldn't matter).
    >
    > 15-char data set name (padded right with spaces)
    > 1 space
    > 8-char timestamp mmddyyyy
    > 1 space
    > 6-char file size (i.e. can be at most 999999, padded left with spaces)
    >
    > My system produces about 20 such logfiles a day, each about 10MB. I
    > want to compress these files as much as possible manner using standard
    > UNIX tools e.g. perl,python,compress,pack etc. The constraints are
    > that I have to be able to recover faithfully (100% no loss or changes)
    > and that the compression should not take more than 1-2 minutes per
    > run.


    A key factor will be the number of distinct values typically
    represented in each field. If there are, say, less then 256 of each,
    then you can represent the whole file as a set of 3-byte records
    prefixed by a table of the symbols. A short Perl program could pack
    and unpack it. Of course, you probably have more variation than that,
    but the basic idea may still pay off.

    If the sizes are largely distinct, it would probably be better to use
    a fixed bit-field for the size (20 is enough). If there were only a
    few dates and fewer than 65,536 values for the first field, you could
    pack each record into just 5 bytes. At the other end of the scale, if
    the variation is much greater, standard compressors will do better
    than this scheme.

    What are the typical contents like? Could you post a link to an example?

    --
    Ben.

  5. Re: efficient compression tools

    sam.n.seaborn@gmail.com wrote:

    > I am facing the following problem (on a mainframe platform). After a
    > few somewhat crude attempts on my own, I thought I would seek guidance
    > from any experts out there...
    >
    > I have a log file with the following fixed format (it is infact an MVS
    > PDSE but that shouldn't matter).
    >
    > 15-char data set name (padded right with spaces)
    > 1 space
    > 8-char timestamp mmddyyyy
    > 1 space
    > 6-char file size (i.e. can be at most 999999, padded left with spaces)
    >
    > My system produces about 20 such logfiles a day, each about 10MB. I
    > want to compress these files as much as possible manner using standard
    > UNIX tools e.g. perl,python,compress,pack etc. The constraints are
    > that I have to be able to recover faithfully (100% no loss or changes)
    > and that the compression should not take more than 1-2 minutes per
    > run.
    >
    > I have tried all the standard tools (bzip2,gzip,pack,compress) and
    > combinations of these with sort,uniq etc. But I am not able to break
    > the 3MB barrier. The best combination I've got, a crude pipeline that
    > sorts the data by filename, alphabetically, does some prefix-
    > supression, and finally bzip2, gets me about 3MB.


    Compression is all about reducing the size of the representation by
    taking advantage of knowledge of the higher-level structure of the
    data. You have some specific knowledge of the structure, so hopefully
    you should be able to get something good out of that.

    First, just to clarify the question, it sounds like you need to
    preserve the records themselves, but you don't need to preserve the
    order of the records within a file. That sounds like something you
    could exploit.

    As others have said, I would drop the padding between fields, since
    that is clearly not necessary.

    The timestamp and size fields could also be turned into raw integers.
    That's probably particularly helpful for the size.

    Someone else mentioned the idea of picking an arbitrary date and
    storing the others as differences from that date. One obvious way
    to do this is to scan the file and pick the lowest date, then store
    every other date as a difference to that one. The best form for
    the relative dates will depend on the range they're distributed
    across and how they're distributed across it. A good first crack
    would be simply encoding in the output the size of the largest
    difference (in bytes), then storing them as raw integers. However,
    if most of them are clustered in a very small area and there are a
    few outliers, it might be worth storing them in a variable-length
    form.

    Another approach to the dates is delta coding. You could simply
    sort the file by date, then store the first date as itself, but
    store each following date as the difference between it and the
    previous. Since you have so many records (about 300,000 of them,
    by my calculations) there are going to be a lot of repeated dates.
    That means there are going to be a lot of deltas dates that show
    up as all zeros.

    Yet another thing you can do is eliminate 99% of the overhead of
    storing the length of your filenames. Again, since you have
    hundreds of thousands of records, you will have a lot of filenames
    of the same length. So, you can group those together and store
    the length only once per group. So you'd store it 15 times
    instead of 300,000 times. (You'd also need to store the number
    of records that the length applies to.)

    Another idea: right now you are storing one record after another,
    with filenames, dates, and sizes interleaved. With compression
    programs like gzip and bzip2 that work well with long repeated
    (sub)strings, you might get some value out of storing all the
    filenames together, then all the dates, then all file sizes. If
    you do delta coding on the dates and get lots of zero deltas, by
    grouping dates together with no other intervening fields, you are
    going to get even longer strings of zeros, for example. And at
    least with gzip, there is a sliding window dictionary of I think
    32K bytes (or 64K?). By letting that sliding window devote itself
    entirely to fields of only one type, you are going to get the best
    performance out of it.

    Along those lines, if the filenames have extensions or suffixes on
    them, you may even want to break those up and treat extension as
    a separate field from filename. To make this general (so that it
    doesn't choke on certain filenames), create a rule where you split
    the string in two, but make that rule general enough that it can
    be applied to any string and the split can still be reversible,
    such as "the second string is the last period and everything after
    it". You might end up with empty-string as the second string
    sometimes, but that's OK, if overall it improves the compression.

    Final suggestion: if the ideas you get in this group don't do
    the trick, ask in comp.compression. Last I checked, there were
    some people there with some pretty good insights. The only thing
    they might not all understand is the constraint that it needs to
    be done with standard Unix tools. But, you can do most anything in
    Perl, which you allowed, so that shouldn't be a big problem.

    - Logan

  6. Re: efficient compression tools

    sam.n.seaborn@gmail.com wrote:
    > Hi all,
    >
    > I am facing the following problem (on a mainframe platform). After a
    > few somewhat crude attempts on my own, I thought I would seek guidance
    > from any experts out there...
    >
    > I have a log file with the following fixed format (it is infact an MVS
    > PDSE but that shouldn't matter).
    >
    > 15-char data set name (padded right with spaces)
    > 1 space
    > 8-char timestamp mmddyyyy
    > 1 space
    > 6-char file size (i.e. can be at most 999999, padded left with spaces)
    >
    > My system produces about 20 such logfiles a day, each about 10MB. I
    > want to compress these files as much as possible manner using standard
    > UNIX tools e.g. perl,python,compress,pack etc. The constraints are
    > that I have to be able to recover faithfully (100% no loss or changes)
    > and that the compression should not take more than 1-2 minutes per
    > run.


    How long do you intend keeping the files? If you don''t bother
    compressing them, 200 MB/day is only 71 GB/year. SCSI disks are about
    the most expensive around, and cost less than $200 for a years data. 1
    TB SATA disks are even less.

    Is there really any point in worrying about this?

+ Reply to Thread