Looking for a rock-solid CSV file parser - Unix

This is a discussion on Looking for a rock-solid CSV file parser - Unix ; I am looking for a strong parser to be used with CSV (Comma Separated Values) files. I would like to have operations such as counting the number of fields in each line. No, I can't simply count commas in every ...

+ Reply to Thread
Results 1 to 14 of 14

Thread: Looking for a rock-solid CSV file parser

  1. Looking for a rock-solid CSV file parser


    I am looking for a strong parser to be used with CSV (Comma Separated
    Values) files.

    I would like to have operations such as counting the number of fields
    in each line. No, I can't simply count commas in every line. I need to
    make sure that every line is syntactically correct.

    What I need is some sort of 'awk' for CSV files. The "strong"
    requirement is important as I don't want some wise guy trying to sneak
    executable code into my application.

    Is there such a thing?

    TIA,

    -Ramon

    ps: is there a formal spec for CSV files?


  2. Re: Looking for a rock-solid CSV file parser

    On Apr 19, 11:51 pm, Ramon F Herrera wrote:
    > I am looking for a strong parser to be used with CSV (Comma Separated
    > Values) files.
    >
    > I would like to have operations such as counting the number of fields
    > in each line. No, I can't simply count commas in every line. I need to
    > make sure that every line is syntactically correct.
    >
    > What I need is some sort of 'awk' for CSV files. The "strong"
    > requirement is important as I don't want some wise guy trying to sneak
    > executable code into my application.
    >
    > Is there such a thing?
    >
    > TIA,
    >
    > -Ramon
    >
    > ps: is there a formal spec for CSV files?



    Now that I think about it, all I really need is a (set of) function
    which parses a single CSV line.

    I can provide the enclosing loop. :-)

    -Ramon



  3. Re: Looking for a rock-solid CSV file parser

    On 2007-04-20, Ramon F Herrera wrote:
    >
    > I am looking for a strong parser to be used with CSV (Comma Separated
    > Values) files.
    >
    > I would like to have operations such as counting the number of fields
    > in each line. No, I can't simply count commas in every line. I need to
    > make sure that every line is syntactically correct.
    >
    > What I need is some sort of 'awk' for CSV files. The "strong"
    > requirement is important as I don't want some wise guy trying to sneak
    > executable code into my application.
    >
    > Is there such a thing?


    There are many. I have some shell functions that do the job. The
    question is, what job? See below.

    > ps: is there a formal spec for CSV files?


    No. That's one of the problems. There are many variants (e.g., how
    are commas (or other separator) in a field handled?).

    I've taken to using CTV (Character-Terminated Values) format, with
    escaped terminator characters when embedded in a field. Among other
    advantages, the last character of the record can be checked to
    determine what separator is used.

    --
    Chris F.A. Johnson, author |
    Shell Scripting Recipes: | My code in this post, if any,
    A Problem-Solution Approach | is released under the
    2005, Apress | GNU General Public Licence

  4. Re: Looking for a rock-solid CSV file parser

    On Thu, 19 Apr 2007 21:51:20 -0700, Ramon F Herrera wrote:

    > I am looking for a strong parser to be used with CSV (Comma Separated
    > Values) files.
    >
    > I would like to have operations such as counting the number of fields in
    > each line. No, I can't simply count commas in every line. I need to make
    > sure that every line is syntactically correct.
    >
    > What I need is some sort of 'awk' for CSV files. The "strong"
    > requirement is important as I don't want some wise guy trying to sneak
    > executable code into my application.
    >
    > Is there such a thing?


    A google groups search for csv parser gives useful information in the
    second link:

    http://groups.google.com/groups/search?q=csv+parser

    --
    James Antill -- james@and.org
    http://www.and.org/and-httpd/ -- $2,000 security guarantee
    http://www.and.org/vstr/

  5. Re: Looking for a rock-solid CSV file parser

    Ramon F Herrera wrote:
    > On Apr 19, 11:51 pm, Ramon F Herrera wrote:
    >> I am looking for a strong parser to be used with CSV (Comma Separated
    >> Values) files.

    [snip]
    > Now that I think about it, all I really need is a (set of) function
    > which parses a single CSV line.


    Depending on what you're doing with the data, you might find the PHP
    scripting language useful. In particular, I use its fgetcsv() function
    (http://www.php.net/manual/en/function.fgetcsv.php) to fetch a row of
    CSV values into an array, from which I can process the data in any way I
    like. If you're already familiar with PHP scripting, it's quite handy.

    (Note that while PHP is often used for website scripting, it is also
    quite useful from the command line. On Debian this functionality
    requires installing one of the php?-cli packages. YMMV.)

  6. Re: Looking for a rock-solid CSV file parser

    John-Paul Stewart wrote:
    > Ramon F Herrera wrote:
    >> On Apr 19, 11:51 pm, Ramon F Herrera wrote:
    >>> I am looking for a strong parser to be used with CSV (Comma Separated
    >>> Values) files.

    > [snip]
    >> Now that I think about it, all I really need is a (set of) function
    >> which parses a single CSV line.

    >
    > Depending on what you're doing with the data, you might find the PHP
    > scripting language useful. In particular, I use its fgetcsv() function
    > (http://www.php.net/manual/en/function.fgetcsv.php) to fetch a row of
    > CSV values into an array, from which I can process the data in any way I
    > like. If you're already familiar with PHP scripting, it's quite handy.
    >
    > (Note that while PHP is often used for website scripting, it is also
    > quite useful from the command line. On Debian this functionality
    > requires installing one of the php?-cli packages. YMMV.)


    Or simply write your first C program. Jobs like this are really good
    projects to teach yourself 'C' on.


  7. Re: Looking for a rock-solid CSV file parser

    On Apr 21, 4:57 am, The Natural Philosopher wrote:
    > John-Paul Stewart wrote:
    > > Ramon F Herrera wrote:
    > >> On Apr 19, 11:51 pm, Ramon F Herrera wrote:
    > >>> I am looking for a strong parser to be used with CSV (Comma Separated
    > >>> Values) files.

    > > [snip]
    > >> Now that I think about it, all I really need is a (set of) function
    > >> which parses a single CSV line.

    >
    > > Depending on what you're doing with the data, you might find the PHP
    > > scripting language useful. In particular, I use its fgetcsv() function
    > > (http://www.php.net/manual/en/function.fgetcsv.php) to fetch a row of
    > > CSV values into an array, from which I can process the data in any way I
    > > like. If you're already familiar with PHP scripting, it's quite handy.

    >
    > > (Note that while PHP is often used for website scripting, it is also
    > > quite useful from the command line. On Debian this functionality
    > > requires installing one of the php?-cli packages. YMMV.)

    >
    > Or simply write your first C program. Jobs like this are really good
    > projects to teach yourself 'C' on.


    I have been writing code (mainly C, but also Java) for a quarter of a
    century, Nat the Philo :-). I have decided to take myself out of the
    programming chair and hire some fresh kid to do the leg work while I
    do the part that requires wisdom and gray hair.

    In fact, this is part of a much bigger generalized program that
    converts lots of files with all kinds of formats (fixed-width, CSV,
    EBCDIC, etc.) into a normalized canonical format which is fed into
    Oracle. I even designed a formal description language (with lex+yacc,
    but I'll move it to Antlr as soon as I learn it) to describe the file
    contents.

    I figure that I don't have all the test cases for CSV, I might miss
    something, so why reinvent the wheel? Ergo: use somebody else's fine-
    tuned code for CSV and only CSV.

    As you philosophers say: QED.

    -Ramon



  8. Re: Looking for a rock-solid CSV file parser

    >I am looking for a strong parser to be used with CSV (Comma Separated
    >Values) files.
    >
    >I would like to have operations such as counting the number of fields
    >in each line. No, I can't simply count commas in every line. I need to
    >make sure that every line is syntactically correct.


    Is CSV actually well-defined enough so that "every line is syntactically
    correct" is meaningful?

    >What I need is some sort of 'awk' for CSV files. The "strong"
    >requirement is important as I don't want some wise guy trying to sneak
    >executable code into my application.


    Here's a way to construct a test case:

    Step 1: write a CSV line with a single value, a string, containing
    all of the printable characters in the character set you are using
    (ASCII?) in character-set-code order with particular attention to
    comma, any kind of quote, backslash, space, and any other character
    used to quote stuff. Extra credit: include *all* of the characters
    with particular attention to carriage return, newline, tab and nul.

    Step 2: write a CSV line with multiple numeric values, consisting of
    all of the prime numbers between 2 and 100, inclusive and in order.

    Step 3: write a CSV line with 10,000 string values, each one of them
    consisting of the line from step 1. (Warning: this line will be over
    1 meg long if there are more than about 100 printable characters.)

    Step 4: write a CSV line with 3 string values, each value consisting of
    the lines from steps 1, 2, and 3, respectively.

    Step 5: write a CSV line with 4 string values, each value consisting of
    the lines from steps 1, 2, 3, and 4, respectively.

    Step 6: write a CSV line with 5 string values, each value consisting of
    the lines from steps 1, 2, 3, 4, and 5, respectively. (Warning: if the
    line from Step 3 is 1 meg, this one will be over 4 meg).

    Now test these 6 lines against your parser.

    Then test the parser using its own executable as input. It's allowed
    to signal errors, but not crash the program.

    Test the parser using the output of /dev/random as input. It's allowed
    to signal errors, but not crash the program. On systems without /dev/random,
    try using something large and encrypted with a solid encryption system.

    Now, assuming that the character set is ASCII (character codes 0 -
    127), and you're only going for the printable characters, can anyone
    claim that there is exactly one correct sequence of characters on
    a line resulting from Step 1? For the moment, forget about OS
    differences in how lines are ended.


  9. Re: Looking for a rock-solid CSV file parser

    On Apr 21, 3:24 pm, Ramon F Herrera wrote:
    > On Apr 21, 4:57 am, The Natural Philosopher wrote:
    >
    >
    >
    > > John-Paul Stewart wrote:
    > > > Ramon F Herrera wrote:
    > > >> On Apr 19, 11:51 pm, Ramon F Herrera wrote:
    > > >>> I am looking for a strong parser to be used with CSV (Comma Separated
    > > >>> Values) files.
    > > > [snip]
    > > >> Now that I think about it, all I really need is a (set of) function
    > > >> which parses a single CSV line.

    >
    > > > Depending on what you're doing with the data, you might find the PHP
    > > > scripting language useful. In particular, I use its fgetcsv() function
    > > > (http://www.php.net/manual/en/function.fgetcsv.php) to fetch a row of
    > > > CSV values into an array, from which I can process the data in any way I
    > > > like. If you're already familiar with PHP scripting, it's quite handy.

    >
    > > > (Note that while PHP is often used for website scripting, it is also
    > > > quite useful from the command line. On Debian this functionality
    > > > requires installing one of the php?-cli packages. YMMV.)

    >
    > > Or simply write your first C program. Jobs like this are really good
    > > projects to teach yourself 'C' on.

    >
    > I have been writing code (mainly C, but also Java) for a quarter of a
    > century, Nat the Philo :-). I have decided to take myself out of the
    > programming chair and hire some fresh kid to do the leg work while I
    > do the part that requires wisdom and gray hair.
    >
    > In fact, this is part of a much bigger generalized program that
    > converts lots of files with all kinds of formats (fixed-width, CSV,
    > EBCDIC, etc.) into a normalized canonical format which is fed into
    > Oracle. I even designed a formal description language (with lex+yacc,
    > but I'll move it to Antlr as soon as I learn it) to describe the file
    > contents.
    >
    > I figure that I don't have all the test cases for CSV, I might miss
    > something, so why reinvent the wheel? Ergo: use somebody else's fine-
    > tuned code for CSV and only CSV.


    Recognisably wisdom from experience :-)

    NIH is a pathology.

    >
    > As you philosophers say: QED.
    >
    > -Ramon




  10. Re: Looking for a rock-solid CSV file parser

    toby wrote:
    > On Apr 21, 3:24 pm, Ramon F Herrera wrote:
    >> On Apr 21, 4:57 am, The Natural Philosopher wrote:
    >>
    >>
    >>
    >>> John-Paul Stewart wrote:
    >>>> Ramon F Herrera wrote:
    >>>>> On Apr 19, 11:51 pm, Ramon F Herrera wrote:
    >>>>>> I am looking for a strong parser to be used with CSV (Comma Separated
    >>>>>> Values) files.
    >>>> [snip]
    >>>>> Now that I think about it, all I really need is a (set of) function
    >>>>> which parses a single CSV line.
    >>>> Depending on what you're doing with the data, you might find the PHP
    >>>> scripting language useful. In particular, I use its fgetcsv() function
    >>>> (http://www.php.net/manual/en/function.fgetcsv.php) to fetch a row of
    >>>> CSV values into an array, from which I can process the data in any way I
    >>>> like. If you're already familiar with PHP scripting, it's quite handy.
    >>>> (Note that while PHP is often used for website scripting, it is also
    >>>> quite useful from the command line. On Debian this functionality
    >>>> requires installing one of the php?-cli packages. YMMV.)
    >>> Or simply write your first C program. Jobs like this are really good
    >>> projects to teach yourself 'C' on.

    >> I have been writing code (mainly C, but also Java) for a quarter of a
    >> century, Nat the Philo :-). I have decided to take myself out of the
    >> programming chair and hire some fresh kid to do the leg work while I
    >> do the part that requires wisdom and gray hair.
    >>
    >> In fact, this is part of a much bigger generalized program that
    >> converts lots of files with all kinds of formats (fixed-width, CSV,
    >> EBCDIC, etc.) into a normalized canonical format which is fed into
    >> Oracle. I even designed a formal description language (with lex+yacc,
    >> but I'll move it to Antlr as soon as I learn it) to describe the file
    >> contents.
    >>
    >> I figure that I don't have all the test cases for CSV, I might miss
    >> something, so why reinvent the wheel? Ergo: use somebody else's fine-
    >> tuned code for CSV and only CSV.

    >
    > Recognisably wisdom from experience :-)
    >
    > NIH is a pathology.
    >

    Well I used to think like that, until I discoverd that mostly other
    peoples wheels have square axles, bent shafts, fall apart..


    I must have spent getting on for quarter of a million quid on database
    software that ultimately never did what it what supposed to.

    I could have written it better for less than that.

    >> As you philosophers say: QED.
    >>



    If you want something done right, do it yourself.

    The great advantage of home rolled programs, is you know the guy who
    wrote them intimately, and he is always there to fix them for you.

    >> -Ramon

    >
    >


  11. Re: Looking for a rock-solid CSV file parser

    On Apr 20, 12:51 am, Ramon F Herrera wrote:
    > I am looking for a strong parser to be used with CSV (Comma Separated
    > Values) files.
    >
    > I would like to have operations such as counting the number of fields
    > in each line. No, I can't simply count commas in every line. I need to
    > make sure that every line is syntactically correct.
    >
    > What I need is some sort of 'awk' for CSV files. The "strong"
    > requirement is important as I don't want some wise guy trying to sneak
    > executable code into my application.
    >
    > Is there such a thing?


    Have you checked out libcsv yet (see my response to your post at
    <http://groups.google.com/group/comp....se_frm/thread/
    a5a3bef8d6b8d057/#>). You should be able to use it to write a program
    to accomplish what you are trying to do pretty easily.

    > ps: is there a formal spec for CSV files?


    No, but there are a set of common conventions which the majority of
    applications using CSV follow described at <http://www.creativyst.com/
    Doc/Articles/CSV/CSV01.htm>, my CSV library follows these conventions.
    There is also an RFC for CSV as a mime-type at <http://tools.ietf.org/
    html/rfc4180> but its description is overly strict and doesn't reflect
    traditional conventions making its usefulness quite limited.

    Robert Gamble


  12. Re: Looking for a rock-solid CSV file parser

    On Apr 22, 10:52 pm, The Natural Philosopher wrote:
    > toby wrote:
    > > On Apr 21, 3:24 pm, Ramon F Herrera wrote:
    > >> On Apr 21, 4:57 am, The Natural Philosopher wrote:

    >
    > >>> John-Paul Stewart wrote:
    > >>>> Ramon F Herrera wrote:
    > >>>>> On Apr 19, 11:51 pm, Ramon F Herrera wrote:
    > >>>>>> I am looking for a strong parser to be used with CSV (Comma Separated
    > >>>>>> Values) files....
    > >> I figure that I don't have all the test cases for CSV, I might miss
    > >> something, so why reinvent the wheel? Ergo: use somebody else's fine-
    > >> tuned code for CSV and only CSV.

    >
    > > Recognisably wisdom from experience :-)

    >
    > > NIH is a pathology.

    >
    > Well I used to think like that, until I discoverd that mostly other
    > peoples wheels have square axles, bent shafts, fall apart..
    >
    > I must have spent getting on for quarter of a million quid


    That sounds like the problem right there :-)

    Admittedly the industry has changed a LOT in the last 5-10-20 years.
    Who changed it were the itch-scratchers: Richard, Linus, Monty and so
    on; individuals who proved that they were the *right* people to be
    solving particular problems. What holds it back are the Emperors:
    starting with Gates.

    > on database
    > software that ultimately never did what it what supposed to.
    >
    > I could have written it better for less than that.


    If you *had*, like Monty did, for instance, you could have GPL'd it
    and been a) leading a platform today or b) an acquisition target like
    Sleepycat and retired gracefully, with occasional trips to the
    International Space Station.

    >
    > >> As you philosophers say: QED.

    >
    > If you want something done right, do it yourself.
    >
    > The great advantage of home rolled programs, is you know the guy who
    > wrote them intimately, and he is always there to fix them for you.
    >
    > >> -Ramon




  13. Re: Looking for a rock-solid CSV file parser

    On Apr 22, 10:52 pm, The Natural Philosopher wrote:
    > ...
    > The great advantage of home rolled programs, is you know the guy who
    > wrote them intimately, and he is always there to fix them for you.


    The disadvantage is that he just may not be very smart, or very smart
    about the problem domain. For most problems, someone else out there
    knows more about it than you do. Google finds them pretty quickly...

    >
    > >> -Ramon




  14. Re: Looking for a rock-solid CSV file parser

    In Ramon F Herrera wrote:
    >
    > I am looking for a strong parser to be used with CSV (Comma Separated
    > Values) files.
    >
    > I would like to have operations such as counting the number of fields
    > in each line. No, I can't simply count commas in every line. I need to
    > make sure that every line is syntactically correct.
    >
    > What I need is some sort of 'awk' for CSV files. The "strong"
    > requirement is important as I don't want some wise guy trying to sneak
    > executable code into my application.
    >
    > Is there such a thing?
    >
    > TIA,
    >
    > -Ramon
    >
    > ps: is there a formal spec for CSV files?


    As for "a formal spec", no, because it's like a formal spec for a car.
    There are VW, Toyota, etc. Most implementations differ in how they
    treat whitespaces around comma separators.

    If you want a shell solution, then you can find my approach to this
    problem in my .sig below. Essentially,

    while read -C a b c; do
    declare -p a b c
    done << EOF
    aa, "11, 22" ,cc
    EOF

    --
    William Park , Toronto, Canada
    ThinFlash: Linux thin-client on USB key (flash) drive
    http://home.eol.ca/~parkw/thinflash.html
    BashDiff: Super Bash shell
    http://freshmeat.net/projects/bashdiff/

+ Reply to Thread