PDF file translation - Suse

This is a discussion on PDF file translation - Suse ; Can anyone suggest an app that will extract a PDF (acrobat) file to plain text? I have a bunch of bank statements in .pdf format and I'd like to extract the numbers from them to put into database. The Acrobat ...

+ Reply to Thread
Results 1 to 18 of 18

Thread: PDF file translation

  1. PDF file translation

    Can anyone suggest an app that will extract a PDF (acrobat) file to plain
    text? I have a bunch of bank statements in .pdf format and I'd like to
    extract the numbers from them to put into database. The Acrobat "save to
    text" yields garbage - probably due to the table format used.

    --
    Will Honea

    --
    Posted via a free Usenet account from http://www.teranews.com


  2. Re: PDF file translation

    Will Honea wrote:

    > Can anyone suggest an app that will extract a PDF (acrobat) file to plain
    > text? I have a bunch of bank statements in .pdf format and I'd like to
    > extract the numbers from them to put into database. The Acrobat "save to
    > text" yields garbage - probably due to the table format used.


    Hello Will,

    PDF is plain text by itself, added with the relevant formatting information.
    You may use your preferred text editor to open the PDF, there you can see
    if you can systematically grep the text from there.

    With regards,
    Hendric
    --
    Hendric Stattmann, Mödling, Austria. Registered Linux User #178879
    For e-mail contact, please use

  3. Re: PDF file translation

    Will Honea schrieb:
    > Can anyone suggest an app that will extract a PDF (acrobat) file to plain
    > text? I have a bunch of bank statements in .pdf format and I'd like to
    > extract the numbers from them to put into database. The Acrobat "save to
    > text" yields garbage - probably due to the table format used.
    >


    You may want to try pdftotext (comes with oss10.3) or alternatively
    pdf2txt, a shell script downloadable from

    http://www.comp.eonworks.com/scripts/scripts.html

    -never used it myself but hopefully it works for you.

    wilbert

  4. Re: PDF file translation

    Hendric Stattmann wrote:

    > Will Honea wrote:
    >
    >> Can anyone suggest an app that will extract a PDF (acrobat) file to plain
    >> text? I have a bunch of bank statements in .pdf format and I'd like to
    >> extract the numbers from them to put into database. The Acrobat "save to
    >> text" yields garbage - probably due to the table format used.

    >
    > Hello Will,
    >
    > PDF is plain text by itself, added with the relevant formatting
    > information. You may use your preferred text editor to open the PDF, there
    > you can see if you can systematically grep the text from there.



    Whoa! I must have some weird codepage issues here - I get the format/context
    info in plain text but the content is resembles Chinese written in Arabic
    script! You sure about that plain text part?

    --
    Will Honea

    --
    Posted via a free Usenet account from http://www.teranews.com


  5. Re: PDF file translation

    Wilhelm Bertalan wrote:

    > Will Honea schrieb:
    >> Can anyone suggest an app that will extract a PDF (acrobat) file to plain
    >> text? I have a bunch of bank statements in .pdf format and I'd like to
    >> extract the numbers from them to put into database. The Acrobat "save to
    >> text" yields garbage - probably due to the table format used.
    >>

    >
    > You may want to try pdftotext (comes with oss10.3) or alternatively
    > pdf2txt, a shell script downloadable from
    >
    > http://www.comp.eonworks.com/scripts/scripts.html
    >
    > -never used it myself but hopefully it works for you.


    Perfect! Didn't even know it was on the system but it does the job nicely.
    Now, if it would only parse just the parts I want, the way I want it .

    --
    Will Honea

    --
    Posted via a free Usenet account from http://www.teranews.com


  6. Re: PDF file translation

    On Tue, 05 Feb 2008 07:58:49 -0700, Will Honea quoth:

    > Whoa! I must have some weird codepage issues here - I get the format/context
    > info in plain text but the content is resembles Chinese written in Arabic
    > script! You sure about that plain text part?


    and when I do 'strings file.pdf' I also get codes and crap.

    would have been cool if it had worked!

    Felmon


  7. Re: PDF file translation

    Hendric Stattmann schrieb:
    >
    > PDF is plain text by itself,
    >

    No. Text within a PDF *may* be plain text, but it does not need to. See PDF
    output of the scribus DTP program for example. There, any single letter is
    positioned in absolute coordinates on the paper. No luck when grepping for
    text O_o;

    Kind regards

    Jan



  8. Re: PDF file translation

    Will Honea wrote:
    > Can anyone suggest an app that will extract a PDF (acrobat) file to plain
    > text? I have a bunch of bank statements in .pdf format and I'd like to
    > extract the numbers from them to put into database. The Acrobat "save to
    > text" yields garbage - probably due to the table format used.
    >

    Not all data in a pdf file is text, it can also contain graphical
    information. This can happen if you scan the printed bank statements
    with a scanner.
    Otoh, if it is graphical output, you can try to do some text
    recognition. Don't know which software to use for that under Linux.

    Joost

  9. Re: PDF file translation

    Wilhelm Bertalan schrieb am Tuesday 05 February 2008 10:36:

    > You may want to try pdftotext (comes with oss10.3) or alternatively
    > pdf2txt


    > -never used it myself but hopefully it works for you.


    pdftotext works fine, used it myself many times. Parsing the resulting text
    file requires some scripting effort but also works fine. If help is needed
    ask me via pm.

    Stefan

  10. Re: PDF file translation

    Will Honea wrote:

    > Can anyone suggest an app that will extract a PDF (acrobat) file to plain
    > text? I have a bunch of bank statements in .pdf format and I'd like to
    > extract the numbers from them to put into database. The Acrobat "save to
    > text" yields garbage - probably due to the table format used.
    >

    Try PDFedit. You can do a lot of things to PDF documents with this
    extensive program. So here is what you can do to get the text you want.

    While viewing a page: -> Page -> Extract Text from Page

    This will pop up a preview window showing the extracted text. Now left
    click + drag over the desired text. Right click -> Copy Text. This will
    copy the text to your clipboard. Right click and paste into a text
    editor/word processor of your choice....

    Should work well for any PDF document.

    PDFedit is in the repo's or use OpenSuSE software download service.

  11. Re: PDF file translation

    Will Honea wrote:

    > Can anyone suggest an app that will extract a PDF (acrobat) file to plain
    > text? I have a bunch of bank statements in .pdf format and I'd like to
    > extract the numbers from them to put into database. The Acrobat "save to
    > text" yields garbage - probably due to the table format used.
    >

    Try Open Office Writer.

    Marty Felker

  12. Re: PDF file translation

    Marty Felker wrote:

    > Will Honea wrote:
    >
    >> Can anyone suggest an app that will extract a PDF (acrobat) file to plain
    >> text? I have a bunch of bank statements in .pdf format and I'd like to
    >> extract the numbers from them to put into database. The Acrobat "save to
    >> text" yields garbage - probably due to the table format used.
    >>

    > Try Open Office Writer.


    Marty, I tried oo but couldn't figure out how to get it to read the blasted
    PDF file. That's not to say it's not there, but I sure missed it - is it
    an add-on filter or something?

    For this project, pdftotxt is ideal as it makes scripting through several
    hundred source files simple but the function would be geberally useful and
    I thought I had come across it but these poor, tired old eyes just couldn't
    get me there yesterday (maybe next weeks cataract surgery will solve that
    problem).

    --
    Will Honea

    --
    Posted via a free Usenet account from http://www.teranews.com


  13. Re: PDF file translation

    Stefan Bredow wrote:
    > pdftotext works fine, used it myself many times. Parsing the resulting text
    > file requires some scripting effort but also works fine. If help is needed
    > ask me via pm.


    Why via personal mail? Are we not allowed to either help or know the
    answer?

    houghi
    --
    Listen do you hear them drawing near in their search for the sinners?
    Feeding on the power of our fear and the evil within us.
    Incarnation of Satan's creation of all that we dread.
    When the demons arrive those alive would be better off dead!

  14. Re: PDF file translation

    On Tue, 05 Feb 2008 22:11:34 -0700, Will Honea wrote:

    > Marty Felker wrote:
    >
    >> Will Honea wrote:
    >>
    >>> Can anyone suggest an app that will extract a PDF (acrobat) file to
    >>> plain text? I have a bunch of bank statements in .pdf format and I'd
    >>> like to extract the numbers from them to put into database. The
    >>> Acrobat "save to text" yields garbage - probably due to the table
    >>> format used.
    >>>

    >> Try Open Office Writer.

    >
    > Marty, I tried oo but couldn't figure out how to get it to read the
    > blasted PDF file. That's not to say it's not there, but I sure missed
    > it - is it an add-on filter or something?
    >
    > For this project, pdftotxt is ideal as it makes scripting through
    > several hundred source files simple but the function would be geberally
    > useful and I thought I had come across it but these poor, tired old eyes
    > just couldn't get me there yesterday (maybe next weeks cataract surgery
    > will solve that problem).
    >
    > --
    > Will Honea


    Try the option
    pdftotxt -layout
    --help says " -layout : maintain original physical layout"

  15. Re: PDF file translation

    graham wrote:

    > On Tue, 05 Feb 2008 22:11:34 -0700, Will Honea wrote:
    >
    >> Marty Felker wrote:
    >>
    >>> Will Honea wrote:
    >>>
    >>>> Can anyone suggest an app that will extract a PDF (acrobat) file to
    >>>> plain text? I have a bunch of bank statements in .pdf format and I'd
    >>>> like to extract the numbers from them to put into database. The
    >>>> Acrobat "save to text" yields garbage - probably due to the table
    >>>> format used.
    >>>>
    >>> Try Open Office Writer.

    >>
    >> Marty, I tried oo but couldn't figure out how to get it to read the
    >> blasted PDF file. That's not to say it's not there, but I sure missed
    >> it - is it an add-on filter or something?
    >>
    >> For this project, pdftotxt is ideal as it makes scripting through
    >> several hundred source files simple but the function would be geberally
    >> useful and I thought I had come across it but these poor, tired old eyes
    >> just couldn't get me there yesterday (maybe next weeks cataract surgery
    >> will solve that problem).
    >>
    >> --
    >> Will Honea

    >
    > Try the option
    > pdftotxt -layout
    > --help says " -layout : maintain original physical layout"


    That's the option I used - makes it easy to write a script that ends up with
    all the data in CSV format.

    Thanks, all, for the comments.

    --
    Will Honea

    --
    Posted via a free Usenet account from http://www.teranews.com


  16. Re: PDF file translation

    Jan Kandziora wrote:

    > No. Text within a PDF *may* be plain text, but it does not need to.


    Hello Jan,

    I did verify this once more and found out that you are right.
    Fortunately, other users have suggested good ways to solve the problem.

    With regards,
    Hendric
    --
    Hendric Stattmann, Mödling, Austria. Registered Linux User #178879
    For e-mail contact, please use

  17. Re: PDF file translation

    Will Honea wrote:

    > graham wrote:


    [...]

    >>>> Try Open Office Writer.
    >>>
    >>> Marty, I tried oo but couldn't figure out how to get it to read the
    >>> blasted PDF file. That's not to say it's not there, but I sure missed
    >>> it - is it an add-on filter or something?


    Although OpenOffice can convert the file formats it understands into PDF,
    ASFAIK it cannot read PDF files itself.

    [...]

    --
    Les
    Posted exclusively to the alt.os.linux.suse newsgroup on Usenet

  18. Re: PDF file translation

    Leslie Danks wrote:

    > Although OpenOffice can convert the file formats it understands into PDF,
    > ASFAIK it cannot read PDF files itself.


    KOffice can open pdf files. The results are not always great, however.

    --
    Don


+ Reply to Thread