Text database? - Unix

This is a discussion on Text database? - Unix ; I am trying to implement a text recognition module. But I need some character to train the algorithms with. Does anyone know of a free online database that contains characters?...

+ Reply to Thread
Results 1 to 6 of 6

Thread: Text database?

  1. Text database?

    I am trying to implement a text recognition module. But I need some
    character to train the algorithms with. Does anyone know of a free
    online database that contains characters?

  2. Re: Text database?

    > But I need some
    > character to train the algorithms with. Does anyone know of a free
    > online database that contains characters?


    Wouldn't the Internet itself serve as such a database?
    Or perhaps a subset, like Usenet, perhaps even narrower-
    talk.religion.newage, for example, is full of long, winding
    texts.

    wget + sed|perl|(g)awk|python|... should get you a _lot_
    of training data.

    HTH and TTFN,
    Tarkin

  3. Re: Text database?

    saneman wrote:
    > I am trying to implement a text recognition module. But I need some
    > character to train the algorithms with. Does anyone know of a free
    > online database that contains characters?


    You're using an online resource right now that contains characters.

    If you'd like larger, more standardized corpus of text, the Gutenberg
    project could probably help.

    I suppose there's a chance you actually want bitmaps of fonts, though.
    That could be accomplished by downloading some fonts. Or by using
    some bitmaps containing rasterized fonts; one could even create such
    bitmaps by loading some text into a text editor or word processor and
    typing or pasting the desired characters.

    - Logan

  4. Re: Text database?

    >I am trying to implement a text recognition module. But I need some
    >character to train the algorithms with. Does anyone know of a free
    >online database that contains characters?


    Post a message on USENET using your real email address, and you'll
    have an unending supply of fresh SPAM. Does that meet your requirements?




  5. Re: Text database?

    On Mar 17, 1:54 am, saneman wrote:
    > I am trying to implement a text recognition module. But I need some
    > character to train the algorithms with. Does anyone know of a free
    > online database that contains characters?


    You'd probably be better off asking on sci.image.processing, where you
    were posting in the first place. That said, this is a reasonable place
    for the following point:

    You are presumable after a database of images of characters. You could
    synthesize one by rasterizing a number of fonts (automatically) and
    then adding various kinds of noise or various distortions.

    I have a program for generating rasterizaton from here:

    http://linuxfromscratch.org/pipermai...ry/004748.html

    look at links-2.1pre32-italic.patch.gz

    You can run this patch on an empty directory, to extract the relavent
    files.

    To add distortions, you may wish to experiment with pnmscale,
    pnmrotate, pgmnoise and pnmshear to add distortions. To be honest,
    comp.unix.shell is also a good place for this kind of commandline
    stuff, so I've cross posted there as well. Maybe some imagemagick
    expert can weigh in on adding errors automatically.

    -Ed
    --
    (You can't go wrong with psycho-rats.)(http://mi.eng.cam.ac.uk/~er258)

    /d{def}def/f{/Times s selectfont}d/s{11}d/r{roll}d f 2/m{moveto}d -1
    r 230 350 m 0 1 179{ 1 index show 88 rotate 4 mul 0 rmoveto}for/s 12
    d f pop 235 420 translate 0 0 moveto 1 2 scale show showpage


  6. Re: Text database?

    Edward Rosten wrote:
    > On Mar 17, 1:54 am, saneman wrote:
    >> I am trying to implement a text recognition module. But I need some
    >> character to train the algorithms with. Does anyone know of a free
    >> online database that contains characters?

    >
    > You'd probably be better off asking on sci.image.processing, where you
    > were posting in the first place. That said, this is a reasonable place
    > for the following point:
    >
    > You are presumable after a database of images of characters. You could
    > synthesize one by rasterizing a number of fonts (automatically) and
    > then adding various kinds of noise or various distortions.
    >
    > I have a program for generating rasterizaton from here:
    >
    > http://linuxfromscratch.org/pipermai...ry/004748.html
    >
    > look at links-2.1pre32-italic.patch.gz
    >
    > You can run this patch on an empty directory, to extract the relavent
    > files.
    >
    > To add distortions, you may wish to experiment with pnmscale,
    > pnmrotate, pgmnoise and pnmshear to add distortions. To be honest,
    > comp.unix.shell is also a good place for this kind of commandline
    > stuff, so I've cross posted there as well. Maybe some imagemagick
    > expert can weigh in on adding errors automatically.
    >
    > -Ed
    > --
    > (You can't go wrong with psycho-rats.)(http://mi.eng.cam.ac.uk/~er258)
    >
    > /d{def}def/f{/Times s selectfont}d/s{11}d/r{roll}d f 2/m{moveto}d -1
    > r 230 350 m 0 1 179{ 1 index show 88 rotate 4 mul 0 rmoveto}for/s 12
    > d f pop 235 420 translate 0 0 moveto 1 2 scale show showpage
    >


    This here was just what I needed:

    http://yann.lecun.com/exdb/mnist/

    which is also used on the below pages:

    http://www.bcl.hamilton.ie/~barak/te...hw1/index.html
    http://www.iro.umontreal.ca/~lisa/tw...nistVariations
    http://www.int.tu-darmstadt.de/mlu/index.html

+ Reply to Thread