Text database? - Unix
This is a discussion on Text database? - Unix ; I am trying to implement a text recognition module. But I need some
character to train the algorithms with. Does anyone know of a free
online database that contains characters?...
-
Text database?
I am trying to implement a text recognition module. But I need some
character to train the algorithms with. Does anyone know of a free
online database that contains characters?
-
Re: Text database?
> But I need some
> character to train the algorithms with. Does anyone know of a free
> online database that contains characters?
Wouldn't the Internet itself serve as such a database?
Or perhaps a subset, like Usenet, perhaps even narrower-
talk.religion.newage, for example, is full of long, winding
texts.
wget + sed|perl|(g)awk|python|... should get you a _lot_
of training data.
HTH and TTFN,
Tarkin
-
Re: Text database?
saneman wrote:
> I am trying to implement a text recognition module. But I need some
> character to train the algorithms with. Does anyone know of a free
> online database that contains characters?
You're using an online resource right now that contains characters.
If you'd like larger, more standardized corpus of text, the Gutenberg
project could probably help.
I suppose there's a chance you actually want bitmaps of fonts, though.
That could be accomplished by downloading some fonts. Or by using
some bitmaps containing rasterized fonts; one could even create such
bitmaps by loading some text into a text editor or word processor and
typing or pasting the desired characters.
- Logan
-
Re: Text database?
>I am trying to implement a text recognition module. But I need some
>character to train the algorithms with. Does anyone know of a free
>online database that contains characters?
Post a message on USENET using your real email address, and you'll
have an unending supply of fresh SPAM. Does that meet your requirements?
-
Re: Text database?
On Mar 17, 1:54 am, saneman wrote:
> I am trying to implement a text recognition module. But I need some
> character to train the algorithms with. Does anyone know of a free
> online database that contains characters?
You'd probably be better off asking on sci.image.processing, where you
were posting in the first place. That said, this is a reasonable place
for the following point:
You are presumable after a database of images of characters. You could
synthesize one by rasterizing a number of fonts (automatically) and
then adding various kinds of noise or various distortions.
I have a program for generating rasterizaton from here:
http://linuxfromscratch.org/pipermai...ry/004748.html
look at links-2.1pre32-italic.patch.gz
You can run this patch on an empty directory, to extract the relavent
files.
To add distortions, you may wish to experiment with pnmscale,
pnmrotate, pgmnoise and pnmshear to add distortions. To be honest,
comp.unix.shell is also a good place for this kind of commandline
stuff, so I've cross posted there as well. Maybe some imagemagick
expert can weigh in on adding errors automatically.
-Ed
--
(You can't go wrong with psycho-rats.)(http://mi.eng.cam.ac.uk/~er258)
/d{def}def/f{/Times s selectfont}d/s{11}d/r{roll}d f 2/m{moveto}d -1
r 230 350 m 0 1 179{ 1 index show 88 rotate 4 mul 0 rmoveto}for/s 12
d f pop 235 420 translate 0 0 moveto 1 2 scale show showpage
-
Re: Text database?
Edward Rosten wrote:
> On Mar 17, 1:54 am, saneman wrote:
>> I am trying to implement a text recognition module. But I need some
>> character to train the algorithms with. Does anyone know of a free
>> online database that contains characters?
>
> You'd probably be better off asking on sci.image.processing, where you
> were posting in the first place. That said, this is a reasonable place
> for the following point:
>
> You are presumable after a database of images of characters. You could
> synthesize one by rasterizing a number of fonts (automatically) and
> then adding various kinds of noise or various distortions.
>
> I have a program for generating rasterizaton from here:
>
> http://linuxfromscratch.org/pipermai...ry/004748.html
>
> look at links-2.1pre32-italic.patch.gz
>
> You can run this patch on an empty directory, to extract the relavent
> files.
>
> To add distortions, you may wish to experiment with pnmscale,
> pnmrotate, pgmnoise and pnmshear to add distortions. To be honest,
> comp.unix.shell is also a good place for this kind of commandline
> stuff, so I've cross posted there as well. Maybe some imagemagick
> expert can weigh in on adding errors automatically.
>
> -Ed
> --
> (You can't go wrong with psycho-rats.)(http://mi.eng.cam.ac.uk/~er258)
>
> /d{def}def/f{/Times s selectfont}d/s{11}d/r{roll}d f 2/m{moveto}d -1
> r 230 350 m 0 1 179{ 1 index show 88 rotate 4 mul 0 rmoveto}for/s 12
> d f pop 235 420 translate 0 0 moveto 1 2 scale show showpage
>
This here was just what I needed:
http://yann.lecun.com/exdb/mnist/
which is also used on the below pages:
http://www.bcl.hamilton.ie/~barak/te...hw1/index.html
http://www.iro.umontreal.ca/~lisa/tw...nistVariations
http://www.int.tu-darmstadt.de/mlu/index.html