This is a discussion on FuzzyOcr and PDF files - SpamAssassin ; -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hello all, because some people insisted on it, I added an experimental feature to FuzzyOcr that allows you to scan PDFs as if they were images. The feature was implemented in the latest SVN ...
-----BEGIN PGP SIGNED MESSAGE-----
because some people insisted on it, I added an experimental feature to
FuzzyOcr that allows you to scan PDFs as if they were images.
The feature was implemented in the latest SVN revision and is of
course disabled by default.
Personally, I would not use this feature because the risk of false
positives on important documents is really high, but if you really
want to test this, here are the steps to enable it:
1. Get dependencies:
-A netpbm version that includes pstopnm
-Poppler (http://poppler.freedesktop.org/) for the pdfinfo and
2. Add those binaries as helper apps in FuzzyOcr.cf (see the .cf file
included in SVN)
3. Enable PDF scanning with focr_scan_pdfs 1 in config.
Optionally, it is possible to skip PDFs which contain more than x
Currently, the parameters for pstopnm are hardcoded (-xsize=1000),
there might be better ways/values to translate PDFs into usable, but
not too big pnm files.
If you know better ways, tell me. Also I am missing some recent PDF
spam samples (which contain images), so if you could upload some
sample, that would also help.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
-----END PGP SIGNATURE-----