This is a discussion on Re: PDFText Plugin for PDF file scoring - not for PDF images - SpamAssassin ; Dallas Engelken wrote, on 14/07/07 12:17 AM: > James MacLean wrote: >> Hi folks, >> >> Regrets if this is the wrong list. >> >> Wanted to be able to score on text found in PDF files. Did not see ...
Dallas Engelken wrote, on 14/07/07 12:17 AM:
> James MacLean wrote:
>> Hi folks,
>> Regrets if this is the wrong list.
>> Wanted to be able to score on text found in PDF files. Did not see
>> any obvious route, so made a plugin that calls XPDF's pdfinfo and
>> pdftotext to get the text that is then scored.
>> Sample local.cf could be :
>> pdftotext_cmd /usr/local/bin/pdftotext
>> pdfinfo_cmd /usr/local/bin/pdfinfo
>> body PDF_TO_TEXT
>> eval:check_pdftext("^Error","sex","drugs",'Title:\s+stock_tmp.pdf:4','Creator:\s+OpenOffice .org
>> Notice that a :4 gives a find of that regex 4 points.
>> Really don't know if this was the right road to follow, as I copied
>> the AntiVirus.pm and came up with this:
>> So far... it appears to work as expected and didn't take down a
>> pretty busy server .
>> Enjoy hearing any positive criticisms .
> I did this the other day with CAM::PDF, but Theo recommended this work
> should be done in the post_message_parse() plugin call. Then you
> could just write body rules against the text, uris would get checked
> by uribldns plugin, etc....
> Dallas Engelken
I did start with keeping it all in Perl, but when I tested my first SPAM
with the CAM::PDF utils, it resulted in just a bunch of space separated
letters . Interested in getting something working, I switched to the
XPDF utils. Maybe getpdftext.pl is not a good example of how the modules
Where do I find information on hooking into post_message_parse()? Tried
greping in the module area with no luck . Certainly agree it would be
better to get the text out and let everyone at it . I couldn't see how
to do that when I started down this road. I was even first trying to see
if Exim would add another attachment to the e-mail which would be the
output of pfdtotext, but again, wanted to get something running, so
opted for what is there now .