Daniel J McDonald wrote:
> On Wed, 2007-07-11 at 14:49 +0530, Suhas Ingale wrote:
>
>
>> Has anyone tried running PDFInfo plugin with 3.1.7 version?
>>
>>

>
> No, finally got it working yesterday evening using 3.2.1, but the
> initial results are underwhelming. Almost 100% overlap with
> TVD_SPACE_RATIO. Only one miss:
>


First of all, TVD_SPACE_RATIO only applies for those running v3.2,
whereas PDFInfo.pm can be used with any 3.x version..

Secondly, TVD_SPACE_RATIO can fire almost at will without a body.

$ echo "" | spamassassin
2.9 TVD_SPACE_RATIO BODY: TVD_SPACE_RATIO


Take the basic mime part from a pdf stock spam... it looks similar to this

--------------050701020003040207010006
Content-Type: text/plain; charset=iso-8859-2; format=flowed
Content-Transfer-Encoding: 7bit


--------------050701020003040207010006

and it fires on TVD_SPACE_RATIO fine.

$ cat /root/sample2.txt | spamassassin -D 2>&1 | grep -i tvd
[26686] dbg: tvd: word [SPAM-8.3]- Re: warning_6042146166.pdf
[26686] dbg: tvd: len=39
[26686] dbg: tvd: spaces 2 nonspaces 37
[26686] dbg: tvd: pct = 5
[26686] dbg: tvd: final = 5
[26686] dbg: rules: ran eval rule TVD_SPACE_RATIO ======> got hit (1)


change the mime part to

--------------050701020003040207010006
Content-Type: text/plain; charset=iso-8859-2; format=flowed
Content-Transfer-Encoding: 7bit

tvd no longer fires now

--------------050701020003040207010006

$ cat /root/sample2.txt | spamassassin -D 2>&1 | grep -i tvd
[26739] dbg: tvd: word [SPAM-8.3]- Re: warning_6042146166.pdf
[26739] dbg: tvd: len=39
[26739] dbg: tvd: spaces 2 nonspaces 37
[26739] dbg: tvd: pct = 5
[26739] dbg: tvd: word tvd no longer fires now
[26739] dbg: tvd: len=24
[26739] dbg: tvd: spaces 4 nonspaces 20
[26739] dbg: tvd: pct = 20
[26739] dbg: tvd: final = 20

.... and 20 isnt between tvd_vertical_words('0','10')

Easy for spammy to avoid that. Even more, this rule has a good chance
of falsing. I emailed myself a png from webalizer without any body text.

# cat test | spamassassin -D 2>&1 |grep -i tvd
[27390] dbg: tvd: word hourly_usage_200706.png
[27390] dbg: tvd: len=24
[27390] dbg: tvd: spaces 0 nonspaces 24
[27390] dbg: tvd: pct = 0
[27390] dbg: tvd: final = 0
[27390] dbg: rules: ran eval rule TVD_SPACE_RATIO ======> got hit (1)

The fact is, email is "FTP for Dummies"... and IMHO, TVD_SPACE_RATIO
may be a bit high at 2.9.

BTW, v0.3 of PDFInfo.pm is now posted - so for those that have it
already, you might want to sync up

# counts GMD_PDF_HORIZ 135s/0h of 6132 corpus (4555s/1577h AxB-MANUAL) 07/11/07
# counts GMD_PDF_HORIZ 31s/0h of 11773 corpus (10988s/785h AxB2-TRAPS) 07/11/07
# counts GMD_PDF_SQUARE 36s/0h of 6132 corpus (4555s/1577h AxB-MANUAL) 07/11/07
# counts GMD_PDF_SQUARE 11s/0h of 11773 corpus (10988s/785h AxB2-TRAPS) 07/11/07
# counts GMD_PDF_VERT 24s/0h of 6132 corpus (4555s/1577h AxB-MANUAL) 07/11/07
# counts GMD_PDF_VERT 10s/0h of 11773 corpus (10988s/785h AxB2-TRAPS) 07/11/07
# counts GMD_PDF_FUZZY1_T1 591s/0h of 6132 corpus (555s/1577h AxB-MANUAL) 07/11/07
# counts GMD_PDF_FUZZY1_T1 199s/0h of 11773 corpus (10988s/785h AxB2-TRAPS) 07/11/07
# counts GMD_PDF_FUZZY2_T1 199s/0h of 11773 corpus (10988s/785h AxB2-TRAPS) 07/11/07
# counts GMD_PDF_FUZZY2_T1 591s/0h of 6132 corpus (555s/1577h AxB-MANUAL) 07/11/07
# counts GMD_PDF_FUZZY2_T2 118s/0h of 6132 corpus (555s/1577h AxB-MANUAL) 07/11/07
# counts GMD_PDF_FUZZY2_T2 1s/0h of 10767 corpus (9986s/781h AxB2-TRAPS) 07/11/07
# counts GMD_PDF_FUZZY2_T3 0s/0h of 10767 corpus (9986s/781h AxB2-TRAPS) 07/11/07
# counts GMD_PDF_FUZZY2_T3 25s/0h of 5641 corpus (4064s/1577h AxB-MANUAL) 07/11/07
# counts GMD_PDF_FUZZY2_T4 105s/0h of 6132 corpus (555s/1577h AxB-MANUAL) 07/11/07
# counts GMD_PDF_FUZZY2_T4 28s/0h of 10767 corpus (9986s/781h AxB2-TRAPS) 07/11/07
# counts GMD_AUTHOR_COLET 1s/0h of 10767 corpus (9986s/781h AxB2-TRAPS) 07/11/07
# counts GMD_AUTHOR_COLET 2s/0h of 6132 corpus (555s/1577h AxB-MANUAL) 07/11/07
# counts GMD_AUTHOR_MOBILE 2s/0h of 6132 corpus (555s/1577h AxB-MANUAL) 07/11/07
# counts GMD_AUTHOR_MOBILE 55s/0h of 10767 corpus (9986s/781h AxB2-TRAPS) 07/11/07
# counts GMD_AUTHOR_OOO 1s/0h of 10767 corpus (9986s/781h AxB2-TRAPS) 07/11/07
# counts GMD_AUTHOR_OOO 118s/0h of 6132 corpus (555s/1577h AxB-MANUAL) 07/11/07
# counts GMD_AUTHOR_HPADMIN 105s/0h of 6132 corpus (4555s/1577h AxB-MANUAL) 07/11/07
# counts GMD_AUTHOR_HPADMIN 27s/0h of 11773 corpus (10988s/785h AxB2-TRAPS) 07/11/07
# counts GMD_PRODUCER_GPL 227s/0h of 6132 corpus (555s/1577h AxB-MANUAL) 07/11/07
# counts GMD_PRODUCER_GPL 85s/0h of 10767 corpus (9986s/781h AxB2-TRAPS) 07/11/07
# counts GMD_PRODUCER_POWERPDF 0s/0h of 10767 corpus (9986s/781h AxB2-TRAPS) 07/11/07
# counts GMD_PRODUCER_POWERPDF 0s/0h of 5641 corpus (4064s/1577h AxB-MANUAL) 07/11/07
# counts GMD_PDF_STOX_M1 159s/0h of 6132 corpus (555s/1577h AxB-MANUAL) 07/11/07
# counts GMD_PDF_STOX_M1 40s/0h of 11773 corpus (10988s/785h AxB2-TRAPS) 07/11/07
# counts GMD_PDF_STOX_M2 223s/0h of 6132 corpus (555s/1577h AxB-MANUAL) 07/11/07
# counts GMD_PDF_STOX_M2 29s/0h of 10767 corpus (9986s/781h AxB2-TRAPS) 07/11/07


--
Dallas Engelken
dallase@uribl.com
http://uribl.com