Joseph Brennan writes:

> /Dear .{0,12}(web ?mail|columbia\.edu)/i
>
> /Password.{0,10}\([\s\.\*\_]+\)/
>
> /you must reply to this email/i
>
> Reply-to =~ /\@live\.com/


I created a meta-rule out of these (with a score of 8), and then ran
spamassassin -D < phish to see how it worked, it matched the metarule
flawlessly, but the phish ended up with only a 5.4 score due to BAYES_00
dragging it down. That was surprising to me, so I started to wonder if
my bayes DB was poisoned.

I ran some stats, and the results seem to indicate a healthy bayes
database (unless I am reading this wrong)... A side note: its
interesting to note how only 9% of our email is spam, which seems low,
but maybe clamav-milter+rbls are blocking the remaining 40%?

Email: 2379392 Autolearn: 1075396 AvgScore: -6.32 AvgScanTime: 5.96 sec
Spam: 227816 Autolearn: 114079 AvgScore: 14.75 AvgScanTime: 4.23 sec
Ham: 2151576 Autolearn: 961317 AvgScore: -8.56 AvgScanTime: 6.15 sec

Time Spent Running SA: 3941.26 hours
Time Spent Processing Spam: 267.76 hours
Time Spent Processing Ham: 3673.50 hours

TOP SPAM RULES FIRED
----------------------------------------------------------------------
RANK RULE NAME COUNT %OFMAIL %OFSPAM %OFHAM
----------------------------------------------------------------------
1 HTML_MESSAGE 154522 54.03 67.83 52.57
2 BAYES_99 134531 6.09 59.05 0.48
3 BOTNET 133687 8.90 58.68 3.63
4 RDNS_NONE 102255 10.19 44.88 6.51
5 URIBL_JP_SURBL 98879 4.94 43.40 0.87
6 MIME_HTML_ONLY 87518 7.62 38.42 4.36
7 URIBL_OB_SURBL 76624 3.98 33.63 0.84
8 DCC_CHECK 74600 8.51 32.75 5.94
9 URIBL_AB_SURBL 59890 2.72 26.29 0.23
10 URIBL_SC_SURBL 53911 2.51 23.66 0.27
11 RCVD_IN_BL_SPAMCOP_NET 43120 2.43 18.93 0.68
12 URIBL_WS_SURBL 38251 1.79 16.79 0.21
13 URIBL_RHS_DOB 36565 2.17 16.05 0.70
14 BAYES_50 35322 3.93 15.50 2.71
15 HTML_IMAGE_ONLY_16 33887 1.68 14.87 0.28
16 HTML_SHORT_LINK_IMG_2 33118 1.56 14.54 0.19
17 HTML_IMAGE_RATIO_02 32757 2.93 14.38 1.72
18 URIBL_SBL 30456 1.80 13.37 0.57
19 RAZOR2_CHECK 27722 2.55 12.17 1.53
20 RAZOR2_CF_RANGE_51_100 26856 2.41 11.79 1.41
----------------------------------------------------------------------

TOP HAM RULES FIRED
----------------------------------------------------------------------
RANK RULE NAME COUNT %OFMAIL %OFSPAM %OFHAM
----------------------------------------------------------------------
1 BAYES_00 2002969 84.67 5.15 93.09
2 HTML_MESSAGE 1131073 54.03 67.83 52.57
3 UNPARSEABLE_RELAY 760567 32.93 10.12 35.35
4 DKIM_SIGNED 693328 29.74 6.26 32.22
5 DKIM_VERIFIED 531590 22.67 3.38 24.71
6 ALL_TRUSTED 173612 7.30 0.05 8.07
7 USER_IN_WHITELIST 155704 6.54 0.00 7.24
8 RDNS_NONE 140127 10.19 44.88 6.51
9 DCC_CHECK 127844 8.51 32.75 5.94
10 RCVD_IN_DNSWL_LOW 101863 4.31 0.34 4.73
11 MIME_HTML_ONLY 93817 7.62 38.42 4.36
12 RCVD_IN_DNSWL_MED 90038 3.81 0.31 4.18
13 WHOIS_NETSOLPR 87575 3.72 0.38 4.07
14 MIME_QP_LONG_LINE 82804 4.49 10.52 3.85
15 BOTNET 78052 8.90 58.68 3.63
16 BAYES_50 58286 3.93 15.50 2.71
17 FUZZY_AMBIEN 53284 2.28 0.38 2.48
18 SARE_SUB_ENC_UTF8 50533 2.14 0.17 2.35
19 SARE_MILLIONSOF 42268 1.84 0.67 1.96
20 FORGED_YAHOO_RCVD 38762 1.74 1.16 1.80
----------------------------------------------------------------------


Then I looked to see what bayes did with the message, but I do not
understand how to read the output, can someone explain this to me and
give me an idea why BAYES_00 fired when we've been feeding every one of
these spams to bayes to train on it?

$ spamassassin -D bayes < phish
[9595] dbg: bayes: using username: @GLOBAL
[9595] dbg: bayes: database connection established
[9595] dbg: bayes: found bayes db version 3
[9595] dbg: bayes: Using userid: 4
[9595] dbg: bayes: corpus size: nspam = 6782956, nham = 15364321
[9595] dbg: bayes: header tokens for *p = "U*mayodayo D*3web.net D*net"
[9595] dbg: bayes: header tokens for *F = "U*mayodayo D*3web.net D*net"
[9595] dbg: bayes: header tokens for Reply-to = "U*s.team43 D*live.com
D*com"
[9595] dbg: bayes: header tokens for MIME-Version = ""
[9595] dbg: bayes: header tokens for *c = "/plain; charset=ISO-8859-1"
[9595] dbg: bayes: header tokens for Content-Transfer-Encoding = "8bit"
[9595] dbg: bayes: header tokens for X-Originating-IP = "196.207.0.227"
[9595] dbg: bayes: header tokens for To = ""
[9595] dbg: bayes: header tokens for X-Languages = " en"
[9595] dbg: bayes: header tokens for X-Languages-Length = " 1213"
[9595] dbg: bayes: header tokens for X-Spam-Relays-External = " [
ip=209.197.145.198 rdns=reef.cybersurf.com helo=reef.cybersurf.com
by=cat.cia.com ident= envfrom= intl=0 id=1Kw6iz-0002Li-Pg auth= msa=0 ]
[ ip=196.207.0.227 rdns=196-207-0-227.netcomng.com
helo=196-207-0-227.netcomng.com by=webmail.3web.com ident= envfrom=
intl=0 id= auth=HTTP msa=0 ] [ ip=196.207.0.227 rdns= helo= by= ident=
envfrom= intl=0 id= auth= msa=0 ]"
[9595] dbg: bayes: header tokens for X-Spam-Relays-Internal = " "
[9595] dbg: bayes: header tokens for *RT = " "
[9595] dbg: bayes: header tokens for *RU = " [ ip=209.197.145.198
rdns=reef.cybersurf.com helo=reef.cybersurf.com by=cat.cia.com ident=
envfrom= intl=0 id=1Kw6iz-0002Li-Pg auth= msa=0 ] [ ip=196.207.0.227
rdns=196-207-0-227.netcomng.com helo=196-207-0-227.netcomng.com
by=webmail.3web.com ident= envfrom= intl=0 id= auth=HTTP msa=0 ] [
ip=196.207.0.227 rdns= helo= by= ident= envfrom= intl=0 id= auth= msa=0
]"
[9595] dbg: bayes: header tokens for *r = " 196-207-0-227.netcomng.com
(196-207-0-227.netcomng.com [196.207.0 ip*196.207.0.227 ]) by
webmail.3web.com (IMP) HTTP ; "
[9595] dbg: bayes: header tokens for *r = " 196-207-0-227.netcomng.com
(196-207-0-227.netcomng.com [196.207.0 ip*196.207.0.227 ]) by
webmail.3web.com (IMP) HTTP ; apache by
reef.cybersurf.com local (Exim 4.44) id 1Kw6j0-0006W5-UJ; "
[9595] dbg: bayes: tok_get_all: token count: 142
[9595] dbg: bayes: token 'weekly' => 0.000135596068218096
[9595] dbg: bayes: token 'becomes' => 0.000298722931704609
[9595] dbg: bayes: token 'inbox' => 0.000343185200935573
[9595] dbg: bayes: token 'one's' => 0.000597114317425083
[9595] dbg: bayes: token 'folder' => 0.00064482620854974
[9595] dbg: bayes: token 'webmail' => 0.000671660424469413
[9595] dbg: bayes: token 'INBOX' => 0.000805791313030454
[9595] dbg: bayes: token 'Webmail' => 0.00100686213349969
[9595] dbg: bayes: token 'inboxes' => 0.00107385229540918
[9595] dbg: bayes: token 'SPACE' => 0.0011503920171062
[9595] dbg: bayes: token 'reset' => 0.00200996264009963
[9595] dbg: bayes: token 'oldest' => 0.00320874751491054
[9595] dbg: bayes: token 'SAVE' => 0.00400496277915633
[9595] dbg: bayes: token 'Bates' => 0.0156699029126214
[9595] dbg: bayes: token 'bates' => 0.0156699029126214
[9595] dbg: bayes: token 'current' => 0.0200447781112092
[9595] dbg: bayes: token 'H*r:IMP' => 0.0961561369397845
[9595] dbg: bayes: token 'notified' => 0.121287867011135
[9595] dbg: bayes: token 'Password' => 0.13640095340516
[9595] dbg: bayes: token 'HX-Spam-Relays-External:sk:webmail' => 0.1492193587257
[9595] dbg: bayes: token 'H*RU:sk:webmail' => 0.1492193587257
[9595] dbg: bayes: score = 1.83186799063151e-15

Any ideas would be very appreciated! My goal is to stop these phishers
from getting their mail through, but even with a customized rule set to
a high score, they will get through if BAYES_00 fires...

micah