On Jun 30, 2007, at 8:07 AM, Loren Wilton wrote:

>
>>> You have a bit of a chicken and egg problem at the start. Until
>>> some learning takes place in the system.

>
> Two possibilities. The rules exist and have scores. Assume they
> are maintained, for whatever reason.
>
> 1. Until Bayes has enough info to kick in, classification is
> done by the scores. Then when Bayes kicks in the scores turn off
> (insofar as adding to themessage score, they might still show up as
> tokens in the message that Bayes will process).
>
> 2. Divide all the scores by 10 or 20. The leave them on.
> Pretty soon bayes will override almost any reasonable score
> combination.
>
> BTW, while ham rules are possible, SA has almost no ham rules;
> perhaps two or so. Spammers long ago found they could write their
> spams to match ham rules and thus bypass SA. Thus, no ham rules,
> no spmammer workarounds. Of course personal or ste specific ham
> rules will generally still work, since they will not be public
> knowledge and spammers won't be able to target them.
>
> I suspect you can find all rule names in PerMsgStatus. However the
> latest SA versions have implemented a 'check' plugin that actually
> runs the rules and accumulates the score. The rule running was
> moved to a plugin so that people could, at least in theory, change
> the order or the way that rules are run. It sounds like that is
> what you want to do, so a modified Check plugin may well be the way
> to go.
>
> I don't understand though why you are interested in the names of
> all rules run; I don't see what it buys you. Currently ALL rules
> are run, unless short-circuiting is in effect, and by default it
> mostly isn't. In any case, if a rule doesn't hit on a message, the
> name of the rule is probably irrelevent. It might have missed
> because the message is ham, but it even more likely missed because
> it simply targets a different kind of spam. So assuming that
> "rules not hit" === "good tokens" is unlikely to be the case.


But in Bayes, you can't score on the absence of a token. Just
because the email I'm writing does not contain a certain word does
not mean it is "good". The listing of ALL rules run with a binary
YES/NO indication applied to each one would permit you to accrue
points for both the presence of and lack of a specific rule. But
this would allow you to start applying pro Ham rules as well.

But you may have a point that "rules not hit" is sufficient for
determining "good tokens" in the same manner that "viagra" is bad and
not having "viagra" permits the email to score on the other tokens
available. To further prove this out, the practice of spammers (who
I'm sure are reading this list) is to try to apply enough skew to the
Bayes to push it low and skip enough rules to keep from scoring any
hits -- the net effect is to come up with Unsure email (I work in a
ternary system). Under pure bayesian statistics, the cutoff points
for ham/spam tend to move pretty quickly from a nominal 0.3/0.7 to
0.3/0.5 giving the entire probability range of 0.500 to 1.00 over to
Spam and 0.00 to 0.300 (or even lower) to specifically Ham with a
belt of uncertainty in the middle.

And after typing all this I'm thinking you might be right. But part
of this approach is to run all these rules in YES/NO fashion and see
if the probability is significant. For example: If I tested for
SOME_TEST=NO and found it was scoring a probability of ~0.500 then
it's indisputable that you are right.

The only area of exception to this would be some kind of AWL factor
rather than a hard coded AWL override. Creative Regex can handle
this by capturing the email addresses in FROM: and providing a very
strong probability for that. Not a Whitelist, but an indication.
Not sure, haven't considered it as I never found AWL to be really
useful compared against the impact of Bayes on headers.

As for the start up effectiveness. There are a variety of ways to do
this. I consider this similar to installing linux. It might be
harder to do than buying a computer with Windows installed for you,
but the long term benefits out weigh the short term gains and how
often do you really install Linux or SpamAssassin? You can always
seed the data from captured emails.

Thank you for the information on Check. I will look into that and
see if I can come up with something that will do the trick. I have
to confess I'm coming into this backwards, I wrote a bayesian spam
filter and then started looking into SpamAssassin so my Bayes
statistical Engine is not SpamAssassins. But the results will be the
same for either approach (I hope) if you simply push rules in as meta-
data tokens into the Statistical Process.