On Jun 30, 2007, at 2:55 PM, Bart Schaefer wrote:

>
> On 6/29/07, Tom Allison wrote:
>>
>> The thought I had, and have been working on for a while, is changing
>> how the scoring is done. Rather than making Bayes a part of the
>> scoring process, make the scoring process a part of the Bayes
>> statistical Engine. As an example you would simply feed into the
>> Bayesian process, as tokens, the indications of scoring hits (binary
>> yes/no) would be examined next to the other tokens in the message.

>
> There are a few problems with this.
>
> (1) It assumes that Bayesian (or similar) classification is more
> accurate than SA's scoring system. Either that, or you're willing to
> give up accuracy in the name of removing all those confusing knobs you
> don't want to touch, but it would seem to me to be better to have the
> knobs and just not touch them.
>

I know that without SA you can have >99.9% accuracy with pure
bayesian classification.
But there are specific non Bayes things that are made visible through
spamassassin rules that a typical bayes process can't catch (very
well or at all). The whole issue of "knobs" is moot under a
statistical approach because each users scoring will determine the
real importance of each particular rule hit.

> (2) For many SA rules you would be, in effect, double-counting some
> tokens. An SA scoring rule that matches a phrase, for example, is
> effectively matching a collection of tokens that are also being fed
> individually to the Bayes engine. In theory, you should not
> second-guess the system by passing such compound tokens to Bayes;
> instead it should be allowed to learn what combinations of tokens are
> meaningful when they appear together.


Bayes does not match a phrase, only words. At least that is what
most Bayes filters do.
There are some approaches that do use multiple words, but not a
"phrase". Therefore I think the intersection of Bayes and
Spamassassin rules is going to be small.

> (It might be worthwhile, though, to e.g. add tokens that are not
> otherwise present in the message, such as for the results of network
> tests.)


This is what I'm interested in and mentioned in paragraph one. There
are a lot of things you can do with SpamAssassin that just Bayes will
never do. It is exactly this type of work that I think would be most
interesting to pursue.

> (3) It introduces a bootstrapping problem, as has already been noted.
> Everyone has to train the engine and re-train it when new rules are
> developed.
>
> I've thought of a few more, but they all have to do with the benifits
> of having all those "knobs" and if you've already adopted the basic
> premise that they should be removed there doesn't seem to be any
> reason to argue that part.
>
> To summarize my opinion: If what you want is to have a Bayesian-type
> engine make all the decisions, then you should install a Bayesian
> engine and work on ways to feed it the right tokens; you should not
> install SpamAssassin and then work on ways to remove the scoring.


It makes sense to do this approach. However it would not make sense
to try and reinvent the fantastic amount of useful work that has come
from SpamAssassin. That would take a very long time to address.
SpamAssassin has some really great ways of finding the right tokens.
Why would I consider trying to duplicate all that effort.