This is a discussion on Re: Suggestion to developers - SpamAssassin ; Matt's generally nailed it. I would say that it should be easy enough to write a plugin which reorders rule priorities into a desired order, then implements the "have_shortcircuited" plugin hook to return 1 at the desired point... so if ...
Matt's generally nailed it.
I would say that it should be easy enough to write a plugin which reorders
rule priorities into a desired order, then implements the
"have_shortcircuited" plugin hook to return 1 at the desired point... so
if anyone feels like trying it out to see if they can make an
auto-shortcircuiting plugin which outperforms base SpamAssassin over a
mixed corpus of 50:50 nonspam and spam, go for it
Matt Kettler writes:
> Crocomoth wrote:
> > Matt Kettler-3 wrote:
> >>> 1. Using this method, admin must understand that the fate of every
> >>> message
> >>> (for all users) will depend from the single rule.
> >> Not if you set it up properly.. You can have multiple rules run with a
> >> very early priority (low number), then have another one run with a
> >> semi-early priority which does shortcircuiting. All of the "very early"
> >> rules will be involved in the decision to shortcircuit or not.
> > Yes, but low-numbered rules may not generate any points and the desision may
> > depend from one rule anyways. This does not change anything. And what is
> > more (see (2) with which you have agreed), in default configuration, this
> > will be bayes which generates only 3.5 points (not taking into account
> > while/black lists because they will not be set up properly in most cases).
> > And, I think, number of persons not wishing to reorder standard rules will
> > be much more than "semi-professional" admins.
> True, but your automated method based on sorting them on "weight" would
> pretty much grind spamassassin to a screeching halt by increasing the
> average scan time due to forcing multiple passes through the message.
> Not to mention false positive problems if negative-scoring rules end up
> being considered "heavy" and don't get run.
> Your idea essentially ruins any benefits of memory caching that
> SpamAssassin currently exploits. Right now, rules are run in groups
> based on what part of the message they need. This lends speed to
> spamassassin by allowing that portion of the mesage to already be in
> cache for all but the first rule in the group.
> If you start jumping around all over the message for different rules,
> the processor memory cache quickly becomes full and pushes out parts
> that you're going to be looking at again. If you keep going
> back-and-forth header, body, header, body, header, body.. you wind up
> going out to ram quite often, and that's painfully slow. (I don't care
> what high-speed dual-channel ddr2 memory setup you have, it's abysmally
> slow from the processors perspective, generally 20 times slower than
> cache is)
> Sure, some messages will bail out faster, but most messages will take
> much longer to scan. How is that better?
> I don't debate that the basic idea of having SA do this "automagically"
> would be a great thing. However, the reality of doing it efficiently is
> much trickier than you think.
> At one point, one idea was to run all the negative scoring rules, and
> then run the positive scoring ones, and bail out if the score went over
> the spam threshold during the positive phase.
> The end result of that test was abysmally slow, due to having to scan
> the message in two passes (negative header, negative body, positive
> header, positive body).
> > Sort order may be: negative rules, sorted positive common rules. Any
> > user-defined rules should be checked after negative ones and before
> > positives, if exists. Of course, sorting should be performed once upon load
> > procedure.
> Tested, as mentioned above. Resulted in horrible performance due to
> > Or, such a cut-off may work without any sorting; this is optional. Standard
> > priorities could be enough, if they set up.
> I'd agree there. SA could exploit priorities better in the default
> config, but this kind of thing needs to be done very carefuly to avoid
> thrashing the processor cache. Any simple "sort by.." is going to result
> in terrible performance.