Really force a Bayes expire - SpamAssassin

This is a discussion on Really force a Bayes expire - SpamAssassin ; Many of my Bayes db's (not SQL) can't be expired anymore because the --force-expire run can't find a delta that is big enough or so. I tried with several settings for max_size that would either expire only a few or ...

+ Reply to Thread
Results 1 to 8 of 8

Thread: Really force a Bayes expire

  1. Really force a Bayes expire

    Many of my Bayes db's (not SQL) can't be expired anymore because the
    --force-expire run can't find a delta that is big enough or so. I tried
    with several settings for max_size that would either expire only a few or
    most of the db and some steps in-between. Always the same problem. Is
    there a way or script that can *really* enforce an expire?
    I don't want to throw them away as they are working quite fine. But I also
    don't want them to grow indefinitely (now at around 3-4 million tokens).
    Years ago I used a very difficult way to expire such a problem db by
    dumping it and then removing the "correct" data with a script and then
    importing it again. And after that it would expire like normal again.
    However, this was quite tricky and I think it wouldn't work with the
    current format anymore.

    Kai

    --
    Kai Schätzl, Berlin, Germany
    Get your web at Conactive Internet Services: http://www.conactive.com


  2. Re: Really force a Bayes expire



    > Many of my Bayes db's (not SQL) can't be expired anymore because the
    > --force-expire run can't find a delta that is big enough or so. I tried
    > with several settings for max_size that would either expire only a few or
    > most of the db and some steps in-between. Always the same problem. Is
    > there a way or script that can *really* enforce an expire?
    > I don't want to throw them away as they are working quite fine. But I also
    > don't want them to grow indefinitely (now at around 3-4 million tokens).


    If you have 3-4 million tokens you need SQL (mysql, change to INNODB tables)
    and I have a script that will create a new baysian table and just send
    'newest' stuff to it.

    Full script available if you want, but basicilly you stop spamd,

    In mysql:
    create table bayes_token_new like bayes_token;
    insert into bayes_token_new (select * from bayes_token order by id,atime
    desc limit 1000000);

    (I am using amavisd.. It has only one id, so I don't know what they would do
    to SA. In amavisd I didn't use the id key, but you might need it to trigger
    index. Also, you could get clever and only load 'interesting' tokens,
    something like 'where ham_count > 8 or spam_count > 8')

    Would give you 1MM tokens (note: will take a long time, atime isn't a direct
    key.
    You could limit by id (select * from bayes_vars), something like:

    Insert into bayes_token_new (select * from bayes_token where id=1 order by
    atime, desc).

    Don't forget to update bayes_vars, rename table and restart spamd.
    Might allow more than 1mm tokens with above, then force a bayes-expire just
    in case. (with a max_db < 1mm)

    WMMY, suggestion void where prohibited or taxed.


    --
    Michael Scheidell, CTO
    >|SECNAP Network Security

    Winner 2008 Network Products Guide Hot Companies
    FreeBSD SpamAssassin Ports maintainer

    __________________________________________________ _______________________
    This email has been scanned and certified safe by SpammerTrap(r).
    For Information please see http://www.spammertrap.com
    __________________________________________________ _______________________


  3. Re: Really force a Bayes expire

    Kai Schaetzl wrote:
    > Many of my Bayes db's (not SQL) can't be expired anymore because the
    > --force-expire run can't find a delta that is big enough or so. I tried
    > with several settings for max_size that would either expire only a few or
    > most of the db and some steps in-between. Always the same problem. Is
    > there a way or script that can *really* enforce an expire?
    >

    Not really. If it can't find a delta, then SA can't find any way of
    doing a sensible expiry, and there's not likely any "good" expiry that
    can be performed.

    First, understand that bayes expiry works by picking a time, and
    dropping everything that hasn't been used since that time. Essentially
    discarding all the stale data, and keeping the fresh stuff. I forget the
    exact threshold, but I think anything that's more recent than 12 hours
    is always kept by the algorithm, to avoid flushing out stuff that's
    clearly being used regularly.

    When no delta can be found, generally this means you've got a bayes DB
    that's got a large set of data, all at more-or-less the same timestamp,
    and very little other data.

    If it were to pick an expiry date that winds up dropping the large
    chunk, your bayes disables itself because there's no longer enough data,
    making it more-or-less the same as deleting the bayes DB entirely. On
    the other hand there's no data older than the large chunk left, so any
    other selection winds up expiring nothing.

    One cause of this is one big blob of hand training that created the
    "chunk" in the dates. In this situation, SA will eventually drop the
    chunk, but only after there's enough recently-used tokens to
    differentiate it.

    > I don't want to throw them away as they are working quite fine. But I also
    > don't want them to grow indefinitely (now at around 3-4 million tokens).
    >

    Hmm, do you have a high mail volume? Another possibility is a diverse
    large volume of mail is quite likely to create a broad set of constantly
    fresh tokens all under the 12 hour threshold. On the plus side, this
    essentially makes SA automatically scale the bayes DB to fit your mail
    volume. It also shouldn't grow indefinitely, as any tokens that aren't
    being used regularly will wind up being expired after they go unused for
    a day or so. On the down side, your bayes DB can be large.

    It might be worth looking at a force expire with debug on (sa-learn -D
    --force-expire). The time ranges and their respective reduction counts
    could really tell which scenario is what's happening to you.

    Can you post the reduction count table?

    > Years ago I used a very difficult way to expire such a problem db by
    > dumping it and then removing the "correct" data with a script and then
    > importing it again. And after that it would expire like normal again.
    > However, this was quite tricky and I think it wouldn't work with the
    > current format anymore.
    >
    >



  4. Re: Really force a Bayes expire

    Sorry for the belated answer. I'm currently on vacation and don't look
    regularly in the mailing lists. If I have a bit of time I'll check for the
    data from the -D and also from the magic dump and post it.

    Kai

    --
    Kai Schätzl, Berlin, Germany
    Get your web at Conactive Internet Services: http://www.conactive.com


  5. Re: Really force a Bayes expire

    Thanks for the tips, I'll have a look at them once I'm back in office
    which will take another four weeks. Quite a while back I tested converting
    to SQL, but never actually put the results into production.

    AFAIR, with SQL I do the expire myself, SA won't do it, right?

    Kai

    --
    Kai Schätzl, Berlin, Germany
    Get your web at Conactive Internet Services: http://www.conactive.com


  6. Re: Really force a Bayes expire


    > From: Kai Schaetzl
    > Reply-To:
    > Date: Fri, 29 Aug 2008 15:01:25 +0200
    > To:
    > Subject: Re: Really force a Bayes expire
    >
    > Thanks for the tips, I'll have a look at them once I'm back in office
    > which will take another four weeks. Quite a while back I tested converting
    > to SQL, but never actually put the results into production.
    >
    > AFAIR, with SQL I do the expire myself, SA won't do it, right?


    No, sa-learn --force-expire works fine.
    --
    Michael Scheidell, CTO
    >|SECNAP Network Security

    Winner 2008 Network Products Guide Hot Companies
    FreeBSD SpamAssassin Ports maintainer

    __________________________________________________ _______________________
    This email has been scanned and certified safe by SpammerTrap(r).
    For Information please see http://www.spammertrap.com
    __________________________________________________ _______________________


  7. Re: Really force a Bayes expire

    Michael Scheidell writes:

    >> From: Kai Schaetzl
    >> AFAIR, with SQL I do the expire myself, SA won't do it, right?

    >
    > No, sa-learn --force-expire works fine.


    As does the 'normal' Bayes expiry mechanism of triggering (or
    attempting) an expire when the number of tokens reaches the threshold.



  8. Re: Really force a Bayes expire

    Graham Murray wrote on Fri, 29 Aug 2008 19:42:34 +0100:

    > As does the 'normal' Bayes expiry mechanism of triggering (or
    > attempting) an expire when the number of tokens reaches the threshold.


    coming back to this old thread. Unfortunately, the expiry uses the same
    algorithm for all Bayes storage engines. So, if it fails on dbm it also
    fails on SQL. I may have misunderstood that, but my understanding from the
    postings on this list over the past few years was that the expiry
    algorithm for the SQL store would be more "intelligent" and thus succeed
    where it doesn't succeed on dbm. That's definitely not true. It's the same
    and fails the same ;-)
    After conversion I had to extract the latest records up to the number I
    wanted to keep into a new table, replace the old table bayes_token with
    it, flush bayes_seen and adjust bayes_vars accordingly. As I'm now way
    under my normal expiry limit I don't know if the normal expiry will work
    now, but I hope it. If not, it's definitely easier to "manually" expire
    the Bayes database once it's in SQL.

    Kai

    --
    Kai Schätzl, Berlin, Germany
    Get your web at Conactive Internet Services: http://www.conactive.com


+ Reply to Thread