Really force a Bayes expire - SpamAssassin
This is a discussion on Really force a Bayes expire - SpamAssassin ; Many of my Bayes db's (not SQL) can't be expired anymore because the
--force-expire run can't find a delta that is big enough or so. I tried
with several settings for max_size that would either expire only a few or
...
-
Really force a Bayes expire
Many of my Bayes db's (not SQL) can't be expired anymore because the
--force-expire run can't find a delta that is big enough or so. I tried
with several settings for max_size that would either expire only a few or
most of the db and some steps in-between. Always the same problem. Is
there a way or script that can *really* enforce an expire?
I don't want to throw them away as they are working quite fine. But I also
don't want them to grow indefinitely (now at around 3-4 million tokens).
Years ago I used a very difficult way to expire such a problem db by
dumping it and then removing the "correct" data with a script and then
importing it again. And after that it would expire like normal again.
However, this was quite tricky and I think it wouldn't work with the
current format anymore.
Kai
--
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
-
Re: Really force a Bayes expire
> Many of my Bayes db's (not SQL) can't be expired anymore because the
> --force-expire run can't find a delta that is big enough or so. I tried
> with several settings for max_size that would either expire only a few or
> most of the db and some steps in-between. Always the same problem. Is
> there a way or script that can *really* enforce an expire?
> I don't want to throw them away as they are working quite fine. But I also
> don't want them to grow indefinitely (now at around 3-4 million tokens).
If you have 3-4 million tokens you need SQL (mysql, change to INNODB tables)
and I have a script that will create a new baysian table and just send
'newest' stuff to it.
Full script available if you want, but basicilly you stop spamd,
In mysql:
create table bayes_token_new like bayes_token;
insert into bayes_token_new (select * from bayes_token order by id,atime
desc limit 1000000);
(I am using amavisd.. It has only one id, so I don't know what they would do
to SA. In amavisd I didn't use the id key, but you might need it to trigger
index. Also, you could get clever and only load 'interesting' tokens,
something like 'where ham_count > 8 or spam_count > 8')
Would give you 1MM tokens (note: will take a long time, atime isn't a direct
key.
You could limit by id (select * from bayes_vars), something like:
Insert into bayes_token_new (select * from bayes_token where id=1 order by
atime, desc).
Don't forget to update bayes_vars, rename table and restart spamd.
Might allow more than 1mm tokens with above, then force a bayes-expire just
in case. (with a max_db < 1mm)
WMMY, suggestion void where prohibited or taxed.
--
Michael Scheidell, CTO
>|SECNAP Network Security
Winner 2008 Network Products Guide Hot Companies
FreeBSD SpamAssassin Ports maintainer
__________________________________________________ _______________________
This email has been scanned and certified safe by SpammerTrap(r).
For Information please see http://www.spammertrap.com
__________________________________________________ _______________________
-
Re: Really force a Bayes expire
Kai Schaetzl wrote:
> Many of my Bayes db's (not SQL) can't be expired anymore because the
> --force-expire run can't find a delta that is big enough or so. I tried
> with several settings for max_size that would either expire only a few or
> most of the db and some steps in-between. Always the same problem. Is
> there a way or script that can *really* enforce an expire?
>
Not really. If it can't find a delta, then SA can't find any way of
doing a sensible expiry, and there's not likely any "good" expiry that
can be performed.
First, understand that bayes expiry works by picking a time, and
dropping everything that hasn't been used since that time. Essentially
discarding all the stale data, and keeping the fresh stuff. I forget the
exact threshold, but I think anything that's more recent than 12 hours
is always kept by the algorithm, to avoid flushing out stuff that's
clearly being used regularly.
When no delta can be found, generally this means you've got a bayes DB
that's got a large set of data, all at more-or-less the same timestamp,
and very little other data.
If it were to pick an expiry date that winds up dropping the large
chunk, your bayes disables itself because there's no longer enough data,
making it more-or-less the same as deleting the bayes DB entirely. On
the other hand there's no data older than the large chunk left, so any
other selection winds up expiring nothing.
One cause of this is one big blob of hand training that created the
"chunk" in the dates. In this situation, SA will eventually drop the
chunk, but only after there's enough recently-used tokens to
differentiate it.
> I don't want to throw them away as they are working quite fine. But I also
> don't want them to grow indefinitely (now at around 3-4 million tokens).
>
Hmm, do you have a high mail volume? Another possibility is a diverse
large volume of mail is quite likely to create a broad set of constantly
fresh tokens all under the 12 hour threshold. On the plus side, this
essentially makes SA automatically scale the bayes DB to fit your mail
volume. It also shouldn't grow indefinitely, as any tokens that aren't
being used regularly will wind up being expired after they go unused for
a day or so. On the down side, your bayes DB can be large.
It might be worth looking at a force expire with debug on (sa-learn -D
--force-expire). The time ranges and their respective reduction counts
could really tell which scenario is what's happening to you.
Can you post the reduction count table?
> Years ago I used a very difficult way to expire such a problem db by
> dumping it and then removing the "correct" data with a script and then
> importing it again. And after that it would expire like normal again.
> However, this was quite tricky and I think it wouldn't work with the
> current format anymore.
>
>
-
Re: Really force a Bayes expire
Sorry for the belated answer. I'm currently on vacation and don't look
regularly in the mailing lists. If I have a bit of time I'll check for the
data from the -D and also from the magic dump and post it.
Kai
--
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
-
Re: Really force a Bayes expire
Thanks for the tips, I'll have a look at them once I'm back in office
which will take another four weeks. Quite a while back I tested converting
to SQL, but never actually put the results into production.
AFAIR, with SQL I do the expire myself, SA won't do it, right?
Kai
--
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
-
Re: Really force a Bayes expire
> From: Kai Schaetzl
> Reply-To:
> Date: Fri, 29 Aug 2008 15:01:25 +0200
> To:
> Subject: Re: Really force a Bayes expire
>
> Thanks for the tips, I'll have a look at them once I'm back in office
> which will take another four weeks. Quite a while back I tested converting
> to SQL, but never actually put the results into production.
>
> AFAIR, with SQL I do the expire myself, SA won't do it, right?
No, sa-learn --force-expire works fine.
--
Michael Scheidell, CTO
>|SECNAP Network Security
Winner 2008 Network Products Guide Hot Companies
FreeBSD SpamAssassin Ports maintainer
__________________________________________________ _______________________
This email has been scanned and certified safe by SpammerTrap(r).
For Information please see http://www.spammertrap.com
__________________________________________________ _______________________
-
Re: Really force a Bayes expire
Michael Scheidell writes:
>> From: Kai Schaetzl
>> AFAIR, with SQL I do the expire myself, SA won't do it, right?
>
> No, sa-learn --force-expire works fine.
As does the 'normal' Bayes expiry mechanism of triggering (or
attempting) an expire when the number of tokens reaches the threshold.
-
Re: Really force a Bayes expire
Graham Murray wrote on Fri, 29 Aug 2008 19:42:34 +0100:
> As does the 'normal' Bayes expiry mechanism of triggering (or
> attempting) an expire when the number of tokens reaches the threshold.
coming back to this old thread. Unfortunately, the expiry uses the same
algorithm for all Bayes storage engines. So, if it fails on dbm it also
fails on SQL. I may have misunderstood that, but my understanding from the
postings on this list over the past few years was that the expiry
algorithm for the SQL store would be more "intelligent" and thus succeed
where it doesn't succeed on dbm. That's definitely not true. It's the same
and fails the same ;-)
After conversion I had to extract the latest records up to the number I
wanted to keep into a new table, replace the old table bayes_token with
it, flush bayes_seen and adjust bayes_vars accordingly. As I'm now way
under my normal expiry limit I don't know if the normal expiry will work
now, but I hope it. If not, it's definitely easier to "manually" expire
the Bayes database once it's in SQL.
Kai
--
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com