Re: [Rule Set proposal] French Rules - SpamAssassin
This is a discussion on Re: [Rule Set proposal] French Rules - SpamAssassin ; Giampaolo Tomassoni writes:
> > -----Original Message-----
> > From: jm@jmason.org [mailto:jm@jmason.org]
> > Sent: Wednesday, June 18, 2008 12:10 PM
> > To: John GALLET
> > Cc: users@spamassassin.apache.org
> > Subject: Re: [Rule Set proposal] French Rules
> ...
-
Re: [Rule Set proposal] French Rules
Giampaolo Tomassoni writes:
> > -----Original Message-----
> > From: jm@jmason.org [mailto:jm@jmason.org]
> > Sent: Wednesday, June 18, 2008 12:10 PM
> > To: John GALLET
> > Cc: users@spamassassin.apache.org
> > Subject: Re: [Rule Set proposal] French Rules
> >
> > ...omissis...
> >
> > by the way, if you're reasonably perl-capable, it might be worthwhile
> > using the algorithm I use to generate the JM_SOUGHT ruleset for english
> > spam: http://taint.org/tag/rule-discovery
> >
> > you just give it a corpus of spam samples and it generates the rules
> > for
> > you. The code is in SpamAssassin SVN.
> >
> > --j.
>
> Nah, that's great!
>
> I regret I can only occasionally read interesting messages due to my own
> time constraints. I could have read about this set of scripts weeks ago,
> otherwise...
>
> How this code is supposed to be used? I see these scripts in rule-dev:
> maildir-scan-headers, seek-phrases-in-corpus, seek-phrases-in-log and
> strip-high-scorers-from-log.
>
> Give us a brief description of their work and usage.
Basically, you collect 2 corpora:
1. a big corpus of ham samples, stuff that you do not want to match.
2. a smaller corpus of spam samples.
You run "seek-phrases-in-corpus" over the 2 corpora, and it'll spit out
the patterns; you can then write rules based on these.
Alternatively run "mass-check" and "seek-phrases-in-log" directly as that
script does, to get a bit more control (and generate real SpamAssassin
rules). That's what the JM_SOUGHT scripts do. See below:
http://taint.org/x/2008/seekrules_run
that script also calls "mk_meta_rule", which is here:
http://taint.org/x/2008/mk_meta_rule
--j.
-
seekrules over French spam (was Re: [Rule Set proposal] FrenchRules
Hi,
> You run "seek-phrases-in-corpus" over the 2 corpora, and it'll spit out
> the patterns; you can then write rules based on these.
I did so, the results are interesting, though I do not really know where
to go from there. If I take the first 50 "best" patterns and strip off the
obvious stand-alone words and sure-to-be-false-positive expressions, here
is what I get to: (sorry for non French speakers, explanation below)
RATIO SPAM% HAM% DATA
1.000 9.375 0.000 /Pour ne plus recevoir /
1.000 6.875 0.000 /6 janvier 1978 relative /
1.000 6.875 0.000 /affiche pas correctement, vous pouvez le visualiser en/
1.000 5.625 0.000 /s données nominatives /
1.000 5.625 0.000 / ce message, cliquez-ici/
1.000 5.625 0.000 / vous désinscrire de /
1.000 5.000 0.000 /Conformément à l/
1.000 5.000 0.000 / plus recevoir d\'informations de notre part/
1.000 5.000 0.000 /un droit d\'accès/
1.000 4.375 0.000 /ment Ã| l\'article 34 de la loi/
1.000 4.375 0.000 /ment à l\'article 34 de la loi /
1.000 3.750 0.000 /ous désinscrire de notre /
1.000 3.750 0.000 /es nominatives vous concernant\. /
1.000 3.750 0.000 / Libertés du 6 /
1.000 3.750 0.000 /es vous concernant\. Pour l\'exercer, /
As you can see, charset encoding makes a mess, and many must be regrouped.
Anyway, these are the patterns I tried to code in FR_SPAMISLEGAL and
FR_HOWTOUNSUBSCRIBE, plus one I considered too generic (if you can't
read this mail in html, click here).
The whole result is available at
http://www.saphirtech.fr/spam/seekrules_fr_1.txt
> http://taint.org/x/2008/seekrules_run
I also adapted this one (paths of course, but also forced "mbox" format,
"detect" spit out zero results), but the result is even less "readable"
for me. I miss the script seekrules/kill_bad_patterns which I presume
removes stand alone words and such things.
Whole result at http://www.saphirtech.fr/spam/seekrules_fr_2.txt
John
-
Re: seekrules over French spam (was Re: [Rule Set proposal] French Rules
John GALLET writes:
> Hi,
>
> > You run "seek-phrases-in-corpus" over the 2 corpora, and it'll spit out
> > the patterns; you can then write rules based on these.
>
> I did so, the results are interesting, though I do not really know where
> to go from there. If I take the first 50 "best" patterns and strip off the
> obvious stand-alone words and sure-to-be-false-positive expressions, here
> is what I get to: (sorry for non French speakers, explanation below)
>
> RATIO SPAM% HAM% DATA
> 1.000 9.375 0.000 /Pour ne plus recevoir /
> 1.000 6.875 0.000 /6 janvier 1978 relative /
> 1.000 6.875 0.000 /affiche pas correctement, vous pouvez le visualiser en/
> 1.000 5.625 0.000 /s données nominatives /
> 1.000 5.625 0.000 / ce message, cliquez-ici/
> 1.000 5.625 0.000 / vous désinscrire de /
> 1.000 5.000 0.000 /Conformément à l/
> 1.000 5.000 0.000 / plus recevoir d\'informations de notre part/
> 1.000 5.000 0.000 /un droit d\'accès/
> 1.000 4.375 0.000 /ment Ã| l\'article 34 de la loi/
> 1.000 4.375 0.000 /ment à l\'article 34 de la loi /
> 1.000 3.750 0.000 /ous désinscrire de notre /
> 1.000 3.750 0.000 /es nominatives vous concernant\. /
> 1.000 3.750 0.000 / Libertés du 6 /
> 1.000 3.750 0.000 /es vous concernant\. Pour l\'exercer, /
>
> As you can see, charset encoding makes a mess, and many must be regrouped.
> Anyway, these are the patterns I tried to code in FR_SPAMISLEGAL and
> FR_HOWTOUNSUBSCRIBE, plus one I considered too generic (if you can't
> read this mail in html, click here).
It might be worth collecting more ham that includes any such common
text -- or even _generating_ mails along those lines (just edit the
message body to include the text you want the ruleset to avoid. 
> The whole result is available at
> http://www.saphirtech.fr/spam/seekrules_fr_1.txt
>
> > http://taint.org/x/2008/seekrules_run
>
> I also adapted this one (paths of course, but also forced "mbox" format,
> "detect" spit out zero results)
ah. forgot to mention: detect only treats files that end in ".mbox" as
mboxes. 
> , but the result is even less "readable"
> for me. I miss the script seekrules/kill_bad_patterns which I presume
> removes stand alone words and such things.
yes, I left that out. it's very specific to my spamtraps, since it
removes noise added by some of them.
> Whole result at http://www.saphirtech.fr/spam/seekrules_fr_2.txt
>
> John
Thanks for trying it out!
--j.
-
Re: seekrules over French spam (was Re: [Rule Set proposal] FrenchRules
Re,
>> Anyway, these are the patterns I tried to code in FR_SPAMISLEGAL and
>> FR_HOWTOUNSUBSCRIBE, plus one I considered too generic (if you can't
>> read this mail in html, click here).
>
> It might be worth collecting more ham that includes any such common
> text -- or even _generating_ mails along those lines (just edit the
> message body to include the text you want the ruleset to avoid. 
Well, that's the whole point: can we conclude that an email with an
unsubcribe link tends to be a spam more often than a ham ? I consider so,
but with a low score. Can we conclude that an email citing the French Law
"informatique et libertés" is a spam ? I would say "100% except government
sponsored mailing lists that may feel obliged to do so", so I added a
higher score. Now it might perfectly be faulty logic, I do not have any
experience in spam fighting.
>> I also adapted this one (paths of course, but also forced "mbox" format,
>> "detect" spit out zero results)
> ah. forgot to mention: detect only treats files that end in ".mbox" as
> mboxes. 
:-) ok, well anyway it was quite easy to find out since it worked well
when forcing and not at all in automatic.
> Thanks for trying it out!
Well, thanks for writing it. I think its main weak point for French and
other accented languages is handling the different encodings for a same
char with an accent, some kind of "synonyms" list. The same letter, say "a
with an accent", can be misspelled with a plain "a", encoded in various
charsets (latin, utf-8) to a "normal" à, or html encoded agrave (I left &
and ; out). I do not know if it is possible at all, it might complicate
things *a lot*.
a++;
JG
-
Re: seekrules over French spam (was Re: [Rule Set proposal] French Rules
John GALLET writes:
> Re,
>
> >> Anyway, these are the patterns I tried to code in FR_SPAMISLEGAL and
> >> FR_HOWTOUNSUBSCRIBE, plus one I considered too generic (if you can't
> >> read this mail in html, click here).
> >
> > It might be worth collecting more ham that includes any such common
> > text -- or even _generating_ mails along those lines (just edit the
> > message body to include the text you want the ruleset to avoid. 
>
> Well, that's the whole point: can we conclude that an email with an
> unsubcribe link tends to be a spam more often than a ham ? I consider so,
> but with a low score. Can we conclude that an email citing the French Law
> "informatique et libertés" is a spam ? I would say "100% except government
> sponsored mailing lists that may feel obliged to do so", so I added a
> higher score. Now it might perfectly be faulty logic, I do not have any
> experience in spam fighting.
Well, with automated rule-set generation I would advise erring on the
side of "no false positives" -- my experience with FPs is that they
may appear to be infrequent in one corpus, and then be 10x as frequent
in another person's corpus, just due to the kind of ham he/she gets.
> >> I also adapted this one (paths of course, but also forced "mbox" format,
> >> "detect" spit out zero results)
> > ah. forgot to mention: detect only treats files that end in ".mbox" as
> > mboxes. 
>
> :-) ok, well anyway it was quite easy to find out since it worked well
> when forcing and not at all in automatic.
>
> > Thanks for trying it out!
>
> Well, thanks for writing it. I think its main weak point for French and
> other accented languages is handling the different encodings for a same
> char with an accent, some kind of "synonyms" list. The same letter, say "a
> with an accent", can be misspelled with a plain "a", encoded in various
> charsets (latin, utf-8) to a "normal" à, or html encoded agrave (I left &
> and ; out). I do not know if it is possible at all, it might complicate
> things *a lot*.
The tool can take care of this -- it will replace mutating single-characters
with a /./. It also supports /.?/, /.{0,3}/, /.{0,10}/ and a few other
"any" patterns.
--j.
-
Re: seekrules over French spam (was Re: [Rule Set proposal] FrenchRules
Justin Mason a écrit :
> John GALLET writes:
>> Well, thanks for writing it. I think its main weak point for French and
>> other accented languages is handling the different encodings for a same
>> char with an accent, some kind of "synonyms" list. The same letter, say "a
>> with an accent", can be misspelled with a plain "a", encoded in various
>> charsets (latin, utf-8) to a "normal" à, or html encoded agrave (I left &
>> and ; out). I do not know if it is possible at all, it might complicate
>> things *a lot*.
>
> The tool can take care of this -- it will replace mutating single-characters
> with a /./. It also supports /.?/, /.{0,3}/, /.{0,10}/ and a few other
> "any" patterns.
If the number of permutations is small (as would be the case for
accented letters and the equivalent unaccented ones, or for that matter
obfuscation with lookalike characters), wouldn't it be better for it to
replace the character by a [] list of those permutations (i.e. replace
something that mutates between e and é with [eé] or replace obfuscation
of i with l and 1 by [il1] ?
John.
--
-- Over 3000 webcams from ski resorts around the world - www.snoweye.com
-- Translate your technical documents and web pages - www.tradoc.fr
-
Re: seekrules over French spam (was Re: [Rule Set proposal] French Rules
John Wil**** writes:
> Justin Mason a écrit :
> > John GALLET writes:
> >> Well, thanks for writing it. I think its main weak point for French and
> >> other accented languages is handling the different encodings for a same
> >> char with an accent, some kind of "synonyms" list. The same letter, say "a
> >> with an accent", can be misspelled with a plain "a", encoded in various
> >> charsets (latin, utf-8) to a "normal" à, or html encoded agrave (I left &
> >> and ; out). I do not know if it is possible at all, it might complicate
> >> things *a lot*.
> >
> > The tool can take care of this -- it will replace mutating single-characters
> > with a /./. It also supports /.?/, /.{0,3}/, /.{0,10}/ and a few other
> > "any" patterns.
>
> If the number of permutations is small (as would be the case for
> accented letters and the equivalent unaccented ones, or for that matter
> obfuscation with lookalike characters), wouldn't it be better for it to
> replace the character by a [] list of those permutations (i.e. replace
> something that mutates between e and é with [eé] or replace obfuscation
> of i with l and 1 by [il1] ?
It would be. but fixing the pattern-discovery algorithm to discover this
in a relatively speedy way is not so easy. Patches accepted 