Re: [Rule Set proposal] French Rules - SpamAssassin

This is a discussion on Re: [Rule Set proposal] French Rules - SpamAssassin ; Giampaolo Tomassoni writes: > > -----Original Message----- > > From: jm@jmason.org [mailto:jm@jmason.org] > > Sent: Wednesday, June 18, 2008 12:10 PM > > To: John GALLET > > Cc: users@spamassassin.apache.org > > Subject: Re: [Rule Set proposal] French Rules > ...

+ Reply to Thread
Results 1 to 7 of 7

Thread: Re: [Rule Set proposal] French Rules

  1. Re: [Rule Set proposal] French Rules


    Giampaolo Tomassoni writes:
    > > -----Original Message-----
    > > From: jm@jmason.org [mailto:jm@jmason.org]
    > > Sent: Wednesday, June 18, 2008 12:10 PM
    > > To: John GALLET
    > > Cc: users@spamassassin.apache.org
    > > Subject: Re: [Rule Set proposal] French Rules
    > >
    > > ...omissis...
    > >
    > > by the way, if you're reasonably perl-capable, it might be worthwhile
    > > using the algorithm I use to generate the JM_SOUGHT ruleset for english
    > > spam: http://taint.org/tag/rule-discovery
    > >
    > > you just give it a corpus of spam samples and it generates the rules
    > > for
    > > you. The code is in SpamAssassin SVN.
    > >
    > > --j.

    >
    > Nah, that's great!
    >
    > I regret I can only occasionally read interesting messages due to my own
    > time constraints. I could have read about this set of scripts weeks ago,
    > otherwise...
    >
    > How this code is supposed to be used? I see these scripts in rule-dev:
    > maildir-scan-headers, seek-phrases-in-corpus, seek-phrases-in-log and
    > strip-high-scorers-from-log.
    >
    > Give us a brief description of their work and usage.


    Basically, you collect 2 corpora:

    1. a big corpus of ham samples, stuff that you do not want to match.

    2. a smaller corpus of spam samples.

    You run "seek-phrases-in-corpus" over the 2 corpora, and it'll spit out
    the patterns; you can then write rules based on these.

    Alternatively run "mass-check" and "seek-phrases-in-log" directly as that
    script does, to get a bit more control (and generate real SpamAssassin
    rules). That's what the JM_SOUGHT scripts do. See below:

    http://taint.org/x/2008/seekrules_run

    that script also calls "mk_meta_rule", which is here:
    http://taint.org/x/2008/mk_meta_rule

    --j.


  2. seekrules over French spam (was Re: [Rule Set proposal] FrenchRules

    Hi,

    > You run "seek-phrases-in-corpus" over the 2 corpora, and it'll spit out
    > the patterns; you can then write rules based on these.


    I did so, the results are interesting, though I do not really know where
    to go from there. If I take the first 50 "best" patterns and strip off the
    obvious stand-alone words and sure-to-be-false-positive expressions, here
    is what I get to: (sorry for non French speakers, explanation below)

    RATIO SPAM% HAM% DATA
    1.000 9.375 0.000 /Pour ne plus recevoir /
    1.000 6.875 0.000 /6 janvier 1978 relative /
    1.000 6.875 0.000 /affiche pas correctement, vous pouvez le visualiser en/
    1.000 5.625 0.000 /s données nominatives /
    1.000 5.625 0.000 / ce message, cliquez-ici/
    1.000 5.625 0.000 / vous désinscrire de /
    1.000 5.000 0.000 /Conformément à l/
    1.000 5.000 0.000 / plus recevoir d\'informations de notre part/
    1.000 5.000 0.000 /un droit d\'accès/
    1.000 4.375 0.000 /ment Ã| l\'article 34 de la loi/
    1.000 4.375 0.000 /ment à l\'article 34 de la loi /
    1.000 3.750 0.000 /ous désinscrire de notre /
    1.000 3.750 0.000 /es nominatives vous concernant\. /
    1.000 3.750 0.000 / Libertés du 6 /
    1.000 3.750 0.000 /es vous concernant\. Pour l\'exercer, /

    As you can see, charset encoding makes a mess, and many must be regrouped.

    Anyway, these are the patterns I tried to code in FR_SPAMISLEGAL and
    FR_HOWTOUNSUBSCRIBE, plus one I considered too generic (if you can't
    read this mail in html, click here).

    The whole result is available at
    http://www.saphirtech.fr/spam/seekrules_fr_1.txt

    > http://taint.org/x/2008/seekrules_run


    I also adapted this one (paths of course, but also forced "mbox" format,
    "detect" spit out zero results), but the result is even less "readable"
    for me. I miss the script seekrules/kill_bad_patterns which I presume
    removes stand alone words and such things.

    Whole result at http://www.saphirtech.fr/spam/seekrules_fr_2.txt

    John

  3. Re: seekrules over French spam (was Re: [Rule Set proposal] French Rules


    John GALLET writes:
    > Hi,
    >
    > > You run "seek-phrases-in-corpus" over the 2 corpora, and it'll spit out
    > > the patterns; you can then write rules based on these.

    >
    > I did so, the results are interesting, though I do not really know where
    > to go from there. If I take the first 50 "best" patterns and strip off the
    > obvious stand-alone words and sure-to-be-false-positive expressions, here
    > is what I get to: (sorry for non French speakers, explanation below)
    >
    > RATIO SPAM% HAM% DATA
    > 1.000 9.375 0.000 /Pour ne plus recevoir /
    > 1.000 6.875 0.000 /6 janvier 1978 relative /
    > 1.000 6.875 0.000 /affiche pas correctement, vous pouvez le visualiser en/
    > 1.000 5.625 0.000 /s données nominatives /
    > 1.000 5.625 0.000 / ce message, cliquez-ici/
    > 1.000 5.625 0.000 / vous désinscrire de /
    > 1.000 5.000 0.000 /Conformément à l/
    > 1.000 5.000 0.000 / plus recevoir d\'informations de notre part/
    > 1.000 5.000 0.000 /un droit d\'accès/
    > 1.000 4.375 0.000 /ment Ã| l\'article 34 de la loi/
    > 1.000 4.375 0.000 /ment à l\'article 34 de la loi /
    > 1.000 3.750 0.000 /ous désinscrire de notre /
    > 1.000 3.750 0.000 /es nominatives vous concernant\. /
    > 1.000 3.750 0.000 / Libertés du 6 /
    > 1.000 3.750 0.000 /es vous concernant\. Pour l\'exercer, /
    >
    > As you can see, charset encoding makes a mess, and many must be regrouped.


    > Anyway, these are the patterns I tried to code in FR_SPAMISLEGAL and
    > FR_HOWTOUNSUBSCRIBE, plus one I considered too generic (if you can't
    > read this mail in html, click here).


    It might be worth collecting more ham that includes any such common
    text -- or even _generating_ mails along those lines (just edit the
    message body to include the text you want the ruleset to avoid.

    > The whole result is available at
    > http://www.saphirtech.fr/spam/seekrules_fr_1.txt
    >
    > > http://taint.org/x/2008/seekrules_run

    >
    > I also adapted this one (paths of course, but also forced "mbox" format,
    > "detect" spit out zero results)


    ah. forgot to mention: detect only treats files that end in ".mbox" as
    mboxes.

    > , but the result is even less "readable"
    > for me. I miss the script seekrules/kill_bad_patterns which I presume
    > removes stand alone words and such things.


    yes, I left that out. it's very specific to my spamtraps, since it
    removes noise added by some of them.

    > Whole result at http://www.saphirtech.fr/spam/seekrules_fr_2.txt
    >
    > John


    Thanks for trying it out!

    --j.


  4. Re: seekrules over French spam (was Re: [Rule Set proposal] FrenchRules

    Re,

    >> Anyway, these are the patterns I tried to code in FR_SPAMISLEGAL and
    >> FR_HOWTOUNSUBSCRIBE, plus one I considered too generic (if you can't
    >> read this mail in html, click here).

    >
    > It might be worth collecting more ham that includes any such common
    > text -- or even _generating_ mails along those lines (just edit the
    > message body to include the text you want the ruleset to avoid.


    Well, that's the whole point: can we conclude that an email with an
    unsubcribe link tends to be a spam more often than a ham ? I consider so,
    but with a low score. Can we conclude that an email citing the French Law
    "informatique et libertés" is a spam ? I would say "100% except government
    sponsored mailing lists that may feel obliged to do so", so I added a
    higher score. Now it might perfectly be faulty logic, I do not have any
    experience in spam fighting.

    >> I also adapted this one (paths of course, but also forced "mbox" format,
    >> "detect" spit out zero results)

    > ah. forgot to mention: detect only treats files that end in ".mbox" as
    > mboxes.


    :-) ok, well anyway it was quite easy to find out since it worked well
    when forcing and not at all in automatic.

    > Thanks for trying it out!


    Well, thanks for writing it. I think its main weak point for French and
    other accented languages is handling the different encodings for a same
    char with an accent, some kind of "synonyms" list. The same letter, say "a
    with an accent", can be misspelled with a plain "a", encoded in various
    charsets (latin, utf-8) to a "normal" à, or html encoded agrave (I left &
    and ; out). I do not know if it is possible at all, it might complicate
    things *a lot*.

    a++;
    JG


  5. Re: seekrules over French spam (was Re: [Rule Set proposal] French Rules


    John GALLET writes:
    > Re,
    >
    > >> Anyway, these are the patterns I tried to code in FR_SPAMISLEGAL and
    > >> FR_HOWTOUNSUBSCRIBE, plus one I considered too generic (if you can't
    > >> read this mail in html, click here).

    > >
    > > It might be worth collecting more ham that includes any such common
    > > text -- or even _generating_ mails along those lines (just edit the
    > > message body to include the text you want the ruleset to avoid.

    >
    > Well, that's the whole point: can we conclude that an email with an
    > unsubcribe link tends to be a spam more often than a ham ? I consider so,
    > but with a low score. Can we conclude that an email citing the French Law
    > "informatique et libertés" is a spam ? I would say "100% except government
    > sponsored mailing lists that may feel obliged to do so", so I added a
    > higher score. Now it might perfectly be faulty logic, I do not have any
    > experience in spam fighting.


    Well, with automated rule-set generation I would advise erring on the
    side of "no false positives" -- my experience with FPs is that they
    may appear to be infrequent in one corpus, and then be 10x as frequent
    in another person's corpus, just due to the kind of ham he/she gets.

    > >> I also adapted this one (paths of course, but also forced "mbox" format,
    > >> "detect" spit out zero results)

    > > ah. forgot to mention: detect only treats files that end in ".mbox" as
    > > mboxes.

    >
    > :-) ok, well anyway it was quite easy to find out since it worked well
    > when forcing and not at all in automatic.
    >
    > > Thanks for trying it out!

    >
    > Well, thanks for writing it. I think its main weak point for French and
    > other accented languages is handling the different encodings for a same
    > char with an accent, some kind of "synonyms" list. The same letter, say "a
    > with an accent", can be misspelled with a plain "a", encoded in various
    > charsets (latin, utf-8) to a "normal" à, or html encoded agrave (I left &
    > and ; out). I do not know if it is possible at all, it might complicate
    > things *a lot*.


    The tool can take care of this -- it will replace mutating single-characters
    with a /./. It also supports /.?/, /.{0,3}/, /.{0,10}/ and a few other
    "any" patterns.

    --j.


  6. Re: seekrules over French spam (was Re: [Rule Set proposal] FrenchRules

    Justin Mason a écrit :
    > John GALLET writes:
    >> Well, thanks for writing it. I think its main weak point for French and
    >> other accented languages is handling the different encodings for a same
    >> char with an accent, some kind of "synonyms" list. The same letter, say "a
    >> with an accent", can be misspelled with a plain "a", encoded in various
    >> charsets (latin, utf-8) to a "normal" à, or html encoded agrave (I left &
    >> and ; out). I do not know if it is possible at all, it might complicate
    >> things *a lot*.

    >
    > The tool can take care of this -- it will replace mutating single-characters
    > with a /./. It also supports /.?/, /.{0,3}/, /.{0,10}/ and a few other
    > "any" patterns.


    If the number of permutations is small (as would be the case for
    accented letters and the equivalent unaccented ones, or for that matter
    obfuscation with lookalike characters), wouldn't it be better for it to
    replace the character by a [] list of those permutations (i.e. replace
    something that mutates between e and é with [eé] or replace obfuscation
    of i with l and 1 by [il1] ?

    John.

    --
    -- Over 3000 webcams from ski resorts around the world - www.snoweye.com
    -- Translate your technical documents and web pages - www.tradoc.fr


  7. Re: seekrules over French spam (was Re: [Rule Set proposal] French Rules


    John Wil**** writes:
    > Justin Mason a écrit :
    > > John GALLET writes:
    > >> Well, thanks for writing it. I think its main weak point for French and
    > >> other accented languages is handling the different encodings for a same
    > >> char with an accent, some kind of "synonyms" list. The same letter, say "a
    > >> with an accent", can be misspelled with a plain "a", encoded in various
    > >> charsets (latin, utf-8) to a "normal" à, or html encoded agrave (I left &
    > >> and ; out). I do not know if it is possible at all, it might complicate
    > >> things *a lot*.

    > >
    > > The tool can take care of this -- it will replace mutating single-characters
    > > with a /./. It also supports /.?/, /.{0,3}/, /.{0,10}/ and a few other
    > > "any" patterns.

    >
    > If the number of permutations is small (as would be the case for
    > accented letters and the equivalent unaccented ones, or for that matter
    > obfuscation with lookalike characters), wouldn't it be better for it to
    > replace the character by a [] list of those permutations (i.e. replace
    > something that mutates between e and é with [eé] or replace obfuscation
    > of i with l and 1 by [il1] ?


    It would be. but fixing the pattern-discovery algorithm to discover this
    in a relatively speedy way is not so easy. Patches accepted


+ Reply to Thread