[Rule Set proposal] French Rules - SpamAssassin

This is a discussion on [Rule Set proposal] French Rules - SpamAssassin ; Hi, This is my first post on this list and first ruleset, so please point me to the right place/documents if I am doing anything wrong. According to a search of this list on markmail.org, there have been few subjects ...

+ Reply to Thread
Results 1 to 15 of 15

Thread: [Rule Set proposal] French Rules

  1. [Rule Set proposal] French Rules

    Hi,

    This is my first post on this list and first ruleset, so please point me
    to the right place/documents if I am doing anything wrong.

    According to a search of this list on markmail.org, there have been few
    subjects about spam in French and (no disrespect meant) I would agree with
    the comments I read about the current French Ruleset being inadequate
    (tried it, did not keep any of it).

    So I would like to propose a set for French Rules and get your feedback.

    You can find both the rules and some sample spam email messages (two of
    them missing, I have hits in my log files, but deleted them) at the
    following URL: http://www.saphirtech.fr/spam/

    I have been running these for about a month sitewise on three domains, I
    have not seen any false positives (yet).

    Sincerely,
    JG


    ################################################## ###################################
    ##### FRENCH SPECIFIC SPAMASSASSIN RULES.
    ##### USE AND REDISTRIBUTE WITH THIS NOTE AT YOUR OWN RISK AND PLEASURE.
    ##### AUTHOR: John GALLET
    ##### Version: 2008-JUNE-17
    ##### Latest: http://www.saphirtech.fr/
    ##### Status: It Works For Me (tm)
    ################################################## ###################################
    # Spam is legal in France !
    body FR_SPAMISLEGAL /\b(Conform.+ment|En vertu).{0,5}(article.{0,4}34.{0,4})?la loi\b/i
    describe FR_SPAMISLEGAL French: pretends spam is (l)awful.
    lang fr describe FR_SPAMISLEGAL Invoque la loi informatique et libertes.
    score FR_SPAMISLEGAL 2.5

    body FR_SPAMISLEGAL_2 /\bdroit d.acc.+s.{1,3}(de modification)?.{0,5}de rectification\b/i
    describe FR_SPAMISLEGAL_2 French: pretends spam is (l)awful.
    lang fr describe FR_SPAMISLEGAL_2 Invoque le droit de rectification cnil.
    score FR_SPAMISLEGAL_2 2.5

    #####
    # yeah, sure.
    body FR_NOTSPAM /\b(ceci|ce).{1,9} n.est pas.{1,5}spam\b/i
    describe FR_NOTSPAM French: claims not to be spam.
    lang fr describe FR_NOTSPAM Affirme ne pas etre du spam.
    score FR_NOTSPAM 4.0

    #####
    ## I can pay my taxes
    body FR_PAYLESSTAXES /\b(paye|calcul|simul|r.+dui|investi).{1,7}(moins|v o|ses).{0,5}imp.+t(s)?\b/i
    describe FR_PAYLESSTAXES French: Pay less taxes
    lang fr describe FR_PAYLESSTAXES Simulateurs et reductions d'impots.
    score FR_PAYLESSTAXES 2.0

    body FR_REALESTATE_INVEST /\b(loi)? (de.robien|girardin).{1,15}(neuf|recentr.+|ancien| IR|IS|imp.+t(s)?|industriel(le)?)\b/i
    describe FR_REALESTATE_INVEST French: Invest in real-estate with tax-reductions
    lang fr describe FR_REALESTATE_INVEST Reduction impots immobilier.
    score FR_REALESTATE_INVEST 2.5

    #####
    # I won at the casino
    body FR_ONLINEGAMBLING /\b(casino(s)?|jeu(x)?|joueur(s)?) (en ligne|de grattage)\b/i
    describe FR_ONLINEGAMBLING French: Online gambling
    lang fr describe FR_ONLINEGAMBLING Jeux en ligne.
    score FR_ONLINEGAMBLING 2.0

    #####
    # I am so lucky to receive spam
    body FR_YOURELUCKY /\b(tentez)? votre (jour de)? chance\b/i
    describe FR_YOURELUCKY French: it's your lucky day (sure).
    lang fr describe FR_YOURELUCKY Jeux de hasard et de chance.
    score FR_YOURELUCKY 1.0

    #####
    # Baby, did you forget to take your meds ?
    body FR_ONLINEMEDS /\bpharmacie(s)? (en
    ligne|internet)\b/i
    describe FR_ONLINEMEDS French: Online meds ordering
    lang fr describe FR_ONLINEMEDS Achat de medicaments en ligne.
    score FR_ONLINEMEDS 3.0

    ######
    # Tell me why
    body FR_REASON_SUBSCRIBE /\bVous recevez ce(t|tte)?
    (message|mail|m.+l|lettre|news.+) (car|parce que)\b/i
    describe FR_REASON_SUBSCRIBE French: you subscribed to my spam.
    lang fr describe FR_REASON_SUBSCRIBE Indique pourquoi vous recevez le
    courrier.
    score FR_REASON_SUBSCRIBE 1.5

    #####
    # How to unsubscribe
    body FR_HOWTOUNSUBSCRIBE
    /\b(souhaitez|d.+sirez|pour).{1,10}(plus.{1,}recevo ir|d.+sincrire|d.+sinscription).{0,10}(information |email|mail|mailing|newsletter|message|offre|promo tion)(s)?\b/i
    describe FR_HOWTOUNSUBSCRIBE French: how to unsubscribe
    lang fr describe FR_HOWTOUNSUBSCRIBE Indique comment se desabonner.
    score FR_HOWTOUNSUBSCRIBE 2.0

    ####
    # Various "CRM" (Could Remove Me)
    #####
    header FR_MAILER_1 X-Mailer =~ /(delosmail|cabestan|ems|mp6|wamailer|phpmailer|eMa ilink|Accucast|Benchmail)/i
    describe FR_MAILER_1 French spammy X-Mailer
    lang fr describe FR_MAILER_1 X-Mailer couramment employe pour
    des spams en francais.
    score FR_MAILER_1 4.0

    header FR_MAILER_2 X-EMV- =~ /.+/
    describe FR_MAILER_2 French spammy mailer header
    lang fr describe FR_MAILER_2 X-Mailer couramment employe pour
    des spams en francais.
    score FR_MAILER_2 4.0

    ################################################## ###################################
    ##### END FRENCH SPECIFIC SPAMASSASSIN RULES.
    ################################################## ###################################


  2. Re: [Rule Set proposal] French Rules

    On Tue, Jun 17, 2008 at 12:11 PM, John GALLET
    wrote:
    > Hi,
    >
    > This is my first post on this list and first ruleset, so please point me to
    > the right place/documents if I am doing anything wrong.
    >
    > According to a search of this list on markmail.org, there have been few
    > subjects about spam in French and (no disrespect meant) I would agree with
    > the comments I read about the current French Ruleset being inadequate (tried
    > it, did not keep any of it).
    >
    > So I would like to propose a set for French Rules and get your feedback.
    >
    > You can find both the rules and some sample spam email messages (two of them
    > missing, I have hits in my log files, but deleted them) at the following
    > URL: http://www.saphirtech.fr/spam/
    >
    > I have been running these for about a month sitewise on three domains, I
    > have not seen any false positives (yet).
    >
    > Sincerely,
    > JG


    I was able to access the URL you mentioned, but not all of the files
    below it. I received:
    "Forbidden
    You don't have permission to access /spam/FR_PAYLESSTAXES.txt on this server."


    Dave


  3. Re: [Rule Set proposal] French Rules

    Hi,

    > I was able to access the URL you mentioned, but not all of the files
    > below it. I received:
    > "Forbidden
    > You don't have permission to access /spam/FR_PAYLESSTAXES.txt on this server."


    Sorry guys, only the ruleset file (the one I tried, of course) was
    readable, all the non empty spam samples had bad rights. This is fixed.

    I still miss samples for two rules, even if I did had hits according to
    /var/spool/maillog I did not save them.

    John


  4. Re: [Rule Set proposal] French Rules


    John GALLET writes:
    > Hi,
    >
    > This is my first post on this list and first ruleset, so please point me
    > to the right place/documents if I am doing anything wrong.
    >
    > According to a search of this list on markmail.org, there have been few
    > subjects about spam in French and (no disrespect meant) I would agree with
    > the comments I read about the current French Ruleset being inadequate
    > (tried it, did not keep any of it).
    >
    > So I would like to propose a set for French Rules and get your feedback.


    by the way, if you're reasonably perl-capable, it might be worthwhile
    using the algorithm I use to generate the JM_SOUGHT ruleset for english
    spam: http://taint.org/tag/rule-discovery

    you just give it a corpus of spam samples and it generates the rules for
    you. The code is in SpamAssassin SVN.

    --j.


  5. Re: [Rule Set proposal] French Rules


    > I still miss samples for two rules, even if I did had hits according to
    > /var/spool/maillog I did not save them.


    I added a sample for the FR_NOTSPAM rule, and I removed the
    FR_YOURELUCKY rule as I see other forms of the text getting through so
    it is not efficient. On the other hand, nearly all these messages are
    caught with RBL rules so I might even remove it completely if I can't find
    an efficient one.

    John
    PS: reminder, rules and samples avaible at
    http://www.saphirtech.fr/spam/


  6. hit frequencies (was Re: [Rule Set proposal] French Rules

    Hi,

    First of all, thanks to Justin for patiently helping me to install
    mass-check and pointing me in the right direction. I will try to run the
    algorithms tonight to see what they come up with.

    In the meantime, you can find a hit-frequencies report at:
    http://www.saphirtech.fr/spam/freqs_2008_06_23.txt

    All rules are prefixed with FR_ and are available in the same directory.

    I must say I did not double check for stray spam in my mailbox before
    using it as a ham corpus but it *should* be clean. I'll double check for
    next run. The spam corpus was 100% French spam, hand-picked over the last
    week through the "probably-spam" class (default score values 5-15).

    Any feedback on the results (not enough in corpus, bad rules, good rules,
    etc.) appreciated.

    Sincerely,
    JG


  7. Re: hit frequencies (was Re: [Rule Set proposal] French Rules

    On Mon, 23 Jun 2008, John GALLET wrote:

    > First of all, thanks to Justin for patiently helping me to install
    > mass-check and pointing me in the right direction.


    Applause for Justin! This is the sort of thing we need to see for many
    more specialized spam categories...

    > I will try to run the algorithms tonight to see what they come up with.


    Thanks for taking this burden upon yourself. One other thing you should be
    prepared to do, if you're willing to devote long-term responsibility to
    these rules, is to provide sa-update-compatible feeds of your dynamic
    rules. This is another thing that Justin can probably help you with.

    --
    John Hardin KA7OHZ http://www.impsec.org/~jhardin/
    jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
    key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
    -----------------------------------------------------------------------
    The problem is when people look at Yahoo, slashdot, or groklaw and
    jump from obvious and correct observations like "Oh my God, this
    place is teeming with utter morons" to incorrect conclusions like
    "there's nothing of value here". -- Al Petrofsky, in Y! SCOX
    -----------------------------------------------------------------------
    11 days until the 232nd anniversary of the Declaration of Independence


  8. Re: hit frequencies (was Re: [Rule Set proposal] French Rules

    John GALLET a écrit :
    > Any feedback on the results (not enough in corpus, bad rules, good
    > rules, etc.) appreciated.


    Looking at the rules, I'm worried about false positives on genuine
    opt-in advertising. I have a number of users who choose to receive all
    kinds of advertising blurb, so I'll run your rules with very low scores
    for a while to see what gets hit.

    John.

    --
    -- Over 3000 webcams from ski resorts around the world - www.snoweye.com
    -- Translate your technical documents and web pages - www.tradoc.fr


  9. Re: hit frequencies (was Re: [Rule Set proposal] French Rules

    Re,

    > Looking at the rules, I'm worried about false positives on genuine opt-in
    > advertising. I have a number of users who choose to receive all kinds of
    > advertising blurb,


    This is one of the reasons why I did not hunt for "click here" and "if you
    can't see this email in html". Now correct me if I am wrong (ouch, no, not
    on the head), but isn't this what whitelist_from is for ? I never was able
    to let the Intel newsletter through (it is in English), it would always be
    caught by SA. Same went for Microsoft Support genuine answers (ok, don't
    laugh).

    >so I'll run your rules with very low scores for a while to see what gets
    >hit.


    You can have a little more information, and exactly this suggestion, by
    reading http://www.saphirtech.fr/spamassassin.html

    JG


  10. Re: hit frequencies (was Re: [Rule Set proposal] French Rules

    > Thanks for taking this burden upon yourself. One other thing you should be
    > prepared to do, if you're willing to devote long-term responsibility to these
    > rules, is to provide sa-update-compatible feeds of your dynamic rules. This
    > is another thing that Justin can probably help you with.


    I am happy with trying to do so, but I am honestly not worried about the
    feed part, all it bores down to is putting the right file at the right
    place (be it push or pull, ftp or rsync, whatever).

    What I am more worried about is testing regularly the rules, and, even
    before that, checking that they are valid. They are "good" on my system
    with my users, but then they were custom-tailored to be so.

    JG


  11. Re: hit frequencies (was Re: [Rule Set proposal] French Rules

    Re,

    > I excluded the last two rules from my masscheck to avoid FPs as these
    > ESPs/X-Mailers are definitely grey, "import rcpt list and blast" sort of ESPs
    > not black for global use.


    If you can point me to some more information on how to do that, on-list or
    off-list, I am interested. I am new to this whole business.

    In fact I was forced to look at X-Mailer and other strange headers for
    French spam that was still getting through with no real easy keywords, and
    these guys often ad the good idea to have developped their own "software"
    and be proud of it.

    > #counts FR_SPAMISLEGAL 8s/2h of 3859 corpus (1166s/2693h AXB-MC1)
    > 06/23/08
    > #counts FR_SPAMISLEGAL_2 5s/2h of 3859 corpus (1166s/2693h AXB-MC1)
    > 06/23/08
    > #counts FR_NOTSPAM 0s/0h of 3859 corpus (1166s/2693h AXB-MC1)
    > 06/23/08
    > #counts FR_PAYLESSTAXES 0s/0h of 3859 corpus (1166s/2693h AXB-MC1)
    > 06/23/08
    > #counts FR_REALESTATE_INVEST 0s/0h of 3859 corpus (1166s/2693h AXB-MC1)
    > 06/23/08
    > #counts FR_ONLINEGAMBLING 0s/0h of 3859 corpus (1166s/2693h AXB-MC1)
    > 06/23/08
    > #counts FR_ONLINEMEDS 0s/0h of 3859 corpus (1166s/2693h AXB-MC1)
    > 06/23/08
    > #counts FR_REASON_SUBSCRIBE 1s/1h of 3859 corpus (1166s/2693h AXB-MC1)
    > 06/23/08
    > #counts FR_HOWTOUNSUBSCRIBE 7s/16h of 3859 corpus (1166s/2693h
    > AXB-MC1) 06/23/08
    >
    > If these are hit rates with a very minimal daily corpus, don't know if the
    > present ruleset is ready for production unless you have 0 tolerance for any
    > bulk, period


    I do subscribe to various mailing lists, and none of them seemed compelled
    to remind me how to unsubscribe, even less to state me the law about spam.

    Even the official government "conseil des ministres" (sum up of the
    daily/weekly/whatever government meeting) does not state the "loi
    informatique et libertés" anymore (but they do use a company I am getting
    a lot of spam from ).

    So basically the question is: what makes a spam in French recognizable.

    On the other hand I am also worried about the very low hits of most rules.

    If all your 1166 spams are in French, we can throw the whole ruleset to
    /dev/null (well I'll keep it for me anyway).

    A++;
    JG



  12. Re: hit frequencies (was Re: [Rule Set proposal] French Rules

    Yet Another Ninja a écrit :
    > If these are hit rates with a very minimal daily corpus, don't know if
    > the present ruleset is ready for production unless you have 0 tolerance
    > for any bulk, period


    I'm afraid I must agree. I don't have a confirmed and sorted corpus per
    se, but after a single night's live testing with very low scores I can
    confirm that, as I suspected, many of these rules hit genuine opt-in
    newsletters and even things like ebay notifications in French. I will
    however keep the ruleset live for a while, to see whether the online
    meds and onling gambling rules actually hit anything.

    My personal tolerance for bulk mail is pretty low, and in a way I'd love
    to use rules like these, with just a bit of fine tuning - the rules do
    also hit a fair bit of French spam. But unfortunately my users actually
    want to receive their newsletters and even complain if it ends up in
    their spam folder.

    John.

    --
    -- Over 3000 webcams from ski resorts around the world - www.snoweye.com
    -- Translate your technical documents and web pages - www.tradoc.fr


  13. Re: hit frequencies (was Re: [Rule Set proposal] French Rules

    On Dienstag, 24. Juni 2008 John Wil**** wrote:
    > with just a bit of fine tuning


    I guess John Gallet needs a bigger corpus, maybe you could share some
    ham/spam with him. He does the work to create the rules, and with
    better corpus the rules will become better. I know this, I maintain the
    GERMAN ruleset and it's hard without any reports from others.

    mfg zmi
    --
    // Michael Monnerie, Ing.BSc ----- http://it-management.at
    // Tel: 0660 / 415 65 31 .network.your.ideas.
    // PGP Key: "curl -s http://zmi.at/zmi.asc | gpg --import"
    // Fingerprint: AC19 F9D5 36ED CD8A EF38 500E CE14 91F7 1C12 09B4
    // Keyserver: www.keyserver.net Key-ID: 1C1209B4

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v2.0.4-svn0 (GNU/Linux)

    iD8DBQBIYKGtzhSR9xwSCbQRAsemAJ4xuHcVfzPNlPHPFXpxj9 ecfN4l+wCgqZ25
    NgyoJIVFLPSvbz/5gG0F6eI=
    =RlmI
    -----END PGP SIGNATURE-----


  14. Philosophy for opt-in (was Re: [Rule Set proposal] French Rules

    Hi,

    >> If these are hit rates with a very minimal daily corpus, don't know if the
    >> present ruleset is ready for production unless you have 0 tolerance for any
    >> bulk, period

    >
    > I'm afraid I must agree. I don't have a confirmed and sorted corpus per se,
    > but after a single night's live testing with very low scores I can confirm
    > that, as I suspected, many of these rules hit genuine opt-in newsletters and
    > even things like ebay notifications in French.


    Thanks for the feedback. I do not have any ebay subscriptors in my users,
    except one power-seller who has ebay thingies in whitelist.

    >I will however keep the ruleset live for a while, to see whether the
    >online meds and onling gambling rules actually hit anything.


    They should, they do on my machines. But actually, they are only useful
    for a "new" spam that has not been caught yet by RBL. When I wrote them,
    it was because spam *was* getting through, now they just push towards
    "almost-probably-spam". Another note is that much of this particular spam
    is auto and badly translated (much "pidgin-French" if I can say so).

    > My personal tolerance for bulk mail is pretty low, and in a way I'd love to
    > use rules like these, with just a bit of fine tuning - the rules do also hit
    > a fair bit of French spam. But unfortunately my users actually want to
    > receive their newsletters and even complain if it ends up in their spam
    > folder.


    I think I have a newbye simple problem of philosophy/strategy: my
    approach, for what it's worth, was that I flag anything that contains some
    unsubscribe links and French law reminders because anyway all the ones I
    receive are spam, and I add the opt-in mailing/newsletter I receive to
    whitelist_from in user_prefs, i.e. I kill everything except those
    explicitly allowed.

    If that is not the correct approach, I can garantee you the current way
    the rules are written is bad (too harsh), and I need strategy advice on
    how to manage opt-in lists.

    John


  15. Re: Philosophy for opt-in (was Re: [Rule Set proposal] French Rules

    John GALLET a écrit :
    > I think I have a newbye simple problem of philosophy/strategy: my
    > approach, for what it's worth, was that I flag anything that contains
    > some unsubscribe links and French law reminders because anyway all the
    > ones I receive are spam, and I add the opt-in mailing/newsletter I
    > receive to whitelist_from in user_prefs, i.e. I kill everything except
    > those explicitly allowed.


    That's a strategy I tried when I first started writing SA rules, but
    soon rejected due to the workload of detecting and whitelisting new
    opt-in subscriptions. It may work for you if you don't have many users
    who sign up for this stuff...

    Incidentally, I have a ruleset for French-language "Nigerian" scams
    (which in fact tend to be mostly from Côte d'Ivoire, not Nigeria!) that
    I've been meaning to clean up and make public. I'll try to get round to
    that soon...

    John.

    --
    -- Over 3000 webcams from ski resorts around the world - www.snoweye.com
    -- Translate your technical documents and web pages - www.tradoc.fr


+ Reply to Thread