What about the junk filter? - Mozilla

This is a discussion on What about the junk filter? - Mozilla ; Can the adaptive junk filter in TB have got less efficient? I used to find it very efficient, but I have found SpamBrave for OE much more reliable and even SpamBayes for Outlook catches more suspects. My provider can tag ...

+ Reply to Thread
Page 1 of 3 1 2 3 LastLast
Results 1 to 20 of 48

Thread: What about the junk filter?

  1. What about the junk filter?

    Can the adaptive junk filter in TB have got less efficient?
    I used to find it very efficient, but I have found SpamBrave for OE much
    more reliable and even SpamBayes for Outlook catches more suspects.
    My provider can tag spam and even with that switched on TB is missing some
    of the tagged ones.
    Has something changed?
    --
    Jim S
    Tyneside UK
    http://www.jimscott.co.uk

  2. Re: What about the junk filter?

    Jim S wrote:
    > Can the adaptive junk filter in TB have got less efficient?
    > I used to find it very efficient, but I have found SpamBrave for OE much
    > more reliable and even SpamBayes for Outlook catches more suspects.
    > My provider can tag spam and even with that switched on TB is missing some
    > of the tagged ones.
    > Has something changed?


    Your question was answered in the other newsgroup.

    My JMC here is still catching 99.8%. One false positive in six months.

    Question: How big is your Training.dat file? In some cases, training.dat
    files can grow so large that the effectiveness of JMC is diminished.
    This usually occurs when training.dat files are in excess of 1MB.

    NB: If you are NOT having problems with JMC catch ratio, then dont worry
    about the size of your training.dat. If, howerever, you are experiencing
    a problem like the above (perceived reduction in effectiveness) then
    checking the size of your training.dat is suggested.

  3. Re: What about the junk filter?

    Moz Champion (Dan) wrote:
    > Jim S wrote:
    >> Can the adaptive junk filter in TB have got less efficient?
    >> I used to find it very efficient, but I have found SpamBrave for OE much
    >> more reliable and even SpamBayes for Outlook catches more suspects.
    >> My provider can tag spam and even with that switched on TB is missing some
    >> of the tagged ones.
    >> Has something changed?

    >
    > Your question was answered in the other newsgroup.
    >
    > My JMC here is still catching 99.8%. One false positive in six months.
    >
    > Question: How big is your Training.dat file? In some cases, training.dat
    > files can grow so large that the effectiveness of JMC is diminished.
    > This usually occurs when training.dat files are in excess of 1MB.
    >
    > NB: If you are NOT having problems with JMC catch ratio, then dont worry
    > about the size of your training.dat. If, howerever, you are experiencing
    > a problem like the above (perceived reduction in effectiveness) then
    > checking the size of your training.dat is suggested.


    I was having the same problem and had a huge training dat file but then
    I deleted it and started again. Within a couple of days it was back to
    catching virtually everything. I think that as the spammers adapt all
    the time then the file becomes clogged with out-of-date filters. Its
    very rare spam that gets through these days.

    :-)

  4. Re: What about the junk filter?

    Abacus wrote:
    > Moz Champion (Dan) wrote:
    >> Jim S wrote:
    >>> Can the adaptive junk filter in TB have got less efficient?
    >>> I used to find it very efficient, but I have found SpamBrave for OE much
    >>> more reliable and even SpamBayes for Outlook catches more suspects.
    >>> My provider can tag spam and even with that switched on TB is missing
    >>> some
    >>> of the tagged ones.
    >>> Has something changed?

    >>
    >> Your question was answered in the other newsgroup.
    >>
    >> My JMC here is still catching 99.8%. One false positive in six months.
    >>
    >> Question: How big is your Training.dat file? In some cases,
    >> training.dat files can grow so large that the effectiveness of JMC is
    >> diminished. This usually occurs when training.dat files are in excess
    >> of 1MB.
    >>
    >> NB: If you are NOT having problems with JMC catch ratio, then dont
    >> worry about the size of your training.dat. If, howerever, you are
    >> experiencing a problem like the above (perceived reduction in
    >> effectiveness) then checking the size of your training.dat is suggested.

    >
    > I was having the same problem and had a huge training dat file but then
    > I deleted it and started again. Within a couple of days it was back to
    > catching virtually everything. I think that as the spammers adapt all
    > the time then the file becomes clogged with out-of-date filters. Its
    > very rare spam that gets through these days.
    >
    > :-)


    The OP responded in the other group stating he had done that with no
    improvement. We are investigating other avenues.

  5. Re: What about the junk filter?


    "Moz Champion (Dan)" wrote in message
    news:FM6dnQPYXMViyzLYnZ2dnUVZ_sfinZ2d@mozilla.org. ..
    > Jim S wrote:
    >> Can the adaptive junk filter in TB have got less efficient?
    >> I used to find it very efficient, but I have found SpamBrave for OE much
    >> more reliable and even SpamBayes for Outlook catches more suspects.
    >> My provider can tag spam and even with that switched on TB is missing
    >> some
    >> of the tagged ones.
    >> Has something changed?

    >
    > Your question was answered in the other newsgroup.
    >
    > My JMC here is still catching 99.8%. One false positive in six months.
    >


    I am underwhelmed by it. For example, messages with things like '0EM and
    s0ftware' in the subject line get through consistently, so after several
    weeks of so called 'training' I've had to add a simple filter that removes
    messages with subject lines containing those plus a bunch of other 'obvious'
    words that it seems unable to learn. How many times would I have to click on
    a message containing the word 's0ftware' for it to learn? If I get 1 a day,
    that 20 plus its seen, and it still hasnt caught on.


    --
    Tumbleweed

    email replies not necessary but to contact use;
    tumbleweednews at hotmail dot com




  6. Re: What about the junk filter?

    Tumbleweed wrote:
    > "Moz Champion (Dan)" wrote in message
    > news:FM6dnQPYXMViyzLYnZ2dnUVZ_sfinZ2d@mozilla.org. ..
    >> Jim S wrote:
    >>> Can the adaptive junk filter in TB have got less efficient?
    >>> I used to find it very efficient, but I have found SpamBrave for OE much
    >>> more reliable and even SpamBayes for Outlook catches more suspects.
    >>> My provider can tag spam and even with that switched on TB is missing
    >>> some
    >>> of the tagged ones.
    >>> Has something changed?

    >> Your question was answered in the other newsgroup.
    >>
    >> My JMC here is still catching 99.8%. One false positive in six months.
    >>

    >
    > I am underwhelmed by it. For example, messages with things like '0EM and
    > s0ftware' in the subject line get through consistently, so after several
    > weeks of so called 'training' I've had to add a simple filter that removes
    > messages with subject lines containing those plus a bunch of other 'obvious'
    > words that it seems unable to learn. How many times would I have to click on
    > a message containing the word 's0ftware' for it to learn? If I get 1 a day,
    > that 20 plus its seen, and it still hasnt caught on.
    >
    >



    Well, a Baysien filter doesnt look at only subject lines or such it
    looks at the entire message.

    And I never found that to be the case, I had to click on a 'similar'
    spam for perhaps 3 times, and after that it was caught. My JMC catches
    image spams, every other type of spam, you name it. As I said it is
    catching 200 spam for every 1 I see in my inbox..

    Have you checked the size of your training.dat file?

  7. Re: What about the junk filter?

    On 2007-01-18 15:16 (-0700 UTC), Tumbleweed wrote:

    > "Moz Champion (Dan)" wrote in message
    > news:FM6dnQPYXMViyzLYnZ2dnUVZ_sfinZ2d@mozilla.org. ..
    >> Jim S wrote:
    >>> Can the adaptive junk filter in TB have got less efficient?
    >>> I used to find it very efficient, but I have found SpamBrave for OE much
    >>> more reliable and even SpamBayes for Outlook catches more suspects.
    >>> My provider can tag spam and even with that switched on TB is missing
    >>> some
    >>> of the tagged ones.
    >>> Has something changed?

    >> Your question was answered in the other newsgroup.
    >>
    >> My JMC here is still catching 99.8%. One false positive in six months.

    >
    > I am underwhelmed by it. For example, messages with things like '0EM and
    > s0ftware' in the subject line get through consistently, so after several
    > weeks of so called 'training' I've had to add a simple filter that removes
    > messages with subject lines containing those plus a bunch of other 'obvious'
    > words that it seems unable to learn. How many times would I have to click on
    > a message containing the word 's0ftware' for it to learn? If I get 1 a day,
    > that 20 plus its seen, and it still hasnt caught on.


    At the risk of starting another flamewar with Dan, part of the problem, is
    that how JMC work isn't particularly intuitable.

    For JMC (or any Bayesian filter) to work properly, it needs both good tokens
    (from ham) and bad tokens (from spam), and needs more than just correcting
    false positives. At the simplest level, you need to mark at least one
    message as ham and at least one message as spam.

    Since JMC also accumulate a lot of cruft, it doesn't hurt to maintain them
    from time to time, which is was the Bayes Junk Tool
    () allows you to do.

    Depending on how much spam you get, it might take a combination of
    server-side filtering (black-, grey-, and whitelists), server-side spam
    tools (/e.g./, SpamAssassin or some other sort of Bayesian classifier), and
    local spam tools to deal with it properly.

    /b.

    --
    People are stupid. /A/ person may be smart, but /people/ are stupid.
    --Stephen M. Graham

  8. Re: What about the junk filter?

    Brian Heinrich wrote:
    > On 2007-01-18 15:16 (-0700 UTC), Tumbleweed wrote:
    >
    >> "Moz Champion (Dan)" wrote in message
    >> news:FM6dnQPYXMViyzLYnZ2dnUVZ_sfinZ2d@mozilla.org. ..
    >>> Jim S wrote:
    >>>> Can the adaptive junk filter in TB have got less efficient?
    >>>> I used to find it very efficient, but I have found SpamBrave for OE
    >>>> much
    >>>> more reliable and even SpamBayes for Outlook catches more suspects.
    >>>> My provider can tag spam and even with that switched on TB is
    >>>> missing some
    >>>> of the tagged ones.
    >>>> Has something changed?
    >>> Your question was answered in the other newsgroup.
    >>>
    >>> My JMC here is still catching 99.8%. One false positive in six months.

    >>
    >> I am underwhelmed by it. For example, messages with things like '0EM
    >> and s0ftware' in the subject line get through consistently, so after
    >> several weeks of so called 'training' I've had to add a simple filter
    >> that removes messages with subject lines containing those plus a bunch
    >> of other 'obvious' words that it seems unable to learn. How many times
    >> would I have to click on a message containing the word 's0ftware' for
    >> it to learn? If I get 1 a day, that 20 plus its seen, and it still
    >> hasnt caught on.

    >
    > At the risk of starting another flamewar with Dan, part of the problem,
    > is that how JMC work isn't particularly intuitable.
    >
    > For JMC (or any Bayesian filter) to work properly, it needs both good
    > tokens (from ham) and bad tokens (from spam), and needs more than just
    > correcting false positives. At the simplest level, you need to mark at
    > least one message as ham and at least one message as spam.
    >
    > Since JMC also accumulate a lot of cruft, it doesn't hurt to maintain
    > them from time to time, which is was the Bayes Junk Tool
    > () allows you to do.
    >
    > Depending on how much spam you get, it might take a combination of
    > server-side filtering (black-, grey-, and whitelists), server-side spam
    > tools (/e.g./, SpamAssassin or some other sort of Bayesian classifier),
    > and local spam tools to deal with it properly.
    >
    > /b.
    >



    Again, how is 99.8 percent catch ratio and one false postive in over six
    months not working?



  9. Re: What about the junk filter?

    On 2007-01-18 18:08 (-0700 UTC), Moz Champion (Dan) wrote:

    > Brian Heinrich wrote:




    >> At the risk of starting another flamewar with Dan, part of the
    >> problem, is that how JMC work isn't particularly intuitable.




    > Again, how is 99.8 percent catch ratio and one false postive in over six
    > months not working?


    I'm not talking about your catch ration; I'm talking about how Bayesian
    filters work.

    /b.

    --
    People are stupid. /A/ person may be smart, but /people/ are stupid.
    --Stephen M. Graham

  10. Re: What about the junk filter?

    Brian Heinrich wrote:
    > On 2007-01-18 18:08 (-0700 UTC), Moz Champion (Dan) wrote:
    >
    >> Brian Heinrich wrote:

    >
    >
    >
    >>> At the risk of starting another flamewar with Dan, part of the
    >>> problem, is that how JMC work isn't particularly intuitable.

    >
    >
    >
    >> Again, how is 99.8 percent catch ratio and one false postive in over
    >> six months not working?

    >
    > I'm not talking about your catch ration; I'm talking about how Bayesian
    > filters work.
    >
    > /b.
    >



    And I am asking you if a 99.8 percent catch ratio with 1 false positive
    in six months is working or not? It is working. You can talk about
    your theories all you wish. The FACT remains, my JMC is working quite
    well, period.

    Flies cant fly either. But they do! You are like the scientest
    explaining again and again how it is impossible for flies to fly, all
    the while swatting at the flies flying about his head.

  11. Re: What about the junk filter?

    On 2007-01-18 19:00 (-0700 UTC), Moz Champion (Dan) wrote:

    > Brian Heinrich wrote:
    >> On 2007-01-18 18:08 (-0700 UTC), Moz Champion (Dan) wrote:
    >>
    >>> Brian Heinrich wrote:

    >>
    >>
    >>
    >>>> At the risk of starting another flamewar with Dan, part of the
    >>>> problem, is that how JMC work isn't particularly intuitable.

    >>
    >>
    >>
    >>> Again, how is 99.8 percent catch ratio and one false postive in over
    >>> six months not working?

    >>
    >> I'm not talking about your catch ration; I'm talking about how
    >> Bayesian filters work.

    >
    > And I am asking you if a 99.8 percent catch ratio with 1 false positive
    > in six months is working or not? It is working. You can talk about
    > your theories all you wish. The FACT remains, my JMC is working quite
    > well, period.
    >
    > Flies cant fly either. But they do! You are like the scientest
    > explaining again and again how it is impossible for flies to fly, all
    > the while swatting at the flies flying about his head.


    Yep, Toon Boy certainly lit on the *perfect* adjective to describe you. . . .

    /b.

    --
    People are stupid. /A/ person may be smart, but /people/ are stupid.
    --Stephen M. Graham

  12. Re: What about the junk filter?

    Moz Champion (Dan) wrote:
    > Well, a Baysien filter doesnt look at only subject lines or such it
    > looks at the entire message.
    >
    > And I never found that to be the case, I had to click on a 'similar'
    > spam for perhaps 3 times, and after that it was caught. My JMC catches
    > image spams, every other type of spam, you name it. As I said it is
    > catching 200 spam for every 1 I see in my inbox..



    Sort of a related question for you: I am fairly familiar with SpamAssassin
    (a server side spam scoring system). It's Baysian filter requires that it
    you teach it not only what is spam, but also what is NOT spam (ie. ham).
    To do this, I took a corpus of ~500 legitimate messages and "taught" SA
    that they were ham.


    Now for the question:
    Does TB's JMC learn ham? If so, how (other than marking the rare message
    that is mis-identified as spam as "not junk")?

    --

    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    Chris Barnes AOL IM: CNBarnes
    chris@txbarnes.com (also MSN IM) Yahoo IM: chrisnbarnes

  13. Re: What about the junk filter?

    Moz Champion (Dan) wrote:
    > Again, how is 99.8 percent catch ratio and one false postive in over six
    > months not working?



    I see that the question I just asked is very much in line with Brian's
    comments (the quote above was your reply to Brian's statement that teaching
    ham was integral to an effective Baysian filter).

    To which my comment is: I'm not getting anywhere near 98% effectiveness
    with TB's JMC. It's perhaps somewhere around 60%.


    Now granted, that is 60% of what SpamAssassin on the server missed (SA
    catches about 80% of the total spam I receive). But even combining SA with
    JMC, that still works out to a 92% TOTAL effective rate.

    --

    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    Chris Barnes AOL IM: CNBarnes
    chris@txbarnes.com (also MSN IM) Yahoo IM: chrisnbarnes

  14. Re: What about the junk filter?

    On 2007-01-19 08:42 (-0700 UTC), Chris Barnes wrote:

    > Moz Champion (Dan) wrote:
    >> Again, how is 99.8 percent catch ratio and one false postive in over
    >> six months not working?

    >
    > I see that the question I just asked is very much in line with Brian's
    > comments (the quote above was your reply to Brian's statement that
    > teaching ham was integral to an effective Baysian filter).
    >
    > To which my comment is: I'm not getting anywhere near 98% effectiveness
    > with TB's JMC. It's perhaps somewhere around 60%.
    >
    > Now granted, that is 60% of what SpamAssassin on the server missed (SA
    > catches about 80% of the total spam I receive). But even combining SA
    > with JMC, that still works out to a 92% TOTAL effective rate.


    Depending on how much and the nature of the spam you receive, that figure
    (60%) may or may not be reasonable -- especially since (as you mention) it's
    60% of what SpamAssassin isn't catching.

    Dan would likely suggest the 'canonical' method for dealing with JMC loses
    efficiency or becomes 'bloated', which is to reset training.dat -- there's a
    UI for it -- and start from scratch.

    I find it easier just to use the Bayes Junk Tool
    () to clean out the cruft from training.dat.
    Values of 5 for good tokens and 20 for bad tokens seems to work all right
    for me. . . .

    /b.

    --
    People are stupid. /A/ person may be smart, but /people/ are stupid.
    --Stephen M. Graham

  15. Re: What about the junk filter?

    Moz Champion (Dan) wrote:
    > Brian Heinrich wrote:
    >> On 2007-01-18 15:16 (-0700 UTC), Tumbleweed wrote:
    >>
    >>> "Moz Champion (Dan)" wrote in message
    >>> news:FM6dnQPYXMViyzLYnZ2dnUVZ_sfinZ2d@mozilla.org. ..
    >>>> Jim S wrote:
    >>>>> Can the adaptive junk filter in TB have got less efficient?
    >>>>> I used to find it very efficient, but I have found SpamBrave for OE
    >>>>> much
    >>>>> more reliable and even SpamBayes for Outlook catches more suspects.
    >>>>> My provider can tag spam and even with that switched on TB is
    >>>>> missing some
    >>>>> of the tagged ones.
    >>>>> Has something changed?
    >>>> Your question was answered in the other newsgroup.
    >>>>
    >>>> My JMC here is still catching 99.8%. One false positive in six months.
    >>>
    >>> I am underwhelmed by it. For example, messages with things like '0EM
    >>> and s0ftware' in the subject line get through consistently, so after
    >>> several weeks of so called 'training' I've had to add a simple filter
    >>> that removes messages with subject lines containing those plus a
    >>> bunch of other 'obvious' words that it seems unable to learn. How
    >>> many times would I have to click on a message containing the word
    >>> 's0ftware' for it to learn? If I get 1 a day, that 20 plus its seen,
    >>> and it still hasnt caught on.

    >>
    >> At the risk of starting another flamewar with Dan, part of the
    >> problem, is that how JMC work isn't particularly intuitable.
    >>
    >> For JMC (or any Bayesian filter) to work properly, it needs both good
    >> tokens (from ham) and bad tokens (from spam), and needs more than just
    >> correcting false positives. At the simplest level, you need to mark
    >> at least one message as ham and at least one message as spam.
    >>
    >> Since JMC also accumulate a lot of cruft, it doesn't hurt to maintain
    >> them from time to time, which is was the Bayes Junk Tool
    >> () allows you to do.
    >>
    >> Depending on how much spam you get, it might take a combination of
    >> server-side filtering (black-, grey-, and whitelists), server-side
    >> spam tools (/e.g./, SpamAssassin or some other sort of Bayesian
    >> classifier), and local spam tools to deal with it properly.
    >>
    >> /b.
    >>

    >
    >
    > Again, how is 99.8 percent catch ratio and one false postive in over six
    > months not working?
    >
    >

    Please let this Catch-22 die before it gets Started. Just explain to
    the fellow how it is supposed to work in Theory. And let it go. Please!!

    --
    ------------------------------------------------------------------------
    Phillip M. Jones, CET http://www.vpea.org
    If it's "fixed", don't "break it"! mailtojones@kimbanet.com
    http://www.kimbanet.com/~pjones/default.htm
    ------------------------------------------------------------------------

  16. Re: What about the junk filter?


    "Moz Champion (Dan)" wrote in message
    newsoednT4B542Phy3YnZ2dnUVZ_sfinZ2d@mozilla.org...

    >
    > Again, how is 99.8 percent catch ratio and one false postive in over six
    > months not working?
    >


    does the JMC filter only work if its you? Otherwise, why doesnt it work for
    me and others? Is our spam more sophisticated?

    my training.dat is 748kb. Thats after about 2 months of use. Should I delete
    and start again?

    Also, how do I teach it what a 'good' message is? is it necessary to have
    false hits and tell it they arent spam? I dont get many of those, just lots
    of spam that isnt marked as such. I'd have thought that if every single
    message that contained the word '0EM' was marked by me as spam (when it
    missed it as it seems to) that after at least 20 hits, maybe many more, it
    would have worked that out.

    --
    Tumbleweed

    email replies not necessary but to contact use;
    tumbleweednews at hotmail dot com




  17. Re: What about the junk filter?

    On 2007-01-19 15:20 (-0700 UTC), Tumbleweed wrote:

    > "Moz Champion (Dan)" wrote in message
    > newsoednT4B542Phy3YnZ2dnUVZ_sfinZ2d@mozilla.org...
    >
    >> Again, how is 99.8 percent catch ratio and one false postive in over six
    >> months not working?

    >
    > does the JMC filter only work if its you? Otherwise, why doesnt it work for
    > me and others? Is our spam more sophisticated?


    Dan has had consistently good results with JMC ever since they were
    introduced in late '02/early '03, even tho' he doesn't train JMC in the
    manner suggested.

    Without actually seeing his spam, it's almost impossible to tell, but there
    can be considerable variability in the type of spam one sees; for instance,
    at work, I would see very different spam than I see at home.

    > my training.dat is 748kb. Thats after about 2 months of use. Should I delete
    > and start again?


    You can if you wish; certainly, that could be considered the canonical way,
    in as much as there is a UI to do so. However, I prefer to use the Bayes
    Junk Tool () to prune and maintain my
    training.dat.

    As I've said elsewhere, I find that, given the amount of spam I've received,
    using a cut-off of five for good tokens and 20 for bad tokens seems to keep
    JMC working well here.

    > Also, how do I teach it what a 'good' message is?


    in the summary pane, then Mark | As Not Junk.

    Bear in mind that you don't need to seed the database with every ham message
    you get; depending on your volume of e-mail -- and of spam -- , marking one
    or two every week or so should be adequate.

    > is it necessary to have
    > false hits and tell it they arent spam?


    No, although correcting false positives is a nice, straight-forward way to
    seed the database with good tokens.

    > I dont get many of those, just lots
    > of spam that isnt marked as such. I'd have thought that if every single
    > message that contained the word '0EM' was marked by me as spam (when it
    > missed it as it seems to) that after at least 20 hits, maybe many more, it
    > would have worked that out.


    Not necessarily. OEM on itself is just a token. If you've marked e-mail in
    which you've corresponded with someone named Frank about OEM software as
    ham, there's a not-insignificant probability that the tokens |Frank| and
    |OEM| in the same e-mail won't be marked as spam.

    The problem is that the token |OEM| might be found in conjunction with
    |cheap|, |che ap|, |cheep|, |ch eep|, |ch eap|, |Microsoft|, |Macrosoft|,
    |Microsfat|, |Microsaft|, |Macrosaft|, &c, &c; in other words, all those
    mis-spellings are just another way of 'poisoning' Bayesian filters.

    That's actually part of the reason why I use the BJT to prune/maintain my
    training.dat: I already /have/ a goodly number of both good and bad tokens
    -- something like 2 200 when I last pruned it -- , so it makes more sense to
    me to be able to continue to make use of what I already have than to chuck
    it and start from scratch. . . .

    It works well for me; YMMV. . . .

    /b.

    --
    People are stupid. /A/ person may be smart, but /people/ are stupid.
    --Stephen M. Graham

  18. Re: What about the junk filter?

    Chris Barnes wrote:
    > Moz Champion (Dan) wrote:
    >> Well, a Baysien filter doesnt look at only subject lines or such it
    >> looks at the entire message.
    >>
    >> And I never found that to be the case, I had to click on a 'similar'
    >> spam for perhaps 3 times, and after that it was caught. My JMC catches
    >> image spams, every other type of spam, you name it. As I said it is
    >> catching 200 spam for every 1 I see in my inbox..

    >
    >
    > Sort of a related question for you: I am fairly familiar with
    > SpamAssassin (a server side spam scoring system). It's Baysian filter
    > requires that it you teach it not only what is spam, but also what is
    > NOT spam (ie. ham).
    > To do this, I took a corpus of ~500 legitimate messages and "taught" SA
    > that they were ham.
    >
    >
    > Now for the question:
    > Does TB's JMC learn ham? If so, how (other than marking the rare
    > message that is mis-identified as spam as "not junk")?
    >



    In a minimal manner.
    When you first start JMC, all messages are marked as spam until you mark
    some as non junk (and restart). Then all messages are not marked as spam
    until you mark some as junk (and restart)

    Other than that, its false positives.

    But, then, why does it need to? A catch ratio of 99.8% is hardly poor or
    non effective. In other words, for every 1 spam that 'gets through'
    there were 200 spams that did not. Quite effective.

    How much time did you spend marking those 500 messages to teach it ham?
    Twenty minutes? Well, I could mark all the ones JMC misses for almost 10
    years with those twenty minutes (2 seconds per). So even if teaching it
    what is ham improved your catch ratio to 100% (which I doubt) - You have
    still spend MORE time doing it than I will for 10 years.

    Teaching it with more ham will NOT improve your catch ratio, it will
    improve your false positive ratio - which even you admit is rare! So
    what is the point? You might be 500 times LESS likely to get a false
    positive than I am. I only get one in six months anyway, do you get one
    in 250 years .

  19. Re: What about the junk filter?

    Tumbleweed wrote:
    > "Moz Champion (Dan)" wrote in message
    > newsoednT4B542Phy3YnZ2dnUVZ_sfinZ2d@mozilla.org...
    >
    >> Again, how is 99.8 percent catch ratio and one false postive in over six
    >> months not working?
    >>

    >
    > does the JMC filter only work if its you? Otherwise, why doesnt it work for
    > me and others? Is our spam more sophisticated?
    >
    > my training.dat is 748kb. Thats after about 2 months of use. Should I delete
    > and start again?
    >
    > Also, how do I teach it what a 'good' message is? is it necessary to have
    > false hits and tell it they arent spam? I dont get many of those, just lots
    > of spam that isnt marked as such. I'd have thought that if every single
    > message that contained the word '0EM' was marked by me as spam (when it
    > missed it as it seems to) that after at least 20 hits, maybe many more, it
    > would have worked that out.
    >

    Thats not what I wrote.


    IF you are not having problems, then dont even WORRY about the size of
    your training.dat.
    PROBLEMS due to training.dat size dont even START till over 1MB anyway.


    Well, I dont know about more sophisticated spam, but whenever a user
    here has forwarded on spam to see if my JMC 'catches' it, it has, when
    his has not!

    What is YOUR catch ratio? To figure it out simply count the number of
    messages that you see in your inbox (es) and then the number of spam
    messages caught by JMC over the same period of time. To make it easier
    to figure use the length of time it takes to accumulate 100 spams
    (caught by JMC) as the time period.

    All I do is. Mark any message that is spam as junk, and unmark any
    message that isnt spam (but JMC caught) as non junk. Period. I dont
    worry about 'ham' or teaching it ham, or whats good or not. All I care
    about is is catching most of the spam and putting it aside for me. And
    it is catching 99.8% of the spam sent my way.



  20. Re: What about the junk filter?


    "Brian Heinrich" wrote in message
    news:jrednZkHcOE2xCzYnZ2dnUVZ_segnZ2d@mozilla.org. ..
    >
    >> I dont get many of those, just lots of spam that isnt marked as such. I'd
    >> have thought that if every single message that contained the word '0EM'
    >> was marked by me as spam (when it missed it as it seems to) that after at
    >> least 20 hits, maybe many more, it would have worked that out.

    >
    > Not necessarily. OEM on itself is just a token. If you've marked e-mail
    > in which you've corresponded with someone named Frank about OEM software
    > as ham, there's a not-insignificant probability that the tokens |Frank|
    > and |OEM| in the same e-mail won't be marked as spam.
    >
    > The problem is that the token |OEM| might be found in conjunction with
    > |cheap|, |che ap|, |cheep|, |ch eep|, |ch eap|, |Microsoft|, |Macrosoft|,
    > |Microsfat|, |Microsaft|, |Macrosaft|, &c, &c; in other words, all those
    > mis-spellings are just another way of 'poisoning' Bayesian filters.
    >


    this is '0EM where the '0' is a zero, not a letter. Hence its never
    encountered elsewhere, which is why I am surprised its missing it. Same for
    s0ftware where the second letter is a number not a letter.

    --
    Tumbleweed

    email replies not necessary but to contact use;
    tumbleweednews at hotmail dot com




+ Reply to Thread
Page 1 of 3 1 2 3 LastLast