I haven't run any real statistics about this, but it's worth realizing
that unless there's a significant number of spams that have this behavior,
a rule probably costs more in resource use than it provides in hits.

A quick:

pcregrep -ri 'http://(?:[^/.]+\.){7}'

in my corpus shows about 20 spam hits in some 245000 mails. There could be
reasons this RE wouldn't hit, but in general I wouldn't bother.

On Tue, Apr 22, 2008 at 01:24:37AM +0200, Karsten Br=E4ckelmann wrote:
> On Mon, 2008-04-21 at 22:16 +0200, mouss wrote:
> > untested yet:

> > uri URI_DEEP5 m|https?://[\w-]\.[\w-]\.[\w-]\.[\w-]\.[\w-]\.|
> > score URI_DEEP5 0.1

uri URI_DEEP6 m|https?://[\w-]\.[\w-]\.[\w-]\.[\w-]\.[\w-]\.[\w-

> > score URI_DEEP6 1.0

uri URI_DEEP7
> > m|https?://[\w-]\.[\w-]\.[\w-]\.[\w-]\.[\w-]\.[\w-]\.[\w-]\.|
> > score URI_DEEP7 2.0

> Beware, those are adding up. Since you didn't anchor the end of the RE
> to ($|/), whatever hits URI_DEEP7 hits the previous ones, too. Effective
> score: 3.1
> They don't work anyway. You are testing for single chars between the
> dots. And the '-' should be first in a char class, if it is to represent
> itself. Also, I'd prefer to keep them cleaner and more readable using
> quantifiers, rather than copying parts 7 times...
> uri URI_DEEP7 m,https?://([-\w]+\.){6},
> The above forces 6 dots, and thus "7 levels". Hits on even longer URIs,
> too -- the same constraint of adding scores applies here.
> Oh, and yes -- this one is untested, too.
> guenther
