Indexing a largish collection of mail and usenet messages?
I have a collection of archives of mailing list and news messages.
The largest collection is pretty big, about 150,000 messages which
means about 200 megabytes of text, shortly to be migrated to a FreeBSD
server. The lists are all active so archives typically add a few
messages each day. I want to provide a full text search of each
archive. What software should I use? I have been using the sturdy
but ancient lqtext package. It's OK, but it has a few bugs I have yet
to pick and I'm wondering if something better is available.
First, I am NOT, repeat NOT, asking about web spiders. The messages
are directly available to indexing software as files on my server, so
there's no advantage to running them through Apache on the way to the
indexer. Also, the messages in the archive never change and I know
what files are new each day, so it would be pointless for a package to
re-spider the whole archive to look for the new messages. I am not
unalterably opposed to something that spiders if it is otherwise
wonderful, but that approach hasn't been fruitful in the past.
What I want ideally is something that knows enough about the structure
of mail messages to deal intelligently with headers vs. body, that can
do something reasonable with MIME and HTML bodies (not urgent, I can
always run them through demime on the way to the index), and most
importantly that actually works with 150,000 messages. I've seen
lots of packages that look promising but that fall over dead once
they get past 10,000 messages or so.
User interface isn't particularly important, I can plug it into my
existing stuff so long as it has the basic functions of taking search
terms and giving back the locations of the matches. To see the
current version, bugs and all, see
[url]http://compilers.iecc.com/compsearch.phtml[/url]
The comp.compilers archive is also indexed in Google and other public
search engines, which works splendidly, but most of the other lists
are private so Google is out.
Any suggestions? Tnx.
R's,
John
Re: Indexing a largish collection of mail and usenet messages?
John L <johnl@iecc.com> wrote:[color=blue]
> Any suggestions? Tnx.
>[/color]
I use an old (pre-commercial) version of glimpse on my home LAN. Works
fine for me. YMMV. I'm not sure what the terms are for more recent
versions...
pete
--
[email]pete@fenelon.com[/email] "he just stuck to buying beer and pointing at other stuff"
Re: Indexing a largish collection of mail and usenet messages?
John L wrote:[color=blue]
> I have a collection of archives of mailing list and news messages.
> The largest collection is pretty big, about 150,000 messages which
> means about 200 megabytes of text, shortly to be migrated to a FreeBSD
> server. The lists are all active so archives typically add a few
> messages each day. I want to provide a full text search of each
> archive. What software should I use? I have been using the sturdy
> but ancient lqtext package. It's OK, but it has a few bugs I have yet
> to pick and I'm wondering if something better is available.
>
> First, I am NOT, repeat NOT, asking about web spiders. The messages
> are directly available to indexing software as files on my server, so
> there's no advantage to running them through Apache on the way to the
> indexer. Also, the messages in the archive never change and I know
> what files are new each day, so it would be pointless for a package to
> re-spider the whole archive to look for the new messages. I am not
> unalterably opposed to something that spiders if it is otherwise
> wonderful, but that approach hasn't been fruitful in the past.
>
> What I want ideally is something that knows enough about the structure
> of mail messages to deal intelligently with headers vs. body, that can
> do something reasonable with MIME and HTML bodies (not urgent, I can
> always run them through demime on the way to the index), and most
> importantly that actually works with 150,000 messages. I've seen
> lots of packages that look promising but that fall over dead once
> they get past 10,000 messages or so.
>
> User interface isn't particularly important, I can plug it into my
> existing stuff so long as it has the basic functions of taking search
> terms and giving back the locations of the matches. To see the
> current version, bugs and all, see
> [url]http://compilers.iecc.com/compsearch.phtml[/url]
>
> The comp.compilers archive is also indexed in Google and other public
> search engines, which works splendidly, but most of the other lists
> are private so Google is out.
>
> Any suggestions? Tnx.
>
> R's,
> John
>[/color]
I dunno what your programming skills are like, but the python language
comes complete with modules that understand mail/news format messages.
It would probably not be a lot of work to create some kind of b-tree
index using the (also available) Berkeley DB interfaces in python...
--
----------------------------------------------------------------------------
Tim Daneliuk [email]tundra@tundraware.com[/email]
PGP Key: [url]http://www.tundraware.com/PGP/[/url]
Re: Indexing a largish collection of mail and usenet messages?
>I dunno what your programming skills are like, but the python language[color=blue]
>comes complete with modules that understand mail/news format messages.
>It would probably not be a lot of work to create some kind of b-tree
>index using the (also available) Berkeley DB interfaces in python...[/color]
I was hoping to avoid that. There's a lot more to a usable full text
index than a b-tree full of words.
R's,
John
Re: Indexing a largish collection of mail and usenet messages?
ISYS understands mail message formats. But you'd need to run it on a Windows
box.
([url]www.isys-search.com[/url]).
-- Ian
"John L" <johnl@iecc.com> wrote in message
news:enbubo$17vf$1@gal.iecc.com...[color=blue][color=green]
> >I dunno what your programming skills are like, but the python language
>>comes complete with modules that understand mail/news format messages.
>>It would probably not be a lot of work to create some kind of b-tree
>>index using the (also available) Berkeley DB interfaces in python...[/color]
>
> I was hoping to avoid that. There's a lot more to a usable full text
> index than a b-tree full of words.
>
> R's,
> John[/color]
Re: Indexing a largish collection of mail and usenet messages?
In article <ep6t6t$1o8j$1@otis.netspace.net.au>,
Ian <waimate01@telstra.com> wrote:[color=blue]
>ISYS understands mail message formats. But you'd need to run it on a Windows
>box.[/color]
Don't have a Windows box, probably just as well.
Someone suggested the Namazu indexing package. The indexer is written
in perl but is adequately fast, the searcher is in C and is very fast.
It comes with a CGI but I found it easier to pipe the command line
search program into my PHP scripts.
Visit [url]http://a2.iecc.com/search/spamtools/[/url] to see it in action.
Namazu understands a reasonable search language with and, or, etc.
Try searching for words like delivery, signature, response, and
other e-mail related stuff.
R's,
John