Indexing a largish collection of mail and usenet messages? - BSD

This is a discussion on Indexing a largish collection of mail and usenet messages? - BSD ; I have a collection of archives of mailing list and news messages. The largest collection is pretty big, about 150,000 messages which means about 200 megabytes of text, shortly to be migrated to a FreeBSD server. The lists are all ...

+ Reply to Thread
Results 1 to 6 of 6

Thread: Indexing a largish collection of mail and usenet messages?

  1. Indexing a largish collection of mail and usenet messages?

    I have a collection of archives of mailing list and news messages.
    The largest collection is pretty big, about 150,000 messages which
    means about 200 megabytes of text, shortly to be migrated to a FreeBSD
    server. The lists are all active so archives typically add a few
    messages each day. I want to provide a full text search of each
    archive. What software should I use? I have been using the sturdy
    but ancient lqtext package. It's OK, but it has a few bugs I have yet
    to pick and I'm wondering if something better is available.

    First, I am NOT, repeat NOT, asking about web spiders. The messages
    are directly available to indexing software as files on my server, so
    there's no advantage to running them through Apache on the way to the
    indexer. Also, the messages in the archive never change and I know
    what files are new each day, so it would be pointless for a package to
    re-spider the whole archive to look for the new messages. I am not
    unalterably opposed to something that spiders if it is otherwise
    wonderful, but that approach hasn't been fruitful in the past.

    What I want ideally is something that knows enough about the structure
    of mail messages to deal intelligently with headers vs. body, that can
    do something reasonable with MIME and HTML bodies (not urgent, I can
    always run them through demime on the way to the index), and most
    importantly that actually works with 150,000 messages. I've seen
    lots of packages that look promising but that fall over dead once
    they get past 10,000 messages or so.

    User interface isn't particularly important, I can plug it into my
    existing stuff so long as it has the basic functions of taking search
    terms and giving back the locations of the matches. To see the
    current version, bugs and all, see
    http://compilers.iecc.com/compsearch.phtml

    The comp.compilers archive is also indexed in Google and other public
    search engines, which works splendidly, but most of the other lists
    are private so Google is out.

    Any suggestions? Tnx.

    R's,
    John








  2. Re: Indexing a largish collection of mail and usenet messages?

    John L wrote:
    > Any suggestions? Tnx.
    >


    I use an old (pre-commercial) version of glimpse on my home LAN. Works
    fine for me. YMMV. I'm not sure what the terms are for more recent
    versions...

    pete
    --
    pete@fenelon.com "he just stuck to buying beer and pointing at other stuff"

  3. Re: Indexing a largish collection of mail and usenet messages?

    John L wrote:
    > I have a collection of archives of mailing list and news messages.
    > The largest collection is pretty big, about 150,000 messages which
    > means about 200 megabytes of text, shortly to be migrated to a FreeBSD
    > server. The lists are all active so archives typically add a few
    > messages each day. I want to provide a full text search of each
    > archive. What software should I use? I have been using the sturdy
    > but ancient lqtext package. It's OK, but it has a few bugs I have yet
    > to pick and I'm wondering if something better is available.
    >
    > First, I am NOT, repeat NOT, asking about web spiders. The messages
    > are directly available to indexing software as files on my server, so
    > there's no advantage to running them through Apache on the way to the
    > indexer. Also, the messages in the archive never change and I know
    > what files are new each day, so it would be pointless for a package to
    > re-spider the whole archive to look for the new messages. I am not
    > unalterably opposed to something that spiders if it is otherwise
    > wonderful, but that approach hasn't been fruitful in the past.
    >
    > What I want ideally is something that knows enough about the structure
    > of mail messages to deal intelligently with headers vs. body, that can
    > do something reasonable with MIME and HTML bodies (not urgent, I can
    > always run them through demime on the way to the index), and most
    > importantly that actually works with 150,000 messages. I've seen
    > lots of packages that look promising but that fall over dead once
    > they get past 10,000 messages or so.
    >
    > User interface isn't particularly important, I can plug it into my
    > existing stuff so long as it has the basic functions of taking search
    > terms and giving back the locations of the matches. To see the
    > current version, bugs and all, see
    > http://compilers.iecc.com/compsearch.phtml
    >
    > The comp.compilers archive is also indexed in Google and other public
    > search engines, which works splendidly, but most of the other lists
    > are private so Google is out.
    >
    > Any suggestions? Tnx.
    >
    > R's,
    > John
    >


    I dunno what your programming skills are like, but the python language
    comes complete with modules that understand mail/news format messages.
    It would probably not be a lot of work to create some kind of b-tree
    index using the (also available) Berkeley DB interfaces in python...

    --
    ----------------------------------------------------------------------------
    Tim Daneliuk tundra@tundraware.com
    PGP Key: http://www.tundraware.com/PGP/

  4. Re: Indexing a largish collection of mail and usenet messages?

    >I dunno what your programming skills are like, but the python language
    >comes complete with modules that understand mail/news format messages.
    >It would probably not be a lot of work to create some kind of b-tree
    >index using the (also available) Berkeley DB interfaces in python...


    I was hoping to avoid that. There's a lot more to a usable full text
    index than a b-tree full of words.

    R's,
    John

  5. Re: Indexing a largish collection of mail and usenet messages?

    ISYS understands mail message formats. But you'd need to run it on a Windows
    box.
    (www.isys-search.com).

    -- Ian

    "John L" wrote in message
    news:enbubo$17vf$1@gal.iecc.com...
    > >I dunno what your programming skills are like, but the python language
    >>comes complete with modules that understand mail/news format messages.
    >>It would probably not be a lot of work to create some kind of b-tree
    >>index using the (also available) Berkeley DB interfaces in python...

    >
    > I was hoping to avoid that. There's a lot more to a usable full text
    > index than a b-tree full of words.
    >
    > R's,
    > John




  6. Re: Indexing a largish collection of mail and usenet messages?

    In article ,
    Ian wrote:
    >ISYS understands mail message formats. But you'd need to run it on a Windows
    >box.


    Don't have a Windows box, probably just as well.

    Someone suggested the Namazu indexing package. The indexer is written
    in perl but is adequately fast, the searcher is in C and is very fast.
    It comes with a CGI but I found it easier to pipe the command line
    search program into my PHP scripts.

    Visit http://a2.iecc.com/search/spamtools/ to see it in action.
    Namazu understands a reasonable search language with and, or, etc.
    Try searching for words like delivery, signature, response, and
    other e-mail related stuff.

    R's,
    John


+ Reply to Thread