robots.txt: Good, Bad, Ugly? - Security

This is a discussion on robots.txt: Good, Bad, Ugly? - Security ; Do civilized modern web crawlers still use robots.txt? In a recent debate a friend was suggesting that robots.txt should not be used anymore since there are other means of authorizing/restricting access to a web site. After the debate I was ...

+ Reply to Thread
Results 1 to 15 of 15

Thread: robots.txt: Good, Bad, Ugly?

  1. robots.txt: Good, Bad, Ugly?

    Do civilized modern web crawlers still use robots.txt? In a recent
    debate a friend was suggesting that robots.txt should not be used
    anymore since there are other means of authorizing/restricting access
    to a web site. After the debate I was left with the impression that
    only malicious individuals seek the contents of robots.txt.

    Opinions?


  2. Re: robots.txt: Good, Bad, Ugly?

    caveman@archaeologist.com wrote:
    > Do civilized modern web crawlers still use robots.txt? In a recent
    > debate a friend was suggesting that robots.txt should not be used
    > anymore since there are other means of authorizing/restricting access
    > to a web site. After the debate I was left with the impression that
    > only malicious individuals seek the contents of robots.txt.
    >
    > Opinions?
    >

    All the reputable search engines honour the robots.txt file. The
    malicious ones are the ones that ignore it, and yes, you do need other
    means of restricting them.

    --
    Dave
    mail da ve@llondel.org (without the space)
    http://www.llondel.org
    So many gadgets, so little time

  3. Re: robots.txt: Good, Bad, Ugly?


    >All the reputable search engines honour the robots.txt file. The
    >malicious ones are the ones that ignore it, and yes, you do need other
    >means of restricting them.


    Does anybody maintain a block-list of abusive search engines?

    What fraction of them are trojaned boxes?

    --
    The suespammers.org mail server is located in California. So are all my
    other mailboxes. Please do not send unsolicited bulk e-mail or unsolicited
    commercial e-mail to my suespammers.org address or any of my other addresses.
    These are my opinions, not necessarily my employer's. I hate spam.


  4. Re: robots.txt: Good, Bad, Ugly?

    caveman@archaeologist.com said:
    >Do civilized modern web crawlers still use robots.txt? In a recent
    >debate a friend was suggesting that robots.txt should not be used
    >anymore since there are other means of authorizing/restricting access
    >to a web site. After the debate I was left with the impression that
    >only malicious individuals seek the contents of robots.txt.


    Well, robots.txt was never meant to restrict access; it is just
    a hint to the crawlers that a certain part of the site does not
    contain material worth indexing (whatever that might be on a given
    site).

    As was said in the other response, "good" crawlers still honor it;
    "bad" ones apparently either just ignore it, or use it in an attempt
    to find non-public information. The latter is not a problem as long
    as you don't try to use robots.txt as a mechanism to protect data
    content.
    --
    Wolf a.k.a. Juha Laiho Espoo, Finland
    (GC 3.0) GIT d- s+: a C++ ULSH++++$ P++@ L+++ E- W+$@ N++ !K w !O !M V
    PS(+) PE Y+ PGP(+) t- 5 !X R !tv b+ !DI D G e+ h---- r+++ y++++
    "...cancel my subscription to the resurrection!" (Jim Morrison)

  5. Re: robots.txt: Good, Bad, Ugly?

    Dave {Reply Address In.sig} wrote:
    > All the reputable search engines honour the robots.txt file. The
    > malicious ones are the ones that ignore it, and yes, you do need other
    > means of restricting them.
    >

    Could you expand on "other means". I use .htaccess and .htpasswd etc.
    but what else do you have in mind?

    --
    ----------------
    Barton L. Phillips
    Applied Technology Resources, Inc.
    Tel: (818)652-9850
    Web: http://www.applitec.com

  6. Re: robots.txt: Good, Bad, Ugly?

    On Thu, 21 Sep 2006 17:17:47 GMT, "Barton L. Phillips"
    wrote:

    >Dave {Reply Address In.sig} wrote:
    >> All the reputable search engines honour the robots.txt file. The
    >> malicious ones are the ones that ignore it, and yes, you do need other
    >> means of restricting them.
    >>

    >Could you expand on "other means". I use .htaccess and .htpasswd etc.
    >but what else do you have in mind?


    Here's some stuff that may help:
    ftp://yesican.chsoft.biz/pub/robots.zip

    I must warn you that I don't have time to educate you. If you can't
    make it work, don't ask for my help.
    --
    buck


  7. Re: robots.txt: Good, Bad, Ugly?

    buck wrote:
    > On Thu, 21 Sep 2006 17:17:47 GMT, "Barton L. Phillips"
    > wrote:
    >
    >> Dave {Reply Address In.sig} wrote:
    >>> All the reputable search engines honour the robots.txt file. The
    >>> malicious ones are the ones that ignore it, and yes, you do need other
    >>> means of restricting them.
    >>>

    >> Could you expand on "other means". I use .htaccess and .htpasswd etc.
    >> but what else do you have in mind?

    >
    > Here's some stuff that may help:
    > ftp://yesican.chsoft.biz/pub/robots.zip
    >
    > I must warn you that I don't have time to educate you. If you can't
    > make it work, don't ask for my help.
    > --
    > buck
    >

    Thanks, it looks pretty straight forward I don't think I will need any
    help if I choose to use it.

    Have you seen any good bots that violate the robots.txt file? Can you
    (will you) post (alternately send it to me) your 'badrobot' file it
    would be interesting to see your results.

    --
    ----------------
    Barton L. Phillips
    Applied Technology Resources, Inc.
    Tel: (818)652-9850
    Web: http://www.applitec.com

  8. Re: robots.txt: Good, Bad, Ugly?

    Barton L. Phillips wrote:
    > Dave {Reply Address In.sig} wrote:
    >> All the reputable search engines honour the robots.txt file. The
    >> malicious ones are the ones that ignore it, and yes, you do need other
    >> means of restricting them.
    >>

    > Could you expand on "other means". I use .htaccess and .htpasswd etc.
    > but what else do you have in mind?
    >

    It depends on how sophisticated you want to get. I came across
    http://www.neilgunton.com/spambot_trap/ when browsing one day and it
    looked like fun.

    --
    Dave
    mail da ve@llondel.org (without the space)
    http://www.llondel.org
    So many gadgets, so little time

  9. Re: robots.txt: Good, Bad, Ugly?

    Dave {Reply Address In.sig} wrote:
    > Barton L. Phillips wrote:
    >> Dave {Reply Address In.sig} wrote:
    >>> All the reputable search engines honour the robots.txt file. The
    >>> malicious ones are the ones that ignore it, and yes, you do need other
    >>> means of restricting them.
    >>>

    >> Could you expand on "other means". I use .htaccess and .htpasswd etc.
    >> but what else do you have in mind?
    >>

    > It depends on how sophisticated you want to get. I came across
    > http://www.neilgunton.com/spambot_trap/ when browsing one day and it
    > looked like fun.
    >

    This looks a lot like what Buck sent me in a zip file. I am examining
    both and may (probably more for fun than need) implement something (at
    least for a little while and look at the logs).

    In reality I have not seen any problems relating to bots. I check my
    logs for user agents and look at who/what is reading my robots.txt and
    have not seen anything that alarmed me to date. However, buck's logic
    will give me a little bit better log to look at.

    Thanks

    --
    ----------------
    Barton L. Phillips
    Applied Technology Resources, Inc.
    Tel: (818)652-9850
    Web: http://www.applitec.com

  10. Re: robots.txt: Good, Bad, Ugly?

    On Thu, 21 Sep 2006 19:22:48 GMT, "Barton L. Phillips"
    wrote:

    >buck wrote:
    >> On Thu, 21 Sep 2006 17:17:47 GMT, "Barton L. Phillips"
    >> wrote:
    >>
    >>> Dave {Reply Address In.sig} wrote:
    >>>> All the reputable search engines honour the robots.txt file. The
    >>>> malicious ones are the ones that ignore it, and yes, you do need other
    >>>> means of restricting them.
    >>>>
    >>> Could you expand on "other means". I use .htaccess and .htpasswd etc.
    >>> but what else do you have in mind?

    >>
    >> Here's some stuff that may help:
    >> ftp://yesican.chsoft.biz/pub/robots.zip
    >>
    >> I must warn you that I don't have time to educate you. If you can't
    >> make it work, don't ask for my help.
    >> --
    >> buck
    >>

    >Thanks, it looks pretty straight forward I don't think I will need any
    >help if I choose to use it.
    >
    >Have you seen any good bots that violate the robots.txt file? Can you
    >(will you) post (alternately send it to me) your 'badrobot' file it
    >would be interesting to see your results.


    It really woudln't help because almost all of the entries in badrobots
    are one shot abusers. They come to steal graphics or looking for php,
    (Etc.) to abuse, get spider trapped by bad.html and are done. I don't
    keep track of second attempts but as far as I can tell they never come
    back.

    I have Apache set to resolve IPs. Most of the entries in badrobots
    did not resolve. Many are "IANA Reserved".

    If you really think it would be informative, I'll make the file
    available for you. I just think it wastes both your time and mine.

    One thing I _do_ use it for is to stop one "good" robot I don't want
    to allow. I could see if a regular robots.txt would be honored, but
    since I had this I used it instead because it is bulletproof while
    robots.txt is optional.

    NONE of the "good" robots EVER has abused robots.txt. Google, Yahoo,
    Ask Jeeves, AltaVista, LookSmart, Lycos, Mamma, Netscape, WiseNut...
    A few AOL searches hit the spider trap. I purge AOL from badrobots as
    soon as I find them because by then the abuser has lost interest .
    --
    buck

  11. Re: robots.txt: Good, Bad, Ugly?

    "buck" wrote in message
    news:n517h2lrceljl5rmcdigjl26edeef5bvq7@4ax.com

    > I have Apache set to resolve IPs. Most of the entries in badrobots
    > did not resolve. Many are "IANA Reserved".


    Processing such files with " jdresolve -n -r -a /file/name >
    resolved_filename " to recursively resolve the addresses as much as
    possible, is a very informative practice, I've found.
    http://www.jdrowell.com/archives/projects/jdresolve/

    $ jdresolve --help
    ....
    --nostats or -n
    don't display stats after processing
    --recursive or -r
    recurse into C, B and A classes when there is no PTR.
    default is no recursion
    --anywhere or -a
    resolve all addresses in file (not just those that start lines)
    ....




  12. Re: robots.txt: Good, Bad, Ugly?

    On Fri, 22 Sep 2006 01:16:27 GMT, Barton L. Phillips wrote:
    >
    > In reality I have not seen any problems relating to bots. I check my
    > logs for user agents and look at who/what is reading my robots.txt and
    > have not seen anything that alarmed me to date.


    Errr, ummm. Bad bots probably won't bother reading robots.txt.

    Jonesy
    --
    Marvin L Jones | jonz | W3DHJ | linux
    38.24N 104.55W | @ config.com | Jonesy | OS/2
    *** Killfiling google posts:

  13. Re: robots.txt: Good, Bad, Ugly?

    buck wrote:
    >
    > It really woudln't help because almost all of the entries in badrobots
    > are one shot abusers. They come to steal graphics or looking for php,
    > (Etc.) to abuse, get spider trapped by bad.html and are done. I don't
    > keep track of second attempts but as far as I can tell they never come
    > back.

    Thank you for your reply. I think that is very interesting. It is more
    of a reason for me to look at implementing your "trap". I will look at
    my access_log files a bit more closely now and see if I see similar
    patterns (though it might be hard to see).

    Thanks again

    --
    ----------------
    Barton L. Phillips
    Applied Technology Resources, Inc.
    Tel: (818)652-9850
    Web: http://www.applitec.com

  14. Re: robots.txt: Good, Bad, Ugly?

    Allodoxaphobia wrote:
    > On Fri, 22 Sep 2006 01:16:27 GMT, Barton L. Phillips wrote:
    >> In reality I have not seen any problems relating to bots. I check my
    >> logs for user agents and look at who/what is reading my robots.txt and
    >> have not seen anything that alarmed me to date.

    >
    > Errr, ummm. Bad bots probably won't bother reading robots.txt.
    >
    > Jonesy

    Yes sorry that was not really what I was trying to say. You are right of
    course, and I do other analysis to try to see what "bad" bots are doing.
    I think buck's program might help as log files are quite hard to analyze
    for this type of violation.

    --
    ----------------
    Barton L. Phillips
    Applied Technology Resources, Inc.
    Tel: (818)652-9850
    Web: http://www.applitec.com

  15. Re: robots.txt: Good, Bad, Ugly?

    >It really woudln't help because almost all of the entries in badrobots
    >are one shot abusers. They come to steal graphics or looking for php,
    >(Etc.) to abuse, get spider trapped by bad.html and are done. I don't
    >keep track of second attempts but as far as I can tell they never come
    >back.


    If you send me a handful of recent addresses, I'll run them
    through some of the common anti-spam block/black lists.


    --
    The suespammers.org mail server is located in California. So are all my
    other mailboxes. Please do not send unsolicited bulk e-mail or unsolicited
    commercial e-mail to my suespammers.org address or any of my other addresses.
    These are my opinions, not necessarily my employer's. I hate spam.


+ Reply to Thread