Downloading webpages - Linux

This is a discussion on Downloading webpages - Linux ; HI all, Can anyone point me to a web resource that explains how to write a programme that downloads web pages? Theres a lot of stuff for Java, but not much for C++ TIA Paul -- ---- Home: http://www.paullee.com Woes: ...

+ Reply to Thread
Results 1 to 15 of 15

Thread: Downloading webpages

  1. Downloading webpages

    HI all,
    Can anyone point me to a web resource that explains how to
    write a programme that downloads web pages? Theres a lot of stuff
    for Java, but not much for C++

    TIA

    Paul
    --
    ----
    Home: http://www.paullee.com
    Woes: http://www.dr_paul_lee.btinternet.co.uk/zzq.shtml

  2. Re: Downloading webpages


    Kwebway Konongo wrote:

    > HI all,
    > Can anyone point me to a web resource that explains how to
    > write a programme that downloads web pages? Theres a lot of stuff
    > for Java, but not much for C++


    Look at the source code to a program like 'wget'.

    DS


  3. Re: Downloading webpages

    Kwebway Konongo writes:
    > Can anyone point me to a web resource that explains how to
    > write a programme that downloads web pages? Theres a lot of stuff
    > for Java, but not much for C++


    There are libraries to download resources given by URL. You'll be
    able to find several of them with a C API that you should be able to
    use directly from C++ without any problem.

    To start, try libwww.



    Otherwise, if you just want to download a few bytes using the
    simpliest protocol, you can just do the following:

    Open a TCP socket to the server
    Send "GET http://host/local/part HTTP/1.0\r\l\r\l"
    Read header lines (terminated by \r\l) until you find an empty line.
    Find the Content-Length header and extract the number of bytes to read.
    Read the bytes.
    When done, close the socket.



    --
    __Pascal Bourguignon__ http://www.informatimago.com/
    -----BEGIN GEEK CODE BLOCK-----
    Version: 3.12
    GCS d? s++:++ a+ C+++ UL++++ P--- L+++ E+++ W++ N+++ o-- K- w---
    O- M++ V PS PE++ Y++ PGP t+ 5+ X++ R !tv b+++ DI++++ D++
    G e+++ h+ r-- z?
    ------END GEEK CODE BLOCK------

  4. Re: Downloading webpages

    Pascal Bourguignon wrote:

    > Otherwise, if you just want to download a few bytes using the
    > simpliest protocol, you can just do the following:
    > Open a TCP socket to the server
    > Send "GET http://host/local/part HTTP/1.0\r\l\r\l"
    > Read header lines (terminated by \r\l) until you find an empty line.
    > Find the Content-Length header and extract the number of bytes to read.
    > Read the bytes.
    > When done, close the socket.


    To be correct is a little more of work, the content can be chunked, in
    theory, no matter how short the content is. In practice, I agree that a
    very short answer chunked will be rare.

    --
    Salu2

  5. Re: Downloading webpages

    Julián Albo wrote:

    > Pascal Bourguignon wrote:
    >
    >> Otherwise, if you just want to download a few bytes using the
    >> simpliest protocol, you can just do the following:
    >> Open a TCP socket to the server
    >> Send "GET http://host/local/part HTTP/1.0\r\l\r\l"
    >> Read header lines (terminated by \r\l) until you find an empty line.
    >> Find the Content-Length header and extract the number of bytes to read.
    >> Read the bytes.
    >> When done, close the socket.

    >
    > To be correct is a little more of work, the content can be chunked, in
    > theory, no matter how short the content is. In practice, I agree that a
    > very short answer chunked will be rare.
    >


    I may have to switch over to Java, since it seems so much more easier.
    However, I am worried about the latency involved in the JVM; the garbage
    collector seems to slow the programme down (and, frm experience, the
    CLR C@ gc does the same). I'd like this to run as fast as possible, but
    there may be a trade-off between complexity of development and execution
    speed...


    --
    ----
    Home: http://www.paullee.com
    Woes: http://www.dr_paul_lee.btinternet.co.uk/zzq.shtml

  6. Re: Downloading webpages

    Kwebway Konongo wrote:
    > I may have to switch over to Java, since it seems so much more easier.
    > However, I am worried about the latency involved in the JVM; the garbage
    > collector seems to slow the programme down (and, frm experience, the
    > CLR C@ gc does the same). I'd like this to run as fast as possible, but
    > there may be a trade-off between complexity of development and execution
    > speed...


    Mmmm, that looks a bit like premature worrying about optimization.
    I would bet that the real bootleneck isn't going to be anything
    in your application but the time it's going to take until the
    reply from the other side is arriving (if one arrives at all).
    If opening a connection to the web server you want to send a
    request to already sometimes takes something in the order of
    hundreds of milli-second (and that's not that rare - unless you
    don't put in a timer you may wait a lot longer for an attempted
    connection to time out when the other side is down!) then delays
    due to a garbage collector won't be your main concern. Better go
    for a language you know well and write a prototype and only if you
    find that's the real bottleneck rewrite it in something which may
    be faster - by then you will at least understand the ins and outs
    of how it's got to be done.
    Regards, Jens
    --
    \ Jens Thoms Toerring ___ jt@toerring.de
    \__________________________ http://toerring.de

  7. Re: Downloading webpages

    On 2007-01-11, Kwebway Konongo wrote:
    > HI all,
    > Can anyone point me to a web resource that explains how to
    > write a programme that downloads web pages? Theres a lot of stuff
    > for Java, but not much for C++


    download the source of wget, or use libcurl.

    Bye.
    Jasen

  8. Re: Downloading webpages

    Kwebway Konongo wrote:
    > Can anyone point me to a web resource that explains how to
    > write a programme that downloads web pages? Theres a lot of stuff
    > for Java, but not much for C++


    Take a look at Curl.

    --
    Milan Babuskov
    http://njam.sourceforge.net
    http://swoes.blogspot.com

  9. Re: Downloading webpages

    Kwebway Konongo wrote:

    > I may have to switch over to Java, since it seems so much more
    > easier.


    If you want it easy, readable and maintainable, use Python.

    Perhaps in conjunction with Twisted for even clearer client and
    server programming.

    http://www.python.org/
    http://docs.python.org/lib/module-urllib.html
    http://twistedmatrix.com/trac/wiki/TwistedProject

    > However, I am worried about the latency involved in the
    > JVM; the garbage collector seems to slow the programme down (and,
    > frm experience, the CLR C@ gc does the same). I'd like this to run
    > as fast as possible, but there may be a trade-off between
    > complexity of development and execution speed...


    Yes, there is one.

    Good rule of thumb: First time, then worry. Meaning: write your
    program, locate bottlenecks, then optimize. Premature optimization
    can be a real pain. And especially if you have to wait for multiple
    web servers your client's performance shouldn't be too important.

    Regards,


    Björn

    --
    BOFH excuse #428:

    Firmware update in the coffee machine


  10. Re: Downloading webpages

    Bjoern Schliessmann
    writes:
    > Kwebway Konongo wrote:
    >> I may have to switch over to Java, since it seems so much more
    >> easier.

    >
    > If you want it easy, readable and maintainable, use Python.


    'readable' and 'maintainble' are properties of text and not of
    language.

    > Perhaps in conjunction with Twisted for even clearer client and
    > server programming.


    This sentence doesn't communicate anything except that you like it.

    >> However, I am worried about the latency involved in the
    >> JVM; the garbage collector seems to slow the programme down (and,
    >> frm experience, the CLR C@ gc does the same). I'd like this to run
    >> as fast as possible, but there may be a trade-off between
    >> complexity of development and execution speed...

    >
    > Yes, there is one.
    >
    > Good rule of thumb: First time, then worry. Meaning: write your
    > program, locate bottlenecks, then optimize.


    Ehhh ... you do understand that 'optimize later' when the problem is
    the garbage collector of the language runtime amounts to 'rewrite in
    another language'?

  11. Re: Downloading webpages

    Rainer Weikusat wrote:

    >>>'readable' and 'maintainble' are properties of text and not of
    >>>language.

    >>
    >>Certain languages make it easier to write well understandable code,

    >
    > That is a usual claim made by people advocating the use of 'certain
    > languages', but it is nevertheless untrue, because those are
    > properties of texts.


    And yours is a usual claim by people advocating the use of the
    other certain languages, the ones that make it hard to write
    readable and maintainable code ... well, text, if you will,
    but text that must compile under the given language rules to
    produce an executable that implements the intended task...

    Readability and maintainability may be properties of text, but
    there is a direct cause-effect association between the choice
    of a language, and the *difficulty* of producing readable and
    maintainable *outcomes* while using the said language (or, the
    probability of producing a readable and maintainable "outcome"
    when programming a given task, if you want it in more abstract,
    mathematical terms).

    (and no, I do not pretend to initiate a language war, nor am
    I defending Python --- a language that I have never ever used,
    nor have I ever read a single line of code written in it...
    And no, I'm not planning to in the near future)

    Carlos
    --

  12. Re: Downloading webpages

    Carlos Moreno writes:
    > Rainer Weikusat wrote:
    >>>>'readable' and 'maintainble' are properties of text and not of
    >>>>language.
    >>>
    >>>Certain languages make it easier to write well understandable code,

    >> That is a usual claim made by people advocating the use of 'certain
    >> languages', but it is nevertheless untrue, because those are
    >> properties of texts.

    >
    > And yours is a usual claim by people advocating the use of the
    > other certain languages, the ones that make it hard to write
    > readable and maintainable code ... well, text, if you will,


    But I am not advocating the use of any language and if I would, that
    could not possibly affect the fact that I cited above. This may be
    less true for a language like English, which has as very simple
    grammar, but in German, for instance, it is fairly easy to write
    totally correct sentences that many or even most people don't
    understand. And German is still a fairly simple language, compared
    to others (Polish, for instance).

    > Readability and maintainability may be properties of text, but
    > there is a direct cause-effect association between the choice
    > of a language, and the *difficulty* of producing readable and
    > maintainable *outcomes* while using the said language


    The most frequent problem wrt 'readablity' in source code is the
    general unwillingness of 'average' programmers to structure code
    anyhow and to use meaningful identifiers instead of mnemonics.

  13. Re: Downloading webpages

    Bjoern Schliessmann
    writes:

    [...]

    >>> BTW, Python code can be extended with C or C++ modules if a part
    >>> of the code is /really/ needs to be high-speed.

    >>
    >> It is doubtlessly possible to write Phyton-code that executes with
    >> adequate performance for almost any given task if the target
    >> platforms are 'suitably limited', because "performance" (loosely
    >> speaking) is again mostly an attribute of the implementation and
    >> not the language used to implement it.

    >
    > I'm sorry, I don't get your point. Not at all.


    The language does not write code. Programmers write code and this code
    is as 'readable' (which is a really fuzzy term) as the person who
    wrote it considered that to be necessary and will 'perform' like code
    written the way it was written performs.

    That a binary compiled from crappy C may execute faster than the same
    crap written in a virtual machine language is a 'sideband effect'.

  14. Re: Downloading webpages

    Kwebway Konongo wrote:
    > Julián Albo wrote:
    >
    >> Pascal Bourguignon wrote:
    >>
    >>> Otherwise, if you just want to download a few bytes using the
    >>> simpliest protocol, you can just do the following:
    >>> Open a TCP socket to the server
    >>> Send "GET http://host/local/part HTTP/1.0\r\l\r\l"
    >>> Read header lines (terminated by \r\l) until you find an empty line.
    >>> Find the Content-Length header and extract the number of bytes to read.
    >>> Read the bytes.
    >>> When done, close the socket.

    >> To be correct is a little more of work, the content can be chunked, in
    >> theory, no matter how short the content is. In practice, I agree that a
    >> very short answer chunked will be rare.
    >>

    >
    > I may have to switch over to Java, since it seems so much more easier.
    > However, I am worried about the latency involved in the JVM; the garbage
    > collector seems to slow the programme down (and, frm experience, the
    > CLR C@ gc does the same). I'd like this to run as fast as possible, but
    > there may be a trade-off between complexity of development and execution
    > speed...

    This is much, much faster compared to the latency you
    will eperiance in the network. Do not worry.
    On the other hand, why not script the downloading, using e.g.
    'wget' ?

  15. Re: Downloading webpages

    Hi.

    "Nils O. Selåsdal" writes:

    > Kwebway Konongo wrote:
    > > Julián Albo wrote:
    > >
    > >> Pascal Bourguignon wrote:
    > >>
    > >>> Otherwise, if you just want to download a few bytes using the
    > >>> simpliest protocol, you can just do the following:
    > >>> Open a TCP socket to the server
    > >>> Send "GET http://host/local/part HTTP/1.0\r\l\r\l"
    > >>> Read header lines (terminated by \r\l) until you find an empty line.
    > >>> Find the Content-Length header and extract the number of bytes to read.
    > >>> Read the bytes.
    > >>> When done, close the socket.
    > >> To be correct is a little more of work, the content can be chunked, in
    > >> theory, no matter how short the content is. In practice, I agree that a
    > >> very short answer chunked will be rare.
    > >>

    > > I may have to switch over to Java, since it seems so much more
    > > easier.
    > > However, I am worried about the latency involved in the JVM; the garbage
    > > collector seems to slow the programme down (and, frm experience, the
    > > CLR C@ gc does the same). I'd like this to run as fast as possible, but
    > > there may be a trade-off between complexity of development and execution
    > > speed...

    > This is much, much faster compared to the latency you
    > will eperiance in the network. Do not worry.
    > On the other hand, why not script the downloading, using e.g.
    > 'wget' ?


    If you don't have (or don't want to get) wget, you can also use bash
    as follows (this is from the "Bash" Wikipedia entry):

    # open a tcp socket on port 80 to en.wikipedia.org for read/write
    exec 5<>/dev/tcp/en.wikipedia.org/80
    # send a html request for this page
    echo -e "GET /wiki/Bash HTTP/1.0\nHost: en.wikipedia.org\n" >&5
    # output the response to the screen
    cat <&5
    # close output redirection for the socket
    exec 5>&-
    # close input redirection for the socket
    exec 5<&-


    --
    Art Werschulz (agw STRUDEL comcast.net)
    207 Stoughton Ave Cranford NJ 07016
    (908) 272-1146

+ Reply to Thread