using sed to modify html - Unix

This is a discussion on using sed to modify html - Unix ; hi, i have a series of html files on which I need to do some group maniuplation. what would be the sed command to find all href's and replace the string with my own string? for example, replace all occurrences ...

+ Reply to Thread
Results 1 to 7 of 7

Thread: using sed to modify html

  1. using sed to modify html



  2. Re: using sed to modify html


    "John W. Krahn" wrote in message
    news:sqtLj.28023$pb5.12071@edtnps89...
    > Lou Zion wrote:
    >>
    >> i have a series of html files on which I need to do some group
    >> maniuplation. what would be the sed command to find all href's and
    >> replace the string with my own string?
    >>
    >> for example, replace all occurrences of:
    >>
    >>
    >>
    >> with
    >>
    >>

    >>
    >> note that I can't use a constantt string for the first href because I
    >> don't know what will be between the quotes, so i've tried to do some
    >> pattern matching.
    >>
    >> I tried:
    >>
    >> sed 's/\
    /\ >> title=\"Call me\!\"\>/g' myfile.html
    >>
    >> the * doesn't seem to match anything. i verified this with a simple grep:
    >> grep '*' myfile.html
    >>
    >> doesn't match anything. how can i do this??

    >
    > * is a modifier that says match the character before the modifier zero or
    > more times. In your "href" example you say to match \" zero or more times
    > followed by \". In your grep example there is nothing in front of * for
    > it to modify.
    >
    >
    http://www.regular-expressions.info/tutorial.html
    >


    that's great. i just had to add a "." before the * and voila. some of the
    regular expression programs i use match * to anything, not requiring a
    preceding ".", but sed does, so that's good to know.

    thanks again!

    lou



  3. Re: using sed to modify html

    >>> the * doesn't seem to match anything. i verified this with a simple grep:
    >>> grep '*' myfile.html
    >>>
    >>> doesn't match anything. how can i do this??

    >>
    >> * is a modifier that says match the character before the modifier zero or
    >> more times. In your "href" example you say to match \" zero or more times
    >> followed by \". In your grep example there is nothing in front of * for
    >> it to modify.
    >>
    >> http://www.regular-expressions.info/tutorial.html
    >>

    >
    >that's great. i just had to add a "." before the * and voila. some of the
    >regular expression programs i use match * to anything, not requiring a
    >preceding ".", but sed does, so that's good to know.


    There's a difference between regular expessions (where you match
    anything with .* ) and filename wildcard expansion (where you match
    most anything with * ).

    Note that you might want to search for 'href=' followed by a double quote
    followed by a bunch of characters *NOT* including a double quote followed
    by a double quote. Why? * tends to be "greedy", matching as much
    as it can, so if you allow double quote characters in the middle,
    your matched expression will include the words Great and Deal below,
    and you probably didn't want that to happen.

    Great "Deal"
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^



  4. Re: using sed to modify html


    "Gordon Burditt" wrote in message
    news:3Nidnd01AZoVNWPanZ2dnUVZ_hynnZ2d@internetamer ica...
    >>>> the * doesn't seem to match anything. i verified this with a simple
    >>>> grep:
    >>>> grep '*' myfile.html
    >>>>
    >>>> doesn't match anything. how can i do this??
    >>>
    >>> * is a modifier that says match the character before the modifier zero
    >>> or
    >>> more times. In your "href" example you say to match \" zero or more
    >>> times
    >>> followed by \". In your grep example there is nothing in front of * for
    >>> it to modify.
    >>>
    >>> http://www.regular-expressions.info/tutorial.html
    >>>

    >>
    >>that's great. i just had to add a "." before the * and voila. some of
    >>the
    >>regular expression programs i use match * to anything, not requiring a
    >>preceding ".", but sed does, so that's good to know.

    >
    > There's a difference between regular expessions (where you match
    > anything with .* ) and filename wildcard expansion (where you match
    > most anything with * ).
    >
    > Note that you might want to search for 'href=' followed by a double quote
    > followed by a bunch of characters *NOT* including a double quote followed
    > by a double quote. Why? * tends to be "greedy", matching as much
    > as it can, so if you allow double quote characters in the middle,
    > your matched expression will include the words Great and Deal below,
    > and you probably didn't want that to happen.
    >
    > Great "Deal"
    > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^


    thanks, this is good to know. i was following the filename use of * as you
    suggested. curiously i saw another use of regex that didn't use the .*
    notation. it was like this:

    sed -i.bak -r 's#(href=)"[^"]*"#\1"http://www.mydomain.com"
    title="Call me!"#' file.html

    the * doesn't have any matching characters before it, just something NOT to
    match. i'm not even sure how to interpret every component of this
    expression. not sure what the \1 is for or why the href= is in parentheses.
    it works too, though

    lou





  5. Re: using sed to modify html

    Lou Zion wrote:

    > curiously i saw another use of regex that didn't
    > use the .* notation. it was like this:
    >
    > sed -i.bak -r 's#(href=)"[^"]*"#\1"http://www.mydomain.com"
    > title="Call me!"#' file.html
    >
    > the * doesn't have any matching characters before it,

    It does

    > just something NOT to match.

    Read that part as 'Anything but a "'

    Bye, Jojo



  6. Re: using sed to modify html

    >> There's a difference between regular expessions (where you match
    >> anything with .* ) and filename wildcard expansion (where you match
    >> most anything with * ).
    >>
    >> Note that you might want to search for 'href=' followed by a double quote
    >> followed by a bunch of characters *NOT* including a double quote followed
    >> by a double quote. Why? * tends to be "greedy", matching as much
    >> as it can, so if you allow double quote characters in the middle,
    >> your matched expression will include the words Great and Deal below,
    >> and you probably didn't want that to happen.
    >>
    >> Great "Deal"
    >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^

    >
    >thanks, this is good to know. i was following the filename use of * as you
    >suggested. curiously i saw another use of regex that didn't use the .*
    >notation. it was like this:
    >
    > sed -i.bak -r 's#(href=)"[^"]*"#\1"http://www.mydomain.com"
    >title="Call me!"#' file.html


    Let's dissect this. sed is doing a substitution with the expression
    to find as:
    (href=)"[^"]*"
    and replace it with:
    \1"http://www.mydomain.com" title = "Call me!"

    An expression in parentheses is the same an an expression NOT in
    parentheses, except it also sets the value of a matched substring,
    and grouping is done so a following * applies to the whole () expression.

    The search expression matches href=, followed by double quote,
    followed by anything but a double quote repeated zero or more times,
    followed by a double quote. The * applies to the [^"] which means
    "any single character except a double quote.

    The replacement is done literally except for the \1 part, which matches
    the first matched substring, in this case href= . It gets more
    interesting if you had:
    (href=|img=)"[^"]*"
    which matches href= or img= in the string, and keeps whatever
    one was used in the replacement.


    >the * doesn't have any matching characters before it, just something NOT to
    >match.


    Guess what? That's a character to match. It's specified as
    "everything but ..." but it's still a character to match.

    >i'm not even sure how to interpret every component of this
    >expression. not sure what the \1 is for or why the href= is in parentheses.
    >it works too, though




+ Reply to Thread