[9fans] non greedy regular expressions - Plan9

This is a discussion on [9fans] non greedy regular expressions - Plan9 ; Hello regexp(6) seems to know only greedy regular expressions. So does probably awk, sed, grep, ..., since these are based on that regexp. My question is what to do if one needs non-greedy regexps. Also, is there anything like submatch ...

+ Reply to Thread
Page 1 of 2 1 2 LastLast
Results 1 to 20 of 35

Thread: [9fans] non greedy regular expressions

  1. [9fans] non greedy regular expressions

    Hello

    regexp(6) seems to know only greedy regular expressions. So does
    probably awk, sed, grep, ..., since these are based on that regexp.
    My question is what to do if one needs non-greedy regexps.

    Also, is there anything like submatch extraction, counted repetition
    of regexp (like {3,5}), (lookahead) assertions in plan9?

    Thanks
    Ruda


  2. Re: [9fans] non greedy regular expressions

    > regexp(6) seems to know only greedy regular expressions. So does
    > probably awk, sed, grep, ..., since these are based on that regexp.
    > My question is what to do if one needs non-greedy regexps.


    russ has a great writeup on this.
    http://swtch.com/~rsc/regexp/
    i think it covers all your questions.

    - erik



  3. Re: [9fans] non greedy regular expressions

    > russ has a great writeup on this.
    > http://swtch.com/~rsc/regexp/
    > i think it covers all your questions.
    >
    > - erik


    I read trough some of that already yesterday. Anyway, am still
    puzzled. In the text of

    Regular Expression Matching Can Be Simple And Fast
    (but is slow in Java, Perl, PHP, Python, Ruby, ...)

    R. Cox writes:
    ---
    While writing the text editor sam [6] in the early 1980s, Rob Pike
    wrote a new regular expression implementation, which Dave Presotto
    extracted into a library that appeared in the Eighth Edition. Pike's
    implementation incorporated submatch tracking into an efficient NFA
    simulation but, like the rest of the Eighth Edition source, was not
    widely distributed.
    ....
    Pike's regular expression implementation, extended to support Unicode,
    was made freely available with sam in late 1992, but the particularly
    efficient regular expression search algorithm went unnoticed. The code
    is now available in many forms: as part of sam, as Plan 9's regular
    expression library, or packaged separately for Unix.
    ---

    But any manual page (regexp(6), that of sam) keeps completely silent
    about eg. any submatch tracking.
    So what's wrong? Can anybody clarify the situation for me or do I
    really have to read the codes?

    Ruda


  4. Re: [9fans] non greedy regular expressions

    > But any manual page (regexp(6), that of sam) keeps completely silent
    > about eg. any submatch tracking.
    > So what's wrong? Can anybody clarify the situation for me or do I
    > really have to read the codes?


    well reading the code would be a travesty. it's curious
    that neither the sam paper nor regexp(6) mentions
    submatches. maybe i missed them.

    sed -n 's:.*(KRAK[A-Z]+*) +([a-zA-Z]+).*:\2, \1:gp'
    sam -d /lib/volcanoes
    ,x g/KRAK/ s:[^A-Z]+([A-Z]+*) +([a-zA-Z]+).*:\2, \1:
    /KRAK/-+

    or, if you want to run sam as a stream editor
    { echo ',x g/KRAK/ s:[^A-Z]+([A-Z]+*) +([a-zA-Z]+).*:\2, \1:'
    echo '/KRAK/-+'
    } | sam -d /lib/volcanoes >[2=]

    a couple of examples of submatch tracking from our system
    (i'm not sure about the version of iwhois on sources.)

    /rc/bin/B
    /rc/bin/Kill
    /rc/bin/broke
    /rc/bin/fax
    /rc/bin/fshalt
    /rc/bin/iwhois
    /rc/bin/juke

    - erik



  5. Re: [9fans] non greedy regular expressions

    > well reading the code would be a travesty. it's curious
    > that neither the sam paper nor regexp(6) mentions
    > submatches. maybe i missed them.
    >
    > sed -n 's:.*(KRAK[A-Z]+*) +([a-zA-Z]+).*:\2, \1:gp' > - erik


    Ok, so despite the documentation, some submatch tracking is there.
    But in all (?) your examples, as well as in the scripts you mentioned,
    this tracking is exclusively used with the s command (which is said to
    be unnecessary at least in sam/acme). If I try sth. like
    /( b(.)b)/a/\1\2/
    on
    bla blb 56
    I get
    bla blb\1\2 56
    which is not quite what I want... How then? (I'd like to get 'bla blblblb 56')

    Further, in R. Cox's text (http://swtch.com/~rsc/regexp/regexp1.html)
    he claims that all nice features except for backreferences can be
    implemented with Thomson's NFA algorithm. And even the backreferences
    can be handled gracefully somehow. That is: ALL: non-greedy operators,
    generalized assertions, counted repetitions, character classes CAN be
    processed using the fast algorithm. Why then we don't have it? I once
    wrote a program in python and was pretty happy to have non-greedy
    operators and lookahead assertions on hand. Should I hadn't had those,
    I probably wouldn't have been able to write it (nicely).

    Ruda


  6. Re: [9fans] non greedy regular expressions

    > Ok, so despite the documentation, some submatch tracking is there.
    > But in all (?) your examples, as well as in the scripts you mentioned,
    > this tracking is exclusively used with the s command (which is said to
    > be unnecessary at least in sam/acme). If I try sth. like
    > /( b(.)b)/a/\1\2/


    this is covered in the sam paper in the section where pike
    discusses s.

    > Further, in R. Cox's text (http://swtch.com/~rsc/regexp/regexp1.html)
    > he claims that all nice features except for backreferences can be
    > implemented with Thomson's NFA algorithm. And even the backreferences
    > can be handled gracefully somehow. That is: ALL: non-greedy operators,
    > generalized assertions, counted repetitions, character classes CAN be
    > processed using the fast algorithm. Why then we don't have it?


    let me turn this around. why would these additions (complications)
    benefit plan 9?

    - erik


  7. Re: [9fans] non greedy regular expressions

    I think you've understood correctly. Back references mostly aren't
    there. Greedy operators aren't there. For back references, this may
    be due to philosophical reservations; I have a few myself. For greedy
    operators, I suspect it's more because noone has cared enough to do
    it. It wouldn't be too hard, as Russ' article says. If someone is
    going to to this I would suggest going all the way and implementing
    tags. See http://laurikari.net/ville/spire2000-tnfa.ps.

    > > well reading the code would be a travesty. it's curious
    > > that neither the sam paper nor regexp(6) mentions
    > > submatches. maybe i missed them.
    > >
    > > sed -n 's:.*(KRAK[A-Z]+*) +([a-zA-Z]+).*:\2, \1:gp' > > - erik

    >
    > Ok, so despite the documentation, some submatch tracking is there.
    > But in all (?) your examples, as well as in the scripts you mentioned,
    > this tracking is exclusively used with the s command (which is said to
    > be unnecessary at least in sam/acme). If I try sth. like
    > /( b(.)b)/a/\1\2/
    > on
    > bla blb 56
    > I get
    > bla blb\1\2 56
    > which is not quite what I want... How then? (I'd like to get 'bla blblblb 56'
    > )
    >
    > Further, in R. Cox's text (http://swtch.com/~rsc/regexp/regexp1.html)
    > he claims that all nice features except for backreferences can be
    > implemented with Thomson's NFA algorithm. And even the backreferences
    > can be handled gracefully somehow. That is: ALL: non-greedy operators,
    > generalized assertions, counted repetitions, character classes CAN be
    > processed using the fast algorithm. Why then we don't have it? I once
    > wrote a program in python and was pretty happy to have non-greedy
    > operators and lookahead assertions on hand. Should I hadn't had those,
    > I probably wouldn't have been able to write it (nicely).
    >
    > Ruda
    >

    --
    John Stalker
    School of Mathematics
    Trinity College Dublin
    tel +353 1 896 1983
    fax +353 1 896 2282


  8. Re: [9fans] non greedy regular expressions

    On Fri, Oct 24, 2008 at 6:11 PM, Rudolf Sykora wrote:
    > Further, in R. Cox's text (http://swtch.com/~rsc/regexp/regexp1.html)
    > he claims that all nice features except for backreferences can be
    > implemented with Thomson's NFA algorithm. And even the backreferences
    > can be handled gracefully somehow. That is: ALL: non-greedy operators,
    > generalized assertions, counted repetitions, character classes CAN be
    > processed using the fast algorithm. Why then we don't have it?


    Just because something is possible doesn't mean that it is a good
    idea. Implementing every possible feature is the GNU/way, if you want
    to go down that path, you know where to find it.

    > I once
    > wrote a program in python and was pretty happy to have non-greedy
    > operators and lookahead assertions on hand. Should I hadn't had those,
    > I probably wouldn't have been able to write it (nicely).


    Python uses the absolutely dreadful and unintelligible PCRE, which not
    only have unpredictable performance as Russ points out, but are a
    total mess so complex that I doubt anybody fully understands them,
    they are to Plan 9 regexps what C++ is to C.

    Maybe Plan 9 regexps are not as 'powerful', but at least I can understand them.

    uriel


  9. Re: [9fans] non greedy regular expressions

    Backreferences within the pattern (such as in /(.*)\1/) make the
    matcher non-regular and exponentially hard. It was a deliberate
    decision not to implement them in sam and I'd make the same decision
    today.

    As far as greedy/non-greedy operators go, they have more utility but I
    believe they have become popular primarily to deal with the problems
    with Perl and PCRE doing greedy implementations badly in order to
    avoid the exponential cost of their backtracking matchers. These
    matchers can give crazy results.

    Sam etc. are O(length of string * length of pattern) at all times and
    their leftmost-longest properties work throughout.

    Again, Russ has most of this in his analysis.

    -rob


  10. Re: [9fans] non greedy regular expressions

    2008/10/24 John Stalker :
    > I think you've understood correctly. Back references mostly aren't
    > there. Greedy operators aren't there.


    you probably mean NON-greedy ops.
    I was not that concerned about backreferences but submatch extraction
    (according to the terminology used by R. Cox in his article).
    R.

    PS.: and submatch extraction is there, somehow, undocumented (to my knowledge)


  11. Re: [9fans] non greedy regular expressions

    Regarding greedy vs non-greedy etc.

    In Plan 9, a regular expression search always looks
    for the "leftmost-longest" match, which means
    a match that starts as far to the left as possible
    in the target text, and of the matches that start there,
    the longest one.

    In that model, it is not accurate to describe the * + ?
    operators as greedy or not. None of them is working
    toward any goal other than the overall longest match
    at the leftmost position.

    In Perl and its imitators, the match starts at the leftmost
    position but is otherwise the first one that is found,
    not necessarily the longest. In that context, words like
    "greedy" and "non-greedy" start to make sense,
    because the behavior of any one operator influences which
    match is encountered first.

    Either approach -- leftmost-longest or leftmost-first --
    can be implemented using finite automata.

    Russ


  12. Re: [9fans] non greedy regular expressions

    > you probably mean NON-greedy ops.

    Yes, my mistake. I'll risk making a very minor correction to
    Rob's post as well:

    > Backreferences within the pattern (such as in /(.*)\1/) make the
    > matcher non-regular and exponentially hard.


    They do change the class of the grammar and nobody knows how to
    implement them in subexponential time, but it hasn't been proved
    to be impossible.

    --
    John Stalker
    School of Mathematics
    Trinity College Dublin
    tel +353 1 896 1983
    fax +353 1 896 2282


  13. Re: [9fans] non greedy regular expressions

    >If I try sth. like
    >/( b(.)b)/a/\1\2/
    >on
    >bla blb 56
    >I get
    >bla blb\1\2 56
    >which is not quite what I want... How then? (I'd like to get 'bla blblblb 56')


    echo bla blb 56 | sed 's/( b(.)b)/&\1\2/'
    bla blb blbl 56

    similarly use `s' not `a' in sam.


  14. Re: [9fans] non greedy regular expressions

    2008/10/24 Charles Forsyth :
    >>If I try sth. like
    >>/( b(.)b)/a/\1\2/
    >>on
    >>bla blb 56
    >>I get
    >>bla blb\1\2 56
    >>which is not quite what I want... How then? (I'd like to get 'bla blblblb 56')

    >
    > echo bla blb 56 | sed 's/( b(.)b)/&\1\2/'
    > bla blb blbl 56
    >
    > similarly use `s' not `a' in sam.



    Yes. But my question was more subtle. I know it can be done with the
    's' command in sam now. But I asked sth. different. The documentation
    (paper about sam) says 's' is redundant. But it seems (only seems
    since documentation says nothing), that submatches only work with the
    's' command. If this is true 's' is not redundant, since if it weren't
    there you would not have any chance to use \1, \2, etc.
    That was the reason I wanted to have my example rewritten WITHOUT the
    's' command...

    But anyway thanks!
    Ruda


  15. Re: [9fans] non greedy regular expressions

    > In that model, it is not accurate to describe the * + ?
    > operators as greedy or not. None of them is working
    > toward any goal other than the overall longest match
    > at the leftmost position.


    So then I must be mistaken about the terminology.
    I thought greedy=leftmost-longest, while non-greedy=leftmost-first:
    Having
    bla bla (AB)(CDEF)(GH)
    /\(.*\)/
    matches the whole (AB)(CDEF)(GH), 'greedy', while if '*' were
    'non-greedy', I'd expect a match with (AB). This is now not in Plan9.
    This mentioned example is easy to solve, if I want to go in 'bracket'
    steps with
    /\([^)]*\)/
    but there are harder examples, eg. where the_interesting_part is
    delimited with a more complicated structure ('ABC' here):
    ABCthe_interesting_partABC blabla bla ABCthe_interesting_partABC etc.
    Now, how to parse it? With non-greedy '*', no problem. With 'greedy'
    and no negative lookahead assertion? Ok, maybe 'y' in sam could help
    (don't know). What about
    ABCthe_interesting_partCBA blabla bla ABCthe_interesting_partCBA etc.?
    All the thinking about this is simply removed with 'non-greedy' ops.

    Ruda



    > In Perl and its imitators, the match starts at the leftmost
    > position but is otherwise the first one that is found,
    > not necessarily the longest. In that context, words like
    > "greedy" and "non-greedy" start to make sense,
    > because the behavior of any one operator influences which
    > match is encountered first.
    >
    > Either approach -- leftmost-longest or leftmost-first --
    > can be implemented using finite automata.
    >
    > Russ
    >
    >



  16. Re: [9fans] non greedy regular expressions

    > I thought greedy=leftmost-longest, while non-greedy=leftmost-first:

    Greedy leftmost-first is different from leftmost-longest.
    Search for /a*(ab)?/ in "ab". The leftmost-longest match
    is "ab", but the leftmost-first match (because of the
    greedy star) is "a". In the leftmost-first case, the greediness
    of the star caused an overall short match.

    > All the thinking about this is simply removed with 'non-greedy' ops.


    But it isn't (or shouldn't be).
    Using /\(.*\)/ to match small parenthesized expressions
    is fragile: /\(.*\)/ in "(a(b))" matches "(a(b)".
    In contrast, the solution you rejected /\([^)]*\)/ is more robust.

    It doesn't make sense to shoehorn non-greedy and greedy
    operators into an engine that provides leftmost-longest matching.
    If you want a different model, you need to use a different program.
    Perl has been ported.

    Russ


  17. Re: [9fans] non greedy regular expressions

    > Greedy leftmost-first is different from leftmost-longest.
    > Search for /a*(ab)?/ in "ab". The leftmost-longest match
    > is "ab", but the leftmost-first match (because of the
    > greedy star) is "a". In the leftmost-first case, the greediness
    > of the star caused an overall short match.


    Ok, I finally see the point... thanks.
    Only one last question: So is there any simple way (using existing
    regexps) to find 'the interesting parts' in my last example?:
    ABCthe_interesting_partCBA blabla bla ABCthe_interesting_partCBA ...
    ....i.e. delimited with ABC from the left and CBA (or e.g. EFG) from the right?
    Say I have a file with such a structure and want to only get 'the
    interesting parts' form it.

    >
    >> All the thinking about this is simply removed with 'non-greedy' ops.

    >
    > But it isn't (or shouldn't be).

    well, I admit it isn't


  18. Re: [9fans] non greedy regular expressions

    > Ok, I finally see the point... thanks.
    > Only one last question: So is there any simple way (using existing
    > regexps) to find 'the interesting parts' in my last example?:
    > ABCthe_interesting_partCBA blabla bla ABCthe_interesting_partCBA ...
    > ...i.e. delimited with ABC from the left and CBA (or e.g. EFG) from the right?
    > Say I have a file with such a structure and want to only get 'the
    > interesting parts' form it.


    doesn't s/ABC(the_interesting_part)CBA/x/g work for you?
    maybe i don't understand the example. if so, could you explain?

    - erik


  19. Re: [9fans] non greedy regular expressions

    > doesn't s/ABC(the_interesting_part)CBA/x/g work for you?
    > maybe i don't understand the example. if so, could you explain?
    >
    > - erik


    I think not.
    I have a file say like this

    ABC asassadfasdf asdfasdf asdfasdf CBA hhhhhhhhhhjjjjjjjjjjioioioi
    sodifs
    sdfsd
    ABC
    dasdfas aasdfa
    njnjn CBA

    and I want to get

    ' asassadfasdf asdfasdf asdfasdf '
    'dasdfas aasdfa'
    'njnjn'

    where I added apostrophes to see the spaces on indivial lines. Simply:
    give me everything that is between delimiters (ABC and CBA).

    Ruda


  20. Re: [9fans] non greedy regular expressions

    hello

    using sed and only one reg-exp is mandatory?

    cat t.txt| sed 's/(ABC | CBA)/ \n\1\n /g' | awk '/ABC/,/CBA/' | grep -
    v 'ABC|CBA'

    that's a naive and simple approach, but i can't see why you need to
    use just one reg-exp and just one sed. May be i missed something
    through the thread :-?

    gabi


    El 25/10/2008, a las 0:04, Rudolf Sykora escribió:

    >> doesn't s/ABC(the_interesting_part)CBA/x/g work for you?
    >> maybe i don't understand the example. if so, could you explain?
    >>
    >> - erik

    >
    > I think not.
    > I have a file say like this
    >
    > ABC asassadfasdf asdfasdf asdfasdf CBA hhhhhhhhhhjjjjjjjjjjioioioi
    > sodifs
    > sdfsd
    > ABC
    > dasdfas aasdfa
    > njnjn CBA
    >
    > and I want to get
    >
    > ' asassadfasdf asdfasdf asdfasdf '
    > 'dasdfas aasdfa'
    > 'njnjn'
    >
    > where I added apostrophes to see the spaces on indivial lines. Simply:
    > give me everything that is between delimiters (ABC and CBA).
    >
    > Ruda
    >
    >




+ Reply to Thread
Page 1 of 2 1 2 LastLast