Re: [9fans] non greedy regular expressions - Plan9

This is a discussion on Re: [9fans] non greedy regular expressions - Plan9 ; > > GNU grep takes a simple but effective approach. It uses a DFA when > > possible, reverting to an NFA when backreferences are used. GNU awk does > > something similar---it uses GNU grep's fast shortest-leftmost DFA engine ...

+ Reply to Thread
Results 1 to 2 of 2

Thread: Re: [9fans] non greedy regular expressions

  1. Re: [9fans] non greedy regular expressions

    > > GNU grep takes a simple but effective approach. It uses a DFA when
    > > possible, reverting to an NFA when backreferences are used. GNU awk does
    > > something similar---it uses GNU grep's fast shortest-leftmost DFA engine
    > > for simple "does it match" checks, and reverts to a different engine for
    > > checks where the actual extent of the match must be known. Since that
    > > other engine is an NFA, GNU awk can conveniently offer capturing
    > > parentheses, and it does via its special gensub function.


    I'll be the first to admit that gawk's behavior is something of a hack.
    I'd much rather be able to use a single matcher. (At least I was able to
    get Friedl to describe gawk's implementation correctly. :-)

    > > Tcl's regex engine is a true hybrid, custom built by Henry Spencer ....
    > >
    > > Currently, this engine is available only to Tcl, but Henry tells me that
    > > it's on his to-do list to break it out into a separate package that can
    > > be used by others.


    It's been on that TODO list for well over a decade, IIRC. More vaporware
    as far as a broader community is concerned.

    > Again, turns out the "big books on regular expressions" can give the
    > lowlife--that's me--things "hackers" deny them.


    Not in the least. The Plan 9 regexp library in fact gives you close to
    the same nirvana; an automata that has DFA speed characteristics with
    the NFA's ability to capture sub texts.

    As other mails have pointed out, anything that isn't leftmost longest
    has weird semantics. Non-greedy operators are mostly syntactic sugar.
    I'll agree that in practice they're likely to be useful, but as I'm pretty
    much an awk guy () I've never felt the pain of their lack, either.

    Gawk is stuck with its current two-matcher approach since it provides
    the option for different syntaxes, but that approach also has lots of
    problems; more than once I've come across cases where they interpret the
    same syntax differently, or don't handle multibyte characters the same.
    (Not to mention that they sntax bits API in the GNU matchers is a
    a real nightmare. Another thing that's too late to change.)

    If I could start over again, I'd start with either the plan 9 lib or
    ripping out Henry's library from tcl, but the former is probably easier.

    Friedl's book is good for what it aims to be: an introduction to
    regular expressions. But scientifically rigid (as in, on the same
    order as the dragon book) it's definitely not.

    Russ's paper describes the state of the world pretty well.

    Arnold


  2. Re: [9fans] non greedy regular expressions

    > As other mails have pointed out, anything that isn't leftmost longest
    > has weird semantics. Non-greedy operators are mostly syntactic sugar.


    Is (leftmost-longest + all-greedy operators) syntactic salt then?

    > Not in the least. The Plan 9 regexp library in fact gives you close to
    > the same nirvana; an automata that has DFA speed characteristics with
    > the NFA's ability to capture sub texts.


    Does regexp(n) also give the lowlife any hint of why it should behave
    differently from Perl? Friedl's book doesn't, but it has good reason.

    > Friedl's book is good for what it aims to be: an introduction to
    > regular expressions. But scientifically rigid (as in, on the same
    > order as the dragon book) it's definitely not.


    That's good. No problem with "big books on regular expressions," then.
    Introductory ones, or scientific ones.

    --On Monday, October 27, 2008 9:23 PM +0200 Aharon Robbins
    wrote:

    >> > GNU grep takes a simple but effective approach. It uses a DFA when
    >> > possible, reverting to an NFA when backreferences are used. GNU awk
    >> > does something similar---it uses GNU grep's fast shortest-leftmost DFA
    >> > engine for simple "does it match" checks, and reverts to a different
    >> > engine for checks where the actual extent of the match must be known.
    >> > Since that other engine is an NFA, GNU awk can conveniently offer
    >> > capturing parentheses, and it does via its special gensub function.

    >
    > I'll be the first to admit that gawk's behavior is something of a hack.
    > I'd much rather be able to use a single matcher. (At least I was able to
    > get Friedl to describe gawk's implementation correctly. :-)
    >
    >> > Tcl's regex engine is a true hybrid, custom built by Henry Spencer ....
    >> >
    >> > Currently, this engine is available only to Tcl, but Henry tells me
    >> > that it's on his to-do list to break it out into a separate package
    >> > that can be used by others.

    >
    > It's been on that TODO list for well over a decade, IIRC. More vaporware
    > as far as a broader community is concerned.
    >
    >> Again, turns out the "big books on regular expressions" can give the
    >> lowlife--that's me--things "hackers" deny them.

    >
    > Not in the least. The Plan 9 regexp library in fact gives you close to
    > the same nirvana; an automata that has DFA speed characteristics with
    > the NFA's ability to capture sub texts.
    >
    > As other mails have pointed out, anything that isn't leftmost longest
    > has weird semantics. Non-greedy operators are mostly syntactic sugar.
    > I'll agree that in practice they're likely to be useful, but as I'm pretty
    > much an awk guy () I've never felt the pain of their lack, either.
    >
    > Gawk is stuck with its current two-matcher approach since it provides
    > the option for different syntaxes, but that approach also has lots of
    > problems; more than once I've come across cases where they interpret the
    > same syntax differently, or don't handle multibyte characters the same.
    > (Not to mention that they sntax bits API in the GNU matchers is a
    > a real nightmare. Another thing that's too late to change.)
    >
    > If I could start over again, I'd start with either the plan 9 lib or
    > ripping out Henry's library from tcl, but the former is probably easier.
    >
    > Friedl's book is good for what it aims to be: an introduction to
    > regular expressions. But scientifically rigid (as in, on the same
    > order as the dragon book) it's definitely not.
    >
    > Russ's paper describes the state of the world pretty well.
    >
    > Arnold
    >



+ Reply to Thread