[9fans] non greedy regular expressions - Plan9

This is a discussion on [9fans] non greedy regular expressions - Plan9 ; although it can be satisfying to do things in one instruction (one command) i confess i often find it quicker to split up the problem in sam or acme: 1. delete everything not between delimiters ,y/ABC([^C]|C[^B]|CB[^A]|\n)+CBA/d 2. delete the delimeters ...

+ Reply to Thread
Page 2 of 2 FirstFirst 1 2
Results 21 to 35 of 35

Thread: [9fans] non greedy regular expressions

  1. Re: [9fans] non greedy regular expressions

    although it can be satisfying to do things in one instruction (one command)
    i confess i often find it quicker to split up the problem in sam or acme:

    1. delete everything not between delimiters
    ,y/ABC([^C]|C[^B]|CB[^A]|\n)+CBA/d
    2. delete the delimeters
    ,x/ABC|CBA/d
    3. look to decide if i missed a boundary case for my input

    it would be easier if the delimiters weren't words, or your sections didn't span lines.
    i didn't spend any time deciding whether there was a better way
    to express the trailing word delimiter. it's too late.

    an example actually using the () [submatch] to extract and exchange two fields on each line:
    ,x s/([^,]*),(.*)/\2,\1/


  2. Re: [9fans] non greedy regular expressions

    >i didn't spend any time deciding whether there was a better way
    >to express the trailing word delimiter. it's too late.


    gdiaz@9grid.es meanwhile pointed out one way: change the words to otherwise unused special
    characters


  3. Re: [9fans] non greedy regular expressions

    Is awk available? This worked for me, but it's not on Plan9. It does copy
    the newline after the 2nd "ABC" (I wasn't sure if leading or all blank lines
    should be deleted).
    $ cat a.data
    dflkdl dlkrwo3je4ogjmdmxd
    ABC asassadfasdf asdfasdf asdfasdf CBA hhhhhhhhhhjjjjjjjjjjioioioi
    sodifs
    sdfsd
    ABC
    dasdfas aasdfa
    njnjn CBA
    fkpri34ouijglkrlptgf;c
    $ awk 'BEGIN {RS = "ABC"; FS = "CBA"}NR == 1 {next}{print $1}' a.data
    asassadfasdf asdfasdf asdfasdf

    dasdfas aasdfa
    njnjn

    On Fri, Oct 24, 2008 at 3:04 PM, Rudolf Sykora wrote:

    > > doesn't s/ABC(the_interesting_part)CBA/x/g work for you?
    > > maybe i don't understand the example. if so, could you explain?
    > >
    > > - erik

    >
    > I think not.
    > I have a file say like this
    >
    > ABC asassadfasdf asdfasdf asdfasdf CBA hhhhhhhhhhjjjjjjjjjjioioioi
    > sodifs
    > sdfsd
    > ABC
    > dasdfas aasdfa
    > njnjn CBA
    >
    > and I want to get
    >
    > ' asassadfasdf asdfasdf asdfasdf '
    > 'dasdfas aasdfa'
    > 'njnjn'
    >
    > where I added apostrophes to see the spaces on indivial lines. Simply:
    > give me everything that is between delimiters (ABC and CBA).
    >
    > Ruda
    >
    >



  4. Re: [9fans] non greedy regular expressions

    2008/10/25 Tom Simons :
    > Is awk available? This worked for me, but it's not on Plan9. It does copy
    > the newline after the 2nd "ABC" (I wasn't sure if leading or all blank lines
    > should be deleted).
    > $ awk 'BEGIN {RS = "ABC"; FS = "CBA"}NR == 1 {next}{print $1}' a.data


    To that newline: It should copy the newline you describe, since that
    one really is between delimiters. However, this one is also the only
    one that should be copied. There shouldn't be any blank line anywhere
    in the middle of the resulting output. In this sense your solution
    doesn't work.

    Your solution ALMOST works in linux. It shows not to work in plan9 at
    all, probably due to the fact that in plan9 only the 1st character of
    the RS variable is considered as the record delimiter.

    But what I really wanted to see is how people using plan9 can solve
    the problem without using a specialized minilanguage like awk. See
    what Erik S. Raymond says in his Art of Unix programming:

    http://www.faqs.org/docs/artu/ch08s02.html#awk

    Basically he claims that the way this language was designed was
    unfortunate. And that the language is on its decline. Among the
    reasons is that languages like Perl, Python, Ruby all form a suitable
    superset and that
    'Programmers increasingly chose to do awklike things with Perl or
    (later) Python, rather than keep two different scripting languages in
    their heads'.

    I myself may not be that competent to claim this too, but at least
    from my own experience I have started to like to use as few tools as
    possible. Thus I don't want to use awk any longer. I don't like perl
    either (in my opinion it's a bad language). Python is nice for coding,
    but somehow not handy for commandline use. Ruby seems to be superior
    to all. So in my ideal (not the quickest though) world I'd rather get
    rid of perl, awk, and use ruby instead, if anything more complicated
    is needed.

    Anyway, my main reason for the task was to see if someone can really
    come with a nice solution using exclusively sam (and more, possibly
    without that 's' command --- btw. noone so far has answered the
    question of possible use of submatch tracking with commands other than
    's'; remember 's' was designated unnecessary).

    I wanted to see the thing be done in sam/acme without use of awk or
    sed. That is Charles Forsyth's solution, which really works:
    ---
    1. delete everything not between delimiters
    ,y/ABC([^C]|C[^B]|CB[^A]|\n)+CBA/d
    2. delete the delimeters
    ,x/ABC|CBA/d
    3. look to decide if i missed a boundary case for my input
    ---
    I like it. It does exactly what I wanted. And here comes the point
    I've been after all the time from the very beginning. I wanted to
    show, that the solution has a very ugly part in itself, namely
    ([^C]|C[^B]|CB[^A]|\n)+
    whose only reason is to ensure there is not CBA somewhere in the
    middle. Imagine there would be something more complicated as a
    delimiter. Imagine, I'd like the closing delimiter be either CBA or
    EFG (any one would do). And I think you soon go mad.

    In python ( http://www.amk.ca/python/howto/regex/), this is easily
    solved with a non-greedy operator

    /ABC(.*?)CBA/
    /ABC(.*?)(CBA|EFG)/

    It's true that non-greedy operators don't have a good meaning in Plan9
    (as R. Cox explained), due to its leftmost-longest paradigm. However,
    what I conclude from this example is, that the leftmost-first kind of
    thinking with two kinds of ops (greedy/nongreedy) can be sometimes
    quite useful.

    Now. If the leftmost-longest match is usable for my problem, I am fine
    with C + regexp(6). If not I only see the possibility to use
    perl/python nowadays (if I don't want to go mad like above). Put it
    the other way round. Perl/python can hopefully almost always be used,
    they solve almost any problem with regexps to our liking. Then
    universality wins and we may end up using perl/python exclusively. And
    we will (people do) use them inspite of their wrong (i.e. slow;
    perhaps horrible -- as some of you said) designs.

    My question then is: wouldn't it be better to switch to the
    leftmost-first paradigm, hence open possible use of (non-)greedy
    operators, and in a way contribute to an accord with perl/python
    syntax? And use a good algorithm for that all? But maybe it's not
    worth and the current state is just sufficient...

    Ruda


  5. Re: [9fans] non greedy regular expressions

    > ...I have started to like to use as few tools as
    > possible.


    I entirely agree, there is too much to learn and you have to be selective;
    however my selection is: sed, awk, C and sam (for impromptu, well, editing).

    I cannot really comment directly on gready operators, I have never
    knowingly used a system that provided them, and I have never found myself
    needing features that Plan9's RE don't have.

    Perhaps its a lack of vision on my part.

    It is probably possible visualise problems that would be solved
    much more simply on Plan9 than on Linux/BSD/etc. The bottom line
    is that the designers of Plan9 took a different approach
    to REs and as a result the systems are different.

    -Steve


  6. Re: [9fans] non greedy regular expressions

    > Now. If the leftmost-longest match is usable for my problem, I am fine
    > with C + regexp(6). If not I only see the possibility to use
    > perl/python nowadays (if I don't want to go mad like above).


    There is another option: yacc. I'm not saying it's simpler
    than perl or python, but it's not much harder and it's more
    a more general tool. The resulting code will be fast, if
    you care.

    > My question then is: wouldn't it be better to switch to the
    > leftmost-first paradigm, hence open possible use of (non-)greedy
    > operators, and in a way contribute to an accord with perl/python
    > syntax? And use a good algorithm for that all? But maybe it's not
    > worth and the current state is just sufficient...


    Switching semantics in existing tools is rarely a good idea. Too many
    things break. If anyone is going to do this, and it won't be me, then
    it needs to be done in a way that doesn't break anything, like when
    UNIX switched from basic to extended regular expressions and added a
    -e option to grep.
    --
    John Stalker
    School of Mathematics
    Trinity College Dublin
    tel +353 1 896 1983
    fax +353 1 896 2282


  7. Re: [9fans] non greedy regular expressions

    I loved this thread. Thanks everyone. Thanks Rudolf Sykora.


  8. Re: [9fans] non greedy regular expressions

    The ability to put \1 in the right hand side of a substitution was
    done by jason@
    at the Uni of Sydney, but after the Sam papers were published. It was a welcome
    feature that added special functionality to the 's' command within
    Sam. (Ed(1) had
    the feature, within its limited regexps, long before, of course.)

    -rob


    On Fri, Oct 24, 2008 at 12:56 PM, Rudolf Sykora wrote:
    > 2008/10/24 Charles Forsyth :
    >>>If I try sth. like
    >>>/( b(.)b)/a/\1\2/
    >>>on
    >>>bla blb 56
    >>>I get
    >>>bla blb\1\2 56
    >>>which is not quite what I want... How then? (I'd like to get 'bla blblblb 56')

    >>
    >> echo bla blb 56 | sed 's/( b(.)b)/&\1\2/'
    >> bla blb blbl 56
    >>
    >> similarly use `s' not `a' in sam.

    >
    >
    > Yes. But my question was more subtle. I know it can be done with the
    > 's' command in sam now. But I asked sth. different. The documentation
    > (paper about sam) says 's' is redundant. But it seems (only seems
    > since documentation says nothing), that submatches only work with the
    > 's' command. If this is true 's' is not redundant, since if it weren't
    > there you would not have any chance to use \1, \2, etc.
    > That was the reason I wanted to have my example rewritten WITHOUT the
    > 's' command...
    >
    > But anyway thanks!
    > Ruda
    >
    >



  9. Re: [9fans] non greedy regular expressions

    As I explained in an earlier post, your suggested

    > /ABC(.*?)CBA/


    is less robust than Charles's spelled-out version,
    since yours doesn't handle nested expressions well.
    That's a serious enough limitation that your
    scenario stops being a compelling argument for
    leftmost-first matching and non-greedy operators.

    > Then universality wins and we may end up using perl/python exclusively.


    No one said you couldn't. If they let you do your job faster than
    acme and sam do, no one here is going to tell you not to use them.

    > My question then is: wouldn't it be better to switch to the
    > leftmost-first paradigm, hence open possible use of (non-)greedy
    > operators, and in a way contribute to an accord with perl/python
    > syntax? And use a good algorithm for that all? But maybe it's not
    > worth and the current state is just sufficient...


    Leftmost-first matching is difficult to explain.
    When POSIX had to define the rules for regexps,
    they chose to define them as leftmost-longest
    even though all the implementations were leftmost-first,
    because describing leftmost-first precisely was too
    complex.

    Leftmost-first matching is not widely understood.
    At the beginning of this conversation, you believed that in the
    absence of non-greedy operators, leftmost-first matching
    produced leftmost-longest matches. I would guess that
    more people believe this than know otherwise.

    Leftmost-first matching only lets you write a regexp that
    is half right (see above) in the use case that you are
    focusing on.

    I don't see the benefit of switching from leftmost-longest
    to leftmost-first. Certainly an "accord with perl/python"
    is not good justification. Playing catch-up with Perl and
    Python regexps can only lead to madness. Both systems
    are complex enough that essentially no one completely
    understands them.

    Russ


  10. Re: [9fans] non greedy regular expressions

    > Leftmost-first matching is difficult to explain.
    > When POSIX had to define the rules for regexps,
    > they chose to define them as leftmost-longest
    > even though all the implementations were leftmost-first,
    > because describing leftmost-first precisely was too
    > complex.
    >
    > Leftmost-first matching is not widely understood.
    > At the beginning of this conversation, you believed that in the
    > absence of non-greedy operators, leftmost-first matching
    > produced leftmost-longest matches. I would guess that
    > more people believe this than know otherwise.


    Yes. I weren't quite acquainted with the problematics. Probably
    because I have never bumped into a problem with the leftmost-first
    implemenatation. Perhaps my problems were just not complex enough to
    notice a regexp finds sth. different from my expectation. Seems to be
    a case of majority of people.
    (E.g. Vim seems to use leftmost-first, too.)

    Ok. I am just somehow tired now with this. I thank everybody who
    positively contributed to this thread. I'll wait some time and will
    try to use what is present and I will see.

    Ruda


  11. Re: [9fans] non greedy regular expressions

    >Both systems are complex enough that essentially no one completely understands them.

    this touches on an important point. the first introduction of regular expressions
    to editors was great, because it took some formal language theory and made it useful
    in an `every day' way. conversely, the theory provided significant underpinning (in a structural
    sense) to that practical application. now there are big books on `regular expressions'
    mainly because they are no longer regular but a big collection of ad-hoc additions.
    (i think there is a similarity to Perl's relationship to the history of programming languages.)
    the result really is hard to reason about ... because it hasn't got that underpinning and the algebra has gone.
    it's worse than that: the code is a mess too, i supposed in consequence.


  12. Re: [9fans] non greedy regular expressions

    > practical application. now there are big books on `regular expressions'
    > mainly because they are no longer regular but a big collection of ad-hoc


    I thought they were "regular" because they "regularly" occurred in the
    target text. Turns out other interpretations are possible. Though, mine has
    the advantage of being correct :-D

    The set of "big books on regular expressions" includes Jeffrey Friedl's
    "Mastering Regular Expressions" that happens to contain a chapter by the
    title "NFA, DFA, and POSIX" wherein he says:

    > DFA Speed with NFA Capabilities: Regex Nirvana?
    >
    > I've said several times that a DFA can't provide capturing parentheses or
    > backreferences. This is quite true, but it certainly doesn't preclude
    > hybrid approaches that mix technologies in an attempt to reach regex
    > nirvana. The sidebar in Section 4.6.3.1 told how NFAs have diverged from
    > the theoretical straight and narrow in search of more power, and it's
    > only natural that the same happens with DFAs. A DFA's construction makes
    > it more difficult, but that doesn't mean impossible.
    >
    > GNU grep takes a simple but effective approach. It uses a DFA when
    > possible, reverting to an NFA when backreferences are used. GNU awk does
    > something similar—it uses GNU grep's fast shortest-leftmost DFA engine
    > for simple "does it match" checks, and reverts to a different engine for
    > checks where the actual extent of the match must be known. Since that
    > other engine is an NFA, GNU awk can conveniently offer capturing
    > parentheses, and it does via its special gensub function.
    >
    > Tcl's regex engine is a true hybrid, custom built by Henry Spencer (whom
    > you may remember having played an important part in the early development
    > and popularization of regular expressions see Section 3.1.1.6). The Tcl
    > engine sometimes appears to be an NFA— it has lookaround, capturing
    > parentheses, backreferences, and lazy quantifiers. Yet, it has true POSIX
    > longest-leftmost match (see Section 4.6.1), and doesn't suffer from some
    > of the NFA problems that we'll see in Chapter 6. It really seems quite
    > wonderful.
    >
    > Currently, this engine is available only to Tcl, but Henry tells me that
    > it's on his to-do list to break it out into a separate package that can
    > be used by others.


    Again, turns out the "big books on regular expressions" can give the
    lowlife--that's me--things "hackers" deny them.

    --On Monday, October 27, 2008 10:18 AM +0000 Charles Forsyth
    wrote:

    >> Both systems are complex enough that essentially no one completely
    >> understands them.

    >
    > this touches on an important point. the first introduction of regular
    > expressions to editors was great, because it took some formal language
    > theory and made it useful in an `every day' way. conversely, the theory
    > provided significant underpinning (in a structural sense) to that
    > practical application. now there are big books on `regular expressions'
    > mainly because they are no longer regular but a big collection of ad-hoc
    > additions. (i think there is a similarity to Perl's relationship to the
    > history of programming languages.) the result really is hard to reason
    > about ... because it hasn't got that underpinning and the algebra has
    > gone. it's worse than that: the code is a mess too, i supposed in
    > consequence.
    >







  13. Re: [9fans] non greedy regular expressions

    >> practical application. now there are big books on `regular expressions'
    >> mainly because they are no longer regular but a big collection of ad-hoc

    >
    > I thought they were "regular" because they "regularly" occurred in the
    > target text. Turns out other interpretations are possible. Though, mine has
    > the advantage of being correct :-D


    there's a reason they're not called regularly expressions.

    (if this were the definition, an expression's regularlyness would
    depend on the target text, would it not?)

    - erik



  14. Re: [9fans] non greedy regular expressions

    > The set of "big books on regular expressions" includes Jeffrey Friedl's
    > "Mastering Regular Expressions" that happens to contain a chapter by the
    > title "NFA, DFA, and POSIX" wherein he says:
    >
    > > DFA Speed with NFA Capabilities: Regex Nirvana?


    This guy seems to blur the distinctions here. His discussion
    makes it sound like he's saying that NFAs have more expressive
    power than DFAs. This is incorrect. Both NFAs and DFAs have
    exactly the same expressive power as the class of grammars
    called regular. For the arbitrary case of nesting (e.g. parens),
    these machines are insufficient. However, for any prescribed
    maximum nesting level, you can write a regular expression to
    account for it, though it becomes clumsy.

    To get more expressiveness, you need to go to a machine
    with more functionality. Classically, the next step
    up the hierarchy is the pushdown automaton. The expressive
    power of this machine corresponds directly to the context-
    free grammars. Because the full generality of the CFG
    require nondeterminism, automated translations from CFG
    to code/machine are usually done with restricted classes
    of CFGs, such as LR(k) and LALR. You can also increase
    the power of a FA by adding a counter or by making the
    transitions probablistic. If you truly want to build
    expression matching mechanisms that go beyond regular,
    building on the FA with counter(s) would be a far more
    sound foundation than a lot of the ad hoc stuff that's
    been done. But the truth is that whipping up a CFG
    and feeding it to yacc is far more expedient than torturing
    regular expressions all day.

    > Again, turns out the "big books on regular expressions" can give the
    > lowlife--that's me--things "hackers" deny them.


    And a good book on automata theory can give you far more
    than any "big book of...", "... in 21 days" or "... for
    dummies" book can. Besides, why would you think that
    anyone is denying you knowledge or the opportunity
    to write code that demonstrates your ideas?

    BLS



  15. Re: [9fans] non greedy regular expressions

    > there's a reason they're not called regularly expressions.

    As explained in the post by Brian L. Stuart it's a matter of "grammar" :-P

    > (if this were the definition, an expression's regularlyness would
    > depend on the target text, would it not?)


    Yes, and that _would_ be why you wouldn't craft a regex for matching in a
    text with a, say, single-digit frequency of some string.

    --On Monday, October 27, 2008 9:23 AM -0400 erik quanstrom
    wrote:

    >>> practical application. now there are big books on `regular expressions'
    >>> mainly because they are no longer regular but a big collection of ad-hoc

    >>
    >> I thought they were "regular" because they "regularly" occurred in the
    >> target text. Turns out other interpretations are possible. Though, mine
    >> has the advantage of being correct :-D

    >
    > there's a reason they're not called regularly expressions.
    >
    > (if this were the definition, an expression's regularlyness would
    > depend on the target text, would it not?)
    >
    > - erik
    >
    >




+ Reply to Thread
Page 2 of 2 FirstFirst 1 2