[9fans] non greedy regular expressions  Plan9
This is a discussion on [9fans] non greedy regular expressions  Plan9 ; Hello
regexp(6) seems to know only greedy regular expressions. So does
probably awk, sed, grep, ..., since these are based on that regexp.
My question is what to do if one needs nongreedy regexps.
Also, is there anything like submatch ...

[9fans] non greedy regular expressions
Hello
regexp(6) seems to know only greedy regular expressions. So does
probably awk, sed, grep, ..., since these are based on that regexp.
My question is what to do if one needs nongreedy regexps.
Also, is there anything like submatch extraction, counted repetition
of regexp (like {3,5}), (lookahead) assertions in plan9?
Thanks
Ruda

Re: [9fans] non greedy regular expressions
> regexp(6) seems to know only greedy regular expressions. So does
> probably awk, sed, grep, ..., since these are based on that regexp.
> My question is what to do if one needs nongreedy regexps.
russ has a great writeup on this.
http://swtch.com/~rsc/regexp/
i think it covers all your questions.
 erik

Re: [9fans] non greedy regular expressions
> russ has a great writeup on this.
> http://swtch.com/~rsc/regexp/
> i think it covers all your questions.
>
>  erik
I read trough some of that already yesterday. Anyway, am still
puzzled. In the text of
Regular Expression Matching Can Be Simple And Fast
(but is slow in Java, Perl, PHP, Python, Ruby, ...)
R. Cox writes:

While writing the text editor sam [6] in the early 1980s, Rob Pike
wrote a new regular expression implementation, which Dave Presotto
extracted into a library that appeared in the Eighth Edition. Pike's
implementation incorporated submatch tracking into an efficient NFA
simulation but, like the rest of the Eighth Edition source, was not
widely distributed.
....
Pike's regular expression implementation, extended to support Unicode,
was made freely available with sam in late 1992, but the particularly
efficient regular expression search algorithm went unnoticed. The code
is now available in many forms: as part of sam, as Plan 9's regular
expression library, or packaged separately for Unix.

But any manual page (regexp(6), that of sam) keeps completely silent
about eg. any submatch tracking.
So what's wrong? Can anybody clarify the situation for me or do I
really have to read the codes?
Ruda

Re: [9fans] non greedy regular expressions
> But any manual page (regexp(6), that of sam) keeps completely silent
> about eg. any submatch tracking.
> So what's wrong? Can anybody clarify the situation for me or do I
> really have to read the codes?
well reading the code would be a travesty. it's curious
that neither the sam paper nor regexp(6) mentions
submatches. maybe i missed them.
sed n 's:.*(KRAK[AZ]+*) +([azAZ]+).*:\2, \1:gp'
sam d /lib/volcanoes
,x g/KRAK/ s:[^AZ]+([AZ]+*) +([azAZ]+).*:\2, \1:
/KRAK/+
or, if you want to run sam as a stream editor
{ echo ',x g/KRAK/ s:[^AZ]+([AZ]+*) +([azAZ]+).*:\2, \1:'
echo '/KRAK/+'
}  sam d /lib/volcanoes >[2=]
a couple of examples of submatch tracking from our system
(i'm not sure about the version of iwhois on sources.)
/rc/bin/B
/rc/bin/Kill
/rc/bin/broke
/rc/bin/fax
/rc/bin/fshalt
/rc/bin/iwhois
/rc/bin/juke
 erik

Re: [9fans] non greedy regular expressions
> well reading the code would be a travesty. it's curious
> that neither the sam paper nor regexp(6) mentions
> submatches. maybe i missed them.
>
> sed n 's:.*(KRAK[AZ]+*) +([azAZ]+).*:\2, \1:gp'
>  erik
Ok, so despite the documentation, some submatch tracking is there.
But in all (?) your examples, as well as in the scripts you mentioned,
this tracking is exclusively used with the s command (which is said to
be unnecessary at least in sam/acme). If I try sth. like
/( b(.)b)/a/\1\2/
on
bla blb 56
I get
bla blb\1\2 56
which is not quite what I want... How then? (I'd like to get 'bla blblblb 56')
Further, in R. Cox's text (http://swtch.com/~rsc/regexp/regexp1.html)
he claims that all nice features except for backreferences can be
implemented with Thomson's NFA algorithm. And even the backreferences
can be handled gracefully somehow. That is: ALL: nongreedy operators,
generalized assertions, counted repetitions, character classes CAN be
processed using the fast algorithm. Why then we don't have it? I once
wrote a program in python and was pretty happy to have nongreedy
operators and lookahead assertions on hand. Should I hadn't had those,
I probably wouldn't have been able to write it (nicely).
Ruda

Re: [9fans] non greedy regular expressions
> Ok, so despite the documentation, some submatch tracking is there.
> But in all (?) your examples, as well as in the scripts you mentioned,
> this tracking is exclusively used with the s command (which is said to
> be unnecessary at least in sam/acme). If I try sth. like
> /( b(.)b)/a/\1\2/
this is covered in the sam paper in the section where pike
discusses s.
> Further, in R. Cox's text (http://swtch.com/~rsc/regexp/regexp1.html)
> he claims that all nice features except for backreferences can be
> implemented with Thomson's NFA algorithm. And even the backreferences
> can be handled gracefully somehow. That is: ALL: nongreedy operators,
> generalized assertions, counted repetitions, character classes CAN be
> processed using the fast algorithm. Why then we don't have it?
let me turn this around. why would these additions (complications)
benefit plan 9?
 erik

Re: [9fans] non greedy regular expressions
I think you've understood correctly. Back references mostly aren't
there. Greedy operators aren't there. For back references, this may
be due to philosophical reservations; I have a few myself. For greedy
operators, I suspect it's more because noone has cared enough to do
it. It wouldn't be too hard, as Russ' article says. If someone is
going to to this I would suggest going all the way and implementing
tags. See http://laurikari.net/ville/spire2000tnfa.ps.
> > well reading the code would be a travesty. it's curious
> > that neither the sam paper nor regexp(6) mentions
> > submatches. maybe i missed them.
> >
> > sed n 's:.*(KRAK[AZ]+*) +([azAZ]+).*:\2, \1:gp'
> >  erik
>
> Ok, so despite the documentation, some submatch tracking is there.
> But in all (?) your examples, as well as in the scripts you mentioned,
> this tracking is exclusively used with the s command (which is said to
> be unnecessary at least in sam/acme). If I try sth. like
> /( b(.)b)/a/\1\2/
> on
> bla blb 56
> I get
> bla blb\1\2 56
> which is not quite what I want... How then? (I'd like to get 'bla blblblb 56'
> )
>
> Further, in R. Cox's text (http://swtch.com/~rsc/regexp/regexp1.html)
> he claims that all nice features except for backreferences can be
> implemented with Thomson's NFA algorithm. And even the backreferences
> can be handled gracefully somehow. That is: ALL: nongreedy operators,
> generalized assertions, counted repetitions, character classes CAN be
> processed using the fast algorithm. Why then we don't have it? I once
> wrote a program in python and was pretty happy to have nongreedy
> operators and lookahead assertions on hand. Should I hadn't had those,
> I probably wouldn't have been able to write it (nicely).
>
> Ruda
>

John Stalker
School of Mathematics
Trinity College Dublin
tel +353 1 896 1983
fax +353 1 896 2282

Re: [9fans] non greedy regular expressions
On Fri, Oct 24, 2008 at 6:11 PM, Rudolf Sykora wrote:
> Further, in R. Cox's text (http://swtch.com/~rsc/regexp/regexp1.html)
> he claims that all nice features except for backreferences can be
> implemented with Thomson's NFA algorithm. And even the backreferences
> can be handled gracefully somehow. That is: ALL: nongreedy operators,
> generalized assertions, counted repetitions, character classes CAN be
> processed using the fast algorithm. Why then we don't have it?
Just because something is possible doesn't mean that it is a good
idea. Implementing every possible feature is the GNU/way, if you want
to go down that path, you know where to find it.
> I once
> wrote a program in python and was pretty happy to have nongreedy
> operators and lookahead assertions on hand. Should I hadn't had those,
> I probably wouldn't have been able to write it (nicely).
Python uses the absolutely dreadful and unintelligible PCRE, which not
only have unpredictable performance as Russ points out, but are a
total mess so complex that I doubt anybody fully understands them,
they are to Plan 9 regexps what C++ is to C.
Maybe Plan 9 regexps are not as 'powerful', but at least I can understand them.
uriel

Re: [9fans] non greedy regular expressions
Backreferences within the pattern (such as in /(.*)\1/) make the
matcher nonregular and exponentially hard. It was a deliberate
decision not to implement them in sam and I'd make the same decision
today.
As far as greedy/nongreedy operators go, they have more utility but I
believe they have become popular primarily to deal with the problems
with Perl and PCRE doing greedy implementations badly in order to
avoid the exponential cost of their backtracking matchers. These
matchers can give crazy results.
Sam etc. are O(length of string * length of pattern) at all times and
their leftmostlongest properties work throughout.
Again, Russ has most of this in his analysis.
rob

Re: [9fans] non greedy regular expressions
2008/10/24 John Stalker :
> I think you've understood correctly. Back references mostly aren't
> there. Greedy operators aren't there.
you probably mean NONgreedy ops.
I was not that concerned about backreferences but submatch extraction
(according to the terminology used by R. Cox in his article).
R.
PS.: and submatch extraction is there, somehow, undocumented (to my knowledge)

Re: [9fans] non greedy regular expressions
Regarding greedy vs nongreedy etc.
In Plan 9, a regular expression search always looks
for the "leftmostlongest" match, which means
a match that starts as far to the left as possible
in the target text, and of the matches that start there,
the longest one.
In that model, it is not accurate to describe the * + ?
operators as greedy or not. None of them is working
toward any goal other than the overall longest match
at the leftmost position.
In Perl and its imitators, the match starts at the leftmost
position but is otherwise the first one that is found,
not necessarily the longest. In that context, words like
"greedy" and "nongreedy" start to make sense,
because the behavior of any one operator influences which
match is encountered first.
Either approach  leftmostlongest or leftmostfirst 
can be implemented using finite automata.
Russ

Re: [9fans] non greedy regular expressions
> you probably mean NONgreedy ops.
Yes, my mistake. I'll risk making a very minor correction to
Rob's post as well:
> Backreferences within the pattern (such as in /(.*)\1/) make the
> matcher nonregular and exponentially hard.
They do change the class of the grammar and nobody knows how to
implement them in subexponential time, but it hasn't been proved
to be impossible.

John Stalker
School of Mathematics
Trinity College Dublin
tel +353 1 896 1983
fax +353 1 896 2282

Re: [9fans] non greedy regular expressions
>If I try sth. like
>/( b(.)b)/a/\1\2/
>on
>bla blb 56
>I get
>bla blb\1\2 56
>which is not quite what I want... How then? (I'd like to get 'bla blblblb 56')
echo bla blb 56  sed 's/( b(.)b)/&\1\2/'
bla blb blbl 56
similarly use `s' not `a' in sam.

Re: [9fans] non greedy regular expressions
2008/10/24 Charles Forsyth :
>>If I try sth. like
>>/( b(.)b)/a/\1\2/
>>on
>>bla blb 56
>>I get
>>bla blb\1\2 56
>>which is not quite what I want... How then? (I'd like to get 'bla blblblb 56')
>
> echo bla blb 56  sed 's/( b(.)b)/&\1\2/'
> bla blb blbl 56
>
> similarly use `s' not `a' in sam.
Yes. But my question was more subtle. I know it can be done with the
's' command in sam now. But I asked sth. different. The documentation
(paper about sam) says 's' is redundant. But it seems (only seems
since documentation says nothing), that submatches only work with the
's' command. If this is true 's' is not redundant, since if it weren't
there you would not have any chance to use \1, \2, etc.
That was the reason I wanted to have my example rewritten WITHOUT the
's' command...
But anyway thanks!
Ruda

Re: [9fans] non greedy regular expressions
> In that model, it is not accurate to describe the * + ?
> operators as greedy or not. None of them is working
> toward any goal other than the overall longest match
> at the leftmost position.
So then I must be mistaken about the terminology.
I thought greedy=leftmostlongest, while nongreedy=leftmostfirst:
Having
bla bla (AB)(CDEF)(GH)
/\(.*\)/
matches the whole (AB)(CDEF)(GH), 'greedy', while if '*' were
'nongreedy', I'd expect a match with (AB). This is now not in Plan9.
This mentioned example is easy to solve, if I want to go in 'bracket'
steps with
/\([^)]*\)/
but there are harder examples, eg. where the_interesting_part is
delimited with a more complicated structure ('ABC' here):
ABCthe_interesting_partABC blabla bla ABCthe_interesting_partABC etc.
Now, how to parse it? With nongreedy '*', no problem. With 'greedy'
and no negative lookahead assertion? Ok, maybe 'y' in sam could help
(don't know). What about
ABCthe_interesting_partCBA blabla bla ABCthe_interesting_partCBA etc.?
All the thinking about this is simply removed with 'nongreedy' ops.
Ruda
> In Perl and its imitators, the match starts at the leftmost
> position but is otherwise the first one that is found,
> not necessarily the longest. In that context, words like
> "greedy" and "nongreedy" start to make sense,
> because the behavior of any one operator influences which
> match is encountered first.
>
> Either approach  leftmostlongest or leftmostfirst 
> can be implemented using finite automata.
>
> Russ
>
>

Re: [9fans] non greedy regular expressions
> I thought greedy=leftmostlongest, while nongreedy=leftmostfirst:
Greedy leftmostfirst is different from leftmostlongest.
Search for /a*(ab)?/ in "ab". The leftmostlongest match
is "ab", but the leftmostfirst match (because of the
greedy star) is "a". In the leftmostfirst case, the greediness
of the star caused an overall short match.
> All the thinking about this is simply removed with 'nongreedy' ops.
But it isn't (or shouldn't be).
Using /\(.*\)/ to match small parenthesized expressions
is fragile: /\(.*\)/ in "(a(b))" matches "(a(b)".
In contrast, the solution you rejected /\([^)]*\)/ is more robust.
It doesn't make sense to shoehorn nongreedy and greedy
operators into an engine that provides leftmostlongest matching.
If you want a different model, you need to use a different program.
Perl has been ported.
Russ

Re: [9fans] non greedy regular expressions
> Greedy leftmostfirst is different from leftmostlongest.
> Search for /a*(ab)?/ in "ab". The leftmostlongest match
> is "ab", but the leftmostfirst match (because of the
> greedy star) is "a". In the leftmostfirst case, the greediness
> of the star caused an overall short match.
Ok, I finally see the point... thanks.
Only one last question: So is there any simple way (using existing
regexps) to find 'the interesting parts' in my last example?:
ABCthe_interesting_partCBA blabla bla ABCthe_interesting_partCBA ...
....i.e. delimited with ABC from the left and CBA (or e.g. EFG) from the right?
Say I have a file with such a structure and want to only get 'the
interesting parts' form it.
>
>> All the thinking about this is simply removed with 'nongreedy' ops.
>
> But it isn't (or shouldn't be).
well, I admit it isn't

Re: [9fans] non greedy regular expressions
> Ok, I finally see the point... thanks.
> Only one last question: So is there any simple way (using existing
> regexps) to find 'the interesting parts' in my last example?:
> ABCthe_interesting_partCBA blabla bla ABCthe_interesting_partCBA ...
> ...i.e. delimited with ABC from the left and CBA (or e.g. EFG) from the right?
> Say I have a file with such a structure and want to only get 'the
> interesting parts' form it.
doesn't s/ABC(the_interesting_part)CBA/x/g work for you?
maybe i don't understand the example. if so, could you explain?
 erik

Re: [9fans] non greedy regular expressions
> doesn't s/ABC(the_interesting_part)CBA/x/g work for you?
> maybe i don't understand the example. if so, could you explain?
>
>  erik
I think not.
I have a file say like this
ABC asassadfasdf asdfasdf asdfasdf CBA hhhhhhhhhhjjjjjjjjjjioioioi
sodifs
sdfsd
ABC
dasdfas aasdfa
njnjn CBA
and I want to get
' asassadfasdf asdfasdf asdfasdf '
'dasdfas aasdfa'
'njnjn'
where I added apostrophes to see the spaces on indivial lines. Simply:
give me everything that is between delimiters (ABC and CBA).
Ruda

Re: [9fans] non greedy regular expressions
hello
using sed and only one regexp is mandatory?
cat t.txt sed 's/(ABC  CBA)/ \n\1\n /g'  awk '/ABC/,/CBA/'  grep 
v 'ABCCBA'
that's a naive and simple approach, but i can't see why you need to
use just one regexp and just one sed. May be i missed something
through the thread :?
gabi
El 25/10/2008, a las 0:04, Rudolf Sykora escribió:
>> doesn't s/ABC(the_interesting_part)CBA/x/g work for you?
>> maybe i don't understand the example. if so, could you explain?
>>
>>  erik
>
> I think not.
> I have a file say like this
>
> ABC asassadfasdf asdfasdf asdfasdf CBA hhhhhhhhhhjjjjjjjjjjioioioi
> sodifs
> sdfsd
> ABC
> dasdfas aasdfa
> njnjn CBA
>
> and I want to get
>
> ' asassadfasdf asdfasdf asdfasdf '
> 'dasdfas aasdfa'
> 'njnjn'
>
> where I added apostrophes to see the spaces on indivial lines. Simply:
> give me everything that is between delimiters (ABC and CBA).
>
> Ruda
>
>