In this post we will look at the greedy nature of common quantifiers (* and +).
II Greedy vs Reluctant
We’re so used with the * quantifier that we almost forget its fundamental greedy nature. A quantifier is considered “greedy” when it tries to match the longest text chain possible before bactracking if the pattern does not match. Its counterpart is a reluctant quantifier.
Example is better that words, let’s see how we can extract the root directory in an Unix system given the complete path of a file:
Input: /home/data/ebooks/JavaForDummies.pdf
Pattern: ^(.+)/.+$
String text = "/home/data/books/JavaForDummies.pdf"; Pattern pattern = Pattern.compile("^/(.+)(?:/.+)*$"); Matcher matcher = pattern.matcher(text); if(matcher.matches()) { System.out.println(" Root folder : " + matcher.group(1)); }
Not surprisingly the console would display:
Matches : true
Group 1 : home/data/books/JavaForDummies.pdf
What the heck ? But we did clearly isolate the first part /(.+) of the pattern from the remaining trail, how comes ?
This is a very common mistake when you just happen to forget the greedy nature of the quantifier (greedy is their default behavior). First the regexp engine tries to match the pattern/(.+) with the longest possible text. The regexp engine starts by consuming all the text:
/(.+) : /home/data/books/JavaForDummies.pdf↑
The first part of the pattern does match, and the engine will stop here, reporting the entire text as the match. The last part of the pattern (?:/.+)* did indeed match with .. nothing because the * quantifier means nothing or more and it this case it matched nothing.
To circumvent this issue, one can thing to turn the * into + to force the regexp engine to match. The result is no even close to what we expect:
Matches : true
Group 1 : home/data/books
We did get rid of the last matching block (/JavaForDummies.pdf) but it’s still not satisfactory.
But ? Wasn’t the + in (?:/.+)+ supposed to be greedy ? Why didn’t it match /data/books/JavaForDummies.pdf but only the last part ?
Well, first come, first served. The regexp engine starts processing the /(.+) part first and since it is greedy it will consume all the string. But by doing so, there is nothing left to be matched with (?:/.+)+ so the regexp engine is forced to give up as many characters as needed so that the WHOLE pattern matches.
Step 1. /home/data/books/JavaForDummies.pdf↑
- /(.+) matches /home/data/books/JavaForDummies.pdf
- (?:/.+)+ matches nothing -> KO
Step 2. /home/data/books/JavaForDummies.pd↑f
- /(.+) matches /home/data/books/JavaForDummies.pd
- (?:/.+)+ matches f -> KO
Step 3. /home/data/books/JavaForDummies.p↑df
- /(.+) matches /home/data/books/JavaForDummies.p
- (?:/.+)+ matches df -> KO
…
Step 20. /home/data/books↑
- /(.+) matches /home/data/books
- (?:/.+)+ matches /JavaForDummies.pdf -> OK
As long as (?:/.+)+ can match anything, be it the shortest length, the regexp engine will stop processing and return the successful result. Indeed the greediness of the first part /(.+) has priority over the greediness of the second one (?:/.+)+.
Now, back to our issue, we still want to extract the root folder from the file path. Each folder level in the path is delimited by the slash /. The idea here would be to take the first group of text between slashes:
Patter: ^/([^/]+)/.+$
This time the output is:
Matches ? true
Group 1 : home
That’s what we wanted. The trick here is the use of a negation in a character class: [^/]. This means “anything except the a/“. Combined with the beginning / and the second /, we can isolate the root folder from the remaining path: /([^/]+)/
Alternatively we can use the reluctant version of our + quantifier: +?
Pattern: ^/(.+?)(?:/.+)+$
The output:
Matches ? true
Group 1 : home
This time, instead of starting by consuming all the string, the reluctant +? will consume character by character until the whole pattern matches, thus giving us the smallest text that can match the string match.
Recommended readings: