What's the difference between '(?:' and '(' ?

In the python docs:

(...)
Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below. To match the literals '(' or ')', use \( or \), or enclose them inside a character class: [(] [)].
(?:...)
A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.

I understand what the difference is but I don't get why the () version doesn't work for the regexps used in the class. We never used \ followed by a number in those regexps so how does it interfere?

EDIT I probably didn't ask the question as clearly as I could have but luckily it got answered anyway, I was confused because I hadn't read the findall docs and so didn't know that it has a different behavior when there are capturing groups in the regexp. As dreyescat put it:

I don't think it is really due to a difference between the capturing and non-capturing grouping but because of the findall method, that change its behavior (return) when using a capturing grouping.

All these regular expressions (regex1, regex2, and regex3) are equal but, when you call findall with the second one, then findall return the list of groups instead of the list of matches. So it is because of the findall behavior that it looks like they behave different.

From findall documentation:

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

asked 22 Apr '12, 12:09

Thomas%20Hodson's gravatar image

Thomas Hodson
2246

accept rate: 0%

edited 25 Apr '12, 06:42


6 Answers:

I'll just echo the answers already given by others. The syntax introduced in class (?:) does not save the particular text matched by that structural group (and thus it can potentially be ever-so-slightly faster when implemented, but that's not worth considering). The big motivation in using it was making the output of re.findall() make sense to newcomers to regular expressions. I could have spent class time explaining the difference, but it did not seem to have high pedagogical merit compared to the other things we cover in Unit 1, and "escape sequences" had used up most of our "gory details" budget.

If cs262 students go on to use regular expressions on the job or in other projects, they may well find it handy to use the capturing parentheses -- matching and replacing subgroups is a key regular expression power. But we won't need it for this class. However, if you've learned on (?:) it's easy enough to shift gears to () later. Sorry for any confusion!

link

answered 22 Apr '12, 22:07

UdacityWes's gravatar image

UdacityWes ♦♦
8.3k416

Do you feel the syntax should have been reversed? (?:) should have been for capturing parentheses and () used for non-capturing? I think the more bizarre syntax should be used for the more unusual case of capturing.

(22 Apr '12, 23:19)

Charles Lin

Charles%20Lin's gravatar image

I completely agree that the non capturing version is the one to learn first, it was just that from reading the python docs on the two types of group I didn't understand why the output of findall changed. I should have read the findall doc aswell. Also it's really nice to have the teacher present on the forums.

(25 Apr '12, 06:47)

Thomas Hodson

Thomas%20Hodson's gravatar image

Take a look at this example:

string = "She said: <strong>Hi!</strong>"

regex1 = r"<\w+>.*</\w+>"
regex2 = r"<\w+>(.*)</\w+>"
regex3 = r"<\w+>(?:.*)</\w+>"

print re.findall(regex1, string)
print re.findall(regex2, string)
print re.findall(regex3, string)

If you run this you will get

['<strong>Hi!</strong>']
['Hi!']
['<strong>Hi!</strong>']

as a result.

So "(...)" and "(?:...)" obviously yield different results.
"(?:...)" is solely intented to use for grouping your regex and will not interfere with the result of the regex. In the above example you can see that the results of regex1 and regex3 are identical.

regex2 on the other hand yields a different result (the contents of the parantheses). So it is clear why the examples from the course did not work this way.

"(...)" are used to create backreferences and are very useful. Consider this example:

string = "<a>foo</a> <b>bar <c>baz</c></b>"
regex4 = r"<(\w+)>(.*)</\1>"   # \1 refers to the first (...) in the regex, \2 to the second, ...

link

answered 22 Apr '12, 14:16

gorilla834's gravatar image

gorilla834
1.2k14

edited 22 Apr '12, 14:18

1

I don't think it is really due to a difference between the capturing and non-capturing grouping but because of the findall method, that change its behavior (return) when using a capturing grouping.

All these regular expressions (regex1, regex2, and regex3) are equal but, when you call findall with the second one, then findall return the list of groups instead of the list of matches. So it is because of the findall behavior that it looks like they behave different.

From findall documentation:

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right,
and matches are returned in the order found. If one or more groups are present in the pattern, return a list
of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included
in the result unless they touch the beginning of another match.

(22 Apr '12, 15:21)

dreyescat

dreyescat's gravatar image

The only difference between the capturing and non-capturing version of the grouping parenthesis is that the capturing version keeps the matched groups for later use while the non-capturing don't. So you could use any of them for the regular expressions in this class and should work, it doesn't matter which.

I think that the non-capturing version is basically a convenient grouping that could be helpful with complex regular expressions with lots of groups, when some are required for later use and others aren't. It could be difficult to track which group number is which. Using the capturing ones for the groups required and the non-capturing version for the rest could improve your group numbering an help a lot on tracking only the interesting groups.

A part from this convenience I have not found anything saying that once is better than the other, for example, regarding performance.

An interesting excerpt from the regular expressions documentation:

Except for the fact that you can’t retrieve the contents of what the group matched, a non-capturing group
behaves exactly the same as a capturing group; you can put anything inside it, repeat it with a repetition
metacharacter such as *, and nest it within other groups (capturing or non-capturing). (?:...) is particularly
useful when modifying an existing pattern, since you can add new groups without changing how all the other
groups are numbered. It should be mentioned that there’s no performance difference in searching between
capturing and non-capturing groups; neither form is any faster than the other.

link

answered 22 Apr '12, 14:55

dreyescat's gravatar image

dreyescat
7.2k163588

Look here: http://docs.python.org/library/re.html#regular-expression-syntax
There are both variants explained.

(?...) => A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.

But so far I do not understand what that really means.

link

answered 22 Apr '12, 13:10

mat10243x-3's gravatar image

mat10243x-3
1.1k1127

It is just how Python defines grouping in RE.

link

answered 22 Apr '12, 13:04

chaim's gravatar image

chaim
6.7k441

It makes the intent of the regular expression clearer, and I'd wager it comes with a (likely very small) performance increase; there's no point in keeping a reference to a subexpression if we're not going to need it, and it's much clearer from simply looking at the regex that it's being used only for grouping, and not for capturing. Also (?:) totally looks cooler. :D

link

answered 22 Apr '12, 12:25

Justin%20Singer's gravatar image

Justin Singer
1.9k520

Your answer
Question text:

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "Title")
  • image?![alt text](/path/img.jpg "Title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags

×9,100
×3,923
×96

Asked: 22 Apr '12, 12:09

Seen: 895 times

Last updated: 25 Apr '12, 06:47