วันเสาร์ที่ 4 พฤษภาคม พ.ศ. 2556

Python Regular Expression Cheat Sheet


Meta-Character

[a-c], [abc] : a, b, c
[^5] : all chars except 5
\d : [0-9] decimal digit
\D : [^0-9] non-digit chars
\s : [ \t\n\r\f\v] whitespace chars
\S : [^ \t\n\r\f\v] non-whitespace chars
\w : [a-zA-Z0-9_] alphanumeric chars
\W : [^a-zA-Z0-9_] non-alphanumeric chars
^ : beginning of line
\A : beginning of string (differ from ^ for multi-line string)
$ : end of line
\Z : end of string (differ from ^ for multi-line string)
\b : word boundary
\B : non-word boundary
() : group e.g. (ab)+ match ab, abab, ababab, ...
\1, \2, ... : reference to group 1, group 2, ...
(?P<name>...) : grouped name, e.g. (?P<word>\b\w+\b) : matched group (\b\w+\b) will be named word

Repeating

* : 0+ repeating (greedy repeating : get as much as it could)
+ : 1+ repeating (greedy repeating : get as much as it could)
? : 0..1 repeating = {0,1} (greedy repeating : get as much as it could)
*?, +?, ?? : same as above but non-greedy repeating
{m} : exactly repeating m times
{m,n} : m..n repeating
{,n} : 0..n repeating
{m,} : m.. repeating

Usage - Pattern object

compile() : compile pattern string to Pattern object
match() : matches at the beginning of the string, return Match object or None if not match
search() : matches at any location of the string, return Match object or None if not match
findall() : return all matched substrings as a list
finditer() : return all matched substrings as an iterator
Usage - Module level
re.match(<regex string>, <target string>) :
re.search(...)
re.findall(...)
re.finditer(...)

Usage - Match object

group() : return matched string
start() : return starting position (0-indexed)
end() : return ending position (0-indexed, excluded)
span() return tuple of (start, end) positions
match length = end() - start()

Sample

>>> import re
>>> p = re.compile(’[a-z]+’)
>>> p.match("")
>>> print p.match("")
None
>>> m = p.match(’tempo’)
>>> m.group()
’tempo’
>>> m.start(), m.end()
(0, 5)
>>> m.span()
(0, 5)
>>>
>>>
>>> m = p.search(’::: message’)
>>> m.group()
’message’
>>> m.span()
(4, 11)
>>>
>>>
>>> p = re.compile(’\d+’)
>>> p.findall(’12 drummers drumming, 11 pipers piping, 10 lords a-leaping’)
[’12’, ’11’, ’10’]
>>> iterator = p.finditer(’12 drummers drumming, 11 ... 10 ...’)
>>> for match in iterator:
...     print match.span()
...
(0, 2)
(22, 24)
(29, 31)
>>>
>>>
>>> p = re.compile(r’(?P<word>\b\w+\b)’)
>>> m = p.search( ’(((( Lots of punctuation )))’ )
>>> m.group(’word’)
’Lots’
>>> m.group(1)
’Lots’
>>>
>>>
>>> p = re.compile(r’\W+’)
>>> p2 = re.compile(r’(\W+)’)
>>> p.split(’This... is a test.’) # delimiter not included
[’This’, ’is’, ’a’, ’test’, ’’]
>>> p2.split(’This... is a test.’) # delimiter included
[’This’, ’... ’, ’is’, ’ ’, ’a’, ’ ’, ’test’, ’.’, ’’]
>>>
>>>
>>> p = re.compile( ’(blue|white|red)’)
>>> p.sub( ’colour’, ’blue socks and red shoes’)
’colour socks and colour shoes’
>>> p.sub( ’colour’, ’blue socks and red shoes’, count=1)
’colour socks and red shoes’
>>>
>>>
>>> s = ’<html><head><title>Title</title>’
>>> len(s)
32
>>> print re.match(’<.*>’, s).span()
(0, 32)
>>> print re.match(’<.*>’, s).group() # greedy match
<html><head><title>Title</title>
>>> print re.match(’<.*?>’, s).group() # non-greedy match, others are *?, +?, ??, or {m,n}?
<html>

ไม่มีความคิดเห็น:

แสดงความคิดเห็น