第七章模式匹配與正則表達(dá)式

好像咕咕咕太久了，又滾來更新了。這次是第七章的內(nèi)容，正則表達(dá)式，如果寫的有問題，請給我留言，非常感謝。

在進(jìn)行本章內(nèi)容的筆記之前，先說一下，正則表達(dá)式是什么。

百度給的定義如下：正則表達(dá)式是對字符串操作的一種邏輯共識，就是用事先定義好的一些特定字符、及這些特定字符的組合，組成一個“規(guī)則字符串”，這個“規(guī)則字符串”用來表達(dá)對字符串的一種過濾邏輯。（感覺其實說的很清楚了，再簡單一點就是說：類似一種速記的邏輯，用自己特定的方法表示信息）

不用正則表達(dá)式來查找文本模式

首先，書上舉的例子，是在一些字符串中查找電話號碼。電話號碼的格式是xxx-xxx-xxxx。我先假定看到這篇讀書筆記的讀者們，都已經(jīng)了解了Python，或者有其他語言的基礎(chǔ)，那么，先請大家思考一下，應(yīng)該怎么來實現(xiàn)呢？

最簡單，完全不管包裝的方法就是直接從鍵盤或者文件輸入字符串，然后在“主函數(shù)”部分用if來進(jìn)行判斷。然后關(guān)于字符串、元組、列表部分如果到這里仍有疑問，就麻煩翻一下前面的內(nèi)容，在此不贅述啦。

以下是書中提供的代碼（我不記得我有沒有上傳過代碼包了，如果沒有我回頭上傳一下）


def isPhoneNumber(text):
 if len(text) != 12:
 return False # not phone number-sized
 for i in range(0, 3):
 if not text[i].isdecimal():
 return False # not an area code
 if text[3] != '-':
 return False # does not have first hyphen
 for i in range(4, 7):
 if not text[i].isdecimal():
 return False # does not have first 3 digits
 if text[7] != '-':
 return False # does not have second hyphen
 for i in range(8, 12):
 if not text[i].isdecimal():
 return False # does not have last 4 digits
 return True # 'text' is a phone number!
print('415-555-4242 is a phone number:')
print(isPhoneNumber('415-555-4242'))
print('Moshi moshi is a phone number:')
print(isPhoneNumber('Moshi moshi'))

（輸出展示）

415-555-4242 is a phone number:

True

Moshi moshi is a phone number:

False

幾點注釋：

1. isdecimal() 方法檢查字符串是否只包含十進(jìn)制字符。這種方法只存在于unicode對象。

注意:定義一個十進(jìn)制字符串，只需要在字符串前添加 'u' 前綴即可。

isdecimal()方法語法：

str.isdecimal()

如果字符串是否只包含十進(jìn)制字符返回True，否則返回False。

2.調(diào)用函數(shù)的方法和其他語言差距不大

3.一定要注意空格，我太長時間沒寫了，導(dǎo)致長時間報錯（我真應(yīng)該找到我的游標(biāo)卡尺，枯了）

isPhoneNumber()函數(shù)的代碼進(jìn)行幾項檢查，看看text中的字符串是不是有效的電話號碼。如果其中任意一項檢查失敗，函數(shù)就返回False。代碼首先檢查該字符串是否剛好有12個字符?。然后它檢查區(qū)號(就是text中的前3個字符)是否只包含數(shù)字?。函數(shù)剩下的部分檢查該字符串是否符合電話號碼的模式:號碼必須在區(qū)號后出現(xiàn)第一個短橫線?， 3個數(shù)字?，然后是另一個短橫線?,最后是4個數(shù)字?如果程序執(zhí)行通過了所有的檢查，它就返回True?。

然后，再利用前面提到的切片的方法，我們還可以從一串字符（不像前面的直接判斷一小段一小段的字符串是不是電話號碼）中提取電話號碼。代碼如下：


def isPhoneNumber(text):
 if len(text) != 12:
 return False # not phone number-sized
 for i in range(0, 3):
 if not text[i].isdecimal():
 return False # not an area code
 if text[3] != '-':
 return False # does not have first hyphen
 for i in range(4, 7):
 if not text[i].isdecimal():
 return False # does not have first 3 digits
 if text[7] != '-':
 return False # does not have second hyphen
 for i in range(8, 12):
 if not text[i].isdecimal():
 return False # does not have last 4 digits
 return True # 'text' is a phone number!
'''print('415-555-4242 is a phone number:')
print(isPhoneNumber('415-555-4242'))
print('Moshi moshi is a phone number:')
print(isPhoneNumber('Moshi moshi'))'''
message='Call me at 415-555-1011 tomorrow. 415-555-9999 is my office'
for i in range(len(message)):
 chunk = message[i:i+12]
 if isPhoneNumber(chunk):
 print('Phone number found: '+ chunk)
print('Done')

（輸出展示）

Phone number found: 415-555-1011
Phone number found: 415-555-9999
Done

“

在for 循環(huán)的每次迭代中，取自message 的一段新的 12個字符被賦給變量chunk?.例如，在第一次迭代， i是0, chunk被賦值為message[0:12] (即字符串'Call me at 4').在下次選代，i是1, chunk 被賦值為message[1:13] (字符串'all me at 4I')。
將chunk傳遞給isPhoneNumber(),看看它是否符合電話號碼的模式?。如果符合，就打印出這段文本。
繼續(xù)遍歷message,最終chunk中的12個字符會是一個電話號碼。該循環(huán)遍歷了整個字符串，測試了每一段12個字符，打印出所有滿足isPhoneNumber()的chunk。當(dāng)我們遍歷完message,就打印出Done.
在這個例子中，雖然message中的字符串很短，但它也可能包含上百萬個字符，程序運行仍然不需要一秒鐘。使用正則表達(dá)式查找電話號碼的類似程序，運行也不會超過一秒鐘，但用正則表達(dá)式編寫這類程序會快得多”

”

用正則表達(dá)式查找文本模式

我們還是回到上面的問題，電話號碼，因為書呢是美國人寫的，就按照他們的習(xí)慣，電話號碼格式是xxx-xxx-xxxx，那么正則表達(dá)式會長什么樣子呢？就是用約定俗成的符號\d來代替我前面隨意用的x，\d\d\d-\d\d\d-\d\d\d\d，因為人呢是特別懶惰的，當(dāng)然也是為了盡量避免失誤，所以還有一個簡化版本的：\d\d\d-\d\d\d-\d\d\d\d=》\d{3}-\d{3}-\d{4}，通過花括號中間加數(shù)字表示前面的符號重復(fù)幾遍。

創(chuàng)建正則表達(dá)式對象

Python中所有的正則表達(dá)式都在re模塊中

import re

如果不導(dǎo)入就會報錯：NameError：balabalabala……

如果我們要創(chuàng)建一個Regex對象來匹配電話號碼模式（讓phoneNumRegex中包含一個Regex對象）：

phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')

匹配Regex對象

通過search()方法查找字符串

那么前面的def部分+切片查找部分就被search()替代了


import re
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search('My number is 415-555-4242.')
print('phone number found: '+ mo.group())

（輸出展示）

phone number found: 415-555-4242

幾點注釋：

1.search()：http://www.cnblogs.com/aaronthon/p/9435967.html

2.group()：https://www.cnblogs.com/erichuo/p/7909180.html

用正則表達(dá)式匹配更多模式

可以使用括號分組（搭配group()使用）

比如上面提到的：\d\d\d-\d\d\d-\d\d\d\d=》(\d\d\d)-(\d\d\d-\d\d\d\d)

上面的代碼改成：


import re
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My number is 415-555-4242.')
print(mo.group(1))
'''
print(mo.group(2))
print(mo.group(0))
print(mo.group())
print(mo.group(1)+mo.group(2))
'''

（輸出展示）

415

如果把注釋去掉，輸出如下：

415
555-4242
415-555-4242
415-555-4242
415555-4242


import re
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My number is 415-555-4242.')
'''
print(mo.group(1))
print(mo.group(2))
print(mo.group(0))
print(mo.group())
print(mo.group(1)+mo.group(2))
'''
areaCode,mainNumber= mo.groups()
print(areaCode)
print(mainNumber)

（輸出展示）

415
555-4242

括號在正則表達(dá)式中有特殊的含義，但是如果你需要在文本中匹配括號，怎么辦?例如，你要匹配的電話號碼，可能將區(qū)號放在一對括號中。在這種情況下，就需要用倒斜杠對(和)進(jìn)行字符轉(zhuǎn)義。


import re
phoneNumRegex = re.compile(r'(\(\d\d\d\)) (\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My number is (415) 555-4242.')
print(mo.group(1))
print(mo.group(2))
print(mo.group(1)+' '+mo.group(2))

（輸出展示）

(415)
555-4242
(415) 555-4242

用管道匹配多個分組

那么，“管道”是什么呢？在本書中，將字符‘|’稱為“管道”，用于希望匹配許多表達(dá)式中的一個時。比如：


import re
heroRegex=re.compile(r'Batman|Tina Fey')
mo1=heroRegex.search('Batman and Tina Fey.')
print(mo1.group())
mo2=heroRegex.search('Tina Fey and Batman.')
print(mo2.group())

（輸出展示）

Batman
Tina Fey

如果Batman 和Tina Fey都出現(xiàn)在字符串中，那么返回第一個出現(xiàn)的匹配文本。

（后面還會提到“findall()”方法，可以用來找到“所有”匹配的地方）

也可以使用管道來匹配多個模式中的一個。比如說，書上舉例子要匹配'Batman'、'Batmobile'、'Batcopter'、'Batbat'中任意一個。因為都以‘Bat’開頭?！噙€可以簡化：


import re
batRegex=re.compile(r'Bat(man|mobile|copter|bat)')
mo=batRegex.search('Batmobile lost a wheel.')
print(mo.group())
print(mo.group(1))

（輸出展示）

Batmobile
mobile

方法調(diào)用mo.group()返回了完全匹配的文本‘Batmobile’，而mo.group(1)只是返回第一個括號分組內(nèi)匹配的文本‘mobile’。

如果需要匹配正真的管道字符，就用倒斜杠轉(zhuǎn)義->\（思考這個意思是：


import re
batRegex=re.compile(r'\||Batman|bat')
mo=batRegex.search('| Batman lost a \.')
print(mo.group())

）

用問號實現(xiàn)可選匹配

直接舉例子吧


import re
batRegex=re.compile(r'Bat(wo)?man')
mo1=batRegex.search('The adventures of Batman.')
print(mo1.group())
mo2=batRegex.search('The adventures of Batwoman')
print(mo2.group())

在這里'(wo)?'就是一個可選擇的項，就是類似可以省略可以不省略的意思。

如果真的需要匹配問號的，同上，還是加上倒斜杠轉(zhuǎn)義。

用星號匹配零次或多次


import re
batRegex=re.compile(r'Bat(wo)*man')
mo1=batRegex.search('The adventures of Batman.')
print(mo1.group())
mo2=batRegex.search('The adventures of Batwoman')
print(mo2.group())
mo3=batRegex.search('The adventures of Batwowowowowowowoman')
print(mo3.group())

就和？差不了多少啦，無非就是把一次或零次改成零次或無數(shù)次（突然想起來，據(jù)說女裝只有零次和無數(shù)次~）

用加號匹配一次或多次

先看一個報錯的：


import re
batRegex=re.compile(r'Bat(wo)+man')
mo2=batRegex.search('The adventures of Batwoman')
print(mo2.group())
mo3=batRegex.search('The adventures of Batwowowowowowowoman')
print(mo3.group())
mo1=batRegex.search('The adventures of Batman.')
print(mo1.group())

看一下報錯信息：

Batwoman
Traceback (most recent call last):
Batwowowowowowowoman
File 'xxxxxxxxxx（存儲位置）', line 10, in <module>
print(mo1.group())
AttributeError: 'NoneType' object has no attribute 'group'

Process finished with exit code 1

然后是不報錯的：


import re
batRegex=re.compile(r'Bat(wo)+man')
mo2=batRegex.search('The adventures of Batwoman')
print(mo2.group())
mo3=batRegex.search('The adventures of Batwowowowowowowoman')
print(mo3.group())
mo1=batRegex.search('The adventures of Batman.')
print(mo1)

這個很容易理解的，因為加號要求至少有一個。

用花括號匹配待定次數(shù)


import re
haRegex=re.compile(r'(Ha){3}')
mo1=haRegex.search('HaHaHa.')
print(mo1.group())
mo2=haRegex.search('Ha')
print(mo2)

（輸出展示）

HaHaHa
None
在正則表達(dá)式中：

(Ha){3,5}的意思呢，就是：((Ha)(Ha)(Ha)|(Ha)(Ha)(Ha)(Ha)|(Ha)(Ha)(Ha)(Ha)(Ha))醬紫

貪心和非貪心匹配

說到貪心，我又想起來我的那些看什么都是貪心的日子（DFS、BFS、線性規(guī)劃等等看什么都是貪心）

---------------------------------未完，找時間填坑-------------------------------

插一個閑篇啊，我一邊填坑，一邊等著老師講爬蟲，然后在這本書的后面也提到了一個Python自帶的模塊——webbrowser，作用非常無聊（不過給我提供了一個不用<a></a>就能打開網(wǎng)頁的方法）


import webbrowser
webbrowser.open('https://www.csdn.net/')

參考：

http://www.runoob.com/python/att-string-isdecimal.html

http://www.cnblogs.com/aaronthon/p/9435967.html

https://www.cnblogs.com/erichuo/p/7909180.html

本站僅提供存儲服務(wù)，所有內(nèi)容均由用戶發(fā)布，如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請點擊舉報。

免费视频淫片aa毛片_日韩高清在线亚洲专区vr_日韩大片免费观看视频播放_亚洲欧美国产精品完整版

第七章 模式匹配與正則表達(dá)式

不用正則表達(dá)式來查找文本模式

用正則表達(dá)式查找文本模式

用正則表達(dá)式匹配更多模式

貪心和非貪心匹配

第七章模式匹配與正則表達(dá)式