首先說(shuō)下我要用到的工具:tesseract/ImageMagick/…etc.
tesseract谷歌(原HP)開(kāi)源的OCR(Optical Character Recognition,光學(xué)字符識(shí)別)識(shí)別引擎,引用google code tesseract-ocr的話——可能是開(kāi)源界最精確的識(shí)別引擎:
Tesseract is probably the most accurate open source OCR engine available. Combined with the Leptonica Image Processing Library it can read a wide variety of image formats and convert them to text in over 60 languages. It was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but since then it has been improved extensively by Google. It is released under the Apache License 2.0.
ImageMagick是一個(gè)用于查看、編輯位圖文件以及進(jìn)行圖像格式轉(zhuǎn)換的開(kāi)放源代碼軟件套裝
我在這里之所以提到ImageMagick是因?yàn)槟承﹫D片格式需要用這個(gè)工具來(lái)轉(zhuǎn)換。
Leptonica 是一圖像處理與圖像分析工具,tesseract依賴于它。而且不是所有的格式(如jpg)都能處理,所以我們需要借助imagemagick做格式轉(zhuǎn)換。leptonica格式受限為:
Here's a summary of compression support and limitations: - All formats except JPEG support 1 bpp binary. - All formats support 8 bpp grayscale (GIF must have a colormap). - All formats except GIF support 24 bpp rgb color. - All formats except PNM support 8 bpp colormap. - PNG and PNM support 2 and 4 bpp images. - PNG supports 2 and 4 bpp colormap, and 16 bpp without colormap. - PNG, JPEG, TIFF and GIF support image compression; PNM and BMP do not. - WEBP supports 24 bpp rgb color.
如果你老老實(shí)實(shí)的去google codetesseract-ocr下載最新的tar.gz
$tar xzvf tesseract-ocr-3.02.02.tar.gz -C ~/Downloads/tesseract$cd ~/Downloads/tesseract-ocr$less README$./autogen.sh$./configure$make$make install$sudo ldconfig
可能,你會(huì)在autogen.sh卡殼(環(huán)境沒(méi)有配置)。另外,你還有依賴關(guān)系要解決。
如果你的發(fā)行版有官方或者第三方維護(hù)的二進(jìn)制包,干嘛自己編譯呢?直接命令行安裝(比如我的archlinux):
[hilo@hilo ]$ sudo pacman -S tesseract #leptonica、libpng 等依賴會(huì)自動(dòng)解決滴[hilo@hilo ]$ sudo pacman -S tesseract-data-eng #英文的語(yǔ)言包還是必須要滴[hilo@hilo ]$ sudo pacman -S imagemagick #如果你還沒(méi)有安裝過(guò)imagemagick
[hilo@hilo ~]$ convert a.jpg a.tif #先轉(zhuǎn)為可識(shí)別的a.tif[hilo@hilo ]$ tesseract a.tif out[hilo@hilo ]$ cat out.txt #查看識(shí)別到的驗(yàn)證碼
識(shí)別成功率跟圖片質(zhì)量關(guān)系密切,一般拿到后的驗(yàn)證碼都得經(jīng)過(guò)灰度化,二值化,去噪,利用imgick就可以很方便的做到.convert -monochrome foo.png bar.png #將圖片二值化
這是推薦讀下鬼仔的高級(jí)驗(yàn)證碼識(shí)別
ok, 沒(méi)有問(wèn)題,可以參考faq,結(jié)尾僅需要加digits
tesseract imagename outputbase digits
不得不說(shuō),tesseract英文識(shí)別率已經(jīng)很不錯(cuò)了(現(xiàn)有的tesseract-data-eng),但是驗(yàn)證碼識(shí)別還是太雞肋了。但是請(qǐng)別忘記,tesseract的智能識(shí)別是需要訓(xùn)練的.
未完
這里羅列一下faq上沒(méi)有提到的的問(wèn)題:
嚴(yán)格來(lái)說(shuō),這不是一個(gè)bug(tesseract 3.0),出現(xiàn)這個(gè)錯(cuò)誤是因?yàn)閠esseract搞不清圖像的字符布局,如果你看過(guò)tesseract wiki,你就應(yīng)該知道如何解決:
-psm N Set Tesseract to only run a subset of layout analysis and assume a certain form of image. The options for N are: 0 = Orientation and script detection (OSD) only. 1 = Automatic page segmentation with OSD. 2 = Automatic page segmentation, but no OSD, or OCR. 3 = Fully automatic page segmentation, but no OSD. (Default) 4 = Assume a single column of text of variable sizes. 5 = Assume a single uniform block of vertically aligned text. 6 = Assume a single uniform block of text. 7 = Treat the image as a single text line. 8 = Treat the image as a single word. 9 = Treat the image as a single word in a circle. 10 = Treat the image as a single character.
對(duì)于我們的驗(yàn)證碼a.tif排列來(lái)說(shuō),采用-psm 7(single text line)比較合適。
$ tesseract 84.tif out -l eng -psm 7 ;cat out.txt
聯(lián)系客服