1、介绍
在爬虫中经常会遇到验证码识别的问题,现在的验证码大多分计算验证码、滑块验证码、识图验证码、语音验证码等四种。本文就是识图验证码,识别的是简单的验证码,要想让识别率更高,识别的更加准确就需要花很多的精力去训练自己的字体库。
识别验证码通常是这几个步骤:
(1)灰度处理
(2)二值化
(3)去除边框(如果有的话)
(4)降噪
(5)切割字符或者倾斜度矫正
(6)训练字体库
(7)识别
这6个步骤中前三个步骤是基本的,4或者5可根据实际情况选择是否需要。
经常用的库有pytesseract(识别库)、OpenCV(高级图像处理库)、imagehash(图片哈希值库)、numpy(开源的、高性能的Python数值计算库)、PIL的 Image,ImageDraw,ImageFile等。博主的Python学习圈子点击即可进入一起交流学习,还有最新的Python资料可以免费下载
2、实例
以某网站登录的验证码识别为例:具体过程和上述的步骤稍有不同。
首先分析一下,验证码是由4个从0到9等10个数字组成的,那么从0到9这个10个数字没有数字只有第一、第二、第三和第四等4个位置。那么计算下来共有40个数字位置,如下:
那么接下来就要对验证码图片进行降噪、分隔得到上面的图片。以这40个图片集作为基础。
对要验证的验证码图片进行降噪、分隔后获取四个类似上面的数字图片、通过和上面的比对就可以知道该验证码是什么了。
以上面验证码2837为例:
1、图片降噪
2、图片分隔
3、图片比对
通过比验证码降噪、分隔后的四个数字图片,和上面的40个数字图片进行哈希值比对,设置一个误差,max_dif:允许最大hash差值,越小越精确,最小为0。
这样四个数字图片通过比较后获取对应是数字,连起来,就是要获取的验证码。
完整代码如下:
<span>#</span><span>coding=utf-8</span> <span>import</span><span> os </span><span>import</span><span> re </span><span>from</span> selenium <span>import</span><span> webdriver </span><span>from</span> selenium.webdriver.common.keys <span>import</span><span> Keys </span><span>import</span><span> time </span><span>from</span> selenium.webdriver.common.action_chains <span>import</span><span> ActionChains </span><span>import</span><span> collections </span><span>import</span><span> mongoDbBase </span><span>import</span><span> numpy </span><span>import</span><span> imagehash </span><span>from</span> PIL <span>import</span><span> Image,ImageFile </span><span>import</span><span> datetime </span><span>class</span><span> finalNews_IE: </span><span>def</span> <span>__init__</span><span>(self,strdate,logonUrl,firstUrl,keyword_list,exportPath,codepath,codedir): self.iniDriver() self.db </span>=<span> mongoDbBase.mongoDbBase() self.date </span>=<span> strdate self.firstUrl </span>=<span> firstUrl self.logonUrl </span>=<span> logonUrl self.keyword_list </span>=<span> keyword_list self.exportPath </span>=<span> exportPath self.codedir </span>=<span> codedir self.hash_code_dict </span>=<span>{} </span><span>for</span> f <span>in</span> range(0,10<span>): </span><span>for</span> l <span>in</span> range(1,5<span>): file </span>= os.path.join(codedir, <span>"</span><span>codeLibrarycode</span><span>"</span> + str(f) + <span>"</span><span>_</span><span>"</span>+str(l) + <span>"</span><span>.png</span><span>"</span><span>) </span><span>#</span><span> print(file)</span> hash =<span> self.get_ImageHash(file) self.hash_code_dict[hash]</span>=<span> str(f) </span><span>def</span><span> iniDriver(self): </span><span>#</span><span> 通过配置文件获取IEDriverServer.exe路径</span> IEDriverServer = <span>"</span><span>C:Program FilesInternet ExplorerIEDriverServer.exe</span><span>"</span><span> os.environ[</span><span>"</span><span>webdriver.ie.driver</span><span>"</span>] =<span> IEDriverServer self.driver </span>=<span> webdriver.Ie(IEDriverServer) </span><span>def</span><span> WriteData(self, message, fileName): fileName </span>= os.path.join(os.getcwd(), self.exportPath + <span>"</span><span>/</span><span>"</span> +<span> fileName) with open(fileName, </span><span>"</span><span>a</span><span>"</span><span>) as f: f.write(message) </span><span>#</span><span> 获取图片文件的hash值</span> <span>def</span><span> get_ImageHash(self,imagefile): hash </span>=<span> None </span><span>if</span><span> os.path.exists(imagefile): with open(imagefile, </span><span>"</span><span>rb</span><span>"</span><span>) as fp: hash </span>=<span> imagehash.average_hash(Image.open(fp)) </span><span>return</span><span> hash </span><span>#</span><span> 点降噪</span> <span>def</span> clearNoise(self, imageFile, x=0, y=<span>0): </span><span>if</span><span> os.path.exists(imageFile): image </span>=<span> Image.open(imageFile) image </span>= image.convert(<span>"</span><span>L</span><span>"</span><span>) image </span>=<span> numpy.asarray(image) image </span>= (image > 135) * 255<span> image </span>= Image.fromarray(image).convert(<span>"</span><span>RGB</span><span>"</span><span>) </span><span>#</span><span> save_name = "D:workpython36_crawlVeriycodemode_5590.png"</span> <span>#</span><span> image.save(save_name)</span> <span> image.save(imageFile) </span><span>return</span><span> image </span><span>#</span><span>切割验证码</span> <span>#</span><span> rownum:切割行数;colnum:切割列数;dstpath:图片文件路径;img_name:要切割的图片文件</span> <span>def</span> splitimage(self, imagePath,imageFile,rownum=1, colnum=4<span>): img </span>=<span> Image.open(imageFile) w, h </span>=<span> img.size </span><span>if</span> rownum <= h <span>and</span> colnum <=<span> w: </span><span>print</span>(<span>"</span><span>Original image info: %sx%s, %s, %s</span><span>"</span> %<span> (w, h, img.format, img.mode)) </span><span>print</span>(<span>"</span><span>开始处理图片切割, 请稍候...</span><span>"</span><span>) s </span>=<span> os.path.split(imageFile) </span><span>if</span> imagePath == <span>""</span><span>: dstpath </span>=<span> s[0] fn </span>= s[1].split(<span>"</span><span>.</span><span>"</span><span>) basename </span>=<span> fn[0] ext </span>= fn[-1<span>] num </span>= 1<span> rowheight </span>= h //<span> rownum colwidth </span>= w //<span> colnum file_list </span>=<span>[] </span><span>for</span> r <span>in</span><span> range(rownum): index </span>=<span> 0 </span><span>for</span> c <span>in</span><span> range(colnum): </span><span>#</span><span> (left, upper, right, lower)</span> <span>#</span><span> box = (c * colwidth, r * rowheight, (c + 1) * colwidth, (r + 1) * rowheight)</span> <span>if</span> index < 1<span>: colwid </span>= colwidth + 6<span> </span><span>elif</span> index < 2<span>: colwid </span>= colwidth + 1<span> </span><span>elif</span> index < 3<span>: colwid </span>=<span> colwidth box </span>= (c * colwid, r * rowheight, (c + 1) * colwid, (r + 1) *<span> rowheight) newfile </span>= os.path.join(imagePath, basename + <span>"</span><span>_</span><span>"</span> + str(num) + <span>"</span><span>.</span><span>"</span> +<span> ext) file_list.append(newfile) img.crop(box).save(newfile, ext) num </span>= num + 1<span> index </span>+= 1<span> </span><span>return</span><span> file_list </span><span>def</span> compare_image_with_hash(self, image_hash1,image_hash2, max_dif=<span>0): </span><span>"""</span><span> max_dif: 允许最大hash差值, 越小越精确,最小为0 推荐使用 </span><span>"""</span><span> dif </span>= image_hash1 -<span> image_hash2 </span><span>#</span><span> print(dif)</span> <span>if</span> dif <<span> 0: dif </span>= -<span>dif </span><span>if</span> dif <=<span> max_dif: </span><span>return</span><span> True </span><span>else</span><span>: </span><span>return</span><span> False </span><span>#</span><span> 截取验证码图片</span> <span>def</span><span> savePicture(self): self.driver.get(self.logonUrl) self.driver.maximize_window() time.sleep(</span>1<span>) self.driver.save_screenshot(self.codedir </span>+<span>"</span><span>Temp.png</span><span>"</span><span>) checkcode </span>= self.driver.find_element_by_id(<span>"</span><span>checkcode</span><span>"</span><span>) location </span>= checkcode.location <span>#</span><span> 获取验证码x,y轴坐标</span> size = checkcode.size <span>#</span><span> 获取验证码的长宽</span> rangle = (int(location[<span>"</span><span>x</span><span>"</span>]), int(location[<span>"</span><span>y</span><span>"</span>]), int(location[<span>"</span><span>x</span><span>"</span>] + size[<span>"</span><span>width</span><span>"</span><span>]), int(location[</span><span>"</span><span>y</span><span>"</span>] + size[<span>"</span><span>height</span><span>"</span>])) <span>#</span><span> 写成我们需要截取的位置坐标</span> i = Image.open(self.codedir +<span>"</span><span>Temp.png</span><span>"</span>) <span>#</span><span> 打开截图</span> result = i.crop(rangle) <span>#</span><span> 使用Image的crop函数,从截图中再次截取我们需要的区域</span> filename = datetime.datetime.now().strftime(<span>"</span><span>%M%S</span><span>"</span><span>) filename </span>=self.codedir +<span>"</span><span>Temp_code.png</span><span>"</span><span> result.save(filename) self.clearNoise(filename) file_list </span>=<span> self.splitimage(self.codedir,filename) verycode </span>=<span>""</span><span> </span><span>for</span> f <span>in</span><span> file_list: imageHash </span>=<span> self.get_ImageHash(f) </span><span>for</span> h,code <span>in</span><span> self.hash_code_dict.items(): flag </span>=<span> self.compare_image_with_hash(imageHash,h,0) </span><span>if</span><span> flag: </span><span>#</span><span> print(code)</span> verycode+=<span>code </span><span>break</span><span> </span><span>print</span><span>(verycode) self.driver.close() </span><span>def</span><span> longon(self): self.driver.get(self.logonUrl) self.driver.maximize_window() time.sleep(</span>1<span>) self.savePicture() accname </span>= self.driver.find_element_by_id(<span>"</span><span>username</span><span>"</span><span>) </span><span>#</span><span> accname = self.driver.find_element_by_id("//input[@id="username"]")</span> accname.send_keys(<span>"</span><span>ctrchina</span><span>"</span><span>) accpwd </span>= self.driver.find_element_by_id(<span>"</span><span>password</span><span>"</span><span>) </span><span>#</span><span> accpwd.send_keys("123456")</span> code =<span> self.getVerycode() checkcode </span>= self.driver.find_element_by_name(<span>"</span><span>checkcode</span><span>"</span><span>) checkcode.send_keys(code) submit </span>= self.driver.find_element_by_name(<span>"</span><span>button</span><span>"</span><span>) submit.click()</span>
www#gaodaima.com来源gao!%daima.com搞$代*!码网搞代码
实例补充:
<span>#</span><span> -*- coding: utf-8 -*</span> <span>import</span><span> sys reload(sys) sys.setdefaultencoding( </span><span>"</span><span>utf-8</span><span>"</span><span> ) </span><span>import</span><span> re </span><span>import</span><span> requests </span><span>import</span><span> io </span><span>import</span><span> os </span><span>import</span><span> json </span><span>from</span> PIL <span>import</span><span> Image </span><span>from</span> PIL <span>import</span><span> ImageEnhance </span><span>from</span> bs4 <span>import</span><span> BeautifulSoup </span><span>import</span><span> mdata </span><span>class</span><span> Student: </span><span>def</span> <span>__init__</span><span>(self, user,password): self.user </span>=<span> str(user) self.password </span>=<span> str(password) self.s </span>=<span> requests.Session() </span><span>def</span><span> login(self): url </span>= <span>"</span><span>http://202.118.31.197/ACTIONLOGON.APPPROCESS?mode=4</span><span>"</span><span> res </span>=<span> self.s.get(url).text imageUrl </span>= <span>"</span><span>http://202.118.31.197/</span><span>"</span>+re.findall(<span>"</span><span><img src="(.+?)" width="55"</span><span>"</span><span>,res)[0] im </span>=<span> Image.open(io.BytesIO(self.s.get(imageUrl).content)) enhancer </span>=<span> ImageEnhance.Contrast(im) im </span>= enhancer.enhance(7<span>) x,y </span>=<span> im.size </span><span>for</span> i <span>in</span><span> range(y): </span><span>for</span> j <span>in</span><span> range(x): </span><span>if</span> (im.getpixel((j,i))!=<span>(0,0,0)): im.putpixel((j,i),(</span>255,255,255<span>)) num </span>= [6,19,32,45<span>] verifyCode </span>= <span>""</span> <span>for</span> i <span>in</span> range(4<span>): a </span>= im.crop((num[i],0,num[i]+13,20<span>)) l</span>=<span>[] x,y </span>=<span> a.size </span><span>for</span> i <span>in</span><span> range(y): </span><span>for</span> j <span>in</span><span> range(x): </span><span>if</span> (a.getpixel((j,i))==<span>(0,0,0)): l.append(</span>1<span>) </span><span>else</span><span>: l.append(0) his</span>=<span>0 chrr</span>=<span>""</span><span>; </span><span>for</span> i <span>in</span><span> mdata.data: r</span>=<span>0; </span><span>for</span> j <span>in</span> range(260<span>): </span><span>if</span>(l[j]==<span>mdata.data[i][j]): r</span>+=1 <span>if</span>(r><span>his): his</span>=<span>r chrr</span>=<span>i verifyCode</span>+=<span>chrr </span><span>#</span><span> print "辅助输入验证码完毕:",verifyCode</span> data=<span> { </span><span>"</span><span>WebUserNO</span><span>"</span><span>:str(self.user), </span><span>"</span><span>Password</span><span>"</span><span>:str(self.password), </span><span>"</span><span>Agnomen</span><span>"</span><span>:verifyCode, } url </span>= <span>"</span><span>http://202.118.31.197/ACTIONLOGON.APPPROCESS?mode=4</span><span>"</span><span> t </span>= self.s.post(url,data=<span>data).text </span><span>if</span> re.findall(<span>"</span><span>images/Logout2</span><span>"</span>,t)==<span>[]: l </span>= <span>"</span><span>[0,"</span><span>"</span>+re.findall(<span>"</span><span>alert((.+?));</span><span>"</span>,t)[1][1][2:-2]+<span>"</span><span>"]</span><span>"</span>+<span>"</span> <span>"</span>+self.user+<span>"</span> <span>"</span>+self.password+<span>"</span><span> </span><span>"</span> <span>#</span><span> print l</span> <span>#</span><span> return "[0,""+re.findall("alert((.+?));",t)[1][1][2:-2]+""]"</span> <span>return</span><span> [False,l] </span><span>else</span><span>: l </span>= <span>"</span><span>登录成功 </span><span>"</span>+re.findall(<span>"</span><span>! (.+?) </span><span>"</span>,t)[0]+<span>"</span> <span>"</span>+self.user+<span>"</span> <span>"</span>+self.password+<span>"</span><span> </span><span>"</span> <span>#</span><span> print l</span> <span>return</span><span> [True,l] </span><span>def</span><span> getInfo(self): imageUrl </span>= <span>"</span><span>http://202.118.31.197/ACTIONDSPUSERPHOTO.APPPROCESS</span><span>"</span><span> data </span>= self.s.get(<span>"</span><span>http://202.118.31.197/ACTIONQUERYBASESTUDENTINFO.APPPROCESS?mode=3</span><span>"</span>).text <span>#</span><span>学籍信息</span> data = BeautifulSoup(data,<span>"</span><span>lxml</span><span>"</span><span>) q </span>= data.find_all(<span>"</span><span>table</span><span>"</span>,attrs={<span>"</span><span>align</span><span>"</span>:<span>"</span><span>left</span><span>"</span><span>}) a </span>=<span> [] </span><span>for</span> i <span>in</span><span> q[0]: </span><span>if</span> type(i)==<span>type(q[0]) : </span><span>for</span> j <span>in</span><span> i : </span><span>if</span> type(j) ==<span>type(i): a.append(j.text) </span><span>for</span> i <span>in</span> q[1<span>]: </span><span>if</span> type(i)==type(q[1<span>]) : </span><span>for</span> j <span>in</span><span> i : </span><span>if</span> type(j) ==<span>type(i): a.append(j.text) data </span>=<span> {} </span><span>for</span> i <span>in</span> range(1,len(a),2<span>): data[a[i</span>-1]]=<span>a[i] </span><span>#</span><span> data["照片"] = io.BytesIO(self.s.get(imageUrl).content)</span> <span>return</span><span> json.dumps(data) </span><span>def</span><span> getPic(self): imageUrl </span>= <span>"</span><span>http://202.118.31.197/ACTIONDSPUSERPHOTO.APPPROCESS</span><span>"</span><span> pic </span>=<span> Image.open(io.BytesIO(self.s.get(imageUrl).content)) </span><span>return</span><span> pic </span><span>def</span><span> getScore(self): score </span>= self.s.get(<span>"</span><span>http://202.118.31.197/ACTIONQUERYSTUDENTSCORE.APPPROCESS</span><span>"</span>).text <span>#</span><span>成绩单</span> score = BeautifulSoup(score, <span>"</span><span>lxml</span><span>"</span><span>) q </span>= score.find_all(attrs={<span>"</span><span>height</span><span>"</span>:<span>"</span><span>36</span><span>"</span><span>})[0] point </span>=<span> q.text </span><span>print</span> point[point.find(<span>"</span><span>平均学分绩点</span><span>"</span><span>):] table </span>=<span> score.html.body.table people </span>= table.find_all(attrs={<span>"</span><span>height</span><span>"</span> : <span>"</span><span>36</span><span>"</span><span>})[0].string r </span>= table.find_all(<span>"</span><span>table</span><span>"</span>,attrs={<span>"</span><span>align</span><span>"</span> : <span>"</span><span>left</span><span>"</span>})[0].find_all(<span>"</span><span>tr</span><span>"</span><span>) subject </span>=<span> [] lesson </span>=<span> [] </span><span>for</span> i <span>in</span><span> r[0]: </span><span>if</span> type(r[0])==<span>type(i): subject.append(i.string) </span><span>for</span> i <span>in</span><span> r: k</span>=<span>0 temp </span>=<span> {} </span><span>for</span> j <span>in</span><span> i: </span><span>if</span> type(r[0])==<span>type(j): temp[subject[k]] </span>=<span> j.string k</span>+=1<span> lesson.append(temp) lesson.pop() lesson.pop(0) </span><span>return</span><span> json.dumps(lesson) </span><span>def</span><span> logoff(self): </span><span>return</span> self.s.get(<span>"</span><span>http://202.118.31.197/ACTIONLOGOUT.APPPROCESS</span><span>"</span><span>).text </span><span>if</span> <span>__name__</span> == <span>"</span><span>__main__</span><span>"</span><span>: a </span>= Student(20150000,20150000<span>) r </span>=<span> a.login() </span><span>print</span> r[1<span>] </span><span>if</span><span> r[0]: r </span>=<span> json.loads(a.getScore()) </span><span>for</span> i <span>in</span><span> r: </span><span>for</span> j <span>in</span><span> i: </span><span>print</span><span> i[j], </span><span>print</span><span> q </span>=<span> json.loads(a.getInfo()) </span><span>for</span> i <span>in</span><span> q: </span><span>print</span><span> i,q[i] a.getPic().show() a.logoff()</span>
到此这篇关于python识别验证码的思路及解决方案的文章就介绍到这了,更多Python技术和学习资料进入博主的Python圈子观看和下载。
本文的文字及图片来源于网络加上自己的想法,仅供学习、交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理。