前言
本文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理
项目目标
爬取酷燃网视频数据
<code><span class="hljs-attribute">https:<span class="hljs-comment">//krcom.cn/</span></span></code>
www#gaodaima.com来源gaodai$ma#com搞$代*码网搞代码
环境
Python3.6
pycharm
爬虫代码
<span><a href="https://www.gaodaima.com/tag/import" title="查看更多关于import的文章" target="_blank">import</a></span><span> pprint </span><span>import</span><span> requests </span><span>import</span><span> re </span><span>def</span><span> download_<a href="https://www.gaodaima.com/tag/video" title="查看更多关于video的文章" target="_blank">video</a>(title, url): filename_video </span>= <span>"</span><span>C:UsersAdministratorDesktop酷燃网</span><span>"</span> + title + <span>"</span><span>.mp4</span><span>"</span><span> response_video </span>= requests.get(url=<span>url) with open(filename_video, mode</span>=<span>"</span><span>wb</span><span>"</span><span>) as f: f.write(response_video.content) </span><span>def</span><span> download_mp3(title, url): filename_mp3 </span>= <span>"</span><span>C:UsersAdministratorDesktop酷燃网</span><span>"</span> + title + <span>"</span><span>.mp3</span><span>"</span><span> response_mp3 </span>= requests.get(url=<span>url) with open(filename_mp3, mode</span>=<span>"</span><span>wb</span><span>"</span><span>) as f: f.write(response_mp3.content) </span><span>for</span> page <span>in</span> range(0, 101, 20<span>): url </span>= <span>"</span><span>https://krcom.cn/aj/hot/loadingmore?ajwvr=6&cursor=0;2020102014&YmdH=&__rnd=1603176486876</span><span>"</span><span> headers </span>=<span> { </span><span>"</span><span>User-Agent</span><span>"</span>: <span>"</span><span>Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36</span><span>"</span><span> } response </span>= requests.get(url=url, headers=<span>headers) html_data </span>= response.text.encode(<span>"</span><span>utf-8</span><span>"</span>).decode(<span>"</span><span>unicode_escape</span><span>"</span><span>) urls </span>= re.findall(<span>"</span><span>vid=(.*?)"</span><span>"</span><span>, html_data, re.S) titles </span>= re.findall(<span>"</span><span><h3 class="V_autocut_2l">(.*?)<</span><span>"</span><span>, html_data, re.S) data </span>=<span> zip(urls, titles) </span><span>for</span> i <span>in</span><span> data: vid </span>=<span> i[0] title </span>= i[1<span>] page_url </span>= <span>"</span><span>https://krcom.cn/aj/dash/media?media_ids={}&protocols=dash&watermarks=krcom</span><span>"</span><span>.format(vid) response_2 </span>= requests.get(url=page_url, headers=<span>headers) html_json </span>=<span> response_2.json() video_url </span>= html_json[<span>"</span><span>data</span><span>"</span>][<span>"</span><span>list</span><span>"</span>][0][<span>"</span><span>details</span><span>"</span>][1][<span>"</span><span>play_info</span><span>"</span>][<span>"</span><span>url</span><span>"</span><span>] mp3_url </span>= html_json[<span>"</span><span>data</span><span>"</span>][<span>"</span><span>list</span><span>"</span>][0][<span>"</span><span>details</span><span>"</span>][-1][<span>"</span><span>play_info</span><span>"</span>][<span>"</span><span>url</span><span>"</span><span>] download_video(title, video_url) download_mp3(title, mp3_url) </span><span>print</span>(title)