• 欢迎访问搞代码网站,推荐使用最新版火狐浏览器和Chrome浏览器访问本网站!
  • 如果您觉得本站非常有看点,那么赶紧使用Ctrl+D 收藏搞代码吧

Python爬虫:爬取某日头条某瓜视频,有/无水印两种方法

python 搞java代码 3年前 (2022-05-21) 24次浏览 已收录 0个评论

前言

本文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理。

以下文章来源于青灯编程 ,作者:清风

Python爬虫、数据分析、网站开发等案例教程视频免费在线观看

<code><span class="hljs-attribute">https:<span class="hljs-comment">//space.bilibili.<a href="https://www.gaodaima.com/tag/com" title="查看更多关于com的文章" target="_blank">com</a>/523606542</span></span></code>

www#gaodaima.com来源gao($daima.com搞@代@#码(网搞代码

基本开发环境

  • Python 3.6
  • Pycharm

相关模块的使用

<code><span class="hljs-keyword"><a href="https://www.gaodaima.com/tag/import" title="查看更多关于import的文章" target="_blank">import</a> time
<span class="hljs-keyword">import os
<span class="hljs-keyword">import re
<span class="hljs-keyword">import requests
from selenium <span class="hljs-keyword">import webdriver
from selenium.webdriver.chrome.options <span class="hljs-keyword">import Options</span></span></span></span></span></span></code>

目标网页分析

 

如何获取视频地址

西瓜视频有两种:

1、有水印视频

2、无水印视频

有水印视频

在网页源代码中

 

<code><span class="hljs-attribute">https:<span class="hljs-comment">//www.ixigua.com/embed?group_id=6817258591586615812</span></span></code>

这个链接点击进去是视频播放地址。

 

前端页面中已有视频真实地址

<code>//v9-xg-web-s.ixigua.com/ac99e1bf75dd0faa6854d9e5367fac3f/<span class="hljs-number">5fe894d7/video/tos/cn/tos-cn-ve-<span class="hljs-number">4/<span class="hljs-number">626cf09c0830417da4b70982950cedd9/?a=<span class="hljs-number">1768&br=<span class="hljs-number">3891&bt=<span class="hljs-number">1297&cd=<span class="hljs-number">0%7C0%7C0&cr=<span class="hljs-number">0&cs=<span class="hljs-number">0&cv=<span class="hljs-number">1&dr=<span class="hljs-number">0&ds=<span class="hljs-number">3&er=<span class="hljs-number">0&l=<span class="hljs-number">20201227210214010204050203275E2F92&lr=default&mime_type=video_mp4&qs=<span class="hljs-number">0&rc=anQ3aWdzNjd2dDMzZjczM0ApPDQ2NjU8aGU3NzplMzZoNWdfMWguMmA0NWFfLS02LS9zczIwXjBfY2A2MmIvXjMyLjI6Yw%3D%3D&vl=&vr=</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></code>

 

 

只要请求这个网址即可下载保存视频。

无水印视频

无水印的视频下载比较麻烦,首先它是音频和视频画面分离的

 

水印是没有水印,但是视频是没有声音的。

 

如何找音频和视频地址呢?

使用开发者工具,在XHR里面是有相对对应链接的

音频地址:

<code><span class="hljs-attribute">https:<span class="hljs-comment">//v9-xg-web-s.ixigua.com/79457295a8a89bf86bdcd157eb848175/5fe895f4/video/tos/cn/tos-cn-vd-0026/43771a1a38ea473d9cb5b8e7c0f651f3/media-audio-und-mp4a/?a=1768&br=0&bt=0&cd=0%7C0%7C0&cr=0&cs=0&cv=1&dr=0&ds=&er=0&l=20201227210659010028033025224FC377&lr=default&mime_type=video_mp4</span></span></code>

 

视频画面地址:

<code><span class="hljs-attribute">https:<span class="hljs-comment">//v9-xg-web-s.ixigua.com/9b4e18f3b29244557c83b8e88f13dd1b/5fe895f4/video/tos/cn/tos-cn-vd-0026/86a41ef8ebd3496585db455ae56b3ff3/media-video-avc1/?a=1768&br=12159&bt=4053&cd=0%7C0%7C0&cr=0&cs=0&cv=1&dr=0&ds=4&er=0&l=20201227210659010028033025224FC377&lr=default&mime_type=video_mp4</span></span></code>

 

所以如果想要爬取西瓜视频无水印版本的话,不仅要下载视频,还要下载音频,然后再合成视频和音频两个文件,和之前的爬取B视频有相似之处。

西瓜视频水印版本下载

1、获取源代码提取视频播放地址以及标题

<code><span class="hljs-function"><span class="hljs-keyword">def <span class="hljs-title">main<span class="hljs-params">(html_url):
    headers = {
        <span class="hljs-string">"user-agent": <span class="hljs-string">"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"
    }
    response = requests.get(url=html_url, headers=headers)
    response.encoding = response.apparent_encoding
    play_url = re.findall(<span class="hljs-string">""embedUrl":"(.*?)"", response.text)[<span class="hljs-number">0]
    title = re.findall(<span class="hljs-string">"<title data-react-helmet="true">(.*?)</title>", response.text)[<span class="hljs-number">0].replace(<span class="hljs-string">" - 西瓜视频", <span class="hljs-string">"")</span></span></span></span></span></span></span></span></span></span></span></span></code>

2、获取视频真实下载地址

这里使用selenium主要是因为,链接的变化规律问题。每次请求网页的参数都不一样,比较难以分析,但是前端网页中是有显示真实的视频地址,所以可以使用selenium直接提取。

<code><span class="hljs-function"><span class="hljs-keyword">def <span class="hljs-title">get_video_url<span class="hljs-params">(html_url):
    <span class="hljs-string">"""传入播放地址,获取视频下载地址"""
    chrome_options = Options()
    chrome_options.add_argument(<span class="hljs-string">"--headless")
    os.system(<span class="hljs-string">"taskkill /f /im chromedriver.exe")
    driver = webdriver.Chrome(executable_path=<span class="hljs-string">"chromedriver.exe", options=chrome_options)
    driver.get(html_url)
    driver.implicitly_wait(<span class="hljs-number">10)
    video_url = driver.find_element_by_css_selector(<span class="hljs-string">"#player_default video").get_attribute(<span class="hljs-string">"src")
    driver.close()
    <span class="hljs-keyword">return video_url</span></span></span></span></span></span></span></span></span></span></span></span></code>

3、视频下载保存

方式一:正常保存方式

<code><span class="hljs-function"><span class="hljs-keyword">def <span class="hljs-title">save<span class="hljs-params">(video_url, video_title):
    filename = <span class="hljs-string">"video" + video_title + <span class="hljs-string">".mp4"
    video_headers = {
        <span class="hljs-string">"user-agent": <span class="hljs-string">"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"
    }
    video_response = requests.get(url=video_url, headers=video_headers).content
    <span class="hljs-keyword">with open(filename, mode=<span class="hljs-string">"wb") <span class="hljs-keyword">as f:
        f.write(video_response)
        print(<span class="hljs-string">"正在下载保存:", video_title)</span></span></span></span></span></span></span></span></span></span></span></span></code>

运行效果:

 

方式二:实现下载进度条

<code>def progressbar(video_url, video_title):
    <span class="hljs-keyword">start = time.time()  <span class="hljs-comment"># 下载开始时间
    response = requests.get(video_url, stream=<span class="hljs-literal">True)  <span class="hljs-comment"># stream=True必须写上
    <span class="hljs-keyword">size = <span class="hljs-number">0  <span class="hljs-comment"># 初始化已下载大小
    chunk_size = <span class="hljs-number">1024  <span class="hljs-comment"># 每次下载的数据大小
    content_size = <span class="hljs-built_in">int(response.headers[<span class="hljs-string">"content-length"])  <span class="hljs-comment"># 下载文件总大小
    try:
        <span class="hljs-keyword">if response.status_code == <span class="hljs-number">200:  <span class="hljs-comment"># 判断是否响应成功
            print(<span class="hljs-string">"Start download,[File size]:{size:.2f} MB".format(
                <span class="hljs-keyword">size=content_size / chunk_size / <span class="hljs-number">1024))  <span class="hljs-comment"># 开始下载,显示下载文件大小
            filepath = <span class="hljs-string">"video" + video_title + <span class="hljs-string">".mp4"  <span class="hljs-comment"># 设置图片name,注:必须加上扩展名
            <span class="hljs-keyword">with <span class="hljs-keyword">open(filepath, <span class="hljs-string">"wb") <span class="hljs-keyword">as <span class="hljs-keyword">file:  <span class="hljs-comment"># 显示进度条
                <span class="hljs-keyword">for <span class="hljs-keyword">data <span class="hljs-keyword">in response.iter_content(chunk_size=chunk_size):
                    file.write(<span class="hljs-keyword">data)
                    <span class="hljs-keyword">size += <span class="hljs-keyword">len(<span class="hljs-keyword">data)
                    print(<span class="hljs-string">"[下载进度]:%s%.2f%%" % (<span class="hljs-string">"▇" * <span class="hljs-built_in">int(<span class="hljs-keyword">size * <span class="hljs-number">50 / content_size), <span class="hljs-built_in">float(<span class="hljs-keyword">size / content_size * <span class="hljs-number">100)),
                          <span class="hljs-keyword">end=<span class="hljs-string">"
")
        <span class="hljs-keyword">end = time.time()  <span class="hljs-comment"># 下载结束时间
        print(<span class="hljs-string">"Download completed!,times: %.2f秒" % (<span class="hljs-keyword">end - <span class="hljs-keyword">start))  <span class="hljs-comment"># 输出下载用时时间
        print(f<span class="hljs-string">"视频【 {video_title} 】已经保存完毕")
    <span class="hljs-keyword">except:
        print(<span class="hljs-string">"Error")</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></code>

运行效果:

 

只要输入视频的ID即可下载视频,之后也可以做一个简单GUI桌面应用软件,之前文章都是有写过类似的。

完整代码

<code><span class="hljs-keyword">import time
<span class="hljs-keyword">import os
<span class="hljs-keyword">import re
<span class="hljs-keyword">import requests
<span class="hljs-keyword">from selenium <span class="hljs-keyword">import webdriver
<span class="hljs-keyword">from selenium.webdriver.chrome.options <span class="hljs-keyword">import Options



<span class="hljs-function"><span class="hljs-keyword">def <span class="hljs-title">get_video_url<span class="hljs-params">(html_url):
    <span class="hljs-string">"""传入播放地址,获取视频下载地址"""
    chrome_options = Options()
    chrome_options.add_argument(<span class="hljs-string">"--headless")
    os.system(<span class="hljs-string">"taskkill /f /im chromedriver.exe")
    driver = webdriver.Chrome(executable_path=<span class="hljs-string">"chromedriver.exe", options=chrome_options)
    driver.get(html_url)
    driver.implicitly_wait(<span class="hljs-number">10)
    video_url = driver.find_element_by_css_selector(<span class="hljs-string">"#player_default video").get_attribute(<span class="hljs-string">"src")
    driver.close()
    <span class="hljs-keyword">return video_url




<span class="hljs-comment"># def save(video_url, video_title):
<span class="hljs-comment">#     filename = "video" + video_title + ".mp4"
<span class="hljs-comment">#     video_headers = {
<span class="hljs-comment">#         "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"
<span class="hljs-comment">#     }
<span class="hljs-comment">#     video_response = requests.get(url=video_url, headers=video_headers).content
<span class="hljs-comment">#     with open(filename, mode="wb") as f:
<span class="hljs-comment">#         f.write(video_response)
<span class="hljs-comment">#         print("正在下载保存:", video_title)




<span class="hljs-function"><span class="hljs-keyword">def <span class="hljs-title">progressbar<span class="hljs-params">(video_url, video_title):
    start = time.time()  <span class="hljs-comment"># 下载开始时间
    response = requests.get(video_url, stream=<span class="hljs-literal">True)  <span class="hljs-comment"># stream=True必须写上
    size = <span class="hljs-number">0  <span class="hljs-comment"># 初始化已下载大小
    chunk_size = <span class="hljs-number">1024  <span class="hljs-comment"># 每次下载的数据大小
    content_size = int(response.headers[<span class="hljs-string">"content-length"])  <span class="hljs-comment"># 下载文件总大小
    <span class="hljs-keyword">try:
        <span class="hljs-keyword">if response.status_code == <span class="hljs-number">200:  <span class="hljs-comment"># 判断是否响应成功
            print(<span class="hljs-string">"Start download,[File size]:{size:.2f} MB".format(
                size=content_size / chunk_size / <span class="hljs-number">1024))  <span class="hljs-comment"># 开始下载,显示下载文件大小
            filepath = <span class="hljs-string">"video" + video_title + <span class="hljs-string">".mp4"  <span class="hljs-comment"># 设置图片name,注:必须加上扩展名
            <span class="hljs-keyword">with open(filepath, <span class="hljs-string">"wb") <span class="hljs-keyword">as file:  <span class="hljs-comment"># 显示进度条
                <span class="hljs-keyword">for data <span class="hljs-keyword">in response.iter_content(chunk_size=chunk_size):
                    file.write(data)
                    size += len(data)
                    print(<span class="hljs-string">"[下载进度]:%s%.2f%%" % (<span class="hljs-string">"▇" * int(size * <span class="hljs-number">50 / content_size), float(size / content_size * <span class="hljs-number">100)),
                          end=<span class="hljs-string">"
")
        end = time.time()  <span class="hljs-comment"># 下载结束时间
        print(<span class="hljs-string">"Download completed!,times: %.2f秒" % (end - start))  <span class="hljs-comment"># 输出下载用时时间
        print(<span class="hljs-string">f"视频【 <span class="hljs-subst">{video_title} 】已经保存完毕")
    <span class="hljs-keyword">except:
        print(<span class="hljs-string">"Error")




<span class="hljs-function"><span class="hljs-keyword">def <span class="hljs-title">main<span class="hljs-params">(html_url):
    headers = {
        <span class="hljs-string">"cookie": <span class="hljs-string">"输入你自己的cookie",
        <span class="hljs-string">"referer": <span class="hljs-string">"https://www.ixigua.com/?wid_try=1",
        <span class="hljs-string">"user-agent": <span class="hljs-string">"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"
    }
    response = requests.get(url=html_url, headers=headers)
    response.encoding = response.apparent_encoding
    play_url = re.findall(<span class="hljs-string">""embedUrl":"(.*?)"", response.text)[<span class="hljs-number">0]
    title = re.findall(<span class="hljs-string">"<title data-react-helmet="true">(.*?)</title>", response.text)[<span class="hljs-number">0].replace(<span class="hljs-string">" - 西瓜视频", <span class="hljs-string">"")
    video_url = get_video_url(play_url)
    progressbar(video_url, title)




<span class="hljs-keyword">if __name__ == <span class="hljs-string">"__main__":
    video_id = input(<span class="hljs-string">"请输入你要下载的视频ID:")
    url = <span class="hljs-string">f"https://www.ixigua.com/<span class="hljs-subst">{video_id}"
    main(url)</span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></code>

搞代码网(gaodaima.com)提供的所有资源部分来自互联网,如果有侵犯您的版权或其他权益,请说明详细缘由并提供版权或权益证明然后发送到邮箱[email protected],我们会在看到邮件的第一时间内为您处理,或直接联系QQ:872152909。本网站采用BY-NC-SA协议进行授权
转载请注明原文链接:Python爬虫:爬取某日头条某瓜视频,有/无水印两种方法

喜欢 (0)
[搞代码]
分享 (0)
发表我的评论
取消评论

表情 贴图 加粗 删除线 居中 斜体 签到

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址