爬虫

2023-05-15 02:54| 来源: 网络整理| 查看: 265

目录从 m3u8 文件中解析出 ts 信息按时间截取视频抓取 ts 文件单文件测试批量下载合并 ts 文件将合并的ts文件转化为视频文件

参考资料：

m3u8格式介绍 ts文件格式介绍视频下载 Python读取m3u8文件 ts转mp4 # 配置环境 import requests,re import sys,time import os import numpy as np import glob work_dir = os.getcwd() print(work_dir) # 用来保存ts文件 file_dir = os.path.join(work_dir,'file_tmp') if not os.path.exists(file_dir): os.mkdir(file_dir)

先定义保存文件的函数

def savefile(file_url,file_name): # 配置headers防止被墙，一般问题不大 headers = { 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.116 Safari/537.36' } r = requests.get(file_url,headers=headers) if r.status_code == 200: with open(file_name, 'wb') as f: f.write(r.content) 从 m3u8 文件中解析出 ts 信息

怎么查找m3u8文件？

假设在chrome上打开视频页

右键检查，Network -> All ，过滤.m3u8

一般可以看到两个m3u8地址，其中一个是带hls的，这个文件可以解析出ts信息

拿个网址来举例，比如这个视频

# 如果url中没有hls的，那就是源m3u8文件 # 源m3u8文件会跳转到另一个m3u8文件，这个地址中就带有hls # 这个是源m3u8文件，不带hls url_m3u8 = 'https://wuji.zhulong-zuida.com/20190706/762_c260ca6c/index.m3u8' r = requests.get(url_m3u8) r.encoding='utf-8' # 查看内容 print(r.text)

输出： #EXTM3U #EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=800000,RESOLUTION=1080x608 800k/hls/index.m3u8

可以看到最后一行就是跳转后的m3u8地址

# 合成带有hls的m3u8地址 if r.text.split('\n')[-1] == '': hls_mark = r.text.split('\n')[-2] # 以防\n结尾 else: hls_mark = r.text.split('\n')[-1] url_m3u8_hls = url_m3u8.replace('index.m3u8',hls_mark) url_m3u8_hls

输出： 'https://wuji.zhulong-zuida.com/20190706/762_c260ca6c/800k/hls/index.m3u8'

# 不过有时候可能没法查到跳转后的带hls的连接 # 但是视频加载文件的网址格式为主url+文件名.ts # 这个主url是带hls的 # m3u8的index目录格式为主url/index.m3u8 url_m3u8_hls = 'https://wuji.zhulong-zuida.com/20190706/762_c260ca6c/800k/hls/index.m3u8' # 带有hls的m3u8文件中获得的是ts信息 # 包括ts文件名称，以及该文件的持续时间 # 这个文件有用，先保存一下 file_m3u8 = url_m3u8_hls.split('/')[-1] with open(file_m3u8,'wb') as f: f.write(r.content) # iter_lines得到的是bytesstring text_bytes = list(r.iter_lines()) # 转化成正常string text_string = [i.decode('utf-8') for i in text_bytes] # 筛选以.ts结尾的行 # 有些情况下可能是以其他格式的文件，比如png，下载后修改后缀即可 # ts_name = [i for i in text_string if i.endswith('.ts')] ts_name = [i for i in text_string if not i.startswith('#')] ts_name[:3]

输出： ['36962c1a1b0000000.ts', '36962c1a1b0000001.ts', '36962c1a1b0000002.ts']

有时候ts文件信息中可能还包含一部分路径信息。因为路径都是统一的，所以我们只需要文件名就可以了

if '/' in ts_name[1]: # 部分ts文件名中带有路径信息，只保留文件名即可 ts_name = [i.split('/')[-1] for i in ts_name] ts_name[:3]

接下来处理时间戳。

# 筛选带有时间的行 ts_time = [float(re.findall('[.\d]+',i)[0]) for i in text_string if i.startswith('#EXTINF')] ts_time[:3]

输出： [4.1283, 4.3785, 4.17]

# 检验解析出来的时间戳和文件名数量是否匹配 len(ts_name) == len(ts_time)

输出： True

按时间截取视频 # 建立时间基准 # 得到累计时间序列 time_cum = np.cumsum(ts_time) # 那如果我要看51分05秒~55分46秒，应该下载哪些文件呢？ time_start = 1*3600+26*60+5 time_end = 1*3600+46*60+20 # 如果有多段时间截取，可以写个函数将时间序列进行转化 # 输入：[(0.0.0,0.9.30),(0.10.0,0.20.0)] # 输出累计时间戳的index [(0,38),(40,80)] # 对于起始时间 # 筛选累计时间戳 combine.ts os.system(cmd_str)

执行成功的话，会返回0

第二种合并ts文件的方式，可以将所有的ts文件按顺序写到一个新的文件中。

file_out = 'merge_02.ts' with open(file_out,'wb') as f_out: for f_in in file_list: f_out.write(open(f_in,'rb').read()) 将合并的ts文件转化为视频文件

最后，我们将合成的ts文件转化成视频文件(比如MP4格式) 这里我们调用 ffmpeg 将 ts 文件转化为视频文件。

转化命令为

ffmpeg -i 文件名称.ts -c copy [视频名称] e.g. ffmpeg -i merge.ts -c copy '视频截取片段.mp4' file_in = os.path.join(work_dir,'merge.ts') #如果路径中有空格，所以路径需要用上双引号，否则会找不到该文件 file_out = "merge.mp4" # 这里是去ffmpeg官网下载编译好的软件包，免安装的 cmd = './ffmpeg -i '+file_in +' -c copy ' + file_out os.system(cmd) # 运行正常返回0

【本文地址】

爬虫

爬虫

今日新闻

推荐新闻