超星系统登录，信息爬取

2024-07-10 21:29| 来源: 网络整理| 查看: 265

超星系统登录，信息爬取

经历过上一年的疫情的大学生，一部分大学生可能对超星有一个深刻的认识，而我写这个项目的想法来自我的导师，做一个可以爬取超星课程学生成绩，可以随机组卷（前提是自己题库里有题），该项目可以做到统计题库试题类型以及试题数量，同时可以模板组卷以及一键群发消息，省去了许多的点点点。下面步入整体。首先就是登录：

超星登录链接（点这里），这个是超星现在登录的一个节界面，当然还有另一个界面（点这个是另一个），我选择这个是因为这个可以实现扫码登录，学号或工号登录（我称他为机构登录），手机号码与密码登录，还有一个就是手机号与验证码登录（不过我没搞上它），反正是登录样式比较多。

在这里插入图片描述我们看一下这个页面，有两种登录方式，我们首先分析手机号密码登录登录，我们输入一个错误的账号密码，看看它的请求，我们可以发现上图uname是我们手机号，那个password显然就是我们密码。但是它被加密，其实这个加密不难，很简单，它就是一个base64加密：看下图就知道了。在这里插入图片描述如上图，base64编码后得到的结果跟我们抓包得到的一样。剩下就简单了，post一下就好了，下面我们分析另一个登录方式扫码登录，细心地朋友会发现在抓包时有个请求getauthstatus，它其实就与扫码登录有关。经过我的分析，它是在不过请求的，不过请求一段时间二维码就会失效。不要担心它失效，它的时间其实挺长的。我们扫码登录首先就是获取二维码，这个其实很简单，在登录页面网页源码里面就可以得到二维码的链接，我们下载一下，然后展示就ok了。在这里插入图片描述那下面就是getauthstatus请求了。我们看一下getauthstatus请求，我们发现它需要参数enc与uuid，我们全局搜索一下，它就在登录网页源码里面，也就是刚刚那个二维码链接的上面。上面截图红框框里面就是了。对于这个扫码登录我的思路就是先下载二维码，然后展示，同时不断发送getauthstatus，同时getauthstatus也会返回我们数据，通过数据我们可以判断登录状态，当我扫码后会返回我们用户名，当我们移动端确认后，我们同样会获取返回信息，这时我们就请求一个统用的链接http://i.chaoxing.com/，允许这个请求重新定向就会登录到我们超星登录成功后页面。当然只是请求到的。接下来我们分析学号或工号登录，在这里插入图片描述

它其实在其它方式登录。不过该方式涉及验证码。我就想到验证码识别，我也找到一个训练验证码模型，不过我电脑训练总是失败，不知道为什么。所以就只能手动了。那我们就乱输入一下，抓一下包。在这里插入图片描述我们会发现uname为账号，numcode为验证码，password是密码加密，当然，密码加密还是base64加密，这里有一个就是fid，其实它每个学校的一个id。然后我们请求一下就好了。在这里我仅仅讲的是登录，在我的代码中，我用的是session会话，在登录成功后会进行cookie的保存，当一定时间内我们只需要登录一次，然后每次用到时会检查cookie，有效就会进行下面的操作，无效，就会重新登录，然后再进行操作。登录已经解决，下面就是信息爬取。

成绩统计

在这里插入图片描述

首先我们眼获取课程的url,这个url其实在我们登录后的页面里面，我们正则或者xpath就可以获取。在这里我用的是正则，当我获取课程链接后，我们请求课程链接，然后就是获取统计的url（如上图灰色块里的统计）。我们再次正则出来就好了，在这里我要重点说一下，我们要把统计链接里的courseId,classId,enc, cpi,openc 4个参数匹配出，变成全局变量，它们不仅下现在有用，在后面群发信息，模板或者随机组卷也会用到。下面利用上面的一下参数组成的url请求，获取源码，然后匹配出来要的信息，我把这些信息写入字典，key为名字，value为一些成绩。然后用pandas转化为excel表就好了。没有什么难度

群发信息

群发信息其实也是通过抓包，看是什么形式，其实也很简单，当然也会用到上面4个参数的一些参数。

随机组卷或模板组卷

在这里我说一下，我本来想打算用go语言的chromedp包这个包，但是这个上面有个难度，就是对于弹窗确认。这个我找了想过文档没找到。这个可以操控chrome浏览器，不需要任何驱动。不想我现在用的selenium，它要配置驱动，还要设置环境变量Path，并且有点慢，没有chromedp快。同时这里要再次登录，同样也是我保存了cookies，登录一次后，一段时间时间后不用再登录。我想过统一selenium与request的cookies,但是目前没有想到好的方式。不过这个登录支持超星登录的所有方式，剩下就是一下click，sendkey了，具体看我源码。

登录源码： import base64 import os import platform import re import subprocess import sys import time from http import cookiejar import muggle_ocr import requests from PIL import Image # 超星登录 class chaoxing_login(object): def __init__(self): self.session = requests.session() self.session.cookies = cookiejar.LWPCookieJar(filename='core/chaoxing_cookies.txt') self.login_headers = { 'Origin': 'http://passport2.chaoxing.com', 'Referer': 'http://passport2.chaoxing.com/login?loginType=3&newversion=true&fid=-1&refer=http%3A%2F%2Fi.chaoxing.com', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36', 'X-Requested-With': 'XMLHttpRequest', 'Host': 'passport2.chaoxing.com', } # 登录完成的请求头 self.login_complete_headers = { 'Host': 'i.chaoxing.com', 'Referer': 'http://passport2.chaoxing.com/', 'Upgrade-Insecure-Requests': '1', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36', } self.account_url = 'http://passport2.chaoxing.com/unitlogin' # 机构登录data self.account_data = { 'fid': '学校id', 'uname': '', 'numcode': '', 'password': '', 'refer': 'http%3A%2F%2Fi.chaoxing.com', 't': 'true', } self.phone_url = 'http://passport2.chaoxing.com/fanyalogin' # 手机号登录data self.phone_data = { 'fid': '-1', 'uname': '', 'password': '', 'refer': 'http%3A%2F%2Fi.chaoxing.com', 't': 'true', 'forbidotherlogin': '0', } self.QR_code_headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36', 'Host': 'passport2.chaoxing.com', 'Referer': 'http://passport2.chaoxing.com/login?fid=&newversion=true&refer=http%3A%2F%2Fi.chaoxing.com', 'Upgrade-Insecure-Requests': '1', } self.session.headers = self.login_headers # 图片展示 def show_img(self, file_name): userPlatform = platform.system() if userPlatform == 'Darwin': # Mac subprocess.call(['open', file_name]) elif userPlatform == 'Linux': # Linux subprocess.call(['xdg-open', file_name]) else: # Windows os.startfile(file_name) # 13位时间戳 def get_time_stamp(self): time_stamp = str(int(time.time() * 1000)) return time_stamp # 获取验证码 def get_captcha(self): print('验证码获取中......') captcha_url = "http://passport2.chaoxing.com/num/code?{}".format(self.get_time_stamp()) headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36", } response = self.session.get(url=captcha_url, headers=headers) if response.status_code == 200: print('获取验证码成功') content = response.content with open("core\chaoxing_captcha.png", "wb") as f: f.write(content) self.show_img(file_name='core\chaoxing_captcha.png') else: print('抱歉,获取验证码失败\n' '程序将自动终止，请重新打开程序') sys.exit() # 密码加密 def password_encrypt(self, password): password = base64.b64encode(password.encode()) password = password.decode() return password # 检查cookies def check_cookies(self): self.session.headers = self.login_complete_headers try: # 加载cookies self.session.cookies.load(ignore_discard=True) url = "http://i.chaoxing.com/" response = self.session.get(url=url, allow_redirects=True) if response.status_code == 200: # print(response.text) return True else: return False except FileNotFoundError: return "无cookie文件" # 账号密码输入 def input(self): uname = input('请输入账户：') password = input('请输入密码：') password = self.password_encrypt(password) if self.num == '1': self.account_data['uname'] = uname self.account_data['password'] = password self.get_captcha() numcode = input('请输入验证码：') self.account_data['numcode'] = numcode elif self.num == '2': self.phone_data['uname'] = uname self.phone_data['password'] = password # 扫码登入所需的uuid,enc def get_uuid_enc(self): url = 'http://passport2.chaoxing.com/login' params = { 'fid': '', 'newversion': 'true', 'refer': 'http://i.chaoxing.com', } response = self.session.get(url=url, params=params) text = response.text self.uuid = re.findall('', text)[0] self.enc = re.findall('', text)[0] def QR_png(self): print('二维码获取中......') self.get_uuid_enc() url = 'http://passport2.chaoxing.com/createqr' params = { 'uuid': self.uuid, 'fid': '-1', } self.session.headers = self.QR_code_headers response = self.session.get(url=url, params=params) if response.status_code == 200: print('二维码获取成功') content = response.content # 这里照片数据为bytes形式，所以为'wb' with open('core\QR.png', 'wb') as f: f.write(content) print('二维码保存成功') self.show_img(file_name='core\QR.png') # self.getauthstatus() else: print('抱歉,获取二维码失败\n' '程序将自动终止，请重新打开程序') sys.exit() # 扫码登录状态获取 def getauthstatus(self): count = 0 while True: getauthstatus_url = 'http://passport2.chaoxing.com/getauthstatus' data = { 'enc': self.enc, 'uuid': self.uuid, } response = self.session.post(url=getauthstatus_url, data=data) text = response.text if '未登录' not in text: dic = response.json() if dic['status'] == False: self.uid = dic['uid'] self.nickname = dic['nickname'] print('用户==》{}《==请您确认登录'.format(self.nickname)) elif dic['status'] == True: print('用户==》{}《==您已确认登录'.format(self.nickname)) return True else: print('不要让人家苦苦等待嘛,请您扫一下二维码') # 请求50次，二维码将刷新一次 count += count if count == 150: return False time.sleep(1) # 登入信息判断,扫码登录不能用 def login_info_judge(self): response = self.session.post(url=self.url, data=self.data) text = response.text if 'captcha is incorrect' in text or '验证码错误' in text: return '验证码错误' elif 'account or passport is wrong' in text or '用户名或密码错误' in text: return '用户名或密码错误' else: return True # 机构登录 def account_login(self): while True: self.input() mes = self.login_info_judge() if mes != True: print('登录失败！') print(mes) if mes == '验证码错误': self.get_captcha() numcode = input('请输入验证码：') self.account_data['numcode'] = numcode mes = self.login_info_judge() else: mes = self.login_info_judge() else: print('登录成功！') self.session.cookies.save() print('cookie保存成功！') return '' # 号码登录 def phone_sign(self): while True: self.input() mes = self.login_info_judge() if mes != True: print('登录失败！') print(mes) mes = self.login_info_judge() else: print('登录成功！') self.session.cookies.save() print('cookie保存成功！') return '' # 扫码登录 def QR_code_sign(self): self.QR_png() while True: judge_info = self.getauthstatus() if judge_info == True: break else: self.QR_png() return '' # 登录入口 def login(self): b = self.check_cookies() if b == True: print('超星cookie有效') else: print('不建议用机构方式登录，会涉及手动输入验证码，推荐扫码登录') if b == False: print('超星cookies失效') else: print('没有超星cookie文件') self.session.headers = self.login_headers print('1代表机构登录\n' '2代表号码登录\n' '3代表扫码登录') self.num = input('请选择登入方式:') if self.num == '1': self.url = self.account_url self.data = self.account_data self.account_login() elif self.num == '2': self.url = self.phone_url self.data = self.phone_data self.phone_sign() elif self.num == '3': self.QR_code_sign() self.session.headers = self.login_complete_headers url = "http://i.chaoxing.com/" response = self.session.get(url=url, allow_redirects=True) if response.status_code == 200: print('登录成功') # print(response.text) self.session.cookies.save() print('cookie保存成功') return self.session 随机组卷及模板组局源码（selenium） import json import os import platform import subprocess import sys from math import ceil from time import sleep from lxml import etree from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.ui import WebDriverWait class login(): def __init__(self): chrome_options = webdriver.ChromeOptions() chrome_options.add_argument('--headless') #隐藏浏览器 chrome_options.add_argument( '--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36') # 设置请求头的User-Agent chrome_options.add_argument('--disable-infobars') # 禁用浏览器正在被自动化程序控制的提示 self.driver = webdriver.Chrome(options=chrome_options,executable_path='core\chromedriver.exe') self.login_url = "http://passport2.chaoxing.com/login?fid=&newversion=true&refer=http%3A%2F%2Fi.chaoxing.com" def phone_login(self): phone = input('请输入手机号:') self.driver.find_element_by_css_selector("#phone").send_keys(phone) password = input('请输入密码:') self.driver.find_element_by_css_selector("#pwd").send_keys(password) self.driver.find_element_by_css_selector("#loginBtn").click() def jigou_login(self): # 点击其它登录 self.driver.find_element_by_xpath('//*[@id="otherlogin"]').click() sleep(1) self.driver.find_element_by_xpath('//*[@id="inputunitname"]').send_keys('浙大宁波理工学院') sleep(0.75) uname = input('请输入学号或者工号:') self.driver.find_element_by_xpath('//*[@id="uname"]').send_keys(uname) sleep(0.75) password = input('请输入密码:') self.driver.find_element_by_xpath('//*[@id="password"]').send_keys(password) sleep(0.75) self.driver.find_element_by_xpath('//*[@id="numVerCode"]').screenshot('core\chaoxing_captcha.png') file_name = 'core\chaoxing_captcha.png' self.show_img(file_name) txtSecretCode = input('请输入验证码:') self.driver.find_element_by_xpath('//*[@id="vercode"]').send_keys(txtSecretCode) self.driver.find_element_by_xpath('//*[@id="loginBtn"]').click() def show_img(self, file_name): userPlatform = platform.system() if userPlatform == 'Darwin': # Mac subprocess.call(['open', file_name]) elif userPlatform == 'Linux': # Linux subprocess.call(['xdg-open', file_name]) else: # Windows os.startfile(file_name) def QR_login(self): self.driver.find_element_by_xpath('//*[@id="quickCode"]').screenshot('core\QR.png') file_name = 'core\QR.png' self.show_img(file_name) def check_state(self): self.driver.get('http://i.chaoxing.com/') sleep(1) f = '账号管理' in self.driver.page_source if f == True: print('登录成功') return True else: print('登录失败') return False def save_cookie(self): cookie_list = self.driver.get_cookies() jsonCookies = json.dumps(cookie_list) with open('core\chaoxing_cookies.json', 'w') as f: f.write(jsonCookies) print('保存cookie成功') def read_cookie(self): with open('core\chaoxing_cookies.json', 'r') as f: list_cookie = json.loads(f.read()) return list_cookie def check_cookie(self, list_cookie): self.driver.get(self.login_url) self.driver.delete_all_cookies() for cookie in list_cookie: self.driver.add_cookie(cookie) sleep(2) self.driver.get(self.login_url) sleep(1) self.driver.get('http://i.chaoxing.com/') f = self.check_state() if f == True: print('cookie有效') return True else: print('cookie无效') return False def check_file(self): f = os.path.exists('core\chaoxing_cookies.json') if f == True: list_cookie = self.read_cookie() f = self.check_cookie(list_cookie) if f == True: return True else: return False return False def login(self): self.driver.maximize_window() self.driver.get(self.login_url) f = self.check_file() if f == False: while True: print('请选择登录方式:\n' '1为电话号码登录\n' '2为学号或者工号登录\n' '3为扫码登录\n' '4为验证码登录\n' '输入其它字符为退出') num = input('请输入数字:') if num == '1': self.phone_login() elif num == '2': self.jigou_login() elif num == '3': self.QR_login() elif num == '4': pass else: sys.exit() f = self.check_state() if f == True: self.save_cookie() break else: pass return self.driver class exam(): def __init__(self): self.driver = login().login() self.dict = {} self.subject_xpath_dict = { "单选题": {'delete': '//*[@id="typetrid1"]/span[2]/a', 'score': '//*[@id="0_score"]', 'subject_num': '//*[@id="0_TypeDiv"]/li[2]/div[2]/div[1]/p[2]/input[2]'}, "多选题": {'delete': '//*[@id="typetrid2"]/span[2]/a', 'score': '//*[@id="1_score"]', 'subject_num': '//*[@id="1_TypeDiv"]/li[2]/div[2]/div[1]/p[2]/input[2]'}, "填空题": {'delete': '//*[@id="typetrid3"]/span[2]/a', 'score': '//*[@id="2_score"]', 'subject_num': '//*[@id="2_TypeDiv"]/li[2]/div[2]/div[1]/p[2]/input[2]'}, "判断题": {'delete': '//*[@id="typetrid4"]/span[2]/a', 'score': '//*[@id="3_score"]', 'subject_num': '//*[@id="3_TypeDiv"]/li[2]/div[2]/div[1]/p[2]/input[2]'}, "简答题": {'delete': '//*[@id="typetrid5"]/span[2]/a', 'score': '//*[@id="4_score"]', 'subject_num': '//*[@id="4_TypeDiv"]/li[2]/div[2]/div[1]/p[2]/input[2]'}, "名词解释": {'delete': '//*[@id="typetrid6"]/span[2]/a', 'score': '//*[@id="5_score"]', 'subject_num': '//*[@id="5_TypeDiv"]/li[2]/div[2]/div[1]/p[2]/input[2]'}, "论述题": {'delete': '//*[@id="typetrid7"]/span[2]/a', 'score': '//*[@id="6_score"]', 'subject_num': '//*[@id="6_TypeDiv"]/li[2]/div[2]/div[1]/p[2]/input[2]'}, "计算题": {'delete': '//*[@id="typetrid8"]/span[2]/a', 'score': '//*[@id="7_score"]', 'subject_num': '//*[@id="7_TypeDiv"]/li[2]/div[2]/div[1]/p[2]/input[2]'}, "分录题": {'delete': '//*[@id="typetrid9"]/span[2]/a', 'score': '//*[@id="9_score"]', 'subject_num': '//*[@id="9_TypeDiv"]/li[2]/div[2]/div[1]/p[2]/input[2]'}, "资料题": {'delete': '//*[@id="typetrid10"]/span[2]/a', 'score': '//*[@id="10_score"]', 'subject_num': '//*[@id="10_TypeDiv"]/li[2]/div[2]/div[1]/p[2]/input[2]'}, "连线题": {'delete': '//*[@id="typetrid11"]/span[2]/a', 'score': '//*[@id="11_score"]', 'subject_num': '//*[@id="11_TypeDiv"]/li[2]/div[2]/div[1]/p[2]/input[2]'}, "排序题": {'delete': '//*[@id="typetrid13"]/span[2]/a', 'score': '//*[@id="13_score"]', 'subject_num': '//*[@id="13_TypeDiv"]/li[2]/div[2]/div[1]/p[2]/input[2]'}, "完型填空": {'delete': '//*[@id="typetrid14"]/span[2]/a', 'score': '//*[@id="14_score"]', 'subject_num': '//*[@id="14_TypeDiv"]/li[2]/div[2]/div[1]/p[2]/input[2]'}, "阅读理解": {'delete': '//*[@id="typetrid15"]/span[2]/a', 'score': '//*[@id="15_score"]', 'subject_num': '//*[@id="15_TypeDiv"]/li[2]/div[2]/div[1]/p[2]/input[2]'}, "程序题": {'delete': '//*[@id="typetrid17"]/span[2]/a', 'score': '//*[@id="17_score"]', 'subject_num': '//*[@id="17_TypeDiv"]/li[2]/div[2]/div[1]/p[2]/input[2]'}, "口语题": {'delete': '//*[@id="typetrid18"]/span[2]/a', 'score': '//*[@id="18_score"]', 'subject_num': '//*[@id="18_TypeDiv"]/li[2]/div[2]/div[1]/p[2]/input[2]'}, "听力题": {'delete': '//*[@id="typetrid19"]/span[2]/a', 'score': '//*[@id="19_score"]', 'subject_num': '//*[@id="19_TypeDiv"]/li[2]/div[2]/div[1]/p[2]/input[2]'}, "共用选项题": {'delete': '//*[@id="typetrid20"]/span[2]/a', 'score': '//*[@id="20_score"]', 'subject_num': '//*[@id="20_TypeDiv"]/li[2]/div[2]/div[1]/p[2]/input[2]'}, "其它": {'delete': '//*[@id="typetrid21"]/span[2]/a', 'score': '//*[@id="8_score"]', 'subject_num': '//*[@id="8_TypeDiv"]/li[2]/div[2]/div[1]/p[2]/input[2]'}, } self.choice_dict = { "单选题": '//*[@id="setPaperStructure"]/div[1]/div/div[2]/div[1]/label[1]/input', "多选题": '//*[@id="setPaperStructure"]/div[1]/div/div[2]/div[1]/label[2]/input', "填空题": '//*[@id="setPaperStructure"]/div[1]/div/div[2]/div[1]/label[3]/input', "判断题": '//*[@id="setPaperStructure"]/div[1]/div/div[2]/div[1]/label[4]/input', "简答题": '//*[@id="setPaperStructure"]/div[1]/div/div[2]/div[1]/label[5]/input', "名词解释": '//*[@id="setPaperStructure"]/div[1]/div/div[2]/div[1]/label[6]/input', "论述题": '//*[@id="setPaperStructure"]/div[1]/div/div[2]/div[1]/label[7]/input', "计算题": '//*[@id="setPaperStructure"]/div[1]/div/div[2]/div[1]/label[8]/input', "分录题": '//*[@id="setPaperStructure"]/div[1]/div/div[2]/div[1]/label[9]/input', "资料题": '//*[@id="setPaperStructure"]/div[1]/div/div[2]/div[1]/label[10]/input', "连线题": '//*[@id="setPaperStructure"]/div[1]/div/div[2]/div[1]/label[11]/input', "排序题": '//*[@id="setPaperStructure"]/div[1]/div/div[2]/div[1]/label[12]/input', "完型填空": '//*[@id="setPaperStructure"]/div[1]/div/div[2]/div[1]/label[13]/input', "阅读理解": '//*[@id="setPaperStructure"]/div[1]/div/div[2]/div[1]/label[14]/input', "程序题": '//*[@id="setPaperStructure"]/div[1]/div/div[2]/div[1]/label[15]/input', "口语题": '//*[@id="setPaperStructure"]/div[1]/div/div[2]/div[1]/label[16]/input', "听力题": '//*[@id="setPaperStructure"]/div[1]/div/div[2]/div[1]/label[17]/input', "共用选项题": '//*[@id="setPaperStructure"]/div[1]/div/div[2]/div[1]/label[18]/input', "其它": '//*[@id="setPaperStructure"]/div[1]/div/div[2]/div[1]/label[19]/input', } def template_exam(self, url, paper_num): # 输入试卷标题 print('标题字符数为4到40个字符') while True: title = input('请输入标题:') if len(title) >= 4 and len(title) = 0: choice_list.append(num) else: break self.choice_subject(choice_list) sleep(1) print("下面是每个题型的总分以及题型数量的信息填写,请您分配好分数\n同时试卷默认总分为100\n如果计算出总分不为100,将会自动按照比例更改分数\n并满足总分为100") self.input_info(choice_list) n = input('是否同时保存为模板\n' '是输入1\n' '否输入0\n' '请输入:') # 确定同时保持为模板 if n == '1': self.driver.find_element_by_xpath('//*[@id="savePaperTemplateCheck"]').click() sleep(2) # 点击保存 self.driver.find_element_by_xpath('//*[@id="actionTab"]/a[1]').click() sleep(2) # 定位到弹窗 alert = self.driver.switch_to.alert sleep(1) # 确认组卷进行确定 alert.accept() sleep(2) # 输入每个题型的分数和数目 def input_info(self, choice_list): score_list = [] for num in choice_list: score = input("请输入{}总分数:".format(self.type_list[num])) score_list.append(int(score)) self.driver.find_element_by_xpath(self.subject_xpath_dict[self.type_list[num]]['score']).send_keys(score) sleep(0.75) subject_num = input('请输入{}的题目数:'.format(self.type_list[num])) self.driver.find_element_by_xpath(self.subject_xpath_dict[self.type_list[num]]['subject_num']).send_keys( subject_num) sleep(0.75) Sum = sum(score_list) if Sum != 100: print('试卷总分不为一百，正在进行更改') for num in range(len(score_list)): if num != len(score_list) - 1: # 按照比例进行分配 score_list[num] = int((score_list[num] / Sum) * 100) else: score_list[num] = 100 - sum(score_list[:-1]) else: pass n = 0 for num in choice_list: score = score_list[n] self.driver.find_element_by_xpath(self.subject_xpath_dict[self.type_list[num]]['score']).clear() sleep(0.25) self.driver.find_element_by_xpath(self.subject_xpath_dict[self.type_list[num]]['score']).send_keys(score) n += 1 sleep(0.75) print('更改完成') # 题型选择 def choice_subject(self, choice_list): # 点击更多题型 self.driver.find_element_by_xpath('//*[@id="newMore"]').click() for num in choice_list: # 对每个题型打上对号 self.driver.find_element_by_xpath(self.choice_dict[self.type_list[num]]).click() # 点击确定 self.driver.find_element_by_xpath( '//*[@id="setPaperStructure"]/div[1]/div/div[2]/div[2]/a[1]/span').click() # 题库信息统计 def statistical(self, html): html = etree.HTML(html) tr_list = html.xpath('//*[@id="tableId"]/tr') for tr in tr_list: key = tr.xpath('td[3]/text()') if key != []: key = key[0].strip() self.dict[key] += 1 else: pass # 对网页处理 def get_pageNum_subject(self, html): html = etree.HTML(html) str_num = html.xpath('//*[@id="RightCon"]/div/div/div[4]/span[2]/text()')[0] page_num = ceil(int(str_num) / 20) # print(page_num) option_list = html.xpath('//*[@id="qTypeSelect"]/option') for i in range(1, len(option_list)): key = option_list[i].xpath('text()')[0].strip() self.dict[key] = 0 return page_num 爬取入口源码 import datetime import os import re import sys import time import pandas as pd from docx import Document from docx.enum.text import WD_PARAGRAPH_ALIGNMENT from docx.oxml.ns import qn from docx.shared import Pt, RGBColor, Cm from lxml import etree from exam import exam from login import chaoxing_login class Chaoxing_spider(): def __init__(self): self.session = chaoxing_login().login() self.my_teach_headers = { 'Host': 'mooc1-1.chaoxing.com', # 'Referer': 'http://mooc1-1.chaoxing.com/visit/interaction?s=e9059bca0eca12ef882b78f6a497cdc9', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36', 'X-Requested-With': 'XMLHttpRequest', } self.headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36', 'Host': 'mooc1-1.chaoxing.com', } self.class_dic = {} self.statistics_info_dic = {} self.schoolid = '18638' # 自己学校Id，学校不同Id不同 # 获取班级链接及名字，并存入一个字典 def get_class_url_name(self): params = { 'isAjax': 'true' } url = 'http://mooc1-1.chaoxing.com/visit/courses/teach' self.session.headers = self.my_teach_headers response = self.session.get(url=url, params=params) # print(response.text) text = response.text href_name = re.findall('', text) for i in href_name: href, name = i href = 'https://mooc1-1.chaoxing.com' + href.replace("'", "").replace('"', '') name = name.replace("'", "").replace('"', '') self.class_dic[name] = href return '' # 获取成绩统计的信息，并获取信息的题头 def get_statistics_info(self): title_url = 'https://mooc1-1.chaoxing.com/moocAnalysis/analysisScore?classId={}&courseId={}&ut=t&cpi={}&openc={}'.format( self.classId, self.courseId, self.cpi, self.openc) r = self.session.get(url=title_url) text = r.text html = etree.HTML(text) th_list = html.xpath('//tr[@id="commonthead"]/th') title_list = [] for th in th_list: i = th.xpath('span/text()') if len(i) == 1: title_list.append(i[0]) else: title_list.append(i[0] + i[1]) # 创建字典 for i in title_list: self.statistics_info_dic[i] = [] # print(self.statistics_info_dic) data = { 'courseId': self.courseId, 'classId': self.classId, 'pageSize': '30', 'sw': '', 'pageNum': '1', 'fid': '0', 'sortType': '', 'order': '', 'test': '0', 'isSimple': '0', 'openc': self.openc, } self.session.headers = { 'Host': 'mooc1-1.chaoxing.com', 'Origin': 'https://mooc1-1.chaoxing.com', 'Referer': self.statistics_url, 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36', 'X-Requested-With': 'XMLHttpRequest', } url = 'https://mooc1-1.chaoxing.com/moocAnalysis/analysisScoreData' p = 1 while True: data['pageNum'] = str(p) p += 1 response = self.session.post(url=url, data=data) # print(response.text) text = response.text if text == '': break else: self.statistics_info_handle(text, title_list) def save_statistics_info(self): self.get_statistics_info() # 保存文件 df = pd.DataFrame(self.statistics_info_dic) print(df) file_path = r'data\statistics\{}.xlsx'.format(self.class_name + str(int(time.time() * 1000))) b = os.path.exists(file_path) if b == False: print('{}excel文件正在保存......'.format(self.class_name)) df.to_excel(file_path) print('{}excel文件保存成功.'.format(self.class_name)) else: print('路径为：{}\n' 'excel文件存在，无法保存\n' '程序将自动退出'.format(file_path)) sys.exit() # 根据信息标头对每个人信息进行提取 def statistics_info_handle(self, text, title_list): # print(title_list) html = etree.HTML(text) # 把每次提取到的成绩存入info_lint info_list = [] for i in range(1, len(title_list) + 1): if i == 1: info = html.xpath('//tr/td[1]/span/@title') # print(info) else: info = html.xpath('//tr/td[{}]/span/text()'.format(i)) for j in range(len(info)): info[j] = info[j].replace('\t', '').replace('\n', '').replace('\r', '').strip() # print(info) info_list.append(info) # print(info) # print(info_list) dic = dict(zip(title_list, info_list)) for i in title_list: self.statistics_info_dic[i] = self.statistics_info_dic[i] + dic[i] # print(self.statistics_info_dic) # 模板试卷 def template_exam(self): base_url = 'https://mooc1-1.chaoxing.com/exam/loadPaperTemplate' params = { 'courseId': self.courseId, 'start': '0', 'examsystem': '0', 'isCustomPaper': 'false', 'cpi': self.cpi, 'openc': self.openc, 'qbanksystem': '0', 'qbankbackurl': '', } response = self.session.get(url=base_url, params=params) text = response.text html = etree.HTML(text) tr_list = html.xpath('//tbody[@id="tableId"]/tr') name_url_list = [] n = 0 for tr in tr_list: name = tr.xpath('td[1]/text()')[0] url_info = tr.xpath('td[5]/a/@onclick')[0] template_url = 'https://mooc1-1.chaoxing.com' + re.findall('"(.*?)"', url_info)[0] name_url_list.append((name, template_url)) print('序号:{} 名字:{}'.format(n, name)) n += 1 print('输入负数将退出!') while True: num = int(input("请输入你要选择模板的序号:")) if num = 0: template_url = name_url_list[num][1] break elif num >= len(name_url_list): print('输入数字有误!') else: sys.exit() paper_num = int(input("请输入要组成试卷数量:")) exam().template_exam(template_url, paper_num) # 试卷库 def exam_library(self): url = 'https://mooc1-1.chaoxing.com/exam/reVerSionPaperList' params = { 'courseId': self.courseId, 'classId': self.classId, 'ut': 't', 'examsystem': '0', 'cpi': self.cpi, 'openc': self.openc, } response = self.session.get(url=url, params=params) text = response.text href_name_list = re.findall('(.*?)', text) for i in href_name_list: href, name = i url = '{}{}'.format('https://mooc1-1.chaoxing.com', href) content = self.session.get(url).content html = etree.HTML(content) url = 'https://mooc1-1.chaoxing.com' + html.xpath('//*[@id="RightCon"]/div[2]/ul/li/div[2]/p/a/@href')[0] self.download_paper(url, name) def download_paper(self, url, name): paper_dict = {} print('-' * 20) response = self.session.get(url) text = response.text # 每个答题信息 list = re.findall('(.*?)', text, re.S) for i in list: html = etree.HTML(i) # 题目大标题 subject_title = html.xpath('//h2/text()|//h2/em/text()') # print(subject_title) # 把题目大标题变为元组 subject_list_tuple = tuple(subject_title) # 大标题作为第一层key paper_dict[subject_list_tuple] = {} # 题目列表 subject_list = html.xpath('//div[@class="TiMu"]/div[@name="certainTitle"]') for subject in subject_list: # 题目信息 subject_detailed = subject.xpath( 'div[1]/i/text()|div[1]/div/text()|div[1]/div/p/text()|div[1]/div/img/@src') # print(subject_detailed) # 把题目信息添加 subject_detailed_tuple = tuple(subject_detailed) paper_dict[subject_list_tuple][subject_detailed_tuple] = {} # 选项 option_list = subject.xpath('ul/li') paper_dict[subject_list_tuple][subject_detailed_tuple]['选项'] = [] if option_list != []: for option in option_list: option1 = option.xpath('i/text()') option1_content_list = option.xpath('div/a/text()|div/a/img/@src|div/a/p/text()') option1.extend(option1_content_list) paper_dict[subject_list_tuple][subject_detailed_tuple]['选项'].append(option1) # print(option1) # 答案 answer_list = subject.xpath( 'div[2]/div[1]/span/div/text()|div[2]/div[1]/span/div/img/@src|div[2]/span/text()|div[2]/div[1]/span/div/p/img/@src') if answer_list != []: if len(answer_list) == 1: answer_list = answer_list[0].replace('正确答案：', '').strip() # print('答案', answer_list) else: answer_list1 = [] for i in range(len(answer_list)): m = answer_list[i].strip() if m != '': answer_list1.append(m) answer_list = answer_list1 # print('答案', answer_list) paper_dict[subject_list_tuple][subject_detailed_tuple]['答案或分析'] = answer_list else: analysis_list = subject.xpath('div[3]/span/img/@src') # print('分析', analysis_list) paper_dict[subject_list_tuple][subject_detailed_tuple]['答案或分析'] = analysis_list time_stamp = int(time.time()) year = datetime.datetime.now().year month = datetime.datetime.now().month def paper(): headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36', } self.session.headers = headers # 试卷document document = Document(r'core\template.docx') document.styles['Normal'].font.name = u'宋体' document.styles['Normal']._element.rPr.rFonts.set(qn('w:eastAsia'), u'宋体') document.styles['Normal'].font.color.rgb = RGBColor(0, 0, 0) if month

【本文地址】

超星系统登录，信息爬取

超星系统登录，信息爬取

今日新闻

推荐新闻