Python批量正则匹配(多文本×多规则)

您所在的位置:网站首页 python正则匹配字符 Python批量正则匹配(多文本×多规则)

Python批量正则匹配(多文本×多规则)

2023-12-15 20:17| 来源: 网络整理| 查看: 265

什么是多规则正则匹配

多规则正则匹配是一种基于人工规则的语义提取技术,实现简单,在特定场景中足够好用。

下面是多规则正则匹配的效果:

This paper proposes Computation Offloading using Reinforcement Learning (CORL) scheme to minimize latency and energy consumption.

句式:本文贡献(This paper)场景:任务卸载(Computation Offloading)方法:强化学习(Reinforcement Learning)优化目标:时延(latency)、能耗(energy)

因为一种含义有不同的表达方法,所以每个语义都要单独列一条正则表达式。例如:

本文贡献:this (paper|article|work|study|manuscript)任务卸载:(task|comput[a-z]*( resource[a-z]*)?|application|job|service)[\- ]offloading|offload([a-z]* ){,4}(task|job)强化学习:reinforcement learning|deep Q[\– ]network|actor[\– ]critic时延:delay|completion time|latency能耗:energy|power consumption 多规则匹配的实现方式

既要匹配关键词,又要知道是哪条规则匹配到的,有两种方法:(1)写一个循环,每条规则逐一匹配;(2)用正则表达式自带的“命名组合”(?P…),将所有规则合并成一条,一次性匹配。

# 方法1:逐条规则 import re text = "This paper proposes Computation Offloading using Reinforcement Learning (CORL) scheme to minimize latency and energy consumption." rules=[ ['本文贡献','this (paper|article|work|study|manuscript)'], ['任务卸载','(task|comput[a-z]*( resource[a-z]*)?|application|job|service)[\- ]offloading|offload([a-z]* ){,4}(task|job)'], ['强化学习','reinforcement learning|deep Q[\– ]network|actor[\– ]critic'], ['时延','delay|completion time|latency'], ['能耗','energy|power consumption'] ] for sub_key,pattern in rules: for ret in re.finditer(pattern,text,re.IGNORECASE): sub_val=text[ret.start():ret.end()] print(f"{sub_key}: {sub_val}") # 方法2:合并规则 import re text = "This paper proposes Computation Offloading using Reinforcement Learning (CORL) scheme to minimize latency and energy consumption." pattern = "(?Pthis (paper|article|work|study|manuscript))|(?P(task|comput[a-z]*( resource[a-z]*)?|application|job|service)[\- ]offloading|offload([a-z]* ){,4}(task|job))|(?Preinforcement learning|deep Q[\– ]network|actor[\– ]critic)|(?Pdelay|completion time|latency)|(?Penergy|power consumption)" for ret in re.finditer(pattern,text,re.IGNORECASE): sub_key=ret.lastgroup sub_val=text[ret.start():ret.end()] print(f"{sub_key}: {sub_val}") 多文本多规则正则匹配

以上是1个句子、n个规则的正则匹配。 如果句子也有n个,又能分两种情况:(1)写一个循环,每个句子逐条匹配;(2)将所有句子合并成一个长篇,一次性匹配,再根据匹配结果的位置判断是第几句。

以下是个简单的实验,测一下n个句子、n个规则时,逐条匹配与合并匹配哪个快。

准备数据:

data={ "txts": [ "A Cost-Driven Fuzzy Scheduling Strategy for Intelligent Workflow ...", "Scheduling in edge-cloud environments can address ...", ... ], "rules": [ "本文|this (paper|article|work|study|manuscript)|we ", "基于|based|driven", ... ] }

(每条规则第一个“|”之前是规则名。)

测试程序:

''' @File :test.py @Description :正则表达式测试 @Date :2023/08/22 17:21:18 @Author :pro1515151515 @Version :1.0 ''' import os import json from itertools import accumulate import bisect # python自带的二分搜索 import re import time def 合并文本_合并规则(txts,rules): t0=time.time() output=set() pattern=re.compile('|'.join([f"(?Pl})" for l in rules])) text='\n'.join(txts) pins = list(accumulate([len(txt)+1 for txt in txts])) for ret in pattern.finditer(text,re.IGNORECASE): tid=bisect.bisect_left(pins, ret.start()) sub_key=ret.lastgroup sub_val=text[ret.start():ret.end()] sub_string=txts[tid] output.add(f"{sub_key}|{sub_val}|{sub_string}") t1=time.time() print(f"[合并文本_合并规则] 匹配数:{len(output)},用时:{t1-t0}秒") return output def 逐条文本_合并规则(txts,rules): t0=time.time() output=set() pattern=re.compile('|'.join([f"(?Pl})" for l in rules])) for text in txts: for ret in pattern.finditer(text,re.IGNORECASE): sub_key=ret.lastgroup sub_val=text[ret.start():ret.end()] sub_string=text output.add(f"{sub_key}|{sub_val}|{sub_string}") t1=time.time() print(f"[逐条文本_合并规则] 匹配数:{len(output)},用时:{t1-t0}秒") return output def 合并文本_逐条规则(txts,rules): t0=time.time() output=set() patterns={} for rule in rules: patterns[rule.split('|')[0].replace(' ','_')]=re.compile(f"({rule})") text='\n'.join(txts) pins = list(accumulate([len(txt)+1 for txt in txts])) for sub_key,pattern in patterns.items(): for ret in pattern.finditer(text,re.IGNORECASE): tid=bisect.bisect_left(pins, ret.start()) sub_val=text[ret.start():ret.end()] sub_string=txts[tid] output.add(f"{sub_key}|{sub_val}|{sub_string}") t1=time.time() print(f"[合并文本_逐条规则] 匹配数:{len(output)},用时:{t1-t0}秒") return output def 逐条文本_逐条规则(txts,rules): t0=time.time() hit=0 output=set() patterns={} for rule in rules: patterns[rule.split('|')[0].replace(' ','_')]=re.compile(f"({rule})") for text in txts: for sub_key,pattern in patterns.items(): for ret in pattern.finditer(text,re.IGNORECASE): sub_val=text[ret.start():ret.end()] sub_string=text output.add(f"{sub_key}|{sub_val}|{sub_string}") t1=time.time() print(f"[逐条文本_逐条规则] 匹配数:{len(output)},用时:{t1-t0}秒") return output def main(): with open('data.json','r',encoding='utf-8') as f: data=json.loads(f.read()) txts = data['txts'] rules = data['rules'] print(f"{len(txts)}行文本,{len(rules)}条匹配规则") ans1=合并文本_合并规则(txts,rules) ans2=逐条文本_合并规则(txts,rules) ans3=合并文本_逐条规则(txts,rules) ans4=逐条文本_逐条规则(txts,rules) print("结束") if __name__=='__main__': main()

测试结果:

5299行文本,99条匹配规则 [合并文本_合并规则] 匹配数:4175,用时:6.736292600631714秒 [逐条文本_合并规则] 匹配数:4163,用时:6.20894980430603秒 [合并文本_逐条规则] 匹配数:4182,用时:0.6981680393218994秒 [逐条文本_逐条规则] 匹配数:4170,用时:0.9920296669006348秒 结束

结论: n个句子、n条规则的多对多正则匹配,建议是将文本合并起来匹配、每条规则逐条匹配;另外,匹配结果可能存在重复项,需要去重。

附件: Python批量正则匹配(测试数据+代码).zip

本文首发于我的博客:https://www.proup.club/index.php/archives/762/ 转载请注明本页面网址和原作者:pro1515151515



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3