NLP入门（三）词形还原（Lemmatization)

您所在的位置：网站首页 › 还原单词sminsig › NLP入门（三）词形还原（Lemmatization)

NLP入门（三）词形还原（Lemmatization)

2024-07-02 21:46| 来源: 网络整理| 查看: 265

NLP入门（三）词形还原（Lemmatization) 2020-09-08 01:37 算法 NLP 约 999 字预计阅读 2 分钟次阅读文章目录

词形还原（Lemmatization）是文本预处理中的重要部分，与词干提取（stemming）很相似。简单说来，词形还原就是去掉单词的词缀，提取单词的主干部分，通常提取后的单词会是字典中的单词，不同于词干提取（stemming），提取后的单词不一定会出现在单词中。比如，单词“cars”词形还原后的单词为“car”，单词“ate”词形还原后的单词为“eat”。在Python的nltk模块中，使用WordNet为我们提供了稳健的词形还原的函数。如以下示例Python代码：

1 2 3 4 5 6 7 8 9 10 11 12 13 14 from nltk.stem import WordNetLemmatizer wnl = WordNetLemmatizer() # lemmatize nouns print(wnl.lemmatize('cars', 'n')) print(wnl.lemmatize('men', 'n')) # lemmatize verbs print(wnl.lemmatize('running', 'v')) print(wnl.lemmatize('ate', 'v')) # lemmatize adjectives print(wnl.lemmatize('saddest', 'a')) print(wnl.lemmatize('fancier', 'a'))

在以上代码中，wnl.lemmatize()函数可以进行词形还原，第一个参数为单词，第二个参数为该单词的词性，如名词，动词，形容词等，返回的结果为输入单词的词形还原后的结果。词形还原一般是简单的，但具体我们在使用时，指定单词的词性很重要，不然词形还原可能效果不好，如以下代码

1 2 3 4 5 from nltk.stem import WordNetLemmatizer wnl = WordNetLemmatizer() print(wnl.lemmatize('ate', 'n')) print(wnl.lemmatize('fancier', 'v'))

输出结果如下：

1 2 3 [(‘The’, ‘DT’), (‘brown’, ‘JJ’), (‘fox’, ‘NN’), (‘is’, ‘VBZ’), (‘quick’, ‘JJ’), (‘and’, ‘CC’), (‘he’, ‘PRP’), (‘is’, ‘VBZ’), (‘jumping’, ‘VBG’), (‘over’, ‘IN’), (‘the’, ‘DT’), (‘lazy’, ‘JJ’), (‘dog’, ‘NN’)]

关于上述词性的说明，可以参考下表：

OK，知道了获取单词在句子中的词性，再结合词形还原，就能很好地完成词形还原功能。示例的Python代码如下：

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 from nltk import word_tokenize, pos_tag from nltk.corpus import wordnet from nltk.stem import WordNetLemmatizer # 获取单词的词性 def get_wordnet_pos(tag): if tag.startswith('J'): return wordnet.ADJ elif tag.startswith('V'): return wordnet.VERB elif tag.startswith('N'): return wordnet.NOUN elif tag.startswith('R'): return wordnet.ADV else: return None sentence = 'football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal.' tokens = word_tokenize(sentence) # 分词 tagged_sent = pos_tag(tokens) # 获取单词词性 wnl = WordNetLemmatizer() lemmas_sent = [] for tag in tagged_sent: wordnet_pos = get_wordnet_pos(tag[1]) or wordnet.NOUN lemmas_sent.append(wnl.lemmatize(tag[0], pos=wordnet_pos)) # 词形还原 print(lemmas_sent)

输出结果如下：

1 [‘football’, ‘be’, ‘a’, ‘family’, ‘of’, ‘team’, ‘sport’, ‘that’, ‘involve’, ‘,’, ‘to’, ‘vary’, ‘degree’, ‘,’, ‘kick’, ‘a’, ‘ball’, ‘to’, ‘score’, ‘a’, ‘goal’, ‘.’]

输出的结果就是对句子中的单词进行词形还原后的结果。

文章作者 glin

上次更新 2020-09-08 01:37

NLP 基础

【本文地址】

NLP入门（三）词形还原（Lemmatization)

NLP入门（三）词形还原（Lemmatization)

今日新闻

推荐新闻