文本预处理:词袋模型(bag of words,BOW)、TF-IDF

这篇博客主要整理介绍文本预处理中的词袋模型(bag of words,BOW)和TF-IDF。

一、词袋模型(bag of words,BOW)


"John likes to watch movies, Mary likes movies too" "John also likes to watch football games"


[‘also’, ‘football’, ‘games’, ‘john’, ‘likes’, ‘mary’, ‘movies’, ‘to’, ‘too’, ‘watch’]

因此,它们的向量表示为: BOW词向量


from sklearn.feature_extraction.text import CountVectorizer corpus = [ "John likes to watch movies, Mary likes movies too", "John also likes to watch football games", ] vectorizer = CountVectorizer() X = vectorizer.fit_transform(corpus) print(vectorizer.get_feature_names()) print(X.toarray()) #输出结果: #['also', 'football', 'games', 'john', 'likes', 'mary', 'movies', 'to', 'too', 'watch'] #[[0 0 0 1 2 1 2 1 1 1] # [1 1 1 1 1 0 0 1 0 1]] 二、TF-IDF(Term Frequency / Inverse Document Frequency,词频-逆文本频率)


"John likes to play football, Mary likes too"

这个句子若用BOW模型,它的词表为:[‘football’, ‘john’, ‘likes’, ‘mary’, ‘play’, ‘to’, ‘too’],则词向量表示为:[1 1 2 1 1 1 1]。若根据BOW模型提取这个句子的关键词,则为 “like”,但是显然这个句子的关键词应该为 “football”。而TF-IDF则可以解决这个问题。TF-IDF看名字也知道包括两部分TF和IDF,TF(Term Frequency,词频)的公式为: T F ( w ) = 单 词 w 在 文 章 中 出 现 的 次 数 文 章 的 单 词 总 数 TF(w) = \frac{单词w在文章中出现的次数}{文章的单词总数} TF(w)=文章的单词总数单词w在文章中出现的次数​ 而IDF(inverse document frequency,逆文本频率)的公式为: I D F ( w ) = l o g ( 语 料 库 中 文 档 的 总 数 包 含 词 w 的 文 档 数 + 1 ) IDF(w) =log( \frac{语料库中文档的总数}{包含词w的文档数 + 1}) IDF(w)=log(包含词w的文档数+1语料库中文档的总数​) 其中,分母之所以加1是为了防止分母为0。所以,TF-IDF的公式为: T F − I D F ( w ) = T F ( w ) ∗ I D F ( w ) TF-IDF(w) = TF(w)*IDF(w) TF−IDF(w)=TF(w)∗IDF(w) TF-IDF值越大说明这个词越重要,也可以说这个词是关键词。关于关键词的判断示例,可以参考TF-IDF与余弦相似性的应用(一):自动提取关键词 下面来看看实际使用,sklearn中封装TF-IDF方法,并且也提供了示例:

from sklearn.feature_extraction.text import TfidfVectorizer corpus = [ 'This is the first document.', 'This document is the second document.', 'And this is the third one.', 'Is this the first document?', ] vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(corpus) print(vectorizer.get_feature_names()) print(X) print(X.toarray()) """ ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this'] (0, 8) 0.38408524091481483 (0, 3) 0.38408524091481483 (0, 6) 0.38408524091481483 (0, 2) 0.5802858236844359 (0, 1) 0.46979138557992045 (1, 8) 0.281088674033753 (1, 3) 0.281088674033753 (1, 6) 0.281088674033753 (1, 1) 0.6876235979836938 (1, 5) 0.5386476208856763 (2, 8) 0.267103787642168 (2, 3) 0.267103787642168 (2, 6) 0.267103787642168 (2, 0) 0.511848512707169 (2, 7) 0.511848512707169 (2, 4) 0.511848512707169 (3, 8) 0.38408524091481483 (3, 3) 0.38408524091481483 (3, 6) 0.38408524091481483 (3, 2) 0.5802858236844359 (3, 1) 0.46979138557992045 [[0. 0.46979139 0.58028582 0.38408524 0. 0. 0.38408524 0. 0.38408524] [0. 0.6876236 0. 0.28108867 0. 0.53864762 0.28108867 0. 0.28108867] [0.51184851 0. 0. 0.26710379 0.51184851 0. 0.26710379 0.51184851 0.26710379] [0. 0.46979139 0.58028582 0.38408524 0. 0. 0.38408524 0. 0.38408524]] """ ​




