文本分类实战

2024-01-25 16:15| 来源: 网络整理| 查看: 265

上一篇文章我们讲了一些数据处理的方法。这一篇我们来对数据进行一些分析，帮助我们更好的理解数据的基础上，为后面的工作做一些基础。也希望有一些积累，在后面遇到相似的任务事可以举一反三。好了，话不多说，我们开始。

embeddings与vocab中词汇不相覆盖问题

我们使用的预训练好的embeddings词向量来对训练集和测试集中的词汇表vocab进行向量的映射，这里存在的一个问题是预训练embeddings中的词汇不能完全覆盖vocab中的词汇，就导致不被覆盖的词汇只能用随机向量或者unknown向量表示，这样会影响最终的任务效果。导致这个问题的原因主要有几个：第一个是生僻词，他们没有出现在预训练的embeddings中，但是这种情况相对较少；第二个是大小写和简写，预训练好的embeddings是区分大小写的，所以这里也是一个原因；第三个是数据集中的单词拼写错误，这是最主要的原因，因为所给的数据是提问者键盘敲上去的，难免出现拼写的错误，错误拼写的单词肯定不会出现在预训练好的embeddings中。但是这种情况也是很难处理的，对于这个比赛来说，要求不能使用额外的数据集，所以我们就不能使用外部的拼写检查库来矫正拼写错误。

好，那么我们就来看看如何分析这些数据的吧。

1. 建立词汇表vocab

vocab字典，建立单词与其出现频次的映射

train = pd.read_csv("../input/train.csv") # Train shape = (1306122, 3) test = pd.read_csv("../input/test.csv") # Test shape = (56370, 2) df = pd.concat([train ,test]) # shape=(1362492, 2) def build_vocab(sentences, verbose = True): """ :param sentences: list of list of words 输入是训练集与测试集的数据 :return: dictionary of words and their count 追踪训练词汇表，遍历所有文本对单词进行计数 """ vocab = {} for sentence in tqdm(sentences, disable = (not verbose)): for word in sentence: try: vocab[word] += 1 except KeyError: vocab[word] = 1 return vocab sentences =df['question_text'].progress_apply(lambda x: x.split()).values vocab = build_vocab(sentences) # vocab_size=508823

我们来看一下建立好的vocab字典是什么样子，打印出现频率前五的单词看一下：

{'How': 261930, 'did': 33489, 'Quebec': 97, 'nationalists': 91, 'see': 9003} 2. 加载预训练embeddings

这里为了实现更好的效果，我们加载使用4种预训练的embeddings

google = '../input/embeddings/GoogleNews-vectors-negative300/GoogleNews-vectors-negative300.bin' glove = '../input/embeddings/glove.840B.300d/glove.840B.300d.txt' paragram = '../input/embeddings/paragram_300_sl999/paragram_300_sl999.txt' wiki_news = '../input/embeddings/wiki-news-300d-1M/wiki-news-300d-1M.vec' from gensim.models import KeyedVectors from gensim.models import KeyedVectors def load_embed(file): def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32') if file == '../input/embeddings/wiki-news-300d-1M/wiki-news-300d-1M.vec': embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(file) if len(o)>100) elif file == '../input/embeddings/GoogleNews-vectors-negative300/GoogleNews-vectors-negative300.bin': model = KeyedVectors.load_word2vec_format(file, binary=True) embeddings_index = {} for word, vector in zip(model.vocab, model.vectors): embeddings_index[word] = vector else: embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(file, encoding='latin')) return embeddings_index 3. 检查预训练embeddings和vocab的覆盖情况 def check_coverage(vocab, embeddings_index): known_words = {} # 两者都有的单词 unknown_words = {} # embeddings不能覆盖的单词 nb_known_words = 0 #对应的数量 nb_unknown_words = 0 # for word in vocab.keys(): for word in tqdm(vocab): try: known_words[word] = embeddings_index[word] nb_known_words += vocab[word] except: unknown_words[word] = vocab[word] nb_unknown_words += vocab[word] pass print('Found embeddings for {:.2%} of vocab'.format(len(known_words) / len(vocab))) # 覆盖单词的百分比 print('Found embeddings for {:.2%} of all text'.format(nb_known_words / (nb_known_words + nb_unknown_words))) # 覆盖文本的百分比，与上一个指标的区别的原因在于单词在文本中是重复出现的。 unknown_words = sorted(unknown_words.items(), key=operator.itemgetter(1))[::-1] print("unknown words : ", unknown_words[:30]) return unknown_words oov_google = check_coverage(vocab, embed_google) oov_glove = check_coverage(vocab, embed_glove) oov_paragram = check_coverage(vocab, embed_paragram) oov_fasttext = check_coverage(vocab, embed_fasttext)

然后我们来看一下结果，并且将unknown_words打印出30个来看看是什么情况

Google : Found embeddings for 24.05% of vocab Found embeddings for 78.75% of all text unknown words : [('to', 420476), ('a', 419837), ('of', 345145), ('and', 262815), ('India?', 17082), ('it?', 13436), ('do?', 9112), ('life?', 8074), ('you?', 6553), ('me?', 6485), ('them?', 6421), ('time?', 5994), ('world?', 5632), ('people?', 5191), ('why?', 5144), ('Quora?', 4872), ('10', 4783), ('like?', 4677), ('for?', 4631), ('work?', 4392), ('2017?', 4227), ('mean?', 4137), ('2018?', 3746), ('country?', 3578), ('now?', 3496), ('this?', 3464), ('years?', 3387), ('2017', 3300), ('not?', 3246), ('year?', 2913)] Glove : Found embeddings for 32.77% of vocab Found embeddings for 88.15% of all text unknown words : [('India?', 17082), ('it?', 13436), ("What's", 12985), ('do?', 9112), ('life?', 8074), ('you?', 6553), ('me?', 6485), ('them?', 6421), ('time?', 5994), ('world?', 5632), ('people?', 5191), ('why?', 5144), ('Quora?', 4872), ('like?', 4677), ('for?', 4631), ('work?', 4392), ('2017?', 4227), ('mean?', 4137), ('2018?', 3746), ('country?', 3578), ('now?', 3496), ('this?', 3464), ('years?', 3387), ('not?', 3246), ('year?', 2913), ('day?', 2834), ('engineering?', 2743), ('person?', 2728), ('school?', 2688), ('so,', 2679)] Paragram : Found embeddings for 19.37% of vocab Found embeddings for 72.21% of all text unknown words : [('What', 436013), ('I', 319441), ('How', 273144), ('Why', 148582), ('Is', 113627), ('Can', 54992), ('Which', 49357), ('Do', 41756), ('If', 35896), ('Are', 30442), ('Does', 24142), ('Who', 22884), ('Where', 20008), ('Should', 17269), ('India?', 17082), ('Will', 15283), ('When', 15084), ('India', 14270), ('Indian', 13441), ('it?', 13436), ("I'm", 13344), ("What's", 12985), ('Trump', 10569), ('Quora', 10447), ('In', 10441), ('Would', 10307), ('US', 9832), ('do?', 9112), ('My', 8463), ('The', 8215)] FastText : Found embeddings for 29.77% of vocab Found embeddings for 87.66% of all text unknown words : [('India?', 17082), ("don't", 15642), ('it?', 13436), ("I'm", 13344), ("What's", 12985), ('do?', 9112), ('life?', 8074), ("can't", 7375), ('you?', 6553), ('me?', 6485), ('them?', 6421), ('time?', 5994), ("doesn't", 5970), ('world?', 5632), ('people?', 5191), ('why?', 5144), ("it's", 5019), ('Quora?', 4872), ('like?', 4677), ('for?', 4631), ('work?', 4392), ('2017?', 4227), ('mean?', 4137), ('2018?', 3746), ('country?', 3578), ('now?', 3496), ('this?', 3464), ('years?', 3387), ("didn't", 3329), ('not?', 3246)]

可以看出的几个问题：1.大写的问题，特别是paragram这个embeddings，貌似是只要是大写的字母就不能识别。2.缩写的问题，比如(“don’t”, 15642)(“it’s”, 5019)等等，这个问题可以用匹配的方式解决，我们手写一个字典，将这种单词转换过来。3.标点符号，比如问号，这里的问题我不太确定是embeddings没有问号的问题还是问号与单词合在一起不能被识别的问题。下面我们一个一个处理这些问题。

4. vocab大写转小写 df['lowered_question'] = df['question_text'].apply(lambda x: x.lower()) vocab_low = build_vocab(df['lowered_question']) oov_google = check_coverage(vocab_low, embed_google) oov_glove = check_coverage(vocab_low, embed_glove) oov_paragram = check_coverage(vocab_low, embed_paragram) oov_fasttext = check_coverage(vocab_low, embed_fasttext)

那么我们再来看一下效果怎么样：（上面两行是处理前的效果，下面两行是处理后的结果）

Google : Found embeddings for 24.05% of vocab Found embeddings for 78.75% of all text Found embeddings for 15.24% of vocab Found embeddings for 77.69% of all text Glove : Found embeddings for 32.77% of vocab Found embeddings for 88.15% of all text Found embeddings for 27.10% of vocab Found embeddings for 87.88% of all text Paragram : Found embeddings for 19.37% of vocab Found embeddings for 72.21% of all text Found embeddings for 31.01% of vocab Found embeddings for 88.21% of all text FastText : Found embeddings for 29.77% of vocab Found embeddings for 87.66% of all text Found embeddings for 21.74% of vocab Found embeddings for 87.14% of all text

可以看到，除了paragram有提升之外，其他三个embeddings反而是降低了。所以下面一种想法是只对paragram进行大写转换，其他三个不转换；另外一种想法是吧embeddings中没有的小写加入进去。我们先来看一下第二种方法的效果如何。

5. 向embeddings中添加小写 def add_lower(embedding, vocabulary): count = 0 # for word in vocab: for word in tqdm(vocabulary): if word in embedding and word.lower() not in embedding: embedding[word.lower()] = embedding[word] count += 1 print(f"Added {count} words to embedding") add_lower(embed_google, vocab) add_lower(embed_glove, vocab) add_lower(embed_paragram, vocab) add_lower(embed_fasttext, vocab)

看一下添加了多少单词进去：

Added 30276 words to embedding # Google Added 15199 words to embedding # glove Added 0 words to embedding # paragram Added 27908 words to embedding # fasttext

然后看一下效果如何：(三步对比结果，按顺序)

Google： Found embeddings for 24.05% of vocab 15.24% of vocab 21.79% of vocab Found embeddings for 78.75% of all text 77.69% of all text 87.22% of all text Glove : Found embeddings for 32.77% of vocab 27.10% of vocab 30.39% of vocab Found embeddings for 88.15% of all text 87.88% of all text 88.19% of all text Paragram : Found embeddings for 19.37% of vocab 31.01% of vocab 31.01% of vocab Found embeddings for 72.21% of all text 88.21% of all text 88.21% of all text FastText : Found embeddings for 29.77% of vocab 21.74% of vocab 27.77% of vocab Found embeddings for 87.66% of all text 87.14% of all text 87.73% of all text

分析一下结果，对于paragram因为没有添加新的单词进去，所以效果没有变化。对于另外三种embeddings，都是相对前两步中的最好结果在vocab上略有降低，all text上有所提升，特别是Google.

下面我们进行吧缩写的单词转换成正常形式。

6. 转换缩写形式

转化字典：

contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is", "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have", "i'll": "i will", "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as", "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would", "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are", "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have" }

我们来看一下embeddings中出现了哪些缩写的单词

def known_contractions(embed): known = [] for contract in contraction_mapping: if contract in embed: known.append(contract) return known print("- Known Contractions -") print(" Google :") print(known_contractions(embed_google)) print(" Glove :") print(known_contractions(embed_glove)) print(" Paragram :") print(known_contractions(embed_paragram)) print(" FastText :") print(known_contractions(embed_fasttext))

结果：

- Known Contractions - Google : ["ain't", "aren't", "can't", "could've", "couldn't", "didn't", "doesn't", "don't", "hadn't", "hasn't", "haven't", "he'd", "he'll", "he's", "how'd", "how's", "I'd", "I'd've", "I'll", "I'm", "I've", "i'd", "i'll", "i'm", "i've", "isn't", "it'd", "it'll", "it's", "let's", "ma'am", "must've", "o'clock", "oughtn't", "she'd", "she'll", "she's", "should've", "shouldn't", "that's", "there's", "here's", "they'd", "they'll", "they're", "they've", "wasn't", "we'd", "we'll", "we're", "we've", "weren't", "what're", "what's", "what've", "where'd", "where's", "who'll", "who's", "who've", "won't", "would've", "wouldn't", "wouldn't've", "y'all", "you'd", "you'll", "you're", "you've"] Glove : ["can't", "'cause", "didn't", "doesn't", "don't", "I'd", "I'll", "I'm", "I've", "i'd", "i'll", "i'm", "i've", "it's", "ma'am", "o'clock", "that's", "you'll", "you're"] Paragram : ["can't", "'cause", "didn't", "doesn't", "don't", "i'd", "i'll", "i'm", "i've", "it's", "ma'am", "o'clock", "that's", "you'll", "you're"] FastText : []

这里还要继续细化一下，因为数据集里的数据，他们缩写用的标点也有可能是不一样的，而我们上面提到的转化字典里统一用的 "'"这个标点。所以我们吧数据里的不同的缩写标点都统一换成我们使用的这个。

def clean_contractions(text, mapping): specials = ["’", "‘", "´", "`"] for s in specials: text = text.replace(s, "'") text = ' '.join([mapping[t] if t in mapping else t for t in text.split(" ")]) return text df['treated_question'] = df['lowered_question'].apply(lambda x: clean_contractions(x, contraction_mapping)) vocab = build_vocab(df['treated_question'])

此时我们再来看一下覆盖的情况：

Google： Found embeddings for 24.05% of vocab 15.24% of vocab 21.79% of vocab 21.88% of vocab Found embeddings for 78.75% of all text 77.69% of all text 87.22% of all text 87.39% of all text Glove : Found embeddings for 32.77% of vocab 27.10% of vocab 30.39% of vocab 30.53% of vocab Found embeddings for 88.15% of all text 87.88% of all text 88.19% of all text 88.56% of all text Paragram : Found embeddings for 19.37% of vocab 31.01% of vocab 31.01% of vocab 31.16% of vocab Found embeddings for 72.21% of all text 88.21% of all text 88.21% of all text 88.58% of all text FastText : Found embeddings for 29.77% of vocab 21.74% of vocab 27.77% of vocab 27.91% of vocab Found embeddings for 87.66% of all text 87.14% of all text 87.73% of all text 88.44% of all text

这次达到了更好的结果。

7. 标点符号的处理

这一步我们处理一下标点，首先看一下预训练的embeddings可以不能识别什么标点

punct = "/-'?!.,#$%\'()*+-/:;@[\\]^_`{|}~" + '""“”’' + '∞θ÷α•à−β∅³π‘₹´°£€\×™√²—–&' def unknown_punct(embed, punct): unknown = '' for p in punct: if p not in embed: unknown += p unknown += ' ' return unknown print(unknown_punct(embed_google, punct)) print(unknown_punct(embed_glove, punct)) print(unknown_punct(embed_paragram, punct)) print(unknown_punct(embed_fasttext, punct))

结果是这样的：

Google : / - ' ? ! . , ' ( ) - / : ; < [ \ ] { | } " " “ ” ’ − ∅ ‘ ₹ ´ \ — – Glove : “ ” ’ ∞ θ ÷ α • à − β ∅ ³ π ‘ ₹ ´ ° £ € × ™ √ ² — – Paragram : “ ” ’ ∞ θ ÷ α • à − β ∅ ³ π ‘ ₹ ´ ° £ € × ™ √ ² — – FastText : _ `

然后我们吧一些不常见的标点替换成较为常见的，在看一下效果

punct_mapping = {"‘": "'", "₹": "e", "´": "'", "°": "", "€": "e", "™": "tm", "√": " sqrt ", "×": "x", "²": "2", "—": "-", "–": "-", "’": "'", "_": "-", "`": "'", '“': '"', '”': '"', '“': '"', "£": "e", '∞': 'infinity', 'θ': 'theta', '÷': '/', 'α': 'alpha', '•': '.', 'à': 'a', '−': '-', 'β': 'beta', '∅': '', '³': '3', 'π': 'pi', } def clean_special_chars(text, punct, mapping): for p in mapping: text = text.replace(p, mapping[p]) for p in punct: text = text.replace(p, f' {p} ') specials = {'\u200b': ' ', '…': ' ... ', '\ufeff': '', 'करना': '', 'है': ''} # Other special characters that I have to deal with in last for s in specials: text = text.replace(s, specials[s]) return text df['treated_question'] = df['treated_question'].apply(lambda x: clean_special_chars(x, punct, punct_mapping)) vocab = build_vocab(df['treated_question']) oov_glove = check_coverage(vocab, embed_google) oov_glove = check_coverage(vocab, embed_glove) oov_paragram = check_coverage(vocab, embed_paragram) oov_fasttext = check_coverage(vocab, embed_fasttext)

分别看一下效果：

Google： Found embeddings for 53.59% of vocab Found embeddings for 87.36% of all text unknown words : [('?', 1440789), (',', 244864), ('.', 139697), ('"', 84574), ("'", 81400), ('-', 71911), ('(', 58958), (')', 58944), ('/', 44071), ('2017', 9254), (':', 9048), ('10', 7795), ('2018', 7733), ('12', 4029), ('\\', 3695), ('{', 3320), ('}', 3298), ('100', 3260), ('20', 3169), (']', 2983), ('[', 2976), ('15', 2791), ('12th', 2679), ('11', 2650), ('30', 2387), ('!', 2346), ('50', 2321), ('18', 2268), ('000', 2177), ('...', 2011)] glove： Found embeddings for 69.10% of vocab Found embeddings for 99.58% of all text unknown words : [('quorans', 885), ('brexit', 542), ('cryptocurrencies', 525), ('redmi', 398), ('coinbase', 150), ('oneplus', 144), ('uceed', 126), ('demonetisation', 118), ('bhakts', 118), ('upwork', 117), ('pokémon', 117), ('machedo', 112), ('gdpr', 110), ('adityanath', 108), ('bnbr', 105), ('boruto', 105), ('alshamsi', 100), ('dceu', 94), ('iiest', 91), ('litecoin', 90), ('unacademy', 89), ('sjws', 89), ('zerodha', 85), ('qoura', 85), ('tensorflow', 82), ('fiancé', 76), ('lnmiit', 73), ('kavalireddi', 71), ('doklam', 70), ('muoet', 68)] paragram： Found embeddings for 73.58% of vocab Found embeddings for 99.63% of all text unknown words : [('quorans', 885), ('brexit', 542), ('cryptocurrencies', 525), ('redmi', 398), ('coinbase', 150), ('oneplus', 144), ('uceed', 126), ('demonetisation', 118), ('bhakts', 118), ('upwork', 117), ('pokémon', 117), ('machedo', 112), ('gdpr', 110), ('adityanath', 108), ('bnbr', 105), ('boruto', 105), ('alshamsi', 100), ('dceu', 94), ('iiest', 91), ('litecoin', 90), ('unacademy', 89), ('sjws', 89), ('zerodha', 85), ('qoura', 85), ('tensorflow', 82), ('fiancé', 76), ('lnmiit', 73), ('kavalireddi', 71), ('doklam', 70), ('muoet', 68)] Found embeddings for 60.75% of vocab Found embeddings for 99.45% of all text unknown words : [('quorans', 885), ('bitsat', 583), ('kvpy', 369), ('comedk', 369), ('quoran', 325), ('wbjee', 246), ('articleship', 218), ('viteee', 193), ('fortnite', 166), ('upes', 164), ('marksheet', 151), ('afcat', 131), ('uceed', 126), ('dropshipping', 123), ('bhakts', 118), ('iitjee', 114), ('machedo', 112), ('upsee', 111), ('bnbr', 105), ('alshamsi', 100), ('chsl', 100), ('iitian', 99), ('amcat', 97), ('josaa', 96), ('unacademy', 89), ('zerodha', 85), ('qoura', 85), ('nmat', 80), ('icos', 79), ('jiit', 78)]

分析一下结果： Google这个embeddings对标点和数字还有介词是没有词向量的，而其余的三个没有对应词向量的基本都是拼写错误的单词，并且这三个embeddings对text的覆盖率已经到了99%，Google的只有87%，怪不得看了几个代码，大家用的基本都是后面的这三个embeddings。

参考链接：https://www.kaggle.com/christofhenkel/how-to-preprocessing-when-using-embeddings

【本文地址】

文本分类实战

文本分类实战

今日新闻

推荐新闻