4. Preparing Textual Data for Statistics and Machine Learning

您所在的位置:网站首页 html+python 4. Preparing Textual Data for Statistics and Machine Learning

4. Preparing Textual Data for Statistics and Machine Learning

2023-04-10 17:08| 来源: 网络整理| 查看: 265

Blueprint: Extracting Lemmas Based on Part of Speech

Lemmatization is the mapping of a word to its uninflected root. Treating words like housing, housed, and house as the same has many advantages for statistics, machine learning, and information retrieval. It can not only improve the quality of the models but also decrease training time and model size because the vocabulary is much smaller if only uninflected forms are kept. In addition, it is often helpful to restrict the types of the words used to certain categories, such as nouns, verbs, and adjectives. Those word types are called part-of-speech tags.

Let’s first take a closer look at lemmatization. The lemma of a token or span can be accessed by the lemma_ property, as illustrated in the following example:

text = "My best friend Ryan Peters likes fancy adventure games." doc = nlp(text) print(*[t.lemma_ for t in doc], sep='|')

Out:

-PRON-|good|friend|Ryan|Peters|like|fancy|adventure|game|.

The correct assignment of the lemma requires a lookup dictionary and knowledge about the part of speech of a word. For example, the lemma of the noun meeting is meeting, while the lemma of the verb is meet. In English, spaCy is able to make this distinction. In most other languages, however, lemmatization is purely dictionary-based, ignoring the part-of-speech dependency. Note that personal pronouns like I, me, you, and her always get the lemma -PRON- in spaCy.

The other token attribute we will use in this blueprint is the part-of-speech tag. Table 4-3 shows that each token in a spaCy doc has two part-of-speech attributes: pos_ and tag_. tag_ is the tag from the tagset used to train the model. For spaCy’s English models, which have been trained on the OntoNotes 5 corpus, this is the Penn Treebank tagset. For a German model, this would be the Stuttgart-Tübingen tagset. The pos_ attribute contains the simplified tag of the universal part-of-speech tagset.12 We recommend using this attribute as the values will remain stable across different models. Table 4-4 shows the complete tag set descriptions.

Table 4-4. Universal part-of-speech tags Tag Description Examples ADJ Adjectives (describe nouns) big, green, African ADP Adpositions (prepositions and postpositions) in, on ADV Adverbs (modify verbs or adjectives) very, exactly, always AUX Auxiliary (accompanies verb) can (do), is (doing) CCONJ Connecting conjunction and, or, but DET Determiner (with regard to nouns) the, a, all (things), your (idea) INTJ Interjection (independent word, exclamation, expression of emotion) hi, yeah NOUN Nouns (common and proper) house, computer NUM Cardinal numbers nine, 9, IX PROPN Proper noun, name, or part of a name Peter, Berlin PRON Pronoun, substitute for noun I, you, myself, who PART Particle (makes sense only with other word) PUNCT Punctuation characters , . ; SCONJ Subordinating conjunction before, since, if SYM Symbols (word-like) $, © VERB Verbs (all tenses and modes) go, went, thinking X Anything that cannot be assigned grlmpf

Part-of-speech tags are an excellent alternative to stop words as word filters. In linguistics, pronouns, prepositions, conjunctions, and determiners are called function words because their main function is to create grammatical relationships within a sentence. Nouns, verbs, adjectives, and adverbs are content words, and the meaning of a sentence depends mainly on them.

Often, we are interested only in content words. Thus, instead of using a stop word list, we can use part-of-speech tags to select the word types we are interested in and discard the rest. For example, a list containing only the nouns and proper nouns in a doc can be generated like this:

text = "My best friend Ryan Peters likes fancy adventure games." doc = nlp(text) nouns = [t for t in doc if t.pos_ in ['NOUN', 'PROPN']] print(nouns)

Out:

[friend, Ryan, Peters, adventure, games]

We could easily define a more general filter function for this purpose, but textacy’s extract.words function conveniently provides this functionality. It also allows us to filter on part of speech and additional token properties such as is_punct or is_stop. Thus, the filter function allows both part-of-speech selection and stop word filtering. Internally it works just like we illustrated for the noun filter shown previously.

The following example shows how to extract tokens for adjectives and nouns from the sample sentence:

import textacy tokens = textacy.extract.words(doc, filter_stops = True, # default True, no stopwords filter_punct = True, # default True, no punctuation filter_nums = True, # default False, no numbers include_pos = ['ADJ', 'NOUN'], # default None = include all exclude_pos = None, # default None = exclude none min_freq = 1) # minimum frequency of words print(*[t for t in tokens], sep='|')

Out:

best|friend|fancy|adventure|games

Our blueprint function to extract a filtered list of word lemmas is finally just a tiny wrapper around that function. By forwarding the keyword arguments (**kwargs), this function accepts the same parameters as textacy’s extract.words.

def extract_lemmas(doc, **kwargs): return [t.lemma_ for t in textacy.extract.words(doc, **kwargs)] lemmas = extract_lemmas(doc, include_pos=['ADJ', 'NOUN']) print(*lemmas, sep='|')

Out:

good|friend|fancy|adventure|game Note

Using lemmas instead of inflected words is often a good idea, but not always. For example, it can have a negative effect on sentiment analysis where “good” and “best” make a difference.



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3