关于python：如何使用NLTK tokenizer摆脱标点符号？

您所在的位置：网站首页 › word去除标点 › 关于python：如何使用NLTK tokenizer摆脱标点符号？

关于python：如何使用NLTK tokenizer摆脱标点符号？

2023-12-13 12:43| 来源: 网络整理| 查看: 265

我刚开始使用NLTK，我不太明白如何从文本中获取单词列表。如果我使用nltk.word_tokenize()，我会得到一个单词和标点的列表。我只需要这些词。我怎样才能去掉标点符号？另外，word_tokenize不适用于多个句子：在最后一个单词中添加点。

相关讨论你为什么不自己去掉标点符号？nltk.word_tokenize(the_text.translate(None, string.punctuation))应该在python2中工作，而在python3中你可以在nltk.work_tokenize(the_text.translate(dict.fromkeys(string.p‌unctuation)))中工作。这不管用。文本没有任何变化。 NLTK假定的工作流程是先将句子标记化，然后将每个句子标记为单词。这就是为什么word_tokenize()不适用于多个句子。为了去掉标点符号，可以使用正则表达式或Python的isalnum()函数。它确实有效：>>> 'with dot.'.translate(None, string.punctuation) 'with dot'(注意结果末尾没有点)，如果你有像'end of sentence.No space'这样的东西，它可能会引起问题，在这种情况下，你可以这样做：the_text.translate(string.maketrans(string.punctuation, ' '*len(string.punctuation)))用空格替换所有标点符号。糟糕，它确实有效，但不适用于Unicode字符串。顺便说一下，isalnum()方法与unicode一起工作。试试这个-stackoverflow.com/questions/265960/&hellip； @"python2"是指python2的str，也就是python3的bytes。如果你使用我写的"python3"版本，它会起作用：the_text.translate(dict.fromkeys(string.punctuation))删除所有(ascii)标点符号。

看看NLTK在这里提供的其他标记化选项。例如，您可以定义一个标记器，它将字母数字字符序列作为标记，并删除其他所有内容：

1234from nltk.tokenize import RegexpTokenizer tokenizer = RegexpTokenizer(r'\w+') tokenizer.tokenize('Eighty-seven miles to go, yet. Onward!')

输出：

1['Eighty', 'seven', 'miles', 'to', 'go', 'yet', 'Onward'] 相关讨论请注意，如果使用此选项，则会丢失word_tokenize特有的自然语言功能，如拆分收缩。您可以天真地在regex \w+上拆分，而不需要任何NLTK。为了说明@sffc comment，您可能会丢失诸如"mr"之类的词。

您不需要NLTK来删除标点符号。您可以用简单的python删除它。弦乐：

123import string s = '... some string with punctuation ...' s = s.translate(None, string.punctuation)

或用于Unicode：

123import string translate_table = dict((ord(char), None) for char in string.punctuation) s.translate(translate_table)

然后在标记器中使用这个字符串。

P.S.字符串模块还有一些可以删除的其他元素集(如数字)。

相关讨论使用同样有效的列表表达式删除所有标点。a ="*fa,fd.1lk#$" print("".join([w for w in a if w not in string.punctuation]))

下面的代码将删除所有标点符号以及非字母字符。从他们的书上抄下来的。

网址：http://www.nltk.org/book/ch01.html

123456789import nltk s ="I can't do this now, because I'm so tired. Please give me some time. @ sd 4 232" words = nltk.word_tokenize(s) words=[word.lower() for word in words if word.isalpha()] print(words)

输出

1['i', 'ca', 'do', 'this', 'now', 'because', 'i', 'so', 'tired', 'please', 'give', 'me', 'some', 'time', 'sd'] 相关讨论请注意，使用这种方法，在"不能"或"不要"这样的情况下，会丢失"不"一词，这对于理解和分类句子可能非常重要。最好使用sentence.translate(string.maketrans("，，)，chars_to_remove)，其中chars_to_remove可以是".，"：；！"

正如在注释中注意到的，从sent_tokenize()开始，因为word_tokenize()只在一个句子上工作。您可以使用filter()过滤掉标点符号。如果您有一个Unicode字符串，请确保它是一个Unicode对象(不是用一些编码(如"utf-8")编码的"str")。

12345from nltk.tokenize import word_tokenize, sent_tokenize text = '''It is a blue, small, and extraordinary ball. Like no other''' tokens = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)] print filter(lambda word: word not in ',-', tokens) 相关讨论 Penn Treebank标记器所涉及的大部分复杂性都与正确处理标点符号有关。如果你只想去掉标点符号，为什么要使用一个处理标点符号的昂贵的标记器？ word_tokenize是返回[token for sent in sent_tokenize(text, language) for token in _treebank_word_tokenize(sent)]的函数。所以我认为你的答案是做NLTK已经做的：在使用word_tokenize()之前使用sent_tokenize()。至少对NLTK3来说是这样。 @因为你不需要标点符号吗？所以你要的是did和n't，而不是.。

我只使用了以下代码，删除了所有标点：

123456789tokens = nltk.wordpunct_tokenize(raw) type(tokens) text = nltk.Text(tokens) type(text) words = [w.lower() for w in text if w.isalpha()] 相关讨论为什么要将令牌转换为文本？

我认为您需要某种正则表达式匹配(下面的代码在python 3中)：

123456789import string import re import nltk s ="I can't do this now, because I'm so tired. Please give me some time." l = nltk.word_tokenize(s) ll = [x for x in l if not re.fullmatch('[' + string.punctuation + ']+', x)] print(l) print(ll)

输出：

12['I', 'ca',"n't", 'do', 'this', 'now', ',', 'because', 'I',"'m", 'so', 'tired', '.', 'Please', 'give', 'me', 'some', 'time', '.'] ['I', 'ca',"n't", 'do', 'this', 'now', 'because', 'I',"'m", 'so', 'tired', 'Please', 'give', 'me', 'some', 'time']

在大多数情况下应该很好地工作，因为它删除标点符号，同时保留诸如"n't"之类的标记，而这些标记不能从诸如wordpunct_tokenize之类的regex标记化器获得。

相关讨论这也将在保持收缩的同时去除像...和--这样的东西，而s.translate(None, string.punctuation)不会

我使用此代码删除标点：

12345678import nltk def getTerms(sentences): tokens = nltk.word_tokenize(sentences) words = [w.lower() for w in tokens if w.isalnum()] print tokens print words getTerms("hh, hh3h. wo shi 2 4 A . fdffdf. A&&B")

如果你想检查令牌是否是一个有效的英文单词，你可能需要pyenchant

辅导的：

12345 import enchant d = enchant.Dict("en_US") d.check("Hello") d.check("Helo") d.suggest("Helo") 相关讨论当心这个解决方案会杀死宫缩。这是因为word_tokenize使用标准标记器TreebankWordTokenizer，它将收缩(例如can't)分解为(ca，n't)。但是，n't不是字母数字，在这个过程中会丢失。

去除穿刺(它会去除。以及使用以下代码处理标点符号的一部分)

123 tbl = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P')) text_string = text_string.translate(tbl) #text_string don't have punctuation w = word_tokenize(text_string) #now tokenize the string

样品输入/输出：

1direct flat in oberoi esquire. 3 bhk 2195 saleable 1330 carpet. rate of 14500 final plus 1% floor rise. tax approx 9% only. flat cost with parking 3.89 cr plus taxes plus possession charger. middle floor. north door. arey and oberoi woods facing. 53% paymemt due. 1% transfer charge with buyer. total cost around 4.20 cr approx plus possession charges. rahul soni

['direct', 'flat', 'oberoi', 'esquire', '3', 'bhk', '2195', 'saleable', '1330', 'carpet', 'rate', '14500', 'final', 'plus', '1', 'floor', 'rise', 'tax', 'approx', '9', 'flat', 'cost', 'parking', '389', 'cr', 'plus', 'taxes', 'plus', 'possession', 'charger', 'middle', 'floor', 'north', 'door', 'arey', 'oberoi', 'woods', 'facing', '53', 'paymemt', 'due', '1', 'transfer', 'charge', 'buyer', 'total', 'cost', 'around', '420', 'cr', 'approx', 'plus', 'possession', 'charges', 'rahul', 'soni']

关于python：如何使用NLTK tokenizer摆脱标点符号？

关于python：如何使用NLTK tokenizer摆脱标点符号？

今日新闻

推荐新闻