单标签数据集的多标签文本分类

您所在的位置:网站首页 多标签文本分类数据集 单标签数据集的多标签文本分类

单标签数据集的多标签文本分类

2024-07-07 20:27| 来源: 网络整理| 查看: 265

我有一个数据集,每个文档都有一个标签,如下例所示 .

label text pay "i will pay now" finance "are you the finance guy?" law "lawyers and law" court "was at the court today" finance report "bank reported annual share.."

文本文档可以使用多个标签进行标记,因此如何对此数据集进行多标签分类?我已经阅读了很多来自 sklearn 的文档,但我似乎找不到在单标签数据集上进行多标签分类的正确方法 . 在此先感谢您的帮助 .

到目前为止,这就是我所拥有的:

import numpy as np import pandas as pd from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer from sklearn.svm import LinearSVC from sklearn.linear_model import SGDClassifier from sklearn.naive_bayes import MultinomialNB from sklearn.feature_extraction.text import TfidfTransformer from sklearn.multiclass import OneVsRestClassifier from sklearn.metrics import accuracy_score from sklearn.cross_validation import train_test_split from sklearn.preprocessing import MultiLabelBinarizer from sklearn import preprocessing loc = r'C:\Users\..\Downloads\excel.xlsx' df = pd.read_excel(loc) X = np.array(df.docs) z = np.array(df.title) y = np.array(df.raw) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) mlb = preprocessing.MultiLabelBinarizer() Y = mlb.fit_transform(y_train) Y_test = mlb.fit_transform(y_test) classifier = Pipeline([ ('vectorizer', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', OneVsRestClassifier(LinearSVC()))]) classifier.fit(X_train, Y) predicted = classifier.predict(X_test) doc_new = np.array(['X has announced that it will sell $587 million']) print("Accuracy Score: ", accuracy_score(Y_test, predicted)) print(mlb.inverse_transform(classifier.predict(doc_new)))

但我不断收到尺寸错误:

.format(len(self.classes_),yt.shape [1]))ValueError:44个类的预期指标,但得到了46个



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3