单标签数据集的多标签文本分类 |
您所在的位置:网站首页 › 多标签文本分类数据集 › 单标签数据集的多标签文本分类 |
我有一个数据集,每个文档都有一个标签,如下例所示 . label text pay "i will pay now" finance "are you the finance guy?" law "lawyers and law" court "was at the court today" finance report "bank reported annual share.."文本文档可以使用多个标签进行标记,因此如何对此数据集进行多标签分类?我已经阅读了很多来自 sklearn 的文档,但我似乎找不到在单标签数据集上进行多标签分类的正确方法 . 在此先感谢您的帮助 . 到目前为止,这就是我所拥有的: import numpy as np import pandas as pd from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer from sklearn.svm import LinearSVC from sklearn.linear_model import SGDClassifier from sklearn.naive_bayes import MultinomialNB from sklearn.feature_extraction.text import TfidfTransformer from sklearn.multiclass import OneVsRestClassifier from sklearn.metrics import accuracy_score from sklearn.cross_validation import train_test_split from sklearn.preprocessing import MultiLabelBinarizer from sklearn import preprocessing loc = r'C:\Users\..\Downloads\excel.xlsx' df = pd.read_excel(loc) X = np.array(df.docs) z = np.array(df.title) y = np.array(df.raw) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) mlb = preprocessing.MultiLabelBinarizer() Y = mlb.fit_transform(y_train) Y_test = mlb.fit_transform(y_test) classifier = Pipeline([ ('vectorizer', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', OneVsRestClassifier(LinearSVC()))]) classifier.fit(X_train, Y) predicted = classifier.predict(X_test) doc_new = np.array(['X has announced that it will sell $587 million']) print("Accuracy Score: ", accuracy_score(Y_test, predicted)) print(mlb.inverse_transform(classifier.predict(doc_new)))但我不断收到尺寸错误: .format(len(self.classes_),yt.shape [1]))ValueError:44个类的预期指标,但得到了46个 |
今日新闻 |
推荐新闻 |
CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3 |