Text Classification With Scikit-learn
Pointers
There are already some good resources on the web (presentations, scikit learn demos and documentation).
- Twenty newsgroups dataset on scikit-learn website
-
Official page scikit-learn page on classification with 20 newgroups
-
“Statistical Learning and Text Classification with NLTK and scikit-learn”, by Olivier Grisel
at PyCon
2010 -
“Statistical Learning and Text Classification with NLTK and scikit-learn”, by Olivier Grisel
at PyCon
2011 -
Another tutorial on text classification with Scikit learn, by Jimmy Lai
- Updated (and running) script based on the official scikit-learn example. The results here are based on this script.
Loading the data
remove = ('headers', 'footers', 'quotes') #remove headers, signatures to make the problem more realistic categories = None #load all categories data_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42, remove=remove) data_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42, remove=remove)
Tokenization / Parsing / Word weighting
vectorizer = TfidfVectorizer(sublinear_tf=True, smooth_idf = True, max_df=0.5, stop_words='english') X_train = vectorizer.fit_transform(data_train.data) #fit_transform on training data X_test = vectorizer.transform(data_test.data) #just transform on test data
When using a custom text analysis pipeline
def tokenize(text): tokens = nltk.word_tokenize(text) #include more steps if necessary return tokens vectorizer = TfidfVectorizer(tokenizer=tokenize, #provide a tokenizer if you want to sublinear_tf=True, smooth_idf = True, max_df=0.5, stop_words='english')
Feature Extraction
feature_names = vectorizer.get_feature_names() ch2 = SelectKBest(chi2, k=opts.select_chi2) X_train = ch2.fit_transform(X_train, y_train) X_test = ch2.transform(X_test) selected_feature_names = [feature_names[i] for i in ch2.get_support(indices=True)]
Classifiers
#need (X_train, y_train) and (X_test, y_test) clf = MultinomialNB(alpha=.01) #or clf = LinearSVC(loss='l2', penalty="l2", dual=False, tol=1e-3, C = 10) clf.fit(X_train, y_train) pred = clf.predict(X_test) accuracy = metrics.accuracy_score(y_test, pred)
Performance
Run the script https://github.com/stefansavev/demos/blob/master/text-categorization/20ng/20ng.py
python 20ng.py --filtered --all_categories
You’ll see this output. You want to look at the accuracy for different methods. The methods reported here are linear classifiers. All the accuracies are pretty much the same between 0.69 and 0.7. Notice the “–filtered” option. This option removes some of the headers to make the task more realistic. If headers are not removed the accuracy is higher.
11314 documents - 13.782MB (training set) 7532 documents - 8.262MB (test set) 20 categories Using TfidfVectorizer feature extraction on training set (tokenization) in 12.127502s at 1.136MB/s n_samples: 11314, n_features: 101323 feature extraction on test set (tokenization) in 5.820425s at 1.419MB/s n_samples: 7532, n_features: 101323 Results: RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, solver='lsqr', tol=0.01) train time: 3.992s test time: 0.105s accuracy: 0.702 dimensionality: 101323 density: 1.000000 LinearSVC(C=1, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, loss='l2', multi_class='ovr', penalty='l2', random_state=None, tol=0.001, verbose=0) train time: 6.567s test time: 0.030s accuracy: 0.697 dimensionality: 101323 density: 1.000000 SGDClassifier(alpha=0.0001, class_weight=None, epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.15, learning_rate='optimal', loss='hinge', n_iter=50, n_jobs=1, penalty='l2', power_t=0.5, random_state=None, shuffle=False, verbose=0, warm_start=False) train time: 8.887s test time: 0.105s accuracy: 0.701 dimensionality: 101323 density: 0.378897 MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True) train time: 0.225s test time: 0.079s accuracy: 0.696 dimensionality: 101323 density: 1.000000