Text Classification With Scikit-learn
Pointers
There are already some good resources on the web (presentations, scikit learn demos and documentation).
- Twenty newsgroups dataset on scikit-learn website
-
Official page scikit-learn page on classification with 20 newgroups
-
“Statistical Learning and Text Classification with NLTK and scikit-learn”, by Olivier Grisel
at PyCon
2010 -
“Statistical Learning and Text Classification with NLTK and scikit-learn”, by Olivier Grisel
at PyCon
2011 -
Another tutorial on text classification with Scikit learn, by Jimmy Lai
- Updated (and running) script based on the official scikit-learn example. The results here are based on this script.
Loading the data
remove = ('headers', 'footers', 'quotes') #remove headers, signatures to make the problem more realistic
categories = None #load all categories
data_train = fetch_20newsgroups(subset='train', categories=categories,
shuffle=True, random_state=42,
remove=remove)
data_test = fetch_20newsgroups(subset='test', categories=categories,
shuffle=True, random_state=42,
remove=remove)
Tokenization / Parsing / Word weighting
vectorizer = TfidfVectorizer(sublinear_tf=True,
smooth_idf = True,
max_df=0.5,
stop_words='english')
X_train = vectorizer.fit_transform(data_train.data) #fit_transform on training data
X_test = vectorizer.transform(data_test.data) #just transform on test data
When using a custom text analysis pipeline
def tokenize(text):
tokens = nltk.word_tokenize(text)
#include more steps if necessary
return tokens
vectorizer = TfidfVectorizer(tokenizer=tokenize, #provide a tokenizer if you want to
sublinear_tf=True,
smooth_idf = True,
max_df=0.5,
stop_words='english')
Feature Extraction
feature_names = vectorizer.get_feature_names() ch2 = SelectKBest(chi2, k=opts.select_chi2) X_train = ch2.fit_transform(X_train, y_train) X_test = ch2.transform(X_test) selected_feature_names = [feature_names[i] for i in ch2.get_support(indices=True)]
Classifiers
#need (X_train, y_train) and (X_test, y_test) clf = MultinomialNB(alpha=.01) #or clf = LinearSVC(loss='l2', penalty="l2", dual=False, tol=1e-3, C = 10) clf.fit(X_train, y_train) pred = clf.predict(X_test) accuracy = metrics.accuracy_score(y_test, pred)
Performance
Run the script https://github.com/stefansavev/demos/blob/master/text-categorization/20ng/20ng.py
python 20ng.py --filtered --all_categories
You’ll see this output. You want to look at the accuracy for different methods. The methods reported here are linear classifiers. All the accuracies are pretty much the same between 0.69 and 0.7. Notice the “–filtered” option. This option removes some of the headers to make the task more realistic. If headers are not removed the accuracy is higher.
11314 documents - 13.782MB (training set)
7532 documents - 8.262MB (test set)
20 categories
Using TfidfVectorizer
feature extraction on training set (tokenization) in 12.127502s at 1.136MB/s
n_samples: 11314, n_features: 101323
feature extraction on test set (tokenization) in 5.820425s at 1.419MB/s
n_samples: 7532, n_features: 101323
Results:
RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, solver='lsqr', tol=0.01)
train time: 3.992s
test time: 0.105s
accuracy: 0.702
dimensionality: 101323
density: 1.000000
LinearSVC(C=1, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, loss='l2', multi_class='ovr', penalty='l2',
random_state=None, tol=0.001, verbose=0)
train time: 6.567s
test time: 0.030s
accuracy: 0.697
dimensionality: 101323
density: 1.000000
SGDClassifier(alpha=0.0001, class_weight=None, epsilon=0.1, eta0=0.0,
fit_intercept=True, l1_ratio=0.15, learning_rate='optimal',
loss='hinge', n_iter=50, n_jobs=1, penalty='l2', power_t=0.5,
random_state=None, shuffle=False, verbose=0, warm_start=False)
train time: 8.887s
test time: 0.105s
accuracy: 0.701
dimensionality: 101323
density: 0.378897
MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)
train time: 0.225s
test time: 0.079s
accuracy: 0.696
dimensionality: 101323
density: 1.000000