Pointers

There are already some good resources on the web (presentations, scikit learn demos and documentation).

Loading the data

remove = ('headers', 'footers', 'quotes') #remove  headers, signatures to make the problem more realistic
categories = None #load all categories
data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=remove)

data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=remove)

Tokenization / Parsing / Word weighting

vectorizer = TfidfVectorizer(sublinear_tf=True, 
                             smooth_idf = True,
                             max_df=0.5,
                             stop_words='english')
X_train = vectorizer.fit_transform(data_train.data) #fit_transform on training data
X_test = vectorizer.transform(data_test.data) #just transform on test data

When using a custom text analysis pipeline

    def tokenize(text):
        tokens = nltk.word_tokenize(text)
        #include more steps if necessary               
        return tokens

    vectorizer = TfidfVectorizer(tokenizer=tokenize, #provide a tokenizer if you want to
                                 sublinear_tf=True, 
                                 smooth_idf = True,
                                 max_df=0.5,
                                 stop_words='english')

Feature Extraction

feature_names = vectorizer.get_feature_names()
ch2 = SelectKBest(chi2, k=opts.select_chi2)
X_train = ch2.fit_transform(X_train, y_train)
X_test = ch2.transform(X_test)
selected_feature_names = [feature_names[i] for i in ch2.get_support(indices=True)]

Classifiers

#need (X_train, y_train) and (X_test, y_test)
clf = MultinomialNB(alpha=.01)
#or clf = LinearSVC(loss='l2', penalty="l2", dual=False, tol=1e-3, C = 10)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
accuracy = metrics.accuracy_score(y_test, pred)

Performance

Run the script https://github.com/stefansavev/demos/blob/master/text-categorization/20ng/20ng.py

python 20ng.py --filtered --all_categories

You’ll see this output. You want to look at the accuracy for different methods. The methods reported here are linear classifiers. All the accuracies are pretty much the same between 0.69 and 0.7. Notice the “–filtered” option. This option removes some of the headers to make the task more realistic. If headers are not removed the accuracy is higher.

11314 documents - 13.782MB (training set)
7532 documents - 8.262MB (test set)
20 categories

Using TfidfVectorizer

feature extraction on training set (tokenization)  in 12.127502s at 1.136MB/s
n_samples: 11314, n_features: 101323

feature extraction on test set (tokenization)  in 5.820425s at 1.419MB/s
n_samples: 7532, n_features: 101323

Results:

RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, solver='lsqr', tol=0.01)
train time: 3.992s
test time:  0.105s
accuracy:   0.702
dimensionality: 101323
density: 1.000000

LinearSVC(C=1, class_weight=None, dual=False, fit_intercept=True,
     intercept_scaling=1, loss='l2', multi_class='ovr', penalty='l2',
     random_state=None, tol=0.001, verbose=0)
train time: 6.567s
test time:  0.030s
accuracy:   0.697
dimensionality: 101323
density: 1.000000


SGDClassifier(alpha=0.0001, class_weight=None, epsilon=0.1, eta0=0.0,
       fit_intercept=True, l1_ratio=0.15, learning_rate='optimal',
       loss='hinge', n_iter=50, n_jobs=1, penalty='l2', power_t=0.5,
       random_state=None, shuffle=False, verbose=0, warm_start=False)
train time: 8.887s
test time:  0.105s
accuracy:   0.701
dimensionality: 101323
density: 0.378897

MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)
train time: 0.225s
test time:  0.079s
accuracy:   0.696
dimensionality: 101323
density: 1.000000