This exercise will use NLTK database and distinguish four authors via PCA. As a programming beginner, there might be some mistakes in this post. If any please comment below. Any feedback is welcome.
NLTK (Natural Language Toolkit) is a platform with plenty of human language data with over 50 corpora and lexical resources, such as Brown Corpus, Project Gutenberg, NPS Chat, etc. This post we will use text from Project Gutenberg and analyze four authors: Shakespeare, John Milton, Jane Austen, and Herman Melville. The final result will look like this:
From the graph we can see the four authors have different writing styles. We picked three features (word frequency) from the text and calculated the value of each feature for each block of text. For example, the value of one block of text is (a, b, c). If we plotted all the blocks (here we have 61 blocks) on the graph, we can see they generate four clusters. And that is clusters of authors.
Before we start, let’s look at what is dimension reduction and how PCA works.
What is dimension reduction?
In machine learning and statistics, dimension reduction is the process of reducing the number of random variables under consideration by obtaining a set of principal variables. — Wikipedia
Here we can understand dimension as feature.
When we are investigating data, non-linear, non-local relationships are not evident in the original feature space. Not every feature is related to our problems. But if we reduce the number of some dimensions or remove the noise, the unnecessary parts of data, we can visualize it into 2D or 3D space easily.
Dimension reduction can be divided into feature selection and feature extraction. Feature selection is to find a subset of the original variables. Feature extraction transforms the data in the high-dimensional space to a space of fewer dimensions. — Wikipedia
PCA is one sort of data transformation in feature extraction.
What is PCA?
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. — Wikipedia
PCA is the main linear technique for dimension reduction. It projects the original data onto a direction which maximizes the variance of data in the low-dimensional representation.
The first step is to standardize the data: put the data on the same unit scale. It means the data should have a mean of 0 and a variance of 1.
Intuition: X’ (new rescaled feature) = (X – Xmin)/(Xmax – Xmin)
In this exercise, here is what we are going to do:
- pick out the top 50 words (50 features) in the four authors’ work
- chunk the text data into 61 blocks (on the plot, each block is a dot)
- calculate the frequency of the 50 words in each author’s work
- Standardize features by removing the mean and scaling to unit variance
- Reduce features from 50 to 3.
- Plot the result.
After we plot the result, we are also going to create a K-means classifier to find four clusters. Then we will use the classifier to predict new text data.
- Define K-means classifier
- Chunk test data
- Predict test data.
Here is the code:
Acknowledgement: Professor Paul Gingsbarg and INFO 6010.