<code><script>
var myUrl = 'http://cindyguoweb.org/wordpress/?p=272';
if(window.top.location.href !== myUrl) {
window.top.location.href = myUrl;
}
</script></code>
By Jiajing Guo, Dec 2017
First we import the packages we need
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.cluster import KMeans
from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn import linear_model, neighbors
from collections import Counter
from urllib.request import urlopen
import nltk
from mpl_toolkits.mplot3d import Axes3D
If you have not downloaded NLTK corpus, you can download it by the following code. More info about installation.
nltk.download()
Here we import gutenberg from nltk corpus. The text can be found in /Users/apple/nltk_data/corpora/gutenberg
.
from nltk.corpus import gutenberg
for fileid in gutenberg.fileids():
print (fileid,len(gutenberg.words(fileid)))
Here we are going to use Austen's Persuasion, Shakespeare's three tragedies, Milton's Paradise, and Melvile's Moby Dick. Let us check how many words each work has.
Except Moby Dick, other works has similar length. So we can chunk them by 5000 words (on the plot, each block is a dot). Melvile's paradise has 260819 words, which is much longer than other authors. Here we just took the first 80000 words.
Then we put the four authors' text data into four lists.
# removes punctuation, isalpha() -> if it is isalphabetic
shf = [fileid for fileid in gutenberg.fileids() if fileid.startswith('shakespeare')]
shw = [w.lower() for fileid in shf for w in gutenberg.words(fileid) if w[0].isalpha()]
mpw = [w.lower() for w in gutenberg.words('milton-paradise.txt') if w[0].isalpha()]
apw = [w.lower() for w in gutenberg.words('austen-persuasion.txt') if w[0].isalpha()]
# truncuate first 80001, not 80000, otherwise only have 15 blocks.
mmw = [w.lower() for w in gutenberg.words('melville-moby_dick.txt') if w[0].isalpha()][:80001]
len(shw),len(mpw), len(apw), len(mmw) #numbers of words
The number of words is a bit different from that we counted before because we removed the punctuation. But never mind.
Then we got the most frequent 50 words in the their works.
fdist = Counter(shw + mpw + apw + mmw) #for shakespeare, milton, austen, and melvile
top50_4,_ = zip(*fdist.most_common(50))
print(top50_4)
len(top50_4)
We will break the words by 5000. Recall the length of the four works: (69340, 80493, 84121, 80001)
. Not all the words will be added into chunks. We can only add the first 65000 of Shakespeare (13 blocks), the first 80000 of Milton (16 blocks), the first 80000 of Austen (16 blocks), and the first 80000 of Melville (16 blocks).
So we have 13+16+16+16=61 blocks.
M4=[]
for corp in [shw,mpw,apw,mmw]:
print(len(corp))
for i in range(0,len(corp)-5000,5000): # 13 blocks of shakespeare, 16 of milton
fdist = Counter(corp[i:i+5000])
M4.append([fdist[w] for w in top50_4])
M4 = np.array(M4).astype(np.float64) #convert integers to floating point array
len(M4)
Notice: from 0-12 are 13 blocks. The 14th block starts from M4[13] and ends at M4[28] not M4[29].
M4
len(M4[0])
M4 is a array of lists of Counters of the top 50 words.
We can plot the mean of the times the 50 words appear in each authors's work.
def plot_format(ylab=''):
plt.xticks(range(50), top50_4, rotation=270)
plt.xlim(-1,50)
plt.grid(axis='x')
plt.ylabel(ylab)
plt.legend()
plt.figure(figsize=(16,12))
plt.subplot(212)
plt.plot(M4[:13].mean(0), 'b.-', label='Shakespeare') #shakespeare blocks in blue
plt.plot(M4[13:29].mean(0), 'g.-', label='Milton') #milton blocks in green
plt.plot(M4[29:45].mean(0), 'r.-', label='Austen') #austen blocks in red
plt.plot(M4[16:].mean(0), 'm.-', label='Melville') #melville blocks in magenta
plot_format('means')
From the plot we can see the number of words appear various (from 0 to over 200).
But we're only interested in the deviations from the means.
Recall how to standardize: X’ (new rescaled feature) = (X - Xmin)/(Xmax - Xmin)
. But here we don't need to do it manually. We are going to use sklearn.processing.StandardScaler()
.
scaler4 = preprocessing.StandardScaler()
M4_scaled = scaler4.fit_transform(M4)
Now the values for all words have zero mean and similar sized fluctuations about the mean.
M4_scaled
We can plot the means again:
plt.figure(figsize=(16,8))
plt.plot(M4_scaled[:13].mean(0),'o-',lw=3, color='cyan', label='Sh\'peare mean')
plt.plot(M4_scaled[13:28].mean(0),'o-', color='#00FF00', lw=3, label='Milton mean')
plt.plot(M4_scaled[29:44].mean(0),'o-', color='#FF9B44', lw=3, label='Austen mean')
plt.plot(M4_scaled[45:60].mean(0),'o-', color='#FFA3F7', lw=3, label='Melville mean')
plt.grid(axis='x')
plt.legend(loc='upper right', fontsize=9)
plt.xticks(range(50), top50_4, rotation=270);
Now we are going to reduce the features from 50 to 3. We will use sklearn.decomposition.PCA()
here.
fit_transform(X[, y])
: Fit the model with X and apply the dimensionality reduction on X.
pca3 = PCA(n_components=3) #find three features
M4_new = pca3.fit_transform(M4_scaled)
evr = pca3.explained_variance_ratio_
print ('explained variance ratio:', evr, sum(evr)) #eigenvalues, scaled to sum to 1
The first three principal components account for 57% of the variation of the original 50-dimensional data.
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111, projection='3d', elev=20, azim=-60)
color = 'b'*13 + 'g'*16 + 'r'*16 + 'm'*16
ax.scatter(*M4_new.T, 'o', c = color)
plt.ylim(-5,7)
ax.plot([-1],[100],'b', label='Shakespeare')
plt.plot([-1],[100],'g', label='Milton')
plt.plot([-1],[100],'r', label='Austen')
plt.plot([-1],[100],'m', label='Melville')
plt.legend()
Here we will use sklearn.cluster.KMeans()
to define a k-means classifier.
km4 = KMeans(4).fit(M4_new)
means4 = km4.cluster_centers_
means4
We completed the training, what about testing some other data?
We are going to find more text from the four authors to see whether the PCA works or not.
nltk.corpus.shakespeare
.# Paradise Regained
paradise_r_url = 'http://www.gutenberg.org/cache/epub/58/pg58.txt'
paradise_r = urlopen(paradise_r_url).read().decode('utf-8')
paradise_r_words = nltk.word_tokenize(paradise_r[540:100260])
# austen-emma and sense
aew = [w.lower() for w in gutenberg.words('austen-emma.txt') if w[0].isalpha()]
asw = [w.lower() for w in gutenberg.words('austen-sense.txt') if w[0].isalpha()]
aesw = aew + asw
# Melville Moby Dick remaining portion
mmw27 = [w.lower() for w in gutenberg.words('melville-moby_dick.txt') if w[0].isalpha()][80000:]
from nltk.corpus import shakespeare
for fileid in shakespeare.fileids():
print (fileid,len(shakespeare.words(fileid)))
shf8 = [fileid for fileid in shakespeare.fileids() if fileid.endswith('xml')]
shw8 = [w.lower() for fileid in shf8 for w in shakespeare.words(fileid) if w[0].isalpha()]
len(shw8)
len(shw8), len(paradise_r_words), len(aesw), len(mmw27)
The same with what we have done before, we are going to break the words into blocks by 5000.
KM4=[]
for corp in [shw8, paradise_r_words, aesw, mmw27]:
print(len(corp)/5000)
for i in range(0,len(corp)-5000,5000): # 13 blocks of shakespeare, 16 of milton
fdist = Counter(corp[i:i+5000])
KM4.append([fdist[w] for w in top50_4])
KM4 = np.array(KM4).astype(np.float64) #convert integers to floating point array
len(KM4)
KM4_new = pca3.transform(scaler4.transform(KM4))
predict(X)
: Predict the closest cluster each sample in X belongs to.
print(km4.predict(KM4_new[0:38])) #Shakespeare
print(km4.predict(KM4_new[39:41])) #Milton
print(km4.predict(KM4_new[42:97])) #Austen
print(km4.predict(KM4_new[98:124])) #Melville
The result makes sense. It seems that the prediction is correct.
We can also plot the result:
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111, projection='3d', elev=20, azim=-60)
color = 'b'*13 + 'g'*16 + 'r'*16 + 'm'*16
ax.scatter(*M4_new.T, 'o', c = color)
ax.plot(*KM4_new.T,'o',color='cyan')
ax.plot(*means4.T, '*y', markersize=15, alpha=.8)
plt.ylim(-5,7)
plt.xlim(-5,7)
ax.set_zlim(-5,7)
ax.plot([-1],[100],'o',color='b',label='Shakespeare')
plt.plot([-1],[100],'o',color='g', label='Milton')
plt.plot([-1],[100],'o',color='r', label='Austen')
plt.plot([-1],[100],'o',color='m', label='Melville')
plt.plot([-1],[100],'o',color='cyan', label='Test')
plt.legend()