In [ ]:
<code><script>
var myUrl = 'http://cindyguoweb.org/wordpress/?p=272';
 
if(window.top.location.href !== myUrl) {
    window.top.location.href = myUrl;
}
</script></code>

By Jiajing Guo, Dec 2017

First we import the packages we need

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
from sklearn.cluster import KMeans
from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn import linear_model, neighbors
from collections import Counter
from urllib.request import urlopen
In [3]:
import nltk
from mpl_toolkits.mplot3d import Axes3D

If you have not downloaded NLTK corpus, you can download it by the following code. More info about installation.

In [4]:
nltk.download()
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
Out[4]:
True

Chunk text data

Here we import gutenberg from nltk corpus. The text can be found in /Users/apple/nltk_data/corpora/gutenberg.

Chapter 2 Accessing Text Corpora and Lexical Resources

In [5]:
from nltk.corpus import gutenberg
In [6]:
for fileid in gutenberg.fileids():
    print (fileid,len(gutenberg.words(fileid)))
austen-emma.txt 192427
austen-persuasion.txt 98171
austen-sense.txt 141576
bible-kjv.txt 1010654
blake-poems.txt 8354
bryant-stories.txt 55563
burgess-busterbrown.txt 18963
carroll-alice.txt 34110
chesterton-ball.txt 96996
chesterton-brown.txt 86063
chesterton-thursday.txt 69213
edgeworth-parents.txt 210663
melville-moby_dick.txt 260819
milton-paradise.txt 96825
shakespeare-caesar.txt 25833
shakespeare-hamlet.txt 37360
shakespeare-macbeth.txt 23140
whitman-leaves.txt 154883

Here we are going to use Austen's Persuasion, Shakespeare's three tragedies, Milton's Paradise, and Melvile's Moby Dick. Let us check how many words each work has.

  • Shakespeare: 25833+37360+23140=86333
  • Milton: 96825
  • Austen-persuation: 98171
  • Melville: 260819

Except Moby Dick, other works has similar length. So we can chunk them by 5000 words (on the plot, each block is a dot). Melvile's paradise has 260819 words, which is much longer than other authors. Here we just took the first 80000 words.

Then we put the four authors' text data into four lists.

In [7]:
# removes punctuation, isalpha() -> if it is isalphabetic
shf = [fileid for fileid in gutenberg.fileids() if fileid.startswith('shakespeare')]
shw = [w.lower() for fileid in shf for w in gutenberg.words(fileid) if w[0].isalpha()]
mpw = [w.lower() for w in gutenberg.words('milton-paradise.txt') if w[0].isalpha()]
apw = [w.lower() for w in gutenberg.words('austen-persuasion.txt') if w[0].isalpha()]
# truncuate first 80001, not 80000, otherwise only have 15 blocks. 
mmw = [w.lower() for w in gutenberg.words('melville-moby_dick.txt') if w[0].isalpha()][:80001] 
len(shw),len(mpw), len(apw), len(mmw) #numbers of words
Out[7]:
(69340, 80493, 84121, 80001)

The number of words is a bit different from that we counted before because we removed the punctuation. But never mind.

Then we got the most frequent 50 words in the their works.

In [8]:
fdist = Counter(shw + mpw + apw + mmw) #for shakespeare, milton, austen, and melvile
top50_4,_ = zip(*fdist.most_common(50))
print(top50_4)
len(top50_4)
('the', 'and', 'of', 'to', 'in', 'a', 'i', 'that', 'his', 'with', 'it', 'he', 'not', 'but', 'as', 'was', 'for', 'be', 'all', 'you', 'is', 'her', 'had', 'him', 'this', 'my', 's', 'on', 'by', 'so', 'from', 'at', 'or', 'she', 'me', 'what', 'they', 'their', 'which', 'no', 'have', 'there', 'now', 'more', 'will', 'were', 'we', 'them', 'thou', 'when')
Out[8]:
50

We will break the words by 5000. Recall the length of the four works: (69340, 80493, 84121, 80001). Not all the words will be added into chunks. We can only add the first 65000 of Shakespeare (13 blocks), the first 80000 of Milton (16 blocks), the first 80000 of Austen (16 blocks), and the first 80000 of Melville (16 blocks).

So we have 13+16+16+16=61 blocks.

In [9]:
M4=[]
for corp in [shw,mpw,apw,mmw]:
    print(len(corp))
    for i in range(0,len(corp)-5000,5000):  # 13 blocks of shakespeare, 16 of milton
        fdist = Counter(corp[i:i+5000])
        M4.append([fdist[w] for w in top50_4])
M4 = np.array(M4).astype(np.float64) #convert integers to floating point array
len(M4)
69340
80493
84121
80001
Out[9]:
61

The 61 blocks are:

  • 0:12 - Shakespeare (13)
  • 13:28 - Miltion (16)
  • 29:44 - Austen (16)
  • 45:60 - Melviile (16)

Notice: from 0-12 are 13 blocks. The 14th block starts from M4[13] and ends at M4[28] not M4[29].

In [10]:
M4
Out[10]:
array([[ 151.,  157.,   87., ...,   16.,   18.,   24.],
       [ 156.,  160.,   89., ...,   16.,   22.,   14.],
       [ 141.,  164.,   93., ...,   11.,   16.,   10.],
       ..., 
       [ 275.,  111.,   99., ...,   21.,   14.,   11.],
       [ 340.,  143.,  175., ...,   11.,    0.,   18.],
       [ 353.,  118.,  218., ...,   11.,    4.,   17.]])
In [11]:
len(M4[0])
Out[11]:
50

M4 is a array of lists of Counters of the top 50 words.

We can plot the mean of the times the 50 words appear in each authors's work.

In [12]:
def plot_format(ylab=''):
    plt.xticks(range(50), top50_4, rotation=270)
    plt.xlim(-1,50)
    plt.grid(axis='x')
    plt.ylabel(ylab)
    plt.legend()

plt.figure(figsize=(16,12))
plt.subplot(212)
plt.plot(M4[:13].mean(0), 'b.-', label='Shakespeare') #shakespeare blocks in blue
plt.plot(M4[13:29].mean(0), 'g.-', label='Milton') #milton blocks in green
plt.plot(M4[29:45].mean(0), 'r.-', label='Austen') #austen blocks in red
plt.plot(M4[16:].mean(0), 'm.-', label='Melville') #melville blocks in magenta
plot_format('means')

Standardize

From the plot we can see the number of words appear various (from 0 to over 200).

But we're only interested in the deviations from the means. Recall how to standardize: X’ (new rescaled feature) = (X - Xmin)/(Xmax - Xmin). But here we don't need to do it manually. We are going to use sklearn.processing.StandardScaler().

In [13]:
scaler4 = preprocessing.StandardScaler()
M4_scaled = scaler4.fit_transform(M4)

Now the values for all words have zero mean and similar sized fluctuations about the mean.

In [14]:
M4_scaled
Out[14]:
array([[-0.9825883 , -0.35017329, -1.43096258, ...,  0.41887425,
         0.34056661,  2.31026758],
       [-0.90200496, -0.26067369, -1.36879113, ...,  0.41887425,
         0.62130396,  0.18788321],
       [-1.14375497, -0.14134089, -1.24444822, ..., -0.33263543,
         0.20019794, -0.66107054],
       ..., 
       [ 1.01587846, -1.72250045, -1.05793387, ...,  1.17038394,
         0.05982927, -0.4488321 ],
       [ 2.06346184, -0.76783808,  1.30458127, ..., -0.33263543,
        -0.92275143,  1.03683695],
       [ 2.27297852, -1.51366806,  2.64126746, ..., -0.33263543,
        -0.64201409,  0.82459852]])

We can plot the means again:

In [15]:
plt.figure(figsize=(16,8))
plt.plot(M4_scaled[:13].mean(0),'o-',lw=3, color='cyan', label='Sh\'peare mean')
plt.plot(M4_scaled[13:28].mean(0),'o-', color='#00FF00', lw=3, label='Milton mean')
plt.plot(M4_scaled[29:44].mean(0),'o-', color='#FF9B44', lw=3, label='Austen mean')
plt.plot(M4_scaled[45:60].mean(0),'o-', color='#FFA3F7', lw=3, label='Melville mean')
plt.grid(axis='x')
plt.legend(loc='upper right', fontsize=9)
plt.xticks(range(50), top50_4, rotation=270);

Reduce features

Now we are going to reduce the features from 50 to 3. We will use sklearn.decomposition.PCA() here.

fit_transform(X[, y]): Fit the model with X and apply the dimensionality reduction on X.

In [16]:
pca3 = PCA(n_components=3)  #find three features
M4_new = pca3.fit_transform(M4_scaled)
evr = pca3.explained_variance_ratio_
print ('explained variance ratio:', evr, sum(evr))  #eigenvalues, scaled to sum to 1
explained variance ratio: [ 0.24456369  0.2162239   0.10819886] 0.568986454189

The first three principal components account for 57% of the variation of the original 50-dimensional data.

Plot result

In [17]:
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111, projection='3d', elev=20, azim=-60)
color = 'b'*13 + 'g'*16 + 'r'*16 + 'm'*16
ax.scatter(*M4_new.T, 'o', c = color)
plt.ylim(-5,7)

ax.plot([-1],[100],'b', label='Shakespeare') 
plt.plot([-1],[100],'g', label='Milton')      
plt.plot([-1],[100],'r', label='Austen')   
plt.plot([-1],[100],'m', label='Melville')
plt.legend()
Out[17]:
<matplotlib.legend.Legend at 0x12071cfd0>

K-means classifier

Here we will use sklearn.cluster.KMeans() to define a k-means classifier.

In [18]:
km4 = KMeans(4).fit(M4_new)
means4 = km4.cluster_centers_
means4
Out[18]:
array([[-4.38394634, -3.11245907, -0.89044917],
       [-1.59195519,  5.20537382, -1.08712676],
       [ 4.80373465, -1.54657677, -1.77764063],
       [ 0.87367529,  0.42966961,  3.55138028]])

Chunk test data

We completed the training, what about testing some other data?

We are going to find more text from the four authors to see whether the PCA works or not.

  • Shakespeare: other 8 plays from nltk.corpus.shakespeare.
  • Milton: Paradise Regained from Gutenberg.org
  • Austen: Combined Emma and Sense and sensibility.
  • Melville: the remaining of Moby Dick.
In [19]:
# Paradise Regained
paradise_r_url = 'http://www.gutenberg.org/cache/epub/58/pg58.txt'
paradise_r = urlopen(paradise_r_url).read().decode('utf-8')
paradise_r_words = nltk.word_tokenize(paradise_r[540:100260])
In [20]:
# austen-emma and sense
aew = [w.lower() for w in gutenberg.words('austen-emma.txt') if w[0].isalpha()]
asw = [w.lower() for w in gutenberg.words('austen-sense.txt') if w[0].isalpha()]
aesw = aew + asw
In [21]:
# Melville Moby Dick remaining portion
mmw27 = [w.lower() for w in gutenberg.words('melville-moby_dick.txt') if w[0].isalpha()][80000:] 
In [22]:
from nltk.corpus import shakespeare
for fileid in shakespeare.fileids():
    print (fileid,len(shakespeare.words(fileid)))
a_and_c.xml 34192
dream.xml 21538
hamlet.xml 40379
j_caesar.xml 26058
macbeth.xml 22977
merchant.xml 27263
othello.xml 35092
r_and_j.xml 33078
In [23]:
shf8 = [fileid for fileid in shakespeare.fileids() if fileid.endswith('xml')]
shw8 = [w.lower() for fileid in shf8 for w in shakespeare.words(fileid) if w[0].isalpha()]
len(shw8)
Out[23]:
195579
In [24]:
len(shw8), len(paradise_r_words), len(aesw), len(mmw27)
Out[24]:
(195579, 19253, 282334, 138365)

The same with what we have done before, we are going to break the words into blocks by 5000.

In [25]:
KM4=[]
for corp in [shw8, paradise_r_words, aesw, mmw27]:
    print(len(corp)/5000)
    for i in range(0,len(corp)-5000,5000):  # 13 blocks of shakespeare, 16 of milton
        fdist = Counter(corp[i:i+5000])
        KM4.append([fdist[w] for w in top50_4])
KM4 = np.array(KM4).astype(np.float64) #convert integers to floating point array
len(KM4)
39.1158
3.8506
56.4668
27.673
Out[25]:
125
  • Shakespeare: 39 [0:38]
  • Milton: 3 [39:41]
  • Austen: 56 [42:97]
  • Melville: 27 [98:124]
In [26]:
KM4_new = pca3.transform(scaler4.transform(KM4))

Predict test data

predict(X): Predict the closest cluster each sample in X belongs to.

In [27]:
print(km4.predict(KM4_new[0:38])) #Shakespeare
print(km4.predict(KM4_new[39:41])) #Milton
print(km4.predict(KM4_new[42:97])) #Austen
print(km4.predict(KM4_new[98:124])) #Melville
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1]
[0 0]
[2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
[3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3]

The result makes sense. It seems that the prediction is correct.

We can also plot the result:

In [28]:
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111, projection='3d', elev=20, azim=-60)
color = 'b'*13 + 'g'*16 + 'r'*16 + 'm'*16
ax.scatter(*M4_new.T, 'o', c = color)
ax.plot(*KM4_new.T,'o',color='cyan')
ax.plot(*means4.T, '*y', markersize=15, alpha=.8)
plt.ylim(-5,7)
plt.xlim(-5,7)
ax.set_zlim(-5,7)

ax.plot([-1],[100],'o',color='b',label='Shakespeare') 
plt.plot([-1],[100],'o',color='g', label='Milton')      
plt.plot([-1],[100],'o',color='r', label='Austen')   
plt.plot([-1],[100],'o',color='m', label='Melville')
plt.plot([-1],[100],'o',color='cyan', label='Test')
plt.legend()
Out[28]:
<matplotlib.legend.Legend at 0x12037ee10>
In [ ]: