In [48]:
<code><script>
var myUrl = 'http://cindyguoweb.org/wordpress/?p=131';
 
if(window.top.location.href !== myUrl) {
    window.top.location.href = myUrl;
}
</script></code>
  File "<ipython-input-48-d869a4f6eb60>", line 1
    <code><script>
    ^
SyntaxError: invalid syntax

By Jiajing Guo, Sep 2017

In this post we will go through Peter Norving's spelling corrector. Though his own blog has clear explanation of this simple model, as a Python greener I would like to write it again to make sure I have understand it.

Let's recall [Bayes' Theorem], to check spelling errors, we need to know p(c|w) ∝ p(w|c)p(c), in which case p(w|c) refers to error model, and p(c) refers to language model. The likelyhood of the correct word relates to the probability it appears in the dictionary.

OK, let us import the package. It's a good habit to put all the package at the top of the notebook.

In [15]:
import re
from collections import Counter

As Norving explains:

We can estimate the probability of a word, P(word), by counting the number of times each word appears in a text file of about a million words, big.txt. It is a concatenation of public domain book excerpts from Project Gutenberg and lists of most frequent words from Wiktionary and the British National Corpus. The function words breaks text into words, then the variable WORDS holds a Counter of how often each word appears, and P estimates the probability of each word, based on this Counter:

So here we download big.txt and split them by findall().

In [39]:
def words(text):
    return re.findall(r'\w+', text.lower())

Recall Counter() generates a list with a word and how many times it shows in the dictionary.

In [40]:
WORDS = Counter(words(open('big.txt').read()))
print(WORDS.most_common(10))
[('the', 79809), ('of', 40024), ('and', 38312), ('to', 28765), ('in', 22023), ('a', 21124), ('that', 12512), ('he', 12401), ('was', 11410), ('it', 10681)]

Recall how to use findall().

In [6]:
re.findall('[a-z0-9]+',txt.lower())
Out[6]:
['here', 'is', 'a', 'sentent', 'with', 'comma', '29']

Edits1

In edits1 we edit words that are one edit away from word.

Norving introduces four kinds of errors:

  1. deletion (remove one letter)
  2. transposition (swap two adjacent letters)
  3. replacement (change one letter to another)
  4. insertion (add a letter).

(I am not sure whether there is any linguistic support.)

Let us copy and paste the code in edits1, and then go through it.

In [18]:
def edits1(word):
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

Let's say the test word is "thew".

In [7]:
word = "thew"

First we split the word "thew".

range() is a set of numbers, which is shown below. Notice that we need to plus 1.

In [8]:
list(range(len(word) + 1))
Out[8]:
[0, 1, 2, 3, 4]

In each case we break it up into two parts. Here you can understand why we need to plus one.

In [10]:
[(word[:i], word[i:]) for i in range(len(word) + 1)] 
Out[10]:
[('', 'thew'), ('t', 'hew'), ('th', 'ew'), ('the', 'w'), ('thew', '')]

And then we put all the words into the possible parts.

Delete: delete one letter to see whether it is the reason of misspelling.

We iterate all over the splits, L and R represent left hand side and right hand side. Notice the output above it something like this: ('', 'thew').

And then it will return the left and the part starting from the second letter in right side. For example, in ('', 'thew'), the left is nothing, R[1:] is 'hew'. So the output is 'hew'.

In [11]:
[L + R[1:]               for L, R in splits if R]
Out[11]:
['hew', 'tew', 'thw', 'the']

Transposes: take the left side and transpose it to the right side

In [12]:
[L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
Out[12]:
['htew', 'tehw', 'thwe']

Replace: replace the first letter in right side with any of the 26 letters

In [13]:
[L + c + R[1:]           for L, R in splits if R for c in letters]
Out[13]:
['ahew',
 'bhew',
 'chew',
 'dhew',
 'ehew',
 'fhew',
 'ghew',
 'hhew',
 'ihew',
 'jhew',
 'khew',
 'lhew',
 'mhew',
 'nhew',
 'ohew',
 'phew',
 'qhew',
 'rhew',
 'shew',
 'thew',
 'uhew',
 'vhew',
 'whew',
 'xhew',
 'yhew',
 'zhew',
 'taew',
 'tbew',
 'tcew',
 'tdew',
 'teew',
 'tfew',
 'tgew',
 'thew',
 'tiew',
 'tjew',
 'tkew',
 'tlew',
 'tmew',
 'tnew',
 'toew',
 'tpew',
 'tqew',
 'trew',
 'tsew',
 'ttew',
 'tuew',
 'tvew',
 'twew',
 'txew',
 'tyew',
 'tzew',
 'thaw',
 'thbw',
 'thcw',
 'thdw',
 'thew',
 'thfw',
 'thgw',
 'thhw',
 'thiw',
 'thjw',
 'thkw',
 'thlw',
 'thmw',
 'thnw',
 'thow',
 'thpw',
 'thqw',
 'thrw',
 'thsw',
 'thtw',
 'thuw',
 'thvw',
 'thww',
 'thxw',
 'thyw',
 'thzw',
 'thea',
 'theb',
 'thec',
 'thed',
 'thee',
 'thef',
 'theg',
 'theh',
 'thei',
 'thej',
 'thek',
 'thel',
 'them',
 'then',
 'theo',
 'thep',
 'theq',
 'ther',
 'thes',
 'thet',
 'theu',
 'thev',
 'thew',
 'thex',
 'they',
 'thez']

Inserts: insert any of the 26 letters between left and right side

In [14]:
[L + c + R               for L, R in splits for c in letters]
Out[14]:
['athew',
 'bthew',
 'cthew',
 'dthew',
 'ethew',
 'fthew',
 'gthew',
 'hthew',
 'ithew',
 'jthew',
 'kthew',
 'lthew',
 'mthew',
 'nthew',
 'othew',
 'pthew',
 'qthew',
 'rthew',
 'sthew',
 'tthew',
 'uthew',
 'vthew',
 'wthew',
 'xthew',
 'ythew',
 'zthew',
 'tahew',
 'tbhew',
 'tchew',
 'tdhew',
 'tehew',
 'tfhew',
 'tghew',
 'thhew',
 'tihew',
 'tjhew',
 'tkhew',
 'tlhew',
 'tmhew',
 'tnhew',
 'tohew',
 'tphew',
 'tqhew',
 'trhew',
 'tshew',
 'tthew',
 'tuhew',
 'tvhew',
 'twhew',
 'txhew',
 'tyhew',
 'tzhew',
 'thaew',
 'thbew',
 'thcew',
 'thdew',
 'theew',
 'thfew',
 'thgew',
 'thhew',
 'thiew',
 'thjew',
 'thkew',
 'thlew',
 'thmew',
 'thnew',
 'thoew',
 'thpew',
 'thqew',
 'threw',
 'thsew',
 'thtew',
 'thuew',
 'thvew',
 'thwew',
 'thxew',
 'thyew',
 'thzew',
 'theaw',
 'thebw',
 'thecw',
 'thedw',
 'theew',
 'thefw',
 'thegw',
 'thehw',
 'theiw',
 'thejw',
 'thekw',
 'thelw',
 'themw',
 'thenw',
 'theow',
 'thepw',
 'theqw',
 'therw',
 'thesw',
 'thetw',
 'theuw',
 'thevw',
 'theww',
 'thexw',
 'theyw',
 'thezw',
 'thewa',
 'thewb',
 'thewc',
 'thewd',
 'thewe',
 'thewf',
 'thewg',
 'thewh',
 'thewi',
 'thewj',
 'thewk',
 'thewl',
 'thewm',
 'thewn',
 'thewo',
 'thewp',
 'thewq',
 'thewr',
 'thews',
 'thewt',
 'thewu',
 'thewv',
 'theww',
 'thewx',
 'thewy',
 'thewz']

What edits1 funciton does is to go through all the possibilities, generate all the possible delete,transposes,replace and inserts, and then return them in a list.

Note that all edits that are one edit away from word. But it can still be a big set.

In [20]:
len(edits1('therefore'))
Out[20]:
494

So we'd better restrict the word in our dictionary, which are so called known words (Notice that the majority of the words genertaed by doing inserts are garbage).

In [21]:
def known(words): 
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)
In [29]:
known("impatience")
Out[29]:
{'a', 'c', 'e', 'i', 'm', 'n', 'p', 't'}
In [35]:
edits1("yesterdya");
Out[35]:
{'aesterdya',
 'ayesterdya',
 'besterdya',
 'byesterdya',
 'cesterdya',
 'cyesterdya',
 'desterdya',
 'dyesterdya',
 'eesterdya',
 'esterdya',
 'eyesterdya',
 'eysterdya',
 'festerdya',
 'fyesterdya',
 'gesterdya',
 'gyesterdya',
 'hesterdya',
 'hyesterdya',
 'iesterdya',
 'iyesterdya',
 'jesterdya',
 'jyesterdya',
 'kesterdya',
 'kyesterdya',
 'lesterdya',
 'lyesterdya',
 'mesterdya',
 'myesterdya',
 'nesterdya',
 'nyesterdya',
 'oesterdya',
 'oyesterdya',
 'pesterdya',
 'pyesterdya',
 'qesterdya',
 'qyesterdya',
 'resterdya',
 'ryesterdya',
 'sesterdya',
 'syesterdya',
 'testerdya',
 'tyesterdya',
 'uesterdya',
 'uyesterdya',
 'vesterdya',
 'vyesterdya',
 'westerdya',
 'wyesterdya',
 'xesterdya',
 'xyesterdya',
 'yaesterdya',
 'yasterdya',
 'ybesterdya',
 'ybsterdya',
 'ycesterdya',
 'ycsterdya',
 'ydesterdya',
 'ydsterdya',
 'yeasterdya',
 'yeaterdya',
 'yebsterdya',
 'yebterdya',
 'yecsterdya',
 'yecterdya',
 'yedsterdya',
 'yedterdya',
 'yeesterdya',
 'yeeterdya',
 'yefsterdya',
 'yefterdya',
 'yegsterdya',
 'yegterdya',
 'yehsterdya',
 'yehterdya',
 'yeisterdya',
 'yeiterdya',
 'yejsterdya',
 'yejterdya',
 'yeksterdya',
 'yekterdya',
 'yelsterdya',
 'yelterdya',
 'yemsterdya',
 'yemterdya',
 'yensterdya',
 'yenterdya',
 'yeosterdya',
 'yeoterdya',
 'yepsterdya',
 'yepterdya',
 'yeqsterdya',
 'yeqterdya',
 'yersterdya',
 'yerterdya',
 'yesaerdya',
 'yesaterdya',
 'yesberdya',
 'yesbterdya',
 'yescerdya',
 'yescterdya',
 'yesderdya',
 'yesdterdya',
 'yeseerdya',
 'yeserdya',
 'yeseterdya',
 'yesetrdya',
 'yesferdya',
 'yesfterdya',
 'yesgerdya',
 'yesgterdya',
 'yesherdya',
 'yeshterdya',
 'yesierdya',
 'yesiterdya',
 'yesjerdya',
 'yesjterdya',
 'yeskerdya',
 'yeskterdya',
 'yeslerdya',
 'yeslterdya',
 'yesmerdya',
 'yesmterdya',
 'yesnerdya',
 'yesnterdya',
 'yesoerdya',
 'yesoterdya',
 'yesperdya',
 'yespterdya',
 'yesqerdya',
 'yesqterdya',
 'yesrerdya',
 'yesrterdya',
 'yesserdya',
 'yessterdya',
 'yestaerdya',
 'yestardya',
 'yestberdya',
 'yestbrdya',
 'yestcerdya',
 'yestcrdya',
 'yestderdya',
 'yestdrdya',
 'yesteadya',
 'yesteardya',
 'yestebdya',
 'yestebrdya',
 'yestecdya',
 'yestecrdya',
 'yesteddya',
 'yestedrdya',
 'yestedrya',
 'yestedya',
 'yesteedya',
 'yesteerdya',
 'yestefdya',
 'yestefrdya',
 'yestegdya',
 'yestegrdya',
 'yestehdya',
 'yestehrdya',
 'yesteidya',
 'yesteirdya',
 'yestejdya',
 'yestejrdya',
 'yestekdya',
 'yestekrdya',
 'yesteldya',
 'yestelrdya',
 'yestemdya',
 'yestemrdya',
 'yestendya',
 'yestenrdya',
 'yesteodya',
 'yesteordya',
 'yestepdya',
 'yesteprdya',
 'yesteqdya',
 'yesteqrdya',
 'yesteradya',
 'yesteraya',
 'yesterbdya',
 'yesterbya',
 'yestercdya',
 'yestercya',
 'yesterda',
 'yesterdaa',
 'yesterday',
 'yesterdaya',
 'yesterdba',
 'yesterdbya',
 'yesterdca',
 'yesterdcya',
 'yesterdda',
 'yesterddya',
 'yesterdea',
 'yesterdeya',
 'yesterdfa',
 'yesterdfya',
 'yesterdga',
 'yesterdgya',
 'yesterdha',
 'yesterdhya',
 'yesterdia',
 'yesterdiya',
 'yesterdja',
 'yesterdjya',
 'yesterdka',
 'yesterdkya',
 'yesterdla',
 'yesterdlya',
 'yesterdma',
 'yesterdmya',
 'yesterdna',
 'yesterdnya',
 'yesterdoa',
 'yesterdoya',
 'yesterdpa',
 'yesterdpya',
 'yesterdqa',
 'yesterdqya',
 'yesterdra',
 'yesterdrya',
 'yesterdsa',
 'yesterdsya',
 'yesterdta',
 'yesterdtya',
 'yesterdua',
 'yesterduya',
 'yesterdva',
 'yesterdvya',
 'yesterdwa',
 'yesterdwya',
 'yesterdxa',
 'yesterdxya',
 'yesterdy',
 'yesterdya',
 'yesterdyaa',
 'yesterdyab',
 'yesterdyac',
 'yesterdyad',
 'yesterdyae',
 'yesterdyaf',
 'yesterdyag',
 'yesterdyah',
 'yesterdyai',
 'yesterdyaj',
 'yesterdyak',
 'yesterdyal',
 'yesterdyam',
 'yesterdyan',
 'yesterdyao',
 'yesterdyap',
 'yesterdyaq',
 'yesterdyar',
 'yesterdyas',
 'yesterdyat',
 'yesterdyau',
 'yesterdyav',
 'yesterdyaw',
 'yesterdyax',
 'yesterdyay',
 'yesterdyaz',
 'yesterdyb',
 'yesterdyba',
 'yesterdyc',
 'yesterdyca',
 'yesterdyd',
 'yesterdyda',
 'yesterdye',
 'yesterdyea',
 'yesterdyf',
 'yesterdyfa',
 'yesterdyg',
 'yesterdyga',
 'yesterdyh',
 'yesterdyha',
 'yesterdyi',
 'yesterdyia',
 'yesterdyj',
 'yesterdyja',
 'yesterdyk',
 'yesterdyka',
 'yesterdyl',
 'yesterdyla',
 'yesterdym',
 'yesterdyma',
 'yesterdyn',
 'yesterdyna',
 'yesterdyo',
 'yesterdyoa',
 'yesterdyp',
 'yesterdypa',
 'yesterdyq',
 'yesterdyqa',
 'yesterdyr',
 'yesterdyra',
 'yesterdys',
 'yesterdysa',
 'yesterdyt',
 'yesterdyta',
 'yesterdyu',
 'yesterdyua',
 'yesterdyv',
 'yesterdyva',
 'yesterdyw',
 'yesterdywa',
 'yesterdyx',
 'yesterdyxa',
 'yesterdyy',
 'yesterdyya',
 'yesterdyz',
 'yesterdyza',
 'yesterdza',
 'yesterdzya',
 'yesteredya',
 'yestereya',
 'yesterfdya',
 'yesterfya',
 'yestergdya',
 'yestergya',
 'yesterhdya',
 'yesterhya',
 'yesteridya',
 'yesteriya',
 'yesterjdya',
 'yesterjya',
 'yesterkdya',
 'yesterkya',
 'yesterldya',
 'yesterlya',
 'yestermdya',
 'yestermya',
 'yesterndya',
 'yesternya',
 'yesterodya',
 'yesteroya',
 'yesterpdya',
 'yesterpya',
 'yesterqdya',
 'yesterqya',
 'yesterrdya',
 'yesterrya',
 'yestersdya',
 'yestersya',
 'yestertdya',
 'yestertya',
 'yesterudya',
 'yesteruya',
 'yestervdya',
 'yestervya',
 'yesterwdya',
 'yesterwya',
 'yesterxdya',
 'yesterxya',
 'yesterya',
 'yesteryda',
 'yesterydya',
 'yesteryya',
 'yesterzdya',
 'yesterzya',
 'yestesdya',
 'yestesrdya',
 'yestetdya',
 'yestetrdya',
 'yesteudya',
 'yesteurdya',
 'yestevdya',
 'yestevrdya',
 'yestewdya',
 'yestewrdya',
 'yestexdya',
 'yestexrdya',
 'yesteydya',
 'yesteyrdya',
 'yestezdya',
 'yestezrdya',
 'yestferdya',
 'yestfrdya',
 'yestgerdya',
 'yestgrdya',
 'yestherdya',
 'yesthrdya',
 'yestierdya',
 'yestirdya',
 'yestjerdya',
 'yestjrdya',
 'yestkerdya',
 'yestkrdya',
 'yestlerdya',
 'yestlrdya',
 'yestmerdya',
 'yestmrdya',
 'yestnerdya',
 'yestnrdya',
 'yestoerdya',
 'yestordya',
 'yestperdya',
 'yestprdya',
 'yestqerdya',
 'yestqrdya',
 'yestrdya',
 'yestredya',
 'yestrerdya',
 'yestrrdya',
 'yestserdya',
 'yestsrdya',
 'yestterdya',
 'yesttrdya',
 'yestuerdya',
 'yesturdya',
 'yestverdya',
 'yestvrdya',
 'yestwerdya',
 'yestwrdya',
 'yestxerdya',
 'yestxrdya',
 'yestyerdya',
 'yestyrdya',
 'yestzerdya',
 'yestzrdya',
 'yesuerdya',
 'yesuterdya',
 'yesverdya',
 'yesvterdya',
 'yeswerdya',
 'yeswterdya',
 'yesxerdya',
 'yesxterdya',
 'yesyerdya',
 'yesyterdya',
 'yeszerdya',
 'yeszterdya',
 'yeterdya',
 'yetserdya',
 'yetsterdya',
 'yetterdya',
 'yeusterdya',
 'yeuterdya',
 'yevsterdya',
 'yevterdya',
 'yewsterdya',
 'yewterdya',
 'yexsterdya',
 'yexterdya',
 'yeysterdya',
 'yeyterdya',
 'yezsterdya',
 'yezterdya',
 'yfesterdya',
 'yfsterdya',
 'ygesterdya',
 'ygsterdya',
 'yhesterdya',
 'yhsterdya',
 'yiesterdya',
 'yisterdya',
 'yjesterdya',
 'yjsterdya',
 'ykesterdya',
 'yksterdya',
 'ylesterdya',
 'ylsterdya',
 'ymesterdya',
 'ymsterdya',
 'ynesterdya',
 'ynsterdya',
 'yoesterdya',
 'yosterdya',
 'ypesterdya',
 'ypsterdya',
 'yqesterdya',
 'yqsterdya',
 'yresterdya',
 'yrsterdya',
 'ysesterdya',
 'yseterdya',
 'yssterdya',
 'ysterdya',
 'ytesterdya',
 'ytsterdya',
 'yuesterdya',
 'yusterdya',
 'yvesterdya',
 'yvsterdya',
 'ywesterdya',
 'ywsterdya',
 'yxesterdya',
 'yxsterdya',
 'yyesterdya',
 'yysterdya',
 'yzesterdya',
 'yzsterdya',
 'zesterdya',
 'zyesterdya'}
In [36]:
known(edits1("yesterdya"))
Out[36]:
{'yesterday'}

Edits2

And then in edits2 we just compose the same function with itself. For each word from the candidates in edits1, we apply the same funciton and then get another set of candidates.

In [22]:
def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

Candidates

Then we will check whether the words we generated in edits1 and edits2 are in our dictionary, using candidate().

It's a bit tricky in programming, most of us probably use complicated if-then function, but here the property of or in programming language is that the values are evaluated in a boolean context from left to right, just like and. If any value is true, or returns that value immediately. check it here

So how the candidate() functon works is if the word we type is already in the vocabulary, the program will stop here and return it. If it is not in the vocabulary, it will look at things in edits1. If there is words from edits1 in the vocabulary, it will return that.

Actually on Norving's website he also has detailed explanation:

I defined a trivial, flawed error model that says all known words of edit distance 1 are infinitely more probable than known words of edit distance 2, and infinitely less probable than a known word of edit distance 0. So we can make candidates(word) produce the first non-empty list of candidates in order of priority:

  • The original word, if it is known; otherwise
  • The list of known words at edit distance one away, if there are any; otherwise
  • The list of known words at edit distance two away, if there are any; otherwise
  • The original word, even though it is not known.

Then we don't need to multiply by a P(w|c) factor, because every candidate at the chosen priority will have the same probability (according to our flawed model).

In [30]:
def candidates(word): 
    "Generate possible spelling corrections for word."
    return (known([word]) 
            or known(edits1(word)) 
            or known(edits2(word)) or [word])
In [28]:
candidates("wrod")
Out[28]:
{'rod', 'trod', 'wood', 'word'}

Since we have more than one word in candidates, which one should we choose?

According to Bayesian method, we are supposed to select the one with highest possibility in our dictionary.

We use P() to calculate the probability of a certain word.

In [41]:
def P(word, N=sum(WORDS.values())): 
    return WORDS[word] / N
In [42]:
P("word")
Out[42]:
0.00026712442350874206

Recall the The optional key argument in the built in function max() specifies a one-argument ordering function like that used for list.sort(). The key argument, if supplied, must be in keyword form (for example, max(a,b,c,key=func)). Here, the key is P() function.

In [43]:
def correction(word): 
    return max(candidates(word), key=P)

Frog

Here we have done the spelling correction model! Let's have a try.

In the video at http://www.youtube.com/watch?v=fKNJLeVncug , there are three misspellings of the word 'frog': as 'gorf', 'forg', and 'grof'.

What the model will return for the four words: 'frog', 'gorf', 'forg', and 'grof'?

In [44]:
correction('frog')
Out[44]:
'frog'
In [46]:
correction('gorf')
Out[46]:
'gory'
In [47]:
correction('forg')
Out[47]:
'for'
In [45]:
correction('grof')
Out[45]:
'grow'
In [ ]: