Looking up words in a Dictionary using Python
First off, I do not mean dictionary in the Python sense of the word. I mean dictionary in the glossary sense, like Merriam-Webster. This collision of terminology makes Googling for this functionality particularly difficult and frustrating.
I came across three useful Python solutions, and I'm going to detail usage of two of them in this post.
Option 1: NLTK + Wordnet
First up is accessing Wordnet.
"Wordnet is a large lexical database of English..."The only Python way of accessing this (that I came across) is NLTK, a set of
"Open source Python modules, linguistic data and documentation for research and development in natural language processing..."
Getting NLTK Installed
For various reasons, NLTK is not packaged by Debian, so I had to install it by hand. Even if your distro does package NLTK, you might want to read this bit anyway. Installing was a cinch with easy_install nltk. However, this does not install the corpus (where wordnet is stored). As shown below:
>>> from nltk.corpus import wordnet
>>> wordnet.synsets( 'cake' )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.5/site-packages/nltk-2.0b8-py2.5.egg/nltk/corpus/util.py", line 68, in __getattr__
self.__load()
File "/usr/lib/python2.5/site-packages/nltk-2.0b8-py2.5.egg/nltk/corpus/util.py", line 56, in __load
except LookupError: raise e
LookupError:
**********************************************************************
Resource 'corpora/wordnet' not found. Please use the NLTK
Downloader to obtain the resource: >>> nltk.download().
Searched in:
- '/home/jmhobbs/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
**********************************************************************
So what we need to do is run the NLTK installer, as shown here:
>>> import nltk
>>> nltk.download()
NLTK Downloader
---------------------------------------------------------------------------
d) Download l) List c) Config h) Help q) Quit
---------------------------------------------------------------------------
Downloader> d
Download which package (l=list; x=cancel)?
Identifier> wordnet
Downloading package 'wordnet' to /home/jmhobbs/nltk_data...
Unzipping corpora/wordnet.zip.
---------------------------------------------------------------------------
d) Download l) List c) Config h) Help q) Quit
---------------------------------------------------------------------------
Downloader> q
True
>>>
Using NLTK + Wordnet
Now that we have everything installed, using wordnet from Python is straight forward.
# Load the wordnet corpus
from nltk.corpus import wordnet
# Get a collection of synsets (synonym sets) for a word
synsets = wordnet.synsets( 'cake' )
# Print the information
for synset in synsets:
print "-" * 10
print "Name:", synset.name
print "Lexical Type:", synset.lexname
print "Lemmas:", synset.lemma_names
print "Definition:", synset.definition
for example in synset.examples:
print "Example:", example
The output of that is:
----------
Name: cake.n.01
Lexical Type: noun.artifact
Lemmas: ['cake', 'bar']
Definition: a block of solid substance (such as soap or wax)
Example: a bar of chocolate
----------
Name: patty.n.01
Lexical Type: noun.food
Lemmas: ['patty', 'cake']
Definition: small flat mass of chopped food
----------
Name: cake.n.03
Lexical Type: noun.food
Lemmas: ['cake']
Definition: baked goods made from or based on a mixture of flour, sugar, eggs, and fat
----------
Name: coat.v.03
Lexical Type: verb.contact
Lemmas: ['coat', 'cake']
Definition: form a coat over
Example: Dirt had coated her face
Perfect!
Caveats
There are some caveats to using WordNet with NLTK. First is that the definitions aren't always ordered in the way you would expect. For instance, look at the "cake" results above. Cake, as in the confection, is the third definition, which feels wrong. You can of course order and filter on the synset name to correct this to some degree.
Second, there is a major load time for getting WordNet ready to use. Your first call to wordnet.sysnsets will take considerably longer than the next ones. On my machine the difference was 3.5 seconds versus 0.0003 seconds.
Last, you are constrained to the English language, as analyzed by Pinceton. I'll address this issue in the next section.
Option 2: SDict Viewer
As I said above, using WordNet is simple, but restrictive. What if I want to use a foreign language dictionary or something? WordNet is only in English. This is where the SDict format comes in. It has lots of free resource files available at http://sdict.com/en/. The best existing parser I found was SDict Viewer which is a dead project, but remarkably complete.
SDict Viewer is an application
SDict Viewer is an application, so it's not an easy to install library. However, it is very well written and extracting what you need is simple. You can get my "library" version from http://github.com/jmhobbs/sdictviewer-lib.
Here is an example when it's all finished:
import sys
import sdictviewer.formats.dct.sdict as sdict
import sdictviewer.dictutil
dictionary = sdict.SDictionary( 'webster_1913.dct' )
dictionary.load()
start_word = sys.argv[1]
found = False
for item in dictionary.get_word_list_iter( start_word ):
try:
if start_word == str( item ):
instance, definition = item.read_articles()[0]
print "%s: %s" % ( item, definition )
found = True
break
except:
continue
if not found:
print "No definition for '%s'." % start_word
dictionary.close()
Here is a sample run:
jmhobbs@katya:~$ python okay.py Cat
Cat: (n.) An animal of various species of the genera Felis and Lynx. The domestic cat is Felis domestica. The European wild cat (Felis catus) is much larger than the domestic cat. In the United States the name wild cat is commonly applied to the bay lynx (Lynx rufus) See Wild cat, and Tiger cat.
wrote /home/jmhobbs/.sdictviewer/index_cache/webster_1913.dct-1.0.index
As you can see, it gives a nice definition (thank you Webster 1913) and then it has a little junk on the end. This is the index cache, a lookup table for finding words faster. You can avoid saving it by calling dictionary.close(False) instead.
Option 3: Aard Format
In option 2 I said that SDict Viewer was a dead project, this is because the development has been moved to the Aard Dictionary project. I chose not to pursue this format, as most of the existing resources are stored in HTML formats and I needed plain text. This might be ideal for you though, as they also provide access to Wikipedia archives.
All Done
So there you have it. Two viable ways of extracting a plain text definition for a word in Python. Best of luck to you!