Thursday, July 31, 2003

Local Thesaurus and Dictionary - a technique

I’d love to have the contents of dictionary.reference.com and thesaurus.reference.com (and, for that matter, Encyclopedia Brittanica) on my local machines. So that i wouldn’t have be online to get access. Now, there’s not much chance of getting a CD copy to work since I work in linux and there aren’t any linux readers. Sure I’ve got win4lin, but win4lin has some limitations involving multimedia. And VMware is too slow on this box.



So I’m thinking, it’d be an interesting afternoon’s work to download a bunch of web pages, strip out the unique words in them, add the contents of the unix wordlist, maybe whack a lot of random blogs to a depth of 5 or so. That should cover almost all the words in the English language.


Then, after taking all the unique words out, run them through the thesaurus. Try each word in turn, except if it’s already appeared in a previous thesaurus run. That should give us almost everything right there.


And finally, with our new list of unique words, run them through dictionary.


I wonder what the legal repercussions would be though. Not that it matters. I’m probably not going to do this, and even if I do, it’ll be only for my own amusement. Heh, I probably won’t even use the data, probably deleting it after a week or so. But the legal status would be interesting.

No comments: