Monday, January 28, 2008

Homophone Checker

Remember homophones from...like first grade? The problem is that they can still trip us up today. And your computer's nifty spell checker won't even catch the slip up, because it's still a word! I had a crazy idea to write a Homophone Checker: a spell checker that will no know when you used the wrong spelling of a homophone.

I'm writing it in Perl. It is in the very early stages of development right now. I'm mainly writing Perl scripts to parse large amounts of text gathering all the statistics I possibly can about properly-used homophones. These statistics will be used by the eventual Homophone Checker in determining the probability that the write right homophone spelling was used.

Where better to find large (and I mean LARGE) amounts of text with (hopefully) properly-used homophones than Wikipedia with its over 2 million articles? Every few weeks, Wikipedia dumps all the articles into a single file they make available for download to programmers wanting a way to do analysis on the complete Wikipedia text. If you had to guess how big Wikipedia is (minus images and revision history) in gigabytes, how big would you guess?

Uncompressed: 13.9 GB. That would fill 3 DVDs! Thankfully they compress it pretty heavily for downloading, but the compressed file still weighed in at 3.2 GB and took an entire night to download.

I based my list of homophones on this list available free online for academic use and research. The first Perl script I wrote scans all of Wikipedia for these words counting how many times they occur. Then it generates a list showing what percentage of the time a specific spelling of a homophone is used compared to its other spellings.

I thought you might find the statistics thus far interesting. The below list is just a preliminary list. It is generated from scanning 6,000 Wikipedia articles. Scanning all 2 million+ articles will probably take my computer about 3 days of processing. Scanning articles that average 435 words/article looking for 1,543 different homophones is pretty processor intensive. My script is scanning a little less than 500 articles/minute--a lot faster than a human doing it by hand!

Frequency List (Based on 6,000 Wikipedia Articles)

Saturday, January 12, 2008

This Post Wrote Itself

Have you ever heard an author make the comment that a story wrote itself or a certain character dictated how they should develop? For example, Neil Gaiman, a favorite author of mine, just posted the following a few days ago on his blog:

I'm more or less happily writing Chapter Six of The Graveyard Book. I say more or less as I'm at that place where I hope that the book knows what it's doing because right now I don't have a clue -- I'm writing one scene after another like a man walking through a valley in thick fog, just able to see the path a little way ahead, but with no idea where it's actually going to lead him.
This kind of comment has always made me mad. I've always thought, "Writing--good writing at least--is hard work. Where do these people get off with the story doing all the work for them? I've never had one of my characters come to life and tell me, 'Okay, Mister, this is how it's going to be.'"

I think I understand the comment a little better now. I just finished the first draft of the third episode of Map Makers. It currently comes in at 5,049 words. I wrote about 2,200 of those words (about half the story) in five hours today. Trust me, that is extremely fast for me. I like for them to come in under 5,000 words, so I still have a little revising to do before it's ready to post (and I'll be officially creeped out if it comes in at exactly 4,989 words.)

There is a plot development in the third episode (I'm not going to give it away, so you'll just have to read the story to find out) that I really didn't want to happen. I built the complication up and up and up until it was the only resolution that made sense. I thought of three alternate resolutions, but they all seemed such a stretch that they cheapened the story, so I kept the one I really didn't want to happen. I guess you could say, the story wrote itself.

Thursday, January 10, 2008

Creepy Coincidence

I finally posted the second episode of Map Makers. "Tongue Tied" is up for the world to see.

My goal for each episode was 5,000 words. After posting the second episode, I decided to take a look at the word count for each story. Both stories had exactly 4,989 words. It's a creepy coincidence that without even trying both stories ended up with the exact same word count. Actually I probably couldn't have accomplished that if I had tried.

Sunday, January 6, 2008

It's Kind of Like You Never Left

A new era of internet is upon us. First, there were text browsers. Then there were graphical browsers. Now I have created...

Bookmarks!

I know what you're thinking: "Haven't those been around for a while? I think Internet Explorer even has them." Those aren't the kind of bookmarks I'm talking about. I've created a way to make reading Map Makers even easier!

I understand how busy people are, and that at 5,000 words, you may not have time to sit down and read all of the first episode, "The Pluto Incident," in one sitting. I have added small bookmark icons (bookmark) beside each paragraph. By clicking the bookmark icon where you stop reading, you will be returned right where you left off when you come back!

This nifty feature will also be available for the next two (long overdue) episodes as soon as I get them posted. I meant to spend more time writing today, but I ended up spending six hours writing 67 lines of code to make the new bookmark feature a reality. Let me know what you think of the new feature, and, of course, any thoughts you have on "The Pluto Incident."