Programmatic Sentence Generation – Shakespeare!

Introduction

Many moons ago, when AOL Instant Messenger was all the rage, I really wanted to write some kind of chat bot, something like SmarterChild except, perhaps, more amusing and less useful. I do recall attempting to use some Perl-based chatbot APIs to connect one of my accounts to AIM, but had very little success. So I gave up on the AIM API, and focused more on the core functionality of what I wanted a chatbot to do: simply construct a sentence given a “seed” word as the first word in the sentence. Therefore, the “Shakespeare” project was born.

What’s in a Name?

Why is this application called Shakespeare? Well, actually, originally it wasn’t – it was called Jabberwocky. I needed a source text from which to construct sentences, and one of the free online books available on Project Gutenberg was Through the Looking Glass by Lewis Carroll, and Jabberwocky is one of my favorite poems from the book. Not to mention the fact that a programmtically-generated sentence would likely sound rather silly and hilarious. Or so I hoped. Anyway, the first version of my program used Jabberwocky as the source text, but later I decided that I wanted a much larger source text, such as one of the works of Shakespeare. I grabbed the text from The Merchant of Venice, cleaned it up, and started using it, and the sentences it generates now are just as amusing or more.

Program Source

This project is written in C++, and is available on GitHub for public consumption here: Shakespeare. Feel free to check it out. The project opens in Visual Studio 2008, but if you don’t have VS2008 installed then it should be pretty easy to edit and compile using g++ or Eclipse or something =)

Source Explanation

The program works via a simple two-step process:

  1. The source text is parsed and string pairs are put into a multimap
  2. The program reads lines from stdin until “quit” is received, and generates a “sentence” for each word entered.

Let’s examine these steps more closely:

Parsing the source text

    while (!file.eof())
{
std::string s;
file >> s;
s = strlower(s);

// Look for a period
if (s.length() > 1 && s[s.length() - 1] == '.')
{
s = s.substr(0, s.length()-1);
file.putback('.');
}
// Look for an exclamation
else if (s.length() > 1 && s[s.length() - 1] == '!')
{
s = s.substr(0, s.length()-1);
file.putback('!');
}
// Look for a question-mark
else if (s.length() > 1 && s[s.length() - 1] == '?')
{
s = s.substr(0, s.length()-1);
file.putback('?');
}
// Look for a comma
else if (s.length() > 1 && s[s.length() - 1] == ',')
{
s = s.substr(0, s.length()-1);
file.putback(',');
}

// Add the data to the map
if (last != "" && last != "." && last != "!" && last != "?")
{
wordmap.insert(make_pair(last, s));
}

last = s;
}

The code here reads strings from the file, where each string is separated by whitespace. For the most part these strings represent words, but some of the words end with periods, commas, exclamation marks, or question marks. If one of these characters is encountered, we want to treat it like its own “word”. Next, for each word that we find, we insert it into a multimap where the previously encountered word is the key and the current word is the value. By the way, a multimap is just a C++ map that can have multiple non-unique key-value pairs.

Generating the sentence

        while (true)
{
// Check to see if the word is in the multimap
if (wordmap.find(word) != wordmap.end())
{
// Get first mapped value
multimap::iterator lb = wordmap.lower_bound(word);

// Count how many there are
long count = wordmap.count(word);

// Select one of them
long index = rand() % count;

// Advance map iterator
for (long i = 0; i < index; i++)
{
lb++;
}

// Check for characters that would end a sentence
if (lb->second == "." || lb->second == "!" || lb->second == "?")
{
// If sentence is too short, select another word
if (wordcount 1)
{
// Update the word count in case the only endings to this
// word are ! or . or ?
wordcount++;
continue;
}

// Print word
cout <second << endl;
break;
}
else if (lb->second == ",")
{
// Print commas
cout <second;
}
else
{
// Print the word with a space
cout << " " <second;
wordcount++;
}

// Update the current word
word = lb->second;
}
else
{
// Not found; end the sentence
cout << ".\n";
break;
}
}

The code here looks up the first mapped value for the current word in the multimap (if it exists). If it does, then we pick one of the values for this key at random. Note that multiple identical values can exist because of how the source text was parsed, so you have a higher chance of getting a more common “following” word than one that is not so common. Each following word is printed, unless the word is a “sentence-ending” word such as the period, exclamation mark, or question mark and the sentence is deemed to be too small. Not shown here is the initialization of the first word to be the word that the user typed into the application.

Some Example Outputs

I find many of these to be hilarious. The sentences have a tendency to run on and obviously don’t quite follow the rules of sentences. But because each source word precedes a destination word, the sentences generated often almost make sense:

Will be render’d, will you, monsieur le bon?

Why should i know your master will tarry?

Try confusions with his surety and you the occasion to his affections, dispatch all of rich value of fair hand i am in silver ewes, as he hath the fiend bids me this shadow ere thou to him.

Thou like a night troilus methinks it.

Wit in the duke only are not love me to speak for the truth is no so again his outward wall, for a man knows but who bids me the wind!

Pity on his wealth being seasoned with us in alabaster?

Jump with imagined speed, indeed, for this letter call me dog and that is sum that the knave and it to offend, but assumes some taste of blood to my clerk, for i do you a key, there.

Improvements?

Feel free to leave a comment if you have any ideas for ways to make the generated sentences either more realistic, or more amusing =)

Published by

HexarA

Seattleite. Climber. Snowboarder. Traveler. Party rocker. Technologist. Spanish enthusiast. Fun-seeker.

Leave a comment