Babble: Mockingbird
Examples
It's been pointed out to me that people want to see results first, so unlike
academia I've put the Examples before the explanation. The links below will
give you text imitating the writer's style, with some curious results.
(be patient as the program constructs the matrix --
Shelley's Frankenstien takes 30 seconds to read sometimes).
I'll confess I removed the Project Guttenberg "small print" text, but
technically I'm not redistributing the texts themselves. I hope they
give me credit for promoting their excellent
philanthropy.
History
This is another project in my experiments with Natural Language Processing
(NLP). I've called it 'Mockingbird' because it attempts to imitate the
writing style of another written work. The real name is Babbler, but I
decided to use that as an umbrella for my NLP projects and research.
I had heard some time ago, must've been 1993, about software written to
help spoof someone else on USENET. "Wow, great idea!" thought my hacker
self, but as usual I didn't have the malevolence to seek out this tool or
to find a victim to target. I didn't think of it much until I came across
an MS-DOS program (also called Babbler) that claimed to do the same. I
employed it to imitate a friend's eclectic stream-of-conciousness writing,
and offered as a contest to his friends to identify which passage was
written by the human. MS-DOS Babbler lost, no contest, but it still got
me thinking.
Later, I wrote TinGoth and used it as a puppet. People
found it very entertaining (especially it's caustic attitude towards it's
creator), and I set on the idea of getting TinGoth to speak for itself.
First I explored AI attempts like Eliza, Racter and Parry (I'll show off my
succeses in that later). The rumour about the USENET imitation software
came to mind again, so I started working on something similar.
The method I decided on for the imitator is very similar to a statistical
technique called
Markov Chains.
I didn't know it at the time; all I had to work on was a goal and a cloudy
memory of seeing a similar algorithm before. A Markov Chain is a sequence
with a finite number of different values, where each value in sequence depends
on the previous value. Well, it's states and probability vectors, but
when you're usually dealing with Markov Chains you only want to know the
last state/value, whereas I'm interested in the sequence. In the case of
imitating someone's style, I'm hoping to learn the probability of a word
occuring when following another word.
First I start by reading a large text of the human's writing. I've been
using Carroll's "Alice's Adventures in Wonderland"
and the King James Bible book of Genesis, thanks to the
Guttenberg Project, and an essay written
by a friend of mine on a theological subject. The babbler parses the text
into a stream of tokens,
where each space-separated word is a token, and the start and end of a
sentence are special tokens. A hashtable is formed, where each key-token has
a list of value-tokens that have followed it in the text (and how often each
value-token occurs, which is later flattened into probabilities). After
finishing the text, the program has an imitation matrix ready to start
churning out sentences in the author's style.
To go from imitation matrix to output, the babbler starts with the start of
sentence special key-token, finds it's list of value-tokens, and chooses
a value-token randomly (with the same freqency as was observed in the text).
The value-token is sent to output, and becomes the new key-token for
repeating the process. This continues until the end of sentence special
token has become the key-token, and output stops.
Shortcomings
I'd rather talk about these before showing you the sample outputs, because
the output can be quite impressive so hopefully you'll have forgotten these
after seeing it.
- Imitating sentence-by-sentence is great for one of short attention
span, but the text being analyzed is rarely a series of unrelated sentences.
Authors like Lewis Carrol, Ghod Almighty or my friend Pete tend to write
in paragraphs, and larger structures like essays or complex structures like
verse.
- The imitation matrix doesn't have anything that could be mistaken for
comprehension. Grammar is often broken in the output, simply because of
the flexibility of English words for many parts of speech.
- Quotations are sometimes orphaned, with a single " mark in a sentence.
I've merely obliterated them, but sometimes (especially when using more
power in creating the imitation matrix) a human will infer where quotation
marks "should" be placed.
- I used to have a method of saving imitation matrixes to disk, but after
out-of-memory problems using Data::Dumper, I will have to rewrite a new
method. In the meantime, the examples on this page are going to be slow
because they have to rebuild the imitation matrix each time.
|