I recently gave out an assignment where I asked students to produce high quality gibberish based on Markov chains and a large input corpus. A classic of computer science education, if you will. This got me thinking – how can we be sure that we are outputting semanticly meaningful gibberish? And what is an essay, but semanticly meaningful gibberish?

I propose a high-level Markov chainer. We will run Latent Semantic Indexing on a bunch of documents, and a bunch of document chunks. We will then find out what concepts tend to follow what other concepts, and we can then chain these concepts together, along with the restriction that you can’t end up more than a few hops from where you started. Then, we can treat all document chunks that we have on each successive subject as input for a separate markov chainer (3 to 5 of the previous words taken into account) for each chunk. Doing all this, I bet we could write a semi-cogent essay on a topic that would get an A from a 5th grade teacher.

A nice test of this hypothesis might be to send a bunch of these essays out to a bunch of teachers and ask them to grade each essay (we would include some actual 5th grade essays) and tell them that we were doing it to compare the equitability of grades among teachers. We would, of course, not tell them that some of the documents were fake – when people are conditioned to think things are fake, they end up ferreting out small mistakes. If a person were told that I was a clever perl script designed to pass the Turing test, I doubt I could change their mind if we chatted over AIM.