Markov Chain Englishoid Sentences

OK, this is really geeky, so please bear with me.

In the classic DOS game Rogue, magic scrolls have titles with gibberish names, probably generated on the fly with a simple random character generator. The Rogue random scroll name generator seems to assume independence between successive characters — i.e., the probability that a character will be generated is not influenced by the immediately preceding characters. OK, at best, it might have some rule that you can’t have too many consonants in a row. In any case, it generates character strings that don’t follow the conventions of English, and that are often unpronounceable.

To illustrate what I’m talking about, this R command generates ten random small and capital letters:

rawToChar(as.raw(sample(65:122,10)))

An example of output is:

“ZhcOJbfPxW”

(Doesn’t look much like English, and looks impossible to pronounce. Might make a reasonably secure password, though. Actually, it’s secure no longer, now that it’s displayed for all the world to see.)

I have long thought that it might be fun to generate random words with non-independence of successive characters, and based on the frequencies observed in actual English text. I.e., if the letter q were randomly generated, then the next letter is probably going to be a u and not a k, because that’s how English works in general.

Now, Markov chains have been in my mind recently, partly because they were mentioned as a possible technique for some data imputation we had to do at work (we actually ended up not using Markov chains), and partly because of a recent MetaFilter post entitled Markov Bible. And then I thought, after all these years just thinking about it, why don’t I try implementing a Markov chain that generates random Englishoid words strung out in sentences, where the frequency distribution of characters is conditional on the preceding two characters, using real English sentences as data?

So, I went ahead and implemented such an algorithm in R, with two-character memory, i.e., the probability distribution of characters generated is based on the preceding two characters. Sentences were defined as starting with a capital letter, so to initialize sentences I just took the overall frequencies of capital letters as the starting distribution. And sentences were defined as ending with either a period, a question mark, or an exclamation point. If a line of text ended with a hyphen, it was assumed that a word had been split between that line and the following line; otherwise, it was assumed that there was a blank space between the last word of that line and the following line. As a self-imposed challenge, for memory efficiency, I required the code to read the text into a memory buffer line by line, rather than reading the whole text file into memory at once.

To test my code, I took the text of H.P. Lovecraft’s story, The Dunwich Horror, and put it into a plain text file. I then read the text file into R and put the two-character transition information into a matrix, using these two lines of R code:


fileName       <- "DunwichHorror.txt"
twoCharFreqMat <- buildTwoCharacterFreqMatrix(fileName)

(buildTwoCharacterFreqMatrix() is one of the R functions I wrote; complete R code is given below.) I then generated 10 random Englishoid sentences with this loop:


# Generate 10 random sentences.
for ( k in 1:10 ) {
    sentence <- generateSentence(twoCharFreqMat,2000)
    print(sentence)
    flush.console()
}

And these are the 10 sentences that I got:

  1. “Night it thaff fly terye four er dow in’ hill Saw posprigarthe but – the fecand of the Whad of Arackeplated comithe pill.”
  2. “But of whan the few move deves wit.”
  3. “Deen.”
  4. “Earmits The hathe of dismew therethe becuen spook fable foreleatele whe hind to aritfuld boodmight ingicir and ded, of tweres The andeeteved ung thouse.”
  5. “Whal Ric in com tiong thot hing maing.”
  6. “Dr whipped, pooke rhathe fick hing to se proboadmit itaind hude feentomemany siong crome fols, ithe an – wok thad.”
  7. “Sep facle a bight ‘Thembere ninig spaing alkateley semplen tion lary canxingthe ocitelleas righ ders stin’s couraboo derip, ang and ally ons toped frouggled he yetly Pocke daters be – no evele on, fullintingle con theectiouseeps wee lumand outed preeng Mist rid the gi mys her to ch mas down ers aowde pland th re mompow thadvel doultaganereecturs cand ancer to thicoust wougend fas the hey bey valmoor Spre sed med Dr far He skulder pawfuload shand ais anced whisilly to thrumber It and Eng, bodeeled thattenibe knoistionew that’s ham thers thaven halmounry par geopy sibles they dessen knot frine and bould thated a dremomic thindern wass wriblike and matherast whin blear slever pre 751stray What parthe gan’ tred, waysidered hateds hey forge bulthe Harmisideddes ousen hattom anniall rorbione ‘He whortenclousle st ashe cows eve ninithe only we hattes inds he quing ing, anday; the mento huseer the irs bovencenter priershromed ton orniand morroo card aught lown thad compin st he his – the dicitatiour’s drand dreak tat.”
  8. “Whaterumbloseciarseem itich as olls wed thided black which alle pronstramen Curated th on.”
  9. “Bis seelle manew fres tel up opeartill ne par vintimple tame.”
  10. “It steroormead dirs mand It tir thatench, of as bout call dis horstandarty, obbatim Some fronce unds and an’ they tharrome red whirthoute Will explace reemblact strus ad, east wo le of throdunks stalf, fout that theepok andistonsturings Curat row of Spreen diall an sawas Dr expeem of and ouse vered leckly aincamis beark, and the reen th.”

Cool, no? Granted, this is nothing earth-shaking, and I am not the first person to have done something like this. I am doing this just for the sheer fun of it.

Many of the words look like they follow English conventions, sort of, and some of them are in fact real English words. But overall, it looks more like a foreign language. German or Dutch, maybe? Some of the words in sentence #5 sound… Vietnamese? I like how the commas, dashes, and semicolons seem to imply some train of thought, some reasoned argument, but you just can’t understand the words.

Note the sheer length of #7 — a “run-on” sentence. Maybe this suggests that Lovecraft tended to have longer sentences, and thus punctuation marks that end sentences are relatively (compared to other writers) rare in Lovecraft? But note that #3 is only one word. Hmm, if I had instead used Ernest Hemingway for input, would I have had more one-word sentences like #3?

Part of the reason I chose The Dunwich Horror for input was that Lovecraft wrote in a rather stilted, archaic-sounding style of English (some might call it “purple prose”), which I thought might be interesting for this little project. The Dunwich Horror has some sentences in which Lovecraft puts on a faux New England accent; this might have influenced my “transition matrix,” but I think only very slightly.

Suppose you used text from some language other than English as input to this R code? Would you be able to tell that the result was based on non-English input?

Or suppose that instead of basing the character generation probability on the previous two characters, you base it on the previous three characters? I bet that the result will really start looking like English. I must try it sometime.


In the comments of the Markov Bible post on MetaFilter, a link is given to this web page: Mark V. Shaney. This does something very similar to what I have done here, except it operates on words rather than on letters.


Check out the SCIgen web page. You can create a randomly generated computer science paper with yourself and your cat as co-authors. You might even get it accepted in a peer-reviewed (?) journal or conference.

I have mentioned SCIgen in a previous post.


Random SBIR Grant Proposal Generator

and

arXiv vs. snarXiv

both via Division By Zero, in a post entitled 12 scholarly hoaxes, randomly generated articles, and other tricky fun


Here’s my R code for generating Englishoid sentences:


getNextNonemptyLine <- function(filePointer) {
    # Keeps reading lines from FILEPOINTER until it gets a non-empty line.
    # If it hits EOF before getting a non-empty line, returns an empty string.
    nextLine <- readLines(con=filePointer, n=1)

    if ( length(nextLine) == 0 ) {
        return("")
    }

    while ( nextLine == "" ) {
        nextLine <- readLines(con=filePointer, n=1)
        if ( length(nextLine) == 0 ) {
            return("")
        }
    }

    return(nextLine)
}

getNextBatchOfChars <- function(filePointer,n) {
    # Keeps reading non-empty lines from FILEPOINTER until it has gathered
    # AT LEAST n characters.
    # If it is unable to do so, returns an empty string.
    nextLine  = n ) {
        return(nextLine)
    }

    returnVal <- nextLine
    nextLine  <- getNextNonemptyLine(filePointer)
    while ( ( length(returnVal) < n ) & ( nextLine != "" ) ) {
        returnVal <- sprintf("%s%s",returnVal,nextLine)
        nextLine  <- getNextNonemptyLine(filePointer)
    }
    if ( length(returnVal) < n ) {
        return("")
    } else {
        return(returnVal)
    }
}

getNextBatchOfAscii <- function(filePointer,n) {
    # Keeps reading non-empty lines from FILEPOINTER until it has gathered
    # AT LEAST n ASCII values.
    # If it is unable to do so, returns an empty NUMERIC array.
    emptyArray       <- as.integer(rep(0,0))
    returnVal        <- emptyArray
    nextBatchOfChars <- getNextBatchOfChars(filePointer,n)
    if ( nextBatchOfChars == "" ) {
        return(emptyArray)
    }
    nextBatchOfAscii <- as.integer(charToRaw(nextBatchOfChars))
    returnVal        <- nextBatchOfAscii
    while ( length(returnVal) < n ) {
        nextBatchOfChars <- getNextBatchOfChars(filePointer,n)
        if ( nextBatchOfChars == "" ) {
            return(emptyArray)
        }
        nextBatchOfAscii <- as.integer(charToRaw(nextBatchOfChars))
        returnVal        <- c(returnVal,nextBatchOfAscii)
    }

    # If RETURNVAL is still too short, return empty array.
    if ( length(returnVal) < n ) {
        return(emptyArray)
    } else {
        return(returnVal)
    }
}

generateFirstTwoCharacters <- function(twoCharFreqMat,charList) {
    # Generate first character, restricting it to capital letters.
    # To determine the first character, compute the frequencies of
    # capital letters across all columns.
    # Keep in mind that ASCII codes start at 0, not 1.
    oneCharFreqMat <- apply(X=twoCharFreqMat,MARGIN=1,FUN=sum)
    oneCharFreqMat[ c(1:65,91:128) ] <- 0
    numChar        <- length(oneCharFreqMat)
    oneCharCumSum  <- cumsum(oneCharFreqMat)
    oneCharCumSum2 <- c(1,oneCharCumSum[1:(numChar-1)]+1);
    randNum        <- sample(x=oneCharCumSum[numChar],size=1)
    indexSelect1   <- (1:numChar)[ ( oneCharCumSum2 = randNum ) ]

    # To determine the second character, now compute frequencies across
    # columns where indexSelect1 is the second character.
    oneCharFreqMat <- rep(0,numChar)
    for ( i in 1:numChar ) {
        columnIndex    <- ( ( i - 1 ) * numChar ) + indexSelect1
        oneCharFreqMat <- oneCharFreqMat + twoCharFreqMat[ , columnIndex ]
    }
    oneCharCumSum  <- cumsum(oneCharFreqMat)
    oneCharCumSum2 <- c(1,oneCharCumSum[1:(numChar-1)]+1);
    randNum        <- sample(x=oneCharCumSum[numChar],size=1)
    indexSelect2   <- (1:numChar)[ ( oneCharCumSum2 = randNum ) ]

    return(c(indexSelect1,indexSelect2))
}

generateSentence <- function(twoCharFreqMat,nCharLimit) {
    # Generate first two characters.
    charList       <- ( 0:127 )
    charsGenerated <- generateFirstTwoCharacters(twoCharFreqMat,charList)
    char1          <- charsGenerated[1]
    char2          <- charsGenerated[2]

    # Now keep generating characters, Markov Chain-like, until
    # we hit a period, question mark, exclamation point,
    # OR all zero probabilities.
    # ASCII number for period is 46.
    # ASCII number for question mark is 63.
    # ASCII number for exclamation point is 33.
    # But keep in mind that ASCII codes start at 0, not 1.
    #
    numChar <- nrow(twoCharFreqMat)
    while ( ( char2 != 47 )
          & ( char2 != 64 )
          & ( char2 != 34 )
          & ( length(charsGenerated) < nCharLimit ) ) {
        columnIndex    <- ( ( char1 - 1 ) * numChar ) + char2
        oneCharFreqMat <- twoCharFreqMat[ , columnIndex ]
        if ( all(oneCharFreqMat==0) ) {
            break
        }
        oneCharCumSum  <- cumsum(oneCharFreqMat)
        oneCharCumSum2 <- c(1,oneCharCumSum[1:(numChar-1)]+1)
        randNum        <- sample(x=oneCharCumSum[numChar],size=1)
        nextChar       <- (1:numChar)[ ( oneCharCumSum2 = randNum ) ]
        charsGenerated <- c(charsGenerated,nextChar)
        char1          <- char2
        char2          <- nextChar
    }
    sentence <- rawToChar(as.raw(charList[charsGenerated]))
    return(sentence)
}

buildTwoCharacterFreqMatrix <- function(fileName) {
    # There are 128*128 = 16384 bins to track.
    charList <- ( 0:127 )
    numChar  <- length(charList)
    numList  <- ( 1:numChar )

    # Initialize two-character frequency matrix.
    # Rows will be the 128 output characters.
    # Columns will be the 128*128 sequential characters to generate
    # one of the 128 output characters.
    twoCharFreqMat <- matrix(0,nrow=numChar,ncol=numChar*numChar)

    # Open text file for input.
    filePointer <- file(fileName,"r")

    # Initialize ASCIIARRAY, which contain a line of text from the input file.
    asciiArray <- getNextBatchOfAscii(filePointer,3)
    numAscii   <- length(asciiArray)

    # Process lines of text, and keep reading lines of text until EOF.
    lineCount  3 ) {
        print(sprintf("Line count = %d",lineCount))
        flush.console()

        # Fill in frequencies in twoCharFreqMat.
        k        <- 1
        cIn1     <- numList[ asciiArray[k]   == charList ]
        cIn2     <- numList[ asciiArray[k+1] == charList ]
        cOut     <- numList[ asciiArray[k+2] == charList ]
        rowIndex <- cOut
        colIndex <- ( ( cIn1 - 1 ) * numChar ) + cIn2
        twoCharFreqMat[rowIndex,colIndex] = 4 ) {
            for ( k in 2:(numAscii-3) ) {
                cIn1     <- cIn2
                cIn2     <- cOut
                cOut     <- numList[ asciiArray[k+2] == charList ]
                rowIndex <- cOut
                colIndex <- ( ( cIn1 - 1 ) * numChar ) + cIn2
                twoCharFreqMat[rowIndex,colIndex] <-
                    twoCharFreqMat[rowIndex,colIndex] + 1
            }
        }

        # Get next line of text, and prepare for next iteration through
        # WHILE loop.  If the last character was a hyphen, consider it
        # to mean that the word continues to the next line. Otherwise,
        # treat the end of line as if it were simply a space between
        # words.
        # ASCII number for blank space is 32.
        # ASCII number for hyphen is 45.
        if ( asciiArray[numAscii] == 45 ) {
            pre1 <- as.raw(asciiArray[numAscii-2])
            pre2 <- as.raw(asciiArray[numAscii-1])
        } else {
            pre1 <- as.raw(asciiArray[numAscii-1])
            pre2 <- as.raw(32)
        }
        asciiArray <- getNextBatchOfAscii(filePointer,1)
        asciiArray <- c(pre1,pre2,asciiArray)
        numAscii   <- length(asciiArray)

        lineCount   3 )

    # Close the input file.
    close(filePointer)
    print("")

    return(twoCharFreqMat)
}

# Build two-character frequency matrix.
fileName       <- "DunwichHorror.txt"
twoCharFreqMat <- buildTwoCharacterFreqMatrix(fileName)

# Generate 10 random sentences.
for ( k in 1:10 ) {
    sentence <- generateSentence(twoCharFreqMat,2000)
    print(sentence)
    flush.console()
}


Addendum (09/08/11): Here’s a random academic sentence generator. And here’s a Postmodernism Generator (every time you click on that link, you get a new random essay, reminiscent of the Sokal Affair). Here’s an example of output, complete with Lovecraftian references: Lacanist obscurity in the works of Joyce.