Data Without Borders: Kickoff “Data Dive”

A group named Data Without Borders plans to hold a Data Dive on October 14-16, 2011. In this event, non-profits and NGOs make it available data they’d like to have analyzed; statisticians, data scientists, and “data hackers” crunch the numbers pro bono.

The upcoming event is in NYC (do you need to be physically *in* NYC to participate?!), but they’re planning similar events in other cities, including DC. I have subscribed to their mailing list; maybe I can catch the DC event, and maybe it’ll be a total geekfest.

Advertisements

Markov Chain Englishoid Sentences

OK, this is really geeky, so please bear with me.

In the classic DOS game Rogue, magic scrolls have titles with gibberish names, probably generated on the fly with a simple random character generator. The Rogue random scroll name generator seems to assume independence between successive characters — i.e., the probability that a character will be generated is not influenced by the immediately preceding characters. OK, at best, it might have some rule that you can’t have too many consonants in a row. In any case, it generates character strings that don’t follow the conventions of English, and that are often unpronounceable.

To illustrate what I’m talking about, this R command generates ten random small and capital letters:

rawToChar(as.raw(sample(65:122,10)))

An example of output is:

“ZhcOJbfPxW”

(Doesn’t look much like English, and looks impossible to pronounce. Might make a reasonably secure password, though. Actually, it’s secure no longer, now that it’s displayed for all the world to see.)

I have long thought that it might be fun to generate random words with non-independence of successive characters, and based on the frequencies observed in actual English text. I.e., if the letter q were randomly generated, then the next letter is probably going to be a u and not a k, because that’s how English works in general.

Now, Markov chains have been in my mind recently, partly because they were mentioned as a possible technique for some data imputation we had to do at work (we actually ended up not using Markov chains), and partly because of a recent MetaFilter post entitled Markov Bible. And then I thought, after all these years just thinking about it, why don’t I try implementing a Markov chain that generates random Englishoid words strung out in sentences, where the frequency distribution of characters is conditional on the preceding two characters, using real English sentences as data?

So, I went ahead and implemented such an algorithm in R, with two-character memory, i.e., the probability distribution of characters generated is based on the preceding two characters. Sentences were defined as starting with a capital letter, so to initialize sentences I just took the overall frequencies of capital letters as the starting distribution. And sentences were defined as ending with either a period, a question mark, or an exclamation point. If a line of text ended with a hyphen, it was assumed that a word had been split between that line and the following line; otherwise, it was assumed that there was a blank space between the last word of that line and the following line. As a self-imposed challenge, for memory efficiency, I required the code to read the text into a memory buffer line by line, rather than reading the whole text file into memory at once.

To test my code, I took the text of H.P. Lovecraft’s story, The Dunwich Horror, and put it into a plain text file. I then read the text file into R and put the two-character transition information into a matrix, using these two lines of R code:


fileName       <- "DunwichHorror.txt"
twoCharFreqMat <- buildTwoCharacterFreqMatrix(fileName)

(buildTwoCharacterFreqMatrix() is one of the R functions I wrote; complete R code is given below.) I then generated 10 random Englishoid sentences with this loop:


# Generate 10 random sentences.
for ( k in 1:10 ) {
    sentence <- generateSentence(twoCharFreqMat,2000)
    print(sentence)
    flush.console()
}

And these are the 10 sentences that I got:

  1. “Night it thaff fly terye four er dow in’ hill Saw posprigarthe but – the fecand of the Whad of Arackeplated comithe pill.”
  2. “But of whan the few move deves wit.”
  3. “Deen.”
  4. “Earmits The hathe of dismew therethe becuen spook fable foreleatele whe hind to aritfuld boodmight ingicir and ded, of tweres The andeeteved ung thouse.”
  5. “Whal Ric in com tiong thot hing maing.”
  6. “Dr whipped, pooke rhathe fick hing to se proboadmit itaind hude feentomemany siong crome fols, ithe an – wok thad.”
  7. “Sep facle a bight ‘Thembere ninig spaing alkateley semplen tion lary canxingthe ocitelleas righ ders stin’s couraboo derip, ang and ally ons toped frouggled he yetly Pocke daters be – no evele on, fullintingle con theectiouseeps wee lumand outed preeng Mist rid the gi mys her to ch mas down ers aowde pland th re mompow thadvel doultaganereecturs cand ancer to thicoust wougend fas the hey bey valmoor Spre sed med Dr far He skulder pawfuload shand ais anced whisilly to thrumber It and Eng, bodeeled thattenibe knoistionew that’s ham thers thaven halmounry par geopy sibles they dessen knot frine and bould thated a dremomic thindern wass wriblike and matherast whin blear slever pre 751stray What parthe gan’ tred, waysidered hateds hey forge bulthe Harmisideddes ousen hattom anniall rorbione ‘He whortenclousle st ashe cows eve ninithe only we hattes inds he quing ing, anday; the mento huseer the irs bovencenter priershromed ton orniand morroo card aught lown thad compin st he his – the dicitatiour’s drand dreak tat.”
  8. “Whaterumbloseciarseem itich as olls wed thided black which alle pronstramen Curated th on.”
  9. “Bis seelle manew fres tel up opeartill ne par vintimple tame.”
  10. “It steroormead dirs mand It tir thatench, of as bout call dis horstandarty, obbatim Some fronce unds and an’ they tharrome red whirthoute Will explace reemblact strus ad, east wo le of throdunks stalf, fout that theepok andistonsturings Curat row of Spreen diall an sawas Dr expeem of and ouse vered leckly aincamis beark, and the reen th.”

Cool, no? Granted, this is nothing earth-shaking, and I am not the first person to have done something like this. I am doing this just for the sheer fun of it.

Many of the words look like they follow English conventions, sort of, and some of them are in fact real English words. But overall, it looks more like a foreign language. German or Dutch, maybe? Some of the words in sentence #5 sound… Vietnamese? I like how the commas, dashes, and semicolons seem to imply some train of thought, some reasoned argument, but you just can’t understand the words.

Note the sheer length of #7 — a “run-on” sentence. Maybe this suggests that Lovecraft tended to have longer sentences, and thus punctuation marks that end sentences are relatively (compared to other writers) rare in Lovecraft? But note that #3 is only one word. Hmm, if I had instead used Ernest Hemingway for input, would I have had more one-word sentences like #3?

Part of the reason I chose The Dunwich Horror for input was that Lovecraft wrote in a rather stilted, archaic-sounding style of English (some might call it “purple prose”), which I thought might be interesting for this little project. The Dunwich Horror has some sentences in which Lovecraft puts on a faux New England accent; this might have influenced my “transition matrix,” but I think only very slightly.

Suppose you used text from some language other than English as input to this R code? Would you be able to tell that the result was based on non-English input?

Or suppose that instead of basing the character generation probability on the previous two characters, you base it on the previous three characters? I bet that the result will really start looking like English. I must try it sometime.


In the comments of the Markov Bible post on MetaFilter, a link is given to this web page: Mark V. Shaney. This does something very similar to what I have done here, except it operates on words rather than on letters.


Check out the SCIgen web page. You can create a randomly generated computer science paper with yourself and your cat as co-authors. You might even get it accepted in a peer-reviewed (?) journal or conference.

I have mentioned SCIgen in a previous post.


Random SBIR Grant Proposal Generator

and

arXiv vs. snarXiv

both via Division By Zero, in a post entitled 12 scholarly hoaxes, randomly generated articles, and other tricky fun


Here’s my R code for generating Englishoid sentences:


getNextNonemptyLine <- function(filePointer) {
    # Keeps reading lines from FILEPOINTER until it gets a non-empty line.
    # If it hits EOF before getting a non-empty line, returns an empty string.
    nextLine <- readLines(con=filePointer, n=1)

    if ( length(nextLine) == 0 ) {
        return("")
    }

    while ( nextLine == "" ) {
        nextLine <- readLines(con=filePointer, n=1)
        if ( length(nextLine) == 0 ) {
            return("")
        }
    }

    return(nextLine)
}

getNextBatchOfChars <- function(filePointer,n) {
    # Keeps reading non-empty lines from FILEPOINTER until it has gathered
    # AT LEAST n characters.
    # If it is unable to do so, returns an empty string.
    nextLine  = n ) {
        return(nextLine)
    }

    returnVal <- nextLine
    nextLine  <- getNextNonemptyLine(filePointer)
    while ( ( length(returnVal) < n ) & ( nextLine != "" ) ) {
        returnVal <- sprintf("%s%s",returnVal,nextLine)
        nextLine  <- getNextNonemptyLine(filePointer)
    }
    if ( length(returnVal) < n ) {
        return("")
    } else {
        return(returnVal)
    }
}

getNextBatchOfAscii <- function(filePointer,n) {
    # Keeps reading non-empty lines from FILEPOINTER until it has gathered
    # AT LEAST n ASCII values.
    # If it is unable to do so, returns an empty NUMERIC array.
    emptyArray       <- as.integer(rep(0,0))
    returnVal        <- emptyArray
    nextBatchOfChars <- getNextBatchOfChars(filePointer,n)
    if ( nextBatchOfChars == "" ) {
        return(emptyArray)
    }
    nextBatchOfAscii <- as.integer(charToRaw(nextBatchOfChars))
    returnVal        <- nextBatchOfAscii
    while ( length(returnVal) < n ) {
        nextBatchOfChars <- getNextBatchOfChars(filePointer,n)
        if ( nextBatchOfChars == "" ) {
            return(emptyArray)
        }
        nextBatchOfAscii <- as.integer(charToRaw(nextBatchOfChars))
        returnVal        <- c(returnVal,nextBatchOfAscii)
    }

    # If RETURNVAL is still too short, return empty array.
    if ( length(returnVal) < n ) {
        return(emptyArray)
    } else {
        return(returnVal)
    }
}

generateFirstTwoCharacters <- function(twoCharFreqMat,charList) {
    # Generate first character, restricting it to capital letters.
    # To determine the first character, compute the frequencies of
    # capital letters across all columns.
    # Keep in mind that ASCII codes start at 0, not 1.
    oneCharFreqMat <- apply(X=twoCharFreqMat,MARGIN=1,FUN=sum)
    oneCharFreqMat[ c(1:65,91:128) ] <- 0
    numChar        <- length(oneCharFreqMat)
    oneCharCumSum  <- cumsum(oneCharFreqMat)
    oneCharCumSum2 <- c(1,oneCharCumSum[1:(numChar-1)]+1);
    randNum        <- sample(x=oneCharCumSum[numChar],size=1)
    indexSelect1   <- (1:numChar)[ ( oneCharCumSum2 = randNum ) ]

    # To determine the second character, now compute frequencies across
    # columns where indexSelect1 is the second character.
    oneCharFreqMat <- rep(0,numChar)
    for ( i in 1:numChar ) {
        columnIndex    <- ( ( i - 1 ) * numChar ) + indexSelect1
        oneCharFreqMat <- oneCharFreqMat + twoCharFreqMat[ , columnIndex ]
    }
    oneCharCumSum  <- cumsum(oneCharFreqMat)
    oneCharCumSum2 <- c(1,oneCharCumSum[1:(numChar-1)]+1);
    randNum        <- sample(x=oneCharCumSum[numChar],size=1)
    indexSelect2   <- (1:numChar)[ ( oneCharCumSum2 = randNum ) ]

    return(c(indexSelect1,indexSelect2))
}

generateSentence <- function(twoCharFreqMat,nCharLimit) {
    # Generate first two characters.
    charList       <- ( 0:127 )
    charsGenerated <- generateFirstTwoCharacters(twoCharFreqMat,charList)
    char1          <- charsGenerated[1]
    char2          <- charsGenerated[2]

    # Now keep generating characters, Markov Chain-like, until
    # we hit a period, question mark, exclamation point,
    # OR all zero probabilities.
    # ASCII number for period is 46.
    # ASCII number for question mark is 63.
    # ASCII number for exclamation point is 33.
    # But keep in mind that ASCII codes start at 0, not 1.
    #
    numChar <- nrow(twoCharFreqMat)
    while ( ( char2 != 47 )
          & ( char2 != 64 )
          & ( char2 != 34 )
          & ( length(charsGenerated) < nCharLimit ) ) {
        columnIndex    <- ( ( char1 - 1 ) * numChar ) + char2
        oneCharFreqMat <- twoCharFreqMat[ , columnIndex ]
        if ( all(oneCharFreqMat==0) ) {
            break
        }
        oneCharCumSum  <- cumsum(oneCharFreqMat)
        oneCharCumSum2 <- c(1,oneCharCumSum[1:(numChar-1)]+1)
        randNum        <- sample(x=oneCharCumSum[numChar],size=1)
        nextChar       <- (1:numChar)[ ( oneCharCumSum2 = randNum ) ]
        charsGenerated <- c(charsGenerated,nextChar)
        char1          <- char2
        char2          <- nextChar
    }
    sentence <- rawToChar(as.raw(charList[charsGenerated]))
    return(sentence)
}

buildTwoCharacterFreqMatrix <- function(fileName) {
    # There are 128*128 = 16384 bins to track.
    charList <- ( 0:127 )
    numChar  <- length(charList)
    numList  <- ( 1:numChar )

    # Initialize two-character frequency matrix.
    # Rows will be the 128 output characters.
    # Columns will be the 128*128 sequential characters to generate
    # one of the 128 output characters.
    twoCharFreqMat <- matrix(0,nrow=numChar,ncol=numChar*numChar)

    # Open text file for input.
    filePointer <- file(fileName,"r")

    # Initialize ASCIIARRAY, which contain a line of text from the input file.
    asciiArray <- getNextBatchOfAscii(filePointer,3)
    numAscii   <- length(asciiArray)

    # Process lines of text, and keep reading lines of text until EOF.
    lineCount  3 ) {
        print(sprintf("Line count = %d",lineCount))
        flush.console()

        # Fill in frequencies in twoCharFreqMat.
        k        <- 1
        cIn1     <- numList[ asciiArray[k]   == charList ]
        cIn2     <- numList[ asciiArray[k+1] == charList ]
        cOut     <- numList[ asciiArray[k+2] == charList ]
        rowIndex <- cOut
        colIndex <- ( ( cIn1 - 1 ) * numChar ) + cIn2
        twoCharFreqMat[rowIndex,colIndex] = 4 ) {
            for ( k in 2:(numAscii-3) ) {
                cIn1     <- cIn2
                cIn2     <- cOut
                cOut     <- numList[ asciiArray[k+2] == charList ]
                rowIndex <- cOut
                colIndex <- ( ( cIn1 - 1 ) * numChar ) + cIn2
                twoCharFreqMat[rowIndex,colIndex] <-
                    twoCharFreqMat[rowIndex,colIndex] + 1
            }
        }

        # Get next line of text, and prepare for next iteration through
        # WHILE loop.  If the last character was a hyphen, consider it
        # to mean that the word continues to the next line. Otherwise,
        # treat the end of line as if it were simply a space between
        # words.
        # ASCII number for blank space is 32.
        # ASCII number for hyphen is 45.
        if ( asciiArray[numAscii] == 45 ) {
            pre1 <- as.raw(asciiArray[numAscii-2])
            pre2 <- as.raw(asciiArray[numAscii-1])
        } else {
            pre1 <- as.raw(asciiArray[numAscii-1])
            pre2 <- as.raw(32)
        }
        asciiArray <- getNextBatchOfAscii(filePointer,1)
        asciiArray <- c(pre1,pre2,asciiArray)
        numAscii   <- length(asciiArray)

        lineCount   3 )

    # Close the input file.
    close(filePointer)
    print("")

    return(twoCharFreqMat)
}

# Build two-character frequency matrix.
fileName       <- "DunwichHorror.txt"
twoCharFreqMat <- buildTwoCharacterFreqMatrix(fileName)

# Generate 10 random sentences.
for ( k in 1:10 ) {
    sentence <- generateSentence(twoCharFreqMat,2000)
    print(sentence)
    flush.console()
}


Addendum (09/08/11): Here’s a random academic sentence generator. And here’s a Postmodernism Generator (every time you click on that link, you get a new random essay, reminiscent of the Sokal Affair). Here’s an example of output, complete with Lovecraftian references: Lacanist obscurity in the works of Joyce.

The Limits of Statistics

A Map Of The Limits of Statistics (by Nassim Nicholas Taleb, author of The Black Swan).

Published in: on 10 January 2009 at 12:15 pm  Leave a Comment  

Reporting Bias in Drug Trials

Paper: Reporting Bias in Drug Trials Submitted to the Food and Drug Administration: Review of Publication and Presentation

Paper: Why Most Published Research Findings Are False

Published in: on 10 January 2009 at 12:05 pm  Leave a Comment  
Tags: ,

Exploratory vs. Hypothesis-Driven Experiments

The night of December 9, 2008, I had dinner at the P.R. Grill in Pentagon City with Drs. K.H.K. and W.J.K., as well as with I.K.. An animated dinner conversation accompanied the good food. This post is a follow-up to one of the topics covered.

In exploratory analyses or experiments, very many variables may be correlated against an effect just to see which, if any, might possible be related to the effect of interest. It is a sort of fishing expedition. Such “data mining” can lead to many false positives (Type I errors).

You may recall that I brought up as an example the case of butter production in Bangladesh. D. Leinweber sifted through a UN data CD ROM and correlated multiple (I do not know exactly how many) indicators with the S&P 500, and found that butter production in Bangladesh was the single best predictor of the S&P 500 (He Who Mines Data May Strike Fool’s Gold, Business Week, 6/16/1997). So why don’t we all follow Bangladeshi butter production to time the stock market?

Here’s a Gedankenexperiment. Imagine you perform 10,000 experiments, and that all of them are just random noise without a true effect or signal. If you do your significance testing with an alpha-level of 0.05, then you’d expect about 5% or 500 of those experiments to appear to be significant at the 0.05 level just by chance alone. (As an aside, the way we do statistics these days is apparently a horrible conflation of Fisher’s significance testing with p-values and Neyman-Pearson hypothesis testing with alpha-levels.) I.e., the data could be telling you that 500 of the experiments were “successes”, when in reality they were just random noise.

Here’s another thought experiment. Ravens’ Stadium in Baltimore can hold 70,000 people. If the stadium were full and each of the 70,000 people flipped a coin fifteen times, there’s a greater than 88% chance that at least one person to flip ten heads in a row. But if this happened, it is due to chance; it is not because that particular person is an expert in flipping coins.

So, we need multiple testing procedures to protect ourselves against these errors. For example, there is the famous Bonferroni correction. In functional brain imaging, we use methods based on (Gaussian) random field theory. More recently, we (in functional brain imaging) have started using methods based on controlling the expected false discovery rate (FDR).

Exploratory analyses can suggest some interesting phenomenon, e.g., that two variables are correlated; so, sometimes they are called “hypothesis-generating experiments”. But then you need to do a hypothesis-driven experiment to really convince everybody that the two variables are indeed correlated, and that they weren’t discovered in a fishing expedition (i.e., that it isn’t merely a case of finding somebody who flipped 15 heads in a row in Ravens’ Stadium). In a hypothesis-driven experiment, you declare at the outset that your hypothesis is that X and Y are correlated, and you design the experiment specifically to test that hypothesis. It’s a surgical strike, rather than a fishing expedition.

So, exploratory analyses are a case where the data may not truly reflect a “real” signal, which was actually your original question. I think that another case, somewhat slippery, is the selection of a significance threshold. You might generate a t-test that is significant if you had chosen an alpha level of 0.05, but non-significant if you had chosen an alpha level of 0.01. So, “significance” depends on what you had chosen. (Again, I am aware that there are controversies regarding the use of p-values; e.g., see this book.)

Published in: on 14 December 2008 at 3:17 pm  Comments (4)  
Tags: