JSTOR Access Regained!

The juice is back on.

In the preceding post, I mentioned a comment in a MetaFilter post which indicated that access to JSTOR is possible through some public libraries, culled from a list on the JSTOR web site (the complete list of JSTOR participants is very long, and I wonder whether the list of public libraries is exhaustive; that culling must have taken some time, and I imagine it must have been exhausting). It turned out that one of those public library systems is very close to me: the Prince William Public Library System (PWPLS). I emailed PWPLS asking whether I could obtain a library card, even though I live in Arlington County rather than Prince William County. They said that I could still obtain a library card from them, because they have a reciprocal lending arrangement with Arlington.

Yesterday, I went to the nearest full-service branch of the PWPLS, the Bull Run Regional Library, and obtained a PWPLS library card. When I got back home, I tried accessing JSTOR through my new library account. It works! I was able to download some interesting papers off of JSTOR last night, including a paper about the Mexican Virgin of Guadalupe (connected with the “devouring mother” Aztec goddesses, Tonantzin and Coatlicue, which I mentioned in a previous post), as well as a primer on Data Envelopment Analysis (reminds me of the efficient frontier in Modern Portfolio Theory).

A few weeks ago, I obtained a Reader Card from the Library of Congress, hoping that I might get access to JSTOR through their electronic services. Unfortunately, the Library of Congress’ electronic resources don’t include JSTOR. However, I might still use my Reader Card for other purposes. For example, Alex Hassanthe local jazz pianist, not the baseball player — sometimes goes digging in the bowels of the Library of Congress looking for interesting sheet music; maybe I can do the same.

Antique Papers Now Freely Available on JSTOR

Well, it’s a start. Via MetaFilter: Early JSTOR publications now free. In the comments, MeFite marble says that there may be a way to obtain cheap access to JSTOR via public libraries; probably worth investigating.

A related recent MetaFilter post: The Lairds of Learning. In the comments, MeFite phalene states:

“… losing the free, degree attached connection to the journal banks is the thing I’ll miss most about finishing being a student.”

I can sympathize! The juice has been cut off.

In the same post, MeFite ovvl links to this interesting article: The Library: Three Jeremiads. In this tryptich of peeves, Robert Darnton writes:

While prices continued to spiral upward, professors became entrapped in another kind of vicious circle, unaware of the unintended consequences. Reduced to essentials, it goes like this: we academics devote ourselves to research; we write up the results as articles for journals; we referee the articles in the process of peer reviewing; we serve on the editorial boards of the journals; we also serve as editors (all of this unpaid, of course); and then we buy back our own work at ruinous prices in the form of journal subscriptions—not that we pay for it ourselves, of course; we expect our library to pay for it, and therefore we have no knowledge of our complicity in a disastrous system.

I have mentioned JSTOR earlier here and here.

Published in: on 7 September 2011 at 10:03 pm  Comments (1)  
Tags: ,

Markov Chain Englishoid Sentences

OK, this is really geeky, so please bear with me.

In the classic DOS game Rogue, magic scrolls have titles with gibberish names, probably generated on the fly with a simple random character generator. The Rogue random scroll name generator seems to assume independence between successive characters — i.e., the probability that a character will be generated is not influenced by the immediately preceding characters. OK, at best, it might have some rule that you can’t have too many consonants in a row. In any case, it generates character strings that don’t follow the conventions of English, and that are often unpronounceable.

To illustrate what I’m talking about, this R command generates ten random small and capital letters:

rawToChar(as.raw(sample(65:122,10)))

An example of output is:

“ZhcOJbfPxW”

(Doesn’t look much like English, and looks impossible to pronounce. Might make a reasonably secure password, though. Actually, it’s secure no longer, now that it’s displayed for all the world to see.)

I have long thought that it might be fun to generate random words with non-independence of successive characters, and based on the frequencies observed in actual English text. I.e., if the letter q were randomly generated, then the next letter is probably going to be a u and not a k, because that’s how English works in general.

Now, Markov chains have been in my mind recently, partly because they were mentioned as a possible technique for some data imputation we had to do at work (we actually ended up not using Markov chains), and partly because of a recent MetaFilter post entitled Markov Bible. And then I thought, after all these years just thinking about it, why don’t I try implementing a Markov chain that generates random Englishoid words strung out in sentences, where the frequency distribution of characters is conditional on the preceding two characters, using real English sentences as data?

So, I went ahead and implemented such an algorithm in R, with two-character memory, i.e., the probability distribution of characters generated is based on the preceding two characters. Sentences were defined as starting with a capital letter, so to initialize sentences I just took the overall frequencies of capital letters as the starting distribution. And sentences were defined as ending with either a period, a question mark, or an exclamation point. If a line of text ended with a hyphen, it was assumed that a word had been split between that line and the following line; otherwise, it was assumed that there was a blank space between the last word of that line and the following line. As a self-imposed challenge, for memory efficiency, I required the code to read the text into a memory buffer line by line, rather than reading the whole text file into memory at once.

To test my code, I took the text of H.P. Lovecraft’s story, The Dunwich Horror, and put it into a plain text file. I then read the text file into R and put the two-character transition information into a matrix, using these two lines of R code:


fileName       <- "DunwichHorror.txt"
twoCharFreqMat <- buildTwoCharacterFreqMatrix(fileName)

(buildTwoCharacterFreqMatrix() is one of the R functions I wrote; complete R code is given below.) I then generated 10 random Englishoid sentences with this loop:


# Generate 10 random sentences.
for ( k in 1:10 ) {
    sentence <- generateSentence(twoCharFreqMat,2000)
    print(sentence)
    flush.console()
}

And these are the 10 sentences that I got:

  1. “Night it thaff fly terye four er dow in’ hill Saw posprigarthe but – the fecand of the Whad of Arackeplated comithe pill.”
  2. “But of whan the few move deves wit.”
  3. “Deen.”
  4. “Earmits The hathe of dismew therethe becuen spook fable foreleatele whe hind to aritfuld boodmight ingicir and ded, of tweres The andeeteved ung thouse.”
  5. “Whal Ric in com tiong thot hing maing.”
  6. “Dr whipped, pooke rhathe fick hing to se proboadmit itaind hude feentomemany siong crome fols, ithe an – wok thad.”
  7. “Sep facle a bight ‘Thembere ninig spaing alkateley semplen tion lary canxingthe ocitelleas righ ders stin’s couraboo derip, ang and ally ons toped frouggled he yetly Pocke daters be – no evele on, fullintingle con theectiouseeps wee lumand outed preeng Mist rid the gi mys her to ch mas down ers aowde pland th re mompow thadvel doultaganereecturs cand ancer to thicoust wougend fas the hey bey valmoor Spre sed med Dr far He skulder pawfuload shand ais anced whisilly to thrumber It and Eng, bodeeled thattenibe knoistionew that’s ham thers thaven halmounry par geopy sibles they dessen knot frine and bould thated a dremomic thindern wass wriblike and matherast whin blear slever pre 751stray What parthe gan’ tred, waysidered hateds hey forge bulthe Harmisideddes ousen hattom anniall rorbione ‘He whortenclousle st ashe cows eve ninithe only we hattes inds he quing ing, anday; the mento huseer the irs bovencenter priershromed ton orniand morroo card aught lown thad compin st he his – the dicitatiour’s drand dreak tat.”
  8. “Whaterumbloseciarseem itich as olls wed thided black which alle pronstramen Curated th on.”
  9. “Bis seelle manew fres tel up opeartill ne par vintimple tame.”
  10. “It steroormead dirs mand It tir thatench, of as bout call dis horstandarty, obbatim Some fronce unds and an’ they tharrome red whirthoute Will explace reemblact strus ad, east wo le of throdunks stalf, fout that theepok andistonsturings Curat row of Spreen diall an sawas Dr expeem of and ouse vered leckly aincamis beark, and the reen th.”

Cool, no? Granted, this is nothing earth-shaking, and I am not the first person to have done something like this. I am doing this just for the sheer fun of it.

Many of the words look like they follow English conventions, sort of, and some of them are in fact real English words. But overall, it looks more like a foreign language. German or Dutch, maybe? Some of the words in sentence #5 sound… Vietnamese? I like how the commas, dashes, and semicolons seem to imply some train of thought, some reasoned argument, but you just can’t understand the words.

Note the sheer length of #7 — a “run-on” sentence. Maybe this suggests that Lovecraft tended to have longer sentences, and thus punctuation marks that end sentences are relatively (compared to other writers) rare in Lovecraft? But note that #3 is only one word. Hmm, if I had instead used Ernest Hemingway for input, would I have had more one-word sentences like #3?

Part of the reason I chose The Dunwich Horror for input was that Lovecraft wrote in a rather stilted, archaic-sounding style of English (some might call it “purple prose”), which I thought might be interesting for this little project. The Dunwich Horror has some sentences in which Lovecraft puts on a faux New England accent; this might have influenced my “transition matrix,” but I think only very slightly.

Suppose you used text from some language other than English as input to this R code? Would you be able to tell that the result was based on non-English input?

Or suppose that instead of basing the character generation probability on the previous two characters, you base it on the previous three characters? I bet that the result will really start looking like English. I must try it sometime.


In the comments of the Markov Bible post on MetaFilter, a link is given to this web page: Mark V. Shaney. This does something very similar to what I have done here, except it operates on words rather than on letters.


Check out the SCIgen web page. You can create a randomly generated computer science paper with yourself and your cat as co-authors. You might even get it accepted in a peer-reviewed (?) journal or conference.

I have mentioned SCIgen in a previous post.


Random SBIR Grant Proposal Generator

and

arXiv vs. snarXiv

both via Division By Zero, in a post entitled 12 scholarly hoaxes, randomly generated articles, and other tricky fun


Here’s my R code for generating Englishoid sentences:


getNextNonemptyLine <- function(filePointer) {
    # Keeps reading lines from FILEPOINTER until it gets a non-empty line.
    # If it hits EOF before getting a non-empty line, returns an empty string.
    nextLine <- readLines(con=filePointer, n=1)

    if ( length(nextLine) == 0 ) {
        return("")
    }

    while ( nextLine == "" ) {
        nextLine <- readLines(con=filePointer, n=1)
        if ( length(nextLine) == 0 ) {
            return("")
        }
    }

    return(nextLine)
}

getNextBatchOfChars <- function(filePointer,n) {
    # Keeps reading non-empty lines from FILEPOINTER until it has gathered
    # AT LEAST n characters.
    # If it is unable to do so, returns an empty string.
    nextLine  = n ) {
        return(nextLine)
    }

    returnVal <- nextLine
    nextLine  <- getNextNonemptyLine(filePointer)
    while ( ( length(returnVal) < n ) & ( nextLine != "" ) ) {
        returnVal <- sprintf("%s%s",returnVal,nextLine)
        nextLine  <- getNextNonemptyLine(filePointer)
    }
    if ( length(returnVal) < n ) {
        return("")
    } else {
        return(returnVal)
    }
}

getNextBatchOfAscii <- function(filePointer,n) {
    # Keeps reading non-empty lines from FILEPOINTER until it has gathered
    # AT LEAST n ASCII values.
    # If it is unable to do so, returns an empty NUMERIC array.
    emptyArray       <- as.integer(rep(0,0))
    returnVal        <- emptyArray
    nextBatchOfChars <- getNextBatchOfChars(filePointer,n)
    if ( nextBatchOfChars == "" ) {
        return(emptyArray)
    }
    nextBatchOfAscii <- as.integer(charToRaw(nextBatchOfChars))
    returnVal        <- nextBatchOfAscii
    while ( length(returnVal) < n ) {
        nextBatchOfChars <- getNextBatchOfChars(filePointer,n)
        if ( nextBatchOfChars == "" ) {
            return(emptyArray)
        }
        nextBatchOfAscii <- as.integer(charToRaw(nextBatchOfChars))
        returnVal        <- c(returnVal,nextBatchOfAscii)
    }

    # If RETURNVAL is still too short, return empty array.
    if ( length(returnVal) < n ) {
        return(emptyArray)
    } else {
        return(returnVal)
    }
}

generateFirstTwoCharacters <- function(twoCharFreqMat,charList) {
    # Generate first character, restricting it to capital letters.
    # To determine the first character, compute the frequencies of
    # capital letters across all columns.
    # Keep in mind that ASCII codes start at 0, not 1.
    oneCharFreqMat <- apply(X=twoCharFreqMat,MARGIN=1,FUN=sum)
    oneCharFreqMat[ c(1:65,91:128) ] <- 0
    numChar        <- length(oneCharFreqMat)
    oneCharCumSum  <- cumsum(oneCharFreqMat)
    oneCharCumSum2 <- c(1,oneCharCumSum[1:(numChar-1)]+1);
    randNum        <- sample(x=oneCharCumSum[numChar],size=1)
    indexSelect1   <- (1:numChar)[ ( oneCharCumSum2 = randNum ) ]

    # To determine the second character, now compute frequencies across
    # columns where indexSelect1 is the second character.
    oneCharFreqMat <- rep(0,numChar)
    for ( i in 1:numChar ) {
        columnIndex    <- ( ( i - 1 ) * numChar ) + indexSelect1
        oneCharFreqMat <- oneCharFreqMat + twoCharFreqMat[ , columnIndex ]
    }
    oneCharCumSum  <- cumsum(oneCharFreqMat)
    oneCharCumSum2 <- c(1,oneCharCumSum[1:(numChar-1)]+1);
    randNum        <- sample(x=oneCharCumSum[numChar],size=1)
    indexSelect2   <- (1:numChar)[ ( oneCharCumSum2 = randNum ) ]

    return(c(indexSelect1,indexSelect2))
}

generateSentence <- function(twoCharFreqMat,nCharLimit) {
    # Generate first two characters.
    charList       <- ( 0:127 )
    charsGenerated <- generateFirstTwoCharacters(twoCharFreqMat,charList)
    char1          <- charsGenerated[1]
    char2          <- charsGenerated[2]

    # Now keep generating characters, Markov Chain-like, until
    # we hit a period, question mark, exclamation point,
    # OR all zero probabilities.
    # ASCII number for period is 46.
    # ASCII number for question mark is 63.
    # ASCII number for exclamation point is 33.
    # But keep in mind that ASCII codes start at 0, not 1.
    #
    numChar <- nrow(twoCharFreqMat)
    while ( ( char2 != 47 )
          & ( char2 != 64 )
          & ( char2 != 34 )
          & ( length(charsGenerated) < nCharLimit ) ) {
        columnIndex    <- ( ( char1 - 1 ) * numChar ) + char2
        oneCharFreqMat <- twoCharFreqMat[ , columnIndex ]
        if ( all(oneCharFreqMat==0) ) {
            break
        }
        oneCharCumSum  <- cumsum(oneCharFreqMat)
        oneCharCumSum2 <- c(1,oneCharCumSum[1:(numChar-1)]+1)
        randNum        <- sample(x=oneCharCumSum[numChar],size=1)
        nextChar       <- (1:numChar)[ ( oneCharCumSum2 = randNum ) ]
        charsGenerated <- c(charsGenerated,nextChar)
        char1          <- char2
        char2          <- nextChar
    }
    sentence <- rawToChar(as.raw(charList[charsGenerated]))
    return(sentence)
}

buildTwoCharacterFreqMatrix <- function(fileName) {
    # There are 128*128 = 16384 bins to track.
    charList <- ( 0:127 )
    numChar  <- length(charList)
    numList  <- ( 1:numChar )

    # Initialize two-character frequency matrix.
    # Rows will be the 128 output characters.
    # Columns will be the 128*128 sequential characters to generate
    # one of the 128 output characters.
    twoCharFreqMat <- matrix(0,nrow=numChar,ncol=numChar*numChar)

    # Open text file for input.
    filePointer <- file(fileName,"r")

    # Initialize ASCIIARRAY, which contain a line of text from the input file.
    asciiArray <- getNextBatchOfAscii(filePointer,3)
    numAscii   <- length(asciiArray)

    # Process lines of text, and keep reading lines of text until EOF.
    lineCount  3 ) {
        print(sprintf("Line count = %d",lineCount))
        flush.console()

        # Fill in frequencies in twoCharFreqMat.
        k        <- 1
        cIn1     <- numList[ asciiArray[k]   == charList ]
        cIn2     <- numList[ asciiArray[k+1] == charList ]
        cOut     <- numList[ asciiArray[k+2] == charList ]
        rowIndex <- cOut
        colIndex <- ( ( cIn1 - 1 ) * numChar ) + cIn2
        twoCharFreqMat[rowIndex,colIndex] = 4 ) {
            for ( k in 2:(numAscii-3) ) {
                cIn1     <- cIn2
                cIn2     <- cOut
                cOut     <- numList[ asciiArray[k+2] == charList ]
                rowIndex <- cOut
                colIndex <- ( ( cIn1 - 1 ) * numChar ) + cIn2
                twoCharFreqMat[rowIndex,colIndex] <-
                    twoCharFreqMat[rowIndex,colIndex] + 1
            }
        }

        # Get next line of text, and prepare for next iteration through
        # WHILE loop.  If the last character was a hyphen, consider it
        # to mean that the word continues to the next line. Otherwise,
        # treat the end of line as if it were simply a space between
        # words.
        # ASCII number for blank space is 32.
        # ASCII number for hyphen is 45.
        if ( asciiArray[numAscii] == 45 ) {
            pre1 <- as.raw(asciiArray[numAscii-2])
            pre2 <- as.raw(asciiArray[numAscii-1])
        } else {
            pre1 <- as.raw(asciiArray[numAscii-1])
            pre2 <- as.raw(32)
        }
        asciiArray <- getNextBatchOfAscii(filePointer,1)
        asciiArray <- c(pre1,pre2,asciiArray)
        numAscii   <- length(asciiArray)

        lineCount   3 )

    # Close the input file.
    close(filePointer)
    print("")

    return(twoCharFreqMat)
}

# Build two-character frequency matrix.
fileName       <- "DunwichHorror.txt"
twoCharFreqMat <- buildTwoCharacterFreqMatrix(fileName)

# Generate 10 random sentences.
for ( k in 1:10 ) {
    sentence <- generateSentence(twoCharFreqMat,2000)
    print(sentence)
    flush.console()
}


Addendum (09/08/11): Here’s a random academic sentence generator. And here’s a Postmodernism Generator (every time you click on that link, you get a new random essay, reminiscent of the Sokal Affair). Here’s an example of output, complete with Lovecraftian references: Lacanist obscurity in the works of Joyce.