Mining Shakespeare with Digital Tools: a New Frontier?

Since I study, read, watch and perform Shakespeare, I’m quite familiar with his work. However, one main attraction these plays hold for me is the way each experience is different: you find yourself in a labyrinth you are sure you remember, but it invariably turns out to be somewhat different than you thought. I like the strangeness of this experience, the element of surprise and the sense of newness involved.

Another way in which Shakespeare’s works can surprise is through the development of new tools with which to ‘read’ him, and in our digital age this means apps and online algorithmic websites that can give you any number of different views of the material.

For instance, one search started with a word in a passage from the 1623 Folio version of Henry VI Part 2 (5.2.31-56) that rang a faint bell: ‘babes’. This is from a speech where young Clifford, finding his father’s body on the battlefield, laments his death. The passage is not in the first Quarto (1594), nor in subsequent ones.

I looked up ‘babe(s)’ on Open Source Shakespeare, a free online resource, and found the following peculiar pattern:


Of course, the story of one word tells you little or nothing about Shakespeare more generally. I had to find out whether this was just some one-off anomaly, so I looked at the first 250 substantives in my alphabetical excel database and found that 21 of these had a similar pattern (8.4%).

Why would words appearing in early Shakespeare all but disappear for an eight year period then reappear? It is tempting to offer speculative reasons for this oddity, though some might say that the frequency of this pattern is not significant, and/or the sample biased. In truth, much more groundwork is required to test whether this particular seam is worth working in future.

My next move was to return to the passage and see if it contained further clues. In the 1999 Arden editor, Ronald Knowles, stated that “all [post-war] commentators… [agree that it is] strongly reminiscent of Shakespeare’s mature, tragic period”, so I wanted to check for evidence of this. I extracted 64 substantive words and phrases from these 26 lines and tested them, this time adjusting figures to even out differences in script size, and developing a standard base for comparing relative frequencies of words, based on cumulative proportions, as shown below.


This data suggests that, rather than being written in Shakespeare’s later years, it is probably from the early period, since 5 out of 12 plays of 1590-5 scored over 300 but only 1 of the 12 post-1603 plays (and none in the middle period, supporting the finding from the first set of data). It also suggests that this passage may have been written by a different hand than the rest of the play, which has a low correlation to the passage, being 25th of the 37 plays when in rank order, as shown below.


Data-mining has the capacity to unearth gems, as well as to provide fuel for public and academic debates, though in an early phase. This blog sketches some exploratory findings using rudimentary approaches to data-mining Shakespeare, an approach which should attract new types of textual explorers, and may well unearth some valuable materials in the future.


