Data

Scholars of early modern English drama have new ways of working with data—and new reasons to do so. A Digital Anthology of Early Modern English Drama (EMED) supports this work with encoded texts and structured bibliographic materials. Below are some entry points into the EMED project, through textual data, network data, and metadata, that allow us to do interesting new work with these centuries-old texts. You will also find the rules that we followed in encoding the plays; datasets for analysis; and tools and resources to help you get started.

Data in EMED

Data, in the broadest sense, can be defined simply as things that are known, or items of information. In the humanities, data can take many forms. A Digital Anthology of Early Modern English Drama contains primarily two types of structured data:

  • textual data, or machine-readable texts, which contain additional markup and may be analyzed on their own or grouped into a corpus.
  • network data, or information about the relationships among facts, such as the fact that person A is the son of person B.

For EMED, the first type includes the texts of the plays, encoded with additional information about elements such as stage directions and the words spoken on stage; divisions of the playbook, including paratexts, prologues, acts, and scenes; and parts of speech. The second type refers to information about a play and its associated authors, printers, acting companies, publication date, genre, and so on. This information is known as metadata, which literally means data about data.

While you can read about individual plays on their play page (and download them there in XML, PDF, and HTML), below you’ll find links to download the full metadata file for the entire site, as well as batch-download plays encoded in XML. We have pre-processed some of the XML files to show the potential of the markup for early modern plays.

Encoding plays

Computers are extremely literal. For them to interpret textual data, it needs to be encoded, or tagged with additional information telling the computer what the words mean—and indeed, that a string of character is a word. Humans are highly variable: do you want to tag a title to a song as a <head>, a <label>, a <title>, or just as part of a paragraph <p>? We do best when we have a guide to help us agree how to tag something consistently. EMED follows the Text Encoding Initiative guidelines for eXtensible Markup Language (XML). These guidelines still allow for a great deal of adaptation. Our encoding policy details specific decisions we make on how to treat original and regularized spellings, illegible words, the attribution of speeches and the representation of stage directions, and other elements of the text. In the above example, we'd use <label> to encode a title of a song found in a EMED play, and use <head> for act and scene headings.

Download a PDF of our encoding documentation. For further information on our editorial choices, see our Editing page. A complete editor’s guidebook to our editorial and encoding practice will be released in due course.

Datasets and downloads

Metadata

Download the EMED Metadata.

EMED’s metadata consists of information about the play, such as the author, printer, publisher, bookseller, place of publication, physical features of the playbook, information about the first performance, and more. This information may be searched and browsed online, or downloaded as a .csv file. CSV files can be examined in a variety of spreadsheet software, such as Excel, or uploaded to Open Refine for processing and analysis. The limits of the EMED corpus are important to consider when analyzing or visualizing this data: we include only a fraction of the drama performed in London during this period, excluding university plays, closet dramas, and most masques, which were performed by the court. Additionally, our corpus only contains first editions. It may be fruitful for those interested in publication dates and popularity to compare our data to that of the Database of Early English Playbooks (DEEP). You can download performance and publication data for the EMED corpus here: EMED-dates.xlsx.


Publication dates of works in EMED, plotted by year. Plays are plotted individually, so collections of plays may appear as spikes, such as the 1647 Beaumont and Fletcher folio.


Corpus download

Use our Corpus Download page to download all or a selection of files of the plays.

You may want to work with a selection or all of the available EMED texts offline. You can easily search and download them in bulk using our Corpus Download function. First, select one of four groups of texts you wish to work with: the TCP files as hosted by EMED, the Shakespeare His Contemporaries files, the EMED fully encoded Featured Plays, or the EMED minimally encoded files (to be made available in due course). You can learn more about how our documentary editions use and adapt the files produced by these projects on our Editing page.

Note: a play may not have XML files for each type (TCP, SHC, EMED). For example, if it was not part of the EEBO-TCP’s Phase 1, it may not have a TCP file, but may still be provided by SHC. Once you’ve selected your XML type, you may define your corpus with our search engine. You may select one, part, or all of your results to download as a group.

Parts of speech

The ana attribute (@ana) in TEI encoding allows the encoder to designate an analytical feature of a specific portion of a text. In our case, we use it to provide parts of speech to individual words—that is, whether a word is a proper noun, an adjective, a participle, a present-tense verb, or so on. This work is automatically done by the SHC project using MorphAdorner, and is lightly adjusted by EMED editors, though we do not guarantee its accuracy. A python script allows us to create a list of words with a given ana attribute, making a list with their original and regularized spellings and their word ID, which designates their place in the text.

The following .txt files are tab-delineated lists of the proper nouns found in each play, with their word IDs, original spelling, and regularized versions. For programs that can help manipulate this data, see Tools and Projects below.
Proper nouns in Dido: Dido-nouns.txt (includes possessives, both plural and singular)
Proper nouns in Tamburlaine 1: 1Tam-nouns.txt (includes possessives, both plural and singular)
Proper nouns in Tamburlaine 2: 2Tam-nouns.txt (includes possessives, both plural and singular)
Proper nouns in Doctor Faustus: DrFaust-nouns.txt (includes possessives, both plural and singular)
Proper nouns in Edward II: Ed2-nouns.txt (includes possessives, both plural and singular)

Tools and projects

Folger Digital Texts: Free access to meticulously accurate texts from the Folger Shakespeare Library editions of Shakespeare’s plays and poetry, including free downloads of the source code—providing the basis for new noncommercial projects and apps.

Folger Digital Texts API: A work-in-progress API that provides a variety of views, analyses, and visualizations of Shakespeare’s plays, including cue scripts and graphical representations of who is on stage across a timeline of the play.

Open Refine: Formerly known as Google Refine, this tool allows you to upload data sets (such as the tab-delineated txt files above), clean the data, explore it quickly and easily, and transform it into new formats.

    If you’re new to Open Refine, there are several tutorials available:
  • Not just for librarians, Library Carpentry’s Open Refine lesson walks you through uploading data, cleaning it, using features of Open Refine, and reconciling your data against external references.
  • Cleaning Data with Open Refine This lesson from The Programming Historian focuses on data cleaning: recognizing and removing duplicates, separating multiple values in the same field, analyzing distribution of values, and grouping values.

Digital Tools for Textual Analysis (Folgerpedia): This is a list of digital humanities tools dealing with textual analysis, most of which were initially compiled by Brett Greatley-Hirsch, Heather Froehlich, and other participants of the Folger Institute's Early Modern Digital Agendas (2013) institute for advanced topics in digital humanities.

Early Modern Print: Text Mining Early Modern Printed English: An online suite of tools and projects for the computational exploration and analysis of English print culture before 1700. This group of tools takes advantage of the Text Creation Partnership’s transcriptions of early modern printed works to facilitate quantitative approaches to early modern English texts.

Authority files

Authority files are external reference lists that help establish a rule for identifying a person, place, or thing. Below are a series of databases outside of EMED which may be useful as references for the people and places found in EMED texts.

  1. GeoNames: A geographical database that provides latitude and longitude for use with GIS applications.
  2. VIAF: The Virtual International Authority File: Combines name authority files from the majority of the major libraries in the world, including the Library of Congress.
  3. MoEML Gazatteer of Early Modern London: Digital gazatteer and authority list for early modern London, part of the Map of Early Modern London project.