Art, Data, and Formalism

[This is the text of a presentation I gave at “(Digital) Humanities Revisited,” a conference held at the Herrenhausen Palace in Hanover, Germany, on December 5-7 2013. The full record of this fascinating conference, including audio of several of the presentations, can be found here.]

Writing free verse is like playing tennis with the net down.

—Robert Frost, Address at Milton Academy, Massachusetts (17 May 1935)

This, however, is the great step we have to take; our analysis, which has hitherto been qualitative, must become quantitative … If you cannot weigh, measure, number your results, however you may be convinced yourself, you must not hope to convince others, or claim the position of an investigator; you are merely a guesser, a propounder of hypotheses.

—Frederick Fleay, “On Metrical Tests as applied to Dramatic Poetry.” The New Shakspere Society’s Transactions. Vol. 1. London: Trübner and Co. 1874.

They will pluck out the heart not of Hamlet’s but of Shakespeare’s mystery by the means of a metrical test; and this test is to be applied by a purely arithmetical process.

—Algernon Charles Swinburne, “A Study of Shakespeare”, London: Chatto and Windus, 1880.

In the late 19th century, the New Shakespeare Society outlined a program of research involving quantitative metrical analysis of Shakespeare’s plays; they could ask the question “does Shakespeare’s use of meter reveal anything about the order of composition of his plays?” even though they could not get the answer without considerable effort. The vision of an quantitatively, empirically based program of literary research is thus not new, and does not arise with the advent of digital tools, and I would agree with Jeffrey Schnapp that it is not a necessary characteristic of many of these tools. However, it is clearly a persistent interest and one that has enjoyed a recent resurgence.

Swinburne in the quote above is as much a caricature of the “poet” as his image of Fleay is a caricature of the literary scientist. But if, like Swinburne, we are concerned that quantitative methods like these may propose a representation of “art” as “data” in ways that ignore or render inaccessible the qualities that made the category of “art” meaningful in the first place, then we may wish to ask whether there are other ways of approaching that representation. In my talk today I will be asking precisely this: what do art and data have in common that may provide the basis for a uniquely illuminating digital understanding of and engagement with cultural works?

I confess at the outset: I am going to sidestep and bracket off a set of questions that are essentially restatements of  “what is art?” and adopt a position:

Art is, among other things, play with constraint.

“Play with constraint” here of course means both play enabled by constraint and also play that engages with its own constraints. Some particularly famous examples may be drawn from the literary experimentation of the Ouvroir de littérature potentielle: works like Raymond Queneau’s Cent Mille Milliards de Poèmes, which offers the reader a sonnet for which each line may be chosen from ten different options, and which required Queneau to write ten different sonnets with the same rhyme scheme whose lines could be used interchangeably. But the work of the Oulipo only serves to make vividly visible what is evident in more subtle ways wherever we look. The deliberate adoption of constraint—whether in the form of the material properties of medium, generic conventions, audience expectations, or deliberate formal limitations such as the unities of time and place—is what makes art intelligible as such. And the exploration, testing, or deliberate redrawing of the boundaries those constraints establish is one source of the pleasure and provocation that art provides: for instance, the use of enjambment to draw attention to the “frame” of the poetic line, or the use of a horizontal line in abstract painting to evoke the conventions of landscape.

These questions of constraint offer a shift of perspective on the central question of this session: what is the impact of going from (analog) art to (digital) data? The answer we are primed to give is that this is a lossy and reductive process, not only informationally but also culturally. But what precisely is lossiness in this context and how does it operate?

Digital information, when not born digital, exists under a regime of “capture” that corresponds to our understanding of the term “data” as a set of observations or representations of phenomena which we gather, record, analyse, and reproduce. We can think of the moment of “capture” as the boundary between a state in which the universe is infinitely detailed and observation has free play, and a state in which a shutter has closed and something has been written, using a particular notation system and constraint system.[1]

The lossiness of that moment of capture is well understood to us; pushing the resolution of our observational grid ever finer is one of the core progressive narratives of the digital age.

Slide of progressively higher-resolution photographs of a caterpillar

Photo by the author. Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Zooming in on a data object, however high-resolution, we eventually reach that horizon of signification where we have exhausted the informational resources of the object, where there is no more detail to be had, no further differentiation and hence no further signification. The representation reveals its exhaustibility in contrast with the inexhaustibility of the real object.

Slide of a highly magnified and pixellated detail of a photograph of a caterpillar

Photo by the author. Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

This liminal point between meaning and non-meaning has of course held great interest for the art world. Pointillism, for instance, plays with the ways that images can be composed of individual, independent components of light and color rather than of “figures” whose ontology is reflected in the application of paint. Here, part of the argument is turned on its head: what is remarkable is the way the human eye can interpolate information and insist upon form even when the observational data available to it has been deliberately decomposed: we learn of an innate training towards—an appetite for, an anticipation of—the perception of form by the eye and brain.

Even earlier, the poet and artist William Blake read the liminal point in a very different way in arguing against mezzotint and in favor of engraving. When we zoom in on a mezzotint image, composed of tiny dots, we lose sight of the figure, and this decomposition of the image sacrifices what for Blake is the essence of art: the “firm, determinate line” that represents the form of things seen or imagined. In a letter to his friend George Cumberland, Blake suggests that one crucial property of this line is that it carries its signifying properties into its “minutest subdivisions”: at no point does it decompose into “intermeasurable” elements in the manner of a mezzotint. In other words, in good art for Blake there is no lower threshold, no horizon below which the image ceases to signify; the problem with mezzotint is precisely that it has no innate representation of form.[2]

Blake is observing here something characteristic about the data model that inhabits pixellating technologies like these. The information carried by the dots in the bitmap image, as in the mezzotint, is purely local and positional: the individual picture elements don’t have informational connections to one another except in our perceiving analysis. A sequence of adjacent dark spaces does not constitute a “line”: in fact, this kind of data model does not know about things like “lines” or “noses” or “faces.” This is true no matter how fine-grained the actual sampling: we get more and more densely packed dots, but we never get faces, despite the clear perceptibility of figures to our visual systems. We can infer them—and our computational surrogates (like the “recognize faces” feature in your camera) are very good at this—but they have no durable presence in the ontology of the digital object. In short, this “sample-based” data model has no concern with the artistic nature of the work: with what it is about, or its formal properties, or the process of its creation. It is as if we created a museum catalog by taking one-inch thick horizontal slices of each floor of the building.

For the digital equivalent of Blake’s “bounding line”—the contour that actually represents the boundaries, shape, and identity of the work—we look to representational systems whose focus is precisely on modeling the outlines and the semantics of things—texts, images, sound, three-dimensional spaces. Many of these representational systems are at present most commonly associated with structured data formats like XML, though that is an artifact of present history and where we are in the development of computing systems. Their common thread is not XML per se, but rather their emphasis on transcribing, naming, structuring, and annotating the informationally salient pieces of things. So for instance in XML a simple representation of a poem might look like this:

<poem type=”nursery_rhyme”>
<line_group type=”couplet”>
<line>Probable-Possible, my black hen</line>
<line>She lays eggs in the Relative When</line>
<line_group type=”couplet”>
<line>She never lays eggs in the Positive Now</line>
<line>Because she’s unable to postulate how.</line>

Or it might look like this:

start =
element poem {
attribute type {
list {
| “villanelle”
| “limerick”
| “ode”
| “epic”
| “nursery_rhyme”
(element line_group {
attribute type {
list { “stanza” | “verse_para” | “couplet” }
element line { text }+
| element line { text })+

The first version shows us an individual poem; the second offers one theory about what constitutes a “poem” as a genre. Both are radically impoverished (they had to fit on a slide), but at the same time they are (as a first approximation) truer to what we think of as the literary object than an equivalently impoverished bitmap version of the Mona Lisa. The encoded representation of the poem observes a set of phenomena that it takes to be salient features of the poem’s signification; the schema proposes a set of phenomena that it takes to be characteristic of the genre of “poetry”. Depending on how we use the schema, it might operate descriptively to summarize for us all of the structural possibilities present in a given poetic oeuvre, or it might operate prescriptively to dictate which postulant poems are to be allowed into our corpus of approved poem instances. And depending on how we create the schema, it might operate observationally to record the attested formal properties of a collection, or it might operate theoretically to hypothesize and test such properties in an unfamiliar collection. In other words, the schema operates much as a theory of poetry operates, depending on whether we are literary historians (“here is what the poem has been”) or literary critics (“here’s what a poem really is”).

[As an aside: I realize I have done something slightly sneaky here, in substituting a textual example for a pictorial one just at the point in my argument where I shift ground from the pixel to the schema. However, this apparent bait and switch maneuver is motivated chiefly by time constraints. A plain text transcription is actually an interesting example of both a kind of lossy sampling approach (when compared with manuscript) and also a kind of minimalist modeling approach (when compared with a bitmap image). And there are abundant examples of strongly modeled data formats for sound, images, and three-dimensional spaces which unfortunately I don’t have time to detail here.]

I am arguing here on behalf of a convergence or unexpected sympathy between art and data, in this respect: that they share an interest in form and in the operations of constraint systems that express and regulate form and hence meaning. And further, I would stress that they also share a critical or frictional relationship to those constraint systems: they also permit form to quarrel with its own constraints. As Willard McCarty has argued (and his is only the most sustained argument to this effect), the data model created by processes of digital scholarship is always an inquiry into the nature of the object being modeled, and always reveals something about its theory of the object by what it cannot accommodate: by the blank spaces it leaves, by the ways it misreads or appropriates the object.[3] For an obvious example, we might consider the long-standing tension between text encoding practices and the representation of physical documents, or the ways in which classification systems shape our understanding of a knowledge domain.

In the evolution of an artistic form or genre (such as the sonata, or the epistolary novel, or the landscape painting) there is a process of exploration and elaboration as the practitioners of the form first converge on a set of shared practices, and explore what formal constraints can offer by way of heuristic framing. So for instance, the landscape painting lets us position a viewer in a particular kind of relation to an experience of space and place; the epistolary novel allows us to explore interiority, human relationships, and also the specific narrative frisson arising from the ways that the epistolary form limits what each character knows of the entire narrative situation. And at a certain point, those very constraints become themselves the visible feature: something to be foregrounded, displayed, played with, destabilized, in some cases deliberately and visibly violated.

In the digital humanities, data modeling proceeds in a very similar spirit: we use the process of data modeling to test our ideas about how texts and cultural objects behave against the actual cultural landscape. This is especially true in cases where a data model—for instance, a schema like that of the Text Encoding Initiative—has been created as an expression of disciplinary consensus, precisely so that it can be examined, questioned, and refined. The most important intellectual outcome of such a formalization is arguably not the consensus it proposes (its function as a standard) but the critical scrutiny it makes possible. The constraints become visible—and debatable—as an artifact of their capacity to generate meaning. As I have argued elsewhere, the design of models like the TEI (which explicitly permit dissent to be expressed in formal terms) provides a deeply characteristic mechanism for digital humanities research. In the data world, we see constraint used as an exploratory tool as much as it is used as an instrument of production. For a project like the Women Writers Project, the schema is both a form of documentation and a provisional theory, reflecting the formal structures we have discovered in the vast heterogeneous body of texts we are representing. We treat each version of the schema as a hypothesis pro tem concerning genre and textual structure, and we use it until it is challenged by an edge case that cannot be accommodated, at which point we modify the schema as needed to reflect this newly discovered corner of reality.

We are now in a position to return to our point of departure and ask again: what is the impact of going digital; how does it take us “from art to data” in the domain of text? And what does it mean to go “from art to data”?

I argue that going from art to data in a meaningful way entails bringing a set of formal commitments, such as we understand them, from the art work into explicit view within the domain of our data. The work of “going digital” thus entails several things:

1. First, it entails understanding and articulating those formal commitments (at the level of both the individual object and the genre), and expressing them in some appropriate form within the data. The resulting model makes explicit the formal structures that were at work in the original object, or in our engagement with it (for instance, interpretive strategies and editorial interventions).

2. Second, it entails a strategic information loss: a deliberate setting aside of the information we do not choose to retain, the information not accommodated in the model. This loss differs from the sampling loss, the pixels-in-between, of a bitmap-style data capture, in that the loss here serves to heighten the salience of what is retained, in the way that a road map omits elevation and soil data so that we can see the route more clearly.

3. And finally, it entails a corresponding strategic informational gain, in that the digital representation of a source object constitutes a model of the object: a purposeful representation of it that serves some interpretive or analytic goal.

A screen shot of a visualization from the Women Writers Project

Screen shot from Women Writers Project,

So the lossiness of formalism is also its strength and brings a number of important consequences:

  • we create computational tractability: the information we do retain is better adapted to computationally mediated ways of knowing, tools for analysis and apprehension
  • we gain strategic focus: we keep the information that matters to us, we focus our resources (of curation, storage, tool accommodation, etc.) on what we will actually use
  • we enable higher-level pattern discovery: we can compare models as well as instances; we can see patterns not only in our data but in our ideas about our data. The TAPAS project, for instance, which is building a large corpus of TEI collections, will also build tools by which we can study and compare the schemas projects use.
  • most importantly, through formalism, we create explicit models representing and communicating our assumptions about what constitutes the object for us in representational terms: what information is being carried with us across the analog-digital barrier. Johanna Drucker has observed that “insofar as form allows sense to appear to sentience…the role of aesthetics is to illuminate the ways in which the forms of knowledge provoke interpretation”[4] and in this respect, aesthetics and information modeling have a great deal in common.

The biggest research question I see arising here is how to achieve interoperability or mutual intelligibility of data models that does not require identity of data models. In other words, rather than seeking to perfect our models and achieve perfect consensus about them, how can we continue to have disagreeing but productive conversations about divergent interpretations, essential to the humanities, via our data models? How can our data models help us undertake these conversations? It seems to me that the accommodation of art within the sphere of data ultimately rests on this commitment to ongoing interpretation. When the TEI was founded, it was with the understanding that it needed to function as a broker or hub to mediate interchange among many divergent and equally legitimate modelling approaches. Over time, the TEI’s customization mechanism has evolved into a very rich system for representing dissent, debate, and interpretation, and may soon provide ways of visualizing those debates through a detailed comparison of  TEI schemas. While recognizing the need in some contexts for simple standards that enforce more impoverished but regular formal structures upon us, I hope we can avoid assuming that those simplifications are always adequate and always necessary as a condition of data.

[1] As Trevor Muñoz defined it in a recent Digital Humanities Data Curation workshop, data is

“information in the role of evidence; propositions systematically asserted, encoded in symbol structures.”

[2]  The full passage from this letter reads: “For a Line or Lineament is not formed by Chance a Line is a Line in its Minutest Subdivisions Strait or Crooked It is Itself & Not Intermeasurable with or by any Thing Else Such is Job but since the French Revolution Englishmen are all Intermeasurable One by Another Certainly a happy state of Agreement to which I for One do not Agree.” (William Blake, Letter to George Cumberland, 12 April 1827. Available online at the William Blake Archive,

[3] “…a model may violate expectations and so surprise us: either by a success we cannot explain, e.g., finding an occurrence where it should not be; or by a likewise inexplicable failure, e.g., not finding one where it is otherwise clearly present. In both cases modeling problematizes. As a tool of research, then, modeling succeeds intellectually when it results in failure, either directly within the model itself or indirectly through ideas it shows to be inadequate. This failure, in the sense of expectations violated, is, as we will see, fundamental to modeling.” Willard McCarty, “Modeling: A Study in Words and Meanings”, A Companion to Digital Humanities, ed. Schreibman, Siemens, and Unsworth. Blackwells, 2004.

[4] Johanna Drucker, SpecLab: Digital Aesthetics and Projects in Speculative Computing (2009), xii.

Posted in digital humanities | Leave a comment

TEI and Scholarship (in the C{r|l}o{w|u}d)

[This is the text of a keynote presentation I gave at the TEI conference in 2012 at Texas A&M University. I am working on making my past presentations accessible here, in case they may be useful to anyone.]

James Surowiecki’s well-known book on “The Wisdom of Crowds,” and Kathy Sierra’s counterpoint on “The Dumbness of Crowds” give us a provisional definition of a wise crowd:

  • It must possess diversity of information and perspectives
  • Its members must think independently
  • It must be decentralized, and able to draw effectively on local knowledge
  • It must have some method of aggregating and synthesizing that knowledge

In this sense, the TEI is a wise crowd: deliberately broad, designedly interested in synthesizing the vast local expertise we draw on to produce something as ambitious and deeply expert as the TEI Guidelines.

But the TEI also has some structural ambivalence about the crowd: about the role of standards in a crowd. As researchers and designers (people who contribute expertise to the development of the guidelines) the crowd (that’s us!) is great, but as encoders and project developers the crowd shows another face:

  • The crowd is unruly!
  • They make mistakes!
  • They commit tag abuse!
  • They all do things differently!
  • They are all sure they are right!

The dissatisfaction with this divergence arises from the hope that this crowd (that is, also us!) will put its efforts towards a kind of crowd-sourcing: developing TEI data that can be aggregated into big data, producing a crowd-sourced canon that would serve as an input to scholarship. This would mean:

  • To the extent possible, a neutral encoding (whatever that means)
  • A reusable encoding (whatever that means)
  • An interoperable encoding (whatever that means)

The technical requirements for developing such a resource are not difficult to imagine; MONK and TEI-analytics have shown one way. But the social requirements are more challenging. That unruliness I mentioned isn’t cussedness or selfishness. It has to do rather with the motives and uses for text encoding that are emerging now more powerfully as the TEI comes into a new relationship with scholarship. The “crowd” in this new relationship is once again that wise crowd whose expertise is so important, but it is working in new ways and in a different environment.

In what follows, I’m going to explore the TEI and scholarship in (and of) the crowd by sketching three convergent narratives, and then considering where they lead us.

1. People

The first is a narrative of people: an emerging population of TEI users.

Here is a starting point: in the past year, from the Women Writers Project’s workshops alone (and we are only one of a large number of organizations offering TEI workshops), about 150 people participated in introductory TEI workshops, and this has been a typical number over the past three years, which have seen a steady and striking increase in demand for introductory and advanced TEI instruction. Syd Bauman and I began teaching TEI workshops in 2004; here’s the trend in total workshops offered:

WWP TEI Seminars, 2004-2012 TEI_2012_slides 4

And here is the trend in number of participants:

WWP TEI Seminars, 2004-2012

The increasing trend here is striking in itself, but more so when we unpack it a little. First, the composition of the audience is changing: 8-10 years ago, the predominant audience for TEI training was library and IT staff; now, very substantial numbers of participants are faculty and graduate students. Many (or even most) attendees are planning a small digital humanities project with a strong textual component, and they see TEI as a crucial aspect of the project’s development. They are full of ideas about their textual materials and also about the research questions they’d like to be able to pursue, or the teaching they’d like to be able to do; they see TEI markup as a way of representing what excites them about their texts and their research. Even more remarkably, some people attend TEI workshops to learn TEI because they think they should know about it, not solely because they are planning a TEI project. In other words, text encoding has taken on the status of an academic competence, a humanities research skill.

And these workshops represent just the WWP: there are now many other workshop series that are experiencing similar levels of growth.

2. Scholarship

The second strand of the narrative I’d like to lay out here has to do with scholarship and scale. TEI data often considered as an input for scholarship, for instance via the concept of the “digital archive” or the digital edition. This is data digitized in advance of need, often by those who own the source material, on spec as it were, to be used by others. In the terms discussed at a recent workshop at Brown University on data modeling in the humanities, this is data being modeled “altruistically.” It is designed to function in a comparatively neutral and anticipatory way: in effect, it is data that stands ready to support whatever research we may bring to it. And in this sense it is data that serves scholarship, where the actual scholarly product is expected to be expressed in some other form, such as a scholarly article or monograph.

However, as DH is increasingly naturalized in humanities departments, we now are seeing attempts to articulate the role that text markup can play in a much closer relationship to scholarship:

  • First, TEI data considered as scholarship: as a representation of an analytical process that yields insight in itself, something that one could receive scholarly credit for.
  • And second, TEI data considered as a scholarly publication that is not operating under the sign of “edition” but rather something more like “monograph” or in fact a hybrid of the two: in other words, not operating solely as a reproduction or remediation of a primary source, but also as a an argument about a set of primary sources or cultural objects/ideas

The stakes of articulating this case successfully are, first of all, to make this kind of work visible within the system of academic reward, and second, to call in question the separation of “data creation” from interpretation and analysis: a separation that is objectionable within the digital humanities on both theoretical and political grounds.

As a result, this new population of scholar-encoders is ready and willing to understand the encoding work they are undertaking as entirely continuous with their work as critics, scholars, and teachers. The data they aspire to create is expected to carry and express this full freight of meaning.

3. Economics

The third strand I offer here has to do with economics, and in particular with the economics of data in the crowd. Because hand markup is expensive, large-scale data collections tend not to emphasize markup; in large-scale digital library projects for which the lower levels of the TEI in Libraries Best Practices guidelines were developed, markup is concentrated in the TEI header, largely absent in the text transcription. And in these large-scale contexts, markup is thus not only unrealistically expensive at an incremental level, but it also fails a larger economic test in which we compare the time it takes to do a task by hand with the time it will take to build a tool to do the task. If the tool is running on a sufficiently large data set, even a very expensive tool will pay for itself in the end. For this reason, algorithmic, just-in-time approaches make sense in large collections.

Markup excels in two situations. It excels in cases where the data set is small and the markup task is so complex that the economics of tool-building are at a disadvantage. If the tool required starts to approximate the human brain, well, we can hire one of those more cheaply than we can build it (for a few more years at least!). And second, markup excels in cases where we need to be able to concretize our identifications and subject them to individual scrutiny before using them as the basis for analysis: in other words, in situations where precision does matter, where every instance counts heavily towards our final result, and where the human labor of verification is costly and cannot be repeated. Hand markup is thus characteristically associated with, economically tied to, small-scale data.

At this point in my story, these three narratives — people, scholarship, and economics — start to converge. At present, a generation of highly visible and really interesting digital humanities scholarship is proceeding on the basis of research that analyzes very large-scale data with powerful analytical tools. (Think of the Digging into Data funding initiative, of grid technologies, of high-performance computing, of e-science.) But at the same time, another new generation of digital humanities scholars is emerging along very different lines. They are humanities faculty working on individual texts, authors, genres; they are interested in the mediation of textual sources (in ways that are consonant with domains like book and media history); and they are alert to the ways that textual representation (including data modeling) inflects meaning, reading, interpretation. These scholars are gaining expertise in TEI as a natural extension of their interest in texts and in textual meaning and representation, and their scholarship will be expressed in their markup, and will also arise out of the analysis their markup makes possible.

So there is an interesting interplay or counterpoint here. Algorithmic approaches work well at large scale precisely because they don’t require us to scrutinize individual cases or make individual decisions: they work on probabilities rather than on certainty, and they work on trends, correlations, the tendency of things to yield pattern, rather than on the quiddity of things, their tendency towards uniqueness and individual bizarreness. But for some kinds of interpretation (e.g. for intrinsically small data sets, for data sets that are dominated by exceptions rather than patterns, that very bizarreness is precisely the object of scrutiny: scholarship is a process of scrutiny and debate and careful adjustment of our interpretations, instance by instance: markup allows scholarly knowledge to be checked, adjusted. It produces a product that carries a higher informational value.

These two paradigms, if we like, of digital scholarship (“scholarship in the algorithm” and “scholarship in the markup”) are different but not opposed, not inimical; they are appropriate in different kinds of cases, both legitimate. They both represent significant and distinctive advances in digital scholarship: the idea of formalizing an analytical method in an algorithm is no more and no less remarkable, from the viewpoint of traditional humanities scholarship, than the idea of formalizing such a method in an information structure that adorns a textual instance. They each represent different attitudes towards the relationship between pattern and exception, and different approaches to managing that interplay. And in fact, as the Text Encoding and Text Analysis session at DH2012 noted, these two approaches have a lot to offer one another: they both work even better together than separately.

What can we observe about the TEI landscape that these narratives converge upon? First, it is significantly inhabited by “invisible” practitioners who are not experts, not members of TEI-L, not proprietors of large projects, but nonetheless receiving institutional support to create TEI data on a small scale (individually) and a large scale (collectively). These users are strongly invested in the idea of TEI as a tool for expressing scholarship: they believe that it is the right tool and they find it satisfying to use. They are working on documents that are valuable for their oddity and exceptionalism, and I will indulge myself here in the topos of the copious list: the notebooks of Thoreau, whaling logs, auction catalogues, family letters, broadsides attesting to the interesting reuse of woodcut illustrations, financial records, Russian poetry, an 18th-century ladies magazine specializing in mathematics, revolutionary war pamphlets, sermons of a 19th-century religious leader, drafts of Modernist poetry, Swedish dramatic texts, records of Victorian social events, the thousand-page manuscript notebook of Ralph Waldo Emerson’s aunt, Mary Moody Emerson, and so on.

It’s hard to envision an intellectual rubric for such projects, and yet at a certain level they have a lot of things in common. First, they have a set of functional requirements in common: with small numbers of comparatively TEI files, they are not “big data” and they don’t require a big publication infrastructure, but they do need a few commonly available pieces of publication infrastructure: a “table of contents” view, a “search” view, a “results” view (same as TOC, basically), a “reading” view. All of these are now the stock in trade, out of the box, of simple XML publishing tools like eXist, XTF, etc.

However, this data could yield quite a bit more insight with a few more tools for looking at it, and some tools in particular come up over and over again as offering useful views of the data: timelines, maps, representations of personal interconnections, networks of connected entities. And taking this even further, there are specific genres that could benefit from specific analytical tools: for instance, a drama browser (think of Watching the Script), a correspondence browser (think of a combination timeline and personal network), an itinerary browser (think of a combination map and timeline, like Neatline), an edition browser (with focus on variant readings, commentary, witness comparison: think of an amplified version of Juxta), and so forth.

These projects also have a set of opportunities in common: for one thing, they represent a remarkable opportunity to study the ways that markup functions representationally, if only we could study this data as a body: a semiotic form of knowledge. And for another, they represent a remarkable opportunity to study the ways that specific TEI elements are used, in the wild: a sociological form of knowledge. Finally, and most importantly, these projects have a number of problems in common:

  • Publication:  there is currently no obvious formal venue for such projects/publications, and the kind of self-publication that is fairly common in university digital humanities centers isn’t available at smaller institutions, or at institutions with only a single project of this kind
  • Data curation:  these projects are a data curation time bomb; they typically have a very tiny project staff that by its nature is short-term (students with high turnover, IT or library staff who don’t have a long-term institutional mandate to assist; grant-funded labor that will disappear when the funding runs out). Running on slender resources, they don’t have the luxury of detailed internal documentation (and they don’t typically have staff who are skilled at this). Migration to future TEI formats is in many cases probably out of reach.
  • Basic ongoing existence:  these are projects that quite often lack even a stable server home; when the person primarily responsible for their creation is no longer working on the project, the institution doesn’t have anyone whose job it is to keep the project working.

From some perspectives, these look like problems for these projects to identify and suffer and (hopefully, eventually) solve. This perspective has produced the TAPAS project, which may be familiar to many of you. TAPAS is a project now in a 2-year development phase funded by IMLS and NEH, which is developing a TEI publishing and archiving service for TEI projects at small institutions and those operating with small resources.

But we should also treat this as a problem for the TEI. If you can indulge me for a moment in some cheap historical periodization, we can divide the TEI’s history thus far into several phases:

  1. Inception and problem identification, where the problem is the fact that many scholars want to produce digital representations of research materials, and there is a risk that they will do it in dead-ended, self-limiting ways
  2. Research and development, where the TEI community grows intrepidly and tackles the question of “How do we represent humanities research materials?”
  3. Refinement and tool-building, where the community (now having both critical mass and an intellectual history) can set in place a working apparatus of use (e.g. Roma) and build Things that Work
  4. And now, in the past five years: public visibility, where (thanks to the tremendous and sudden popularity of the digital humanities), the TEI is now noticeable and legitimate in sectors where before it would have appeared a geeky anomaly. As I noted earlier, people—faculty, graduate students—now attend TEI workshops just out of a sense of professional curiosity and responsibility: “This is something I should know about.”

Things look very good, according to several important metrics. There’s public and institutional funding available for TEI projects; the idea of treating TEI projects as scholarly work, to be rewarded with professional advancement, isn’t ridiculous but is a real conversation. The “regular academy” recognizes the TEI (albeit in a vague and mystical way) as a gold standard in its domain: it possesses magical healing powers over data. And there is an infrastructure for learning to use the TEI, which is a huge development; Melissa Terras, a few years, addressed the TEI annual conference with a strongly worded alarum: she pointed out that the TEI had an urgent and sizeable need for training materials, support systems, information, on-ramps for the novice. Although the TEI itself has not responded to that call, its community has: there are now a substantial number of regular programs of workshops and institutes where novices and intermediate users can get excellent training in TEI, and there are also starting to be some excellent resources for teaching oneself (chief among them TEI by Example, developed by Melissa Terras and Edward Vanhoutte). And finally, a lot of TEI data is being produced.

But that success has produced a crossroads that we’re now standing at. The question is whether in 20 years that data will represent scholarly achievement or the record of a failed idealism: whether the emerging scholarly impulse to represent documents in an expressive, analytical, interpretively rich way is simply obsolete and untenable, or whether in fact such impulses can constitute a viable form of digital scholarship: not as raw, reusable representations whose value lies chiefly in the base text they preserve, but as interpretations that carry an insight worth circulating and preserving and coming back to. If the answer turns out to be “yes”, it will be because two conditions have been met:

  1. The data still exists (a curation challenge).
  2. The data still has something to say (a scholarly challenge).

It’s important to observe that this is not a question about interoperability; it is a question about infrastructure, and it is about social infrastructure as much as it is about technical infrastructure. It is tempting to treat the prospect of hundreds of small TEI projects as simply an interoperability nightmare, a hopeless case, but I think this assumption bears closer scrutiny and study. In fact, at this point, I will assert (sticking my neck out here) that the major obstacle to the long-term scholarly value and functioning of this data is not its heterogeneity but its physical dispersion. As an array of separately published (or unpublished) data sets, this material is practically invisible and terribly vulnerable: published through library web sites or faculty home pages; unable to take advantage of basic publishing infrastructure that would make it discoverable via OAI-PMH or similar protocols; vulnerable to changes in staffing and hardware, and to changes in publication technology. And last, through a terrible irony, unlikely to be published with tools and interfaces that will make the most of its rich markup, through lack of access to sustained technical expertise.

These vulnerabilities and invisibilities could be addressed by gathering these smaller projects together under a common infrastructure that would permit each one to show its own face while also existing as part of a super-collection. This creates, in effect, three forms of exposure and engagement for these data sets. The first is through their presence as individual projects, each with its own visible face through which that project’s data is seen and examined on its own terms (offering benefits to readers who are interested in individual projects for their specific content). The second is through their juxtaposition with (and hence direct awareness of) other similar projects, which opens up opportunities for projects to modify their data and converge towards some greater levels of consistency (offering benefits to the projects themselves). And the last is through their participation in the super-collection, the full aggregation of all project data (offering benefits to those who want to study TEI data, and also—if the collection gets large enough—to those who are interested in the content of the corpus that is formed).

The idea of a repository of TEI texts has been proposed before, in particular in 2011 as part of the discussion of the future of the TEI. There was general agreement that a repository of TEI data would have numerous benefits: as a source of examples for teaching and training and tool development, as a corpus to enable the study of the TEI, as a corpus for data mining, and so forth. But the discussion on the TEI listserv at that time came at the project from a somewhat different angle: it focused on the functions of the repository with respect to an authoritative TEI—in other words, on the function of the data as a corpus—rather than considering how such a repository might serve the needs of individual contributors. Perhaps as a result, significant attention was paid to the question of whether and how to enforce a baseline encoding, to provide for interoperability of the data; there was a general assumption that data in the repository should be converted to a common format (and perhaps that the responsibility for such conversion would lie with the contributing projects)

In other words, underlying this discussion was an assumption that the data would be chiefly used as an aggregation, and that without conversion to a common format, such an aggregation would be worthless.

But I think we should revisit these assumptions. In fact, I think there’s a huge benefit of such a repository first and foremost to the contributors for the reasons I’ve sketched above, and that benefit is accentuated if the repository permits and even supports variation in encoding. And I also think there’s a great deal we can do with this data as an aggregation, if we approach it in the same spirit as we approach any other heterogeneous, non-designed data set. Instead of aspiring to perfect consistency across the aggregation, we can focus on strategies for making useful inferences from the markup that is actually there. We can focus on the semantics of individual TEI elements, rather than on structure: in other words, on mining instead of parsing. And we can focus on what can be inferred from coarse rather than fine nesting information: “all of these divisions have datelines somewhere in them” rather than “each of these divisions has its dateline in a different place!?” We can also be prepared to work selectively with the data: for tools or functions that require tighter constraints, test for those constraints and use only the data that conforms. In short, we should treat interoperability as the last (and highly interesting) research challenge rather than the first objection. And of course, once we have such a data set, we can also think of ways to increase its convergence through data curation and through functional incentives to good practice.

If this is the path forward, I’d like to argue that the TEI has as much stake in it as the scholarly community of users, and I’d like to propose that we consider what that path could look like. I am involved in the TAPAS project, which has already begun some work of its own in response to this set of needs, with special emphasis on the predicament of small TEI producers. But we are also very eager to see that work benefit the TEI as broadly as possible. So ​in the interest of understanding those broader benefits, I’d like to set TAPAS aside for the moment, for purposes of this discussion: let’s treat those plans as hypothetical and flexible and instead entertain the question of what the TEI and the TEI community might most look for in such a service, if we were designing it from scratch.

What would such a service look like? What could it usefully do, within the ecology I have sketched? This is a question I would like to genuinely pose for discussion here, but to give you something to respond to I am going to seed the discussion with some heretical proposals:

  • Gather the data in one place
  • Exercise our ingenuity in leveraging what the TEI vocabulary does give us in the way of a common language
  • Offer some incentives towards convergence in the form of attractive functionality
  • Provide some services that help with convergence (e.g. data massage)
  • Provide some automated tools that study divergence and bring it to the fore, for discussion: why did you do it this way? Could you do it that way?
  • But also permit the exercise of independence: provide your own data and your own stylesheets
  • Find ways to make the markup itself of interest: this is a corpus of TEI data, not (primarily) a corpus of letters, novels, diaries, etc.
  • Encourage everyone to deposit their TEI data (eventually!)
  • Provide curation (figure out how to fund it), top priority: this is a community resource
  • Provide methods for mining and studying this data (qua TEI, qua content)
  • Provide ways to make this data valuable for third parties: make it as open as possible


Posted in digital humanities | Tagged | Leave a comment

Big changes ahead

With a mixture of excitement and astonishment I find myself changing jobs after 20 years at Brown University. Starting July 1, I will be taking up a new position at Northeastern University as Professor of the Practice in the Department of English, and as Director of the Digital Scholarship Group in the library. As part of my faculty half, I will also be affiliated with the NULab for Texts, Maps, and Networks, and I will also continue as Director of the Women Writers Project.

My first impulse here is to offer thanks, because I feel extremely lucky but I can also see around me the efforts and generosity of other people who have brought me to this point. Aficionados of digital humanities job construction will recognize this new position as not only beautifully tailored but also an institutional achievement: a job that crosses colleges, disciplines, faculty/staff lines. I have only a glimpse of what it look to put it together and I am hugely grateful to those at Northeastern who worked to make it happen. And on Brown University’s side, there has been a long history of generous support for me and for the WWP going back to 1988 when the project was founded, and extending across the many university departments that have housed the project: the English Department, Computing and Information Services, and most recently the University Library. I have been very happy at Brown and could not have been more fortunate in my colleagues and in the professional opportunities I have found there.

So what is coming next? There are a few major new things on the horizon:

  • Starting the Digital Scholarship Group: this will be the big hit-the-ground-running agenda item for me; the DSG is an idea and a space and I’ll be working intensively with Patrick Yott on bringing it into existence. The WWP will have a home within the DSG (together with NEU’s other digital projects) and we will be building a support structure and research agenda that can
  • Teaching: NEU already has a significant graduate student body interested in digital humanities, and has plans to expand on this. My position carries a 2-course load and I’m really looking forward to developing courses and thinking about the overall digital humanities curriculum.
  • Working with digital humanities colleagues in the NULab: this deserves its own post so at this point I will just say that it’s a very exciting prospect…
  • Developing a strategic plan for the WWP that takes advantage of new circumstances: participation by NEU’s digital humanities graduate students, opportunities to contribute to research initiatives in the NULab, and above all long-term fiscal stability.

All of my current projects, grant commitments, and so forth will be maintained, one way or another, but the transition (especially for the WWP side of things) is going to take a lot of work so I anticipate being distracted and possibly needing to shift things around a bit over the next several months.

Proceeding in a hopeful and enthusiastic spirit!

Posted in digital humanities | 6 Comments

A Matter of Scale

I recently had the honor and pleasure of giving a joint keynote presentation, “A Matter of Scale,” with Matt Jockers at the Boston-area Days of DH conference hosted by Northeastern University’s NULab. Matt has kindly put the text of our debate up on the University of Nebraska open-access repository and has also blogged about it.

This debate was great fun to prepare and also provided a fascinating perspective for me on the process of authoring. I do write a lot of single-authored things (e.g. conference papers, articles) where “my own” ideas and arguments are all I have to focus on, though I find those usually emerge by engaging with and commenting on other people’s work. I also write a lot of single-authored things where I’m actually serving as the proxy for a group (e.g. grant proposals). And I also increasingly find myself writing co-authored material—for instance, the white paper I’m currently working on with Fotis Jannidis that reports on the data modeling workshop we organized last spring, or the article I wrote with Jacqueline Wernimont on feminism and the Women Writers Project. In all of these situations I feel that I know the boundaries of my own ideas pretty well, even as I can feel them being influenced or put into dialogue with those of my collaborators.

However, writing this debate with Matt took a different turn. The presentation was framed as a debate from the start—so, in principle, each of us would be defending a specific position (big data for him, small data for me). We ascertained early on that we didn’t actually find that polarization very helpful, and we developed a narrative for the presentation that started by throwing it out, then facetiously embracing it, and finally exploring it in some detail. But we retained the framing device of the debate-as-conversational-exchange. However, rather than each writing our own dialogue, we both wrote both parts: Matt began with an initial sketch, which I then reworked, and he expanded, and I refined, and he amended, and so forth, until we were done. The result was that throughout the authoring process, we were putting words in each other’s mouths, and editing words and ideas of “our own” that had been written for us by someone else.

Despite agreeing on the misleadingness of the micro/macro polarity, I think Matt and I actually do have differing ideas about data and different approaches to using it—but what was striking to me during this process was that I found I had a hard time remembering what my own opinions were. The ideas and words Matt wrote for the debate-Julia character didn’t always feel fully familiar to me, but at the same time they didn’t feel alien either, and they were so fully embedded in the unfolding dialogue that they drew their character more from that logic than from my own brain, even as I reworked them from my own perspective.

I’m not sure what conclusions to draw, but it’s clear to me that there’s more to learn from collaborative authoring than just the virtues of compromise and the added value of multiple perspectives. I’m sure there’s an important literature on the subject and would be grateful for pointers. Working with Matt was a blast and I hope we have an opportunity to do this again.

Posted in digital humanities | 3 Comments

On getting old

I realized at this year’s DH conference that I think of the conference as marking the “new year” in my digital humanities life—partly because it coincides roughly with the new fiscal year at my institution, and partly because I always come away with that mix of elation and resolution and mild hangover that’s often associated with early January. It also makes me aware of the passing of time. This year, with so many new young participants, it occurred to me that I’m roughly the same age now as my mentors were when I first started attending the conference. But where in 1994 I felt I had everything to learn from those who were older than I, now nearly 20 years later I feel I have everything to learn from those who are younger. My “generation” in DH (if I can permit myself such a gross and vague term for a moment) spent a lot of time and effort focusing on developing data standards and organizational infrastructure and big important projects and articulations of methods. We were and are terribly self-conscious about everything, having made in so many cases a professional transition that defamiliarized the very roots of what we had been trained to do, and that self-consciousness felt like power. I think I see in the “next generation” (with the same apology!) somewhat less of this self-consciousness and more of an adeptness at getting things done. When I see the projects and research work that were presented in Hamburg I feel a sense of awe, of stepping back as a train rushes by.

Looking at that train while it pauses in the station, I can see its parts and I can understand them—I know about the data standards, the infrastructure, the languages, the layers and modules, the way things work, and I know that in principle I could build such a thing. I know how to write the grant proposal for such a thing. But in the face of its sheer force and speed and power, I feel the way I imagine a Victorian stagecoach might have felt while waiting at a railroad crossing—I feel fragile and vulnerable and a little elderly. (And now we can all laugh together at how silly that is.)

OK, after singing Auld Lang Syne and sleeping in, we wake up a few days later with a renewed sense of vigor. My “new year’s resolutions” coming out of this DH2012 are:

  1. Read more DH blogs!
  2. Read more DH blogs in languages other than English! I was delighted to be placed next to at the poster session and take this as a good sign. Also very excited about the possibility I heard discussed of a Spanish-language and French-language Day of DH.
  3. Write more!

Happy new year and more soon, I hope!


Posted in digital humanities | Leave a comment

The database is the scholarship

I was lucky enough to be invited to participate in the recent workshop organized by Tanya Clement and Doug Reside at MITH, “Off the Tracks—Laying New Lines for Digital Humanities Scholars.” It was such a rich couple of days that I’m sure I’ll be working it over in my mind for quite some time. But there was a moment of insight that stood out for me and I want to come back to it.

Among the many useful points of orientation for our discussion were Mills Kelly’s thoughtful and influential 2008 postings on “Making Digital Scholarship Count,” in which he offers definitions of “scholarship” and “digital scholarship” as a basis for thinking about how this work ought to be counted in the scheme of academic reward. For the Off the Tracks workshop, this was an important point because of the key issues being tackled: what kinds of work do digital humanities center staff do? Is it scholarship? Is it research? Should it be?

For me, the fulcrum of Mills Kelly’s argument is right here:

Where it gets trickier is when we consider digitization projects–whether small in scale, or massive, like the Perseus Project or the Valley of Shadow. Each of these excellent and heavily used projects offers scholars, teachers, students, and the general public unique access to their content. But, as Cathy Davidson at HASTAC told me in an interview for this series, “the database is not the scholarship. The book or the article that results from it is the scholarship.” Or, I would add, the digital scholarship that results from it. In other words, I’m not willing to limit us to the old warhorses of the book or scholarly article.

I also want to emphasize that I have tremendous respect for the scholars and teams of students and staff who created these two projects–both of which I use often in my own teaching. But I also have to say that I don’t think either project can be considered “scholarship” if we use the definition I’ve proposed here.

Why not, you might well ask? The reason is fairly simple in both cases. Neither project offers an argument.

The question is, do we know where to look for the argument in a work of digital scholarship? To the perennial discussion of “what is digital humanities?”, I would add this: digital humanities is the intellectual domain in which the argument is very much the database—or, perhaps more precisely, the data model. While books and articles may continue to result from our engagement with data, it’s misleading to think of the data as simply an input (raw or cooked) to this work; those of who work on the development of “the database” (where that’s a metaphor for a collection of digital research materials) know how deeply that work is disciplinary, political, and argumentative. I think the question the digital humanities still needs to work on, though, is where to look for those arguments and how to read them, understanding that they may in some cases be the most consequential interventions in our field.

Posted in digital humanities, Uncategorized | 3 Comments

Hello, world

I thought about deleting this pre-post with its boilerplate title, but then decided to keep it for its authentic spirit of “is this mike on?” and to acknowledge the awkwardness of writing in a space structured around audience when you know the audience isn’t listening yet. If you’ve arrived here by reading backwards (or because you happened upon it through some quirk of searching), welcome and thanks for reading.

I have a number of writing outlets but they all generate very strategic or local prose: grant proposals, conference and seminar presentations, reports of various kinds, articles for books and journals. The great thing about the digital academy is that it has become so conversational: there’s a sense of timely, informal, yet serious exchange that is somehow humanizing, less ritualized but no less substantive than many of the traditional forms of academic writing. The stylistics of this online conversation interest me a great deal and the best way to study it is probably to do it: hence this blog.

Posted in Uncategorized | Leave a comment