[This is the text of a keynote presentation I gave at the 2012 TEI annual conference at Texas A&M University. Posting it now in the spirit of “better late than never” and “huh, maybe still useful!”]
James Surowiecki’s well-known book on “The Wisdom of Crowds,” and Kathy Sierra’s counterpoint on “The Dumbness of Crowds” give us a provisional definition of a wise crowd:
- It must possess diversity of information and perspectives
- Its members must think independently
- It must be decentralized, and able to draw effectively on local knowledge
- It must have some method of aggregating and synthesizing that knowledge
In this sense, the TEI is a wise crowd: deliberately broad, designedly interested in synthesizing the vast local expertise we draw on to produce something as ambitious and deeply expert as the TEI Guidelines.
But the TEI also has some structural ambivalence about the crowd: about the role of standards in a crowd. As researchers and designers (people who contribute expertise to the development of the guidelines) the crowd (that’s us!) is great, but as encoders and project developers the crowd shows another face:
- The crowd is unruly!
- They make mistakes!
- They commit tag abuse!
- They all do things differently!
- They are all sure they are right!
The dissatisfaction with this divergence arises from the hope that this crowd (that is, also us!) will put its efforts towards a kind of crowd-sourcing: developing TEI data that can be aggregated into big data, producing a crowd-sourced canon that would serve as an input to scholarship. This would mean:
- To the extent possible, a neutral encoding (whatever that means)
- A reusable encoding (whatever that means)
- An interoperable encoding (whatever that means)
The technical requirements for developing such a resource are not difficult to imagine; MONK and TEI-analytics have shown one way. But the social requirements are more challenging. That unruliness I mentioned isn’t cussedness or selfishness. It has to do rather with the motives and uses for text encoding that are emerging now more powerfully as the TEI comes into a new relationship with scholarship. The “crowd” in this new relationship is once again that wise crowd whose expertise is so important, but it is working in new ways and in a different environment.
In what follows, I’m going to explore the TEI and scholarship in (and of) the crowd by sketching three convergent narratives, and then considering where they lead us.
The first is a narrative of people: an emerging population of TEI users.
Here is a starting point: in the past year, from the WWP’s workshops alone (and we are only one of a large number of organizations offering TEI workshops, about 150 people participated in introductory TEI workshops, and this has been a typical number over the past three years, which have seen a steady and striking increase in demand for introductory and advanced TEI instruction. Syd Bauman and I began teaching TEI workshops in 2004; here’s the trend in total workshops offered (a total of 69 events in 8 years):
and the corresponding trend in number of participants (a total of at least 1166 participants as of 2012; we didn’t keep strict records in the first few years):
The increasing trend here is striking in itself, but more so when we unpack it a little. First, the composition of the audience is changing: 8-10 years ago, the predominant audience for TEI training was library and IT staff; now, very substantial numbers of participants are faculty and graduate students. Many (or even most) attendees are planning a small digital humanities project with a strong textual component, and they see TEI as a crucial aspect of the project’s development. They are full of ideas about their textual materials and also about the research questions they’d like to be able to pursue, or the teaching they’d like to be able to do; they see TEI markup as a way of representing what excites them about their texts and their research. Even more remarkably, some people attend TEI workshops to learn TEI because they think they should know about it, not solely because they are planning a TEI project. In other words, text encoding has taken on the status of an academic competence, a humanities research skill.
And these workshops represent just the WWP: I know that there are now many other workshop series that are experiencing similar levels of growth. [Edited to add: the upward trend in workshops has steepened further since 2012.]
The second strand of the narrative I’d like to lay out here has to do with scholarship and scale. TEI data often considered as an input for scholarship, for instance via the concept of the “digital archive” or the digital edition. This is data digitized in advance of need, often by those who own the source material, on spec as it were, to be used by others. In the terms discussed at a recent workshop at Brown University on data modeling in the humanities, this is data being modeled “altruistically.” It is designed to function in a comparatively neutral and anticipatory way: in effect, it is data that stands ready to support whatever research we may bring to it. And in this sense it is data that serves scholarship, where the actual scholarly product is expected to be expressed in some other form, such as a scholarly article or monograph.
However, as DH is increasingly naturalized in humanities departments, we now are seeing attempts to articulate the role that text markup can play in a much closer relationship to scholarship:
- First, TEI data considered as scholarship: as a representation of an analytical process that yields insight in itself, something that one could receive scholarly credit for.
- And second, TEI data considered as a scholarly publication that is not operating under the sign of “edition” but rather something more like “monograph” or in fact a hybrid of the two: in other words, not operating solely as a reproduction or remediation of a primary source, but also as a an argument about a set of primary sources or cultural objects/ideas
The stakes of articulating this case successfully are, first of all, to make this kind of work visible within the system of academic reward, and second, to call in question the separation of “data creation” from interpretation and analysis: a separation that is objectionable within the digital humanities on both theoretical and political grounds.
As a result, this new population of scholar-encoders is ready and willing to understand the encoding work they are undertaking as entirely continuous with their work as critics, scholars, and teachers. The data they aspire to create is expected to carry and express this full freight of meaning.
The third strand I offer here has to do with economics, and in particular with the economics of data in the crowd. Because hand markup is expensive, large-scale data collections tend not to emphasize markup; in large-scale digital library projects for which the lower levels of the TEI in Libraries Best Practices guidelines were developed, markup is concentrated in the TEI header, largely absent in the text transcription. And in these large-scale contexts, markup is thus not only unrealistically expensive at an incremental level, but it also fails a larger economic test in which we compare the time it takes to do a task by hand with the time it will take to build a tool to do the task. If the tool is running on a sufficiently large data set, even a very expensive tool will pay for itself in the end. For this reason, algorithmic, just-in-time approaches make sense in large collections.
Markup excels in two situations. It excels in cases where the data set is small and the markup task is so complex that the economics of tool-building are at a disadvantage. If the tool required starts to approximate the human brain, well, we can hire one of those more cheaply than we can build it (for a few more years at least!). And second, markup excels in cases where we need to be able to concretize our identifications and subject them to individual scrutiny before using them as the basis for analysis: in other words, in situations where precision does matter, where every instance counts heavily towards our final result, and where the human labor of verification is costly and cannot be repeated. Hand markup is thus characteristically associated with, economically tied to, small-scale data.
At this point in my story, these three narratives— people, scholarship, and economics—start to converge. At present, a generation of highly visible and really interesting digital humanities scholarship is proceeding on the basis of research that analyzes very large-scale data with powerful analytical tools. (Think of the NEH’s Digging into Data funding initiative, of grid technologies, of high-performance computing, of e-science.) But at the same time, another new generation of digital humanities scholars is emerging along very different lines. They are humanities faculty working on individual texts, authors, genres; they are interested in the mediation of textual sources (in ways that are consonant with domains like book and media history); and they are alert to the ways that textual representation (including data modeling) inflects meaning, reading, interpretation. These scholars are gaining expertise in TEI as a natural extension of their interest in texts and in textual meaning and representation, and their scholarship will be expressed in their markup, and will also arise out of the analysis their markup makes possible.
So there is an interesting interplay or counterpoint here. Algorithmic approaches work well at large scale precisely because they don’t require us to scrutinize individual cases or make individual decisions: they work on probabilities rather than on certainty, and they work on trends, correlations, the tendency of things to yield pattern, rather than on the quiddity of things, their tendency towards uniqueness and individual bizarreness. But for some kinds of interpretation (e.g. for intrinsically small data sets, for data sets that are dominated by exceptions rather than patterns), that very bizarreness is precisely the object of scrutiny: scholarship is a process of scrutiny and debate and careful adjustment of our interpretations, instance by instance: markup allows scholarly knowledge to be checked, adjusted. It produces a product that carries a higher informational value.
These two paradigms, if we like, of digital scholarship (“scholarship in the algorithm” and “scholarship in the markup”) are different but not opposed, not inimical; they are appropriate in different kinds of cases, both legitimate. They both represent significant and distinctive advances in digital scholarship: the idea of formalizing an analytical method in an algorithm is no more and no less remarkable, from the viewpoint of traditional humanities scholarship, than the idea of formalizing such a method in an information structure that adorns a textual instance. They each represent different attitudes towards the relationship between pattern and exception, and different approaches to managing that interplay. And in fact, as the Text Encoding and Text Analysis session at DH2012 noted, these two approaches have a lot to offer one another: they both work even better together than separately.
What can we observe about the TEI landscape that these narratives converge upon? First, it is significantly inhabited by “invisible” practitioners who are not experts, not members of TEI-L, not proprietors of large projects, but nonetheless receiving institutional support to create TEI data on a small scale (individually) and a large scale (collectively). These users are strongly invested in the idea of TEI as a tool for expressing scholarship: they believe that it is the right tool and they find it satisfying to use. They are working on documents that are valuable for their oddity and exceptionalism, and I will indulge myself here in the topos of the copious list: the notebooks of Thoreau, whaling logs, auction catalogues, family letters, broadsides attesting to the interesting reuse of woodcut illustrations, financial records, Russian poetry, an 18th-century ladies magazine specializing in mathematics, revolutionary war pamphlets, sermons of a 19th-century religious leader, drafts of Modernist poetry, Swedish dramatic texts, records of Victorian social events, the thousand-page manuscript notebook of Mary Moody Emerson, and so on.
It’s hard to envision a unifying intellectual rubric for such projects, and yet at a certain level they have a lot of things in common. First, they have a set of functional requirements in common: with small numbers of comparatively TEI files, they are not “big data” and they don’t require a big publication infrastructure, but they do need a few commonly available pieces of publication infrastructure: a “table of contents” view, a “search” view, a “results” view, a “reading” view: all of these are now the stock in trade, out of the box, of simple XML publishing tools like eXist, XTF, etc.
However, this data could yield quite a bit more insight with a few more tools for looking at it, and some tools in particular come up over and over again as offering useful views of the data: timelines, maps, representations of personal interconnections, networks of connected entities. And taking this even further, there are specific genres that could benefit from specific analytical tools: for instance, a drama browser (think of Watching the Script), a correspondence browser (think of a combination timeline and personal network), an itinerary browser (think of a combination map and timeline, like Neatline), an edition browser (with focus on variant readings, commentary, witness comparison: think of an amplified version of Juxta), and so forth.
These projects also have a set of opportunities in common: for one thing, they represent a remarkable opportunity to study the ways that markup functions representationally, if only we could study this data as a body: a semiotic form of knowledge. And for another, they represent a remarkable opportunity to study the ways that specific TEI elements are used, in the wild: a sociological form of knowledge. Finally, and most importantly, these projects have a number of problems in common:
- Publication: there is currently no obvious formal venue for such projects/publications, and the kind of self-publication that is fairly common in university digital humanities centers isn’t available at smaller institutions, or at institutions with only a single project of this kind
- Data curation: these projects are a data curation time bomb; they typically have a very tiny project staff that by its nature is short-term (students with high turnover, IT or library staff who don’t have a long-term institutional mandate to assist; grant-funded labor that will disappear when the funding runs out). Running on slender resources, they don’t have the luxury of detailed internal documentation (and they don’t typically have staff who are skilled at this). Migration to future TEI formats is in many cases probably out of reach.
- Basic ongoing existence: these are projects that quite often lack even a stable server home; when the person primarily responsible for their creation is no longer working on the project, the institution doesn’t have anyone whose job it is to keep the project working.
From some perspectives, these look like problems for these projects to identify and suffer and (hopefully, eventually) solve. This perspective has produced the TAPAS project, which may be familiar to many of you. TAPAS is a project now in a 2-year development phase funded by IMLS and NEH, which is developing a TEI publishing and archiving service for TEI projects at small institutions and those operating with small resources. [Edited to add: in 2016, TAPAS is now in operation and working through a further 3-year development phase focused on XML-aware repository functionality and on pedagogical tools.]
But we should also treat this as a problem for the TEI. If you can indulge me for a moment in some cheap historical periodization, we can divide the TEI’s history thus far into several phases:
- Inception and problem identification, where the problem is the fact that many scholars want to produce digital representations of research materials, and there is a risk that they will do it in dead-ended, self-limiting ways
- Research and development, where the TEI community grows intrepidly and tackles the question of “How do we represent humanities research materials?”
- Refinement and tool-building, where the community (now having both critical mass and an intellectual history) can set in place a working apparatus of use (e.g. Roma) and build Things that Work
- And now, in the past five years: public visibility, where (thanks to the tremendous and sudden popularity of the digital humanities), the TEI is now noticeable and legitimate in sectors where before it would have appeared a geeky anomaly. As I noted earlier, people—faculty, graduate students—now attend TEI workshops just out of a sense of professional curiosity and responsibility: “This is something I should know about.”
Things look very good, according to several important metrics. There’s public and institutional funding available for TEI projects; the idea of treating TEI projects as scholarly work, to be rewarded with professional advancement, isn’t ridiculous but is a real conversation. The “regular academy” recognizes the TEI (albeit in a vague and mystical way) as a gold standard in its domain: it possesses magical healing powers over data. And there is an infrastructure for learning to use the TEI, which is a huge development; Melissa Terras, a few years, addressed the TEI annual conference with a strongly worded alarum: she pointed out that the TEI had an urgent and sizeable need for training materials, support systems, information, on-ramps for the novice. Although the TEI itself has not responded to that call, its community has: there are now a substantial number of regular programs of workshops and institutes where novices and intermediate users can get excellent training in TEI, and there are also starting to be some excellent resources for teaching oneself (chief among them TEI by Example, developed by Melissa Terras and Edward Vanhoutte). And finally, a lot of TEI data is being produced.
But that success has produced a crossroads that we’re now standing at. The question is whether in 20 years that data will represent scholarly achievement or the record of a failed idealism: whether the emerging scholarly impulse to represent documents in an expressive, analytical, interpretively rich way is simply obsolete and untenable, or whether in fact such impulses can constitute a viable form of digital scholarship: not as raw, reusable representations whose value lies chiefly in the base text they preserve, but as interpretations that carry an insight worth circulating and preserving and coming back to. If the answer turns out to be “yes”, it will be because two conditions have been met:
- The data still exists (a curation challenge).
- The data still has something to say (a scholarly challenge).
It’s important to observe that this is not a question about interoperability; it is a question about infrastructure, and it is about social infrastructure as much as it is about technical infrastructure. It is tempting to treat the prospect of hundreds of small TEI projects as simply an interoperability nightmare, a hopeless case, but I think this assumption bears closer scrutiny and study. In fact, at this point, I will assert (sticking my neck out here) that the major obstacle to the long-term scholarly value and functioning of this data is not its heterogeneity but its physical dispersion. As an array of separately published (or unpublished) data sets, this material is practically invisible and terribly vulnerable: published through library web sites or faculty home pages; unable to take advantage of basic publishing infrastructure that would make it discoverable via OAI-PMH or similar protocols; vulnerable to changes in staffing and hardware, and to changes in publication technology. And last, through a terrible irony, unlikely to be published with tools and interfaces that will make the most of its rich markup, through lack of access to sustained technical expertise.
These vulnerabilities and invisibilities could be addressed by gathering these smaller projects together under a common infrastructure that would permit each one to show its own face while also existing as part of a super-collection. This creates, in effect, three forms of exposure and engagement for these data sets. The first is through their presence as individual projects, each with its own visible face through which that project’s data is seen and examined on its own terms (offering benefits to readers who are interested in individual projects for their specific content). The second is through their juxtaposition with (and hence direct awareness of) other similar projects, which opens up opportunities for projects to modify their data and converge towards some greater levels of consistency (offering benefits to the projects themselves). And the last is through their participation in the super-collection, the full aggregation of all project data (offering benefits to those who want to study TEI data, and also—if the collection gets large enough—to those who are interested in the content of the corpus that is formed).
The idea of a repository of TEI texts has been proposed before, in particular in 2011 as part of the discussion of the future of the TEI. There was general agreement that a repository of TEI data would have numerous benefits: as a source of examples for teaching and training and tool development, as a corpus to enable the study of the TEI, as a corpus for data mining, and so forth. But the discussion on the TEI listserv at that time came at the undertaking from a somewhat different angle: it focused on the functions of the repository with respect to an authoritative TEI—in other words, on the function of the data as a corpus—rather than considering how such a repository might serve the needs of individual contributors. Perhaps as a result, significant attention was paid to the question of whether and how to enforce a baseline encoding, to provide for interoperability of the data; there was a general assumption that data in the repository should be converted to a common format (and perhaps that the responsibility for such conversion would lie with the contributing projects)
In other words, underlying this discussion was an assumption that the data would be chiefly used as an aggregation, and that without conversion to a common format, such an aggregation would be worthless.
But I think we should revisit these assumptions. In fact, I think there’s a huge benefit of such a repository first and foremost to the contributors for the reasons I’ve sketched above, and that benefit is accentuated if the repository permits and even supports variation in encoding. And I also think there’s a great deal we can do with this data as an aggregation, if we approach it in the same spirit as we approach any other heterogeneous, non-designed data set. Instead of aspiring to perfect consistency across the aggregation, we can focus on strategies for making useful inferences from the markup that is actually there. We can focus on the semantics of individual TEI elements, rather than on structure: in other words, on mining instead of parsing. And we can focus on what can be inferred from coarse rather than fine nesting information: “all of these divisions have a <dateline> somewhere in them” rather than “each of these divisions has its <dateline> in a different place!?” We can also be prepared to work selectively with the data: for tools or functions that require tighter constraints, test for those constraints and use only the data that conforms. In short, we should treat interoperability as the last (and highly interesting) research challenge rather than the first objection. And of course, once we have such a data set, we can also think of ways to increase its convergence through data curation and through functional incentives to good practice.
If this is the path forward, I’d like to argue that the TEI has as much stake in it as the scholarly community of users, and I’d like to propose that we consider what that path could look like. I am involved in the TAPAS project, which has already begun some work of its own in response to this set of needs, with special emphasis on the predicament of small TEI producers. But we are also very eager to see that work benefit the TEI as broadly as possible. So in the interest of understanding those broader benefits, I’d like to set TAPAS aside for the moment, for purposes of this discussion: let’s treat those plans as hypothetical and flexible and instead entertain the question of what the TEI and the TEI community might most look for in such a service, if we were designing it from scratch.
What would such a service look like? What could it usefully do, within the ecology I have sketched? This is a question I would like to genuinely pose for discussion here, but to give you something to respond to I am going to seed the discussion with some heretical proposals:
- Gather the data in one place
- Exercise our ingenuity in leveraging what the TEI vocabulary does give us in the way of a common language
- Offer some incentives towards convergence in the form of attractive functionality
- Provide some services that help with convergence (e.g. data massage)
- Provide some automated tools that study divergence and bring it to the fore, for discussion: why did you do it this way? Could you do it that way?
- But also permit the exercise of independence: provide your own data and your own stylesheets
- Find ways to make the markup itself of interest: this is a corpus of TEI data, not (primarily) a corpus of letters, novels, diaries, etc.
- Encourage everyone to deposit their TEI data (eventually!)
- Provide curation (figure out how to fund it), top priority: this is a community resource
- Provide methods for mining and studying this data (qua TEI, qua content)
- Provide ways to make this data valuable for third parties: make it as open as possible