Jobs, Roles, Skills, and Tools: Working in the Digital Academy

Digital humanities is a field that makes us self-conscious about our jobs. In my contribution to Matthew Gold’s “Debates in the Digital Humanities” I spent several thousand words anatomizing the different roles I’ve played and their different institutional frames, modes of work, funding structures, and systems of credit. In this I was participating in a long tradition of introspection that began with discussions about whether humanities computing was a discipline and whether it should find its most characteristic and beneficial home in humanities departments, libraries, IT departments, or some other academic or para-academic space altogether.

By the time I was in a position to write with any experience on the subject, the answer was clearly in some sense “Yes!” and “all of the above”: for instance, the Women Writers Project has been located in all four types of space and there are vibrant examples of digital humanities centers, digital scholarship centers, digital archives, digital initiatives, digital collaboratories, and many other units located in all different kinds of institutional spaces. It’s also clear that the identity crisis the field experienced so early on (I’m not a faculty member! My job is in an IT unit! but I have a PhD in classics! Is what I’m doing research?) has matured not through a clean resolution but through a proliferation of professionally viable options and job types: the digital humanities librarian; the instructional technologist; the digital humanities project manager; the professor of XYZ and digital humanities; the director and associate and assistant director of DH centers; the DH developer, and so forth.

Asking the question “What is DH?” (and either evading or reframing or attempting to answer it) is a popular pastime, could almost be a drinking game in some contexts. But looking at this landscape, we might be more inclined to ask “Who is DH?” or ‘How is DH?” All of these roles have proven themselves necessary or at least possible: in various configurations they form more or less stable local ecologies in which roles and skills are allocated based on conditions of institutional possibility (who has the funding, who has the will, who has the imagination, who has the open staff lines, who has the office space, who has the research project). But this observation is equivalent to saying “stuff happens where it happens”: it doesn’t help us understand how to design such configurations, or how to adjust them when they turn out to be unstable or toxic, or how to form ourselves professionally to occupy a particular niche.

Looking at this landscape with an eye to understanding how these ecologies work, what we see first is jobs. But jobs (though fascinating in their own right) are not the most useful unit of meaning here: for one thing, as we all know, specific jobs come into being in particular shapes and sizes for reasons that have more to do with occasion and accident than with ontology. Jobs are actually aggregations of roles—specific vectors of participation in the ecology—and of skills—the competencies needed to perform the role effectively.

This line of discussion has you all worried that I am actually from an HR department. But in fact I’d like to suggest that if we’re interested in understanding how digital humanities works as a field, these roles and skills (properly unpacked) hold the key to that understanding. Consider the following questions:

  1. At what point in a digital project work flow should texts be proofread, and by whom? In what form should the data be exposed for this work to proceed effectively?
  2. Is text encoding integral to digital scholarly editing, or an implementation of it?
  3. To what extent do scholarly users of a digital resource need to understand its use of metadata standards?
  4. How can the underlying information architecture of a digital edition effectively represent its editorial principles?
  5. When a work of digital scholarship is complete and its original researchers have retired, who takes responsibility for it and where in the institution does that responsibility lie? How substantively can we expect those with long-term responsibility to maintain and extend the resource?
  6. Do we need to understand the complete inner workings of our digital tools in order to use them responsibly and critically?
  7. With whom does final intellectual responsibility for a digital humanities project lie? In what specific components of the project does that intellectual responsibility make itself felt?

Given time, we could design an interesting small-group workshop exercise to diagram those questions out as implied relationships between different institutional roles and skills, and in doing so I think we would find that the roles we identify (for instance, the potential librarian presence in items 3 and 5, or the potential programmer role in items 1, 4, and 6) aren’t represented by the same skills across the board. Similarly, the same skill (e.g. knowledge of scholarly editorial principles) doesn’t necessarily map to the same role in every case. So understanding roles and skills and their configurations can help us think more precisely about terms like “collaboration” which I think gets used far too often in far too facile and imprecise a manner.

But there another term lurking in this list which I think is also significant, namely “tools.” Digital humanities has a keen awareness of tools as a marker of difference from the “traditional humanities” and a deep self-consciousness about tools as a marker of expertise, without really being certain what that expertise says about us. I’ve listened with fascination when speakers at conferences preface a comment by a phrase like “I’m not an engineer, but…” What this phrase means is “I don’t have technical expertise but in fact this makes me a more credible commentator on the non-technical aspects of the issue here…”

So in what follows I’m going to use jobs, roles, skills, and tools as a way of teasing apart this complex ecology and helping us understand it better.

From Jobs to Roles

Is humanities computing merely a hobby for tenured faculty? I am beginning to think so. I have just finished looking through the October MLA job list along with the computer science equivalent. As in past years, I see no jobs relating to humanities computing. At best, there are 1 or 2 positions where experience in computer aided instruction might be helpful. … I started out as a German professor here at Yale and then was, in effect, booted out when I consorted with the CS people. Now I am a full-time lecturer in computer science, teaching a curriculum of humanities computing along with regular CS courses… But I am also painfully aware of the fact that I have this job because I MADE this job, and it took 5 years of continuous drudge-work and diplomacy to get to this point. …I can tell you this: if humanities computing is to be more than a gentleman’s sport, somebody has got to start creating jobs for this field. How many more Goethe specialists do we need? Give it a rest. Hire someone who will rock the status quo. …20 years from now there will be departments of humanities computing. No doubt someone will write a doctoral thesis on the history of the field and my name will appear in a footnote: “wrote some interesting early works, ‘German Tutor’, ‘MacConcordance’, ‘Etaoin Shrdlu’, and then disappeared from the field”. I don’t want to be a footnote. I want to be the head of the department. Make a job in humanities computing this year.”

    —Stephen Clausing, Humanist Vol. 6 Num. 357, 15 Nov 1992 (emphasis mine)

Jobs are a good starting point here because they are easy to detect (especially when they are missing). This quotation is curiously telling: even while calling for “a department of humanities computing” (i.e. more faculty jobs, but representing a hybridization of humanities and CS expertise), it subtly registers the fact that the very “gentlemanliness” of the faculty jobs (and the way they would tend to position the more practical aspects of building, teaching, and using technological systems) might not in fact serve humanities computing very well in the long run. With hindsight, we can also note that in fact the jobs that first proliferated were not in fact faculty positions; there were plenty of faculty in humanities computing early on but they all had quite standard jobs and their “computing” dimension was acknowledged as odd. The distinctively “humanities computing” (or, later “digital humanities”) jobs were in other institutional spaces:

  • library positions, particularly around library-led digitization efforts like UVA’s EText Center but also in some cases as an outgrowth of library support for digital publications and projects (for instance, IATH, MITH)
  • information technologists: arising in groups dedicated to supporting faculty research (e.g. STG, OUCS, Centre for Computing in the Humanities @ KCL
  • instructional technology groups, for instance at Northwestern University, University of Virginia, and NYU
  • research groups, either free-floating or located in departments: e.g. ARTFL, the WWP, Perseus

And these categories of jobs have now solidified around a set of parameters that match the institutional spaces where these jobs are located. It’s also worth noting that the jobs we think of as “alt-ac” (jobs that humanities PhDs might consider as alternatives to tenure-track faculty positions) had in fact been around for quite a while before that label was invented, and there was nothing particularly “alt” about them: they are in an important sense the infrastructure of the academy. Their “altification” is in the eyes of those now seeking those jobs: people with humanities PhDs who see them as an alternative to faculty jobs.

If we map these out organizationally what we start to see is that certain kinds of positions are replicated in multiple organizations: notably developers, analysts, and support staff, and that in fact “research centers” can be located pretty much anywhere (and bring their characteristic staff with them).

So to get at the real categories that are emerging here, we need to remap these by job type rather than by organizational unit (I’m giving broad characterizations here)

  • information management jobs: emphasis on maintaining healthy information ecologies for the institution
  • faculty/academic jobs: emphasis on research and teaching
  • analyst jobs: these are the jobs with responsibility for translating between competencies and thinking about systems
  • technical or “developer” jobs: these are the “building” jobs that typically involve deeper programming expertise
  • managerial jobs: oversight over working groups and work systems
  • administrative jobs: oversight over resources, fiscal and legal arrangements
  • data creation jobs: those involving digitization, encoding, metadata creation, georeferencing of maps, etc.

Slide07

These characterizations help us see the functions these jobs play in the larger ecology (and we can start to imagine some of the working relationships they entail with one another). They help us see what people “do”—their organizational role— regardless of their institutional location. Bear in mind that an individual “job” might entail more than one “role” as I am describing them here.

We can probe even further into these roles by asking what kinds of skills and expertise distinctively belong to each one. However, skills and expertise here are not simply “the things people know how to do.” I know how to write a TEI customization, and my colleague Syd Bauman knows how to write a TEI customization, but that skill operates very differently for the two of us because of the different kinds of metaknowledge we each have. By “metaknowledge” I mean domains in which we possess a comparative perspective, an understanding of why things are the way they are. In his case, metaknowledge about the design of schema languages and the systems that process them, and in my case, metaknowledge about discipline-specific approaches to textual analysis. So it may also be useful to consider the distinctive skills and metaknowledge each of these roles characteristically possesses; this slide is an oversimplification but I hope it illustrates what I’m getting at here:

Slide08

  • analyst: knowledge of design, standards, and systems; translation between discourses; has meta-knowledge with respect to discipline and method
  • scholar: content expertise, research methods, reading and interpretation; has meta-knowledge with respect to theory (what does this mean?)
  • developer: significant technical and architectural knowledge at the system level (i.e. not just the ability to use specific tools and languages but ability to learn and assess them, use them strategically); has meta-knowledge with respect to code, processing systems, system architecture
  • manager: understand and manage resources, creating effective conditions for individuals and groups to function; has meta-knowledge with respect to organizational systems
  • administrator: knowledge of fiscal structures, legal and policy requirements, university ecology (including both fiscal and diplomatic realities and strategic planning); has meta-knowledge with respect to resources
  • data creator/coder: (coder/data preparation): data modeling, quality assurance practices, specific technical systems, consistent application of standard procedures and good decision-making in exceptional cases; has meta-knowledge with respect to content
  • information manager: understanding and managing information systems, designing systems in which information can be translated efficiently into usable and sustainable forms; has metaknowledge with respect to data management and data representation systems (such as information standards, repository systems, data curation best practices)

Tools 

“Personally, I think Digital Humanities is about building things. [. . .] If you are not making anything, you are not…a digital humanist.” (Stephen Ramsay, “Who’s In and Who’s Out”, http://stephenramsay.us/text/2011/01/08/whos-in-and-whos-out/)

What’s a tool? We use the term as if we know them when we see them: things we use to do tasks. In digital humanities, terms like “build” and “tool” are necessarily a bit metaphorical: you can’t break your toe by dropping these things on your foot.

In early discussions on the Humanist discussion list (e.g. in the late 1980s and early 1990s), the term “tool” often carried a pejorative tone, with the implication of considering something as “merely a tool.” More precisely, through discussions of how and whether a tool determines our relations with the world (the “if all you have is a hammer everything looks like a nail” analogy), we can see concerns about the effect that the use of computational tools would have on humanistic research: a loss of nuance in our methods, a tendency to reduce complexity in the interests of computational tractability, and conversely, expectations about the usefulness of the computer as a “fast and accurate tool”, an “analytical tool.” Tools also figure prominently in definitional discussions of DH as a domain and as a profession: for instance, the ACLS Cyberinfrastructure Report lists 5 things “digital scholarship has meant”, of which 4 mention tools (p. 7) and three of these items are “creating tools”:

“In recent practice, ‘digital scholarship’ has meant several related things:

  1. a) Building a digital collection of information for further study and analysis
    b) Creating appropriate tools for collection-building
    c) Creating appropriate tools for the analysis and study of collections
    d) Using digital collections and analytical tools to generate new intellectual products
    e) Creating authoring tools for these new intellectual products, either in traditional forms or in digital form”

—”Our Cultural Commonwealth: The Report of the American Council of Learned Societies Commission on Cyberinfrastructure for the Humanities and Social Sciences”, http://www.acls.org/cyberinfrastructure/OurCulturalCommonwealth.pdf 

Alongside these discussions there’s another strand visible, concerning the question of whether the computer is more like a “tool” or a “method”: essentially, a set of questions about how these systems sit in relation to our own thought processes and theories. The hammer/nail metaphor imposes a strict thingness on the tool: its shape determines our use of it, and this metaphor also suggests a certain self-evidence about the tool (we all know about hammers) and a desire for transparency. The hammer works, unproblematically, as a hammer: its purpose is to drive the nail, not to open up a discussion about the process. But if something that looks like a tool could also be a method, then the concept of the tool starts to seem more plastic, more responsive to and engaged with our own thought processes. Stephen Ramsay observed at MLA in 2011 that “If you’re not making anything, you’re not a digital humanist,” but far from offering this kind of “making” or “building” as a pure, bone-headed space of theory-free praxis, he proposes making as “a new kind of hermeneutic.” As he and Geoffrey Rockwell argue in “Developing Things: Notes Towards an Epistemology of Building in the Digital Humanities”, under the right conditions tools can even be theories:

For tools to be theories in the way digital humanists want—in a way that makes them accessible to, for example, peer review—opacity becomes an almost insuperable problem. The only way to have any purchase on the theoretical assumptions that underlie a tool would be to use that tool. Yet it is the purpose of the tool (and this is particularly the case with digital tools) to abstract the user away from the mechanisms that would facilitate that process. In a sense, the tools most likely to fare well in that process are not tools, per se, but prototypes—perhaps especially those that are buggy, unstable, and make few concessions toward usability.

—Stephen Ramsay and Geoffrey Rockwell, “Developing Things: Notes Towards an Epistemology of Building in the Digital Humanities”, Debates in the Digital Humanities, ed. Matthew Gold, 2012.

In particular, tools can be theories if they operate to draw attention to themselves, above all by functioning in frictional ways.

We can start to see here that there’s a reason digital humanists are so preoccupied with tools and it’s not because tools are important in themselves: it’s because they are an irritant. They catalyze something complex and difficult concerning professional identity, scholarly methods and practices, and specific types of expertise. The questions of “who is digital humanities?” and “how is digital humanities?” can evidently be reframed as questions like “what are the design goals of our tools?” “when should a tool ‘just work’?” and “are we tool users, tool builders, or tool theorists?”

So we can now come back to think about how each of these digital humanities roles understands the category of the “tool”. (Remember that I’m talking here about “roles” rather than “people”: a given person might occupy more than one role and hence have a more complex positioning.)

The scholar has a tendency to regard tools from a distance, as a category rather than as a set of specificities (“we need a tool…”). In my experience when people are speaking from the “scholar” role, they don’t tend to differentiate between tools on the basis of complexity or design, but rather on the basis of function. So a car and a tricyle would appear more similar (as tools for carrying humans over the ground) than a tricyle and a wheelbarrow (simple non-motorized wheeled devices). The “scholar” role also tends to use more complex tools only to the point of getting the general idea, delegates their use in cases where in-depth expertise or systematic production-grade use is required (delegates e.g. to the coder) although the scholar may often try out the tool in a spirit of solidarity. “Tool” for the scholar role is often a metaphor for the thing someone will build and I will use (or someone will use on my behalf). Tools are the mark of the domain of implementation.

Scholars tend to use the metaphor of “get our hands dirty” when talking about using tools. Tools are the implements through which practice (praxis) is enacted. Tools do things. Tools are the subject of workshops. Typical examples of this kind of tool include visualization tools, data manipulation tools, publication tools. search tools, analysis tools. The scholar is also very interested in theorizing the tool, and likely to be very sensitive to potential theoretical implications of its use, although not necessarily in a position to engage with its underlying design or implementation.

The developer builds tools, and uses tools in ways that don’t preserve their surface integrity (for instance by modifying them or configuring them in expert ways). There are many tools that are visible to the developer and not to the scholar: version control tools, tools for editing code (emacs, etc.), environmental tools (operating systems, virtual server software, the kernel). The developer also often has an under-the-hood or architectural view of tools whose conventionalized surfaces are also visible to the scholar: e.g. database programs and content management systems (= publication tools), search engines and indexing systems (= search tools), interface management tools such as Bootstrap, JavaScript libraries and data pipelines (= visualization tools, data manipulation tools), data management tools (schemas, XML processing tools, etc.).

Developers are less likely than scholars to be interested in a frictional relationship with tools, although they are likely to have a much greater awareness (metaknowledge) of the kinds of design and implementation issues that could inform an understanding of their theoretical implications. It’s easy to understand why: these jobs are typically the ones that involve building things that “actually work”, and their colleagues anywhere in the institution (IT, library systems) are operating under an extremely practical set of metrics for success: Did my beeper go off at midnight? Is the server down? Did I have to fix bad code someone else wrote five years ago who apparently didn’t know what they were doing? People in these jobs have a word for tools that are “buggy, unstable, and make few concessions towards usability” and it’s not a polite one.

The analyst chooses tools, specifies tools, documents tools, and thus has a kind of inexpert but meta-awareness of the tools that the developer uses directly. The analyst can participate in decision-making that concerns these genres of tools, but doesn’t participate directly in or have responsibility for their creation or management. It’s worth noting that analysts probably have the strongest interest of all these roles in theorizing tools; they understand them well enough to do so and their interest is not purely practical; they’re aware (like the developer) of how much of a difference the choice of tools makes, but by nature of their job they can be less pragmatic about it. In the analyst we see an almost anthropological interest/perspective.

The analyst also has tools native to his/her position. These include design and prototyping tools (wire framing, diagramming/charts), documentation tools (wikis, content management systems, literate programming, code commenting), standards and reference systems, ontologies and authority systems. It is worth noting that the “tool” aspect of these systems with respect to the analyst’s role really lies in the expertise with which the analyst uses the capabilities of these systems to ensure that the information they contain can be used effectively within the project context. In other words, these systems become tools through the work of information design/organization, ergonomic optimization within a specific work flow, ease of use (training, reference), and effectiveness in preserving records of decisions and actions.

The administrator stands aside from the specifically digital humanities “tool” ecology, but of course this role has its own tools: e.g. payroll systems, grant management systems, budgeting tools, record-keeping tools. These tools are assumed to be theoretically neutral with respect to the research enterprise, but it’s interesting to contemplate the effect they have on the ecology, for instance by defining professional roles, differentiating spheres of expertise and authority, and limiting access to information.

The manager benchmarks and assesses tools and their impact.

The data creator uses tools, often with somewhat more critical perspective than the scholar (i.e. as an expert user rather than an onlooker or visitor-user). Tools natural to this role include:

  • authoring tools (XML editors like Oxygen, content management systems like Omeka and WordPress)
  • simple data conversion tools (Google refine, OxGarage)
  • digitization tools: scanning, OCR, other kinds of image capture/manipulation

Expert data creators who have used multiple different tools for the same kind of task develop a kind of parallax or meta-knowledge about these tools. Inexpert data creators—understandably—fetishize and personify the tool as a kind of totalizing context for their work (“Oxygen didn’t like my code”) and may also not fully understand the data apart from the tool through which they encounter it. For instance, it’s common to find that someone who has deep familiarity with XML data as an encoder may nonetheless not know that the same XML file can be opened and viewed in a web browser.

The information manager is very similar to the analyst in choosing, specifying, and documenting tools, but these tools tend to be infrastructural (or portals to infrastructure) rather than user-oriented. The characteristic tools of the information manager include:

  • data management tools (repository systems, tools for integrity checking, data migration, data conversion)
  • data dissemination and discovery tools (online library catalogues, APIs and tools for exposing metadata for discovery such as OAI servers)

We can see latent in these roles some strong potential complementarities and also some strong potential friction points. For example, a developer who is also an analyst (combining deep expertise in systems with disciplinary metaknowledge) is a powerful ally of the scholar, since together they can develop tools and projects that are both functional and theorized. Finding those three roles combined in a single person is difficult but if you do, you should hire them! However, when developers who are not also analysts work directly with scholars, there can sometimes be a kind of disjuncture: the developer’s detailed awareness of the differences not just between tricyles and cars, but also between different kinds of transmissions and how they affect the behavior of your four-wheel drive, makes it likely that there will be conversations where the scholar thinks he or she is requesting a way to get from point A to point B on wheels and ends up in a conversation about the merits of different kinds of synchromesh. As another example: people often start their careers in the data creator/coder role (e.g. working as a graduate assistant on a digital project, which gives one an initial exposure to tools for content creation and management), and then develop additional roles over time. But a project that includes only scholars and coders (a common combination) will have a peculiar relationship with tools: a kind of distance (on the scholar’s part) and on the other hand an intensive proximity (on the coder’s part) that may not yet have critical distance or meta-knowledge: the awareness needed to use the tools in a fully knowing way.

So what does a healthy ecology for Digital Humanities look like, taking all of this into account? How can we build organizations in which digital scholarship in the humanities can thrive and where practitioners in a variety of roles can work effectively together? A few thoughts and concluding suggestions:

  1. The “analyst” role is clearly an important one, and also one that interestingly seems to inhabit a number of different possible institutional locations and job identities: the library, IT, research centers; post-doctoral fellows, data curation and DH librarians, instructional technologists, research support specialists, programmer/analysts. Furthermore, because the analyst is so often an infrastructural position rather than a project-specific one, that role brings with it an inherent attention to longevity and sustainability. The analyst is likely to know about things like data standards, institutional repositories, data curation and reuse. The analyst’s skill profile also includes knowing how to identify and assess relevant work by peer institutions, to avoid reinventing the wheel (or worse, repeating common mistakes)
  1. The developer role is also crucial, but it’s not easy to make good use of a single human developer working in isolation. The expertise the developer role possesses is directly translatable into good architectural decisions and the skill to write program code that is efficient, effective, well-documented and intelligible to others, easy to maintain and extend, and not subject to obscure breakage. If you’re building prototypes for purposes of theorization (which is entirely legitimate), then these may not be concerns, but if you are trying to build a tool or system that will work in the future, then you should not substitute non-developers for developers. Data creators and analysts are often mistaken for developers (especially  by scholars to whom the difference isn’t always visible). Furthermore, because there are many different kinds of developer expertise, a developer who is great with XML tools may not know anything about customizing Drupal or building web applications in JavaScript, let alone building digital repository systems. Scholars aren’t necessarily aware of these differences and there’s a tendency to say “we need to hire a developer” without specifying what kind and what skills. Often, one needs pieces of several different developers to build an entire system. In an ecology that lacks an analyst, translating between scholars (or data creators) and developers isn’t always easy, which is why a developer who is also an analyst is a huge asset.
  1. It’s important not to overload junior jobs like assistant faculty, post-doc, grad student, entry-level developer, with roles that are outside the scope of their capacity: or at least to do so knowingly. I spent the first ten years of my professional life operating outside the scope of my capacity, which was great for me in the long term but arguably less than ideal for the project and not great for my personal life either. If you’re going to overload a junior role, make sure you have strong mentoring and training in place, and make sure you have infrastructural roles somewhere else in the system (analysts, developers) who can provide transitional support when needed. Which leads me to:
  1. It’s important not to use short-term jobs as a way of filling long-term roles. In situations of scarcity and constraint, it’s tempting to make technically gifted graduate students or post-docs serve as solo developers or managers, but this is a risky approach: not because they aren’t capable of doing this work, but rather because those roles need greater continuity. Turnover in those positions results in loss of organizational memory, poor or incomplete implementation of systems, lack of documentation, and long-term difficulties for both the project and for those who stay behind (scholars, analyst roles in the library or IT organization who will have to pick up the pieces, administrators who have to make sense of financial and HR situations).
  1. Finally, be aware of the different professional trajectories and accompanying reward systems that are in play for these different roles, again bearing in mind that the same person may occupy different roles (and may have multiple professional trajectories in play). This is an area where a lot of discussion has taken place in recent years, particularly in venues like the University of Maryland’s Off the Tracks workshop and of course in the discussions of “#alt-ac”, citation practices, and related issues. It’s important to think about the forms of professional development each role needs: additional degrees, opportunities to attend conferences, opportunities for practical training or internships. It’s also important to think about the forms of professional visibility each role needs, such as opportunities to publish, opportunities to participate in open-source software development and standards bodies, opportunities to mentor others and participate in professional associations. And it’s very important to think about the next job each role is likely to be seeking—whether that is a tenure-track faculty position, a more senior analyst or developer position, a post-doc, a position in an MA or PhD program—and to think about whether that role will be in your organization (and if not, why not?).

The strongest digital humanities centers I know of are the ones that have succeeded in creating generational systems: viable succession plans in which students trained as data creators grow into developer, analyst, or manager roles while also maturing as scholars. But they also have invested in creating permanent jobs for the infrastructural roles (developers, analysts, administrators, managers) that give the ecology its stability and continuity.

I’m grateful to those who have heard versions of this talk for their feedback and insight.

Posted in digital humanities | Leave a comment

TEI and Scholarship (in the C{r|l}o{w|u}d)

[This is the text of a keynote presentation I gave at the 2012 TEI annual conference at Texas A&M University. Posting it now in the spirit of “better late than never” and “huh, maybe still useful!”]

James Surowiecki’s well-known book on “The Wisdom of Crowds,” and Kathy Sierra’s counterpoint on “The Dumbness of Crowds” give us a provisional definition of a wise crowd:

  • It must possess diversity of information and perspectives
  • Its members must think independently
  • It must be decentralized, and able to draw effectively on local knowledge
  • It must have some method of aggregating and synthesizing that knowledge

In this sense, the TEI is a wise crowd: deliberately broad, designedly interested in synthesizing the vast local expertise we draw on to produce something as ambitious and deeply expert as the TEI Guidelines.

But the TEI also has some structural ambivalence about the crowd: about the role of standards in a crowd. As researchers and designers (people who contribute expertise to the development of the guidelines) the crowd (that’s us!) is great, but as encoders and project developers the crowd shows another face:

  • The crowd is unruly!
  • They make mistakes!
  • They commit tag abuse!
  • They all do things differently!
  • They are all sure they are right!

The dissatisfaction with this divergence arises from the hope that this crowd (that is, also us!) will put its efforts towards a kind of crowd-sourcing: developing TEI data that can be aggregated into big data, producing a crowd-sourced canon that would serve as an input to scholarship. This would mean:

  • To the extent possible, a neutral encoding (whatever that means)
  • A reusable encoding (whatever that means)
  • An interoperable encoding (whatever that means)

The technical requirements for developing such a resource are not difficult to imagine; MONK and TEI-analytics have shown one way. But the social requirements are more challenging. That unruliness I mentioned isn’t cussedness or selfishness. It has to do rather with the motives and uses for text encoding that are emerging now more powerfully as the TEI comes into a new relationship with scholarship. The “crowd” in this new relationship is once again that wise crowd whose expertise is so important, but it is working in new ways and in a different environment.

In what follows, I’m going to explore the TEI and scholarship in (and of) the crowd by sketching three convergent narratives, and then considering where they lead us.

  1. People

The first is a narrative of people: an emerging population of TEI users.

Here is a starting point: in the past year, from the WWP’s workshops alone (and we are only one of a large number of organizations offering TEI workshops, about 150 people participated in introductory TEI workshops, and this has been a typical number over the past three years, which have seen a steady and striking increase in demand for introductory and advanced TEI instruction. Syd Bauman and I began teaching TEI workshops in 2004; here’s the trend in total workshops offered (a total of 69 events in 8 years):

and the corresponding trend in number of participants (a total of at least 1166 participants as of 2012; we didn’t keep strict records in the first few years):

TEI_2012_slides 4

The increasing trend here is striking in itself, but more so when we unpack it a little. First, the composition of the audience is changing: 8-10 years ago, the predominant audience for TEI training was library and IT staff; now, very substantial numbers of participants are faculty and graduate students. Many (or even most) attendees are planning a small digital humanities project with a strong textual component, and they see TEI as a crucial aspect of the project’s development. They are full of ideas about their textual materials and also about the research questions they’d like to be able to pursue, or the teaching they’d like to be able to do; they see TEI markup as a way of representing what excites them about their texts and their research. Even more remarkably, some people attend TEI workshops to learn TEI because they think they should know about it, not solely because they are planning a TEI project. In other words, text encoding has taken on the status of an academic competence, a humanities research skill.

And these workshops represent just the WWP: I know that there are now many other workshop series that are experiencing similar levels of growth. [Edited to add: the upward trend in workshops has steepened further since 2012.]

  1. Scholarship

The second strand of the narrative I’d like to lay out here has to do with scholarship and scale. TEI data often considered as an input for scholarship, for instance via the concept of the “digital archive” or the digital edition. This is data digitized in advance of need, often by those who own the source material, on spec as it were, to be used by others. In the terms discussed at a recent workshop at Brown University on data modeling in the humanities, this is data being modeled “altruistically.” It is designed to function in a comparatively neutral and anticipatory way: in effect, it is data that stands ready to support whatever research we may bring to it. And in this sense it is data that serves scholarship, where the actual scholarly product is expected to be expressed in some other form, such as a scholarly article or monograph.

However, as DH is increasingly naturalized in humanities departments, we now are seeing attempts to articulate the role that text markup can play in a much closer relationship to scholarship:

  • First, TEI data considered as scholarshipas a representation of an analytical process that yields insight in itself, something that one could receive scholarly credit for.
  • And second, TEI data considered as a scholarly publication that is not operating under the sign of “edition” but rather something more like “monograph” or in fact a hybrid of the two: in other words, not operating solely as a reproduction or remediation of a primary source, but also as a an argument about a set of primary sources or cultural objects/ideas

The stakes of articulating this case successfully are, first of all, to make this kind of work visible within the system of academic reward, and second, to call in question the separation of “data creation” from interpretation and analysis: a separation that is objectionable within the digital humanities on both theoretical and political grounds.

As a result, this new population of scholar-encoders is ready and willing to understand the encoding work they are undertaking as entirely continuous with their work as critics, scholars, and teachers. The data they aspire to create is expected to carry and express this full freight of meaning.

  1. Economics

The third strand I offer here has to do with economics, and in particular with the economics of data in the crowd. Because hand markup is expensive, large-scale data collections tend not to emphasize markup; in large-scale digital library projects for which the lower levels of the TEI in Libraries Best Practices guidelines were developed, markup is concentrated in the TEI header, largely absent in the text transcription. And in these large-scale contexts, markup is thus not only unrealistically expensive at an incremental level, but it also fails a larger economic test in which we compare the time it takes to do a task by hand with the time it will take to build a tool to do the task. If the tool is running on a sufficiently large data set, even a very expensive tool will pay for itself in the end. For this reason, algorithmic, just-in-time approaches make sense in large collections.

Markup excels in two situations. It excels in cases where the data set is small and the markup task is so complex that the economics of tool-building are at a disadvantage. If the tool required starts to approximate the human brain, well, we can hire one of those more cheaply than we can build it (for a few more years at least!). And second, markup excels in cases where we need to be able to concretize our identifications and subject them to individual scrutiny before using them as the basis for analysis: in other words, in situations where precision does matter, where every instance counts heavily towards our final result, and where the human labor of verification is costly and cannot be repeated. Hand markup is thus characteristically associated with, economically tied to, small-scale data.

At this point in my story, these three narratives— people, scholarship, and economics—start to converge. At present, a generation of highly visible and really interesting digital humanities scholarship is proceeding on the basis of research that analyzes very large-scale data with powerful analytical tools. (Think of the NEH’s Digging into Data funding initiative, of grid technologies, of high-performance computing, of e-science.) But at the same time, another new generation of digital humanities scholars is emerging along very different lines. They are humanities faculty working on individual texts, authors, genres; they are interested in the mediation of textual sources (in ways that are consonant with domains like book and media history); and they are alert to the ways that textual representation (including data modeling) inflects meaning, reading, interpretation. These scholars are gaining expertise in TEI as a natural extension of their interest in texts and in textual meaning and representation, and their scholarship will be expressed in their markup, and will also arise out of the analysis their markup makes possible.

So there is an interesting interplay or counterpoint here. Algorithmic approaches work well at large scale precisely because they don’t require us to scrutinize individual cases or make individual decisions: they work on probabilities rather than on certainty, and they work on trends, correlations, the tendency of things to yield pattern, rather than on the quiddity of things, their tendency towards uniqueness and individual bizarreness. But for some kinds of interpretation (e.g. for intrinsically small data sets, for data sets that are dominated by exceptions rather than patterns), that very bizarreness is precisely the object of scrutiny: scholarship is a process of scrutiny and debate and careful adjustment of our interpretations, instance by instance: markup allows scholarly knowledge to be checked, adjusted. It produces a product that carries a higher informational value.

These two paradigms, if we like, of digital scholarship (“scholarship in the algorithm” and “scholarship in the markup”) are different but not opposed, not inimical; they are appropriate in different kinds of cases, both legitimate. They both represent significant and distinctive advances in digital scholarship: the idea of formalizing an analytical method in an algorithm is no more and no less remarkable, from the viewpoint of traditional humanities scholarship, than the idea of formalizing such a method in an information structure that adorns a textual instance. They each represent different attitudes towards the relationship between pattern and exception, and different approaches to managing that interplay. And in fact, as the Text Encoding and Text Analysis session at DH2012 noted, these two approaches have a lot to offer one another: they both work even better together than separately.

What can we observe about the TEI landscape that these narratives converge upon? First, it is significantly inhabited by “invisible” practitioners who are not experts, not members of TEI-L, not proprietors of large projects, but nonetheless receiving institutional support to create TEI data on a small scale (individually) and a large scale (collectively). These users are strongly invested in the idea of TEI as a tool for expressing scholarship: they believe that it is the right tool and they find it satisfying to use. They are working on documents that are valuable for their oddity and exceptionalism, and I will indulge myself here in the topos of the copious list: the notebooks of Thoreau, whaling logs, auction catalogues, family letters, broadsides attesting to the interesting reuse of woodcut illustrations, financial records, Russian poetry, an 18th-century ladies magazine specializing in mathematics, revolutionary war pamphlets, sermons of a 19th-century religious leader, drafts of Modernist poetry, Swedish dramatic texts, records of Victorian social events, the thousand-page manuscript notebook of Mary Moody Emerson, and so on.

It’s hard to envision a unifying intellectual rubric for such projects, and yet at a certain level they have a lot of things in common. First, they have a set of functional requirements in common: with small numbers of comparatively TEI files, they are not “big data” and they don’t require a big publication infrastructure, but they do need a few commonly available pieces of publication infrastructure: a “table of contents” view, a “search” view, a “results” view, a “reading” view: all of these are now the stock in trade, out of the box, of simple XML publishing tools like eXist, XTF, etc.

However, this data could yield quite a bit more insight with a few more tools for looking at it, and some tools in particular come up over and over again as offering useful views of the data: timelines, maps, representations of personal interconnections, networks of connected entities. And taking this even further, there are specific genres that could benefit from specific analytical tools: for instance, a drama browser (think of Watching the Script), a correspondence browser (think of a combination timeline and personal network), an itinerary browser (think of a combination map and timeline, like Neatline), an edition browser (with focus on variant readings, commentary, witness comparison: think of an amplified version of Juxta), and so forth.

These projects also have a set of opportunities in common: for one thing, they represent a remarkable opportunity to study the ways that markup functions representationally, if only we could study this data as a body: a semiotic form of knowledge. And for another, they represent a remarkable opportunity to study the ways that specific TEI elements are used, in the wild: a sociological form of knowledge. Finally, and most importantly, these projects have a number of problems in common:

  • Publication: there is currently no obvious formal venue for such projects/publications, and the kind of self-publication that is fairly common in university digital humanities centers isn’t available at smaller institutions, or at institutions with only a single project of this kind
  • Data curation: these projects are a data curation time bomb; they typically have a very tiny project staff that by its nature is short-term (students with high turnover, IT or library staff who don’t have a long-term institutional mandate to assist; grant-funded labor that will disappear when the funding runs out). Running on slender resources, they don’t have the luxury of detailed internal documentation (and they don’t typically have staff who are skilled at this). Migration to future TEI formats is in many cases probably out of reach.
  • Basic ongoing existence: these are projects that quite often lack even a stable server home; when the person primarily responsible for their creation is no longer working on the project, the institution doesn’t have anyone whose job it is to keep the project working.

From some perspectives, these look like problems for these projects to identify and suffer and (hopefully, eventually) solve. This perspective has produced the TAPAS project, which may be familiar to many of you. TAPAS is a project now in a 2-year development phase funded by IMLS and NEH, which is developing a TEI publishing and archiving service for TEI projects at small institutions and those operating with small resources. [Edited to add: in 2016, TAPAS is now in operation and working through a further 3-year development phase focused on XML-aware repository functionality and on pedagogical tools.]

But we should also treat this as a problem for the TEI. If you can indulge me for a moment in some cheap historical periodization, we can divide the TEI’s history thus far into several phases:

  1. Inception and problem identification, where the problem is the fact that many scholars want to produce digital representations of research materials, and there is a risk that they will do it in dead-ended, self-limiting ways
  2. Research and development, where the TEI community grows intrepidly and tackles the question of “How do we represent humanities research materials?”
  3. Refinement and tool-building, where the community (now having both critical mass and an intellectual history) can set in place a working apparatus of use (e.g. Roma) and build Things that Work
  4. And now, in the past five years: public visibility, where (thanks to the tremendous and sudden popularity of the digital humanities), the TEI is now noticeable and legitimate in sectors where before it would have appeared a geeky anomaly. As I noted earlier, people—faculty, graduate students—now attend TEI workshops just out of a sense of professional curiosity and responsibility: “This is something I should know about.”

Things look very good, according to several important metrics. There’s public and institutional funding available for TEI projects; the idea of treating TEI projects as scholarly work, to be rewarded with professional advancement, isn’t ridiculous but is a real conversation. The “regular academy” recognizes the TEI (albeit in a vague and mystical way) as a gold standard in its domain: it possesses magical healing powers over data. And there is an infrastructure for learning to use the TEI, which is a huge development; Melissa Terras, a few years, addressed the TEI annual conference with a strongly worded alarum: she pointed out that the TEI had an urgent and sizeable need for training materials, support systems, information, on-ramps for the novice. Although the TEI itself has not responded to that call, its community has: there are now a substantial number of regular programs of workshops and institutes where novices and intermediate users can get excellent training in TEI, and there are also starting to be some excellent resources for teaching oneself (chief among them TEI by Example, developed by Melissa Terras and Edward Vanhoutte). And finally, a lot of TEI data is being produced.

But that success has produced a crossroads that we’re now standing at. The question is whether in 20 years that data will represent scholarly achievement or the record of a failed idealism: whether the emerging scholarly impulse to represent documents in an expressive, analytical, interpretively rich way is simply obsolete and untenable, or whether in fact such impulses can constitute a viable form of digital scholarship: not as raw, reusable representations whose value lies chiefly in the base text they preserve, but as interpretations that carry an insight worth circulating and preserving and coming back to. If the answer turns out to be “yes”, it will be because two conditions have been met:

  1. The data still exists (a curation challenge).
  2. The data still has something to say (a scholarly challenge).

It’s important to observe that this is not a question about interoperability; it is a question about infrastructure, and it is about social infrastructure as much as it is about technical infrastructure. It is tempting to treat the prospect of hundreds of small TEI projects as simply an interoperability nightmare, a hopeless case, but I think this assumption bears closer scrutiny and study. In fact, at this point, I will assert (sticking my neck out here) that the major obstacle to the long-term scholarly value and functioning of this data is not its heterogeneity but its physical dispersion. As an array of separately published (or unpublished) data sets, this material is practically invisible and terribly vulnerable: published through library web sites or faculty home pages; unable to take advantage of basic publishing infrastructure that would make it discoverable via OAI-PMH or similar protocols; vulnerable to changes in staffing and hardware, and to changes in publication technology. And last, through a terrible irony, unlikely to be published with tools and interfaces that will make the most of its rich markup, through lack of access to sustained technical expertise.

These vulnerabilities and invisibilities could be addressed by gathering these smaller projects together under a common infrastructure that would permit each one to show its own face while also existing as part of a super-collection. This creates, in effect, three forms of exposure and engagement for these data sets. The first is through their presence as individual projects, each with its own visible face through which that project’s data is seen and examined on its own terms (offering benefits to readers who are interested in individual projects for their specific content). The second is through their juxtaposition with (and hence direct awareness of) other similar projects, which opens up opportunities for projects to modify their data and converge towards some greater levels of consistency (offering benefits to the projects themselves). And the last is through their participation in the super-collection, the full aggregation of all project data (offering benefits to those who want to study TEI data, and also—if the collection gets large enough—to those who are interested in the content of the corpus that is formed).

The idea of a repository of TEI texts has been proposed before, in particular in 2011 as part of the discussion of the future of the TEI. There was general agreement that a repository of TEI data would have numerous benefits: as a source of examples for teaching and training and tool development, as a corpus to enable the study of the TEI, as a corpus for data mining, and so forth. But the discussion on the TEI listserv at that time came at the undertaking from a somewhat different angle: it focused on the functions of the repository with respect to an authoritative TEI—in other words, on the function of the data as a corpus—rather than considering how such a repository might serve the needs of individual contributors. Perhaps as a result, significant attention was paid to the question of whether and how to enforce a baseline encoding, to provide for interoperability of the data; there was a general assumption that data in the repository should be converted to a common format (and perhaps that the responsibility for such conversion would lie with the contributing projects)

In other words, underlying this discussion was an assumption that the data would be chiefly used as an aggregation, and that without conversion to a common format, such an aggregation would be worthless.

But I think we should revisit these assumptions. In fact, I think there’s a huge benefit of such a repository first and foremost to the contributors for the reasons I’ve sketched above, and that benefit is accentuated if the repository permits and even supports variation in encoding. And I also think there’s a great deal we can do with this data as an aggregation, if we approach it in the same spirit as we approach any other heterogeneous, non-designed data set. Instead of aspiring to perfect consistency across the aggregation, we can focus on strategies for making useful inferences from the markup that is actually there. We can focus on the semantics of individual TEI elements, rather than on structure: in other words, on mining instead of parsing. And we can focus on what can be inferred from coarse rather than fine nesting information: “all of these divisions have a <dateline> somewhere in them” rather than “each of these divisions has its <dateline> in a different place!?” We can also be prepared to work selectively with the data: for tools or functions that require tighter constraints, test for those constraints and use only the data that conforms. In short, we should treat interoperability as the last (and highly interesting) research challenge rather than the first objection. And of course, once we have such a data set, we can also think of ways to increase its convergence through data curation and through functional incentives to good practice.

If this is the path forward, I’d like to argue that the TEI has as much stake in it as the scholarly community of users, and I’d like to propose that we consider what that path could look like. I am involved in the TAPAS project, which has already begun some work of its own in response to this set of needs, with special emphasis on the predicament of small TEI producers. But we are also very eager to see that work benefit the TEI as broadly as possible. So in the interest of understanding those broader benefits, I’d like to set TAPAS aside for the moment, for purposes of this discussion: let’s treat those plans as hypothetical and flexible and instead entertain the question of what the TEI and the TEI community might most look for in such a service, if we were designing it from scratch.

What would such a service look like? What could it usefully do, within the ecology I have sketched? This is a question I would like to genuinely pose for discussion here, but to give you something to respond to I am going to seed the discussion with some heretical proposals:

  • Gather the data in one place
  • Exercise our ingenuity in leveraging what the TEI vocabulary does give us in the way of a common language
  • Offer some incentives towards convergence in the form of attractive functionality
  • Provide some services that help with convergence (e.g. data massage)
  • Provide some automated tools that study divergence and bring it to the fore, for discussion: why did you do it this way? Could you do it that way?
  • But also permit the exercise of independence: provide your own data and your own stylesheets
  • Find ways to make the markup itself of interest: this is a corpus of TEI data, not (primarily) a corpus of letters, novels, diaries, etc.
  • Encourage everyone to deposit their TEI data (eventually!)
  • Provide curation (figure out how to fund it), top priority: this is a community resource
  • Provide methods for mining and studying this data (qua TEI, qua content)
  • Provide ways to make this data valuable for third parties: make it as open as possible

Discuss!

Posted in digital humanities | Leave a comment

Art, Data, and Formalism

[This is the text of a presentation I gave at “(Digital) Humanities Revisited,” a conference held at the Herrenhausen Palace in Hanover, Germany, on December 5-7 2013. The full record of this fascinating conference, including audio of several of the presentations, can be found here.]

Writing free verse is like playing tennis with the net down.

—Robert Frost, Address at Milton Academy, Massachusetts (17 May 1935)

This, however, is the great step we have to take; our analysis, which has hitherto been qualitative, must become quantitative … If you cannot weigh, measure, number your results, however you may be convinced yourself, you must not hope to convince others, or claim the position of an investigator; you are merely a guesser, a propounder of hypotheses.

—Frederick Fleay, “On Metrical Tests as applied to Dramatic Poetry.” The New Shakspere Society’s Transactions. Vol. 1. London: Trübner and Co. 1874.

They will pluck out the heart not of Hamlet’s but of Shakespeare’s mystery by the means of a metrical test; and this test is to be applied by a purely arithmetical process.

—Algernon Charles Swinburne, “A Study of Shakespeare”, London: Chatto and Windus, 1880.

In the late 19th century, the New Shakespeare Society outlined a program of research involving quantitative metrical analysis of Shakespeare’s plays; they could ask the question “does Shakespeare’s use of meter reveal anything about the order of composition of his plays?” even though they could not get the answer without considerable effort. The vision of an quantitatively, empirically based program of literary research is thus not new, and does not arise with the advent of digital tools, and I would agree with Jeffrey Schnapp that it is not a necessary characteristic of many of these tools. However, it is clearly a persistent interest and one that has enjoyed a recent resurgence.

Swinburne in the quote above is as much a caricature of the “poet” as his image of Fleay is a caricature of the literary scientist. But if, like Swinburne, we are concerned that quantitative methods like these may propose a representation of “art” as “data” in ways that ignore or render inaccessible the qualities that made the category of “art” meaningful in the first place, then we may wish to ask whether there are other ways of approaching that representation. In my talk today I will be asking precisely this: what do art and data have in common that may provide the basis for a uniquely illuminating digital understanding of and engagement with cultural works?

I confess at the outset: I am going to sidestep and bracket off a set of questions that are essentially restatements of  “what is art?” and adopt a position:

Art is, among other things, play with constraint.

“Play with constraint” here of course means both play enabled by constraint and also play that engages with its own constraints. Some particularly famous examples may be drawn from the literary experimentation of the Ouvroir de littérature potentielle: works like Raymond Queneau’s Cent Mille Milliards de Poèmes, which offers the reader a sonnet for which each line may be chosen from ten different options, and which required Queneau to write ten different sonnets with the same rhyme scheme whose lines could be used interchangeably. But the work of the Oulipo only serves to make vividly visible what is evident in more subtle ways wherever we look. The deliberate adoption of constraint—whether in the form of the material properties of medium, generic conventions, audience expectations, or deliberate formal limitations such as the unities of time and place—is what makes art intelligible as such. And the exploration, testing, or deliberate redrawing of the boundaries those constraints establish is one source of the pleasure and provocation that art provides: for instance, the use of enjambment to draw attention to the “frame” of the poetic line, or the use of a horizontal line in abstract painting to evoke the conventions of landscape.

These questions of constraint offer a shift of perspective on the central question of this session: what is the impact of going from (analog) art to (digital) data? The answer we are primed to give is that this is a lossy and reductive process, not only informationally but also culturally. But what precisely is lossiness in this context and how does it operate?

Digital information, when not born digital, exists under a regime of “capture” that corresponds to our understanding of the term “data” as a set of observations or representations of phenomena which we gather, record, analyse, and reproduce. We can think of the moment of “capture” as the boundary between a state in which the universe is infinitely detailed and observation has free play, and a state in which a shutter has closed and something has been written, using a particular notation system and constraint system.[1]

The lossiness of that moment of capture is well understood to us; pushing the resolution of our observational grid ever finer is one of the core progressive narratives of the digital age.

Slide of progressively higher-resolution photographs of a caterpillar

Photo by the author. Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Zooming in on a data object, however high-resolution, we eventually reach that horizon of signification where we have exhausted the informational resources of the object, where there is no more detail to be had, no further differentiation and hence no further signification. The representation reveals its exhaustibility in contrast with the inexhaustibility of the real object.

Slide of a highly magnified and pixellated detail of a photograph of a caterpillar

Photo by the author. Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

This liminal point between meaning and non-meaning has of course held great interest for the art world. Pointillism, for instance, plays with the ways that images can be composed of individual, independent components of light and color rather than of “figures” whose ontology is reflected in the application of paint. Here, part of the argument is turned on its head: what is remarkable is the way the human eye can interpolate information and insist upon form even when the observational data available to it has been deliberately decomposed: we learn of an innate training towards—an appetite for, an anticipation of—the perception of form by the eye and brain.

Even earlier, the poet and artist William Blake read the liminal point in a very different way in arguing against mezzotint and in favor of engraving. When we zoom in on a mezzotint image, composed of tiny dots, we lose sight of the figure, and this decomposition of the image sacrifices what for Blake is the essence of art: the “firm, determinate line” that represents the form of things seen or imagined. In a letter to his friend George Cumberland, Blake suggests that one crucial property of this line is that it carries its signifying properties into its “minutest subdivisions”: at no point does it decompose into “intermeasurable” elements in the manner of a mezzotint. In other words, in good art for Blake there is no lower threshold, no horizon below which the image ceases to signify; the problem with mezzotint is precisely that it has no innate representation of form.[2]

Blake is observing here something characteristic about the data model that inhabits pixellating technologies like these. The information carried by the dots in the bitmap image, as in the mezzotint, is purely local and positional: the individual picture elements don’t have informational connections to one another except in our perceiving analysis. A sequence of adjacent dark spaces does not constitute a “line”: in fact, this kind of data model does not know about things like “lines” or “noses” or “faces.” This is true no matter how fine-grained the actual sampling: we get more and more densely packed dots, but we never get faces, despite the clear perceptibility of figures to our visual systems. We can infer them—and our computational surrogates (like the “recognize faces” feature in your camera) are very good at this—but they have no durable presence in the ontology of the digital object. In short, this “sample-based” data model has no concern with the artistic nature of the work: with what it is about, or its formal properties, or the process of its creation. It is as if we created a museum catalog by taking one-inch thick horizontal slices of each floor of the building.

For the digital equivalent of Blake’s “bounding line”—the contour that actually represents the boundaries, shape, and identity of the work—we look to representational systems whose focus is precisely on modeling the outlines and the semantics of things—texts, images, sound, three-dimensional spaces. Many of these representational systems are at present most commonly associated with structured data formats like XML, though that is an artifact of present history and where we are in the development of computing systems. Their common thread is not XML per se, but rather their emphasis on transcribing, naming, structuring, and annotating the informationally salient pieces of things. So for instance in XML a simple representation of a poem might look like this:

<poem type=”nursery_rhyme”>
<line_group type=”couplet”>
<line>Probable-Possible, my black hen</line>
<line>She lays eggs in the Relative When</line>
</line_group>
<line_group type=”couplet”>
<line>She never lays eggs in the Positive Now</line>
<line>Because she’s unable to postulate how.</line>
</line_group>
</poem>

Or it might look like this:

start =
element poem {
attribute type {
list {
“sonnet”
| “villanelle”
| “limerick”
| “ode”
| “epic”
| “nursery_rhyme”
}
},
(element line_group {
attribute type {
list { “stanza” | “verse_para” | “couplet” }
},
element line { text }+
}
| element line { text })+
}

The first version shows us an individual poem; the second offers one theory about what constitutes a “poem” as a genre. Both are radically impoverished (they had to fit on a slide), but at the same time they are (as a first approximation) truer to what we think of as the literary object than an equivalently impoverished bitmap version of the Mona Lisa. The encoded representation of the poem observes a set of phenomena that it takes to be salient features of the poem’s signification; the schema proposes a set of phenomena that it takes to be characteristic of the genre of “poetry”. Depending on how we use the schema, it might operate descriptively to summarize for us all of the structural possibilities present in a given poetic oeuvre, or it might operate prescriptively to dictate which postulant poems are to be allowed into our corpus of approved poem instances. And depending on how we create the schema, it might operate observationally to record the attested formal properties of a collection, or it might operate theoretically to hypothesize and test such properties in an unfamiliar collection. In other words, the schema operates much as a theory of poetry operates, depending on whether we are literary historians (“here is what the poem has been”) or literary critics (“here’s what a poem really is”).

[As an aside: I realize I have done something slightly sneaky here, in substituting a textual example for a pictorial one just at the point in my argument where I shift ground from the pixel to the schema. However, this apparent bait and switch maneuver is motivated chiefly by time constraints. A plain text transcription is actually an interesting example of both a kind of lossy sampling approach (when compared with manuscript) and also a kind of minimalist modeling approach (when compared with a bitmap image). And there are abundant examples of strongly modeled data formats for sound, images, and three-dimensional spaces which unfortunately I don’t have time to detail here.]

I am arguing here on behalf of a convergence or unexpected sympathy between art and data, in this respect: that they share an interest in form and in the operations of constraint systems that express and regulate form and hence meaning. And further, I would stress that they also share a critical or frictional relationship to those constraint systems: they also permit form to quarrel with its own constraints. As Willard McCarty has argued (and his is only the most sustained argument to this effect), the data model created by processes of digital scholarship is always an inquiry into the nature of the object being modeled, and always reveals something about its theory of the object by what it cannot accommodate: by the blank spaces it leaves, by the ways it misreads or appropriates the object.[3] For an obvious example, we might consider the long-standing tension between text encoding practices and the representation of physical documents, or the ways in which classification systems shape our understanding of a knowledge domain.

In the evolution of an artistic form or genre (such as the sonata, or the epistolary novel, or the landscape painting) there is a process of exploration and elaboration as the practitioners of the form first converge on a set of shared practices, and explore what formal constraints can offer by way of heuristic framing. So for instance, the landscape painting lets us position a viewer in a particular kind of relation to an experience of space and place; the epistolary novel allows us to explore interiority, human relationships, and also the specific narrative frisson arising from the ways that the epistolary form limits what each character knows of the entire narrative situation. And at a certain point, those very constraints become themselves the visible feature: something to be foregrounded, displayed, played with, destabilized, in some cases deliberately and visibly violated.

In the digital humanities, data modeling proceeds in a very similar spirit: we use the process of data modeling to test our ideas about how texts and cultural objects behave against the actual cultural landscape. This is especially true in cases where a data model—for instance, a schema like that of the Text Encoding Initiative—has been created as an expression of disciplinary consensus, precisely so that it can be examined, questioned, and refined. The most important intellectual outcome of such a formalization is arguably not the consensus it proposes (its function as a standard) but the critical scrutiny it makes possible. The constraints become visible—and debatable—as an artifact of their capacity to generate meaning. As I have argued elsewhere, the design of models like the TEI (which explicitly permit dissent to be expressed in formal terms) provides a deeply characteristic mechanism for digital humanities research. In the data world, we see constraint used as an exploratory tool as much as it is used as an instrument of production. For a project like the Women Writers Project, the schema is both a form of documentation and a provisional theory, reflecting the formal structures we have discovered in the vast heterogeneous body of texts we are representing. We treat each version of the schema as a hypothesis pro tem concerning genre and textual structure, and we use it until it is challenged by an edge case that cannot be accommodated, at which point we modify the schema as needed to reflect this newly discovered corner of reality.

We are now in a position to return to our point of departure and ask again: what is the impact of going digital; how does it take us “from art to data” in the domain of text? And what does it mean to go “from art to data”?

I argue that going from art to data in a meaningful way entails bringing a set of formal commitments, such as we understand them, from the art work into explicit view within the domain of our data. The work of “going digital” thus entails several things:

1. First, it entails understanding and articulating those formal commitments (at the level of both the individual object and the genre), and expressing them in some appropriate form within the data. The resulting model makes explicit the formal structures that were at work in the original object, or in our engagement with it (for instance, interpretive strategies and editorial interventions).

2. Second, it entails a strategic information loss: a deliberate setting aside of the information we do not choose to retain, the information not accommodated in the model. This loss differs from the sampling loss, the pixels-in-between, of a bitmap-style data capture, in that the loss here serves to heighten the salience of what is retained, in the way that a road map omits elevation and soil data so that we can see the route more clearly.

3. And finally, it entails a corresponding strategic informational gain, in that the digital representation of a source object constitutes a model of the object: a purposeful representation of it that serves some interpretive or analytic goal.

A screen shot of a visualization from the Women Writers Project

Screen shot from Women Writers Project, http://www.wwp.brown.edu/wwo/lab/speakers.html

So the lossiness of formalism is also its strength and brings a number of important consequences:

  • we create computational tractability: the information we do retain is better adapted to computationally mediated ways of knowing, tools for analysis and apprehension
  • we gain strategic focus: we keep the information that matters to us, we focus our resources (of curation, storage, tool accommodation, etc.) on what we will actually use
  • we enable higher-level pattern discovery: we can compare models as well as instances; we can see patterns not only in our data but in our ideas about our data. The TAPAS project, for instance, which is building a large corpus of TEI collections, will also build tools by which we can study and compare the schemas projects use.
  • most importantly, through formalism, we create explicit models representing and communicating our assumptions about what constitutes the object for us in representational terms: what information is being carried with us across the analog-digital barrier. Johanna Drucker has observed that “insofar as form allows sense to appear to sentience…the role of aesthetics is to illuminate the ways in which the forms of knowledge provoke interpretation”[4] and in this respect, aesthetics and information modeling have a great deal in common.

The biggest research question I see arising here is how to achieve interoperability or mutual intelligibility of data models that does not require identity of data models. In other words, rather than seeking to perfect our models and achieve perfect consensus about them, how can we continue to have disagreeing but productive conversations about divergent interpretations, essential to the humanities, via our data models? How can our data models help us undertake these conversations? It seems to me that the accommodation of art within the sphere of data ultimately rests on this commitment to ongoing interpretation. When the TEI was founded, it was with the understanding that it needed to function as a broker or hub to mediate interchange among many divergent and equally legitimate modelling approaches. Over time, the TEI’s customization mechanism has evolved into a very rich system for representing dissent, debate, and interpretation, and may soon provide ways of visualizing those debates through a detailed comparison of  TEI schemas. While recognizing the need in some contexts for simple standards that enforce more impoverished but regular formal structures upon us, I hope we can avoid assuming that those simplifications are always adequate and always necessary as a condition of data.


[1] As Trevor Muñoz defined it in a recent Digital Humanities Data Curation workshop, data is

“information in the role of evidence; propositions systematically asserted, encoded in symbol structures.”

[2]  The full passage from this letter reads: “For a Line or Lineament is not formed by Chance a Line is a Line in its Minutest Subdivisions Strait or Crooked It is Itself & Not Intermeasurable with or by any Thing Else Such is Job but since the French Revolution Englishmen are all Intermeasurable One by Another Certainly a happy state of Agreement to which I for One do not Agree.” (William Blake, Letter to George Cumberland, 12 April 1827. Available online at the William Blake Archive, http://www.blakearchive.org/exist/blake/archive/transcription.xq?objectid=lt12april1827.1.ltr.02.)

[3] “…a model may violate expectations and so surprise us: either by a success we cannot explain, e.g., finding an occurrence where it should not be; or by a likewise inexplicable failure, e.g., not finding one where it is otherwise clearly present. In both cases modeling problematizes. As a tool of research, then, modeling succeeds intellectually when it results in failure, either directly within the model itself or indirectly through ideas it shows to be inadequate. This failure, in the sense of expectations violated, is, as we will see, fundamental to modeling.” Willard McCarty, “Modeling: A Study in Words and Meanings”, A Companion to Digital Humanities, ed. Schreibman, Siemens, and Unsworth. Blackwells, 2004.

[4] Johanna Drucker, SpecLab: Digital Aesthetics and Projects in Speculative Computing (2009), xii.

Posted in digital humanities | Leave a comment

TEI and Scholarship (in the C{r|l}o{w|u}d)

[This is the text of a keynote presentation I gave at the TEI conference in 2012 at Texas A&M University. I am working on making my past presentations accessible here, in case they may be useful to anyone.]

James Surowiecki’s well-known book on “The Wisdom of Crowds,” and Kathy Sierra’s counterpoint on “The Dumbness of Crowds” give us a provisional definition of a wise crowd:

  • It must possess diversity of information and perspectives
  • Its members must think independently
  • It must be decentralized, and able to draw effectively on local knowledge
  • It must have some method of aggregating and synthesizing that knowledge

In this sense, the TEI is a wise crowd: deliberately broad, designedly interested in synthesizing the vast local expertise we draw on to produce something as ambitious and deeply expert as the TEI Guidelines.

But the TEI also has some structural ambivalence about the crowd: about the role of standards in a crowd. As researchers and designers (people who contribute expertise to the development of the guidelines) the crowd (that’s us!) is great, but as encoders and project developers the crowd shows another face:

  • The crowd is unruly!
  • They make mistakes!
  • They commit tag abuse!
  • They all do things differently!
  • They are all sure they are right!

The dissatisfaction with this divergence arises from the hope that this crowd (that is, also us!) will put its efforts towards a kind of crowd-sourcing: developing TEI data that can be aggregated into big data, producing a crowd-sourced canon that would serve as an input to scholarship. This would mean:

  • To the extent possible, a neutral encoding (whatever that means)
  • A reusable encoding (whatever that means)
  • An interoperable encoding (whatever that means)

The technical requirements for developing such a resource are not difficult to imagine; MONK and TEI-analytics have shown one way. But the social requirements are more challenging. That unruliness I mentioned isn’t cussedness or selfishness. It has to do rather with the motives and uses for text encoding that are emerging now more powerfully as the TEI comes into a new relationship with scholarship. The “crowd” in this new relationship is once again that wise crowd whose expertise is so important, but it is working in new ways and in a different environment.

In what follows, I’m going to explore the TEI and scholarship in (and of) the crowd by sketching three convergent narratives, and then considering where they lead us.

1. People

The first is a narrative of people: an emerging population of TEI users.

Here is a starting point: in the past year, from the Women Writers Project’s workshops alone (and we are only one of a large number of organizations offering TEI workshops), about 150 people participated in introductory TEI workshops, and this has been a typical number over the past three years, which have seen a steady and striking increase in demand for introductory and advanced TEI instruction. Syd Bauman and I began teaching TEI workshops in 2004; here’s the trend in total workshops offered:

WWP TEI Seminars, 2004-2012 TEI_2012_slides 4

And here is the trend in number of participants:

WWP TEI Seminars, 2004-2012

The increasing trend here is striking in itself, but more so when we unpack it a little. First, the composition of the audience is changing: 8-10 years ago, the predominant audience for TEI training was library and IT staff; now, very substantial numbers of participants are faculty and graduate students. Many (or even most) attendees are planning a small digital humanities project with a strong textual component, and they see TEI as a crucial aspect of the project’s development. They are full of ideas about their textual materials and also about the research questions they’d like to be able to pursue, or the teaching they’d like to be able to do; they see TEI markup as a way of representing what excites them about their texts and their research. Even more remarkably, some people attend TEI workshops to learn TEI because they think they should know about it, not solely because they are planning a TEI project. In other words, text encoding has taken on the status of an academic competence, a humanities research skill.

And these workshops represent just the WWP: there are now many other workshop series that are experiencing similar levels of growth.

2. Scholarship

The second strand of the narrative I’d like to lay out here has to do with scholarship and scale. TEI data often considered as an input for scholarship, for instance via the concept of the “digital archive” or the digital edition. This is data digitized in advance of need, often by those who own the source material, on spec as it were, to be used by others. In the terms discussed at a recent workshop at Brown University on data modeling in the humanities, this is data being modeled “altruistically.” It is designed to function in a comparatively neutral and anticipatory way: in effect, it is data that stands ready to support whatever research we may bring to it. And in this sense it is data that serves scholarship, where the actual scholarly product is expected to be expressed in some other form, such as a scholarly article or monograph.

However, as DH is increasingly naturalized in humanities departments, we now are seeing attempts to articulate the role that text markup can play in a much closer relationship to scholarship:

  • First, TEI data considered as scholarship: as a representation of an analytical process that yields insight in itself, something that one could receive scholarly credit for.
  • And second, TEI data considered as a scholarly publication that is not operating under the sign of “edition” but rather something more like “monograph” or in fact a hybrid of the two: in other words, not operating solely as a reproduction or remediation of a primary source, but also as a an argument about a set of primary sources or cultural objects/ideas

The stakes of articulating this case successfully are, first of all, to make this kind of work visible within the system of academic reward, and second, to call in question the separation of “data creation” from interpretation and analysis: a separation that is objectionable within the digital humanities on both theoretical and political grounds.

As a result, this new population of scholar-encoders is ready and willing to understand the encoding work they are undertaking as entirely continuous with their work as critics, scholars, and teachers. The data they aspire to create is expected to carry and express this full freight of meaning.

3. Economics

The third strand I offer here has to do with economics, and in particular with the economics of data in the crowd. Because hand markup is expensive, large-scale data collections tend not to emphasize markup; in large-scale digital library projects for which the lower levels of the TEI in Libraries Best Practices guidelines were developed, markup is concentrated in the TEI header, largely absent in the text transcription. And in these large-scale contexts, markup is thus not only unrealistically expensive at an incremental level, but it also fails a larger economic test in which we compare the time it takes to do a task by hand with the time it will take to build a tool to do the task. If the tool is running on a sufficiently large data set, even a very expensive tool will pay for itself in the end. For this reason, algorithmic, just-in-time approaches make sense in large collections.

Markup excels in two situations. It excels in cases where the data set is small and the markup task is so complex that the economics of tool-building are at a disadvantage. If the tool required starts to approximate the human brain, well, we can hire one of those more cheaply than we can build it (for a few more years at least!). And second, markup excels in cases where we need to be able to concretize our identifications and subject them to individual scrutiny before using them as the basis for analysis: in other words, in situations where precision does matter, where every instance counts heavily towards our final result, and where the human labor of verification is costly and cannot be repeated. Hand markup is thus characteristically associated with, economically tied to, small-scale data.

At this point in my story, these three narratives — people, scholarship, and economics — start to converge. At present, a generation of highly visible and really interesting digital humanities scholarship is proceeding on the basis of research that analyzes very large-scale data with powerful analytical tools. (Think of the Digging into Data funding initiative, of grid technologies, of high-performance computing, of e-science.) But at the same time, another new generation of digital humanities scholars is emerging along very different lines. They are humanities faculty working on individual texts, authors, genres; they are interested in the mediation of textual sources (in ways that are consonant with domains like book and media history); and they are alert to the ways that textual representation (including data modeling) inflects meaning, reading, interpretation. These scholars are gaining expertise in TEI as a natural extension of their interest in texts and in textual meaning and representation, and their scholarship will be expressed in their markup, and will also arise out of the analysis their markup makes possible.

So there is an interesting interplay or counterpoint here. Algorithmic approaches work well at large scale precisely because they don’t require us to scrutinize individual cases or make individual decisions: they work on probabilities rather than on certainty, and they work on trends, correlations, the tendency of things to yield pattern, rather than on the quiddity of things, their tendency towards uniqueness and individual bizarreness. But for some kinds of interpretation (e.g. for intrinsically small data sets, for data sets that are dominated by exceptions rather than patterns, that very bizarreness is precisely the object of scrutiny: scholarship is a process of scrutiny and debate and careful adjustment of our interpretations, instance by instance: markup allows scholarly knowledge to be checked, adjusted. It produces a product that carries a higher informational value.

These two paradigms, if we like, of digital scholarship (“scholarship in the algorithm” and “scholarship in the markup”) are different but not opposed, not inimical; they are appropriate in different kinds of cases, both legitimate. They both represent significant and distinctive advances in digital scholarship: the idea of formalizing an analytical method in an algorithm is no more and no less remarkable, from the viewpoint of traditional humanities scholarship, than the idea of formalizing such a method in an information structure that adorns a textual instance. They each represent different attitudes towards the relationship between pattern and exception, and different approaches to managing that interplay. And in fact, as the Text Encoding and Text Analysis session at DH2012 noted, these two approaches have a lot to offer one another: they both work even better together than separately.

What can we observe about the TEI landscape that these narratives converge upon? First, it is significantly inhabited by “invisible” practitioners who are not experts, not members of TEI-L, not proprietors of large projects, but nonetheless receiving institutional support to create TEI data on a small scale (individually) and a large scale (collectively). These users are strongly invested in the idea of TEI as a tool for expressing scholarship: they believe that it is the right tool and they find it satisfying to use. They are working on documents that are valuable for their oddity and exceptionalism, and I will indulge myself here in the topos of the copious list: the notebooks of Thoreau, whaling logs, auction catalogues, family letters, broadsides attesting to the interesting reuse of woodcut illustrations, financial records, Russian poetry, an 18th-century ladies magazine specializing in mathematics, revolutionary war pamphlets, sermons of a 19th-century religious leader, drafts of Modernist poetry, Swedish dramatic texts, records of Victorian social events, the thousand-page manuscript notebook of Ralph Waldo Emerson’s aunt, Mary Moody Emerson, and so on.

It’s hard to envision an intellectual rubric for such projects, and yet at a certain level they have a lot of things in common. First, they have a set of functional requirements in common: with small numbers of comparatively TEI files, they are not “big data” and they don’t require a big publication infrastructure, but they do need a few commonly available pieces of publication infrastructure: a “table of contents” view, a “search” view, a “results” view (same as TOC, basically), a “reading” view. All of these are now the stock in trade, out of the box, of simple XML publishing tools like eXist, XTF, etc.

However, this data could yield quite a bit more insight with a few more tools for looking at it, and some tools in particular come up over and over again as offering useful views of the data: timelines, maps, representations of personal interconnections, networks of connected entities. And taking this even further, there are specific genres that could benefit from specific analytical tools: for instance, a drama browser (think of Watching the Script), a correspondence browser (think of a combination timeline and personal network), an itinerary browser (think of a combination map and timeline, like Neatline), an edition browser (with focus on variant readings, commentary, witness comparison: think of an amplified version of Juxta), and so forth.

These projects also have a set of opportunities in common: for one thing, they represent a remarkable opportunity to study the ways that markup functions representationally, if only we could study this data as a body: a semiotic form of knowledge. And for another, they represent a remarkable opportunity to study the ways that specific TEI elements are used, in the wild: a sociological form of knowledge. Finally, and most importantly, these projects have a number of problems in common:

  • Publication:  there is currently no obvious formal venue for such projects/publications, and the kind of self-publication that is fairly common in university digital humanities centers isn’t available at smaller institutions, or at institutions with only a single project of this kind
  • Data curation:  these projects are a data curation time bomb; they typically have a very tiny project staff that by its nature is short-term (students with high turnover, IT or library staff who don’t have a long-term institutional mandate to assist; grant-funded labor that will disappear when the funding runs out). Running on slender resources, they don’t have the luxury of detailed internal documentation (and they don’t typically have staff who are skilled at this). Migration to future TEI formats is in many cases probably out of reach.
  • Basic ongoing existence:  these are projects that quite often lack even a stable server home; when the person primarily responsible for their creation is no longer working on the project, the institution doesn’t have anyone whose job it is to keep the project working.

From some perspectives, these look like problems for these projects to identify and suffer and (hopefully, eventually) solve. This perspective has produced the TAPAS project, which may be familiar to many of you. TAPAS is a project now in a 2-year development phase funded by IMLS and NEH, which is developing a TEI publishing and archiving service for TEI projects at small institutions and those operating with small resources.

But we should also treat this as a problem for the TEI. If you can indulge me for a moment in some cheap historical periodization, we can divide the TEI’s history thus far into several phases:

  1. Inception and problem identification, where the problem is the fact that many scholars want to produce digital representations of research materials, and there is a risk that they will do it in dead-ended, self-limiting ways
  2. Research and development, where the TEI community grows intrepidly and tackles the question of “How do we represent humanities research materials?”
  3. Refinement and tool-building, where the community (now having both critical mass and an intellectual history) can set in place a working apparatus of use (e.g. Roma) and build Things that Work
  4. And now, in the past five years: public visibility, where (thanks to the tremendous and sudden popularity of the digital humanities), the TEI is now noticeable and legitimate in sectors where before it would have appeared a geeky anomaly. As I noted earlier, people—faculty, graduate students—now attend TEI workshops just out of a sense of professional curiosity and responsibility: “This is something I should know about.”

Things look very good, according to several important metrics. There’s public and institutional funding available for TEI projects; the idea of treating TEI projects as scholarly work, to be rewarded with professional advancement, isn’t ridiculous but is a real conversation. The “regular academy” recognizes the TEI (albeit in a vague and mystical way) as a gold standard in its domain: it possesses magical healing powers over data. And there is an infrastructure for learning to use the TEI, which is a huge development; Melissa Terras, a few years, addressed the TEI annual conference with a strongly worded alarum: she pointed out that the TEI had an urgent and sizeable need for training materials, support systems, information, on-ramps for the novice. Although the TEI itself has not responded to that call, its community has: there are now a substantial number of regular programs of workshops and institutes where novices and intermediate users can get excellent training in TEI, and there are also starting to be some excellent resources for teaching oneself (chief among them TEI by Example, developed by Melissa Terras and Edward Vanhoutte). And finally, a lot of TEI data is being produced.

But that success has produced a crossroads that we’re now standing at. The question is whether in 20 years that data will represent scholarly achievement or the record of a failed idealism: whether the emerging scholarly impulse to represent documents in an expressive, analytical, interpretively rich way is simply obsolete and untenable, or whether in fact such impulses can constitute a viable form of digital scholarship: not as raw, reusable representations whose value lies chiefly in the base text they preserve, but as interpretations that carry an insight worth circulating and preserving and coming back to. If the answer turns out to be “yes”, it will be because two conditions have been met:

  1. The data still exists (a curation challenge).
  2. The data still has something to say (a scholarly challenge).

It’s important to observe that this is not a question about interoperability; it is a question about infrastructure, and it is about social infrastructure as much as it is about technical infrastructure. It is tempting to treat the prospect of hundreds of small TEI projects as simply an interoperability nightmare, a hopeless case, but I think this assumption bears closer scrutiny and study. In fact, at this point, I will assert (sticking my neck out here) that the major obstacle to the long-term scholarly value and functioning of this data is not its heterogeneity but its physical dispersion. As an array of separately published (or unpublished) data sets, this material is practically invisible and terribly vulnerable: published through library web sites or faculty home pages; unable to take advantage of basic publishing infrastructure that would make it discoverable via OAI-PMH or similar protocols; vulnerable to changes in staffing and hardware, and to changes in publication technology. And last, through a terrible irony, unlikely to be published with tools and interfaces that will make the most of its rich markup, through lack of access to sustained technical expertise.

These vulnerabilities and invisibilities could be addressed by gathering these smaller projects together under a common infrastructure that would permit each one to show its own face while also existing as part of a super-collection. This creates, in effect, three forms of exposure and engagement for these data sets. The first is through their presence as individual projects, each with its own visible face through which that project’s data is seen and examined on its own terms (offering benefits to readers who are interested in individual projects for their specific content). The second is through their juxtaposition with (and hence direct awareness of) other similar projects, which opens up opportunities for projects to modify their data and converge towards some greater levels of consistency (offering benefits to the projects themselves). And the last is through their participation in the super-collection, the full aggregation of all project data (offering benefits to those who want to study TEI data, and also—if the collection gets large enough—to those who are interested in the content of the corpus that is formed).

The idea of a repository of TEI texts has been proposed before, in particular in 2011 as part of the discussion of the future of the TEI. There was general agreement that a repository of TEI data would have numerous benefits: as a source of examples for teaching and training and tool development, as a corpus to enable the study of the TEI, as a corpus for data mining, and so forth. But the discussion on the TEI listserv at that time came at the project from a somewhat different angle: it focused on the functions of the repository with respect to an authoritative TEI—in other words, on the function of the data as a corpus—rather than considering how such a repository might serve the needs of individual contributors. Perhaps as a result, significant attention was paid to the question of whether and how to enforce a baseline encoding, to provide for interoperability of the data; there was a general assumption that data in the repository should be converted to a common format (and perhaps that the responsibility for such conversion would lie with the contributing projects)

In other words, underlying this discussion was an assumption that the data would be chiefly used as an aggregation, and that without conversion to a common format, such an aggregation would be worthless.

But I think we should revisit these assumptions. In fact, I think there’s a huge benefit of such a repository first and foremost to the contributors for the reasons I’ve sketched above, and that benefit is accentuated if the repository permits and even supports variation in encoding. And I also think there’s a great deal we can do with this data as an aggregation, if we approach it in the same spirit as we approach any other heterogeneous, non-designed data set. Instead of aspiring to perfect consistency across the aggregation, we can focus on strategies for making useful inferences from the markup that is actually there. We can focus on the semantics of individual TEI elements, rather than on structure: in other words, on mining instead of parsing. And we can focus on what can be inferred from coarse rather than fine nesting information: “all of these divisions have datelines somewhere in them” rather than “each of these divisions has its dateline in a different place!?” We can also be prepared to work selectively with the data: for tools or functions that require tighter constraints, test for those constraints and use only the data that conforms. In short, we should treat interoperability as the last (and highly interesting) research challenge rather than the first objection. And of course, once we have such a data set, we can also think of ways to increase its convergence through data curation and through functional incentives to good practice.

If this is the path forward, I’d like to argue that the TEI has as much stake in it as the scholarly community of users, and I’d like to propose that we consider what that path could look like. I am involved in the TAPAS project, which has already begun some work of its own in response to this set of needs, with special emphasis on the predicament of small TEI producers. But we are also very eager to see that work benefit the TEI as broadly as possible. So ​in the interest of understanding those broader benefits, I’d like to set TAPAS aside for the moment, for purposes of this discussion: let’s treat those plans as hypothetical and flexible and instead entertain the question of what the TEI and the TEI community might most look for in such a service, if we were designing it from scratch.

What would such a service look like? What could it usefully do, within the ecology I have sketched? This is a question I would like to genuinely pose for discussion here, but to give you something to respond to I am going to seed the discussion with some heretical proposals:

  • Gather the data in one place
  • Exercise our ingenuity in leveraging what the TEI vocabulary does give us in the way of a common language
  • Offer some incentives towards convergence in the form of attractive functionality
  • Provide some services that help with convergence (e.g. data massage)
  • Provide some automated tools that study divergence and bring it to the fore, for discussion: why did you do it this way? Could you do it that way?
  • But also permit the exercise of independence: provide your own data and your own stylesheets
  • Find ways to make the markup itself of interest: this is a corpus of TEI data, not (primarily) a corpus of letters, novels, diaries, etc.
  • Encourage everyone to deposit their TEI data (eventually!)
  • Provide curation (figure out how to fund it), top priority: this is a community resource
  • Provide methods for mining and studying this data (qua TEI, qua content)
  • Provide ways to make this data valuable for third parties: make it as open as possible

Discuss!

Posted in digital humanities | Tagged | Leave a comment

Big changes ahead

With a mixture of excitement and astonishment I find myself changing jobs after 20 years at Brown University. Starting July 1, I will be taking up a new position at Northeastern University as Professor of the Practice in the Department of English, and as Director of the Digital Scholarship Group in the library. As part of my faculty half, I will also be affiliated with the NULab for Texts, Maps, and Networks, and I will also continue as Director of the Women Writers Project.

My first impulse here is to offer thanks, because I feel extremely lucky but I can also see around me the efforts and generosity of other people who have brought me to this point. Aficionados of digital humanities job construction will recognize this new position as not only beautifully tailored but also an institutional achievement: a job that crosses colleges, disciplines, faculty/staff lines. I have only a glimpse of what it look to put it together and I am hugely grateful to those at Northeastern who worked to make it happen. And on Brown University’s side, there has been a long history of generous support for me and for the WWP going back to 1988 when the project was founded, and extending across the many university departments that have housed the project: the English Department, Computing and Information Services, and most recently the University Library. I have been very happy at Brown and could not have been more fortunate in my colleagues and in the professional opportunities I have found there.

So what is coming next? There are a few major new things on the horizon:

  • Starting the Digital Scholarship Group: this will be the big hit-the-ground-running agenda item for me; the DSG is an idea and a space and I’ll be working intensively with Patrick Yott on bringing it into existence. The WWP will have a home within the DSG (together with NEU’s other digital projects) and we will be building a support structure and research agenda that can
  • Teaching: NEU already has a significant graduate student body interested in digital humanities, and has plans to expand on this. My position carries a 2-course load and I’m really looking forward to developing courses and thinking about the overall digital humanities curriculum.
  • Working with digital humanities colleagues in the NULab: this deserves its own post so at this point I will just say that it’s a very exciting prospect…
  • Developing a strategic plan for the WWP that takes advantage of new circumstances: participation by NEU’s digital humanities graduate students, opportunities to contribute to research initiatives in the NULab, and above all long-term fiscal stability.

All of my current projects, grant commitments, and so forth will be maintained, one way or another, but the transition (especially for the WWP side of things) is going to take a lot of work so I anticipate being distracted and possibly needing to shift things around a bit over the next several months.

Proceeding in a hopeful and enthusiastic spirit!

Posted in digital humanities | 6 Comments

A Matter of Scale

I recently had the honor and pleasure of giving a joint keynote presentation, “A Matter of Scale,” with Matt Jockers at the Boston-area Days of DH conference hosted by Northeastern University’s NULab. Matt has kindly put the text of our debate up on the University of Nebraska open-access repository and has also blogged about it.

This debate was great fun to prepare and also provided a fascinating perspective for me on the process of authoring. I do write a lot of single-authored things (e.g. conference papers, articles) where “my own” ideas and arguments are all I have to focus on, though I find those usually emerge by engaging with and commenting on other people’s work. I also write a lot of single-authored things where I’m actually serving as the proxy for a group (e.g. grant proposals). And I also increasingly find myself writing co-authored material—for instance, the white paper I’m currently working on with Fotis Jannidis that reports on the data modeling workshop we organized last spring, or the article I wrote with Jacqueline Wernimont on feminism and the Women Writers Project. In all of these situations I feel that I know the boundaries of my own ideas pretty well, even as I can feel them being influenced or put into dialogue with those of my collaborators.

However, writing this debate with Matt took a different turn. The presentation was framed as a debate from the start—so, in principle, each of us would be defending a specific position (big data for him, small data for me). We ascertained early on that we didn’t actually find that polarization very helpful, and we developed a narrative for the presentation that started by throwing it out, then facetiously embracing it, and finally exploring it in some detail. But we retained the framing device of the debate-as-conversational-exchange. However, rather than each writing our own dialogue, we both wrote both parts: Matt began with an initial sketch, which I then reworked, and he expanded, and I refined, and he amended, and so forth, until we were done. The result was that throughout the authoring process, we were putting words in each other’s mouths, and editing words and ideas of “our own” that had been written for us by someone else.

Despite agreeing on the misleadingness of the micro/macro polarity, I think Matt and I actually do have differing ideas about data and different approaches to using it—but what was striking to me during this process was that I found I had a hard time remembering what my own opinions were. The ideas and words Matt wrote for the debate-Julia character didn’t always feel fully familiar to me, but at the same time they didn’t feel alien either, and they were so fully embedded in the unfolding dialogue that they drew their character more from that logic than from my own brain, even as I reworked them from my own perspective.

I’m not sure what conclusions to draw, but it’s clear to me that there’s more to learn from collaborative authoring than just the virtues of compromise and the added value of multiple perspectives. I’m sure there’s an important literature on the subject and would be grateful for pointers. Working with Matt was a blast and I hope we have an opportunity to do this again.

Posted in digital humanities | 3 Comments

On getting old

I realized at this year’s DH conference that I think of the conference as marking the “new year” in my digital humanities life—partly because it coincides roughly with the new fiscal year at my institution, and partly because I always come away with that mix of elation and resolution and mild hangover that’s often associated with early January. It also makes me aware of the passing of time. This year, with so many new young participants, it occurred to me that I’m roughly the same age now as my mentors were when I first started attending the conference. But where in 1994 I felt I had everything to learn from those who were older than I, now nearly 20 years later I feel I have everything to learn from those who are younger. My “generation” in DH (if I can permit myself such a gross and vague term for a moment) spent a lot of time and effort focusing on developing data standards and organizational infrastructure and big important projects and articulations of methods. We were and are terribly self-conscious about everything, having made in so many cases a professional transition that defamiliarized the very roots of what we had been trained to do, and that self-consciousness felt like power. I think I see in the “next generation” (with the same apology!) somewhat less of this self-consciousness and more of an adeptness at getting things done. When I see the projects and research work that were presented in Hamburg I feel a sense of awe, of stepping back as a train rushes by.

Looking at that train while it pauses in the station, I can see its parts and I can understand them—I know about the data standards, the infrastructure, the languages, the layers and modules, the way things work, and I know that in principle I could build such a thing. I know how to write the grant proposal for such a thing. But in the face of its sheer force and speed and power, I feel the way I imagine a Victorian stagecoach might have felt while waiting at a railroad crossing—I feel fragile and vulnerable and a little elderly. (And now we can all laugh together at how silly that is.)

OK, after singing Auld Lang Syne and sleeping in, we wake up a few days later with a renewed sense of vigor. My “new year’s resolutions” coming out of this DH2012 are:

  1. Read more DH blogs!
  2. Read more DH blogs in languages other than English! I was delighted to be placed next to Hypotheses.org at the poster session and take this as a good sign. Also very excited about the possibility I heard discussed of a Spanish-language and French-language Day of DH.
  3. Write more!

Happy new year and more soon, I hope!

 

Posted in digital humanities | Leave a comment