MiMoText is a research project in computational literary studies dealing with new ways to model and analyse literary history and literary historiography. It is based on the idea of extracting statements relevant to literary history from bibliographies, scholarly publications and primary sources, in order to build a shared knowledge network for literary history. We employ methods from information extraction and text mining to obtain large numbers of statements about authors and literary works from our data. Moreover, we use the Linked Open Data paradigm to model, represent and query the information we obtain. We believe our project is a step towards a mode of digital humanities that goes not only beyond small, deeply encoded datasets and their close reading, but also beyond Big Data approaches that cannot always be easily adapted to the humanities. Instead, we propose a third way for digital humanities that develops quantitative methods to create and analyse datasets relevant to research in the humanities that are both larger and smarter than has been customary up until recently.
A first stream of research in the digital humanities has been all about smart data: small, carefully curated, deeply encoded and/or annotated datasets and their nuanced, qualitative analysis. Scholarly digital editions, especially historical-critical or genetic editions, can be seen as the prototype of this kind of research. A second stream of digital humanities has strived to adopt and adapt methods initially developed to analyse Big Data in computer science, such as topic modelling, word embedding models or deep learning / deep neural networks. The requirements of these methods, which are not naturally compatible with the small, smart datasets typical of the (digital) humanities, have created pressure to produce big humanities data suitable for training dedicated models adapted to specific domains in the humanities.1 However, we can observe the emergence of a third stream of digital humanities, where datasets are being created that are both relatively large and relatively ‘smart’ using carefully optimized quantitative methods. Such datasets are then being analysed using ‘smart’ methods that, for example, take into account detailed metadata and structural as well as semantic annotations.2
The project MiMoText aims to contribute to this third stream for digital humanities, specifically in the domain of literary history and historiography.3 It does so in the areas of data curation, information extraction and information modelling (see Fig. 1). Three main sources of information are exploited: metadata from bibliographic reference systems (such as printed bibliographies or library catalogues), statements from scholarly publications on the history of literature (such as literary histories and scholarly articles), and semantic and formal characteristics of primary sources (such as novels). The project aims to merge information that has been extracted from these three sources and is relevant to literary historiography into a deeply connected knowledge network that follows the principles of Linked Open Data (LOD), in a way similar to Wikidata. The goal is to build a smart network of information that revolutionizes the way we learn about, think about and perform research into literary history. It will accomplish this because it closely links primary sources, scholarly literature and bibliographic data, allowing for connections, inferences and comparisons between these sources, and because it provides information not just on famous, canonized works but also on ‘the great unread’, that is, the extensive, forgotten literary production lying beneath the relatively restricted canon.4 Including such forgotten works does entail additional challenges for the project, for example because these works are not readily available in digital form or because they are less frequently discussed in the research literature, but we believe this effort is worthwhile because it avoids replicating earlier biases. The initial domain of application of the project is the French novel of the second half of the eighteenth century, but extensions into German literary history as well as the history of philosophy are projected.
The first type of data that has been analysed in the project is bibliographic data. In our case, this meant performing the scanning, full-text digitization, semi-automatic encoding and modelling of the printed Bibliographie du genre romanesque français 1751–1800.5 This bibliography documents every novel known to have been published in French during the second half of the eighteenth century, including translations and reprints. It is an unusually rich and thorough documentation of the overall production of novels during that period. The entries are organized by year and in many cases include, beyond detailed bibliographic information, additional information on several key literary elements of the novels: narrative perspective; narrative setting; protagonists; key plot elements; and themes and/or style.
Based on the HTML output from the OCR process using Abby FineReader, a small part of the data (one year's entries for every decade) has been manually annotated, distinguishing all the different items typically included in the bibliographic entries of a novel, such as author, title, publisher and publication date. Then, a Machine Learning classifier (Conditional Random Fields) was trained on this annotated data.6 Once suitable parameters had been identified, the classifier was used to assign a label indicating the type of bibliographical information to each item in the bibliography. Depending on the type of item to be recognized, the F1-score was typically between 0.951 and 0.997 for this task, with one outlier at 0.908. We consider this to be excellent performance.7 In the next step, all the information was modelled using several relevant vocabularies and ontologies, notably from the Semantic Publishing and Referencing (SPAR) family of ontologies.8 The process of fitting the bibliographic data into categories posed some difficulties due to OCR errors and irregularities within the data, but after a thorough process of correction, the complete bibliography is now available in a highly structured format suitable for import into our LOD information network.9 Due to the automated processes that are part of our pipeline, some OCR errors and some faulty classifications persist. We argue that this is typical of the kind of larger and smarter, but not necessarily impeccably clean, datasets that are characteristic of the ‘third stream’ of the digital humanities.
For the MiMoText project, this bibliographic data is essential, because it defines the population of authors and novels that we focus on in all subsequent steps of the data enrichment and analysis. Any additional information we can derive from mining scholarly publications and primary sources is connected to these initial entries in the knowledge base.
The second kind of data that is mined in the MiMoText project are scholarly publications, more precisely: works belonging to the area of literary historiography about French eighteenth-century literature. In the first phase, we have focused on such works published in German, as we are particularly interested in the reception history of French literature in Germany as it is reflected in literary histories read by students and lay readers alike.10
The overall goal of analysing this material is to automatically identify statements about authors and their works in these scholarly publications. These statements can then be collected in order to connect them to the authors and works already present in the dataset built from the bibliography, thereby enriching our knowledge network. One particularity of these publications is that they are multilingual, in the sense that, although the main text is in German, it contains numerous named entities in French (such as authors, titles, place names, institutions and publishers) as well as citations of varying length in French and sometimes in other languages (taken from relevant scholarly publications or from novels that are being discussed). This creates particular challenges for the information extraction steps.
The first step towards this is named entity recognition (NER), in order to identify the sentences in our literary histories that refer to authors and works we know are included in the bibliographic data. This leads us to a second particularity of the task: our named entity recognition task does not need to focus so much on ‘organizations’, a type of named entity traditionally included in the named entity recognition task. However, in addition to people and locations, we are interested in identifying work titles (in particular novels, but also plays or essays) as entities, as these are an essential part of our information network.
The named entity recognition task for persons (PER) and locations (LOC) was implemented using spaCy (https://spacy.io, version 3.2). As for the work titles (TITLE), for which regular NER frameworks are not trained, relying on the ‘miscellaneous’ (MISC) category that spaCy provides failed to produce promising results. Training a custom NER classifier was not an option due to lack of a sufficient amount of training data. For these reasons, an alternative method had to be found. Identification of works is now based on available bibliographic data and a flexible string-matching process, including automatically generated short forms of titles as they are likely to be used in scholarly literature. The results from this method are promising, though not sufficiently reliable at this stage to be used in further processing steps without manual corrections.11
Based on the identification of named entities such as authors and works, we then plan to extract several specific types of statements: statements about authors, such as ‘author A is characterized as [adjective]’ statements about works, such as ‘work X is about [keyword]’; and statements about relations between authors and works, such as ‘work Y has been influenced by work Z’, and others (see below for details). This statement extraction task is currently being prepared through extensive manual annotations. Based on a combination of the texts annotated for the named entities with the manually annotated training data, we can train a machine learning algorithm for automatic statement extraction from unannotated data.
The third kind of data that is mined in the MiMoText project are primary sources, that is literary texts. In the current phase of our project, this means French novels from the period 1750–1800.12
A corpus of 115 French novels first published between 1750 and 1800 has been created. In part, the digital full texts are derived from existing digital versions, obtained for example from platforms such as Wikisource. To a considerable extent, however, additional full-text digitization based on digital facsimiles available from the French National Library (BnF) was necessary. To this end, a selection of 30 volumes has been digitized using double-keying in order to obtain data to train a dedicated OCR model.13 This targeted full-text digitization means that our corpus contains not only novels by key authors from the French Enlightenment, such as Jean-Jacques Rousseau or Denis Diderot, but also a wide range of lesser-known authors. Therefore, this corpus already represents the variety of styles, themes and plots of the French Enlightenment novel rather well. All texts have been encoded following the Guidelines of the Text Encoding Initiative,14 and a script to extract the texts in modernized plain text is provided.15 More novels will be added over the course of the project's duration to increase coverage and variety.
A first type of analysis has been conducted using this iteration of the corpus, namely topic modelling. Topic modelling is a method developed in computer science for the detection of thematic structure in large collections of texts, especially newspaper texts or scholarly articles.16 If properly adapted to the literary domain, this method can be used to great effect in the context of computational literary studies. In the MiMoText project, this has been accomplished by several preprocessing steps. First of all, the orthography of the texts has been modernized automatically when extracting the plain text from the XML-TEI files. Second, this modernized text has been split into segments of 1,000 tokens each to restrict the context of co-occurrence which, given the considerable length of novels, would have otherwise been too large. Finally, the resulting text segments have been annotated linguistically so that, for each word form, the information about its lemma and part-of-speech is also available. Based on this information, all tokens corresponding to function words have been filtered out and all word forms replaced with the lemmas.
Topic modelling was performed using the MALLET framework, which implements latent Dirichlet allocation.17 The resulting topics are available as word lists like the ‘travel’ and ‘philosophy’ topics shown here (Figs 2a and 2b). In order to feed them into the information network, but also to make them tangible and comparable, it is useful to assign labels to them using a controlled vocabulary. We experimented with word embedding models for topic labelling but finally decided to assign the labels manually. The controlled vocabulary used is fed from various sources, notably from the Dictionnaire européen des Lumières that is a suitable and highly relevant resource for this task.18 The titles of the individual entries offer a broad coverage of the topics dealt with in the literature of the French Enlightenment. Nevertheless, some of them are either too specific or too generic to map topics and therefore were not considered for the vocabulary. Remaining semantic gaps in the vocabulary were filled using additional terms. The source of each element of the topic vocabulary is referenced in our knowledge graph and a Wikidata identifier is provided for each entry.19 This allows us to achieve a sufficiently broad vocabulary that can be used for modelling thematic information derived from all three textual resources, something that allows good comparability.
The resulting topic model has been successful in identifying a certain number of topics that can be related to key concepts in the history of ideas and in literary history of the French Enlightenment. In the information network, each novel can be linked to the topics that have the strongest association with it. This not only yields additional information about novels often forgotten in recent literary histories, but also allows new insights into the thematic relations between novels, as many novels with shared topics emerge from this analysis.
Other than topics, we are enriching our raw novel data with ‘smart’ or semantic information such as named entities (persons and locations), sentiment scores or intertextual references. The goal is not only multilingual information mining and semantic enrichment, but also the meaningful interlinking with existing LOD resources and authority data such as Wikidata identifiers for persons, literary themes or locations.20
Ultimately, the aim of the MiMoText project is to bring together the three kinds of information described until here, and extracted from three different types of publications, in a joint knowledge graph or information network that is based on LOD statements and can be browsed online and queried using, for example, SPARQL queries.21 Each statement in such a system is structured as a triple of subject, predicate and object. In view of the heterogeneity of our data sources and data structures, modelling the information network implies some special challenges. Basically, three different levels of modelling can be distinguished: a conceptual one in the broader sense, a more formal one as well as the one relating to the technical implementation. In the following, we deliberately choose a rather simplified conceptual representation that depicts the logic of intuitively readable statements with highlighted predicates that connect directly interpretable labels for subjects and objects.
From the bibliographic data, and taking Voltaire's novel Candide as our example,22 we obtain statements like the following:
Voltaire IS_AUTHOR_OF Candide
Candide HAS_PUBLICATION_DATE 1759
Candide HAS_NARRATIVE_LOCATION Europe, America
Candide IS_ABOUT philosophy
Candide HAS_NARRATIVE_PERSPECTIVE heterodiegetic
Candide HAS_REPRINT_COUNT high
Candide HAS_LEGAL_STATUS censored
Candide HAS_LITERARY_GENRE satire
Candide IS_ABOUT monarchy
Candide IS_ABOUT philosophy
Candide IS_ABOUT travel
Q1597 [=Candide] ABOUT Q1592 [=‘philosophy’]
Q1597 SHORT_NAME ‘Candide’ @fr
Q1597 TITLE Candide ou l'optimisme @fr
Q1592 LABEL ‘philosophy’ @en
Q1601 [=topic11] RELATED TO Q1592
Q1601 PART OF Q1598 [Topic Model 11-202023]
Q1598 REFERENCE URL https://github.com/MiMoText/mmt_2020-11-19_11-38
{Q1597 ABOUT Q1592} STATED IN Q1598
There is a considerable amount of research regarding the theory of literary historiography, not least regarding the consequences of post-structuralist thought for the construction of historical periods, canons of literary texts, and coherent narratives of literary evolution.25 But these relatively high-level discussions rarely descend to the practical question of what constitutes fundamental, minimal statements in literary historiography. However, it is all but self-evident what those most fundamental statements in literary historiography really are. This means that such an inventory needs to be induced from the literary histories themselves, which is not a process that can be concluded once for all, but needs to be a continual process accompanying our analyses of these materials.26 Second, we may well assume, provisionally, that information about genre, theme, style and narrative perspective are certainly fundamental (if not sufficient) aspects of what literary historiography is likely to deem relevant when characterizing literary works. However, even under this hypothesis we still need to define the correct range of concepts for each of these predicates and find an adequate granularity of the concepts to strike a balance between the nuances we want to express and the need to enable efficient retrieval.
To give just one example: even for such a seemingly simple category as narrative perspective, complications arise quickly. For example, we could assume that there are only a few fundamental narrative perspectives, namely: heterodiegetic, homodiegetic and autodiegetic narration, following Genette.27 However, we then still need to adapt this vocabulary to the French eighteenth century, where epistolary novels are prevalent and dialogue novels not infrequent. Finally, in many of the novels, we find various kinds of combinations of these features. If we simplify too much, we end up with overly broad categories; if we allow for too much nuance, we relinquish any possibility to obtain a useful number of results for any search query that includes a filter on narrative perspective.
With respect to the formal modelling step, we follow the Wikidata/Wikibase data model but adapt it to our needs.28 Using our project-specific Wikibase instance in the current project phase, we consider it an advantage that we can guarantee a certain stability for our defined domain as well as the flexibility to reconceptualize properties (which are still discussed in the Wikidata community, for example) in dialogue with the community of literary scholars. On the level of many items, however – that is, when it comes to concrete work and author instances like ‘Candide’ and ‘Voltaire’ – a more direct reuse respectively mapping is possible. Generally speaking, our approach to these issues has been to take small steps in the right direction and be open to revising our inventory of subjects, predicates and objects, if necessary.
The advantages of solving these issues, and of merging all of this information into a knowledge base in the LOD paradigm are substantial.29 Such an information resource enables at least three scenarios that we believe provide a real added value to researchers. First of all, instead of searching for this kind of information across several separate sources, researchers find all of the information in one shared information source. Second, it becomes possible to make inferences across all of the statements. In this manner, we can, for instance, infer that Voltaire has written a satire, or that he has written about philosophy, even when such statements are not explicitly encoded in the data. Finally, we can compare and contrast statements concerning the same entities from different sources (see Fig. 4). For example, one can check whether the results from topic modelling of the novels confirm, add to or contradict the themes identified by the bibliographers. For example, the Bibliographie du genre romanesque français mentions ‘thèmes philosophiques’ as a theme of Voltaire's novel Candide (which can be mapped to the keyword ‘philosophy’). The topic modelling analysis has allowed to identify a topic in Candide that, similarly, was best labelled with the keyword ‘philosophy’, so that both sources are in agreement in this case. But additional key topics have been identified, such as ‘travel’ or ‘monarchy’, that were not included among the themes mentioned in the bibliography, thus enhancing the information available. The different statements about works and their authors can in this sense contain complementary but also contradictory information, and are linked, aggregated and referenced in our Wikibase instance (see Fig. 5).
A researcher interested in a certain literary theme, looking specifically for women writers of that period or interested only in epistolary novels could use the information network to query works and authors that match his or her interest and obtain results that go beyond the canon. The SPARQL endpoint also enables to create more complex queries: for example, one could combine a certain theme and the publication date and find out whether novels with the theme ‘travel’ are less or more frequent before and after the French Revolution.
In each of the four domains that we have been working on in the context of the MiMoText project, considerable progress has been made already. The bibliography has been converted to Linked Open Data; a named entity recognition pipeline specifically adapted to our materials has been created for the scholarly literature; a unique corpus of novels has been created and analysed using topic modelling; and considerable progress towards a conceptual model for representing and merging all of this information into a shared knowledge base has been made. However, we are still only at the beginning of our endeavour, which has considerable depth and complexity. In addition to further work already mentioned above, our next goal is to publish a prototype of the knowledge base that will, for instance, contain, beyond the data contained in the bibliography, information about the themes assigned to each novel. Such a prototype will enable us and others to gain experience with the data obtained so far and orient further development.
The research described here has been funded by the German federal state of Rhineland-Palatinate in the programme ‘Forschungsinitiative Rheinland-Pfalz 2019–2023’.
Christof Schöch has developed the structure of this paper. Anne Klee, Julia Röttgermann and Katharina Dietz have performed data analysis. Maria Hinzmann has performed the data modelling. All authors have contributed to the writing of the paper.
Christof Schöch https://orcid.org/0000-0002-4557-2753
Maria Hinzmann https://orcid.org/0000-0001-7199-1436
Julia Röttgermann https://orcid.org/0000-0002-1918-8117
Katharina Dietz https://orcid.org/0000-0001-7405-3656
Anne Klee https://orcid.org/0000-0002-1532-2649
1 C. L. Borgman, Big Data, little data, no data: scholarship in the networked world (Cambridge, MA, 2015).
2 C. Schöch, ‘Big? Smart? Clean? Messy? Data in the humanities’, Journal of the Digital Humanities, 2 (2013), 2–13, http://journalofdigitalhumanities.org/2-3/big-smart-clean-messy-data-in-the-humanities/, last accessed 4 Nov 2021.
3 For details on MiMoText, see the project website at: https://mimotext.uni-trier.de, last accessed 04 Nov 2021. The project is conducted at the Trier Center for Digital Humanities (TCDH) at Trier University, founded in 1998.
4 Margaret Cohen, ‘Narratology in the archive of literature’, Representations, 108 (2009), 51–75, https://doi.org/10.1525/rep.2009.108.1.51, last accessed 4 Nov 2021.
5 A. Martin, V. Mylne and R. L. Frautschi, Bibliographie du genre romanesque français, 1751–1800 (London, 1977).
6 J. Lafferty, A. McCallum and F. C. N. Pereira, ‘Conditional random fields: probabilistic models for segmenting and labeling sequence data’, Proceedings of the 18th International Conference on Machine Learning 2001 (ICML 2001) (San Francisco, 2001), 282–9, http://portal.acm.org/citation.cfm?id=655813, last accessed 4 Nov 2021. The sklearn-crfsuite library for Python was used for training and evaluation.
7 For further details on data, methods and performance, see A. Lüschow, ‘Automatische Extraktion und semantische Modellierung der Einträge einer Bibliographie französischsprachiger Romane’, Spielräume: Digital Humanities zwischen Modellierung und Interpretation. Konferenzabstracts (Paderborn, 2020), 80–4. https://doi.org/10.5281/zenodo.4621703, last accessed 4 Nov 2021.
8 S. Peroni and D. Shotton, ‘The SPAR ontologies’, The Semantic Web – ISWC 2018 (Cham, 2018), 119–36, https://doi.org/10.1007/978-3-030-00668-6_8, last accessed 4 Nov 2021.
9 The dataset is available online: see A. Lüschow, Bibliographie du genre romanesque français, 1751-1800: RDF Model [Data Set], French (Trier, 2019), http://doi.org/10.5281/zenodo.3401428, last accessed 4 Nov 2021.
10 This task has been addressed by team member Katharina Dietz.
11 See our repository for details: https://github.com/MiMoText/NER_Pubs.
12 This task has been addressed by team members Anne Klee and Julia Röttgermann.
13 The OCR environment we used is called OCR4all; see C. Reul et al., ‘OCR4all: an open-source tool providing a (semi-)automatic OCR workflow for historical printings’, ArXiv:1909.04032 [Cs], https://arxiv.org/abs/1909.04032, last accessed 4 Nov 2021.
14 See L. Burnard, What is the text encoding initiative? How to add intelligent markup to digital resources (Marseille, 2014), http://books.openedition.org/oep/426, last accessed 4 Nov 2021. More precisely, the texts are valid against the TEI schema developed for the European Literary Text Collection (ELTeC) in the framework of the COST Action Distant Reading for European Literary History; see https://distant-reading.net/eltec.
15 See https://github.com/MiMoText/roman18/, https://doi.org/10.5281/zenodo.4061904, both last accessed 4 Nov 2021.
16 See D. M. Blei, ‘Probabilistic topic models’, Communications of the ACM, 55 (2012), 77–84, https://doi.org/10.1145/2133806.2133826, last accessed 4 Nov 2021.
17 A. Kachites McCallum, ‘MALLET: a machine learning for language toolkit’, 2002. http://mallet.cs.umass.edu, last accessed 4 Nov 2021.
18 M. Delon et al., Dictionnaire européen des Lumières (Paris, 2007).
19 See https://github.com/MiMoText/vocabularies, last accessed 4 Nov 2021.
20 The Wikidata knowledge base does have stable identifiers, is multilingual and stores curated data on a wide range of areas of human knowledge. Nevertheless, the domain we are interested in (French literature of the eighteenth century) is not covered there to a satisfying degree. We are using the Wikidata hub as a basis for the multilingual labelling of our controlled vocabulary which we map to other authority data as well.
21 See, for example, L. Ehrlinger and W. Wöß, ‘Towards a definition of knowledge graphs’ (SEMANTiCS (Posters, Demos, SuCCESS), Leipzig, 2016), http://ceur-ws.org/Vol-1695/paper4.pdf, last accessed 4 Nov 2021.
22 Voltaire is one of the key authors of the French Enlightenment and published his novel Candide in 1759. The story of a journey of its eponymous hero through Europe and America can be understood as a critical confrontation with certain philosophical currents such as Leibnizian optimism.
23 The file name indicates the release date of the discussed topic modelling results. As our corpus grows, further topic modelling releases are planned.
24 Each element of a controlled vocabulary represents an item in our Wikibase instance and is linked as part of a specific vocabulary (such as the vocabulary of thematic concepts) to a GitHub repository, where the development of the vocabularies is documented (including further references).
25 See C. Uhlig, ‘Current models and theories of literary historiography’, Arcadia, 22 (1987), 1–17, https://search.proquest.com/docview/1297886980/citation/110CBD02A5B64AC4PQ/1, last accessed 4 Nov 2021; and David Perkins, Is literary history possible? (Baltimore, 1993).
26 Generally, Willard McCarty conceives of data modelling in the digital humanities as a continuous, iterative process: W. McCarty, Humanities computing, paperback edition (Basingstoke, Hampshire, 2014).
27 G. Genette, Narrative discourse: an essay in method (Oxford, 1979).
28 Our decision to use Wikibase as a knowledge base software, which is not least a decision for the intended linking of our data with Wikidata in the future, has various effects on questions of data modelling. In this context, we consider the existence of RDF serialization, among other things, as an important dimension for questions of interoperability. See Wikimedia Foundation, ‘Wikibase/DataModel’, https:// www.mediawiki.org/wiki/Wikibase/DataModel, last accessed 4 Nov 2021; and Wikimedia Foundation, ‘Wikibase/Indexing/RDF Dump Format’, https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format, last accessed 4 Nov 2021.
29 See, for example, J. Schelstraete and M. Van Remoortel, ‘Towards a sustainable and collaborative data model for periodical studies’, Media History, 25 (2019), 336–54, https://doi.org/10.1080/13688804.2018.1481374, last accessed 4 Nov 2021.