VII International Conference "Electronic publicating El-Pub2002"

September 23-27, Novosibirsk,
(state registration number 0320300064)


Electronic Corpora of Old Texts for Research in Literature and Linguistics

Lavrentiev A.

Institute of Philology,
SB RAS (Russia)

Until recently critical editions have been the only widely available source for researchers of medieval texts. However, even the most accurate critical edition does not include all the information from the original manuscript and contains editor's interpretations and corrections. Information technologies make it possible to change this situation drastically. They give a possibility to publish digital photographs of the manuscript along with a transcription of the text, which is suitable for search and analysis of words and graphical elements of the manuscript. Preparing such a transcription demands solving some methodological and technical problems, as well as adopting certain convention for the encoding of specific graphical elements of a medieval manuscript.

One of the first hypertext editions of a whole manuscript tradition was the "Charrette" Project (Princeton University). This edition available through the Internet includes color images of the eight existing manuscrips of Chretien de Troyes's Chevalier de la Charrette, their diplomatic transcriptions in SGML format, the critical edition and a number of "tools" for search and analysis of textual data, in particular, databases on poetical figures and grammar.

At present these databases are made with the critical edition, but in the future it is planned to use the data from the diplomatic transcriptions. This will require updating the format of these transcriptions in order to facilitate their automatic processing and present them in a user-friendly form. To be more precise, the updating should consist in converting the transcriptions into XML, adding there some new elements and preparing several stylesheets and scripts for different options of visualizing and data-extracting.

The principles of developing electronic corpora of old texts elaborated in the Charrette Project can be applied to any electronic editions of old texts with historical value, e.g. old Russian texts. The basic principles of such editions should be the following:

1. The edition must include high resolution color images of the manuscript.

2. The transciption should be as close as possible to the original (including all non-standard graphic features and apparent errors).

3. There should be a system for encoding virtually illegible or ambiguous fragments of the original.

4. The transcription must conform to the internationally adopted standard of text encoding (TEI).

5. The transcription should be compatible with modern software for visualizing and processing electronic documents.

All these statements will be illustrated with concrete examples in the text of the paper.

At present a preparation work for the electronic edition of some Siberian Russian chronicles and folklore texts is beeing carried out at the Institute of Philology, SB RAS.

Internet references

1. Charrette project:

2. TEI Specification:

Full Text in Russian: HTML
Note. Abstracts are published in author's edition

|Mathematical Publications|

© 2002, Siberian Branch of Russian Academy of Science, Novosibirsk
© 2002, UII SB RAS, Novosibirsk