Towards a Linguistic Fingerprint Method for Latin Literature

John W. Thomas

Xavier University

Studies in Linguistics have progressed substantially with the availability and analysis of digital corpora.  In English language studies examinations are underway as to how the Internet is facilitating rapid changes and evolutions in certain usages and structures.  In Latin linguistics the effects have been much more like the ripples of a small stone dropped in a pond.  Faster personal computers, larger hard drives, and extensive online fully searchable and extractable databases such as The Latin Library should allow these ripples to grow.

A comparison based on these digital corpora of the pattern of word usage within the works of authors such as Tacitus, Vergil, Sallust, and Cicero, can be used to generate a mathematical usage chart, equivalent to a kind of "fingerprint" of the author's word selection.   Linguistic fingerprints have been used to show evidence of one author upon another (as in the case of Empedocles' influence upon Euripides—Sedly, LICS 2.4 (2003), there done without computers and based on a comparatively small amount of data).   This paper will present a comparison of usage patterns through a series of major Latin authors in an attempt to find if some words are used in similar ways large groups of authors, and what information can be gleaned from their general usage patterns.

A first step is to select a series of quantifiable common inflected words.  The examination considers the relative frequency of specific forms within an inflected word. Whether an author uses sum 50 times or 5000 times, the use of some forms will be more common than others, and the relationship of these forms a pattern.  Besides sum, two of the most common words in Latin are the forms of ipse and the forms of quis, quid // qui, quae, quod.  Each has certain number of possible unique forms (as opposed to grammatical use).  Tacitus e.g. in Annals 1, uses quis, quid // qui, quae, quod 287 times in its various forms, qui (which can be masculine nominative singular or plural) 43 times, but quarum only twice; quo is used 27 times, etc.  The relative frequency is then 14.98% for qui within Book one, 9.41% for quo, and 0.70% for quarum.  The scatter chart of their relative usage frequency resembles a kind of voiceprint or fingerprint.  When the prints of different authors are overlaid, a large scale comparison can be comprehended.  Slides will briefly illustrate the comparison of some of these, such as the table below. 

            It turns out that there is remarkably little variation in how many authors use quis, quid //qui, quae, quod, yet what differences do exist may be diagnostic.  The analysis of digital corpora can also take a much wider focus and examine the use of entire corpora simultaneously.  Studies of word usage have traditionally focused on solitaries and certain quantifiable peculiarities such as abstract nouns or compound verbs.  Few have tried to examine patterns of use in Latin's most common words.  Indeed without the digital corpora such a consideration would be at best a Herculean effort.  And yet it is far and away the use of the most common words that determines an author's overall usage pattern and therefore the foundation of that author's style.   

Through digital corpora we can quantify the entirety of an author's text and then compare this to any other author or part thereof.  With this analysis comes the ability to look with a wide and unemotional view—a quantifiable view—to develop pictures that would otherwise not be possible. These digital studies won't define or re-define Cicero, Caesar, or Tacitus, but they can shed light on progressions of style and inter-textual relationships that would otherwise be invisible and that can enhance our overall understanding.


