Skip to main content

Applications of natural language processing to archaeological decipherment: A survey of proto-Elamite

Resource type
Thesis type
(Thesis) Ph.D.
Date created
2023-11-16
Authors/Contributors
Abstract
In this thesis, we describe the first-ever large-scale computational analysis of the partially-deciphered proto-Elamite (PE) script. This script was used to write economic accounts which follow a very regular "spreadsheet" structure incorporating many numerals. This sets PE apart from prose corpora which have been considered in prior decipherment work, in ways that both enable and require exploration of new models and methodologies. In close collaboration with domain experts, we provide a thorough survey of this corpus and answer longstanding questions about its content. We describe novel approaches to multi-modal representation learning, which combine visual information from a VAE-inspired encoder with contextual features from a neural language model. We apply these models to evaluate hypotheses about the script's underlying character inventory, which remains very uncertain. By analyzing the representations learned by these models, we also deepen our understanding of the relationships between a set of visually complex signs known as complex graphemes, and discover a strict grammar which appears to govern their construction. We apply a novel variant of the bootstrapping classification algorithm to disambiguate numeric notations with uncertain magnitudes. This enables the first-ever statistical analysis of the corpus's numeric content, and of the relationships between the numeric and linguistic parts of these documents. Given that numeral notations comprise more than half of the attested corpus, this represents a significant advance in our understanding of the script. By applying sequence models to study the internal structure of these documents, we independently replicate claims about a structure called the "header", and adduce new evidence about the size of headers and their distribution across the corpus. In addition to these main results, we also describe a number of small, focused investigations into word order, the presence of affixal morphology, and other minor features of the texts.
Document
Extent
234 pages.
Identifier
etd22782
Copyright statement
Copyright is held by the author(s).
Permissions
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Supervisor or Senior Supervisor
Thesis advisor: Sarkar, Anoop
Language
English
Member of collection
Download file Size
etd22782.pdf 11 MB

Views & downloads - as of June 2023

Views: 24
Downloads: 6