In this section we will give a very brief introduction to modelling language using recurrent neural networks (RNN), for a more details account see (5). We can view language as a series of discrete tokens and our goal is to fit a probabilistic model for such sequences, i.e. we wish to find a parametric model that learns the distribution
) from samples. The first step is to use an autoregressive model, more specifically we use the factorization
). What this means is that we can reduce the problem of modeling a full sentence to predicting the next word given the text seen so far.
In recurrent neural networks this autoregressive model ) is fitted using a hidden memory. Given the previous hidden memory
the network first updates the memory based on the new input
and then uses the updated memory to predict the next token and then passes the updated memory to the next step. More formally
Where is a one-hot representation of the input token,
are linear mappings and
is the vector of probabilities for each possible next token. Given this parametric model, it is trained by maximizing the training log-likelihood to produce the output model. While simple and effective, due
to vanishing gradients simple recurrent neural networks have difficulties in modeling long time dependencies, i.e. when the probability of the next token depends on information seen many steps before. To solve this issue various modifications have been proposed, such as long-short-term-memory (LSTM) (13) that introduce a gating mechanisms.
As a baseline model for comparison we trained a n-gram model. The n-gram model is a model that gives probability to each token based on how frequent did the sequence of the last 1 tokens end in this token in the training set. The main limitation of n-gram models is that for small n the context used for prediction is very small, while for large n most test sequences of size n are never seen in the training set. We used a 2-gram model, i.e. each word is predicted according to the frequency that it appeared after the previous one.
In order to generate our datasets we collected transliterated texts from achemenet website, based on data prepared by F. Joannès and his team in the framework of the Achemenet Program (CNRS, Nanterre). We then designed a tokenization method for Akkadian transliterations, as detailed in Sec. 6. We trained a LSTM recurrent network and a n-gram baseline model on this dataset, see supporting information for model and training details.
Results for both models are in table 1. Loss refers to mean negative log-likelihood and perplexity is two to the power of the entropy (both cases lower is better).
Table 1. Loss and perplexity on Achemenet dataset
As expected, the RNN greatly outperforms the n-gram baseline and despite the limitations of the dataset it does not suffer from severe overfitting.
A. Completing random missing words. In order to evaluate our models’ ability to complete missing words, we took random sentences from the test corpus, removed the middle word and tried to predict it using the rest of the sentence. Our model returns a ranking of probable words and we report the mean reciprocal rank (MRR). The MRR is the average over the dataset of one over the predicted rank of the correct word. It is a very common and useful measure for information retrieval as it is highly biased towards the top ranks, which is what the user is mostly interested in. We also evaluate the “hit@k” with measure what is the percentage of sentences in which the correct completion is in the top k suggestions. We took all test sentences without breaks of length 10 or longer, 166 sentences in total, for evaluation.
We compared two variations of our model, one that finds the optimal completion based only on the words up until the missing word, denoted “LSTM (start)”, and one that takes the full sentence into account labeled “LSTM (full)”. As the “LSTM (full)” model needs to run separately for each
Fig. 2. Left to right: The original cuneiform line art, the transliteration and the translation of the Achaemenid period Babylonian text YOS 7 11.
candidate missing word, we first picked the top 100 candidates using “LSTM (start)”. We then generated the 100 sentences, one for each possible completion and re-rank them based on the full sentence log-likelihood. If the right completion is not in the top 100, we take the reciprocal rank to be zero.
Table 2. Completing missing fifth word in sentences.
For comparison we used two simple 2-gram baselines. One that takes into account only the previous word, denoted "2-Gram (start)", and one that takes into account both previous and next word denoted "2-Gram (full)". While this is a relatively weak model, we found it to work surprisingly well yet still significantly inferior to the LSTM model.
It is clear from the results in table 2 that our algorithm can be of great help in completing missing words, with almost 85% chance of completing the word correctly and 94% chance of having the correct word in the top 10 suggestions.
B. Designed completion test. We designed another experiment in order to evaluate our completion algorithm and understand its strengths and weaknesses. We generated a set of 52 multiple choice questions, where the model is presented with a sentence with one word missing as well as four possible completions and the goal is to select the correct one. The three wrong answers were designed to be wrong semantically,
wrong syntactically and both. This way we can see the types of mistakes the algorithms makes. The assumption is that the learning algorithm would be more likely than a human to make semantic mistakes, but should be better than a non-expert in grammar. If that would be the case, the effec-tiveness of our approach as a way to assist humans would rise, as the strengths of human and machine complement each other.
We used our model to rank four possible restorations for each of the missing words in the 52 random sentences, selecting the one with highest likelihood we received 88.5% accuracy (see Supporting Information for the complete list of questions and answers). Looking at the six failed completions we see that four are semantically incorrect, one is syntactically incorrect and one is both; in agreement with our hypothesis.
Further study into the different restorations of the designed completion test, taking into account the full ranking of the answers, results in some interesting patterns. The majority of restorations, 36 cases, show the algorithm best identifies either correct sentence structure or correct syntactic sequences of parts of speech based on statistical frequency of smaller syntagmatic structures. A smaller subset of cases, which probably derives from paradigmatic relationships between certain classes of words, show correct semantic identification of noun class, as well as related verbs. With regards to the latter, five possible cases identify correctly usage of verbal forms based on their context (e.g. in direct speech). Take for example question 3: NAME ašú šá NAME ana NAME lú qíipi ébabbarra u NAME lú sanga LOCATION ___ umma. The model ranked the four possible answers as follows: iqbi; liqbuú; bar; bán. The example does not only show correct identification of sentence structure, but also linking two different forms of the verb qabû "to speak". It does not necessarily reflect understanding of verbal root form, rather statistical frequency of iqbi in this context and identification of its similarity to liqbuú. This statistical inference emerges more clearly in one of the mistakes made by the model in question 32, where it does not differentiate properly the grammatical person of the verb nad¯anu "to give, pay" (taaddinu vs. inamdin).
Some level of the models’ semantic knowledge becomes apparent with regards to noun class. 16 questions show possible correct identification of countable nouns, names of professions, temporal designations, gender, and even contextual formulaic legal clause (so called elat-clause). Six cases show correct identification of prepositions, particle use, or pronouns. The choice in question 7 between the family of related prepositions ina and ana, makes it clear that these choices are again based on frequency in specific contexts. Moreover, statistical grasp of parts of speech seems to be a decisive factor in at least six cases of restoration. But it can either interfere with contextual identification of the correct restoration–e.g. by preferring ina igi over ina šubefore NAME (question 35)–or achieve surprisingly good results–e.g. kurkur over LOCATION after lugal (question 37).
The model does not seem to identify alternate logographic and phonetic writings of the same words: e.g. Sum. da = Akk. itti or Sum. im.dub = Akk. tuppi. It obviously lacks enough examples of interchangeability between cases in the studied corpus. Further confusion can happen when the model identifies similarity between the answer and another word close by in the sentence, either noun or verb. Especially problematic are cases when there are very few similar sentences to train on, so the algorithm makes an "educated" guess.
In conclusion, our model–as far as can be judged by this experiment–is, as expected, good in teasing out sentence structures. However, it was also surprisingly better than assumed in semantic identifications due to context based statistical inference (rather than finding underlying grammatical rules and morphology). In order to greatly improve false identifications based on statistical frequency of contextual semantic relationships, much more training material will be needed. Nevertheless, We have demonstrated that even without access to large amounts of data we can successfully train LSTM models and use them to complete missing words. In our completion test we show good results that while not sufficient for automatic completion, prove that this can be an invaluable tool in helping scholars with text restoration.
The significance of our results with the late Babylonian corpus is rooted in the fact that most entry level scholars or other interested historians and social scientists, who focus on the large first millennium BCE Babylonian archives, do not have the very specific knowledge and expertise to understand deep underlying political, social or historical structures without reading through hundreds of texts. By incorporating our model
in an appropriate tool (made available on-line in the near future through the Babylonian Engine project), it will be of immense help to scholars in the historical sciences, allowing them to overcome the high entry barrier needed to restore fragmented Akkadian texts; first structured archival documents, but as the data set grows one can train the model on more genres, such as scientific or literary texts. Access to the primary sources in their original state as well as the ability to restore broken passages are a necessity for understanding Akkadian corpora on a macroscale.
Our method is innovative in its implementation on ancient cuneiform texts. However, to better understand the signifi-cance of this study, it should be placed in the broader context of the necessary data pipeline for reading such ancient texts. One can classify two types of relevant problems in text restoration that are currently being dealt with state-of-the-art machine assisted solutions:
I Problems of visualization which relate to the preservation, reconstruction and accessibility of documentary sources using some form of scanning, photography or both, in 2D+ or 3D technology (14, 15). Nowadays, the most cost effective methods combine Photogrammetry, which creates a 3D (or 2D+) model of the object (16), and Polynomial Texture Mapping (PTM) using Reflection Transformation Imaging (RTI) technology. The latter provides different lighting sources and texture to the scanned object (17). Some systems employ multispectral imaging that can reveal features hidden from the human eye (18). Several major projects developed effective methods for 3D or 2D+ scans and photography of cuneiform tablets. (a) The pioneering, but now defunct iCaly project from the University of Johns Hopkins (19); (b) the PTM and RTI dome shaped systems developed by Southampton and Oxford(17), on the one hand, and by KU Leuven(20), on the other; (c) a joint Dortmund-Würzburg team that scans cuneiform fragments in 3D and focuses on digitizing philological work and reconstruction of fragmentary tablets(14); (d) and the initiative GigaMesh, led by Hubert Mara from Heidelberg (7). The Heidelberg group have recently developed various methods for Automatic Machine identification of cuneiform signs using ML models and Vector Geometry (21, 22). Advances in cost-effective and fast 3D scanning technology are crucial to further the work described here. For instance, they allow exact measurements of inscribed objects, that can lead to the joining of broken tablet fragments. These can otherwise only be identified as matching by a handful of world experts in cuneiform. The Virtual Cuneiform Tablet Reconstruction Project (VCTR) joined 3D scanned cuneiform tablet fragments automatically using a novel matching algorithm with measure of fit metrics which dramatically reduce false positive match reports (23). The matching algorithm works by iteratively finding the optimal relative orientation of the two fragments under consideration in three-dimensional space. The team succeeded in joining with this method Neo-Babylonian archival texts as well as a manuscript of the Babylonian flood myth Atrahasis (24).
II Linguistic and content-related problems, which include automated or partly automated transcription and translation of ancient languages. This is an area with potential for Big data mining using models of Natural Language Processing (NLP), ML or AI. It is also the most complicated aspect, given the lexical and semantic complexity of the cuneiform script and Akkadian language. A multinational project led by a group from Toronto, Frankfurt and UCLA has initiated the Machine Translation and Automated Analysis of Cuneiform Languages project (MTAAC). Its main goal is to find methodologies for the automated analysis and machine translation of transliterated cuneiform documents, specifically written in the less syntactically complex Sumerian language (25). They aim to have the resulting translated lemmas automatically tagged according to context, creating a semantic and lexical database, based on neural machine translation models. A recent endeavor of a joint Ariel-Tel Aviv research group, managed high success rates using NLP algorithms like HMM, MEMM and BiLSTEM, for word segmentation and automatic transliteration of Akkadian texts in Unicode cuneiform (26).
Transcription of Akkadian Cuneiform script and its Neo-Baby-lonian dialect. Akkadian was written in the Cuneiform script. Alongside Egyptian Hieroglyphs, Cuneiform is the earliest attested form of writing, which was probably invented in southern Mesopotamia at the end of the fourth Millennium BCE and initially used to record daily accounting procedures in the Sumerian cities on a clay medium. A good analogy to this earliest phase is the modern "spreadsheets" see (27). The script was then adopted by the Akkadian speaking Babylonians and Assyrians to write their own language using a mixture of syllabic signs, logograms (which incorporated Sumferian values for ideograms) and determinatives(1). In all, Akkadian is one of the most enduring and widely attested languages of the ancient Near East for around 2,500 years. Its geographic horizon spans from Iran to Greece and from Anatolia to Egypt. Neo-Babylonian is the longest consecutive language phase of Akkadian, covering the first millennium BCE, ending sometime after the first century CE. The genres and writing conventions of this phase are characterized by their departure from standardized orthography practiced throughout the second millennium BCE. Many spellings are inconsistent with actual phonemic renderings of words and can vary to a considerable extent,especially on account of the intensive language contact and interference between Akkadian and Aramaic(29, 30). There are some rules that govern the normalization of Neo-Babylonian–i.e. bound transcription which correctly represents noun and verbal morphology–but in general it is avoided in most recent publications unless for linguistic or pedagogic purposes(11). For this reason we have also chosen to avoid the pitfall of training the algorithm in any kind of normalization practices for the time being. In our training corpus we remained on the level of (unbiased) transliteration, but we removed all connecting features between
Fig. 3. Mechanical bound transcription of Babylonian text YOS 7 11.
phonograms and sumerograms, resulting in a mechanical (unnormalized) bound transcription: Akkadian phonetic spellings and logographic writings are taken at face value, by simply removing connecting hyphens between syllables and between logograms. A necessary contrast is drawn between phonetic and logographic writings based on their typical representation in italic typeface vs regular typeface, respectively (see also below on Tokenization). However, phonetic compliments, normally italicized, are currently identified as part of regular typeface logograms when attached. Superscripted determinatives are used to identify proper names, such as personal names, theophoric names, locations and month names. Changes in sentence structure were not taken into consideration, since they only occur at the relatively late corpora of the Parthian Period (third century BCE onwards)(9). The resulting Akkadian texts used to train the algorithm look like the example in Fig. 3, which show the same text as in Fig. 2 but in our mechanical bound transcription.
Neo-Babylonian archives under the Persian Empire, their historical significance and text restorations. Babylonian archives from the end of the sixth to the fourth centuries BCE are one of the main sources for reconstructing the official and daily heritage of the Persian Empire and its subject peoples in Mesopotamia. Structuring of Neo-Babylonian archives is based mostly on an artificial division between private and institutional ownership(12). Criteria employed to this end are more frequently reliant on common principal actors with connected activities (i.e. prosopography), document type and content or shared background in an institution (like temple or palace), and less on physical proximity between documents in a given find context (archaeological and/or museum based studies in acquisition history for illicitly excavated texts). Among the largest representative text groups with a private background are the business archives of the Egibi and N¯ur-Sîn families from Babylon and Murašu ’firm’ from Nippur, as well as the closely contemporary archive of Persian governor B¯elšunu from the palace complex of Babylon, known as the Kasr archive (designated Kasr N6; (31)).The Murašu texts especially and another archive cluster from several rural centres known as the Yahudu ’archive’, provide significant information on foreign minority communities in the Achaemenid Empire during a period of close to 200 years, including the fate of the Judean community in Babylonian exile(33). However, the largest textual groups from this period by far are the two large multi-file archives with an institutional background in city temples: the Eanna archive from Uruk and Ebabbar archive from Sippar.
Fig. 4. Line art and transliteration of Achaemenid period Babylonian text YOS 7 51 from the Eanna archive in Uruk. Fragmentary upper half of obverse marked by a red square.
These makeup the bulk of the Achemenet data set, alongside the Egibi/N¯ur-Sîn archive and Murašu material. All together, the Achemenet Neo-Babylonian data set has representative archival groups for the Achaemenid period from almost every large city in Babylonia:Babylon (Ea-epp¯eš-il¯ı, Gahal, Napp¯ahu); Kiš (Epp¯eš-il¯ı); Sippar (B¯el-r¯emanni, Ea-epp¯eš-il¯ı A, Iššar-tar¯ıbi, Marduk-r¯emanni, R¯e’i-sisê); Uruk (Atû).
The need for text restorations varies from archive to archive depending either on their method of excavation and preservation in recent times, or archival selection processes in antiquity (e.g. discarded or "dead" archives). The best kept tablets found their way into Museum collections in Europe and the US already following their initial age of discovery during the late 19th and early 20th century. Many came from illicit or clandestine excavations and went through an active selection process, by which collections preferred the most complete tablets. On the other hand, tablets from official excavations in Babylon and and Uruk, for example, have a higher percentage of fragmentary texts. Some large archives like Murašu or Kasr (that was already vitrified from an ancient fire), became damaged because of poor handling following excavation or due to the effects of war. A large number of Eanna tablets from before the reign of Darius I were deliberately discarded or smashed already in antiquity after becoming inactive for the temple administration(35, 36). Such an Eanna text with a fragmentary upper half of the obverse, dating to the reign of Cyrus can be seen in Fig. 4, followed by its possible restoration that is based on known parallels and scholarly study (Fig. S1).
Data collection. We collected 2,247 Achaemenid period Babylonian archival texts. As the Achemenet website does not have an API we built a scraping code in Python 2.7 to scrap the texts, preprocessing and tokenize them. The code uses the "Beautiful Soup" library to remove all the the unnecessary HTML tags and take only the transliterated text itself from the site. Superscript and italic tags have semantic meaning and were therefore preserved by our processing. We replace words with low appearance (below three total appearances in the train data) with an UNKNOWN token, as there is not enough data to properly predict and use these words. The number of different words in the vocabulary that were collected after this is 1,549 and the number of words in total is 220,926. The number of words that appear only once is 3,175 and twice is 932. For comparison the Penn treebank dataset, a standard and relatively small English text data set, comprising of texts from Wall Street Journal, has 10,000 unique words and the number of total words is 1,036,580. While over-fitting is something to be aware of given the scale of the data, the unique nature of these texts comprising of well structured bureaucratic information makes them well suited for machine learning modelling.
Tokenization. Tokenization is an automatic process in which the text is split into words, and each one is replaced by a numeric token. This is an important process that requires language specific knowledge, or a lot of semantic meaning might be lost. A classical example in English is tokenizing a word like “aren’t". If we do not break it into two tokens then it is considered a word on its own and losses the connection to “are" and “not". While it might be possible for the learning algorithm to learn the connection from the data, bad tokenization can complicate matters considerably by creating a large amount of unnecessary words in our dictionary.
We created a new tokenizer, specifically built for Akkadian. Masculine names, god names and female names, identified by determinatives before proper names in super script ’I’ or ’Id’, ’d’ and ’f’ respectively, were replaced by a ’NAME’, ’GODNAME’ and ’FEMALENAME’ token. Locations, identified by determinative before proper names in super script ’uru’, or after proper names in super script ’ki’, were replaced by ’LOCATION’ token. Month names, with determinative super script ’iti’ before the noun, were replaced by MONTH and simple numbers were replaced with the token ’NUM’. In order to simplify the tokenization of broken parts, each broken or incomplete part was replaced with the token ’<BRK>’, and words that appeared only two times or less as ’<UNK>’, since we we do not have enough information to learn their meaning.
Another important aspect of Akaddian is that some cuneiform symbols can be interpreted in two ways: As a syllable or as a logogram, i.e. representing a whole word. During transliteration the specific meaning is marked by using italic for syllables. During tokenization we use the same token for both representations, but we keep the HTML start italic <i> and stop italic <\i> symbols so the use of the word as a syllable or logogram can be inferred by the context. While using different tokens for both uses has some advantages, we found that doing so adds a large amount of noise to the preprocessing step and decided to use this method instead. For example, one sentence fragment after preprocessing is "NUM mana kùbab-bar <i> šá </i> NAME a <i> šú šá </i> NAME". After this preprocessing every unique word is replaced by a numeric token.
ceived funding from the Ministry of Science and Technology Grant 89540 and the Israel Science Foundation Grant 457/19. https://www.overleaf.com/project/5bbd570d71590a2759027677