Sumerian language translation using Deep Learning

I have written previously (here and here) about my interest for the Ur III period, a Sumerian dynasty characterized by an abundant number of administrative documents in the form of clay tablets written using cuneiform scripts. We know about 65,000 such documents, which record various types of transactions. In addition, thanks to initiatives such as CDLI (the Cuneiform Digital Library Initiative) and ORACC (Open Richly Annotated Cuneiform Corpus), these texts (and many others) are available in digital form, and may be exploited using modern data analysis techniques.

I have shown in previous posts my modest attempts at analyzing the whole Ur III corpus, for example to represent the social network of transactions, or inflows/outflows of various goods (which you can see in this GitHub repository). In all these cases, I parsed the sumerian texts in order to extract only the needed information. This is possible thanks to the fact that the majority of sumerian administrative texts exhibit repeatable patterns which can (more or less) easily be parsed. For example, the formulation “ki X-ta” in Sumerian indicates that ‘X’ is the person giving the goods, whereas the formulation “Y i3-dab5” translates as ‘Y received it‘.

During this Christmas holiday, I wanted to experiment at home with direct machine translation of Sumerian language to English. There are serious academic attempts at doing so: the MTAAC project (Machine Translation and Automated Analysis of Cuneiform Languages) involves different research groups in the world for doing precisely so. My attempt here is extremely modest in comparison, as will be seen below from the methodology details. My main goals were first and foremost to have fun and to gain more knowledge in neural machine translation (disclaimer: I’m only starting learning sumerian and am very far from being good at it). Without further ado, let’s dive in the technicalities.

The data

In any supervised machine translation learning task, one needs a dataset of sequences in the original language and their translation in the destination language (the ground truth). Thankfully, the CDLI has not only digitized and transliterated sumerian texts, but in some cases an English translation is also provided. In addition, the entire CDLI text data is dumped daily on this GitHub repository, and can be parsed fairly easily.

I extracted all texts from the ATF data with Sumerian language, and containing English translations. This represents 3244 tablets, a small fraction of the entire corpus. However, since each tablet is 5-15 lines in average, I ended up with 51238 sequences of Sumerian text with English translation, which is acceptable for trying to train a deep learning model (although not comparable to usual machine translation datasets such as the WMT’14 German-English, with more than 30M sentences). Note that in this case, the entire tablet is splitted in separated lines which are treated independently. This means that the coherence of the whole text is lost; however, in many cases, the lines stand alone for translation. For example ‘1(disz) sila4 /ki ab-ba-sa6-ga-ta / in-ta-e3-a i3-dab5‘ can be translated line per line independently to give ‘1 lamb / from Abbasaga / Inta’e’a received it‘. Some lines supposedly had English translation, but the translations were only ‘‘ or ‘xxx‘, in which case they were removed.

I decided to train the neural translation model at the character level, and not at the word level. This is not usually the case in the majority of the neural translation papers I have seen. One of the reason for this is that the length of sequences would become extremely long, and deep learning models using Recurrent Neural Networks (RNN) would have difficulties with long-range dependencies (even with cell structures such as LSTMs or GRUs). However, in the case of the sumerian corpus, the sentences are short and do not exceed 40-60 characters in average, which is acceptable for LSTMs/GRUs. One could also wonder whether a word-level model is adapted to transliterated sequences coming from logosyllabic cuneiform, although I have no definite answer to this.

In the sumerian transliterations, additional characters such as ‘#‘, ‘[‘, ‘]‘, ‘x‘, etc. are used to indicate damaged or illegible signs. I have made no attempt to remove them. Though the text can remain understandable, this is a questionable assumption, and it is highly possible that the translations would improve if these characters were cleaned.

The dataset was only split 0.85/0.15 in training/test sets, since no optimization of hyperparameters was considered here.

The models

I have compared two models, namely Seq2Seq without attention mechanism, and Google’s Transformer, which are described in this paper and this paper respectively. There are very good tutorials and explainers(see for example here and here) on the net about these two models, so I won’t detail them here.

Both models are based on an encoder/decoder structure. In the case of Seq2Seq, the input sequence is fed character-per-character to an encoder network of stacked LSTM (or GRU) layers, which encodes the sequence into a fixed-size vector. This vector is used by the decoder network (also made of stacked LSTM layers) to predict the output sequence character-per-character in an auto-regressive fashion. This model has been a turning point in neural translation but is nowadays considered as quite limited since it is difficult to imagine that one can encode the whole meaning of a sentence into a vector. Later models added attention mechanisms, i.e. ways to ponder the importance of a word and its neighbors in the translation. The Transformer model of Google gets rid of the recurrent nature of the network to keep only the attention mechanism, which in their case improves both the training speed and the translation quality.

I built the Seq2Seq model from scratch in Keras, using two stacked LSTM layers of size 1024 for the encoder and the same structure for the decoder. For the Transformer model, I used the very nice implementation given in this GitHub repository, using only two layers of self-attention. And now for the horrifying truth: I trained these models on my old personal computer on CPU only, which took 10 days for each model. In other words: don’t do this, unless you have a GPU at hand ! I didn’t, and could let my computer run non-stop during the holidays, but never again. This is also why I did not want to bother too much with hyperparameter optimization, and data cleaning comparisons.

What is expected ?

Not much, really. As I said above, this was simply an experiment for the sake of fun, so I did not have very high hopes, especially given the hardware limitations. In addition, there are other limitations coming from the data itself:

  • The sumerian metrological system uses a mix of sexagesimal and decimal system: as an example, ‘4(gesz2) 1(u) 5(disz) udu‘ corresponds to 4*60+10+5=255 sheeps. In addition, different signs are used for the measurement of units, volumes, surfaces etc. For example one uses ‘1(barig)‘ to deisgnate a volume of 60 sila3 (appr. 1 liter), whereas ‘1(ban2)‘ corresponds to 10 sila3. So our network would have not only to learn how to sum, but also to differentiate between units of measurements. Of course it doesn’t, and just memorizes the most often encountered sequences. A pre-processing step to convert these into numeric form would certainly help.
  • Proper nouns are not explicitly identified in a sentence. For example, ‘ab-ba-sa6-ga’ corresponds to Abbasaga, a chief official of the Ur III dynasty. Similarly, ‘ARAD2‘ (in full form ‘ARAD2-mu‘) was a granary chief, but ‘ARAD2‘ also means slave/servant, as in ‘ARAD2-zu‘ : ‘your servant‘. Without any indication, it is highly improbable that our models would be able to recognize a proper noun, and in practice it just memorizes often encountered sequences again.
  • There is variability in the translations given in the dataset, most probably coming from different annotators.

The results

And now for the results. After training the model, I asked for the translations on unseen tablets from the CDLI corpus where no English translation was present. We begin with some good cases. Here is CDLI n° P131719:

Original Seq2Seq Transformer
1. 1(u) 2(disz) sila4 12 lambs, 12 lambs,
2. 2(disz) masz2 2 billy goats, 2 billy goats,
3. u4 6(disz)-kam 6th day, 6th day,
4. ki ab-ba-sa6-ga-ta from Abbasaga from Abbasaga
1. in-ta-e3-a Inta’e’a Intaea
2. i3-dab5 accepted; accepted;
3. iti szu-esz5-sza month: “šu’ešša,” month: “šu’ešša,”
4. mu sza-asz-ru{ki} ba-hul year: “Šašru was destroyed.” year: “Šašru was destroyed.”
1. 1(u) 4(disz) (total:) 24. 14,

The translation is near perfect for the Seq2Seq model and perfect for the Transformer model, probably because the sentences are extremely generic and very often encountered throughout the corpus. As we clearly see, the Seq2Seq model has trouble summing ‘1(u)‘ (10) and ‘4(disz)‘ and gives a wrong total of cattle. Note the way the proper noun Inta’e’a is spelled depending on the model used.

Here is another example, with CDLI n° P131728:

Original Seq2Seq Transformer
1. 3(disz) u8 3 ewes, 3 ewes,
2. 2(disz) udu 2 sheep, 2 sheep,
3. 2(disz) sila4 2 lambs, 2 lambs,
4. 2(disz) ud5 2 nanny goats, 2 nanny goats,
5. 1(disz) masz2 1 billy goat, 1 billy goat,
1. e2-muhaldim for the kitchen; (for) the kitchen,
2. u4 3(disz)-kam 3rd day, 3rd day;
3. zi-ga booked out; booked out;
4. ki ur-ku3-nun-na from Ur-kununa from Ur-kununa
5. iti a2-ki-ti month: “Akitu,” month: “Akitu,”
6. mu us2-sa ki-masz{ki} ba-hul year after: “Kimaš was destroyed.” year after: “Kimaš was destroyed.”

No problem here, the translations are perfect and identical.

But problems quickly arise. Here is CDLI n°P131690:

Original Seq2Seq Transformer
1. 1(disz) sila4 1 lamb, 1 lamb,
2. e2-uz-ga for the uzga-house; for the uzga-house,
3. mu-kux(DU) ARAD2-mu delivery of ARADmu; delivery of Warad-mud,
4. ha-ba-ba-tum muhaldim maszkim may you be my ally. Babatum, the plot of Martum, the craster,
5. u4 1(u) 4(disz)-kam 14th day; 14th day;
1. [ki] ab-ba-sa6-ga-ta ba-zi from (the account of) Abba-saga booked out, from Abbasaga account;
2. u4 3(disz)-kam 3rd day, 3rd day;
2. iti u5-bi2-gu7 month: “Ubi feast,” month: “Ubi-feast,”
3. mu sza-asz-ru{ki} ba-hul year: “Šašru was destroyed.” year: “Šašru was destroyed.”
1. 1(disz) (total:) 1. (total:) 1.

The translation is mostly ok (the Transformer model had trouble with ARAD2-mu), but fails miserably on ‘ha-ba-ba-tum muhaldim maszkim‘. This sentence translates to ‘Hababatum, the cook, is the responsible official‘: both models are probably unable to understand that ‘ha-ba-ba-tum’ is a proper noun and generate an erroneous sequence of characters.

Overall I found that texts about cereals (mostly barley) were often poorly translated (maybe there was a bias towards texts about cattle in the train dataset ?). Here is CDLI n°P133029:

Original Seq2Seq Transformer
1. 3(barig) 1(ban2) sze lugal 3 (gur) 1 ban2 wheat, grown, 3 barig 1 ban2 barley, royal (measure)
2. sza3-gal erin2-na fodder reed for cheese, fodder of the governors,
3. ki ARAD2-{d}ba-ba6-ta from Abba-Ištar from ARAD-Baba;
1. kiszib3 ur-{d}ba-ba6 under seal of Ur-Baba; under seal of Ur-Baba;
2. iti amar-a-a-si month: “Amar-ayasi,” month: “Amar-ayasi,”
3. mu {d}szu-{d}suen lugal year: “Šū-Suen is king”” year: “Šu-Suen is king.”
1. ur-{d}ba-ba6 ugula Ur-Baba was enforcer; Ur-Baba, the king;
2. dumu lugal-[…] son of Lugal-…, son of Lugal-…,

In the first line, the Seq2Seq model both misses the units of measurements for barley, and barley (‘sze‘) itself, deciding to translate it instead to ‘wheat‘. The next line is a fail for both models: the true translation is ‘fodder for the troops‘. The Transformer model maybe got confused between ‘erin2‘ (‘troops‘) and ‘ensi2‘ (‘governor‘), but I have no idea what confused the Seq2Seq into translating it to cheese and reed. Then the Seq2Seq model erroneously translate a proper noun, the true person being here ARAD-baba (a brewer). The Transformer model erroneously translates ‘ugula‘ as king (which would be ‘lugal‘) whereas it really means ‘overseer‘, ‘enforcer‘.

Sometimes the translations get strange. Below is an example, with CDLI n°P133032

Original Seq2Seq Transformer
1. 3(disz) geme2 u4 1(disz)-sze3 3 female laborer days (and) 3 for the female laborers,
2. a2 zi3 ar3-a ma2-a si-ga nibru{ki}-sze3 the seven cows that stretched out the water of Ur, labor of the horn of the house of the heart of the house with the he
3. ki ur-ab-ba-ta from Ur-bada, from Ur-abba;
4. kiszib3 lu2-{d}nin-gir2-su under seal of Lu-Ningirsu, under seal of Lu-Ningirsu;
1. dumu ba-zi son of Adaya. son of Bazi,
2. giri3 ab-ba-kal-la via Abbakala, via Abbakalla;
3. mu si-ma-num2{ki} ba-hul year: “Simanum was destroyed.” year: “Simanum was destroyed.”
1. lu2-{d}nin-gir2-su Lu-Ningirsu, Lu-Ningirsu,
2. dub-sar scribe, scribe,
3. dumu ba-zi son of Adaya. son of Bazi,

Apart from the usual problems with proper nouns, we have the sentence ‘a2 zi3 ar3-a ma2-a si-ga nibru{ki}-sze3‘, which I believe translates to ‘labour of flour-milling, transpored on a boat to Nippur‘. This leads both model to spectacularly fail: I’m interested to hear about any details about the sumerian legend of the seven cows that stretched out the water of Ur !

Overall, too many of these translations have similar problems for the models to be acceptable.

Looking inside the Seq2Seq model

As seen above, the Seq2Seq model encodes any sentence into a series of fixed-size vectors, the encoder states, which are then used by the decoder to produce the translation. We can study these representations to see what happens after training. I used UMAP to perform dimensionality reduction and project the 43552 vectors of size 1024 of the training set into two dimensions. The scatter plot below shows each one of these projections.

UMAP visualization of the 2nd layer encoder states of the Seq2Seq model. Each point corresponds to a sentence of the training set.

We can clearly see clusters in this figure, meaning that the Seq2Seq encoder has learned specific representations for the sentences in the training set. To see this in detail, below is the same scatter plot with some example clusters highlighted in color.

Here are some sentences corresponding to the points in the orange cluster in the upper left quadrant:

pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba, pisan-dub-ba,…

This is in fact the same sentence which is often found on many so-called pisan-dub-ba texts. The sentence itself translates to ‘basket of tablets‘ and is found on clay tablets, which were attached to baskets containing other archived tablets. These clay tablets often bore a short description of the contents of the basket.

Here are some sentences corresponding to the points in the blue cluster in the bottom:

1(disz) sila4 ensi2 gir2-su[{ki}], 1(disz) sila4 szesz-zi-mu, 1(disz) sila4 hur#-sag#-ga#?-lam#?-ma#, 1(disz) sila4 mu-kux(DU) en-sza3-ku3#-ge, 1(disz) sila4 kal-la-mu, 1(disz) [sila4] {d}gesztin-an-na-ama-lugal, 1(disz) sila4 ku5-da-mu, 1(disz) sila4 lu2-{d}asar-lu2-hi szabra, 1(disz) sila4 ur-{d}suen# dumu-lugal, 1(disz) sila4 szimaszgi2, 1(disz) sila4 szimaszgi2#{ki}, 1(disz) sila4 ur-gu-la nu-banda3, 1(disz) sila4 er3#-re-szum, 1(disz) sila4 zi2-na-na, 1(disz) sila4 lugal-ma2-gur8-re, 1(disz) sila4 lugal#-x-da, 1(disz) sila4 it-ra-ak-i3-li2, 1(disz) sila4 lugal-ku3-zu, 1(disz) sila4 AN […]-x, 1(disz) sila4 mu-kux(DU) s,e-lu-usz-{d}da-gan, 1(disz) sila4 lugal-a2-zi-da szabra, …

All these sentences begin with ‘1(disz) sila4′ which means ‘1 lamb‘, and are clustered in the same region.

Below are some sentences corresponding to the points in the red cluster in the top:

iti ezem-{d}szul-gi, iti ezem-{d}szul-gi, [iti ezem]-{d}szul-gi, iti ezem-{d}szul-gi, iti ezem-{d}szul-gi, iti ezem-{d}szul-gi, iti ezem-{d}szul-gi, iti ezem-{d}szu-{d}suen, iti ezem-{d}szul-gi, iti ezem-{d}amar-{d}suen, iti ezem-{d}szul-gi, iti ezem#-{d}szul-gi, iti# ezem#-{d#}szu#-{d#}suen#, iti ezem-{d}szul-gi, iti ezem-{d}szul-gi, iti ezem {d}szul-gi4, iti ezem-{d}szul-gi, iti# ezem#-{d}szul-gi, iti# ezem#-{d#}szul-gi, …

This is a month name, namely ‘Month: festival of Sulgi‘. Note that the encoder has learned to represent sentences which contain damaged signs ‘#’ in the same region.

Finally, below are some sentences corresponding to the points in the yellow cluster:

szunigin 3(disz) dug dida du 1(ban2), szunigin 5(u) 8(disz) ad7 gu4 geme2 usz-bar-e gu7-a, szunigin 2/3(disz) ma-na 8(disz) 5/6(disz) gin2 1(u) sze ku3, szunigin 1(u) 8(disz) gin2 naga, szunigin# 1(ban2) kasz saga szunigin 4(ban2) [n kasz], szu-nigin2 1(gesz2@c) 4(u@c) la2 3(asz@c) ab2 gu4, szunigin 7(gesz2) 3(u) 3(asz) 5(ban2) 1/2(disz) sila3 gur, szunigin 8(disz) dumu 1(ban2)-ta, [szunigin 1(szar2) 4(gesz’u)] 8(gesz2) 3(u) 8(disz)# 2(disz)# gin2 gurusz# [u4 1(disz)-sze3], szunigin 2(szar2) 9(gesz2) 6(disz) 5/6(disz) geme2 u4 1(disz)-sze3, szunigin 4(u) 3(asz) 1(barig) gur, szunigin 2(disz) sa szum2, szunigin 1/3(disz) sar 9(disz) gin2 igi-6(disz)-gal2 sig4, szunigin 3(gesz2) 2(u) 2(disz) udu masz2 hi-a, szunigin 6(bur3) GAN2# […], szunigin 1(gesz’u@c) 5(gesz2@c) 3(u@c) ninda 5(u) du8, [szunigin] 2(u)# 8(disz) udu-nita2 1(u) sila4, szunigin 5(disz) ud5 niga, szunigin 7(disz) gurusz 4(ban2)-ta, szunigin 3(disz) ug3-[IL2 u4 3(u)-sze3], szunigin 4(gesz2) 4(u) 3(asz) 3(ban2) 7(disz) 1/2(disz) sila3 gur, szunigin 2(u) 4(disz) ab2, szunigin 1(asz@c) ug3 gurusz 1(barig) 1(ban2) 5(disz) sila3 sze 4(disz) ma-na siki-ta, szu-nigin2 4(u@c) 4(asz@c) 2(barig@c) 3(ban2@c) [sze] gur saggal#, szunigin 1(u) 4(disz) sila4,…

We have here sentences beginning with ‘szunigin’ which means ‘total‘, and are found on book-keeping tablets.

Conclusions and next steps

This was an interesting project, even though the automatic translations are nowhere near acceptable overall, and I did learn a lot about neural machine translation. Given my current hardware limitations, it is unlikely that I will try to improve the models. In addition, I believe that a larger and better annotated training set is needed for improved performances. I will probably post the code and the translations outputs if needed, do not hesitate to leave a message for any comments or questions.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s