While the BLEU score gives us some idea of how good our translator is, the best evaluation method is testing it out.This thesis attempts to offer a reconceptualization of translation analysis. This model achieves a BLEU score of about 17 (not great), but the score is artificially deflated by features of Middle English such as spelling inconsistencies and the irregular grammar of a changing language. We use that test to see how much the model is learning.Īt the end of training, we test the model using a metric called BLEU (a measure of machine translation goodness) on another batch of sentences it hasn't seen before. Every so often, we test the model by asking it to translate some sentences it hasn't seen before. In essence, we show it a sentence in English, and then show it the Middle English translation. (in this case, translation to Middle English)įor the translator we use BART, a denoising language model created by Facebook in late 2019. We then use that probability distribution as a prior before fine-tuning the model on a specific task. Large lanugage models allow us to leverage the underlying probability distribution of language. The dataset is available here.Īs readers of this blog know, I tend to tackle most of my problems using large language models. I wrote a few python scripts to scrape this data from the internet and align the sentence pairs. In all, the training data for this model includes about 60,000 paired sentences, taken from Chaucer's complete works, the Wycliffe Bible, and Sir Gwain and the Green Knight. Scholars have produced line-by-line translations of the Canterbury Tales, and the Wycliffe Bible can be easily aligned verse-by-verse with a modern English version of the Bible. These are both good resources because they're very long and have Modern English translations. Thankfully, we still have a couple of great resources for paired text: the Wycliffe Bible and Geoffery Chaucer's Canterbury Tales. Whereas languages like Spanish have millions of web resources easily available, there are at most a few hundered surviving books in Middle English. This data, ideally, should be in the form of sentences paired with their translations.Īs I mentioned earlier, Middle English is something of a low-resource language. In order to train a neural translation model, you need a bunch of data. ![]() ![]() Gewurþe ðin willa on eorðan swa swa on heofonum.Īs we go back in time the language gets harder to understand, with Old English basically unintelligable to modern people. ![]() Gyue to us this dai oure breed ouer othir substaunce Thy will be done, in earth, as it is in heauen. One common illustration of these different varieties of English is by reproducing the Lord's Prayer. If you're reading this post, chances are you speak Modern English, so we're not going to cover it here. Scholars divide the history of the English language into 4 categories: There are a number of common misunderstandings about the history of English.Īt the beginning of my Linguistics degree, I was surprised to learn that what I always thought of as 'Old English' wasn't old English at all. Middle English is also a good target because it's very similar to modern English, which should make this somewhat easier. There are a number of surviving texts, but not a ton. To get around this issue, I decided to simulate a low-resource language using Middle English, a variety of English spoken from the 11th-15th centuries CE. I don't speak or understand any low-resource languages. Since the technology has improved, I figured I can improve. The last neural Machine Translation project I did was in 2018 before large transformer-based language models had really become ubiquitous. ![]() (you can't talk to your Alexa in Mapuche, for example.) Speakers of low-resource languages are also at a disadvantage because most technologies are not translated in their language. They can often suffer from lack of support by governments and educational systems. Low-resource languages are languages that don't have an abundance of written textĪnd often don't enjoy the prestige of high-resource languages like English, Mandarin, and German. I wanted to do a project with neural Machine Translation and a low-resorce language.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |