Tuesday, March 24, 2015

A Universal Translator By Any Other Name…

Without the Universal Translator (UT) we wouldn’t be celebrating the 50th anniversary of Star Trek next year. Who wants to watch a TV show where people can’t communicate with one another and can’t figure out what they have in common? You might as well watch family Thanksgiving dinner videos.

Kirk and the Gorn captain were made to “fight it out” by the Metrons. 
Settle your differences man to alien was a big deal on the original 
series. The Gorn were reptile like, and they had similar technology to 
Federation. Notice the silver cylinders, each character has a universal
translator. The Gorn have also been spotted on The Big Bang Theory
so their territory is growing.
As a story telling convention, the UT allowed for near instantaneous communication between species that had never met before. With the communication problem solved, the story could move on to conflict resolution and figuring which female alien Kirk was going to kiss.

In the Original Series, the UT was a silver cylinder; you can see the Gorn and Kirk with them in the clip. By The Next Generation, they were incorporated into "com badges." In one episode, Riker and Counselor Troi had them as implants. The Ferengi had them in their ear, an apparently Quark’s had to be adjusted with a Phillips screwdriver every once in a while – although that may have been to remove ear wax.

Humans have about 6000 spoken languages on Earth as of March, 2015 – 6001 if you want to include rap. We're in quite the hurry to build translators that would help us understand one another - anything to avoid years of high school classes that lead to stronger brains but also bad foreign names and poor attempts at cooking.

In some ways, our translators have already passed those of Star Trek, but in others ways we're far behind. Most of our problems have to do with understanding just what things all languages have in common and what things are purely cultural, contextual, and completely without precedent. Let’s take a look at our efforts so far.

If you watched Star Trek, you may already realize the way we have surpassed some of their technology. The UTs of Kirk and Picard were for spoken language only. They still had to keep a crew member as a translator to figure out what signs meant on another ship or how to interpret alien consoles. We already have that licked.

Romulan text is supposedly related in visual character to Vulcan. 
One- someone studies this stuff? Two – I like the color scheme. If 
you came across this screen on a ship’s console, you’d know
it was important – the optical character reader/ translator in Word Lens 
would come in handy here. Three – I find it hard to believe there 
isn’t a Romulan font package you can buy at the App store.
Optical character readers have come a long way in the past few years. We now have cameras and software that can view written words in one language and automatically project them on the screen as translations in another language.

Google has one (Google Goggles, now Google Translate for Android), and there’s an app for that on the iPhone/iPad (called Word Lens, from Quest Visual, bought by Google Translate in 2014, see video below).

And of course we have translators for written words – you type in what you want to say, and the software gives you a reasonable (meh) translation. Try translating a phrase in and out of a language several times and see what you end up with – it’s like a multicultural game of telephone operator.
The latest amazements are the vocal translators, but only for languages we have programmed in. Skype translator was introduced in late 2014. You speak in Spanish or English while having a video chat. On the other end, it comes out in English or Spanish. Why? Because that’s the only translation they offer as of now. How? It's based on speech recognition software. It also gives you a written transcript of the conversation so you can post all the hilarious errors on Twitter (like for autocorrect).

It’s in the vocal translation arena that the Star Trek UT excelled. It was so good it that the TV series just accepted that the translator was there, never broke down, and let us hear everything in English. They didn’t even bother making the aliens’ lips (if they had them) move out of synch with the English translation!

The Rosetta Stone was discovered in 1799 by one
of Napoleon’s soldiers. It was a decree from 196
BCE on behalf of King Ptolemy V. The top is the
decree in ancient Egyptian hieroglyphs, the middle
is the same decree in Demotic script, and the
bottom is the decree in ancient Greek. Having the
same text in three languages allowed us to decipher
hieroglyphics for the first time.
Most importantly, the Star Trek UT had one feature that none of ours currently do. It could decipher and translate languages that had never been encountered before – like rap.

In principal, the Federation members would have their new alien acquaintances talk into the translator for a while. The device, using deciphering algorithms and the linguacode matrix (invented by an Enterprise linguist), would learn it and then translate it. This seems hinky to me.

Every time a new word was encountered, it would seem to me that the translator would have to either wait till it heard it enough times to decipher its meaning or extrapolate its meaning from context. Neither of these things could occur in real time. It seems to me that the “talk into it” phase would be very long.

Basically, the hardware of a translator is easy. It’s the software that we have to work on. A 2012 paper presented to the Association for Computational Linguistics (yep, just call ‘em the UT geeks) used statistical models to try and train language programs better.

Up to this point in time, vocabulary has been the choke point in trying to speed deciphering and translation. By using the statistical commonalities of all languages (if they can be found and relied upon), the need for so much vocabulary would be eased.

Any of these real-life software algorithms (or the fictional linguacode matrix) will be based on ideas presented in the 1950’s by American linguist, philosopher, and political activist Noam Chomsky and others.

Noam Chomsky was born in 1928, and he hasn’t
been quiet since. He isn’t boisterous by any means,
but he has an opinion he’s willing to debate you on
for just about everything. Linguistics is his game, but
woe is the person who believes he only knows the
structure of language – many a debating opponent
has skewered by his blunt, and ungilded prose/speech. 
Chomsky put forth the hypothesis that all languages had universal similarities. He claims the existence of a biologic faculty in all organisms of high brain function that exists for innate language production and use; basically he’s saying the language is genetic. With this approach, it should be possible to write software that could break any language into these similar patterns and then decipher it.

Ostensibly, the more languages that were encountered, the better the UT would work. On the other hand, maybe there’s not a biologic universality to language, but word order is mimicked in all language – how we build a language is universal.

Either one of these scenarios would make it easier for a computer program to take a completely unknown language and put it through algorithms that might discern order and then meaning.

But a recent study is inconsistent with these ideas. According to a 2011 paper in Nature, word order is based more on historical context within a language family than in some universal constant or similarity. They found that many different sentence part combinations, like verb-object (or object-verb) or preposition-noun (or the reverse) for example, are influenced by other structure pairs within the sentence.

One word preceding the other in some languages caused a reversal in other pairs, while the reverse might be true in other language families.  The way that sentence structure via word ordering evolved does not follow an inevitable course – languages aren’t that predictable. Bad news for computer-based word order help.

In 2600 BCE the Indus valley civilization had a
population of over 5 million. Cities have been
excavated and impressive art has been found. The
tile above shows a rhino, polka-dotted at that,
apparently with polish on its toenails. The symbols
above may be a written language. It’s a big deal which
language it might be related to, since Pakistan and
India are still fighting over this region.
It’s a roller coaster ride trying to figure out if computer power is going to solve our UT problems. We were at a low point with the paper above, but in 2009 we got some speed over a hill. In the 2009 paper, a computer algorithm to predict conditional entropy was used in an effort to investigate a 5000 year old dead language.

The Indus civilization was the largest and most advanced group in the 3000 BCE world. Located in the border region of today’s India and Pakistan, they may have had a written language – we can’t tell. They had pictograph carvings, but what they mean is up in the air. There is no Rosetta stone like we found for ancient Egyptian, and no one speaks or reads the Indus now.

The algorithm for conditional entropy is used to calculate the randomness in a sequence of…. well, anything. Here they wanted to see if there was structure in the markings and drawings. The results suggested that the sequences were most like those in natural languages.

But, just to prove it’s never that simple, linguist Richard Sproat (works for Google now) has contended that the symbols are non-linguistic. In 2014, he did his own larger analysis with several different kinds of non-linguistic symbols, and showed that the Indus pictographs fall into the non-linguistic category.

He rightly points out that computational analyses have a downfall in that biases could enter based on what type of text is selected and what that text depicts. I don’t think someone could pick up English if all they had to study were shopping lists.

But in other old languages, more progress has been made. One paper used a computer program to decipher and translate ancient language of Ugaritic in just a few hours. They made several assumptions, the biggest one being that it had a known language family (Hebrew in this case). This may not be possible when dealing for the first time with some new alien language.

Picard and Captain Dathon of the Tamarians had to
come to some meeting of the minds in order to survive
the beast on El-Adrel IV. He spoke only in metaphor, a
fact that Picard is slow to pick up on. Me - I just wonder
how Dathon didn’t drown when it rained. By the way, you
can get a T-shirt with just about any of Dathon’s sayings.
image credit - It's All About: Star Trek
They also assumed that the word order and alphabet usage frequencies would be very similar between the lost language and Hebrew. They then played these assumptions off one another until they came upon a translation. Ugaritic was deciphered by brute human force a while back, but it took many people many years to do it. This is how we know that the computer algorithm got it right – it just took 1/1000 of the time.

But, even if we find universalities in language, the computer won’t be enough. An example comes from Star Trek itself, in an episode of ST:TNG called Darmok. The universal translator told Picard exactly what the aliens were saying, but it didn’t make any sense.

Their language was based on their folklore and history. All their phrases were metaphors of events in their past. So unless the UT knew this species’ particular history, it could only translate the words not the meaning. Language is more than words in an order; language is the collective mind of a group connecting them to each other and to their world.

Next week, deflector shields.

Contributed by Mark E. Lasbury, MS, MSEd, PhD

Sproat, R. (2014). A statistical comparison of written language and nonlinguistic symbol systems Language, 90 (2), 457-481 DOI: 10.1353/lan.2014.0031

Dunn, M., Greenhill, S., Levinson, S., & Gray, R. (2011). Evolved structure of language shows lineage-specific trends in word-order universals Nature, 473 (7345), 79-82 DOI: 10.1038/nature09923

Rao, R., Yadav, N., Vahia, M., Joglekar, H., Adhikari, R., & Mahadevan, I. (2009). Entropic Evidence for Linguistic Structure in the Indus Script Science, 324 (5931), 1165-1165 DOI: 10.1126/science.1170391

Snyder, Benjamin, Regina Barzilay and Kevin Knight (2010). A Statistical Model for Lost Language Decipherment Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL 2010

1 comment:

  1. Nice article. The possibility of such a UT is even more improbable by the reasonable assumption that humans and ETs have fundamentally different brains (if you can really speak of brains at all), because they don't have any evolutionary history in common.

    I hope that somebody is going to make a science fiction series were they consult scientists to create a more accurate prediction of the future. Especially in the area of life science star trek fails.