Aghaster

Universal Translator

18 posts in this topic

Hi,

I'm a computer programmer who's become very interested in the problem of languages in the world. With globalisation, small languages rapidly become extinct. While in the past these languages could easily survive as people travelled less distance, and had less foreign influence, they are now doomed. Now I hear you say "But globalisation is a process that cannot be stopped". True, absolutely true. But isn't there something we can do? The old way of trying to solve the problem of languages in the world is to create a conlang and try to teach it to everybody. There are just tons of conlangs out there, all trying to get popular. I have learned Esperanto in a few months and I've been able to see the problem from a different angle. After all, aren't languages just made out of words? English is called an "international" language because it is the most widespread. Chinese is the most spoken language in the world, but it is concentrated mostly in China. Esperanto is widespread all over the world, but has about 2 million speakers. All this brings me to ask myself, couldn't we build this fictional device called the Universal Translator, like in Star Trek?

Okay. I see you coming to tell me "But that cannot be done, it is just science fiction". Do not think I'm dumb, I am perfectly aware it is pure science fiction and will probably not get done. But this question brought me to ask myself "Is there any way to provide accurate automatic translation?" Babelfish miserably fails at that task. If you give babelfish "to throw up" to translate it to french, it'll give you "pour jeter vers le haut". You cannot be any more innacurate than that, it just looks like word-to-word translation. How can we improve this? In my opinion, a better automatic translator needs to be given more information that just a translation of a word. If you analyse how the language works, they are influences by surrounding words. To throw and throw up do not mean the same. Some verbs shouldn't be translated to the same verb depending on their direct complement. For example, killing a process in computer terms isn't the same as killing someone, when translated to French. How can we fix this major problem in automatic translation? By classifying all words into categories, and more than just linguistic categories such as "adverb, verb, adjective". Each word in the dictionary would have its translation to the target language according to the words around it, that can influence the meaning. This way, we avoid problems encountered by the babelfish and word-to-word translation.

Now, what about having a very complete dictionary? What about typing mistakes, and uncommon words? My idea about it is to make this program work like a wiki. Users would be invited to report unknown words encountered or inaccurate translations to the main server where all the linguistic information would be kept. Other users could then confirm that a translation is accurate or not. The software could also scan into wikipedia articles to detect words it doesn't know and output them to a human translator who would add them to the dictionary. Errors in the language should also be added to the dictionary, and the translation software would simply replace these by their "proper language" equivalent before processing the translation.

Another problem is words that have more than one meaning. One can guess the good meaning out of the context. This translator could also do the same, by analysing the classes of the words it is surrounded by. For example, when getting back to our example of "throwing up", if words like sickness, stomachache or illness were encountered in the text, than there are higher chances that "throw up" really means to vomit. These words would be in a class "health state" or something like that. But then one could write a word meaning something completely out of context. How to detect that? Here comes human assistance...

To have the most accurate automatic translation, we need to make it less automatic. The translation program would simply input more than one possible translation for the ambiguous part of the text, which would have to be interpreted by the human reader. this could simply be done like in tests where we have to circle the right translation, between a certain number of choices. to throw up : (lancer vers le haut/vomir) or simply be prompted by the translation program.

I began learning the tools I plan to use to try and implement my concept for my translator. The tools I intend to use are sqlite with C for the command line tools, and eventually a nice GUI application using C++ with a certain GUI lib (cross-platform, of course.).

Please let me know what you think about this!

And if you have the idea on how to make the real Universal Translator, I'm of course interested, even if I doubt a real solution exist.

-Aghaster

-1

Share this post


Link to post
Share on other sites

You would need more developers.

-1

Share this post


Link to post
Share on other sites

You would need more developers.

This I figured out :P I have a friend of mine who's interested in starting that with me. I know lots of linguists on the internet as I always hang out in communities like unilang and polyglot. Anybody who would be interested in working at this project is welcome.

-1

Share this post


Link to post
Share on other sites

I might, my freind is from the ukrain and russian is his native language

0

Share this post


Link to post
Share on other sites

I might, my freind is from the ukrain and russian is his native language

Nice, I am opened to any language. I'd like to make this project a centralisation of linguistic resources too.

0

Share this post


Link to post
Share on other sites

I am Bulgarian and if you want, I could try to find time to contribute. I can also help out with Spanish and Russian, and even English.

0

Share this post


Link to post
Share on other sites

Okay, So we need to know what each member of this future team knows. I'm currently learning sqlite. I have good C/C++ knowledge (wrote a C++ tutorial www.planetcpp.info and read a good C tutorial afterwards). How I intend to design the program is like this: C routines that use SQLite, compiled as libs. These libs would then be used in a GUI interface. For the database part, information to store is this: the word. surrounding words that modify its meaning, with translation. These words can be a group of words or specific words. Unicode will be used, no local encodings. SQLite uses unicode anyway. Before beginning to code anything, we need to set up a good planification for the program.

0

Share this post


Link to post
Share on other sites

in order to make a universal translator, you should first create some kind of algorithm. Which then can be adapted to any language. You don't really need more than two or three different languages during the developing sstages of this algorithm

0

Share this post


Link to post
Share on other sites

in order to make a universal translator, you should first create some kind of algorithm. Which then can be adapted to any language. You don't really need more than two or three different languages during the developing sstages of this algorithm

There's already an algorithm to analyze general languages. It's called the Chomsky–Schützenberger hierarchy, and when it's provided with all the rules of a specific language, it can take a simple sentence (like "This is my ball.") and break it down like so:

	S
/ \
this is
\
ball
/
my

The problem with this hierarchy theory is that it best applies to English, which (despite how it's used) is very formal in terms of noun-verb and adjective-noun usage. Unlike English, you can leave out nouns and even pronouns in other languages ("Ésta es mi bola." and "Es mi bola." are both valid Spanish; and "これは私のボールです。" and "ボールです。" are also both valid Japanese). As you can see, it's almost impossible for a machine to figure out who the ball belongs to if we're not using English!

0

Share this post


Link to post
Share on other sites

I can give you a hand with Spanish. Im from Argentina, so we have the worst slang ever, meaning i understand Mexican spanish, Spanish spanish, Chilean spanish, Bolivian spanish... you get the idea :)

The best way to determine what kind of sentence should be output in Spanish, is usualy based in being able to "draw" a sintax rule that will look for the conjugation of subjects in the sentences, the times they appear, the verb and its relations... its rather complicated and probably long/heavy/heartbreaking... but if you want an accurate translator, you need it to reason as close as a person's mind as possible :)

Edited by Enkil
0

Share this post


Link to post
Share on other sites

but if you want an accurate translator, you need it to reason as close as a person's mind as possible :)

Yes. That is what we are analyzing here. How does the human mind works with languages? We can take the question further, and do a bit of philosophy: How can a word barely mean something? Machines are given words they never understand from a human point of view. But we can bring the machine to imitate the human analysis of words. In this case, we must push it as far as the machine can do, that is, the part that can be automatised. The rest that the machine can't figure out - let the human user interpret it. There are ways we can make the machine "judge" if a word has more chances to mean something than anything else, with words in context and the subject of what is being translated. The translator should be able to give a % of chance over the other possibilities, and give it as a first choice. Smaller possibilities should be given too. In a completely automatic translation I suggest these different possibilities to be given in the form (possibility1/possibility2/possibility3) etc. The reader could easily read the text and choose the words that seem appropriate, with the first ones always being the ones with the highest possibility of being the good ones. And btw, thanks to people who have been posting about text analysis algorithm. It helps me think of the design of the software.

0

Share this post


Link to post
Share on other sites

I honestly haev no knowledge to help with algorythms :( i'd love to help with the translating and see what i can learn from the under the hood stuff as well. The various options are very valid, as english has some verbs that will stay the same, while spanish has 5 or 6 variations and forms of putting them in sentences...

About how to analyze... its just syntax based from my point of view. If a person decides in a Yes/No structure of though, or a logic or/and premise and inference process, a machine can emulate it as closely as posible.

It would be interesting to, as you say, generate 3 or 4 translations with each of them, a specific pattern. People testing the translation would point out which one satisfies them the most. That way, a big database with a pool of "things in common" between each data piece can be generated... again, im a total noob :unsure: statistics are what make me decide, the higher odds towards x than y... that i beleive, is how my mind works :)

Anyways, just drop me a PM when you need some work done, i check the forums daily ^_^

0

Share this post


Link to post
Share on other sites

I honestly haev no knowledge to help with algorythms :( i'd love to help with the translating and see what i can learn from the under the hood stuff as well. The various options are very valid, as english has some verbs that will stay the same, while spanish has 5 or 6 variations and forms of putting them in sentences...

About how to analyze... its just syntax based from my point of view. If a person decides in a Yes/No structure of though, or a logic or/and premise and inference process, a machine can emulate it as closely as posible.

It would be interesting to, as you say, generate 3 or 4 translations with each of them, a specific pattern. People testing the translation would point out which one satisfies them the most. That way, a big database with a pool of "things in common" between each data piece can be generated... again, im a total noob :unsure: statistics are what make me decide, the higher odds towards x than y... that i beleive, is how my mind works :)

Anyways, just drop me a PM when you need some work done, i check the forums daily ^_^

What are your programming skills? Linguistic knowledge alone can help, but I'd like to know the skills of everybody proposing their help. In my case, I have good knowledge of C/C++ and I'm learning how to use sqlite.

0

Share this post


Link to post
Share on other sites

i know very little javascript, html, learning php and looking into c++ at the moment. A few pointers to a specific are within the language needed could be helpful for me and the project though. If it involves some knowledge database expansion in my head, its not that bad :D

0

Share this post


Link to post
Share on other sites

Well, I'm ready to help, my knowledge in computer langages are : c/c++, java, php, mySQL and a few more...

I'm use to learn in computer sciences... I made some managment and I think that it could be a very interesting software...

Well, I think that we should find a good way for brain storm, find directors line to implement this projet... I'll check for sqlite and read that forum!

see you!

0

Share this post


Link to post
Share on other sites

You need fish to put into people's ears.

0

Share this post


Link to post
Share on other sites

Sounds like quite an undertaking. Good luck!

I'd help but I only know english and I've had 2 years of French in school.

0

Share this post


Link to post
Share on other sites

There is no problem for the language you actually speak, even if most people who will work on this project have good english. The language will take long to put in the database anyway. And for french, me and guyle are native french speakers so its okay :P

0

Share this post


Link to post
Share on other sites

Please sign in to comment

You will be able to leave a comment after signing in



Sign In Now