Sign in to follow this  
Followers 0
pedrotuga

hacking google translation tool

14 posts in this topic

Ok, need to work out a command line translation script for my own usage.

I thought of sending a request to

http://translate.google.com/translate_t#

and extract the result.

To my surprise google actually managed to make it trick for me.

I translated a word from a language to another, for example

I fired ngrep and made it catch every package with the word 'mesa' which means 'table' in portuguese

then I insert this url on google:

http://translate.google.com/translate_t#en|pt|table%0A

and, surprisingly, ngrep catches nothing. I tink the request is made via javascript using some kind of encoding. Probably this is easy to figure out, i just thought it would be simpler.

Anybody up for a small callange?

0

Share this post


Link to post
Share on other sites

Does this work for you?

I used babel fish.

http://babelfish.yahoo.com/

#!/usr/bin/python

import urllib, re, sys

def main():
if len(sys.argv) == 1:
print 'Usage: %s <string to translate>' % sys.argv[0]
sys.exit(0)

text = ' '.join(sys.argv[1:])
data = "ei=UTF-8&doit=done&fr=bf-res&intl=1&tt=urltext&%s&lp=en_es&btnTrTxt=Translate" % urllib.urlencode({'trtext':text})
resp = urllib.urlopen('http://babelfish.yahoo.com/translate_txt', data).read()
trans = re.findall('(?<=id="result"><div style="padding:0.6em;">).*(?=<)', resp)[0]
print trans

if __name__ == '__main__':
main()

edit: Are you going to use this for a one way translation, eg English --> Spanish,

or do you want to translate multiple languages.

If so, which languages?

Because now it is just a matter of changing the lp post var.

Edited by SwartMumba
0

Share this post


Link to post
Share on other sites

Babelfish won't do it, because I want to translate stuff from/to swedish.

0

Share this post


Link to post
Share on other sites

I made an irc bot with python doing something similar to this a long time ago. I actually used an html parsing lib (sgmllib) rather than messing around with packets though.

I did it by manipulating the url a bit. It goes like:

http://translate.google.com/translate_t?text='text to translate'&hl=en&langpair=es|en&tbb=1

this translating the words "text to translate" from Spanish to English (yes, I know text to translate is not Spanish, this was just the example in the notes I made).

I accidently deleted a good chunk of the code shortly after I finished the bot (this is why we keep backups :( ) so I don't have exactly what I did to parse the page returned from google, but I hope this helps you out a bit.

0

Share this post


Link to post
Share on other sites

Is this ok?

#!/usr/bin/python

import urllib, urllib2, re, sys

def main():
if len(sys.argv) == 1:
print 'Usage: %s <string to translate>' % sys.argv[0]
sys.exit(0)

text = ' '.join(sys.argv[1:])
data = "http://translate.google.com/translate_a/t?client=t&%s&sl=en&tl=sv" % urllib.urlencode({'text':text})
req = urllib2.Request('http://translate.google.com/translate_t', data, {'User-Agent':'Mozilla/5.0'})
resp = urllib2.urlopen(req).read()
trans = re.findall('(?<=<div id=result_box dir="ltr">).*', resp)[0]
trans = trans[: trans.index('<')]
print trans

if __name__ == '__main__':
main()

<side note> Can anyone see what the problem is with this regex? --> '(?<=<div id=result_box dir="ltr">).*(?=<)'

It seems to parse up to the translation, but it seems to fail at '(?=<)'. I even tried '(?=\<)' just encase it was a special char.

</side note>

Edited by SwartMumba
0

Share this post


Link to post
Share on other sites

That is fine. Thank you.

There's something strange though, ngrep didn't catch those packages, weird.

Like, if i insert this command,

ngrep -d any 'hacka'

it doesn't capture anything when running your script with se same arguments as you showed before. If anybody knows why

What method did you use to find out the url? Did you use a packet sniffer? which one?

I didn't understand why you make two requests and extract the content from the html... I just get it from the url that repplies to google translatorajax requests.

#!/usr/bin/python

import urllib, urllib2, re, sys

url = "http://translate.google.com/translate_a/t?client=t&text=%s&sl=en&tl=sv" % sys.argv[1]

opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
print opener.open(url).read()

Edited by pedrotuga
0

Share this post


Link to post
Share on other sites
What method did you use to find out the url? Did you use a packet sniffer? which one?

I used the ff add-on Live HTTP headers.

You can just as easily look at the source.

I didn't understand why you make two requests and extract the content from the html...

I didn't make two requests.

urllib2.Request builds a request object for urllib2.urlopen.

0

Share this post


Link to post
Share on other sites

google and yahoo can be used a s superfast transparent proxy FYI :)

0

Share this post


Link to post
Share on other sites

If anyone is interested, I borrowed a few lines of SwartMumba's code to finish up the section of that bot I deleted that I mentioned in my previous post. Its pretty messy and feels pretty incomplete right now but it works as long as you don't attempt to use arabic or do a few other things. If anyone wants to take a look at it you can find it here.

0

Share this post


Link to post
Share on other sites

#!/usr/bin/perl -w
# gtranslate1.pl, v0.01, by jabzor - for binrev [full-http post + parse]
use strict;
use LWP;

(my $o = $0) =~ s!.+[\\//]!!; #filename, hack for windows.

die "$o: missing option
Usage: $o [from] [to] text goes here...\n
Example '$o en fr translate this to french'\n"
unless my ($from, $to) = (shift, shift), my $text = join(' ',@ARGV);

my $ua = LWP::UserAgent->new; $ua->agent('Mozilla/5.0');

my $response = $ua->post( "http://translate.google.com/translate_t",
[ 'text' => $text,
'hl' => 'en',
'langpair' => "$from|$to",
'tbb' => '1',
]
);

die "error: ", $response->status_line
unless $response->is_success;

if( $response->content =~ m{<div id=result_box dir="ltr">(.*?)</div>} ) {
print "{$from}: [ $text ]\n{$to}: [ $1 ]\n";
} else {
print "error: Couldn't find the match in the response!\n";
}

#!/usr/bin/perl -w
# gtranslate2.pl, v0.02, by jabzor - for binrev [simple http get the initial location]
# change-log: v0.02 - added support for unpacking the 'auto' translate
use strict;
use LWP;

(my $o = $0) =~ s!.+[\\//]!!; #filename, hack for windows.

die "$o: missing option
Usage: $o [from] [to] text goes here...\n
Example '$o en fr translate this to french'\n"
unless my ($from, $to) = (shift, shift), my $text = (join ' ', @ARGV);

my $ua = LWP::UserAgent->new; $ua->agent('Mozilla/5.0');

my $response = $ua->get(
"http://translate.google.com/translate_a/t?client=t&text=$text&sl=$from&tl=$to"
);

die "error: ", $response->status_line
unless $response->is_success;

if( $response->content =~ m{^\"(.*)\"$} ) {
print "{$from}: [ $text ]\n{$to}: [ $1 ]\n";
} elsif ($response->content =~ m{\["(.+[^\\])","(.+)"\]}) {
print "{$2}: [ $text ]\n{$to}: [ $1 ]\n";
} else {
print "error: ",$response->content,"\n";};

current supported language values (there are others unsupported, ic = icelandic for example)
(can easily grep the values from the remote site to local database, for future updates if needed)
ar Arabic
bg Bulgarian
ca Catalan
zh-CN Chinese (Simplified)
zh-TW Chinese (Traditional)
hr Croatian
cs Czech
da Danish
nl Dutch
en English
tl Filipino
fi Finnish
fr French
de German
el Greek
iw Hebrew
hi Hindi
id Indonesian
it Italian
ja Japanese
ko Korean
lv Latvian
lt Lithuanian
no Norwegian
pl Polish
pt Portuguese
ro Romanian
ru Russian
sr Serbian
sk Slovak
sl Slovenian
es Spanish
sv Swedish
uk Ukrainian
vi Vietnamese

>gtranslate2.pl

gtranslate2.pl: missing option

Usage: gtranslate.pl [from] [to] text goes here...

Example 'gtranslate2.pl en fr translate this to french'

>gtranslate1.pl es en "el gato" es blanco

{es}: [ el gato es blanco ]

{en}: [ the cat is white ]

>gtranslate2.pl en pl "translate this in to polish"

{en}: [ translate this in to polish ]

{pl}: [ przetłumaczyć to na polski ]

gtranslate2 uses far less bandwidth, I would use it.. though both give the same results; either way, have fun. :D

EDIT:

Updated gtranslate2 to support unpacking the 'auto' translate.

Example: gtranslate2.pl auto en le ciel est en baisse

{fr}: [ le ciel est en baisse ]

{en}: [ the sky is falling ]

If the language is not supported by Google but is known, it will return an error:

gtranslate2.pl auto en stundar

error: ["We are not yet able to translate from Icelandic into English."]

I'm pretty impressed with the google translation service..

{en}: [ the new york times newspaper ]{fr}: [ le quotidien New York Times ]

{en}: [ the old street newspaper ] {fr}: [ l'ancien journal de rue ]

{en}: [ the Old Street newspaper ] {fr}: [ la Old Street journal ]

Edited by jabzor
0

Share this post


Link to post
Share on other sites
You can just as easily look at the source.

You mean that brutal unindented amount of js?

I didn't make two requests.

urllib2.Request builds a request object for urllib2.urlopen.

I looked a bit more carefully. But why not fetching the data directly from the url that responds to the ajax calls?

google and yahoo can be used a s superfast transparent proxy FYI :)

What do you mean? Can you guive an example?

gtranslate2 uses far less bandwidth I would use it.. both give the same results; either way, have fun with it. :D

My latest code fetches the data from the ajax server, thus each response has nothing but the the headers and the translated string.

0

Share this post


Link to post
Share on other sites

Doing a GET on the URL for Google's web based translation tool, and then parsing the returned HTML is all well and good, but what if Google alters their page layout? It could break your script.

Google offers this service also as a web API, which is more ideal for using in scripts in programs. The API accepts a GET request like most examples above show, but returns the response in a lightweight JSON formatted data structure. Most programming languages have libraries to parse JSON formatted responses and turn them into easy-to-use associative arrays (read: hash in Perl, or dictionary in Python).

So, basically you'd run a GET on a URL looking like this:

http://ajax.googleapis.com/ajax/services/language/translate?v=1.0&q=hello%20world&langpair=en%7Ces

Which would give you the following response:

{"responseData": {"translatedText":"hola mundo"}, "responseDetails": null, "responseStatus": 200}

Or, instead of generating/parsing the raw request in your own program, some languages may offer libraries to do it for you. For example, Perl has a module called REST::Google::Translate that makes the request, and returns in the response as a Perl object. See http://search.cpan.org/~ejs/REST-Google-1....e/Translate.pod for more information.

0

Share this post


Link to post
Share on other sites
Doing a GET on the URL for Google's web based translation tool, and then parsing the returned HTML is all well and good, but what if Google alters their page layout? It could break your script.

Google offers this service also as a web API, which is more ideal for using in scripts in programs. The API accepts a GET request like most examples above show, but returns the response in a lightweight JSON formatted data structure. Most programming languages have libraries to parse JSON formatted responses and turn them into easy-to-use associative arrays (read: hash in Perl, or dictionary in Python).

So, basically you'd run a GET on a URL looking like this:

http://ajax.googleapis.com/ajax/services/language/translate?v=1.0&q=hello%20world&langpair=en%7Ces

Which would give you the following response:

{"responseData": {"translatedText":"hola mundo"}, "responseDetails": null, "responseStatus": 200}

Or, instead of generating/parsing the raw request in your own program, some languages may offer libraries to do it for you. For example, Perl has a module called REST::Google::Translate that makes the request, and returns in the response as a Perl object. See http://search.cpan.org/~ejs/REST-Google-1....e/Translate.pod for more information.

I don't like google APIs, you have to provide a key, therefore they know what you're doing. Personally that's my business.

But if you look to my script carefully i am not fetching the content from html, I am fetching it from their ajax address which serves the raw text

http://translate.google.com/translate_a/t?...sl=en&tl=sv

Edited by pedrotuga
0

Share this post


Link to post
Share on other sites

The ircII script LiCe by SrFrog used to do this when BabelFish was still at AltaVista; I noticed babelfish.altavista.com redirects to babelfish.yahoo.com now.

0

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0