Jump to content


Photo
- - - - -

Document Googler


  • Please log in to reply
15 replies to this topic

#1 Drake Anubis

Drake Anubis

    Never Forget

  • Agents of the Revolution
  • 911 posts
  • Gender:Not Telling
  • Location:San Diego, CA

Posted 03 April 2007 - 09:28 PM

Does anybody know of a way to check if part of a document (or pref a group of documents) has been indexed by a search engine like google. For example if I had a list of cooperate internal phone numbers and I wanted to see if any part of this document was on google, I could feed the document into a program and it would check google.

I've heard of professors using programs to check for plagiarism, anybody have any experience with this.

#2 Zeph

Zeph

    OMG, so close to "1337"!

  • Agents of the Revolution
  • 1,319 posts

Posted 03 April 2007 - 09:33 PM

Just put the things you want to look for in google with quotes.

#3 Drake Anubis

Drake Anubis

    Never Forget

  • Agents of the Revolution
  • 911 posts
  • Gender:Not Telling
  • Location:San Diego, CA

Posted 03 April 2007 - 09:39 PM

Just put the things you want to look for in google with quotes.


Well yes, but say I want to search a set of documents, maybe each document is about 30 pages and there are several documents. I would like to be able to make several thousand google searches with queries derived from these documents automatically.

Another application would be a something like the USB Hacksaw or the USB Slurper, take everything off of a machine and then quickly run the documents to see if anything you just pulled is already online somewhere.

#4 Alk3

Alk3

    "I Hack, therefore, I am"

  • Binrev Financier
  • 1,003 posts
  • Gender:Not Telling
  • Location:312 Chi-town

Posted 03 April 2007 - 10:21 PM

I am also interested to see this.

#5 Zeph

Zeph

    OMG, so close to "1337"!

  • Agents of the Revolution
  • 1,319 posts

Posted 03 April 2007 - 10:56 PM

Write a script then send it to me.
lol

#6 Drake Anubis

Drake Anubis

    Never Forget

  • Agents of the Revolution
  • 911 posts
  • Gender:Not Telling
  • Location:San Diego, CA

Posted 04 April 2007 - 01:09 AM

Write a script then send it to me.


It might not be that hard to code, just something that would pull out random sentences and look for matches, but from what I understand there are already programs that do this well.

#7 Linux

Linux

    SUP3R 31337 P1MP

  • Banned
  • 278 posts

Posted 04 April 2007 - 08:03 AM

perl+ LWP is great for this type of stuff...

#8 Drake Anubis

Drake Anubis

    Never Forget

  • Agents of the Revolution
  • 911 posts
  • Gender:Not Telling
  • Location:San Diego, CA

Posted 04 April 2007 - 12:00 PM

Yes but it would be nice to have a program that is already coded. Something that would be able to do advanced things like after it finds a match it starts googling that section more specifically, maybe even shifting the words to switch it from 3rd person to 1st person or something. Those kinds of things.

#9 R3c0n

R3c0n

    SUPR3M3 31337 Mack Daddy P1MP

  • Members
  • 411 posts
  • Location:Daytona, Florida

Posted 04 April 2007 - 02:07 PM

For plagarism etc. school's (including my old school) used turnitin.com, and it was actually quite effective...at least 6 people in our class got caught...but its fairly easy to get around if you are even half intelligent.

I dont fully understand your second question.....why not just copy and paste the segment you want to check and google it? That usually works for me...

If you are specifically looking for things like "word, excel, powerpoint documents" you can use "filetype:xls , filetype:doc" etc.. followed by the string to search..that also works out fine..I used to download pre-maid presentations off google for psych class ...lawl

#10 Drake Anubis

Drake Anubis

    Never Forget

  • Agents of the Revolution
  • 911 posts
  • Gender:Not Telling
  • Location:San Diego, CA

Posted 04 April 2007 - 02:26 PM

I dont fully understand your second question.....why not just copy and paste the segment you want to check and google it? That usually works for me...


Ok, let me create a useage scenario. Lets say I had a document that show not be available anywhere, like an interoffice memo with the instructions on how the phone system operates. I would like to be able to put this memo into a program that would then search google using the text from the memo, and then it would be able to determine if all or part of this document is indexed online.

So it might do this by saying "Exact phrase match on Line 12 and 13", if it found an exact sentence match, what are the odds that a specific sentence is going to be a perfect match on google.

Or by numbering, if there is a specific sequence of numbers it should be able to search. So if the memo said "To reset to the default password dial your voicemail box followed by 13254679*001", it should be able to search google for that string and find that somebody posted on a forum "yeah man, I found this system and all you got to do to reset the password is punch in 13254679*001".

And then something more fuzzy like, "80% of Lines 22:29 match this document found here". If somebody changed the wording slightly.

This is kindof disapointing, it seems like this would be very handy to have I thought for sure somebody would have coded something by now. I'll have to code this myself I guess, can anybody suggest a language that would be good for basically just running grep and search querys? Maybe perl or ruby....

Edited by Drake Anubis, 04 April 2007 - 02:27 PM.


#11 jabzor

jabzor

    hax?

  • Agents of the Revolution
  • 1,146 posts
  • Country:
  • Gender:Male
  • Location:Northern Elbonia, fighting the lefties

Posted 04 April 2007 - 03:53 PM

Perl will work great (text parsing is where it scores all the big points), curl/wget and grep would work as well, ruby should work fine.
Really you could do this in almost any language including javascript or visualbasic, think about what do you need:
- to search for keyterms/phrases within a document (regex or basic greping)
- to submit a query to Google(and others) based on the terms and possibly filetypes (plaintext vs richtext or encoded/obfuscated formatting may pose a slight issue, office2007 documents for example are compressed by default so you may have to copy the text and save it to a plain-text file or let your script read the terms within office directly) (lwp/curl/wget are just a few examples or use the existing api frameworks)
- to parse and display the results to the end-user (sed/grep/awk/regex etc)

You could if you wanted throw together a simply bash script to do all of this or use the developer apis Google provides to save yourself some hassle and write all of this at a higher level; either way your search results will not be all-inclusive without other search-engines to confirm and perform additional queries (particularly if Google returns no results or false positives).

Edited by jabzor, 04 April 2007 - 03:56 PM.


#12 Drake Anubis

Drake Anubis

    Never Forget

  • Agents of the Revolution
  • 911 posts
  • Gender:Not Telling
  • Location:San Diego, CA

Posted 04 April 2007 - 04:37 PM

your search results will not be all-inclusive without other search-engines


Thats a very good point, it would need to look at different search engines, also it would need to eliminate the cross talk as the search engines find results that each other have.

I thought about the google api but quickly dismissed it. All that needs to be done is a query, grab the returned links, then preform the operations, the google api wouldn't be need for just a query and a return.

For the conversion of file formats in the beginning I would just use copypaste to a txt file, but maybe later I could implement a conversion into the program to make it more convinent.

I was mainly considering perl, the only reason I suggested ruby is because I wanted to learn it just in general what with the metasploit framework and all.

#13 toast_or

toast_or

    Will I break 10 posts?

  • Members
  • 2 posts

Posted 15 April 2007 - 02:45 PM

Doesn't google keep a record of every search?

If you ran your "secret doc" through a search to see if it has been posted on the web it may end up stored on google's servers.

#14 Drake Anubis

Drake Anubis

    Never Forget

  • Agents of the Revolution
  • 911 posts
  • Gender:Not Telling
  • Location:San Diego, CA

Posted 15 April 2007 - 05:10 PM

I think it would be safe to say that google has upwords of 100,000 searchs per second, running one bit of information in between 300,000 searchs is not going to be noticed.

#15 lambda

lambda

    mad 1337

  • Members
  • 144 posts

Posted 20 April 2007 - 07:05 PM

A little late, but this might interest you

#16 Drake Anubis

Drake Anubis

    Never Forget

  • Agents of the Revolution
  • 911 posts
  • Gender:Not Telling
  • Location:San Diego, CA

Posted 20 April 2007 - 07:19 PM

A little late, but this might interest you


Yes it does. While that program is only made to search amongst documents, it could probably be adapted to search the internet, at the very least it would be a good baseline if I made something myself. Thanks.




BinRev is hosted by the great people at Lunarpages!