DrakeAnubis

Document Googler

16 posts in this topic

Does anybody know of a way to check if part of a document (or pref a group of documents) has been indexed by a search engine like google. For example if I had a list of cooperate internal phone numbers and I wanted to see if any part of this document was on google, I could feed the document into a program and it would check google.

I've heard of professors using programs to check for plagiarism, anybody have any experience with this.

0

Share this post


Link to post
Share on other sites

Just put the things you want to look for in google with quotes.

0

Share this post


Link to post
Share on other sites
Just put the things you want to look for in google with quotes.

Well yes, but say I want to search a set of documents, maybe each document is about 30 pages and there are several documents. I would like to be able to make several thousand google searches with queries derived from these documents automatically.

Another application would be a something like the USB Hacksaw or the USB Slurper, take everything off of a machine and then quickly run the documents to see if anything you just pulled is already online somewhere.

0

Share this post


Link to post
Share on other sites

I am also interested to see this.

0

Share this post


Link to post
Share on other sites

Write a script then send it to me.

lol

0

Share this post


Link to post
Share on other sites
Write a script then send it to me.

It might not be that hard to code, just something that would pull out random sentences and look for matches, but from what I understand there are already programs that do this well.

0

Share this post


Link to post
Share on other sites

Yes but it would be nice to have a program that is already coded. Something that would be able to do advanced things like after it finds a match it starts googling that section more specifically, maybe even shifting the words to switch it from 3rd person to 1st person or something. Those kinds of things.

0

Share this post


Link to post
Share on other sites

For plagarism etc. school's (including my old school) used turnitin.com, and it was actually quite effective...at least 6 people in our class got caught...but its fairly easy to get around if you are even half intelligent.

I dont fully understand your second question.....why not just copy and paste the segment you want to check and google it? That usually works for me...

If you are specifically looking for things like "word, excel, powerpoint documents" you can use "filetype:xls , filetype:doc" etc.. followed by the string to search..that also works out fine..I used to download pre-maid presentations off google for psych class ...lawl

0

Share this post


Link to post
Share on other sites
I dont fully understand your second question.....why not just copy and paste the segment you want to check and google it? That usually works for me...

Ok, let me create a useage scenario. Lets say I had a document that show not be available anywhere, like an interoffice memo with the instructions on how the phone system operates. I would like to be able to put this memo into a program that would then search google using the text from the memo, and then it would be able to determine if all or part of this document is indexed online.

So it might do this by saying "Exact phrase match on Line 12 and 13", if it found an exact sentence match, what are the odds that a specific sentence is going to be a perfect match on google.

Or by numbering, if there is a specific sequence of numbers it should be able to search. So if the memo said "To reset to the default password dial your voicemail box followed by 13254679*001", it should be able to search google for that string and find that somebody posted on a forum "yeah man, I found this system and all you got to do to reset the password is punch in 13254679*001".

And then something more fuzzy like, "80% of Lines 22:29 match this document found here". If somebody changed the wording slightly.

This is kindof disapointing, it seems like this would be very handy to have I thought for sure somebody would have coded something by now. I'll have to code this myself I guess, can anybody suggest a language that would be good for basically just running grep and search querys? Maybe perl or ruby....

Edited by Drake Anubis
0

Share this post


Link to post
Share on other sites

Perl will work great (text parsing is where it scores all the big points), curl/wget and grep would work as well, ruby should work fine.

Really you could do this in almost any language including javascript or visualbasic, think about what do you need:

- to search for keyterms/phrases within a document (regex or basic greping)

- to submit a query to Google(and others) based on the terms and possibly filetypes (plaintext vs richtext or encoded/obfuscated formatting may pose a slight issue, office2007 documents for example are compressed by default so you may have to copy the text and save it to a plain-text file or let your script read the terms within office directly) (lwp/curl/wget are just a few examples or use the existing api frameworks)

- to parse and display the results to the end-user (sed/grep/awk/regex etc)

You could if you wanted throw together a simply bash script to do all of this or use the developer apis Google provides to save yourself some hassle and write all of this at a higher level; either way your search results will not be all-inclusive without other search-engines to confirm and perform additional queries (particularly if Google returns no results or false positives).

Edited by jabzor
0

Share this post


Link to post
Share on other sites
your search results will not be all-inclusive without other search-engines

Thats a very good point, it would need to look at different search engines, also it would need to eliminate the cross talk as the search engines find results that each other have.

I thought about the google api but quickly dismissed it. All that needs to be done is a query, grab the returned links, then preform the operations, the google api wouldn't be need for just a query and a return.

For the conversion of file formats in the beginning I would just use copypaste to a txt file, but maybe later I could implement a conversion into the program to make it more convinent.

I was mainly considering perl, the only reason I suggested ruby is because I wanted to learn it just in general what with the metasploit framework and all.

0

Share this post


Link to post
Share on other sites

Doesn't google keep a record of every search?

If you ran your "secret doc" through a search to see if it has been posted on the web it may end up stored on google's servers.

0

Share this post


Link to post
Share on other sites

I think it would be safe to say that google has upwords of 100,000 searchs per second, running one bit of information in between 300,000 searchs is not going to be noticed.

0

Share this post


Link to post
Share on other sites
A little late, but this might interest you

Yes it does. While that program is only made to search amongst documents, it could probably be adapted to search the internet, at the very least it would be a good baseline if I made something myself. Thanks.

0

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now