Document Googler
#1
Posted 03 April 2007 - 09:28 PM
I've heard of professors using programs to check for plagiarism, anybody have any experience with this.
#2
Posted 03 April 2007 - 09:33 PM
#3
Posted 03 April 2007 - 09:39 PM
Just put the things you want to look for in google with quotes.
Well yes, but say I want to search a set of documents, maybe each document is about 30 pages and there are several documents. I would like to be able to make several thousand google searches with queries derived from these documents automatically.
Another application would be a something like the USB Hacksaw or the USB Slurper, take everything off of a machine and then quickly run the documents to see if anything you just pulled is already online somewhere.
#4
Posted 03 April 2007 - 10:21 PM
#5
Posted 03 April 2007 - 10:56 PM
lol
#6
Posted 04 April 2007 - 01:09 AM
Write a script then send it to me.
It might not be that hard to code, just something that would pull out random sentences and look for matches, but from what I understand there are already programs that do this well.
#7
Posted 04 April 2007 - 08:03 AM
#8
Posted 04 April 2007 - 12:00 PM
#9
Posted 04 April 2007 - 02:07 PM
I dont fully understand your second question.....why not just copy and paste the segment you want to check and google it? That usually works for me...
If you are specifically looking for things like "word, excel, powerpoint documents" you can use "filetype:xls , filetype:doc" etc.. followed by the string to search..that also works out fine..I used to download pre-maid presentations off google for psych class ...lawl
#10
Posted 04 April 2007 - 02:26 PM
I dont fully understand your second question.....why not just copy and paste the segment you want to check and google it? That usually works for me...
Ok, let me create a useage scenario. Lets say I had a document that show not be available anywhere, like an interoffice memo with the instructions on how the phone system operates. I would like to be able to put this memo into a program that would then search google using the text from the memo, and then it would be able to determine if all or part of this document is indexed online.
So it might do this by saying "Exact phrase match on Line 12 and 13", if it found an exact sentence match, what are the odds that a specific sentence is going to be a perfect match on google.
Or by numbering, if there is a specific sequence of numbers it should be able to search. So if the memo said "To reset to the default password dial your voicemail box followed by 13254679*001", it should be able to search google for that string and find that somebody posted on a forum "yeah man, I found this system and all you got to do to reset the password is punch in 13254679*001".
And then something more fuzzy like, "80% of Lines 22:29 match this document found here". If somebody changed the wording slightly.
This is kindof disapointing, it seems like this would be very handy to have I thought for sure somebody would have coded something by now. I'll have to code this myself I guess, can anybody suggest a language that would be good for basically just running grep and search querys? Maybe perl or ruby....
Edited by Drake Anubis, 04 April 2007 - 02:27 PM.
#11
Posted 04 April 2007 - 03:53 PM
Really you could do this in almost any language including javascript or visualbasic, think about what do you need:
- to search for keyterms/phrases within a document (regex or basic greping)
- to submit a query to Google(and others) based on the terms and possibly filetypes (plaintext vs richtext or encoded/obfuscated formatting may pose a slight issue, office2007 documents for example are compressed by default so you may have to copy the text and save it to a plain-text file or let your script read the terms within office directly) (lwp/curl/wget are just a few examples or use the existing api frameworks)
- to parse and display the results to the end-user (sed/grep/awk/regex etc)
You could if you wanted throw together a simply bash script to do all of this or use the developer apis Google provides to save yourself some hassle and write all of this at a higher level; either way your search results will not be all-inclusive without other search-engines to confirm and perform additional queries (particularly if Google returns no results or false positives).
Edited by jabzor, 04 April 2007 - 03:56 PM.
#12
Posted 04 April 2007 - 04:37 PM
your search results will not be all-inclusive without other search-engines
Thats a very good point, it would need to look at different search engines, also it would need to eliminate the cross talk as the search engines find results that each other have.
I thought about the google api but quickly dismissed it. All that needs to be done is a query, grab the returned links, then preform the operations, the google api wouldn't be need for just a query and a return.
For the conversion of file formats in the beginning I would just use copypaste to a txt file, but maybe later I could implement a conversion into the program to make it more convinent.
I was mainly considering perl, the only reason I suggested ruby is because I wanted to learn it just in general what with the metasploit framework and all.
#13
Posted 15 April 2007 - 02:45 PM
If you ran your "secret doc" through a search to see if it has been posted on the web it may end up stored on google's servers.
#14
Posted 15 April 2007 - 05:10 PM
#16
Posted 20 April 2007 - 07:19 PM
A little late, but this might interest you
Yes it does. While that program is only made to search amongst documents, it could probably be adapted to search the internet, at the very least it would be a good baseline if I made something myself. Thanks.
BinRev is hosted by the great people at Lunarpages!












