Sign in to follow this  
Followers 0
t3st.s3t

Search Program

7 posts in this topic

I am looking for a Linux program that can search and extract information from web pages, or one that can just scour the net and gather information. Something that works with mySQL or PostgreSQL would be great. I've looked all over Google for such a program and have had little success. Any recommendations?

0

Share this post


Link to post
Share on other sites

Maybe if you would be a little less vague, someone might be able to help you.

You can make your own easy enough. For example, I hacked up something similar to what you described with Ruby and Hpricot (and HTML parser with a powerful and handy query syntax) in an hour or so. Making a spider is a relatively simple task, you might not find canned software that does exactly what you want,

0

Share this post


Link to post
Share on other sites

If you're looking to do it in Perl, use the www:Spyder module.

I'm sure you could easily get something decent done in that in a very short time.

0

Share this post


Link to post
Share on other sites
If you're looking to do it in Perl, use the www:Spyder module.

I'm sure you could easily get something decent done in that in a very short time.

I was hoping for a packaged product but this will work nicely. Thanks!

0

Share this post


Link to post
Share on other sites
I was hoping for a packaged product but this will work nicely. Thanks!

Larbin is probably what you are looking for. It has already implemented a full bot and has some default stuff in storage implemented as well. The nice thing is that you can code up your own way of handling a page too in case you want to dump it to a DB or something.

Just an added note be sure you are "kind" to other web servers and only crawl them every once and awhile, also check the last modified tags. Larbin does all of this for you so I recommend it.

Hope that helps!

-Dr^ZigMan

0

Share this post


Link to post
Share on other sites
I was hoping for a packaged product but this will work nicely. Thanks!

Larbin is probably what you are looking for. It has already implemented a full bot and has some default stuff in storage implemented as well. The nice thing is that you can code up your own way of handling a page too in case you want to dump it to a DB or something.

Just an added note be sure you are "kind" to other web servers and only crawl them every once and awhile, also check the last modified tags. Larbin does all of this for you so I recommend it.

Hope that helps!

-Dr^ZigMan

It looks like a nice piece of software. I'll give it a try. Thank you for your help.

0

Share this post


Link to post
Share on other sites

Assuming that you're attempting to spider data from websites. If you're not specifically tied to MySQL or PostgreSQL you should check out programs build with 'Lucene' technology:

http://wiki.apache.org/lucene-java/PoweredBy

Lucene will automatically index the data for you and enable some quite complicated searches.

A good starting point would be Nutch (although I have not tried it)

http://lucene.apache.org/nutch/about.html

I use MindRetrieve (a Lucene base personal proxy) to 'scrap' the bits of the web I see. I often have the feeling that I've seen something in the last few weeks but can't remember where it was....

Munge.

Edited by mungewell
0

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0