Recently trying to find some good alternatives for web scrapper for one my apps, that had to scrap links and extract keywords from the webpages, since it had to much robust, handling broken html, and heaps of text.
I browsed around the web looking for some existing libraries that could help me out. Since I don’t like to create things from scratch. Well I couldn’t find what I exactly wanted, but I found somethings that could help me building it.
Topia.termextract for text processing
lxml for html parsing
After writing the code, I tested it pretty thoroughly and after the module was completed. I thought to launch it as an API service. Well deployment was not an issue, since Heroku is my option. So got the API deployed on Heroku, along with some modifications, and running it with gunicorn server.
What is Scrapit?
Scrapit is an API for scrapping webpages for keywords. Using Scrapit you can extract important keywords from webpages. That are relevant to the page.
You need to make calls to
q : (required) url to be fetched
occurs : (optional) Will only return the words that are repeated more that once on the webpage. Set to ''1'' while you want to enable it
pretty : (optional) Used for pretty printing the response. Set to ''1'' while you want to enable it
Well I m going to try and continuously fix the bugs in the API.
So if you, have any suggestions that would make the Scrappit any better, they are welcome here :)