What are some good free web scrapers / scraping techniques?
My 2 cents:
Some more open Source solutions:
1. WebHarvest
- Written in Java
- Leverages XSLT, Xquery and Regex to performs its scraping voodoo
- Check it out at: http://web-harvest.sourceforge.net/overview.php
2. Beautiful Soup
- Written in Python
- Leverages libraries like lxml and html5lib.
- I must mention that their client list includes notables like MovableType and Reddit, so I guess they have their game sorted out.
- Check it out at: http://www.crummy.com/software/BeautifulSoup/
3. Solvent + Piggy Bank
- These are firefox extensions written in Javascript, authored at MIT.
- Piggy Bank is actually a mashup module to aggregate and integrate info from various sites. Solvent é outro add-on que funciona com o Piggy Bank para desenvolver raspadores de tela.
- Eles têm alguns screencasts legais para mostrar como a ferramenta deles pode raspar sites como craigslist e cafeterias Starbucks.<
- conhecimento básico de Javascript é necessário
- > veja em: http://simile.mit.edu/wiki/Solvent
se você'estiver no mercado procurando algo um pouco menos exigente tecnicamente, aqui estão algumas ofertas:
1. IRobotSoft
2. NeedleBase
- A visual tool allowing you to easily create scrapers + gives you cool features like duplicate culling/merging data sets and all.
- Its pretty easy to use but I'm not sure how it performs when things get a wee bit complicated (e.g. with AJAX and all)
- Price: Free for low volume scrapes (login with your Google account)
(I think for higher volumes you need to pay up) - Check out at: www.needlebase.com
Paid Services
In case you change your mind and are willing to toss in some dough, you might want to check out:
1. ScraperWiki (já mencionado nas respostas anteriores: custa pelo menos $1000 por trabalho de raspador e lhe dá boas opções de privacidade de dados)
2. Mozenda (SaaS de alta qualidade: $99 por 5000 páginas - ferramenta sofisticada que lhe permite conjurar cenários complexos de raspagem.
3. ScrapeHero (DaaS muito acessível: $50 por 10.000 páginas com suporte ao cliente ao vivo)