Captcha is the show-stopper for scrapers and bad news is that almost all the websites have captcha enabled for the suspicious requests. Getting away with captcha is not an easy thing.
Apart from detecting the IP address, server-based solutions can also detect the requests coming from a client using HTTP fingerprinting. So, the requests can be blocked if it’s coming from the blacklisted client machine/device even if it’s using a new IP.
Almost all the websites are using DOM manipulation and Dynamic content. A generic scraper script can’t access the data from these websites and moreover, the APIs serving these websites can’t be scraped directly unless the dynamically generated security token is passed on to the server the request header.
DDoS attack protection
Most of the websites are enabled with DDoS attack protection, which can also block the scrapers.
Changes in HTML structure
Frequent changes on these websites is not an uncommon phenomenon, however, this can cause breakdown of the scraper and would require immediate updates in the script.
As the companies are moving to “App-only” concept, it throws a new challenge to scrapers…what to scrape?
- Create a rule-based parsing engine for the data scraping so that any changes in the web page can be handled swiftly without making changes to the rest of the program.
- All the rules should be stored in a file, called template.
- The template will define which data is to be scraped and how to locate the data on the target web page.
- If the structure of the target web page changes, then you only have to make changes to the relevant template and the rest of the program (Worker) will remain unchanged.
- Also, if any new data has to be extracted from the same page, this approach would ensure a significantly quicker turnaround time.
- This approach also allows us to add a new rule for a new website very easily, making the script highly extensible.
- This will be relevant for web pages as well as mobile APIs. Every web page will consequently have a separate template.
- Reader –It is the most significant part of the whole solution as every website has its own security and page/content loading mechanism which needs to be analyzed to identify the key challenges and then devise a strategy to develop the reader. On some websites, the content is loaded along with the DOM whereas some other websites use DOM Manipulation and have dynamic content loading.
- Extractor –It will be using the rules and patterns defined in the templates to identify and extract the required data from the HTML/JSON scraped by the reader. Extracted data will be stored in a cloud-based NoSQL database.