Captcha is the show-stopper for scrapers and bad news is that almost all the websites have captcha enabled for the suspicious requests. Getting away with captcha is not an easy thing.
Apart from detecting the IP address, server-based solutions can also detect the requests coming from a client using HTTP fingerprinting. So, the requests can be blocked if it’s coming from the blacklisted client machine/device even if it’s using a new IP.
Almost all the websites are using DOM manipulation and Dynamic content. A generic scraper script can’t access the data from these websites and moreover, the APIs serving these websites can’t be scraped directly unless the dynamically generated security token is passed on to the server the request header.
DDoS attack protection
Most of the websites are enabled with DDoS attack protection, which can also block the scrapers.
Changes in HTML structure
Frequent changes on these websites is not an uncommon phenomenon, however, this can cause breakdown of the scraper and would require immediate updates in the script.
As the companies are moving to “App-only” concept, it throws a new challenge to scrapers…what to scrape?
We can avoid this by routing the requests through thousands of different IP addresses. We can randomize the request origination instead of making the requests at a set frequency.
Not all the websites implement captcha security as it severely interferes with user experience as well as user interaction. So, we can identify which websites have the captcha implemented and then find a solution for the same by following this approach –
- Analyze which website has implemented the captcha protection and then reverse engineer which action/event triggers the captcha. If a pattern/cause can be identified, adapt your request/code to circumvent it.
- If no pattern can be identified, or if the identified pattern can’t be circumvented, then the solution is to break the captcha. Many captcha solutions are known to include bugs/exploits so they can be easily avoided. If the captcha solution can’t be avoided and does not have known exploits, apply an automated OCR-level captcha recognition. If OCR-level captcha recognition is also not working, then you would require human manual interaction (human farms).
- For such apps, we can scrape the data by identifying the relevant APIs and finding a solution to scrape those APIs.
- These APIs are generally secured by a header token, which is generated at the user end by the App, so direct scraping of the APIs may not work.
- For this, you can use the techniques which are used in automation testing by the app developers.
Frequent changes on these websites is not an uncommon phenomenon, however, this can cause breakdown of the scraper and would require immediate updates in the script. However, by employing the template-based approach, not only you can instantly identify any changes to the website/API but can also swiftly adapt your script.
You can overcome this by using cloud computing technologies. You can initiate a cloud ephemeral instance using preconfigured system images and as soon as the instance is blocked, its destroyed and replaced by a new instance.
We can scrape these websites by creating an app which can simulate the behavior of a website user. The script can even log in as a user and then browse/search the records in a captive browser. As the script has full access to the content of the browser, the content can be then scraped easily.
All of the above-mentioned solutions when combined can make the scraper overcome this solution.
- Create a rule-based parsing engine for the data scraping so that any changes in the web page can be handled swiftly without making changes to the rest of the program.
- All the rules should be stored in a file, called template.
- The template will define which data is to be scraped and how to locate the data on the target web page.
- If the structure of the target web page changes, then you only have to make changes to the relevant template and the rest of the program (Worker) will remain unchanged.
- Also, if any new data has to be extracted from the same page, this approach would ensure a significantly quicker turnaround time.
- This approach also allows us to add a new rule for a new website very easily, making the script highly extensible.
- This will be relevant for web pages as well as mobile APIs. Every web page will consequently have a separate template.
- Reader –It is the most significant part of the whole solution as every website has its own security and page/content loading mechanism which needs to be analyzed to identify the key challenges and then devise a strategy to develop the reader. On some websites, the content is loaded along with the DOM whereas some other websites use DOM Manipulation and have dynamic content loading.
- Extractor –It will be using the rules and patterns defined in the templates to identify and extract the required data from the HTML/JSON scraped by the reader. Extracted data will be stored in a cloud-based NoSQL database.