We live in the entrepreneurship era. Starting a business has never been easier. And thanks to how the Internet is reaching more and more people today, you can easily start a business and profit from it. One such business avenue is to develop and aggregate online information and package it as a different product or service. For example, you can develop a scraper to search for the best sneaker prices available online. One such scraper will require to crawl a certain number of websites and return the findings.

However, scraping and crawling is very demanding. And a scraper software could make a large volume of requests per minutes, putting a strain on the webservers it scrapes. And the servers, to protect themselves will throttle the requests or even temporary ban the IP from where these requests originated. Therefore, you will need semi-dedicated proxies in orders to divert your requests and channel them through different IPs. In this article, we are going to show what are semi-dedicated proxies and why you need them when developing a scraper.


What are semi-dedicated proxies?

Semi-dedicated proxies are exactly what they mean. They are private proxies that are not dedicated... I know, it sounds strange. Meaning their usage is split between a few proxy users. Therefore, passing through these proxies are your requests to other web servers and as well other requests from the other users of these proxies. Moreover, as there are many requests passing through the proxy server and each user uses a part of the bandwidth, so does the price of semi-dedicated proxies is split between proxy providers. Therefore, semi-dedicated proxies will always cost less than other proxies.


Why do you need shared semi-dedicated proxies

The main reason for using shared semi-dedicated proxies when developing a scraper was mentioned above. You need a certain number of IPs to divert your requests. So, the targeted web servers, where the scrape will take place will not limit your access to its content due to a large number of requests.

Therefore, the larger the website on which you are planning to do the scraping is, the more semi-dedicated proxies you need in order to access the server with several concurrent threads.


Scraper development and setup

When it comes to actually developing your scraper, it does not matter its inner guts as long as it does the job. Therefore, you can develop your scraper either in Java, C++ or Python. Every programming language has libraries and frameworks that you can use to develop a scraper.

However, the most important factor you need to remember is that the scraper you are developing needs to support proxy setup. Thus, it must be enabled to access the targeted websites through different IPs. And this is where you must set and use your semi-dedicated proxies.


Why are semi-dedicated proxies a good start?

For scraping, you can use any proxies your find. You can even use virgin proxies. However, the best option for a scraper is to buy semi-dedicated proxies, especially if the scraping will be performed on unknown websites. You need semi-dedicated proxies because they are also cheaper. So in order to experiment and test the limit of your scraper, you need to buy the cheapest proxies available. Therefore, buy semi-dedicated proxies and start scraping.


Choose proxies based on your scraper connection

Another factor you need to consider is how you are going to divert your traffic through multiple proxies and how your semi-dedicated proxies are going to allow you to connect to them. For instance, some proxies allow connection and authentication based on IPs whereas other will allow through username:password. And even other semi-dedicated proxies will allow both. The later are the ones you need to buy. You must look for semi-dedicated proxies that allow you connect either way. In this way, the scraper you are developing will not restrict your proxies usage.


Get a VPS for your scraper

One last factor to consider when developing your scraper and using semi-dedicated proxies is how you are going to deploy it. You can buy your own machine and set it up. However, it is better to set your scraper on a virtual private server. In this way, you make sure that your scraper will be available to scrape on a permanent basis and you will manage to scrape without any limits.

For this particular case, you can use a free instance of AWS EC2, which is basically a free VPS that you can use to deploy your scraper. Moreover, EC2s from AWS come with their own dedicated IPs. Therefore you can develop your scraper to connect to semi-dedicated proxies as you find best. Either through IP auth or username and pass.



 Wednesday, January 17, 2018



« Back