I am a talented Web Scraper who has rich experience for a few years.
I am familiar with Scrapy, Selenium, and Beautiful Soup.
I have completed around 30k web scrapers so far.
Especially, my experiences cover over searching data over the various sources, lead generation, getting past the scraping bot, cleaning data, data mining and so on.
One of the most excited projects I have done was when I scraped around 10k US automobile websites.
The number of sites is around 10000 and I had to make a well-made script to crawl one time every day.
I decided to build the scrapers in python scrapy.
To speed up the crawling, I used Broad Crawl Method(it is a built-in function to make concurrent requests at once. )
Also, I had to avoid the captcha and limitation.
To implement it, I used the IP and user agents-rotating method and adjusted the download timeout. (really I used 1000 proxies)
There are around 1k ~5k ads in each site and the crawled html pages are stored in the specified folder with a unique name.
The crawled items such as VIN, Mileage, Price, Model, Name and so on are stored in mongo.
The crawler was deployed on AWS.
To distribute the crawlers, I used 100 Centos VMs and made a system to give commands from the server to all crawlers and gather the results to the server. It is similar to Gearman.
The crawler keeps working 24/7.
Please contact me and discuss the project in more detail
Thank you