Portal | Homepage | Blog

Whitelisting IP Address

Hi,

I am trying to crawl the website with Screaming Frog SEO Spider, but most of the URLs get 0 response code (connection timeout).

It is probably due to delay in robots.txt file. Can you whitelist 2 of my IP Addresses so I can preform a crawl?

Thanks

Hi Ivan,

It’s most likely your crawl rate is too high and you are triggering our anti-DDoS protection.

The bot needs to respect the standard crawl-delay set in robots.txt for all sites.

Even then, I would suggest maybe crawling at half that speed, and ensuring that the bot does not break the limit not just in terms of rate but also in terms of simultaneous requests.

To get your IP unblocked, you will need to open a ticket, and the support team will provide a standard agreement you need to consent to in order to have crawling access re-enabled.

Bear in mind the entire system is automated and we cannot allow exceptions for platform-wide performance reasons. If the crawl limits are exceeded again, even if you previously had access reinstated, you will be automatically blocked again.

Kind regards,
Donogh

Thanks for Your response Donogh,

In the robots.txt file crawl delay is specified to be 5 (seconds). Issue is that the website has well over 9,000 URLs and that essentially makes the website impossible to crawl.

Is there any workaround for the sites of this size to be crawled in some reasonable time frame?

Ivan

Hi Ivan,

Google and Facebook happily obey the crawl delay, so, with respect, I don’t see any issue here. The same is also true for much larger sites, with upwards of 50,000 items.

Allowing intensive crawlers on client sites hugely negatively impacts the performance of the entire SaaS platform. If every client were hammering their sites with crawlers like that, it would literally cost our retailers money.

Therefore, I’m afraid this policy cannot be negotiated.

Can I ask what you are using the crawler for please? We may be able to suggest better options.

Kind regards,
Donogh

Hi Donogh,

thanks for Your answer…

Not sure about the Facebook, but Google does not take crawl delay into consideration, they adjust it based on the server reaction.

I will figure something out with respect to getting the data about each individual page.

Thanks for Your time and prompt responses…

Regards,
Ivan

Sounds good. You’re very welcome. Please feel free to open a ticket if you need more specific assistance.