Portal | Homepage | Blog

Robots Text Disallow for int'l Search Engines - SEO


#1

Hi,

I was checking our robots.txt and saw we are blocking international search engines robots like Naver and etc.
Why?

Best,
Mohsen


#2

Hi Mohsen,

All of the banned bots have exhibited spammy behavior where they don’t respect typical crawl rates and aggressively, repeatedly request content from our servers.

You might find this informative: http://searchenginewatch.com/sew/news/2067357/bye-bye-crawler-blocking-parasites

If you Google many of the blocked bots you’ll see that other webmasters have experienced problems with them also. Some examples:

spbot: https://www.webmasterworld.com/search_engine_spiders/4073028.htm

Naverbot: https://www.webmasterworld.com/forum11/2471.htm

Xovibot: https://www.webmasterworld.com/search_engine_spiders/4685507.htm
http://blocklistpro.com/content-scrapers/xovibot-seo-crawler-from-xovi-gmbh.html

Ahrefsbot: http://blocklistpro.com/content-scrapers/ahrefsbot-seo-spybots.html

Regards,
Donogh


#3

Hi Donogh,

Thanks for your quick reply and useful links.

Regards,
Mohsen


#4

Hi Donogh,

Same topic…

How do we change our robots.txt file?

Also, can you explain the following 3 statements in our robots.txt file? Especially the one on product images?

User-agent: *
Disallow: /store/filtered/

User-agent: Googlebot-Image
Disallow: /product_images/

User-agent: *
Crawl-delay: 5

Thanks,

Stan


#5

Hi Stan,

The store will incorporate any robots.txt you upload to the root of your FTP site. The default robots.txt cannot be modified.

User-agent: *
Disallow: /store/filtered/

Filtered search results pages are excluded – the contents of those pages are not unique and are indexed on department / category / subcategory pages.

User-agent: Googlebot-Image
Disallow: /product_images/

Product images are available on your main URL but they are normally served via our content delivery network on images.nitrosell.com. So that directory is excluded in favor of the CDN content being indexed.

User-agent: *
Crawl-delay: 5

That’s to control the crawl rate of certain robots, which can be excessive and can affect your site’s performance.

Regards,
Donogh


#6

Hi Donogh,

Thanks for the advice. I added a new robots.txt file to our ftp root, but it is still not recognized. The url is still accessing Nitrosell’s default file.

Can you look at this for me?

Stan


#7

Hi @jbw, Stan has added a robots.txt to his FTP root and apparently it’s being ignored.

The contents of the file the FTP are:

User-agent: *

If you visit it, it’s not incorporating Stan’s file:

http://www.coronacigar.com/robots.txt

Would you mind reviewing this, please?

Thanks,
Donogh


#8

Hi

It is incorporating Stan’s file. The robots.txt file contains:

User-agent: *                                                                                                                                                                                              
Sitemap: http://www.coronacigar.com/sitemap.txt

This appears at the top of the robots.txt … the nitrosell default robots.txt is appended:

www.coronacigar.com/robots.txt

Regards
Jerry


#9

HI Jerry,

Yes, the sitemap is incorporated into your robot.txt file, but what I am trying to do is to get http://www.coronacigar.com/robots.txt file to show the robots.txt file I uploaded to my root directory. I do not want to use the Nitrosell default robots.txt. Too many disallows and delays. I want to try it for a couple of months to see if our organic traffic increases. Our competitors do not have any disallows or delays on their sites.

Let me know your thoughts.

Stan