Google Indexing Problem since July 18, 2006

  • Bigwebmaster
  • Site Admin
  • Site Admin
  • User avatar
  • Posts: 9089
  • Loc: Seattle, WA & Phoenix, AZ

Post 3+ Months Ago

I have a strange issue that is currently going on with ozzu. Since about July 18th I noticed Googlebot had stopped going to the site. Normally Googlebot crawls about 3000-15000 pages per day.

I also have the site in Google Sitemaps so I went in there to see if there were any errors that might explain why Googlebot has stopped going to the site. There was indeed an error which says:

5xx error robots.txt unreachable

There is a detailed description of this error which says:

Quote:
Before we crawled the pages of your site, we tried to check your robots.txt file to ensure we didn't crawl any pages that you had roboted out. However, your robots.txt file was unreachable. To make sure we didn't crawl any pages listed in that file, we postponed our crawl. When this happens, we return to your site later and crawl it once we can reach your robots.txt file. Note that this is different from a 404 response when looking for a robots.txt file. If we receive a 404, we assume that a robots.txt file does not exist and we continue the crawl.


You can find the direct link to this here:

http://www.google.com/support/webmaster ... 5154&hl=en

So according to that error since it cannot obtain the robots.txt file it will not attempt to obtain any pages from the site until that gets resolved. This would explain to me why Googlebot might have stopped visiting the site. However, this doesn't resolve the problem.

The odd thing is that if you goto http://www.ozzu.com/robots.txt it loads perfectly fine. I checked to make sure the robots.txt file was still fine as well as far as the format, and there are no problems with the syntax of anything inside the file. I haven't changed the file in months. To me 5xx error would indicate that my server would be somehow not returning the page properly, such as an Internal Server Error or the like. I checked all of my access and error logs and Googlebot hasn't even tried to visit the site period. It has shown in Google Sitemaps that it tried to retreive robots.txt on July 18, 2006 and then July 21, 2006 and it resulted in a 5xx error. I looked in my logs for those days and Google shows up in zero times. It never requested the file. It never actually reached the server in other words if it had tried. At first I was thinking maybe Googlebot just caught my server at a bad time and that is why it wasn't able to reach the server. However, my server has had no downtime and according to Google Sitemaps it tried on two separate days resulting in the same problem 5xx error.

Since its not even showing up in the httpd logs I started thinking that it could be my DNS not working properly. I verified that my sites DNS is indeed working correctly by checking at dnsreport.

Next I started to think that maybe I had inadvertently blocked Googlebot from visiting the server with my firewall. I checked the configuration and everything was exactly the same as it has always been. Still I wasn't entirely sure (maybe something slipped past me) so I decided to do a test by using another site on the server. This site is very small and very new so I submitted that to Google Sitemaps and within a day Googlebot stopped by and grabbed the robots.txt file with no problem. It shows up clearly in my logs too. In the Google Sitemap area for this site the status is also OK. So with that out of the way I that confirms my server is not blocking Googlebot.

Any ideas why Googlebot might have stopped coming? The site has been around since 2002 and has really never had any problems with Googlebot.

There are only two indications of the problem. The first is Googlebot is no longer visiting, and the second is the 1 error I see in Google Sitemaps:

5xx error robots.txt unreachable

If anybody has any ideas or has had this happen to them I would love to hear from you how to possibly resolve the issue. I did email Google yesterday, but that could take a few weeks if ever.
  • Anonymous
  • Bot
  • No Avatar
  • Posts: ?
  • Loc: Ozzuland
  • Status: Online

Post 3+ Months Ago

  • ATNO/TW
  • Super Moderator
  • Super Moderator
  • User avatar
  • Posts: 23456
  • Loc: Woodbridge VA

Post 3+ Months Ago

I have no clue what the problem is, but at work I start with troubleshooting 101 and the first thing I would think to do is go back in time to July 17th or 18th and ask myself what if anything did I change?
  • ATNO/TW
  • Super Moderator
  • Super Moderator
  • User avatar
  • Posts: 23456
  • Loc: Woodbridge VA

Post 3+ Months Ago

hmmmmmm. This is interesting. From here
http://forums.searchenginewatch.com/sho ... eadid=2786

Quote:
Embracing a Partial Hope: Robots.txt Rewrites

A 2002 WMW thread describes a procedure for rewriting robots.txt files and avoiding unwanted non human visitors. The prescribed procedure consists in banning all robots and allowing the good ones with the following script

RewriteCond %{HTTP_USER_AGENT} ^(Mozilla|Opera)
RewriteCond %{HTTP_USER_AGENT} !(Slurp|surfsafely)
RewriteRule ^robots\.txt$ /someotherfile [L]
where "someotherfile" could be a fake robots.txt or a blank file.

A Line-by-line explanation is given below.

Line 1: IF the User-agent string starts with "Mozilla" or “Opera”, do this rewrite.

Line 2: AND IF the User-agent string does not contain "Slurp" or "surfsafely" (i.e., Two 'bots w/UAs that start with "Mozilla"), do this rewrite.

Line 3: THEN do the rewrite of robots.txt to “someotherfile”.

All robots not identified by lines 1 and 2 will be served with the fake file. Feel free to modify this script to your heart needs.
This solution works with some non human visitors...

  • camp185
  • Graduate
  • Graduate
  • User avatar
  • Posts: 214
  • Loc: San Jose, CA

Post 3+ Months Ago

Did you see this page: http://www.google.com/support/webmaster ... tx=related
  • Bigwebmaster
  • Site Admin
  • Site Admin
  • User avatar
  • Posts: 9089
  • Loc: Seattle, WA & Phoenix, AZ

Post 3+ Months Ago

Yes its not that. There are absolutely zero requests of Google making requests in the logs since about the 18th of July.
  • camp185
  • Graduate
  • Graduate
  • User avatar
  • Posts: 214
  • Loc: San Jose, CA

Post 3+ Months Ago

Hitting a few other forums it sounds like you are not alone, and many experienced this starting the end of March to early April. No answers either.

There was some guesses saying it was learning to change it's frequency to sites based on content. Two examples that they guessed were sites that didn't update often no longer required regular scans, and sites that had so much new content (like a forum) it would reduce its scan frequency, but increase its scan amount. The final guess is that the demond seed has taken over the internet, locked all the doors, and is incubating a new child.

Guesses or not there are people out there just like you.
  • Bigwebmaster
  • Site Admin
  • Site Admin
  • User avatar
  • Posts: 9089
  • Loc: Seattle, WA & Phoenix, AZ

Post 3+ Months Ago

Hey guys I just thought I would let you know that the problem got fixed today. Googlebot is now active again and all my Google Sitemap errors have vanished too. So everything is great in ozzuland!
  • Benat
  • Mastermind
  • Mastermind
  • User avatar
  • Posts: 2123

Post 3+ Months Ago

YAY Happy times in Ozzuland :)
  • lioness
  • Mastermind
  • Mastermind
  • User avatar
  • Posts: 1615

Post 3+ Months Ago

Bigwebmaster wrote:
Hey guys I just thought I would let you know that the problem got fixed today. Googlebot is now active again and all my Google Sitemap errors have vanished too. So everything is great in ozzuland!


Great news. What was the issue in the end?
  • Bigwebmaster
  • Site Admin
  • Site Admin
  • User avatar
  • Posts: 9089
  • Loc: Seattle, WA & Phoenix, AZ

Post 3+ Months Ago

I have no idea, and I doubt I will ever know. I made lots of changes just in case it was something on my end. As far as I know the problem could have very well been something on Google's end too. Regardless whether it was something I did or not the problem seems to be fixed. Googlebot has already crawled over 5000 pages today and counting.

In case you are curious on exaclty what changes I made here is what I did:

  • Changed Ozzu's IP address to a new one and so that it was no longer shared with the name server IP address
  • Created a reverse DNS entry on the IP address to point to http://www.ozzu.com
  • Removed all of the banned IP Addresses from my firewall just in case something there was causing problems
  • Fixed a glitch in the Ozzu sitemap files which was making so the content type was text/html when it should have been text/xml
  • Modified Server Tokens in Apache so that the server reveals less about itself as far as versions and what modules are installed
  • Removed the Etag entry on the headers for robots.txt which gets added automatically. I removed it with the "FileETag None" directive
  • Cleaned up the robots.txt file some for entries that were no longer needed
  • Removed Ozzu from the old server, and removed all DNS entries on the old server
  • Updated Apache to the latest version (it was slightly out of date)
  • Updated other software to their latest versions
  • Rebooted the server (you never know!)
  • Posted this problem in a few other places other than ozzu such as webmasterworld and the Google Sitemaps Group
  • I have a contact at Google that I discussed about this too, although his resources are limited
  • Emailed the Google Sitemap department
  • Made sure that robots.txt wasn't in DOS format
  • Went through all of the apache log files to see if I could find any evidence of what was going on
  • Sniffed the network packets on the server to see if Google was even attempting to make a TCP connection at all
  • Disabled firewall for awhile to see if that would do anything. I also rechecked all the settings to make sure everything was correct
  • Normally when anybody tries to access something at ozzu.com without the www it is permanently redirected to the www domain (301 redirect). I kept this how it was except for the robots.txt file I made it so that it would serve it both on the domain with and without the www . There will be no redirect for that file
  • Before I had 404 pages being redirected to the ozzu homepage. I removed that redirect and now it just serves the standard 404 page. I might customize that a bit more in the future but was worth mentioning
  • Removed all the banned IPs from the PHPBB Admin panel
  • Tested another site on the same server as ozzu to see if Googlebot was able to successfully reach that. While the problem still existed on ozzu, Googlebot was successfully crawling this other site which ruled out the fact that firewalls were to blame.
  • Stalked logs 24 hours a day in hopes that I would see an entry by Googlebot. Finally saw an entry this morning where Google Media bot requested the robots.txt file. Soon after that Googlebot started doing its thing.
  • meman
  • Web Master
  • Web Master
  • User avatar
  • Posts: 3432
  • Loc: London Town , Apples and pears and all that crap

Post 3+ Months Ago

The mediabot has started indexing for googlebot recently, So it might have been a problem googles end regarding that.
  • Bigwebmaster
  • Site Admin
  • Site Admin
  • User avatar
  • Posts: 9089
  • Loc: Seattle, WA & Phoenix, AZ

Post 3+ Months Ago

Yes it very well could be related to that and how they have a caching system now. Google Media Bot obviously retreived the robots.txt file for the regular Googlebot today. Until today, there havent been any Googlebot or Google Media bot entries for the last 10 days. So both bots were unable to reach ozzu for some reason. Now today both bots are retreiving pages from the server.

I really wish I could ultimately know what caused the problem, but at least its fixed.
  • SEO_Pro.
  • Student
  • Student
  • User avatar
  • Posts: 88
  • Loc: SLC, Ut

Post 3+ Months Ago

Have you checked the CHMOD on the robots.txt file?

- Jacob Kerr
WI Works, Inc. - Web Development Engineer
  • SEO_Pro.
  • Student
  • Student
  • User avatar
  • Posts: 88
  • Loc: SLC, Ut

Post 3+ Months Ago

See RFC 2616 for a complete list of these status codes. Likely reasons for this error are an internal server error or a server busy error. If the server is busy, it may have returned an overloaded status to ask the Googlebot to crawl the site more slowly. In this case, we'll return again later to crawl additional pages.

http://www.google.com/support/webmaster ... swer=35149

Post Information

  • Total Posts in this topic: 14 posts
  • Users browsing this forum: No registered users and 5 guests
  • You cannot post new topics in this forum
  • You cannot reply to topics in this forum
  • You cannot edit your posts in this forum
  • You cannot delete your posts in this forum
  • You cannot post attachments in this forum
 
 

© 1998-2014. Ozzu® is a registered trademark of Unmelted, LLC.