Google Penalty hits a Major California City
Google penalties are not limited to businesses doing SEO. One of our clients is a large city in California. Last night a representative contacted us and said that people couldn't find their main city website on Google. Sure enough the website nearly completely disappeared from all Google organic search results. The website is your standard city website and didn't have any offsite SEO performed on it so the odds of a manual or algorithmic penalty of this magnitude were slim. As with most major cities you can do a host of critical functions on the website such as paying parking tickets (I swear I was only in the spot for an hour and no I'm not parked further than 19 inches or closer than .2 inches from the curb). Sure enough a quick Google search found massive drops in the city's organic Google rankings. Google Analytics showed huge drops in organic traffic from Google, and Fruition's Google Penalty Checker said something was seriously wrong. The following is an overview of Fruition's playbook that is used to diagnose Google penalties. Here's what the website's organic search traffic looks like for the last 30 days. Note the instant traffic drop to almost zero. What the above graph told us immediately is that there was a serious issue. This points to one of two issues. First, a server outage or website configuration issue. The second is a site hack. This being a city website where a lot of personal information is exchanged that is obviously the first concern.
Google Penalty Playbook
The potential for this being a hack was worrisome so we tackled that first. An initial look at the website and its source code didn't show any of the tell tale signs e.g.no encrypted code, we didn't see any spam links, the site wasn't showing up for drug related searches, etc. However, with other hacks (not this one) there have been malware injections and website hijacks that look at the visitor and show the website to real visitors and the hacked site to Google-Bot. Websites impacted by this type of hack often include content related to certain drugs or counterfeit goods. Here are few ways to check and rule this out:
- In Google Analytics look at the referring organic traffic. Even though Google encrypts all search data you may still see some tell tale signs of this type of hack. Those include: (i) geographic referrers change e.g. your traffic is all from Los Angeles and then switches to Moscow this shows that you are ranking for different keywords in different geographic areas, (ii) you are getting traffic for bad keywords (again more difficult now).
- In Google Webmaster Tools go to Search Traffic > Links to Your Site. Look at the anchor text for the websites that are linking to you. A lot of these hacks are automated so as soon as your website is hacked other hacked websites will have backlinks pointing at you related to the bad keywords. These nefarious backlinks with bad keywords, or keywords, unrelated your website may not show up instantly since Google forthcoming on when the lists of backlinks is updated.
- Third party backlink services update more regularly. MajesticSEO.com is probably the best right now. Run your site through majestic's tool and look at the backlinks that you are getting.
Security Summary for Google Penalty
Ok. Great no blatant hacking! So the obvious hacks are ruled out based on the above. Now we'll move on to GoogleBot just not being able to access the site.
Google Bot Limited Access Penalty
I'm going to call this the "GoogleBot Limited Access Penalty" because that's effectively what it is. Since we ruled out hacks for now we focused on access issues and site configuration problems. The first thing we checked (before anything else including the security) was to ensure that there wasn't a robots.txt file that added the dreaded / in the wrong spot. Here's an example of a bad robots.txt (this blocks all bots):
User-agent: * Disallow: /
Here's a link to the actual robots-nobots file (downloads a text file). This robots.txt allows all (what you want in most cases):
User-agent: * Disallow:
(Here's the actual example robots-allowbots [downloads a text file]. This is placed in the root directory of the site.) Sometimes the wrong robots.txt file gets moved from a development site where you want to prevent Google from indexing the site because the development site could cause duplicate content penalties (production and development sites are both indexed by Google=bad). In this case that didn't happen.
Is GoogleBot still blocked?
Usually, if Google-Bot can't access your website you get a message. These messages are in the form of the following:
While crawling your site, we have noticed an increase in the number of transient soft 404 errors around 2013-07-15 03:00 UTC (London, Dublin, Edinburgh). Your site may have experienced outages. These issues may have been resolved. Here are some sample pages that resulted in soft 404 errors:
Or it could look like this:
Over the last 24 hours, Googlebot encountered 188 errors while attempting to retrieve DNS information for your site. The overall error rate for DNS queries for your site is 1.0%.
In this one it says there's a DNS problem. If you get this error Google provides a great checklist that helps you diagnose the problem. For example, it says that if the "error rate was 100%" check your DNS, domain WHOIS, etc. Google then goes on to provide nice details on each specific error that Google-Bot runs into. Here's a checklist and quick links to those:
- For Google-Bot server connectivity issues.
- For Google-Bot access denied errors (e.g. robotx.txt)
- For Googlebot crawl errors (these are dreadful on certain servers).
- For general server errors
In this case no messages appeared in webmaster tools. However, we're not done with webmaster tools yet. Under Search Traffic > Manual actions we want to make sure there isn't a manual action. Here there wasn't. Now under Crawl we want to see what the crawl history looks like. In this case there are a lot of issues but the first thing that pops out is that Google says that "robots.txt is inaccessible." Now we're getting somewhere.
Steps to take if you get robots.txt is inaccessible message in webmaster tools
Now that we know that Google is indeed having accessibility issues we can start to figure out where the problem is. From a big picture view there are a few spots that could block Googlebot. First, DNS could be bad but we ruled those out above. Further, Google was trying to access a specific file which means that Google was getting an IP back and trying to get to the server (that we assume is the right one for now). Second, there could be a routing issue in the datacenter. Yet, in this instance regular visitors to the site could access it so the routing seemed to be ok. Third, there could be a hardware firewall issue. Fourth, there could be a software firewall issue. The hardware firewall in this case isn't managed by Fruition. When IT manager in charge of the firewall was alerted to the potential problem he was able to back track to the dates that were identified and found a change was made to the firewall policies. He was able to adjust the policy and fix the configuration issue! We verified by going to Google's tool under Crawl > Fetch as Googlebot and insert the robots.txt file in there. Penalty from hardware config issues A new Palo Alto firewall was configured to allow browsing web traffic but not bots. Thus, it blocked Googlebot. When Googlebot can't get to a website Google worries that the website is down (or hacked) therefore regular visitors would not be able to access it either. This would produce a bad search result because someone would search Google for "paying my parking ticket in big city California" see the website in the search result, click on it, and get a dead site. This is bad for Google search quality. So when Googlebot can't access the site they remove the site from their index. This is exactly what happened here.
How to prevent firewalls from inadvertently blocking Googlebot
This is certainly not the first time that a firewall has blocked Googlebots. When you are configuring a firewall and you see the check box for bots your gut reaction is "no I don't want any bleeping bleeping bots." Yet, if the major search engines aren't excluded from the block this is the result. With Palo Alto firewalls this is apparently very easy to check the wrong box. With Watchguard firewalls they use the Websense database. Websense's list of bots is limited to "command-and-control" bots. This geared more toward keeping people on a business network from visiting sites that are infected and not bots trying to access the site (e.g. the intrusion protection is built more toward keeping visitors behind the firewall from inadvertently accessing a bad bot controlled site vs keeping bad bots from scanning sites that are hosted behind the firewall. If you're using modsecurity the default rule set blocks bots but has exceptions for trusted bots. If you're using Cloudflare's WAF they do the same by allowing trusted bots and excluding badbots. Bad bots include content steels, comment spammers, forum spammers, price checkers, etc.
Ensuring that Googlebot can access the robots.txt file, Googlebot then knows that it is authorized to index the site. It took about 12 hours from the time the hardware firewall issue was identified until the website reappeared in Google's index. In this case the Google penalty on the city site was caused by an easy to make firewall configuration issue and not a manual or algorithmic Google penalty. The result was the same, zero traffic from Google and frustrated parking ticker payers. By following the above guide you can usually diagnose most large, instantaneous, drops in traffic.