How to quickly remove 10,000 PDF, XLS, and HTML files from Google & Bing

What a difference a week makes. Last week we got a call from a large city that had accidentally dropped their main city domain name from Google. This week we tackled the exact opposite. A C-level IT director at a middle-market company called when they realized the content of their intranet was accidentally made public. The company needed assistance in getting the confidential information out of Google and Bing search results as fast as possible. This article details the steps needed to remove 10,000 PDF, Microsoft Word, Excel, and other documents from Google and Bing indexes fast while preserving evidence to meet disclosure requirements. It showcases that SEO experts are actually good at something aside from getting websites ranked; they know how to take down content from Google and Bing quickly as well.

The lessons contained are also valuable to understand how not to accidentally block Google bot. Sometimes it helps to reverse engineer what you are actually trying to do in order to understand the right processes. Here, we learn about how Google Bot accesses a site which can help with your traditional SEO efforts.

How can so many documents get on Google and Bing?

A little forward to the story. Disclosure of corporate documents can happen by accident or it can happen maliciously. This article focuses on fixing the accidental disclosures but is important to note the two main types of malicious disclosures.

The first kind of malicious disclosure is from hackers stealing the entire contents of intranets and posting them online. Unfortunately, network security is very difficult and becoming even more difficult. There are custom malware shops in Russia, Romania, China, and elsewhere that specialize in creating software programs that are very difficult to detect using traditional anti-virus and anti-malware technologies. The second kind of malicious disclosure is by insiders. Insider disclosure of malicious documents is also unfortunate in that it happens quite frequently. Both of these types of malicious disclosures are subjects of many other great articles. Here we are focusing solely on using our SEO skills to remove the accidental disclosures on a domain that is owned and controlled by a company.

Checklist for Google and Bing Content Removal

Step 1 – Locking Down the Opening(s) and Damage Assessment

This step has two components that must happen in parallel.

Step 1.a – Damage Assessment

To do the damage assessment correctly you need experts in IT, legal, marketing, PR, and SEO. As an attorney, IT directory, and SEO with 15 years experience I cringe at the number of issues that this type of issue raises. If you are a publicly traded company you have serious disclosure issues to worry about, if you are a health care company you have HIPPA, if you are a financial services company you have many regulatory issues, and these are just the public disclosure requirements. Contractually you likely have Non Disclosure Agreements (“NDA”) and confidentiality agreements, employee privacy issues, and just straight forward trade secrets that maybe at risk. To understand the scope of the damage Fruition has developed a checklist for health care, defense, financial services, and retail stores to work off of. The checklist in this article provides a jump start to documenting what items were made public and the reporting requirements. This can help if the accidental disclosure draws the attention of states attorneys.

If you are in-house counsel you probably have a good handle on the legal, if you are IT you probably know how to quickly shut down the opening, and if you are SEO you are relishing the opportunity to get something done quickly that has a definitive end point and is clearly successful. Working together the damage to the company is minimized and customer trust can be saved.

Step 1.b – Lock Down the Opening

Corporate intranets can be built using many different platforms. Fruition builds intranets for Fortune 1000 enterprises using Drupal and dedicated servers with Watchguard firewalls. Fruition also builds corporate intranets using Microsoft IIS. This particular case involved a Microsoft IIS server.

Assuming that the actual access has been blocked to the concerned files we move on to the configuration changes to speed removal of the documents. This is typically done by changing permissions from any or all to only the corporate IP addresses. The next step is an important one that is often overlooked. It involves properly configuring the robots meta tag and X-Robots-Tag HTTP header specifications. Without this step Google and Bing may think that your website is not crawlable because of an unknown problem. Thus, Google and Bing may delay removing the content. You also have to remember that with IIS and most Apache flavors it is easiest to “deny all” which means the robots.txt file will not be accessible. The robots.txt is another trick in the tool bag that generally will not work in this specific use case.

Unlike the robots.txt file, x-robots shows in the header so even pages that are 403d (permission denied) will get reached by Google and Bing. Google’s developers have an excellent document explaining how the x-robots works: https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag.

In this case we set the x-robots to the following:

X-Robots-Tag: noarchive, noindex, unavailable_after: 23 Jul 2007 15:00:00 PST, nofollow, noydir, nopreview, noimageindex, nosnippet, noodp

Once you do this you can check your work by using any of the header check tools such as this one http://web-sniffer.net/.

Step 2: Spider All URLs

Spider all urls that are in the Google and Bing index. Note that this goes against Google’s webmaster guidelines (no automated bots used on search results). However, it is important to get a list of the urls that are included in Google and Bing’s indexes. Both Google Webmaster Tools and Bing Webmaster Tools do a good job of giving you many of the urls but it has been Fruition’s experience that both engines often leave out some urls. Furthermore, when you are dealing with tens of thousands or hundreds of thousands of Word, Excel, PDF, and other documents it is not feasible to do this documentation process by hand.

There are several third party tools that do this. At Fruition we have an in-house tool that tracks which urls are driving traffic to a site. By tweaking it to just list the urls we generate a list of all documents that are indexed in Google and Bing. Again, this is not recommended to do frequently but in this case it is the only reliable way to catalog the potential exposure.

Step 3 – Setup Webmaster Tools in Google and Bing

Since this is likely an intranet or similar service there probably was not a reason to include the site in webmaster tools. Now there is! That is to get the content removed from Google and Bing, FAST!

Here are the links to both engines webmaster tools:

Step 4 – Submit URL Removals

Now that the root domain that you want to remove content from is authorized you can use Google and Bing’s content removal service. Hopefully, your content is in directories so that you can use wild cards. You can also do domain.com/*/* which seems effective in getting most of the content. If not you have to exclude each url by hand.

Google’s content removal tool is not linked from within webmaster tools. This is probably a good move to prevent novice webmasters from submitting urls by accident. Here are direct links to both Google and Bing’s cached content removal tools.

Block urls in bing

It is interesting that Bing’s block expires. I really cannot think of a case where you would want to block a url and then have it reappear. Nevertheless, since we have already blocked bots at the server level when Bing bot goes back to check it will not get a 403 error and the page will not be added back into Bing’s index.

This feature allows you to block a URL from appearing in the Bing search results. This block will remain in place for 90 days. If the URL still returns a 200 OK code when visited by our crawler after 90 days, it will reappear in our search results. There are no limits on the number of consecutive blocks you can apply to a URL. We will alert you 8 days in advance of the block being removed so you can decide if you wish to renew the block.

Removing cached content is important because Google caches a lot of these documents so even if you have stopped direct access the document lives on Google’s servers. There is also the description of the site (search snippet) that often reveals confidential information. The image below is an example of how a phone number of some other confidential information can appear in the search snippet.

Removed cached content from Google

Step 5 – Do Manual Searches to Verify site:domain.com

Searching for site:domain.com (note there is no space between : and domain.com) provides you with a list of a urls (including Word, Excel, and PDF) documents that are included in the Google index. By adding a space to the search site: domain.com you see all sites linking to your exposed domain. Often these are just informational sites such as domaintools.com and other services that list domain registrant information.

Searching for bad files in google

Step 6 – Save Server Logs

It is important to save as big of a log set as possible to try and understand who accessed the data and how bad the damage is. The server logs can give you an ideas of how long the files were up and who accessed the data.

Step 7 – Run Network Scans of the IPs

Using Nexus and meta-sploit you can scan your block IP addresses to see what other data is leaking.

Step 8 – Setup Google Alerts

Setup Google Alerts to catch when your private stuff leaks.

To do this first create a text file with a random phrase that does not appear in the Google index. Pick your phrase, go to Google, search for it and the results should be blank. Stick your phrase in a text file anywhere on your intranet that you are concerned about private documents leaking. Next setup a Google alert to tell you when that file is in the wild.

Alert: "Random phrase"

Set what you want to search for (hint leave at default)

The default alert type is Everything. Everything alerts include results from Google Web Search, Google Blog Search and Google News. If you are only interested in one type of results, you can select a single alert type.

You can also do one for site:concerned-domain.com to alert you if a configuration issue occurs again and the content gets picked back up by Google bot.

Step 9 – Analyze, Assess, and Repeat

Even with setting up the alerts and doing everything needed in webmaster tools you are still going to have to go to Google and Bing and do manual searches to find documents that got picked up else where or have not been removed from the indexes. This is tough process that you can write books about.

Summary of How to Quickly Remove 10,000 Excel, Word, and PDF documents from Google & Bing

There is nothing simple about getting this much data removed from Google and Bing quickly. Each step takes attention to detail and an understanding of what signals that Google and Bing bots use to find content and then what the search engine algorithms use to list the data.

The total time it took to remove 98% of the Excel, PDF, and Word documents was approximately 48 hours for Google and 96 for Bing. A few files straggled out in the indexes for several days. Nevertheless, in this case, the damage was contained quickly and the harm caused to the company was minimal. I hope that this article helps you avoid having to use these issues or if you found this article because you are facing the same issues feel free to contact Fruition. We are available 24/7 for issues such as this.

Why This Was a Great Project

This project was interesting because it brought together my skills as attorney, IT director, and owner of SEO company allowing me to apply my 20 years of knowledge with a group of talented individuals to minimize damage to the client. It is the exact opposite of the previous weeks project that required Fruition to get a site re-index quickly.

Thanks and a Favor

Thanks for reading please comment below if you have any other tips or tricks on how to get content out of Google and Bing quickly. Also, it is always greatly appreciated when articles are shared.

submit to reddit

Brad Anderson

Brad Anderson is the founder and CEO of Fruition. Brad combined his passion for marketing, technology, innovation and data-based decision making into a successful national digital marketing agency when he created the Denver-based Fruition. Brad brings the unique perspective of an expert marketer, board member, agency owner and entrepreneur to his career and his thought-leadership writing.

Leave a Comment

Your email address will not be published. Required fields are marked *