If you run your business by the numbers, you need to be able to trust them 100%.
But one look at your Google Analytics reports will show you that that’s easier said than done.
On any given day you’ll see tens to hundreds of visits from all kinds of strange places.
These aren’t real visitors, it’s Google Analytics spam and it has become a big problem over the last few years.
So if you want to get rid of this spam and want to clean those fake visitors from your reports, this article will show you exactly how!
It’s been about 2 years since Google Analytics spam has really become a problem. And the approach I use to fight this spam has also evolved.
Today I do the following for my clients:
- Set up multiple Google Analytics views
- Filter hostname spam
- Filter spam referrals
- Exclude known bots
- Create a spam-free segment
Let’s take a look at these steps in more detail.
Step 1. Set Up Multiple Google Analytics Views
By default, a new Google Analytics property comes with one view: All website data.
If you make changes to the settings, there is always the chance that something goes wrong. The data you’ve already got in your account won’t be affected, but the data that comes in after you make the changes will be modified.
Since there is no undo button, it’s a good idea to have a backup of your data.
So before you do anything, create 2 extra views. So that’s 3 views in total:
- Main – your main view (you can rename All Website Data to this one)
- Raw – a view without any changes or filters
- Test – a view to testing changes before you make them in the Main filter
The problem with the Google Analytics spam is just in your reports. These are fake visitors that never land on your website. They use a loophole in the way Google Analytics works to fake visits from other websites. That’s why they are also called ghost spam.
So while it seems that you have visitors from big sites like apple.com or reddit.com, most of those aren’t real. Luckily we can tell them apart from real visitors.
Step 2. Filter Hostname Spam
The first way to detect them is via hostnames. In simple terms, your hostname is the name of your site.
Let’s look at the hostname report in Google Analytics:
A valid hostname is the domain from your site which I blurred out in the example above. Besides that, the only other valid hostname is checkout.shopify.com.
So the only reason there might be a different domain is because you are using your Google Analytics tracking code with other tools.
If you’re using Google Analytics for ecommerce, the actual checkout often is on another domain (checkout.shopify.com), but those pages are loading your own Google Analytics code (that way you can track transactions). That’s why there is a different hostname.
The other hostnames in the example above: (not set), lifehacker.com, google.org or www.foxnews.com are fakes.
Take a look at your own report in Google Analytics: Audience > Technology > Network > Hostname
So instead of filtering out the fakes, we are only going to include valid hostnames in our reports, the rest can be ignored.
But you want to make sure to filter out only the spam, not the legit traffic!
Let’s take a look at which hostnames are valid:
- your own domain (domain.com)
- your own sub domains (blog.domain.com)
- Content Delivery Networks (or CDN): Cloudflare or Akamai
- Translation services: Google, Bing or Baidu
- Shopping carts: Shopify or Lightspeed
- Payment services: Paypal
- Cache services: Google cache
- Other tools that use your tracking code: landing page tools, email providers, etc.
It’s essential you don’t filter out real traffic, that’s why I wanted to give you some examples of hostnames that are valid. It’s not a complete list, but it will tell you what to look for:
- checkout.shopify.com (Shopify checkout pages)
- yourshopifydomain.myshopify.com (your own Shopify domain)
- translate.googleusercontent.com (Google Translate)
- yourdomain.webshopapp.com (Lightspeed checkout pages)
- develop.yourdomain.com (staging server)
- dev.yourdomain.com (staging server)
- yourdomain.us4.list-manage.com (MailChimp list settings)
- fbrender.heyo.com (Facebook contest tool)
- webcache.googleusercontent.com (Google Cache)
- us4.campaign-archive.com (MailChimp archives)
- cdn.yourdomain.com (a CDN service)
- web.archive.org (users looking at old versions of your site via archive.org)
- yourdomain.googleweblight.com (light version of your domain by Google)
- yourwpenginedomain.wpengine.com (your WPEngine subdomain – WordPress only)
- yourdomain.dev (staging server)
- yourdomain.3dcartstores.com (your 3D cart subdomain)
- translate.baiducontent.com (Baidu translate)
- www.yourdomain.stfi.re (link tools)
- www.youtube.com (if you use your tracking code on your Youtube channel)
Action time
You need to create a new filter on your Google Analytics view that only includes the traffic to the hostnames you specify.
I recommend starting with your Test view & let it run for a week, and check transactions/value. These should be the same since we’re only filtering out fake traffic. Once you’ve verified it’s correct, you can create a filter in your Main view.
Goto Admin > correct view > Filters > + Add Filter
Select Custom > Include > Filter Field: Hostname > Filter Pattern: see below > Save
In the Filter Pattern field, you’re going to enter a combination of all the valid domains that you’ve found.
You have to do that in a special format, called regular expression or regex.
Let me give an example to simplify it.
Example
I've discovered 2 good hostnames: www.storegrowers.com checkout.shopify.com In the Filter Pattern field I'll enter: www\.storegrowers\.com|checkout\.shopify\.com
So that’s a backslash(\) in front of every dot and a pipe (|) in between domains.
This will take care of a bunch of spam already, but not everything. Let’s look at step 3 to filter out the other spam in your Google Analytics account.
Step 3. Filter Referral Spam
Besides ghost hostnames, your Google Analytics reports are also full of ghost referrals.
These are websites that appear to send visitors to your site, but actually aren’t.
They play on the curiosity of website owners since it’s only natural to wonder what that site that linked to you is all about and go visit them. (Sidenote: most of these websites actually don’t work anymore, so it’s unclear why they would bother with this shit)
To exclude these from our reports we’re going to set up filters that eliminate these.
As I mentioned before, my approach has changed over the last couple of years. In the beginning I kept track of all of the domains that I found in my own reports or those of clients. But that quickly became too much work to keep updated.
So my new approach is that instead of excluding the exact referrals, I try to look at the patterns in all of the referrals. Those won’t filter out everything, but they will get you 95% of the way there.
Analytics provider piwik has kept a nice updated Google Analytics spam list on Github of over 483 domains. I’ve rolled all of my domains into that list, so that’s what I’m using in this post.
To do this, you’ll create 2 new custom filters on your Views.
Action time
Goto Admin > correct view > Filters > + Add Filter
Select Custom > Exclude > Filter Field: Campaign source > Filter Pattern: see below > Save
In the Filter Pattern field, you’re going to enter a combination of all the spam referrals that you’ve found, again in the regex format described above.
There is a 255 character limit to the field, so you might have to create multiple similar filters.
Example
In my reports I've found 4 spam referral domains: motherboard.vice.com lifehacĸer.com site-auditor.online addons.mozilla.org In the Filter Pattern field I'll enter:
motherboard\.vice\.com|lifehacĸer|site-auditor|addons\.mozilla\.org
So that’s a backslash(\) in front of every dot and a pipe (|) in-between domains.
Like I said before you can add only the spam you see in your reports, or you can use the huge list of domains mentioned above.
I use a mix of filters for specific domains & things that keep popping up again in reports across clients. I’ve also started excluding a couple of domains that are responsible for most of the spam (.site, .xyz, .рф, .ru, .info, .top, .ua, .kz, .uz, .ga & .cf). I know that there is the possibility that a legit site from one of those domains will send me traffic, but I’m willing to take the chance.
Again, you’ll also have to apply this filter for every single one of your views, start with your Test view, then roll it out to your Main view if things are correct.
4. Exclude known bots
Google doesn’t really seem to care about these spam issues, otherwise I wouldn’t have to write a detailed blog post on how to deal with Google Analytics spam.
But they do have a small feature that helps out.
On the View level, Google Analytics offers a solution to filter all known bots. Just check the Exclude all hits from known bots and spiders option.
5. Create a spam free segment
If you create new filters, your data will only be filtered from the moment you add them. So even if you exclude spam referrals, your past history is still affected.
To look at your historic data without the spam, you can create a segment that excludes these known referrals.
You can do this by re-using the regex code you creates for your spam referrals & hostnames.
Action time
To create a new segment, click + Add Segment above the chart in Google Analytics > + New Segment
Then Exclude the Sessions that have a Source that matches the regex code you’ve found above.
You can add an additional filter for the hostname spam.
Note on referral exclusion lists
To close off I want to mention another technique that’s often mentioned to tackle Google Analytics spam, the referral exclusion list.
The Referral Exclusion List is a feature in Google Analytics to add domains that you wanted to exclude from your reports.
While that seems ideal at first, there is a catch. If you add domains to that list, those visits don’t get excluded at all, they simply get added to the direct traffic of your website. That only makes the spam invisible, so it’s definitely not a good option.
That’s it for this post, by now you should be able to get that annoying spam out of your Analytics. If you have any questions or have an alternative approach, let me know in the comments!