Learn about Web Development, Design, Marketing and IT

How to identify and filter spam referral traffic (Part 3 of 3)

Written by Ryan Brooks | December 14, 2015 at 7:18 PM

In the previous sections, identifying referral spam traffic and eliminating referral spam traffic were covered. So what else is there? While it may seem easy to monitor for new, spam sources, some months might see a heavier flow of junk referral traffic than others. It wouldn’t be uncommon to only have 2 new junk sources to filter one month, then the next month have 7. Therefore, it would be prudent to add a filter to reduce the amount of potential new spam sources; helping managing them on a month to month basis.

Ghost hostname traffic

Back in Google Analytics, navigate to Audience, Technology then Network. For the primary dimension, select the hostname option. There should be a list of hostnames which visitors used to reach your site, and potentially one entry that says (not set). (Not set) means that traffic didn’t hit a website domain. But how is this possible? How could a selection of sessions not hit the website but still exist in Google Analytics? Because it was hitting the Google Analytics server, never actually going to a website domain.

This grouping of traffic is called ghost traffic. Not actually existing but still having a presence in analytics, and is entirely spam. Since the hostname doesn’t exist, how do you exclude it?

You don’t. But there is something else which can be done.

Determining legitimate hostname traffic

Since we can’t exclude sources that can’t be identified, we have to set up a filter to include known sources instead. Set the date range in Google Analytics to at least a year or as long as the analytics has been active. Note: if it’s a brand new site, this process will not be applicable. You must wait at least a few months before completing the rest of these steps. Looking at the hostname list, there should be entries that match the domain of your website as well as some other known domains like: m.youtube.com, google translate, etc.

Make a note of every known, legitimate hostname as they appear from that table and create a new list in a word processing document (I use notepad). A filter must be created to allow only these legitimate hostnames to hit the site. If you don’t recognize a hostname, it’s most likely junk. Some spam examples that I’ve seen are: 4webmasters.org and anything from darodar.com.

Keep in mind, the first time you add something via an iFrame HTML tag (like a YouTube video), you are adding a new hostname to your website. Make sure to coordinate with anyone making changes to the website so you can allow a new hostname to populate traffic data in analytics.

If you have setup a series of 301 redirects or performed URL binding to your domain, those affected domains should be included as legit hostname sources or your analytics will be incorrect. If you’re unsure about any of this, don’t worry, there will be fallbacks set in place in case something gets filter that shouldn’t be discussed further in the blog post.

Getting rid of ghost traffic

Once a list of legit hostnames is compiled, go back to the Admin section and create a new filter. Name the filter whatever you need to identify it (I use “Ghost Filter”). Select the Custom filter type, select the Include radio button and set the filter field to hostname.

This next part can be tricky since your hostname list probably contains a lot of periods in the URLs. Those periods will cause an error when entered into the filter pattern and will need to be modified. In your word document, use a “\” without the quotes before each period and separate each domain with a “|” without the quotes.

Here’s an example of my ghost filter for DiscoverTec:

www\.discovertec\.com|discovertec\.com|www2\.discovertec\.com|\.discovertec\.com|dt12\.discovertecweb\.com|discovertec\.discovertecweb\.com|testsite\.discovertecweb\.com|youtube\.com|m\.youtube\.com|www\.youtube\.com|google\.com|discovertec\.dev\.local

Included are: both versions of the domain (important), test site domains, Google and Google products.

Like the spam filters, you can verify this filter as well, but if there’s too small a sample size of data, it’ll throw an error even though the filter will operate as intended. Click save to finish.

A fallback plan for traffic data

Because data purity and the ability to report on anything is the upmost important, there has to be a fallback in case a mistake was made. The solution is a VIEW dedicated to any and all data no matter the source. Setting up an unfiltered version of analytics will allow data to be retrieved in case it was mistakenly filtered in another VIEW.

An example of a situation where an unfiltered VIEW was used in order to retrieve data is: if a referral source, example.com, was added to the spam filter. Later on, it turns out example.com was a legitimate source of traffic that needs to be accounted for. Since it was filtered out, the data for the time period it was being filtered is now gone for good. But, if there’s an unfiltered version of the analytics, it can simply be retrieved from there.

An unfiltered view is exactly what it sounds like. A VIEW with no filters. To setup an unfiltered version of your data, go to the Admin section again and select the dropdown under the VIEW section. The last option should be Create new view. In the Reporting View Name, type in “Unfiltered Data”. Select the appropriate reporting time zone and country, then click Create View.

That’s it. This is the unfiltered data view which will collect everything and anything in case it needs to be retrieved. What you now have is a reporting VIEW (with spam filters and a ghost filter) and an unfiltered VIEW for everything and anything.