Make GA Data Quality Suck Less!

November 6, 2007 by Justin Cutroni

Frustrated with GA Data?  Get a grip!We all know that data quality sucks! But there are a few, vital steps that you can take to insure that your Google Analytics data is as accurate as possible. Remember, accurate data makes for happy, and accurate, analysts.

Here are three simple tips that can help make your data more accurate.

1. Eliminate Duplicate Data

Many sites that I work on have duplicate data. The usual cause is mixed case URLs. Google Analytics is case sensitive, it captures the data exactly as it appears in the location bar of the browser. So if a URL is of mixed case in the browser, it will be captured and displayed in mixed case within GA.

It’s very easy to have two URLs, that have the same functional meaning, appear as two line items in GA because they have a different case. Here’s an example:


/worldseries/index.php?year=2007&keyword=lowell
/Worldseries/index.php?year=2007&keyword=Lowell

Both URLs are probably the same, they just appear different because of the case. We want to force both URLs to have the same case and thus make them appear as a single line item in GA. This can be done with a Lowercase or Uppercase filter. I like the lowercase filter, but you could easily use the uppercase filter. It’s a personal preference.

The filter below will force the Request URI to lowercase:

Google Analytics Lowercase Filter

I recommend adding a case changing filter to any data element (i.e. filter field) that could be mixed case. This includes:

  • Request URI
  • Campaign Name
  • Campaign Term
  • Campaign Medium
  • Campaign Source

Another cause of duplicate data is multiple URLs that display the same content but have a different file extension. Here’s an example:


/champions/redsox.php
/champions/redsox.htm

These URLs may appear different (because of the file extension), but the web server might interpret them as the same file. Please note that not every web server behaves this way. It all depends on your web server. Check with your IT guru if your site has URLs with multiple file extensions.

You should merge duplicates URLs, that have different file extensions, into a single line item. I find the best way to do this is with an advanced filter.

Google Analytics AdVanced Filter for URI ReWrite

Some may think that a search and replace filter is the best way to remove these duplicates. But you would need to create a search and replace filter for each set of URLs that needs to be merged. An advanced filter, because it uses a regular expression, will change every URL that ends in ‘.htm’ to a ‘.php’ extension.

2. Remove Irrelevant Information

Extra information in the URL can cause big problems in Google Analytics. The reason is that GA will capture all of the data in a URL, which includes the query string parameters. Query string parameters that don’t have a functional meaning should be removed from the URL.

An easy way eliminate these parameters is to collect data for a week and then analyze the top content report. Any query string parameter that does not provide insight into what the visitor sees or does should be eliminated.

To remove a query string parameter from GA simply add it to the ‘Exclude URL Query Parameters’ field in the profile settings:

20071104-exclude-parameters.png

Enter multiple parameters as a comma separated list.

Be aware that once you remove a query string parameter from GA it is completely eliminated from the system. So any goals, funnels or other filters that use that parameter will no longer work.

Also remember that you should remove any query string parameters that contain personally identifiable information. It is against the GA terms of service to collect PII.

3. Identify Your Segment

I could have easily named this tip ‘exclude internal data’ but I wanted to change the way we all think about profiles and the data that’s in them. I believe we should think of profile data in terms of the segment we want to analyze, not who we want to exclude. I know these statements are very close in meaning, but there is a slight difference. Segmentation is so important to analysis. I believe that every time we create a profile we should consider what segment of data it will contain.

I can think if a few segments of data that I would like to analyze:

  • CPC traffic
  • New visitors
  • Return visitors
  • European visitors
  • Traffic from a specific marketing campaign
  • Non-employee traffic
  • Traffic generated by my call center

All of the above segments can be created as different profiles using include filters. Each will provide some insight into that segment. Don’t get me wrong, you’ll probably want to exclude internal traffic from 99% of your profiles. But try to think in broader terms, focus on the segment that you want to analyze.

Creating a profile based on a particular segment of traffic is pretty easy. The first thing you want to do is identify what segment of traffic you want to include in your profile. Then create a filter based on the filter filed that represents that segment.

Let’s say I want to see all traffic generated from visitors performing some type of external search on my name. I could apply the following include filter to a profile:

Google Analytics Include Filter

This filter can easily be modified to include a specific marketing campaign (using the Campaign Name field), a specific country (using the Visitor Country field) or any other segment of data so long as it is represented by one of the filter fields. Please note that this will work even if you’re using AdWords auto tagging on, even though you haven’t done any heavy lifting to define the Campaign Term.

By the way, you will want to exclude internal traffic from many profiles. My favorite way to remove internal traffic from a profile is with an ‘Exclude all traffic from an IP address’. Make sure you use anchors at the beginning and end of the regular expression.

Google Analytics Filter: Exclude IP Address

Another good way to exclude internal traffic, especially if you don’t have a static IP address, is to use a little hack called Count Me Out. This hack uses the GA custom segment cookie to identify users.

So remember, yes, you need to exclude internal traffic, but try to take a broad view and think about segmentation when you filter your profiles.

Subscribe: