I’ll get back to talking about the web analytics team soon, but I’ve been getting a few emails from folks just starting out who are a bit confused about data collection. So I figured I’d blog about it…
When web analysts talk about data collection, they are referring to the method by which counts and measures of things, like page views and duration’s, are captured by a web analytics tool. If you’re new to web analytics, data collection can be slightly confusing.
Track and Collect Data
There are three “generally-accepted” methods for data collection in the web analytics industry:
Page tags. Client-side data collection involves using little snippets of HTML code that reference a JS file and communicate via a beacon to a “page tag server” – the machine that collects the data so it can be sessionized by the web analytics tool (it may not be called that by your vendor). As a web analyst, if you are using page tags you will have lots of fun tagging every page on your web site and instrumenting the tags with custom variables and campaign codes. Reasons why people like page tags are numerous, and include the fact that they are fairly efficient in filtering out non-human traffic (as long as the robot doesn’t execute JavaScript) and can count proxy cached pages (improving accuracy). Page tags are probably the most ubiquitous method for collecting web data today.
Log files. Server-side data collection involves parsing text-based log files generated by Web servers. The server, when instructed to do so, logs every request received by clients in a file called the “log file.” There are many formats for log files. Each line in a log file is called a “hit” and contains lots of different stuff – from the ip address, a request date/time stamp, the item requested, user agent, referrer, and more. Many “hits” make up a single page view – that’s why it’s incorrect to use the term “hits” to refer to “page views.” As a web analyst you will be defining the format of the log file within your tool and moving and synchronizing log files so that they can be processed by your tool. Some people will claim log file analysis is dated (historic may be more appropriate), or less accurate than page tags (due to caching issues). Other people like logs because they can reprocess their data.
Packet sniffers. Network data collection involves deploying either software or hardware that intercepts and logs traffic coming over a network. Every packet is captured and decoded according to a configuration you define. Your web analytics tool can be configured to process the data captured and decoded by the sniffer. Packet sniffers are a less common approach for data collection by web analytics vendors.
Interestingly some vendors offer “hybrid” data collection, which combines multiple data collection methods. This mode could be considered a “fourth type” of data collection. Most commonly hybrid data collection means using logs and page tags to collect different data elements, but other combinations are possible as well.
As you investigate the best data collection method for your implementation ensure you deeply consider the pros and cons of each method. For example page tags capture information about the browser (like screen resolution) that logs just can’t. But what about if you need to measure non-JavaScript executing clients, like some mobile devices? Log files capture information about crawlers (i.e. robotic traffic) that page tags just can’t. But can you adequately filter robotic traffic and maintain host exclusions? Packet sniffers capture pretty much everything, but can be challenging to customize to your exact data needs (and you’ll need a fair amount of IT support).
Which one is correct for your implementation? It depends on your business goals defining what you need to measure…
Leave a Reply