I’ll get back to talking about the web analytics team soon, but I’ve been getting a few emails from folks just starting out who are a bit confused about data collection. So I figured I’d blog about it…
When web analysts talk about data collection, they are referring to the method by which counts and measures of things, like page views and duration’s, are captured by a web analytics tool. If you’re new to web analytics, data collection can be slightly confusing.
Track and Collect Data
There are three “generally-accepted” methods for data collection in the web analytics industry:
- Log files. Server-side data collection involves parsing text-based log files generated by Web servers. The server, when instructed to do so, logs every request received by clients in a file called the “log file.” There are many formats for log files. Each line in a log file is called a “hit” and contains lots of different stuff – from the ip address, a request date/time stamp, the item requested, user agent, referrer, and more. Many “hits” make up a single page view – that’s why it’s incorrect to use the term “hits” to refer to “page views.” As a web analyst you will be defining the format of the log file within your tool and moving and synchronizing log files so that they can be processed by your tool. Some people will claim log file analysis is dated (historic may be more appropriate), or less accurate than page tags (due to caching issues). Other people like logs because they can reprocess their data.
- Packet sniffers. Network data collection involves deploying either software or hardware that intercepts and logs traffic coming over a network. Every packet is captured and decoded according to a configuration you define. Your web analytics tool can be configured to process the data captured and decoded by the sniffer. Packet sniffers are a less common approach for data collection by web analytics vendors.
Interestingly some vendors offer “hybrid” data collection, which combines multiple data collection methods. This mode could be considered a “fourth type” of data collection. Most commonly hybrid data collection means using logs and page tags to collect different data elements, but other combinations are possible as well.
Which one is correct for your implementation? It depends on your business goals defining what you need to measure…