Tracking Website Data-Collection and Privacy Practices with the iWatch Web Crawler

July 19, 2007 by Richard Conlan

http://cups.cs.cmu.edu/soups/2007/proceedings/p29_jensen.pdf

iWatch is a webcrawler which builds a central database of global online data practices.  It starts with a seed list of the top 50 websites as reported by Comscore Media Metrix and indexes privacy related practices including cookies, webbugs, P3P, etc., while post-processing indexes data by domain, by country, cross-references lists of privacy seals, fetches P3P policies, etc.  Programatically determine some of these things is pretty complicated.  To date they have indexed nearly 250,000 pages over nearly 25,000 unique domains in 81 countries.  In addition to grouping upon domain and country they also group based on common privacy laws, such as those shared by members of the EU.

The iWatch data allows:

  • data mining for better risk indicators
  • study the evolution of practices over time and the impact of key events
  • directly provide data to aid consumers, legislators, e-merchants, and researchers

The data gathered so far suggests that sites with P3P policies are actually more likely to use webbugs.  The data shows that P3P adoption increased in the US and Canada from 2005 to 2006, but decreased in the rest of the world.  Correspondingly, the use of webbugs increased in the US, but decreased in most other areas.  It is hoped that this data will be useful for e-merchants trying to decide which privacy features to include, to security researchers analyzing privacy and trends, and to end users trying to evaluate their privacy risks on-line.

Figure 9 in the paper, which was described as a teaser for future work, is interesting. Apparently it shows the connections between web sites based on third party cookies.

It provokes interesting thoughts about privacy because it could be an illustration of the potential data aggregation between different domains. It should be of interest to people studying the issues/opportunities of behavioral advertising.

 
Chandan Sarkar wrote:

The interesting part of the graph is that it is a directed graph. An Edge from vertex A to B means that A writes a cookie that only B can read more specifically A is the cookie writer and B the cookie reader. So what is more relevant for the importance of a node: Its indegree (how many cookies are written in other sites that can be read by this site.) and Its outdegree (How many cookies does this site write in the name of third party).

Anotther two important feature of the graph are
1) RandomWalkBetweenness which specifies based on random walks, measuring the expected number of times a node is traversed by a random walk averaged over all pairs of nodes, and number of
HITS which calculates the “hubs-and-authorities” importance measures for each node in a graph.