Google Web Accelerator

Written for Spider Magazine June 2005 issue. Also available through spider’s website.

Don’t be evil

The Google Web Accelerator may prove to be Google’s first misstep or their greatest leap since they launched in 1998.

On May 4, Google released yet another tool in their already impressive arsenal, the Google Web Accelerator (GWA). Unfortunately, the privacy and security issues raised by this tool have deeply divided Google loyalists. Despite the fact that Google stopped downloads of the GWA merely a week after it was launched, claiming that they have reached maximum capacity, the accelerator is currently being used by thousands of people worldwide. Usually, Google elicits the kind of user response other companies can only dream of. But the strong reaction to the Google Web Accelerator (GWA) seems to have taken even the company by surprise. The software dramatically improves net-cruising speed but its implications are highly controversial. This might well prove to be Google’s first misstep or their greatest leap since PageRank, the famous algorithm behind their current search technology.

The principle behind the GWA is simple. Users are served cached (pre-loaded) copies of webpages from Google’s notorious, high-performance clusters. This enhances the browsing experience because, first, Google servers are faster on average (in terms of network speed as well as raw server performance) than most other servers. Second, users are spared the extra hops required to retrieve the webpage from the originating servers. Finally, as Google compresses data before sending it across, transmissions are faster and less bandwidth is required. Perhaps the only downside to the GWA experience is that the user’s IP cannot be seen by anyone but Google. (Even so, the losses are borne by ad-revenue based models of Web businesses — the user usually enjoys the privacy afforded by virtually anonymous surfing.)

In exchange, Google gets to monitor every single page a user visits, with the exception of secure SSL pages. This will have critical implications for the next round of search wars. There has been little innovation in search engine design since Google came up with their PageRank algorithm (see box at bottom). In a bid to further streamline searches, engineers are now looking to explore the possibilities of contextual search. This is predicated on knowledge of users: farmers searching for “rice”, for example, will see pages relating to agriculture while chefs will see recipes. With GWA, Google gets accurate Web usage patterns for each user and can profile them accordingly. Not only does this enable them to get a head start on contextual search, it also paves the way for a recently copyrighted (no patents filed yet) TrustRank algorithm which is widely pegged as the successor to PageRank.

Google could have done this in a number of different ways and it’s important to consider the different designs to understand why the current one is so controversial. First, they could have offered the caching aspect of GWA by allowing users to enter the settings directly in their browser, thereby eliminating the need to download GWA. This design would be identical to the current one, save for the compression perks of GWA. HTTP/1.1, currently supported by most browsers and servers on the Web, already uses compression (a.k.a. content encoding, which results in an average of 75 per cent faster internet surfing speeds as a result of lower bandwidth utilisation). Google could have differentiated itself by using proprietary compression algorithms or by optimising the channel between GWA and their own servers. However, Google is probably marketing the tool as a Web accelerator, simply to give people some incentive to run it.

Second, to enable TrustRank, they only needed to monitor the URLs of the sites users visit, not the entire content of the page. In fact, the Google Toolbar already has an advanced feature to allow URL monitoring. It would have been easier to integrate the compression feature into the Google Toolbar than releasing a new tool. Clearly, they’re not just after URLs. This is because the URL alone is not enough to penetrate the Deep Web, which is the search-engine inaccessible part of the Web (hence invisible to them) and much larger than the visible Web — 500 times larger by some estimates. Pages which aren’t hyperlinked or accessible only by hitting the submit button for example, lie in the Deep Web and the best way to get to them is through users of GWA who submit the form and end up there. If, indeed, they are trying to penetrate the Deep Web, this approach raises some major privacy concerns. Users might not want Google to see the content of pages resulting from filling out a personal form or private pages not hyperlinked elsewhere.

Third, why is Google using cached pages as their bait? There are other value-added services they can offer in exchange for monitoring users’ internet habits. The reason is probably that this enables them not only to monitor users, but also helps Google bot (their spider). The spider crawls over eight billion webpages on a monthly basis, and indexes those pages to make them searchable. As a result, “current” search results on Google can be up to a month old. As good netizens, Google cannot crawl websites on a daily basis (because it represents abuse of website owners’ bandwidth, among other reasons) to just update their index. If the users, however, request a page through the Google proxy, the page request is justified as the user would have visited the original site with or without Google. As an intermediary, Google can index the current copy of the page if it has changed. Google ends up instantly updating their index and at the same time relays the page to the user. In fact, this automatically results in more frequent updates of popular pages and infrequent updates of other pages, striking an economic balance. Seems like a fair trade.

In their unconventional SEC filing for the IPO, Google underlined their company motto of “Don’t be evil”. If we take them at their word, then the GWA is probably a bid to improve their search strategy, and in turn improve users’ search experience. To the average user, the tool is a Web accelerator. The fact that Google is storing/caching what they do and looking over their shoulders might not be immediately apparent to them. The issue is one of privacy. Users trade anonymity on the Web for identification with Google. This kind of despotic power only sets the stage for the evil possible when one is dealing with a publicly traded company. What if Google is taken over by less ethically-minded people or is hacked somewhere down the line? If the internet is used to plan a terrorist attack or commit a crime, it would be very easy to subpoena user logs out of Google. These logs would throw up all sorts of information about targeted users: email address if they use Google Groups; all email, if they use GMail; search history if they have cookies on; credit card info if they use Google AdWords or Google Answers. And now with GWA, everything they view on the Web. The first subpoena will set the precedent for all others to follow and ruin the delicate balance between privacy and intrusive surveillance.

The caching mechanism in general, relies on metadata used by servers which says when the page has been updated and when it expires (so that the cache servers revisit and get a fresh copy when the page expires). It also relies on user activation. If user A visits dawn.com through the cache server, unless the successive user activates the feature, she will get the cached copy from the time when user A visited and won’t actually be going directly to dawn.com. Both faulty metadata and caching have their share of glitches and anyone who has worked from behind a corporate proxy will testify to glitches such as being unable to get fresh copies of a page or being logged in as another user. When such errors are scaled to the entire Web population, the propagation of error is exponentially larger. Reports of these glitches with GWA have already begun to surface.

The current trend on the internet is to move away from subscription models (user-pays services such as Hotmail) and toward ad-based revenue models (sponsor-pays services such as GMail). This means that Web businesses rely increasingly on people clicking on ads that may be rotated on a daily or hourly basis. If Google serves up cached copies, the same ads will be displayed over and over again thereby jeopardising the revenue model of such businesses. All this could backfire, resulting in anti-competitive lawsuits against Google, whose own Web ads would be unaffected by this. Furthermore, Web businesses gain business intelligence by monitoring their logs and seeing where users are coming from (for example to test the effectiveness of a marketing campaign in another city). Through GWA, this will be rendered less effective since Google will act as a single source on behalf of its users.

Of PageRank and other pioneers

PageRank was Google’s algorithm for assessing the relevance of search results. The algorithm ranks pages according to popularity, where a specific page’s popularity is determined by the number of inbound links from other pages, and each inbound link is weighted according to the popularity of the page where the link originates.
But PageRank has significant limitations that are sapping it of its efficacy. As aggressive businesses know, it is getting progressively easier to get other webpages to link to their own and thus boost their ranking. (The first 10 results for “250 LM”, for example, are of online stores retailing sports car memorabilia and not the official Ferrari website). Spammers are abusing this characteristic of PageRank and polluting the Web with their spam URLs. In fact, the fundamental link structure of the Web is changing because Blogs by their very infrastructure rely on linking to each other to create the so-called Blogosphere. The first “Google Bomb” was the handiwork of bloggers who effortlessly engaged in large-scale linking to a central site.

Google’s newly trademarked TrustRank algorithm hopes to redress PageRank shortcomings by changing the criteria determining rank. The new algorithm looks to rank pages according to the number of people who visit it and possibly how much time they spend on it.