A Comparison of the Size of the Yahoo! and Google Indices


The following study was completed by two of Professor Vernon Burton's former students at the University of Illinois. Although he agreed to host the report on his webpage in the interests of encouraging the debate on the relative strengths of the different search engines, neither Professor Burton nor NCSA had any direct involvement in the study.

This version is a followup study to the original study that was done to address some legitimate concerns about the inclusion of "wordlists" and "dictionaries" in the study results. The followup study again sampled ~10,000 search queries of Google and Yahoo (excluding dictionaries and wordlists) and found similar results to the original study.




Introduction

On August 8th, 2005 Tim Mayer of Yahoo! posted on the Yahoo! Search Blog that the “[Yahoo!] index now provides access to over 20 billion items” which include “19.2 billion web documents, 1.6 billion images, and over 50 million audio and video files”. [2] Two days later, on his blog, University of California at Berkeley Visiting Professor John Battelle reported that Google refuted this claim saying “[their] scientists are not seeing the increase claimed in the Yahoo! index”. [3] In order to test Yahoo!'s claims, Matthew Cheney and Mike Perry conducted a brief study of the indices of the two search engines and then conducted a followup study to deal with the presence of "wordlists" and "dictionaries" in the results.


Methodology

Although there is no direct way to verify the size of each search engine's respective index, the standard method to measure relative size was developed by Krishna Bharat and Andrei Broder in 1998. [4] Their method utilized a corpus of “web words” from the Yahoo! Web Hierarchy that was used to generate random search queries. The results of those queries were in turn sampled and the presence of these sampled webpages in major search engines was then checked.

For our study, instead of focusing on documents that match the common "web words", we chose to focus on the more obscure documents of the web – the “long tail” of the search index. By counting the presence of these obscure documents in either search engine, we hoped to be able to measure its comprehensiveness and its relative size. However, this method resulted in a large number of "dictionary" or "wordlist" files showing up in the results. This presented an unforeseen bias since Google indexes and saves a much larger percentage of each webpage than Yahoo does and, as such, is more likely to return "wordlists" and "dictionaries" when queried with random words.

To deal with this problem we modified our original search parameters of searching for two random words from the commonly available English Ispell Wordlist (a total of 135,069 words) [5]. Instead, we searched for two random words and not a third random word. This method, we feel, helps to exclude the vast number of "dictionaries" and "wordlists" because those results should be filtered out by the "not a third random word" part of our search query.

Additionally, we decided to throw out any results received that did not generate more than 25 actual results on both Yahoo and Google. Although this might potentially ignore some relevant web documents, we felt that those queries that produced a "sizable" number of results (greater than 25) were the most useful for our study. The goal was to only sample those random web queries that returned meaningful and non-spam/wordlist/dictionary web results.

Again and unfortunately, both the Yahoo! and Google search engines truncate results returned to the user after 1,000 results. Thus, for the purposes of this study, we were forced to restrict our searches to those queries that returned less than 1,000 results on both Yahoo! and Google. Any search result found to have more than 1,000 returned results on either search engine was disregarded from our sample.

We modified our PERL script to use the new methodology and used it to search both Yahoo! and Google again and logged the results. For the purpose of the verification study we used a sample of 10,034 different searches of Yahoo and Google. In the interest of transparency, we have included a copy of the PERL script and the dictionary file we used to run the queries on the project website.


Results

For the verification study, over a period of 48 hours using computing resources at the University of Illinois at Urbana-Champaign chapter of the Association for Computing Machinery (ACM), we conduct a random sample of 10,034 searches of Yahoo! and Google.

Based on this random sample, we found that on average Yahoo! only returns 65% of the results that Google does and, in many cases, returns significantly less. This number is substantially different than our previous result of 37.4%, but still shows Google providing an overwhelmingly larger number of results. The results are as displayed in Table One.

Table One (n=10,034)

 
Average Search Results
(Excluding Duplicate Results)
Average Search Results
(Including Duplicate Results)
Yahoo!
132
229
Google
202
349

In aggregate, Yahoo! returned a total of 1,328,167 results to our 10,034 searches while Google returned almost twice as many total results at 2,029,022. This pattern is similar when you include “omitted” or “duplicate” search results (both search engines give you an option to search for omitted or duplicate results) with Google returning 3,510,999 total results and Yahoo! only returning 2,391,153 total results. This information is available in Table Two.

Table Two (n=10,034)

 
Total Search Results
(Excluding Duplicate Results)
Total Search Results
(Including Duplicate Results)
Yahoo!
1,328,167
2,391,153
Google
2,029,022
3,510,999

Interestingly, the actual total number of results returns varies dramatically from the estimated total number of results that both Google and Yahoo! provide users in the search results. In the case of Google, the number of actual results returned is about one third of the estimation that Google gives. However, in the case of Yahoo! the actual number of search results returned is less than one sixth the estimated total. This information is available in Table Three. [6]

Table Three (n=10,034)

 
Estimated Search Results (Excluding Duplicate Results)
Total Search Results (Excluding Duplicate Results)
Percent of Actual Results Based on Estimate
Estimated Search Results (Including Duplicate Results)
Total Search Results (Including Duplicate Results)
Percent of Actual Results Based on Estimate
Yahoo!
9,827,975
1,328,167
13.5%
10,181,043
2,301,153
22.6%
Google
6,435,679
2,029,022
31.5%
6,431,059
3,510,999
54.5%


Conclusions

Based on the data created from our sample searches, this study concludes that for a random set of words a user can expect, on average, to receive 65% more results using the Google search engine than the Yahoo! search engine. In fact, in the 10,034 test cases we ran, only in 16% of the cases (1606) did Yahoo! return more results. In 83.7% of the cases (8399) Google returned more results. In less than 1% of the cases both search engines returned the same number of results.

It is the opinion of this study that Yahoo!'s claim to have a web index of over twice as many documents as Google's index is suspicious. Unless a large number of the documents Yahoo! has indexed are not yet available to its search engine or if the Yahoo! search engine is not returning all the documents that match our specific search queries, we find it puzzling that Yahoo!'s search engine consistently returned fewer results than Google.


Footnotes

[1] The following study was completed by two of Professor Vernon Burton's former students at the University of Illinois. Although he agreed to host the report on his webpage in the interests of encouraging the debate on the relative strengths of the different search engines, neither Professor Burton nor NCSA had any direct involvement in the study.
[2] Mayer, Tim. "Our Blog is Growing Up . And So Has Our Index". Yahoo! Search Blog. August 8th, 2005.
[3] Battelle, John. "In This Battle, Size Does Matter: Google Responds to Yahoo Index Claims". John Battelle's Searchblog. August 10th, 2005. [http://battellemedia.com/archives/001790.php]
[4] Bharat, Krishna and Andrei Broder. "A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines". In Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia (WWW7), pages 379-388, April 1998.
[5] Kuenning, Geoff. "Ispell Word List". [http://fmg-www.cs.ucla.edu/geoff/ispell.html]. We are aware that the study focuses on websites in English and would be interested in other researchers who have done studies using other languages.
[6] Since the study was first done, we have noticed that Yahoo has modified the way they estimate the number of search results to be quite a bit more accurate.

 

Publicado el 18/07/2011

Páginas de interés