The reason we need large data sets is so we can establish a relationship between our sample and the population. The population in many internet contexts is every web user, or every website. Clearly, we cannot collect data on every one of them (though perhaps Google does). What we do instead is collect data on a large number of websites. We could look at a single website, but any results would be anecdotal at best. George Burns lived to the age of 99, yet smoked constantly for over seven decades. This one example would suggest that smoking extends your life. Obviously, Mr. Burns was either lucky or should have lived longer. If instead of one person we considered 10,000 people we would find that smokers live shorter lives than nonsmokers. A larger sample gives a better understanding of the true relationships in data.
In SEO we look at many web sites to take advantage of a large sample size. We look at sites on various subjects, usually our clients or their competitors. We look at web sites that we own and can modify in many ways to help us determine the effect of specific changes to sites. We also look at sites that are large and represent the ideal site in some way, like Ebay or Amazon. Our goal is to get a fuller picture of the way that Google ranks websites. Data from all of these various sites allow us to make inferences about the true form of Googles algorithm.
From a practical perspective, determining Googles website ranking method could be compared to trying to determine the characteristics one person looks for in a mate. There are many things that go into such a decision, but many of those characteristics are related. Intelligence, education, wealth etc… are all very interrelated. Though Google is large and very popular, it is one perspective, and therefore we present Google with many different websites and see how much Google likes them. We then make adjustments and see what kind of change occurs. For our relationship example, we parade many suitors in front of our target and see what she likes. Later we take the same guy and change his clothes; does our girl like him more or less? Now for the tricky part, Google changes things from time to time. Our girl has a different type every now and then. To counter these changes we are forced to continually track Google and watch for changes in ranking to sites which have not changed. Though no computer program can really determine human likes and dislikes, enough data can allow us to get close.
The key weakness of our analogy comparing mating preferences to Googles algorithm is that unlike relationship stuff, the Google algorithm is by necessity quantifiable. Because the Google algorithm is designed to let a computer simulate human preference we can similarly design a program to interpret the algorithm. We are not really looking for what makes a good website. Rather our goal is to determine what Google thinks makes a good website. This is a difficult, but possible task.