Model validation requires us to consider multiple variables and models across many different data sets all at the same time. Overall variables selection can be done by comparing the best models among the various datasets. We do not consider all datasets together because there is a chance that some data sets will lack any accurate use of a certain variable, and consequently make us unlikely to find it in an overall model. Rather we consider individual models from each data set and then analyze each variable to understand why it is or is not present in each model.
It would be simple if all variables were consistent across all datasets. In practice there is not a single variable which is significant for all datasets. All variables fluctuate from dataset to dataset. What we are really looking for then is variables which are always either significant or insignificant with a reasonable explanation.
A common issue is a variable where a single website has a very large value, imagine 100,000 links. If this one site ranked poorly, the results would indicate that the variable is irrelevant when in fact an extreme observation or outlier is changing the results. We then consider each variables range of values as well as distributional values (mean, median, quartiles etc…) to see if the different datasets show a different shape for these variables. If we can find a specific difference, where all datasets which show the variable significant have only larger values, and those that show it not significant have only smaller values, it is then likely that the variable is important, but not effectively used by many websites.
As an example consider the tables above. The Golf table shows results with a median of 80, and 75% of all observations are smaller than 132. Compared to the SEO data with a median of 19 and 75% of the observations are smaller than 41.5 we see a real difference. If the variable in question here (internal_links) was significant for one dataset and not the other we could look at the difference in the values to determine if the variable is truly significant. We look for a similar pattern across more than only two datasets. Considering only two datasets is limiting, but interesting. The golf dataset shows internal_links to be non-significant, while SEO does. The result here suggests the variable is truly significant. The sites in the SEO dataset simply have too many internal links. It seems that Google penalizes a site that has too many internal_links.
We now develop a single model using the number of variables suggested by our previous factor analysis. The only variables we keep are those which are consistently significant and when insignificant, show a pattern which fits with our expectations for the data. The next step in this process is to perform an actual study. Everything done up to this point has been based entirely on correlation. Correlation cannot EVER establish causation. We would ultimately like to know which variables actually cause improved ranks. To do this we must conduct a study where we change the variables we have determined to be significant and see if they actually cause an improvement. This will be discussed in depth in the next few blog posts.
President/Founder, Tru Family Dental
Marketing, Dependable Cleaners
President, Frame Destination
President & Founder, Family Travel Association