# Variable Selection for SEO Regression Analysis

Posted on January 2, 2013 • Written by Brad Anderson

Using multiple linear regression and other techniques previously described we can now begin to make a full model based on one single dataset (or google search). There are a lot of ways to determine the variables that should be included. The best way to include variables is based on subject knowledge. Because Google is not forthcoming about the variables they use, we are left to base our variable choices only on search results.

We begin by performing a Factor Analysis (FA). This analysis will allow us to divide the variables up into a number of groups. Recall that root domain trust and subdomain trust were closely related to each other (“trust” is an estimation of the quality of a domain or subdomain). Remember, that Google continues to disfavor subdomains over root domains in search results. FA would indicate that these variables are closely related to the same underlying variable. All variables we consider would be placed onto groups (called factors) according to how well correlated they are to one another. Then to get independent (or mostly independent) variables we select the “best” variable from each group.

The actual method of FA will be presented in a non-technical and somewhat abbreviated fashion here. For further understanding of FA consider Wikipedia. When we run our PCA we will get results indicating the number of groups of variables the data seems to have. Essentially the computer looks at the data and begins breaking the variables up into independent groups and assigning each variable a number from -1 to 1 indicating how well that variable conforms to each group (correlation). New groups are added until a group gets added that does not increase the percentage of variability explained b y the model, that new group is removed leaving us a good guess about the number of groups represented by the one data set.

We now consider each group individually and select the variable with the highest correlation (value furthest from 0, whether negative or positive). When we have selected a single variable for each group we can begin looking at this model as a serious possibility for the Google model. We run a multiple linear regression model and then, one at a time, we replace each variable with the next best variable from the same factor and rerun the analysis. We continue like this until we have a model with the best overall fit based on every variable considered. The table is a truncated factor analysis, including only a few variables with 7 factors. In practice we had a few hundred variables, but a huge table would only become confusing. To approximate Factor1 we would begin with the variable followed root domains since .895 is the largest value in the factor1 column. We would get a variable for each factor in this way. We should run a regression analysis looking for overall fit, individual variable fit, and multicollinearity (variables being independent of each other). Then we would replace followed root domains with followed cblock linking domains (since .858 is the second largest value in the factor1 column) and rerun the regression comparing results. The replacement of variables would continue until we came to the conclusion that we had found the best model, generally the fit would start getting worse at some point and we would be done. Now we would look and see which of these many regression models has the best fit.

You may have noticed that followed root domains is the best variable for both factor1 and factor6, this is due to the fact that the table does not include the majority of the variables. In reality we would not expect to see anything like that in real data. This example is very limited mathematically.

In practice FA has generally resulted in 7 to 10 groups of variables. Subsequently, with the data provided we have been able to determine 7 to 10 variables are important to the actual Google model. In developing this FA based model, we have both overcome a great deal of difficulty (selecting independent variables) and brought in new difficulties (determining which variable in each group actually matters). As said before we pick the “best” variable form each group. So what exactly is the “best” variable? We hope we have selected the variable Google actually uses.
Next we will consider using multiple datasets to validate a single model. ## Written by Brad Anderson

Brad Anderson is the Executive Director and Founder of Fruition. Brad’s focus is supporting Fruition’s team to enable sustainable growth and excellent client satisfaction (EBITDA growth). With a strong statistical background, Brad built Fruition’s in-house software that is used to manage client success.