The previous blog posts on using regression analysis for SEO focused on a single variable only, backlinks for instance. Of course, the actual Google algorithm involves many variables, perhaps thousands. We now begin to discuss the issues surrounding multiple regressions for SEO. Multiple regressions simply refer to any regression model which includes more than one predictor variable. In the case of SEO actual site rank is the dependent variable and the variables on a website are all potential predictor variables. Of course site rank is not as easy to determine when you are considering personalized search results. For this exercise we are using a generic geographically neutral ranking.
Simple linear regression has an easy to see form. We simply graph points on a two-dimensional table. With multiple regressions we can only visualize a model with two predictors. Two predictors would be represented in a three-dimensional graph. Any model with more than two predictors could only be drawn in some hyper dimensional way.
The actual process is the same for both multiple and simple regression. We are trying to draw the best line through all of our data points. In multiple regressions we have added more variables which may be predictors. In SEO we would like to know which variables are effective in Google’s algorithm. We start be looking at individual variables with simple regression. We then combine multiple variables together into a single multiple model.
As we add new variables we are concerned that variables may overlap. Consider subdomain trust and root domain trust. Both of which are estimated values as to the trustworthiness of the domain. We would expect these two variables to be closely related to each other. As a result we would only want to include one or the other. If we include both, we risk getting misleading results. Running simple regressions for each variable and an interaction model for both gave the following results.
Reminder from the previous blog posts that r square is the goodness of fit or simply put how accurate the statistical model is. An r square of 1 would perfectly predict the dependent variable.
For subdomain trust:
For root domain trust:
For both root domain and subdomain trust:
The R2 for the full model = .1829 is the same as for the subdomain trust model alone. This indicates that the strength of the new variable adds nothing new to the model that is not already included. We now consider root domain trust a significant variable. By improving the trust of the domain we are improving the ability to increase the website’s rankings in Google.
Since at least 2011 Google has taken into account varies social factors into a site’s rankings. Luckily, Twitter, can be added easily to model resulting in an increase in R2. For both Twitter and the domain’s trust impact on Google rankings we see:
Thus, we know that there is a correlation between a strong social media presence using Twitter and a website’s organic search engine rankings on Google.
The biggest benefit that we have gained from our heavily investment in our math systems is when a Google update is released. We have found that websites that receive a Google penalty are much easier to check and then fix when you have real numbers that point to the underlining cause. Utilizing our tools for our Google penalty checker have helped Fruition pull new clients out from under a penalty.
The next step is to continue looking for potential new variables to add to the model. We look specifically for variables that are not closely related to the others and that we can easily change. A variable like Twitter mentions can be changed easily by increasing the number of references the site has from Twitter. A variable like trust in the domain is much more difficult because it takes into account user interactions on the website. We have done this exercise with ~450 different variables that we have isolated. These variables range from social media impact (Facebook Likes, Facebook Shares, Stumbleupons) to onsite variables such as the sentence structure on the page or page speed.
As we build these models we are also forced to simultaneously check each model against other data sets in an effort to verify whether or not the variable are causal. The ultimate goal is finding a large model which closely reflects the Google algorithm made up of variables all of which are significant across various data sets.
President & Founder, Tru Family Dental
Marketing, Dependable Cleaners
President, Frame Destination
President & Founder, Family Travel Association