# Regression Analysis for SEO

Posted on November 8, 2012. • Written by Brad Anderson

Google is adamant that there is no way to “beat” their search algorithms. The fact is these algorithms are nothing more than mathematical equations that do have common threads. Granted they are very complex algorithms with hundreds of years worth of man hours that have gone into them. Nevertheless, statistical methods that allow us to approximate an unknown equation with a similar equation. This blog post will break down a few statistical methods that Fruition uses to improve our client’s SEO.

## The Important Stuff First

Fruition’s statistical models tell us that several known variables are still important. Those are:

• H1 tags,
• the importance, format, and density of title tags change a lot for different searches,
• site traffic is important, and
• over optimization of anchor text is a killer.

Those of course are just a few of the tidbits that we know. The following paragraphs lay the groundwork for how we come to make those statements.

## Linear Regression for SEO

The most limited and easily understood version of this method is linear regression. In linear regression we find the “best” line through the data. Essentially, we find the line that passes closest to all of the points in a dataset simultaneously. Linear regression is generally taught as a freshman level college course.

Here at Fruition we have been using advanced statistical models to focus our client’s SEO budgets on the variables that we can impact and produce the most likely positive impact on rankings. Some of the things our statistical models have uncovered are the over optimization of anchor text and the increased importance of social media and page speed on a website’s rankings. This blog post explains some basic and intermediate level statistics that you can apply to your search marketing efforts.

Figure 1 shows the results of a simple regression analysis conducted using SAS. The line represents the best available line through all of the data. Notice how the line is not very close to many of the points it represents. This is due to a poor fit. On the table “R-Square” = 0.1103. The R square value indicates that this line accounts for only 11.03% of the variability in the data points. We would hope for a value closer to 1, which would indicate a near perfect fit. We look for around .3 as a general rule since values between .05 and .2 can indicate no relationship at all.

Figure 1 – Linear Regression for SEO

In SEO we all agree that there is more than one variable which determines the overall rank of a site. Multiple Linear Regressions allows us to add more predictor variables. With more variables new issues arise. We need the variable to be independent of each other (otherwise we may find a high R square value due to overlapping variables). The line now travels through “n-space” no longer is the line a two dimensional thing to be shown on paper. Rather for every variable, there is another dimension. For 3 variables the line would be in 3-D. But for more than three it is difficult to fathom what the line ”looks” like.

## Multiple Linear Regression for SEO

In the context of Search Engine Optimization (SEO), Multiple Linear Regression doesn’t really work for two reasons. First, linear regression assumes that the dependent variable (in this case site rank) is measured on an interval scale. Second, linear regression assumes a straight line relationship, where in reality some variables are good, up to a point and then they are bad, the line would be curved.

Interval data means that the distance between the first and second ranked variable is the same as the distance between the fourth and fifth variable etc… As an example consider how we measure height, a six foot person is taller than a five foot person by the same distance as a five foot person is taller than a four foot person, all inches are the same length. Site rank is measured on only an ordinal scale, where the first is better than the second and so on, but there is nothing known about the distance between them. The best example of ordinal is education level, we know a J.D. requires more study than an M.S., however, is the distance between a J.D. and an M.S. the same as the distance between a B.S. and an M.S.? I can attest that it is not. As applied to the SEO world, the distance between the quantity of links built and the quality of onsite content is not for sites that are ranked from position 1 to 2 and 2 to 3, etc.

The shape of the relationship between the predictor variables and the dependent variables establishes if a variable changes (actual site rank is the dependent variable and the other variables like keyword count are the predictors). For instance, the easiest type of spam to spot is having the keyword on a webpage 100,000 times. A keyword count of 100,000 is way too much, but not having the keyword at all is also bad. Somewhere in between 0 and 100,000 we find a good spot for many variables. Table 2 shows a simple example of a curvilinear relationship. In the example note that R²=.2093, which isn’t horrible, but look at the points. There is a clear relationship, but it isn’t a straight line, so linear regression cannot find it.

Figure 2 – Advanced Statistics for SEO

## Logistic Regression for SEO

We now move to a more appropriate model. Logistic regression in itself solves the first issue with linear regression as applied to SEO. Logistic regression does not assume the data are measured on an interval scale. This single change brings with it other changes in the way the results are interpreted and the information we can derive from the model. Logistic regression is driven by a concept called an odds ratio. Odds are simply the ratio of the probability of success divided by the probability of failure. Here success and failure simply refer to getting what you are looking for. In a horse race, a given horses odds are the probability the horse will win, divided by the probability it will lose. So odds of 3:1, or just 3, indicate the horse is three times as likely to win as it is to lose.

## Odds Ratio

An odds ratio is the odds of a variable being in a category divided by the odds of the same variable being in the next lower category. An odds ratio (Θ) of one indicates the variable is equally likely to fall in either category. Θ<1, indicate the variable is more likely to be in the lower category, and Θ>1 indicates the variable is more likely to be found in the higher category. In the context of SEO this is useful because we can split the rankings up into categories; say the top 20 results, the next 20 results and so on for the top 100 or so. Then we can look at various variables and see if the odds ratio of a given variable lends itself to improving rank.

## Issues with Linear Regression for SEO

The second issue with linear regression for SEO is more difficult to address. Each variable needs to be individually analyzed so its shape can be determined manually. Once the shape is known, we move forward by watching for values that are either too large or too small.