Analytics, models, and algorithms

Recently it is so common to find the term algorithm in expressions such as “the power of the algorithms”.

That expression is used to mean the power of the data analysis process and methods that are capable of automatically predicting or inferring features, e.g. preferences of people. A typical example is that of the prediction of the interests of the people surfing the web in order to propose them targeted advertisements.

I personally do not like this kind of metonymical usage of the term; here I briefly explain why.

Definitions

There are a few key related terms:

algorithm

is a description of a computation that consists in an ordered and finite sequence of elementary steps (operations or instructions) in order to produce a result in a finite time.

(mathematical) model

is a description of a system, typically using mathematical concepts and language

analytics

is the process and method for discovery, interpretation, and communication of meaningful patterns in data.

How it works

When people (journalists mostly) talk about algorithms what they mean is actually the analytics process and methods, that is:

  1. You start with some data set
  2. You define/reuse an algorithm to build a (mathematical) model
  3. You get a model that describes your data; e.g. showing a correlation

1 – Data sets

Average per capita cheese consumption in the years 2000 through 2009:

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
29.8 30.1 30.5 30.6 31.3 31.7 32.6 33.1 32.7 32.8

Number of people who died becoming tangled in their bedsheets, in the same years

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
327 456 509 497 596 573 661 741 809 717

The above weird pair of data series is taken from http://www.tylervigen.com/spurious-correlations.

2 – Algorithm

Supposing we aim at a linear model that links the two series by means of a linear regression, the Least Squares method can be used.

The method can be applied using different algorithms, e.g. in R:

leastSquare <- function(x,y){
    X = matrix(c(rep(1,length(x)),x),ncol=2)
    b = solve(t(X)%*%X) %*% t(X) %*% y
    return(as.numeric(b) )
}

Once the algorithm is executed on the two series of values it will return two values: -2977.3485  and  113.1329

3 – Model

The two numbers represent the coefficients of the linear interpolation, and allow us to write the equation:

DbBSE =  113 * ApCCC – 2977

Where:

DbBSE : Death by BedSheet Entanglement

ApCCC : Average per Capita Cheese Consumption

The equation is a description of a model that describes how the two measures are linked to each other. So apparently an increase of one lb in the average per capita consumption could be linked to 113 more deaths by bed sheet entanglement.

Of course the link is almost for sure due to pure chance: in general the presence of a correlation does not imply causation.

The role of algorithms

The whole process is what is generally called analytics, the model is the description of how different measures are linked to each other, and the algorithms constitute a means to identify the parameters of the model starting, e.g. from historical data.

In fact the “algorithms” are a small part in such a complex process, so the expression “the power of the algorithms” is a synecdoche.

In general I believe that such expression is meaningful and correct if it is clear that it is a synecdoche, otherwise it is a misleading simplification.

 

 

 

Annunci

Rispondi

Inserisci i tuoi dati qui sotto o clicca su un'icona per effettuare l'accesso:

Logo WordPress.com

Stai commentando usando il tuo account WordPress.com. Chiudi sessione / Modifica )

Foto Twitter

Stai commentando usando il tuo account Twitter. Chiudi sessione / Modifica )

Foto di Facebook

Stai commentando usando il tuo account Facebook. Chiudi sessione / Modifica )

Google+ photo

Stai commentando usando il tuo account Google+. Chiudi sessione / Modifica )

Connessione a %s...