Three common Data Analytics Mistakes

Data analytics is far away still from being just a simple process. There is so much experience, domain knowledge, and handcraft thrown in the boiling pot of every data analytics project that it is really hard to write in stone which steps should be used for which problem. So, when I get asked “Which model should we use and when?” I usually avoid a clear answer, because a clear answer is impossible. When I get asked which pre-processing or which missing values strategy should be used for which problem I still avoid a clear answer.

That said, there are still a few obvious mistakes that must be avoided in any data analytics project. I will describe here only three of such obvious, painfully embarrassing mistakes: the three mistakes I see most often.

1. Normalize before using distance based algorithms, such as clustering.

A distance based clustering algorithm measures the distance between two points in the data set. The closest points are then grouped together. In the data vector, however, there might be numbers, such as age, and numbers, such as ratios. While age ranges between 0 and 100, ratios range between 0 and 1. Thus a distance measure, such as the Euclidean distance, will be strongly influenced by the age numbers and not as much by the ratio numbers. The final clusters will be heavily dominated by the age values and not so much by the ratio values. In practice we over-evaluate age and under-evaluate ratios. A data normalization, to reduce all numbers to follow into the same range ([0-1] for example), solves this discrepancy issue and ensures an equal treatment for all input data numbers in the distance calculation.

2. Do not mix training and deployment.

Often I hear “My deployment workflow is extremely slow”. When we open the incriminated workflow I often find that training is included into the deployment workflow as well. Some basic explanations here. There is a training phase which builds a model and tests it to verify its accuracy. And there is a SEPARATE deployment phase, which takes the model - previously trained and verified - and applies it to new data. In terms of workflows, this means that the training workflow includes a learner node, a predictor node, and probably a scorer node; while the deployment workflow includes just a model reader node and a predictor node. During deployment, the predictor node takes the incoming data of the moment, scores them against the model, and generates the new predictions for these new data points. Only in case of real time training, where the amount of data is such as to justify a constant re-training of the models, learner nodes are included in deployment workflows! But those are special applications and often require a big data and Spark integration.

3. Correlation does not mean causality

As the “Spurious Correlations” web site (http://www.tylervigen.com/spurious-correlations or http://priceonomics.com/do-storks-deliver-babies/ ) constantly reminds us, never assume that correlation means causality. Trends can be correlated for a variety of reasons, but this does not mean that they depend on each other. A third (or fourth or fifth) variable can play the controlling role. So, once you have your results, from a correlation measure or from a machine learning model, always keep a healthy degree of skepticism about your own results. This might avoid making statements with a degree of certainty that is unjustified!

What about your experience? Other obvious mistakes to warn the data analytics newbies about?

Source : http://bit.ly/1Rm5kI1

Writer : Rosaria Silipo