Tuesday, October 18, 2011

Analytical Data Mart -- is a Myth

Over the past two years, I have observed an increasing number of RFPs which include setting up of analytical data marts as part of the scope. This is a disturbing trend. It shows that the analytical projects are driven by non-analytical expertise.

I dont blame the IT guys for the way the RFPs are designed. They take a typcial data warehouse / reporting approach. In this case the requirements are known and the data warehouse is expected to maintain and provide the data elements for the reporting needs. Analytics too follows similar analogy. We have an end requirement -- which may be lapse prediction and the predictors, say. And this requirement needs data elements. This is where the similarity ends. Before the model is developed, one does not know which of the data elements are needed for the end report. In fact, one of the objective of the model building is to identify the data elements that are significant contributors towards the event under consideration, the lapsation of a policy.


Vendors in the market strut analytical data models. It basically consists of 100s of variables and derived variables which are likely to play a role of a contributer to the observed event. The IT team following the sales pitch of such vendors often include a scope of creating an analytical data mart containing all the 100s of variables listed.

If a company has enough budget and time (and patience), it will be okay to create an analytical data mart of over 800 variables from multiple sources of data and involving complex transformations. But this is never the case.

Now consider the models built on this data mart. Any model used in business will hardly have more than 20 variables (both direct and derived). So the rest of the 780+ variables was wasted.

An ideal way would be to let the statisticians use data dumps to do the modelling activity. Once a model is developed and tested and found to be useful for business deployment, the need for productionizing is to make the 6 to 8 to 20 variables available for the scoring purpose. Compare this with creating a 800+ variable data mart -- the former is a much more practical approach.

I have not even dwelt on the process of modelling and data prepartion. Based on the objective being tested, the data preparation will differ hugely from model to model. Often the analytical data marts get ignored and the analysts goes back to data dumps for creating the analytical data set for modelling. See my earlier post on time stamped data sets for modelling (http://crmzen.blogspot.com/2010/03/time-factor-in-modeling.html) to understand the complexity in creating data for statistical modelling.

It will be good for the IT department and the data warehousing personnel to understand this difference in the analytical process. Especially since this difference is not subtle. It is not needed to wait for 12 to 18 months (or more) till the data warehouse is set up and populated for the analytical activity to begin. And, the return on investment is much higher with predictive analytics. When compounded with the quick turnaround, the returns multiply.

Monday, October 03, 2011

Statistics hints at Existence of God

At the onset of this post let me make a few things clear. I am not an atheist. I have faith in the bible. But I do not hesitate in questioning facts about the bible. Now, according to some, that makes me an atheist. Atleast, it does not make me a fanatic. So I leave it at that.

My professor of economics, Mr. Sakhalkar, once made a statement that when one reads he should not be selective. That creates a bias and restricts ones circle of influence. Going down that path a couple of years ago took me down the path of the origins of religion.

A very interesting fact came to the conscious. "God" was an invention of man to blame someone for things that man did not understand and could not control. In the old days, the most frightful element for man was fire. He could not understand it, he could not control it and it was most destructive force. So a Fire God existed. And among all the gods, the Fire God was the most powerful one. There was Water God, Land God, Wind God, Sun God, Star God and so on.

As man started understanding the elements of nature, the importance of that God diminished. Until the God itself was abolished. Eventually as we sit in the twenty-first century, almost all of the element Gods are extinct. When the Gods started disappearing, man found reason to blame other men for various elements. So a forest fire was because some fool dropped a lighted cigarette on dry grass. Understanding man and its nature became a prime topic of importance. God now starting taking form of man. A convenient person to blame when things go beyond explanation.

A statistical model is a function which describes the observed behaviour based on identified independent variables. But what most sales personnel (call them consultants or statisticians) leave out is the "epsilon" (E). Every function in statistics has the epsilon attached to it. Consider the following function which forecast the amount of sale at a retail outlet.

Y = aX + bZ + ε

where Y is the amount of sale, X is the average salary of store visitors and Z is the fact that it is raining. ε represents the epsilon or the error component. This is statisticians' way of keeping themselves legally safe (yup... statisticians are smarter than lawyers). It implies that though the function predicts the amount of sale, there is an error component which explains the deviation in actual amount against the predicted amount. So if the actual sales is different from the predicted amount, blame the error component and not the statistician.

There is a whole lot of effort expended in trying to understand and explain this error component. New variables found, new algorithms applied, but the ε still lives on. Till date there has not been any statistics model that does not have the ε in it.

Even a statement that "All crows are black." will be stated by a statistician as "It is with 95% level of confidence that 99% of the time all crows are black." This is with the understanding that if someone sees a crow with is not all black, the statistician is safe with his statement.

So now we have a EPSILON which is the unexplained factor and responsible for all the deviations in the statistical model. In some cases, like predicting the likelihood of a patient surviving a critical operation, this EPSILON also represents a dangerous and frightening probability. A play in the equation that is unexplained and blamed for any deviation in our predictive capabilities. EUREKA ---- we have found the "STATISTICS" GOD.

ε
 
test