Tuesday, October 18, 2011

Analytical Data Mart -- is a Myth

Over the past two years, I have observed an increasing number of RFPs which include setting up of analytical data marts as part of the scope. This is a disturbing trend. It shows that the analytical projects are driven by non-analytical expertise.

I dont blame the IT guys for the way the RFPs are designed. They take a typcial data warehouse / reporting approach. In this case the requirements are known and the data warehouse is expected to maintain and provide the data elements for the reporting needs. Analytics too follows similar analogy. We have an end requirement -- which may be lapse prediction and the predictors, say. And this requirement needs data elements. This is where the similarity ends. Before the model is developed, one does not know which of the data elements are needed for the end report. In fact, one of the objective of the model building is to identify the data elements that are significant contributors towards the event under consideration, the lapsation of a policy.


Vendors in the market strut analytical data models. It basically consists of 100s of variables and derived variables which are likely to play a role of a contributer to the observed event. The IT team following the sales pitch of such vendors often include a scope of creating an analytical data mart containing all the 100s of variables listed.

If a company has enough budget and time (and patience), it will be okay to create an analytical data mart of over 800 variables from multiple sources of data and involving complex transformations. But this is never the case.

Now consider the models built on this data mart. Any model used in business will hardly have more than 20 variables (both direct and derived). So the rest of the 780+ variables was wasted.

An ideal way would be to let the statisticians use data dumps to do the modelling activity. Once a model is developed and tested and found to be useful for business deployment, the need for productionizing is to make the 6 to 8 to 20 variables available for the scoring purpose. Compare this with creating a 800+ variable data mart -- the former is a much more practical approach.

I have not even dwelt on the process of modelling and data prepartion. Based on the objective being tested, the data preparation will differ hugely from model to model. Often the analytical data marts get ignored and the analysts goes back to data dumps for creating the analytical data set for modelling. See my earlier post on time stamped data sets for modelling (http://crmzen.blogspot.com/2010/03/time-factor-in-modeling.html) to understand the complexity in creating data for statistical modelling.

It will be good for the IT department and the data warehousing personnel to understand this difference in the analytical process. Especially since this difference is not subtle. It is not needed to wait for 12 to 18 months (or more) till the data warehouse is set up and populated for the analytical activity to begin. And, the return on investment is much higher with predictive analytics. When compounded with the quick turnaround, the returns multiply.

No comments:

Post a Comment

 
test