Thursday, February 07, 2013

Statistics: Is it "Guilty" or "not Guilty"


In college, my statistics professor had an uncanny ability to link statistical concepts to mythology or philosophies. It also tended to make his lectures fun to attend and often sent us on a parallel track to read more about the event mentioned along with the statistical theory of the day. One of topics that often confuse statisticians-in-the-making is the right formulation of the problem or in statistical terms the right formulation of the hypothesis. In introducing the topic, he asked us to recall the court scene in every movie. The premise of all cases is that the accused is innocent unless proven guilty and it is the responsibility of the accuser to prove that the accused is guilty. The analogy to statistics is that a given series of observation is uniform unless it is proven otherwise. Thus, every hypothesis states that the series is closer to normal curve and the exercise is to prove that it is not. What is more interesting is that the conclusion in legal proceedings is  "given the circumstances the accused is not guilty". The law does not state that the accused is innocent. Again the analogy in statistics, the application of various theory eventually brings to the conclusion that the series does not deviate significantly. It does not state that the series is aligned to normal distribution or a derivation thereof.

It is important to understand this concept when applying predictive analytics to business scenarios. Let us considers the churn model or retention model. The premise of the entire engagement is to find customers who are likely to attrite. The data set is accordingly prepared such as one can define the population into one that attrited in a given period and the ones that continued. Accordingly the predictive model is built and the scoring rule applied to the target population.

The score is a representative of the likelihood that the customer will attrite or not. It is not a measure of the continuity of the customer being on books with the company.

The law states "in light of the known or presented evidence". Similarly, the statistical model is built based on the data variables (or information) fed as inputs to the model building process. The model only concludes whether the known variables indicate a attrition on the part of the customer. They do not indicate that a customer will continue the relationship. This understanding is very important in analysing and inferencing from the statistical models. One should understand that there are other factors which may not be known or could not be quantified. As such they constitute the missing information. And some of this information may influence some of the customers to attrite. Hence, we cannot say that a customer who will not attrite as per the predictive model will continue the relationship. In light of the known or input variables, the customer will not attrite --- that's the verdict.

One area of financial impact will be default modelling in credit lending business. The default model often predicts whether the customer will default on the loan taken. Businesses tend to wrongfully create an corollary that the customer who will not default is the good customer. As such, often these customers enjoy high ratings and the business tends to take additional exposure to such customers. Such is the belief that even the variable in the database is labelled as "good" and "bad" customers. The predictive model only states that given the variables analysed a customer will default (that is be a bad customer). It should not be construed that the others will be "good" customers.

This difference is not subtle and it is very critical that businesses deploying predictive analytics understand this difference. As long as there will be uncertainties in the business arena, the decision will be to segregate the target base into "guilty" and "not guilty". The statistics deployed aims to either "reject" or "not reject" the hypothesis. Maybe, the default status variable should be labelled as "bad" or "not bad" customers.

This is probably one of the toughest post for me. I had to explain a hard core statistical thought into layman language. It took some time to edit this post and I believe it is job well done. If you believe so then kindly let me know. If not, then I will be happy to get into a discussion to explain or further simplify the matter.
 
test