Friday, April 26, 2013

Statistics cannot play God


One of my earlier posts titled "Statistics hints at Existence of God" <<click here for the post>> detailed how the error component of any model hinted at a philosophy that is analogous to concluding the existence of God. While, this explicitly states that God exists, today's post is about why this God should not be omnipresent and should not be implicit in the statistical model.

A key element of the modern day God and his relation with humans is the fact that God endowed humans with the capability of free will. One definition of God is an entity that is "all knowing". This entity knows the past, present and the future. In fact, he defines the future. God gave human the capacity of free will. It is argued that this was done out of his love for mankind. But this also represents a paradox in the way we define and understand God. The free will of humans gave it the option of wrong choices. Thus, emerged the possibility that God cannot now know the future since he does not dictate the choices man makes which in turn defines the future. God, hence, took a big risk in creating humans as free thereby including the possibility for wrongful choices.

Can a statistical model embody that spirit of granting free will? Lets look at the not so past sub-prime crisis. Every financial institution that went bust or lost in the crisis were big users of statistics. There were numerous case studies published of how these institutions used statistical scores to take decisions and thereby improve business parameters. While the employees, or human kind, followed the dictates of the statistical decision, business was thriving.

Then some mortals realised that the asset prices are ballooning. So even if the creditor defaults, the bank could recover their money by forcing the creditor to liquidate the asset. Or the bank could attach the asset of a lazy creditor and auction it at a much high price than the outstanding loan amount. The default scores or credit scores were rendered powerless. Though the model scores showed that the customer has questionable ability to earn and service  the loan amount, the institutions still over rode this decision and went ahead with granting the loans. History knows the derivatives and the leveraging that was done on these loans. But our debate lies with the initial event and not the derivatives.

The statisticians or the analytics sponsors in these institutions, be they the Business Intelligence heads or the CxOs, decided to play God. They granted free will to the consumers of the statistical models. This was probably done for the love they had in the employees capabilities and in the potential profitability. But they did not account for the risk of wrongful decisions. Eventually, when the risk became a substantiated object, the Garden of Eden was lost. To some, it cost their very existence.

If only, the sponsors had not played a loving God but a tyrant one, forcing the business to follow the scores of the statistical models, they could have survived the crisis much healthier. And there have been some institutions, albeit a few in number, who stayed with the statistical models and refused to permit free will to the business consumers or employees.

This shows a key lesson for companies adopting analytical expertise. A vast knowledge of historical knowledge is accumulated in the final, adopted statistical model. This knowledge is much greater than any individual employee. Hence, it is critical that the business processes are designed such that exceptions or over-rides are minimized, if not eliminated. Every time employees or processes are allowed to override analytical models, business has faltered and eventually the statistical approach is blamed. In such scenarios, it is not uncommon for the enterprise to abandon its analytical approach completely. And we hear comments such as "we tried analytics in the past and it does not work". This statement in a time when there are numerous examples of statistical applications in the same scenario being referred.

So, note this good, if you are deploying analytical approach in your enterprise, do not play the loving God and grant free will to the consumers of analytics.

Thursday, February 07, 2013

Statistics: Is it "Guilty" or "not Guilty"


In college, my statistics professor had an uncanny ability to link statistical concepts to mythology or philosophies. It also tended to make his lectures fun to attend and often sent us on a parallel track to read more about the event mentioned along with the statistical theory of the day. One of topics that often confuse statisticians-in-the-making is the right formulation of the problem or in statistical terms the right formulation of the hypothesis. In introducing the topic, he asked us to recall the court scene in every movie. The premise of all cases is that the accused is innocent unless proven guilty and it is the responsibility of the accuser to prove that the accused is guilty. The analogy to statistics is that a given series of observation is uniform unless it is proven otherwise. Thus, every hypothesis states that the series is closer to normal curve and the exercise is to prove that it is not. What is more interesting is that the conclusion in legal proceedings is  "given the circumstances the accused is not guilty". The law does not state that the accused is innocent. Again the analogy in statistics, the application of various theory eventually brings to the conclusion that the series does not deviate significantly. It does not state that the series is aligned to normal distribution or a derivation thereof.

It is important to understand this concept when applying predictive analytics to business scenarios. Let us considers the churn model or retention model. The premise of the entire engagement is to find customers who are likely to attrite. The data set is accordingly prepared such as one can define the population into one that attrited in a given period and the ones that continued. Accordingly the predictive model is built and the scoring rule applied to the target population.

The score is a representative of the likelihood that the customer will attrite or not. It is not a measure of the continuity of the customer being on books with the company.

The law states "in light of the known or presented evidence". Similarly, the statistical model is built based on the data variables (or information) fed as inputs to the model building process. The model only concludes whether the known variables indicate a attrition on the part of the customer. They do not indicate that a customer will continue the relationship. This understanding is very important in analysing and inferencing from the statistical models. One should understand that there are other factors which may not be known or could not be quantified. As such they constitute the missing information. And some of this information may influence some of the customers to attrite. Hence, we cannot say that a customer who will not attrite as per the predictive model will continue the relationship. In light of the known or input variables, the customer will not attrite --- that's the verdict.

One area of financial impact will be default modelling in credit lending business. The default model often predicts whether the customer will default on the loan taken. Businesses tend to wrongfully create an corollary that the customer who will not default is the good customer. As such, often these customers enjoy high ratings and the business tends to take additional exposure to such customers. Such is the belief that even the variable in the database is labelled as "good" and "bad" customers. The predictive model only states that given the variables analysed a customer will default (that is be a bad customer). It should not be construed that the others will be "good" customers.

This difference is not subtle and it is very critical that businesses deploying predictive analytics understand this difference. As long as there will be uncertainties in the business arena, the decision will be to segregate the target base into "guilty" and "not guilty". The statistics deployed aims to either "reject" or "not reject" the hypothesis. Maybe, the default status variable should be labelled as "bad" or "not bad" customers.

This is probably one of the toughest post for me. I had to explain a hard core statistical thought into layman language. It took some time to edit this post and I believe it is job well done. If you believe so then kindly let me know. If not, then I will be happy to get into a discussion to explain or further simplify the matter.

Wednesday, October 31, 2012

KISS and make up!!!


Most time, while presenting analytics, I get queries such as "do you use neural network?" or "do you use support vector?". Almost all models I have built have been with Linear Regression or Decision Tree. I have often found good fitment of the predicted values to the observed values. Whether it was for churn prediction, default prediction, offer uptake, next visit, next spend, etc. The accuracy (or classification rate) has ranged from 65% to 86%. A good enough accuracy considering the fact that these models where not mission critical such as the actuarial tables for life insurers.

So in all cases, my response to these questions was "No". This often upset the enquirer. Then we get into a debate on why did I not use these algorithms. Every point of the argument I bring everyone back to the uplift curve or the classification matrix. Irrespective of the algorithm, if I have achieved a acceptable accuracy in my prediction, that should be end of the argument. But, alas, it is seldom so.

All algorithms end up in generating a scoring logic and gives scores on the observed variable. In reality these scores are not much helpful on its own. Business owners want to know how such scores have been generated. This is where the simplified algorithms of regression and decision trees are very useful. The end result of the modelling exercise is a "human understandable" function such as:

For decision tree: if (age > 25) and (income < 15000), then (Y = 0.05)

For linear regression: Y = a + b(age) + c(income).

The Y denotes the score. Now these equations explains in plain human language the rationale behind the score. Since, it is understandable, it also gives some additional insights into the drivers of the score. Thus, two customers having the same score, may have different drivers of the score. For example, one would have the 'age' variable contributing significantly to the score while for the other customer, it would be the 'income' variable. Try getting this insight from a "neural network" algorithm.

Another reason, I go simple is because often such models are used to convince management to loosen the purse string for additional budget towards some activity. Maybe a new campaign, maybe a new campaign, etc. The management team are good business people but often not statistical experts. The simplified functions are easy to explain and to be understood by this team. Now try explaining a complex function and getting a budget sanctioned by the management team.

And finally, the adage Keep It Simple Stupid (KISS) is so very useful. The objective is not how complex the algorithm is but how good the model fitment is. The complex the model the more time taken for data preparation and for understanding the output and tinkering with the data for improvement in uplift. And often the uplift of the complex algorithms over the linear regression or decision tree are few basis points. It may not be worth the time and effort involved.

Remember, I am not talking mission critical applications here. If I was building actuarial tables or drug efficacy, then I would scout around for alternate algorithms and seek the best fit ones. But for marketing models, where the life is short for the model and the window of opportunity is opened for an even shorter time, it makes sense to keep it simple and run with the model. A 65%  accurate model is better than no model at all. And a 90% accurate model achieved after the window of opportunity closes is of no use.

So the next time I am asked if i used "neural network", I guess I will just KISS and make up with the interrogator.

Thursday, July 19, 2012

An Opportunity Lost


India has been pretty gung ho about the Unique IDentification (UID) project. While the machinery has been churning out UID cards in thousands, there has been a whole series of debate on the use of UID numbers. One of the usage has been in the subsidy management process by the Government. See the article in Mint on the Finance Minister launching pilot of this scheme.

As per this scheme, a consumer of a government subsidised item such as cooking gas cylinder will have to pay the full market rate for the cylinder. The purchase activity will be passed on to the government agency who will then credit the subsidy amount to the bank account of the consumer. This bank account will be picked up from the bank account linked to the UID number.

India has a large number of items subsidised by the Government for its retail citizens. The public distribution system covers grocery items. Fertilizers are subsidised for farmers. Cooking gas cylinders are subsidised for retail/home consumption. When all consumers of these subsidised items are counted, the numbers will run into maybe 70% or more of the Indian population. The government is expected to pay the subsidy amount to the bank account of these consumers.

Each of these consumers will need to have a bank account declared attached to the UID number. Knowing how critical this bank account will eventually become, the chances of consumers changing the bank account will be minimal. Also, considering the cumbersome process of government agencies, consumers will be deterred form changing this account.

So now we are seeing a potential of a consumer sticking with a bank account for longer time since his subsidy amount is being credited to it. A excellent case in increasing the stickiness of the consumer.

Unfortunately, I dont see any bank realizing the potential of this. Banks cannot play a role in the UID registration process since the agencies are already appointed. But Banks can definitely facilitate the UID registration process. They could source the forms, provide address proofs (attested bank statements) and also get appointment tokens for the individuals. The catch --- the bank account registered on the UID number will be one that is opened in the same bank. For every customer who takes up this offer, the bank wins a highly persistent customer with a high life time with the bank.

Lets see which of the banks latch on to this idea. Till then, I wait for my appointment date for UID registration.

If you enjoy reading my posts, do click on the right to become a follower of the blog.

Wednesday, June 27, 2012

CRM Bloopers...

This post is on a lighter note. The software industry and the manufacturing industry are proud of their Quality Assurance processes. But it looks like CRM processes lag behind on this front.


Today I renewed my vehicle insurance policy. This is my first renewal and I have had a year of no claim on the policy. So I enjoyed a "no claim bonus" on the renewal premium. After completing the payment of the renewal premium, I was presented with the usual "thank you" message from the insurer, Bajaj Allianz. While reading the message I was very amused by the following sentence:

"As a loyal customer you are covered Additionally for Accidental Medical Expenses Cover/Drive Assure protect and 24x7 spot assistance for sum insured of Rs. 0."


So for renewing the policy, I am termed as a "loyal" customer. Well, I am okay with that. I get additional services. Well, I am okay with that. For a sum insured of Rs.0. WHAT? Is that a benefit I should be happy about? I can make two guesses:

1. Either it is a typo error. In which case, I will wait for the detailed policy wordings document.

OR

2. The formula used for calculating the benefit resulted in a value of ZERO.

Typical, case of borderline defect as they say in the software industry. A case for Quality Assurance.


On similar lines, a couple of months back, I was withdrawing some money at the ATM. After the transaction, I was presented with the following screen.



Note the options given. It was like holding a gun to my head and stating that I have to take the offer. There was no exit option or refusal option. The first time it happened, I was perplexed on what to do especially since my debit card was still in the ATM machine. I did not want the bank to call me since my CA had already addressed all tax issues. While my mind was booting up to process this situation, the screen went away and the transaction terminated normally with my transaction slip and card being returned to me. The next time, I got this message, I knew I had to just wait for few seconds and eventually the message would go away. DEJA VU... I say.



If you enjoy my posts, click on the right to follow these posts. Also, looking forward to your comments and similar experiences.
 
test