In this article, we will answer the question of what a model really is, for what purpose we might want to build one, and how to teach a computer something new. If you want to learn what machine learning is, be sure to read the previous article "Machine learning for managers. "
Goals of modeling
Forecasting
The first reason to get interested in modeling is the need for prediction. We want the model to help us answer the following questions, for example.
- Is a given email spam?
- What is the probability that a customer will repay a loan?
- How many units of a given product (for example, a book) will we sell?
- Will the customer abandon our company's services?
- Will the patient respond positively to therapy?
- How much is my apartment worth?
- Is a given financial transaction a fraud?
These problems can be divided into two types: regressive (when the predicted characteristic is quantitative, such as the value of an apartment) and classificatory (when the predicted characteristic is qualitative, such as loan repayment). Let's focus for a moment on question two, the probability of loan repayment. What are the advantages of constructing a model that can predict such a probability? The answer seems obvious, but to get a good understanding of the purpose for which we build models, let's think about it longer. First, let's establish a reference, that is, how a bank could lend if it did not have such a model. Suppose it hired an expert who, based on theory and his experience, would determine who was worth lending to. We could say that such an expert would estimate the probability of repayment - that is, our model simply replaces it. What, then, would it be better at?
-
The model returns results automatically, usually in a fraction of a second, making it much faster than an expert.
-
In addition to the cost of building the model, its use (i.e., predictions for more people) usually involves no additional expense, also such a model can be much cheaper.
-
Humans have limited memory capacity, and can use a limited amount of information. We build the model using a computer, which does not have these limitations. If there is enough data, we may be able to account for some subtle or multidimensional relationships that would elude an expert.
-
A person's decisions may be affected by a variety of factors, for example, he or she may underestimate repayment probabilities for some people as a result of certain life experiences that are not necessarily representative.
-
Although an expert can estimate probabilities, one should not expect to actually be able to treat them that way, that is, a score of 60% does not necessarily mean that 60 out of 100 similar customers will repay the loan. Most likely, such a value can only be understood as "a little more than half and half." For the bank, such customers can still be profitable, but in order to be able to estimate the value of this profit, we just need information about the probability - and with properly calibrated models we will be able to obtain such information.
In a word, the model should be faster, cheaper and more accurate. Whether it will actually be so depends on many factors, and we will discuss this topic in a future article.
Inference
Although the above-described reason for modeling reigns supreme in the world of machine learning, it is not the only one. Moreover, looking a little more broadly at data analysis, it is not necessarily the most important one. After all, we may be interested not only in whether a given customer will repay a loan, but why. What factors determine that certain people repay and others do not? Why is the therapy unlikely to work for a given patient? What factors increase the risk of disease X? To answer these questions, building a model will be helpful and perhaps even necessary. Although, as we will see later, we will not necessarily use the same tools as when the goal is prognosis.
At this point, one may have the caveat that no special tools, especially a model, are needed to find the factors affecting the trait of interest. Isn't it enough, for example, to count what percentage of women repay loans and what percentage of men to resolve whether gender affects the probability of repayment? Unfortunately, no, and we will explain this with a more illustrative example of earnings. Nowadays, the thesis about the difference in earnings between men and women is quite popular. On the Internet it is very easy to come across charts showing that in the United States, for example, women's earnings are about 80% of men's. Let's take this as a fact, but consider what conclusions we can draw from it. Does this result indicate discrimination? Do women earn less because they are women?
Statistics like the above are usually calculated in a simple way: the median earnings of women are divided by the median earnings of men. But after all, if men work more hours per week on average (which is a fact), then this should be taken into account. Further, we need to take into account that men tend to choose better-paying jobs, are more likely to ask for a raise, are more likely to move, and so on. Taking all these factors into account is necessary to verify the hypothesis of a direct effect of gender on earnings - and this is possible if we use the right model. In the end, it may turn out that such an adjusted earnings gap is smaller, there is none at all, or it is men who earn less.
In summary, we can distinguish two purposes of modeling: forecasting and inference. We may be interested in only one of them, although it can be risky to create models that cannot be interpreted. In what follows, however, we will rather focus on the first model-building objective, as it is the one that is usually combined with machine learning.
How to predict the future based on the past?
Machine learning has very good marketing and is often presented in a magical way. When we raise funds for a project, it's all well and good to say that we're "teaching a computer" or "making artificial intelligence," but what's behind that? How can a computer be taught something new? It's actually very simple, and I'll illustrate this with an example of one of the machine learning algorithms, the nearest neighbor method.
A new customer comes to the bank applying for a loan. We collect information about him: how much he earns, what he does for a living, his age, how many dependents he has, and so on. Then we look in the historical data for a certain number (for example, 10) of customers most similar to him (nearest neighbors): with similar salaries and ages, working in similar industries and having a similar number (maybe the same number) of dependents. Let's assume that nine such customers have repaid their loans, one has not. We can therefore assume that our customer will also repay, with a probability of 90%. And already, we have just taught the computer something: predicting how a NEW customer will behave - really? I guess it can't be that simple? And yet in substance it is.
In practice, there are a number of problems (not at all small), such as choosing the number of similar customers we are looking for, or the definition of the similarity measure. Besides, most machine learning methods do not have as simple an interpretation as the one presented above. Nevertheless, the example given shows that there is nothing magical about teaching a computer something new, and on top of that, it can be achieved in a simple way.
What is a model?
Finally, let's definitely strip machine learning of its magic by taking a purely mathematical approach. Let's assume, as in the previous article, that our goal is to predict the value of Y based on certain characteristics of X. More specifically, let's break X down into individual features: X1
, X2
, Xp
. We want to find the relationship between them and Y, or to put it in the language of mathematics - a function. But let's take into account that almost certainly Y is not completely explained by the set of features we have available. Whether a given customer will repay a loan, for example, may depend on certain characteristics of his character that are not necessarily easy to obtain. Let's denote this unknown information by ε. That is, when building a model, we look for the function f from the following equality:
**Y=f(X1
,X2
, Xp
)+ε**
We can call this function a model, and machine learning provides us with the tools to find this function optimally.
To summarize the topics covered in this article, if the form of the function f is simple enough, we will be able to draw conclusions about the relationship between Y and X characteristics. If we are interested in prediction, the function can be very complicated (the so-called black box), because we only care about getting the closest to the truth Y for a set of features X. We can look at the various machine learning algorithms as different forms of the f function. Moreover, we can test these algorithms by simply applying them and observing whether their predicted Y values on historical data are close to the real ones. In this way, we can also choose the optimal number of nearest neighbors in the previously described algorithm. This is such an important topic that we will devote another section to it.
If you found this article interesting and, in addition, you are a manager who wants to become competent in the use of Big Data and Data Science issues and gain knowledge of the specifics of big data, the integration and collection of data from different sources and the architecture of Big Data class solutions, check out the postgraduate program Data Science and Big Data in Management.