Author: Kateryna Volkovska
With this post, I want to share some useful information and introduce you a couple of Machine Learning (ML) concepts as well as highlight differences and similarities between ML and econometrics. Also, I am aimed to show how Data Mining (DM) and ML can be used by economists.
Basics
So firstly let's clarify definitions:
- ML uses data to predict some variable as a function of other variables (focuses on computing a good prediction of y given the new values of x).
- Econometrics uses statistical methods for prediction and inference of economic relationship.
In general, econometricians are thought to start with the theoretical model and then build a model that validates or invalidates the theory. Machine learners always start from data.
Machine learning techniques (such as decision trees, support vector machines (SVM), neural networks and deep learning) allow for more effective ways to model complex economic relationships.
Table 1. The comparison of aims of Econometrics, DM and ML.
Econometrics | Machine Learning | Data Mining |
prediction |
prediction
| summarization |
summarization | extract info from data | finding patterns |
estimation | visualization | |
hypothesis testing | data manipulation | |
extract info from data |
So the main difference is that ML, for the most part, deals with pure prediction, while econometrics cares more on causal inference.
Predict & Classify
When econometrician faced up with prediction problem he or she usually employs the linear or logit regression. However, ML suggests more advanced nonlinear methods that are more useful for big data sets. Here are some of them:
- Regression trees;
- Random forest;
- Least absolute shrinkage and selection operator (LASSO - regression analysis method);
- Least-angle regression.
What economists always call “the out-of-sample prediction”, machine learners call “the case of overfitting”. The common difficulty for both is the classification problem. While econometrician in this case usually uses logit or probit, ML suggests using decision trees in order to classify the observation which will lead to good out-of-sample predictions (in literature you can find the abbreviation “CART” - classification and regression trees). The feature of decision trees is that they capture non-linearity in data, while logistic regression not. Hence, ML tool does better. Another cool thing of ML is that it prefers averaging over many small models which give better out-of-sample prediction than choosing a single model.
Data Structures and Dimensionality Reduction
So what are the differences between data structures that are most commonly used by ML and econometrics? Firstly, econometricians deal usually with time-series and panel data, while machine learners prefer cross-sectional data with independent identically distributed observations. However, for time series ML offers a method called Bayesian structural time series (BSTS) aimed to work better for variable selection problems in time series application.
I think all of you have heard about Principal component analysis (PSA) for dimensionality reduction of data. In fact, it is ML method, but it is widely used by econometricians and mathematicians. I used it also in my Bachelor thesis while making the analysis which particular factors influence most on the costs of insurance companies in the USA.
Regression Everywhere
Another common tool for machine learning specialists and econometricians is regression analysis. Its primary goal is to understand as far as possible with the available data, how the conditional distribution of the response y varies across subpopulation determined by the possible values of the predictors or predictor (Cook and Weisberg (1999)). In my post I want to catch your attention on the following economic example which provides methods for variable selection in the context of the growth regressions (Varian 2014).
In the example, he uses the dataset from Sala-i-Martin (1997) of 72 countries and 42 variables in order to determine the most important variables for economic growth. Sala-i-Martin (1997) computed all possible subsets of regressors and used the results to construct the measure called CDF(0). In the table below you can see the variables that have the highest CDF(0) and therefore the most useful in explaining economic growth according to Sala-i-Martin (1997). Ley and Steel (2009) for this problem used Bayesian model averaging, LASSO and spike-and-slab regressions (which is also a Bayesian technique) . In the following table LASSO column shows the ordinal importance of the variables or a dash meaning that it was not included in the chosen model. Other columns show the posterior probability of inclusion in the model.
Table 2. Comparing Variable Selection Algorithms: Which Variables Appeared as Important Predictors of Economic Growth?
Source: Ley and Steel (2009), data from Sala-i-Martin (1997).
These methods are efficient and useful for economic research in case you faced up with the problem of determining variables that are most important for the particular model.
Must-have Software
So what about software and programming languages? For ML it is definitely R and Python (check packages “scikit learn” and “statsmodels”). And for econometrics R, Stata and Eviews are the best. The last two are the statistical software and they are not for free, thus, on my opinion, R is the most suitable for both purposes. For those, who are interested, I highly encourage to read a book “An Introduction to Statistical Learning” by Gareth James (https://www.amazon.com/Introduction-Statistical-Learning-Applications-Statistics/dp/1461471370).
Conclusion and Further Inspiration
We can see that econometrics and ML are very closely related. However, I consider econometrics as a subpart of ML. Other important applications of ML include:
- Computer vision;
- Speech recognition (e.g. Siri and Hello Google that you all know);
- Artificial intelligence (check game “Just dance”:D)
- Tests to avoid overfitting;
- Nonlinear estimations;
- Model averaging;
- Tools for manipulating big data (SQL, NoSQL databases);
- Computational Bayesian methods.
I am convinced that ML tools should be more widely known by young economists and researchers. Hope you have found from this post some interesting ideas what you should learn to grow more in your future career.
And I want to share with you this beautiful mind map which provides the great overview of all machine learning techniques: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/
Take blanket and cup of tea and start watching Data Mining video lectures of Jeef Leek (https://www.youtube.com/user/jtleek2007)
I would be very happy for your feedback and comments. Let's share ideas!
Happy blogging!:)
While I like your short overview about the methods of machine learning, I strongly disagree with your summary about what the econometric community can learn from the ML community. Except for the point of "Tools for manipulating big data (SQL, NoSQL databases)" the points exist already in the econometric community, even without the existence of the ML community. These techniques are just not or barely taught in Tartu, but with a look into a standard econometric book e.g. the book of Greene, you will find these techniques easily. Tools for manipulating big data might be useful to know, but they are not directly related to the database type (Relational/SQL or Non-relational/NoSQL).
ReplyDeleteInstead of learning or diving into machine learning, I rather suggest to get a better understanding of the existing econometric techniques and learn the methods needed in your field of interest. After that, you can still for useful pieces in the ML literature.
Sven-Kristjan Bormann
PhD student, University of Tartu
Dear Sven-Kristjan Bormann,
ReplyDeleteThank you for your feedback and suggestion!
However, it is a blog post for students, which gives some hints and motivation to pay attention on specific topics. I am not rejecting the existence of this algorithms in the econometric books, but I found that methods of non-linear estimations such as gradient descent algorithm or Gauss-Newton algorithm are more typical for ML not econometrics, but perhaps might be useful.
So please do not discourage students learning ML :)
With kind regards,
Kateryna