Data Science and Development: Financial Inclusion
Tommie Thompson – Georgetown MPP 2018
According to a 2015 report by the UN’s International Telecommunication Union, 2/3rd of global internet users come from the developing world. In some countries, like India, mobile phone use is as high as 75%. These trends are generating large amounts of data, which present a new opportunity for tech-savvy development practitioners. Governments and NGOs can utilize the data to make well-informed decisions and provide more effective services at a cheaper price. The question however, is how to do this.
The applications depend on what we mean by “big data” and data science. As many have noted, “data science” is a poorly-defined and malleable term. Annette Brown of R&E SEARCH for EVIDENCE for instance, explains that data scientists often use buzzwords like “data analytics” and “big data” as flashy but superficial alternatives for “data analysis” and “large datasets.” She attempts to parse the ambiguous nomenclature, and she determines that the overarching differences between data science and classical statistical methods are in data structure and methods. That is, data science mines and analyzes data from unique sources. While that is true, I don’t think her definition hits the core of what data science is. If we look at the things data scientists do, the main way they distinguish themselves from traditional statisticians is by downplaying theory. Data science, “big data,” and machine learning are mostly data-driven, not theory-driven. Consider a finding from data scientists at Target. By analyzing large amounts of spending patterns, they found that if a woman uncharacteristically purchases unscented lotion, it is more likely she is pregnant. This was not a hypothesis Target’s statisticians tested – who would predict such a relationship? – but a finding they observed just by playing with data.
This is an approach that can make econometricians cringe. Throughout graduate school, we were taught that theory comes first, and quantitative models must always be checked against intuition. The emphasis is justified considering how easily one can wrangle data to the get results they want; as the Nobel Prize-winning economist Ronald Coase said, “if you torture data enough, nature will confess.” Ignoring theory is especially worrisome in the past few years, as many fields like psychology and medicine face criticisms of bad statistics and unreplicable studies. However, if social scientists let the theory pendulum swing too far, they will neglect a source of novel and practical insights. Data-driven approaches have value, and it’s important to know when it’s appropriate.
An example of how econometricians and statisticians would approach a problem differently than data scientists is in the use of R2. An R2 is a measure of how well a regression model fits the data. The closer to 1 the R2 is, the better the model explains the data points. The problem however, is that we rarely have all the data the world offers – the population – and if we rely too much on the R2, we may build a model that explains the sample we have well, but not the world we are inferring. For the things econometricians do, this makes sense. However, many data science applications seek to drive their models’ R2s as high as possible. A high R2 means the model will predict results well, and to many data scientists, prediction is what matters most. This is something data scientists can do because (1) they have “big data” that occupies terabytes of space and (2) the methods they use can account for the risk of overfitting.
Although most data science happens in the private sector, there are many untapped uses in international development. One of the more promising applications of “big data” is financial inclusion. Since many entrepreneurs and informal small businesses in LDCs operate exclusively with cash, they are unable to develop a credit score or produce a record of their financial history. This makes it difficult for lending institutions to assess their creditworthiness and risk of default, so these institutions are often unwilling to loan money. The standard development fix for this market failure are non-profit microfinance institutions (MFIs), which are willing to loan money to poor clients without collateral. However, the same problem persists. Without a way to rate borrowers, MFIs must charge high interest rates to cover the uncertainty and rely on external funding to stay afloat. A necessary compromise in providing capital to those in poverty, but not an ideal one.
Fortunately, the amount of data currently available to NGOs makes a more effective model possible. If poor borrowers cannot be evaluated with traditional methods due to a lack of financial information, then why not use other types of information? With machine learning techniques, MFIs can use variables (i.e. “features” in the data science parlance) derived from biographical, behavioral, and social information to predict a borrower’s probability of loan repayment. For instance, lenders can mine data from a borrower’s phone and social media to determine what type of person he or she is. Even counterintuitive and “absurd” features –like one’s preferred font color and number of saved cellphone contacts – might be highly predictive of financial behavior. In fact, it turns out that those who regularly use “thank you” in loan applications are less likely to repay loans! These predictions can be used to allocate limited funds more effectively, which will reduce overhead costs and bring interest rates down. More importantly, serious low-income entrepreneurs may gain greater access to the capital they need to climb out of poverty, since it will be easier to spot disingenuous borrowers in a way that is hard for them to game.
This is just one of the many ways big data can help development practitioners, and a few NGOs are already using this method. Other applications include using GIS data to optimize food provision routes, automating the matching of CVs of marginalized groups with jobs, and as political scientist Chris Blattman has found, predicting criminal violence in unstable areas. With the right skillset and knowledge, these approaches are also more cost-effective and quicker than large RCTs. But RCTs and data science are not enemies; data science methods can also be used to aid randomized experiments by providing a source of behavioral change following interventions. For instance, a de-radicalization intervention can be tracked against the amount of radical Twitter posts in an area to see if it is effective. Of course, governments and practitioners will need to address issues such as data privacy and security, but the future looks good for fighting poverty with data.