Predicting Poverty Using Geospatial Data in Thailand

Poverty statistics are conventionally compiled using data from household income and expenditure survey or living standards survey. This study examines an alternative approach in estimating poverty by investigating whether readily available geospatial data can accurately predict the spatial distribution of poverty in Thailand. In particular, geospatial data examined in this study include night light intensity, land cover, vegetation index, land surface temperature, built-up areas, and points of interest. The study also compares the predictive performance of various econometric and machine learning methods such as generalized least squares, neural network, random forest, and support vector regression. Results suggest that intensity of night lights and other variables that approximate population density are highly associated with the proportion of an area’s population who are living in poverty. The random forest technique yielded the highest level of prediction accuracy among the methods considered in this study, perhaps due to its capability to fit complex association structures even with small and medium-sized datasets. Moving forward, additional studies are needed to investigate whether the relationships observed here remain stable over time, and therefore, may be used to approximate the prevalence of poverty for years when household surveys on income and expenditures are not conducted, but data on geospatial correlates of poverty are available.


INTRODUCTION
Over the past 3 decades, real gross domestic product per capita of Thailand has grown by more than twofold. The country's economic growth is accompanied by declining household poverty rates, dropping from 61.41% in 1988 to 5.04% in 2019 ( Figure 1). However, there are areas in the country where significant pockets of poverty still exist. For instance, about 6.76% of households living in rural areas are still considered poor (NSO 2020). Furthermore, the pandemic brought by the coronavirus disease (COVID-19) may undermine some of the gains in poverty reduction in the country. The concentration of poverty in rural areas is possibly driven by Bangkok's high agglomeration force and the fact that most economic activities are concentrated in Bangkok and its suburbs (OECD 2018). Since rural provinces have a limited variety of economic activities, they have a constraint of creating nonagriculture jobs. Trends in nonpecuniary indicators of development are also concerning ( Figure 2). For instance, half of the country's working population are still in precarious employment. There is also ample room for improvement in the education sector as rural migrants and urban poor generally lack the skills demanded by modern jobs. Given that poverty is still prevalent in select areas of Thailand, particularly rural areas, poverty monitoring remains an important task for the country's development practitioners. At present, the National Economic and Social Development Council and National Statistical Offices are the government agencies responsible for compiling poverty statistics in Thailand. In particular, these agencies rely on the Household Socio-economic Survey (HSES), in which the household's income is surveyed every two years. In addition, HSES results are compiled at the national and provincial levels only, as the sample size of the aforementioned survey is not large enough to yield reliable estimates beyond provincial level, yet demand for granular data on poverty and other socioeconomic indicators continues to expand.
Given the high cost of increasing the frequency and scope of such surveys, an option which government agencies with limited resources may be unable to sustain in the long run, a number of initiatives within Thailand and elsewhere are exploring alternative data collection strategies which entail tapping other data sources. For instance, small area estimation techniques which combine survey with census and other types of administrative data have been widely used to facilitate estimation at levels more granular than what working with surveys alone can afford. More recently, efforts to use innovative data coming from call detail records, social media data, digital transactions, and remote sensing for compilation of development statistics are expanding too.
Mapping the spatial distribution of poverty is an area which could greatly benefit from the integration of multiple data sources. In this context, two types of analytical frameworks are worth pointing. First, by capitalizing on ongoing developments in computer vision techniques and satellite imagery, several researchers have shown that it is feasible to develop an algorithm that can automatically predict survey-based estimates of poverty with satisfactory levels of accuracy (Jean et al. 2016;Hofer et al. 2020;ADB 2020;Piaggesi et al. 2019). Such approach is quite attractive for instances wherein collecting survey data, particularly in remote and/or hard-to-reach areas, is onerous, and no other types of supplementary data are readily available. However, since the features extracted by computer vision techniques are relatively abstract (ADB 2020), it is difficult to manually pinpoint exactly which features are being picked up by the computer when predicting poverty. Consequently, it is also difficult to validate what could have triggered an unexpectedly low or high estimate of poverty, if such instances arise. Alternatively, if structured geospatial data are readily available, one can develop a more tractable econometric model for predicting poverty. Whereas in the first approach, one is letting the computer extract abstract features or patterns from satellite images that are potentially correlated with poverty, in the second approach, one is leveraging on quantitative data that have already been precompiled. In principle, the former approach can draw vast number of potential predictors of poverty as a computer can directly extract numerous features from the satellite images whereas the latter approach is limited to structured precompiled data. However, presumably, it is easier to validate the results of the second approach since the data used for predicting poverty are more structured and tractable.
This study explores the second approach where poverty is predicted by identifying correlates from precompiled geospatial data. It contributes to the existing literature by assessing whether it is feasible to develop a model with satisfactory predictive performance even if we are solely depending on precompiled geospatial datasets which theoretically, can be considered as just a fraction of the number of covariates that the first approach can potentially generate, a feat that has not been explored thoroughly in the context of Thailand, in previous studies. Up to some extent, it may be considered as follow up to the ADB (2020) study which examined the first approach of using geospatial data to predict poverty. However, there are slight differences in research objectives. Whereas, the main objective of the ADB (2020) study is to examine the feasibility of providing poverty estimates that are more granular than government-published estimates by using artificial intelligence on data from survey, census, and satellite imagery, our focus here is to examine whether we can develop a reasonably good poverty prediction model even if we limit ourselves in using covariates from readily available or precompiled geospatial dataset(s). Furthermore, this study's objective is to briefly compare the performance of different machine learning techniques, a topic that has not been well explored in previous studies of poverty estimation using nontraditional data sources. By doing so, the study aims to contribute to the literature that explores other cost-effective methods of predicting poverty through integrating the use of innovative data with surveys and register-based data which in turn, could provide rich inputs as relevant government agencies aim to meet the growing data requirements of economic planners and policymakers.
The rest of this paper is structured as follows. The second section reviews related literature while the third and fourth sections introduce the data and research methodologies, respectively. The fifth section presents the key findings of the econometric and machine learning methods adopted in this study. The last section summarizes lessons learned and draws brief recommendations for future studies.

Using Precompiled Geospatial Data for Predicting Socioeconomic Indicators
The  Li et al. (2013); and Li, Zhao, and Li (2016) found statistically significant relationship between the density of night lights and various ground data such as gross domestic product, electricity consumption, inequality, and infant mortality rate.
In addition to night lights intensity, Landsat, National Oceanic and Atmospheric Administration Polar Orbiting Environmental Satellites and Terra-Moderate Resolution Imaging Spectroradiometer satellites have been scanning the Earth's surface with multi-spectrum sensors. These multi-spectrum data have been used by various researchers to compile a number of geospatial indicators such as the building density, water coverage, Normalized Difference Vegetation Index (NDVI), Land Surface Temperature (LST), Normalized Difference Water Index, Normalized Difference Snow Index, Normalized Difference Soil Index, and Normalized Difference Built-up Index. Specifically, NDVI represents the spatio-temporal pattern of forest and cultivated areas and is considered one of the conventional indices commonly used in remote-sensing analysis of vegetation. NDVI is calculated by measuring the difference between near infrared (which vegetation reflects) and red light (which vegetation absorbs). The studies of Sun et al. (2010), Li et al. (2015) and Jin et al. (2008) demonstrated correlation between urban expansion and decreasing NDVI. Similarly, the research of Kristjanson et al. (2005), Bhattacharya and Innes (2006), Morikawa (2014), and Aburas et al. (2015) showed the statistical relationship between NDVI and the spatial distribution of income inequality.
Data on land surface temperature is another type of precompiled geospatial information which researchers are using to predict income.  Gilmont et al. (2018) also found statistically significant correlation between rainfall on income, human capital, and economic activity in developing countries. In addition, Leroux et al. (2016) and Sruthi and Aslam (2015) documented the formulation of forecast models using both temperature and NDVI for predicting drought and in turn, forecasting the loss of agricultural output and its effect on farmers' incomes.
Efforts to crowd source geospatial data are also expanding. A good example is OpenStreetMap (OSM), a collaborative project producing a crowdsourced geographic database, and one of the major platforms promoting the use of geospatial data in the fields of global humanitarian action and community development. OSM database also features other types of geospatial data as presence of road, river, built-up area and point of interest (POI), enabling the investigation on the association among the geographical characteristics and socioeconomic conditions. Studies such as those by Hu et al. (2016), Ye et al. (2019), and Deng et al. (2019) demonstrate that OSM can provide details of spatial distribution of population and economic activities.

Poverty Mapping in Thailand
As mentioned earlier, official poverty statistics in Thailand are based on the HSES which provides reliable estimates from national down to provincial levels. However, recognizing the importance of having more geographically disaggregated poverty data as inputs for policy targeting, National Statistical Office (NSO) of Thailand started compiling small area (tambon or subdistrict level) poverty estimates in 2003 in collaboration with other development partners like the then National Economic and Social Development Board (NESDB), Thailand Development Research Institute (TDRI), 1 and the World Bank. Since then, small area poverty estimates in the country have been compiled for the following years: 2005, 2007, 2008, 2011, 2012, 2015 and 2017. 2 The outputs in 2003 and 2005 were jointly prepared by three local institutions, namely, NESDB, NSO, and TDRI, together with the technical advisory from the World Bank. In 2015, the World Bank provided further technical assistance to NSO, to build capacity to implement small area estimation among more NSO staff. Additional technical details on the process of compiling poverty maps are documented by Jitsuchon (2004), Healy and Jitsuchon (2007), and Jitsuchon and Richter (2007).
However, despite availability of analytical tools for compiling granular estimates of poverty, it is important to identify alternative methods due to limitations associated with the conventional poverty mapping technique which heavily relies on availability of census data. For instance, since censuses are usually conducted every 5 to 10 years only, poverty mapping models that use covariates derived from census have restrictively strong assumptions (Bedi, Coudouel, and Simlaer 2007).
In a study published recently, researchers from ADB extended the conventional small area poverty estimation framework by tapping geospatial data extracted from daytime and nighttime imagery through machine learning algorithms to create granular poverty maps of the Philippines and Thailand (ADB 2020, Hofer et al. 2020). The adopted method was inspired specifically by Jean et al. (2016) which was further used and/or enhanced in subsequent studies (e.g., Babenko et al. 2017;Tingzon et al. 2019;Heitmann and Buri 2019;Yeh et al. 2020). These studies fall under the strand of literature that broadly aim to explore applications of artificial intelligence and computer vision techniques for estimating poverty. However, as hinted earlier, this methodology has several technical issues. First, validating aberrant or unexpected predictions becomes challenging because of the fact that the features being used to correlate poverty are abstract. Second, instead of directly predicting poverty, the method employs an intermediate step wherein an algorithm is first trained to predict the intensity of night lights. The intermediate step is necessary in this context because sources of night light data, particularly satellite imagery, are readily accessible and can cost-effectively provide large volumes of labelled images on which to train a computer vision algorithm, something that cannot be easily done if we were to predict poverty outright since readily available poverty data are not quite granular. Using data on night lights as a proxy for poverty during the intermediate step is arguably valid if it is assumed that places that are brighter at night are less poor than those places that are less well lit. However, if there are places that are equally lit but show varying levels of poverty on the ground, such an intermediate step could potentially lead to loss of vital information by not predicting poverty outright.
This study contributes to the existing literature of poverty measurement in Thailand by developing a prediction model whose correlates were derived from precompiled geospatial data. By doing so, we aim to assess whether it is feasible to develop a model with satisfactory predictive performance even if we are solely depending on precompiled geospatial dataset(s) instead of applying computer vision techniques to automatically extract satellite image features that are potentially correlated with poverty, a feat that has not been thoroughly explored in previous studies. 1 In 2019, NESDB has been renamed as the National Economic and Social Development Council. TDRI is an independent think tank foundation.

Google Earth Engine
Google Earth Engine (GEE) is an open cloud-based data storage and computing platform which provides access to satellite imagery for free. In this study, we extracted the following information from GEE: Climate Hazards Group InfraRed Precipitation with Station data (CHIRPS)'s rainfall data.
• Land Surface Temperature (LST) • Normalized Difference Vegetation Index (NDVI). NDVI is widely used as an indicator representing the land cover of forest and agricultural activity.
• Intensity of Night Lights Table 1 summarizes the range of data that can be obtained from GEE.

Global Urban Footprint
The Global Urban Footprint project by the German Remote Sensing Data Center of the German Aerospace Center compiles geocoded data which identify urban areas, land surface, and water bodies. Geocoded data on built-up and nonbuilt-up areas are also available from the Global Urban Footprint.

Global Human Settlement Layer
Mainly supported and supervised by the Directorate General Joint Research Centre of the European Commission, the Global Human Settlement Layer project has produced a fully open and free geospatial spatial dataset. The generated geospatial database provides informative evidence and the broadened insight of global human presence.

United States Geological Survey
This geospatial dataset has been generated based on the ten years (2001-2010) collection of Terra-Moderate Resolution Imaging Spectroradiometer-based Global Land Cover maps (MCD12Q1 land cover type data). There are 16 classifications for each pixel, identifying the type of land cover based on the method of highest confidence during 2001-2010, as described in Broxton et al. (2014).

European Space Agency Land Cover
Initially, the main objective of the European Space Agency (ESA)'s Climate Change Initiative is to produce an accurate land-cover classification that can serve the climate modeling community. This project has developed the Essential Climate Variable spatial dataset based on the extensive archives of remote-sensing data (ESA 2017). The database covers time series from 1992 to 2017 and contains 38 land cover classes, which are based on the United Nations Land Cover Classification System.

Open Street Map
OpenStreetMap features crowd-sourced data on locations of infrastructures, human settlements and economic activities. In this study, we extract the following information from OSM: road count, road length, POI, and built-up area. We categorized POIs into 16 types based on its economic activity matched to the official classifications of 16 production and service sectors published by the NESDB.
Tables A.1 and A.2 of the Appendix provide the list of variables obtained from geospatial data of 2015 and 2017, respectively.

Income-Based Poverty
As mentioned earlier, poverty mapping is a regular initiative conducted by Thai government. In this study, the ratio of the population living below the national poverty line per total population in each tambon (i.e., subdistrict) is used as one of the dependent variables in our computations.

Multidimensional Poverty
As an alternative metric of poverty, the Office of the National Economic and Social Development Committee and National Electronics and Computer Technology Center also compile statistics on prevalence of multidimensional poverty starting 2017. The data are based on: (i) a census-based Basic Minimum Need data, supervised by the Community Development Department, Ministry of Interior, which includes approximately 36 million population.
(ii) a register-based data source of approximately 11.4 million individuals gathered by the Ministry of Finance through the national welfare card program.
The criteria used in identifying a multidimensionally poor person is inspired by the Multidimensional Poverty Index method developed by the Oxford Poverty and Human Development Initiative and United Nations Development Programme.

Reference Period
Our target reference period coincides with two most recent years where tambon-level estimates of poverty in Thailand are available: 2015 and 2017 for income poverty, and 2017 for multidimensional poverty index.

IV. METHODS
In this study, we consider Generalized Least Squares (GLS) method, and three other widely used machine learning algorithms: neural network, random forest estimation, and support vector regression (SVR). To assess whether the models have satisfactory generalization performance, 50% of the data were allocated for training while the remaining 50% constituted the validation set. Based on this allocation, we resampled the data 100 times. The values of metrics used to compare machine learning algorithms are based on averages from these 100 datasets.

Generalized Least Squares
GLS is considered a modification of the Ordinary Least Squares (OLS) as it relaxes the assumption that the variance of an observation is homogeneous regardless of the explanatory variables associated with it.

Neural Network
A neural network is an example of a machine-learning model inspired by the biological neural network that constitutes the human brain. As with other types of machine-learning models, a neural network can learn to perform different tasks without being explicitly programmed to do so (ADB 2020).
Structurally, a neural network is composed of numerous nodes and edges. A node can be a variable or a mathematical function connected by edges. These nodes combine together to form different layers within the neural network. The input layer takes in the raw data. In the hidden layers, each node or neuron serves as filter and is activated each time it detects a specific pattern or feature. The output layer simply organizes the identified features into an appropriate category. The best way to represent these connections is through computational graphs as shown in Figure 3 (ADB 2020).

Figure 3: Illustration of a Sample Neural Network
Source: Graphics generated by the study team.

Random Forest
Random forests are an ensemble method based on decision trees, with each tree building on a random subset of the training data and a random subset of the independent variables. It can perform classification and/or prediction-related tasks and by using averaging, it can improve a model's predictive accuracy and control overfitting.
In this study, variable importance (VIMP) and minimal depth (MD) were used to conduct further analyses. These metrics use the main features obtained from all decision trees to assess the relative significance of explanatory variables in selecting the final predictors in the model.

Support Vector Regression
Typically, the main objective in a linear regression framework is to minimize a specific loss function. For instance, OLS method aims to minimize the sum of squared errors. Methods like lasso or ridge regression extend this framework by introducing additional penalty parameters to minimize complexity and/or reduce the number of covariates that marginally contribute to the model's predictive performance.
On the other hand, a method like the support vector regression provides an alternative framework wherein instead of minimizing a specific loss function, one is only concerned about reducing it to a certain degree. This gives greater flexibility in the estimation and helps in dealing with the limitations pertaining to distributional properties of the variables included in the analyses. In general, flexibility with allowable error renders SVR superior than other conventional estimation methods that are fixated on minimizing a loss function.
In this study, neural network, random forest estimation, and support vector regression were implemented using R software. Table 2 lists the details of relevant R packages, including links to the main sources of technical references.

Preliminary Analysis
As preliminary estimation tools, we first estimated a full model and various model specifications using OLS and stepwise regression. In general, we found that the proportion of people living below incomebased poverty line and the value of the multidimensional poverty index are negatively associated with geospatial indicators that represent the degree of an area's urbanization, i.e., intensity of night lights, building density, and number of points of interest which are associated with manufacturing and utility sectors. On the other hand, poverty outcomes are positively correlated with rainfall, NDVI, and other land cover classes that are typically associated in rural areas. While the directions of these correlations align with our expectations, the resulting adjusted R2 values are relatively low, ranging from 0.13 to 0.33.

Using Machine Learning Algorithms to Predict Income-Based Poverty Rate
Comparison of the RMSE (averaged across 100 trials) from the four computational methods (Figure 4) shows that the random forest method yielded the lowest RMSE value. Graphical illustration of the goodness-of-fit (Figures 5 and 6) also confirms that the Random Forest (RF) has the best predictive performance among the four methods that we have considered-generating predicted values that are closest to the actual ones. SVR and GLS performed second and third under the same criteria. Notably, neural network generated the highest RMSE.   VIMP identified intensity of night lights and population density-related variables as the biggest contributors to the model. Meanwhile, five variables were identified as false positive in the VIMP's results for 2015 and 2017, indicating the irrelevance of these variables in predicting the poverty headcount rate. Alternatively, it is also possible that the information provided by these variables is already captured by other variables. The 'unimportant' variables are the area covered by tree or shrub (ESALC_12), the area covered by tree, broadleaved and deciduous more than 40% (ESALC_61), the area covered by mosaic herbaceous more than 50% (ESALC_110), the area covered by tree, flooded, fresh or brackish water (ESALC_160) and the bare areas (ESALC_200).
The results obtained from MD calculation generated similar outcomes, confirming that intensity of night lights and population density-related variables are highly associated with poverty headcount. Similarly, the results show that five variables possess very low predictive power-the same five variables identified by VIMP results as irrelevant to the model. These variables can therefore be excluded from the model in further analysis.

Figure 8: Results of Minimal Depth Computation
Source: Calculation and graphics generated by the study team.

Using Machine Learning Algorithms to Predict Multidimensional Poverty Index
In addition to income-based poverty rate, we also applied GLS, neural network, random forest, and support vector machine to predict the multidimensional poverty index (MPI). Figure 9 depicts the comparison of RMSE obtained from four Machine Learning methods. Similar to the case of income poverty rates, the random forest method yielded the lowest RMSE. The scatterplot in Figure 10 compares the actual MPI and the predicted values. It shows that most predicted values generated by random forest are located closest to the 45-degree line, suggesting that it has the best fit among the four methods considered in this study.  Source: Calculation and graphics generated by the study team.
Again, we examined the degree of explanatory power of each variable by calculating VIMP and MD. Figure 11 exhibits that variables related to population density such as nighttime light, LST, and road density have a high degree of contribution to predict the variation in poverty rate, based on VIMP. The result obtained from MD, as illustrated in Figure 12, also show qualitatively similar results, revealing that nighttime light, LST, rainfall, road density, and the area covered by woody Savannas (USGS8) are key geographical features associated with the value of MPI.  In summary, among the methods applied in this study, the random forest technique yielded the highest level of accuracy when predicting both income poverty rate and multidimensional poverty index. Furthermore, the resulting random forest models fit the datasets well, as suggested by the adjusted R-square values presented in Table 3. As stated in Wang, Aggarwal, and Liu (2018), the random forest algorithm tends to outperform other machine learning methods due to its capability to fit complex association structures even with small datasets. 3

VI. CONCLUSION
The contribution of this study is twofold. Firstly, it introduces the integration of data, composed of the nationwide survey, register-based data, geospatial information, and the satellite imagery. In addition, since most of these are open data, data acquisition cost is minimal. This is potentially attractive for national statistical offices with scarce resources but wish to explore geospatial data can be used to enhance the compilation of poverty statistics. Secondly, this paper has applied computational techniques to examine the relationship between geospatial features such as intensity of night lights, land cover, land use, etc. and proportion of people living below poverty line as measured using conventional method of estimating poverty. It is shown that the Random Forest is the best prediction method, yielding the accuracy of more than 80%. These contributions suggest the potential of applying the open data and open-source computational tools to analyze the spatial distribution of poverty. Furthermore, the results obtained from VIMP and MD reveal the associations between geospatial covariates such as intensity of night lights, population density, and poverty rates. Moving forward, if it can be proven that such relationships remain stable over time, it might be possible to apply these techniques to predict poverty for years when household surveys on income and expenditures are not conducted, but data on geospatial correlates of poverty are available. 3 Other studies that applied the random forest technique in other contexts also noted qualitatively similar results. For example, Fernández-Delgado et al. (2014), using the entire University of California Irvine dataset, have demonstrated that the random forest outperformed 179 classifiers from 17 families. Similarly, Díaz-Uriarte and Alvarez de Andrés (2006) have stated that random forest is the best method for gene selection and classification, and Ali et al. (2012) have shown that random forest yields the highest accuracy in predicting breast cancer. Nevertheless, the robustness of the results of random forest technique to the size of the training and validation data warrants further investigation.

Predicting Poverty Using Geospatial Data in Thailand
This study examines an alternative approach in estimating poverty by investigating whether readily available geospatial data can accurately predict the spatial distribution of poverty in Thailand. It also compares the predictive performance of various econometric and machine learning methods such as generalized least squares, neural network, random forest, and support vector regression. Results suggest that intensity of night lights and other variables that approximate population density are highly associated with the proportion of population living in poverty. The random forest technique yielded the highest level of prediction accuracy among the methods considered, perhaps due to its capability to fit complex association structures even with small and medium-sized datasets.

About the Asian Development Bank
ADB is committed to achieving a prosperous, inclusive, resilient, and sustainable Asia and the Pacific, while sustaining its efforts to eradicate extreme poverty. Established in 1966, it is owned by 68 members -49 from the region. Its main instruments for helping its developing member countries are policy dialogue, loans, equity investments, guarantees, grants, and technical assistance.