Compiling Granular Population Data Using Geospatial Information

Detailed data on the distribution of human populations are valuable inputs to research and decision making. This study aims at compiling data on population density that are more granular than government-published estimates and assessing di ﬀ erent methods and model speci ﬁ cations. As a ﬁ rst step, we combine government-published data with publicly available data like land cover classes, elevation, slope


Introduction
To appreciate the importance of accurate and timely data on human population distributions, it is crucial to understand how such data are used.In general, population numbers play a vital role in various decision-making processes and help governments assess where resources and investments should be allocated (Salvatore et al. 2005, Balk et al. 2006, Tatem et al. 2012).For instance, there are studies showing that a lack of accurate data on the size of the population being targeted for a health intervention program may lead to the ineffective use of scarce resources (Barau et al. 2014).On the other hand, there is also evidence showing that inaccurate population data may lead to a miscalculation of the number of electoral districts (Bycroft 2011), which in turn, may affect how resources are allocated geographically.Given the dynamic nature of population distributions, access to accurate forecasts is critical (United Nations Population Division 2018).With spatially disaggregated population forecasts, governments can better identify areas that may need new roads, schools, hospitals, and other infrastructure and service facilities.
At the global level, high-quality population data are also essential for monitoring indicators outlined in the 2030 Sustainable Development Agenda (Bernal et al. 2021).For example, there are instances when trends in regional inequality may paint an inaccurate pattern simply due to the use of an inappropriate measure of population (Li andGibson 2013, Tian et al. 2016).Furthermore, the "leave no one behind" principle espoused by the 2030 agenda-which requires Sustainable Development Goal indicators to be disaggregated by income, age, sex, ethnicity, and migratory and disability status-also implicitly prompts the need for accurate population data for such groups.
Accurate, granular, and timely population numbers are also fundamental in planning intervention programs in response to disasters and other economic shocks.This was recently demonstrated when the coronavirus disease (COVID-19) pandemic struck as countries and economies heavily relied on population data not only to identify potential hotspots of widespread infection, but also to determine how much relief efforts were warranted for specific population groups and geographic areas. 1  Due to the significance of having access to high-quality population data, these are collected by national statistics offices or government agencies responsible for their collection.Population data are collected through various approaches.In many developed countries, population data are compiled from national registers of births and deaths.The quality of population data compiled from national registers depends on the comprehensiveness, frequency of updating, and synchronization at the local and national levels.National population registers, when managed properly, have the advantage of producing near real-time population data.In many developing countries, however, such registers are either unavailable or miss specific population segments systematically.
Population and housing censuses are commonly used to gather population data too.In fact, the World Population and Housing Census Programme is one of the longest-standing global statistical programs, which recognizes the importance of household and population censuses as a source for supplying data on the distribution of human populations.These censuses differ in their modes of data collection, with some countries relying on face-to-face interviews, while others use phone-based interviews, postal mail, or the Internet.Each data collection mode has its strengths and limitations.With well-trained interviewers, face-to-face interviews ensure a common understanding of questions, thus facilitating enhanced comparability of data throughout the country.Obviously, this mode requires more financial resources.On the other hand, a census conducted through the phone, postal mail, or Internet has the advantage of being more cost-effective, but it could also be more prone to systematic nonresponse patterns.
Although there are methods to reduce the financial resources needed for conducting population and household censuses, it is still challenging for financially constrained national statistics offices to conduct a census frequently.On average, population and household censuses are conducted every 10 years.In Asia and the Pacific, fewer than 10 economies conduct a census more frequently than once a decade, mainly due to the average cost per person, which is estimated at $2.04 (Sustainable Development Solutions Network 2015).This also corroborates the findings of a survey conducted by the Statistics and Data Innovation Unit of the Asian Development Bank (ADB), which show that a nonnegligible portion of its regional 1 By making any designation of or reference to a particular territory or geographic area, or by using the term "country" in this document, the authors do not intend to make any judgments as to the legal or other status of any territory or area.
Compiling Granular Population Data Using Geospatial Information 265 members conduct major surveys and censuses less frequently than is ideal (Figure 1), despite improvements in overall statistical capacity over time (Figure 2).2In addition to cost, there are a number of administrative and technical challenges associated with conducting population census such as concerns about potential intrusiveness, privacy and response burdens, reduced cooperation from those being surveyed, and difficulties in accessing secure apartments, among others (Skinner 2018).
The COVID-19 pandemic further intensified the challenge of conducting timely censuses.There are 120 countries or economies worldwide that either postponed, delayed, or canceled the conduct of their respective population censuses during the COVID-19 pandemic (United Nations Conference on Trade and Development 2021).Some of these economies changed their data collection methodologies to push through with their census activities.In Asia and the Pacific, six out of the 10 regional members surveyed by ADB's Statistics and Data Innovation Unit rescheduled their census field activities to either later in 2020 or in 2021.National statistics offices also intensified initiatives to capacitate their staff and enumerators on the use of new methods to facilitate timely census data compilation.In addition to enhancing timeliness, there is also a need to improve the accuracy and granularity of population numbers in settings where collecting data is administratively challenging such as in remote or conflict-stricken areas.As policies and programs that require population data as inputs evolve and become increasingly complex, there is a need to continuously explore innovative methods of compiling accurate, timely, and granular population data that can complement conventional approaches of collecting such information.
The use of remotely sensed data and the application of machine learning techniques are being increasingly considered.Several studies have noted that such methodologies produce reasonably accurate population data, especially in settings where conventional approaches of collecting population data have limitations (Wang, Borthwick, and Ni 2022).In fact, even in countries where conventional sources of population data are available, remotely sensed data, particularly satellite imagery, are being used to enhance granularity (Tatem et al. 2007).That is why a considerable number of studies have already applied remotely sensed data as their main source of input data or have used it in combination with census information (Anderson and Anderson 1973;Sutton, Elvidge, and Roberts 2001;Chen 2002;Bhaduri et al. 2007;Linard, Gilbert, and Tatem 2011;Linard et al. 2012).Besides its granularity, another great advantage of satellite imagery is that it can even be updated when interventions like face-to-face interviews are not feasible.
Often, most initiatives to compile grid-level population density have relied on the application of basic dasymetric population mapping (Tobler et al. 1995, Balk and Yetman 2004, Balk et al. 2006).Dasymetric population mapping entails dividing administrative areas into smaller spatial units, onto which population size is averaged to calculate population density.The Gridded Population of the World database and the Global Rural Urban Mapping Project are examples of global datasets that have been produced with this approach.To obtain their estimates, the Gridded Population of the World database uses a simple redistribution across census units, while the Global Rural Urban Mapping Project specifically considers rural and urban areas in the spatial distribution of the population inside census units.Another dataset with global coverage is WorldPop's Gridded Population Count database. 3WorldPop applies an even more elaborate approach, which is based on a random forest estimation technique (Stevens et al. 2015).There are a number of reasons why random forest is one of the most commonly used machine learning techniques, particularly in the context of geospatial analysis.Several studies show that the technique works well and yields high prediction accuracy even with imbalanced data, missing values, multicollinearity, the presence of outliers, and overfitting (Sohnesen and Stender 2017;Zhao et al. 2019;Hu et al. 2022;Yang et al. 2022;Simon, Glaum, and Valdovinos 2023). 4This dasymetric redistribution approach also allows for the inclusion of a wide range of input data, including remotely sensed and geospatial data from different scales.Additionally, random forest benefits from nonparametric spatial predictions, which are then anchored across space using contemporary administrative-boundary-linked census data.This method also represents the basis for the estimations described in this paper.Datasets mapping grid-level population counts for larger areas have also been produced by LandScan as well as by the United Nations Environment Programme (Deichmann 1996, Dobson et al. 2000, Bhaduri et al. 2007).Several studies like those by Azar et al. (2010Azar et al. ( , 2013)), Lung et al. (2013), Murakami and Yamagata (2019), Chen et al. (2020), andWang, Borthwick, andNi (2022) have also documented the use of high-resolution satellite imagery to obtain granular estimates of population density in selected countries.
More recently, Stevens et al. (2015) proposed ensemble methods in population mapping that capitalize on remotely sensed and geospatial data.In particular, they developed a new semiautomated dasymetric modeling approach that incorporates detailed census and ancillary data in a flexible random forest estimation technique.Using Viet Nam, Cambodia, and Kenya as case studies, they were able to show that their proposed method increases accuracy and flexibility vis-à-vis other common gridded population data production methodologies.
This study applies the relatively new method proposed by Stevens et al. (2015) to compile granular population data for other Asian countries such as the Philippines and Thailand.Specifically, we combine government-published data with publicly available data on parameters like land cover classes, elevation, slope, and nighttime lights on different levels of granularity and apply a random forest model to obtain grid-level estimates of population density at the 100 meter (m) by 100 m grid level.
In addition to compiling population data that are more granular than governmentpublished estimates, another methodological contribution of this study is to examine the feasibility of forecasting the population distribution at the grid level using data from select areas in Thailand as case studies.To ensure the validity of these forecasts and projections, we evaluate different model specifications and, in addition to random forests, also consider a Bayesian model averaging (BMA) approach.
The applications of such granular forecasts of population density are manifold.First, they can support the planning and administration of a new census or survey.Second, especially during a pandemic or other state of exception in which the possibility of conducting face-to-face interviews might be limited, forecasts can complement census data and facilitate updating outdated information.Third, granular forecasts of population density can show where help is needed the most in situations of crisis or catastrophe, especially when changes in people's behavior can be adequately captured by changes in spatial features on the ground.Lastly, projections like these may also be very valuable to determine the best locations for new facilities, like schools, hospitals, and other public amenities.
The remainder of this paper is structured as follows.Section II gives an overview of all input data used, section III describes the methods applied, section IV provides details on the results, and section V concludes the paper.

A. Data Sources
We use population data compiled by the Philippine Statistics Authority and Thailand's National Statistical Office.5 Country-specific published population data were matched to delineated administrative boundaries for municipalities, barangays (villages in the Philippines), and tambons (subdistricts in Thailand).These administrative levels provided the finest level of administrative unit available at the time of analysis.To estimate population density at the grid level, the natural logarithm of population density is used as the response variable.This decision is consistent with the recommendation of Stevens et al. (2015), who experimented with different specifications of the response variable-including the natural logarithm, common logarithm, and square root of population density-and found that using the natural logarithm yields the highest prediction accuracy.On the other hand, to forecast future levels of population density for select areas in Thailand, we use the growth of population density as the dependent variable. 6opulation distribution is usually highly correlated with land cover types, so we incorporated land cover information using GlobCover, which is a global land cover map based on Environmental Satellite's Medium Resolution Imaging Spectrometer. 7his global land cover map counts 22 land cover classes defined within the United Nations Land Cover Classification System and is available at a spatial resolution of approximately 300 m.In particular, 21 of these 22 land cover classes can be found in Thailand, while 12 of them can be found in the Philippines.All of these classes are included in our estimations.We use GlobCover as our source of information on different land cover classes since GlobCover provides high-quality publicly available data and good coverage for both countries of interest, as demonstrated in previous studies (Stevens et al. 2015).
Land cover data are complemented by digital elevation data and their derived slope estimate based on HydroSHEDS data.8Furthermore, we include Moderate Resolution Imaging Spectroradiometer (MODIS9 )-derived net primary production, monthly data on temperature and precipitation from WorldClim,10 and information on nighttime lights from the Visible Infrared Imaging Radiometer Suite (VIIRS 11 ).Finally, we use OpenStreetMap12 (OSM) to obtain information on different features like villages, schools, and rivers, and Protected Planet13 to identify protected areas.
Table A1 in Appendix 1 gives an overview of the input data used to estimate population density at the 100 m by 100 m grid level.To put it into perspective, an average-sized tambon in Thailand is about 70 square kilometers, so there are about 7,000 grids inside each tambon.An average-sized barangay in the Philippines is about 7 square kilometers, so there are around 700 grids inside each barangay.Note that most of the input data have a constant resolution in degrees, and therefore the resolution in meters changes with the distance to the equator.The resolution shown in Table A1 in Appendix 1 refers to the approximate resolution near the equator.

B. Data Processing
Before applying random forest and BMA estimations to the data, it has to be transformed into the same raster format.In the case of raster data, this means that we have to match the projection, resolution, and origin to get consistent raster files.For vector data, this entails rasterization.
Most of our input data are in the form of raster data.To make sure that all the raster objects have the same projection, resolution, and origin, we perform the following steps.First, we read the raster data-for example, the MODIS data on net primary production-into R. If the given raster is not in the same projection as the raster containing the data on population density, we change its projection accordingly.In case the given raster is already in the correct projection, we only have to adjust its resolution and/or origin.This is done by resampling the raster.Finally, we save the transformed raster as a GeoTIFF.
In addition, we have two sources of vector input data.The first one is a dataset on protected areas, which consists of polygon and point shapefiles.The second one is OSM data, which consist of different OSM files.These OSM files contain line, point, and polygon attributes.In instances when these data from OSM were not available for the target year, we used the closest year for which data are available.The process of converting vector data to raster data is called rasterization.This approach is implemented in the following steps.First, we read the vector data into R. Next, we convert the data into the correct projection.Then we rasterize all points, lines, and polygons, using functions from the package's raster and fasterize.Finally, we set the values of empty grids to 0 and save the created raster as a GeoTIFF.

III. Methodology A. Random Forest
To estimate population density on a grid level, a random forest approach is applied.As briefly mentioned earlier, the random forest is also one of the two techniques that we assessed for forecasting population density. 14The second approach, BMA, is explained in more detail in the second part of this section.
Random forests are an ensemble method based on trees, with each tree building on a random subset of the training data and a random subset of the independent variables (Breiman 2001;Cutler, Cutler, and Stevens 2012).In particular, it is assumed that the p-dimensional vector X ¼ (X 1 , . . ., X p ) T of explanatory variables and the dependent variable Y follow a joint distribution P XY (XY).In this paper, Y is the number of persons living in each 100 m by 100 m grid cell and X consists primarily of geospatial data like nightlights, land cover classes, temperature, and precipitation.
A random forest of this form is created from a set of regression trees, h 1 (x), . . ., h J (x), as base learners.Averaging over the predictions of all individual trees leads to the following random forest prediction:

B. Regression Trees
Since every tree only takes into consideration a random subset of explanatory factors, the jth tree can be denoted as h j (X, Â j ), where Â j is a random subset of the covariates and the Â j s are independent for j ¼ 1, . . ., J. Considering a set of training data D ¼ f(x 1 , y 1 ), . . ., (x N , y N )g, where x i ¼ (x i, 1 , . . ., x i, p ) T includes p covariates, y i is the response, and θ j is a particular realization of Â j , a fitted tree can be written as h j (x, θ j , D) (Breiman 2001).
However, the parameter θ j , which adds randomness to the trees, is not modeled explicitly, but enters the trees implicitly in two different ways.First, in the splitting of each node, only a random subset of all covariates is considered.Second, each tree only considers a bootstrap sample of the training data.This means that, for each tree, a random subset of the training data is drawn with replacement (Cutler, Cutler, and Stevens 2012).
The regression trees creating a random forest are grown using the method of recursive binary splitting.In other words, each tree partitions the predictor space in a sequence of binary splits based on individual covariates.The first node, including the entire predictor space, is called the root node.The final nodes that are not split any further are called the terminal nodes.Considering a certain split point in the values of one of the explanatory variables, each nonterminal node is split into two descendant nodes.Values of the covariate that are below this split point go to the left descendant node, while higher values go to the right.
To determine the exact split point to partition a node, every possible split in all covariates considered in the tree is taken into account.In regression trees, the splitting criterion is based on the mean squared error of the predictions of the descendant nodes.Hence, the squared error of the prediction in a node can be written as where " y ¼ 1 n P n i¼1 y i is the prediction of y at the given node.
Compiling Granular Population Data Using Geospatial Information 273 Since each split leads to two descendant nodes, the Qs of both descendant nodes as well as the sample sizes of both nodes are considered in the choice of the optimal split point.It is determined by minimizing where Q L and Q R are the mean squared errors in the predictions of the left and right descendant nodes, and n L and n R are the corresponding sample sizes.
After the selection of a split point, the node is partitioned into its two descendant nodes, which are again split in the same way.This procedure is continued until either a predefined level of Q split or a predefined maximal tree size is reached.
Each of the trees can then be used to obtain an individual estimate of the response variable y.Finally, the simple average of the estimations of all trees gives the actual random forest prediction.

C. Bayesian Model Averaging
Since there is limited research on how to produce reasonable forecasts of population density on a granular level, we want to consider different approaches for this task.The first one is a random forest technique, which we also used to obtain gridlevel information on population density for years in which we have governmentpublished population numbers.The second one is BMA, which is an ensemble learning method that uses Bayesian inference to solve the problem of model selection.BMA considers a large number of models, evaluates their explanatory power, and then weights the variables accordingly.An advantage of this approach is that it facilitates the inclusion of a large set of covariates and internally weights them by their relative importance in explaining the dependent variable.That way, it considers the uncertainty regarding which explanatory factors to include.Since we know that the estimates of coefficients depend on the covariates entering the model and there is not a lot of literature on which covariates to employ in our forecasts, model uncertainty must be taken into account.Hence, BMA could be a suitable approach.
As opposed to random forests, BMA does not create regression trees, but relies on simple linear regressions. 15Consequently it does not search for split points in the data, but rather directly estimates the average effect each covariate has on the dependent variable.Furthermore, BMA does not use random subsets of the explanatory variables but evaluates all different combinations of covariates that can be created from the input data.Another difference is that BMA does not use subsets of the observations in the individual regressions, but rather uses all observations in each of its regressions.Finally, while random forests simply use the average of the predictions of all individual trees as their result, BMA uses weighted averages, giving more weight to individual models with higher explanatory power.
To apply this method on the estimation of the population growth rate, we set up a regression of the following form: In this linear regression, y is a column vector of population growth rates.X k is a matrix of k columns, one for each explanatory variable, and β k is a k-dimensional vector of parameters corresponding to the variables in X k .
As explanatory variables, we will employ all covariates used to estimate gridlevel population density in past years as well as the initial population density.Details on all datasets entering the estimation can be found in section II of this paper.Regarding the initial population density, we tried two different ways to include this variable.First, we included it in the form of absolute numbers of people.Second, we conducted the same estimations with these levels replaced by logs of population density.
To apply BMA on the linear regression given above, we estimated sets of linear regression models.This means that we considered all models that can be set up as combinations of the variables included in X k .Since X k consists of k potential covariates, we have a total of 2 k possible variable combinations and a set of potential models, M ¼ fM 1 , M 2 , . . ., M 2 k À1 , M 2 k g.The BMA estimation assesses all these models by their ability to explain the dependent variable and then weights the models by their explanatory power.As a result of this evaluation, weighted averages of the models are used to carry out Bayesian inference.

A. Density Estimates
To obtain grid-level estimates of population density for Thailand and the Philippines, we first overlaid all input data for the years for which we have government-published population data.For Thailand, we have access to tambon-level data for 2013, 2015, and 2017.For the Philippines, we have municipal-level data for 2012, 2015, and 2018.Next, we applied a random forest approach as described in section III with the log of population density as the dependent variable and parameters Compiling Granular Population Data Using Geospatial Information 275 like land cover classes, elevation, slope, and nighttime lights as covariates, using data aggregated at the barangay and tambon levels.The resulting (adjusted) R 2 value is 0.9 for both countries.By applying the trained model to the grid-level data of the explanatory variables, we are able to estimate population density on the 100 m by 100 m grid level for all years for which we have input data from government-published estimates.
In particular, the model predictions are used as weights to distribute the population inside a tambon or municipality at the grid level.This means that we first create estimates of the total population in each 100 m by 100 m grid.Next, we rescale these numbers to make them sum up to our input data on the tambon or municipality level.This ensures that our results always match the official data on the most granular administrative level included in the census data.The reason why we do not create any predictions of population distributions at the tambon, municipality, or province level is that we mainly see the approach described in this paper as a method to disaggregate census or survey data to a finer scale.Hence, the aim of this piece of research is not to show how census data might be replaced but rather how a combination of traditional data sources with new types of data and machine learning techniques might allow us to obtain insights on a new level of geographical granularity.
Figure 3 shows the results of the random forest predictions of population density on the grid level for Thailand and the Philippines in 2015.

B. Variable Importance
Apart from these predictions, the model also shows the relative importance of different independent variables in explaining the dependent variable.In line with Stevens et al. (2015), we have decided to measure the degree of predictive contribution of individual variables as the mean decrease in the sum of squared residuals when a variable is included in a tree split.Figure 4 gives the explanatory power of the 10 most relevant covariates in describing population density at the grid level in Thailand in 2015.
In addition to measuring the sensitivity of population density to different explanatory variables using variable importance, we also conduct visual checks of the alignment of population density with features ranked highest based on our variable importance analysis.An example is given in Figure 5, which compares areas covered by a specific land cover class with population density results.As Figure 4 shows, one of the most important explanatory factors is the variable globcover_dst190, which captures the distance to the next grid covered by the land cover class 190, representing artificial surfaces and associated areas.To get a better picture of how this factor is Compiling Granular Population Data Using Geospatial Information 277 related to population density, Figure 5 illustrates the areas covered by this land cover class next to our results on the population density in Bangkok in 2015.
Comparing the most important variables across our estimations for different years, we find very similar patterns.Regarding a comparison of the most important variables for Thailand versus the Philippines, we can see that while most variables are still the same, we can spot some differences.Like Thailand, data on lights at night are also the most important predictor of population density in the Philippines.Similarly, the distances to pharmacies and hospitals, as well as data on temperature and artificial surfaces, also belong to the top 10 of the most important variables for estimating population density in the Philippines.However, the main difference is that slope and elevation seem to be less important in Thailand.This is probably related to the differences in the geographies of the two countries.Except for its mountains in the north, large parts of Thailand are very flat, while due to their volcanic origin, the islands of the Philippines are generally more mountainous. 16

C. Model Optimization and Comparison
To optimize the random forest used to create grid-level estimates of population density in Thailand and the Philippines, we tune the number of randomly selected 16 Comparing these results with that of Stevens et al. (2015), we find similarities and differences.For example, like in our estimations, lights at night play a very important role and are among the most central covariates.Similarly, the distances to the next urban area and to the next hospital also seem to play an important role in some countries considered in Stevens et al. (2015).In addition, the distances to the next generic health facility as well as to the settlement point and the next community are also highly relevant for some of the models in Stevens et al. (2015).covariates that are considered at each node of a tree.To avoid overfitting in this step of the process, we identify the optimal value for this parameter with respect to the out-ofbag error estimate.This technique makes use of the fact that only a random subset of the training data is used in each tree of a random forest.Hence, the remaining observations can be used as test data to validate the predictive power of the algorithm without risking overfitting the data.With this technique, the optimal number of covariates considered at each node is identified based on out-of-bag error rates, allowing us to build a random forest with optimal predictive power with respect to new observations.Building on the results for Thailand for 2013, 2015, and 2017, we extend our estimations by forecasting population density for select areas in Thailand for 2020.Here, we use the population growth rate as the dependent variable and the initial level of population density as well as the remotely sensed input data from our earlier predictions as explanatory factors.
Facing greater uncertainty regarding how to produce reasonable forecasts of population density at this level of granularity, we aim at assessing different methods and model specifications.In particular, we consider a random forest and a BMA approach.In the random forest approach, we again optimize our model by tuning the number of covariates considered at each node of a tree based on the out-of-bag error rates.Additionally, we evaluate how to include the initial level of population density in our model, either in absolute terms or in logs.The evaluation of the suitability of these different approaches is based on data from 2013 and 2015, which is used to predict population density in Thailand in 2017.These numbers are then compared to actual data from the same year by computing the root mean squared error and the mean average error of our predictions.The table below  Compiling Granular Population Data Using Geospatial Information 279 As the table shows, the random forest approach clearly outperforms BMA in our exercise.Regarding the specification of the initial population, the root mean squared error is slightly lower with population in logs, while the mean average error is slightly lower for population in absolute numbers.Hence, we use a random forest to conduct our actual forecasts, and we use the log of the initial population in our set of covariates.

D. Population Density Nowcasts
To examine the feasibility of using the method in nowcasting population density at a time when pandemic-related restrictions hamper census operations, we considered select areas for a pilot exercise-Udon Thani, Uthai Thani, and Samut Songkhramfollowing the suggestion of colleagues from the National Statistical Office of Thailand.These areas have slightly different characteristics in terms of geography, population size, and economy, which allows us to draw insights on how specific characteristics may affect the method's prediction accuracy.Located in the northeast portion of Thailand, Udon Thani accounted for about 2.4% of the country's total registered population in 2012.The area's gross domestic product (GDP) per capita was about 61% lower than GDP per capita for the entire country in 2021.Uthai Thani is located in the lower northern portion of Thailand and accounted for about 0.5% of the country's registered population in 2012.Its GDP per capita was about 56% lower than average GDP per capita for the entire country in 2021.Samut Songkhram, which is located southwest of Bangkok, accounted for 0.3% of Thailand's registered population in 2012, and had a GDP per capita in 2021 that was approximately 31% lower than that of the entire country.To obtain grid-level forecasts of population density in these areas for 2020, we first trained the model with information on the average annual population growth in Thailand from 2013 to 2017 and initial data from 2013.Second, we used the trained random forest and initial data from 2017 to estimate annual growth rates after 2017.Applying these growth rates to grid-level population data from 2017, we obtained granular forecasts of population density in 2020.Figures 6-8 show the results of our forecasts of population density at the 100 m by 100 m grid level for Udon Thani, Uthai Thani, and Samut Songkhram, respectively.
These forecasts allow us to project the change in population density in these areas from 2013 to 2020.Focusing on the center of Udon Thani, we can see that population density decreased in some areas, while it increased in other parts.Figure 9 shows the  Two main reasons led to the selection of these areas for this nowcasting (forecasting) exercise.First, the National Statistical Office of Thailand was interested in a case study for these areas to help them plan for the then upcoming census activities.Second, our main goal with this part of our analysis was to conduct an initial assessment of the suitability of random forest and BMA approaches to forecast population density on a grid level and to find out, which method performs better at this specific task.Focusing on a few select areas with slightly different characteristics 17 Since estimates for 2020 are not rescaled, we are allowing for the total population to grow inside the areas under consideration.For each grid, we predict the growth in population density from 2017 to 2020 and then apply these individual growth rates to the results for 2017.This does not only change the population density on the grid level, but also leads to a change in the projected total population in each of the districts.However, for long-term forecasts, including detailed information on migration dynamics could potentially improve the forecasts further.seemed sufficient for this task and allowed us to speed up computation times significantly.Hence, we have decided to focus on these areas and have not included other parts of Thailand or the Philippines in the exercise.Based on encouraging results, we do, however, think that it would be possible to extend this exercise to other regions or countries should the need arise.

A. Nightlights and Settlements
The discussion in section IV demonstrated the effectiveness of using nightlights to produce granular predictions of population density in developing countries, particularly Thailand and the Philippines.This can be compared with the results of previous studies showing nightlights' capacity to predict economic activity (Elvidge et al. 1997;Sutton and Costanza 2002;Henderson, Storeygard, and Weil 2012) and wealth (Jean et al. 2016) across space.That nightlights predict both economic activity and population well is consistent with earlier theories linking productivity and settlement concentration (Bettencourt and West 2011).
This raises questions on whether models relying on nightlights would have better results for areas with more concentrated settlements, basically urban areas, than those with sparser populations, usually the poorer rural areas.Indeed, Gibson et al. (2021) shows how the use of Defense Meteorological Satellite Program data for nightlights compared to VIIRS understates spatial inequality between rural and urban areas.While this paper has used VIIRS instead of the Defense Meteorological Satellite Program, there remains the valid question of whether nightlights overstate urban population and understate rural ones.Consider the result in Chen and Nordhaus (2011), which showed that luminosity data may not provide reliable estimates of economic activity in low activity areas.
Fortunately, our results in section IV show that while nighttime lights may be a better predictor for urban than rural areas for developing countries, this is partially mitigated by the inclusion of other land cover variables, specifically the globcover_cls190 and globcover_dst190 (artificial surfaces and associated areas), which can place any particular geographic area along the urban-rural spectrum.In fact, globcover_dst190 is in the top 10 covariates in Thailand in terms of importance (Figure 4).
Other land cover data variables (e.g., slope, elevation, and distance to water) would have also increased the ability of the model to capture rural areas.Hoffman-Hall et al. (2019) demonstrated that distance to water and vegetation are useful indicators to capture the presence of small rural settlements.Various land cover variables used in the study (e.g., globecover_cls180, globcover_cls210) would have had similar effects.
Nonetheless, the limitation in the capacity of what are basically two-dimensional images to predict true density remains, especially since cities would likely have higher vertical settlements.Techniques such as Pareto-adjustment in Gibson et al. (2021) can be used to address this problem even with VIIRS nightlights data.

B. Effectiveness of Random Forest
The discussion in section IV demonstrated how random forest is clearly superior to BMA in forecasting population densities.To explain why this may be so, it is important to note that both the explanatory variables and dependent variables are measures of spatial phenomena and therefore might be spatially autocorrelated. 18This might introduce some bias in the model.Our hypothesis is that this bias is better mitigated in random forest regression than in BMA.
Consider that BMA is an ensemble of linear regressions for different combinations of explanatory variables-but also note that each of those models is subject to spatial bias.Each of the base models would take the form y ¼ X k β k þ " (see section III), whereas models closer to reality will take a spatially lagged form in both the response and predictor variables-that is, where W is the selected distance matrix.Unfortunately, incorporating the spatial lag increases the computational intensity of the procedure.
In random forest, the recursive binary splitting already minimizes the spatial bias in each of the trees because it maximizes model fit, regardless of the underlying spatial autocorrelation or other nonlinearities.This fit is then relaxed as we proceed to do this for different combinations of explanatory variables (moving in the other direction of the bias-variance tradeoff).The bootstrapping would also have attenuated the bias because the random resampling also breaks the spatial dependence of the observation points. 19

C. Limitations
As just described, spatial autocorrelation may play a significant part in explaining and projecting population density at the grid level.Related to that, one of the limitations of this paper is that it does not explicitly take into account this spatial dimension.Future researchers might want to focus more on this aspect by, for example, including a spatial weight matrix as another explanatory factor or even by comparing the predictive power of models considering different types of spatial weights.One way to implement this idea would be to use a spatial BMA approach, which creates a weighted average of weight matrices based on different definitions of neighboring grids.
Another limitation of this study could be that it only takes two different types of approaches into account to nowcast (forecast) grid-level population density.Future studies might need to explore a broader range of econometric and machine learning approaches including unsupervised learning-based methods.
Finally, the forecasts presented in this paper have been limited to select areas of Thailand.In the future, the approach presented in the study at hand might be extended to create a more complete picture of the distribution of population density for entire countries or larger regions.

D. Comparison with Other Sources
In this section, we compare our own results for Thailand with population density figures from WorldPop (footnote 3) for 2019 and META20 for 2020.The other two sources were chosen as they are two of the most well-known and well-used population density datasets.
For the comparison, we chose five areas in Thailand: Bangkok, Nakhon Ratchasima, Chiang Mai, Hat Yai, and Udon Thani.While WorldPop provides population density data on a 100 m by 100 m grid level, META publishes their data on a 30 m by 30 m level, which was aggregated and reprojected to 100 m grids for this comparison.
The comparison shows a clear overlap between our own results calculated by the methodology described in this paper and the results of the other two sources (Figure 10).While the maps displaying the META data seem different at first sight (i.e., many more white spaces), they show a similar population density like our own results.META estimates the size and area of buildings and then distributes the population among these.Places without buildings and thus without population, are shown as white.With our own results as well as on maps created with WorldPop data, these areas are colored in dark blue (i.e., no or extremely low population density).

VI. Summary and Conclusion
The availability of accurate, timely, and spatially disaggregated data of the distribution of human populations is important on many fronts.Population data are used as inputs when national governments allocate resources across their local In developing countries or economies, particularly in Asia and the Pacific, a population and household census is conventionally used to compile population numbers.Despite it being considered a useful and reliable collection vehicle for population data, it suffers from several limitations.Censuses are costly and, hence, these are only conducted every 5-10 years in many countries.Furthermore, while there have been notable improvements in census technology compared to techniques implemented 2 decades ago (United Nations Statistics Division 2001), there is still a need for more spatially granular population estimates than data conventionally published from population and household censuses as decision-making operates at finer levels.
This paper uses a combination of published population data (mostly derived from census information) and publicly available data based on satellite imagery to produce estimates of population density at the 100 m by 100 m grid level.Focusing on Thailand and the Philippines, we follow Stevens et al. (2015) and apply a random forest approach.In addition to predictions for the past years for which published population data are available, we also explore a population forecasting method on the same level of granularity in select areas of Thailand.For this part of our estimations, we evaluate different model specifications and, in addition to random forests, also consider a BMA approach.Our results show that using a random forest model with the log of the initial population and remotely sensed data as covariates, reasonable forecasts of grid-level population growth rates are achievable.This method allows us to produce forecasts of population density for select Thai areas for the year 2020.
The results of this paper contribute to the assessment of ensemble methods like random forest and BMA in the field of demography.They showcase the applicability of these methods, especially a random forest approach, in grid-level predictions of population density and might be a starting point for further and more extensive forecasts.In addition, they contribute to the literature focused on creating granular population estimates for a variety of applications from vaccination projects to the assessment of spatial inequality.As described in section I of this paper, especially in times of continued population growth and urbanization, this area of research has the potential to help solve challenges in a variety of fields like economics, health, and environmental policy.The application of such methods could also potentially minimize data inequality that arises when the availability or quality of data in a specific area is constrained due to a lack of financial resources or logistics-related issues that make data collection challenging.Nonetheless, the use of innovative data sources such as remotely sensed data-whether for granular estimations of population, poverty, economic growth, or other development indicator-comes with important caveats because remotely sensed data may not capture all factors that can affect the value of a specific development indicator.For instance, in the case of poverty estimations, there are studies hinting that development practitioners may find satelliteimagery-based poverty maps useful when comparing the prevalence of poverty between assemblies of areas, but such estimates may be less reliable when the objective entails examination of specific geographic areas (Van der Weide et al. 2022).A similar line of argument may also extend for the estimation of population data and other development indicators.Therefore, these newer types of data collection should not be immediately viewed as perfect substitutes for conventional statistical data compilation approaches.Nonetheless, conducting studies such as what was presented here contributes to building a body of knowledge that will allow us to understand in what contexts these methods work well and when they should be used with caution.

Tuning
When estimating a random forest, the parameter that the estimation can be most sensitive to is the number of randomly selected covariates chosen at each node of the trees.This parameter can be tuned and optimized using out-of-bag error estimates.
Since bootstrapping only uses a random subset of the training data in each tree, there is always some training data left that are not employed in the individual estimations.These parts are called out-of-bag data and can be used to compute outof-bag error rates.Tuning the parameter entails calculating out-of-bag error rates for different values and then choosing such that the out-of-bag error is minimized.
Another parameter that can potentially be tuned is the number of trees in the forest.However, Breiman (2001) has shown that as it increases, the out-of-bag error converges to a limit.Hence, it is sufficient to ensure that it is not set too low, but no actual tuning is needed for this part of a random forest estimation.

B. Bayesian Model Averaging
To produce weighted averages of all models, BMA uses weights that are based on the posterior model probabilities resulting from Bayes' theorem.Thus, the posterior model probability of a model M γ can be written as follows: In this equation, the marginal likelihood p(M γ ) refers to the prior model probability, expressing the researcher's opinion of the likelihood of model M γ before checking the data.To get to the posterior model probability, it is updated by the Compiling Granular Population Data Using Geospatial Information 295 information in the data.This is done by multiplying p(M γ ) with the integrated likelihood p(y j M γ , X), representing the probability of observing the given data based on the model M γ .p(y j M γ , X) denotes the likelihood of y integrated over the parameter space for a given model M γ .In contrast, p(y j X) expresses the likelihood of y integrated over the model space and does not depend on a specific model.Hence, it remains a constant multiplicative term (Zeugner and Feldkircher 2015).
Given the posterior model probabilities of all models included in M, it is now possible to derive the marginal posterior distribution of any statistic present in the models.For example, the model-weighted posterior distribution of the estimator of the coefficient β k can be described as This means that the posterior distribution of the estimator of β k given y and X can be written as the average of its posterior distribution under each of the models in M, weighted by their posterior model probabilities (Hoeting et al. 1999;Fragoso, Bertoli, and Louzada 2018).

Priors
To obtain posterior model probabilities and posterior distributions of the model parameters, corresponding priors have to be chosen.If knowledge from previous studies is available, these priors should reflect the researchers' prior beliefs.However, as in the case of this study, prior information is often limited.In such cases, using a uniform prior model probability is a common choice (Zeugner and Feldkircher 2015).
Regarding prior beliefs about the coefficients, the common practice is to assume a normal distribution and specify the expected mean and standard deviation (Zeugner and Feldkircher 2015).In particular, the prior mean is usually set to 0, and the variance is based on Zellner's g: Hence, the prior distribution of β k can be written as This specification reflects that the researchers expect the coefficient to be 0 and its variance to be closely related to that of the explanatory variables in X k .As the equation shows, the hyperparameter g defines how large the variance of β k is expected Compiling Granular Population Data Using Geospatial Information 297 Continued.
Continued.Compiling Granular Population Data Using Geospatial Information 299 to be.For this paper, g is set to

Type
where n is the number of observations and k is the number of covariates.This approach follows the work of Fernandez, Ley, and Steel (2001).

Model Sampling
Once these priors are set, BMA is applied to estimate the posterior model probabilities as well as the posterior distributions of the coefficients.The high cardinality of the model space given in this and many other applications of BMA often makes the specific evaluation of every single potential model infeasible.As one solution to this challenge, Markov chain Monte Carlo model composition has proven to conduct sensible evaluations of subsets of the model space that account for a reasonably large part of the posterior model probability mass (Crespo Cuaresma and Feldkircher 2013).This motivated us to evaluate the set of potential models with a random walk search algorithm.

Figure 1 .
Figure 1.Frequency of Surveys and Censuses in Developing Asian Economies

Figure 2 .
Figure 2. Statistical Capacity Indicators in Asia and the Pacific

Figure 4 .
Figure 4. Variable Importance of the 10 Covariates with the Highest Explanatory Power in the Random Forest Estimation

Figure 5 .
Figure 5. Artificial Surfaces and Associated Areas (Land Cover Class 190) versus Population Density in Bangkok, 2015

Figure 10 .
Figure 10.Comparison of Population Density Results with Other Sources
shows the results of this comparison.

Table A1 .
Description of Variables Used in Estimation of Population Density