Take-home Exercise 1 - Geospatial Analytics for Social Good

1. Overview

1.1. Background

Water is an important resource to mankind. Clean and accessible water is critical to human health. It provides a healthy environment, a sustainable economy, reduces poverty and ensures peace and security. Yet over 40% of the global population does not have access to sufficient clean water. By 2025, 1.8 billion people will be living in countries or regions with absolute water scarcity, according to UN-Water. The lack of water poses a major threat to several sectors, including food security. Agriculture uses about 70% of the world’s accessible freshwater.

Developing countries are most affected by water shortages and poor water quality. Up to 80% of illnesses in the developing world are linked to inadequate water and sanitation. Despite technological advancement, providing clean water to the rural community is still a major development issues in many countries globally, especially countries in the Africa continent.

To address the issue of providing clean and sustainable water supply to the rural community, a global Water Point Data Exchange (WPdx) project has been initiated. The main aim of this initiative is to collect water point related data from rural areas at the water point or small water scheme level and share the data via WPdx Data Repository, a cloud-based data library. What is so special of this project is that data are collected based on WPDx Data Standard.

1.2. Overall Goal

This study aims to apply appropriate global and local measures of spatial Association techniques to reveal the spatial patterns of Not Functional water points in Nigeria.

1.3. Key Objectives

Using the appropriate R packages, we will need to:

Prepare the dataset and save it in simple feature data frameformat, as well as derive the proportion of functional and non-functional water point at LGA level
Conduct thematic mapping analysis to examine the spatial distribution of functional and non-functional water point rate at LGA level
Conduct hotspot areas and outliers/clusters maps of functional and non0functional water point rate at LGA level

2. Setup

2.1 Packages Used

The R packages that we will be using for this analysis area:

sf: used for importing, managing, and processing geospatial data
tidyverse: used for wrangling attribute data
spdep: used for computing spatial weights, global and local spatial auto-correlation statistics
tmap: used for creating cartographic quality choropleth map
funModeling: used for exploratory data analysis, data preparation and model performance

In addition, the following tidyverse packages will be used:

tidyr for manipulating and tidying data
dplyr for wrangling and transforming data
ggplot2 for visualising data

2.2 Datasets Used

2 geospatial datasets will be utilized for this study:

geo_export_338e5689-bd72-4866-bfde-8997933e9897

WPdx+ dataset from WPdx Global Data Repositories
nga_admbnda_adm2_osgof_20190417

Nigeria Level-2 Administrative Boundary (also known as Local Government Area) polygon features GIS data from geoBoundaries

2.3 Launching the packages in R

The code chunk below is used to perform the following tasks:

creating a package list containing the necessary R packages,
checking if the R packages in the package list have been installed in R,
- if they have yet to be installed, RStudio will installed the missing packages,
launching the packages into R environment.

pacman::p_load(sf, spdep, tmap, tidyverse, funModeling)

3. Data Preparation

In this section, we will bring geospatial data into R environment. The geospatial data is in ESRI shapefile format.

3.1 Import water point shapefile into R environment

The code chunk below uses st_read() of sf package to import Nigeria shapefile into R. The imported shapefile will be simple features Object of sf.

wp <- st_read(dsn = "geodata",
              layer = "geo_export_338e5689-bd72-4866-bfde-8997933e9897",
              crs = 4326) %>%
  filter(clean_coun == "Nigeria") %>%
  select(1:4, 13:15, 23, 36:40)

The code chunk below uses write_rds() of readr package to save the extracted sf data table (i.e. wp) into an output in rds data format. The output file is called wp_nga.rds and it is saved in geodata sub-folder.

wp_nga <- write_rds(wp, "geodata/wp_nga.rds")

3.2 Import Nigeria LGA boundary data into R environment

We are going to import LGA boundary data into R environment using the following code chunk, st_read() of sf package. It is used to import nga_admbnda_adm2_osgof_20190417 shapefile and save the imported geospatial data into simple feature data table.

nga <- st_read(dsn = "geodata",
               layer = "nga_admbnda_adm2_osgof_20190417",
               crs = 4326)

3.3 Recoding NA values into string

Use replace_na() to recode all the NA values in status_cle field into the Unknown.

wp_nga <- read_rds("geodata/wp_nga.rds") %>%
  mutate(status_cle = replace_na(status_cle, "Unknown"))

4. Exploratory Spatial Data Analysis

Use freq() of funModeling package to display the distribution of status_cle field in wp_nga.

freq(data = wp_nga,
     input = "status_cle")

4.1 Extract functional water point data

In the code chunk below, filter() of dplyr is used to select functional water points.

wpt_functional <- wp_nga %>%
  filter(status_cle %in%
           c("Functional", 
             "Functional but not in use",
             "Functional but needs repair"))

freq(data=wpt_functional, 
     input = 'status_cle')

4.2 Extract non-functional water point data

In the code chunk below, filter() of dplyr is used to select non-functional water points.

wpt_nonfunctional <- wp_nga %>%
  filter(status_cle %in%
           c("Abandoned/Decommissioned", 
             "Abandoned",
             "Non-Functional",
             "Non functional due to dry season",
             "Non-Functional due to dry season"))

freq(data=wpt_nonfunctional, 
     input = 'status_cle')

4.3 Extract water point data with Unknown class

In the code chunk below, filter() of dplyr is used to select unknown water points.

wpt_unknown <- wp_nga %>%
  filter(status_cle == "Unknown")

4.4 Performing Point-in-Polygon Count

The code chunk below performs 2 operations at one go. Firstly, it uses st_intersects() to identify the various water point types (e.g. total, functional, non-functional and unknown) located inside each LGA boundary. Next, length() of Base R is used to calculate the number of water points that fall within each LGA boundary.

nga_wp <- nga %>% 
  mutate(`total wpt` = lengths(
    st_intersects(nga, wp_nga))) %>%
  mutate(`wpt functional` = lengths(
    st_intersects(nga, wpt_functional))) %>%
  mutate(`wpt non-functional` = lengths(
    st_intersects(nga, wpt_nonfunctional))) %>%
  mutate(`wpt unknown` = lengths(
    st_intersects(nga, wpt_unknown)))

4.5 Saving the Analytical Data Table

The code chunk below uses mutate() of dplyr package to derive 2 fields namely pct_functional and pct_non-functional. In order to keep the file size small, select() of dplyr is used to retain on the relevant fields.

nga_wp <- nga_wp %>%
  mutate(pct_functional = `wpt functional`/`total wpt`) %>%
  mutate(`pct_non-functional` = `wpt non-functional`/`total wpt`) %>%
  select(3:4, 9:10, 18:23)

Thereafter, we will save the sf data table in rds format for subsequent analysis.

write_rds(nga_wp, "geodata/nga_wp.rds")

5. Geospatial Visualization & Analysis

5.1 Thematic Mapping of Functional and Non-Functional Water Points at LGA level

In order to draw a choropleth map, we will use qtm() of tmap package. Small choropleth maps are created with tmap_arrange().

nga_wp <- read_rds("geodata/nga_wp.rds")
total <- qtm(nga_wp, "total wpt")
wp_functional <- qtm(nga_wp, "wpt functional")
wp_nonfunctional <- qtm(nga_wp, "wpt non-functional")
unknown <- qtm(nga_wp, "wpt unknown")

tmap_arrange(total, wp_functional, wp_nonfunctional, unknown, asp=1, ncol=2)

5.2 Global Spatial Autocorrelation

In this section, we will compute global spatial autocorrelation statistics and perform spatial complete randomness test for global spatial autocorrelation.

5.2.1 Computing Contiguity Spatial Weights

In the code chunk below, poly2nb() of spdep package is used to compute the contiguity weight matrices for the LGA. We will compute Queen contiguity weight matrix.

wm_q <- poly2nb(nga_wp, 
                queen=TRUE)

summary(wm_q)

The summary report above shows that there are 774 area units in Nigera. The most connected area unit has 14 neighbours. There are 2 area units with only 1 neighbour.

5.2.2 Row-standardized weights matrix

Alternatively, we can assign weights to each neighbouring polygon. In this study, each of the neighbouring polygon will be assigned equal weight (style = “W”). This is accomplished by assigning the fraction 1/(# of neighbors) to each neighboring county then summing the weighted income values.

rswm_q <- nb2listw(wm_q, 
                   style="W", 
                   zero.policy = TRUE)

set.ZeroPolicyOption(TRUE)

rswm_q

5.2.3 Performing Moran’s I test

Using localmoran() function of spdep, we will compute local Moran’s I. We will compute local indicator values, given a set of standard deviation values and listw objective providing the neighbour weighting information of the polygon associated with standard deviation values.

The code chunk below is used to compute local Moran’s I of non-functional waterpoints at the LGA.

moran.test(nga_wp$`wpt non-functional`, 
           listw = rswm_q, 
           zero.policy = TRUE, 
           na.action=na.omit)

The above statistical output illustrates that the null hypothesis i.e. observed spatial pattern of values is equally likely as other spatial pattern can be rejected. There is sufficient evidence to show that regions with higher percentage of non-functional water points are dependent on those at other (neighbouring) locations.

5.2.4 Geary’s C test

The code chunk below performs Geary’s C test for spatial autocorrelation by using geary.test() of spdep.

geary.test(nga_wp$`wpt non-functional`, listw=rswm_q)

The above statistical output illustrates that the null hypothesis i.e. observed spatial pattern of values is similar from their immediate neighbours can be rejected. There is sufficient evidence to show that regions with higher percentage of non-functional water points are dissimilar to their (neighbouring) locations.

5.3 Cluster and Outlier Analysis

5.3.1 Computing local Moran’s I

To compute local Moran’s I, the localmoran() function of spdep will be used. It computes the local Moran’s I statistic values, given a set of standard deviation and a listw object providing neighbour weighting information for the polygon associated with standard deviation.

The code chunk below is used to compute local Moran’s I of non-functional water point at the county level.

fips <- order(nga_wp$ADM2_EN)
localMI <- localmoran(nga_wp$`wpt non-functional`, rswm_q)
head(localMI)

The code chunk below list the content of the local Moran matrix derived by using printCoefmat().

printCoefmat(data.frame(
  localMI[fips,], 
  row.names = nga_wp$ADM2_PCODE[fips]),
  check.names=FALSE)

5.3.2 Mapping both local Moran’s I values and p-values

The code chunk below is meant to append the local Moran’s I dataframe (i.e. localMI) onto Nigera Spatial Polygon Data Frame.

nga_wp.localMI <- cbind(nga_wp,localMI) %>%
  rename(Pr.Ii = Pr.z....E.Ii..)

Using choropleth mapping functions of tmap package, we will plot the local Moran’s I and p-values with the code chunk below.

localMI.map <- tm_shape(nga_wp.localMI) +
  tm_fill(col = "Ii", 
          style = "pretty",
          palette = "RdBu",
          title = "local moran statistics") +
  tm_borders(alpha = 0.5)

pvalue.map <- tm_shape(nga_wp.localMI) +
  tm_fill(col = "Pr.Ii", 
          breaks=c(-Inf, 0.001, 0.01, 0.05, 0.1, Inf),
          palette="-Blues", 
          title = "local Moran's I p-values") +
  tm_borders(alpha = 0.5)

tmap_arrange(localMI.map, pvalue.map, asp=1, ncol=2)

5.3.3 Creating LISA map classes

The code chunk below show the steps to prepare a LISA cluster map.

quadrant <- vector(mode="numeric",length=nrow(localMI))

Next, derives the spatially lagged variable of interest (i.e. wpt_nonfunctional) and centers the spatially lagged variable around its mean.

nga_wp$lag_nonfunctional <- lag.listw(rswm_q, nga_wp$`wpt non-functional`)

DV <- nga_wp$lag_nonfunctional - mean(nga_wp$lag_nonfunctional)

This is follow by centering the local Moran’s around the mean.

LM_I <- localMI[,1] - mean(localMI[,1])

Next, we will set a statistical significance level for the local Moran.

signif <- 0.05

These four command lines define the low-low (1), low-high (2), high-low (3) and high-high (4) categories.

quadrant[DV <0 & LM_I>0] <- 1
quadrant[DV >0 & LM_I<0] <- 2
quadrant[DV <0 & LM_I<0] <- 3  
quadrant[DV >0 & LM_I>0] <- 4

Lastly, places non-significant Moran in the category 0.

quadrant[localMI[,5]>signif] <- 0

5.3.4 Plotting LISA map

Using the code chunk below, we will build the LISA map. For effective interpretation, it is better to plot both the local Moran’s I values map and its corresponding p-values map next to each other.

wpt_nonfunctional <- qtm(nga_wp, "wpt_non-functional")

nga_wp.localMI$quadrant <- quadrant
colors <- c("#ffffff", "#2c7bb6", "#abd9e9", "#fdae61", "#d7191c")
clusters <- c("insignificant", "low-low", "low-high", "high-low", "high-high")

LISAmap <- tm_shape(nga_wp.localMI) +
  tm_fill(col = "quadrant", 
          style = "cat", 
          palette = colors[c(sort(unique(quadrant)))+1], 
          labels = clusters[c(sort(unique(quadrant)))+1],
          popup.vars = c("")) +
  tm_view(set.zoom.limits = c(11,17)) +
  tm_borders(alpha=0.5)

LISAmap

5.4 Hot and Cold Spot Analysis

In order to detect spatial anomalies, we will use Getis and Ord’s G-Statistics. We will look at neighbours within a defined proximity to identify where either high or low values cluster spatially. Here, statistically significant hot-spots are recognised as areas of high values where other areas within a neighbourhood range also share high values too.

5.4.1 Deriving the centroid

The code chunk below allows to get the longitude and latitude, which is the 1st value and 2nd value in each centroid respectively.

longitude <- map_dbl(nga_wp$geometry, ~st_centroid(.x)[[1]])

latitude <- map_dbl(nga_wp$geometry, ~st_centroid(.x)[[2]])

Next we will use cbind to put the longitude and latitude into the same object.

coords <- cbind(longitude, latitude)
head(coords)

5.4.2 Computing cut-off distance

The code chunk below uses dnearneigh() of spdep package to derive distance-based weight matrix.

k1 <- knn2nb(knearneigh(coords))

k1dists <- unlist(nbdists(k1, coords, longlat = TRUE))
summary(k1dists)

The summary report shows that the largest first nearest neighbour distance is 71.661 km, so using this as the upper threshold gives certainty that all units will have at least one neighbour.

5.4.3 Computing fixed distance weight matrix

The chunk below computes the distance weight matrix by using dnearneigh() of spdep package.

Next, we will use str() to display the content of wm_d72 weight matrix.

wm_d72 <- dnearneigh(coords, 0, 72, longlat = TRUE)
wm_d72

nb2listw() is used to convert the nb object into spatial weights object.

wm72_lw <- nb2listw(wm_d72, style = 'B')
summary(wm72_lw)

5.4.4 Computing Gi statistics using fixed distance

fips <- order(nga_wp$ADM2_EN)
gi.fixed <- localG(nga_wp$'wpt non-functional', wm72_lw)
gi.fixed

The Gi statistics is represented as a Z-score. Greater values represent a greater intensity of clustering and the direction (positive or negative) indicates high or low clusters.

Next, we will join the Gi values to their corresponding nga_wp sf data frame by using the code chunk below.

nga_wp.gi <- cbind(nga_wp, as.matrix(gi.fixed)) %>%
  rename(gstat_fixed = as.matrix.gi.fixed.)

5.4.5 Mapping Gi values using fixed distance

The code chunk below shows the functions used to map the Gi values derived using fixed distance weight matrix.

wpt_nonfunctional <- qtm(nga_wp, "wpt_non-functional")

Gimap <-tm_shape(nga_wp.gi) +
  tm_fill(col = "gstat_fixed", 
          style = "pretty",
          palette="-RdBu",
          title = "local Gi") +
  tm_borders(alpha = 0.5)

tmap_arrange(Gimap, asp=1, ncol=2)

1f2a596 (Commit)