In Chapter 17, we learned how to work with geospatial data. We learned about shapefiles, map projections, and how to plot spatial data using both ggplot2 and leaflet. In this chapter, we will learn how to perform computations on geospatial data that will enable us to answer questions about how long or how big spatial features are. We will also learn how to use geometric operations and spatial joins to create new geospatial objects. These capabilities will broaden the spectrum of analytical tasks we can perform, and accordingly expand the range of questions we can answer.
18.1 Geospatial operations
18.1.1 Geocoding, routes, and distances
The process of converting a human-readable address into geographic coordinates is called geocoding. While there are numerous APIs available online that will do this for you, the functionality provided in tidygeocoder by the geocode() function uses Open Street Map and does not require registration to use the API. Here, we build a data frame of the places of business of the three authors, geocode the addresses of the schools, convert the resulting data frame into an sf object, and set the projection to epsg:4326 (see Chapter 17).
library(tidyverse)library(mdsr)library(sf)library(tidygeocoder)colleges <-tribble(~school, ~address, "Smith", "44 College Lane, Northampton, MA 01063","Macalester", "1600 Grand Ave, St Paul, MN 55105","Amherst", "Amherst College, Amherst, MA 01002") |>geocode(address, method ="osm") |>st_as_sf(coords =c("long", "lat")) |>st_set_crs(4326)colleges
Simple feature collection with 3 features and 2 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -93.16929 ymin: 42.31771 xmax: -72.51607 ymax: 44.93769
Geodetic CRS: WGS 84
# A tibble: 3 × 3
school address geometry
* <chr> <chr> <POINT [°]>
1 Smith 44 College Lane, Northampton, MA 01063 (-72.63973 42.31771)
2 Macalester 1600 Grand Ave, St Paul, MN 55105 (-93.16929 44.93769)
3 Amherst Amherst College, Amherst, MA 01002 (-72.51607 42.37038)
Geodesic distances can be computed using the st_distance() function in sf. Here, we compute the distance between two of the Five Colleges1.
The geodesic distance is closer to “as the crow files,” but we might be more interested in the driving distance between two locations along the road. To compute this, we need to access a service with a database of roads. Here, we use the openroute service, which requires an API key2. The openrouteservice package provides this access via the ors_directions() function. Note that the value of the API key is not shown. You will need your own API key to use this service.
Note the difference between the geodesic distance computed above and the driving distance computed below. Of course, the driving distance must be longer than the geodesic distance.
route_driving |>st_length()
If you prefer, you can convert meters to miles using the set_units() function from the units package.
It turns out that the rail trail path is slightly longer (but far more scenic).
Since the Calvin Coolidge Bridge is the only reasonable way to get from Northampton to Amherst when driving, there is only one shortest route between Smith and Amherst, as shown in ?fig-smith-amherst. We also show the shortest biking route, which follows the Norwottuck Rail Trail.
However, shortest paths in a network are not unique (see Chapter 20). Ben’s daily commute to Citi Field from his apartment in Brooklyn presented three distinct alternatives:
One could continue on the Long Island Expressway (I-495 E) and then approach Citi Field from the opposite direction on the Grand Central Parkway W.
One could avoid highways altogether and take Roosevelt Avenue all the way through Queens.
The latter route is the shortest but often will take longer due to traffic. The first route is the most convenient approach to the Citi Field employee parking lot. These two routes are overlaid on the map in ?fig-citifield.
Much of the power of working with geospatial data comes from the interactions between various layers of data. The sf package provides many features that enable us to compute with geospatial data.
A basic geospatial question is: what part of one series of geospatial objects lies within another set? To illustrate, we use geospatial data from the MacLeish field station in Whately, MA. These data are provided by the macleish package. Figure 18.1 illustrates that there are several streams that pass though the MacLeish property.
The data from MacLeish happens to have a variable called Shape_Area that contains the precomputed size of the property.
boundary |>pull(area)
255.0989 [acre]
Is this accurate? We can easily compute basic geometric properties of spatial objects like area and length using the st_area function.
st_area(boundary)
1032353 [m^2]
The exact computed area is very close to the reported value. We can also convert the area in square meters to acres by dividing by a known conversion factor.
st_area(boundary) /4046.8564224
255.0999 [m^2]
Similarly, we can compute the length of each segment of the streams and the location of the centroid of the property.
Simple feature collection with 1 feature and 1 field
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -72.67969 ymin: 42.45164 xmax: -72.67969 ymax: 42.45164
Geodetic CRS: WGS 84
area geometry
1 255.0989 [acre] POINT (-72.67969 42.45164)
As promised, we can also combine two geospatial layers. The functions st_intersects() and st_intersection() take two geospatial objects and return a logical indicating whether they intersect, or another sf object representing that intersection, respectively.
st_intersects(boundary, streams)
Sparse geometry binary predicate list of length 1, where the predicate
was `intersects'
1: 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ...
st_intersects() is called a predicate function because it returns a logical. It answers the question: “Do these two layers intersect?” On the other hand, st_intersection() performs a set operation. It answers the question: “What is the set that represents the intersection of these two layers?” Similar functions compute familiar set operations like unions, differences, and symmetric differences, while a whole host of additional predicate functions detect containment (st_contains(), st_within(), etc.), crossings, overlaps, etc.
In Figure 18.2 (a) we use the st_intersection() function to show only the parts of the streams that are contained within the MacLeish property. In Figure 18.2 (b), we show the corresponding set of stream parts that lie outside of the MacLeish property.
boundary_plot +geom_sf(data =st_intersection(boundary, streams), color ="blue", size =1.5 )boundary_plot +geom_sf(data =st_difference(streams, boundary), color ="blue", size =1.5 )
Different spatial geometries intersect in different ways. Above, we saw that the intersection of steams (which are LINESTRINGs) and the boundary (which is a POLYGON) produced LINESTRING geometries. Below, we compute the intersection of the streams with the trails that exist at MacLeish. The trails are also LINESTRING geometries, and the intersection of two LINESTRING geometries produces a set of POINT geometries.
Simple feature collection with 10 features and 3 fields
Geometry type: GEOMETRY
Dimension: XY
Bounding box: xmin: -72.68389 ymin: 42.44529 xmax: -72.67941 ymax: 42.45644
Geodetic CRS: WGS 84
name color Id geometry
8 entry trail - 3 POINT (-72.68015 42.44796)
9 Eastern Loop Blue 3 POINT (-72.67989 42.45126)
13 Snowmobile Trail <NA> 3 MULTIPOINT ((-72.67992 42.4...
15 Driveway <NA> 3 POINT (-72.67968 42.4495)
6 Western Loop Red 3 POINT (-72.68267 42.44952)
1 Porcupine Trail White 3 POINT (-72.68293 42.45028)
6.1 Western Loop Red 3 POINT (-72.68299 42.45024)
4 Vernal Pool Loop Yellow 3 POINT (-72.68389 42.44893)
3 Poplar Hill Road Road 3 POINT (-72.68152 42.45644)
14 Snowmobile Trail <NA> 3 POINT (-72.68153 42.45644)
Note that one of the features is a MULTIPOINT. Occasionally, a trail might intersect a stream in more than one place (resulting in a MULTIPOINT geometry). This occurs here, where the Snowmobile Trail intersects one of the stream segments in two different places. To clean this up, we first use the st_cast() function to convert everything to MULTIPOINT, and then cast everything to POINT. (We can’t go straight to POINT because we start with a mixture of POINTs and MULTIPOINTs.)
Note that we now have 11 features instead of 10. In this case, the intersections of trails and streams has a natural interpretation: these must be bridges of some type! How else could the trail continue through the stream? Figure 18.3 shows the trails, the streams, and these “bridges” (some of the points are hard to see because they partially overlap).
boundary_plot +geom_sf(data = trails, color ="brown", size =1.5) +geom_sf(data = streams, color ="blue", size =1.5) +geom_sf(data = bridges, pch =21, fill ="yellow", size =3)
Figure 18.3: Bridges on the MacLeish property where trails and streams intersect.
18.2 Geospatial aggregation
In Section 18.1.2, we saw how we can split MULTIPOINT geometries into POINT geometries. This was, in a sense, geospatial disaggregation. Here, we consider the perhaps more natural behavior of spatial aggregation.
Just as we saw previously that the intersection of different geometries can produce different resulting geometries, so too will different geometries aggregate in different ways. For example, POINT geometries can be aggregated into MULTIPOINT geometries.
The sf package implements spatial aggreation using the same group_by() and summarize() function that you learned in Chapter 4. The only difference is that we might have to specify how we want the spatial layers to be aggregated. The default aggregation method is st_union(), which makes sense for most purposes.
Note that the trails layer is broken into segments: the Western Loop is comprised of three different features.
trails
Simple feature collection with 15 features and 2 fields
Geometry type: LINESTRING
Dimension: XY
Bounding box: xmin: -72.68513 ymin: 42.44177 xmax: -72.67731 ymax: 42.4618
Geodetic CRS: WGS 84
First 10 features:
name color geometry
1 Porcupine Trail White LINESTRING (-72.68291 42.45...
2 Western Loop Red LINESTRING (-72.68111 42.45...
3 Poplar Hill Road Road LINESTRING (-72.68155 42.45...
4 Vernal Pool Loop Yellow LINESTRING (-72.68265 42.44...
5 Eastern Loop Blue LINESTRING (-72.68113 42.45...
6 Western Loop Red LINESTRING (-72.68401 42.45...
7 Western Loop Red LINESTRING (-72.68088 42.44...
8 entry trail - LINESTRING (-72.68088 42.44...
9 Eastern Loop Blue LINESTRING (-72.68063 42.45...
10 Easy Out Red LINESTRING (-72.67962 42.45...
Which trail is the longest? We know we can compute the length of the features with st_length(), but then we’d have to add up the lengths of each segment. Instead, we can aggregate the segments and do the length computation on the full trails.
In Section 17.4.3, we show how the inner_join() function can be used to merge geospatial data with additional data. This works because the geospatial data was stored in an sf object, which is also a data frame. In that case, since the second data frame was not spatial, by necessity the key on which the join was performed was a non-spatial attribute.
A geospatial join is a fundamentally different type of operation, in which both data frames are geospatial, and the joining key is a geospatial attribute. This operation is implemented by the st_join() function, which behaves similarly to the inner_join() function, but with some additional complexities due to the different nature of its task.
To illustrate this, we consider the question of in which type of forest the two campsites at MacLeish lie (see Figure 18.4).
Figure 18.4: The MacLeish property has two campsites and many different types of forest.
It is important to note that this question is inherently spatial. There simply is no variable between the forests layer and the camp_sites layer that would allow you to connect them other than their geospatial location.
Like inner_join(), the st_join() function takes two data frames as its first two arguments. There is no st_left_join() function, but instead st_join() takes a left argument, that is set to TRUE by default. Finally, the join argument takes a predicate function that determines the criteria for whether the spatial features match. The default is st_intersects(), but here we employ st_within(), since we want the POINT geometries of the camp sites to lie within the POLYGON geometries of the forests.
st_join(camp_sites, forests, left =FALSE, join = st_within) |>select(name, type)
Simple feature collection with 2 features and 2 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -72.67958 ymin: 42.45098 xmax: -72.67815 ymax: 42.45855
Geodetic CRS: WGS 84
# A tibble: 2 × 3
name type geometry
<chr> <chr> <POINT [°]>
1 Group Campsite Old Field White Pine Forest (-72.67815 42.45098)
2 Remote Campsite Sugar Maple Forest (-72.67958 42.45855)
We see that the Group Campsite is in a Eastern White Pine forest, while the Remote Campsite is in a Sugar Maple forest.
18.4 Extended example: Trail elevations at MacLeish
Many hiking trails provide trail elevation maps (or elevation profiles) that depict changes in elevation along the trail. These maps can help hikers understand the interplay between the uphill and downhill segments of the trail and where they occur along their hike.
\[
rating = \sqrt{gain \cdot 2 \cdot distance}
\] A rating below 50 corresponds to the easiest class of hike.
In this example, we will construct an elevation profile and compute the trail rating for the longest trail at MacLeish. The macleish package contains elevation contours that are 30 feet apart. These are relatively sparse, but they will suffice for our purposes.
Next, we compute the geospatial intersection between the trails and the elevation contours, both of which are LINESTRING geometries. This results in POINTs, but just as we saw above, there are several places where the trail crosses the same contour line more than once. If you think about walking up a hill and then back down the other side, this should make sense (see Figure 18.5). These multiple intersections result in MULTIPOINT geometries. We unravel these just as we did before: by casting everything to MULTIPOINT and then casting everything back to POINT.
Figure 18.5 reveals that the Snowmobile trail starts near the southernmost edge of the property at about 750 feet above sea level, snakes along a ridge at 780 feet, before climbing to the local peak at 870 feet, and finally descending the back side of the hill near the northern border of the property.
Figure 18.5: The Snowmobile Trail at MacLeish, with contour lines depicted.
Finally, we need to put the features in order, so that we can compute the distance from the start of the trail. Luckily, in this case the trail goes directly south to north, so that we can use the latitude coordinate as an ordering variable.
In this case, we use st_distance() to compute the geodesic distances between the elevation contours. This function returns a \(n \times n\)matrix with all of the pairwise distances, but since we only want the distances from the southernmost point (which is the first element), we only select the first column of the resulting matrix.
To compute the actual distances (i.e., along the trail) we would have to split the trail into pieces. We leave this as an exercise.
Lovelace, Nowosad, and Muenchow (2019) and Engel (2019) were both helpful in updating this material to take advantage of sf.
18.6 Exercises
Problem 1 (Medium): The Violations data frame in the mdsr package contains information on violations noted in Board of Health inspections of New York City restaurants. These data contain spatial information in the form of addresses and zip codes.
Use the geocode function in tidygeocoder to obtain spatial coordinates for these restaurants.
Using the spatial coordinates you obtained in the previous exercise, create an informative static map using ggspatial that illustrates the nature and extent of restaurant violations in New York City.
Using the spatial coordinates you obtained in the previous exercises, create an informative interactive map using leaflet that illustrates the nature and extent of restaurant violations in New York City.
Problem 2 (Medium):
Use the spatial data in the macleish package and ggspatial to make an informative static map of the MacLeish Field Station property.
Use the spatial data in the macleish package and leaflet to make an informative interactive map of the MacLeish Field Station property.
Problem 3 (Hard): GIS data in the form of shapefiles is all over the Web. Government agencies are particularly good sources for these. The following code downloads bike trail data in Massachusetts from MassGIS. Use bike_trails to answer the following questions:
Driver: ESRI Shapefile
Available layers:
layer_name geometry_type features fields
1 BIKETRAILS_ARC Measured Line String 8277 10
crs_name
1 NAD83 / Massachusetts Mainland
bike_trails <-read_sf(dsn)
How many distinct bike trail segments are there?
What is the longest individual bike trail segment?
How many segments are associated with the Norwottuck Rail Trail?
Among all of the named trails (which may have multiple features), which one has the longest total length?
The bike trails are in a Lambert conformal conic projection. Note that the units of the coordinates are very different from lat/long. In order to get these data onto our leaflet map, we need to re-project them. Convert the bike trails to EPSG:4326, and create a leaflet map.
Color-code the bike trails based on their length, and add an informative legend to the plot.
Problem 4 (Hard): The MacLeish snowmobile trail map generated in the book is quite rudimentary. Generate your own map that improves upon the aesthetics and information content.
Engel, Claudia A. 2019. R for Geospatial Analysis and Mapping. The Geographic Information Science & Technology Body of Knowledge. https://doi.org/10.22224/gistbok/2019.1.3.
Lovelace, Robin, Jakub Nowosad, and Jannes Muenchow. 2019. Geocomputation with R. CRC Press: Boca Raton, FL. https://geocompr.robinlovelace.net/.
The Five College Consortium consists of Amherst, Hampshire, Mount Holyoke, and Smith Colleges, as well as the University of Massachusetts-Amherst.↩︎
Google Maps requires an API key backed up by either a credit card or credits requested by an instructor.↩︎