diff --git a/episodes/07-vector-data-in-python.md b/episodes/07-vector-data-in-python.md index d8ee7716..2a84e114 100644 --- a/episodes/07-vector-data-in-python.md +++ b/episodes/07-vector-data-in-python.md @@ -19,20 +19,23 @@ questions: ## Introduction -As discussed in [Episode 2: Introduction to Vector Data](02-intro-vector-data.md), vector data represents specific features on the Earth's surface using points, lines, and polygons. These geographic elements can then have one or more attributes assigned to them, such as 'name' and 'population' for a city, or crop type for a field. Vector data can be much smaller in (file) size than raster data, while being very rich in terms of the information captured. +In the preceding episodes, we have prepared raster datasets to analyze the wildfire in Rhodes Island. To evaluate its impact, we also need to establish our Area of Interest (AoI), which includes vital infrastructure and built-up areas. In this episode, we are going to extract the AoI from open datasets for our study area and save them for subsequent analysis. -In this episode, we will be moving from working with raster data to working with vector data. We will use Python to open and plot point, line, and polygon vector data. In particular, we will make use of the [`geopandas`](https://geopandas.org/en/stable/) package to open, manipulate and write vector datasets. +We'll be examining vector datasets that represent the AoI. As mentioned in [Episode 2: Introduction to Vector Data](02-intro-vector-data.md), vector data uses points, lines, and polygons to depict specific features on the Earth's surface. These geographic elements can have one or more attributes, like 'name' and 'population' for a city. We'll be using two open data sources in this section: the Database of Global Administrative Areas (GADM) dataset for a polygon of our study area and Open Street Map data for the vital infrastructure. + +We'll be using the Python package [`geopandas`](https://geopandas.org/en/stable/) to handle vector data. This package allows us to open, manipulate, and write vector datasets. ![](fig/E07/pandas_geopandas_relation.png){alt="Pandas and Geopandas"} -`geopandas` extends the popular `pandas` library for data analysis to geospatial applications. The main `pandas` objects (the `Series` and the `DataFrame`) are expanded to `geopandas` objects (`GeoSeries` and `GeoDataFrame`). This extension is implemented by including geometric types, represented in Python using the `shapely` library, and by providing dedicated methods for spatial operations (union, intersection, etc.). The relationship between `Series`, `DataFrame`, `GeoSeries` and `GeoDataFrame` can be briefly explained as follow: +`geopandas` enhances the widely-used `pandas` library for data analysis by extending its functionality to geospatial applications. The primary `pandas` objects (`Series` and `DataFrame`) are extended to `geopandas` objects (`GeoSeries` and `GeoDataFrame`). This extension is achieved by incorporating geometric types, represented in Python using the `shapely` library, and by offering dedicated methods for spatial operations like `union` and `intersection`. Here's a brief explanation of the relationship between `Series`, `DataFrame`, `GeoSeries`, and `GeoDataFrame`: + +- A `Series` is a one-dimensional array with an axis that can hold any data type (integers, strings, floating-point numbers, Python objects, etc.) +- A `DataFrame` is a two-dimensional labeled data structure with columns that can potentially hold different types of data. +- A `GeoSeries` is a `Series` object designed to store shapely geometry objects. +- A `GeoDataFrame` is an extended `pandas.DataFrame` that includes a column with geometry objects, which is a `GeoSeries`. - - A `Series` is a one-dimensional array with axis, holding any data type (integers, strings, floating-point numbers, Python objects, etc.) - - A `DataFrame` is a two-dimensional labeled data structure with columns of potentially different types1. - - A `GeoSeries` is a `Series` object designed to store shapely geometry objects. - - A `GeoDataFrame` is an extened `pandas.DataFrame`, which has a column with geometry objects, and this column is a `GeoSeries`. +In the upcoming sections, we'll explore how to work with raster and vector data simultaneously. -In later episodes, we will learn how to work with raster and vector data together and combine them into a single plot. :::callout ## Introduce the Vector Data @@ -40,405 +43,324 @@ In later episodes, we will learn how to work with raster and vector data togethe In this episode, we will use the downloaded vector data in the `data` directory. Please refer to the [setup page](../learners/setup.md) on how to download the data. ::: -## Import Vector Datasets +## Get the administration boundary of study area + +We will use the `geopandas` package to load the administrative areas of Greece, which we downloaded at: `data/data/gadm/greece.gpkg`. This file contains all administrative areas from Greece. ```python import geopandas as gpd +gdf_greece = gpd.read_file('./data/gadm/greece.gpkg') ``` - -We will use the `geopandas` package to load the crop field vector data we downloaded at: `data/brpgewaspercelen_definitief_2020_small.gpkg`. +We can print out the `gdf_greece`variable: ```python -fields = gpd.read_file("data/brpgewaspercelen_definitief_2020_small.gpkg") -fields +gdf_greece ``` - - -The data are read into the variable `fields` as a `GeoDataFrame`. This is an extened data format of `pandas.DataFrame`, with an extra column `geometry`. - -This file contains a relatively large number of crop field parcels. Directly loading a large file to memory can be slow. If the Area of Interest (AoI) is small, we can define a bounding box of the AoI, and only read the data within the extent of the bounding box. +```output +GID_3 GID_0 COUNTRY GID_1 NAME_1 \ +0 GRC.1.1.1_1 GRC Greece GRC.1_1 Aegean +1 GRC.1.1.2_1 GRC Greece GRC.1_1 Aegean +2 GRC.1.1.3_1 GRC Greece GRC.1_1 Aegean +.. ... ... ... ... ... +324 GRC.8.2.24_1 GRC Greece GRC.8_1 Thessaly and Central Greece +325 GRC.8.2.25_1 GRC Greece GRC.8_1 Thessaly and Central Greece + + NL_NAME_1 GID_2 NAME_2 NL_NAME_2 \ +0 Αιγαίο GRC.1.1_1 North Aegean Βόρειο Αιγαίο +1 Αιγαίο GRC.1.1_1 North Aegean Βόρειο Αιγαίο +2 Αιγαίο GRC.1.1_1 North Aegean Βόρειο Αιγαίο +.. ... ... ... ... +324 Θεσσαλία και Στερεά Ελλάδα GRC.8.2_1 Thessaly Θεσσαλία +325 Θεσσαλία και Στερεά Ελλάδα GRC.8.2_1 Thessaly Θεσσαλία +... +324 POLYGON ((22.81903 39.27344, 22.81884 39.27332... +325 POLYGON ((23.21375 39.36514, 23.21272 39.36469... + +[326 rows x 17 columns] +``` + +The data are read into the variable fields as a `GeoDataFrame`. This is an extened data format of `pandas.DataFrame`, with an extra column `geometry`. We can use the `plot()` function to visualize `gdf_greece`: ```python -# Define bounding box -xmin, xmax = (110_000, 140_000) -ymin, ymax = (470_000, 510_000) -bbox = (xmin, ymin, xmax, ymax) +gdf_greece.plot() ``` +![](fig/E07/greece_administration_areas.png){alt="greece_administrations"} + -Using the `bbox` input argument, we can load only the spatial features intersecting the provided bounding box. +Next, we'll focus on isolating the administrative area of Rhodes Island. This information is located in the `NAME_3` column of `gdf_greece`, where Rhodes Island is listed as "Rhodos". For the sake of this course, let's say this information was shared with us. In a real-world scenario, you would typically discover this by examining the metadata, if available, or by manually exploring the columns. We'll now create a new variable that exclusively represents Rhodes Island. ```python -# Partially load data within the bounding box -fields = gpd.read_file("data/brpgewaspercelen_definitief_2020_small.gpkg", bbox=bbox) +gdf_rhodes = gdf_greece.loc[gdf_greece['NAME_3']=='Rhodos'] ``` -:::callout -## How should I define my bounding box? -For simplicity, here we assume the **Coordinate Reference System (CRS)** and **extent** of the vector file are known (for instance they are provided in the dataset documentation). - -You can also define your bounding box with online coordinates visualization tools. For example, we can use the "Draw Rectangular Polygon" tool in [geojson.io](https://geojson.io/#map=8.62/52.45/4.96). - -Some Python tools, e.g. [`fiona`](https://fiona.readthedocs.io/en/latest/)(which is also the backend of `geopandas`), provide the file inspection functionality without the need to read the full data set into memory. An example can be found in [the documentation of fiona](https://fiona.readthedocs.io/en/latest/manual.html#format-drivers-crs-bounds-and-schema). - -::: - And we can plot the overview by: ```python -fields.plot() +gdf_rhodes.plot() ``` -![](fig/E07/fields.png){alt="Crop fields inside the AOI"} +![](fig/E07/rhodes_administration_areas.png){alt="rhodes_administrations"} -## Vector Metadata & Attributes -When we read the vector dataset with Python (as our `fields` variable) it is loaded as a `GeoDataFrame` object. The `read_file()` function also automatically stores geospatial information about the data. We are particularly interested in describing the format, CRS, extent, and other components of the vector data, and the attributes which describe properties associated -with each vector object. +Now, we have the administrative area of Rhodes Island. We can use the `to_file()` function save this to a file for future use. -We will explore +```python +# Save the rhodes_boundary to gpkg +gdf_rhodes.to_file('rhodes.gpkg') +``` -1. **Object Type:** the class of the imported object. -2. **Coordinate Reference System (CRS):** the projection of the data. -3. **Extent:** the spatial extent (i.e. geographic area that the data covers). Note that the spatial extent for a vector dataset represents the combined extent for all spatial objects in the dataset. +## Get the vital infrastructure and built-up areas -Each `GeoDataFrame` has a `"geometry"` column that contains geometries. In the case of our `fields` object, this geometry is represented by a `shapely.geometry.Polygon` object. `geopandas` uses the `shapely` library to represent polygons, lines, and points, so the types are inherited from `shapely`. +### Highway data from Open Street Map (OSM) -We can view the metadata using the `.crs`, `.bounds` and `.type` attributes. First, let's view the -geometry type for our crop field dataset. To view the geometry type, we use the `pandas` method `.type` on the `GeoDataFrame` object, `fields`. +Now that we have the boundary of our study area, we will make use this to select the main highways in our study area. We will make the following processing steps: -```python -fields.type -``` +1. Select highways of study area +2. Select key infrastruture highways: 'primary', 'secondary', 'tertiary' +3. Create a 100m buffer around the rounds. This buffer will be regarded as the infrastructure region. (note that this buffer is arbitrary and can be changed afterwards if you want!) +#### Step 1: Select highways of study area -```output -0 Polygon -1 Polygon -2 Polygon -3 Polygon -4 Polygon - ... -22026 Polygon -22027 Polygon -22028 Polygon -22029 Polygon -22030 Polygon -Length: 22031, dtype: object -``` +For this course, we have created a subset of the OSM data for Greece in the file `highways.gpkg`. However we have not made a subselection of our study area yet! We will do this in the next steps. -To view the CRS metadata: +Let's load `highways.gpkg` and plot it: ```python -fields.crs +gdf_highways = gpd.read_file('./data/osm/highways.gpkg') ``` +We can explore it using the same commands as above: -```output - -Name: Amersfoort / RD New -Axis Info [cartesian]: -- X[east]: Easting (metre) -- Y[north]: Northing (metre) -Area of Use: -- name: Netherlands - onshore, including Waddenzee, Dutch Wadden Islands and 12-mile offshore coastal zone. -- bounds: (3.2, 50.75, 7.22, 53.7) -Coordinate Operation: -- name: RD New -- method: Oblique Stereographic -Datum: Amersfoort -- Ellipsoid: Bessel 1841 -- Prime Meridian: Greenwich +```python +gdf_highways.plot() ``` -Our data is in the CRS **RD New**. The CRS is critical to -interpreting the object's extent values as it specifies units. To find -the extent of our dataset in the projected coordinates, we can use the `.total_bounds` attribute: +![](fig/E07/greece_highways.png){alt="greece_highways"} + +As you may have noticed, loading and plotting `highways.gpkg` takes a bit long. This is because the file contains all the highways in Greece, and we are only interested in the highways in Rhodes Island. We can use the `mask` parameter of the `read_file()` function to load only the highways in Rhodes Island. ```python -fields.total_bounds +# Read data with a mask of Rhodes +gdf_highways = gpd.read_file('./data/osm/highways.gpkg', mask=gdf_rhodes) ``` +Now we can plot the highways in Rhodes Island: -```output -array([109222.03325 , 469461.512625, 140295.122125, 510939.997875]) +```python +gdf_highways.plot() ``` -This array contains, in order, the values for minx, miny, maxx and maxy, for the overall dataset. The spatial extent of a GeoDataFrame represents the geographic "edge" or location that is the furthest north, south, east, and west. Thus, it represents the overall geographic coverage of the spatial object. - -We can convert these coordinates to a bounding box or acquire the index of the Dataframe to access the geometry. Either of these polygons can be used to clip rasters (more on that later). - +![](fig/E07/rhodes_highways.png){alt="rhodes_highways"} -## Further crop the dataset +#### Step 2: Select key infrastruture highways -We might realize that the loaded dataset is still too large. If we want to refine our area of interest to an even smaller extent, without reloading the data, we can use the [`cx`](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.cx.html) indexer: +The classes of the highways are stored in the `fclass` column. To get an overview of the different classes, we can use the `unique()` function: - ```python - # A smaller bounding box in RD - xmin, xmax = (120_000, 135_000) - ymin, ymax = (485_000, 500_000) - - fields_cx = fields.cx[xmin:xmax, ymin:ymax] - ``` +```python +gdf_highways['fclass'].unique() +``` -## Export data to file +```output +array(['path', 'residential', 'secondary', 'tertiary', 'secondary_link', + 'motorway_link', 'footway', 'pedestrian', 'living_street', + 'service', 'unclassified', 'primary', 'trunk', 'track', + 'track_grade2', 'track_grade3', 'motorway', 'primary_link', + 'steps', 'tertiary_link', 'track_grade1', 'track_grade4', + 'trunk_link', 'track_grade5', 'cycleway', 'bridleway'], + dtype=object) +``` -We will save the cropped results to a shapefile (`.shp`) and use it later. The `to_file` function can be used for exportation: +It seems the variable `gdf_highways` contains all kind of hiking paths and footpaths as well. Since we are only interested in vital infrastructure, classified as "primary", "secondary" and "tertiary" highways, we are going to make a subselection of these highways by the column `fclass`: ```python -fields_cx.to_file('fields_cropped.shp') +# Extract infrastratcure highways +infra_labels = ['primary', 'secondary', 'tertiary'] +infra_highways = gdf_highways.loc[gdf_highways['fclass'].isin(infra_labels)] ``` +We can plot the infrastructure highways: +```python +infra_highways.plot() +``` -This will write it to disk (in this case, in 'shapefile' format), containing only the data from our cropped area. It can be read again at a later time using the `read_file()` method we have been using above. Note that this actually writes multiple files to disk (`fields_cropped.cpg`, `fields_cropped.dbf`, `fields_cropped.prj`, `fields_cropped.shp`, `fields_cropped.shx`). All these files should ideally be present in order to re-read the dataset later, although only the `.shp`, `.shx`, and `.dbf` files are mandatory (see the [Introduction to Vector Data](02-intro-vector-data.md) lesson for more information.) - +![](fig/E07/rhodes_infra_highways.png){alt="rhodes_infra_highways"} -## Selecting spatial features +#### Step 3: Create a 100m buffer around the highways -From now on, we will take in a point dataset `brogmwvolledigeset.zip`, which is the underground water monitoring wells. We will perform vector processing on this dataset, together with the cropped field polygons `fields_cropped.shp`. +After selecting the key infrastructure highways, we will create a 100m buffer around them. This buffer will be regarded as the infrastructure region. -Let's read the two datasets. +In order to create a 100m buffer, we need to first transform the data into a CRS with length unit. The current data has a geographic coordinate system with easures in degrees but not meter. It is therefore not possible to put a buffer around. In our case we decided to project the data as GGRS87, with EPSG code 2100. ```python -fields = gpd.read_file("fields_cropped.shp") -wells = gpd.read_file("data/brogmwvolledigeset.zip") +# Covert to a crs with meters unit +infra_highways_meters = infra_highways.to_crs(2100) ``` - -And take a look at the wells: +After the CRS conversion, we can create a buffer around the highways: ```python -wells.plot(markersize=0.1) +# Create buffer, note that the gdf changed to a GeoSeries +infra_highways_meters_buffer = infra_highways_meters.buffer(100) +infra_highways_meters_buffer ``` -![](fig/E07/wells-nl.png){alt="all wells in the NL"} - -The points represents all the wells over the Netherlands. Since the wells are in the lat/lon coordinates. To make it comparable with fields, we need to first transfer the CRS to the "RD New" projection: - -```python -wells = wells.to_crs(epsg=28992) +```output +45 POLYGON ((835416.042 4003863.922, 835427.364 4... +58 POLYGON ((834402.929 4003012.515, 834401.865 4... + ... +18879 POLYGON ((879188.948 4038486.610, 879196.752 4... +18880 POLYGON ((879190.669 4038322.167, 879181.177 4... +Length: 1369, dtype: geometry ``` - -Now we would like to compare the wells with the cropped fields. We can select the wells within the fields using the `.clip` function: +Note that the type of the `infra_highways_meters_buffer` is a `GeoSeries` and not a `GeoDataFrame`. This is because the `buffer()` function returns a `GeoSeries` object. ```python -wells_clip = wells.clip(fields) -wells_clip +type(infra_highways_meters_buffer) ``` - ```output -bro_id delivery_accountable_party quality_regime ... -40744 GMW000000043703 27364178 IMBRO/A ... -38728 GMW000000045818 27364178 IMBRO/A ... -... ... ... ... ... -40174 GMW000000043963 27364178 IMBRO/A ... -19445 GMW000000024992 50200097 IMBRO/A ... -[79 rows x 40 columns] +geopandas.geoseries.GeoSeries ``` -After this selection, all the wells outside the fields are dropped. This takes a while to execute, because we are clipping a relatively large number of points with many polygons. - -If we do not want a precise clipping, but rather have the points in the neighborhood of the fields, we will need to create another polygon, which is slightly bigger than the coverage of the field. To do this, we can increase the size of the field polygons, to make them overlap with each other, and then merge the overlapping polygons together. - -We will first use the `buffer` function to increase field size with a given `distance`. The unit of the `distance` argument is the same as the CRS. Here we use a 50-meter buffer. Also notice that the `.buffer` function produces a `GeoSeries`, so to keep the other columns, we assign it to the `GeoDataFrame` as a geometry column. +Now that we have a buffer, we can convert it back to the geographic coordinate system to keep the data consistent: ```python -buffer = fields.buffer(50) -fields_buffer = fields.copy() -fields_buffer['geometry'] = buffer -fields_buffer.plot() +# Convert back to the original crs +infra_highways_buffer = infra_highways_meters_buffer.to_crs(infra_highways.crs) +infra_highways_buffer ``` +```output +45 POLYGON ((835416.042 4003863.922, 835427.364 4... +58 POLYGON ((834402.929 4003012.515, 834401.865 4... + ... +18879 POLYGON ((879188.948 4038486.610, 879196.752 4... +18880 POLYGON ((879190.669 4038322.167, 879181.177 4... +Length: 1369, dtype: geometry +``` -![](fig/E07/fields-buffer.png){alt="50m buffer around the fields"} +We can confirm that the buffer is created by the `Polygon` geometry type in `infra_highways_buffer`. -To further simplify them, we can use the `dissolve` function to dissolve the buffers into one: +Reprojecting and buffering our data is something that we are going to do multiple times during this episode. To avoid have to call the same functions multiple times it would make sense to create a function. Let us create a function in which we can add the buffer as a variable. ```python -fields_buffer_dissolve = fields_buffer.dissolve() -fields_buffer_dissolve +def buffer_crs(gdf, size, meter_crs=2100, target_crs=4326): + return gdf.to_crs(meter_crs).buffer(size).to_crs(target_crs) ``` - -All the fields will be dissolved into one multi-polygon, which can be used to `clip` the wells. +For example, we can use this function to create a 200m buffer around the infrastructure highways: ```python -wells_clip_buffer = wells.clip(fields_buffer_dissolve) -wells_clip_buffer.plot() +infra_highways_buffer_200 = buffer_crs(infra_highways, 200) ``` +### Get built-up regions from Open Street Map (OSM) + +Now that we have our highways, we are also interested in the built-up areas. To get this part, we will use the land use data from OSM, which is stored in the file `data/landuse.gpkg`. This file includes the land use data for the entire Greece. We assume the built-up regions to be the union of three types of land use: "commercial", "industrial", and "residential". -![](fig/E07/wells-in-buffer.png){alt="Wells within 50m buffer of fields"} +Note that for the simplicity of this course, we limit the built-up regions to these three types of land use. In reality, the built-up regions can be more complex. More information from OSM, such as buildings and transportation areas, can be used to define the built-up regions. -In this way, we selected all wells within the 50m range of the fields. It is also significantly faster than the previous `clip` operation, since the number of polygons is much smaller after `dissolve`. +You should be able to complete this task by yourself with the knowledge you have gained from the previous steps. :::challenge -## Exercise: clip fields within 500m from the wells -This time, we will do a selection the other way around. Can you clip the field polygons (stored in fields_cropped.shp) with the 500m buffer of the wells (stored in brogmwvolledigeset.zip)? Please visualize the results. +## Exercise: Get the built-up regions -- Hint 1: The file `brogmwvolledigeset.zip` is in CRS 4326. Don’t forget the CRS conversion. +Create a `builtup_buffer` from the file `data/landuse.gpkg` by the following steps: -- Hint 2: `brogmwvolledigeset.zip` contains all the wells in the Netherlands, which means it might be too big for the `.buffer()` function. To improve the performance, first crop it with the bounding box of the fields. +1. Load the land use data from `data/landuse.gpkg` and mask it with the administrative boundary of Rhodes Island (`gdf_rhodes`). +2. Select the land use data for "commercial", "industrial", and "residential". +3. Create a 10m buffer around the land use data. +4. Visualize the results. -::::solution -```python -# Read in data -fields = gpd.read_file("fields_cropped.shp") -wells = gpd.read_file("data/brogmwvolledigeset.zip") - -# Crop points with bounding box -xmin, ymin, xmax, ymax = fields.total_bounds -wells = wells.to_crs(28992) -wells_cx = wells.cx[xmin-500:xmax+500, ymin-500:ymax+500] - -# Create buffer -wells_cx_500mbuffer = wells_cx.copy() -wells_cx_500mbuffer['geometry'] = wells_cx.buffer(500) - -# Clip -fields_clip_buffer = fields.clip(wells_cx_500mbuffer) -fields_clip_buffer.plot() -``` -![](fig/E07/fields-in-buffer-clip.png){alt="fields within 50m buffer of the wells, truncated"} +After completing the exercise, answer the following questions: -:::: -::: +1. How many unique land use types are there in `landuse.gpkg`? +2. After selecting the three types of land use, how many entries (rows) are there in the results? +Hints: -## Spatially join the features - -In the exercise, we clipped the fields polygons with the 500m buffers of wells. The results from this clipping changed the shape of the polygons. If we would like to keep the original shape of the fields, one way is to use the `sjoin` function, which join two `GeoDataFrame`'s on the basis of their spatial relationship: +- `data/landuse.gpkg` contains the land use data for the entire Greece. Use the administrative boundary of Rhodes Island (`gdf_rhodes`) to select the land use data for Rhodes Island. +- The land use attribute is stored in the `fclass` column. +- Reuse `buffer_crs` function to create the buffer. +::::solution ```python -# Join fields and wells_cx_500mbuffer -fields_wells_buffer = fields.sjoin(wells_cx_500mbuffer) -print(fields_wells_buffer.shape) -``` +# Read data with a mask of Rhodes +gdf_landuse = gpd.read_file('./data/osm/landuse.gpkg', mask=gdf_rhodes) +# Find number of unique landuse types +print(len(gdf_landuse['fclass'].unique())) -```output -(11420, 46) -``` - +# Extract built-up regions +builtup_labels = ['commercial', 'industrial', 'residential'] +builtup = gdf_landuse.loc[gdf_landuse['fclass'].isin(builtup_labels)] -This will result in a `GeodataFrame` of all possible combinations of polygons and well buffers intersecting each other. Since a polygon can fall into multiple buffers, there will be duplicated field indexes in the results. To select the fields which intersects the well buffers, we can first get the unique indexes, and use the `iloc` indexer to select: +# Create 10m buffer around the built-up regions +builtup_buffer = buffer_crs(builtup, 10) -```python -idx = fields_wells_buffer.index.unique() -fiedls_in_buffer = fields.iloc[idx] +# Get the number of entries +print(len(builtup_buffer)) -fiedls_in_buffer.plot() +# Visualize the buffer +builtup_buffer.plot() ``` - -![](fig/E07/fields-in-buffer-sjoin.png){alt="Fields in 50m buffer of wells, not truncated"} - -## Modify the geometry of a GeoDataFrame - -:::challenge -## Exercise: Investigate the waterway lines -Now we will take a deeper look at the Dutch waterway lines: `waterways_nl`. Let's load the file `status_vaarweg.zip`, and visualize it with the `plot()` function. Can you tell what is wrong with this vector file? - -::::solution -By plotting out the vector file, we can tell that the latitude and longitude of the file are flipped. -```python -waterways_nl = gpd.read_file('data/status_vaarweg.zip') -waterways_nl.plot() +```output +19 +1336 ``` -![](fig/E07/waterways-wrong.png){alt="waterways, rotated"} -:::: -::: +![](fig/E07/rhodes_builtup_buffer.png){alt="rhodes_builtup_buffer"} -:::callout -## Axis ordering -According to the standards, the axis ordering for a CRS should follow the definition provided by the competent authority. For the commonly used EPSG:4326 geographic coordinate system, the EPSG defines the ordering as first latitude then longitude. -However, in the GIS world, it is custom to work with coordinate tuples where the first component is aligned with the east/west direction and the second component is aligned with the north/south direction. -Multiple software packages thus implement this convention also when dealing with EPSG:4326. -As a result, one can encounter vector files that implement either convention - keep this in mind and always check your datasets! +:::: ::: -Sometimes we need to modify the `geometry` of a `GeoDataFrame`. For example, as we have seen in the previous exercise **Investigate the waterway lines**, the latitude and longitude are flipped in the vector data `waterways_nl`. This error needs to be fixed before performing further analysis. +## Merge the infrastructure regions and built-up regions +Now that we have the infrastructure regions and built-up regions, we can merge them into a single region. We would like to keep track of the type after merging, so we will add two new columns: `type` and `code` by converting the `GeoSeries` to `GeoDataFrame`. -Let's first take a look on what makes up the `geometry` column of `waterways_nl`: +First we convert the highways buffer: ```python -waterways_nl['geometry'] +data = {'geometry': infra_highways_buffer, 'type': 'infrastructure', 'code': 1} +gdf_infra = gpd.GeoDataFrame(data) ``` -```output -0 LINESTRING (52.41810 4.84060, 52.42070 4.84090... -1 LINESTRING (52.11910 4.67450, 52.11930 4.67340... -2 LINESTRING (52.10090 4.25730, 52.10390 4.25530... -3 LINESTRING (53.47250 6.84550, 53.47740 6.83840... -4 LINESTRING (52.32270 5.14300, 52.32100 5.14640... - ... -86 LINESTRING (51.49270 5.39100, 51.48050 5.39160... -87 LINESTRING (52.15900 5.38510, 52.16010 5.38340... -88 LINESTRING (51.97340 4.12420, 51.97110 4.12220... -89 LINESTRING (52.11910 4.67450, 52.11850 4.67430... -90 LINESTRING (51.88940 4.61900, 51.89040 4.61350... -Name: geometry, Length: 91, dtype: geometry -``` - - -Each row is a `LINESTRING` object. We can further zoom into one of the rows, for example, the third row: +Then we convert the built-up buffer: ```python -print(waterways_nl['geometry'][2]) -print(type(waterways_nl['geometry'][2])) -``` - -```output -LINESTRING (52.100900002 4.25730000099998, 52.1039 4.25529999999998, 52.111299999 4.24929999900002, 52.1274 4.23449999799999) - +data = {'geometry': builtup_buffer, 'type': 'builtup', 'code': 2} +gdf_builtup = gpd.GeoDataFrame(data) ``` - -As we can see in the output, the `LINESTRING` object contains a list of coordinates of the vertices. In our situation, we would like to find a way to flip the x and y of every coordinates set. A good way to look for the solution is to use the [documentation](https://shapely.readthedocs.io/en/stable/manual.html) of the `shapely` package, since we are seeking to modify the `LINESTRING` object. Here we are going to use the [`shapely.ops.transform`](https://shapely.readthedocs.io/en/stable/manual.html?highlight=shapely.ops.transform#shapely.ops.transform) function, which applies a self-defined function to all coordinates of a geometry. +After that, we can merge the two `GeoDataFrame` into one: ```python -import shapely - -# Define a function flipping the x and y coordinate values -def flip(geometry): - return shapely.ops.transform(lambda x, y: (y, x), geometry) - -# Apply this function to all coordinates and all lines -geom_corrected = waterways_nl['geometry'].apply(flip) +import pandas as pd +gdf_assets = pd.concat([gdf_infra, gdf_builtup]).reset_index(drop=True) ``` +In `gdf_assets`, we can distinguish the infrastructure regions and built-up regions by the `type` and `code` columns. We can plot the `gdf_assets` to visualize the merged regions: -Then we can update the `geometry` column with the corrected geometry `geom_corrected`, and visualize it to check: ```python -# Update geometry -waterways_nl['geometry'] = geom_corrected - -# Visualization -waterways_nl.plot() +gdf_assets.plot(column='type', legend=True) ``` +![](fig/E07/rhodes_assets.png){alt="rhodes_assets"} -![](fig/E07/waterways-corrected.png){alt="waterways, corrected"} +Finally, we can save the `gdf_assets` to a file for future use: -Now the waterways look good! We can save the vector data for later usage: ```python -# Update geometry -waterways_nl.to_file('waterways_nl_corrected.shp') +gdf_assets.to_file('assets.gpkg') ``` - :::keypoints - Load spatial objects into Python with `geopandas.read_file()` function. - Spatial objects can be plotted directly with `GeoDataFrame`'s `.plot()` method. -- Crop spatial objects with `.cx[]` indexer. -- Convert CRS of spatial objects with `.to_crs()`. -- Select spatial features with `.clip()`. +- Convert CRS of spatial objects with `.to_crs()`. Note that this generates a `GeoSeries` object. - Create a buffer of spatial objects with `.buffer()`. -- Merge overlapping spatial objects with `.dissolve()`. -- Join spatial features spatially with `.sjoin()`. +- Merge spatial objects with `pd.concat()`. ::: diff --git a/episodes/fig/E07/fields-buffer.png b/episodes/fig/E07/fields-buffer.png deleted file mode 100644 index b84869f6..00000000 Binary files a/episodes/fig/E07/fields-buffer.png and /dev/null differ diff --git a/episodes/fig/E07/fields-in-buffer-clip.png b/episodes/fig/E07/fields-in-buffer-clip.png deleted file mode 100644 index 46e9de3a..00000000 Binary files a/episodes/fig/E07/fields-in-buffer-clip.png and /dev/null differ diff --git a/episodes/fig/E07/fields-in-buffer-sjoin.png b/episodes/fig/E07/fields-in-buffer-sjoin.png deleted file mode 100644 index 49aae47c..00000000 Binary files a/episodes/fig/E07/fields-in-buffer-sjoin.png and /dev/null differ diff --git a/episodes/fig/E07/fields.png b/episodes/fig/E07/fields.png deleted file mode 100644 index f3b11299..00000000 Binary files a/episodes/fig/E07/fields.png and /dev/null differ diff --git a/episodes/fig/E07/greece_administration_areas.png b/episodes/fig/E07/greece_administration_areas.png new file mode 100644 index 00000000..9f99b42b Binary files /dev/null and b/episodes/fig/E07/greece_administration_areas.png differ diff --git a/episodes/fig/E07/greece_highways.png b/episodes/fig/E07/greece_highways.png new file mode 100644 index 00000000..35162087 Binary files /dev/null and b/episodes/fig/E07/greece_highways.png differ diff --git a/episodes/fig/E07/rhodes_administration_areas.png b/episodes/fig/E07/rhodes_administration_areas.png new file mode 100644 index 00000000..d1c2024c Binary files /dev/null and b/episodes/fig/E07/rhodes_administration_areas.png differ diff --git a/episodes/fig/E07/rhodes_assets.png b/episodes/fig/E07/rhodes_assets.png new file mode 100644 index 00000000..38df2a30 Binary files /dev/null and b/episodes/fig/E07/rhodes_assets.png differ diff --git a/episodes/fig/E07/rhodes_builtup_buffer.png b/episodes/fig/E07/rhodes_builtup_buffer.png new file mode 100644 index 00000000..2944ee21 Binary files /dev/null and b/episodes/fig/E07/rhodes_builtup_buffer.png differ diff --git a/episodes/fig/E07/rhodes_highways.png b/episodes/fig/E07/rhodes_highways.png new file mode 100644 index 00000000..de241aaf Binary files /dev/null and b/episodes/fig/E07/rhodes_highways.png differ diff --git a/episodes/fig/E07/rhodes_infra_highways.png b/episodes/fig/E07/rhodes_infra_highways.png new file mode 100644 index 00000000..641c23cc Binary files /dev/null and b/episodes/fig/E07/rhodes_infra_highways.png differ diff --git a/episodes/fig/E07/spatial_extent.png b/episodes/fig/E07/spatial_extent.png deleted file mode 100644 index e8407385..00000000 Binary files a/episodes/fig/E07/spatial_extent.png and /dev/null differ diff --git a/episodes/fig/E07/waterways-corrected.png b/episodes/fig/E07/waterways-corrected.png deleted file mode 100644 index c3e1dbba..00000000 Binary files a/episodes/fig/E07/waterways-corrected.png and /dev/null differ diff --git a/episodes/fig/E07/waterways-wrong.png b/episodes/fig/E07/waterways-wrong.png deleted file mode 100644 index d2a36906..00000000 Binary files a/episodes/fig/E07/waterways-wrong.png and /dev/null differ diff --git a/episodes/fig/E07/wells-in-buffer.png b/episodes/fig/E07/wells-in-buffer.png deleted file mode 100644 index 5f0d6a1b..00000000 Binary files a/episodes/fig/E07/wells-in-buffer.png and /dev/null differ diff --git a/episodes/fig/E07/wells-nl.png b/episodes/fig/E07/wells-nl.png deleted file mode 100644 index f3e74922..00000000 Binary files a/episodes/fig/E07/wells-nl.png and /dev/null differ