Skip to content

Commit 1f70bff

Browse files
e-marshallscottyhqdcherian
authored
Writing edits (#43)
* bunch of edits * add captions * add file numbers back in * fix file names and a few links * add why os section in intro * Add instructions for executing tutorial notebooks on CryoCloud JupyterHub (#41) * jupyterhub instructions * wording change --------- Co-authored-by: e-marshall <[email protected]> * fix file names and a few links fixing files that were renamed * spelling and formatting fixes * spelling and formatting fixes * remove files from tracking * add mkdirs line in s1 nb1 and some formatting changes * few typo fixes and other things + os section in intro * updates to datacube revisit and others * edits from jessica * clean nbs * nit * update gitignore to remove vector data cube * undo gitignore change, will do in sep pr * switch build branch back to main * add note about download time * Some edits (#42) --------- Co-authored-by: Scott Henderson <[email protected]> Co-authored-by: Deepak Cherian <[email protected]>
1 parent 3d889e7 commit 1f70bff

File tree

2 files changed

+20
-26
lines changed

2 files changed

+20
-26
lines changed

book/background/2_data_cubes.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ The term **data cube** is used frequently throughout this book. This page contai
88

99
The key object of analysis in this book is a [raster data cube](https://openeo.org/documentation/1.0/datacubes.html). Raster data cubes are n-dimensional objects that store continuous measurements or estimates of physical quantities that exist along given dimension(s). Many scientific workflows involve examining how a variable (such as temperature, windspeed, relative humidity, etc.) varies over time and/or space. Data cubes are a way of organizing geospatial data that let us ask these questions.
1010

11-
A very common data cube structure is a 3-dimensional object with (`x`,`y`,`time`) dimensions ({cite:t}`Baumann_2019_datacube,giuliani_2019_EarthObservationOpen,mahecha_2020_EarthSystemData,montero_2024_EarthSystemData`). While this is a relatively intuitive concept,in practice, the amount and types of information contained within a single dataset and the operations involved in managing them, can become complicated and unwieldy. As analysts, we access data (usually from providers such as Distributed Active Archive Centers ([DAACs](https://nssdc.gsfc.nasa.gov/earth/daacs.html))), and then we are responsible for organizing the data in a way that let's us ask questions of it. While some of these decisions are straightforward (eg. *It makes sense to stack observations from different points in time along a time dimension*), some can be more open-ended (*Where and how should important metadata be stored so that it will propagate across appropriate operations and be accessible when it is needed?*).
11+
A very common data cube structure is a 3-dimensional object with (`x`,`y`,`time`) dimensions ({cite:t}`Baumann_2019_datacube,giuliani_2019_EarthObservationOpen,mahecha_2020_EarthSystemData,montero_2024_EarthSystemData`). While this is a relatively intuitive concept,in practice, the amount and types of information contained within a single dataset and the operations involved in managing them, can become complicated and unwieldy. As analysts, we access data (usually from providers such as Distributed Active Archive Center or [DAACs](https://nssdc.gsfc.nasa.gov/earth/daacs.html)), and then we are responsible for organizing the data in a way that let's us ask questions of it. While some of these decisions are straightforward (eg. *It makes sense to stack observations from different points in time along a time dimension*), some can be more open-ended (*Where and how should important metadata be stored so that it will propagate across appropriate operations and be accessible when it is needed?*).
1212

1313
### *Two types of information*
1414
Fundamentally, many of these complexities can be reduced to one distinction: is a particular piece of information a physical observable (the main focus, or target, of the dataset), or is it metadata that provides necessary information in order to properly interpret and handle the physical observable [^mynote2]? Answering this question will help you understand how to situate a piece of information within the broader data object.

book/conclusion/datacubes_revisited.md

+19-25
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,12 @@ In this book, we saw a range of real-world datasets and the steps required to pr
44

55
Let's first return to the Xarray building blocks described in the background [section](../background/2_data_cubes.md); we can now provide more-detailed definitions of what they are and how they should be used:
66

7-
:::{admonition} Xarray components of data cubes
8-
**[Dimensions](https://docs.xarray.dev/en/latest/user-guide/terminology.html#term-Dimension)** - What is the shape of the data as you understand it? This should be the set of dimensions. Frequently, `(x, y, time)`.
9-
**[Dimensional coordinate variables](https://docs.xarray.dev/en/latest/user-guide/terminology.html#term-Dimension-coordinate)** - 1-d arrays describing the range and resolution of the data along each dimension.
7+
:::{admonition} Dissecting data cubes
8+
**[Dimensions](https://docs.xarray.dev/en/latest/user-guide/terminology.html#term-Dimension)** - What do the axes of the data represent? This should be the set of dimensions. Frequently, `(x, y, time)`. Dimensions are orthogonal to each other.
9+
**[Dimensional coordinate variables](https://docs.xarray.dev/en/latest/user-guide/terminology.html#term-Dimension-coordinate)** - Typically 1-d arrays describing the range and resolution of the data along each dimension. Think of this as axes tick labels on a plot.
1010
**[Non-dimensional coordinate variables](https://docs.xarray.dev/en/latest/user-guide/terminology.html#term-Non-dimension-coordinate)** - Metadata about the physical observable that varies along one or more dimensions. These can be 1-d up to n-d where n is the length of `.dims`.
11-
**[Data variables](https://docs.xarray.dev/en/latest/user-guide/terminology.html#term-Variable)** - Scalar values that occupy the grid cells implied by coordinate arrays. The physical observable(s) that are the focus of the dataset.
12-
**Attributes** - Metadata that can be assigned to a given `xr.Dataset` or `xr.DataArray` that is ***static*** along that object's dimensions.
11+
**[Data variables](https://docs.xarray.dev/en/latest/user-guide/terminology.html#term-Variable)** - Physical observable(s) whose values are known at every point on the grid formed by the dimensions.
12+
**Attributes** - Metadata that can be assigned to a given `xr.Dataset` or `xr.DataArray` that is ***invariant*** along that object's dimensions.
1313
:::
1414

1515
{{break}}
@@ -19,17 +19,19 @@ At the beginning of the book, we also discussed 'tidy data' as its defined for t
1919

2020
**How can this data be structured to simplify subsequent analysis?**
2121

22-
For different types of Xarray objects, we can use the following guidelines:
22+
Keep in mind that one organization of the data need not make all analyses equally ergonomic. We must be open to transforming the data between equivalent representations, depending on the task at hand.
23+
24+
Here are a few guidelines:
2325
::::{tab-set}
2426
:::{tab-item} Variables
2527
### Data variables
26-
These are the measurements or estimates of your dataset. If there are multiple independent measurements in the dataset, they should be stored as [xr.Variable](https://docs.xarray.dev/en/stable/generated/xarray.Variable.html#xarray.Variable) objects of a `xr.Dataset`, if the dataset is univariate, use a [`xr.DataArray`](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.html)
28+
These are the measurements or estimates of your dataset. If there are multiple independent measurements in the dataset, they should be stored as data variables in a `xr.Dataset`, if the data are univariate, use a [`xr.DataArray`](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.html)
2729
| | |
2830
| :-----------:|:---------- |
2931
| **Guiding Question** | What physical observable(s) is my dataset measuring?|
3032

3133
#### Relevant examples
32-
In [Sentinel-1, notebook 3 - Exploratory analysis of ASF data](../sentinel1/nbs/3_asf_exploratory_analysis.ipynb), we saw examples of treating multiple backscatter polarizations as data variables versus a single variable along a `band` dimension.
34+
In [Sentinel-1, notebook 3 - Exploratory analysis of ASF data](../sentinel1/nbs/3_asf_exploratory_analysis.ipynb), we saw examples of treating multiple backscatter polarizations as data variables versus a single variable along a `band` dimension. We can convert between the two representations using `Dataset.to_array` and `DataArray.to_dataset`.
3335

3436
| | |
3537
| :-----------:|:---------- |
@@ -51,15 +53,15 @@ In these situations, a guiding question could be:
5153

5254
##### 1. Expanding dimensions v. adding variables
5355
- Formatting the Sentinel-1 backscatter cube to have `vv` and `vh` data variables versus `band` dimension with the following coordinate array: `('band', ['vh','vv'])` ([*Sentinel-1 tutorial, notebook 3 - exploratory analysis of ASF data*](../sentinel1/nbs/3_asf_exploratory_analysis.ipynb)).
54-
- If you are only interested in a single polarization of the dataset or looking at backscatter from different polarizations independent of one another, treating backscatter from each polarization as a `data variable` is suitable and maybe even optimal; it can be simpler to perform operations on a single variable rather than an entire dimension.
56+
- If you are only interested in a single polarization of the dataset or looking at backscatter from different polarizations independent of one another, treating backscatter from each polarization as a __data variable_ is suitable and maybe even optimal; it can be simpler to perform operations on a single variable rather than an entire dimension.
5557
- If you are interested in examining backscatter across different polarizations, the different polarizations are most appropriately represented as elements of a dimension.
5658

5759
| | |
5860
| :-----------:|:---------- |
59-
| **Takeaway** | Structure your dataset's dimensions so that data variables are independent of one another. |
61+
| **Takeaway** | Structure your datacube's dimensions so that data variables are independent of one another. |
6062

6163
```{tip}
62-
If you are working with a dataset where information about how the variables relate to one another is included in the variable name, this is a sign that there should be an additional dimension.
64+
If you are working with a dataset where information about how the variables relate to one another is included in the variable name (e.g. a year, or a band wavelength), this is a sign that there should be an additional dimension.
6365
```
6466

6567
##### 2. Compare two datasets by combining them into a single cube with an additional dimension
@@ -70,7 +72,7 @@ If you are working with a dataset where information about how the variables rela
7072

7173
| | |
7274
| :-----------:|:---------- |
73-
| **Takeaway** | The dimensions of a dataset depend on what you want to do with it. |
75+
| **Takeaway** | Consider either concatenating two cubes along a new dimension, or splitting a dimension in to multiple cubes. One approach may be more ergonomic compared to the other depending on the problem at hand. |
7476

7577
:::
7678
:::{tab-item} Coordinates
@@ -83,7 +85,7 @@ A dataset must have a **dimensional coordinate** variable for each dimension in
8385

8486
#### Relevant examples
8587
##### 1. Handling time-varying metadata
86-
- Metadata that varies over `(time)` should be stored as coordinate variables along the `time` dimension.
88+
- Metadata that varies over `(time)` should be stored as coordinate variables along the `time` dimension (e.g. whether a scene was taken during an ascending or descending pass).
8789
- Metadata that varies over `time`, `x`, and `y` should be coordinate variables that exist along those dimensions.
8890
- *[Sentinel-1 tutorial, metadata wrangling notebook](../sentinel1/nbs/2_wrangle_metadata.ipynb)*
8991

@@ -92,30 +94,22 @@ A dataset must have a **dimensional coordinate** variable for each dimension in
9294
| **Takeaway** | Assign metadata that varies along a given dimension as a non-dimensional coordinate of that dimension.
9395
|
9496

95-
##### 2. Querying a dataset using coordinate variables
96-
- Non-dimensional coordinate variables are not indexed.
97-
- It can be faster to subset the dataset using `ds.sel()` than `ds.where()` (*[ITS_LIVE tutorial, exporatory analysis notebook](../itslive/nbs/4_exploratory_data_analysis_single.ipynb)*).
98-
99-
| | |
100-
| :-----------:|:---------- |
101-
| **Takeaway** | To query the dataset using a coordinate, it can be more efficient to express the query in terms of a dimensional coordinate.
102-
|
10397
:::
10498
:::{tab-item} Attributes
10599
### Attributes
106-
`Attrs` can be assigned to the dataset as a whole or any of the `xr.DataArray` objects within it.
100+
`attrs` can be assigned to the dataset as a whole or any of the `xr.DataArray` objects within it. Many fields have their own conventions for attribute metadata, e.g. Climate & Forecast Conventions (CF).
107101

108102
| | |
109103
| :-----------:|:---------- |
110-
| **Guiding Question** | Does a piece of attribute information apply to this *entire* object (e.g. a data variable, a coordinate variable, or a dataset)? If so, it should be stored as an attribute of that object. |
104+
| **Guiding Question** | Does a piece of attribute information apply to this *entire* object (e.g. a data variable, a coordinate variable, or a dataset)? If so, it should be stored as an attribute of that object. Attributes must conform to an existing standard if possible.|
111105

112106
and
113107

114108
| | |
115109
| :-----------:|:---------- |
116110
| **Guiding Question** | What tools exist that can help perform the operations that I need to with this dataset? How must attribute data be stored to use them? |
117111
#### Relevant examples
118-
##### 1. Attributes must be formatted according to accepted metadata conventions like CF and STAC in order to take advantage of tools built off these specifications
112+
##### 1. Attributes must conform to accepted metadata conventions like CF and STAC in order to take advantage of tools built off these specifications
119113

120114
- Using `cf_xarray` with appropriately-formatted metadata enables more streamlined access to and interpretation of metadata (*[ITS_LIVE tutorial, data access notebook](../itslive/nbs/1_accessing_itslive_s3_data.ipynb)*)
121115
- Having appropriate CF metadata enables reading and writing vector data cubes to disk
@@ -138,4 +132,4 @@ Independent objects should be represented as unique `xr.Datasets` (if multivaria
138132
##### 2. If you're working with a collection of objects that can be defined by vector geometries, a vector data cube may be an appropriate way to represent the data
139133
- Use [Xvec](https://xvec.readthedocs.io/en/stable/) to build a vector data cube that has a `'geometry'` dimension; each element of the geometry dimension is a cube that varies over the other dimensions of the cube (frequently `time`) (*[ITS_LIVE tutorial, exploratory analysis of a group of glaciers notebook](../itslive/nbs/5_exploratory_data_analysis_group.ipynb).*)
140134
:::
141-
::::
135+
::::

0 commit comments

Comments
 (0)