You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* bunch of edits
* add captions
* add file numbers back in
* fix file names and a few links
* add why os section in intro
* Add instructions for executing tutorial notebooks on CryoCloud JupyterHub (#41)
* jupyterhub instructions
* wording change
---------
Co-authored-by: e-marshall <[email protected]>
* fix file names and a few links
fixing files that were renamed
* spelling and formatting fixes
* spelling and formatting fixes
* remove files from tracking
* add mkdirs line in s1 nb1 and some formatting changes
* few typo fixes and other things + os section in intro
* updates to datacube revisit and others
* edits from jessica
* clean nbs
* nit
* update gitignore to remove vector data cube
* undo gitignore change, will do in sep pr
* switch build branch back to main
* add note about download time
* Some edits (#42)
---------
Co-authored-by: Scott Henderson <[email protected]>
Co-authored-by: Deepak Cherian <[email protected]>
Copy file name to clipboardExpand all lines: book/background/2_data_cubes.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@ The term **data cube** is used frequently throughout this book. This page contai
8
8
9
9
The key object of analysis in this book is a [raster data cube](https://openeo.org/documentation/1.0/datacubes.html). Raster data cubes are n-dimensional objects that store continuous measurements or estimates of physical quantities that exist along given dimension(s). Many scientific workflows involve examining how a variable (such as temperature, windspeed, relative humidity, etc.) varies over time and/or space. Data cubes are a way of organizing geospatial data that let us ask these questions.
10
10
11
-
A very common data cube structure is a 3-dimensional object with (`x`,`y`,`time`) dimensions ({cite:t}`Baumann_2019_datacube,giuliani_2019_EarthObservationOpen,mahecha_2020_EarthSystemData,montero_2024_EarthSystemData`). While this is a relatively intuitive concept,in practice, the amount and types of information contained within a single dataset and the operations involved in managing them, can become complicated and unwieldy. As analysts, we access data (usually from providers such as Distributed Active Archive Centers ([DAACs](https://nssdc.gsfc.nasa.gov/earth/daacs.html))), and then we are responsible for organizing the data in a way that let's us ask questions of it. While some of these decisions are straightforward (eg. *It makes sense to stack observations from different points in time along a time dimension*), some can be more open-ended (*Where and how should important metadata be stored so that it will propagate across appropriate operations and be accessible when it is needed?*).
11
+
A very common data cube structure is a 3-dimensional object with (`x`,`y`,`time`) dimensions ({cite:t}`Baumann_2019_datacube,giuliani_2019_EarthObservationOpen,mahecha_2020_EarthSystemData,montero_2024_EarthSystemData`). While this is a relatively intuitive concept,in practice, the amount and types of information contained within a single dataset and the operations involved in managing them, can become complicated and unwieldy. As analysts, we access data (usually from providers such as Distributed Active Archive Center or [DAACs](https://nssdc.gsfc.nasa.gov/earth/daacs.html)), and then we are responsible for organizing the data in a way that let's us ask questions of it. While some of these decisions are straightforward (eg. *It makes sense to stack observations from different points in time along a time dimension*), some can be more open-ended (*Where and how should important metadata be stored so that it will propagate across appropriate operations and be accessible when it is needed?*).
12
12
13
13
### *Two types of information*
14
14
Fundamentally, many of these complexities can be reduced to one distinction: is a particular piece of information a physical observable (the main focus, or target, of the dataset), or is it metadata that provides necessary information in order to properly interpret and handle the physical observable [^mynote2]? Answering this question will help you understand how to situate a piece of information within the broader data object.
Copy file name to clipboardExpand all lines: book/conclusion/datacubes_revisited.md
+19-25
Original file line number
Diff line number
Diff line change
@@ -4,12 +4,12 @@ In this book, we saw a range of real-world datasets and the steps required to pr
4
4
5
5
Let's first return to the Xarray building blocks described in the background [section](../background/2_data_cubes.md); we can now provide more-detailed definitions of what they are and how they should be used:
6
6
7
-
:::{admonition} Xarray components of data cubes
8
-
**[Dimensions](https://docs.xarray.dev/en/latest/user-guide/terminology.html#term-Dimension)** - What is the shape of the data as you understand it? This should be the set of dimensions. Frequently, `(x, y, time)`.
9
-
**[Dimensional coordinate variables](https://docs.xarray.dev/en/latest/user-guide/terminology.html#term-Dimension-coordinate)** - 1-d arrays describing the range and resolution of the data along each dimension.
7
+
:::{admonition} Dissecting data cubes
8
+
**[Dimensions](https://docs.xarray.dev/en/latest/user-guide/terminology.html#term-Dimension)** - What do the axes of the data represent? This should be the set of dimensions. Frequently, `(x, y, time)`. Dimensions are orthogonal to each other.
9
+
**[Dimensional coordinate variables](https://docs.xarray.dev/en/latest/user-guide/terminology.html#term-Dimension-coordinate)** - Typically 1-d arrays describing the range and resolution of the data along each dimension. Think of this as axes tick labels on a plot.
10
10
**[Non-dimensional coordinate variables](https://docs.xarray.dev/en/latest/user-guide/terminology.html#term-Non-dimension-coordinate)** - Metadata about the physical observable that varies along one or more dimensions. These can be 1-d up to n-d where n is the length of `.dims`.
11
-
**[Data variables](https://docs.xarray.dev/en/latest/user-guide/terminology.html#term-Variable)** - Scalar values that occupy the grid cells implied by coordinate arrays. The physical observable(s) that are the focus of the dataset.
12
-
**Attributes** - Metadata that can be assigned to a given `xr.Dataset` or `xr.DataArray` that is ***static*** along that object's dimensions.
11
+
**[Data variables](https://docs.xarray.dev/en/latest/user-guide/terminology.html#term-Variable)** - Physical observable(s) whose values are known at every point on the grid formed by the dimensions.
12
+
**Attributes** - Metadata that can be assigned to a given `xr.Dataset` or `xr.DataArray` that is ***invariant*** along that object's dimensions.
13
13
:::
14
14
15
15
{{break}}
@@ -19,17 +19,19 @@ At the beginning of the book, we also discussed 'tidy data' as its defined for t
19
19
20
20
**How can this data be structured to simplify subsequent analysis?**
21
21
22
-
For different types of Xarray objects, we can use the following guidelines:
22
+
Keep in mind that one organization of the data need not make all analyses equally ergonomic. We must be open to transforming the data between equivalent representations, depending on the task at hand.
23
+
24
+
Here are a few guidelines:
23
25
::::{tab-set}
24
26
:::{tab-item} Variables
25
27
### Data variables
26
-
These are the measurements or estimates of your dataset. If there are multiple independent measurements in the dataset, they should be stored as [xr.Variable](https://docs.xarray.dev/en/stable/generated/xarray.Variable.html#xarray.Variable) objects of a `xr.Dataset`, if the dataset is univariate, use a [`xr.DataArray`](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.html)
28
+
These are the measurements or estimates of your dataset. If there are multiple independent measurements in the dataset, they should be stored as data variables in a `xr.Dataset`, if the data are univariate, use a [`xr.DataArray`](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.html)
27
29
|||
28
30
| :-----------:|:---------- |
29
31
|**Guiding Question**| What physical observable(s) is my dataset measuring?|
30
32
31
33
#### Relevant examples
32
-
In [Sentinel-1, notebook 3 - Exploratory analysis of ASF data](../sentinel1/nbs/3_asf_exploratory_analysis.ipynb), we saw examples of treating multiple backscatter polarizations as data variables versus a single variable along a `band` dimension.
34
+
In [Sentinel-1, notebook 3 - Exploratory analysis of ASF data](../sentinel1/nbs/3_asf_exploratory_analysis.ipynb), we saw examples of treating multiple backscatter polarizations as data variables versus a single variable along a `band` dimension. We can convert between the two representations using `Dataset.to_array` and `DataArray.to_dataset`.
33
35
34
36
|||
35
37
| :-----------:|:---------- |
@@ -51,15 +53,15 @@ In these situations, a guiding question could be:
51
53
52
54
##### 1. Expanding dimensions v. adding variables
53
55
- Formatting the Sentinel-1 backscatter cube to have `vv` and `vh` data variables versus `band` dimension with the following coordinate array: `('band', ['vh','vv'])` ([*Sentinel-1 tutorial, notebook 3 - exploratory analysis of ASF data*](../sentinel1/nbs/3_asf_exploratory_analysis.ipynb)).
54
-
- If you are only interested in a single polarization of the dataset or looking at backscatter from different polarizations independent of one another, treating backscatter from each polarization as a `data variable` is suitable and maybe even optimal; it can be simpler to perform operations on a single variable rather than an entire dimension.
56
+
- If you are only interested in a single polarization of the dataset or looking at backscatter from different polarizations independent of one another, treating backscatter from each polarization as a __data variable_ is suitable and maybe even optimal; it can be simpler to perform operations on a single variable rather than an entire dimension.
55
57
- If you are interested in examining backscatter across different polarizations, the different polarizations are most appropriately represented as elements of a dimension.
56
58
57
59
|||
58
60
| :-----------:|:---------- |
59
-
|**Takeaway**| Structure your dataset's dimensions so that data variables are independent of one another. |
61
+
|**Takeaway**| Structure your datacube's dimensions so that data variables are independent of one another. |
60
62
61
63
```{tip}
62
-
If you are working with a dataset where information about how the variables relate to one another is included in the variable name, this is a sign that there should be an additional dimension.
64
+
If you are working with a dataset where information about how the variables relate to one another is included in the variable name (e.g. a year, or a band wavelength), this is a sign that there should be an additional dimension.
63
65
```
64
66
65
67
##### 2. Compare two datasets by combining them into a single cube with an additional dimension
@@ -70,7 +72,7 @@ If you are working with a dataset where information about how the variables rela
70
72
71
73
|||
72
74
| :-----------:|:---------- |
73
-
|**Takeaway**|The dimensions of a dataset depend on what you want to do with it. |
75
+
|**Takeaway**|Consider either concatenating two cubes along a new dimension, or splitting a dimension in to multiple cubes. One approach may be more ergonomic compared to the other depending on the problem at hand. |
74
76
75
77
:::
76
78
:::{tab-item} Coordinates
@@ -83,7 +85,7 @@ A dataset must have a **dimensional coordinate** variable for each dimension in
83
85
84
86
#### Relevant examples
85
87
##### 1. Handling time-varying metadata
86
-
- Metadata that varies over `(time)` should be stored as coordinate variables along the `time` dimension.
88
+
- Metadata that varies over `(time)` should be stored as coordinate variables along the `time` dimension (e.g. whether a scene was taken during an ascending or descending pass).
87
89
- Metadata that varies over `time`, `x`, and `y` should be coordinate variables that exist along those dimensions.
@@ -92,30 +94,22 @@ A dataset must have a **dimensional coordinate** variable for each dimension in
92
94
| **Takeaway** | Assign metadata that varies along a given dimension as a non-dimensional coordinate of that dimension.
93
95
|
94
96
95
-
##### 2. Querying a dataset using coordinate variables
96
-
- Non-dimensional coordinate variables are not indexed.
97
-
- It can be faster to subset the dataset using `ds.sel()` than `ds.where()` (*[ITS_LIVE tutorial, exporatory analysis notebook](../itslive/nbs/4_exploratory_data_analysis_single.ipynb)*).
98
-
99
-
|||
100
-
| :-----------:|:---------- |
101
-
| **Takeaway** | To query the dataset using a coordinate, it can be more efficient to express the query in terms of a dimensional coordinate.
102
-
|
103
97
:::
104
98
:::{tab-item} Attributes
105
99
### Attributes
106
-
`Attrs` can be assigned to the dataset as a whole or any of the `xr.DataArray` objects within it.
100
+
`attrs` can be assigned to the dataset as a whole or any of the `xr.DataArray` objects within it. Many fields have their own conventions for attribute metadata, e.g. Climate & Forecast Conventions (CF).
107
101
108
102
|||
109
103
| :-----------:|:---------- |
110
-
|**Guiding Question**| Does a piece of attribute information apply to this *entire* object (e.g. a data variable, a coordinate variable, or a dataset)? If so, it should be stored as an attribute of that object. |
104
+
|**Guiding Question**| Does a piece of attribute information apply to this *entire* object (e.g. a data variable, a coordinate variable, or a dataset)? If so, it should be stored as an attribute of that object. Attributes must conform to an existing standard if possible.|
111
105
112
106
and
113
107
114
108
|||
115
109
| :-----------:|:---------- |
116
110
|**Guiding Question**| What tools exist that can help perform the operations that I need to with this dataset? How must attribute data be stored to use them? |
117
111
#### Relevant examples
118
-
##### 1. Attributes must be formatted according to accepted metadata conventions like CF and STAC in order to take advantage of tools built off these specifications
112
+
##### 1. Attributes must conform to accepted metadata conventions like CF and STAC in order to take advantage of tools built off these specifications
119
113
120
114
- Using `cf_xarray` with appropriately-formatted metadata enables more streamlined access to and interpretation of metadata (*[ITS_LIVE tutorial, data access notebook](../itslive/nbs/1_accessing_itslive_s3_data.ipynb)*)
121
115
- Having appropriate CF metadata enables reading and writing vector data cubes to disk
@@ -138,4 +132,4 @@ Independent objects should be represented as unique `xr.Datasets` (if multivaria
138
132
##### 2. If you're working with a collection of objects that can be defined by vector geometries, a vector data cube may be an appropriate way to represent the data
139
133
- Use [Xvec](https://xvec.readthedocs.io/en/stable/) to build a vector data cube that has a `'geometry'` dimension; each element of the geometry dimension is a cube that varies over the other dimensions of the cube (frequently `time`) (*[ITS_LIVE tutorial, exploratory analysis of a group of glaciers notebook](../itslive/nbs/5_exploratory_data_analysis_group.ipynb).*)
0 commit comments