Move docs from download to data

badgley · badgley · commit cd6cc5cfdd54 · 2023-12-06T10:48:56.000-08:00
diff --git a/.gitignore b/.gitignore
@@ -159,3 +159,5 @@ cython_debug/
 #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
 #.idea/
 offsets_db_data/_version.py
+
+.DS_Store
diff --git a/docs/credits.md b/docs/credits.md
@@ -0,0 +1,148 @@
+# Credits
+The `credits` data reports bulk credit transactions: issuances, retirements, and cancellations.
+We first download raw credit transaction data from each of the registries. 
+We then apply custom, registry-specific transformations to the data, with the goal of mapping all registry data to a common schema.
+
+## Schema
+
+Credit transactions have the following schema: 
+
+```json
+{
+  'title': 'Credit',
+  'properties': {
+    'id': {
+      'title': 'Id',
+      'type': 'integer'
+    },
+    'project_id': {
+      'title': 'Project ID',
+      'description': 'Unique project identifier, by registry',
+      'type': 'string'
+    },
+    'quantity': {
+      'title': 'Quantity',
+      'description': 'Number of credits',
+      'type': 'integer'
+    },
+    'vintage': {
+      'title': 'Vintage',
+      'description': 'Vintage year of credits',
+      'type': 'integer'
+    },
+    'transaction_date': {
+      'title': 'Transaction Date',
+      'description': 'Date of transaction',
+      'type': 'string',
+      'format': 'date'
+    },
+    'transaction_type': {
+      'title': 'Transaction Type',
+      'description': 'Type of transaction (i.e., issuance, retirement)',
+      'type': 'string'
+    }
+  }
+}
+```
+## Downloading raw data
+We download a fresh copy of project and transaction data on a daily basis.
+While downloading, we make no changes to the raw data provided by the registries.
+All data are permanently archived and are made immediately available for download in a publicly available S3 bucket (see Data Access TK).
+
+As with `projects` data, we have no plans to release the code the directly interacts with the registries. 
+We made this decision to keep this part of OffsetsDB closed in an effort to limit download requests from the registries.
+
+## Transforming raw data
+
+Nearly the entirety of the code contained within `offsets-db-data` involves registry-specific logic for transforming raw registry data into a common, shared schema.
+The logic for transforming the data of each registry is contained within a single file and is denoted by the filename.
+For example, the logic involved in transforming Verra data are contained within a file named `vcs.py`.
+ 
+Each registry-specific file contains at least two functions: `process_{registry_abbreviation}_credits` and `process_{registry_abbreviation}_projects`
+Those functions, in turn, call a series of additional transformation functions that produce the normalized project and credit data which combine to form OffsetsDB.
+These transformation functions tend to be quite small and operate on one or two properties of the raw data. 
+To continue with the Verra example, `vcs.py` contains functions with names like `set_vcs_vintage_year` and `generate_vcs_project_ids`.
+These functions contain the registry-specific logic needed to map Verra's raw data to a common schema. 
+
+### An example
+In practice, replicating the behavior of OffsetsDB should be simple.
+Here's an example of using `offsets_db_download` to transform the raw transactions data from Verra into a normalized, analysis ready file:
+
+```python
+import pandas as pd
+from offsets_db_download import vcs
+
+archive_fname = 's3://carbonplan-offsets-db/raw/2023-12-05/verra/transactions.csv.gz'
+raw_credits =  pd.read_csv(archive_fname)
+processed_credits =  vcs.process_vcs_credits(raw_credits)
+```
+
+Invoking single transformation functions, like `set_vcs_vintage_year` is even more straightforward.
+Let's say you want to understand more about how OffsetsDB assigns Verra credits a vintage year.
+You can explore the behavior of this single transformation function by calling:
+
+```python
+raw_credits.set_vcs_vintage_year(date_column='Vintage End')
+```
+
+It's worth noting that we've wrapped all transformation functions using the `pandas_flavor.register_dataframe_method` decorator.
+That means that after importing a registry module from `offsets_db_download`, the transformation functions of that module are directly callable by any Pandas dataframe.
+
+## Initial Column Mapping
+The initial and perhaps must mundane transformation of OffsetsDB involves mapping properties in the raw data to a common schema.
+This step requires constructing a map between the names of properties as they appear in the raw data to the property in OffsetsDB.
+For example, the Climate Action Reserve data refers to the property, `project_id`, as `Project ID`. 
+The ART registry, however, refers to the same property as `Program ID`.
+
+These column mapping files are stored in `offsets_db_data/configs`.
+There is a separate mapping file for `projects` data and `credits` data.
+Some properties either aren't included in the raw data or inferring their value requires special processing.
+In these cases, a `null` value is recorded in the column mapping files. 
+
+## Protocol Mapping \& Categorization
+Offset projects are developed by following a specific set of rules, known as a protocol.
+Unfortunately, there is no standardized way of referring to the exact protocol (or protocol version) used to develop an offset project.
+Even within the domain of a single registry, references to the exact protocol used to develop a project are often inconsistent. 
+
+OffsetsDB addresses this problem by manually assigning every known protocol string to a common schema. 
+Take for example the Clean Development Mechanism protocol AMS-III.D., "Methane recovery in animal manure management systems".
+Across all six registries included in OffsetsDB, we identified twenty-two unique strings referring to this single protocol. 
+OffsetsDB maps these unique strings, which we refer to as "known strings" to a single reference, `ams-iii-d`. 
+
+We also assign each of these unified protocol references a category.
+Those categories include:
+
+- agriculture: offsets derived from the management of farmlands
+- cookstoves:  offsets derived from in-home cookstoves that are either more efficient or use cleaner fuels
+- forest: offsets derived from the management of forests
+- ghg-management: offsets derived from the destruction or elimination of greenhouse gases
+- land-use: offsets derived from changes in land-use (e.g., avoided conversion)
+- renewable-energy: offsets derived from expanding renewable energy capacity
+
+Data about protocol categories and "known strings" are stored in `offsets_db_data/configs/all-protocol-mapping.json`. 
+
+## Registry specific transformations
+Some of the transformations involved in producing OffsetsDB require special knowledge or assumptions about the underlying data.
+This section of the documentation highlights some of those special cases. 
+For additional context, consult specific function docstrings or reach out TK if something doesn't make sense. 
+
+### American Carbon Registry
+
+Project status: When processing ACR projects, we combine two status properties present in the raw data: `Compliance Program Status (ARB or Ecology)` and `Voluntary Status`.
+For compliance projects, we report compliance program status. 
+For non-compliance projects, we report voluntary status. 
+
+### Verra
+There are several unique aspects of Verra's crediting data that require special consideration.
+First, erra is unique amongst the registries included in OffsetsDB in that Verra allows for "rolling" credit issuance.
+This allows projects to complete the paperwork and verificaiton processes for credit issuance, but delay the actual issuance event.
+This results in ambiguities around the precise timing of credit issuance events, as credits that are eligible to be issued but have not yet been issued, are not publicly reported in the Verra crediting data.
+We handle this ambiguity by assuming that the first crediting event, be it an issuance, retirement, or cancellation, on a per-project, per-vintage basis results in issuance of 100 percent of credits eligible to be issued for that project-vintage.
+Second, Verra's data does not allow the distinction of retirement events from cancellation events.
+We report all Verra retirements and cancellations as `retirement/cancellation`.
+Third, Verra allows for the simultaneous issuance of multiple vintages.
+We assign all credits from these multi-vintage issuances to the first reported vintage year.
+
+### California Compliance Projects
+We treat the California Air Resources Board's [issuance table](https://ww2.arb.ca.gov/resources/documents/arb-offset-credit-issuance-table) as the source of truth for all credits issued and retired by any project developed under an ARB-approved protocol.
+```
diff --git a/docs/projects.md b/docs/projects.md
@@ -0,0 +1,145 @@
+# Transformation
+
+Project data across all registries are transformed to a common Project schema:
+
+```json
+{
+  "title": "Project",
+  "type": "object",
+  "properties": {
+    "project_id": {
+      "title": "Project Id",
+      "description": "Project id used by registry system",
+      "type": "string"
+    },
+    "name": {
+      "title": "Name",
+      "description": "Name of the project",
+      "type": "string"
+    },
+    "registry": {
+      "title": "Registry",
+      "description": "Name of the registry",
+      "type": "string"
+    },
+    "proponent": { "title": "Proponent", "type": "string" },
+    "protocol": {
+      "title": "Protocol",
+      "description": "List of protocols",
+      "type": "array",
+      "items": { "type": "string" }
+    },
+    "category": {
+      "title": "Category",
+      "description": "List of categories",
+      "type": "array",
+      "items": { "type": "string" }
+    },
+    "status": { "title": "Status", "type": "string" },
+    "country": { "title": "Country", "type": "string" },
+    "listed_at": {
+      "title": "Listed At",
+      "description": "Date project was listed",
+      "type": "string",
+      "format": "date"
+    },
+    "is_compliance": {
+      "title": "Is Compliance",
+      "description": "Whether project is compliance project",
+      "type": "boolean"
+    },
+    "retired": {
+      "title": "Retired",
+      "description": "Total of retired credits",
+      "type": "integer"
+    },
+    "issued": {
+      "title": "Issued",
+      "description": "Total of issued credits",
+      "type": "integer"
+    },
+    "project_url": {
+      "title": "Project Url",
+      "description": "URL to project details",
+      "type": "string"
+    }
+  },
+  "required": ["project_id", "registry"]
+}
+```
+
+The majority of project attributes are directly taken from the project data downloaded from each registry.
+Table 1 provides the mapping from the raw column names found in downloaded registry data to the OffsetsDB project schema.
+
+|                         | **verra**                 | **climate-action-reserve** | **american-carbon-registry**           | **global-carbon-council**     | **gold-standard**         | **art-trees**               |
+| ----------------------- | ------------------------- | -------------------------- | -------------------------------------- | ----------------------------- | ------------------------- | --------------------------- |
+| **project_id**          | ID                        | Project ID                 | Project ID                             | project_submission_number     | id                        | Program ID                  |
+| **name**                | Name                      | Project Name               | Project Name                           | project_url                   | name                      | Program Name                |
+| **protocol**            | Methodology               | Project Type               | Project Methodology/Protocol           | project_methodology           | methodology               | \-                          |
+| **category**            | inferred from protocol    | inferred from protocol     | inferred from protocol                 | inferred from protocol        | inferred from protocol    | inferred from protocol      |
+| **project_subcategory** | manually assigned         | manually assigned          | manually assigned                      | manually assigned             | manually assigned         | manually assigned           |
+| **proponent**           | Proponent                 | Project Owner              | Project Developer                      | project_details:project_owner | project_developer         | Sovereign Program Developer |
+| **country**             | Country/Area              | Project Site Country       | Project Site Country                   | project_country               | country                   | Program Country             |
+| **status**              | Status                    | Status                     | Derived: voluntary + compliance status | project_status                | status                    | Status                      |
+| **listed_at**           | Project Listed Date       | \-                         | \-                                     | \-                            | \-                        | \-                          |
+| **commenced_at**        | inferred from credit data | inferred from credit data  | inferred from credit data              | inferred from credit data     | inferred from credit data | inferred from credit data   |
+
+## Normalizing Protocols
+
+There is significant variation in the the raw strings used to describe the protocols associated with each prorject.
+That variation exists both within and across registries.
+For example, as of the first release of OffsetsDB, we observed 129 unique strings used to describe `ACM0002`, a CDM-era methodology for crediting grid-connected renewable energy projects.
+We manually constructed a mapping from observed raw strings to a standarized set of protocol names (see [all-protocol-mapping.json](TK)).
+We categorized protocol strings `Other` and `Not Provided` as `unknown`.
+In some cases, projects are associated with multiple protocols.
+We preserve the one-to-many relationship between projects and protocols in the protocol normalization step.
+
+## Categorizing Protocols and Projects
+
+We manually categorized each protocol contained within `all-protocol-mapping.json`, assigning each protocol a primary category.
+OffsetsDB contains the following protocol categories:
+
+- agriculture
+- biochar
+- cookstoves
+- energy-efficiency
+- forest
+- industrial-gases
+- industrial-processes
+- land-use
+- landfill
+- mine-methane
+- oil-and-gas
+- renewable-energy
+- transportation
+- unknown
+- waste-management
+
+Furthermore, OffsetsDB allows for two additional types of subcategorization: protocol subcategory and project subcategory.
+Protocol subcategories derive directly from the protocol itself and apply to all projects developed under that protocol.
+For example, `ACM0006` specifies rules for generating electricity from biomass.
+From the standpoint of category, projects under `ACM0006` are `renewable-energy` projects.
+However, all `ACM0006` projects generate electricity from biomass, meaning we can assign those projects a sub-category of `biomass`.
+Some protocols, however, allow for the development of multiple sub-categories of project.
+For example, `ACM0002` allows for the development of a whole host of renewable energy projects, from hydropower to wind.
+As a reuslt, `ACM0002` projects all fall under the category of `renewable-energy`.
+Further subdividing the projects by type, however, requires looking at project paperwork, after which it is possible to manually assign the project a `project_subcategry`.
+Project subcategorizations are recorded in `project-sub-categories.json`.
+
+## Other normalizations
+
+### Country
+
+We use the Python package [coutnry_convertor](https://github.com/IndEcol/country_converter) to harmonize country names.
+
+### Status
+
+We harmonize project status codes across all registries, allowing projects to have three categories:
+
+- Listed: project exists and cannot yet be issued credits
+- Registered: project exists and can be issued credits
+- Completed: project exists and will not receive additional
+
+This simplified set of status codes does not make full use of information provided by about project state.
+For example, some registries include finer-grain information about where a project is in the registration and verification process.
+We made no attempt to harmonize these extra states across registries, though future versions of might attempt to capture additional details about the status of projects.