Skip to content

Commit

Permalink
🐟 🦈 🐠 🦀 🐙 🦑
Browse files Browse the repository at this point in the history
  • Loading branch information
cboettig committed Aug 10, 2024
1 parent 2f6a66f commit 4cf94ac
Show file tree
Hide file tree
Showing 4 changed files with 118 additions and 205 deletions.
27 changes: 26 additions & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,35 @@ And constructed with the following guidelines:

For more information on SemVer, please visit http://semver.org/.

v 5.0.0
-------

Another streamlined re-design following new abilities for data hosting and access.
This release relies on a HuggingFace datasets hosting for data and metadata hosting
in parquet and schema.org.

Data access is simplified to use the simple HuggingFace datasets API instead
of the previous contentid-based resolution. This allows metadata to be defined
with directly alongside the data platform independent of the R package.

A simplified access protocol relies on `duckdbfs` for direct reads of tables.
Several functions previously used only to manage connections are now deprecated
or removed, along with a significant number of dependencies.

Core use still centers around the same package API using the `fb_tbl()` function,
with legacy helper functions for common tables like `species()` are still accessible and
can still optionally filter by species name where appropriate. As before, loading the
full tables and sub-setting manually is still recommended.

Historic helper functions like `load_taxa()` (combining the taxonomic classification from Species,
Genus, Family and Order tables), `validate_names()`, and `common_to_sci()` and
`sci_to_common()` should be in working order, all using table-based outputs.


v 4.1.1
-------

* hotfix for bug in 4.1.0 on Windows -- duckdb httpfs on windows creates sigfault
* hotfix for bug in 4.1.0 on Windows -- `duckdb` `httpfs` on windows created `segfault`

v 4.1.0
--------
Expand Down
78 changes: 22 additions & 56 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,30 @@ output: github_document
[![Downloads](https://cranlogs.r-pkg.org/badges/grand-total/rfishbase)](https://github.com/r-hub/cranlogs.app)
<!-- badges: end -->

Welcome to `rfishbase 5`! This is the fourth rewrite of the original `rfishbase` package described in [Boettiger et al. (2012)](https://doi.org/10.1111/j.1095-8649.2012.03464.x).


Welcome to `rfishbase 4`. This is the fourth rewrite of the original `rfishbase` package described in [Boettiger et al. (2012)](https://doi.org/10.1111/j.1095-8649.2012.03464.x).
Another streamlined re-design following new abilities for data hosting and access.
This release relies on a HuggingFace datasets hosting for data and metadata hosting
in parquet and schema.org.

Data access is simplified to use the simple HuggingFace datasets API instead
of the previous contentid-based resolution. This allows metadata to be defined
with directly alongside the data platform independent of the R package.

A simplified access protocol relies on `duckdbfs` for direct reads of tables.
Several functions previously used only to manage connections are now deprecated
or removed, along with a significant number of dependencies.

Core use still centers around the same package API using the `fb_tbl()` function,
with legacy helper functions for common tables like `species()` are still accessible and
can still optionally filter by species name where appropriate. As before, loading the
full tables and sub-setting manually is still recommended.

Historic helper functions like `load_taxa()` (combining the taxonomic classification from Species,
Genus, Family and Order tables), `validate_names()`, and `common_to_sci()` and
`sci_to_common()` should be in working order, all using table-based outputs.


- `rfishbase 1.0` relied on parsing of XML pages served directly from Fishbase.org.
- `rfishbase 2.0` relied on calls to a ruby-based API, `fishbaseapi`, that provided access to SQL snapshots of about 20 of the more popular tables in FishBase or SeaLifeBase.
Expand Down Expand Up @@ -91,61 +112,6 @@ available_releases()
```



## Low-memory environments

If you have very limited RAM (e.g. <= 1 GB available) it may be helpful to use `fishbase` tables in remote form by setting `collect = FALSE`. This allows the tables to remain on disk, while the user is still able to use almost all `dplyr` functions (see the `dbplyr` vignette). Once the table is appropriately subset, the user will need to call `dplyr::collect()` to use generic non-dplyr functions, such as plotting commands.

```{r}
fb_tbl("occurrence")
```


## Local copy

Set the option "rfishbase_local_db" = TRUE to create a local copy, otherwise will use a remote copy.
Local copy will get better performance after initial import, but may experience conflicts when
`duckdb` is upgraded or when multiple sessions attempt to access the directory. Remove the default
storage directory (given by `db_dir()`) after upgrading duckdb if using a local copy.

```{r}
options("rfishbase_local_db" = TRUE)
db_disconnect() # close previous remote connection
conn <- fb_conn()
conn
```

Users can trigger a one-time download of all fishbase tables (or a list of desired tables) using `fb_import()`. This will ensure later use of any function can operate smoothly even when no internet connection is available. Any table already downloaded will not be re-downloaded. (Note: `fb_import()` also returns a remote duckdb database connection to the tables, for users who prefer to work with the remote data objects.)

```{r}
fb_import()
```



```{r include=FALSE}
db_disconnect(conn)
```



## Interactive RStudio pane

RStudio users can also browse all fishbase tables interactively in the RStudio connection browser by using the function `fisbase_pane()`. Note that this function will first download a complete set of the fishbase tables.

## Backwards compatibility


`rfishbase` 4.0 tries to maintain as much backwards compatibility as possible with rfishbase 3.0. Because parquet preserves native data types, some encoded types may differ from earlier versions. As before, these are not always the native type -- e.g. fishbase encodes some boolean (logical TRUE/FALSE) values as integer (-1, 0) or character types. Use `as.logical()` to coerce into the appropriate type in that case.

Toggling between fishbase and sealifebase servers using an environmental variable, `FISHBASE_API`, is now deprecated.

Note that fishbase will store downloaded files by hash in the app directory, given by `db_dir()`. The default location can be set by configuring the desired path in the environmental variable, `FISHBASE_HOME`.




-----------

Please note that this package is released with a [Contributor Code of Conduct](https://ropensci.org/code-of-conduct/). By contributing to this project, you agree to abide by its terms.
Expand Down
198 changes: 65 additions & 133 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

<!-- badges: start -->

[![R-CMD-check](https://github.com/ropensci/rfishbase/workflows/R-CMD-check/badge.svg)](https://github.com/ropensci/rfishbase/actions)
[![R-CMD-check](https://github.com/ropensci/rfishbase/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/ropensci/rfishbase/actions/workflows/R-CMD-check.yaml)
[![Coverage
status](https://codecov.io/gh/ropensci/rfishbase/branch/master/graph/badge.svg)](https://codecov.io/github/ropensci/rfishbase?branch=master)
[![Onboarding](https://badges.ropensci.org/137_status.svg)](https://github.com/ropensci/software-review/issues/137)
Expand All @@ -12,10 +12,35 @@ status](https://www.r-pkg.org/badges/version/rfishbase)](https://cran.r-project.
[![Downloads](https://cranlogs.r-pkg.org/badges/grand-total/rfishbase)](https://github.com/r-hub/cranlogs.app)
<!-- badges: end -->

Welcome to `rfishbase 4`. This is the fourth rewrite of the original
Welcome to `rfishbase 5`! This is the fourth rewrite of the original
`rfishbase` package described in [Boettiger et
al. (2012)](https://doi.org/10.1111/j.1095-8649.2012.03464.x).

Another streamlined re-design following new abilities for data hosting
and access. This release relies on a HuggingFace datasets hosting for
data and metadata hosting in parquet and schema.org.

Data access is simplified to use the simple HuggingFace datasets API
instead of the previous contentid-based resolution. This allows metadata
to be defined with directly alongside the data platform independent of
the R package.

A simplified access protocol relies on `duckdbfs` for direct reads of
tables. Several functions previously used only to manage connections are
now deprecated or removed, along with a significant number of
dependencies.

Core use still centers around the same package API using the `fb_tbl()`
function, with legacy helper functions for common tables like
`species()` are still accessible and can still optionally filter by
species name where appropriate. As before, loading the full tables and
sub-setting manually is still recommended.

Historic helper functions like `load_taxa()` (combining the taxonomic
classification from Species, Genus, Family and Order tables),
`validate_names()`, and `common_to_sci()` and `sci_to_common()` should
be in working order, all using table-based outputs.

- `rfishbase 1.0` relied on parsing of XML pages served directly from
Fishbase.org.
- `rfishbase 2.0` relied on calls to a ruby-based API, `fishbaseapi`,
Expand Down Expand Up @@ -57,24 +82,23 @@ function:
fb_tbl("ecosystem")
```

# A tibble: 157,870 × 18
autoctr E_CODE Ecosy…¹ Specc…² Stock…³ Status Curre…⁴ Abund…⁵ LifeS…⁶ Remarks
<int> <int> <int> <int> <int> <chr> <chr> <chr> <chr> <chr>
1 1 1 50628 549 565 native Present <NA> adults <NA>
2 2 1 189 552 568 native Present <NA> adults <NA>
3 3 1 189 554 570 native Present <NA> adults <NA>
4 4 1 79732 873 889 native Present <NA> adults <NA>
5 5 1 5217 948 964 native Present <NA> adults <NA>
6 7 1 39852 956 972 native Present <NA> adults <NA>
7 8 1 39852 957 973 native Present <NA> adults <NA>
8 9 1 39852 958 974 native Present <NA> adults <NA>
9 10 1 188 1526 1719 native Present <NA> adults <NA>
10 11 1 188 1626 1819 native Present <NA> adults <NA>
# … with 157,860 more rows, 8 more variables: Entered <int>,
# Dateentered <dttm>, Modified <int>, Datemodified <dttm>, Expert <int>,
# Datechecked <dttm>, WebURL <chr>, TS <dttm>, and abbreviated variable names
# ¹​EcosystemRefno, ²​Speccode, ³​Stockcode, ⁴​CurrentPresence, ⁵​Abundance,
# ⁶​LifeStage
# A tibble: 160,334 × 18
autoctr E_CODE EcosystemRefno Speccode Stockcode Status CurrentPresence
<int> <int> <int> <int> <int> <chr> <chr>
1 1 1 50628 549 565 native Present
2 2 1 189 552 568 native Present
3 3 1 189 554 570 native Present
4 4 1 79732 873 889 native Present
5 5 1 5217 948 964 native Present
6 7 1 39852 956 972 native Present
7 8 1 39852 957 973 native Present
8 9 1 39852 958 974 native Present
9 10 1 188 1526 1719 native Present
10 11 1 188 1626 1819 native Present
# ℹ 160,324 more rows
# ℹ 11 more variables: Abundance <chr>, LifeStage <chr>, Remarks <chr>,
# Entered <int>, Dateentered <dttm>, Modified <int>, Datemodified <dttm>,
# Expert <int>, Datechecked <dttm>, WebURL <chr>, TS <dttm>

You can see all the tables using `fb_tables()` to see a list of all the
table names (specify `sealifebase` if desired). Careful, there are a lot
Expand Down Expand Up @@ -115,26 +139,26 @@ parallels the database structure of Fishbase. As such, almost all
fb_tbl("species", "sealifebase")
```

# A tibble: 103,169 × 109
SpecCode Genus Species Author Speci…¹ FBname FamCode Subfa…² GenCode TaxIs…³
<int> <chr> <chr> <chr> <int> <chr> <int> <chr> <int> <int>
1 10217 Abyss… cidaris Poore3113 <NA> 512 <NA> 9280 0
2 10218 Abyss… panope Poore3113 <NA> 512 <NA> 9280 0
3 90399 Abyss… averin… Kussa3113 <NA> 502 <NA> 17490 0
4 52610 Abyss… millari Monni… 2585 <NA> 978 <NA> 9281 0
5 52611 Abyss… wyvill… Herdm2892 <NA> 978 <NA> 9281 0
6 138684 Abyss… planus (Slad81020 <NA> 1615 <NA> 24229 0
7 90400 Abyss… acutil… Doti … 3113 <NA> 587 <NA> 9282 0
8 10219 Abyss… argent… Menzi3113 <NA> 587 <NA> 9282 0
9 10220 Abyss… bathya… Just,3113 <NA> 587 <NA> 9282 0
10 10221 Abyss… dentif… Menzi3113 <NA> 587 <NA> 9282 0
# … with 103,159 more rows, 99 more variables: Remark <chr>,
# PicPreferredName <chr>, PicPreferredNameM <chr>, PicPreferredNameF <chr>,
# PicPreferredNameJ <chr>, Source <chr>, AuthorRef <int>, SubGenCode <int>,
# Fresh <int>, Brack <int>, Saltwater <int>, Land <int>, BodyShapeI <chr>,
# DemersPelag <chr>, AnaCat <chr>, MigratRef <int>, DepthRangeShallow <int>,
# DepthRangeDeep <int>, DepthRangeRef <int>, DepthRangeComShallow <int>,
# DepthRangeComDeep <int>, DepthComRef <int>, LongevityWild <dbl>, …
# A tibble: 102,464 × 111
SpecCode Genus Species Author SpeciesRefNo FBname FamCode Subfamily GenCode
<int> <chr> <chr> <chr> <int> <chr> <int> <chr> <int>
1 57969 Abdopus horrid… (D'Or 96968 Red S… 1890 Octopodi… 24384
2 57836 Abdopus tenebr… (Smit 19 <NA> 1890 Octopodi… 24384
3 57142 Abdopus tongan… (Hoyl 19 <NA> 1890 Octopodi… 24384
4 2381155 Abdopus undula… Huffa… 84307 <NA> 1890 <NA> 24384
5 14647 Abebai… troglo… Vande 19 <NA> 572 <NA> 9260
6 165283 Aberom… muranoi Baces 104101 <NA> 616 <NA> 33537
7 140720 Aberra… banyul… Macki… 85340 <NA> 174 <NA> 9262
8 40346 Aberra… enigma… unspe 19 <NA> 174 <NA> 9262
9 20199 Aberra… aberra… (Barn 19 <NA> 308 <NA> 9263
10 93706 Aberro… verruc… Kasat 3696 <NA> 922 <NA> 17969
# ℹ 102,454 more rows
# ℹ 102 more variables: TaxIssue <int>, Remark <chr>, PicPreferredName <chr>,
# PicPreferredNameM <chr>, PicPreferredNameF <chr>, PicPreferredNameJ <chr>,
# Source <chr>, AuthorRef <int>, SubGenCode <int>, Fresh <int>, Brack <int>,
# Saltwater <int>, Land <int>, BodyShapeI <chr>, DemersPelag <chr>,
# Amphibious <chr>, AmphibiousRef <int>, AnaCat <chr>, MigratRef <int>,
# DepthRangeShallow <int>, DepthRangeDeep <int>, DepthRangeRef <int>, …

## Versions and importing all tables

Expand All @@ -147,99 +171,7 @@ fishbase.org. Check available releases:
available_releases()
```

[1] "23.01" "21.06" "19.04"

## Low-memory environments

If you have very limited RAM (e.g. \<= 1 GB available) it may be helpful
to use `fishbase` tables in remote form by setting `collect = FALSE`.
This allows the tables to remain on disk, while the user is still able
to use almost all `dplyr` functions (see the `dbplyr` vignette). Once
the table is appropriately subset, the user will need to call
`dplyr::collect()` to use generic non-dplyr functions, such as plotting
commands.

``` r
fb_tbl("occurrence")
```

# A tibble: 1,097,303 × 106
catnum2 OccurrenceR…¹ SpecC…² Syncode Stock…³ Genus…⁴ Speci…⁵ ColName PicName
<int> <int> <int> <int> <int> <chr> <chr> <chr> <chr>
1 34424 36653 227 22902 241 "Megal… "cypri… "Megal… ""
2 95154 45880 NA NA NA "" "" "" ""
3 97606 45880 NA NA NA "" "" "" ""
4 100025 45880 5520 25676 5809 "Johni… "belan… "" ""
5 98993 45880 5676 16650 5969 "Chrom… "retro… "" ""
6 99316 45880 454 23112 468 "Drepa… "punct… "" ""
7 99676 45880 5388 145485 5647 "Gymno… "bosch… "" ""
8 99843 45880 16813 119925 15264 "Hemir… "balin… "" ""
9 100607 45880 8288 59635 8601 "Ostra… "rhino… "" ""
10 101529 45880 NA NA NA "Scomb… "toloo… "" ""
# … with 1,097,293 more rows, 97 more variables: CatNum <chr>, URL <chr>,
# Station <chr>, Cruise <chr>, Gazetteer <chr>, LocalityType <chr>,
# WaterDepthMin <dbl>, WaterDepthMax <dbl>, AltitudeMin <int>,
# AltitudeMax <int>, LatitudeDeg <int>, LatitudeMin <dbl>, NorthSouth <chr>,
# LatitudeDec <dbl>, LongitudeDeg <int>, LongitudeMIn <dbl>, EastWest <chr>,
# LongitudeDec <dbl>, Accuracy <chr>, Salinity <chr>, LatitudeTo <dbl>,
# LongitudeTo <dbl>, LatitudeDegTo <int>, LatitudeMinTo <dbl>, …

## Local copy

Set the option “rfishbase_local_db” = TRUE to create a local copy,
otherwise will use a remote copy. Local copy will get better performance
after initial import, but may experience conflicts when `duckdb` is
upgraded or when multiple sessions attempt to access the directory.
Remove the default storage directory (given by `db_dir()`) after
upgrading duckdb if using a local copy.

``` r
options("rfishbase_local_db" = TRUE)
db_disconnect() # close previous remote connection

conn <- fb_conn()
conn
```

<duckdb_connection 5fa20 driver=<duckdb_driver 543a0 dbdir='/home/cboettig/.local/share/R/rfishbase/fishbase_23.01' read_only=FALSE bigint=numeric>>

Users can trigger a one-time download of all fishbase tables (or a list
of desired tables) using `fb_import()`. This will ensure later use of
any function can operate smoothly even when no internet connection is
available. Any table already downloaded will not be re-downloaded.
(Note: `fb_import()` also returns a remote duckdb database connection to
the tables, for users who prefer to work with the remote data objects.)

``` r
fb_import()
```

<duckdb_connection 5fa20 driver=<duckdb_driver 543a0 dbdir='/home/cboettig/.local/share/R/rfishbase/fishbase_23.01' read_only=FALSE bigint=numeric>>

## Interactive RStudio pane

RStudio users can also browse all fishbase tables interactively in the
RStudio connection browser by using the function `fisbase_pane()`. Note
that this function will first download a complete set of the fishbase
tables.

## Backwards compatibility

`rfishbase` 4.0 tries to maintain as much backwards compatibility as
possible with rfishbase 3.0. Because parquet preserves native data
types, some encoded types may differ from earlier versions. As before,
these are not always the native type – e.g. fishbase encodes some
boolean (logical TRUE/FALSE) values as integer (-1, 0) or character
types. Use `as.logical()` to coerce into the appropriate type in that
case.

Toggling between fishbase and sealifebase servers using an environmental
variable, `FISHBASE_API`, is now deprecated.

Note that fishbase will store downloaded files by hash in the app
directory, given by `db_dir()`. The default location can be set by
configuring the desired path in the environmental variable,
`FISHBASE_HOME`.
[1] "19.04" "21.06" "23.01" "23.05" "24.07"

------------------------------------------------------------------------

Expand Down
Loading

0 comments on commit 4cf94ac

Please sign in to comment.