dataset: Create Data Frames that are Easier to Exchange and Reuse #681

antaldaniel · 2025-01-21T08:34:55Z

Submitting Author Name: Daniel Antal
Submitting Author Github Handle: @antaldaniel
Repository: https://github.com/dataobservatory-eu/dataset
Version submitted: 0.3.4002
Submission type: Standard
Editor: @maelle
Reviewers: TBD

Archive: TBD
Version accepted: TBD
Language: en

Paste the full DESCRIPTION file inside a code block below:

Package: dataset
Title: Create Data Frames that are Easier to Exchange and Reuse
Version: 0.3.4002
Date: 2024-12-26
DOI: 10.32614/CRAN.package.dataset
Language: en-GB
Authors@R: 
    c(person(given = "Daniel", family = "Antal", 
           email = "[email protected]", 
           role = c("aut", "cre"),
           comment = c(ORCID = "0000-0001-7513-6760")
           ), 
      person(given = "Marcelo", family =  "Perlin", 
             role = c("rev"), 
             comment = c(ORCID = "0000-0002-9839-4268")
             )
      )
Maintainer: Daniel Antal <[email protected]>
Description: The aim of the 'dataset' package is to make tidy datasets easier to release, 
    exchange and reuse. It organizes and formats data frame 'R' objects into well-referenced, 
    well-described, interoperable datasets into release and reuse ready form.
License: GPL (>= 3)
Encoding: UTF-8
URL: https://dataset.dataobservatory.eu/
BugReports: https://github.com/dataobservatory-eu/dataset/issues/
Roxygen: list(markdown = TRUE)
LazyData: true
Imports: 
    assertthat,
    cli,
    haven,
    ISOcodes,
    labelled,
    methods,
    pillar,
    RefManageR,
    rlang,
    tibble,
    utils,
    vctrs (>= 0.5.2)
RoxygenNote: 7.3.2
Suggests: 
    knitr,
    rdflib,
    rmarkdown,
    spelling,
    testthat (>= 3.0.0)
Config/testthat/edition: 3
Depends: 
    R (>= 3.5)
VignetteBuilder: knitr

Scope

Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
- data retrieval
- data extraction
- data munging
- data deposition
  - data validation and testing
- workflow automation
- version control
- citation management and bibliometrics
- scientific software wrappers
- field and lab reproducibility tools
- database software bindings
- geospatial data
- text analysis
Explain how and why the package falls under these categories (briefly, 1-2 sentences):

The package works with various semantic interoperability standards, therefore it allows the users to retrieve RDF annotated, rich and platform-independent data and reconstruct it as an R data.frame with rich metadata attributes, or to release interoperable, RDF annotated datasets on linked open data platforms from native R objects.

Who is the target audience and what are scientific applications of this package?

Production-side statisticans. Scientists who want to update their sources from various data repositories and exchanges. Scientists and research data managers who want to release new scientific or professional datasets that follow modern interoperability standards.

Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

The package aimst to complement the rdflib and the datapsice package.

(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research? Not applicable.
If you made a pre-submission inquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.
Explain reasons for any pkgcheck items which your package is unable to pass.

Technical checks

Confirm each of the following by checking the box.

I have read the rOpenSci packaging guide.
[ x I have read the author guide and I expect to maintain this package for at least 2 years or to find a replacement.

This package:

does not violate the Terms of Service of any service it interacts with.
has a CRAN and OSI accepted license.
[xú contains a README with instructions for installing the development version.
includes documentation with examples for all functions, created with roxygen2.
contains a vignette with examples of its essential functions and uses.
has a test suite.
has continuous integration, including reporting of test coverage.

Publication options

Do you intend for this package to go on CRAN?
Do you intend for this package to go on Bioconductor?
Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

Code of conduct

I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

The text was updated successfully, but these errors were encountered:

ropensci-review-bot · 2025-01-21T08:34:57Z

Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help for help.

maelle · 2025-01-21T08:40:51Z

@antaldaniel I am so sorry, you didn't use the right template... because we broke the right template last time we edited it! I put it back in shape, thanks for helping us catch it. You can find it by opening a new issue (it'll be at the top of the list) or use it from here https://github.com/ropensci/software-review/blob/main/.github/ISSUE_TEMPLATE/A-submit-software-for-review.md

So sorry, thanks for your patience.

ropensci-review-bot · 2025-01-21T08:50:28Z

Checks for dataset (v0.3.4002)

git hash: 7bf85ac7

✔️ Package is already on CRAN.
✔️ has a 'codemeta.json' file.
✔️ has a 'contributing' file.
✔️ uses 'roxygen2'.
✔️ 'DESCRIPTION' has a URL field.
✔️ 'DESCRIPTION' has a BugReports field.
✔️ Package has at least one HTML vignette
✔️ All functions have examples.
✔️ Package has continuous integration checks.
✖️ Package coverage failed
✖️ R CMD check process failed with message: 'Build process failed'.
👀 Function names are duplicated in other packages

Important: All failing checks above must be addressed prior to proceeding

(Checks marked with 👀 may be optionally addressed.)

Package License: GPL (>= 3)

1. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.

type	package	ncalls
internal	base	281
internal	dataset	215
internal	graphics	11
internal	stats	2
imports	assertthat	29
imports	utils	8
imports	labelled	5
imports	rlang	2
imports	cli	1
imports	haven	1
imports	tibble	1
imports	ISOcodes	NA
imports	methods	NA
imports	pillar	NA
imports	RefManageR	NA
imports	vctrs	NA
suggests	knitr	NA
suggests	rdflib	NA
suggests	rmarkdown	NA
suggests	spelling	NA
suggests	testthat	NA
linking_to	NA	NA

Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.

base

ifelse (41), as.character (40), is.null (39), list (20), c (14), vapply (8), lapply (7), names (7), data.frame (6), logical (6), paste (6), paste0 (6), character (5), inherits (5), which (5), contributors (4), date (4), seq_along (4), substr (4), Sys.time (4), for (3), format (3), invisible (3), length (3), t (3), with (3), all (2), attr (2), class (2), drop (2), gsub (2), labels (2), nrow (2), args (1), as.data.frame (1), as.Date (1), as.POSIXct (1), cbind (1), do.call (1), double (1), if (1), nchar (1), ncol (1), rbind (1), Sys.Date (1), tolower (1), vector (1)

dataset

get_bibentry (26), creator (11), dataset_title (11), subject (10), publisher (7), rights (7), get_creator (6), identifier (6), description (5), language (5), new_Subject (5), provenance (5), dataset_df (4), get_publisher (4), get_type (4), agent (3), convert_column (3), n_triple (3), publication_year (3), var_definition (3), var_namespace (3), var_unit (3), as_dataset_df (2), as_dublincore (2), datacite (2), default_provenance (2), definition_attribute (2), geolocation (2), get_author (2), get_person_iri (2), idcol_find (2), is_person (2), is.dataset_df (2), n_triples (2), namespace_attribute (2), new_my_tibble (2), prov_author (2), unit_attribute (2), as_character (1), as_character.haven_labelled_defined (1), as_datacite (1), as_numeric (1), as_numeric.haven_labelled_defined (1), as.character.haven_labelled_defined (1), create_iri (1), dataset_to_triples (1), defined (1), describe (1), dublincore (1), dublincore_to_triples (1), fix_contributor (1), fix_publisher (1), get_definition_attribute (1), get_namespace_attribute (1), get_unit_attribute (1), id_to_column (1), is_dataset_df (1), is_doi (1), is.datacite (1), is.datacite.datacite (1), is.defined (1), is.dublincore (1), is.dublincore.dublincore (1), is.subject (1), label_attribute (1), names.dataset_df (1), new_datacite (1), new_datetime_defined (1), new_dublincore (1), new_labelled_defined (1), print.dataset_df (1), set_default_bibentry (1), set_definition_attribute (1), set_namespace_attribute (1), set_unit_attribute (1), set_var_labels (1), subject_create (1), summary.dataset_df (1), summary.haven_labelled_defined (1), tbl_sum.dataset_df (1), var_definition.default (1), var_label.dataset_df (1), var_label.defined (1), var_namespace.default (1)

assertthat

assert_that (29)

graphics

title (11)

utils

person (5), bibentry (2), citation (1)

labelled

var_label (4), to_labelled (1)

rlang

caller_env (1), env_is_user_facing (1)

stats

df (1), family (1)

cli

cat_line (1)

haven

labelled (1)

tibble

new_tibble (1)

NOTE: Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.

2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

code in R (100% in 29 files) and
1 authors
5 vignettes
1 internal data file
12 imported packages
89 exported functions (median 5 lines of code)
153 non-exported functions in R (median 8 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages
The following terminology is used:

loc = "Lines of Code"
fn = "function"
exp/not_exp = exported / not exported

All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure	value	percentile	noteworthy
files_R	29	88.1
files_vignettes	5	95.4
files_tests	29	97.1
loc_R	1731	79.0
loc_vignettes	521	77.3
loc_tests	717	77.6
num_vignettes	5	96.4	TRUE
data_size_total	2480	62.3
data_size_median	2480	69.0
n_fns_r	242	91.0
n_fns_r_exported	89	94.0
n_fns_r_not_exported	153	89.0
n_fns_per_file_r	5	71.9
num_params_per_fn	2	8.2
loc_per_fn_r	6	13.0
loc_per_fn_r_exp	5	8.5
loc_per_fn_r_not_exp	8	22.9
rel_whitespace_R	25	85.1
rel_whitespace_vignettes	37	81.2
rel_whitespace_tests	23	79.5
doclines_per_fn_exp	52	64.9
doclines_per_fn_not_exp	0	0.0	TRUE
fn_call_network_size	155	84.7

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package

3. `goodpractice` and other checks

Details of goodpractice checks (click to open)

3a. Continuous Integration Badges

3b. `goodpractice` results

`R CMD check` with rcmdcheck

R CMD check generated the following error:

Error in proc$get_built_file() : Build process failed

R CMD check generated the following check_fails:

no_description_date
no_import_package_as_a_whole

Test coverage with covr

ERROR: Test Coverage Failed

Cyclocomplexity with cyclocomp

Error : Build failed, unknown error, standard output:

checking for file ‘dataset/DESCRIPTION’ ... OK
preparing ‘dataset’:
checking DESCRIPTION meta-information ... OK
installing the package to build vignettes
creating vignettes ... ERROR
--- re-building ‘bibentry.Rmd’ using rmarkdown
--- finished re-building ‘bibentry.Rmd’

--- re-building ‘dataset_df.Rmd’ using rmarkdown
--- finished re-building ‘dataset_df.Rmd’

--- re-building ‘defined.Rmd’ using rmarkdown
--- finished re-building ‘defined.Rmd’

--- re-building ‘new_requirements.Rmd’ using rmarkdown
--- finished re-building ‘new_requirements.Rmd’

--- re-building ‘rdf.Rmd’ using rmarkdown

Quitting from lines 106-108 [jsonld] (rdf.Rmd)
Error: processing vignette 'rdf.Rmd' failed with diagnostics:
please install the jsonld package to use this functionality.
--- failed re-building ‘rdf.Rmd’

SUMMARY: processing the following file failed:
‘rdf.Rmd’

Error: Vignette re-building failed.
Execution halted

Static code analyses with lintr

lintr found no issues with this package!

4. Other Checks

Details of other checks (click to open)

✖️ The following 10 function names are duplicated in other packages:

- as_character from metan, radiant.data, retroharmonize, sjlabelled
- as_numeric from descstat, metan, qdapRegex, radiant.data, retroharmonize, sjlabelled, zenplots
- describe from AzureVision, Bolstad2, describer, dlookr, explore, Hmisc, iBreakDown, ingredients, lambda.r, MSbox, onewaytests, prettyR, psych, psych, psyntur, questionr, radiant.data, RCPA3, Rlab, scan, scorecard, sylly, tidycomm
- description from dataMaid, dataPreparation, dataReporter, dcmodify, memisc, metaboData, PerseusR, ritis, rmutil, rsyncrosim, stream, synchronicity, timeSeries, tis, validate
- get_bibentry from eurostat
- identifier from Ramble
- is.defined from nonmemica
- language from sylly, wakefield
- provenance from provenance
- subject from DGM, emayili, gmailr, sendgridr

Package Versions

package	version
pkgstats	0.2.0.48
pkgcheck	0.1.2.77

Editor-in-Chief Instructions:

Processing may not proceed until the items marked with ✖️ have been resolved.

antaldaniel · 2025-01-21T09:13:18Z

@maelle let me know if this works now :)

maelle · 2025-01-21T09:15:43Z

Yes, thank you!

maelle · 2025-01-21T09:15:51Z

@ropensci-review-bot assign @maelle as editor

ropensci-review-bot · 2025-01-21T09:15:55Z

Assigned! @maelle is now the editor

maelle · 2025-01-21T09:21:57Z

Thanks again for your submission!

Editor checks:

Editor comments

Documentation

My main comment before I can proceed to looking for reviewers is that the case of the package could be made better.

On the one hand, it'd be interesting to read how dataset compares to other approaches to the same "problem", such as (if I follow correctly)

On the other hand, how would an user take advantage of dataset?
To me, it is not clear yet from reading the docs.
Questions I wonder about:

As a data publisher, I create the dataset object, and then, how does it help me document it? How does it help me publish it on a repository?
When you mention standard statistical libraries in the README, could you name some?
As a data consumer, how do I create a dataset object (do I get it from an R package? shared in another way)? How can I easily use the information on units when exploring the data, when plotting it?

In short, could you exemplify "release" and "re-use" in a vignette or more, as use cases, potentially using as roles the type of users you mention in the submission under "target audience".

For instance https://wbdataset.dataobservatory.eu/ is a good example, but it is mentioned in a vignette.
More concrete information like wbdataset should make it to the README to make it clearer what dataset is about (and then be expanded in vignettes).

A tiny comment: I find "reuse" harder to parse than "re-use" but that might be a personal preference.

Installation instructions

I'd recommend documenting the two methods of installation (CRAN and GitHub) in distinct chunks so readers could copy-paste the entire code chunk of interest.

Instead of devtools you could recommend using pak.

install.packages("pak")
pak::pak("dataobservatory-eu/dataset")

Default git branch

You might want to rename the master branch to main as some people can be offended by the word "master", see https://www.tidyverse.org/blog/2021/10/renaming-default-branch/ that includes links to context, and practical advice on renaming the default branch.

Contributing guide

The contributing guide does not seem customized.
It mentions a possible "src" folder which is not present.

Since you are looking for co-developers, and mentioned one of the articles could be relevant to potential contributors, I'd recommend having some text related to design and wishes for feedback in the contributing guide.

The contributing guide mentions "AppVeyor" which is not used any more as far as I can tell.

Continuous integration

If AppVeyor is not used any more, please remove the related contributing file.
The code coverage workflow seems not to be working: https://github.com/dataobservatory-eu/dataset/actions/workflows/test-coverage.yaml I'd recommend using the latest workflow file from r-lib/actions, by copy-pasting it or by running usethis::use_github_action("test-coverage").
The pkgdown website is not up to date, for instance on it the test coverage badge is not broken whereas it is in the README. Please add a workflow to continuously deploy it, for instance by running usethis::use_github_action("pkgdown").
If you no longer whish to use the R-CMD-check workflow because you rely on the R-hub ones, please remove the old workflow file.
The latest commits all have red crosses as status, which shows the continuous integration files need a bit of cleaning and tweaking. 🙂

Project management

From the open issues, which ones are meant to be tackled soon?
One of them has the "First CRAN release" milestone, which is outdated.

Code style

I'd recommend running styler (on R scripts including tests) to make spacing more consistent.

For instance in https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/R/agent.R#L56C1-L57C55 the space before return_type is surprising.
I remember being inconsistent with spaces myself years ago and not noticing, (un)fortunately I was converted. 😅

Code

The code could be simplified so that reviewers might more easily follow the logic.

is_person <- function(p) ifelse (inherits(p, "person"), TRUE, FALSE) in https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/R/as_dublincore.R#L11
could be is_person <- function(p) inherits(p, "person") (thanks lintr for catching this).
That pattern comes up several times in the codebase (the relevant linter is "redundant_ifelse_linter").

Code like

https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/R/agent.R#L4-L20

and

https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/R/agent.R#L66

(and other similar pipelines) reminds me of Jenny Bryan's advice in her talk Code smells and feels

If your conditions deal with class, it's time to get object-oriented. In CS jargon, use polymorphisms.

So instead of the complex logic, you'd define methods.

In some files like R/xsd_convert.R and R/dataset_title.R, you use class(something) == or class(something) %in% instead of code built on the more correct inherits().
Using "proper functions for handling class & type" is another tip in the aforementioned talk. 😸

Since dataset imports rlang, you could use the %||% operator from rlang.
For instance https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/R/agent.R#L24

creators   <- ifelse(is.null(dataset_bibentry$author), ":tba", dataset_bibentry$author)

would become

creators   <- dataset_bibentry$author %||% ":tba"

There are many other occurrences of the ifelse(is.null( pattern (and variants with different spacing) that could get the same treatment.

In the R/agent.R file, functions like get_creator() are defined twice, why?

Example dataset

The iris dataset is very well-known, but it is also infamous because of its eugenics links.
Since having a good example dataset is very important, would you consider replacing it with another one, like maybe the palmerpenguins one, even if it comes at the cost of adding a (possibly optional) dependency?

Tests

Should the line https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/tests/testthat/test-agent.R#L9 be removed as it is not used in the test?

I don't understand why the iris object needs to be duplicated in lines like https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/tests/testthat/test-creator.R#L23

expect_true(is.dataset_df might become a custom expectation and/or rely on expect_s3_class() instead. Same comment for expect_true(is.subject.

https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/tests/testthat/test-datacite.R#L18

should be

expect_type(as_datacite(iris_dataset, "list"), "list")

What is https://github.com/dataobservatory-eu/dataset/blob/master/tests/testthat/test-dataset_prov.bak?

When using expect_error() as in https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/tests/testthat/test-dataset_title.R#L3 maybe add a pattern for the error message, just in case another error happens?

Thank you! Happy to discuss any of the items.

maelle · 2025-01-21T09:24:16Z

@ropensci-review-bot check package

ropensci-review-bot · 2025-01-21T09:24:19Z

Thanks, about to send the query.

ropensci-review-bot · 2025-01-21T09:24:22Z

🚀

Editor check started

👋

ropensci-review-bot · 2025-01-21T09:25:47Z

Checks for dataset (v0.3.4002)

git hash: 7bf85ac7

✔️ Package is already on CRAN.
✔️ has a 'codemeta.json' file.
✔️ has a 'contributing' file.
✔️ uses 'roxygen2'.
✔️ 'DESCRIPTION' has a URL field.
✔️ 'DESCRIPTION' has a BugReports field.
✔️ Package has at least one HTML vignette
✔️ All functions have examples.
✔️ Package has continuous integration checks.
✖️ Package coverage failed
✖️ R CMD check process failed with message: 'Build process failed'.
👀 Function names are duplicated in other packages

Important: All failing checks above must be addressed prior to proceeding

(Checks marked with 👀 may be optionally addressed.)

Package License: GPL (>= 3)

1. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.

type	package	ncalls
internal	base	281
internal	dataset	215
internal	graphics	11
internal	stats	2
imports	assertthat	29
imports	utils	8
imports	labelled	5
imports	rlang	2
imports	cli	1
imports	haven	1
imports	tibble	1
imports	ISOcodes	NA
imports	methods	NA
imports	pillar	NA
imports	RefManageR	NA
imports	vctrs	NA
suggests	knitr	NA
suggests	rdflib	NA
suggests	rmarkdown	NA
suggests	spelling	NA
suggests	testthat	NA
linking_to	NA	NA

Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.

base

ifelse (41), as.character (40), is.null (39), list (20), c (14), vapply (8), lapply (7), names (7), data.frame (6), logical (6), paste (6), paste0 (6), character (5), inherits (5), which (5), contributors (4), date (4), seq_along (4), substr (4), Sys.time (4), for (3), format (3), invisible (3), length (3), t (3), with (3), all (2), attr (2), class (2), drop (2), gsub (2), labels (2), nrow (2), args (1), as.data.frame (1), as.Date (1), as.POSIXct (1), cbind (1), do.call (1), double (1), if (1), nchar (1), ncol (1), rbind (1), Sys.Date (1), tolower (1), vector (1)

dataset

get_bibentry (26), creator (11), dataset_title (11), subject (10), publisher (7), rights (7), get_creator (6), identifier (6), description (5), language (5), new_Subject (5), provenance (5), dataset_df (4), get_publisher (4), get_type (4), agent (3), convert_column (3), n_triple (3), publication_year (3), var_definition (3), var_namespace (3), var_unit (3), as_dataset_df (2), as_dublincore (2), datacite (2), default_provenance (2), definition_attribute (2), geolocation (2), get_author (2), get_person_iri (2), idcol_find (2), is_person (2), is.dataset_df (2), n_triples (2), namespace_attribute (2), new_my_tibble (2), prov_author (2), unit_attribute (2), as_character (1), as_character.haven_labelled_defined (1), as_datacite (1), as_numeric (1), as_numeric.haven_labelled_defined (1), as.character.haven_labelled_defined (1), create_iri (1), dataset_to_triples (1), defined (1), describe (1), dublincore (1), dublincore_to_triples (1), fix_contributor (1), fix_publisher (1), get_definition_attribute (1), get_namespace_attribute (1), get_unit_attribute (1), id_to_column (1), is_dataset_df (1), is_doi (1), is.datacite (1), is.datacite.datacite (1), is.defined (1), is.dublincore (1), is.dublincore.dublincore (1), is.subject (1), label_attribute (1), names.dataset_df (1), new_datacite (1), new_datetime_defined (1), new_dublincore (1), new_labelled_defined (1), print.dataset_df (1), set_default_bibentry (1), set_definition_attribute (1), set_namespace_attribute (1), set_unit_attribute (1), set_var_labels (1), subject_create (1), summary.dataset_df (1), summary.haven_labelled_defined (1), tbl_sum.dataset_df (1), var_definition.default (1), var_label.dataset_df (1), var_label.defined (1), var_namespace.default (1)

assertthat

assert_that (29)

graphics

title (11)

utils

person (5), bibentry (2), citation (1)

labelled

var_label (4), to_labelled (1)

rlang

caller_env (1), env_is_user_facing (1)

stats

df (1), family (1)

cli

cat_line (1)

haven

labelled (1)

tibble

new_tibble (1)

NOTE: Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.

2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

code in R (100% in 29 files) and
1 authors
5 vignettes
1 internal data file
12 imported packages
89 exported functions (median 5 lines of code)
153 non-exported functions in R (median 8 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages
The following terminology is used:

loc = "Lines of Code"
fn = "function"
exp/not_exp = exported / not exported

All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure	value	percentile	noteworthy
files_R	29	88.1
files_vignettes	5	95.4
files_tests	29	97.1
loc_R	1731	79.0
loc_vignettes	521	77.3
loc_tests	717	77.6
num_vignettes	5	96.4	TRUE
data_size_total	2480	62.3
data_size_median	2480	69.0
n_fns_r	242	91.0
n_fns_r_exported	89	94.0
n_fns_r_not_exported	153	89.0
n_fns_per_file_r	5	71.9
num_params_per_fn	2	8.2
loc_per_fn_r	6	13.0
loc_per_fn_r_exp	5	8.5
loc_per_fn_r_not_exp	8	22.9
rel_whitespace_R	25	85.1
rel_whitespace_vignettes	37	81.2
rel_whitespace_tests	23	79.5
doclines_per_fn_exp	52	64.9
doclines_per_fn_not_exp	0	0.0	TRUE
fn_call_network_size	155	84.7

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package

3. `goodpractice` and other checks

Details of goodpractice checks (click to open)

3a. Continuous Integration Badges

3b. `goodpractice` results

`R CMD check` with rcmdcheck

R CMD check generated the following error:

Error in proc$get_built_file() : Build process failed

R CMD check generated the following check_fails:

no_description_date
no_import_package_as_a_whole

Test coverage with covr

ERROR: Test Coverage Failed

Cyclocomplexity with cyclocomp

Error : Build failed, unknown error, standard output:

checking for file ‘dataset/DESCRIPTION’ ... OK
preparing ‘dataset’:
checking DESCRIPTION meta-information ... OK
installing the package to build vignettes
creating vignettes ... ERROR
--- re-building ‘bibentry.Rmd’ using rmarkdown
--- finished re-building ‘bibentry.Rmd’

--- re-building ‘dataset_df.Rmd’ using rmarkdown
--- finished re-building ‘dataset_df.Rmd’

--- re-building ‘defined.Rmd’ using rmarkdown
--- finished re-building ‘defined.Rmd’

--- re-building ‘new_requirements.Rmd’ using rmarkdown
--- finished re-building ‘new_requirements.Rmd’

--- re-building ‘rdf.Rmd’ using rmarkdown

Quitting from lines 106-108 [jsonld] (rdf.Rmd)
Error: processing vignette 'rdf.Rmd' failed with diagnostics:
please install the jsonld package to use this functionality.
--- failed re-building ‘rdf.Rmd’

SUMMARY: processing the following file failed:
‘rdf.Rmd’

Error: Vignette re-building failed.
Execution halted

Static code analyses with lintr

lintr found no issues with this package!

4. Other Checks

Details of other checks (click to open)

✖️ The following 10 function names are duplicated in other packages:

- as_character from metan, radiant.data, retroharmonize, sjlabelled
- as_numeric from descstat, metan, qdapRegex, radiant.data, retroharmonize, sjlabelled, zenplots
- describe from AzureVision, Bolstad2, describer, dlookr, explore, Hmisc, iBreakDown, ingredients, lambda.r, MSbox, onewaytests, prettyR, psych, psych, psyntur, questionr, radiant.data, RCPA3, Rlab, scan, scorecard, sylly, tidycomm
- description from dataMaid, dataPreparation, dataReporter, dcmodify, memisc, metaboData, PerseusR, ritis, rmutil, rsyncrosim, stream, synchronicity, timeSeries, tis, validate
- get_bibentry from eurostat
- identifier from Ramble
- is.defined from nonmemica
- language from sylly, wakefield
- provenance from provenance
- subject from DGM, emayili, gmailr, sendgridr

Package Versions

package	version
pkgstats	0.2.0.48
pkgcheck	0.1.2.77

Editor-in-Chief Instructions:

Processing may not proceed until the items marked with ✖️ have been resolved.

maelle · 2025-01-21T09:31:05Z

Coverage is also something noted in my comments.

antaldaniel · 2025-01-21T09:47:53Z

@maelle Thank you. It is unfortunate that you have joined this review after two years, because I find many of your comments very useful. However, I must say that some of them would be unusual to add to a vignette, and I would find an article format more useful, i.e., the your question about frictionless, datapack, datapackage.org, and researchobject, as about 2 years of research went into this package that is usually not a vignette material, other reviewers of my other packages hated extended vignettes.

I think that the frictionless package family follows a very different approach, and when I started the development of this package and this review, it did not even seem relevant. Eventually I see that in some use cases both can be useful and a choice could be offered, so I will argue that, but I think that this is more a paper considering statistical exchange formats and their best representation in R.

A small question: what it the package coverage you are aiming at? I think that the package already has a very high coverage, and it exceeded the requirements when I started the review.

maelle · 2025-01-21T10:38:44Z

@antaldaniel thank you for your answer!

Use cases would be crucial to make the point of the package clearer. You have a vision, that potential users need to understand. With concrete examples of usage, I think understanding what it is all about (and why a potential user should care) would be easier. An use case does not have to be really long, it can contain diagrams, but it should help someone see what they could do with the package in practice. Even if from the current README text they might get it has the good goal of helping with FAIRness, the application of the package might remain nebulous.
I agree the docs do not need to contain an in-depth comparison with other approaches, but a small section "Related work" would be important. It'd allow users to quickly compare your package to others, and also would help them, again, get the point. "Oh, it's a bit like frictionless, it's not a new tibble or a new dataspice". In our dev guide that's "If applicable, how the package compares to other similar packages and/or how it relates to other packages." in https://devguide.ropensci.org/pkg_building.html#readme With workflow/standards packages like this one (as opposed to, say, a package that helps you get data from a specific API, where the goal is easier to grasp for a newcomer), introducing the package is obviously harder, but also more important.
For the coverage, the problem is not the number but the fact that no continuous integration workflow is updating the badge. I am aiming at a not unknown coverage in the README badge. 😉

antaldaniel · 2025-01-21T10:46:02Z

Thank you @maelle and I will look in to why the coverage is not updating... However, do you have an explicit coverage target?

maelle · 2025-01-21T10:54:40Z

Yes it's 75% https://devguide.ropensci.org/pkg_building.html#testing -- But I'd recommend also looking at the coverage report to find the not covered lines and make a judgement call on how important/risky these lines are (vs how hard to test they'd be) just so you're sure there's nothing dangerous in the remaining 25%. 🙂 The idea really is to cover "key functionality" (phrasing from the dev guide).

In my comments I recommend updating the test-coverage workflow, the fix might be as simple as that.

antaldaniel · 2025-01-21T11:34:08Z

@maelle Thank you agian for your useful comments and your PRs. I reorganised the issues and created a new milestone. The milestone currently breaks up your review comments to 7 issues, but they could of course be broken down to more. I set myself a deadline of 16 February to resolve them, though it may happen earlier. I will tag you in the issues when they are ready for review and will also add a new comment here.

maelle · 2025-01-21T11:58:42Z

Thank you! I'll put the issue on hold for now but will remove the label as soon as you are ready. Happy to respond to comments (and issues/PRs in your repo) in the meantime if needed.

maelle · 2025-01-21T11:58:52Z

@ropensci-review-bot put on hold

ropensci-review-bot · 2025-01-21T11:58:55Z

Submission on hold!

maurolepore · 2025-02-02T18:51:34Z

Dear @antaldaniel this is to mark the start of my EiC rotation. I'm reviewing all open issues and noting what I see:

@maelle provided detailed editor checks.
@antaldaniel plans to follow up by 16 February.
Until then the submission is holding.

It all looks good to me. I'll step back. Thanks!

antaldaniel · 2025-02-02T21:13:15Z

Hi @maurolepore,

I just solved three issues with new commits, so their evolution and solution is on the dataset package place. I was hoping and I did receive some broader and useful review, as this is the early stages of the development of a package family. But responding to those is perhaps best placed in an accompanying paper, not a package, for the purpose of this review, I add them as a vignette but it is a bit too philsophical for a normal vignette.

I will note when all are solved, although I have a very serious objection to one.

I think that the use of the iris dataset is a choice of the R-core team, and any criticism should be aimed there. This package wants to extend the usability of a central, core, base R object, i.e., the data.frame itself, which is taught, explained, tested in literally millions of use cases with the iris dataset. If there was a debate on ditching the dataset on the R-core team, I would pitch in defense of it, and accept if it would be replaced with something else; however, I do not think that this is a valid point in the review of a package that extends the core R system.

ropensci-review-bot added the holding label Jan 21, 2025

This comment has been minimized.

Sign in to view

antaldaniel mentioned this issue Jan 21, 2025

dataset: Create Data Frames that are Easier to Exchange and Reuse #553

Closed

18 tasks

ropensci-review-bot assigned maelle Jan 21, 2025

ropensci-review-bot added the 1/editor-checks label Jan 21, 2025

maelle removed the holding label Jan 21, 2025

ropensci-review-bot added the holding label Jan 21, 2025

antaldaniel mentioned this issue Jan 21, 2025

Discontinue the use of the iris dataset dataobservatory-eu/dataset#28

Open

dataset: Create Data Frames that are Easier to Exchange and Reuse #681

dataset: Create Data Frames that are Easier to Exchange and Reuse #681

Comments

antaldaniel commented Jan 21, 2025 • edited by maelle Loading

Scope

Technical checks

Publication options

Code of conduct

ropensci-review-bot commented Jan 21, 2025

This comment has been minimized.

maelle commented Jan 21, 2025

ropensci-review-bot commented Jan 21, 2025

Checks for dataset (v0.3.4002)

1. Package Dependencies

2. Statistical Properties

2a. Network visualisation

3. goodpractice and other checks

3a. Continuous Integration Badges

3b. goodpractice results

R CMD check with rcmdcheck

Test coverage with covr

Cyclocomplexity with cyclocomp

Static code analyses with lintr

4. Other Checks

Editor-in-Chief Instructions:

antaldaniel commented Jan 21, 2025

maelle commented Jan 21, 2025

maelle commented Jan 21, 2025

ropensci-review-bot commented Jan 21, 2025

maelle commented Jan 21, 2025 • edited Loading

Editor checks:

Editor comments

Documentation

Installation instructions

Default git branch

Contributing guide

Continuous integration

Project management

Code style

Code

Example dataset

Tests

maelle commented Jan 21, 2025

ropensci-review-bot commented Jan 21, 2025

ropensci-review-bot commented Jan 21, 2025

ropensci-review-bot commented Jan 21, 2025

Checks for dataset (v0.3.4002)

1. Package Dependencies

2. Statistical Properties

2a. Network visualisation

3. goodpractice and other checks

3a. Continuous Integration Badges

3b. goodpractice results

R CMD check with rcmdcheck

Test coverage with covr

Cyclocomplexity with cyclocomp

Static code analyses with lintr

4. Other Checks

Editor-in-Chief Instructions:

maelle commented Jan 21, 2025

antaldaniel commented Jan 21, 2025

maelle commented Jan 21, 2025

antaldaniel commented Jan 21, 2025

maelle commented Jan 21, 2025

antaldaniel commented Jan 21, 2025

maelle commented Jan 21, 2025

maelle commented Jan 21, 2025

ropensci-review-bot commented Jan 21, 2025

maurolepore commented Feb 2, 2025 • edited Loading

antaldaniel commented Feb 2, 2025

antaldaniel commented Jan 21, 2025 •

edited by maelle

Loading

3. `goodpractice` and other checks

3b. `goodpractice` results

`R CMD check` with rcmdcheck

maelle commented Jan 21, 2025 •

edited

Loading

3. `goodpractice` and other checks

3b. `goodpractice` results

`R CMD check` with rcmdcheck

maurolepore commented Feb 2, 2025 •

edited

Loading