Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataset: Create Data Frames that are Easier to Exchange and Reuse #681

Open
13 of 24 tasks
antaldaniel opened this issue Jan 21, 2025 · 24 comments
Open
13 of 24 tasks

dataset: Create Data Frames that are Easier to Exchange and Reuse #681

antaldaniel opened this issue Jan 21, 2025 · 24 comments

Comments

@antaldaniel
Copy link

antaldaniel commented Jan 21, 2025

Submitting Author Name: Daniel Antal
Submitting Author Github Handle: @antaldaniel
Repository: https://github.com/dataobservatory-eu/dataset
Version submitted: 0.3.4002
Submission type: Standard
Editor: @maelle
Reviewers: TBD

Archive: TBD
Version accepted: TBD
Language: en


  • Paste the full DESCRIPTION file inside a code block below:
Package: dataset
Title: Create Data Frames that are Easier to Exchange and Reuse
Version: 0.3.4002
Date: 2024-12-26
DOI: 10.32614/CRAN.package.dataset
Language: en-GB
Authors@R: 
    c(person(given = "Daniel", family = "Antal", 
           email = "[email protected]", 
           role = c("aut", "cre"),
           comment = c(ORCID = "0000-0001-7513-6760")
           ), 
      person(given = "Marcelo", family =  "Perlin", 
             role = c("rev"), 
             comment = c(ORCID = "0000-0002-9839-4268")
             )
      )
Maintainer: Daniel Antal <[email protected]>
Description: The aim of the 'dataset' package is to make tidy datasets easier to release, 
    exchange and reuse. It organizes and formats data frame 'R' objects into well-referenced, 
    well-described, interoperable datasets into release and reuse ready form.
License: GPL (>= 3)
Encoding: UTF-8
URL: https://dataset.dataobservatory.eu/
BugReports: https://github.com/dataobservatory-eu/dataset/issues/
Roxygen: list(markdown = TRUE)
LazyData: true
Imports: 
    assertthat,
    cli,
    haven,
    ISOcodes,
    labelled,
    methods,
    pillar,
    RefManageR,
    rlang,
    tibble,
    utils,
    vctrs (>= 0.5.2)
RoxygenNote: 7.3.2
Suggests: 
    knitr,
    rdflib,
    rmarkdown,
    spelling,
    testthat (>= 3.0.0)
Config/testthat/edition: 3
Depends: 
    R (>= 3.5)
VignetteBuilder: knitr

Scope

  • Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):

    • data retrieval
    • data extraction
    • data munging
    • data deposition
      • data validation and testing
    • workflow automation
    • version control
    • citation management and bibliometrics
    • scientific software wrappers
    • field and lab reproducibility tools
    • database software bindings
    • geospatial data
    • text analysis
  • Explain how and why the package falls under these categories (briefly, 1-2 sentences):

The package works with various semantic interoperability standards, therefore it allows the users to retrieve RDF annotated, rich and platform-independent data and reconstruct it as an R data.frame with rich metadata attributes, or to release interoperable, RDF annotated datasets on linked open data platforms from native R objects.

  • Who is the target audience and what are scientific applications of this package?

Production-side statisticans. Scientists who want to update their sources from various data repositories and exchanges. Scientists and research data managers who want to release new scientific or professional datasets that follow modern interoperability standards.

The package aimst to complement the rdflib and the datapsice package.

Technical checks

Confirm each of the following by checking the box.

This package:

Publication options

  • Do you intend for this package to go on CRAN?

  • Do you intend for this package to go on Bioconductor?

  • Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

Code of conduct

@ropensci-review-bot
Copy link
Collaborator

Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help for help.

@ropensci-review-bot

This comment has been minimized.

@maelle
Copy link
Member

maelle commented Jan 21, 2025

@antaldaniel I am so sorry, you didn't use the right template... because we broke the right template last time we edited it! I put it back in shape, thanks for helping us catch it. You can find it by opening a new issue (it'll be at the top of the list) or use it from here https://github.com/ropensci/software-review/blob/main/.github/ISSUE_TEMPLATE/A-submit-software-for-review.md

So sorry, thanks for your patience.

@ropensci-review-bot
Copy link
Collaborator

Checks for dataset (v0.3.4002)

git hash: 7bf85ac7

  • ✔️ Package is already on CRAN.
  • ✔️ has a 'codemeta.json' file.
  • ✔️ has a 'contributing' file.
  • ✔️ uses 'roxygen2'.
  • ✔️ 'DESCRIPTION' has a URL field.
  • ✔️ 'DESCRIPTION' has a BugReports field.
  • ✔️ Package has at least one HTML vignette
  • ✔️ All functions have examples.
  • ✔️ Package has continuous integration checks.
  • ✖️ Package coverage failed
  • ✖️ R CMD check process failed with message: 'Build process failed'.
  • 👀 Function names are duplicated in other packages

Important: All failing checks above must be addressed prior to proceeding

(Checks marked with 👀 may be optionally addressed.)

Package License: GPL (>= 3)


1. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.

type package ncalls
internal base 281
internal dataset 215
internal graphics 11
internal stats 2
imports assertthat 29
imports utils 8
imports labelled 5
imports rlang 2
imports cli 1
imports haven 1
imports tibble 1
imports ISOcodes NA
imports methods NA
imports pillar NA
imports RefManageR NA
imports vctrs NA
suggests knitr NA
suggests rdflib NA
suggests rmarkdown NA
suggests spelling NA
suggests testthat NA
linking_to NA NA

Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.

base

ifelse (41), as.character (40), is.null (39), list (20), c (14), vapply (8), lapply (7), names (7), data.frame (6), logical (6), paste (6), paste0 (6), character (5), inherits (5), which (5), contributors (4), date (4), seq_along (4), substr (4), Sys.time (4), for (3), format (3), invisible (3), length (3), t (3), with (3), all (2), attr (2), class (2), drop (2), gsub (2), labels (2), nrow (2), args (1), as.data.frame (1), as.Date (1), as.POSIXct (1), cbind (1), do.call (1), double (1), if (1), nchar (1), ncol (1), rbind (1), Sys.Date (1), tolower (1), vector (1)

dataset

get_bibentry (26), creator (11), dataset_title (11), subject (10), publisher (7), rights (7), get_creator (6), identifier (6), description (5), language (5), new_Subject (5), provenance (5), dataset_df (4), get_publisher (4), get_type (4), agent (3), convert_column (3), n_triple (3), publication_year (3), var_definition (3), var_namespace (3), var_unit (3), as_dataset_df (2), as_dublincore (2), datacite (2), default_provenance (2), definition_attribute (2), geolocation (2), get_author (2), get_person_iri (2), idcol_find (2), is_person (2), is.dataset_df (2), n_triples (2), namespace_attribute (2), new_my_tibble (2), prov_author (2), unit_attribute (2), as_character (1), as_character.haven_labelled_defined (1), as_datacite (1), as_numeric (1), as_numeric.haven_labelled_defined (1), as.character.haven_labelled_defined (1), create_iri (1), dataset_to_triples (1), defined (1), describe (1), dublincore (1), dublincore_to_triples (1), fix_contributor (1), fix_publisher (1), get_definition_attribute (1), get_namespace_attribute (1), get_unit_attribute (1), id_to_column (1), is_dataset_df (1), is_doi (1), is.datacite (1), is.datacite.datacite (1), is.defined (1), is.dublincore (1), is.dublincore.dublincore (1), is.subject (1), label_attribute (1), names.dataset_df (1), new_datacite (1), new_datetime_defined (1), new_dublincore (1), new_labelled_defined (1), print.dataset_df (1), set_default_bibentry (1), set_definition_attribute (1), set_namespace_attribute (1), set_unit_attribute (1), set_var_labels (1), subject_create (1), summary.dataset_df (1), summary.haven_labelled_defined (1), tbl_sum.dataset_df (1), var_definition.default (1), var_label.dataset_df (1), var_label.defined (1), var_namespace.default (1)

assertthat

assert_that (29)

graphics

title (11)

utils

person (5), bibentry (2), citation (1)

labelled

var_label (4), to_labelled (1)

rlang

caller_env (1), env_is_user_facing (1)

stats

df (1), family (1)

cli

cat_line (1)

haven

labelled (1)

tibble

new_tibble (1)

NOTE: Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.


2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

  • code in R (100% in 29 files) and
  • 1 authors
  • 5 vignettes
  • 1 internal data file
  • 12 imported packages
  • 89 exported functions (median 5 lines of code)
  • 153 non-exported functions in R (median 8 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages
The following terminology is used:

  • loc = "Lines of Code"
  • fn = "function"
  • exp/not_exp = exported / not exported

All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure value percentile noteworthy
files_R 29 88.1
files_vignettes 5 95.4
files_tests 29 97.1
loc_R 1731 79.0
loc_vignettes 521 77.3
loc_tests 717 77.6
num_vignettes 5 96.4 TRUE
data_size_total 2480 62.3
data_size_median 2480 69.0
n_fns_r 242 91.0
n_fns_r_exported 89 94.0
n_fns_r_not_exported 153 89.0
n_fns_per_file_r 5 71.9
num_params_per_fn 2 8.2
loc_per_fn_r 6 13.0
loc_per_fn_r_exp 5 8.5
loc_per_fn_r_not_exp 8 22.9
rel_whitespace_R 25 85.1
rel_whitespace_vignettes 37 81.2
rel_whitespace_tests 23 79.5
doclines_per_fn_exp 52 64.9
doclines_per_fn_not_exp 0 0.0 TRUE
fn_call_network_size 155 84.7

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package


3. goodpractice and other checks

Details of goodpractice checks (click to open)

3a. Continuous Integration Badges

rhub.yaml


3b. goodpractice results

R CMD check with rcmdcheck

R CMD check generated the following error:

  1. Error in proc$get_built_file() : Build process failed

R CMD check generated the following check_fails:

  1. no_description_date
  2. no_import_package_as_a_whole

Test coverage with covr

ERROR: Test Coverage Failed

Cyclocomplexity with cyclocomp

Error : Build failed, unknown error, standard output:

  • checking for file ‘dataset/DESCRIPTION’ ... OK
  • preparing ‘dataset’:
  • checking DESCRIPTION meta-information ... OK
  • installing the package to build vignettes
  • creating vignettes ... ERROR
    --- re-building ‘bibentry.Rmd’ using rmarkdown
    --- finished re-building ‘bibentry.Rmd’

--- re-building ‘dataset_df.Rmd’ using rmarkdown
--- finished re-building ‘dataset_df.Rmd’

--- re-building ‘defined.Rmd’ using rmarkdown
--- finished re-building ‘defined.Rmd’

--- re-building ‘new_requirements.Rmd’ using rmarkdown
--- finished re-building ‘new_requirements.Rmd’

--- re-building ‘rdf.Rmd’ using rmarkdown

Quitting from lines 106-108 [jsonld] (rdf.Rmd)
Error: processing vignette 'rdf.Rmd' failed with diagnostics:
please install the jsonld package to use this functionality.
--- failed re-building ‘rdf.Rmd’

SUMMARY: processing the following file failed:
‘rdf.Rmd’

Error: Vignette re-building failed.
Execution halted

Static code analyses with lintr

lintr found no issues with this package!


4. Other Checks

Details of other checks (click to open)

✖️ The following 10 function names are duplicated in other packages:

    • as_character from metan, radiant.data, retroharmonize, sjlabelled
    • as_numeric from descstat, metan, qdapRegex, radiant.data, retroharmonize, sjlabelled, zenplots
    • describe from AzureVision, Bolstad2, describer, dlookr, explore, Hmisc, iBreakDown, ingredients, lambda.r, MSbox, onewaytests, prettyR, psych, psych, psyntur, questionr, radiant.data, RCPA3, Rlab, scan, scorecard, sylly, tidycomm
    • description from dataMaid, dataPreparation, dataReporter, dcmodify, memisc, metaboData, PerseusR, ritis, rmutil, rsyncrosim, stream, synchronicity, timeSeries, tis, validate
    • get_bibentry from eurostat
    • identifier from Ramble
    • is.defined from nonmemica
    • language from sylly, wakefield
    • provenance from provenance
    • subject from DGM, emayili, gmailr, sendgridr


Package Versions

package version
pkgstats 0.2.0.48
pkgcheck 0.1.2.77


Editor-in-Chief Instructions:

Processing may not proceed until the items marked with ✖️ have been resolved.

@antaldaniel
Copy link
Author

@maelle let me know if this works now :)

@maelle
Copy link
Member

maelle commented Jan 21, 2025

Yes, thank you!

@maelle
Copy link
Member

maelle commented Jan 21, 2025

@ropensci-review-bot assign @maelle as editor

@ropensci-review-bot
Copy link
Collaborator

Assigned! @maelle is now the editor

@maelle
Copy link
Member

maelle commented Jan 21, 2025

Thanks again for your submission!

Editor checks:

  • Documentation: The package has sufficient documentation available online (README, pkgdown docs) to allow for an assessment of functionality and scope without installing the package. In particular,
    • Is the case for the package well made?
    • Is the reference index page clear (grouped by topic if necessary)?
    • Are vignettes readable, sufficiently detailed and not just perfunctory?
  • Fit: The package meets criteria for fit and overlap.
  • Installation instructions: Are installation instructions clear enough for human users?
  • Tests: If the package has some interactivity / HTTP / plot production etc. are the tests using state-of-the-art tooling?
  • Contributing information: Is the documentation for contribution clear enough e.g. tokens for tests, playgrounds?
  • License: The package has a CRAN or OSI accepted license.
  • Project management: Are the issue and PR trackers in a good shape, e.g. are there outstanding bugs, is it clear when feature requests are meant to be tackled?

Editor comments


Documentation

My main comment before I can proceed to looking for reviewers is that the case of the package could be made better.

On the one hand, it'd be interesting to read how dataset compares to other approaches to the same "problem", such as (if I follow correctly)

On the other hand, how would an user take advantage of dataset?
To me, it is not clear yet from reading the docs.
Questions I wonder about:

  • As a data publisher, I create the dataset object, and then, how does it help me document it? How does it help me publish it on a repository?
  • When you mention standard statistical libraries in the README, could you name some?
  • As a data consumer, how do I create a dataset object (do I get it from an R package? shared in another way)? How can I easily use the information on units when exploring the data, when plotting it?

In short, could you exemplify "release" and "re-use" in a vignette or more, as use cases, potentially using as roles the type of users you mention in the submission under "target audience".

For instance https://wbdataset.dataobservatory.eu/ is a good example, but it is mentioned in a vignette.
More concrete information like wbdataset should make it to the README to make it clearer what dataset is about (and then be expanded in vignettes).

A tiny comment: I find "reuse" harder to parse than "re-use" but that might be a personal preference.

Installation instructions

I'd recommend documenting the two methods of installation (CRAN and GitHub) in distinct chunks so readers could copy-paste the entire code chunk of interest.

Instead of devtools you could recommend using pak.

install.packages("pak")
pak::pak("dataobservatory-eu/dataset")

Default git branch

You might want to rename the master branch to main as some people can be offended by the word "master", see https://www.tidyverse.org/blog/2021/10/renaming-default-branch/ that includes links to context, and practical advice on renaming the default branch.

Contributing guide

The contributing guide does not seem customized.
It mentions a possible "src" folder which is not present.

Since you are looking for co-developers, and mentioned one of the articles could be relevant to potential contributors, I'd recommend having some text related to design and wishes for feedback in the contributing guide.

The contributing guide mentions "AppVeyor" which is not used any more as far as I can tell.

Continuous integration

  • If AppVeyor is not used any more, please remove the related contributing file.

  • The code coverage workflow seems not to be working: https://github.com/dataobservatory-eu/dataset/actions/workflows/test-coverage.yaml I'd recommend using the latest workflow file from r-lib/actions, by copy-pasting it or by running usethis::use_github_action("test-coverage").

  • The pkgdown website is not up to date, for instance on it the test coverage badge is not broken whereas it is in the README. Please add a workflow to continuously deploy it, for instance by running usethis::use_github_action("pkgdown").

  • If you no longer whish to use the R-CMD-check workflow because you rely on the R-hub ones, please remove the old workflow file.

  • The latest commits all have red crosses as status, which shows the continuous integration files need a bit of cleaning and tweaking. 🙂

Project management

From the open issues, which ones are meant to be tackled soon?
One of them has the "First CRAN release" milestone, which is outdated.

Code style

I'd recommend running styler (on R scripts including tests) to make spacing more consistent.

For instance in https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/R/agent.R#L56C1-L57C55 the space before return_type is surprising.
I remember being inconsistent with spaces myself years ago and not noticing, (un)fortunately I was converted. 😅

Code

The code could be simplified so that reviewers might more easily follow the logic.

is_person <- function(p) ifelse (inherits(p, "person"), TRUE, FALSE) in https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/R/as_dublincore.R#L11
could be is_person <- function(p) inherits(p, "person") (thanks lintr for catching this).
That pattern comes up several times in the codebase (the relevant linter is "redundant_ifelse_linter").

Code like

https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/R/agent.R#L4-L20

and

https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/R/agent.R#L66

(and other similar pipelines) reminds me of Jenny Bryan's advice in her talk Code smells and feels

If your conditions deal with class, it's time to get object-oriented. In CS jargon, use polymorphisms.

So instead of the complex logic, you'd define methods.

In some files like R/xsd_convert.R and R/dataset_title.R, you use class(something) == or class(something) %in% instead of code built on the more correct inherits().
Using "proper functions for handling class & type" is another tip in the aforementioned talk. 😸

Since dataset imports rlang, you could use the %||% operator from rlang.
For instance https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/R/agent.R#L24

creators   <- ifelse(is.null(dataset_bibentry$author), ":tba", dataset_bibentry$author)

would become

creators   <- dataset_bibentry$author %||% ":tba"

There are many other occurrences of the ifelse(is.null( pattern (and variants with different spacing) that could get the same treatment.

In the R/agent.R file, functions like get_creator() are defined twice, why?

Example dataset

The iris dataset is very well-known, but it is also infamous because of its eugenics links.
Since having a good example dataset is very important, would you consider replacing it with another one, like maybe the palmerpenguins one, even if it comes at the cost of adding a (possibly optional) dependency?

Tests

Should the line https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/tests/testthat/test-agent.R#L9 be removed as it is not used in the test?

I don't understand why the iris object needs to be duplicated in lines like https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/tests/testthat/test-creator.R#L23

expect_true(is.dataset_df might become a custom expectation and/or rely on expect_s3_class() instead. Same comment for expect_true(is.subject.

https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/tests/testthat/test-datacite.R#L18

should be

expect_type(as_datacite(iris_dataset, "list"), "list")

What is https://github.com/dataobservatory-eu/dataset/blob/master/tests/testthat/test-dataset_prov.bak?

When using expect_error() as in https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/tests/testthat/test-dataset_title.R#L3 maybe add a pattern for the error message, just in case another error happens?


Thank you! Happy to discuss any of the items.

@maelle maelle removed the holding label Jan 21, 2025
@maelle
Copy link
Member

maelle commented Jan 21, 2025

@ropensci-review-bot check package

@ropensci-review-bot
Copy link
Collaborator

Thanks, about to send the query.

@ropensci-review-bot
Copy link
Collaborator

🚀

Editor check started

👋

@ropensci-review-bot
Copy link
Collaborator

Checks for dataset (v0.3.4002)

git hash: 7bf85ac7

  • ✔️ Package is already on CRAN.
  • ✔️ has a 'codemeta.json' file.
  • ✔️ has a 'contributing' file.
  • ✔️ uses 'roxygen2'.
  • ✔️ 'DESCRIPTION' has a URL field.
  • ✔️ 'DESCRIPTION' has a BugReports field.
  • ✔️ Package has at least one HTML vignette
  • ✔️ All functions have examples.
  • ✔️ Package has continuous integration checks.
  • ✖️ Package coverage failed
  • ✖️ R CMD check process failed with message: 'Build process failed'.
  • 👀 Function names are duplicated in other packages

Important: All failing checks above must be addressed prior to proceeding

(Checks marked with 👀 may be optionally addressed.)

Package License: GPL (>= 3)


1. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.

type package ncalls
internal base 281
internal dataset 215
internal graphics 11
internal stats 2
imports assertthat 29
imports utils 8
imports labelled 5
imports rlang 2
imports cli 1
imports haven 1
imports tibble 1
imports ISOcodes NA
imports methods NA
imports pillar NA
imports RefManageR NA
imports vctrs NA
suggests knitr NA
suggests rdflib NA
suggests rmarkdown NA
suggests spelling NA
suggests testthat NA
linking_to NA NA

Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.

base

ifelse (41), as.character (40), is.null (39), list (20), c (14), vapply (8), lapply (7), names (7), data.frame (6), logical (6), paste (6), paste0 (6), character (5), inherits (5), which (5), contributors (4), date (4), seq_along (4), substr (4), Sys.time (4), for (3), format (3), invisible (3), length (3), t (3), with (3), all (2), attr (2), class (2), drop (2), gsub (2), labels (2), nrow (2), args (1), as.data.frame (1), as.Date (1), as.POSIXct (1), cbind (1), do.call (1), double (1), if (1), nchar (1), ncol (1), rbind (1), Sys.Date (1), tolower (1), vector (1)

dataset

get_bibentry (26), creator (11), dataset_title (11), subject (10), publisher (7), rights (7), get_creator (6), identifier (6), description (5), language (5), new_Subject (5), provenance (5), dataset_df (4), get_publisher (4), get_type (4), agent (3), convert_column (3), n_triple (3), publication_year (3), var_definition (3), var_namespace (3), var_unit (3), as_dataset_df (2), as_dublincore (2), datacite (2), default_provenance (2), definition_attribute (2), geolocation (2), get_author (2), get_person_iri (2), idcol_find (2), is_person (2), is.dataset_df (2), n_triples (2), namespace_attribute (2), new_my_tibble (2), prov_author (2), unit_attribute (2), as_character (1), as_character.haven_labelled_defined (1), as_datacite (1), as_numeric (1), as_numeric.haven_labelled_defined (1), as.character.haven_labelled_defined (1), create_iri (1), dataset_to_triples (1), defined (1), describe (1), dublincore (1), dublincore_to_triples (1), fix_contributor (1), fix_publisher (1), get_definition_attribute (1), get_namespace_attribute (1), get_unit_attribute (1), id_to_column (1), is_dataset_df (1), is_doi (1), is.datacite (1), is.datacite.datacite (1), is.defined (1), is.dublincore (1), is.dublincore.dublincore (1), is.subject (1), label_attribute (1), names.dataset_df (1), new_datacite (1), new_datetime_defined (1), new_dublincore (1), new_labelled_defined (1), print.dataset_df (1), set_default_bibentry (1), set_definition_attribute (1), set_namespace_attribute (1), set_unit_attribute (1), set_var_labels (1), subject_create (1), summary.dataset_df (1), summary.haven_labelled_defined (1), tbl_sum.dataset_df (1), var_definition.default (1), var_label.dataset_df (1), var_label.defined (1), var_namespace.default (1)

assertthat

assert_that (29)

graphics

title (11)

utils

person (5), bibentry (2), citation (1)

labelled

var_label (4), to_labelled (1)

rlang

caller_env (1), env_is_user_facing (1)

stats

df (1), family (1)

cli

cat_line (1)

haven

labelled (1)

tibble

new_tibble (1)

NOTE: Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.


2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

  • code in R (100% in 29 files) and
  • 1 authors
  • 5 vignettes
  • 1 internal data file
  • 12 imported packages
  • 89 exported functions (median 5 lines of code)
  • 153 non-exported functions in R (median 8 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages
The following terminology is used:

  • loc = "Lines of Code"
  • fn = "function"
  • exp/not_exp = exported / not exported

All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure value percentile noteworthy
files_R 29 88.1
files_vignettes 5 95.4
files_tests 29 97.1
loc_R 1731 79.0
loc_vignettes 521 77.3
loc_tests 717 77.6
num_vignettes 5 96.4 TRUE
data_size_total 2480 62.3
data_size_median 2480 69.0
n_fns_r 242 91.0
n_fns_r_exported 89 94.0
n_fns_r_not_exported 153 89.0
n_fns_per_file_r 5 71.9
num_params_per_fn 2 8.2
loc_per_fn_r 6 13.0
loc_per_fn_r_exp 5 8.5
loc_per_fn_r_not_exp 8 22.9
rel_whitespace_R 25 85.1
rel_whitespace_vignettes 37 81.2
rel_whitespace_tests 23 79.5
doclines_per_fn_exp 52 64.9
doclines_per_fn_not_exp 0 0.0 TRUE
fn_call_network_size 155 84.7

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package


3. goodpractice and other checks

Details of goodpractice checks (click to open)

3a. Continuous Integration Badges

rhub.yaml


3b. goodpractice results

R CMD check with rcmdcheck

R CMD check generated the following error:

  1. Error in proc$get_built_file() : Build process failed

R CMD check generated the following check_fails:

  1. no_description_date
  2. no_import_package_as_a_whole

Test coverage with covr

ERROR: Test Coverage Failed

Cyclocomplexity with cyclocomp

Error : Build failed, unknown error, standard output:

  • checking for file ‘dataset/DESCRIPTION’ ... OK
  • preparing ‘dataset’:
  • checking DESCRIPTION meta-information ... OK
  • installing the package to build vignettes
  • creating vignettes ... ERROR
    --- re-building ‘bibentry.Rmd’ using rmarkdown
    --- finished re-building ‘bibentry.Rmd’

--- re-building ‘dataset_df.Rmd’ using rmarkdown
--- finished re-building ‘dataset_df.Rmd’

--- re-building ‘defined.Rmd’ using rmarkdown
--- finished re-building ‘defined.Rmd’

--- re-building ‘new_requirements.Rmd’ using rmarkdown
--- finished re-building ‘new_requirements.Rmd’

--- re-building ‘rdf.Rmd’ using rmarkdown

Quitting from lines 106-108 [jsonld] (rdf.Rmd)
Error: processing vignette 'rdf.Rmd' failed with diagnostics:
please install the jsonld package to use this functionality.
--- failed re-building ‘rdf.Rmd’

SUMMARY: processing the following file failed:
‘rdf.Rmd’

Error: Vignette re-building failed.
Execution halted

Static code analyses with lintr

lintr found no issues with this package!


4. Other Checks

Details of other checks (click to open)

✖️ The following 10 function names are duplicated in other packages:

    • as_character from metan, radiant.data, retroharmonize, sjlabelled
    • as_numeric from descstat, metan, qdapRegex, radiant.data, retroharmonize, sjlabelled, zenplots
    • describe from AzureVision, Bolstad2, describer, dlookr, explore, Hmisc, iBreakDown, ingredients, lambda.r, MSbox, onewaytests, prettyR, psych, psych, psyntur, questionr, radiant.data, RCPA3, Rlab, scan, scorecard, sylly, tidycomm
    • description from dataMaid, dataPreparation, dataReporter, dcmodify, memisc, metaboData, PerseusR, ritis, rmutil, rsyncrosim, stream, synchronicity, timeSeries, tis, validate
    • get_bibentry from eurostat
    • identifier from Ramble
    • is.defined from nonmemica
    • language from sylly, wakefield
    • provenance from provenance
    • subject from DGM, emayili, gmailr, sendgridr


Package Versions

package version
pkgstats 0.2.0.48
pkgcheck 0.1.2.77


Editor-in-Chief Instructions:

Processing may not proceed until the items marked with ✖️ have been resolved.

@maelle
Copy link
Member

maelle commented Jan 21, 2025

Coverage is also something noted in my comments.

@antaldaniel
Copy link
Author

@maelle Thank you. It is unfortunate that you have joined this review after two years, because I find many of your comments very useful. However, I must say that some of them would be unusual to add to a vignette, and I would find an article format more useful, i.e., the your question about frictionless, datapack, datapackage.org, and researchobject, as about 2 years of research went into this package that is usually not a vignette material, other reviewers of my other packages hated extended vignettes.

I think that the frictionless package family follows a very different approach, and when I started the development of this package and this review, it did not even seem relevant. Eventually I see that in some use cases both can be useful and a choice could be offered, so I will argue that, but I think that this is more a paper considering statistical exchange formats and their best representation in R.

A small question: what it the package coverage you are aiming at? I think that the package already has a very high coverage, and it exceeded the requirements when I started the review.

@maelle
Copy link
Member

maelle commented Jan 21, 2025

@antaldaniel thank you for your answer!

  • Use cases would be crucial to make the point of the package clearer. You have a vision, that potential users need to understand. With concrete examples of usage, I think understanding what it is all about (and why a potential user should care) would be easier. An use case does not have to be really long, it can contain diagrams, but it should help someone see what they could do with the package in practice. Even if from the current README text they might get it has the good goal of helping with FAIRness, the application of the package might remain nebulous.
  • I agree the docs do not need to contain an in-depth comparison with other approaches, but a small section "Related work" would be important. It'd allow users to quickly compare your package to others, and also would help them, again, get the point. "Oh, it's a bit like frictionless, it's not a new tibble or a new dataspice". In our dev guide that's "If applicable, how the package compares to other similar packages and/or how it relates to other packages." in https://devguide.ropensci.org/pkg_building.html#readme With workflow/standards packages like this one (as opposed to, say, a package that helps you get data from a specific API, where the goal is easier to grasp for a newcomer), introducing the package is obviously harder, but also more important.
  • For the coverage, the problem is not the number but the fact that no continuous integration workflow is updating the badge. I am aiming at a not unknown coverage in the README badge. 😉

@antaldaniel
Copy link
Author

Thank you @maelle and I will look in to why the coverage is not updating... However, do you have an explicit coverage target?

@maelle
Copy link
Member

maelle commented Jan 21, 2025

Yes it's 75% https://devguide.ropensci.org/pkg_building.html#testing -- But I'd recommend also looking at the coverage report to find the not covered lines and make a judgement call on how important/risky these lines are (vs how hard to test they'd be) just so you're sure there's nothing dangerous in the remaining 25%. 🙂 The idea really is to cover "key functionality" (phrasing from the dev guide).

In my comments I recommend updating the test-coverage workflow, the fix might be as simple as that.

@antaldaniel
Copy link
Author

@maelle Thank you agian for your useful comments and your PRs. I reorganised the issues and created a new milestone. The milestone currently breaks up your review comments to 7 issues, but they could of course be broken down to more. I set myself a deadline of 16 February to resolve them, though it may happen earlier. I will tag you in the issues when they are ready for review and will also add a new comment here.

@maelle
Copy link
Member

maelle commented Jan 21, 2025

Thank you! I'll put the issue on hold for now but will remove the label as soon as you are ready. Happy to respond to comments (and issues/PRs in your repo) in the meantime if needed.

@maelle
Copy link
Member

maelle commented Jan 21, 2025

@ropensci-review-bot put on hold

@ropensci-review-bot
Copy link
Collaborator

Submission on hold!

@maurolepore
Copy link
Member

maurolepore commented Feb 2, 2025

Dear @antaldaniel this is to mark the start of my EiC rotation. I'm reviewing all open issues and noting what I see:

  • @maelle provided detailed editor checks.
  • @antaldaniel plans to follow up by 16 February.
  • Until then the submission is holding.

It all looks good to me. I'll step back. Thanks!

@antaldaniel
Copy link
Author

Hi @maurolepore,

I just solved three issues with new commits, so their evolution and solution is on the dataset package place. I was hoping and I did receive some broader and useful review, as this is the early stages of the development of a package family. But responding to those is perhaps best placed in an accompanying paper, not a package, for the purpose of this review, I add them as a vignette but it is a bit too philsophical for a normal vignette.

I will note when all are solved, although I have a very serious objection to one.

I think that the use of the iris dataset is a choice of the R-core team, and any criticism should be aimed there. This package wants to extend the usability of a central, core, base R object, i.e., the data.frame itself, which is taught, explained, tested in literally millions of use cases with the iris dataset. If there was a debate on ditching the dataset on the R-core team, I would pitch in defense of it, and accept if it would be replaced with something else; however, I do not think that this is a valid point in the review of a package that extends the core R system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants