Test data (SDTM) for the pharmaverse family of packages
To provide a one-stop-shop for SDTM test data in the pharmaverse family of packages. This includes datasets that are therapeutic area (TA)-agnostic (DM, VS, EG, etc.) as well TA-specific ones (RS, TR, OE, etc.).
The package is available from CRAN and can be installed by running install.packages("pharmaversesdtm"). To install the latest development version of the package directly from GitHub use the following code:
if (!requireNamespace("remotes", quietly = TRUE)) {
install.packages("remotes")
}
remotes::install_github("pharmaverse/pharmaversesdtm", ref = "main") # This command installs the latest development version directly from GitHub.Some test datasets have been sourced from the CDISC pilot project, while other datasets have been constructed ad-hoc by the {admiral} team. Please check the Reference page for detailed information regarding the source of specific datasets.
- Datasets that are TA-agnostic: same as SDTM domain name (e.g.,
dm,rs). - Datasets that are TA-specific: domain_TA_others, others go from broader categories to more specific ones (e.g.,
oe_ophtha,rs_onco,rs_onco_irecist).
Note: If an SDTM domain is used by multiple TAs, {pharmaversesdtm} may provide multiple versions of the corresponding test dataset. For instance, the package contains ex and ex_ophtha as the latter contains ophthalmology-specific variables such as EXLAT and EXLOC, and EXROUTE is exchanged for a plausible ophthalmology value.
Firstly, make a GitHub issue in {pharmaversesdtm} with the planned updates and tag @pharmaverse/admiral so that one of the development core team can sanity check the request. Then there are two main ways to extend the test data: either by adding new datasets or extending existing datasets with new records/variables. Whichever method you choose, it is worth noting the following:
- Programs that generate test data are stored in the
data-raw/folder. - Each of these programs is written as a standalone R script: if any packages need to be loaded for a given program, then call
library()at the start of the program (but please do not calllibrary(pharmaversesdtm)). - When you have created a program in the
data-raw/folder, you need to run it as a standalone R script, in order to generate a test dataset that will become part of the{pharmaversesdtm}package, but you do not need to build the package. - Following best practice, each dataset is stored as a
.rdafile whose name is consistent with the name of the dataset, e.g., datasetxxis stored asxx.rda. The easiest way to achieve this is to useusethis::use_data(xx) - The programs in
data-raw/are stored within the{pharmaversesdtm}GitHub repository, but they are not part of the{pharmaversesdtm}package--thedata-raw/folder is specified in.Rbuildignore. - When you run a program that is in the
data-raw/folder, you generate a dataset that is written to thedata/folder, which will become part of the{pharmaversesdtm}package. - The names and sources of test datasets are specified in
R/*.R, for the purpose of generating documentation in theman/folder.
Note: The documentation process in {pharmaversesdtm} is automated for consistency and ease of maintenance.
{pharmaversesdtm} uses a single JSON file to store metadata for all SDTM datasets.
This file contains information such as:
- dataset name
- dataset label
- dataset description
- author
- source
- therapeutic area
- any other dataset-specific metadata.
This metadata drives the automated documentation process, and the file is read by data-raw/create_sdtms_data.R to help generate:
- Documentation
.Rfiles inR/ .Rdfiles inman/Test Name/Test Codetable inclusion (when present)- Dataset grouping by
Therapeutic Area.
- Create a program in the
data-raw/folder, named<name>.R, where<name>should follow the naming convention, to generate the test data and output<name>.rdato thedata/folder.- Use CDISC pilot data such as
dmas input in this program in order to create realistic synthetic data that remains consistent with other domains (not mandatory). - Note that no personal data should be used as part of this package, even if anonymized.
- Use CDISC pilot data such as
- Run the program.
- Update
inst/extdata/sdtms-specs.jsonwith the new dataset metadata, including:- Assigning the dataset label, description, author, source, purpose, or structure.
- Assigning or updating the dataset therapeutic area (used for reference-page grouping).
- Run
data-raw/create_sdtms_data.Rin order to updateNAMESPACEand update the.Rdfiles inman/. - Add your GitHub handle to
.github/CODEOWNERS. - Update
NEWS.md.
- Locate the existing program
<name>.Rin thedata-raw/folder, update it accordingly. - Update the corresponding entry in
inst/extdata/sdtms-specs.jsonto reflect the changes, including:- Changing the dataset label, description, author, or source.
- Modifying the dataset purpose or structure.
- Updating the dataset therapeutic area.
- Removing a dataset (delete its entry from the JSON entirely).
- Run the program, and output updated
<name>.rdato thedata/folder. - Run
data-raw/create_sdtms_data.Rin order to updateNAMESPACEand update the.Rdfiles inman/. - Add your GitHub handle to
.github/CODEOWNERS. - Update
NEWS.md.
Along with the authors and contributors, thanks to the following people for their work on the package:
G Gayatri, Pooja Kumari, Sadchla Mascary, Kangjie Zhang and Zelos Zhu.
