Merge branch 'main' into lcb/grouped_epi_archive

lcbrooks · lcbrooks · commit 6e3f554768db · 2022-11-01T06:18:35.000-07:00
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,7 +1,7 @@
 Type: Package
 Package: epiprocess
 Title: Tools for basic signal processing in epidemiology
-Version: 1.0.0
+Version: 0.5.0.9999
 Authors@R: c(
     person("Jacob", "Bien", role = "ctb"),
     person("Logan", "Brooks", role = "aut"),
diff --git a/NEWS.md b/NEWS.md
@@ -0,0 +1,249 @@
+# epiprocess 0.5.0.9999 (development version)
+
+Note that `epiprocess` uses the [Semantic Versioning
+("semver")](https://semver.org/) scheme for all release versions, but not for
+development versions. A ".9999" suffix indicates a development version.
+
+## Cleanup:
+
+* Added a `NEWS.md` file to track changes to the package.
+
+# epiprocess 0.5.0:
+
+## Potentially-breaking changes:
+
+* `epix_slide`, `<epi_archive>$slide` now feed `f` an `epi_df` rather than
+  converting to a tibble/`tbl_df` first, allowing use of `epi_df` methods and
+  metadata, and often yielding `epi_df`s out of the slide as a result. To obtain
+  the old behavior, convert to a tibble within `f`.
+
+## Improvements:
+
+* Fixed `epix_merge`, `<epi_archive>$merge` always raising error on `sync="truncate"`
+
+## Cleanup:
+
+* Added `Remotes:` entry for `genlasso`, which was removed from CRAN
+* Added `as_epi_archive` tests
+* Added missing `epix_merge` test for `sync="truncate"`
+
+# epiprocess 0.4.0:
+
+## Potentially-breaking changes:
+
+* Fixed `[.epi_df` to not reorder columns, which was incompatible with
+  downstream packages.
+* Changed `[.epi_df` decay-to-tibble logic to more coherent with `epi_df`s
+  current tolerance of nonunique keys: stopped decaying to a tibble in some
+  cases where a unique key wouldn't have been preserved, since we don't
+  enforce a unique key elsewhere.
+* Fixed `[.epi_df` to adjust `"other_keys"` metadata when corresponding
+  columns are selected out.
+* Fixed `[.epi_df` to raise an error if resulting column names would be
+  nonunique.
+* Fixed `[.epi_df` to drop metadata if decaying to a tibble (due to removal
+  of essential columns).
+
+## Improvements:
+
+* Added check that `epi_df` `additional_metadata` is list.
+* Fixed some incorrect `as_epi_df` examples.
+
+## Cleanup:
+
+* Applied rename of upstream package in examples: `delphi.epidata` ->
+  `epidatr`.
+* Rounded out `[.epi_df` tests.
+
+# epiprocess 0.3.0:
+
+## Breaking changes:
+
+* `as_epi_archive`, `epi_archive$new`:
+  * Compactification (see below) by default may change results if working
+    directly with the `epi_archive`'s `DT` field; to disable, pass in
+    `compactify=FALSE`.
+* `epi_archive`'s wrappers and R6 methods have been updated to follow these
+  rules regarding reference semantics:
+  * `epix_<method>` will not mutate input `epi_archive`s, but may alias them
+    or alias their fields (which should not be a worry if a user sticks to
+    these `epix_*` functions and "regular" R functions with
+    copy-on-write-like behavior, avoiding mutating functions `[.data.table`)
+  * `x$<method>` may mutate `x`; if it mutates `x`, it will return `x`
+    invisibly (where this makes sense), and, for each of its fields, may
+    either mutate the object to which it refers or reseat the reference (but
+    not both); if `x$<method>` does not mutate `x`, its result may contain
+    aliases to `x` or its fields.
+* `epix_merge`, `<epi_archive>$merge`:
+  * Removed `...`, `locf`, and `nan` parameters.
+  * Changed the default behavior, which now corresponds to using
+    `by=key(x$DT)` (but demanding that is the same set of column names as
+    `key(y$DT)`), `all=TRUE`, `locf=TRUE`, `nan=NaN` (but with the
+    post-filling step fixed to only apply to gaps, and no longer fill over
+    `NA`s originating from `x$DT` and `y$DT`).
+  * `x` and `y` are no longer allowed to share names of non-`by` columns.
+  * `epix_merge` no longer mutates its `x` argument (but `$merge` continues
+    to do so).
+  * Removed (undocumented) capability of passing a `data.table` as `y`.
+* `epix_slide`:
+  * Removed inappropriate/misleading `n=7` default argument (due to
+    reporting latency, `n=7` will *not* yield 7 days of data in a typical
+    daily-reporting surveillance data source, as one might have assumed).
+
+## New features:
+
+* `as_epi_archive`, `epi_archive$new`:
+  * New `compactify` parameter allows removal of rows that are redundant for the
+    purposes of `epi_archive`'s methods, which use the last version of each
+    observation carried forward.
+  * New `clobberable_versions_start` field allows marking a range of versions
+    that could be "clobbered" (rewritten without assigning new version
+    tags); previously, this was hard-coded as `max(<epi_archive>$DT$version)`.
+  * New `versions_end` field allows marking a range of versions beyond
+    `max(<epi_archive>$DT$version)` that were observed, but contained no
+    changes.
+* `epix_merge`, `$merge`:
+  * New `sync` parameter controls what to do if `x` and `y` aren't equally
+    up to date (i.e., if `x$versions_end` and `y$versions_end` are
+    different).
+* New function `epix_fill_through_version`, method
+  `<epi_archive>$fill_through_version`: non-mutating & mutating way to
+  ensure that an archive contains versions at least through some
+  `fill_versions_end`, extrapolating according to `how` if necessary
+* Example archive data object is now constructed on demand from its
+  underlying data, so it will be based on the user's version of
+  `epi_archive` rather than an outdated R6 implementation from whenever the
+  data object was generated.
+
+# epiprocess 0.2.0:
+
+## Breaking changes:
+
+* Removed default `n=7` argument to `epix_slide`.
+
+## Improvements:
+
+* Ignore `NA`s when printing `time_value` range for an `epi_archive`.
+* Fixed misleading column naming in `epix_slide` example.
+* Trimmed down `epi_slide` examples.
+* Synced out-of-date docs.
+
+## Cleanup:
+
+* Removed dependency of some `epi_archive` tests on an example archive.
+  object, and made them more understandable by reading without running.
+* Fixed `epi_df` tests relying on an S3 method for `epi_df` implemented
+  externally to `epiprocess`.
+* Added tests for `epi_archive` methods and wrapper functions.
+* Removed some dead code.
+* Made `.{Rbuild,git}ignore` files more comprehensive.
+
+# epiprocess 0.1.2:
+
+## New features:
+
+* New `new_epi_df` function is similar to `as_epi_df`, but (i) recalculates,
+  overwrites, and/or drops most metadata of `x` if it has any, (ii) may
+  still reorder the columns of `x` even if it's already an `epi_df`, and
+  (iii) treats `x` as optional, constructing an empty `epi_df` by default.
+
+## Improvements:
+
+* Fixed `geo_type` guessing on alphabetical strings with more than 2
+  characters to yield `"custom"`, not US `"nation"`.
+* Fixed `time_type` guessing to actually detect `Date`-class `time_value`s
+  regularly spaced 7 days apart as `"week"`-type as intended.
+* Improved printing of `epi_df`s, `epi_archives`s.
+* Fixed `as_of` to not cut off any (forecast-like) data with `time_value >
+  max_version`.
+* Expanded `epi_df` docs to include conversion from `tsibble`/`tbl_ts` objects,
+  usage of `other_keys`, and pre-processing objects not following the
+  `geo_value`, `time_value` naming scheme.
+* Expanded `epi_slide` examples to show how to use an `f` argument with
+  named parameters.
+* Updated examples to print relevant columns given a common 80-column
+  terminal width.
+* Added growth rate examples.
+* Improved `as_epi_archive` and `epi_archive$new`/`$initialize`
+  documentation, including constructing a toy archive.
+
+## Cleanup:
+
+* Added tests for `epi_slide`, `epi_cor`, and internal utility functions.
+* Fixed currently-unused internal utility functions `MiddleL`, `MiddleR` to
+  yield correct results on odd-length vectors.
+
+# epiprocess 0.1.1:
+
+## New features:
+
+* New example data objects allow one to quickly experiment with `epi_df`s
+  and `epi_archives` without relying/waiting on an API to fetch data.
+
+## Improvements:
+
+* Improved `epi_slide` error messaging.
+* Fixed description of the appropriate parameters for an `f` argument to
+  `epi_slide`; previous description would give incorrect behavior if `f` had
+  named parameters that did not receive values from `epi_slide`'s `...`.
+* Added some examples throughout the package.
+* Using example data objects in vignettes also speeds up vignette compilation.
+
+## Cleanup:
+
+* Set up gh-actions CI.
+* Added tests for `epi_df`s.
+
+# epiprocess 0.1.0
+
+## Implemented core functionality, vignettes:
+
+Classes:
+* `epi_df`: specialized `tbl_df` for geotemporal epidemiological time
+  series data, with optional metadata recording other key columns (e.g.,
+  demographic breakdowns) and `as_of` what time/version this data was
+  current/published. Associated functions:
+  * `as_epi_df` converts to an `epi_df`, guessing the `geo_type`,
+    `time_type`, `other_keys`, and `as_of` if not specified.
+  * `as_epi_df.tbl_ts` and `as_tsibble.epi_df` automatically set
+    `other_keys` and `key`&`index`, respectively.
+  * `epi_slide` applies a user-supplied computation to a sliding/rolling
+    time window and user-specified groups, adding the results as new
+    columns, and recycling/broadcasting results to keep the result size
+    stable. Allows computation to be provided as a function, `purrr`-style
+    formula, or tidyeval dots. Uses `slider` underneath for efficiency.
+  * `epi_cor` calculates Pearson, Kendall, or Spearman correlations
+    between two (optionally time-shifted) variables in an `epi_df` within
+    user-specified groups.
+  * Convenience function: `is_epi_df`
+* `epi_archive`: R6 class for version (patch) data for geotemporal
+  epidemiological time series data sets. Comes with S3 methods and regular
+  functions that wrap around this functionality for those unfamiliar with R6
+  methods. Associated functions:
+  * `as_epi_archive`: prepares an `epi_archive` object from a data frame
+    containing snapshots and/or patch data for every available version of
+    the data set.
+  * `as_of`: extracts a snapshot of the data set as of some requested
+    version, in `epi_df` format
+  * `epix_slide`, `<epi_archive>$slide`: similar to `epi_slide`, but for
+    `epi_archive`s; for each requested `ref_time_value` and group, applies
+    a time window and user-specified computation to a snapshot of the data
+    as of `ref_time_value`.
+  * `epix_merge`, `<epi_archive>$merge`: like `merge` for `epi_archive`s,
+    but allowing for the last version of each observation to be carried
+    forward to fill in gaps in `x` or `y`.
+  * Convenience function: `is_epi_archive`
+
+Additional functions:
+* `growth_rate`: estimates growth rate of a time series using one of a few
+  built-in `method`s based on relative change, linear regression,
+  smoothing splines, or trend filtering.
+* `detect_outlr`: applies one or more outlier detection methods to a given
+  signal variable, and optionally aggregates the outputs to create a
+  consensus result
+* `detect_outlr_rm`: outlier detection function based on a
+  rolling-median-based outlier detection function; one of the methods
+  included in `detect_outlr`.
+* `detect_outlr_stl`: outlier detection function based on a seasonal-trend
+  decomposition using LOESS (STL); one of the methods included in
+  `detect_outlr`.
diff --git a/R/methods-epi_archive.R b/R/methods-epi_archive.R
@@ -152,7 +152,7 @@ epix_fill_through_version = function(x, fill_versions_end,
 #' # vs. mutating x to hold the merge result:
 #' x$merge(y)
 #'
-#' @importFrom data.table key set
+#' @importFrom data.table key set setkeyv
 #' @export
 epix_merge = function(x, y,
                       sync = c("forbid","na","locf","truncate"),
@@ -215,18 +215,36 @@ epix_merge = function(x, y,
     y_DT = epix_fill_through_version(y, new_versions_end, sync)$DT
   } else if (sync == "truncate") {
     new_versions_end = min(x$versions_end, y$versions_end)
-    x_DT = x$DT[x[["DT"]][["version"]] <= new_versions_end, with=FALSE]
-    y_DT = y$DT[y[["DT"]][["version"]] <= new_versions_end, with=FALSE]
+    x_DT = x$DT[x[["DT"]][["version"]] <= new_versions_end, names(x$DT), with=FALSE]
+    y_DT = y$DT[y[["DT"]][["version"]] <= new_versions_end, names(y$DT), with=FALSE]
   } else Abort("unimplemented")
 
-  if (!identical(key(x$DT), key(x_DT)) || !identical(key(y$DT), key(y_DT))) {
-    Abort("preprocessing of data tables in merge changed the key unexpectedly",
-          internal=TRUE)
+  # key(x_DT) should be the same as key(x$DT) and key(y_DT) should be the same
+  # as key(y$DT). Below, we only use {x,y}_DT in the code (making it easier to
+  # split the code into separate functions if we wish), but still refer to
+  # {x,y}$DT in the error messages (further relying on this assumption).
+  #
+  # Check&ensure that the above assumption; if it didn't already hold, we likely
+  # have a bug in the preprocessing, a weird/invalid archive as input, and/or a
+  # data.table version with different semantics (which may break other parts of
+  # our code).
+  x_DT_key_as_expected = identical(key(x$DT), key(x_DT))
+  y_DT_key_as_expected = identical(key(y$DT), key(y_DT))
+  if (!x_DT_key_as_expected || !y_DT_key_as_expected) {
+    Warn("
+      `epiprocess` internal warning (please report): pre-processing for
+      epix_merge unexpectedly resulted in an intermediate data table (or
+      tables) with a different key than the corresponding input archive.
+      Manually setting intermediate data table keys to the expected values.
+    ", internal=TRUE)
+    setkeyv(x_DT, key(x$DT))
+    setkeyv(y_DT, key(y$DT))
   }
-  ## key(x_DT) should be the same as key(x$DT) and key(y_DT) should be the same
-  ## as key(y$DT). If we want to break this function into parts it makes sense
-  ## to use {x,y}_DT below, but this makes the error checks and messages look a
-  ## little weird and rely on the key-matching assumption above.
+  # Without some sort of annotations of what various columns represent, we can't
+  # do something that makes sense when merging archives with mismatched keys.
+  # E.g., even if we assume extra keys represent demographic breakdowns, a
+  # sensible default treatment of count-type and rate-type value columns would
+  # differ.
   if (!identical(sort(key(x_DT)), sort(key(y_DT)))) {
     Abort("
             The archives must have the same set of key column names; if the
diff --git a/pkgdown/extra.scss b/pkgdown/extra.scss
@@ -0,0 +1,70 @@
+/* The news/changelog in pkgdown 2.0.6 is squashed relative to 1.6.1, and
+   secondary headings are too prominent when using ## (but we can't change to
+   ### without impacting side navbar). Just trying a couple of bootswatches
+   didn't seem to help, and nice template packages might be restricted for use
+   by particular groups (e.g., tidytemplate has such a restriction).
+*/
+
+/* Current approach: add some spacing with CSS, and have h3 extend h4 so that
+ ##'s (which use h3) will render with a bit smaller fonts, while still being
+ recognized/included by the page navigation / TOC feature.
+*/
+
+/* General structure: div.template-news wraps everything of interest regarding
+   the rendered NEWS.md. Within that, div.level2's wrap the each package version
+   + the changes for that version. (Within those,) h2.pkg-version's label the
+   versions.
+*/
+
+
+
+/* Matches the first-listed version's section. (This is written as a general
+   rule, but the adjacent sibling rule with override it for non-first versions'
+   sections. Using :first-child probably wouldn't work as a sibling
+   div.page-header precedes the first div.level2.) */
+div.template-news div.level2 {
+    margin-top: 1.5em;
+}
+
+/* Matches subsequent versions' sections. Places more vspace between these
+   sections than before the first section.
+*/
+div.template-news div.level2 + div.level2 {
+    margin-top: 2.5em;
+}
+
+/* Place some additional vspace after each version number heading; currently,
+   the immediately following content is always a secondary heading, which looks
+   weird with the default spacing.
+*/
+div.template-news h2.pkg-version {
+    margin-bottom: 0.5em;
+}
+
+/* Use `h4` styling for `h3`s (the ## headings); this is the only thing we need
+   .scss for, and we could really just copy-paste in the appropriate value if
+   needed: */
+div.template-news h3 {
+    @extend h4;
+}
+
+
+/* Original approach, to be removed at some later time: try adding hrules before
+   and after primary headings (version numbers). The initial "hrule" (actually a
+   border) after the "Source:" pointer has a different color from natural
+   hrules, so we need some custom CSS styling to get these colors to match and
+   look okay:
+ */
+
+/* .template-news .page-header { */
+/*     /\* 1px solid to match original *\/ */
+/*     /\* (original color was something like --bs-default which seemed to be set to */
+/*     --bs-gray-300) *\/ */
+/*     border-bottom: 1px solid var(--bs-secondary); */
+/* } */
+
+/* .template-news hr { */
+/*     height: 1px; /\* defensive *\/ */
+/*     background-color: var(--bs-secondary); */
+/*     opacity: 1; /\* counteracts a 0.25 setting somewhere *\/ */
+/* } */
diff --git a/tests/testthat/test-epix_merge.R b/tests/testthat/test-epix_merge.R