Skip to content

Properly attribute code and data sources #327

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nmdefries opened this issue Jun 8, 2023 · 6 comments · Fixed by #520
Closed

Properly attribute code and data sources #327

nmdefries opened this issue Jun 8, 2023 · 6 comments · Fixed by #520
Assignees

Comments

@nmdefries
Copy link
Contributor

as_slide_computation borrows heavily from rlang::as_function. Currently attribution is given informally via the roxygen @source tag and in the function description. We would also want to list JHU for data usage.

Logan looked into possibilities here, but it looks like there's no single official way.

@brookslogan
Copy link
Contributor

Potentially missing from the above investigation is attaching a file LICENSE that notes the (non-MIT) data set licenses.

@nmdefries
Copy link
Contributor Author

In the knitRProgressBar package, the author added the original license text and a description of changes to the source code, and added the original authors to the DESCRIPTION as contributors (ctb). No other changes were made for attribution.

@dajmcdon
Copy link
Contributor

Are these attributions actually necessary? I'm not sure that knitRProgressBar is the appropriate model. How do packages in the tidyverse/r-lib actually approach this? That is, do the people who work directly with rlang contributors credit rlang in this way?

(I get that the Writing R packages document recommends this, but if followed, I suspect that all packages on CRAN would have dozens of aut/ctb fields, which doesn't seem to be the case.)

@nmdefries
Copy link
Contributor Author

nmdefries commented Jun 29, 2023

The motivation for this attribution is more than just that epiprocess uses rlang functions. The initial version of epiprocess:::as_slide_computation was copy-pasted, including documentation, from rlang's source code. I added maybe a line of changes.

As for tidyverse, the only member package with additional external authors listed is readr, see its DESCRIPTION. mio and grisu3 seem to be C++ packages that readr copy-pasted code from, e.g. see readr's mio.h vs mio's mmap.hpp. Both C++ packages use the MIT license.

In a historical version, R Core Team was included for an adaptation of its date-time code.

So the standard seems to be to add the authors of the original package to the DESCRIPTION, and note the source and license in the code itself. In readr, no COPYRIGHT file is provided.

Our as_slide_computation function now has additional differences from the original, but it is more than "inspired" by rlang::as_function. However, the amount of borrowed/adapted code is certainly substantially less than all of the examples above. I'm sure there's a point at which attribution no longer makes sense.

@nmdefries
Copy link
Contributor Author

nmdefries commented Jun 29, 2023

I haven't looked for examples of dataset attributions yet.

@nmdefries
Copy link
Contributor Author

nmdefries commented Jul 12, 2023

RE data attributions, looks like no one really does them. As in the above, I looked at some tidyverse packages and assumed that their approach is idiomatic/best-case.

The tidyverse packages I looked at don't have any dtc (data contributor) authors listed, and don't include license info for any datasets, including those that are obviously under copyright and not public domain, e.g. billboard ratings where even the linked source says that "This data is almost certainly a violation of Billboard’s copyright, and probably infringes on Record Research’s books too. The analysis I’m publishing here should fall under fair use, but redistributing the spreadsheet would not"; and Star Wars character info whose license requires attribution.

Some other not-very-official packages list dtc authors, but don't include licenses (although their data might not require it). So listing data contributors and copyright holders in the DESCRIPTION and including license info is better than what the average package is doing.

RE attribution for imported data packages, I think it makes sense to include data contributors in epiprocess and epipredict. Can't find any formal guidance on attribution for this case, but since we're reexporting the datasets and they're being made available as part of the packages, it makes sense to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants