Skip to content

Citation metadata: documentation issue #2723

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
4 tasks
njbart opened this issue Oct 4, 2022 · 11 comments
Closed
4 tasks

Citation metadata: documentation issue #2723

njbart opened this issue Oct 4, 2022 · 11 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@njbart
Copy link

njbart commented Oct 4, 2022

Bug description

https://quarto.org/docs/reference/metadata/citation.html displays this example:

---
citation:
  type: article-journal
  container-title: ACM Transactions on Embedded Computing Systems
  volume: 21
  issue: 2
  date: 3/2022
  issn: 1539-9087
  doi: 10.1145/3514174
---

Issues:

  • date is not a valid CSL variable; use issued instead.
  • 3/2022 is not a valid CSL YAML date format; only ISO-8601 dates are accepted, e.g., 2022, 2022-03, 2022-03-29 etc.
  • Hence date: 3/2022 above should be replaced by issued: 2022-03.

Further suggestions:

  • In quarto output, citation metadata are introduced as BibTeX citation: -- which, strictly speaking is not true: Neither url, nor doi, langid exist in BibTeX -- while the BibTeX variant natbib has url and doi, many others that appear in quarto output, including langid, eventdate, urldate, annote are exclusive to BibLaTeX.
  • It might be useful to offer output in CSL YAML format instead or alongside BibLaTeX.
    • Possibly CSL JSON, too, which would allow lossless import into, e.g., Zotero. (CSL YAML import requires the BetterBibTeX plugin.)
  • Empty BibLaTeX fields like editor = {}, should be filtered out to reduce clutter.
  • Markdown syntax in, e.g., title fields should be converted to the corresponding (La)TeX commands.

Checklist

  • Please include a minimal, fully reproducible example in a single .qmd file? Please provide the whole file rather than the snippet you believe is causing the issue.
  • Please format your issue so it is easier for us to read the bug report.
  • Please document the RStudio IDE version you're running (if applicable), by providing the value displayed in the "About RStudio" main menu dialog?
  • Please document the operating system you're running. If on Linux, please provide the specific distribution.
@njbart njbart added the bug Something isn't working label Oct 4, 2022
@dragonstyle dragonstyle self-assigned this Oct 4, 2022
@dragonstyle dragonstyle added this to the v1.3 milestone Oct 4, 2022
@njbart
Copy link
Author

njbart commented Oct 6, 2022

Some additional concerns: the variable list at https://quarto.org/docs/reference/metadata/citation.html contains some items that are definitely not CSL variables (as per https://docs.citationstyles.org/en/stable/specification.html), including

  • pdf-url
  • public-url
  • abstract-url
  • fulltext-url

I do recognize these might serve some useful purpose, but I don't think it's a good idea to list them among valid CSL variables without any clarification, since none of them will be processed by any of the current citeprocs nor, even if these did process variables other than the official ones at all (which they don’t), by any of the existing CSL styles in the official repository (https://github.com/citation-style-language/styles).

If the quarto developers feel these variables could usefully supplement the set of existing CSL variables it would seem best to start discussing this with the CSL folks.

I am also wondering how exactly the BibLaTeX data are assembled from the citation: element, and what citeproc-ish routine is used to subsequently render the citation after “For attribution, please cite this work as:” – it would seem it is not pandoc, since pandoc would invariably render a BibLaTeX author = {Jane Doe} as Doe, Jane (in any case if the default chicago-author-date.csl is used, as in my tests). quarto, however, following “For attribution …” shows a citation that begins with Jane Doe.

Another hint that pandoc does not seem to be involved when assembling BibLaTeX data either is the fact that a perfectly valid (pandoc) CSL YAML construct inside a citation: element, such as

  author:
      - family: Doe
        given: Jane

is ignored, with an empty author = {} appearing in the BibLaTeX data.

Clarification of these issues would be desirable, although I assume already that if pandoc is indeed not involved in any of the above tasks, things could be greatly improved if it were.

dragonstyle added a commit that referenced this issue Oct 6, 2022
@dragonstyle
Copy link
Collaborator

  • date is not a valid CSL variable; use issued instead.
  • 3/2022 is not a valid CSL YAML date format; only ISO-8601 dates are accepted, e.g., 2022, 2022-03, 2022-03-29 etc.
  • Hence date: 3/2022 above should be replaced by issued: 2022-03.

Thank you, you are correct! I've updated the example. Further answers below...

  • In quarto output, citation metadata are introduced as BibTeX citation: -- which, strictly speaking is not true: Neither url, nor doi, langid exist in BibTeX -- while the BibTeX variant natbib has url and doi, many others that appear in quarto output, including langid, eventdate, urldate, annote are exclusive to BibLaTeX.

For now, I've corrected the label. Per the below suggestion, I agree it would be great to offer more than one format, which should address the whole issue, hopefully.

  • It might be useful to offer output in CSL YAML format instead or alongside BibLaTeX.
    • Possibly CSL JSON, too, which would allow lossless import into, e.g., Zotero. (CSL YAML import requires the BetterBibTeX plugin.)

This is a great suggestion - I'll open another issue to track providing a few formats specifically.

  • Empty BibLaTeX fields like editor = {}, should be filtered out to reduce clutter.

Correct! A bug fix is on the way.

  • Markdown syntax in, e.g., title fields should be converted to the corresponding (La)TeX commands.

We currently use Pandoc do the rendering from CSL to BibLaTeX- this is the Pandoc behavior.

@dragonstyle
Copy link
Collaborator

Some additional concerns: the variable list at https://quarto.org/docs/reference/metadata/citation.html contains some items that are definitely not CSL variables (as per https://docs.citationstyles.org/en/stable/specification.html), including

  • pdf-url
  • public-url
  • abstract-url
  • fulltext-url

I do recognize these might serve some useful purpose, but I don't think it's a good idea to list them among valid CSL variables without any clarification, since none of them will be processed by any of the current citeprocs nor, even if these did process variables other than the official ones at all (which they don’t), by any of the existing CSL styles in the official repository (https://github.com/citation-style-language/styles).

If the quarto developers feel these variables could usefully supplement the set of existing CSL variables it would seem best to start discussing this with the CSL folks.

As the page notes, this data is based upon CSL, but allows the specification of additional fields, typically because they are used by Google Scholar metadata. I think that this is ok - I don't think we should necessarily aspire for the citation field to be strict CSL since we may have needs that go beyond the goals of CSL. Perhaps it is worth clarifying in that headings, maybe not strong enough...

I am also wondering how exactly the BibLaTeX data are assembled from the citation: element, and what citeproc-ish routine is used to subsequently render the citation after “For attribution, please cite this work as:” – it would seem it is not pandoc, since pandoc would invariably render a BibLaTeX author = {Jane Doe} as Doe, Jane (in any case if the default chicago-author-date.csl is used, as in my tests). quarto, however, following “For attribution …” shows a citation that begins with Jane Doe.

We are using Pandoc to render the BibLaTeX. When rendering this front matter:

---
title: CSL Example
author: Charles Teague
date: last-modified
description: |
  This provides an example of CSL formatting being used to control the output of references for a document. Note that the CSL provides the style for both the generated bibliography for the page, but also for the citation information for the page itself, included in the appendix.
bibliography: example.bib
citation:
  title: "Hello World"
  type: article-journal
  container-title: "Journal of Data Science Software"
  doi: "10.23915/reprodocs.00010"
  url: https://example.com/summarizing-output
  issued: 2022-10-01
---

We generate the following CSL JSON:

{
  title: "Hello World",
  type: "article-journal",
  author: [ { family: "Teague", given: "Charles", literal: "Charles Teague" } ],
  language: "en",
  "available-date": { "date-parts": [ [ 2022, 10, 6 ] ], literal: "2022-10-06", raw: "2022-10-06" },
  issued: { "date-parts": [ [ 2022, 10, 1 ] ], literal: "2022-10-01", raw: "2022-10-01" },
  "container-title": "Journal of Data Science Software",
  id: "teague2022",
  URL: "https://example.com/summarizing-output",
  DOI: "10.23915/reprodocs.00010"
}

produces:

@article{teague2022,
  author = {Charles Teague},
  title = {Hello {World}},
  journal = {Journal of Data Science Software},
  date = {2022-10-01},
  url = {https://example.com/summarizing-output},
  doi = {10.23915/reprodocs.00010},
  langid = {en}
}

when rendered using:

pandoc -f csljson -t biblatex --citeproc

Another hint that pandoc does not seem to be involved when assembling BibLaTeX data either is the fact that a perfectly valid (pandoc) CSL YAML construct inside a citation: element, such as

  author:
      - family: Doe
        given: Jane

This looks to be an issue with our CSL parsing, which isn't parsing the author name as anything more than a string right now. I will open an issue to track resolving this.

@dragonstyle
Copy link
Collaborator

I wonder if the fact that 'literal' is present in the CSL name for the author is resulting in Pandoc preferring the literal string rather than the structured name. We currently always bundle literal into the name, thinking that it is useful to provide the origin literal input, but perhaps this isn't working out in this case...

@njbart
Copy link
Author

njbart commented Oct 6, 2022

Well, as shown above, the CSL JSON snippet could not be processed with my pandoc installation (2.19.2, macOS).

I had to change it to:

[
  {
    "title": "Hello World",
    "type": "article-journal",
    "author": [ { "family": "Teague", "given": "Charles", "literal": "Charles Teague" } ],
    "language": "en",
    "available-date": { "date-parts": [ [ 2022, 10, 6 ] ], "literal": "2022-10-06", "raw": "2022-10-06" },
    "issued": { "date-parts": [ [ 2022, 10, 1 ] ], "literal": "2022-10-01", "raw": "2022-10-01" },
    "container-title": "Journal of Data Science Software",
    "id": "teague2022",
    "URL": "https://example.com/summarizing-output",
    "DOI": "10.23915/reprodocs.00010"
  }
]

… which results in the same biblatex code as shown above then. (BTW, the --citeproc flag is unnecessary when converting between biblio formats.)

The biblatex snippet, now called refs.bib, processed with …

pandoc -C --bibliography=refs.bib -t plain << EOF
@teague2022
EOF

… renders:

Teague (2022)

Teague, Charles. 2022. “Hello World.” Journal of Data Science Software,
October. https://doi.org/10.23915/reprodocs.00010.

The “front matter” shown above (minus the bibliography line), when processed with quarto preview file.qmd, however, has

For attribution, please cite this work as:
Charles Teague. 2022. “Hello World.” Journal of Data Science Software, October. https://doi.org/10.23915/reprodocs.00010.

I was failing to see how pandoc could have produced the latter …

… but now I realize quarto is processing the json version, where I’d agree the literal bit is the likely culprit.

@njbart
Copy link
Author

njbart commented Oct 6, 2022

literal in fact is reserved for corporate names and the like, as in the following (pandoc) CSL YAML snippet:

---
references:
- id: apa2012electronic
  author:
    - literal: American Psychological Association
  issued: 2012
  language: en-US
  publisher: American Psychological Association
  publisher-place: Washington, DC
  title: APA style guide to electronic references
  type: book
...

… corresponding to Zotero’s one-field author mode, as opposed to its two-field mode for personal names comprised of “first” and “last”.

@njbart
Copy link
Author

njbart commented Oct 6, 2022

I am not sure to what extent it is possible to parse the unstructured authors’ names that typically appear in pandoc’s YAML metadata block top-level author: field.

Ideally, these would have to be parsed into either an explicit literal/organizational/corporate name:

author:
    - literal: American Psychological Association

or a structured representation of a personal name:

author:
  - family: Doe
    given: Jane
    suffix: Jr.
    dropping-particle: von
    non-dropping-particle: van der

… or, as I reckon this kind of parsing will be next to impossible to do properly if there is no one-field/two-field distinction in the input data to begin with (and even if there is, as in Zotero, certain conventions and hacks need to be relied upon; more on this maybe at a later point …), it seems more likely users will have to be encouraged to enter structured representations of a text’s authors’ names into the citation: element instead. (Maybe not as difficult as it sounds: If users enter the metadata of their text correctly into Zotero, and export to CSL YAML using BBT, they will get an equally correct CSL YAML version, even for rather complex cases.)

@njbart
Copy link
Author

njbart commented Oct 6, 2022

As to dates, I think including all of date-parts, raw and literal elements is unnecessary and confusing.

If a date can be parsed into year, year-month, or year-month-day (or year-season, all with or without a circa flag), or an interval consisting of two such dates, either date-parts or raw should be used – but not both. (Actually, I’d prefer ISO-8601 dates in the raw element, matching the date formats biblatex and pandoc CSL YAML are using.)

The literal date element should be reserved for dates that cannot be parsed as described above, e.g., free-form dates like:

{  "issued" : {
      "literal" : "13th century"
   }
}

@dragonstyle
Copy link
Collaborator

I am not sure to what extent it is possible to parse the unstructured authors’ names that typically appear in pandoc’s YAML metadata block top-level author: field.

For the document author (not in the citation key), we do currently parse the name into a structure as you suggest (we use pandoc for this), but as you note it can't be 100% accurate, so we allow the user to also provide that structure if they'd like (meaning they can correct us in those cases where we miss parse the field). It would be nice for us to improve the parsing of the citation author / editors (really any CSL name field) as well, perhaps using this same approach.

We should also drop the literal from our CSL parsing. For dates, I will take a closer look as well.

@njbart
Copy link
Author

njbart commented Oct 7, 2022

I just realized that quarto does support structured representation of names – https://quarto.org/docs/journals/authors.html#author-schema –, and I would suggest encouraging users to enter names in this scheme only: Even a name as simple as “Jane Doe” could be a personal name, in which case parsing into separate elements, and rendering as “Doe, Jane” in certain contexts is appropriate, or a corporate name, where it is not.

I would restrict the use of literal to organizational names, and actually ignore it if either family or given are present.

Bugs: dropping-particle and non-dropping-particle are currently not rendered in quarto’s output, and suffix and comma-suffix are not even included in the docs (and not rendered either).

Also, when adding a citation: element without its own author: element (unstructured, e.g. author: Jane Doe) – so quarto has to use the top-level, structured author: element – I get ERROR: TypeError: author.split is not a function. The same happens when structured author information is entered below the citation: element.

Authoritative specs of CSL JSON that describe CSL name formats in detail can be found at https://citeproc-js.readthedocs.io/en/latest/csl-json/markup.html#name-fields and https://github.com/Juris-M/citeproc-js/blob/master/attic/citeproc-doc.rst#names.

@dragonstyle
Copy link
Collaborator

No error thrown when providing structured author info. You may now specify structured CSL name, or Quarto will attempt to infer it (not all that well).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants