.. _`init`: becomes <p id="init"> instead of <p id="init"> [SF:feature-requests:66] #579

chrisjsewell · 2020-08-09T14:50:16Z

author: rpuntaie
created: 2019-09-30 17:30:55.106000
assigned: goodger
SF_url: https://sourceforge.net/p/docutils/feature-requests/66

`` encloses a role. There is a default role, else :<role>:`text`

_ in front, is the special target role. For one word the backtick can be dropped.

_`__init__` should produce a target named "__init__".

But instead the produced target is "init".
The backtick avoids ambiguity. There is no need for this behavior.

commenter: goodger
posted: 2019-09-30 18:32:30.338000
title: #66 .. ___init__: becomes

instead of

Description has changed:

Diff:


--- old
+++ new
@@ -1,7 +1,8 @@
-`` encloses a role. There is a default role, else :<role>:`text`
+    `` encloses a role. There is a default role, else :<role>:`text`
 
-_ in front, is the special target role. For one word the backtick can be dropped.
+    _ in front, is the special target role. For one word the backtick can be dropped.
 
-_`__init__` should produce a target named "__init__".
+    _`__init__` should produce a target named "__init__".
+
 But instead the produced target is "init".
 The backtick avoids ambiguity. There is no need for this behavior.

commenter: goodger
posted: 2019-09-30 18:54:01.082000
title: #66 .. ___init__: becomes

instead of

Please be careful with using raw markup in a web form like this. SourceForge expects MarkDown, which has enough similarities to reStructuredText that the markup will be interpreted/misinterpreted. Use MarkDown to quote any markup, and check that the result makes sense when rendered (use the preview function).

When you say, "There is no need for this behavior", what behavior do you mean, exactly?

It works fine for me. This input:

$ rst2pseudoxml.py<<'EOF'
a target _`__init__` in a paragraph

EOF

Produces this output:

<document source="<stdin>">
    <paragraph>
        a target
        <target ids="init" names="__init__">
            __init__
         in a paragraph

The target name is __init__. The ID drops the underscores, for the reasons explained in docutils.nodes.Element and docutils.nodes.make_id, e.g.:

Docutils identifiers will conform to the regular expression
[a-z](-?[a-z0-9]+)*. For CSS compatibility, identifiers (the "class"
and "id" attributes) should have no underscores, colons, or periods.
Hyphens may be used.

commenter: goodger
posted: 2019-09-30 18:54:38.778000
title: #66 .. ___init__: becomes

instead of

status: open --> pending-works-for-me
assigned_to: David Goodger

commenter: milde
posted: 2019-09-30 19:24:29.406000
title: #66 .. ___init__: becomes

instead of

_ in front, is the special target role. For one word the backtick can be dropped.

Do you mean the "hyperlink name" in explicit hyperlink targets?
Here, the backticks can always be dropped
(http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#hyperlink-targets).
In "inline internal targets", the backticks are mandatory (http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#inline-internal-targets).

_`__init__` should produce a target named "__init__".

It is a bit more complicated:
The generated target has the name __init__ and the id init.

The id is generated from the name (but may also be something like "id35" if the name-derived id is not unique). The rules and motivation for the conversion are described in
http://docutils.sourceforge.net/docs/ref/rst/directives.html#rationale.

The generated HTML uses the id, so to link to it from outside, use
"name-of-the-document#init"

In the rST source of the same document, you link to it via the name, e.g.,
__init___
Docutils resolves the "reference name" and the above becomes a reference to the matching target:
Docutils XML: <reference name="__init__" refid="init">,
HTML: <a class="reference internal" href="#init">__init__</a>.

commenter: rpuntaie
posted: 2019-10-01 09:23:06.896000
title: #66 .. ___init__: becomes

instead of

According
https://www.w3.org/TR/CSS21/syndata.html#characters
an identifier can start with two underscores in CSS.

HTML5 allows the id value to start with two underscores (https://html.spec.whatwg.org/multipage/dom.html#the-id-attribute).

HTML5 id is specified to not contain spaces, but some browsers do support spaces nevertheless.
HTML5 does not specify why it disallows spaces. It should therefore allow spaces.

I made a related post about docutils changing IDs in 11/2018: https://sourceforge.net/p/docutils/mailman/message/36453416/

My position is this:

id is used to name targets in the RST.
id is the identifier of a content item.
The same content item should have the same id independent of format (rst, html, pdf, ...)
How should a user target his content item, if every formatter chooses to modify his chosen id?

The id should not be changed.

Docutils should even keep spaces despite HTML5 disallowing them.
If the user runs into a problem with a browser, he will change the id himself and know about it.
Maybe he converts to just pdf, anyway.

To summarize:

RST is not html and does not need restrictions from HTML (or CSS) altogether.

Docutils should develop in that direction.
Relaxing rules does not produce backward incompatibility, either.

commenter: milde
posted: 2019-10-01 16:16:46.659000
title: #66 .. ___init__: becomes

instead of

Ticket moved from /p/docutils/bugs/379/

commenter: milde
posted: 2019-10-12 21:35:57.074000
title: #66 .. ___init__: becomes

instead of

My position is this:

id is used to name targets in the RST.

id is the identifier of a content item.

In rST/Docutils, it is a bit more complicated:

Docutils doctree elements may have multiple ids and
names.

In the reStructuredText source, only reference names_ are used for naming
elements as well as referring to them. IDs_ are only used in generated
documents.

Reference names_ may be auto-derived from the content (e.g. section
titles) or specified by the author via rST syntax (:name: option of
directives, content of hyperlink targets, label of footnotes or citations).

IDs_ are generated by Docutils (sometimes using names as base) when
parsing rST or in transformations.

.. _reference names:
http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#reference-names
.. _ids: http://docutils.sourceforge.net/docs/ref/doctree.html#ids

The same content item should have the same id independent of format
(rst, html, pdf, ...)

To achieve this, the id must be valid in all output formats supported by
Docutils (HTML4.1/XHTML1, HTML5, LaTeX, troff (manpage), XML, ODF/ODT).

HTML4.1:
IDs must begin with a letter ([A-Za-z]) and may be followed by
any number of letters, digits ([0-9]), or any of the characters "-_:.".

HTML5:
no whitespace

LaTeX:
only ASCII characters (32-127) except "%~#{}"

"{" and "}" might be used if balanced but this is not recommended.
Use of certain LaTeX packages results in more exceptions.
Spaces are allowed.
With XeTeX/LuaTeX, Unicode characters are allowed, too.

https://tex.stackexchange.com/questions/18311/what-are-the-valid-names-as-labels

ODT/ODF:

troff:

How should a user target his content item, if every formatter chooses
to modify his chosen id?

Internal (rST source and included files/parent documents):
use the reference name. This works independent of the output format.

External:
HTML: Use the generated id (when unsure about the transformation of a
given name to id, look it up in the output).

LaTeX: Use the id as label (e.g. in \ref{}). This works only if the
external LaTeX source is combined with the Docutils-generated
LaTeX source (i.e. one must include the other or both included in a
common parent).

PDF: named destinations__ are currently not supported in PDFs
generated from Docutils-generated LaTeX.

__ https://tex.stackexchange.com/questions/213860/how-to-generate-a-named-destination-in-pdf

The id should not be changed.

The id is (currently) generated once and used unchanged by the writers.

Docutils should even keep spaces despite HTML5 disallowing them.

Docutils policy is to create valid output. Untill this restriction is
lifted in the HTML5 standard, Docutils will not use spaces in HTML-IDs.
Spaces are allowed in reference names_.

If the user runs into a problem with a browser, he will change the id
himself and know about it.

The author cannot change IDs nor implicite reference names directly. If we
would keep spaces, any document with a section title containing whitespace
would also contain spaces in the id of the corresponding section element.

Maybe he converts to just pdf, anyway.

Even worse: Accented characters, Umlauts, Greek, Cyrillic, etc. in section
titles would lead to compilation errors with pdflatex.

To summarize:

RST is not html and does not need restrictions from HTML (or CSS) altogether.

This is why the internal identifiers (reference names_) don't
have these limitations. The rules for reference names (whitespace
normalization and downcasing) are solely based on practicability for rST.

Identifiers in the generated documents must comply with the restrictions of
the output document format.

Docutils should develop in that direction.

There are two alternatives:

a) Keep ids identical across output formats. This would allow only the
intersection of valid element identifiers.

We could lift the restrictions of CSS1, as generated documents would
still be valid XHTML1 and CSS selectors may use escaping or
[argument] syntax.
This would relax the requirements to complying with the regexp
[A-Za-z][-_:.A-Za-z0-9]* (i.e. also allow underscore, colon,
and full stop).

b) Allow less restrictive identifiers in some formats:

HTML is the format most probably linked to.

The "html5" writer could use the name as ID, just replacing spaces.
This would allow external links like
http://example.com/parrot.html#1.Ιανουάριος.

Or the restriction on the first character may be dropped with an exception
for "html4css1".

Relaxing rules does not produce backward incompatibility, either.

No problem for internal links (unless we also change the rules for
reference names_.

However, external links adapted to the current rules may break.

Example: a document, parrot.rst contains::

Schöner Titel: warum nicht?
=====================

and I link to this section from somewhere on the net with the URL
http://example.org/parrot.html#schoner-titel-warum-nicht, this link will
be broken after re-processing the unchanged source with a Docutils
version with relaxed id-rules.

Therefore, I would only change the rules after careful consideration and an
advance warning. Possibly with an opt-in setting.

commenter: rpuntaie
posted: 2019-10-13 13:36:43.038000
title: #66 .. ___init__: becomes

instead of

I've abbreviated the general concept of identifier with ID.
In this general meaning a reference name is an ID,
because you reference something by uniquely identifying it.

If in docutils there are more reference names and ids
then there are more ways to reference an item.
That is OK.

I was referring only to user chosen reference name_.
Let's keep out IDs generated from headers or form :name:.
I personally never rely on these generated IDs,
because I don't know them.
Instead I put .. _`some_title_id`: in front of a header.

User chosen target IDs (reference name_ in rst) should not be changed.

How are more reference names translated to html,
e.g. for the above additional some_title_id?
More IDs would allow to keep the legacy ID and add
the unchanged user reference name as additional ID.
Else one could add a docutils.conf setting to tell docutils which method to use.

About multiple IDs in html:
https://stackoverflow.com/questions/192048/can-an-html-element-have-multiple-ids
See comment by BoltClock or the answer by tvanfosson.

commenter: milde
posted: 2019-10-13 20:54:11.563000
title: #66 .. ___init__: becomes

instead of

User chosen target IDs (reference name_ in rst) should not be changed.

If the user confines herself to valid names, no change is done.
If the user uses invalid names, the output would be buggy in some output formats. If we want
consistent identifiers, the same rules must aply to all output formats.

Anchors with unchecked user-specified ID value could be specified using raw input but this is not recommended, though.

How are more reference names translated to html?

Try yourself:

.. _first explicit target:
.. _other explicit target:

.. note::
   :name: refname from directive option
   
   the object

If you export to Docutils-XML or ~pseudoxml, you will see the three names and ids of the note element. In the HTML, spans are used as anchors for the additional identifiers.

commenter: rpuntaie
posted: 2019-10-15 07:58:22.844000
title: #66 .. ___init__: becomes

instead of

Docutils has versions.
A new version is allowed to behave differently, according semantic versioning.
Everyone knows that.
If someone uses a new version of docutils,
it is that one's responsibility to integrate it into its context.
Docutils should develop with the associated standards.
HTML has standard 5 now.
IDs should be modified only according standard 5.
This means that only spaces can be replaced
when deriving HTML IDs.

commenter: milde
posted: 2019-10-30 21:21:32.616000
title: #66 .. ___init__: becomes

instead of

If someone uses a new version of docutils,
it is that one's responsibility to integrate it into its context.

There is one problem, though: "Cool URIs don't change"
(https://www.w3.org/Provider/Style/URI.html).
When a new Docutils version produces different URIs for the same input, we
should offer users a way to keep the old URIs.

Docutils should develop with the associated standards.
HTML has standard 5 now.

HTML comes in many different versions. Docutils supports HTML5 with the
"html5_polyglot" writer and XHTML1.1/transitional with the default writer
"html4css1". The default may change in future.

IDs should be modified only according standard 5.
This means that only spaces can be replaced
when deriving HTML IDs.

Identifier keys must be valid in all supported output formats.
Therefore, they must comply with restrictions in the
respective output formats (HTML4.1__, HTML5__, polyglot HTML,
LaTeX, ODT__, troff (manpage), XML__).

__ http://www.w3.org/TR/html401/types.html#type-name
__ https://www.w3.org/TR/html50/dom.html#the-id-attribute
__ https://www.w3.org/TR/html-polyglot/#id-attribute
__ https://tex.stackexchange.com/questions/18311/what-are-the-valid-names-as-labels
__ https://help.libreoffice.org/6.3/en-US/text/swriter/01/04040000.html?DbPAR=WRITER#bm_id4974211
__ https://www.w3.org/TR/REC-xml/#id

We may want to keep the "one ID format for all output formats". Then only
the underscore ("_") may be allowed in addition to the current
transformation.

+1 one rule is easier to remember than a set of different rules.
-1 IDs must keep to a restrictive rule even in more relaxed output formats.

Alternatively, we may allow different identifier transformations for each
output format:

+1 ID-transformation follows (almost) the relaxed rules of the output format.
-1 More complex setup.
-1 ID value used in the output is even harder to predict.

A possible implementation would be via a new "identifier_restrictions"
configuration setting that takes a list of rule sets (CSS1, HTML4, HTML5,
XML, LaTeX, XeTeX/LuaTeX, ODT, troff) and combines them to form the required
transition.

Examples:
The current transition would be identifier_restrictions: HTML4,CSS1.

The "html5polyglot" section could use identifier_restrictions: XML, as
polyglot HTML requires valid XML identifiers.

A user may override this in a config file or with
rst2html5 --identifier-restrictions=HTML5.

commenter: rpuntaie
posted: 2019-10-31 13:35:58.843000
title: #66 .. ___init__: becomes

instead of

This is a nice solution. I would also have a special --identifier-restrictions=none to turn of all ID mappings.

commenter: milde
posted: 2020-03-25 08:30:56.869000
title: #66 .. ___init__: becomes

instead of

attachments:

https://sourceforge.net/p/docutils/feature-requests/_discuss/thread/88981c4638/c615/attachment/id_chars.py

I attach an experimental implementation draft and tests for exploration.

commenter: rpuntaie
posted: 2020-03-26 10:46:44.391000
title: #66 .. ___init__: becomes

instead of

I like this:

It allows to use the same ID for output formats that support it,
which are a lot considering HTML5, ODT, XeTeX and XML.
It also means that the generated documents of these formats all have the same ID for the same content,
including the RST source
It stores the ID language restrictions of different target formats within docutils

Regarding API, I would make your trim_name() the new make_id():

``` python

#make_id_legacy = make_id
def make_id(string, language="legacy"):
    #...

```

The has_prefix shouldn't be needed because determined by the ID format data id_start and id_char.

In the command line interface I would also default to legacy,
because of "Cool URIs don't change" and to avoid the necessity to change people's scripts.

I did not compare the ID language data in your py file with the documentation of the according formats.

The text was updated successfully, but these errors were encountered:

chrisjsewell added feature-requests pending-works-for-me priority-5 labels Aug 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.. _`init`: becomes <p id="init"> instead of <p id="init"> [SF:feature-requests:66] #579

.. _`init`: becomes <p id="init"> instead of <p id="init"> [SF:feature-requests:66] #579

chrisjsewell commented Aug 9, 2020

.. ___init__: becomes <p id="init"> instead of <p id="__init__"> [SF:feature-requests:66] #579

.. ___init__: becomes <p id="init"> instead of <p id="__init__"> [SF:feature-requests:66] #579

Comments

chrisjsewell commented Aug 9, 2020

.. _`init`: becomes <p id="init"> instead of <p id="init"> [SF:feature-requests:66] #579

.. _`init`: becomes <p id="init"> instead of <p id="init"> [SF:feature-requests:66] #579