Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong interpretation of xml files analysis configuration #147

Open
kleag opened this issue Sep 26, 2022 · 3 comments
Open

Wrong interpretation of xml files analysis configuration #147

kleag opened this issue Sep 26, 2022 · 3 comments
Assignees

Comments

@kleag
Copy link
Contributor

kleag commented Sep 26, 2022

Describe the bug
The configuration of xml files analysis allows to set document ids either from tags content (its text) or tag attributes. But if the tag attributes config is absent or does not contain any attribute for the document tag (even if the id really comes from a dedicated tag), then the doc id is wrongly set and it leaks into the enclosing docset tag.

To Reproduce
Let the doc be

<?xml version="1.0" ?>
<DOCSET>
<TEI>
<idno>abcdef</idno>
<p>A text</p>
</TEI>
</DOCSET>

This config should work:

    <group name="identPrpty" class="StandardDocumentPropertyType">
      <param key="storageType" value="string"/>
      <param key="cardinality" value="mandatory"/>
      <list name="elementNames">
        <item value="idno"/>
      </list>
    </group>

but it does not. We have to add the useless attributeNames list as below:

    <group name="identPrpty" class="StandardDocumentPropertyType">
      <param key="storageType" value="string"/>
      <param key="cardinality" value="mandatory"/>
      <list name="elementNames">
        <item value="idno"/>
      </list>
      <list name="attributeNames">
        <item value="TEI id="/>
      </list>
    </group>

With the config that "works", the decoded mult file ends with

     <properties>
        <property name="ContentId" type="int" value="0"/>
        <property name="NodeId" type="int" value="1"/>
        <property name="StructureId" type="int" value="2"/>
        <property name="offBegPrpty" type="int" value="37"/>
        <property name="offEndPrpty" type="int" value="379"/>
        <property name="encodPrpty" type="string" value="UTF8"/>
        <property name="identPrpty" type="string" value="abcdef"/>
        <property name="srcePrpty" type="string" value="…"/>
        <property name="indexDatePrpty" type="date" value="20220926"/>
      </properties>
    </node>
    <properties>
      <property name="ContentId" type="int" value="0"/>
      <property name="NodeId" type="int" value="1"/>
      <property name="StructureId" type="int" value="1"/>
      <property name="offBegPrpty" type="int" value="31"/>
      <property name="offEndPrpty" type="int" value="386"/>
      <property name="encodPrpty" type="string" value="UTF8"/>
      <property name="srcePrpty" type="string" value=…"/>
      <property name="indexDatePrpty" type="date" value="abcdef"/>
    </properties>
  </node>
</MultimediaDocuments>

while with the config that fails, we get:

     <properties>
        <property name="ContentId" type="int" value="0"/>
        <property name="NodeId" type="int" value="1"/>
        <property name="StructureId" type="int" value="2"/>
        <property name="offBegPrpty" type="int" value="37"/>
        <property name="offEndPrpty" type="int" value="72"/>
        <property name="encodPrpty" type="string" value="UTF8"/>
        <property name="identPrpty" type="string" value="abcdef"/>
        <property name="srcePrpty" type="string" value="…"/>
        <property name="indexDatePrpty" type="date" value="20220926"/>
      </properties>
    </node>
    <properties>
      <property name="ContentId" type="int" value="0"/>
      <property name="NodeId" type="int" value="1"/>
      <property name="StructureId" type="int" value="1"/>
      <property name="offBegPrpty" type="int" value="31"/>
      <property name="offEndPrpty" type="int" value="79"/>
      <property name="encodPrpty" type="string" value="UTF8"/>
      <property name="identPrpty" type="string" value="abcdef"/>
      <property name="srcePrpty" type="string" value="…"/>
      <property name="indexDatePrpty" type="date" value="20220926"/>
    </properties>
  </node>
</MultimediaDocuments>

The identPrpty should not be present in the last tag.

@benlabbe you could be interested by this information.

@kleag kleag self-assigned this Sep 26, 2022
@kleag
Copy link
Contributor Author

kleag commented Sep 26, 2022

Globaly, identPrpty flow works very wrongly when using elementNames instead of attributeNames.

@benlabbe
Copy link
Contributor

I made several evolutions associated to this topic in the last months.
I need to try this specific test.

@benlabbe

@kleag
Copy link
Contributor Author

kleag commented Jul 13, 2023

Don't hesitate to close this issue if it is really solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants