Skip to content

Commit 60647e5

Browse files
authored
prepare v1.11.0 (#631)
* prepare v1.11.0 * update docstrings and docs * explaing change * add last fix
1 parent f5a53a8 commit 60647e5

File tree

8 files changed

+45
-23
lines changed

8 files changed

+45
-23
lines changed

HISTORY.md

+23
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,29 @@
11
## History / Changelog
22

33

4+
### 1.11.0
5+
6+
Breaking change:
7+
- metadata now skipped by default (#613), to trigger inclusion in all output formats:
8+
- `with_metadata=True` (Python)
9+
- `--with-metadata` (CLI)
10+
11+
Extraction:
12+
- add HTML as output format (#614)
13+
- better and faster baseline extraction (#619)
14+
- better handling of HTML/XML elements (#628)
15+
- XPath rules added with @felipehertzer (#540)
16+
- fix: avoid faulty readability_lxml content (#635)
17+
18+
Evaluation:
19+
- new scripts and data with @LydiaKoerber (#606, #615)
20+
- additional data with @swetepete (#197)
21+
22+
Maintenance:
23+
- docs extended and updated, added page on deduplication (#618)
24+
- review code, add tests and types in part of the submodules (#620, #623, #624, #625)
25+
26+
427
### 1.10.0
528

629
Breaking changes:

README.md

+4-5
Original file line numberDiff line numberDiff line change
@@ -60,11 +60,10 @@ search engine optimization, and information security).
6060
- Optional elements: comments, links, images, tables
6161

6262
- Multiple output formats:
63-
- Text
64-
- Markdown (with formatting)
65-
- CSV (with metadata)
66-
- JSON (with metadata)
67-
- XML or [XML-TEI](https://tei-c.org/) (with metadata, text formatting and page structure)
63+
- TXT and Markdown
64+
- CSV
65+
- JSON
66+
- HTML, XML and [XML-TEI](https://tei-c.org/)
6867

6968
- Optional add-ons:
7069
- Language detection on extracted content

docs/index.rst

+4-5
Original file line numberDiff line numberDiff line change
@@ -60,11 +60,10 @@ Features
6060
- Formatting and structure: paragraphs, titles, lists, quotes, code, line breaks, in-line text formatting
6161
- Optional elements: comments, links, images, tables
6262
- Multiple output formats:
63-
- Text
64-
- Markdown (with formatting)
65-
- CSV (with metadata)
66-
- JSON (with metadata)
67-
- XML or `XML-TEI <https://tei-c.org/>`_ (with metadata, text formatting and page structure)
63+
- TXT and Markdown
64+
- CSV
65+
- JSON
66+
- HTML, XML and [XML-TEI](https://tei-c.org/)
6867
- Optional add-ons:
6968
- Language detection on extracted content
7069
- Graphical user interface (GUI)

docs/usage-cli.rst

+5-4
Original file line numberDiff line numberDiff line change
@@ -283,8 +283,8 @@ For all usage instructions see ``trafilatura -h``:
283283
[--no-tables] [--only-with-metadata]
284284
[--target-language TARGET_LANGUAGE] [--deduplicate]
285285
[--config-file CONFIG_FILE] [--precision] [--recall]
286-
[-out {txt,csv,json,markdown,xml,xmltei} | --csv | --json |
287-
--markdown | --xml | --xmltei]
286+
[-out {txt,csv,html,json,markdown,xml,xmltei} | --csv | --html |
287+
--json | --markdown | --xml | --xmltei]
288288
[--validate-tei] [-v] [--version]
289289
290290
@@ -343,7 +343,7 @@ Extraction:
343343
--no-comments don't output any comments
344344
--no-tables don't output any table elements
345345
--only-with-metadata only output those documents with title, URL and date
346-
(for formats supporting metadata)
346+
--with-metadata extract and add metadata to the output
347347
--target-language TARGET_LANGUAGE
348348
select a target language (ISO 639-1 codes)
349349
--deduplicate filter out duplicate documents and sections
@@ -360,8 +360,9 @@ Format:
360360
361361
.. code-block:: bash
362362
363-
-out {txt,csv,json,markdown,xml,xmltei}, --output-format {txt,csv,json,markdown,xml,xmltei} determine output format
363+
-out {txt,csv,html,json,markdown,xml,xmltei}, --output-format {txt,csv,html,json,markdown,xml,xmltei}
364364
--csv shorthand for CSV output
365+
--html shorthand for HTML output
365366
--json shorthand for JSON output
366367
--markdown shorthand for MD output
367368
--xml shorthand for XML output

docs/usage-python.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ For the basics see `quickstart documentation page <quickstart.html>`_.
3333

3434
Default output is set to TXT (bare text) without metadata.
3535

36-
The following formats are available: bare text, Markdown (from version 1.9 onwards), CSV, JSON, XML, and XML following the guidelines of the Text Encoding Initiative (TEI).
36+
The following formats are available: bare text, Markdown (from version 1.9 onwards), HTML (from version 1.11 onwards), CSV, JSON, XML, and XML following the guidelines of the Text Encoding Initiative (TEI).
3737

3838

3939
.. hint::

trafilatura/__init__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
__author__ = 'Adrien Barbaresi and contributors'
1010
__license__ = "Apache-2.0"
1111
__copyright__ = 'Copyright 2019-2024, Adrien Barbaresi'
12-
__version__ = '1.10.0'
12+
__version__ = '1.11.0'
1313

1414

1515
import logging

trafilatura/cli.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -138,11 +138,11 @@ def add_args(parser):
138138
help="don't output any table elements",
139139
action="store_false") # false = no tables
140140
group4.add_argument("--only-with-metadata",
141-
help="only output those documents with title, URL and date (for formats supporting metadata)",
141+
help="only output those documents with title, URL and date",
142142
action="store_true")
143143
group4.add_argument("--with-metadata",
144-
help=argparse.SUPPRESS,
145-
action="store_true") # will be deprecated
144+
help="extract and add metadata to the output",
145+
action="store_true")
146146
group4.add_argument("--target-language",
147147
help="select a target language (ISO 639-1 codes)",
148148
type=str)

trafilatura/core.py

+4-4
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,7 @@ def bare_extraction(filecontent, url=None, no_fallback=False, # fast=False,
8989
include_comments: Extract comments along with the main text.
9090
output_format: Define an output format, Python being the default
9191
and the interest of this internal function.
92-
Other values: "csv", "json", "markdown", "txt", "xml", and "xmltei".
92+
Other values: "csv", "html", "json", "markdown", "txt", "xml", and "xmltei".
9393
target_language: Define a language to discard invalid documents (ISO 639-1 format).
9494
include_tables: Take into account information within the HTML <table> element.
9595
include_images: Take images into account (experimental).
@@ -98,7 +98,7 @@ def bare_extraction(filecontent, url=None, no_fallback=False, # fast=False,
9898
include_links: Keep links along with their targets (experimental).
9999
deduplicate: Remove duplicate segments and documents.
100100
date_extraction_params: Provide extraction parameters to htmldate as dict().
101-
with_metadata: Extract metadata fields and add them to the output (available soon).
101+
with_metadata: Extract metadata fields and add them to the output.
102102
only_with_metadata: Only keep documents featuring all essential metadata
103103
(date, title, url).
104104
max_tree_size: Discard documents with too many elements.
@@ -278,7 +278,7 @@ def extract(filecontent, url=None, record_id=None, no_fallback=False,
278278
favor_recall: when unsure, prefer more text.
279279
include_comments: Extract comments along with the main text.
280280
output_format: Define an output format:
281-
"csv", "json", "markdown", "txt", "xml", and "xmltei".
281+
"csv", "html", "json", "markdown", "txt", "xml", and "xmltei".
282282
tei_validation: Validate the XML-TEI output with respect to the TEI standard.
283283
target_language: Define a language to discard invalid documents (ISO 639-1 format).
284284
include_tables: Take into account information within the HTML <table> element.
@@ -288,7 +288,7 @@ def extract(filecontent, url=None, record_id=None, no_fallback=False,
288288
include_links: Keep links along with their targets (experimental).
289289
deduplicate: Remove duplicate segments and documents.
290290
date_extraction_params: Provide extraction parameters to htmldate as dict().
291-
with_metadata: Extract metadata fields and add them to the output (available soon).
291+
with_metadata: Extract metadata fields and add them to the output.
292292
only_with_metadata: Only keep documents featuring all essential metadata
293293
(date, title, url).
294294
max_tree_size: Discard documents with too many elements.

0 commit comments

Comments
 (0)