Skip to content

Commit b36b6fa

Browse files
authored
prepare version 1.10.0 (#608)
* prepare version 1.10.0 * fixes
1 parent bbf7bec commit b36b6fa

File tree

7 files changed

+52
-6
lines changed

7 files changed

+52
-6
lines changed

HISTORY.md

+27
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,33 @@
11
## History / Changelog
22

33

4+
### 1.10.0
5+
6+
Breaking changes:
7+
- raise errors on deprecated CLI and function arguments (#581)
8+
- regroup classes and functions linked to deduplication (#582)
9+
``trafilatura.hashing````trafilatura.deduplication``
10+
11+
Extraction:
12+
- port of is_probably_readerable from readability.js by @zirkelc in #587
13+
- Markdown table fixes by @naktinis in #601
14+
- fix list spacing in TXT output (#598)
15+
- CLI fixes: file processing options, mtime, and tests (#605)
16+
- CLI fix: read standard input as binary (#607)
17+
18+
Downloads:
19+
- fix deflate and add optional zstd to accepted encodings (#594)
20+
- spider fix: use internal download utilities for robots.txt (#590)
21+
22+
Maintenance:
23+
- add author XPaths (#567)
24+
- update justext and lxml dependencies (#593)
25+
- simplify code: unique function for length tests (#591)
26+
27+
Docs:
28+
- fix typos by @RainRat in #603
29+
30+
431
### 1.9.0
532

633
Extraction:

README.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,8 @@ search engine optimization, and information security).
6060
- Optional elements: comments, links, images, tables
6161

6262
- Multiple output formats:
63-
- Text (minimal formatting or Markdown)
63+
- Text
64+
- Markdown (with formatting)
6465
- CSV (with metadata)
6566
- JSON (with metadata)
6667
- XML or [XML-TEI](https://tei-c.org/) (with metadata, text formatting and page structure)

docs/index.rst

+2-1
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,8 @@ Features
6060
- Formatting and structure: paragraphs, titles, lists, quotes, code, line breaks, in-line text formatting
6161
- Optional elements: comments, links, images, tables
6262
- Multiple output formats:
63-
- Text (minimal formatting or Markdown)
63+
- Text
64+
- Markdown (with formatting)
6465
- CSV (with metadata)
6566
- JSON (with metadata)
6667
- XML or `XML-TEI <https://tei-c.org/>`_ (with metadata, text formatting and page structure)

docs/installation.rst

+2
Original file line numberDiff line numberDiff line change
@@ -121,6 +121,8 @@ py3langid
121121
Language detection on extracted main text
122122
pycurl
123123
Faster downloads, possibly less robust though
124+
zstandard
125+
Additional compression algorithm for downloads
124126

125127

126128

docs/usage-python.rst

+17-2
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,17 @@ This function emulates the behavior of similar functions in other packages, it i
157157
>>> html2txt(downloaded)
158158
159159
160+
Guessing if text can be found
161+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
162+
163+
The function ``is_probably_readerable()`` has been ported from Mozilla's Readability.js, it is available from version 1.10.0 onwards and provides a way to guess if a page probably has a main text to extract.
164+
165+
.. code-block:: python
166+
167+
>>> from trafilatura.readability_lxml import is_probably_readerable
168+
>>> is_probably_readerable(html) # HTML string or already parsed tree
169+
170+
160171
Language identification
161172
^^^^^^^^^^^^^^^^^^^^^^^
162173

@@ -199,7 +210,7 @@ The `SimHash <https://en.wikipedia.org/wiki/SimHash>`_ method (also called Chari
199210
.. code-block:: python
200211
201212
# create a Simhash for near-duplicate detection
202-
>>> from trafilatura.hashing import Simhash
213+
>>> from trafilatura.deduplication import Simhash
203214
>>> first = Simhash("This is a text.")
204215
>>> second = Simhash("This is a test.")
205216
>>> second.similarity(first)
@@ -217,11 +228,15 @@ Other convenience functions include generation of file names based on their cont
217228
.. code-block:: python
218229
219230
# create a filename-safe string by hashing the given content
220-
>>> from trafilatura.hashing import generate_hash_filename
231+
>>> from trafilatura.deduplication import generate_hash_filename
221232
>>> generate_hash_filename("This is a text.")
222233
'qAgzZnskrcRgeftk'
223234
224235
236+
.. note::
237+
The ``trafilatura.hashing`` submodule has been renamed ``trafilatura.deduplication`` in version 1.10.0.
238+
239+
225240
Extraction settings
226241
-------------------
227242

tests/cli_tests.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -111,7 +111,7 @@ def test_parser():
111111
assert e.type == SystemExit
112112
assert e.value.code == 0
113113
assert re.match(
114-
r"Trafilatura [0-9]\.[0-9]\.[0-9] - Python [0-9]\.[0-9]+\.[0-9]", f.getvalue()
114+
r"Trafilatura [0-9]\.[0-9]+\.[0-9] - Python [0-9]\.[0-9]+\.[0-9]", f.getvalue()
115115
)
116116

117117
# test deprecations

trafilatura/__init__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
__author__ = 'Adrien Barbaresi and contributors'
1010
__license__ = "Apache-2.0"
1111
__copyright__ = 'Copyright 2019-2024, Adrien Barbaresi'
12-
__version__ = '1.9.0'
12+
__version__ = '1.10.0'
1313

1414

1515
import logging

0 commit comments

Comments
 (0)