-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathrenard_tutorial.py
211 lines (184 loc) · 7.01 KB
/
renard_tutorial.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
# ---
# jupyter:
# jupytext:
# text_representation:
# extension: .py
# format_name: percent
# format_version: '1.3'
# jupytext_version: 1.14.7
# kernelspec:
# display_name: Python 3 (ipykernel)
# language: python
# name: python3
# ---
# %% [markdown]
# # Renard : Relationships Extraction from NARrative Documents
#
#
# Renard is a modular python pipeline that can extract static or dynamic character networks from narrative documents.
#
# - Installation: `pip install renard-pipeline`
# - Documentation: https://compnet.github.io/Renard/
# %% [markdown]
# # General Overview
#
#
# The central object in Renard is a `Pipeline` that you can execute on a document. A `Pipeline` is formed by a sequential series of natural language processing `Step`s needed to extract a character network. Here is an example of a classic `Pipeline`:
#
# ```
# text
# |
# v
# [tokenization]
# |
# v
# [named entity recognition (NER)]
# |
# v
# [characters unification]
# |
# v
# [co-occurences graph extraction]
# |
# v
# character network
# ```
#
# Which you could write using Renard:
#
# ```python
# from renard.pipeline import Pipeline
# from renard.pipeline.tokenization import NLTKTokenizer
# from renard.pipeline.ner import NLTKNamedEntityRecognizer
# from renard.pipeline.character_unification import GraphRulesCharacterUnifier
# from renard.pipeline.graph_extraction import CoOccurrencesGraphExtractor
#
# # Pipeline definition
# pipeline = Pipeline(
# [
# NLTKTokenizer(), # tokenization
# NLTKNamedEntityRecognizer(), # named entity recognition
# GraphRulesCharactersUnifier(), # characters unification
# CoOccurrencesGraphExtractor(co_occurrences_dist=(1, "sentences")) # graph extraction
# ]
# )
# ```
#
# You can then execute that pipeline on a given text:
#
# ```python
# with open("./my_text.txt") as f:
# text = f.read()
#
# out = pipeline(text)
# ```
#
# The `out` object then contains the pipeline execution's result, which includes the character network (see the `character_network` attribute). We can for example export this network on the disk:
#
# ```python
# out.export_graph_to_gexf("./my_graph.gexf")
# ```
# %% [markdown]
# # Example: Extracting a Character Network with Existing NER Annotations
# %% [markdown]
# ## Static Graph Extraction
# %%
from renard.pipeline import Pipeline
from renard.pipeline.character_unification import GraphRulesCharacterUnifier
from renard.pipeline.graph_extraction import CoOccurrencesGraphExtractor
from renard.ner_utils import load_conll2002_bio
# the utility function "load_conll2002_bio" allows loading BIO NER
# annotations from disk.
sentences, tokens, entities = load_conll2002_bio("./tests/three_musketeers_fra.bio")
# pipeline creation. Only the characters extraction and graph
# extraction steps are specified, since tokenization and BIO tags are
# already given.
pipeline = Pipeline(
[
GraphRulesCharacterUnifier(),
# an interaction will be a co-occurence in a range of 3
# sentences or less
CoOccurrencesGraphExtractor(co_occurrences_dist=(3, "sentences")),
],
lang="fra",
)
# pipeline execution. The caller gives tokenization and NER entities
# to the pipeline at runtime
out = pipeline(tokens=tokens, sentences=sentences, entities=entities)
# %% [markdown]
# ## Graph Display
#
# Renard can display the extracted graph using `matplotlib`. This visualization is naive, and it's recommended to export the graph and then use a software such as `Gephi` for more advanced usage.
#
# _note : if there are display issues with Jupyter see https://discourse.jupyter.org/t/jupyter-notebook-zmq-message-arrived-on-closed-channel-error/17869/27. `pip install --upgrade "pyzmq<25" "jupyter_client<8"` might fix the issue._
#
# _note : if there are display issues, you can still save an image on disk with `out.plot_graph_to_file("./my_image.png")`_
# %%
# %matplotlib notebook
import matplotlib.pyplot as plt
out.plot_graph()
plt.show()
# %% [markdown]
# # Extraction Setup
#
# Here are a few examples of tweaks you can apply in the extraction setup. Depending on your text, these can enhance the quality of the extracted graph. For example, the `min_appearances` parameter filters character that appear unfrequently, which can reduce noise from the detection process, at the cost of eliminating some minor characters. The `co_occurences_dist` parameter changes the minimum co-occurrences to register an interaction between two characters. A lower value results in rarer interactions, while a higher value will yield more frequent (but less strict) interactions.
# %%
pipeline = Pipeline(
[
# at least 10 occurences of a characters are needed for them to
# be included in the graph (default is 1)
GraphRulesCharacterUnifier(min_appearances=10),
# A co-occurence between two characters is counted if its
# range is lower or equal to 10 sentences
CoOccurrencesGraphExtractor(co_occurrences_dist=(10, "sentences")),
],
lang="fra",
)
out = pipeline(tokens=tokens, sentences=sentences, entities=entities)
out.plot_graph()
plt.show()
# %% [markdown]
# ## Advanced Graph Manipulation
#
# The `character_network` attribute contains the `networkx` graph extracted by `Renard`. It is possible to manipulate this graph directly using python for advanced usage.
# %%
import networkx as nx
print(nx.density(out.character_network))
# %% [markdown]
# ## Graphi Export
#
# The `export_graph_to_gexf` function can export the graph to the Gephi format.
# %%
out.export_graph_to_gexf("./my_graph.gexf")
# %% [markdown]
# ## Extracting a dynamic graph
#
# It is possible to ask the `CoOccurrencesGraphExtractor` step to extract a _dynamic_ graph using the `dynamic` argument and a few parameters.
# %%
pipeline = Pipeline(
[
GraphRulesCharacterUnifier(min_appearances=10),
CoOccurrencesGraphExtractor(
co_occurrences_dist=(20, "sentences"),
dynamic=True, # we want to extract a dynamic graph (i.e. a list of sequential graphs)
dynamic_window=20, # the size, in number of interactions, of each graph
dynamic_overlap=0, # overlap between windows
),
],
lang="fra",
)
out = pipeline(tokens=tokens, sentences=sentences, entities=entities)
# display adapts to the fact that the extracted graph is dynamic, and
# allow exploration of each graph of the list.
out.plot_graph()
plt.show()
# %% [markdown]
# It is also possible to explore the cumulative dynamic graph:
# %%
out.plot_graph(cumulative=True, stable_layout=True)
plt.show()
# %% [markdown]
# And to export the dynamic graph to the Gephi format. When doing so,
# the 'timeline' feature of Gephi will work as expected:
# %%
out.export_graph_to_gexf("./graphe_dynamique.gexf")