Skip to content

Commit 1fc4b2b

Browse files
Add CLI and mappings docs (previously part of the Aleph docs) (#1557)
The documentation for the ftm CLI and mappings where previously part of the [Aleph docs](https://docs.aleph.occrp.org/developers). We recently restructured the Aleph docs. As part of that, we remove the ftm CLI and mappings docs. I’m moving them to the FollowTheMoney docs as they are helpful resources. I’ve only removed a few references to Aleph that would’ve been confusing otherwise and fixed a few links, but apart from that the contents are mostly unedited.
1 parent 8cd9bf3 commit 1fc4b2b

File tree

6 files changed

+677
-40
lines changed

6 files changed

+677
-40
lines changed
1.16 MB
Loading
1.37 MB
Loading
1.16 MB
Loading
689 KB
Loading

docs/src/pages/docs/cli.mdx

Lines changed: 228 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -3,60 +3,250 @@ layout: '@layouts/DocsLayout.astro'
33
title: CLI
44
---
55

6-
# Command-Line Functions
6+
# CLI
77

8-
Many of the functions of _followthemoney_ can be used interactively or in scripts via the command line. Please first refer to the [Aleph documentation](https://docs.aleph.occrp.org/developers/followthemoney/ftm) for an intro to the `ftm` utility.
8+
The `ftm` command-line tool can be used to generate, process and export streams of entities in a line-based JSON format. Typical uses would include:
99

10-
Key to understanding the `ftm` tool is the notion of [streams](/docs#streams): entities can be transferred between programs and processing steps as a series of JSON objects, one per line. This notion is supported by the related [alephclient](https://docs.aleph.occrp.org/developers/alephclient) command, which can serve as a source, and a sink for entity streams, backed by the Aleph API.
10+
* Generating FollowTheMoney entities by applying an [entity mapping to structured data tables](/docs/mappings) (CSV, SQL).
11+
* Converting an existing stream of FollowTheMoney entities into another format, such as CSV, Excel, Gephi GEXF or Neo4J's Cypher language.
12+
* Converting data in complex formats, such as the Open Contracting Data Standard, into FollowTheMoney entities.
1113

12-
## Examples
14+
## Installation
1315

14-
The command line sequence below uses shell pipes to a) [map data](/docs/mappings) into entities from a database, b) apply a [namespace](/docs/namespace) to the entity IDs, c) aggregate [entity fragments](/docs/fragments) created by the mapping, and d) export the resulting entity stream into a sequence of CYPHER statements that can be executed on a Neo4J database to generate a property graph:
16+
To install `ftm`, you need to have Python 3 installed and working on your computer. You may also want to create a virtual environment using virtualenv or pyenv. With that done, type:
1517

1618
```bash
17-
ftm map companies_from_db.yml | \
18-
ftm sign -s my_namespace | \
19-
ftm aggregate | \
20-
ftm export-cypher -o graph.cypher
19+
pip install followthemoney
20+
ftm --help
2121
```
2222

23-
Here's another example that fetches pre-generated entities from a URL and loads
24-
them into a local Aleph instance:
23+
### Optional: Enhanced transliteration support
24+
25+
One of the jobs of followthemoney is to transliterate text from various alphabets into the latin script to support the comparison of names. The normal tool used for this is prone to fail with certain alphabets, e.g. the Azeri language. For that reason, we recommend also installing ICU (International components for Unicode).
26+
27+
On a Debian-based Linux system, installing ICU is relatively simple:
2528

2629
```bash
27-
export URL=https://public.data.occrp.org/datasets/icij/panama_papers.ijson
28-
curl -s $URL | \
29-
ftm validate | \
30-
alephclient write-entities -f icij_panama_papers
30+
apt install libicu-dev
31+
pip install pyicu
3132
```
3233

33-
## Reference
34+
The Mac OS version of installing ICU is a bit complicated, and requires you to have Homebrew installed:
3435

35-
Please refer to the output of `ftm --help` for a detailed reference of the `ftm` CLI:
36+
```bash
37+
brew install icu4c
38+
env CFLAGS=-I/usr/local/opt/icu4c/include
39+
env LDFLAGS=-L/usr/local/opt/icu4c/lib
40+
PATH=$PATH:/usr/local/opt/icu4c/bin
41+
pip install pyicu
42+
```
3643

44+
## Executing a data mapping
45+
46+
Probably the most common task for `ftm` is to generate FollowTheMoney entities from some structured data source. This is done using a YAML-formatted mapping file, [described here](/docs/mappings). With such a YAML file in hand, you can generate entities like this:
47+
48+
```bash
49+
curl -o md_companies.yml https://raw.githubusercontent.com/alephdata/aleph/main/mappings/md_companies.yml
50+
ftm map md_companies.yml
3751
```
38-
Usage: ftm [OPTIONS] COMMAND [ARGS]...
3952

40-
Utility for FollowTheMoney graph data
53+
This will yield a line-based JSON stream of every company in Moldova, their directors and principal shareholders.
4154

42-
Options:
43-
--help Show this message and exit.
55+
<Image
56+
src="/assets/pages/docs/cli/mapping-result.png"
57+
alt="Screenshot of a terminal window. The terminal shows the output of the `ftm map` command to generate the Moldovan company data."
58+
density={2}
59+
/>
4460

45-
Commands:
46-
aggregate Aggregate multiple fragments of entities
47-
dump-model Export the current schema model
48-
export-csv Export to CSV
49-
export-cypher Export to Cypher script
50-
export-excel Export to Excel
51-
export-gexf Export to GEXF (Gephi) format
52-
export-neo4j-bulk Export to Neo4J bulk import
53-
export-rdf Export to RDF NTriples
54-
import-vis Load a .VIS file and get entities
55-
map Execute a mapping file and emit objects
56-
map-csv Map CSV data from stdin and emit objects
57-
pretty Format a stream of entities to make it readable
58-
sieve Filter out parts of entities.
59-
sign Apply a HMAC signature to entity IDs
60-
sorted-aggregate Aggregate sorted fragments of entities
61-
validate Re-parse and validate the given data
61+
You might note, however, that this actually generates multiple entity fragments for each company (i.e. multiple entities with the same ID). This is due to the way the md_companies mapping is written: each query section generates a partial company record. In order to mitigate this, you will need to perform entity aggregation:
62+
63+
```bash
64+
curl -o md_companies.yml https://raw.githubusercontent.com/alephdata/aleph/main/mappings/md_companies.yml
65+
ftm map md_companies.yml | ftm aggregate > moldova.ijson
6266
```
67+
68+
The call for `ftm aggregate` will retain the entire dataset in memory, which is impossible to do for large databases. In such cases, it's recommended to use an on-disk entity aggregation tool, `followthemoney-store`.
69+
70+
### Loading data from a local CSV file
71+
72+
Another peculiarity of `ftm map` is that the source data is actually referenced within the YAML mapping file as an absolute URL. While this makes sense for data sourced from a SQL database or a public CSV file, you might sometimes want to map a local CSV file instead. For this, a modified version of `ftm map` is provided, `ftm map-csv`. It ignores the specified source URLs and reads data from standard input:
73+
74+
```bash
75+
cat people_of_interest.csv | ftm map-csv people_of_interest.yml | ftm aggregate
76+
```
77+
78+
## Exporting entities to Excel or CSV
79+
80+
FollowTheMoney data can be exported to tabular formats, such as modern Excel (XLSX) files, and comma-separated values (CSV). Since each schema of entities has a different set of properties it makes sense to turn each schema into a separate table: `People` go into one, `Directorships` into another.
81+
82+
To export to an Excel file, use the `ftm export-excel` command:
83+
84+
```bash
85+
curl -o us_ofac.ijson https://storage.googleapis.com/occrp-data-exports/us_ofac/us_ofac.json
86+
cat us_ofac.ijson | ftm validate | ftm export-excel -o OFAC.xlsx
87+
```
88+
89+
Since writing the binary data of an Excel file to standard output is awkward, it is mandatory to include a file name with the `-o` option.
90+
91+
<Image
92+
src="/assets/pages/docs/cli/export-excel.png"
93+
alt="Screenshot of Microsoft Excel showing the export from the example above. The Excel file has multiple sheets, one for each entity type (e.g. People, Companies, and Ownerships)."
94+
density={2}
95+
/>
96+
97+
<Callout theme="danger">
98+
When exporting to Excel format, it's easy to generate a workbook larger than what Microsoft Excel and similar office programs can actually open. Only export small and mid-size datasets.
99+
</Callout>
100+
101+
When exporting to CSV format using `ftm export-csv`, the exporter will usually generate multiple output files, one for each schema of entities present in the input stream of FollowTheMoney entities. To handle this, it expects to be given a directory name:
102+
103+
```bash
104+
curl -o us_ofac.ijson https://storage.googleapis.com/occrp-data-exports/us_ofac/us_ofac.json
105+
cat us_ofac.ijson | ftm validate | ftm export-csv -o OFAC/
106+
```
107+
108+
In the given directory, you will find files names `Person.csv`, `LegalEntity.csv`, `Vessel.csv`, etc.
109+
110+
## Exporting data to a network graph
111+
112+
FollowTheMoney sees every unit of information as an entity with a set of properties. To analyse this information as a network with nodes and edges, we need to decide what logic should rule the transformation of entities into nodes and edges. Different strategies are available:
113+
114+
* Some entity schemata, such as `Directorship`, `Ownership`, `Family` or `Payment`, contain annotations that define how they can be transformed into an edge with a source and target.
115+
* Entities also naturally reference others. For example, an `Email` has an `emitters` property that refers to a `LegalEntity`, the sender. The `emitters` property connects the two entities and can also be turned into an edge.
116+
* Finally, some types of properties (e.g. `email`, `iban`, `names`) can be formed into nodes, with edges formed towards each node that derives from an entity with that property value. For example, an `address` node for "40 Wall Street" would show links to all the companies registered there, or a node representing the name "Frank Smith" would connect all the documents mentioning that name. It rarely makes sense to turn all property types into nodes, so the set of types that need to be [reified](<https://en.wikipedia.org/wiki/Reification_(computer_science)>) can be passed as options into the graph exporter.
117+
118+
### Cypher commands for Neo4J
119+
120+
[Neo4J](https://neo4j.com/) is a popular open source graph database that can be queried and edited [using the Cypher language](https://neo4j.com/docs/cypher-refcard/current/). It can be used as a database backend or queried directly to perform advanced analysis, e.g. to find all paths between two entities.
121+
122+
The example below uses Neo4J's `cypher-shell` command to load the US sanctions list into a local instance of the database:
123+
124+
```bash
125+
curl -o us_ofac.ijson https://storage.googleapis.com/occrp-data-exports/us_ofac/us_ofac.json
126+
cat us_ofac.ijson | ftm export-cypher | cypher-shell -u user -p password
127+
```
128+
129+
<Image
130+
src="/assets/pages/docs/cli/export-cypher.png"
131+
alt="Screenshot of FtM entities imported to a Neo4J instance."
132+
density={2}
133+
/>
134+
135+
By default, this will only make explicit edges based on entity to entity relationships. If you want to reify specific property types, use the `-e` option:
136+
137+
```bash
138+
cat us_ofac.ijson | ftm export-cypher -e name -e iban -e entity -e address
139+
```
140+
141+
When working with file-based datasets, you may want to delete folder hierarchies from the imported data in Neo4J to avoid file co-location biasing path and density analyses:
142+
143+
```
144+
# Delete folder hierarchies:
145+
MATCH ()-[r:ANCESTORS]-() DELETE r;
146+
MATCH ()-[r:PARENT]-() DELETE r;
147+
# Delete entities representing individual pages:
148+
MATCH (n:Page) DETACH DELETE n;
149+
# Delete names or email only used once:
150+
MATCH (n:name) WHERE size((n)--()) <= 1 DETACH DELETE (n);
151+
MATCH (n:email) WHERE size((n)--()) <= 1 DETACH DELETE (n);
152+
MATCH (n:address) WHERE size((n)--()) <= 1 DETACH DELETE (n);
153+
# ... for all reified value types ...
154+
```
155+
156+
At any time, you can flush the entire Neo4J and start from scratch:
157+
158+
```
159+
MATCH (n) DETACH DELETE n;
160+
```
161+
162+
#### Bulk loading data
163+
164+
Another option for loading data to Neo4J is to export a set of entities into CSV files and then using the `neo4-admin import` command to load them into an empty database. This requires shutting down the Neo4J server and then running a command that will write the new database.
165+
166+
In order to generate data in CSV format suitable for Neo4J import, use the following command:
167+
168+
```bash
169+
cat us_ofac.ijson | ftm export-neo4j-bulk -o folder_name -e iban -e entity -e address
170+
```
171+
172+
This will generate a set of CSV files in a folder, and include a shell script file that describes the `neo4-admin` import command that should be used to load the data into a graph store.
173+
174+
### GEXF for Gephi/Sigma.js
175+
176+
[GEXF](https://gephi.org/gexf/format/) (Graph Exchange XML Format) is a file format used by the network analysis software [Gephi](https://gephi.org/) and other tools developed in the periphery of the [Media Lab at Sciences Po](http://tools.medialab.sciences-po.fr/). Gephi is particularly suited to do quantitative analysis of graphs with tens of thousands of nodes. It can calculate network metrics like centrality or PageRank, or generate complex visual layouts.
177+
178+
The command line works analogous to the Neo4J export, also accepting the `-e` flag for property types that should be turned into nodes:
179+
180+
```bash
181+
curl -o us_ofac.ijson https://storage.googleapis.com/occrp-data-exports/us_ofac/us_ofac.json
182+
cat us_ofac.ijson | ftm validate | ftm export-cypher -e iban -o ofac.gexf
183+
```
184+
185+
<Image
186+
src="/assets/pages/docs/cli/export-gephi.png"
187+
alt="Screenshot of Gephi. A small trove of emails has been visualized as a network. The entity schema type has been used to color nodes, while the size is based on the amount of inbound links (i.e. In-Degree)."
188+
density={2}
189+
/>
190+
191+
## Exporting entities to RDF/Linked Data
192+
193+
Entity streams of FollowTheMoney data can also be exported to linked data in the `NTriples` format.
194+
195+
```bash
196+
curl -o us_ofac.ijson https://storage.googleapis.com/occrp-data-exports/us_ofac/us_ofac.json
197+
cat us_ofac.ijson | ftm validate | ftm export-rdf
198+
```
199+
200+
It is unclear to the author why this functionality exists, it was just really easy to implement. For those developers who enjoy working with RDF, it might be worthwhile to point out that the underlying ontology (FollowTheMoney) is also regularly published in [RDF/XML](https://followthemoney.tech/ns/ftm.xml) format.
201+
202+
By default, the RDF exporter tries to map each entity property to a fully-qualified RDF predicate. Schemas include some mappings to FOAF and similar ontologies.
203+
204+
## Importing Open Contracting data
205+
206+
The [Open Contracting Data Standard](https://standard.open-contracting.org/latest/en/) (OCDS) is commonly serialised as a series of JSON objects. `ftm` includes a function to transform a stream of OCDS objects into `Contract` and `ContractAward` entities. This was developed in particular to import data from the DIGIWHIST [OpenTender.eu](https://opentender.eu/all/download) site, so other implementations of OCDS may require extending the importer to accommodate other formats.
207+
208+
Here's how you can convert all Cyprus government procurement data to FollowTheMoney objects:
209+
210+
```bash
211+
curl -o CY_ocds_data.json.tar.gz https://opentender.eu/data/files/CY_ocds_data.json.tar.gz
212+
tar xvfz CY_ocds_data.json.tar.gz
213+
cat CY_ocds_data.json | ftm import-ocds | ftm aggregate >cy_contracts.ijson
214+
```
215+
216+
Depending on how large the OCDS dataset is, you may want to use `followthemoney-store` instead of `ftm aggregate`.
217+
218+
## Aggregating entities using ftm-store
219+
220+
While the method of streaming FollowTheMoney entities is very convenient, there are situations where not all information about an entity is known at the time at which it is generated. For example, think of a [mapping](/docs/mappings) that loads company names from one CSV file, while the corresponding addresses are in a second, separate CSV table. In such cases, it is easier to generate two entities with the same ID and to merge them later.
221+
222+
Merging such entity fragments requires sorting all the entities in the given dataset by their ID in order to aggregate their properties. For small datasets, this can be done in application memory using the `ftm aggregate`command.
223+
224+
Once the dataset size approaches the amount of available memory, however, sorting must be performed on disk. This is also true when entity fragments are generated on different nodes in a computing cluster.
225+
226+
For this purpose, `followthemoney-store` is available as a Python library and a command line tool. It can use any SQL database as a backend, with a local SQLite file set as a default. When using PostgreSQL as a database, `followthemoney-store` can use its built-in upsert functionality, making the backend more performant than others.
227+
228+
To use `followthemoney-store` with SQLite, install it like this:
229+
230+
```bash
231+
pip install followthemoney-store
232+
```
233+
234+
For PostgreSQL support, use the following settings:
235+
236+
```bash
237+
pip install followthemoney-store[postgresql]
238+
export FTM_STORE_URI=postgresql://localhost/followthemoney
239+
```
240+
241+
Once installed, you can operate the `followthemoney-store` command in read or write mode:
242+
243+
```bash
244+
curl -o us_ofac.ijson https://storage.googleapis.com/occrp-data-exports/us_ofac/us_ofac.json
245+
cat us_ofac.ijson | ftm store write -d us_ofac
246+
ftm store iterate -d us_ofac | alephclient write-entities -f us_ofac
247+
ftm store delete -d us_ofac
248+
```
249+
250+
<Callout theme="danger">
251+
When aggregating entities with large fragments of text, a size limit applies. By default, no entity is allowed to grow larger than 50MB of raw text. Additional text fragments are discarded with a warning.
252+
</Callout>

0 commit comments

Comments
 (0)