-
Notifications
You must be signed in to change notification settings - Fork 54
Data Formats
David Jurgens edited this page Jun 19, 2021
·
1 revision
Let’s say you have the documents you want to annotate in a csv file. You can use the following python script to turn that csv file into the json file (narratives_for_annotation.json above) that potato will read in for annotation. The parts of the script you’ll need to edit (mostly file paths) are highlighted in pink (path_to_csv_file
, path_to_potato
, id
, text
).
Some notes about the csv file you’re creating:
- You’ll want each row of your csv file to contain a different document;
- One column should have a unique identifier for the document (e.g., numbers 0 - n)
- Another column should have the document’s text that annotators will read.
- If you’re running this script on the server and your csv file is on your laptop, you can use the
scp
command to move the csv file to the server.
import json
import pandas as pd
# read in the csv file containing documents
## each row corresponds to a different document to annotate
## the spreadsheet has at least two columns: one for the document's unique identifier, one for the text of the document
raw_data = pd.read_csv("path_to_csv_file").fillna('')
id_key = "id" # name of the column in the spreadsheet contains the unique identifier
text_key = "text" # name of the column in the spreadsheet that contains the text
# create a new empty file to put your json into
data_file="/path_to_potato/potato-master/potato/data/narratives_for_annotation.json"
with open(data_file, 'w') as outfile:
pass
# go through each row of your spreadsheet and add the id/text to the json file one at a time
for i in range(raw_data.shape[0]):
item = {"id": str(raw_data[id_key].iloc[i]), #id should be a string
"text": raw_data[text_key].iloc[i],
"annotations": []}
with open(data_file, 'a') as outfile:
json.dump(item, outfile)
outfile.write('\n')