-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
repository structure in discussion #2
Comments
Recent commits (up to c410d9e) pushed some "sample" files in accordance with the proposed structure. |
Initially, we thought we can automate the invocation of the
To make it easy to "anchor" the subdirectory-level commit histories, @marcverhagen suggested using a single plain text file (the proposed name is |
Revisiting this after a year, I'd like to suggest some changes to our previous decisions;
Wans to hear about what people think. (esp. @kelleyl ) Let us know. |
One more thing to consider is the tie between this repo and the evaluation repo. The repo not only serves as a public data repository but should also serve as a source of gold data for in-house evaluators. That said, evaluators need to know
The first is easy. It is a github URL of a subdirectory under this repository. But I'd like to pass the second information along with the first. So, to encode the second information with the first, we can
|
Some thought on this, trying to integrate the above. The top-level has just annotations and golds. The annotations are organized around projects and the golds around annotation types.
Projects have names that should be unique (in fact, I would like the first part of the name, without the themes, to be unique) and optional themes, but no start date, if we want that recorded we put it as a metadata property inside the project. Each project consists of: metadata, a readme file, a process.py script, a guidelines directory and a list of batches.
The metadata specify what annotation types are included in this project and anything else that seems relevant. Guidelines live at the project level, each batch can refer to a specific version of the guidelines. The example has versioned guidelines explicitly in the directory tree, alternatively batches could refer to a git commit. Large changes in guidelines, especially if new categories are introduced, should not be allowed in a project. A new project should be created instead. Batches have their own metadata and readme file (division of labor between those two TBD), a sources file to specify the source files and a directory with annotations. Batch names include a starting date and something useful like "nehshour-2001-2001-25". Metadata includes annotator names and characterizations, guideline version, dates, and anything we deem relevant. The readme could be a file that is presented to users browsing annotations, but that could also be in the metadata. Names of annotation files should reflect the guids if possible, if that is the case then no sources.txt file is needed. If there is only one annotation file for a batch then sources.txt is certainly needed. The golds are organized around annotation types (which we need to think about because that could open a can of worms), but within those type they mirror projects and batches. Annotations live in a directory inside each batch, these are NOT the same as the annotations inside the batches of projects in the annotations top-level. Metadata, readme and sources could just be copied from the projects, possibly by process.py.
Annotations and everything else in these directories are generated by the process.py files from the projects. The files metadata.yml, readme.txt and sources.txt files are probably just copies, in which case we may not even want to include them because we can trace those down from the name of the project and the name of the batch. |
Yes, there are issues with the above. It is rather unclear what to do with a project that mixes a bunch of annotations. In the past we have used an example where annotations in a project could be timeframes for a bunch things like credits, bars and slate. With those we assumed they could be distributed over topical areas in the golds directory, and that seems intuitive enough. But what about more complicated cases, for example for full slate annotation which has or potentially has:
These are conceptually linked and it is not obvious how we would distribute them over subdirectories of golds. Do we want to put entities somewhere in the It is worthwhile to consider Keigh's suggestion to have the projects be the main organizing principle. |
If we use projects as the top-organizing principle the repository could look like:
Using a projects directory so we have space to put something else at the top level without being drowned out by all projects. The golds are created from each batch and are local to the batch. We may still want some substructure in the golds subdir, and we probably need something about this in the metadata. I do have some difficulty in seeing how to map pipeline evaluation to projects, but that would be an issue with the previous proposal as well. |
I'm actually against to
because the complex, nested structure of golds directories will make it hard for the evaluation invoker to collect the necessary information to start an evaluation pipeline. In essence, the invoker needs to know the locations of gold files and GUIDs of the source assets (along with the evaluation script and pipeline config). # part of evaluation invoker
def run_eval(evaluator, pipeline_config, guids: List[str], golds: Path): -> Html
preds = run_pipeline(pipeline_config, guids)
return evaluator(golds, preds) If we want to perform the evaluation on a batch-by-batch basis, evaluators still only need to know the same information but by batch. # part of evaluation invoker
def run_eval(evaluator, pipeline_config, guids_by_batches: List[List[str]], golds_by_batches:List[Path]): -> Html
preds_by_batches = [run_pipeline(pipeline_config, guids) for guids in guids_by_batches]
return evaluator(golds_by_batches, preds_by_batches) We discussed naming gold files using the AAPB-GUIDs, so in that case, the invoker doesn't even need to know the GUIDs. # part of evaluation invoker
def run_eval(evaluator, pipeline_config, golds_by_batches: List[Path]): -> Html
guids_by_batches = []
for batch in golds_by_batches:
guids_in_batch = []
for gold_fname in batch.glob("*"):
guids_in_batch.append(infer_guid(gold_fname))
guids_by_batches.append(guids_in_batch)
...
preds_by_batches = [run_pipeline(pipeline_config, guids) for guids in guids_by_batches]
return evaluator(golds_by_batches, preds_by_batches) With addition of the # part of evaluation invoker
def retrieve_golds(gold_batch_url):
for file_url in parse_html_and_find_href_file_objs(gold_batch_url):
download(file_url, local_gold_tmpdir)
return local_gold_tmpdir
def retrieve_all_gold_batches(gold_url):
return [retrieve_golds(batch_url) for parse_html_and_find_href_file_objs(gold_url)]
def run_eval(evaluator, pipeline_config, gold_url: str): -> Html
golds_by_batches = retrieve_all_gold_batches(gold_url)
guids_by_batches = do_filename_magic(golds_by_batches)
preds_by_batches = [run_pipeline(pipeline_config, guids) for guids in guids_by_batches]
return evaluator(golds_by_batches, preds_by_batches)
In this case, I don't see lots of value in adding Next, regarding All that being said, my proposal is something like this;
|
A few things written down, partially after a short in-person discussion this afternoon. batches directory Yes, I am more than okay with that, but forgot to mention it in my previous comment. We can refer to a batch in Perhaps some subdivision inside batches would be nice, using the program name or some other organizing principle. I think in the long run we might be looking at hundreds or thousands of batches and I don't like directories with too many files. Of course, we can solve that problem when we run into it rather than now, or consider it to not be a problem. substructure in golds directory Most likely not needed, and having it does give us the extra task of naming those directories. But it is possible that golds in a project may be of different kinds (.ann versus .mmif versus something else, at least, we do not disallow that), so within a batch we could have multiple files for each source. By the way, I don't quite see how that structure is such a problem for the evaluator, I would call one layer of subdirectories very complex and what needs to be done is simply a matter of finding the files with the specified guids in the golds directory, which is no more than a few lines of code. One question on the code example def run_eval(evaluator, pipeline_config, guids: List[str], golds: Path): -> Html
preds = run_pipeline(pipeline_config, guids)
return evaluator(golds, preds) Why is the
Agreed, and I don't think anyone was suggesting that if golds can be traced to the batches, which it can in the last two proposals. golds inside of batches I see the point, mostly from a flexibility point of view. I now think golds should not be inside of batches, nor should batches be inside of golds. Which is in line with the directory tree in the previous comment. golds-themeX and golds-themeY I think this is dangerous. In a previous commit the point was made that the themes were there for the joy of people creating project names, allowing you to stash information in the name, but that in no way should there be a registry or themes, nor should code be ever required to use the themes to find stuff. Similarly the themes should not be used to dictate directory structure deeper down. It also assumes that somehow the theme names map to annotation categories, or some other concept that is useful for grouping gold annotations, which we cannot do because the themes are a free for all. Instead of
we have
But this does not allow us to a file named retrieving golds 🦮 Maybe we should come up with a bunch of icons like this. a final worry We need to think about the process of how we select gold data given a pipeline we want to evaluate. The structure of this repo should support that. |
I wasn't implying we maintain a systematic registry of "themes" in the above. When I was writing the projectX-thmX-thmY/annotations/batchX/two-different-annotations-done-in-single-pass-and-saved-in-single-file.json
projectX-thmX-thmY/annotations/batchY/two-different-annotations-done-in-single-pass-and-saved-in-single-file.json
projectX-thmX-thmY/process.py
Given these files (after an "upload" from an annotator), the next thing the annotation manager does will be $ python projectX-thmX-thmY/process.py And it can generate (as proposed above) projectX-thmX-thmY/golds-thmX/batchX/cpb-xxxxx-*.ann
projectX-thmX-thmY/golds-thmX/batchY/cpb-xxxxx-*.ann
projectX-thmX-thmY/golds-thmY/batchY/cpb-xxxxx-*.json
projectX-thmX-thmY/golds-thmY/batchY/cpb-xxxxx-*.json or projectX-thmX-thmY/golds-just-a-name/batch{X,Y}/cpb-xxxxx-*.ann
projectX-thmX-thmY/golds-just-another-name/batch{X,Y}/cpb-xxxxx-*.json or even simpler projectX-thmX-thmY/golds-1/batch{X,Y}/cpb-xxxxx-*.ann
projectX-thmX-thmY/golds-2/batch{X,Y}/cpb-xxxxx-*.json the In fact, it can be a completely different structure. as long as we have all batches directly under one directory that can be passed to the projectX-thmX-thmY/golds/thmX/batch{X,Y}/cpb-xxxxx-*.ann
projectX-thmX-thmY/golds/thmY/batch{X,Y}/cpb-xxxxx-*.json
# then a gold_url value will be `projectX-thmX-thmY/golds/thmX`
# or
projectX/X/golds-batch{X,Y}/cpb-xxxxx-*.ann
projectX/Y/golds-batch{X,Y}/cpb-xxxxx-*.json
# then a gold_url value will be `projectX/X`
# but this won't work
projectX/golds-X-batchX/cpb-xxxxx-*.ann
projectX/golds-X-batchY/cpb-xxxxx-*.ann
projectX/golds-Y-batchX/cpb-xxxxx-*.json
projectX/golds-Y-batchY/cpb-xxxxx-*.json
# gold_url can't be `projectX` (include two different data) nor `projectX/golds-X-batchX` (missing other batches in the same theme)
Agree. Right now, I imagine app developers browse the annotation repository - by carefully reading all Once they find a dataset (suppose run_eval(evaluator=eval_name,
pipeline_config=conf_name,
gold_url="https://aapb-as-dataset.clams.ai/projectX-thmX-thmY/golds-just-another-name")
# Note that evaluators are not a part of this repository, but should come from the evaluation repo
# Also note that pipeline_config is not a part of this repository, but should inherently come from the app itself, and MUST include the app
# Namely, the dependency-wise evaluators should be completely decoupled from the app and should be able to generalize to evaluate the different apps/workflows that outputs the same types of annotation. (I think the "button" to be a github actions workflow)
Yeah, that pseudo-code is before the golden retriever is introduced. Now going back to the example with the concept of 🦮 ,
And by using GUIDs as file names, 🦮 is now free from the responsibility of obtaining the GUIDs (bound under batch names), it's only responsibility is to download the files using only a single string argument, manually "selected" by the app developer. |
A lot is on the What I find dangerous is to use theme names to determine the directory structure so I very much prefer the 2nd and 3rd options underneath "And it can generate (as proposed above)". For each project, the developer determines how to distribute annotations under the golds section as long as it follows a few simple rules on directory structure. Form last week's discussion I think we agree that a fully automatic mapping from pipeline to projects and golds is not realistic, but once the evaluator (the person) has determined what projects are relevant then the evaluator (the program) should have an easy time finding the relevant gold data. |
An example structure we discussed last week;
Notes in this example;
some_project_mmddyy
andother_proj_mmddyy
are arbitrary project identifiers.some_project_mmddyy
is done followinga.guidelines/slate.md
guidelines and annotated slates and credits using VIA environment.other_proj_mmddyy
is done followinga.guidelines/ner.md
guidelines and annotated named entities using brat-nlp environment.z.golds/ner/other_proj_mmddyy/*.conllu.tsv
files are derived fromb.uploads/other_proj_mmddyy/annotations/dump_anntatorY.ann
and automatically generated byb.uploads/other_proj_mmddyy/process.py
.z.golds/*/some_project_mmddyy/*.csv
a.
,b.
, andz.
prefixes were for sorting intree
command and will not actually be used .The text was updated successfully, but these errors were encountered: