All notable changes to this project will be documented in this file.
- conversations summary extraction pipeline
- docs and tests for init command
- docs and tests for automated-import command
- option '--by' for periodic sampling along with doc and tests
- does not fail anymore when annotations are missing their merged_from set
- audio conversions in basic and standard now always include conversion filename and edit the file properly
- added the automated-import command to more easily import automated annotations covering the entire recording
- added the init command: initiate a new dataset
- migrated the project to a pyproject.toml implementation
- eaf_builder replicating the subtree structure of recordings is not happening anymore. And individual files are not placed in individual subfolder anymore
- child-project overview now correctly displays the amount of audio
- audio duration in overview now works!
- add the derived annotation pipeline
- audio conversion has now a 'standard' conversion pipeline that will convert files to mono channel 16kHz pcm_s16le (no need to convert channels and then sampling rate and no need to know the options)
- add the simple_CTC metric to the list of available metrics
- the output of the CLI in the terminal is now handled by the logger module and not by print statements
- validating a dataset now results in warnings for broken symlinks and no errors anymore (#425)
- validation with recordings existing but for which mediainfo can't read the sample rate no longer fail but outputs a warning.
- recording_device_type in recordings.csv now accepts izyrec
- rework of the list of available metrics functions: now has a absolute value and per_hour decorator, and a peak_ option
- zooniverse chunks upload was failing if the dataset column was missing in the csv
- eaf_builder with speaker_id NA no longer fails (#438)
- name of files in rttm that contain only digits in their name now work correctly with the filter ($457)
- periodic sampler having extra NA rows when using explode
- validation of the annotation index checks for annotation period outside of audio duration
- ignore_discarded is now the default behaviour of childproject projects, robustness was added to the discard column
- annotation index validation and annotation importation now checks for the range_offset being greater than duration and oiutputs an error when it is the case
- projects now check for unicity of the experiment column in children and recording csv, read() fails when not unique
- zooniverse uploads : uploads extra metadata to zooniverse (name of the audio clip, dataset it belongs to)
- zooniverse uploads : the subject_set id is stored in the chunk csv file, as the subject_set name (display name) is susceptible to change
- zooniverse uploads : the upload now handles SIGINT and SIGTERM signals to save progression of the upload to the csv before exiting (useful when a job needs to be interrupted
- allow once again get_within_time_range to take str arguments as times
- add arguments to choose the format of compute_ages project method
- discard column in recordings.csv and children.csv now works properly
- metrics pipeline now checks the converted name for unicity even if a specific name was given
- rename set also renames the merged_from column
- rename set accepts subsets location without failing
- Windows automated tests (some functions were edited to be windows compatible). TO REMEMBER: type int in windows default to int32 instead of int64, can lead to big int turned into negative values
- Check calculated ages during corpus validation
- pandas version restricted to avoid errors of future releases , (1.1.0 (assert_frame_equal check_less_precise) to 1.5.0 (last checked version))
- no usage of sox command anymore, remove sox dependency
- merging annotations now sets the format to 'NA' instead of a blank value.
- replace exit() with raise ValueError() to comply with Exception propagation (metrics pipeline)
- fixed ignore-errors in zooniverse upload_chunks
- fixed calculate_shift to correctly reshape to single channel
0.0.7 2022-09-14
- missing column merged_from in annotations.csv does not fail anymore
0.0.6 2022-09-13
- Start times can include seconds (e.g. 12:34:59) while still accepting the old format (This change will allow other columns to accept multiple formats easily).
child-project --version
commandmerge_sets
inAnnotationManager
method now accepts arguments [full_set_merge,skip_existing,recording_filter] to carry out partial mergeschild-project metrics
added the--segments
command to extract metrics from a dataFrame of segments- metrics <voc_speaker> <lena_CTC> and <lena_CVC>
metrics
pipeline's options--from --to
require a HH:MM:SS format now.merge-annotations
command fails when the output_set already exists or if the required sets don't exist
- eafbuilder attributes a default time-aligneable ling-type to created tiers to avoid random attribution that can lead to wrong behaviour and crashes
- 'imported_at' column in annotations.csv did not have a new correct format (in a set)
- metrics avg_cry_... avg_can_... and avg_non_can_... were not calculated correctly
- metrics lp_n lp_dur use lena columns in priority
- metrics lena-CTC and lena-CVC are added correctly and added to the output of the lena pipeline
- praat-parselmouth is now in the setup file so the dependency get installed automatically by conda
child-project process --split
--split option dropped as there is no further need of reducing long audios (>15hs)
0.0.5 - 2022-07-25
--spectrogram
option in thezooniverse extract-chunks
pipeline to generate an image of a spectrogram that will help citizen-scientists with the classification on zooniverse.child-project compare-recordings
command added to allow users to prin a divergence score. This will help identify audio files that are just duplicates of others (and possibily have different codecs/sampling rate/number of channels)--import-speech-from
command added to the eaf builder to integrate previous annotations to the eaf file (e.g. VTC segments) to facilitate annotation process
metrics
pipeline, reworked to be more flexible. Performance hit with it.- Old pipelines still exist
- new usage of
--period
option on every pipeline and for eveery metric. - Usage of a csv file to specify the list of metrics wanted
- Ease of adding new metrics to the supported list
- Outputs a yml parameter file that can be reused to compute the same metrics and keep a trace of what was run.
- changes to standard annotation value (addressee, vcm_type etc)
- importation of empty file now correctly generates an empty converted file
--period
option correctly works with other units thanrecording_filename
- Support for python 3.6
0.0.4 - 2022-02-02
- Conversation sampler
get_within_ranges
function to retrieve all annotations that match the desired portions of the recordings--import-speech-from
option for the EAF annotation builder to pre-fill annotations based on any previously imported set of annotationscompute_ages
function to compute the age of the subject child for each recordinglena_speaker
for the LENA its converterlena_speaker
aggregated metrics for the LenaMetrics pipeline- Improved AclewMetrics and LenaMetrics performance
- Improved error handling (dataframes sanity checks)
- More flexible high-volubility sampler
- Fixed pipelines crashes in presence of NA values in
recording_filename
RandomVocalizationSampler
crash fix
0.0.3 - 2021-10-06
- Fixed exceptions thrown by child-project CLI
0.0.2 - 2021-09-29
- CSV importer to register pre-exisiting CSV annotations into the index without performing any conversion
- Enable Zooniverse pipeline's
retrieve-classifications
to match classifications to the original chunks metadata get_within_time_range
method to clip a list of annotations within a given time-rangeget_segments_timestamps
method to calculate the onset and offset timestamps of each segment from an annotation--from-time
/--to-time
option for metrics extraction- Time-unit aggregated metrics, supporting custom time periods.
- optional
--recordings
option to apply the audio processors to specific recordings only - allow
child-project validate
to check custom recordings profiles and/or annotation sets --ignore-errors
switch for Zooniverse pipeline'supload-chunks
enforce_dtypes
option forChildProject
in order to enforce the dtype of certain metadata columns (e.g. session_id, child_id) to their expected dtype
- Fixed skip-existing argument of the basic audio processor
- Fixed a crash-bug that occured while extracting metrics from recordings with no FEM/MAL/CHI/OCH segment
- Made
pyannote-agreement
an optional dependency - Added dependency constraints to fix some installation issues.
0.0.1 - 2021-07-14
- First proper release of the package.