hakai-ctd-qc
is the main package used to handle the QCing of the CTD
Datasets maintained by the Hakai Institute. Please refer to the
test description manual for a full description of the
different tests applied within this package.
The following commands will start a docker container, request all ctd casts that are awaiting qc and process from the default api, and them process them in batches of chunk_size
casts.
git clone [email protected]:HakaiInstitute/hakai-ctd-qc.git
cd hakai-ctd-qc
cp sample.env .env
docker-compose up
The present package can be installed locally or through a docker container. In all cases, it is best to clone locally the package and apply the appropriate configuration.
git clone [email protected]:HakaiInstitute/hakai-ctd-qc.git
Clone the repository and create the conda environment:
pyenv install 3.11.2
pyenv local 3.11.2
pip install poetry
poetry install
cp sample.env .env
Copy the sample.env
file as .env
and replace the different values accordingly.
Once installed the package hakai_ctd_qc can be run via the command line. See the help menu for a complete description of the different options:
python hakai_ctd_qc --help
Usage: hakai_ctd_qc [OPTIONS]
Options:
--hakai_ids TEXT Comma delimited list of hakai_ids to qc
--processing-stages TEXT Comma list of processing_stage profiles to
review [env=QC_PROCESSING_STAGES] [default:
8_binAvg,8_rbr_processed]
--test-suite Run Test suite [env=RUN_TEST_SUITE]
--api-root TEXT Hakai API root to use [env=HAKAI_API_ROOT]
[default: https://goose.hakai.org/api]
--upload-flag Update database flags
[env=UPDATE_SERVER_DATABASE]
--chunksize INTEGER Process profiles by chunk
[env=CTD_CAST_CHUNKSIZE] [default: 100]
--sentry-minimum-date TEXT Minimum date to use to generate sentry warnings
[env=SENTRY_MINIMUM_DATE]
--profile PATH Run cProfile
--help Show this message and exit.
Important
The api code base still exists but is not accessible in production deployments. These instructions are left here for reference only
Run the following command:
poetry run python hakai_ctd_qc/api.py
And within a browser to go: http://127.0.0.1:8000
With vscode you can also run the debug configuration Run API
which helps debug the interface in realtime.
Note
To protect the api from unexpected calls, you can set a list of accepted tokens as a list of comma separated list.
Any post calls to the api will then require a token
field within the header of the post command and an accepted value.
The hakai_ctd_qc tool is deployed via a Docker container (see Dockerfile) and run from Windmill. On container start, the application
will request all ctd casts that are awaiting qc and process them in batches of chunk_size
casts.
- development: https://windmill-dev-server.windmill.hakai.app/scripts/get/d35488d8aec4898b?workspace=data-pipelines -> qc hakaidev database
- main: https://windmill-dev-server.windmill.hakai.app/scripts/get/58d138bc6d80e3d4?workspace=data-pipelines -> qc hakai database
- a cron schedule is applied to this instance to qc latest data submitted.
see Windmill Schedules for container run schedule
- Testing: Any changes to the package are tested via a GitHub workflow that qc hakai_id test suite.
- Docker Build Testing: Docker container build is tested via a GitHub worflow
- Changes to the main and development versions trigger image builds which are in tern pulled into windmill on next schedualed run. prod deploy action dev deploy action.
- Errors and monitoring: Sentry is use to monitor the different errors and cron jobs. Only the main deployment is required to run a cron job to make sure any newly submitted data is qced. See the following links for any issues and cron issues encountered.
The different tests applied are defined within the respective configurations:
A subset of hakai_ids is used to test the qc tool and is maintained here
Manual flags can also be implemented on any instrument-specific variables via the grey-list, which overwrites any automatically generated flags.
To make sure the tests are working appropriately a series of pytests are available. Some of the tests are specific to the hakai tests, others to the hakai test suite.
The test suite is made available locally via the parquet file, or retrieved from the development or production database.
To run all the tests locally:
poetry run pytest .
To run all the tests with the production data (hecate) or development data (goose). Use the --test-suite-from
option. Here's an example for goose:
poetry run pytest . --test-suite-from goose
Once to test the results on any of the databases without rerunning the tests on the data, you can use the --test-suite-qc False
option.
poetry run pytest . --test-suite-form goose -k test_source_expected_results