-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
filter: re-implement with SQLite and add option to use with --engine
#854
Conversation
Codecov Report
@@ Coverage Diff @@
## master #854 +/- ##
===========================================
- Coverage 59.64% 38.75% -20.90%
===========================================
Files 53 49 -4
Lines 6317 6379 +62
Branches 1586 1608 +22
===========================================
- Hits 3768 2472 -1296
- Misses 2286 3802 +1516
+ Partials 263 105 -158
Continue to review full report at Codecov.
|
A few quick notes I had during our call just now:
|
Could you clarify what you mean by using interpolation vs. placeholders?
Good point! I did this for the ID and date columns here 54d6ddf...a83a17e, but this should at least be expanded to user input such as in augur/augur/filter_support/db/sqlite.py Lines 242 to 244 in 1513f1e
And agreed on blanket quote.
Yeah, I don't know where I got the false notion that commit isn't necessary with
Mostly because I imagined the complexity of writing However, date parsing now takes up a decent amount of time (2nd to loading metadata), so it might be worth trying out a SQL implementation. |
Placeholders are ubiquitous in SQL interfaces, with
Ah, surrounding identifiers in double quotes is a start, but the identifier also needs double quotes escaped by way of doubling them up, e.g.
before interpolation.
👍 The default transaction handling of Python's sqlite3 module is a bit unusual: it's not auto-commit, which ok good, but it is auto-begin for DML, which is a weird combo. IME, it is easier to reason about transactions by disabling that default with
Yeah, the time the date table creation takes makes me think a lot of it is spent shuttling between SQLite (C) and Python and back for those functions. I agree |
Force-pushing date parsing improvements (pre-rebase d3e466d...903d92e) |
1513f1e
to
b477cda
Compare
Force-pushing date parsing improvements and force-include reporting (pre-rebase 903d92e...911e137) |
b477cda
to
3710df5
Compare
After some more experimenting and optimization of current code, I've decided to not to pursue native SQLite for date logic:
|
@victorlin Ok, sounds good. These are great candidates for caching, indeed. That example SQL is a little more tortuous than I was expecting, which was using a date/time function or string handling function to extract the parts from an ISO date, e.g.:
Though I realize now it would be a little more complicated than that with date/time funcs thanks to our need to handle forms like |
Force-pushing updated tests, SQL identifier sanitizing, connection handling improvements (pre-rebase 911e137...3eb8c98) |
I investigated some data loading options today:
Testing on my local machine, Due to the single-threaded and sequential nature of For proper auto-committing, I changed to use the default connection context manager rather than sharing one connection across calls. There is potential to re-use the same metadata for cases like subsampling schemes in the ncov workflow which prompts "concurrency", but for that I think it's easier to just create copies of the database file rather than sharing a single one:
The alternative process would require a new pre-compute augur load metadata.tsv db1.sqlite3
cp db1.sqlite3 db2.sqlite3
# the following can be run in parallel
augur filter --metadata-db db1.sqlite3 --query 'country = "USA"' --output-strains usa.txt
augur filter --metadata-db db2.sqlite3 --query 'country != "USA"' --output-strains non-usa.txt This would come in a later PR, but point being there should be no need to worry about supporting concurrency in this PR. |
Force-pushing additional tests, styling fixes, improved type hints and docstrings (pre-rebase 3eb8c98...a73aec3) |
Force-pushing rebase onto master (equivalent merge commit 4237047) |
22e485f
to
85ece02
Compare
86e374d
to
110af66
Compare
a600f97
to
4404c25
Compare
Instead of copying the database to avoid the two problems above, |
This also simplifies the implementation of validate_arguments() to raise AugurErrors directly instead of returning a boolean to be translated to a proper error message by the caller.
These can be used by any engine.
This commit contains files supporting an independent implementation of augur filter using SQLite. It is a full refactor and splits the logic into several smaller files: - New under filter_support/: - base.py - abstract class containing database-independent filtering/subsampling logic - sqlite.py - contains the working class, an implementation of base.py - Add new io functions in io/db/. TODO: address the following differences in other commits Breaking changes: - Error when `--include-where`, `--exclude-where`, or `--query` column doesn't exist - partially fixes 754 Other changes: - (internal) less memory usage, more disk usage
- `--min-date`/`--max-date`: add support for ambiguity (e.g. 2020-XX-XX) - `--max-date` with year only will work properly now
These tests are specific to the newly coined pandas engine, so the file deserves a new name.
- Translate original filter pytests to work with an in-memory database. - Split filter tests into separate classes: - file loading - filtering - filter groupby TODO: split this into a separate PR - min/max date: check that silently passing invalid values (e.g. "?") get filtered out
- test_load_invalid_id_column - test_load_custom_id_column - test_load_custom_id_column_with_spaces - test_load_priority_scores_extra_column - test_load_priority_scores_missing_column - test_load_sequences_subset_strains - test_generate_sequence_index
e2b3039
to
0385409
Compare
input_group.add_argument('--engine', default="pandas", choices=["pandas", "sqlite"], help="Internal engine, determines syntax used for --query") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running some workflows using the current implementation, will add results to this thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ncov-tutorial seems to run fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running ncov's rebuild-gisaid.yml:
- Created nextstrain/docker-base@ad3f5fb
- Created nextstrain/ncov@e3490aa
- Ran rebuild-gisaid.yml with:
- branch:
victorlin/test-augur-filter-sqlite-engine
- trial build name:
victorlin-test-sqlite
- image:
nextstrain/base:branch-victorlin-test-augur-filter-sqlite-engine
- branch:
- View results:
- Batch job
- S3:
s3://nextstrain-ncov-private/trial/victorlin-test-sqlite/
- Auspice views:
https://nextstrain.org/staging/ncov/gisaid/trial/victorlin-test-sqlite/REGION_NAME/TIME_SPAN
, e.g. https://nextstrain.org/staging/ncov/gisaid/trial/victorlin-test-sqlite/global/6m
Noting that for the global/6m
dataset, n=2780 which contrasts a bit with previous dated builds:
- 2022-08-08: 2948
- 2022-08-06: 2897
- 2022-08-04: 2865
- 2022-08-02: 2866
- 2022-08-01: 2843
Not sure if this is just due to random sampling differences. Would need to dig into Batch logs to investigate augur filter
output, since I don't see the --output-log
files under s3://nextstrain-ncov-private/trial/victorlin-test-sqlite/
.
Update: this stalled as I shifted gears to work on other projects. The current state is that it works but opening for formal review is pending:
|
Update: I have a new version of this on the branch victorlin/rewrite-filter-using-sqlite. I'm still cleaning up the commits, but will open a new PR (or more) for those changes. |
Description of proposed changes
io
functions in io_support/.--query
takes a SQL expression instead of pandas syntax.--include-where
,--exclude-where
, or--query
column doesn't exist (partial fix for Throw an error when any requested filter or group-by columns don't exist #754).Related issue(s)
Testing
--non-nucleotide
and incomplete dateTODO
use SQLite for FASTA(corneliusroemer/fasta_zstd_sqlite?) tried in 06e5698, not valuable foraugur filter
alone.parallelized writesseems roundabout, better to load directly using new loader class inio
--query
syntax from pandas to SQLite #920db/
->engines/
ncov
datautils
todates
#923utils.myopen()
, add tests forio.open_file()
#926filter_support/
#937for later: