-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
filter: Support relative dates for --min-date
and --max-date
#740
Conversation
Codecov Report
@@ Coverage Diff @@
## master #740 +/- ##
==========================================
+ Coverage 34.15% 34.86% +0.70%
==========================================
Files 41 41
Lines 5979 6041 +62
Branches 1534 1552 +18
==========================================
+ Hits 2042 2106 +64
- Misses 3855 3862 +7
+ Partials 82 73 -9
Continue to review full report at Codecov.
|
Thank you for catching this issue, @benjaminotter. Reading the ISO 8601 standard more closely (page 19), I see that we should provide a leading "P" to the duration specification like "P2Y" for 2 years. In this case, pandas fails to parse that interval reporting:
In this case, I believe pandas is wrong and the ISO spec is right, but that doesn't help us fix the problem. I'll have to think about this a bit more and look for alternative solutions. |
For what it's worth, the isodate package has the correct behavior with the above example: import isodate
>>> isodate.parse_duration("P2Y")
isodate.duration.Duration(0, 0, 0, years=2, months=0) For some reason, I have this package installed already, but I don't know yet what other Nextstrain dependency installs it. One annoying aspect of the ISO 8601 format is the requirement to provide a leading "P" for the date period values. For Augur, I don't see us needing to support the time period values of hours, minutes, and seconds, so maybe we can automatically prefix a user's given interval string with a "P" if they don't provide that as part of the string. |
ISO8601 duration string parsing with isodate does the job, but leaves a Deprecation Warning:
Also, the isodate package was by default not installed on my local Nextstrain build. I added a check to make sure the |
I wanted to revisit this issue, since we still need to add this date offsets feature for ncov and seasonal-flu workflows. I looked into several alternatives for representing date durations with and without support for ISO-8601 parsing including Arrow, Python’s built-in datetime timedelta, isodate, isoduration, and pandas. Below are some example outputs from each of these for durations from common use cases of 2 years, 6 months, and 4 weeks. import arrow
import datetime
import isodate
import isoduration
import pandas as pd
# Examples of how to get time delta objects for:
# - 2 years
# - 6 months
# - 4 weeks
# datetime built-in only supports weeks or smaller units
# of time. Years and months have to be converted to the
# closest equivalent in weeks.
#
# 2 years
datetime.timedelta(weeks=52*2)
# datetime.timedelta(days=728)
# 6 months
datetime.timedelta(weeks=int(52*0.5))
# datetime.timedelta(days=182)
# 4 weeks
datetime.timedelta(weeks=4)
# datetime.timedelta(days=28)
# pandas
# 2 years
pd.Timedelta("-2Y")
# FutureWarning: Units 'M', 'Y' and 'y' do not represent unambiguous timedelta values and will be removed in a future version
# Timedelta('-731 days +12:21:36')
#
# pandas generally does not want to support these relative
# offset types. For more details see:
# https://github.com/pandas-dev/pandas/issues/16344
#
# 6 months
pd.Timedelta("-6M")
# FutureWarning: Units 'M', 'Y' and 'y' do not represent unambiguous timedelta values and will be removed in a future version
# Timedelta('-1 days +23:54:00')
# 4 weeks
pd.Timedelta("-4W")
# Timedelta('-28 days +00:00:00')
# arrow does not have a representation of a duration type,
# so we need to create a date object and then "shift" it.
d = arrow.now()
d
# <Arrow [2021-12-29T13:32:46.801604-08:00]>
#
# 2 years
d.shift(years=-2)
# <Arrow [2019-12-29T13:32:46.801604-08:00]>
# 6 months
d.shift(months=-6)
# <Arrow [2021-06-29T13:32:46.801604-07:00]>
# 4 weeks
d.shift(weeks=-4)
# <Arrow [2021-12-01T13:32:46.801604-08:00]>
# isodate was originally developed by one person and has
# had 18 total contributors. It has also been forked into
# a GitHub organization called "isodate" without a clear
# reason. It provides a `parse_duration` function that
# returns a `Duration` instance that can be added to other
# standard datetime objects. Its parser tries to adhere to
# the ISO-8601 standard completely, such that "2Y" is not
# valid but "P2Y" is. This library only depends on Python's
# datetime library.
#
# 2 years
isodate.parse_duration("-P2Y")
# isodate.duration.Duration(0, 0, 0, years=-2, months=0)
# 6 months
isodate.parse_duration("-P6M")
# isodate.duration.Duration(0, 0, 0, years=0, months=-6)
# 4 weeks
isodate.parse_duration("-P4W")
# datetime.timedelta(days=-28)
# isoduration reimplements the logic from isodate using
# Python's datetime library and the third-party arrow library.
# isoduration claims to address limitations of isodate. Its
# `parse_duration` function returns a `Duration` instance.
#
# 2 years
isoduration.parse_duration("-P2Y")
# Duration(DateDuration(years=Decimal('-2'), months=Decimal('0'), days=Decimal('0'), weeks=Decimal('0')), TimeDuration(hours=Decimal('0'), minutes=Decimal('0'), seconds=Decimal('0')))
# 6 months
isoduration.parse_duration("-P6M")
# Duration(DateDuration(years=Decimal('0'), months=Decimal('-6'), days=Decimal('0'), weeks=Decimal('0')), TimeDuration(hours=Decimal('0'), minutes=Decimal('0'), seconds=Decimal('0')))
# 4 weeks
isoduration.parse_duration("-P4W")
# Duration(DateDuration(years=Decimal('0'), months=Decimal('0'), days=Decimal('0'), weeks=Decimal('-4')), TimeDuration(hours=Decimal('0'), minutes=Decimal('0'), seconds=Decimal('0'))) If we implemented our own minimal duration parsing for Augur to support year, month, week, and day offsets, we would still likely need to implement much of the algorithm to add durations to datetimes that isodate already implements. There is also always the option of straight-up copying the isodate logic into Augur and using the bits we need. Ultimately, we want to support an interface like this:
We don’t especially care about the full ISO-8601 duration implementation (e.g., a duration like “-P2Y3M4D” is unlikely to be useful). Given this simpler interface we need, we could implement durations with now = datetime.datetime.now()
# datetime.datetime(2021, 12, 29, 13, 51, 52, 541769)
# Subtract 1 year in weeks.
now - datetime.timedelta(weeks=52)
# datetime.datetime(2020, 12, 30, 13, 51, 52, 541769)
# Result is off by 1 day.
# Subtract 2 years in weeks.
now - datetime.timedelta(weeks=104)
# datetime.datetime(2020, 1, 1, 13, 51, 52, 541769)
# Result is off by 3 days.
# Subtract 12 years in weeks.
now - datetime.timedelta(weeks=12*52)
# datetime.datetime(2010, 1, 13, 13, 51, 52, 541769)
# Result is off by 2 weeks! Maybe we don’t care that the results are so far off, though? In the course of 12 years, what is 2 weeks in error for the resulting analysis? After playing with these different implementations a bit, I’m leaning toward adding isodate as a dependency and rebasing the code from this into the new filter interface. |
@huddlej if using |
@victorlin Good point. I'm still not convinced we need to treat negative dates as a first-class use case (as I've mentioned in our Slack discussion). This feature in particular supports the kind of filtering for recent samples we need in the seasonal flu and ncov workflows and that might be strange to implement in an ancient pathogen analysis. That said, I don't think the proposed interface prevents us from adding support for negative dates in the future, if it becomes necessary. For example, you could run the following to filter to samples collected 10,000 years ago:
Then it would be on us to update the augur filter internals to support this logic. If folks are generally on board with the proposed UI, we can get a lot of value now from the isodate implementation. |
I can look into updating this, since I'll need to incorporate it into #854. |
428fdd6
to
413a13e
Compare
Really nice to have this implemented @victorlin. I believe you were working with I don't know if this creates too much ambiguity in terms of parsing, but my preferred user behavior (or at least what seems the most obvious to me) is that there is only Maximum dates would work similarly. You could have a pair that specifies |
Yes, most of the implementation was already done, I just fixed the merge conflicts and added tests. I also found user experience a bit odd, in that it is a positive duration translating to a negative offset from So I think there's a question of if we want to support offsets from a certain date, for example:
If not (keeping offsets strictly from I can imagine most use cases would offset from |
This would be my preference. The example I had above of
I agree. I can't see myself as a user ever wanting to do something like this. I think "offsets" are just fine only subtracting from |
I'll chime in with my 2 cents 😂
This is the behavior which makes sense to me. You could have a separate argument, which would be mutually exclusive with the corresponding date argument, but it's much nicer to overload them. |
Thank you for the comments, all! The original UI I envisioned for this was the overloaded min/max date arguments. When it came time to implement that overloading, we ran into the minor issue that min/max date arguments are implemented in argparse as a custom numeric_date type that would need to be extended or replaced to support the overloading we'd want. Replacing Regarding this issue:
I don't think any of the proposed syntaxes so far address this potential user confusion. Even the overloaded min/max arguments don't explicitly communicate that the offsets are relative to the current date. The original plan to handle this by clearly describing the offsets in the help text. I also thought the separate offset args might help call out the different behavior from min/max date arguments, but maybe there's a better way... |
Thanks for context here @huddlej. Very helpful. I still think that
I think I prefer overloading |
For |
I see. In this case I'd do |
Also, as @huddlej notes here #721 (comment), we should plan to use the same interface for |
Ok, I'll update PR to overload |
5a1ef13
to
a462043
Compare
The relative dates are parsed by `numeric_date` which uses datetime.date.today() to translate the relative date to an absolute date. Relative dates are positive duration values following the ISO 8601 duration syntax e.g. `--min-date 1Y2W5D` for 1 year, 2 weeks and 5 days ago or `--max-date 1D` for yesterday This also adds a package dependency `isodate` to parse the duration string.
The previous change to support relative dates also refactored so that an invalid date does not raise any error, which is unwanted. Prior to that change, there was one try/catch so treetime.utils.numeric_date would handle all these invalid dates. Explicitly raising from augur.filter.numeric_date is one step in a better direction, though still not a good solution since argparse does not expose these errors. More: https://stackoverflow.com/q/38340252
- Expose the invalid date error message using argparse.ArgumentTypeError - Split out the text describing supported dates into a constant variable, since it is used in 3 places now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thanks for addressing my sporadic comments!
Thanks for the review @joverlee521! Force-pushing a small test order fix then merging. |
Previously, some tests for --max-date were not done for --min-date and vice-versa. Added tests and re-ordered so that similar tests are adjacent.
With #740, filter's numeric_date() provides support for 3 date formats. However, supporting various date formats is not specific to filter (e.g. frequencies has a separate numeric_date() which is now out-of-sync with this filter's numeric_date()). This commit: 1. Moves numeric_date() to a new submodule augur.utils.date_parsing 2. Moves the related SUPPORTED_DATE_HELP_TEXT to augur.utils.date_parsing 3. Updates numeric_date() to raise a TypeError rather than argparse.ArgumentTypeError so it can be generalized to non-argparse usage 4. Adds a new function numeric_date_type() which wraps numeric_date() and raises an argparse.ArgumentTypeError per #740 (comment)
With #740, filter's numeric_date() provides support for 3 date formats. However, supporting various date formats is not specific to filter (e.g. frequencies has a separate numeric_date() which is now out-of-sync with this filter's numeric_date()). This commit: 1. Moves numeric_date() to a new submodule augur.utils.date_parsing 2. Moves the related SUPPORTED_DATE_HELP_TEXT to augur.utils.date_parsing 3. Updates numeric_date() to raise a TypeError rather than argparse.ArgumentTypeError so it can be generalized to non-argparse usage 4. Adds a new function numeric_date_type() which wraps numeric_date() and raises an argparse.ArgumentTypeError per #740 (comment)
With #740, filter's numeric_date() provides support for 3 date formats. However, supporting various date formats is not specific to filter (e.g. frequencies has a separate numeric_date() which is now out-of-sync with this filter's numeric_date()). This commit: 1. Moves numeric_date() to a new submodule augur.utils.date_parsing 2. Moves the related SUPPORTED_DATE_HELP_TEXT to augur.utils.date_parsing 3. Updates numeric_date() to raise a TypeError rather than argparse.ArgumentTypeError so it can be generalized to non-argparse usage 4. Adds a new function numeric_date_type() which wraps numeric_date() and raises an argparse.ArgumentTypeError per #740 (comment)
With #740, filter's numeric_date() provides support for 3 date formats. However, supporting various date formats is not specific to filter (e.g. frequencies has a separate numeric_date() which is now out-of-sync with this filter's numeric_date()). This commit: 1. Moves numeric_date() to a new submodule augur.dates 2. Moves the related SUPPORTED_DATE_HELP_TEXT to augur.dates 3. Updates numeric_date() to raise a TypeError rather than argparse.ArgumentTypeError so it can be generalized to non-argparse usage 4. Adds a new function numeric_date_type() which wraps numeric_date() and raises an argparse.ArgumentTypeError per #740 (comment)
Merge PR #34344, commits were: * Add isodate dependency to augur Added in nextstrain/augur#740 * Update augur to 15.0.0
I'm deleting an old branch It isn't associated with any closed PR and looks like an earlier implementation of what got merged in this PR, so I'm linking it here for posterity. |
Description of proposed changes
The relative dates are parsed by
numeric_date
which usesdatetime.date.today()
to translate the relative date to an absolute date.Relative dates are positive duration values following the ISO 8601 duration syntax
e.g.
--min-date 1Y2W5D
for 1 year, 2 weeks and 5 days ago or--max-date 1D
for yesterdayThis also adds a package dependency
isodate
to parse the duration string.Related issue(s)
Fixes #721
Testing
Added pytests.