-
Notifications
You must be signed in to change notification settings - Fork 243
Add VCF 4.3 writing #1548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: cn_vcf_header
Are you sure you want to change the base?
Add VCF 4.3 writing #1548
Conversation
a4a81ef to
1480e6d
Compare
95f1876 to
77fb2fe
Compare
cmnbroad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a high level first pass - I haven't looked at the tests yet, but I'm checkpointing what I have to start.
| DISABLE_SNAPPY_COMPRESSOR = getBooleanProperty(DISABLE_SNAPPY_PROPERTY_NAME, false); | ||
| VCF_VERSION_TRANSITION_POLICY = VCF42To43VersionTransitionPolicy.valueOf(getStringProperty( | ||
| "vcf_version_transition_policy", | ||
| VCF42To43VersionTransitionPolicy.DO_NOT_TRANSITION.name() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want the default to be FAIL_IF_CANNOT_TRANSITION, rather than DO_NOT_TRANSITION, unless we find some compelling reason to do otherwise. To start, we should see what kind of test failures we get using that as the default, hopefully we don't have to make too many tweaks other than changing the expected output version where tests are sensitive to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need corresponding tests for round-tripping a valid 4.2 file that fails to transition using the default policy, and that succeeds (roundtrips as a 4.2) using TRANSITION_IF_POSSIBLE or DO_NOT_TRANSITION, if we don't already have those.
| this.filters = null; | ||
| } else { | ||
| filters(filter.split(";")); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sam question here, just for some context for me - is this 4.3 related ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These restrictions weren't introduced in 4.3, but we weren't checking them previously. Some of the reference invalid 4.3 files in the hts-specs repo failed_body_filter_00{n}.vcf exercise these cases, so I thought it would be better to introduce these checks if we're trying to be stricter with spec compliance
src/main/java/htsjdk/variant/variantcontext/VariantContext.java
Outdated
Show resolved
Hide resolved
| public void validateForVersion(final VCFHeaderVersion vcfTargetVersion) { | ||
| super.validateForVersion(vcfTargetVersion); | ||
| // Let the 1000 Genomes line through, but only for INFO lines | ||
| if (this instanceof VCFInfoHeaderLine && getID().equals(VCFConstants.THOUSAND_GENOMES_KEY)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally I would use getKey instead of instanceof to detect info lines, but in this case I would move this validation right into a VCFInfoHeaderLine validateForVersion override so you don't have to test this at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing logic wasn't quite right here in any case, it should have been inspecting the ID and not the keys of the header's fields. I've added a comment explaining this and an override to VCFInfoHeaderLine to special case the 1000G key
0c44ef6 to
9a9ea52
Compare
…BCF2.2 and bug fixes.
9a9ea52 to
f66252a
Compare
77fb2fe to
e917800
Compare
Codecov Report
@@ Coverage Diff @@
## cn_vcf_header #1548 +/- ##
=====================================================
+ Coverage 69.452% 69.912% +0.461%
- Complexity 9037 9888 +851
=====================================================
Files 606 707 +101
Lines 35881 38318 +2437
Branches 5942 6238 +296
=====================================================
+ Hits 24920 26789 +1869
- Misses 8598 9048 +450
- Partials 2363 2481 +118
|
…or customs serialization.
…rom header on write
5b45b8b to
d340aff
Compare
cmnbroad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only part way through but checkpointing what I have so far.
src/main/java/htsjdk/variant/variantcontext/writer/VCFVersionUpgradePolicy.java
Outdated
Show resolved
Hide resolved
| outputPath, | ||
| VCFCodecV4_3.VCF_V43_VERSION); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that this case succeeds, the "vcfReadWriteTests" data provider at the top of this file should have a 4.3 file added to it (previously, we couldn't read in a 4.3 file and write it out as the "default" version, since the default was 4.2, but now that the default is 4.3, we can add that test case and remove the comment saying why its not there). The new test case would be:
{ new HtsPath(VARIANTS_TEST_DIR + "variant/vcf43/all43FeaturesCompressed.vcf.gz"), VCFCodecV4_3.VCF_V43_VERSION },
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, there should be at least one new test that demonstrates that a 4.2 input that fails upgrade gets rejected by the VCF4.3 encoder.
There are also a bunch of other accidental things happening in this file now that we should clean up. Specifically, there are test inputs that are subjected to auto-upgrade now, but fail (because they contain a PEDIGREE line that can't be upgraded - which nicely demonstrates the fallback case and the error message display, so thats good), but we'll need to revisit these tests in light of the changes on this branch to make sure they're still testing what they're supposed to be testing.
Perhaps I should add fix up this file and submit it for merging into your branch while you carry on with the other changes.
|
|
||
| writer.writeHeader(readHeader); | ||
| for (final VariantContext vc : readVCF.b) { | ||
| writer.add(vc.fullyDecode(readHeader, false)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this call to fullDecode necessary ? What happens if you don't call it ? It seems like this test should succeed without i ?
cmnbroad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checkpointing a partial review. And I still haven't really looked at the tests.
| * from versions of VCF before the current version. The writer will call {@link VCFHeader#upgradeVersion} headers | ||
| * passed in by {@link VCFWriter#writeHeader} and {@link VCFWriter#setHeader} before writing the header out. | ||
| * @param policy the policy to use | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this doesn't affect BCF writers, so the javadoc should say that this controls how VCF writers handle upgrades, and that its ignored for BCF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, since this policy is engaged when setHeader is called, it looks like it might affect the BCF writer (??). Whatever the intended behavior is, we should include a test that validates it for the BCF writer to ensure we don't unintentionally regress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if we don't already have one, there should be a test that validates that the upgrade policy behavior is correctly honored for all policy values when doing headerless/shard writing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The javadoc in VCFVersionUpgradePolicy and VariantContextWriterBuilder only mentions VCFWriter but I added a little clarifying note that it has no effect on BCF2Writer in VCFVersionUpgradePolicy, VariantContextWriterBuilder and this method.
It doesn't affect BCF because BCF2Writer calls the static method which writes the header verbatim.
src/main/java/htsjdk/variant/variantcontext/writer/VariantContextWriterBuilder.java
Show resolved
Hide resolved
| public VariantContextBuilder version(final VCFHeaderVersion version) { | ||
| this.version = version; | ||
| return this; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This javadoc should explain that/how this setting interacts with the upgrade policy setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this method should be called setVersion, and moved so its adjacent to the getter (which should also have javadoc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added javadoc to the getter, but the class is arranged so getters are all separate from setters already (a bit inconsistently), so we'd have to move the whole class around for consistency
| /* cached monomorphic value: null -> not yet computed, False, True */ | ||
| private Boolean monomorphic = null; | ||
|
|
||
| private VCFHeaderVersion version; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should try to make this final; it appears to be mutable only as a convenience for the upgrade case, where we're creating a new VC anyway (?). See my comments on the code where this is mutated.
| } else { | ||
| // This will ensure that the VariantContext is decoded in a percent-encoding aware way | ||
| final VariantContext newVC = this.fullyDecode(header, true); | ||
| newVC.version = headerVersion; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this is the only case where a VC version value is actually mutated, and its on a newly created VC anyway. If possible, it would be preferable just create the newVC with the correct header version when its constructed (probably requires passing the requested version to fullyDecode ?). The we can make version immutable and in this class.
| private void decodeGenotypes( | ||
| final VariantContextBuilder builder, | ||
| final VCFHeader header, | ||
| final boolean lenientDecoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest that rather than having a return of void and passing in the builder for this method to mutate, it would be cleaner to have it return the new GC, and let the caller propagate them to the builder (and maybe call it getDecodedGenotypes) or something similar.
| final BiConsumer<String, Object> put, | ||
| final Function<String, VCFCompoundHeaderLine> metadataGetter, | ||
| final Map<String, Object> attributes, | ||
| final boolean lenientDecoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should try find some way to make this code type safe. Its not obvious from the argument types that the put and attributes args are related, and need to be type-compatible, and its a bit awkward to require the caller to deconstruct the source objects that own the maps and pass the pieces in pairs like this.
One suggestion to consider would be delegating this to new getDecodedAttributes methods on Genotype and VariantContext, since those classes own the map/put pairs that are passed in here together. Or maybe there is some other alternative.
But if thats awkward, I would at least reorder the args so that attributes is first, followed by the related put, and document how they're related.
| final GenotypesContext gc = new GenotypesContext(); | ||
| for ( final Genotype g : getGenotypes() ) { | ||
| gc.add(fullyDecodeGenotypes(g, header)); | ||
| private static List<Double> makeGPValueLinear(final List<Double> gpValues) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested name: phredScaleToLinearScale ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
| // its values between pre-4.3 and 4.3+ VCF. Pre-4.3 GP is phred scale encoded while | ||
| // 4.3+ GP is a linear probability, bringing it in line with other standard keys that | ||
| // use the P suffix (c.f. VCF 4.3 spec section 7.2). | ||
| if (headerVersion.isAtLeastAsRecentAs(VCFHeaderVersion.VCF4_3) && oldGP != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this also check the VC version ? We shouldn't need to do this if its already 4.3 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check would've been redundant before because we were checking whether the version was the same, but with the other change I added it in
Description
@lbergelson @cmnbroad
Things to think about before submitting: