Skip to content

Conversation

@andersleung
Copy link
Contributor

Description

  • Percent encoding of special characters is now supported in VCFWriter
  • Default behavior is to write VCFs as latest guaranteed compatible version (pre v4.3 -> 4.2 and v4.3 -> v4.3), but system property can be set to automatically attempt conversion from v4.2 -> v4.3 and fall back to v4.2if conversion is impossible, or alternatively fail with error if conversion is impossible

@lbergelson @cmnbroad

Things to think about before submitting:

  • Make sure your changes compile and new tests pass locally.
  • Add new tests or update existing ones:
    • A bug fix should include a test that previously would have failed and passes now.
    • New features should come with new tests that exercise and validate the new functionality.
  • Extended the README / documentation, if necessary
  • Check your code style.
  • Write a clear commit title and message
    • The commit message should describe what changed and is targeted at htsjdk developers
    • Breaking changes should be mentioned in the commit message.

@andersleung andersleung marked this pull request as draft April 20, 2021 19:56
@cmnbroad cmnbroad force-pushed the cn_vcf_header branch 2 times, most recently from a4a81ef to 1480e6d Compare September 22, 2021 12:31
@andersleung andersleung force-pushed the vcf43writer branch 2 times, most recently from 95f1876 to 77fb2fe Compare October 26, 2021 14:39
Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a high level first pass - I haven't looked at the tests yet, but I'm checkpointing what I have to start.

DISABLE_SNAPPY_COMPRESSOR = getBooleanProperty(DISABLE_SNAPPY_PROPERTY_NAME, false);
VCF_VERSION_TRANSITION_POLICY = VCF42To43VersionTransitionPolicy.valueOf(getStringProperty(
"vcf_version_transition_policy",
VCF42To43VersionTransitionPolicy.DO_NOT_TRANSITION.name()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want the default to be FAIL_IF_CANNOT_TRANSITION, rather than DO_NOT_TRANSITION, unless we find some compelling reason to do otherwise. To start, we should see what kind of test failures we get using that as the default, hopefully we don't have to make too many tweaks other than changing the expected output version where tests are sensitive to that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need corresponding tests for round-tripping a valid 4.2 file that fails to transition using the default policy, and that succeeds (roundtrips as a 4.2) using TRANSITION_IF_POSSIBLE or DO_NOT_TRANSITION, if we don't already have those.

this.filters = null;
} else {
filters(filter.split(";"));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sam question here, just for some context for me - is this 4.3 related ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These restrictions weren't introduced in 4.3, but we weren't checking them previously. Some of the reference invalid 4.3 files in the hts-specs repo failed_body_filter_00{n}.vcf exercise these cases, so I thought it would be better to introduce these checks if we're trying to be stricter with spec compliance

public void validateForVersion(final VCFHeaderVersion vcfTargetVersion) {
super.validateForVersion(vcfTargetVersion);
// Let the 1000 Genomes line through, but only for INFO lines
if (this instanceof VCFInfoHeaderLine && getID().equals(VCFConstants.THOUSAND_GENOMES_KEY)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I would use getKey instead of instanceof to detect info lines, but in this case I would move this validation right into a VCFInfoHeaderLine validateForVersion override so you don't have to test this at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing logic wasn't quite right here in any case, it should have been inspecting the ID and not the keys of the header's fields. I've added a comment explaining this and an override to VCFInfoHeaderLine to special case the 1000G key

@cmnbroad cmnbroad force-pushed the cn_vcf_header branch 3 times, most recently from 0c44ef6 to 9a9ea52 Compare November 9, 2021 19:49
@codecov-commenter
Copy link

codecov-commenter commented Nov 29, 2021

Codecov Report

Merging #1548 (d8db226) into cn_vcf_header (09ed106) will increase coverage by 0.461%.
The diff coverage is 76.070%.

@@                  Coverage Diff                  @@
##             cn_vcf_header     #1548       +/-   ##
=====================================================
+ Coverage           69.452%   69.912%   +0.461%     
- Complexity            9037      9888      +851     
=====================================================
  Files                  606       707      +101     
  Lines                35881     38318     +2437     
  Branches              5942      6238      +296     
=====================================================
+ Hits                 24920     26789     +1869     
- Misses                8598      9048      +450     
- Partials              2363      2481      +118     
Impacted Files Coverage Δ
...ta/codecs/reads/cram/cramV2_1/CRAMEncoderV2_1.java 0.000% <0.000%> (ø)
...ta/codecs/variants/vcf/vcfv3_2/VCFEncoderV3_2.java 0.000% <0.000%> (ø)
...ta/codecs/variants/vcf/vcfv3_3/VCFEncoderV3_3.java 0.000% <0.000%> (ø)
...ta/codecs/variants/vcf/vcfv4_0/VCFEncoderV4_0.java 0.000% <0.000%> (ø)
...ta/codecs/variants/vcf/vcfv4_1/VCFEncoderV4_1.java 0.000% <0.000%> (ø)
...ta/codecs/variants/vcf/vcfv4_3/VCFDecoderV4_3.java 0.000% <0.000%> (ø)
.../java/htsjdk/beta/exception/HtsjdkIOException.java 0.000% <0.000%> (ø)
...tsjdk/beta/plugin/hapref/HapRefDecoderOptions.java 0.000% <0.000%> (ø)
...dk/beta/plugin/hapref/HaploidReferenceFormats.java 0.000% <0.000%> (ø)
...in/java/htsjdk/beta/plugin/reads/ReadsFormats.java 0.000% <0.000%> (ø)
... and 195 more

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only part way through but checkpointing what I have so far.

outputPath,
VCFCodecV4_3.VCF_V43_VERSION);
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that this case succeeds, the "vcfReadWriteTests" data provider at the top of this file should have a 4.3 file added to it (previously, we couldn't read in a 4.3 file and write it out as the "default" version, since the default was 4.2, but now that the default is 4.3, we can add that test case and remove the comment saying why its not there). The new test case would be:

{ new HtsPath(VARIANTS_TEST_DIR + "variant/vcf43/all43FeaturesCompressed.vcf.gz"), VCFCodecV4_3.VCF_V43_VERSION },

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, there should be at least one new test that demonstrates that a 4.2 input that fails upgrade gets rejected by the VCF4.3 encoder.

There are also a bunch of other accidental things happening in this file now that we should clean up. Specifically, there are test inputs that are subjected to auto-upgrade now, but fail (because they contain a PEDIGREE line that can't be upgraded - which nicely demonstrates the fallback case and the error message display, so thats good), but we'll need to revisit these tests in light of the changes on this branch to make sure they're still testing what they're supposed to be testing.

Perhaps I should add fix up this file and submit it for merging into your branch while you carry on with the other changes.


writer.writeHeader(readHeader);
for (final VariantContext vc : readVCF.b) {
writer.add(vc.fullyDecode(readHeader, false));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this call to fullDecode necessary ? What happens if you don't call it ? It seems like this test should succeed without i ?

@andersleung andersleung marked this pull request as ready for review December 13, 2021 13:58
Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checkpointing a partial review. And I still haven't really looked at the tests.

* from versions of VCF before the current version. The writer will call {@link VCFHeader#upgradeVersion} headers
* passed in by {@link VCFWriter#writeHeader} and {@link VCFWriter#setHeader} before writing the header out.
* @param policy the policy to use
*/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this doesn't affect BCF writers, so the javadoc should say that this controls how VCF writers handle upgrades, and that its ignored for BCF.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, since this policy is engaged when setHeader is called, it looks like it might affect the BCF writer (??). Whatever the intended behavior is, we should include a test that validates it for the BCF writer to ensure we don't unintentionally regress.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if we don't already have one, there should be a test that validates that the upgrade policy behavior is correctly honored for all policy values when doing headerless/shard writing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The javadoc in VCFVersionUpgradePolicy and VariantContextWriterBuilder only mentions VCFWriter but I added a little clarifying note that it has no effect on BCF2Writer in VCFVersionUpgradePolicy, VariantContextWriterBuilder and this method.

It doesn't affect BCF because BCF2Writer calls the static method which writes the header verbatim.

public VariantContextBuilder version(final VCFHeaderVersion version) {
this.version = version;
return this;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This javadoc should explain that/how this setting interacts with the upgrade policy setting.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this method should be called setVersion, and moved so its adjacent to the getter (which should also have javadoc).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added javadoc to the getter, but the class is arranged so getters are all separate from setters already (a bit inconsistently), so we'd have to move the whole class around for consistency

/* cached monomorphic value: null -> not yet computed, False, True */
private Boolean monomorphic = null;

private VCFHeaderVersion version;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should try to make this final; it appears to be mutable only as a convenience for the upgrade case, where we're creating a new VC anyway (?). See my comments on the code where this is mutated.

} else {
// This will ensure that the VariantContext is decoded in a percent-encoding aware way
final VariantContext newVC = this.fullyDecode(header, true);
newVC.version = headerVersion;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this is the only case where a VC version value is actually mutated, and its on a newly created VC anyway. If possible, it would be preferable just create the newVC with the correct header version when its constructed (probably requires passing the requested version to fullyDecode ?). The we can make version immutable and in this class.

private void decodeGenotypes(
final VariantContextBuilder builder,
final VCFHeader header,
final boolean lenientDecoding
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest that rather than having a return of void and passing in the builder for this method to mutate, it would be cleaner to have it return the new GC, and let the caller propagate them to the builder (and maybe call it getDecodedGenotypes) or something similar.

final BiConsumer<String, Object> put,
final Function<String, VCFCompoundHeaderLine> metadataGetter,
final Map<String, Object> attributes,
final boolean lenientDecoding
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should try find some way to make this code type safe. Its not obvious from the argument types that the put and attributes args are related, and need to be type-compatible, and its a bit awkward to require the caller to deconstruct the source objects that own the maps and pass the pieces in pairs like this.

One suggestion to consider would be delegating this to new getDecodedAttributes methods on Genotype and VariantContext, since those classes own the map/put pairs that are passed in here together. Or maybe there is some other alternative.

But if thats awkward, I would at least reorder the args so that attributes is first, followed by the related put, and document how they're related.

final GenotypesContext gc = new GenotypesContext();
for ( final Genotype g : getGenotypes() ) {
gc.add(fullyDecodeGenotypes(g, header));
private static List<Double> makeGPValueLinear(final List<Double> gpValues) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested name: phredScaleToLinearScale ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

// its values between pre-4.3 and 4.3+ VCF. Pre-4.3 GP is phred scale encoded while
// 4.3+ GP is a linear probability, bringing it in line with other standard keys that
// use the P suffix (c.f. VCF 4.3 spec section 7.2).
if (headerVersion.isAtLeastAsRecentAs(VCFHeaderVersion.VCF4_3) && oldGP != null) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this also check the VC version ? We shouldn't need to do this if its already 4.3 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check would've been redundant before because we were checking whether the version was the same, but with the other change I added it in

@cmnbroad cmnbroad marked this pull request as draft March 1, 2022 14:06
@cmnbroad cmnbroad mentioned this pull request Mar 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants