Summary
Normalizing variants on non-coding RNA transcripts (NR_ accessions) fails with "problem accessing data" because TxIdentityInfo defines cds_start_i and cds_end_i as i32 instead of Option<i32>. Non-coding transcripts have NULL values for these fields in UTA, causing the database row conversion to fail.
Steps to Reproduce
use hgvs::parser::HgvsVariant;
use std::str::FromStr;
// Parse and normalize an NR_ variant
let var = HgvsVariant::from_str("NR_072979.2:n.2217G>A")?;
let normalized = normalizer.normalize(&var)?; // Fails with "problem accessing data"
Expected Behavior
Normalization should succeed. The biocommons/hgvs Python library handles NR_ transcripts correctly.
Actual Behavior
Fails with error: Normalization error: problem accessing data
Root Cause
In src/data/interface.rs, lines 134-145:
pub struct TxIdentityInfo {
pub tx_ac: String,
pub alt_ac: String,
pub alt_aln_method: String,
pub cds_start_i: i32, // Should be Option<i32>
pub cds_end_i: i32, // Should be Option<i32>
pub lengths: Vec<i32>,
pub hgnc: String,
pub translation_table: TranslationTable,
}
Compare with TxInfoRecord (lines 155-163) which correctly uses Option<i32>:
pub struct TxInfoRecord {
pub hgnc: String,
pub cds_start_i: Option<i32>, // Correctly nullable
pub cds_end_i: Option<i32>, // Correctly nullable
// ...
}
The UTA database returns NULL for cds_start_i and cds_end_i for non-coding transcripts:
SELECT tx_ac, cds_start_i, cds_end_i FROM uta_20210129b.tx_def_summary_v WHERE tx_ac = 'NR_072979.2';
-- tx_ac | cds_start_i | cds_end_i
-- -------------+-------------+-----------
-- NR_072979.2 | |
Suggested Fix
- Change
TxIdentityInfo to use Option<i32> for cds_start_i and cds_end_i
- Update the
TryFrom<Row> implementation in src/data/uta.rs
- Update any code that uses these fields to handle
None
Workaround
As a temporary workaround, you can set dummy values in the database:
UPDATE uta_20210129b.transcript SET cds_start_i = 0, cds_end_i = 0 WHERE ac LIKE 'NR_%' AND cds_start_i IS NULL;
REFRESH MATERIALIZED VIEW uta_20210129b.tx_def_summary_mv;
Environment
- hgvs-rs version: 0.19.1
- UTA schema: uta_20210129b
Summary
Normalizing variants on non-coding RNA transcripts (NR_ accessions) fails with "problem accessing data" because
TxIdentityInfodefinescds_start_iandcds_end_iasi32instead ofOption<i32>. Non-coding transcripts have NULL values for these fields in UTA, causing the database row conversion to fail.Steps to Reproduce
Expected Behavior
Normalization should succeed. The biocommons/hgvs Python library handles NR_ transcripts correctly.
Actual Behavior
Fails with error:
Normalization error: problem accessing dataRoot Cause
In
src/data/interface.rs, lines 134-145:Compare with
TxInfoRecord(lines 155-163) which correctly usesOption<i32>:The UTA database returns NULL for
cds_start_iandcds_end_ifor non-coding transcripts:Suggested Fix
TxIdentityInfoto useOption<i32>forcds_start_iandcds_end_iTryFrom<Row>implementation insrc/data/uta.rsNoneWorkaround
As a temporary workaround, you can set dummy values in the database:
Environment