-
Notifications
You must be signed in to change notification settings - Fork 24
Development Subgroup Meeting Notes
1. Disease Episodes beyond Disease First Episode (Tumor Registry data) -> Finalize the model proposal -> Key discussion points
-
Reason for connecting Measurement directly to the Episode and also to the Condition_Occurrence is because we want to be able to have information at the lowest level for Condition_Occurrence and promote them up to the Episode layer when its clear that we know which Condition_Occurrence entry is the Disease Episode. With EHR data we may not be able to do this. Also, potentially people want to use entries at the lower level while the analytic tools don’t have the ability to deal with the Episode layer.
-
The current design is based on the premise that Metastases would be an example of subsequent Disease Episode. The other option that is not captured in the model is whether it should it be handled differently from other disease episodes. This needs further discussion.
-
We need to build a general strategy/rules of how we derive/abstract the subsequent disease episodes. There must be a clear convention for Episode_Type_Concept_ID here which will drive the confidence in the data. For the cases where we are not deriving and just ETLing from source data, we should have conventions there as well. For those that are derived, it will vary greatly. When we are ETLing from source data we know what the underlying events are, and we can develop conventions to putting them in Episodes tables. When deriving, there may be probabilistic methods we may not be aware of.
-
Currently we only have 2 Episode_Type_Concept_ID ‘tumor reg’ and ‘algorithm derived’. We should have a formal library of algorithm so that we know exactly what algorithm was used for deriving this episode which will build confidence in the data.
-
What need a list of the different types of disease episodes? Rimma, Asieh and Michael will define the actual episodes
-
Christian suggested we add an Overall/overarching Disease Episode to cover instances where the patient is referred in and we don’t have the initial diagnosis and all we have is a discharge summary. If we don’t know when the diagnosis truly began then we should not have the Disease First Occurrence. If we have an overarching episode, whether we have all subsequent occurrences they are all children of the same disease and can connect to the parent episode. The advantage of this is that if at a certain point we cannot define the 1st occurrence we would still have the overarching disease episode, the date is whatever is in the discharge summary.
-
For Treatment we may want to connect to the specific occurrence of the disease and not the main diagnosis episode. All other subsequent episodes will connect to the parent diagnosis episode. There must be some freedom to indicate treatment pathway w/o attaching to a particular episode. In many cases, it may not be clear if a treatment was for what diagnosis.
-
Rimma has encountered a lot of patients where the derivation of the first occurrence in tumor reg is just not there. The patient may have been diagnosed and treated elsewhere. We will have a record of the disease but not a reliable record of the first occurrence. Therefore, we having the overall record of the disease with the start date which is derived from the EMR will help. That way if we don’t know that it’s truly the first episode, we will not have the disease 1st episode as child. This will allow us to explicitly differentiate when we do or don’t know when the disease started.
-
Based on the above discussion, the plan is to create a disease episode and make disease 1st occurrence the child of the disease episode. For Surgery, if we know that Surgery was for the 1st occurrence then link it to disease first occurrence. If it’s not known, then link it to overall disease episode. Same with Treatment episodes where they are connected to high level disease episode vs specific 1st occurrence episodes.
-
Let’s say we can derive the Treatment Cycle, then the drug exposure that were implemented as a part of drug cycle 1 should have entries in episode event and be connected only to cycle and not the high level. The flow of connection will be Treatment course -> chemo therapy regiment and surgery. The episode object concepts for treatment course has not been finalized yet. We should have the ability to do with and w/o overall and we need conventions for that.
-
If we know a cycle connects to the medication then they connect to the event. We would only create cycles if we know which drug exp connects to which cycle otherwise we wouldn’t create cycles and then just create chemo therapy regimen and connect them the drug exp to the overall chemo regimen. Plan is not focus on the regimen part first and then re-visit the treatment part.
-
Other important things to consider in the design is
the Treatment intent and the definition of 1st line of treatment 2nd line of treatment etc. The intent is often available in medical records or research. We don’t have the qualifier for that now and we should be adding it to the episode table or adding it as a modifier to the treatment.
Should metastasis be a subsequent disease episode or whether it should be a modifier of disease first occurrence or does it depend on when its diagnosed. If you only catch a metastasis at the first diagnosis of the primary does that change it. Discussion to be continued in the next meeting.
a. How to install the extension (vocabulary and model) and populate the OMOP extension -> Self-study?
b. Synthetic tumor registry data
i. Compile the OMOP CDM v5.31 Tables (pre-requisite)
ii. Compile the Oncology CDM Extension Tables
c. If the plan is to populate the OMOP Extension there are 2 ways of doing that (1) NAACCR ETL (May have to create NAACCR synthetic data) (2) Drug Regimen Extraction (Can we done on synthetic data)
i. Request to modify the TABLE structure in order to manage the referential integrity based on the polymorphic identity. To solve the referential integrity problem the recommendation is to have an interface table (See TABLE structure below) that’s an event interface that handles the polymorphism.
CREATE TABLE event (event_id INTEGER NOT NULL, episode_event_field_concept_id VARCHAR, PRIMARY KEY (event_id))
CREATE TABLE episode (episode_id BIGINT NOT NULL, episode_source_value VARCHAR(50), PRIMARY KEY (episode_id))
CREATE TABLE episode_event (episode_id BIGINT NOT NULL, event_id BIGINT NOT NULL, PRIMARY KEY (episode_id, event_id), FOREIGN KEY(episode_id) REFERENCES episode (episode_id), FOREIGN KEY(event_id) REFERENCES event (event_id))
ii. Have a back link from any table that we want to treat as an event such as drug exposure, procedure occurrence etc will need to have a backlink to an event_id. The current design maybe a bit lose in that there is no real definition of what can and what can’t be an event. Even though it’s obvious as to which tables are event tables but that’s not controlled in any way. The above design allows to specify this stringently and get the benefit of enforcing the referential integrity properly.
iii. Nasreen confirmed that they have permission from Elektra to create a private repo and put code in there to be shared by others that have a MOASIQ licence.
3. Disease Episodes beyond Disease First Episode (Tumor Registry data) -> Finalize the model proposal to send out for review (Refer to Qi’s diagram)
a. Qi walked us through the Episode Tables
i. For the Disease 1st Occurrence, Qi covered the Episode tables both Disease Episodes and Treatment Episodes.
ii. Disease Progression, Remission, Recurrence all link back to the disease 1st occurrence based on ‘episode parent id’ field
iii. Disease Treatment also links to Disease Episode records using the ‘episode parent id’ field. The Treatment Episode can link to either to the Disease First Occurrence records or if it’s a treatment for recurrence, it links to the Disease Recurrence record using the ‘episode parent id’
iv. Per Michael, Tumor Registry data does not code treatment for disease recurrence. It only records treatment for disease 1st occurrence.
v. Different institutions can get their data into this in different ways and it will not always be from the Tumor Registry. It may have to be an algorithm from EHR, chart abstracted locally against their EHR data or they have Oncology EMR which does not need an algorithm.
vi. The episode table entry for a treatment cycle links to the corresponding episode table entry for the treatment episodes via the ‘episode_paremt_id’.
vii. Qi will modify the episode_object_concept_id field in the episode table record for the treatment episode to indicate the exact surgery or specific chemotherapy treatment instead of ‘Surgery’ and ‘Chemotherapy’.
viii. You would only ever have a cycle if you had a drug regimen or a radiotherapy regimen and you can populate them. Fraction 1 of the regimen happened on this date etc or treatment cycle 1 was administered on this date. If you cant do that then you wont have child episodes of the top level treatment episodes.
ix. For regimens that are a combination of Chemotherapy and Radiation together, there would be 2 episodes. We need to put something in the episode_object_concept_id and there will be other properties that you may want to relate to 1 kind of episode vs another. So, hybrid treatment regimens become 2 different episodes.
Next Steps:
Continue the review of subsequent disease episodes to look at the model proposal to represent Cancer Diagnosis (Condition Occurrence) and Modifiers (Measurement)
-
Analytical Package
a. The package is NAACCR specific -> Creating a source agnostic analytic code for survival and time to treatment would be good for somebody to work on. Peter and Chiara (IKNL) are modifying the package to make it platform agnostic. They plan to share it with the community once they are ready.
-
EHR to OMOP Extension
a. Epic OMOP conversion (Mount Saini) -> Mount Saini has EHR data they want to convert to the OMOP Extension
b. MOSAIQ EHR data to OMOP (UNSW) ->
MOSAIQ data/dictionary sharing policies
Michael has worked on MOSAIQ, getting their DD from the NW IT team. Used that to do ETLs not yet into the OMOP data model. Shouldn’t be very difficult to do that.
Creating a private repo and put code in there. At some point someone should reach out to MOSAIQ. May not allow the ETL to be public.
Michael’s code takes radiotherapy treatments and converts them into how many fractions, total dosage, type of radiotherapy treatment was. This can easily be mapped to the Episode tables.
Nasreen reached out to the vendor, Elekta, publication of data structure is forbidden. Can share between users of MOSAIQ which is allowed by Elekta. If we develop a common MOSAIQ code, then it needs to be managed in a less than ideal way if not on a private GitHub. Password protected repo. Github allows for private repo and the repo owner would need to explicitly grant permission to access the repo. We could say to Elekta that we won’t add anybody unless we have permission from them.
Robert mentioned EPIC has an internal network where users can share sensitive data. Elekta does have a special site for such purposes. Nasreen can ask them if this is available for us as well.
Elekta is changing their data structure in their next release which might impact the ETL.
-
Episode derivation (Robert M)
a. Shilpa documented the high-level requirements that need to be further solidified. Will incorporate feedback from Michael.
b. Robert has a meeting with Makabi the week of 6/29 to discuss their overall collaboration with OHDSI and also letting Robert use them to develop the algorithm.
-
Drug Regimen list from UNSW
a. Should we do a compare or wait to run the study to access the fall outs?
-
Disease Episodes beyond Disease First Episode (High level tasks)
a. Need to identify standards for the high-level Disease Episodes
b. Need to map NAACCR Values to high-level Disease Episodes
c. Derivation algorithm (For derivation of Recurrence and Progression from EHR)
d. Finalize the model proposal to send out for review (Refer to Qi’s diagram) -> Dev team to review before presenting to the others.
- EHR to OMOP Extension
UNSW (South Wales) is working on converting their EHR data (MOSAIQ system) to OMOP. In the next phase, the plan is to bring in their registry data to OMOP.
UNSW has RADOnc and chemotherapy data within MOSAIQ that they will be mapping in its entirety to OMOP Oncology Model using Python to ETL the data.
UNSW is planning to include NPL in their extraction pipeline, and to automate some of the mapping generation from configuration files.
The CAP Cancer checklists ingestion into the OMOP vocabulary should be a boon to NLP efforts, giving a clear target for NLP operation over pathology reports.
North Western uses MOSAIQ for Radiation Oncology and pulls Radiation Therapy and Gamma Knife Surgery data from MOSAIQ. Tufts also has MOSAIQ as a local source (more than just RadOnc). Preliminary work has begun at both Northwestern and Tufts to map MOSAIQ. MOSAIQ is popular so having a generic ETL to pull it into the Oncology Extension would be great.
Tufts has done an ETL to convert MOSAIQ data to OMOP with the exception of Episode tables. Next step is to convert MOSAIQ to the OMOP extension. Tufts has tumor registry data in MOSAIQ as well. MOSAIQ is proprietary so we will have to keep this in a private repo. Need to confirm this with MOSAIQ and Michael may already have a script out there to convert MOSAIQ data to OMOP. Need to make sure we don’t breach the MOSAIQ contract. Michael and Robert will check with the organization Electra on what their terms of contract are with regards to sharing data dictionary information etc. Possibly can open source an abstraction of the data model without referencing it directly.
With the drug regimens, there is a good library in Australia (hematology -> https://www.eviq.org.au/medical-oncology and medical oncology protocols -> https://www.eviq.org.au/haematology-and-bmt) Compare HemOnc.org drugs -> https://hemonc.org/wiki/Main_Page and the drug library from Australia to see if they line up and identify the gaps.
- Create disease Progression/Remission/Recurrence episode #270
What is already complete: a. Some basic ingestion was done for Recurrence and Progression by adding values to the Episode domain. b. The only thing being instantiated from the NAACCR ETL are the disease first occurrence entries.
What needs to be done: a. We need to instantiate ‘Recurrence Episodes’ and/or ‘Progression’. b. Eventually, this needs to be done from non-NAACCR data. NAACCR only has one disease Recurrence variable they track. NAACCR is not a great source for subsequent Disease Recurrence and Progression. c. Do we have a standard way of capturing recurrence and progression in MOSAIQ from a clinical perspective? Nasreen found that Drs were filling Recurrence in different ways. She will reach out to users to see how Recurrence is being captured to make sure the capture is standardized. Tried to find good definitions of Recurrence, Progression which was very hard to find which made it difficult to standardize.
-
University of south wales in Sydney Australia is working on converting their registry to OMOP.
a. UNSW has RADOnc and chemotherapy data within MOSAIQ that they will be mapping in its entirety to OMOP Oncology Model using Python to ETL the data.
b. UNSW is planning to include NPL in their extraction pipeline, and to automate some of the mapping generation from configuration files. Using NLP will be necessary to normalize Oncology data.
c. Per Michael, we want to move on to non-tumor registry sources like EHR and Claims. Northwestern uses MOSAIQ for Radiation Oncology and pulls Radiation Therapy and Gamma Knife Surgery data from MOSAIQ. Tufts also has MOSAIQ as a local source (more than just RadOnc). Preliminary work has begun at both Northwestern and Tufts to map MOSAIQ and both would be very interested in working on and help mapping MOSAIQ data. MOSAIQ is pretty popular so having a generic ETL to pull it into the Oncology Extension would be great. And being Python-based would not be a problem at all.
d. The CAP Cancer checklists ingestion into the OMOP vocabulary should be a boon to NLP efforts, giving a clear target for NLP operation over pathology reports.
-
The Dutch Cancer Registry is making progress with conversion of their registry data to OMOP starting with the Person table. They are looking for use cases that will drive the tables they will tackle first. Michael suggested the below as starting points
a. First course diagnosis
b. First course treatment (only 1 treatment modality)
-
Update on Drug Regimen Algorithm (Roadmap - Expansion of Drug Regimen algorithm) a. Anastasios and Henry (IQVIA) are working on improving precision of Chemo Regimen identification and the accuracy of packets.
b. In addition, they are working on making the packets compatible with V6 and registry data.
c. Henry is also working on building infrastructure to calculate treatments and cycles not only regimens but cycles for each regimen. Hoykan has also created cycles for his study.
d. Creating additional package using gold standard phenotype library like cohort build functions and SQL code that can be translated in any dialect.
e. There was a plan to create a framework around package development. Henry is creating a package that is compatible with the Gold Standards and with what Hoykun has developed. We need to be able to utilize both scripts using the framework.
-
Update on studies
a. Bladder Cancer study – MSK; Still working on study to incorporate the ICD codes for MSK. Henry has been working on additional requirements for this study.
-
Create disease Progression/Remission/Recurrence episode and link it to first episode from Tumor Registry #270 -> (Disease Episodes and Treatment Episodes – Regimens and Cycles)
a. Reviewed specification Qi documented to demonstrate the mapping of episodes beyond ‘Disease First Occurrence’ to OMOP tables
b. There is a vocabulary task https://github.com/OHDSI/OncologyWG/issues/123 to determine vocabulary support for episodes beyond ‘Disease First Occurrence’
c. What is already complete:
i. Some basic ingestion was done for Recurrence and Progression by adding values to the Episode domain. ii. The only thing we are Instantiating from the naaccr ETL are the disease first occurrence entries.
d. What needs to be done:
i. We need to instantiate Recurrence Episodes and/or Progression (Remission is not available in NAACCR). For right now we can use what we have but a more formal job needs to be done by the vocabulary team ii. Eventually, this needs to be done from non-NAACCR data. NAACCR only has one disease Recurrence variable they track. (#4 or #2 only from NAACCR data) NAACCR is not a great source for subsequent Disease Recurrence and Progression.
-
Derive surgery episodes #272 & Derive Radiotherapy episodes #271 from EHR-> If we use the OROT vocabulary, we could create Surgery or Radiation Episodes using the NAACCR Vocabulary from EHR or Claims data. MSK is trying to do this now for their COVID patients. Rimma will talk to Michael offline and discuss their approach for the derivation of Surgeries and Radiotherapy.
a. Map low level procedure codes to high level treatment codes -> The idea is to have the NCCN compendium so we can move from lower level codes to the high-level codes that’s required by RedCap that are more registry like instead of HCPCS codes.
b. The NCCN Compendium® contains authoritative, scientifically derived information designed to support decision-making about the appropriate use of drugs and biologics in patients with cancer. The NCCN Compendium is recognized by public and private insurers alike, including CMS and UnitedHealthcare as an authoritative reference for oncology coverage policy.
c. Once the low-level codes are mapped to the high-level codes, derivation for non-drug therapies, like Surgery and Radiation (just like Anastasios’s derivation algorithm is doing for drug therapy) will be easy to implement.
a. Background information -> In the MSK data, 30% of current patients (COVID cohort) who have diagnostic data in Tumor Registry have not been ETLed to OMOP because the current ETL script does not address those diagnostic records that don’t have dates. This could result in loss of information.
b. Tumor registries have a mandate on what’s reportable and what’s not, like benign neoplasm (Not mandated to report to the govt). Some hospital registries want to track all regardless even if they are benign. Don’t know if the 30% reported as not having dates fall in the non-reportable category.
c. NAACCR itself has date of diagnosis flag (391) which if black then there is a date if the value of this field is ‘12’ that means the date of diagnosis is unknown. NAACCR file format allows for diagnosis date to be empty. In these cases, where the NAACCR format has 12, it’s okay to put them in the observation table.
d. Typically, if there are any other dates like date of therapy, date of surgery then the diagnosis date can be inferred. The question arises when only tumor registry information is available and date information cannot be inferred. What do we do in such cases? How useful is this data at all? Should we have a convention in such cases? Rimma is investigating if a date can be inferred.
e. One suggestion is that only in cases where the date can be inferred will we place the record in the Observation table. Rimma will have more information next week to be able to answer contextual questions. The complete requirement in this case is as below:
i. For the Condition Concepts, Conditions as observation_concept_id = “History Of” and value_as_concept_id = C
Condition Concept (ICDO3 combination).
ii. For the tumor modifiers, they are already a combination of attribute value. May be some violations of
conventions, like storing measurement concepts in the observation table. However, we can’t really say “History
of” + “Grade” + “4” because we don’t have the fields in Observation.
iii. Map all other modifiers the same way we are storing them in Measurement
iv. Have connection between Condition and Modifier records in Observation the same way we have it in Measurement via
observation_event_id and obs_event_field_concept_id
v. Have observation date as the inferred date.
vi. For Tumor measurements we don’t have a good way to identify them to map to Observation
vii. Mixing domains is a concern. Don’t know what can be done. Should we duplicate all the concepts.
f. In cases where the date cannot be inferred, a suggestion may be to write logic to find the Condition Date. If MSK has NAACCR on top of EHR, find min Condition Date under ICDO in the Condition table using SNOMED codes. This could be the Diagnosis Date which is derived. This is where we could put the Concept Type ID = ‘derived’. This recommendation involves reconciliation between EHR and Tumor Registry. The Condition table will first have to be populated by EHR/Claims Conditions and then get to Tumor Registry and do the derivation. If there is no EHR data, then it’s a modeling question that needs to be address with the bigger modeling team.
g. Another suggestion to explore is Qi recommendation to add additional type concept like “Date not known”? -> need to vet this with others.
h. An overall recommendation is to build/make available a MetaData with concepts about the trustworthiness that could be referenced when defining cohorts as a routine part of cohort definition. This would allow us to solve issues around aspects where certain people and events are less trust worthy so they can be treated differently. Recording such a metadata during ETL or after would be useful.
a. Bladder Cancer study – MSK; Still working on study to incorporate the ICD codes for MSK. Henry has been working on additional requirements for this study.
b. Colorectal Cancer study – Update. No action needed as this point. If Bayer wants a study built, they will need to fund the effort around building the cohort and study development. The OHDSI community will be engaged at the point where the study needs to be executed on our network of participants/data.
Tumor Registry has a field that indicates recurrence occurred for this primary cancer diagnosis. Right now, we create Treatment Episodes that are child of first Disease Occurrence Episodes. The idea would be if you encounter Recurrence then you would make another entry in the Episode Table which would be the child of Disease First Occurrence. We do have a Disease Recurrence Episode Concept. In the Episode Object Concept, we would just repeat the Diagnosis from the Disease First Occurrence.
For Tumor Registry, we can use the Tumor Registry Concepts because OROT has these mapped from CPT to NAACCR Codes. We can use NAACCR codes to create Treatment Episodes. The OROT vocabulary gives us the ability to identify Surgery Code or Radiotherapy Codes based on the CPT code. If we use the OROT vocabulary, we could create Surgery or Radiation Episodes using the NAACCR Vocabulary from EHR or Claims data. MSK is trying to do this now for their COVID patients. Rimma will talk to Michael offline and discuss their approach for the derivation of Surgeries and Radiotherapy.
Key discussion points:
-
Two follow-up questions based on EU Oncology workshop
- Versioning the CDM extension -> Once integrated with OMOP the Oncology Extension will follow the same versioning as OMOP.
- Versioning of ETL and algorithm -> Plan is to start versioning the ETL code, algorithm and analytical package code in the GitHub repository. The latest version of the algorithm needs to be made available in the GitHub repository as well.
-
Task for Robert M -> Building on the drug regimen algorithm to derive temporal patters.
-
Other derivation tasks ->
a. Deriving temporal with Surgery and Radiation
b. Deriving the temporal pattern used for disease recurrence and progression
-
Modelling tasks ->
a. Modelling for disease recurrence needs to happen before the derivation of episodes for temporal patterns. b. The modeling group needs to help us decide if the metastasis information needs to be included as
Measurement of Disease First Occurrence or it should be a separate Disease Episode entry that is then linked it to the first episode (for the first occurrence).
c. Issue #123 relates to the above -> Determine modelling support for the representation of other 'Disease Episodes' beyond 'Disease First Occurrence'.
-
Dealing with duplicate data ->
a. Same person has multiple records for same tumours with different demographic information. (ETL breaks) How should we decide which to use when creating episodes? Not do this in ETL. Create some data quality checks in the NAACCR import and have people address it there. Create checks on the NAACCR data points table to indicate if there is data that when inserted could cause a problem. Run data against the check and let the data owner figure out how to address it with their source data. b. Same person has 2 diagnosis on the same date for the same tumour (ETL does not break). Options considered, (1) Pick one without any intelligence (2) Determining which is a complete record (3) Do not handle in the ETL but do it in the script that prepares the data. c. Same person has multiple records with different tumours (ETL does not break). No action needed.
Below are the next few planned milestones:
Milestone 1 (Existing milestone) - Create methodology for converting cancer data from EHR/Claims -> Task 1 -> Ingest Radiation Therapy and Surgery data and assign them to Episodes (Treatment episodes and disease episodes).
Milestone 2 (New milestone) - Create disease recurrence episode and link it to first episode from Tumor Registry (This will enable use cases related to ‘Time to recurrence’ and ‘Time to progression’) Task -> Modeling question - Should this be captured in the Measurement table as a measure of the Disease First Occurrence or should this be a separate Disease Episode entry that is then linked it to the first episode (for the first occurrence).
Milestone 4 (New milestone) - Derivation of Episodes Task 1 -> Derive radiation therapy episodes and surgery episodes from claims. (This will be post ETL algorithm)
Mike and Robert will evaluate if they can adopt the same approach for mapping radiation therapy information to Disease Episodes and Treatment Episodes in OMOP.
Key points regarding running the Bladder Cancer study at MSK:
- The standardization is not clear as MSK is not ETLing their full OMOP CDM. Nature of registry is diff from EHR. Eg) Metastases in registry data is in the measurement. In EHR we look for sec malignant neoplasm in Condition_Occurence and use the OMOP hierarchy in the vocabulary. As nothing is standardize, it’ss difficult to create scripts that only suit one site.
- With MSK -> the cohort definition does not capture the whole univ of concepts that Rimma was expecting to capture, ICDO, NAACCR. 300 codes have been identified as missing from Anastasios cohort definition.
- The top-level SNOMED concepts would include the ICDO concepts that are a part of the hierarchy. Need to build additional branches for the ICDO codes.
- There are about 27 ICDO codes that are malignant neoplasm but they are not connected to the SNOMED code for malignant neoplasm. Current, ICDO is not updated to conform to SNOMED. New vocabulary with corrected ICDO will be released this week. Once the new vocabulary is in place, we’ll check if the problem still persists.
- Anastasios’s goal was to create regimen extractions, then using Cohorts definition in registry and HER and get the same people this validated that the regimens were identified correctly. Bladder Cancer was just a proof of concept. We are not at the point where we have fully developed the fundamentals for testing and validating the algorithms and creating unified cohort definitions.
- Anastasis’s script is characterizing oncology drug regimens for malignant bladder cancer patients, using hierarchical SNOMED codes. Anastasios is looking at CONDITION_OCCURRENCE/CONDITION_ERA entries, not 'Disease First Occurrence' EPISODE entries which is why his package is not analyzing bladder cancer patients from initial diagnosis but rather initial observational incidence.
Next steps for the Bladder Cancer Study:
For the short term, take ICDO codes that don’t fully match into consideration until the hierarchy can be solidified.
Bladder Cancer Cohort definition The cohort was created to include people to have the following: (1) Index date defined as the first condition occurrence of Primary malignant neoplasm of bladder SNOMED (See Below).
Additional inclusion criteria includes:
- Age at the index date greater than 18 years.
- At least one drug exposure of the ATC classification (See Below)
Concept Id Concept Name Domain Id Vocabulary Id Concept class 196360 Primary malignant neoplasm of bladder Condition SNOMED Clinical Finding 21601387 Antineoplastic Agents Drug ATC ATC 2nd
** Development Meeting Notes – 3/18 **
With TNM staging issues reported by IKNL, it seems like the same Staging values are in pathological and clinical because in certain scenarios you can use clinical stating if you have not been analyzed/declared by a pathological report. So, in some instances, the fall back is to use Clinical staging for pathological staging for TNM variables. There are other examples that we need to look at to confirm the above. Need to scope mapping based on this understanding to scope it to the clinical staging variables.
For the Analytical package, MSK still has ETL issues and they are working through them. They will execute the package once the issues are resolved. New task created to incorporate changes from Odysseus and make it dialect independent. We need to code everything in SQLServer and use the SQLServer package to translate it.
New Milestone/Tasks discussed as below:
-
For the Milestone - Creating methodology for converting cancer data from EHR -> Ingest Radiation Therapy and Surgery data and assign them to Episodes (first disease occurrence). Creating Diagnosis Modifiers from non-registry data. Creating disease episodes and treatment episodes (Anastasios -> Map regimens to HemOnc.org for the algorithms to be run.) from EHR data for Radiation Therapy and Surgery).
-
Explore a new Milestone - Create disease recurrence episode and link it to first episode from Tumor Registry (direct mappings or algorithms-next generation of algorithms)/Claims*/EHR*. Time to recurrence and time to progression use cases. A modeling question that will need to be explored is should this be captured in the Measurement table as a measure of the Disease First Occurrence or should this be a separate Disease Episode entry that is then linked it to the first episode (for the first occurrence). *possibly
-
Explore a new Milestone - Creating methodology for converting cancer data from Claims Ingest Radiation Therapy and Surgery data and assign them to Episodes (first disease occurrence). Creating Diagnosis Modifiers from non-registry data. Derive radiation therapy episodes and surgery episodes from claims. (This will be post ETL algorithm) -> Michael will point Ning to existing work that has been done to ingest OROT surgery and radiation.
For the Drug Regimen Extraction Algorithm ->
Anastasios is in the process of adding results from NU, Tufts and IQVIA in a report. Based on the results, Anastasios has documented a list of issues with each partner’s results. He plans to reach out to them and discuss further.
Henry and Anastasios are adapting the Oncology Regime Package to take into consideration Measurements and the NAACCR Vocabulary to be able to identify metastasis. This is because in some registry data we cannot find metastasis in the Condition_Occurence table as is currently coded. They just add the primary cancer as Condition and Measurement includes the metastasis, so package needs to be modified. This is true for MSK where this information is stored in Measurement. The modeling group needs to help us decide if the metastasis information needs to be included as Measurement of Disease First Occurrence or it should be a separate Disease Episode entry that is then linked it to the first episode (for the first occurrence).
** Development Meeting Notes 3/11 **
Peter (Netherlands) is working on adding TNM to the ETL -> Per Michael’s recommendation, they will do the mapping using NAACCR variables. This is the approach that others have adopted for the time being. There are limitations with using NAACCR. NAACCR was populated from UICC ver 7. This was available through SEER API. Any news codes added in AJCC 8 and UICC 8 are not there. For a future solution, the team is coming up with a standard vocabulary for tumor characteristics. Plan is to use SNOMED as a target and NAACCR and loin as a source vocabulary. At this point we want to all decide to use one method to be able to do a comparison across.
Clin Stage Group
940 TNM Clin T950 TNM Clin N960 TNM Clin M970 TNM Clin
Path Stage Group
880 TNM Path T890 TNM Path N900 TNM Path M910 TNM Path
Question: Peter asked if there is a way to indicate which TNM version was used
Answer: Michael mentioned that there is a TNM # NAACCR variable. See link below. In the future, we would want to create pre-coordinated Concepts, SNOMED, TNM version 8 etc. This will have the version/edition. Now there is no way to indicate the version.
http://athena.ohdsi.org/search-terms/terms/35918875
Question: In the future, can we indicate whether UICC or AJCC?
Answer: Michael mentioned that they talked to AJCC and there are small diff between them. We have to figure out whether potentially AJCC becomes the standard and UICC is the source. So we can map UICC to AJCC. This is something that’s unclear. AJCC told us diff between UICC classification are very minimal.
Hoykun presented the Chemotherapy Extraction Algorithm that extends it further to calculate treatement pathways. NU, Tufts can run Hoykun’s algorithm. Link to the presentation is as below: https://github.com/OHDSI/OncologyWG/blob/master/documentation/Clinical%20Characterization%20of%20Cancer%20Treatment%20Using%20OMOP.pptx
Anastasios is reviewing the results of executing the drug extraction algorithm and also the bladder cancer study from Columbia, Tufts and IQVIA. NU is still in the approval process for sharing the results on their data.
The analytical project is complete in its current state and includes survival and death from tumor registry data. Plan is to collect the results and describe the changes made since the symposium. Once Ray is done with the Oncology OMOP, they can run the analytical package. MSK also plans to run the algorithm shortly.
Andrew talked about key discussion points related to supporting best practices in research generated by the Oncology WG:
-
The reason for the meeting was to make sure the analysis is performed in keeping with the OHDSI analytical practices and statistical methods. Also, to allow for leaders and SMEs in analytical methods to weigh in on how to better the package so it can be presented to the world.
-
Plan is to repeat the Kaplan Meier curves -> do more sensitive analysis than were done to understand the impact of some of the aspects of the data. For instance, there are a large # of people who don’t have an event during the observation period.
-
Next Steps: Try out a method called Deep Survival analysis -> Evaluate the suitability of this approach. Advantage of this method is it fully leverages a wide range of covariates that are available on these data sets.
Compare and contrast the results, challenges and potential advantages of the 2 approaches and weight in on which of those is the one that we should be presenting to the world.
Once the best practice is in place, we can review what needs to be improved, what needs to be changed going forward. No more immediate work required for the development group. The expectations are that the best practice group will guide what the next generation of the use case development would be.
** Development Meeting 2/26 **
-
IKNL questions
a. Adding modifiers/measurements -> TNM stage; In the CDM proposal, Modifiers are linked to Procedure_Occurrence, Condition_Occurrence and Episode_Events. Michael clarified that it should be Episodes and not Episode_Events.
b. How should TNM staging information be added? NAACCR ETL creates a Condition_Occurence entry and for all tumor modifiers from the NAACCR data like TNM or grading it will create Modifiers that will point to the Condition_Occurence. The NAACCR ETL also creates duplicate set of all those Diagnosis in the Episode table and duplicate set of entries in Measurement table that point to Episode. This will allow for analysis at Episode level as well as the Condition_Occurence. The NAACCR ETL creates Condition_Occurence, creates Modifiers in the Measurement that point to Condition_Occurence, creates an entry in Episodes that’s equal to Condition_Occurence, copies all Measurements and points to Episodes for Condition_Occurence.
-
Analytical package
a. Michael ran the analytical package on NU data. For the 2 plots (1) Time to Survival from Diagnosis (2) Time to First Treatment, #1 is creating a plot, #2 creates an empty page. These results are like what Robert observed. Michael and Robert will work with Meera as it seems to be a problem with the code.
b. Once Ray is done ETLing the Columbia data, they will run the package. MSK conversion will be done by end of Feb and plans to run the analysis by mid of March.
-
Drug Regimen Algorithm
a. The algorithm creates the following cohort (1) Patients with Bladder Cancer (2) Bladder Cancer with metastasis (SNOMED code that are translated from ICD10, ICD9 codes) to Liver and (3) Bladder Cancer with metastasis to any site. It populates comorbidities, identifies regimens, does analysis on oncology regime, analysis on lines of treatment, identifies which kinds of regimens were in first line. Anastasios is going to prepare a report by end of this week.
b. Columbia (Thomas) and Tufts (Robert) have run the algorithm and populated Regimens. Both sites have executed the Bladder Cancer study and are troubleshooting issues.
c. Anastasios has worked to make the algorithm database agnostic (SQLServer). Anastasios is going to upload latest SQLserver to github along with populated results.
d. Chan and Hoykun (Korea) have also created a drug regimen alrogithm which was presented in the OHDSI Community meeting on 2/25. They have invited participants to run the algorithm. Package intakes Diagnosis and Regimens. Cancer types is parameterized and is taken as an input to run the package. The algorithm has been run on sample for breast, colorectal and lung cancer. Package contents are on the github page. Hoykun will present the package in the next Development Subgroup meeting.
e. The plan is to merge the algorithms together in a package -> Users can select the algorithm they want to run. There is also a plan to create a framework that allows to add Oncology identifier and test it.
f. Anastasios’s algorithm takes the End dates of the Drug_Exposure. If Drug Exposure End Date is populated, then it considers it. Hoykun’s algorithm also uses the Drug_Exposure table and uses Start and End date to extract regimen records. Episode_Start is from Drug_Exposure_Start and Episode End is from Drug_Exposure_End date. Algorithm assumes all data has end date. Hoykun will explore how to handle when Drug_Exposure_End date is not present.
** Development WG meeting notes 2/19 **
Peter from IKNL shared their ETL spec and went through their questions with the WG. (1) Episode_Object_Concept_ID takes its value from the ‘Condition’ domain in case of the ‘Disease Episode’ and from the ‘Procedure/Treatment’ domain in case of ‘Treatment Episodes’. There is no such domain as ‘Procedure/Treatment’. It should either be ‘Procedure’ or ‘Regimen’. This needs to be fixed in the documentation.
(2) There is no ‘Episode of care’ domain -> This was a speculation. No one has tried to implement this yet. There is a ‘Episode of care’ concept_class. There is no use case yet.
(3) Started looking at Treatments, they are defined by our cancer registry so no standard vocabularies. Looking at translation to standard vocabularies. Ran into the following problems -> ECEP -> Chemo Therapy. In Athena this is a regimen. Single drugs in HemoOnc are monotheraphy drug regimens. If drugs are not found, then we could follow up with HemoOnc to include them. Anastasios -> We can help map regimens to HemoOnc regimes. Currently exploring the challenges with mappings of Regimens.
(4) Converting NKR to OHDSI but its part of a bigger project to make NKR FAIR. Per FAIR principles require globally recognized identifiers. But in OHDSI IDs in a table are always integers. Not sure how that would work with globally recognized identifiers. FAIR applies to meta-data (master patient index) its hard to create OMOP is only 1 data representation, its hard to have the same unique ID in OMOP and HER. Create a Master patient index that would would sever to map all systems into this patient master index. -> Non issue
(5) Gender concepts -> IKNL has gender ‘hermafrodiet’ but OMOP has only male and female. Is there a specific reason why there is no ‘Other’? If there is a use case, then this can be added. Peter to submit an issue in the CDM repo to request this.
(6) Race and Ethnicity entries are mandatory and IKNL does have this information. The guidance is that this can be ignored and set a 0 value.
(7) Episode_start_dateitme -> For diagnosis this would be the data of incidence. Per Michael, for disease first occurrence, it’s the time the diagnosis was established to begin. There is confident that the disease actually began. In the US, Tumor registrars do a good job at identifying first occurrence. The abstractors decide what the initial diagnosis date is. It’s not a precise date and is left to interpretation. At IKLN, there are list of rules to identify incidence dates. These rules can be used to populate the date. For disease first occurrence there is no end date. OHDSI has not formalized this yet. Was more targeted at treatment. For disease its more imp to concentrate on the start date. Discuss disease end date further.
(8) Episode type concept id -> IKNL takes data from EHR, they would chose episode derived from registry. All registry is derived from EHR data. The idea is what was the most immediate source of the episode.
(9) Adding treatments, treatments are child of a diagnosis episode. Because treatment is a result of a diagnosis, it has to be able to calculate time to treatment type of use cases
(10) For the Measurement (TNM values), issue here is that measurement data time is a required field but don’t have that and link to diagnosis are measurements. For measurements, IKNL has the incidence dates only. For naaccr, we are taking for the tumor characteristics the actual diagnosis date. Measurement date cannot be 0. 0 is only for Concept_ID fields. If you don’t know the date it becomes an Observation record. In this case, you can put the date of observation. If a date is unknown it is considered more of a history. This needs to be discussed further to come up with a convention. If you don’t know the date it becomes a history. The observation date time is time of actual assessment date. Today you know the history so that becomes the observation date. We don’t have vocab support to map from measurement staging TNM concept to a history of staging TNM concept. Take this guidance as an experimental work-in-progress.
(11) All treatments are treatment regimens. No treatment cycle is known. Treatments start is first date of treatment. Episode end date is not nullable. So it can be left as null.
(12) In the Procedure table, there is a procedure_type, modifier concept_id, you can have a procedure and a modifier to the procedure. This can also be 0 if there is no modifier to the procedure. The modifier concept id is required is strange and it shouldn’t. This can be discussed further and bring it up as to why it’s required.
(13) In morphology there is an option unknown; What would we do with this. This is an existing vocabulary issue. In NAACCR there is sometimes people only have an anatomical location. Eg) Brian cancer but not what type. We have created a combination of anatomic site and unknown so it can be mapped into. Dima has created all combination of unknown for morphology. Need to add it to the ETL documentation. Need to implement this in the NAACCR ETL.
(14) For data standardization; Drug regimens maps to HemOnc; Non drug maps (radiation therapy and surgery) map to NAACCR or SNOMED. If we don’t do it similar (US and EU), then research/studies will be difficult. The goal is radiation therapy would be mapped it to the same concept that would get mapped by US tumor registry data so research can be uniform. We don’t know how much coverage NAACCR provides. We are going to map NAACCR to standard, would be more maintenance if EU mapped it to something else to map it to NAACCR.
Odysseus unit testing questions:
(1) Gap between the existing unit test produced by thw WG and initial QA/Data Quality testing. The existing unit test framework adds dummy data to the NAACCR data points, runs the ETL to create the CDM tables and executes to make sure records are created in the CDM tables.
(2) The other things that Odysseus will need -> Need to do an assessment of the source data to access the completeness of OMOP. First things is test of completeness. We have so many variables with so many values, checks what’s in the source and what’s in the target. Because a lot of this is automated and there is a direct mapping between source and target values, we could come up with data quality tests. There is also a post ETL scripts does this test to compare the source and target to make sure every variable is mapped. Is everything in the NACCRR data point table has it landed somewhere in the CDM.
(3) There are some potential problems with how vocab was ingested. This could impact how the data is converted. This is a global issue. We had some small examples, based on collapsing schemas where things would get lost. This was a small impact based on Tufts and NU. Once Odysseus is finished with NAACCR they could run this post script to make sure there is no loss of data. The script is available in the Git repo. Michael will send the link to Odysseus.
(4) Data Quality Dashboard needs to be implemented fully for Oncology.
(5) Next level of testing for Odysseus would be to run the analytical package.
(6) Jim asked about the technical specification of the NAACCR ETL which is in progress. Rimma and Asieh are working on the crosswalk between NAACCR and OMOP and should be ready this week. Rimma and Jim will re-group to walk through the crosswalk to make sure everything is mapped correctly.
**Meeting Notes 2/12 **
Jim and team shared their improvements to the NAACCR ETL
-
The changes they implemented primarily impacted the performance of the execution time as well as fixed some bugs. Documentation of the changes can be found in the link below.
-
The improvements reduced runtime from 11 hours to 1 1/2.
-
The Odysseus team is going to make a pull request. Jim will access if they can make the pull request using SQLServer so the SQLRender can translate it. The steps will be to make any changes in PostGres, run against unit test, then migrate the changes to SQLServer, run the scripts that translates to all 3 versions, let the SQLRender translate from SQLServer, re-run the unit tests.
Tatiana and Natalia walked us through the alternate way of identifying Oncology regimens to build Treatment Episodes. The idea behind this approach is based on HemOnc.org website.
-
For each cancer diagnosis, there are all possible therapies that can be assigned to a patient. If you navigate to a therapy, it says what the duration of the cycle should be, combination of medicines in that cycle, which day of cycle which medication should be given etc.
-
The idea is to check original data (drug_exposure fields), take cancer diagnosis and see if in drug_exposure we have cycles that match with HemOnc website. If source data is of good quality, there will be cycles in drug_exposure table. There could be discrepancies with pattern of cycle so the first check should be to see if drug_exposure contains cycles and then build look-up table for regimens where information from HemOnc website is stored.
-
The lookup is joined with drug_exposure table to see if same cycles can be identified. When there were exact matches, exact episodes and enumerated episode within the cycles can be identified. When order was not a match, those cycles were excluded.
-
In summary, HemOnc was used to build lookup table, join the lookup table to drug_exposure, identify the same regimens.
-
If we have the information from HemOnc in csv files with regimens, cycle duration etc., majority of the treatments can be identified in the EHR data. Jeremy has provided this information, processing it and see how it can be incorporated in the Vocabulary tables. This needs to be tested with more examples of regimens.
-
Next steps is to collaborate and consolidate the efforts around algorithms. This will be discussed further when the Developer workshop convenes in London on the 30th of March.
** Meeting Notes 2/5 **
Michael and Robert will meet to discuss the ETL specific documentation. The layout of the ETL documentation needs to be re-arranged to split out the NAACCR ETL from EHR ETL. Tatiana will provide content for the non-NAACCR ETL section.
Tatiana walked us through the ETL from EHR to Oncology Extension as it pertains to the IQVIA Oncology data asset.
For the building of the Disease Episodes:
- Because the source data did not distinguish between first occurrence, recurrence or remission, an assumption was made that all episodes were first occurrence. The first diagnosis date of cancer is disease first occurrence.
- Source data did not have ICDO codes, so to map cancer diagnosis to ICDO vocabulary, the mapping was derived by concatenating ico_code+histology_code+source_histology_desc+behavior from the source data.
- The source data had cancer diagnosis and modifiers. The condition_occurrence was created with condition source value as ICDO concept code.
- The corresponding modifiers were created in the MEASUREMENT table, the 2 additional columns in the MEASUREMENT table that were included in the Oncology Extension (modifier_of_event, modifier_of_field_concept_id) were used to link to the Condition_occurence table.
- In cases where there was no histology or behavior information in the source data, a look-up table was created to store a list of all the ICD9CM and ICD10CM codes. When building the episodes, it’s not only built for the condition_occurence where considitoon_source_value was ICDO but also for records where condition_source value was cancer code built using the lookup.
- For the Episode table, cancer diagnosis for the same patient was used. Minimum associated condition_start_date for the episode_start_datetime. Even though the Oncology Extension requires the episode_end_datetime, there was no information in condition_occurence_end_date, the episode_end_datetime was left empty. Episode_source_value and episode_source_concept_id was also left empty. This is because there are cases where we have 2 condition_occurences one for ICDO and the other for ICD9CM and we don’t know which value to take. The episode_object_concept_id was populated with the corresponding condition_concept_id
- Episode_event links episode to corresponding condition_occurence_id from Condition_Occurrence table and Modifiers from Measurement table.
For building of the Treatment Episodes -> the script that Anastasios created was used.
Next steps/explore further
(1) One of the participants pointed out that the Concept_name -> Disease_First_Occurrence may not be the best representation as there is a difference between abstracted/confirmed diagnosis vs the first occurrence of the data. It may not be a confirmed diagnosis.
(2) In situations where we have Tumor Registry Data and we supplement it with EHR data, we may need to plan for how to reconcile the 2 together and decide what to take from where.
(3) Now that we have tackled Tumor Registry Data, we should try to supplement it with EHR data (disease occurrence, reoccurrence etc.) This will allow us to get from EHR data all the cancer diagnosis that are in the patient population but not in tumor registry.
(4) The analytical package can be run on the Oncology Extension generated from the EHR but in a very limited scope. Since the data captures Disease First Occurrence and Disease First Line Treatment, time to treatment and survival analysis can be run. There are 2 approaches for the analysis -> Cancer type and metastatic/non-metastatic. For the time being until a change is made, analysis can be run using the Cancer Type approach. In the future, the package can be enhanced as it’s doing NAACCR specific queries on the data. We also need to look for end result rather than how it looks in NAACCR.
** Meeting Notes – Development Subgroup 1/29 **
Closed Issues
Issue #111 Issue #239 Issue #73 Issue #198 Issue #199
In-Progress
Issue #241 – should be completed this week and communicated to Ray (Columbia) along with the issue Issue #239 that is already complete.
Issue #114 – Database parameterization, platform independent. Have one SQL that gets translated through the OHDSI R package that handles that using SQLRender. Package can be tested at MSK once the ETL work is done which is scheduled for middle of Feb.
Issue #115 Thomas (Columbia) – We had problems with adapting RedShift queries into SQLServer 2014. Able to fix most of the issues. Should have the results soon. Chan (Korea) - Discuss with Chan to find a way to produce results that evaluates the matching ration from the algorithm compared to Note from the Physicians Robert (Tufts) – Updating the vocabularies is complete. Should have executed the query today or tomorrow.
Issue #116 At this point it seems like we don’t have the level of majority needed to validate at the level at which it is required. For the sake of scoping it out, we can compare the algorithm results to Notes from the Physicians. At a basic we can look to see if the # of regimens match, the ingredients in the regimen match, begin and end date match to chart abstracted data. There could be many things explaining the mismatches. The important thing is knowing that there could be a lot of reasons for the mismatch and it’s not an indication that the algorithm is flawed.
When we algorithmically derive data, we make assumptions. We must have a framework to be able to test results as it’s difficult to work with partners. (1) OMOP version differences (2) Vocabulary version differences (3) Different SQL dialects. Propose a framework of creating algorithms in a way that it can be used across different partners. This topic has been added to the agenda to discuss in the EU Development Workshop to be held in Oxford around the EU symposium.
Issue #169 – Waiting on Christian to complete the task of consolidating the type concepts. Once that’s done, the hardcoding can be removed.
On a separate note, so far, we have only created Drug Regimen derivation from EHR data. Idea is eventually to create disease episodes, recurrence episodes, tumor characteristics from the EHR data. The analytical package is not created to run on Oncology EHR data as it’s hard coding NAACCR concepts. Invite IQVIA team to present their challenges in creating Oncology Extension from EHR data.
*** Meeting Date: 1/22/2019 ***
- Scope of the Analytical package is to refine the functionality of the symposium package as it relates to the original use case of plotting the survival curves and time-to-treatment distribution. Keeping that in mind, below is my summary:
What's complete:
- Integrated R scripts into a package (Onco package). Everything from prior scripts were integrated into the 2 functions (plot_survival and plot_time_to_rx_hist). This helped with reducing the number of steps to render the plots.
- Function plot_time_to_rx_hist() is the same as before with the exception of the database connector functionality which is new. The SQL script is now integrated in the function.
- time_dx_to_rx.sql was not modified
What's remaining:
- Test the database connector functionality at other sites.
- Customize functionality based on the testing at different sites.
- Indicate which schema the table is coming from using a schema argument possibly.
Future enhancement (OUTSIDE the scope of this Milestone):
-
To support pathway function in ATLAS, set multiple outcome using diff medication to show the curve of different stages of cancer patients.
-
Depending on use cases, make functions less restrictive so they are more generic based on end user needs.
-
For the Drug Regimen Algorithm, Chan will share results of running the algorithm against their data. Anastasios will compare the results with his results to compare and contrast.
*** Meeting Date: 12/18/2019 ***
In-progress issues:
-
Update on Columbia - Michael and Robert are making changes and will have the final code to Ray after unit tests are completed as new issues have been discovered during unit testing. Michael will ask Ray to hold-off on debugging any further until we fix the issues discovered in unit testing.
-
Issue #116 Next steps discussed where to chart abstraction from notes and discreet data to compare it to the output of the algorithm. A couple institutions signed up and will attempt to do this compare. Group will check back in end of Jan to see where we are with the chart abstraction compare. The validation will be only on the first version of Anastasios’s algorithm without going into cycle of scheduling and dosing. Additional information from HemOnc.org will require further vocabulary work for scheduling and dosing. Jeremy has shared the latest version of HemOnc.org. Henry and Anastasios are analyzing the data to decide about how to potentially implement it in the OMOP CDM. This will be included in the 2nd version of the algorithm.
-
Issue #73 Almost done. Next step is to put it into Github.
-
Issue #133 This task will be used to record any bug fixes based on feedback from partners that are running the algorithm. Chan and Asieh are both executing the algorithm and should be able to share their results soon.
-
Issue #198 Michael has added additional unit tests for other tables. Issue with schema dependent codes has been fixed and pushed to a branch. Other issues have come up with refactoring numeric variables and being able to support non standard staging variables. Next step is to fix this as well as continue to finish other tables. For MSK’s ELT to the Oncology extension, the Odysseus/MSK team identified 30 additional variables. Shanta will share with Michael so he can manually insert the mappings to the regular script. Anyone else having metric data will have them.
-
Issue #199 The generic SQL is translating to Oracle, RedShift and SQL. Issue can be closed. Once the code is used, issues that arise can be recorded as bug fixes.
-
Issue #152 Documentation framework and roadmap developed by Robert. It is currently out for review to the Leadership team.
-
Issue #114 Need an update from Chan and Meera on the status of this issue. Meera is going to integrate SQL with OHDSI database connector package to test out the package.
-
Issue #111 Assigning to Qi. There is a need to figure out how we want to move forward with some of the stuff that’s dependent on how the CDM is structured for example Vital Stats being one of them. The code at the end of the ETL script is what updates the observation period, that’s the part that needs to be looked at. We want to figure out whether we want to put that into a new script, that handles situations where there is other data besides NAACCR data and also situations where there is only NAACCR data.
-
Issue #220 - Missing link between ICDO and schema. There are some NAACCR items that should be precoordinated with the schema but they are not.
*** Meeting Date: 12/11/2019 ***
Attendees: Robert Miller [email protected]; Anastasios Siapos [email protected]; Yang, Qi [email protected]; [email protected]; [email protected]
Completed and closes Issues:
- Issue #213 BIGINT is appropriate for clinical primary keys like person_id, visit_occurence_id, procedure_occurrence_id. The issue was that BIGINTs were also applied to the '_concept_id' fields. Michael has corrected and closed the issue. Same with Issue #215.
- Issue #212 This issue is to make sure the NAACCR parser format of naaccr_data_points table is the same as what exists in the etl folder
- Issue #209 This issue was resolved on the R parser side based on feedback from Columbia.
- Issue #211 Fix reference to 'naaccr_data_points' to 'naaccr_data_points_tmp'
- Issue #210 Remove hardcode filtering of naaccr_data_points_tmp based on naaccr_data_points_tmp.naaccr_item_value <
In-progress issues:
-
Issue #73 Need to finalize the decision to insert the treatment 'did not happen' concepts into the Observation table and discuss what date needs to go into the observation_date field which is a required field. Development groups recommendation is to use the date of diagnosis.
-
Issue #198 Michael has written unit tests for Condition, Episodes, Diagnosis Modifiers for schema dependent and non-schema dependent. Unit tests for variable dependent are in process.
-
Dima brought up the issue where variable is site non-specific but values are site specific> Michael and Robert verified that current ETL code is doing the mapping.
-
Issue #133 Chan has updated version of SQL query. Chan did not have the most updated vocabulary so he’s was unable to use HemOnc relationships and this has been fixed as well. Same with Asieh who also needs the HemOnc relationships. Anastasios will be fixing that for her as well. This is the last step in the execution of the algorithm. Both Chan and Asieh should be able to share the results with us at some point. Also, as a next step, Anastasios will create a repo for the drug regimen algorithm.
-
Issue #199 Dave to test Oracle and Redshift through the SQLRender with NAACCR data. Raw translation from one flavor to another has been completed. Michael Robert and Dave will meet on Thur to finalize this task. Dave is also going to upload the NAACCR sample data so others can use it as well. Dave also created a bat file that takes the SQLServer version, runs it through the SQLRender and spits out the 3 versions (Oracle, RedShift and SQL)
-
Issue #152 Documentation framework and roadmap is being developed by Robert.
-
Issue #114 Need an update from Chan and Meera on the status of this issue. Meera is going to integrate SQL with OHDSI database connector package to test out the package.
-
Issue #116 Meeting with Jeremy has been scheduled for Friday 12/13.
-
Issue #111 This issue is related to the gap (person, and observation period) between what the ETL produces and what is needed to make the plots. There is a stand-alone script that handles the gaps, but each organization had their own script and we just need to standardize this. We could also incorporate the script in the ETL and set a flag for one-off solutions like Tufts where NAACCR data sits on its own and is not connected to their EHR. The question on whether we want to include this task in the first milestone was answered by the team as needing it for the first milestone. Follow-up: Observation_Period -> Need guidance on whether (a) to extend observation period if there are no clinical events that correspond to the date and get the last follow up date if its less than the follow-up date in the NAACCR date OR (b) we should create an observation and let the observation period logic to take care of it.
Meeting Date: 12/4/2019
Attendees: Robert Miller [email protected]; Anastasios Siapos [email protected]; Yang, Qi [email protected]; [email protected];Asieh Golozar [email protected];
Completed and closes Issues:
- Issue #205 This was discussed in the vocabulary call and based on investigation of the data the decision was made to not make any code changes. See notes from the vocabulary call related to this topic.
- Issue #186
In-progress issues: 3) Issue #133 Chan has executed the algorithm and has some preliminary results. Chan is waiting for Anastasios to make updates to the algorithm based on the feedback he provided. Once he has the updated algorithm, he will re-run before sharing results hopefully in the 12/18 meeting. Yili from Northwestern and Asieh from Regenron are planning to run the algorithm on their data as well. Plan to setup another algorithm walk-thru meeting with Mohammad and Alex who were both interested in contributing to the algorithm. 4) Issue #73 discussion added to the agenda for the vocabulary meeting on 12/5 5) Issue #152 Documentation meeting scheduled for Thur 12/5 6) Issue #114 Need an update from Chan and Meera on the status of this issue. 7) Issue #199 Robert and Michael have converted the ETL to SQLServer so SQLRender can be used to automatically generate ETL for each diff dialects. Tested it on 3-4 databases. Waiting on Dave to test it on Oracle and Redshift. Idea is to have a single script so any updates are to only one script. Meeting on Friday to finalize this. Daniel's data is in Oracle so this can be tested there as well. Daniel's team is meeting later on 12/4 to discuss next steps of the Oncology ETL project. 8) Issue #198 No new updates. Elizabeth from the Odysseus team is in touch with Michael about the unit test scripts as they would like to use it for the mapping work the team is doing for MSKCC. 9) Issue #116 Meeting with Jeremy has been scheduled for Friday 12/13. 10) Robert will create new issues to address the issues that Ray has identified with the diff between the R Script and the spec. 11) Robert will create a new issue to make sure the DDL, ETL and anywhere else the concept ids are referenced to change from BIGINT to INT. 12) Issue #111 This issue is related to the gap between what the ETL produces and what is needed to make the plots. Need to discuss if this falls under the first milestone which is most likely as running the symposium is a part of the milestone.
Meeting Date: 11/27/2019
Attendees: Michael Gurley [email protected]; Robert Miller [email protected]; Anastasios Siapos [email protected]; Yang, Qi [email protected]; [email protected];Asieh Golozar [email protected];
Key Points:
- Issue #166 is complete and can be closed. Will reach out to Rimma and others for testing the changes. This code also includes re-factored code. The plan is also to create a data quality check to make sure this change is validated. A new label 'Readiness candidate' has been created for such changes.
- Issue #115 Chan is working with Anastasios to run the algorithm packets on his data. Chan plans to share his results with the Development Subgroup in the next couple weeks.
- Issue #73 Pending decision from the modeling team. Will bring it up on the Vocabulary call.
- Issue #152 No new updates. Immediate plan is to take Claire's suggestion and integrate it with the larger documentation effort.
- Issue #114 Michael has uploaded the SQL to query the database. Next step is for Meera to integrate it with the packet, test the performance. Once this is ready Northwestern, Tufts, Columbia are candidates to test the package. Meera plans to prioritize it and possibly have this completed in the next few weeks.
- Issue #199 Dave has completed conversion to RedShift and Oracle. Plan it to meet with Dave, take Robert's latest version, run it through the SQLRender, try it out with the RedShift and Oracle instance and see if it works. Robert will add 2 versions to the Git and if there are any issues he can comment and those specifically.
- Issue #198 Dave has started working on this and will pick this back up once work on 199 is complete. Michael will work with Lisa to aid with the work that the Odysseus team is doing for the standardization project.
- Issue #116 Waiting to schedule the meeting with Jeremy. In the meantime, Yili Zhang from Northwestern has a cohort of 395 breast cancer patient, manually chart extracting the drug regimens. She will run Anastasions's script and compare it to results from her manual extraction.
- Issue #205 can be worked on next pending decision on Issue #193 10)Issue #186 has been assigned to Robert. Everything has a start date now, only procedures will have end date.
Meeting Date: 11/20/2019
Attendees: Michael Gurley [email protected]; Robert Miller [email protected]; Anastasios Siapos [email protected]; Yang, Qi [email protected]; [email protected];David Sonnet ([email protected]); Asieh Golozar [email protected];Maxim Moinat [email protected];Blacketer, Clair [JRDUS] [email protected];Gijs Geleijnse [email protected]
Key points:
Meeting Date: 11/20/2019
- Maxim shared the unit testing framework that is built using Rabbit in a Hat. The Development team decided to look at this framework while continuing to work in parallel on writing our test scripts using the framework created by Michael. Eventually, we will converge with the overall framework.
- Issue #73 -> This issues needs to be discussed and finalized with the modeling team on the CDM/Vocabulary call. Christian thinks that the records 'treatment did not happen' should not be loaded into the CDM.
- Issue #166 -> This needs to be merged with the main branch. The revised ETL is working on postgres with a sample dataset. Duplicate Measurement IDs are being inserted. Michael to work with Robert to look into this issue. Once this is done we can have other groups try it out.
- Issue #133 -> In the process of setting up a meeting with folks that are interested in building an algorithm. Anastasios to walk through what has been done and how they can collaborate to create a robust way of identifying Oncology Regimen.
- Issue #115 -> working on Oracle, tested Postgres, issue loading with one of the vocabulary files.
- Issue #152 -> Idea is to start with a very high level process documentation that for example includes the location of the ddl, ETL code to help participants that are looking to ETL their data.
- Issue #116 -> Meeting scheduled with Jeremy for 11/22 to discuss (1) Validation strategy for the regimen algorithm (2) Expansion of the temporal properties for HemOnc.org into the vocabulary.
- Issue #114 -> Reached out to Meera for a status.
Attendees: Michael Gurley [email protected]; Robert Miller [email protected]; Anastasios Siapos [email protected]; [email protected]; Yang, Qi [email protected]; [email protected];David Sonnet ([email protected]); [email protected]
Key points:
- Issue # 73 -> Qi is looking into comments by Christian and Jeremy. He plans to discuss it further with Michael to come up with next steps.
- Issue # 166 -> Robert is cleaning up the code on the main branch based on agreement between him and Michael and will separate out another branch for additional improvements. The logic to add the ambiguous schema is already out on the main version.
- Issue # 115 -> Chan is working on enhancing the rug regimen algorithm and is planning to run his data through the algorithm. Michael has received request from individuals in the community to build the algorithm. Anastasios will be setting up a working session with Chan and others to work on the algorithm.
- Issue # 116 -> Meeting will Jeremy has been moved to 11/22 due to scheduling issues.
- Issue # 114 -> No current update on this issue.
- Issue # 186 -> This issue needs assignment. Dima plans to have vocabulary task Issue # 71 completed this week.
- Issue # 198 Maxim and Claire have been invited to present their approach to unit testing in the 11/20 Development call. The idea is look at their approach and best practices.
- Issue # 199 Dave is making good progress with this and should have something for the team to review soon.
- Issue # 152 assigned to Dave -> no new update since last call.
- Created issue Issue # 205 as a development activity following the vocabulary change Issue # 67 to create ICD0 concept for ‘unknown histology’.
- Below are issues in the to-do list that that are waiting on assignment:
- Issue # 111 -> Michael has a plan for the Development strategy. Waiting on resource assignment.
- Issue # 167 -> Moving this back to to-do as there is nobody assigned to work on this issue. Looking for volunteers to work on this issue.
- Issue # 158 -> Rimma's team might be able to contribute to this task soon as they are building quality checks for breast cancer data although not specific to OMOP format. Left this in the to-do column for other volunteers to contribute to this effort.
Meeting Date: 11/6/2019
Attendees: Michael Gurley [email protected]; Robert Miller [email protected]; Anastasios Siapos [email protected]; [email protected]; [email protected]; Yang, Qi [email protected]; [email protected];David Sonnet ([email protected]);[email protected];Rimma Belenkaya [email protected];[email protected]
Key points:
- Issue # 166 -> Robert is cleaning up the code on the main branch based on agreement between him and Michael and will separate out another branch for additional improvements. The logic to add the ambiguous schema is already out on the main version.
- Issue # 115 -> Chan is planning to run his data through the algorithm. He should have an update for us next week.
- Issue # 183 -> Forum post has been submitted to solicit other participants in the community.
- Issue # 116 -> Meeting will be scheduled with Jeremy for the week of 11/11 to discuss what he has accomplished so far and how we can contribute to fill the gaps.
- Issue # 114 -> Meera and Chan are working on this. Checking with them on what is the definition of complete.
- Issue # 114 -> Not started yet due to other high priority tasks.
- Issue # 122 -> Complete. Current version can be considered as released and can be used. Next step is to go through the NAACCR ETL and help develop the NAACCR ETL SQL. Issue # 73 will be used to implement a test.
- Issue # 186 -> waiting for a vocabulary task Issue # 71 to be completed.
- New Issue # 198 has been created to make improvements to the ETL SQL and write unit tests. Anyone that wants to improve the ETL will create a branch which will be reviewed and merged with the final version. Shilpa to work with Michael on the acceptance criteria for this task as well as reach out to individuals Andrew provided as having contributed to unit tests in the past.
- Issue # 166 -> Robert is cleaning up the code on the main branch based on agreement between him and Michael and will separate out another branch for additional improvements. The logic to add the ambiguous schema is already out on the main version.
- New Issue # 199 assigned to Dave -> The goal is to make the code as database flavor independent as possible and come up with a strategy for translating NAACCR ETL SQL to different dialects. Main ETL will be in SQL server and then translate to other flavor.
- Michael confirmed that the metric script can be shared only with metric customers.
- Issue # 152 assigned to Dave -> Starting point can be very high level steps, location of code and documentation, input format etc.
- Issue # 158 -> Rimma's team might be able to contribute to this task soon as they are building quality checks for breast cancer data although not specific to OMOP format. Left this in the to-do column for other volunteers to contribute to this effort.
Meeting Date: 10/30/2019
Attendees: Michael Gurley [email protected]; Robert Miller [email protected]; Anastasios Siapos [email protected]; [email protected]; [email protected]; Yang, Qi [email protected]; [email protected];Vibhor Gupta [email protected]; Askash Desai [email protected]; Jiang, Renjian [email protected]; [email protected]
Key Points:
- Need to share Oncology WG progress and next steps with new participants to the group to bring them up to speed on the Oncology WG affairs and help them identify how they can contribute
- Issue 113 -> Released the vocabulary without the duplicates; Available on Athena. Dima is writing up an email with example and changes; There are 13 schemas that should be ambiguous. These are covered by tables that Robert created. There are 42 schemas that should not be ambiguous; Next steps are for others to check them on their data.
- Issue 175 -> In order to move forward, there is a need for more people to run the algorithm on their data. Next coding change to calculating cycles; Need an understanding on what the signatures mean so they can be broken out into regimens and schedule of drugs and what prior work has been done on this; Jeremy may have done some work on this so Anastasios will email Jeremy, Dima and Andrew to setup a breakout session to discuss this as a starting point.
- Issue 115 (Higher priority in the Oncology regimen algorithm space) -> Anastasios and Chan are working on helping Chan run the algorithm with Chan's data. Anastasios is writing up a document that will help them run the algorithm as well as develop their own algorithm.
- Issue 164 -> this issue (challenge) is going to be on-hold until we have had other institutions run the Oncology regimen algorithm (Issue # 115) and have created validation strategy (Issue # 116). The goal with this issue is to evaluate the performance of the various algorithms.
- Issue 183 -> Need Anastasios to frame up a message to broadcast to the community to solicit participants to run the algorithm.
- Issue 116 -> Need a break-out session to discuss this validation strategy/gold standard. Jeremy should be included in the discussion. From a timeframe perspective, the discussion can happen around the 2nd week of November.
- Issue 73 -> In-progress; Ready to add the changes to a branch which will then follow the review process.
- Issue 167 -> Not started since priority was shifted to work on the ambiguous codes.
- Issue 114 -> Meera is working on this
- Issue 122 -> Michael has a unit testing framework ready to be shared with the subgroup in the next meeting. Looking for this to be a work in progress and invite individuals to add to this unit testing framework. Looking for volunteers to write the unit tests.
- Issue 168 -> Michael, Robert and Anastasios will take a look at filling the values for some of the parameters/categories that Andrew has helped identify to access readiness of ETL, Algorithm with regard to variables such as documentation, unit testing etc. Audience of this information are individuals that are thinking of using the ETL, algorithm against their data and research studies so they can gain an understanding of what they can expect and are up against.
Meeting Date: 10/23/2019
Attendees: Michael Gurley [email protected]; Robert Miller [email protected]; Anastasios Siapos [email protected]; [email protected]; [email protected]; Yang, Qi [email protected]; Reich, Christian [email protected]; [email protected]
Key Points:
- Team discussed the milestone 'Running an Oncology Drug Regimen Extraction Challenge'. Anastasios walked the team through the Oncology Regimen Algorithm Packet that he has created. The packets can be found in the link OncoRegimenFinder Packets https://github.com/OHDSI/OncologyWG/tree/master/OncoRegimenFinderV3
- The regimen algorithm is needed if we want to ETL EHR data into Episode tables then we need to algorithmically derive regimen so it's important to have a well defined structure.
- The current algorithm is using the Hemonc.org temporal characteristics like schedule, dosing through a separate structure outside of the OMOP vocabulary.
- There is documentation on how to run the scripts. Anastasios will document steps to test the algorithm.
- MSKCC is in the process of converting Medications data to OMOP (timeframe is 2 months), they should be able to run the algorithm and validate it against their data.
- Jeremy can be utilized to establish validation process/strategy. Our validation process would help informing him where the gaps are. So not only validating how well it works, also validate where it fails. Essentially, identifying the limitations for further improvement.
- There is a task to incorporate temporal aspects of hemo.org vocabulary to OMOP vocabulary, this needs to be done so competing algorithms could potentially use it.
- Chan offered up his strategist to help with the validation process by running the algorithm on their data. Plan is to run it on colon cancer regimen. Phase 1 is identifying the ingredients correctly. Phase 2 (Anastasios project) will produce # of the cycles.
- Chan, Michael (Northwestern data), Robert (TUFTS), Columbia are potential data partners that could run existing algorithm against their data.
- For the vocabulary effort, Rimma will need the disambiguate scripts (Ambiguous schemas and variables that disambiguate) for validation on a subset of the tables. After the validation on a small set of tables, the remainder tables will be populated. Dima will need the links to the scripts. There is issue # 113 https://github.com/OHDSI/OncologyWG/issues/113 that's associate with this topic. Michael's SQL is referenced in the task. Robert will add his script to the task
*Meeting Date: 10/17/2019
Attendees: Michael Gurley [email protected]; Robert Miller [email protected]; Anastasios Siapos [email protected]; [email protected]; [email protected]; Yang, Qi [email protected]; Reich, Christian [email protected]; [email protected]
Key Points:
- The next immediate goals are to (1) fix bugs identified in the first version of the Oncology CDM extension (2) ETL changes to incorporate NAACCR vocabulary (3) re-run ETL for additional use cases and with additional data partners (4) Running the 'Oncology Drug Regimen Extraction Challenge’ with additional data.
- Group went through the tasks assigned to Milestone #1 - Rerun symposium plots with expanded participation and identified high priority tasks and assignments. Issues have been updated with this information.
- For the task #114 - Polish symposium analytical packages, Andrew and Chan will work with Martin and also consult with Kristin.
- Obtaining NAACCR data was identified as being of high priority for the Outreach/Research group discussion.
- Following tasks were identified for future sprints (1) #22, #152, #11.
- For the milestone -> Running an 'Oncology Drug Regimen Extraction Challenge’, Robert will run the OncoRegimenFinder script that Anastasios created with the Tufts data.
Oncology Working Group Publications/Presentation
Data Model
- Cancer Models Representation
- EPISODE
- EPISODE_EVENT
- MEASUREMENT
- CONCEPT_NUMERIC
- Disease Episode Model
Vocabularies
OMOP Model
- Populating the OMOP Oncology Extension
- NAACCR Tumor Registry
- EHR and Claims