Questions re: in_network schema #709

felix-hh · 2023-08-15T16:35:47Z

felix-hh
Aug 15, 2023

Hi!

I have been developing a data pipeline to process in_network MRF files this last month. After a lot of trouble I have managed to develop something that ingests a file in ~5h on my Mac (file is 4GB compressed, 150GB uncompressed). For your reference, this the Aetna file that I used.

The high compression ratio (150GB -> 4GB) indicates that there is a lot of unnecessary redundancy in the original data. This translates into higher (maybe 30x?) processing time as data needs to be uncompressed, and instantiated as an object in the language of your choice before being ingested. Data economy would lead to smaller objects and a much better processing time.

During the process some questions came up about the current schema about choices that, IMO, are driving this inefficiency:

Why is service data part of the negotiated_price object? This causes a lot of redundancy.
- By service data, I mean service_code, modifiers, etc.
- In this file, there are 538,296,072 unique service-provider_group-price combinations, however there are only 835,189 unique services. We repeat service data many times.
- We could instead have an array of “service” objects nested on the “in_network object” defining service codes etc, and then have rates objects (like the current one, but negotiated_price is just a number, no other metadata) nested for each service, defining the rate and the provider_group_ids that correspond to that rate. It is true that this impedes linking a provider to multiple services at once, but enables linking a service to many provider groups at once (a better tradeoff, since there are about 8 services associated with an in_network object, but there are many more provider groups!).
Why is service_code an array of strings?
- Service codes are defined here and range from 0 to 100, where 0 is ‘CSTM-00’ in this schema. First, this could be an array of ints, already saving space. But secondly, this could instead be replaced by 2 base 64 integers, functioning as a bitmask, where the ith bit is flipped if the service code is in the array.
- Imagine replacing the redundant arrays of service codes by just 2 integers amenable to bitwise operations. This dataset is for a technical audience that would appreciate going from ~500-bit slow to process arrays to a 128-bit pair of integers.
(minor) Why use floating points for negotiated_rate?
- In finance, integers are usually used for dollar amounts, expressed in cents (unofficial reference) since floats i) can be imprecise, ii) are less fast to process, and iii) have some gotchas for example, when parsing. In this dataset negotiated_rate can also be a percentage, but the point stands - the percentage can be reported as basis points multiplying by 100 to a good level of precision (I would assume!).

Based on #244 It looks like you have already optimized to reduce redundancy in the past by nesting prices on rates. These questions are about further optimization.

I have implemented these changes in my “clean” version of the dataset for downstream use. I look forward to learn about the reason for these design choices if I am wrong. I imagine this cannot be changed without a new major version of the schema.

Also, I thoroughly admire the work that has been done in putting this schema / transparency tools together and getting insurers to comply. If there is anything I can do to help or I can elaborate on this issues let me know.

shaselton-usds · 2023-09-11T17:59:04Z

shaselton-usds
Sep 11, 2023
Maintainer

Hi @felix-hh

Thank you for this feedback. It really is appreciated. Also, moving this from an "issue" to a "discussion"; hope you don't mind.

I think on the whole, these suggestions are fairly on point in terms of a more purely optimized way to express data. You're correct in your observation in the compression ratio signaling redundant data. We found most of this to be in how provider networks are defined and utilized. By providing an option to define these larger networks once in an external file and having an in-network file reference them, we would see on the whole the file sizes be reduced fairly dramatically. At this point though it is optional thus producers of the files have the decision in higher costs of storage and network charges with unoptimized files or spending some engineering effort in organizing the files in ways that would be beneficial to them and those consuming.

Walking through your suggestions -

Why is service data part of the negotiated_price object? This causes a lot of redundancy

I am trying to wrap my arms around defining service data that applicable to the negotiated rate into independent objects to be later referenced. The current schema allows the user to link multiple provider groups to a negotiated rate (that currently includes all of the service "metadata” that the negotiation is dependent upon), but you'd want to define unique combinations of services to be outside of the negotiated_prices object to be referenced later? Would it be possible for you to mock up an example or schema change to look at? I think one of the bigger questions is if service metadata (as you say) is fairly independent of negotiated prices. More than happy to continue thinking through this with you.

Why is service_code an array of strings?

This suggestion is what I'd expect from a CS background ;). You're correct that bit masking would most likely work here (and frankly, anywhere there is an Enum defined). These are one of those suggestions that are better from an encoding standpoint, but include an additional level of understanding in order to interact with the data. We've received a lot of feedback, especially from the academic research community, that the data may be hard to work with if teams do not have the proper skill sets such as data engineering and data science. This solution, while not a great excuse but is very much a reality, would introduce errors too -- something as simple as someone confusing the higher and lower-order bits. There is no perfect answer here because multiple audiences with varying degrees of technical acumen are trying to work with the raw data and their expectations are all different how "user-friendly"-esk the data should be.

I think the flexibility that strings provide for future custom service codes should not be discounted either.

(minor) Why use floating points for negotiated_rate?

I hear you on this one. It goes back to how accessible is the data and where the data lives (and in which context - internal systems vs disclosures). The financial data that are in the systems whose output ends up in these MRFs absolutely should be ints (I would hope, but not holding my breath). The MRFs themselves are almost like a report in which those that consume the data may wish to do whatever they want - including running further mathematical calculations. It would be in the best interest of that audience to convert a float to an int (not straight down casting obviously) if they plan to run numerous downstream calculations. Having the price represented as a float allowed the data to be more approachable to those that aren't used to working with financial backend systems (or even having the understanding that floating point issues are a very real thing for computers) -- that said, I think it is worth exploring if the enhancement of an int outweighs possible confusion that is introduced to the floating-point "layman". Would definitely enjoy your thoughts on it.

0 replies

felix-hh · 2023-10-03T16:07:47Z

felix-hh
Oct 3, 2023
Author

Hi @shaselton-usds, first of all thank you for the reply and for moving this topic to the right place. Also apologies for the delay, I didn't see a notification, so I just noticed this while looking over the repo yesterday. I'm still working with MRF and learning a lot of stuff. Hopefully we can still have some conversation. Answering from top to bottom

At this point though it is optional thus producers of the files have the decision in higher costs of storage and network charges with unoptimized files

I definitely agree that in some instances the incentives of producers and consumers of data are aligned - this is why files are compressed and not made available as just json. Here's my qualm on file optimization (edit: optimization to reduce uncompressed size, while compressed size is roughly equal, like as you mention, using provider_references instead of provider_group in negotiated_prices for example). Assume that after compression, files optimized through engineering effort and non-optimized files have very similar sizes, but the later is much more expensive to read. If I have a company and I'm in a write-once, read-many situation with many internal consumers, I would take a expensive write, cheap read approach. The problem here is that the burden of reading (after networking costs) is not borne by the insurers, nor are they getting paid by the consumers of the data, so what is the business justification to make it better? I think that future versions of the schema may need to nudge producers into optimization (including defining provider_references / networks separately and removing the ability to do it under negotiated_rates).

By providing an option to define these larger networks once in an external file and having an in-network file reference them, we would see on the whole the file sizes be reduced fairly dramatically

Are there external files? I am very curious if you are referring to provider_references and how it prevents a lot of data redundancy in every negotiated_rate? I think that is a huge win and maybe what you thought I wanted to do for services when I mentioned defining a service object (it is not, although it is worth thinking through too).

Now that we are talking about file optimization, one thing that would drastically improve ease of use for consumers (researchers) are .jsonl files (let's put this as idea no. 4). Disregarding other metadata in the in_network_file, If the in_network object array was a .jsonl file, I could make a tutorial to set yourself up to process in_network files in ~20 minutes in SQL on your laptop. You just create a python environment with DuckDB and make it read the file. Currently DuckDB chokes because the entire in_network array is too large to fit in memory and ingest the file, but it reads jsonl like cutting butter since it processes one object at a time.

Why is service data part of the negotiated_price object? This causes a lot of redundancy

Would it be possible for you to mock up an example or schema change to look at?

Here's the schema I am proposing:

in_network: Struct(
    negotiation_arrangement: varchar,
    name: varchar
    billing_code_type: varchar,
    billing_code_type_version: varchar,
    billing_code: varchar,
    description: varchar,
    services: Struct(
        negotiated_type: varchar,
        expiration_date: date,
        additional_information: varchar,
        billing_class: varchar,
        billing_code_modifiers: varchar[], 
        # service_code: varchar[], replaced by bitmasks below
        service_code_bitmask1: uint64,
        service_code_bitmask2: uint64,
        negotiated_prices: Struct(
            negotiated_rate: uint64 # reported in cents, basis points, it is currently a float. 
            provider_references: uint64[] 
            provider_groups: provider_group[] # This is to support inline definition of provider_groups as we do currently. On another version of the schema, I would phase it out personally. 
        )[], # array of the new negotiated_price objects
    )[], # array of service objects
    bundled_codes: ${existing_bundled_code_spec},
    covered_services: ${existing_covered_services_spec},
)[]

Not familiar enough to write json schema specs but hopefully the above is understandable. How it looks in my database (I don't include provider_groups, bundled_codes and covered_services)

Some caveats:

I also haven't worked with bundled_codes and covered_services because they were not present in my file. I included them in the proposed schema but I might be missing something. For example, maybe they could be simplified, since the in_network object also has billing_code, billing_code_type, billing_code_type_version and description fields. I kept it simple and left it unchanged.
From now on I might say 'doctor' when what I truly mean is provider_group (comprised of hospitals, doctors, nurses, etc). This is just to simplify the thinking.

Now why I think is this beneficial?

To recap, a service is a specific procedure (billing_code) with its specific modifiers, service_codes and other service and negotiation metadata upon which the final price is dependent.
There are not so many different services. For every procedure there are about 8 variations on average (based on my dataset). Currently, every time we report a different price we repeat all of the service metadata (which is a lot of strings!) unnecessarily.
With the proposed change, A new price is just an object linking price and provider. A negotiated_price by a doctor if you will.
The only tradeoff to me appears that to reduce redundancy in service metadata, but we increase redundancy in provider_references and provider_groups (before you could specify a few negotiated_rates i.e. separate services of the same procedure without repeating provider_references). Now you always repeat provider_references for every service they offer. I think it will be more optimized to repeat provider_references (integers) than to repeat all service metadata for 2 providers that offer the same service at a different price.
At the end of the day, this probably needs more theoretical thinking or empirical testing. I spent some time on this yesterday but it was very complicated to "demonstrate" beyond intuition that this is better. If there's anything you'd like to see to be convinced / test the hypothesis this is worthwhile, let me know. Or if you know of any reading material I should catch up on. I can probably look over this in more detail later this week.

I am trying to wrap my arms around defining service data that applicable to the negotiated rate into independent objects to be later referenced

Although this is an option I'm not proposing services as independent objects, I'm proposing prices as members of the service object, as you may have seen above. This idea is also worth exploring. It would be similar to how we define provider_references in a separate object and then use them throughout. Love the simplicity, getting simiilar to a database with a linking table. Again, if this is optimal should be checked but it looks pretty good. I also wonder if this would be difficult for producers to actually produce or not. Since it looks interesting I took a stab at a quick sketch as well:

procedures: {
    ...current_in_network_object_metadata
    services: {
        ...services_metadata,
        service_reference: uint64
    }[]
}[]

in_network: {
    service_reference: uint64
    negotiated_prices: {
        price: uint64
        provider_references: uint64[]
    }[]
}[]

Why is service_code an array of strings?

Bit masking would most likely work here (and frankly, anywhere there is an Enum defined).

That's a good point, I didn't even consider bitmasking for other enums... I just saw a set of numbers 0-100 and thought that's a bitmask.

I think the flexibility that strings provide for future custom service codes should not be discounted either.

Re: flexibility, I would certainly agree that bitmasking is a bad idea for other enums precisely because of flexibility, and because they are strings and not arrays of strings so the cost is not too high, so if we can make it easy to maintain future changes we should. For service codes I am more on the fence, I thought they look pretty static / non-changing since they are CMS-defined but I don't have enough background to know. I would also love to know what you mean by custom codes and how they could be used or changed (as I understand the codes describe the place where care is delivered and most are CMS-defined). To be honest, I still lack a detailed understanding of the true meaning of every data field so if you have any resources beyond the schema, please share. I think that after the first point, this arrays are the second largest driver of the computing burden. So there's a tradeoff between maintainability (flexibility) and performance.

There is no perfect answer here because multiple audiences with varying degrees of technical acumen are trying to work with the raw data

Re: ease of use, I certainly agree, I wrote encode and decode this data in Python (decoding is much easier), but others may use R or Stata or other tools where it might be more difficult. Although it's not extremely hard I acknowledge it could be beyond begginer-level in terms of writing software, especially for people who are not so familiar with bit representations (after all, I do have a CS background). When I was developing the code I had some problems, including off-by-one errors that were hard to debug. I would claim that if bitmasks improved performance significantly, maybe it would be worthwhile to have a tool that goes back and forth between service_code and bitmasks provided for analysts / users provided by CMS or a third party. Because ease of use is a barrier to entry, but access to compute resources and processing time is another barrier (it limits how many datasets you can look at, etc).

Here's the code for those who might find it useful:

import numpy as np

mycode = ['CSTM-00', '21', '31', '32', '33', '34', '51', '54', '55', '56', '61', '81]
unique_service_codes = [list(map(str, mycode)), list(map(str, mycode[1:4]))]

def create_bitmask_numpy(service_codes_col: "pd.Series[type[object]]") -> "np.ndarray[Any, np.dtype[np.uint64]]":
    idxs = [[int(x) if x != "CSTM-00" else 0 for x in nested] for nested in service_codes_col]
    arr = np.zeros((len(idxs), 128), dtype=bool)

    # Use numpy's advanced indexing to set the values to True
    rows = np.repeat(np.arange(len(idxs)), list(map(len, idxs)))
    cols = np.concatenate(idxs)

    arr[rows,cols] = 1
    int_mask = np.frombuffer(np.packbits(arr).tobytes(), dtype=np.uint8).view(np.uint64).reshape((-1, 2))
    return int_mask.T

# I don't transpose the output of create_bitmask_numpy so that I can do the following:
# unique_service_codes[SERVICE_CODE_BITMASK_1], unique_service_codes[SERVICE_CODE_BITMASK_2] = create_bitmask_numpy(unique_service_codes["service_code"])

def decode_bitmask_numpy(encoded) -> list[str]:
    """"""
    return [f'{i if i != 0 else "CSTM-00"}' for i, exists in enumerate(np.unpackbits(encoded.T.view(np.uint8))) if exists]


encoded = create_bitmask_numpy(unique_service_codes)
print(encoded.T)
print(list(map(decode_bitmask_numpy, [bitmasks for bitmasks in encoded.T])))

>>> [[9516951399653703808             4194304]
 [       549772853248                   0]]
[['CSTM-00', '21', '31', '32', '33', '34', '51', '54', '55', '56', '61', '81'], ['21', '31', '32']]

^ a small example of producing bitmasks for my file. In practice it seems like there are not so many distinct combinations of service_codes per file (I observed 20 in the file I processed). So I just process the 20 possible unique service_code combinations and then use a hashmap to go back and forth rather than encoding/decoding.

(minor) Why use floating points for negotiated_rate?

I hear you on this one. It goes back to how accessible is the data and where the data lives (and in which context - internal systems vs disclosures). The financial data that are in the systems whose output ends up in these MRFs absolutely should be ints (I would hope, but not holding my breath).

I hope so too!

Would definitely enjoy your thoughts on it.

Well, in my view the benefits of using cents and basis points are:

Precission in downstream calculations
Speed in downstream calculations
Parsing errors for floating points (I haven't encountered this that I remember, so it might not be a problem)

Then the disadvantages, as you mention:

Is this data used in internal systems or in disclosures? For disclosures, you want to make sure that disclosing prices to patients, you use dollars and not cents.
Floating-point laymans would need to get up to speed with this way of doing things. For example:
- when they see percentage values of 30 which actually mean 0.3%. This should be clearly stated in documentation or even the json object property name. That way they can quickly realize they need to divide the number by 100.
- they need to be wary of using floats and not ints for certain operations. If they're using SQL, they will want to cast to float then divide by 100 (and not the other way around). I'm not sure how hard these bugs would be to spot.

At the end of the day I think this might be lower hanging fruit but it is not as impactful as the two previous ideas. And the caveats for ease of use / the reality of introducing errors after changing it are there.

Again, thanks for taking the time to discuss and happy to provide more details.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions re: in_network schema #709

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Questions re: in_network schema #709

felix-hh Aug 15, 2023

Replies: 2 comments

shaselton-usds Sep 11, 2023 Maintainer

felix-hh Oct 3, 2023 Author

felix-hh
Aug 15, 2023

shaselton-usds
Sep 11, 2023
Maintainer

felix-hh
Oct 3, 2023
Author