Replies: 2 comments
-
Hi @felix-hh Thank you for this feedback. It really is appreciated. Also, moving this from an "issue" to a "discussion"; hope you don't mind. I think on the whole, these suggestions are fairly on point in terms of a more purely optimized way to express data. You're correct in your observation in the compression ratio signaling redundant data. We found most of this to be in how provider networks are defined and utilized. By providing an option to define these larger networks once in an external file and having an in-network file reference them, we would see on the whole the file sizes be reduced fairly dramatically. At this point though it is optional thus producers of the files have the decision in higher costs of storage and network charges with unoptimized files or spending some engineering effort in organizing the files in ways that would be beneficial to them and those consuming. Walking through your suggestions -
I am trying to wrap my arms around defining service data that applicable to the negotiated rate into independent objects to be later referenced. The current schema allows the user to link multiple provider groups to a negotiated rate (that currently includes all of the service "metadata” that the negotiation is dependent upon), but you'd want to define unique combinations of services to be outside of the
This suggestion is what I'd expect from a CS background ;). You're correct that bit masking would most likely work here (and frankly, anywhere there is an Enum defined). These are one of those suggestions that are better from an encoding standpoint, but include an additional level of understanding in order to interact with the data. We've received a lot of feedback, especially from the academic research community, that the data may be hard to work with if teams do not have the proper skill sets such as data engineering and data science. This solution, while not a great excuse but is very much a reality, would introduce errors too -- something as simple as someone confusing the higher and lower-order bits. There is no perfect answer here because multiple audiences with varying degrees of technical acumen are trying to work with the raw data and their expectations are all different how "user-friendly"-esk the data should be. I think the flexibility that strings provide for future custom service codes should not be discounted either.
I hear you on this one. It goes back to how accessible is the data and where the data lives (and in which context - internal systems vs disclosures). The financial data that are in the systems whose output ends up in these MRFs absolutely should be |
Beta Was this translation helpful? Give feedback.
-
Hi @shaselton-usds, first of all thank you for the reply and for moving this topic to the right place. Also apologies for the delay, I didn't see a notification, so I just noticed this while looking over the repo yesterday. I'm still working with MRF and learning a lot of stuff. Hopefully we can still have some conversation. Answering from top to bottom
I definitely agree that in some instances the incentives of producers and consumers of data are aligned - this is why files are compressed and not made available as just json. Here's my qualm on file optimization (edit: optimization to reduce uncompressed size, while compressed size is roughly equal, like as you mention, using provider_references instead of provider_group in negotiated_prices for example). Assume that after compression, files optimized through engineering effort and non-optimized files have very similar sizes, but the later is much more expensive to read. If I have a company and I'm in a write-once, read-many situation with many internal consumers, I would take a expensive write, cheap read approach. The problem here is that the burden of reading (after networking costs) is not borne by the insurers, nor are they getting paid by the consumers of the data, so what is the business justification to make it better? I think that future versions of the schema may need to nudge producers into optimization (including defining provider_references / networks separately and removing the ability to do it under negotiated_rates).
Are there external files? I am very curious if you are referring to provider_references and how it prevents a lot of data redundancy in every negotiated_rate? I think that is a huge win and maybe what you thought I wanted to do for services when I mentioned defining a service object (it is not, although it is worth thinking through too). Now that we are talking about file optimization, one thing that would drastically improve ease of use for consumers (researchers) are Why is service data part of the negotiated_price object? This causes a lot of redundancy
Here's the schema I am proposing:
Not familiar enough to write json schema specs but hopefully the above is understandable. How it looks in my database (I don't include provider_groups, bundled_codes and covered_services) Some caveats:
Now why I think is this beneficial?
Although this is an option I'm not proposing services as independent objects, I'm proposing prices as members of the service object, as you may have seen above. This idea is also worth exploring. It would be similar to how we define provider_references in a separate object and then use them throughout. Love the simplicity, getting simiilar to a database with a linking table. Again, if this is optimal should be checked but it looks pretty good. I also wonder if this would be difficult for producers to actually produce or not. Since it looks interesting I took a stab at a quick sketch as well:
Why is service_code an array of strings?
That's a good point, I didn't even consider bitmasking for other enums... I just saw a set of numbers 0-100 and thought that's a bitmask.
Re: flexibility, I would certainly agree that bitmasking is a bad idea for other enums precisely because of flexibility, and because they are strings and not arrays of strings so the cost is not too high, so if we can make it easy to maintain future changes we should. For service codes I am more on the fence, I thought they look pretty static / non-changing since they are CMS-defined but I don't have enough background to know. I would also love to know what you mean by custom codes and how they could be used or changed (as I understand the codes describe the place where care is delivered and most are CMS-defined). To be honest, I still lack a detailed understanding of the true meaning of every data field so if you have any resources beyond the schema, please share. I think that after the first point, this arrays are the second largest driver of the computing burden. So there's a tradeoff between maintainability (flexibility) and performance.
Re: ease of use, I certainly agree, I wrote encode and decode this data in Python (decoding is much easier), but others may use R or Stata or other tools where it might be more difficult. Although it's not extremely hard I acknowledge it could be beyond begginer-level in terms of writing software, especially for people who are not so familiar with bit representations (after all, I do have a CS background). When I was developing the code I had some problems, including off-by-one errors that were hard to debug. I would claim that if bitmasks improved performance significantly, maybe it would be worthwhile to have a tool that goes back and forth between service_code and bitmasks provided for analysts / users provided by CMS or a third party. Because ease of use is a barrier to entry, but access to compute resources and processing time is another barrier (it limits how many datasets you can look at, etc). Here's the code for those who might find it useful:
^ a small example of producing bitmasks for my file. In practice it seems like there are not so many distinct combinations of service_codes per file (I observed 20 in the file I processed). So I just process the 20 possible unique service_code combinations and then use a hashmap to go back and forth rather than encoding/decoding. (minor) Why use floating points for negotiated_rate?
I hope so too!
Well, in my view the benefits of using cents and basis points are:
Then the disadvantages, as you mention:
At the end of the day I think this might be lower hanging fruit but it is not as impactful as the two previous ideas. And the caveats for ease of use / the reality of introducing errors after changing it are there. Again, thanks for taking the time to discuss and happy to provide more details. |
Beta Was this translation helpful? Give feedback.
-
Hi!
I have been developing a data pipeline to process
in_network
MRF files this last month. After a lot of trouble I have managed to develop something that ingests a file in ~5h on my Mac (file is 4GB compressed, 150GB uncompressed). For your reference, this the Aetna file that I used.The high compression ratio (150GB -> 4GB) indicates that there is a lot of unnecessary redundancy in the original data. This translates into higher (maybe 30x?) processing time as data needs to be uncompressed, and instantiated as an object in the language of your choice before being ingested. Data economy would lead to smaller objects and a much better processing time.
During the process some questions came up about the current schema about choices that, IMO, are driving this inefficiency:
negotiated_price object
? This causes a lot of redundancy.service_code
an array of strings?negotiated_rate
?negotiated_rate
can also be a percentage, but the point stands - the percentage can be reported as basis points multiplying by 100 to a good level of precision (I would assume!).Based on #244 It looks like you have already optimized to reduce redundancy in the past by nesting prices on rates. These questions are about further optimization.
I have implemented these changes in my “clean” version of the dataset for downstream use. I look forward to learn about the reason for these design choices if I am wrong. I imagine this cannot be changed without a new major version of the schema.
Also, I thoroughly admire the work that has been done in putting this schema / transparency tools together and getting insurers to comply. If there is anything I can do to help or I can elaborate on this issues let me know.
Beta Was this translation helpful? Give feedback.
All reactions