Replies: 11 comments 23 replies
-
You should define what your challenges are for more specific suggestions. I would suggest that you would consume and perform analytics using a cloud provider like AWS or Azure to take advantage of the ability to spin up the necessary environments to work with the data along with their cloud native services for data analytics. |
Beta Was this translation helpful? Give feedback.
-
These "transparency" files are a joke; I cant open them either :( |
Beta Was this translation helpful? Give feedback.
-
I have had success streaming the json using python jsonslicer. You can even stream the gzip file into jsonslicer if you don't want to decompress the file first. It's a wrapper around yajl so you might be able to find a java wrapper or use JNI if you really need. |
Beta Was this translation helpful? Give feedback.
-
I have had some initial success with node and jsonSTREAM (running the node file from cli) - largest file so far I've been able to parse down to what I wanted was 96 gb. If you just want to look at the file first before tackling it a hexeditor will let you open it and copy sections to take out and look at more closely. The toughest thing for me (and probably something I need to look into) is the time it takes to stream through the files. |
Beta Was this translation helpful? Give feedback.
-
Lots of good examples here. I would highly recommend, as others already have, to take a streaming strategy when consuming these files. This is how the validator works @DrZeshanKhan not to be too presumptuous here, but it sounds like software development isn't something you have a history with. These transparency files are geared towards developers/innovators. It was cited quite frequently within the rule. There was no illusion that these files were going to be small or end-consumer user friendly -- the data disclosure requirements within the rule dictate that. The machine-readable files are meant for machines. You might be a bit too early as these files are processed and services/platforms are developed for friendly end-user consumption. |
Beta Was this translation helpful? Give feedback.
-
I have successfully written a process that downloads the Anthem TOC file at https://antm-pt-prod-dataz-nogbd-nophi-us-east1.s3.amazonaws.com/anthem/2022-07-01_anthem_index.json.gz, reads the JSON directly from the compressed file, and downloads the linked files within. There are 6363 distinct sub-files, the vast majority of which download just fine. But 558 of them fail with a 403 Forbidden error. All of them are located on Amazon S3 storage at antm-pt-prod-dataz-nogbd-phi-us-east1.s3.amazonaws.com. I believe the links are missing Signature and Key-Pair-Id that should be in the query string. Is anyone from Anthem on here in a position to review and correct the TOC file? |
Beta Was this translation helpful? Give feedback.
-
Anybody able to download 300 GB files and load into the DataBase? If yes, what is the time taken for the data ingestion? |
Beta Was this translation helpful? Give feedback.
-
I have some upbeat news to share. Our company https://www.keyark.com is now loading and querying In-network Rate MRF files. As background, Keyark specializes in providing analytics for complex non-flat data, such as these MRF. We use in-house custom built technology. The tech stack includes the KeySQL Studio web app that supports end-user query. Our query language is very close to SQL, with extensions that enable analysts to handle the nuances of non-flat data. |
Beta Was this translation helpful? Give feedback.
-
You can email to ***@***.***
…On Wed, Aug 3, 2022 at 6:54 PM DrZeshanKhan ***@***.***> wrote:
I have some upbeat news to share. Our company https://www.keyark.com is
now loading and querying In-network Rate MRF files.
As background, Keyark specializes in providing analytics for complex
non-flat data, such as these MRF. We use in-house custom built technology.
The tech stack includes the KeySQL Studio web app that supports end-user
query. Our query language is very close to SQL, with extensions that enable
analysts to handle the nuances of non-flat data.
I had a thought... is there a way to contact you directly? Perhaps via
email?
—
Reply to this email directly, view it on GitHub
<#537 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABQW4NGKMKHGW3CLVKX4A4TVXMPDVANCNFSM53IE5R3Q>
.
You are receiving this because you commented.Message ID:
<CMSgov/price-transparency-guide/repo-discussions/537/comments/3321118@
github.com>
|
Beta Was this translation helpful? Give feedback.
-
Take a look at this blog post: https://aws.amazon.com/blogs/big-data/process-price-transparency-data-using-aws-glue/ |
Beta Was this translation helpful? Give feedback.
-
Some of our customers are using Dadroit for opening these files, so I guessed I suggest checking our work too. |
Beta Was this translation helpful? Give feedback.
-
I am trying to download and ingest huge Json files from Payers like Anthem, UHC etc leveraging pyspark and facing challenges. Any suggestions/recommendations on what other languages/code i can use to download and ingest the files effectively?
Problem Statement
Currently PySpark job is not able to serialize data due to presence of nested json in one column of the file which would easily exceed 2 GB. We have 99% of the file content present in one column.
Steps taken so far to fix this issue:
We have increased driver and executer memory, heap memory, overhead memory
Changed default java serializer to Kryo and set a limit more than 2gb.
We cannot partition data or split df as entire file has 1 row only. It won’t be helpful.
We are not using collect or any aggregate function, just reading input json file to a dataframe which is deeply nested.
We tried to read and flatten data immediately after read to make sure each column gets less than 2 GB data.
Error:
Py4JJavaError: An error occurred while calling o316.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 10 times, most recent failure: Lost task 0.9 in stage 1.0 (TID 19, anbc-ptx-dev-w-0, executor 1): java.lang.IllegalArgumentException: Cannot grow BufferHolder by size 488 because the size after growing exceeds size limitation 2147483632
at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:71)
at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.grow(UnsafeWriter.java:62)
Beta Was this translation helpful? Give feedback.
All reactions