Recommendations on downloading and ingesting huge Json files in the Database? #537

kbl123 · 2022-07-11T16:50:07Z

kbl123
Jul 11, 2022

I am trying to download and ingest huge Json files from Payers like Anthem, UHC etc leveraging pyspark and facing challenges. Any suggestions/recommendations on what other languages/code i can use to download and ingest the files effectively?

Problem Statement
Currently PySpark job is not able to serialize data due to presence of nested json in one column of the file which would easily exceed 2 GB. We have 99% of the file content present in one column.

Steps taken so far to fix this issue:
We have increased driver and executer memory, heap memory, overhead memory
Changed default java serializer to Kryo and set a limit more than 2gb.
We cannot partition data or split df as entire file has 1 row only. It won’t be helpful.
We are not using collect or any aggregate function, just reading input json file to a dataframe which is deeply nested.
We tried to read and flatten data immediately after read to make sure each column gets less than 2 GB data.

Error:
Py4JJavaError: An error occurred while calling o316.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 10 times, most recent failure: Lost task 0.9 in stage 1.0 (TID 19, anbc-ptx-dev-w-0, executor 1): java.lang.IllegalArgumentException: Cannot grow BufferHolder by size 488 because the size after growing exceeds size limitation 2147483632
at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:71)
at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.grow(UnsafeWriter.java:62)

BobSyracuse · 2022-07-11T17:04:51Z

BobSyracuse
Jul 11, 2022

You should define what your challenges are for more specific suggestions. I would suggest that you would consume and perform analytics using a cloud provider like AWS or Azure to take advantage of the ability to spin up the necessary environments to work with the data along with their cloud native services for data analytics.

1 reply

kbl123 Jul 11, 2022
Author

Thanks Bob! Here are more details-

Problem Statement
Currently PySpark job is not able to serialize data due to presence of nested json in one column of the file which would easily exceed 2 GB. We have 99% of the file content present in one column.

Steps taken so far to fix this issue:

We have increased driver and executer memory, heap memory, overhead memory
Changed default java serializer to Kryo and set a limit more than 2gb.
We cannot partition data or split df as entire file has 1 row only. It won’t be helpful.
We are not using collect or any aggregate function, just reading input json file to a dataframe which is deeply nested.
We tried to read and flatten data immediately after read to make sure each column gets less than 2 GB data.

Error:
Py4JJavaError: An error occurred while calling o316.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 10 times, most recent failure: Lost task 0.9 in stage 1.0 (TID 19, anbc-ptx-dev-w-0, executor 1): java.lang.IllegalArgumentException: Cannot grow BufferHolder by size 488 because the size after growing exceeds size limitation 2147483632
at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:71)
at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.grow(UnsafeWriter.java:62)

DrZeshanKhan · 2022-07-16T03:15:07Z

DrZeshanKhan
Jul 16, 2022

These "transparency" files are a joke; I cant open them either :(

1 reply

ranjancse26 Oct 23, 2023

@DrZeshanKhan I would strongly recommend you go with the streamed JSON reader approach.

tjmc · 2022-07-18T00:49:52Z

tjmc
Jul 18, 2022

I have had success streaming the json using python jsonslicer. You can even stream the gzip file into jsonslicer if you don't want to decompress the file first.

It's a wrapper around yajl so you might be able to find a java wrapper or use JNI if you really need.

6 replies

hw-monkey Jul 26, 2022

@tjmc > I work with bigdata and can code in python. In my environment though, I have Windows boxes. jsonslicer seems to have environment requirements that are not for standard python installations. I am not finding good documentation about how to get python pip to work with .exe versions of the required dependencies (e.g. pkg-config, libglib).

Can you point us in the right direction for how to get this working on a Windows box?

Also, curious if you/others are interested in sharing python code/objects for reading from these json schemas.

fferguson10 Jul 26, 2022

before installing jsonslicer with pip you must download and compile yajl using VS or some other C++ compiler like CMake.

hw-monkey Aug 1, 2022

To others who read this, do NOT try to go down the path of VS (Visual Studio). Dmitry Marakasov (writer of jsonslicer) was helping me troubleshoot the installation issues and stated that the MSVC compiler will not work. It must be compiled using a different compiler.

We have successfully used msys2 mingw-64 to create a 64-bit python environment with yajl and jsonslicer on Windows boxes. We are still exploring streaming these large json (and json.gz) files for analysis.

hw-monkey Aug 3, 2022

@tjmc , @fferguson10 > I am struggling with reading the data while streaming it. For example, I have the Florida Blue file:

{
	"reporting_entity_name": "FloridaBlue",
	"reporting_entity_type": "Health Insurance Issuer",
	"reporting_structure": [
		{
			"reporting_plans": [
				{
					"plan_name": "BlueCare",
					"plan_id_type": "EIN",
					"plan_id": "592015694",
					"plan_market_type": "Group"
				},
				{
					"plan_name": "BlueCare",
					"plan_id_type": "HIOS",
					"plan_id": "30252FL001",
					"plan_market_type": "Group"
				},
				{
					"plan_name": "BlueCare Basic HMO",
					"plan_id_type": "EIN",
					"plan_id": "592015694",
					"plan_market_type": "Group"
				},
				{
					"plan_name": "BlueCare Standard HMO",
					"plan_id_type": "EIN",
					"plan_id": "592015694",
					"plan_market_type": "Group"
				}
			], ...

with JsonSlicer, I could do something like:
for *path, item in JsonSlicer(data, (None,None), read_size=128, path_mode='full'):

However, JsonSlicer is designed for iterative json data, so the reporting_entity_name, for example, could not be extracted from the JsonSlicer stream. Clearly, I am missing something from your methodology of working with this data.

Can you share a snippet about how you might do this? I recognize that I must learn the technique before I can tackle the next layer of the file -- decompressing and parsing a subfile:

			"in_network_files": [
				{
					"description": "In Network File",
					"location": "https://bcbsfl.mrf.bcbs.com/2022-07_020_02E0_in-network-rates.json.gz?&Expires=1663859808&Signature=cLGIU9f8pzkRftOqKKEc~yD75uCGJECFRAwSAwCJmarb-0RMdCvi6TVfubSEHR1gAeWAY6nShKqdjpOK-kmaGHtL2VDQKJgWgNxLT2~Ny2DiTbHze-dDAm5yGUawEyFX1J79m1a3-BSociX8NUDQNjDY7soN4vaf0higCvuE~fOOzKfWqe5uVAV7jIz0YqqBMHvt1lxmt117OG6veGw029nBGfL4Rx6FJrLrQaAEuYiXUADesTD9QJD6BIu~E3MwS81L2HOj5p5fhdt9urhOuP5mmTpcCCdFeouoCCht0C5IFM9NXxrvfqDD-AcldI1R5vs9aRDf9Y1g-L7XT6oZ2Q__&Key-Pair-Id=K27TQMT39R1C8A"
				},

THANK you!

tjmc Aug 4, 2022

There is only one reporting_entity_name / reporting_entity_type so I've just read the first few lines. I'd imagine if that starts to have problems, I would wrap the file object with my own class and intercept read(size)

escoses100 · 2022-07-18T22:06:28Z

escoses100
Jul 18, 2022

I have had some initial success with node and jsonSTREAM (running the node file from cli) - largest file so far I've been able to parse down to what I wanted was 96 gb. If you just want to look at the file first before tackling it a hexeditor will let you open it and copy sections to take out and look at more closely. The toughest thing for me (and probably something I need to look into) is the time it takes to stream through the files.

0 replies

shaselton-usds · 2022-07-18T22:30:07Z

shaselton-usds
Jul 18, 2022
Maintainer

Lots of good examples here. I would highly recommend, as others already have, to take a streaming strategy when consuming these files. This is how the validator works

@DrZeshanKhan not to be too presumptuous here, but it sounds like software development isn't something you have a history with. These transparency files are geared towards developers/innovators. It was cited quite frequently within the rule. There was no illusion that these files were going to be small or end-consumer user friendly -- the data disclosure requirements within the rule dictate that. The machine-readable files are meant for machines. You might be a bit too early as these files are processed and services/platforms are developed for friendly end-user consumption.

0 replies

RonPeters · 2022-07-20T17:35:19Z

RonPeters
Jul 20, 2022

I have successfully written a process that downloads the Anthem TOC file at https://antm-pt-prod-dataz-nogbd-nophi-us-east1.s3.amazonaws.com/anthem/2022-07-01_anthem_index.json.gz, reads the JSON directly from the compressed file, and downloads the linked files within.

There are 6363 distinct sub-files, the vast majority of which download just fine. But 558 of them fail with a 403 Forbidden error. All of them are located on Amazon S3 storage at antm-pt-prod-dataz-nogbd-phi-us-east1.s3.amazonaws.com. I believe the links are missing Signature and Key-Pair-Id that should be in the query string.

Is anyone from Anthem on here in a position to review and correct the TOC file?

9 replies

tjmc Jul 20, 2022

@RonPeters it looks like Anthem generated it using their HPI storage by mistake. Try changing the hostname when you are processing the TOC.

antm-pt-prod-dataz-nogbd-phi-us-east1.s3.amazonaws.com -> antm-pt-prod-dataz-nogbd-nophi-us-east1.s3.amazonaws.com

RonPeters Jul 20, 2022

@tjmc Thanks! I spot-checked it on one file and that seems to work. Do you have any contacts at Anthem that can fix the TOC?

kbl123 Jul 20, 2022
Author

@RonPeters @tjmc - Can you help me undertand what's wrong with ToC for Anthem? This is the right URL, right?

https://antm-pt-prod-dataz-nogbd-nophi-us-east1.s3.amazonaws.com/anthem/2022-07-01_anthem_index.json.gz

RonPeters Jul 20, 2022

@kbl123 That's the correct URL for the ToC. But there are links within the ToC to files that are apparently the wrong domain name. For instance, there is a link to https://antm-pt-prod-dataz-nogbd-phi-us-east1.s3.amazonaws.com/anthem/CA_BCCMMEDCL00.json.gz. But the correct URL is https://antm-pt-prod-dataz-nogbd-nophi-us-east1.s3.amazonaws.com/anthem/CA_BCCMMEDCL00.json.gz. Note the "-phi-" vs "-nophi-"

kbl123 Jul 20, 2022
Author

@RonPeters - Thanks! Yeah I noticed that.

kbl123 · 2022-07-27T18:21:07Z

kbl123
Jul 27, 2022
Author

Anybody able to download 300 GB files and load into the DataBase? If yes, what is the time taken for the data ingestion?

1 reply

hw-monkey Aug 1, 2022

Complete downloads into a database does not seem like a good strategy. Our planned use is something different. So, our strategy is to stream the data, and extract a finite subset for analysis.

If you want to download (from a single company) thousands of files that are 80 gb each and then work with them in a database, you are going to need some significant resources invested in processing power and storage. If you then want to do this for multiple providers (e.g. UHC and Anthem), requirements will essentially double.

dgolden · 2022-08-04T01:13:18Z

dgolden
Aug 4, 2022

I have some upbeat news to share. Our company https://www.keyark.com is now loading and querying In-network Rate MRF files.

As background, Keyark specializes in providing analytics for complex non-flat data, such as these MRF. We use in-house custom built technology. The tech stack includes the KeySQL Studio web app that supports end-user query. Our query language is very close to SQL, with extensions that enable analysts to handle the nuances of non-flat data.

2 replies

DrZeshanKhan Aug 4, 2022

I have some upbeat news to share. Our company https://www.keyark.com is now loading and querying In-network Rate MRF files.

As background, Keyark specializes in providing analytics for complex non-flat data, such as these MRF. We use in-house custom built technology. The tech stack includes the KeySQL Studio web app that supports end-user query. Our query language is very close to SQL, with extensions that enable analysts to handle the nuances of non-flat data.

I had a thought... is there a way to contact you directly? Perhaps via email?

dgolden Aug 6, 2022

@DrZeshanKhan Click the Contact link on the Keyark.com web site to submit a message to the company.

dgolden · 2022-10-11T08:20:10Z

dgolden
Oct 11, 2022

You can email to ***@***.***

…

On Wed, Aug 3, 2022 at 6:54 PM DrZeshanKhan ***@***.***> wrote: I have some upbeat news to share. Our company https://www.keyark.com is now loading and querying In-network Rate MRF files. As background, Keyark specializes in providing analytics for complex non-flat data, such as these MRF. We use in-house custom built technology. The tech stack includes the KeySQL Studio web app that supports end-user query. Our query language is very close to SQL, with extensions that enable analysts to handle the nuances of non-flat data. I had a thought... is there a way to contact you directly? Perhaps via email? — Reply to this email directly, view it on GitHub <#537 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABQW4NGKMKHGW3CLVKX4A4TVXMPDVANCNFSM53IE5R3Q> . You are receiving this because you commented.Message ID: <CMSgov/price-transparency-guide/repo-discussions/537/comments/3321118@ github.com>

0 replies

harithatavarthy · 2023-05-04T19:54:15Z

harithatavarthy
May 4, 2023

Take a look at this blog post: https://aws.amazon.com/blogs/big-data/process-price-transparency-data-using-aws-glue/
It talks about splitting large JSON file into smaller chunks before it can be processed and analyzed.
While the blog focuses on AWS Glue for preprocessing and processing the data, ideally it can be processed using any python/spark platform.

3 replies

harish-kanta Dec 21, 2023

Hello @harithatavarthy. I have tried using aws glue for transformation as per the above document that you have shared. But when it comes to files of more than 200mb size, the glue job keeps on running. I have tried multiple optimization option , but still the glue job takes more then expected time. sometimes the job runs for 5-6 hrs. can you please suggest aby alternative or a solution for this

ranjancse26 Dec 21, 2023

@harish-kanta Oh no, Sorry, AWS Glue would be a terrible idea to inject a file like terabytes of data. It is going to cost a ton of dollars with Glue.

harish-kanta Dec 21, 2023

hello @ranjancse26 . You are right, i have paid a great sum to aws, to flatten big json file with aws glue. I am looking for a solution now.

PeterVaron · 2023-09-02T12:20:16Z

PeterVaron
Sep 2, 2023

Some of our customers are using Dadroit for opening these files, so I guessed I suggest checking our work too.
https://dadroit.com/blog/opening-large-json-files-challenges-and-the-solution/

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recommendations on downloading and ingesting huge Json files in the Database? #537

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 11 comments 23 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Recommendations on downloading and ingesting huge Json files in the Database? #537

Replies: 11 comments · 23 replies

kbl123 Jul 11, 2022 Author

shaselton-usds Jul 18, 2022 Maintainer

kbl123 Jul 20, 2022 Author

kbl123 Jul 20, 2022 Author

kbl123 Jul 27, 2022 Author

Replies: 11 comments 23 replies

kbl123 Jul 11, 2022
Author

shaselton-usds
Jul 18, 2022
Maintainer

kbl123 Jul 20, 2022
Author

kbl123 Jul 20, 2022
Author

kbl123
Jul 27, 2022
Author