Process cost openness information utilizing AWS Glue

The Openness in Protection guideline is a federal guideline in the United States that was settled by the Center for Medicare and Medicaid Provider (CMS) in October 2020. The guideline needs health insurance companies to supply clear and succinct info to customers about their health insurance advantages, consisting of expenses and protection information. Under the guideline, health insurance companies need to provide to their members a list of worked out rates for in-network suppliers, along with a quote of the member’s out-of-pocket expenses for particular healthcare services. This info needs to be offered to members through an online tool that is available and simple to utilize. The Openness in Protection guideline likewise needs insurance companies to provide information files which contain in-depth info on the rates they work out with healthcare suppliers. This info can be utilized by companies, scientists, and others to compare rates throughout various insurance companies and healthcare suppliers. Stage 1 application of this guideline, which entered into result on July 1, 2022, needs that payors release machine-readable files openly for each strategy that they provide. CMS (Center for Medicare and Medicaid Solutions) has actually released a technical application guide with file formats, file structure, and requirements on producing these machine-readable files.

This post strolls you through the preprocessing and processing actions needed to prepare information released by health insurance companies because of this federal guideline utilizing AWS Glue We likewise demonstrate how to query and obtain insights utilizing Amazon Athena

AWS Glue is a serverless information combination service that makes it simple to find, prepare, move, and incorporate information from several sources for analytics, artificial intelligence (ML), and application advancement. Athena is a serverless, interactive analytics service constructed on open-source structures, supporting open-table and file formats. Athena supplies a streamlined, versatile method to examine petabytes of information.

Difficulties processing these machine-readable files

The machine-readable files released by these payors differ in size. A file can vary from a couple of megabytes to numerous gigabytes. These files include big JSON things that are deeply embedded. Unlike NDJSON and JSONL formats, where each line in the file is a JSON things, these files include a single big JSON things that can cover throughout several lines. The following figure represents the schema of an in_network rate file released by a significant health insurance provider on their site for public gain access to. This file, when uncompressed, has to do with 20 GB in size, includes a single JSON things, and is deeply embedded. The following figure represents the schema of this JSON things when printed utilizing the Glow printSchema() function. Each highlighted box in red is an embedded range structure.

JSON Schema

Packing a 20 GB deeply embedded JSON things needs a maker with a big memory footprint. Information when packed into memory is 4– 10 times its size on disk. A 20 GB JSON things might require a maker with as much as 200 GB memory. To process work bigger than 20 GB, these makers require to be scaled vertically, thus substantially increasing hardware expenses. Vertical scaling has its limitations, and it’s not possible to scale beyond a particular point. Examining this information needs unnesting and flattening of deeply embedded range structures. These changes take off the information at a rapid rate, thus contributing to the requirement for more memory and disk area.

You can utilize an in-memory dispersed processing structure such as Apache Glow to process and examine such big volumes of information. Nevertheless, to fill this single big JSON things as a Glow DataFrame and carry out an action on it, an employee node requires enough memory to fill this things completely. When an employee node attempts to fill this big deeply embedded JSON things and there isn’t adequate memory to fill it completely, the processing task will stop working with out-of-memory concerns. This requires splitting the big JSON things into smaller sized portions utilizing some type of preprocessing reasoning. As soon as preprocessed, these smaller sized files can then be additional processed in parallel by employee nodes without facing out-of-memory concerns.

Option summary

The option includes a two-step method. The very first is a preprocessing action, which takes the big JSON things as input and divides it to several workable portions. This is needed to resolve the difficulties we pointed out previously. The 2nd is a processing action, which prepares and releases information for analysis.

The preprocessing action utilizes an AWS Glue Python shell task to divide the big JSON things into smaller sized JSON files. The processing action unnests and flattens the range products from these smaller sized JSON files in parallel. It then partitions and composes the output as Parquet on Amazon Simple Storage Service (Amazon S3). The segmented information is cataloged and examined utilizing Athena. The following diagram shows this workflow.

Solution Overview


To carry out the option in your own AWS account, you require to develop or set up the following AWS resources beforehand:

Produce an AWS Glue preprocessing task

The preprocessing action usages ijson, an open-source iterative JSON parser to extract products in the outer range of high-level characteristics. By streaming and iteratively parsing the big JSON file, the preprocessing action loads just a part of the file into memory, thus preventing out-of-memory concerns. It likewise utilizes s3pathlib, an open-source Python user interface to Amazon S3. This makes it simple to deal with S3 file systems.

To develop and run the AWS Glue task for preprocessing, finish the following actions:

  1. On the AWS Glue console, pick Jobs under Glue Studio in the navigation pane.
  2. Produce a brand-new task.
  3. Select Python shell script editor
  4. Select Produce a brand-new script with boilerplate code
    Python Shell Script Editor
  5. Go into the following code into the editor (change the S3 pail names and courses to indicate the input and output places in Amazon S3):
 import ijson
import json
import decimal
from s3pathlib import S3Path
from s3pathlib import context
import boto3
from io import StringIO

class JSONEncoder( json.JSONEncoder):.
def default( self, obj):.
if isinstance( obj, decimal.Decimal):.
return float( obj).
return json.JSONEncoder.default( self, obj).

def upload_to_s3( information, upload_path):.
information = bytes( StringIO( json.dumps( information, cls= JSONEncoder)). getvalue(), encoding=' utf-8')
s3_client. put_object( Body= information, Pail= pail, Secret= upload_path).

s3_client = boto3.client('s 3').

#Replace with your pail and course to JSON things on your pail.
pail=" yourbucket"
largefile_key='ptd/2023 -03 -01 _ United-HealthCare-Services-- Inc- _ Third-Party-Administrator_PS1-50_C2_in-network-rates. json.gz'.
p = S3Path( pail, largefile_key).

#Replace the courses to fit your requirements.
upload_path_base=" ptd/preprocessed/base/ base.json"
upload_path_in_network='ptd/preprocessed/in _ network/'.
upload_path_provider_references=" ptd/preprocessed/provider _ recommendations/"

#Extract leading the worths of the following leading level characteristics and continue them on your S3 pail.
#-- reporting_entity_name.
#-- reporting_entity_type.
#-- last_updated_on.
#-- variation.

base = {
' reporting_entity_name': ",.
' reporting_entity_type': ",.
' last_updated_on':",.
' variation': ".

with" r") as f:.
obj = ijson.items( f, 'reporting_entity_name').
for evt in obj:.
base['reporting_entity_name'] = evt.

with" r") as f:.
obj = ijson.items( f, 'reporting_entity_type').
for evt in obj:.
base['reporting_entity_type'] = evt.

with" r") as f:.
obj = ijson.items( f, 'last_updated_on').
for evt in obj:.
base['last_updated_on'] = evt.

with" r") as f:.
obj = ijson.items( f,' variation').
for evt in obj:.
base['version'] = evt.

upload_to_s3( base, upload_path_base).

#Seek the position of JSON essential provider_references.
#Iterate through products in provider_references range, and for every single 1000 products develop a JSON file on S3 pail.
with" r") as f:.
provider_references = ijson.items( f, 'provider_references. product').
fk = 0.
lst =[]
for rowcnt, row in enumerate( provider_references):.
if rowcnt % 1000 == 0:.
if fk > > 0:.
dest = upload_path_provider_references + course.
upload_to_s3( lst, dest).
lst =[]
course=" provider_references _ {0}. json". format( fk).
fk = fk + 1.

lst.append( row).

course=" provider_references _ {0}. json". format( fk).
dest = upload_path_provider_references + course.
upload_to_s3( lst, dest).

#Seek the position of JSON essential in_network.
#Iterate through products in in_network range, and for every single 25 products develop a JSON file on S3 pail.
with" r") as f:.
in_network = ijson.items( f, 'in_network. product').
fk = 0.
lst =[]
for rowcnt, row in enumerate( in_network):.
if rowcnt % 25 == 0:.
if fk > > 0:.
dest = upload_path_in_network + course.
upload_to_s3( lst, dest).
lst =[]
course=" in_network _ {0}. json". format( fk).
fk = fk + 1.

lst.append( row).

course=" in_network _ {0}. json". format( fk).
dest = upload_path_in_network + course.
upload_to_s3( lst, dest)

  1. Update the residential or commercial properties of your task on the Task information tab:.
    1. For Type, pick Python Shell
    2. For Python variation, pick Python 3.9
    3. For Information processing systems, pick 1 DPU

For Python shell tasks, you can designate either 0.0625 or 1 DPU. The default is 0.0625 DPU. A DPU is a relative step of processing power that includes 4 vCPUs of calculate capability and 16 GB of memory.

python shell job config

The Python libraries ijson and s3pathlib are readily available in pip and can be set up utilizing the AWS Glue task specification -- additional-python-modules You can likewise pick to package these libraries, publish them to Amazon S3, and describe them from your AWS Glue task. For directions on product packaging your library, describe Supplying your own Python library

  1. To set up the Python libraries, set the following task criteria:.
    • Secret-- additional-python-modules
    • Worth ijson, s3pathlibinstall python modules
  2. Run the task.

The preprocessing action produces 3 folders in the S3 pail: base, in_network and provider_references


Files in in_network and provider_references folders includes range of JSON things. Each of these JSON things represents an aspect in the outer range of the initial big JSON things.


Produce an AWS Glue processing task

The processing task utilizes the output of the preprocessing action to develop a denormalized view of information by drawing out and flattening aspects and characteristics from embedded varieties. The level of unnesting depends upon the characteristics we require for analysis. For instance, associates such as negotiated_rate, npi, and billing_code are necessary for analysis and drawing out worths related to these characteristics needs several levels of unnesting. The denormalized information is then segmented by the billing_code column, continued as Parquet on Amazon S3, and signed up as a table on the AWS Glue Information Brochure for querying.

The following code sample guides you through the application utilizing PySpark The columns utilized to partition the information depends upon inquiry patterns utilized to examine the information. Coming to a partitioning method that remains in line with the inquiry patterns will enhance total inquiry efficiency throughout analysis. This post presumes that the questions utilized for examining information will constantly utilize the column billing_code to filter and bring information of interest. Information in each partition is bucketed by npi to enhance inquiry efficiency.

To develop your AWS Glue task, finish the following actions:

  1. On the AWS Glue console, pick Jobs under Glue Studio in the navigation pane.
  2. Produce a brand-new task.
  3. Select Glow script editor
  4. Select Produce a brand-new script with boilerplate code
  5. Go into the following code into the editor (change the S3 pail names and courses to indicate the input and output places in Amazon S3):
 import sys.
from pyspark.context import SparkContext.
from pyspark.sql import SparkSession.
sc = SparkContext.getOrCreate().
stimulate = SparkSession( sc).
from pyspark.sql.functions import take off.

#create a dataframe of base things - reporting_entity_name, reporting_entity_type, variation, last_updated_on.
#using the output of preprocessing action.

base_df ='s 3:// yourbucket/ptd/preprocessed/ base/').

#create a dataframe over provider_references things utilizing the output of preprocessing action.
prvd_df ='s 3:// yourbucket/ptd/preprocessed/ provider_references/').

#cross sign up with dataframe of base things with dataframe of provider_references.
prvd_df = prvd_df. crossJoin( base_df).

#create a dataframe over in_network things utilizing the output of preprocessing action.
in_ntwrk_df ='s 3:// yourbucket/ptd/preprocessed/ in_network/').

#unnest and flatten negotiated_rates and provider_references from in_network things.
in_ntwrk_df2 = in_ntwrk_df. choose(.
in_ntwrk_df. billing_code, in_ntwrk_df. billing_code_type, in_ntwrk_df. billing_code_type_version,.
in_ntwrk_df. covered_services, in_ntwrk_df. description, in_ntwrk_df. name,.
take off( in_ntwrk_df. negotiated_rates). alias(' exploded_negotiated_rates'),.
in_ntwrk_df. negotiation_arrangement).

in_ntwrk_df3 = in_ntwrk_df2. choose(.
in_ntwrk_df2. billing_code, in_ntwrk_df2. billing_code_type, in_ntwrk_df2. billing_code_type_version,.
in_ntwrk_df2. covered_services, in_ntwrk_df2. description, in_ntwrk_df2. name,.
in_ntwrk_df2. exploded_negotiated_rates. negotiated_prices. alias(.
' exploded_negotiated_rates_negotiated_prices'),.
take off( in_ntwrk_df2. exploded_negotiated_rates. provider_references). alias(.
' exploded_negotiated_rates_provider_references'),.
in_ntwrk_df2. negotiation_arrangement).

#join the blown up in_network dataframe with provider_references dataframe.
jdf = prvd_df. sign up with(.
prvd_df. provider_group_id == in_ntwrk_df3. exploded_negotiated_rates_provider_references," fullouter").

#un- nest and flatten characteristics from remainder of the embedded varieties.
jdf2 =
jdf.reporting _ entity_name, jdf.reporting _ entity_type, jdf.last _ updated_on, jdf.version,.
jdf.provider _ group_id, jdf.provider _ groups, jdf.billing _ code,.
jdf.billing _ code_type, jdf.billing _ code_type_version, jdf.covered _ services,.
take off( jdf.exploded _ negotiated_rates_negotiated_prices). alias(.
' exploded_negotiated_rates_negotiated_prices'),.
jdf.exploded _ negotiated_rates_provider_references,.
jdf.negotiation _ plan).

jdf3 =
jdf2.reporting _ entity_name, jdf2.reporting _ entity_type, jdf2.last _ updated_on, jdf2.version,.
jdf2.provider _ group_id,.
take off( jdf2.provider _ groups). alias(' exploded_provider_groups'),.
jdf2.billing _ code, jdf2.billing _ code_type, jdf2.billing _ code_type_version,.
jdf2.covered _ services, jdf2.description,,.
jdf2.exploded _ negotiated_rates_negotiated_prices. additional_information.
alias(' additional_information'),.
jdf2.exploded _ negotiated_rates_negotiated_prices. billing_class. alias(.
' billing_class'),.
jdf2.exploded _ negotiated_rates_negotiated_prices. billing_code_modifier.
alias(' billing_code_modifier'),.
jdf2.exploded _ negotiated_rates_negotiated_prices. expiration_date. alias(.
' expiration_date'),.
jdf2.exploded _ negotiated_rates_negotiated_prices. negotiated_rate. alias(.
' negotiated_rate'),.
jdf2.exploded _ negotiated_rates_negotiated_prices. negotiated_type. alias(.
' negotiated_type'),.
jdf2.exploded _ negotiated_rates_negotiated_prices. service_code. alias(.
' service_code'), jdf2.exploded _ negotiated_rates_provider_references,.
jdf2.negotiation _ plan).

jdf4 = jdf3.reporting _ entity_name, jdf3.reporting _ entity_type, jdf3.last _ updated_on, jdf3.version,.
jdf3.provider _ group_id,.
take off( jdf3.exploded _ provider_groups. npi). alias(' npi'),.
jdf3.exploded _ provider_groups. tin.type.alias(' tin_type'),.
jdf3.exploded _ provider_groups. tin.value.alias(' tin'),.
jdf3.billing _ code, jdf3.billing _ code_type,.
jdf3.billing _ code_type_version, jdf3.covered _ services,.
jdf3.description,, jdf3.additional _ info,.
jdf3.billing _ class, jdf3.billing _ code_modifier,.
jdf3.expiration _ date, jdf3.negotiated _ rate,.
jdf3.negotiated _ type, jdf3.service _ code,.
jdf3.negotiation _ plan).

#repartition by billing_code.
#Repartition alters the circulation of information on stimulate cluster.
#By repartition information we will prevent composing a lot of little files.
jdf5= jdf4.repartition(" billing_code").

datasink_path="s3:// yourbucket/ptd/processed/ billing_code_npi/ parquet/".

#persist dataframe as parquet on S3 and brochure it.
#Partition the information by billing_code. This allows analytical questions to avoid information and enhance efficiency of questions.
#Data is likewise bucketed and arranged npi to enhance inquiry efficiency throughout analysis.

jdf5.write.format(' parquet'). mode(" overwrite"). partitionBy(' billing_code'). bucketBy( 2, 'npi'). sortBy(' npi'). saveAsTable(' ptdtable', course = datasink_path).

  1. Update the residential or commercial properties of your task on the Task information tab:.
    1. For Type, pick Glow
    2. For Glue variation, pick Glue 4.0
    3. For Language, pick Python 3
    4. For Employee type, pick G 2X
    5. For Asked for variety of employees, get in 20.

Coming to the variety of employees and employee type to utilize for your processing task depends upon elements such as the quantity of information being processed, the speed at which it requires to be processed, and the partitioning method utilized. Repartitioning of information can lead to out-of-memory concerns, specifically when information is greatly manipulated on the column utilized to repartition. It’s possible to reach Amazon S3 service limitations if a lot of employees are designated to the task. This is due to the fact that jobs working on these employee nodes might attempt to read/write from the very same S3 prefix, triggering Amazon S3 to throttle the inbound demands. For more information, describe Finest practices style patterns: enhancing Amazon S3 efficiency

processing job config

Taking off range aspects produces brand-new rows and columns, thus tremendously increasing the quantity of information that requires to be processed. Apache Glow divides this information into several Glow partitions on various employee nodes so that it can process big quantities of information in parallel. In Apache Glow, shuffling takes place when information requires to be rearranged throughout the cluster. Mix operations are frequently activated by large changes such as sign up with, reduceByKey, groupByKey, and repartition. In case of exceptions due to regional storage restrictions, it assists to supplement or change regional disk storage capability with Amazon S3 for big shuffle operations. This is possible with the AWS Glue Spark shuffle plugin with Amazon S3 With the cloud shuffle storage plugin for Apache Glow, you can prevent disk space-related failures.

  1. To utilize the Glow shuffle plugin, set the following task criteria:.
    • Secret-- write-shuffle-files-to-s3
    • Worth real
      spark shuffle plugin

Question the information

You can query the cataloged information utilizing Athena. For directions on establishing Athena, describe Establishing

On the Athena console, pick Question editor in the navigation pane to run your inquiry, and define your information source and database.

sql query

To discover the minimum, optimum, and typical worked out rates for treatment codes, run the following inquiry:

round( minutes( negotiated_rate),2) as min_price,.
round( avg( negotiated_rate),2) as avg_price,.
round( max( negotiated_rate),2) as max_price,.
FROM "default"." ptdtable".
group by billing_code, description.
limitation 10;

The following screenshot reveals the inquiry results.

sql query results

Tidy Up

To prevent sustaining future charges, erase the AWS resources you produced:

  1. Erase the S3 things and pail.
  2. Erase the IAM policies and functions.
  3. Erase the AWS Glue tasks for preprocessing and processing.


This post directed you through the required preprocessing and processing actions to query and examine cost transparency-related machine-readable files. Although it’s possible to utilize other AWS services to process such information, this post concentrated on preparing and releasing information utilizing AWS Glue.

To get more information about the Openness in Protection guideline, describe Openness in Protection. For finest practices for scaling Apache Glow tasks and separating information with AWS Glue, describe Finest practices to scale Apache Glow tasks and partition information with AWS Glue To find out how to keep track of AWS Glue tasks, describe Keeping An Eye On AWS Glue Glow tasks

We eagerly anticipate hearing any feedback or concerns.

About the Authors

hari thatavarthy Hari Thatavarthy is a Senior Solutions Designer on the AWS Data Laboratory group. He assists clients style and develop options in the information and analytics area. He thinks in information democratization and likes to resolve complicated information processing-related issues. In his extra time, he likes to play table tennis.

Krishna Maddileti Krishna Maddileti is a Senior Solutions Designer on the AWS Data Laboratory group. He partners with clients on their AWS journey and assists them with information engineering, information lakes, and analytics. In his extra time, he delights in spending quality time with his household and playing computer game with his 7-year-old.

yadukishore tatavarti Yadukishore Tatavarthi is a Senior Partner Solutions Designer at AWS. He works carefully with worldwide system integrator partners to allow and support clients moving their work to AWS.

Manish Kola Manish Kola is a Solutions Designer on the AWS Data Laboratory group. He partners with clients on their AWS journey.

Noritaki Sakayami Noritaka Sekiyama is a Principal Big Data Designer on the AWS Glue group. He is accountable for constructing software application artifacts to assist clients. In his extra time, he delights in biking with his brand-new roadway bike.

Like this post? Please share to your friends:
Leave a Reply

;-) :| :x :twisted: :smile: :shock: :sad: :roll: :razz: :oops: :o :mrgreen: :lol: :idea: :grin: :evil: :cry: :cool: :arrow: :???: :?: :!: