Cloud Setup: Using the Dataset with Google Cloud SQL & BigQuery

This page explains how to set up the MIMIC-IV-Ext-CEKG dataset using Google Cloud services. It is recommended to use Google Cloud instead of a local computer, as the volume of files is large. To use the dataset on the cloud, you will need:

Having a Google Cloud account. It is recommended to start with a free Google Cloud account, which provides $300 in free credits valid for 90 days. You need a credit or debit card to activate the free account without being charged, but you can cancel before reaching the $300 limit or the 90-day period.
Fulfill all the steps for dataset access.

Follow the step-by-step guide to upload the dataset and begin using it for your cloud-based research, simulation, or application development.

Step 1. Download the Dataset Files:

Visit the dataset's paper page and download the dataset from the bottom of the page.

Step 2. Unzip the Required Tables:

Extract each required dataset table from the downloaded archive.

Step 3. Create a Google Cloud Project:

Go to the Google Cloud Manage resources.
In the top navigation bar, click the project name dropdown (usually "My First Project") next to the Google Cloud logo.
Click on “Create Project.”
Fill in the project details:
- Project Name: Choose any name (e.g., my-mimic-cekg-project).
Click “Create” and wait a few seconds for your new project to be created.

Step 4. Upload the Dataset to Google Cloud Storage:

Go to Google Cloud Storage.
Ensure your selected project is my-mimic-cekg-project. Use the dropdown in the top navbar to switch if needed.
Click “Create Bucket.”
Choose a globally unique and permanent bucket name.
Click “Create” and then confirm the bucket creation.
Click “Upload Files” and upload all unzipped dataset tables as CSV files into your new bucket.

Step 5. Create a Dataset in Google BigQuery:

Go to Google BigQuery.
Click on “Try it in Console.”
Ensure your selected project is my-mimic-cekg-project. Use the dropdown in the top navbar to switch if needed.
Click the three-dot menu (⋮) next to your project name and choose “Create Dataset.”
Enter a dataset name (e.g., MimicCEKG) in the “Dataset ID” field and click “Create Dataset.”

Step 6. Import the Tables into BigQuery:

Repeat the following steps for each dataset table:

Click the three-dot menu (⋮) next to the dataset and choose “Create Table.”
For the “Source,” choose “Google Cloud Storage.”
Click “Browse” and select the bucket you created in Step 4.
Select the desired table (CSV file) and click “Select.”
Enter a table name for the BigQuery table.
For the “Schema,” check “Auto-detect.”
Click “Create Table.” The table should now appear in your dataset.

Note: If auto-detection fails during import, you can manually define the schema by clicking “Add Field” and entering column names and data types based on the dataset documentation. If the table is imported successfully, you will see the fields and their types listed automatically.

Table: EventLog

#	Field name	Type
Field 1	Event_ID	STRING
Field 2	Timestamp	DATETIME
Field 3	Activity	STRING
Field 4	Activity_Synonym	STRING
Field 5	Activity_Attributes_ID	STRING
Field 6	Activity_Instance_ID	STRING
Field 7	Entity1_origin	STRING
Field 8	Entity1_ID	INTEGER
Field 9	Entity2_origin	STRING
Field 10	Entity2_ID	INTEGER
Field 11	Entity3_origin	STRING
Field 12	Entity3_ID	INTEGER
Field 13	Entity4_origin	STRING
Field 14	Entity4_ID	INTEGER
Field 15	Entity5_origin	STRING
Field 16	Entity5_ID	INTEGER
Field 17	temp_patient_id	INTEGER
Field 18	temp_encounter_id	STRING

Table: EntitiesAttributes

#	Field name	Type
Field 1	Origin	STRING
Field 2	ID	INTEGER
Field 3	Name	STRING
Field 4	Value	STRING
Field 5	Category	STRING
Field 6	temp_patient_id	INTEGER
Field 7	temp_encounter_id	INTEGER

Table: EntitiesAttributeRel

#	Field name	Type
Field 1	Origin1	STRING
Field 2	ID1	STRING
Field 3	Origin2	STRING
Field 4	ID2	STRING
Field 5	temp_patient_id	INTEGER
Field 6	temp_encounter_id	STRING

Table: ActivityAttributes

#	Field name	Type
Field 1	Activity_Attributes_ID	STRING
Field 2	Activity	STRING
Field 3	Activity_Synonym	STRING
Field 4	Activity_Attribute	STRING
Field 5	Activity_Attribute_Value	STRING
Field 6	temp_patient_id	INTEGER
Field 7	temp_encounter_id	STRING

Table: ActivitiesDomain

#	Field name	Type
Field 1	Activity_Domain	STRING

Table: ICD

#	Field name	Type
Field 1	ICD_Origin	STRING
Field 2	ICD_Code	STRING
Field 3	ICD_Version	STRING
Field 4	ICD_Code_Title	STRING

Table: SCT_Node

#	Field name	Type
Field 1	SCT_ID	INTEGER
Field 2	SCT_Code	INTEGER
Field 3	SCT_DescriptionA_Type1	STRING
Field 4	SCT_DescriptionA_Type2	STRING
Field 5	SCT_DescriptionB	STRING
Field 6	SCT_Semantic_Tags	STRING
Field 7	SCT_Type	STRING
Field 8	SCT_Level	STRING

Table: SCT_REL

#	Field name	Type
Field 1	SCT_ID_1	INTEGER
Field 2	SCT_Code_1	INTEGER
Field 3	SCT_ID_2	INTEGER
Field 4	SCT_Code_2	INTEGER

Table: CNM1

#	Field name	Type
Field 1	Disorder_ID	INTEGER
Field 2	ICD_Code	STRING

Table: CNM2

#	Field name	Type
Field 1	ICD_Code	STRING
Field 2	SCT_ID	STRING

Table: CNM3

#	Field name	Type
Field 1	Activity	STRING
Field 2	Activity_Synonym	STRING
Field 3	SCT_ID	INTEGER
Field 4	SCT_Code	INTEGER

Table: CNM4_1

#	Field name	Type
Field 1	Activity	STRING
Field 2	Activity_Synonym	STRING
Field 3	Activity_Domain	STRING

Table: CNM4_2

#	Field name	Type
Field 1	Activity_Domain	STRING
Field 2	SCT_ID	INTEGER
Field 3	SCT_Code	INTEGER

Table: CNM5_Activity_Instance_ID

#	Field name	Type
Field 1	Activity_Instance_ID	STRING

Table: CNM5_Activity_Instance_ID_with_Class_Part1

#	Field name	Type
Field 1	Activity_Instance_ID	STRING
Field 2	Disorders_Name	STRING
Field 3	temp_patient_id	INTEGER
Field 4	temp_encounter_id	STRING

Table: CNM5_Activity_Instance_ID_with_Class_Part2

#	Field name	Type
Field 1	Activity_Instance_ID	STRING
Field 2	Disorders_Name	STRING
Field 3	temp_patient_id	INTEGER
Field 4	temp_encounter_id	STRING

Table: CNM5_Activity_Instance_ID_with_Class_Part3

#	Field name	Type
Field 1	Activity_Instance_ID	STRING
Field 2	Disorders_Name	STRING
Field 3	temp_patient_id	INTEGER
Field 4	temp_encounter_id	STRING

Table: CNM5_Activity_Instance_ID_with_Features_Part1

#	Field name	Type
Field 1	Activity_Instance_ID	STRING
Field 2	Activity	STRING
Field 3	Activity_Synonym	STRING
Field 4	Activity_Attribute	STRING
Field 5	Activity_Attribute_Value	STRING
Field 6	temp_patient_id	INTEGER
Field 7	temp_encounter_id	STRING
Field 8	Timestamp	DATETIME

Table: CNM5_Activity_Instance_ID_with_Features_Part2

#	Field name	Type
Field 1	Activity_Instance_ID	STRING
Field 2	Activity	STRING
Field 3	Activity_Synonym	STRING
Field 4	Activity_Attribute	STRING
Field 5	Activity_Attribute_Value	STRING
Field 6	temp_patient_id	INTEGER
Field 7	temp_encounter_id	STRING
Field 8	Timestamp	DATETIME

Table: CNM5_Activity_Instance_ID_with_Features_Part3

#	Field name	Type
Field 1	Activity_Instance_ID	STRING
Field 2	Activity	STRING
Field 3	Activity_Synonym	STRING
Field 4	Activity_Attribute	STRING
Field 5	Activity_Attribute_Value	STRING
Field 6	temp_patient_id	INTEGER
Field 7	temp_encounter_id	STRING
Field 8	Timestamp	DATETIME

Table: CNM5_Activity_Instance_ID_with_Features_Part4

#	Field name	Type
Field 1	Activity_Instance_ID	STRING
Field 2	Activity	STRING
Field 3	Activity_Synonym	STRING
Field 4	Activity_Attribute	STRING
Field 5	Activity_Attribute_Value	STRING
Field 6	temp_patient_id	INTEGER
Field 7	temp_encounter_id	STRING
Field 8	Timestamp	DATETIME

Table: CNM5_class

#	Field name	Type
Field 1	Disorder_Name	STRING
Field 2	Disorder_ID	INTEGER

Table: ClusterReference1

#	Field name	Type
Field 1	temp_patient_id	INTEGER
Field 2	Morbid_num	INTEGER
Field 3	Admission_num	INTEGER
Field 4	gender	INTEGER
Field 5	anchor_age	INTEGER
Field 6	dod	INTEGER

Table: N_ClusterReference2

#	Field name	Type
Field 1	temp_patient_id	INTEGER
Field 2	temp_encounter_id	INTEGER
Field 3	ICD10_Code	STRING
Field 4	ICD10_Code_Root	STRING
Field 5	ICD10_Code_title	STRING
Field 6	ICD10_Code_Root_title	STRING
Field 7	Morbid_num	INTEGER
Field 8	Admission_num	INTEGER