Cloud Setup: Using the Dataset with Google Cloud SQL & BigQuery
This page explains how to set up the MIMIC-IV-Ext-CEKG dataset using Google Cloud services. It is
recommended to use Google Cloud instead of a local computer, as the volume of files is large. To use the
dataset on the cloud, you will need:
-
Having a Google Cloud account. It is recommended to start
with a free Google Cloud account, which provides $300 in free credits valid for 90 days. You need
a credit or debit card to activate the free account without being charged, but you can cancel before
reaching the $300 limit or the 90-day period.
- Fulfill all the steps for dataset access.
Follow the step-by-step guide to upload the dataset and begin using it for your cloud-based research,
simulation, or application development.
Step 1. Download the Dataset Files:
Visit the dataset's paper page and
download the dataset from the bottom of the page.
Step 2. Unzip the Required Tables:
Extract each required dataset table from the downloaded archive.
Step 3. Create a Google Cloud Project:
- Go to the Google
Cloud Manage resources.
- In the top navigation bar, click the project name dropdown (usually "My First Project") next to the
Google Cloud logo.
- Click on “Create Project.”
- Fill in the project details:
- Project Name: Choose any name (e.g.,
my-mimic-cekg-project).
- Click “Create” and wait a few seconds for your new project to be created.
Step 4. Upload the Dataset to Google Cloud Storage:
- Go to Google Cloud Storage.
- Ensure your selected project is
my-mimic-cekg-project. Use the dropdown in the top
navbar to switch if needed.
- Click “Create Bucket.”
- Choose a globally unique and permanent bucket name.
- Click “Create” and then confirm the bucket creation.
- Click “Upload Files” and upload all unzipped dataset tables as CSV files into your new bucket.
Step 5. Create a Dataset in Google BigQuery:
- Go to Google BigQuery.
- Click on “Try it in Console.”
- Ensure your selected project is
my-mimic-cekg-project. Use the dropdown in the top
navbar to switch if needed.
- Click the three-dot menu (⋮) next to your project name and choose “Create Dataset.”
- Enter a dataset name (e.g.,
MimicCEKG) in the “Dataset ID” field and click “Create
Dataset.”
Step 6. Import the Tables into BigQuery:
Repeat the following steps for each dataset table:
- Click the three-dot menu (⋮) next to the dataset and choose “Create Table.”
- For the “Source,” choose “Google Cloud Storage.”
- Click “Browse” and select the bucket you created in Step 4.
- Select the desired table (CSV file) and click “Select.”
- Enter a table name for the BigQuery table.
- For the “Schema,” check “Auto-detect.”
- Click “Create Table.” The table should now appear in your dataset.
Note: If auto-detection fails during import, you can manually define the schema by
clicking “Add Field” and entering column names and data types based on the dataset documentation. If the
table is imported successfully, you will see the fields and their types listed automatically.
Table: EventLog
| # |
Field name |
Type |
| Field 1 |
Event_ID |
STRING |
| Field 2 |
Timestamp |
DATETIME |
| Field 3 |
Activity |
STRING |
| Field 4 |
Activity_Synonym |
STRING |
| Field 5 |
Activity_Attributes_ID |
STRING |
| Field 6 |
Activity_Instance_ID |
STRING |
| Field 7 |
Entity1_origin |
STRING |
| Field 8 |
Entity1_ID |
INTEGER |
| Field 9 |
Entity2_origin |
STRING |
| Field 10 |
Entity2_ID |
INTEGER |
| Field 11 |
Entity3_origin |
STRING |
| Field 12 |
Entity3_ID |
INTEGER |
| Field 13 |
Entity4_origin |
STRING |
| Field 14 |
Entity4_ID |
INTEGER |
| Field 15 |
Entity5_origin |
STRING |
| Field 16 |
Entity5_ID |
INTEGER |
| Field 17 |
temp_patient_id |
INTEGER |
| Field 18 |
temp_encounter_id |
STRING |
Table: EntitiesAttributes
| # |
Field name |
Type |
| Field 1 |
Origin |
STRING |
| Field 2 |
ID |
INTEGER |
| Field 3 |
Name |
STRING |
| Field 4 |
Value |
STRING |
| Field 5 |
Category |
STRING |
| Field 6 |
temp_patient_id |
INTEGER |
| Field 7 |
temp_encounter_id |
INTEGER |
Table: EntitiesAttributeRel
| # |
Field name |
Type |
| Field 1 |
Origin1 |
STRING |
| Field 2 |
ID1 |
STRING |
| Field 3 |
Origin2 |
STRING |
| Field 4 |
ID2 |
STRING |
| Field 5 |
temp_patient_id |
INTEGER |
| Field 6 |
temp_encounter_id |
STRING |
Table: ActivityAttributes
| # |
Field name |
Type |
| Field 1 |
Activity_Attributes_ID |
STRING |
| Field 2 |
Activity |
STRING |
| Field 3 |
Activity_Synonym |
STRING |
| Field 4 |
Activity_Attribute |
STRING |
| Field 5 |
Activity_Attribute_Value |
STRING |
| Field 6 |
temp_patient_id |
INTEGER |
| Field 7 |
temp_encounter_id |
STRING |
Table: ActivitiesDomain
| # |
Field name |
Type |
| Field 1 |
Activity_Domain |
STRING |
Table: ICD
| # |
Field name |
Type |
| Field 1 |
ICD_Origin |
STRING |
| Field 2 |
ICD_Code |
STRING |
| Field 3 |
ICD_Version |
STRING |
| Field 4 |
ICD_Code_Title |
STRING |
Table: SCT_Node
| # |
Field name |
Type |
| Field 1 |
SCT_ID |
INTEGER |
| Field 2 |
SCT_Code |
INTEGER |
| Field 3 |
SCT_DescriptionA_Type1 |
STRING |
| Field 4 |
SCT_DescriptionA_Type2 |
STRING |
| Field 5 |
SCT_DescriptionB |
STRING |
| Field 6 |
SCT_Semantic_Tags |
STRING |
| Field 7 |
SCT_Type |
STRING |
| Field 8 |
SCT_Level |
STRING |
Table: SCT_REL
| # |
Field name |
Type |
| Field 1 |
SCT_ID_1 |
INTEGER |
| Field 2 |
SCT_Code_1 |
INTEGER |
| Field 3 |
SCT_ID_2 |
INTEGER |
| Field 4 |
SCT_Code_2 |
INTEGER |
Table: CNM1
| # |
Field name |
Type |
| Field 1 |
Disorder_ID |
INTEGER |
| Field 2 |
ICD_Code |
STRING |
Table: CNM2
| # |
Field name |
Type |
| Field 1 |
ICD_Code |
STRING |
| Field 2 |
SCT_ID |
STRING |
Table: CNM3
| # |
Field name |
Type |
| Field 1 |
Activity |
STRING |
| Field 2 |
Activity_Synonym |
STRING |
| Field 3 |
SCT_ID |
INTEGER |
| Field 4 |
SCT_Code |
INTEGER |
Table: CNM4_1
| # |
Field name |
Type |
| Field 1 |
Activity |
STRING |
| Field 2 |
Activity_Synonym |
STRING |
| Field 3 |
Activity_Domain |
STRING |
Table: CNM4_2
| # |
Field name |
Type |
| Field 1 |
Activity_Domain |
STRING |
| Field 2 |
SCT_ID |
INTEGER |
| Field 3 |
SCT_Code |
INTEGER |
Table: CNM5_Activity_Instance_ID
| # |
Field name |
Type |
| Field 1 |
Activity_Instance_ID |
STRING |
Table: CNM5_Activity_Instance_ID_with_Class_Part1
| # |
Field name |
Type |
| Field 1 |
Activity_Instance_ID |
STRING |
| Field 2 |
Disorders_Name |
STRING |
| Field 3 |
temp_patient_id |
INTEGER |
| Field 4 |
temp_encounter_id |
STRING |
Table: CNM5_Activity_Instance_ID_with_Class_Part2
| # |
Field name |
Type |
| Field 1 |
Activity_Instance_ID |
STRING |
| Field 2 |
Disorders_Name |
STRING |
| Field 3 |
temp_patient_id |
INTEGER |
| Field 4 |
temp_encounter_id |
STRING |
Table: CNM5_Activity_Instance_ID_with_Class_Part3
| # |
Field name |
Type |
| Field 1 |
Activity_Instance_ID |
STRING |
| Field 2 |
Disorders_Name |
STRING |
| Field 3 |
temp_patient_id |
INTEGER |
| Field 4 |
temp_encounter_id |
STRING |
Table: CNM5_Activity_Instance_ID_with_Features_Part1
| # |
Field name |
Type |
| Field 1 |
Activity_Instance_ID |
STRING |
| Field 2 |
Activity |
STRING |
| Field 3 |
Activity_Synonym |
STRING |
| Field 4 |
Activity_Attribute |
STRING |
| Field 5 |
Activity_Attribute_Value |
STRING |
| Field 6 |
temp_patient_id |
INTEGER |
| Field 7 |
temp_encounter_id |
STRING |
| Field 8 |
Timestamp |
DATETIME |
Table: CNM5_Activity_Instance_ID_with_Features_Part2
| # |
Field name |
Type |
| Field 1 |
Activity_Instance_ID |
STRING |
| Field 2 |
Activity |
STRING |
| Field 3 |
Activity_Synonym |
STRING |
| Field 4 |
Activity_Attribute |
STRING |
| Field 5 |
Activity_Attribute_Value |
STRING |
| Field 6 |
temp_patient_id |
INTEGER |
| Field 7 |
temp_encounter_id |
STRING |
| Field 8 |
Timestamp |
DATETIME |
Table: CNM5_Activity_Instance_ID_with_Features_Part3
| # |
Field name |
Type |
| Field 1 |
Activity_Instance_ID |
STRING |
| Field 2 |
Activity |
STRING |
| Field 3 |
Activity_Synonym |
STRING |
| Field 4 |
Activity_Attribute |
STRING |
| Field 5 |
Activity_Attribute_Value |
STRING |
| Field 6 |
temp_patient_id |
INTEGER |
| Field 7 |
temp_encounter_id |
STRING |
| Field 8 |
Timestamp |
DATETIME |
Table: CNM5_Activity_Instance_ID_with_Features_Part4
| # |
Field name |
Type |
| Field 1 |
Activity_Instance_ID |
STRING |
| Field 2 |
Activity |
STRING |
| Field 3 |
Activity_Synonym |
STRING |
| Field 4 |
Activity_Attribute |
STRING |
| Field 5 |
Activity_Attribute_Value |
STRING |
| Field 6 |
temp_patient_id |
INTEGER |
| Field 7 |
temp_encounter_id |
STRING |
| Field 8 |
Timestamp |
DATETIME |
Table: CNM5_class
| # |
Field name |
Type |
| Field 1 |
Disorder_Name |
STRING |
| Field 2 |
Disorder_ID |
INTEGER |
Table: ClusterReference1
| # |
Field name |
Type |
| Field 1 |
temp_patient_id |
INTEGER |
| Field 2 |
Morbid_num |
INTEGER |
| Field 3 |
Admission_num |
INTEGER |
| Field 4 |
gender |
INTEGER |
| Field 5 |
anchor_age |
INTEGER |
| Field 6 |
dod |
INTEGER |
Table: N_ClusterReference2
| # |
Field name |
Type |
| Field 1 |
temp_patient_id |
INTEGER |
| Field 2 |
temp_encounter_id |
INTEGER |
| Field 3 |
ICD10_Code |
STRING |
| Field 4 |
ICD10_Code_Root |
STRING |
| Field 5 |
ICD10_Code_title |
STRING |
| Field 6 |
ICD10_Code_Root_title |
STRING |
| Field 7 |
Morbid_num |
INTEGER |
| Field 8 |
Admission_num |
INTEGER |