Cloud Setup: Using the Dataset with Google Cloud SQL & BigQuery

This page explains how to set up the MIMIC-IV-Ext-CEKG dataset using Google Cloud services. It is recommended to use Google Cloud instead of a local computer, as the volume of files is large. To use the dataset on the cloud, you will need:

  • Having a Google Cloud account. It is recommended to start with a free Google Cloud account, which provides $300 in free credits valid for 90 days. You need a credit or debit card to activate the free account without being charged, but you can cancel before reaching the $300 limit or the 90-day period.
  • Fulfill all the steps for dataset access.

Follow the step-by-step guide to upload the dataset and begin using it for your cloud-based research, simulation, or application development.

Step 1. Download the Dataset Files:

Visit the dataset's paper page and download the dataset from the bottom of the page.

Step 2. Unzip the Required Tables:

Extract each required dataset table from the downloaded archive.

Step 3. Create a Google Cloud Project:

  1. Go to the Google Cloud Manage resources.
  2. In the top navigation bar, click the project name dropdown (usually "My First Project") next to the Google Cloud logo.
  3. Click on “Create Project.”
  4. Fill in the project details:
    • Project Name: Choose any name (e.g., my-mimic-cekg-project).
  5. Click “Create” and wait a few seconds for your new project to be created.

Step 4. Upload the Dataset to Google Cloud Storage:

  1. Go to Google Cloud Storage.
  2. Ensure your selected project is my-mimic-cekg-project. Use the dropdown in the top navbar to switch if needed.
  3. Click “Create Bucket.”
  4. Choose a globally unique and permanent bucket name.
  5. Click “Create” and then confirm the bucket creation.
  6. Click “Upload Files” and upload all unzipped dataset tables as CSV files into your new bucket.

Step 5. Create a Dataset in Google BigQuery:

  1. Go to Google BigQuery.
  2. Click on “Try it in Console.”
  3. Ensure your selected project is my-mimic-cekg-project. Use the dropdown in the top navbar to switch if needed.
  4. Click the three-dot menu (⋮) next to your project name and choose “Create Dataset.”
  5. Enter a dataset name (e.g., MimicCEKG) in the “Dataset ID” field and click “Create Dataset.”

Step 6. Import the Tables into BigQuery:

Repeat the following steps for each dataset table:

  1. Click the three-dot menu (⋮) next to the dataset and choose “Create Table.”
  2. For the “Source,” choose “Google Cloud Storage.”
  3. Click “Browse” and select the bucket you created in Step 4.
  4. Select the desired table (CSV file) and click “Select.”
  5. Enter a table name for the BigQuery table.
  6. For the “Schema,” check “Auto-detect.”
  7. Click “Create Table.” The table should now appear in your dataset.

Note: If auto-detection fails during import, you can manually define the schema by clicking “Add Field” and entering column names and data types based on the dataset documentation. If the table is imported successfully, you will see the fields and their types listed automatically.

Table: EventLog

# Field name Type
Field 1 Event_ID STRING
Field 2 Timestamp DATETIME
Field 3 Activity STRING
Field 4 Activity_Synonym STRING
Field 5 Activity_Attributes_ID STRING
Field 6 Activity_Instance_ID STRING
Field 7 Entity1_origin STRING
Field 8 Entity1_ID INTEGER
Field 9 Entity2_origin STRING
Field 10 Entity2_ID INTEGER
Field 11 Entity3_origin STRING
Field 12 Entity3_ID INTEGER
Field 13 Entity4_origin STRING
Field 14 Entity4_ID INTEGER
Field 15 Entity5_origin STRING
Field 16 Entity5_ID INTEGER
Field 17 temp_patient_id INTEGER
Field 18 temp_encounter_id STRING

Table: EntitiesAttributes

# Field name Type
Field 1 Origin STRING
Field 2 ID INTEGER
Field 3 Name STRING
Field 4 Value STRING
Field 5 Category STRING
Field 6 temp_patient_id INTEGER
Field 7 temp_encounter_id INTEGER

Table: EntitiesAttributeRel

# Field name Type
Field 1 Origin1 STRING
Field 2 ID1 STRING
Field 3 Origin2 STRING
Field 4 ID2 STRING
Field 5 temp_patient_id INTEGER
Field 6 temp_encounter_id STRING

Table: ActivityAttributes

# Field name Type
Field 1 Activity_Attributes_ID STRING
Field 2 Activity STRING
Field 3 Activity_Synonym STRING
Field 4 Activity_Attribute STRING
Field 5 Activity_Attribute_Value STRING
Field 6 temp_patient_id INTEGER
Field 7 temp_encounter_id STRING

Table: ActivitiesDomain

# Field name Type
Field 1 Activity_Domain STRING

Table: ICD

# Field name Type
Field 1 ICD_Origin STRING
Field 2 ICD_Code STRING
Field 3 ICD_Version STRING
Field 4 ICD_Code_Title STRING

Table: SCT_Node

# Field name Type
Field 1 SCT_ID INTEGER
Field 2 SCT_Code INTEGER
Field 3 SCT_DescriptionA_Type1 STRING
Field 4 SCT_DescriptionA_Type2 STRING
Field 5 SCT_DescriptionB STRING
Field 6 SCT_Semantic_Tags STRING
Field 7 SCT_Type STRING
Field 8 SCT_Level STRING

Table: SCT_REL

# Field name Type
Field 1 SCT_ID_1 INTEGER
Field 2 SCT_Code_1 INTEGER
Field 3 SCT_ID_2 INTEGER
Field 4 SCT_Code_2 INTEGER

Table: CNM1

# Field name Type
Field 1 Disorder_ID INTEGER
Field 2 ICD_Code STRING

Table: CNM2

# Field name Type
Field 1 ICD_Code STRING
Field 2 SCT_ID STRING

Table: CNM3

# Field name Type
Field 1 Activity STRING
Field 2 Activity_Synonym STRING
Field 3 SCT_ID INTEGER
Field 4 SCT_Code INTEGER

Table: CNM4_1

# Field name Type
Field 1 Activity STRING
Field 2 Activity_Synonym STRING
Field 3 Activity_Domain STRING

Table: CNM4_2

# Field name Type
Field 1 Activity_Domain STRING
Field 2 SCT_ID INTEGER
Field 3 SCT_Code INTEGER

Table: CNM5_Activity_Instance_ID

# Field name Type
Field 1 Activity_Instance_ID STRING

Table: CNM5_Activity_Instance_ID_with_Class_Part1

# Field name Type
Field 1 Activity_Instance_ID STRING
Field 2 Disorders_Name STRING
Field 3 temp_patient_id INTEGER
Field 4 temp_encounter_id STRING

Table: CNM5_Activity_Instance_ID_with_Class_Part2

# Field name Type
Field 1 Activity_Instance_ID STRING
Field 2 Disorders_Name STRING
Field 3 temp_patient_id INTEGER
Field 4 temp_encounter_id STRING

Table: CNM5_Activity_Instance_ID_with_Class_Part3

# Field name Type
Field 1 Activity_Instance_ID STRING
Field 2 Disorders_Name STRING
Field 3 temp_patient_id INTEGER
Field 4 temp_encounter_id STRING

Table: CNM5_Activity_Instance_ID_with_Features_Part1

# Field name Type
Field 1 Activity_Instance_ID STRING
Field 2 Activity STRING
Field 3 Activity_Synonym STRING
Field 4 Activity_Attribute STRING
Field 5 Activity_Attribute_Value STRING
Field 6 temp_patient_id INTEGER
Field 7 temp_encounter_id STRING
Field 8 Timestamp DATETIME

Table: CNM5_Activity_Instance_ID_with_Features_Part2

# Field name Type
Field 1 Activity_Instance_ID STRING
Field 2 Activity STRING
Field 3 Activity_Synonym STRING
Field 4 Activity_Attribute STRING
Field 5 Activity_Attribute_Value STRING
Field 6 temp_patient_id INTEGER
Field 7 temp_encounter_id STRING
Field 8 Timestamp DATETIME

Table: CNM5_Activity_Instance_ID_with_Features_Part3

# Field name Type
Field 1 Activity_Instance_ID STRING
Field 2 Activity STRING
Field 3 Activity_Synonym STRING
Field 4 Activity_Attribute STRING
Field 5 Activity_Attribute_Value STRING
Field 6 temp_patient_id INTEGER
Field 7 temp_encounter_id STRING
Field 8 Timestamp DATETIME

Table: CNM5_Activity_Instance_ID_with_Features_Part4

# Field name Type
Field 1 Activity_Instance_ID STRING
Field 2 Activity STRING
Field 3 Activity_Synonym STRING
Field 4 Activity_Attribute STRING
Field 5 Activity_Attribute_Value STRING
Field 6 temp_patient_id INTEGER
Field 7 temp_encounter_id STRING
Field 8 Timestamp DATETIME

Table: CNM5_class

# Field name Type
Field 1 Disorder_Name STRING
Field 2 Disorder_ID INTEGER

Table: ClusterReference1

# Field name Type
Field 1 temp_patient_id INTEGER
Field 2 Morbid_num INTEGER
Field 3 Admission_num INTEGER
Field 4 gender INTEGER
Field 5 anchor_age INTEGER
Field 6 dod INTEGER

Table: N_ClusterReference2

# Field name Type
Field 1 temp_patient_id INTEGER
Field 2 temp_encounter_id INTEGER
Field 3 ICD10_Code STRING
Field 4 ICD10_Code_Root STRING
Field 5 ICD10_Code_title STRING
Field 6 ICD10_Code_Root_title STRING
Field 7 Morbid_num INTEGER
Field 8 Admission_num INTEGER