Skip to content

!!! warning "Under Construction" This documentation is currently under active development and subject to change. Some sections may be incomplete or missing.

Data Documentation

Overview

This document describes the data sources and standardized format used in our benchmark pipeline for nocturnal hypoglycemia prediction.

Historical Context

Our benchmark pipeline was originally built on the Kaggle Bristol Type 1 Diabetes Dataset, which has influenced some of our naming conventions and data structures. This historical context helps explain certain design decisions:

  • Column naming convention: The -0:00 suffix in column names (e.g., bg-0:00) indicates current time measurements. While we could use +1:00 for future measurements, this is not currently utilized in our pipeline.
  • Time intervals: The Kaggle dataset combines data from different types of continuous glucose monitors (CGMs):
  • Dexcom users: 5-minute intervals
  • Libre users: 15-minute intervals
  • The time column was originally used to distinguish between these different interval types in the Kaggle dataset

Important Implementation Notes

  • The datetime column serves as the index for data manipulation
  • While models focus on row relationships rather than timestamps, maintaining consistent intervals is crucial
  • Missing data must be either filled or imputed to ensure data quality
  • These requirements apply across all supported datasets

Supported Datasets

  1. Bristol Type 1 Diabetes Dataset

    • A comprehensive dataset from Kaggle
    • Contains both Dexcom (5-min interval) and Libre (15-min interval) users
    • Multiple patients in a single CSV file (not that common)
  2. Gluroo Dataset

    • Internal dataset from Christopher and Walker
    • Contains additional features like protein content and glucose trends
    • Includes meal announcements and intervention data
  3. simglucose (Coming Soon)

    • Planned integration from benchmark repo as a package
    • Will provide simulated data for testing and validation
  4. Lynch 2022 Dataset
  5. Aleppo Dataset

Data Format Standardization

All datasets are transformed into a standardized format for our benchmark pipeline. The following sections detail the required columns and dataset-specific features.

Core Columns (Required for All Datasets)

Column Type Description Source Required?
datetime pd.Timestamp (INDEX) Primary timestamp for data manipulation Created during processing Required
p_num str Patient identifier Original dataset Required
bg_mM float Blood glucose measurement in mmol/L Original dataset Required

Optional Columns (Enhance Features but Don't Block Processing)

Column Type Description Source Notes
dose_units float Insulin dose in units Original dataset Enables IOB calculation
food_g float Carbohydrate intake in grams Original dataset Enables COB calculation
msg_type str Message type indicator ('ANNOUNCE_MEAL' or empty) Derived -
rate float Basal insulin rate in U/hr Original dataset Enables basal rollover

Derived Physiological Features

These columns are computed by the preprocessing pipeline. If the source column is missing or all NaN, these will be set to NaN.

Column Type Description Source
cob float Carbohydrates on board in grams Derived from food_g
carb_availability float Estimated total carbohydrates in blood Derived from food_g
iob float Insulin on board in units Derived from dose_units
insulin_availability float Insulin in plasma Derived from dose_units

!!! note "CGM-Only Datasets" CGM-only datasets (e.g., some Type 2 diabetes patients, pre-training scenarios) are supported. The pipeline will skip COB/IOB calculations and set those features to NaN if the source columns are missing.

Optional Activity Metrics

These columns are available but not typically used in statistical models:

Column Type Description
cals-0:00 float Total calories burnt in last 5 minutes
steps-0:00 float Number of steps taken
hr-0:00 float Heart rate

Dataset-Specific Features

Kaggle Dataset

Column Type Description
time pd.Timestamp Time of day (HH:MM:SS) - Used to determine patient interval type

Gluroo Dataset

Column Type Description
food_protein float Protein content in grams
trend str Glucose trend (e.g., "rising", "falling") from Dexcom/Libre
food_g float Food grams (converts to carbs-0:00)
food_g_keep float Original meal carbohydrate values (for tracking only)
affects_fob bool Food on board flag
affects_iob bool Insulin on board flag
day_start_shift int Day start definition for data processing

Lynch Dataset

Aleppo Dataset

Message Types (Gluroo Dataset)

The Gluroo dataset includes three types of messages that are processed during data transformation:

  1. DOSE_INSULIN - Records insulin administration events
  2. ANNOUNCE_MEAL - Captures meal announcements and carbohydrate intake
  3. INTERVENTION_SNACK - Tracks snack interventions

Dataset Standardization Process

To add a new dataset to the pipeline, follow these steps:

1. Directory Structure

cache/data/{dataset_name}/
│   ├── raw/                    # Original data files
│   |   └── .gitkeep
│   └── processed/              # Processed and cached data
|   |   └── .gitkeep
src/data/diabetes_datasets/
├── ${dataset_name}/
│   ├── __init__.py             # Data initialization
│   ├── ${dataset_name}.py      # Data loader class
│   ├── data_cleaner.py         # Data cleaning functions specific ONLY to this dataset
│   └── README.md               # Instructions on how to access raw data, and where it should go.

2. Implementation Steps

2.1 Create Data Loader

  • Create a new file: src/data/diabetes_datasets/${dataset_name}/${dataset_name}.py
  • Inherit from DatasetBase class
  • Implement caching mechanism to avoid reprocessing
  • Raw data should be fetchd via API of the host if available (We are thinking of just hosting a private HF datasets now)
    class ${DatasetName}Loader(DatasetBase):
        def __init__(self, cache=True):
            super().__init__()
            self.cache = cache
            # Implementation details...
    

2.2 Data Cleaning and Transformation

Implement a data cleaner that performs the following steps in order:

  1. Column Standardization

    • Map original column names to standardized format
    • Ensure all required columns are present
    • Ensure datatime column exists
    • Convert data types as needed
  2. Data Cleaning

    • Remove duplicate entries
    • Handle missing values
    • Validate data ranges
    • Example: See clean_gluroo_data for reference
  3. Time Series Processing

    • Call data_transforms.ensure_regular_time_intervals(cleaned_df)
    • This creates missing rows to maintain consistent intervals
    • Missing data will be imputed in the benchmark pipeline
  4. Derived Features

    • Generate carbohydrate-related features:
      data_transforms.create_cob_and_carb_availability_cols(df)
      
    • Generate insulin-related features:
      data_transforms.create_iob_and_ins_availability_cols(df)
      

2.3 Batch Processing Script

Write a shell script for WATGPU:

```bash
scripts/watgpu_slurm/data_processing_scripts/{datasets}_data_processing.sh
Make sure the partition you're asking for isn't going to request too many resources.
Check [SLURM documentation ](https://slurm.schedmd.com/pdfs/summary.pdf) for more details.
e.g., `sinfo`, `squeue`, `scontrol show node watgpu608` etc.

Script template:

!/bin/bash

SBATCH --job-name="{dataset}_data_processing"

SBATCH --time=10:00:00

SBATCH --cpus-per-task=30

SBATCH --mem-per-cpu=1GB

SBATCH --partition=HI

SBATCH -o results/runs/{dataset}_data_processing/slurm-%j.out

SBATCH -e results/runs/{dataset}_data_processing/slurm-%j.err

SBATCH --mail-user={your_email@domain.com}

SBATCH --mail-type=ALL

Activate the virtual environment

source $HOME/nocturnal/.noctprob-venv/bin/activate

Inline Python code to process the aleppo data (not the best practice but the task is simple enough)

echo "Starting {dataset} data processing" python -c " from src.data.diabetes_datasets.data_loader import get_loader loader = get_loader( data_source_name='{dataset}', use_cached=False, parallel=True, max_workers=30, ) " echo "{dataset} data processing completed"

Run sbatch {dataset}_data_processing.sh

```

3. Documentation

Document all the changes

4. Reference Implementation

See src/data/gluroo/ for a working example of dataset implementation.

Future Improvements

  • Rename bg-0:00 to bg_mgdl-0:00 to explicitly include units
  • Evaluate the need for the -0:00 suffix in column names
  • Standardize time interval handling across all datasets
  • Implement automated data validation checks
  • Add data quality metrics reporting
  • Create dataset-specific documentation templates