!!! warning "Under Construction" This documentation is currently under active development and subject to change. Some sections may be incomplete or missing.
Data Documentation
Overview
This document describes the data sources and standardized format used in our benchmark pipeline for nocturnal hypoglycemia prediction.
Historical Context
Our benchmark pipeline was originally built on the Kaggle Bristol Type 1 Diabetes Dataset, which has influenced some of our naming conventions and data structures. This historical context helps explain certain design decisions:
- Column naming convention: The
-0:00suffix in column names (e.g.,bg-0:00) indicates current time measurements. While we could use+1:00for future measurements, this is not currently utilized in our pipeline. - Time intervals: The Kaggle dataset combines data from different types of continuous glucose monitors (CGMs):
- Dexcom users: 5-minute intervals
- Libre users: 15-minute intervals
- The
timecolumn was originally used to distinguish between these different interval types in the Kaggle dataset
Important Implementation Notes
- The
datetimecolumn serves as the index for data manipulation - While models focus on row relationships rather than timestamps, maintaining consistent intervals is crucial
- Missing data must be either filled or imputed to ensure data quality
- These requirements apply across all supported datasets
Supported Datasets
-
Bristol Type 1 Diabetes Dataset
- A comprehensive dataset from Kaggle
- Contains both Dexcom (5-min interval) and Libre (15-min interval) users
- Multiple patients in a single CSV file (not that common)
-
Gluroo Dataset
- Internal dataset from Christopher and Walker
- Contains additional features like protein content and glucose trends
- Includes meal announcements and intervention data
-
simglucose (Coming Soon)
- Planned integration from benchmark repo as a package
- Will provide simulated data for testing and validation
- Lynch 2022 Dataset
- Aleppo Dataset
Data Format Standardization
All datasets are transformed into a standardized format for our benchmark pipeline. The following sections detail the required columns and dataset-specific features.
Core Columns (Required for All Datasets)
| Column | Type | Description | Source | Required? |
|---|---|---|---|---|
datetime |
pd.Timestamp (INDEX) |
Primary timestamp for data manipulation | Created during processing | Required |
p_num |
str |
Patient identifier | Original dataset | Required |
bg_mM |
float |
Blood glucose measurement in mmol/L | Original dataset | Required |
Optional Columns (Enhance Features but Don't Block Processing)
| Column | Type | Description | Source | Notes |
|---|---|---|---|---|
dose_units |
float |
Insulin dose in units | Original dataset | Enables IOB calculation |
food_g |
float |
Carbohydrate intake in grams | Original dataset | Enables COB calculation |
msg_type |
str |
Message type indicator ('ANNOUNCE_MEAL' or empty) | Derived | - |
rate |
float |
Basal insulin rate in U/hr | Original dataset | Enables basal rollover |
Derived Physiological Features
These columns are computed by the preprocessing pipeline. If the source column is missing or all NaN, these will be set to NaN.
| Column | Type | Description | Source |
|---|---|---|---|
cob |
float |
Carbohydrates on board in grams | Derived from food_g |
carb_availability |
float |
Estimated total carbohydrates in blood | Derived from food_g |
iob |
float |
Insulin on board in units | Derived from dose_units |
insulin_availability |
float |
Insulin in plasma | Derived from dose_units |
!!! note "CGM-Only Datasets" CGM-only datasets (e.g., some Type 2 diabetes patients, pre-training scenarios) are supported. The pipeline will skip COB/IOB calculations and set those features to NaN if the source columns are missing.
Optional Activity Metrics
These columns are available but not typically used in statistical models:
| Column | Type | Description |
|---|---|---|
cals-0:00 |
float |
Total calories burnt in last 5 minutes |
steps-0:00 |
float |
Number of steps taken |
hr-0:00 |
float |
Heart rate |
Dataset-Specific Features
Kaggle Dataset
| Column | Type | Description |
|---|---|---|
time |
pd.Timestamp |
Time of day (HH:MM:SS) - Used to determine patient interval type |
Gluroo Dataset
| Column | Type | Description |
|---|---|---|
food_protein |
float |
Protein content in grams |
trend |
str |
Glucose trend (e.g., "rising", "falling") from Dexcom/Libre |
food_g |
float |
Food grams (converts to carbs-0:00) |
food_g_keep |
float |
Original meal carbohydrate values (for tracking only) |
affects_fob |
bool |
Food on board flag |
affects_iob |
bool |
Insulin on board flag |
day_start_shift |
int |
Day start definition for data processing |
Lynch Dataset
Aleppo Dataset
Message Types (Gluroo Dataset)
The Gluroo dataset includes three types of messages that are processed during data transformation:
DOSE_INSULIN- Records insulin administration eventsANNOUNCE_MEAL- Captures meal announcements and carbohydrate intakeINTERVENTION_SNACK- Tracks snack interventions
Dataset Standardization Process
To add a new dataset to the pipeline, follow these steps:
1. Directory Structure
cache/data/{dataset_name}/
│ ├── raw/ # Original data files
│ | └── .gitkeep
│ └── processed/ # Processed and cached data
| | └── .gitkeep
src/data/diabetes_datasets/
├── ${dataset_name}/
│ ├── __init__.py # Data initialization
│ ├── ${dataset_name}.py # Data loader class
│ ├── data_cleaner.py # Data cleaning functions specific ONLY to this dataset
│ └── README.md # Instructions on how to access raw data, and where it should go.
2. Implementation Steps
2.1 Create Data Loader
- Create a new file:
src/data/diabetes_datasets/${dataset_name}/${dataset_name}.py - Inherit from
DatasetBaseclass - Implement caching mechanism to avoid reprocessing
- Raw data should be fetchd via API of the host if available (We are thinking of just hosting a private HF datasets now)
class ${DatasetName}Loader(DatasetBase): def __init__(self, cache=True): super().__init__() self.cache = cache # Implementation details...
2.2 Data Cleaning and Transformation
Implement a data cleaner that performs the following steps in order:
-
Column Standardization
- Map original column names to standardized format
- Ensure all required columns are present
- Ensure
datatimecolumn exists - Convert data types as needed
-
Data Cleaning
- Remove duplicate entries
- Handle missing values
- Validate data ranges
- Example: See
clean_gluroo_datafor reference
-
Time Series Processing
- Call
data_transforms.ensure_regular_time_intervals(cleaned_df) - This creates missing rows to maintain consistent intervals
- Missing data will be imputed in the benchmark pipeline
- Call
-
Derived Features
- Generate carbohydrate-related features:
data_transforms.create_cob_and_carb_availability_cols(df) - Generate insulin-related features:
data_transforms.create_iob_and_ins_availability_cols(df)
- Generate carbohydrate-related features:
2.3 Batch Processing Script
Write a shell script for WATGPU:
```bash
scripts/watgpu_slurm/data_processing_scripts/{datasets}_data_processing.sh
Make sure the partition you're asking for isn't going to request too many resources.
Check [SLURM documentation ](https://slurm.schedmd.com/pdfs/summary.pdf) for more details.
e.g., `sinfo`, `squeue`, `scontrol show node watgpu608` etc.
Script template:
!/bin/bash
SBATCH --job-name="{dataset}_data_processing"
SBATCH --time=10:00:00
SBATCH --cpus-per-task=30
SBATCH --mem-per-cpu=1GB
SBATCH --partition=HI
SBATCH -o results/runs/{dataset}_data_processing/slurm-%j.out
SBATCH -e results/runs/{dataset}_data_processing/slurm-%j.err
SBATCH --mail-user={your_email@domain.com}
SBATCH --mail-type=ALL
Activate the virtual environment
source $HOME/nocturnal/.noctprob-venv/bin/activate
Inline Python code to process the aleppo data (not the best practice but the task is simple enough)
echo "Starting {dataset} data processing" python -c " from src.data.diabetes_datasets.data_loader import get_loader loader = get_loader( data_source_name='{dataset}', use_cached=False, parallel=True, max_workers=30, ) " echo "{dataset} data processing completed"
Run sbatch {dataset}_data_processing.sh
```
3. Documentation
Document all the changes
4. Reference Implementation
See src/data/gluroo/ for a working example of dataset implementation.
Future Improvements
- Rename
bg-0:00tobg_mgdl-0:00to explicitly include units - Evaluate the need for the
-0:00suffix in column names - Standardize time interval handling across all datasets
- Implement automated data validation checks
- Add data quality metrics reporting
- Create dataset-specific documentation templates