Development of a Self-Supervised Learning-based Approach to Clustering Multivariate Time-Series Data with Missing Values
Publisher
The University of Arizona.Rights
Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction, presentation (such as public display or performance) of protected items is prohibited except with permission of the author.Embargo
Release after 01/22/2026Abstract
Multivariate time-series data are commonly found in acute care, where each patient's record is a series of clinical measurements over time, providing critical information throughout the patient's care. Clustering approaches are often used to extract valuable information and identify patterns from temporal data. However, a major challenge in real-world time-series data is the prevalence of missing values. Existing methods typically require filling in these missing values before clustering, which can introduce noise and potentially lead to inaccurate interpretations. In this dissertation, we address these challenges by proposing a novel approach for clustering sparse and irregularly sampled time-series data We also provide a comprehensive examination of how this developed approach can be applied in the identification of (a) clinical phenotypes and (b) physiological states of patients with Traumatic Brain Injury (TBI).First, we developed a Self-supervised Learning-based Approach to Clustering multivariate Time-series data with missing values (SLAC-Time). SLAC-Time employs a Transformer model and uses time-series forecasting as a proxy task for learning the representations of unlabeled data. When applied to data from the Transforming Research and Clinical Knowledge in Traumatic Brain Injury (TRACK-TBI) study, SLAC-Time outperformed traditional K-means clustering in terms of silhouette coefficient, Calinski Harabasz index, Dunn index, and Davies Bouldin index. It identified three distinct TBI phenotypes, each correlating with significant clinical variables and outcomes, such as the Extended Glasgow Outcome Scale (GOSE) score, length of stay in the ICU, and mortality rates. Second, we applied SLAC-Time for clustering TBI patients within the TRACK-TBI dataset and the Medical Information Mart for Intensive Care (MIMIC)-IV external dataset, leading to the identification of generalizable TBI phenotypes. These include: phenotype α, characterized by the lowest mortality rates and shortest ICU stays, typically observed in younger patients with balanced metabolic responses and milder clinical symptoms; phenotype β, exhibiting the highest mortality rates and longest ICU stays, often found in older patients with elevated metabolic stress, reduced oxygen transport, and significant neurological impairments; and phenotype γ, displaying moderate mortality rates and ICU stays, with clinical presentations that are less severe than phenotype β but more pronounced than phenotype α. These phenotypes provide a detailed understanding of TBI, offering critical insights for personalized patient treatment and management. Third, we applied SLAC-Time for clustering the TBI high-resolution physiological data in the TRACK-TBI dataset and identified three distinct TBI physiological states. State A is critical, with signs of brain oxygen and blood supply deficiency. State B indicates respiratory distress with inadequate oxygenation. State C is stable, showing signs of sufficient cerebral oxygenation and blood flow. Furthermore, we discovered how specific clinical events and interventions could influence these patient states and drive transitions between them, providing critical insights for tailored clinical management of TBI patients. This dissertation makes a modest contribution to the clustering of multivariate time-series data with missing values. By avoiding data imputation and aggregation, SLAC-Time highly increases the accuracy of time-series clustering analyses. The iterative representation-learning capability of SLAC-Time, effectively reduces noise in the raw input data, thereby improving the quality of the representations used for clustering. The methodologies developed throughout this work establish a foundation for precision medicine in treating TBI patients. These contributions are set to inform and direct the future of TBI management, potentially leading to substantial improvements in both our comprehension of the condition and the outcomes for patients.Type
Electronic Dissertationtext
Degree Name
Ph.D.Degree Level
doctoralDegree Program
Graduate CollegeSystems & Industrial Engineering