Publisher
The University of Arizona.Rights
Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.Embargo
Release after 18-May-2019Abstract
The incidence of type 2 diabetes mellitus is rising in the United States and worldwide. Diabetes is a common, debilitating, and deadly disease that begins with a long asymptomatic period, during which its severity can be limited through intervention. Intervention improves with earlier detection, but most prediabetic individuals are unaware of their risk. This study uses machine learning for the linguistic analysis of social media text to detect diabetes risk. To this end, it seeks to answer the questions "What linguistic features most indicate diabetes risk," "What algorithms best detect diabetes risk from these features," and "How can the data to train such algorithms best be collected?" To address these questions, I describe findings from an experiment in eliciting participation in data collection through an initial risk classifier based on public sources. I continue by comparing various linguistic feature sets and machine learning algorithms in detecting body mass index (kg/m^2, a risk factor for diabetes) as well as a more complete diabetes risk measure. Results show that participant engagement with the results of research is robust, but few of these individuals are willing to participate in the research when any personally identifiable data is collected. From these results, it is also evident that limiting feature sets to lexicons of domain-relevant words such as food and exercise terms can be effective, and that modeling a writer's gender and a text's recency can improve detection, along with distinguishing quoted text from original text. This work is a first step toward detecting diabetes risk, with the ultimate goal of designing effective, automated, and individualized interventions through social media. It has shown that language is a valuable predictor of important health variables, and proposes a novel method for accounting for a writer's gender when analyzing their text. Future work will benefit from pursuing larger datasets, potentially through methods described in this work, and from multimodal algorithms capitalizing from the interplay between text and images.Type
textElectronic Dissertation
Degree Name
Ph.D.Degree Level
doctoralDegree Program
Graduate CollegeLinguistics