We are upgrading the repository! A content freeze is in effect until December 6th, 2024 - no new submissions will be accepted; however, all content already published will remain publicly available. Please reach out to repository@u.library.arizona.edu with your questions, or if you are a UA affiliate who needs to make content available soon. Note that any new user accounts created after September 22, 2024 will need to be recreated by the user in November after our migration is completed.
Author
Brau Avila, ErnestoIssue Date
2013Keywords
Data associationMCMC
Multiple target tracking
Scene understanding
Computer Science
Bayesian inference
Advisor
Barnard, Kobus J.
Metadata
Show full item recordPublisher
The University of Arizona.Rights
Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.Abstract
Understanding the content of a video sequence is not a particularly difficult problem for humans. We can easily identify objects, such as people, and track their position and pose within the 3D world. A computer system that could understand the world through videos would be extremely beneficial in applications such as surveillance, robotics, biology. Despite significant advances in areas like tracking and, more recently, 3D static scene understanding, such a vision system does not yet exist. In this work, I present progress on this problem, restricted to videos of objects that move in smoothly and which are relatively easily detected, such as people. Our goal is to identify all the moving objects in the scene and track their physical state (e.g., their 3D position or pose) in the world throughout the video. We develop a Bayesian generative model of a temporal scene, where we separately model data association, the 3D scene and imaging system, and the likelihood function. Under this model, the video data is the result of capturing the scene with the imaging system, and noisily detecting video features. This formulation is very general, and can be used to model a wide variety of scenarios, including videos of people walking, and time-lapse images of pollen tubes growing in vitro. Importantly, we model the scene in world coordinates and units, as opposed to pixels, allowing us to reason about the world in a natural way, e.g., explaining occlusion and perspective distortion. We use Gaussian processes to model motion, and propose that it is a general and effective way to characterize smooth, but otherwise arbitrary, trajectories. We perform inference using MCMC sampling, where we fit our model of the temporal scene to data extracted from the videos. We address the problem of variable dimensionality by estimating data association and integrating out all scene variables. Our experiments show our approach is competitive, producing results which are comparable to state-of-the-art methods.Type
textElectronic Dissertation
Degree Name
Ph.D.Degree Level
doctoralDegree Program
Graduate CollegeComputer Science