Microspecialization in Column-Oriented DBMSes and Scripting Language Interpreters
Author
He, WeiIssue Date
2021Keywords
Database management systemDynamic code specialization
Microspecialization
Profile-based optimization
Scripting language interpreter
Advisor
Strout, Michelle M.
Metadata
Show full item recordPublisher
The University of Arizona.Rights
Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction, presentation (such as public display or performance) of protected items is prohibited except with permission of the author.Abstract
Interpretation in general-purpose systems introduces performance overhead for runtime examinations of the actual workload characteristics. The performance of many such systems, including database management systems (DBMSes) and scripting language interpreters for data analysis workloads, is critical because these systems are often expected to perform computations on huge amounts of data while presenting the results fast enough to interact with users. Microspecialization is an extant program specialization technique that has a developer use domain-specific runtime invariants to specialize the system source code with invariant values. Microspecialization involves a tradeoff between the performance improvement and the required development work for semi-automatically applying the specialization. It was initially proposed in the context of row-oriented DBMSes, where it used workload information such as the shape of database tables to specialize individual DBMS operators and yielded 15–22% performance improvements on the well-used TPC-H benchmark. However, specialization within individual operators limits the potential performance improvement of microspecialization for other systems. Two systems that might benefit from microspecialization across operators are column-oriented DBMSes for analytical queries and scripting language interpreters like Python and Lua for data analysis workloads. Unfortunately, microspecialization across operators leads to increasing development effort for scripting language interpreters. In addition, data analysis script interpretations also call for specialization on data types that could change, but tend to be stable at runtime, for better performance improvement. In this dissertation, I explore the tradeoff space of microspecialization to improve the performance of specialized code through specialization across operators while reducing the required development effort through a semi-automated specialization for scripting language interpreters via traditional compiler optimizations. My dissertation contributes two case studies of the performance improvement and development effort tradeoff for microspecialization in two different domains: column-oriented DBMSes and scripting language interpreters for data analysis workloads. In these case studies, I (1) specialize across DBMS operations via new kinds of invariants based on the mechanism of query interpretations, (2) provide a cost model for targeting the specialization within a function for column-oriented DBMSes, (3) semi-automate the specialization of scripting language interpreters across bytecode operators through code templates and traditional compiler optimizations, and (4) specialize runtime data type examinations that comprise notable overhead in dynamically-typed scripting language interpreters with data type values that tend to be stable based on domain knowledge about the dynamic input formats. I propose four new kinds of invariants for columnar-oriented DBMSes through optimizations on identified computations that are unnecessary for a query, based on the usage of the computation results within a single operator or across operators. With only 2400 source lines of code (SLOC) changes of MonetDB that employs the column-oriented DBMS architecture, I demonstrate an average speedup around 16% through performance evaluation for the TPC-H benchmark. In the second case study for data analysis script interpretations, I microspecialize the scripting language interpreters across bytecode operators by unrolling the interpretation loop and specialized on data types that could be different from the profile at runtime. Meanwhile, to reduce the development work, I semi-automate this specialization with source-level code templates and assignments of invariant values that trigger traditional compiler optimizations. Assisted by manual separation of script code that processes different types of data columns, I report 9.0–39.6% performance improvements of the Lua interpreter and 11.0–17.2% performance improvement of the Python interpreter with only 1509 and 2283 SLOC changes respectively, for five common data analysis tasks summarized from real workloads through the semi-automatic source-level microspecialization approach. The two case studies in this dissertation demonstrate significant performance improvement through only 0.8% modification to MonetDB, 0.2% modification to the Python interpreter, and 9.1% modification to the Lua interpreter codebases. Therefore, I conclude that microspecialization can effectively improve the performance of column-oriented DBMSes and scripting language interpreters through small amounts of development effort.Type
textElectronic Dissertation
Degree Name
Ph.D.Degree Level
doctoralDegree Program
Graduate CollegeComputer Science