Smarter, Leaner, Faster: Practical Efficiency for Democratizing Large Language Models
Author
Dumitru, Razvan GabrielIssue Date
2025Keywords
Efficient InferenceInference Acceleration
Large Language Models
Model Compression
Quantization
Reinforcement Learning
Advisor
Surdeanu, Mihai
Metadata
Show full item recordPublisher
The University of Arizona.Rights
Copyright © is held by the author. Licensed under a Creative Commons Attribution-No Derivative Works 3.0 License (CC BY-ND 3.0). Digital access to this material is made possible by the University Libraries, University of Arizona.Abstract
Large language models (LLMs) deliver impressive capabilities, yet their practicality is constrained by three major bottlenecks: (1) memory and compute demands that make loading and using LLMs impossible on commodity hardware; (2) limited tokens/sec during decoding, which caps delivered throughput under real use-cases; and (3) computational overhead from unnecessarily and redundant reasoning traces, which increases the tokens-per-answer required and serving cost. This dissertation advances a clear objective: decouple capability from raw scale. We show that models can be improved and made smarter, leaner, and faster leading to more accessibility without sacrificing accuracy. Our contributions span three levels. (i) High tokens/sec generation: CopySpec identifies repeating patterns in the context and enables up to a 3.08x increase in tokens/sec. (ii) Concise reasoning: ConciseRL is a reinforcement learning framework that optimizes for concise and sufficient reasoning traces that reduce tokens-per-answer by up to 31x with 7 extra precision points. (iii) Model compression for memory and compute: Variable Layerwise Quantization assigns per-layer bit levels according to two different importance signals, shrinking the footprint of the model with minimal decrease in performance, while Dynamic LLM Slicing completely removes redundant parts of layers based on similar layer importance metrics, cutting inference compute and the memory requirement retaining performance. These techniques work to realign the cost curve of large language models. By converting redundancy into cheap tokens, allocating token budgets to only the problems that matter, and trimming excess size, they increase delivered tokens/sec, shrink the required tokens-per-answer, and reduce the total footprint while preserving output quality. Because each mechanism targets different stages of the pipeline (generation, reasoning, capacity), they compose easily: fewer tokens-per-answer interacts synergistically with higher per-step throughput, and a smaller model allows these gains to be used on commodity hardware. The result of these techniques is a push toward democratization of frontier LLMs while retaining high quality on modest budgets instead of highly expensive infrastructure.Type
textElectronic Dissertation
Degree Name
Ph.D.Degree Level
doctoralDegree Program
Graduate CollegeComputer Science
Degree Grantor
University of ArizonaCollections
Except where otherwise noted, this item's license is described as Copyright © is held by the author. Licensed under a Creative Commons Attribution-No Derivative Works 3.0 License (CC BY-ND 3.0). Digital access to this material is made possible by the University Libraries, University of Arizona.

