TSpace

Preserve and Share Your Research

TSpace is a free and secure research repository established by University of Toronto Libraries to disseminate and preserve the scholarly record of University of Toronto.

 

Communities in TSpace

Select a community to browse its collections.

Recent Submissions

ItemOpen Access
Attention-Level Speculation
(2025-07) Cai, Jack; Vora, Ammar; Zhang, Randolph; O'Connor, Mark; Jeffrey, Mark C.
As Large Language Models (LLMs) grow in size and context length, efficient inference strategies are essential to maintain low-latency token generation. Unfortunately, conventional tensor and data parallelism face diminishing returns when scaling across multiple devices. We propose a novel form—attention-level speculative parallelism (ALSpec)—that predicts self-attention outputs to execute subsequent operations early on separate devices. Our approach overlaps attention and non-attention computations, reducing the attention latency overhead at 128K context length by up to 5x and improving end-to-end decode latency by up to 1.65x, all without sacrificing quality. We establish the fundamental pillars for speculative execution and provide an execution paradigm that simplifies implementation. We show that existing attention-approximation methods perform well on simple information retrieval tasks, but they fail in advanced reasoning and math. Combined with speculative execution, we can approximate up to 90% of self-attention without harming model correctness. Demonstrated on Tenstorrent's NPU devices, we scale up LLM inference beyond current techniques, paving the way for faster inference in transformer models.
ItemOpen Access
When Is Parallelism Fearless and Zero-Cost with Rust?
(ACM, 2024-06-17) Abdi, Javad; Posluns, Gilead; Zhang, Guozheng; Wang, Boxuan; Jeffrey, Mark C.
The Rust programming language is lauded for enabling fearless concurrency with zero cost: detecting concurrency errors at compile time. Given the enduring difficulty of parallel programming in other languages, this implied panacea warrants analysis. In particular, the efficacy of Rust across types of parallelism remains unexplored. Is parallel programming always devoid of fear with Rust? We answer this question through a case study, porting 14 benchmarks with abundant regular and irregular parallelism from C++ to Rust and reporting our experience and observations. We find that Rust, with the Rayon library, indeed delivers fearlessness for program phases comprising only regular parallelism, e.g., prefix-sum. However, for applications with any irregular parallelism, the programmer must choose between unsafe code or high-overhead dynamic checks with errors that manifest at run time, leaving the arduous task of parallel programming as scary with Rust as with its predecessors.
ItemOpen Access
Multi Bucket Queues: Efficient Concurrent Priority Scheduling
(ACM, 2024-06-17) Zhang, Guozheng; Posluns, Gilead; Jeffrey, Mark C.
Many irregular algorithms converge more quickly when they execute tasks in a specific order. When this order is discovered at run time, the algorithm demands a dynamic task scheduler. Scaling a priority scheduler to large systems with many cores is challenging and while many concurrent priority schedulers (CPS) have been proposed, a general classification of their design space is still lacking. We survey prior work and propose three dimensions for the design of CPSs: the degree of synchrony, the drift of priorities, and the underlying data structure. We use this taxonomy to classify existing schedulers and evaluate their strengths and weaknesses. Building on our observations, we propose the Multi Bucket Queue (MBQ) which targets a promising unexplored point in the design space for concurrent priority scheduling. The MBQ leverages the strengths of the MultiQueue and Multi-Level Bucket Queue, while avoiding their weaknesses, yielding a CPS that keeps threads busy and running useful work, yet with high-efficiency queue operations. Our experimental results show that the MBQ is competitive with or outperforms prior work.
ItemOpen Access
Disintegrating manycores: which applications lose and why?
(ACM, 2023-10-28) Brkić, Isidor R.; Jeffrey, Mark C.
The economics of Moore's Law are stumbling, so vendors of many-core architectures are transitioning from single-die monolithic designs to multi-chiplet disintegrated systems within a package. Disintegration lowers cost for the same number of cores but bottlenecks the interconnect. Ideally, disintegration should increase performance per dollar: cost savings should outweigh the disintegration slowdown. Although industry has reported cost savings, the performance penalty of disintegration is not well studied.This paper presents the first characterization, to our knowledge, of disintegration performance penalty across a diverse suite of applications. Unsurprisingly, applications with high speedups on monolithic systems continue to scale well on disintegrated systems, and vice versa. However, the disintegration slowdown compared to an equivalently sized monolith exhibits high variance across applications, with some achieving just over half the performance. Why do some applications get a performance per dollar win, while others lose? Through regression analysis, we find that metrics relating to the network-on-package bandwidth and data sharing correlate with disintegration slowdown. Programmers were already cautioned against shared mutable data on monolithic systems, yet data sharing is unavoidable in many applications. These applications will be disproportionately harmed in the disintegrated future.
ItemOpen Access
Participation, Legitimacy and Fiscal Capacity in Weak States: Evidence from Participatory Budgeting
(2025-06) Grieco, Kevin; Kamara, Abou Bakarr; Meriggi, Niccolò F.; Michel, Julian; Prichard, Wilson
Building durable fiscal capacity requires that states obtain compliance with their taxes—a persistent challenge for states with low enforcement capacity. One promising option for governments in weak states is to raise voluntary compliance by enhancing governmental legitimacy. This study reports results from a participatory budgeting policy experiment in Sierra Leone designed to increase legitimacy and tax compliance by inviting public participation in local policy decision-making. In phone-based town halls, participants shared policy preferences with neighbors and local politicians and then voted for public services that were subsequently implemented. We find that the intervention durably increased participants’ perceptions of government legitimacy. However, contrary to influential models of tax compliance, we report a robust null effect on tax compliance behavior. Participants’ partisan affiliation strongly conditions the treatments’ effects on tax compliance and attitudes toward paying taxes: We find large, positive impacts among copartisans of the incumbent government but significant negative impacts among non-copartisans. Our results highlight that the legitimacy gains of participatory interventions may not increase voluntary tax compliance when participation politicizes compliance.