Publications

2026

1. V-Droid: Advancing Mobile GUI Agent Through Generative Verifiers

Published in The 32nd Annual International Conference On Mobile Computing And Networking (MobiCom), 2026

Recommended citation: Gaole Dai, Shiqi Jiang, Ting Cao, Yuanchun Li, Yuqing Yang, Rui Tan, Mo Li, Lili Qiu. (2026). "V-Droid: Advancing Mobile GUI Agent Through Generative Verifiers." MobiCom.
Download Paper

2. Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices

Published in The 24th International Conference on Mobile Systems, Applications, and Services (MobiSys), 2026

Recommended citation: Xiangyu Li, Chengyu Yin, Weijun Wang, Jianyu Wei, Ting Cao, Yunxin Liu. (2026). "Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices." MobiSys.
Download Paper

3. Scaling LLM Test-Time Compute with Mobile NPU on Smartphones

Published in The 2026 ACM European Conference on Computer Systems (EuroSys), 2026

Recommended citation: Zixu Hao, Jianyu Wei, Tuowei Wang, Minxing Huang, Huiqiang Jiang, Shiqi Jiang, Ting Cao, Ju Ren. (2026). "Scaling LLM Test-Time Compute with Mobile NPU on Smartphones." EuroSys.
Download Paper

4. AVA: Towards Agentic Video Analytics Systems with Video Language Models

Published in USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2026

Recommended citation: Yuxuan Yan, Shiqi Jiang, Ting Cao, Yifan Yang, Qianqian Yang, Yuanchao Shu, Qing Yang, Lili Qiu. (2026). "AVA: Towards Agentic Video Analytics Systems with Video Language Models." NSDI.
Download Paper

5. SeerAttention-R: Sparse Attention Adaptation for Long Reasoning

Published in International Conference on Learning Representations (ICLR), 2026

Recommended citation: Yizhao Gao, Shuming Guo, Shijie Cao, Yuqing Xia, Yu Cheng, Lei Wang, Lingxiao Ma, Yutao Sun, Tianzhu Ye, Li Dong, Hayden Kwok-Hay So, Yu Hua, Ting Cao, Fan Yang, Mao Yang. (2026). "SeerAttention-R: Sparse Attention Adaptation for Long Reasoning." ICLR.
Download Paper

6. ProRe: A Proactive Reward System for GUI Agents via Reasoner–Actor Collaboration

Published in International Conference on Learning Representations (ICLR), 2026

Recommended citation: Gaole Dai, Shiqi Jiang, Ting Cao, Yuqing Yang, Yuanchun Li, Rui Tan, Mo Li, Lili Qiu. (2026). "ProRe: A Proactive Reward System for GUI Agents via Reasoner–Actor Collaboration." ICLR.
Download Paper

7. Sculptor: Empowering LLMs with Cognitive Agency via Active Context Management

Published in International Conference on Learning Representations (ICLR), 2026

Recommended citation: Mo Li, L.H. Xu, Qitai Tan, Long Ma, Ting Cao, Yunxin Liu, Flood Sung. (2026). "Sculptor: Empowering LLMs with Cognitive Agency via Active Context Management." ICLR.
Download Paper

8. Neuralink: Fast on-Device LLM Inference with Neuron Co-Activation Linking

Published in ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2026

Recommended citation: Tuowei Wang, Ruwen Fan, Minxing Huang, Zixu Hao, Kun Li, Ting Cao, Youyou Lu, Yaoxue Zhang, Ju Ren. (2026). "Neuralink: Fast on-Device LLM Inference with Neuron Co-Activation Linking." ASPLOS.
Download Paper

9. BitDecoding: Unlocking Tensor Cores for Long-Context LLMs Decoding with Low-Bit KV Cache

Published in The 32nd International Symposium on High-Performance Computer Architecture (HPCA), 2026

Recommended citation: Dayou Du, Shijie Cao, Jianyi Cheng, Ting Cao, Mao Yang. (2026). "BitDecoding: Unlocking Tensor Cores for Long-Context LLMs Decoding with Low-Bit KV Cache." HPCA.
Download Paper

10. MatXtract: Sparsity-Aware Matrix Transformation via Cascaded Compute Density EXtraction for SpMV

Published in ACM Transactions on Architecture and Code Optimization (TACO), 2026

Selected for ACM Showcase

Recommended citation: Luhan Wang, Kun Li*, Yifeng Chen, Haipeng Jia, Yunquan Zhang, Ting Cao, Yunxin Liu. (2026). "MatXtract: Sparsity-Aware Matrix Transformation via Cascaded Compute Density EXtraction for SpMV." TACO.
Download Paper

2025

1. SparStencil: Retargeting Sparse Tensor Cores to Scientific Stencil Computations via Structured Sparsity Transformation

Published in International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2025

SC 2025 Best Student Paper Award Finalist

Recommended citation: Qi Li, Kun Li, Liang Yuan, Junshi Chen, Hong An, Yunquan Zhang, Ting Cao, Mao Yang. (2025). "SparStencil: Retargeting Sparse Tensor Cores to Scientific Stencil Computations via Structured Sparsity Transformation." SC.
Download Paper

2. Matrix Is All You Need: Rearchitecting Quantum Chemistry to Scale on AI Accelerators

Published in International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2025

Recommended citation: Haozhi Han, Kun Li, Fusong Ju, Yifeng Chen, Yunquan Zhang, Ting Cao, Mao Yang. (2025). "Matrix Is All You Need: Rearchitecting Quantum Chemistry to Scale on AI Accelerators." SC.
Download Paper

3. SeerAttention: Self-distilled Attention Gating for Efficient Long-context Prefilling

Published in The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

Recommended citation: Yizhao Gao, Zhichen Zeng, DaYou Du, Shijie Cao, Peiyuan Zhou, Jiaxing Qi, Junjie Lai, Hayden So, Ting Cao, Fan Yang, Mao Yang. (2025). "SeerAttention: Self-distilled Attention Gating for Efficient Long-context Prefilling." NeurIPS.
Download Paper

4. StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition

Published in International Conference on Computer Vision (ICCV'25), 2025

Recommended citation: Xin Ding, Hao Wu, Yifan Yang, Shiqi Jiang, Qianxi Zhang, Donglin Bai, Zhibo Chen, Ting Cao. (2025). "StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition." ICCV.
Download Paper

5. Jenga: Enhancing Long-Context Fine-tuning of LLMs with Contextual Token Sparsity

Published in USENIX Annual Technical Conference (ATC'25), 2025

Recommended citation: Tuowei Wang, Xingyu Chen, Kun Li, Ting Cao, Ju Ren, Yaoxue Zhang. (2025). "Jenga: Enhancing Long-Context Fine-tuning of LLMs with Contextual Token Sparsity." ATC.
Download Paper

6. Bitnet.cpp: Efficient Edge Inference for Ternary LLMs

Published in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025

Recommended citation: Jinheng Wang, Hansong Zhou, Ting Song, Shijie Cao, Yan Xia, Ting Cao, Jianyu Wei, Shuming Ma, Hongyu Wang, and Furu Wei. (2025). "Bitnet.cpp: Efficient Edge Inference for Ternary LLMs." Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL).
Download Paper

7. LUTensor: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference

Published in The 52nd Annual International Symposium on Computer Architecture (ISCA), 2025

Recommended citation: Zhiwen Mo, Lei Wang, Jianyu Wei, Zhiwen Zeng, Shijie Cao, Lingxiao Ma, Naifeng Jing, Ting Cao, Jilong Xue, Fan Yang, Mao Yang. (2025). "LUTensor: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference." ISCA.
Download Paper

8. Babel: A Scalable Pre-trained Model for Multi-Modal Sensing via Expandable Modality Alignment

Published in The 23rd ACM Conference on Embedded Networked Sensor Systems (SenSys), 2025

Recommended citation: Shenghong Dai, Shiqi Jiang, Yifan Yang, Ting Cao, Mo Li, S. Banerjee, Lili Qiu. (2025). "Babel: A Scalable Pre-trained Model for Multi-Modal Sensing via Expandable Modality Alignment." SenSys.
Download Paper

9. T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge

Published in The 2025 ACM European Conference on Computer Systems (EuroSys), 2025

Recommended citation: Jianyu Wei, Shijie Cao, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, Mao Yang. (2025). "T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge." EuroSys.
Download Paper

10. Efficient and Adaptive Diffusion Model Inference Through Lookup Table on Mobile Devices

Published in IEEE Transactions on Mobile Computing (TMC), 2025

Recommended citation: Qipeng Wang, Shiqi Jiang, Yifan Yang, Ruiqi Liu, Yuanchun Li, Ting Cao, Xuanzhe Liu. (2025). "Efficient and Adaptive Diffusion Model Inference Through Lookup Table on Mobile Devices." IEEE Transactions on Mobile Computing (TMC).
Download Paper

11. Jigsaw: Toward Conflict-free Vectorized Stencil Computation by Tessellating Swizzled Registers

Published in 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP), 2025

Recommended citation: Yiwei Zhang, Kun Li, Liang Yuan, Yunquan Zhang, Ting Cao, Mao Yang. (2025). "Jigsaw: Toward Conflict-free Vectorized Stencil Computation by Tessellating Swizzled Registers." PPoPP.
Download Paper

12. FlashFFTStencil: Bridging Fast Fourier Transforms to Memory-Efficient Stencil Computations on Tensor Core Units

Published in 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP), 2025

Recommended citation: Haozhi Han, Kun Li, Wei Cui, Donglin Bai, Yifeng Chen, Ting Cao, Mao Yang. (2025). "FlashFFTStencil: Bridging Fast Fourier Transforms to Memory-Efficient Stencil Computations on Tensor Core Units." PPoPP.
Download Paper

13. LUT-DLA: Lookup Table as Efficient Extreme Low-Bit Deep Learning Accelerator

Published in 31st IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2025

Recommended citation: Guoyu Li, Chunyun Chen, Shengyu Ye, Yang Wang, Fan Yang, Ting Cao, Mohamed M. Sabry Aly, Cheng Liu, Mao Yang. (2025). "LUT-DLA: Lookup Table as Efficient Extreme Low-Bit Deep Learning Accelerator." HPCA.
Download Paper

14. Anatomizing Deep Learning Inference in Web Browsers

Published in ACM Transactions on Software Engineering and Methodology (TOSEM), 2025

Recommended citation: Qipeng Wang, Shiqi Jiang, Zhenpeng Chen, Xu Cao, Yuanchun Li, Aoyu Li, Yun Ma, Ting Cao, Xuanzhe Liu. (2025). "Anatomizing Deep Learning Inference in Web Browsers." TOSEM.
Download Paper

2024

1. LoRAStencil: Low-Rank Adaptation of Stencil Computation on Tensor Cores

Published in International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’24), 2024

SC 2025 Reproducibility Challenge Finalist

Recommended citation: Yiwei Zhang, Kun Li, Liang Yuan, Jiawen Cheng, Yunquan Zhang, Ting Cao, Mao Yang. (2024). "LoRAStencil: Low-Rank Adaptation of Stencil Computation on Tensor Cores." SC’24.
Download Paper

2. Long Exposure: Accelerating Parameter-Efficient Fine-Tuning for LLMs under Shadowy Sparsity

Published in International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’24), 2024

Recommended citation: Tuowei Wang, Kun Li, Zixu Hao, Donglin Bai, Ju Ren, Yaoxue Zhang, Ting Cao, Mao Yang. (2024). "Long Exposure: Accelerating Parameter-Efficient Fine-Tuning for LLMs under Shadowy Sparsity." SC’24.
Download Paper

3. VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

Published in The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Recommended citation: Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, Mao Yang. (2024). "VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models." EMNLP.
Download Paper

4. PruneAug: Bridging DNN Pruning and Inference Latency on Diverse Sparse Platforms Using Automatic Layerwise Block Pruning

Published in IEEE Transactions on Computers (TC), 2024

Recommended citation: Hanfei Geng, Yifei Liu, Yujie Zheng, Li Lyna Zhang, Jingwei Sun, Yujing Wang, Yang Wang, Guangzhong Sun, Mao Yang, Ting Cao, Yunxin Liu. (2024). "PruneAug: Bridging DNN Pruning and Inference Latency on Diverse Sparse Platforms Using Automatic Layerwise Block Pruning." TC.
Download Paper

5. BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation

Published in 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024 Main Conference, Long paper), 2024

Recommended citation: DaYou Du, Yijia Zhang, Shijie Cao, Jiaqi Guo, Ting Cao, Xiaowen Chu, Ningyi Xu. (2024). "BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation." ACL.
Download Paper

6. AFPQ: Asymmetric Floating Point Quantization for LLMs

Published in 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024 Finding short paper), 2024

Recommended citation: Yijia Zhang, Sicheng Zhang, Shijie Cao, DaYou Du, Jianyu Wei, Ting Cao, Ningyi Xu. (2024). "AFPQ: Asymmetric Floating Point Quantization for LLMs." ACL.
Download Paper

7. Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation

Published in The 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2024

Recommended citation: L. Wang, L. Ma, S. Cao, Q. Zhang, J. Xue, Y. Shi, N. Zheng, Z. Miao, F. Yang, Ting Cao, Y. Yang, M. Yang. (2024). "Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation." OSDI.
Download Paper

8. Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models

Published in IEEE International Conference on Multimedia and Expo (ICME’24), 2024

Recommended citation: Yijia Zhang, Lingran Zhao, Shijie Cao, Wenqiang Wang, Ting Cao, Fan Yang, Mao Yang, Shanghang Zhang, Ningyi Xu. (2024). "Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models." ICME’24.
Download Paper

9. Empowering In-Browser Deep Learning Inference on Edge Through Just-In-Time Kernel Optimization

Published in The 22nd Annual International Conference on Mobile Systems, Applications and Services (MobiSys), 2024

Recommended citation: F. Jia, S. Jiang, Ting Cao, W. Cui, T. Xia, X. Cao, Y. Li, Q. Wang, D. Zhang, J. Ren, Y. Liu, L. Qiu, M. Yang. (2024). "Empowering In-Browser Deep Learning Inference on Edge Through Just-In-Time Kernel Optimization." MobiSys.
Download Paper

10. Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference

Published in The 51st Annual International Symposium on Computer Architecture 2024 (ISCA’24), 2024

Microsoft Research Focus 2024

Recommended citation: R. Hwang, J. Wei, S. Cao, C. Hwang, X. Tang, Ting Cao, M. Yang. (2024). "Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference." ISCA’24.
Download Paper

11. FlexNN: Efficient and Adaptive DNN Inference on Memory-Constrained Edge Devices

Published in The 30th Annual International Conference On Mobile Computing And Networking (MobiCom), 2024

Recommended citation: Xiangyu Li, Yuanchun Li, Yuanzhe Li, Ting Cao, Yunxin Liu. (2024). "FlexNN: Efficient and Adaptive DNN Inference on Memory-Constrained Edge Devices." MobiCom.
Download Paper

12. PIM-DL: Expanding the Applicability of Commodity DRAM-PIMs for Deep Learning via Algorithm-System Co-Optimization

Published in ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024

Recommended citation: Cong Li, Zhe Zhou, Yang Wang, Fan Yang, Ting Cao, Mao Yang, Yun Liang, Guangyu Sun. (2024). "PIM-DL: Expanding the Applicability of Commodity DRAM-PIMs for Deep Learning via Algorithm-System Co-Optimization." ASPLOS.
Download Paper

13. LitePred: Transferable and Scalable Latency Prediction for Hardware-Aware Neural Architecture Search

Published in USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2024

Recommended citation: Chengquan Feng, Li Lyna Zhang, Yuanchi Liu, Jiahang Xu, Chengruidong Zhang, Zhiyuan Wang, Ting Cao, Mao Yang, Haisheng Tan. (2024). "LitePred: Transferable and Scalable Latency Prediction for Hardware-Aware Neural Architecture Search." NSDI.
Download Paper

14. ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor Cores

Published in ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP), 2024

PPoPP 2024 Best Paper Award

Recommended citation: Yuetao Chen, Kun Li, Yuhao Wang, Donglin Bai, Lei Wang, Lingxiao Ma, Liang Yuan, Yunquan Zhang, Ting Cao, Mao Yang. (2024). "ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor Cores." PPoPP.
Download Paper

2023

1. LUT-NN: Empower Efficient Neural Network Inference with Centroid Learning and Table Lookup

Published in The 29th Annual International Conference On Mobile Computing And Networking (MobiCom), 2023

Microsoft Research Focus 2023

Recommended citation: Xiaohu Tang, Yang Wang, Ting Cao, Li Lyna Zhang, Qi Chen, Deng Cai, Yunxin Liu, Mao Yang. (2023). "LUT-NN: Empower Efficient Neural Network Inference with Centroid Learning and Table Lookup." MobiCom.
Download Paper

2. Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference

Published in ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2023

Recommended citation: Junyan Li, Li Lyna Zhang, Jiahang Xu, Yujing Wang, Shaoguang Yan, Yunqing Xia, Yuqing Yang, Ting Cao, Hao Sun, Weiwei Deng, Qi Zhang, Mao Yang. (2023). "Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference." ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD).
Download Paper

3. SpaceEvo: Hardware-Friendly Search Space Design for Efficient INT8 Inference

Published in International Conference on Computer Vision (ICCV), 2023

Recommended citation: Xudong Wang, Li Lyna Zhang, Jiahang Xu, Quanlu Zhang, Yujing Wang, Yuqing Yang, Ningxin Zheng, Ting Cao, Mao Yang. (2023). "SpaceEvo: Hardware-Friendly Search Space Design for Efficient INT8 Inference." International Conference on Computer Vision (ICCV).
Download Paper

4. ElasticViT: Conflict-aware Supernet Training for Deploying Fast Vision Transformer on Diverse Mobile Devices

Published in International Conference on Computer Vision (ICCV), 2023

Recommended citation: Chen Tang, Li Lyna Zhang, Huiqiang Jiang, Jiahang Xu, Ting Cao, Quanlu Zhang, Yuqing Yang, Zhi Wang, Mao Yang. (2023). "ElasticViT: Conflict-aware Supernet Training for Deploying Fast Vision Transformer on Diverse Mobile Devices." International Conference on Computer Vision (ICCV).
Download Paper

5. Adam Accumulation to Reduce Memory Footprints of both Activations and Gradients for Large-scale DNN Training

Published in 26th European Conference on Artificial Intelligence (ECAI), 2023

Recommended citation: Yijia Zhang, Yibo Han, Shijie Cao, Guohao Dai, Youshan Miao, Ting Cao, Fan Yang, Ningyi Xu. (2023). "Adam Accumulation to Reduce Memory Footprints of both Activations and Gradients for Large-scale DNN Training." ECAI.
Download Paper

6. Accurate and Structured Pruning for Efficient Automatic Speech Recognition

Published in Conference of the International Speech Communication Association (INTERSPEECH), 2023

Recommended citation: Huiqiang Jiang, Li Lyna Zhang, Yuang Li, Yu Wu, Shijie Cao, Ting Cao, Yuqing Yang, Jinyu Li, Mao Yang, Lili Qiu. (2023). "Accurate and Structured Pruning for Efficient Automatic Speech Recognition." Conference of the International Speech Communication Association (INTERSPEECH).
Download Paper

7. HiMoDepth: Efficient Training-Free High-Resolution On-Device Depth Perception

Published in IEEE Transactions on Mobile Computing (TMC), 2023

Recommended citation: Jinrui Zhang, Huan Yang, Ju Ren, Deyu Zhang, Bangwen He, Youngki Lee, Ting Cao, Yuanchun Li, Yaoxue Zhang, Yunxin Liu. (2024). "HiMoDepth: Efficient Training-Free High-Resolution On-Device Depth Perception." IEEE Transactions on Mobile Computing (TMC), 23(5), 2024.
Download Paper

8. VSPIM: SRAM Processing-in-Memory DNN Acceleration via Vector-Scalar Operations

Published in IEEE Transactions on Computers (TC), 2023

Recommended citation: Chen Nie, Chenyu Tang, Jie Lin, Huan Hu, Chenyang Lv, Ting Cao, Weifeng Zhang, Li Jiang, Xiaoyao Liang, Weikang Qian, Yanan Sun, Zhezhi He. (2023). "VSPIM: SRAM Processing-in-Memory DNN Acceleration via Vector-Scalar Operations." IEEE Transactions on Computers (TC).
Download Paper

9. NN-Stretch: Automatic Neural Network Branching for Parallel Inference on Heterogeneous Multi-Processors

Published in The 21st International Conference on Mobile Systems, Applications, and Services (MobiSys), 2023

Recommended citation: Jianyu Wei, Ting Cao, Shijie Cao, Shiqi Jiang, Shaowei Fu, Mao Yang, Yanyong Zhang, Yunxin Liu. (2023). "NN-Stretch: Automatic Neural Network Branching for Parallel Inference on Heterogeneous Multi-Processors." The 21st International Conference on Mobile Systems, Applications, and Services (MobiSys).
Download Paper

10. Boosting DNN Cold Inference on Devices

Published in The 21st Annual International Conference on Mobile Systems, Applications and Services (MobiSys), 2023

Recommended citation: Rongjie Yi, Ting Cao, Ao Zhou, Xiao Ma, Shangguang Wang, Mengwei Xu. (2023). "Boosting DNN Cold Inference on Devices." The 21st Annual International Conference on Mobile Systems, Applications and Services (MobiSys).
Download Paper

11. Efficient GPU Kernels for N:M-SPARSE Weights in Deep Learning

Published in Sixth Conference on Machine Learning and Systems (MLSys), 2023

Recommended citation: Bin Lin, Ningxin Zheng, Lei Wang, Shijie Cao, Lingxiao Ma, Quanlu Zhang, Yi Zhu, Ting Cao, Jilong Xue, Yuqing Yang, Fan Yang. (2023). "Efficient GPU Kernels for N:M-SPARSE Weights in Deep Learning." Sixth Conference on Machine Learning and Systems (MLSys).
Download Paper

2022

1. Turbo: Opportunistic Enhancement for Edge Video Analytics

Published in The 20th ACM Conference on Embedded Networked Sensor Systems (SenSys), 2022

Recommended citation: Yan Lu, Shiqi Jiang, Ting Cao, Yuanchao Shu. (2022). "Turbo: Opportunistic Enhancement for Edge Video Analytics." The 20th ACM Conference on Embedded Networked Sensor Systems (SenSys).
Download Paper

2. Hyperion: A Generic and Distributed Mobile Offloading Framework on OpenCL

Published in The 20th ACM Conference on Embedded Networked Sensor Systems (SenSys), 2022

Recommended citation: Ziyan Fu, Ju Ren, Yunxin Liu, Ting Cao, Deyu Zhang, Yuezhi Zhou, Yaoxue Zhang. (2022). "Hyperion: A Generic and Distributed Mobile Offloading Framework on OpenCL." The 20th ACM Conference on Embedded Networked Sensor Systems (SenSys).
Download Paper

3. Romou: Rapidly Generate High-Performance Tensor Kernels for Mobile GPUs

Published in Proceedings of the 28th Annual International Conference on Mobile Computing and Networking (MobiCom), 2022

ArchProbe: 2022-2023 Top 100 Open Source Achievements Award

Recommended citation: Rendong Liang, Ting Cao, Jicheng Wen, Manni Wang, Yang Wang, Jianhua Zou, Yunxin Liu. (2022). "Romou: Rapidly Generate High-Performance Tensor Kernels for Mobile GPUs." Proceedings of the 28th Annual International Conference on Mobile Computing and Networking (MobiCom).
Download Paper

4. MobiDepth: Real-Time Depth Estimation Using On-Device Dual Cameras

Published in Proceedings of the 28th Annual International Conference on Mobile Computing and Networking (MobiCom), 2022

Recommended citation: Jinrui Zhang, Huan Yang, Ju Ren, Deyu Zhang, Bangwen He, Yuanchun Li, Ting Cao, Yaoxue Zhang, Yunxin Liu. (2022). "MobiDepth: Real-Time Depth Estimation Using On-Device Dual Cameras." Proceedings of the 28th Annual International Conference on Mobile Computing and Networking (MobiCom).
Download Paper

5. SwiftPruner: Reinforced Evolutionary Pruning for Efficient Ad Relevance

Published in ACM International Conference on Information and Knowledge Management (CIKM), 2022

Recommended citation: Li Lyna Zhang, Youkow Homma, Yujing Wang, Min Wu, Mao Yang, Ruofei Zhang, Ting Cao, Wei Shen. (2022). "SwiftPruner: Reinforced Evolutionary Pruning for Efficient Ad Relevance." ACM International Conference on Information and Knowledge Management (CIKM).
Download Paper

6. CoDL: Efficient CPU-GPU Co-execution for Deep Learning Inference on Mobile Devices

Published in 20th International Conference on Mobile Systems, Applications, and Services (MobiSys), 2022

Recommended citation: Fucheng Jia, Deyu Zhang, Ting Cao, Shiqi Jiang, Yunxin Liu, Ju Ren, Yaoxue Zhang. (2022). "CoDL: Efficient CPU-GPU Co-execution for Deep Learning Inference on Mobile Devices." 20th International Conference on Mobile Systems, Applications, and Services (MobiSys).
Download Paper

2021

1. nn-Meter: towards accurate latency prediction of DNN inference on diverse edge devices

Published in GetMobile: Mobile Computing and Communications, Research Highlights, 2021

ACM SigMobile Research Highlight

Recommended citation: L. Zhang, S. Han, J. Wei, N. Zheng, T. Cao, Y. Yang, Y. Liu. (2021). "nn-Meter: towards accurate latency prediction of DNN inference on diverse edge devices." GetMobile: Mobile Computing and Communications, Research Highlights, 25(4): pp. 19-23.
Download Paper

2. Unified Holistic Memory Management Supporting Multiple Big Data Processing Frameworks over Hybrid Memories

Published in ACM Transactions on Computer Systems (TOCS), 2021

Recommended citation: Lei Chen, Jiacheng Zhao, Chenxi Wang, Ting Cao, John Zigman, Haris Volos, Onur Mutlu, Fang Lv, Xiaobing Feng, Guoqing Harry Xu, Huimin Cui. (2021). "Unified Holistic Memory Management Supporting Multiple Big Data Processing Frameworks over Hybrid Memories." ACM Transactions on Computer Systems (TOCS), Vol 39(1-4): pp. 1-38.
Download Paper

3. AsyMo: Scalable and Efficient Deep-Learning Inference on Asymmetric Mobile CPUs

Published in Proceedings of the 27th Annual International Conference on Mobile Computing and Networking (MobiCom), 2021

Recommended citation: Manni Wang, Shaohua Ding, Ting Cao, Yunxin Liu, Fengyuan Xu. (2021). "AsyMo: Scalable and Efficient Deep-Learning Inference on Asymmetric Mobile CPUs." Proceedings of the 27th Annual International Conference on Mobile Computing and Networking (MobiCom).
Download Paper

4. nn-Meter: towards accurate latency prediction of deep-learning model inference on diverse edge devices

Published in 19th International Conference on Mobile Systems, Applications, and Services (MobiSys), 2021

MobiSys 2021 Best Paper Award

Recommended citation: L. Zhang, S. Han, J. Wei, N. Zheng, T. Cao, Y. Yang, Y. Liu. (2021). "nn-Meter: towards accurate latency prediction of deep-learning model inference on diverse edge devices." 19th International Conference on Mobile Systems, Applications, and Services (MobiSys).
Download Paper

5. To Bridge Neural Network Design and Real-World Performance: A Behaviour Study for Neural Networks

Published in Proceedings of Machine Learning and Systems (MLSys), 2021

Recommended citation: X. Tang, S. Han, L. Zhang, T. Cao, Y. Liu. (2021). "To Bridge Neural Network Design and Real-World Performance: A Behaviour Study for Neural Networks." Conference on Machine Learning and Systems (MLSys).
Download Paper

2020

1. Profiling and optimizing deep learning inference on mobile GPUs

Published in Proceedings of the 11th ACM SIGOPS Asia-Pacific Workshop on Systems (APSys), 2020

Recommended citation: S. Jiang, L. Ran, T. Cao, Y. Xu, Y. Liu. (2020). "Profiling and optimizing deep learning inference on mobile GPUs." Proceedings of the 11th ACM SIGOPS Asia-Pacific Workshop on Systems (APSys).
Download Paper

2019

1. Panthera: Holistic Memory Management for Big Data Processing over Hybrid Memories

Published in ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2019

Recommended citation: C. Wang, H. Cui, T. Cao, J. Zigman, H. Volos, O. Mutlu, F. Lv, X. Feng, and H. Xu. (2019). "Panthera: Holistic Memory Management for Big Data Processing over Hybrid Memories." ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI).
Download Paper

Dr. Ting Cao

Publications

2026

1. V-Droid: Advancing Mobile GUI Agent Through Generative Verifiers

2. Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices

3. Scaling LLM Test-Time Compute with Mobile NPU on Smartphones

4. AVA: Towards Agentic Video Analytics Systems with Video Language Models

5. SeerAttention-R: Sparse Attention Adaptation for Long Reasoning

6. ProRe: A Proactive Reward System for GUI Agents via Reasoner–Actor Collaboration

7. Sculptor: Empowering LLMs with Cognitive Agency via Active Context Management

8. Neuralink: Fast on-Device LLM Inference with Neuron Co-Activation Linking

9. BitDecoding: Unlocking Tensor Cores for Long-Context LLMs Decoding with Low-Bit KV Cache

10. MatXtract: Sparsity-Aware Matrix Transformation via Cascaded Compute Density EXtraction for SpMV

2025

1. SparStencil: Retargeting Sparse Tensor Cores to Scientific Stencil Computations via Structured Sparsity Transformation

2. Matrix Is All You Need: Rearchitecting Quantum Chemistry to Scale on AI Accelerators

3. SeerAttention: Self-distilled Attention Gating for Efficient Long-context Prefilling

4. StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition

5. Jenga: Enhancing Long-Context Fine-tuning of LLMs with Contextual Token Sparsity

6. Bitnet.cpp: Efficient Edge Inference for Ternary LLMs

7. LUTensor: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference

8. Babel: A Scalable Pre-trained Model for Multi-Modal Sensing via Expandable Modality Alignment

9. T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge

10. Efficient and Adaptive Diffusion Model Inference Through Lookup Table on Mobile Devices

11. Jigsaw: Toward Conflict-free Vectorized Stencil Computation by Tessellating Swizzled Registers

12. FlashFFTStencil: Bridging Fast Fourier Transforms to Memory-Efficient Stencil Computations on Tensor Core Units

13. LUT-DLA: Lookup Table as Efficient Extreme Low-Bit Deep Learning Accelerator

14. Anatomizing Deep Learning Inference in Web Browsers

2024

1. LoRAStencil: Low-Rank Adaptation of Stencil Computation on Tensor Cores

2. Long Exposure: Accelerating Parameter-Efficient Fine-Tuning for LLMs under Shadowy Sparsity

3. VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

4. PruneAug: Bridging DNN Pruning and Inference Latency on Diverse Sparse Platforms Using Automatic Layerwise Block Pruning

5. BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation

6. AFPQ: Asymmetric Floating Point Quantization for LLMs

7. Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation

8. Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models

9. Empowering In-Browser Deep Learning Inference on Edge Through Just-In-Time Kernel Optimization

10. Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference

11. FlexNN: Efficient and Adaptive DNN Inference on Memory-Constrained Edge Devices

12. PIM-DL: Expanding the Applicability of Commodity DRAM-PIMs for Deep Learning via Algorithm-System Co-Optimization

13. LitePred: Transferable and Scalable Latency Prediction for Hardware-Aware Neural Architecture Search

14. ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor Cores

2023

1. LUT-NN: Empower Efficient Neural Network Inference with Centroid Learning and Table Lookup

2. Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference

3. SpaceEvo: Hardware-Friendly Search Space Design for Efficient INT8 Inference

4. ElasticViT: Conflict-aware Supernet Training for Deploying Fast Vision Transformer on Diverse Mobile Devices

5. Adam Accumulation to Reduce Memory Footprints of both Activations and Gradients for Large-scale DNN Training

6. Accurate and Structured Pruning for Efficient Automatic Speech Recognition

7. HiMoDepth: Efficient Training-Free High-Resolution On-Device Depth Perception

8. VSPIM: SRAM Processing-in-Memory DNN Acceleration via Vector-Scalar Operations

9. NN-Stretch: Automatic Neural Network Branching for Parallel Inference on Heterogeneous Multi-Processors

10. Boosting DNN Cold Inference on Devices

11. Efficient GPU Kernels for N:M-SPARSE Weights in Deep Learning

2022

1. Turbo: Opportunistic Enhancement for Edge Video Analytics

2. Hyperion: A Generic and Distributed Mobile Offloading Framework on OpenCL

3. Romou: Rapidly Generate High-Performance Tensor Kernels for Mobile GPUs

4. MobiDepth: Real-Time Depth Estimation Using On-Device Dual Cameras

5. SwiftPruner: Reinforced Evolutionary Pruning for Efficient Ad Relevance

6. CoDL: Efficient CPU-GPU Co-execution for Deep Learning Inference on Mobile Devices

2021

1. nn-Meter: towards accurate latency prediction of DNN inference on diverse edge devices

2. Unified Holistic Memory Management Supporting Multiple Big Data Processing Frameworks over Hybrid Memories

3. AsyMo: Scalable and Efficient Deep-Learning Inference on Asymmetric Mobile CPUs

4. nn-Meter: towards accurate latency prediction of deep-learning model inference on diverse edge devices

5. To Bridge Neural Network Design and Real-World Performance: A Behaviour Study for Neural Networks

2020

1. Profiling and optimizing deep learning inference on mobile GPUs

2019

1. Panthera: Holistic Memory Management for Big Data Processing over Hybrid Memories