Publications

2026


2025


3. Bitnet.cpp: Efficient Edge Inference for Ternary LLMs

Published in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025

Recommended citation: Jinheng Wang, Hansong Zhou, Ting Song, Shijie Cao, Yan Xia, Ting Cao, Jianyu Wei, Shuming Ma, Hongyu Wang, and Furu Wei. (2025). "Bitnet.cpp: Efficient Edge Inference for Ternary LLMs." Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL).
Download Paper

11. Anatomizing Deep Learning Inference in Web Browsers

Published in ACM Transactions on Software Engineering and Methodology (TOSEM), 2025

Recommended citation: Qipeng Wang, Shiqi Jiang, Zhenpeng Chen, Xu Cao, Yuanchun Li, Aoyu Li, Yun Ma, Ting Cao, Xuanzhe Liu. (2024). "Anatomizing Deep Learning Inference in Web Browsers." TOSEM.
Download Paper

2024


6. AFPQ: Asymmetric Floating Point Quantization for LLMs

Published in 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024 Finding short paper), 2024

Recommended citation: Yijia Zhang, Sicheng Zhang, Shijie Cao, DaYou Du, Jianyu Wei, Ting Cao, Ningyi Xu. (2024). "AFPQ: Asymmetric Floating Point Quantization for LLMs." ACL.
Download Paper

2023


2. Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference

Published in ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2023

Recommended citation: Junyan Li, Li Lyna Zhang, Jiahang Xu, Yujing Wang, Shaoguang Yan, Yunqing Xia, Yuqing Yang, Ting Cao, Hao Sun, Weiwei Deng, Qi Zhang, Mao Yang. (2023). "Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference." ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD).
Download Paper

6. Accurate and Structured Pruning for Efficient Automatic Speech Recognition

Published in Conference of the International Speech Communication Association (INTERSPEECH), 2023

Recommended citation: Huiqiang Jiang, Li Lyna Zhang, Yuang Li, Yu Wu, Shijie Cao, Ting Cao, Yuqing Yang, Jinyu Li, Mao Yang, Lili Qiu. (2023). "Accurate and Structured Pruning for Efficient Automatic Speech Recognition." Conference of the International Speech Communication Association (INTERSPEECH).
Download Paper

9. NN-Stretch: Automatic Neural Network Branching for Parallel Inference on Heterogeneous Multi-Processors

Published in The 21st International Conference on Mobile Systems, Applications, and Services (MobiSys), 2023

Recommended citation: Jianyu Wei, Ting Cao, Shijie Cao, Shiqi Jiang, Shaowei Fu, Mao Yang, Yanyong Zhang, Yunxin Liu. (2023). "NN-Stretch: Automatic Neural Network Branching for Parallel Inference on Heterogeneous Multi-Processors." The 21st International Conference on Mobile Systems, Applications, and Services (MobiSys).
Download Paper

10. Boosting DNN Cold Inference on Devices

Published in The 21st Annual International Conference on Mobile Systems, Applications and Services (MobiSys), 2023

Recommended citation: Rongjie Yi, Ting Cao, Ao Zhou, Xiao Ma, Shangguang Wang, Mengwei Xu. (2023). "Boosting DNN Cold Inference on Devices." The 21st Annual International Conference on Mobile Systems, Applications and Services (MobiSys).
Download Paper

11. Efficient GPU Kernels for N:M-SPARSE Weights in Deep Learning

Published in Sixth Conference on Machine Learning and Systems (MLSys), 2023

Recommended citation: Bin Lin, Ningxin Zheng, Lei Wang, Shijie Cao, Lingxiao Ma, Quanlu Zhang, Yi Zhu, Ting Cao, Jilong Xue, Yuqing Yang, Fan Yang. (2023). "Efficient GPU Kernels for N:M-SPARSE Weights in Deep Learning." Sixth Conference on Machine Learning and Systems (MLSys).
Download Paper

2022


1. Turbo: Opportunistic Enhancement for Edge Video Analytics

Published in The 20th ACM Conference on Embedded Networked Sensor Systems (SenSys), 2022

Recommended citation: Yan Lu, Shiqi Jiang, Ting Cao, Yuanchao Shu. (2022). "Turbo: Opportunistic Enhancement for Edge Video Analytics." The 20th ACM Conference on Embedded Networked Sensor Systems (SenSys).
Download Paper

3. Romou: Rapidly Generate High-Performance Tensor Kernels for Mobile GPUs

Published in Proceedings of the 28th Annual International Conference on Mobile Computing and Networking (MobiCom), 2022

ArchProbe: 2022-2023 Top 100 Open Source Achievements Award

Recommended citation: Rendong Liang, Ting Cao, Jicheng Wen, Manni Wang, Yang Wang, Jianhua Zou, Yunxin Liu. (2022). "Romou: Rapidly Generate High-Performance Tensor Kernels for Mobile GPUs." Proceedings of the 28th Annual International Conference on Mobile Computing and Networking (MobiCom).
Download Paper

4. MobiDepth: Real-Time Depth Estimation Using On-Device Dual Cameras

Published in Proceedings of the 28th Annual International Conference on Mobile Computing and Networking (MobiCom), 2022

Recommended citation: Jinrui Zhang, Huan Yang, Ju Ren, Deyu Zhang, Bangwen He, Yuanchun Li, Ting Cao, Yaoxue Zhang, Yunxin Liu. (2022). "MobiDepth: Real-Time Depth Estimation Using On-Device Dual Cameras." Proceedings of the 28th Annual International Conference on Mobile Computing and Networking (MobiCom).
Download Paper

5. SwiftPruner: Reinforced Evolutionary Pruning for Efficient Ad Relevance

Published in ACM International Conference on Information and Knowledge Management (CIKM), 2022

Recommended citation: Li Lyna Zhang, Youkow Homma, Yujing Wang, Min Wu, Mao Yang, Ruofei Zhang, Ting Cao, Wei Shen. (2022). "SwiftPruner: Reinforced Evolutionary Pruning for Efficient Ad Relevance." ACM International Conference on Information and Knowledge Management (CIKM).
Download Paper

6. CoDL: Efficient CPU-GPU Co-execution for Deep Learning Inference on Mobile Devices

Published in 20th International Conference on Mobile Systems, Applications, and Services (MobiSys), 2022

Recommended citation: Fucheng Jia, Deyu Zhang, Ting Cao, Shiqi Jiang, Yunxin Liu, Ju Ren, Yaoxue Zhang. (2022). "CoDL: Efficient CPU-GPU Co-execution for Deep Learning Inference on Mobile Devices." 20th International Conference on Mobile Systems, Applications, and Services (MobiSys).
Download Paper

2021


2. Unified Holistic Memory Management Supporting Multiple Big Data Processing Frameworks over Hybrid Memories

Published in ACM Transactions on Computer Systems (TOCS), 2021

Recommended citation: Lei Chen, Jiacheng Zhao, Chenxi Wang, Ting Cao, John Zigman, Haris Volos, Onur Mutlu, Fang Lv, Xiaobing Feng, Guoqing Harry Xu, Huimin Cui. (2021). "Unified Holistic Memory Management Supporting Multiple Big Data Processing Frameworks over Hybrid Memories." ACM Transactions on Computer Systems (TOCS), Vol 39(1-4): pp. 1-38.
Download Paper

3. AsyMo: Scalable and Efficient Deep-Learning Inference on Asymmetric Mobile CPUs

Published in Proceedings of the 27th Annual International Conference on Mobile Computing and Networking (MobiCom), 2021

Recommended citation: Manni Wang, Shaohua Ding, Ting Cao, Yunxin Liu, Fengyuan Xu. (2021). "AsyMo: Scalable and Efficient Deep-Learning Inference on Asymmetric Mobile CPUs." Proceedings of the 27th Annual International Conference on Mobile Computing and Networking (MobiCom).
Download Paper

4. nn-Meter: towards accurate latency prediction of deep-learning model inference on diverse edge devices

Published in 19th International Conference on Mobile Systems, Applications, and Services (MobiSys), 2021

MobiSys 2021 Best Paper Award

Recommended citation: L. Zhang, S. Han, J. Wei, N. Zheng, T. Cao, Y. Yang, Y. Liu. (2021). "nn-Meter: towards accurate latency prediction of deep-learning model inference on diverse edge devices." 19th International Conference on Mobile Systems, Applications, and Services (MobiSys).
Download Paper

2020


1. Profiling and optimizing deep learning inference on mobile GPUs

Published in Proceedings of the 11th ACM SIGOPS Asia-Pacific Workshop on Systems (APSys), 2020

Recommended citation: S. Jiang, L. Ran, T. Cao, Y. Xu, Y. Liu. (2020). "Profiling and optimizing deep learning inference on mobile GPUs." Proceedings of the 11th ACM SIGOPS Asia-Pacific Workshop on Systems (APSys).
Download Paper

2019