LitePred: Transferable and Scalable Latency Prediction for Hardware-Aware Neural Architecture Search
Published in USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2024
Hardware-aware neural architecture search (NAS) has gained significant attention as it aims to optimize both model accuracy and hardware efficiency. A critical challenge in hardware-aware NAS is the accurate prediction of inference latency across diverse hardware platforms without extensive measurements. This paper presents LitePred, a transferable and scalable latency prediction framework for hardware-aware neural architecture search. LitePred introduces a novel approach that combines analytical modeling with lightweight learning techniques to accurately predict the inference latency of neural network architectures across various hardware platforms with minimal calibration overhead. The system captures both the static characteristics of neural network architectures and the dynamic behaviors of hardware platforms through a decomposable modeling approach. Extensive evaluations demonstrate that LitePred achieves high prediction accuracy across diverse neural architectures and hardware platforms while requiring significantly fewer measurements compared to existing approaches.Hardware-Aware Neural Architecture Search (NAS) has demonstrated success in automating the design of affordable deep neural networks (DNNs) for edge platforms by incorporating inference latency in the search process. However, accurately and efficiently predicting DNN inference latency on diverse edge platforms remains a significant challenge. Current approaches require several days to construct new latency predictors for each one platform, which is prohibitively time-consuming and impractical. In this paper, we propose LitePred, a lightweight approach for accurately predicting DNN inference latency on new platforms with minimal adaptation data by transferring existing predictors. LitePred builds on two key techniques: (i) a Variational Autoencoder (VAE) data sampler to sample high-quality training and adaptation data that conforms to the model distributions in NAS search spaces, overcoming the out-of-distribution challenge; and (ii) a latency distribution-based similarity detection method to identify the most similar pre-existing latency predictors for the new target platform, reducing adaptation data required while achieving high prediction accuracy. Extensive experiments on 85 edge platforms and 6 NAS search spaces demonstrate the effectiveness of our approach, achieving an average latency prediction accuracy of 99.3% with less than an hour of adaptation cost. Compared with SOTA platform-specific methods, LitePred achieves up to 5.3% higher accuracy with a significant 50.6× reduction in profiling cost. Code and predictors are available at https://github.com/microsoft/Moonlit/tree/main/LitePred.
Recommended citation: Chengquan Feng, Li Lyna Zhang, Yuanchi Liu, Jiahang Xu, Chengruidong Zhang, Zhiyuan Wang, Ting Cao, Mao Yang, Haisheng Tan. (2024). "LitePred: Transferable and Scalable Latency Prediction for Hardware-Aware Neural Architecture Search." NSDI.
Download Paper