


default search action
ICPP 2025: San Diego, CA, USA
- Proceedings of the 54th International Conference on Parallel Processing, ICPP 2025, San Diego, CA, USA, September 8-11, 2025. ACM 2025, ISBN 979-8-4007-2074-1

AI
- Quan Deng, Lin Gan, Hongkun Yu, Wenlai Zhao, Guangwen Yang:

Auto-Stencil: Performance-Driven Stencil Optimization with Hardware Feedback for LLMs. 1-10 - Ronghuai Chen, Ce Yu, Hao Fu, Xiaoteng Hu, Bin Yang:

MixLoRA: An Efficient Multi-Tenant Framework for Concurrently Serving Diverse LoRA Models in Large Language Models. 11-21 - Yiduo Wang, Wenda Tang, Linghang Meng, Liang Li, Jie Wu:

Origami: Efficient ML-Driven Metadata Load Balancing for Distributed File Systems. 22-32 - Haonan Jiang, Yusen Li, Xiaoguang Liu, Gang Wang, Xuebo Zhang:

Solving Extended Flexible Job Shop Scheduling Problems with Deep Reinforcement Learning. 33-42 - Hao Dong, Yuehao Xu, Xiaohui Wang, Xinhua Ji, Zhijun Ding:

CoreTuner: Predicting and Scheduling Framework for Optimizing the Joint Allocation of CPU and GPU in Training Cluster. 43-52 - Jin Wang, Chenye Zhu, Jinbin Hu:

A High-Accuracy Sketch for Measuring Low-Entropy Flows in Distributed AI Training. 53-62 - Sooho Jang, Ahyeon Lim, Yuchan Lee, Sookwang Lee, Jaehwan Lee:

P3P-Fed: Peer-to-Peer Personalized Federated Learning with DHT-based Local Clustering. 63-72 - Tianle Li, Yongzhi Huang, Linshan Jiang, Qipeng Xie, Chang Liu, Wenfeng Du, Lu Wang, Kaishun Wu:

FedWCM: Unleashing the Potential of Momentum-based Federated Learning in Long-Tailed Scenarios. 73-82 - Chenghao Nu, Zhe Zhang, Ye Li, Yanchao Zhao:

It Takes Two: Accelerating Accurate Federated Learning through Pipelined Intra-Batch Data Sampling and Training. 83-93 - Joshua Hoke Davis, Daniel Nichols, Ishan Khillan, Abhinav Bhatele:

ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks. 94-103 - He Bai, Hui Li, Jianming Que, Minglong Zhang, Zhiqiang Hu, Ximing Xu, Bing Lin, Runhuai Huang, Junyang Qiu, Shaowen Deng:

Pisces: Towards Adaptive and Fair Congestion Control via Multi-Agent Meta-Reinforcement Learning. 104-114
Algorithms
- Zhiyi Zhang, Junshi Chen, Jingwei Sun, Pengfei Zhang, Zhuopin Xu, Jun Shi, Qi Wang:

WinRS: Accelerate Winograd Backward-Filter Convolution with Tiny Workspace. 115-124 - Dinghuang Hu, Dezun Dong, Xiangke Liao:

SINA: Accelerating Time Synchronization in Large-Scale Network Simulation Using In-Network Allreduce. 125-134 - Chuhe Hong, Qinglin Wang, Xing Peng, Gencheng Liu, Qingyang Zhang, Xinhai Chen, Jie Liu:

VES: Vectorized Sparse General Matrix-Matrix Multiplication on Multi-Core DSPs. 135-145 - Pengyu Wang, Xiaotian Chen, Jianbin Fang, Peng Zhang, Yonggang Che, Chun Huang, Jie Ren:

Optimizing Direct Convolutions on High-Performance Multi-Core DSPs. 146-156 - Cameron Bradley, Anju Mongandampulath Akathoott, Martin Burtscher:

Fast Exact Diameter Computation of Sparse Graphs. 157-167 - Fengkui Yang, Yuanzhang Wang, Chunhua Li, Ke Zhou, Hui Li:

SpeedSketch: An Ultra-Fast Sketch Generation and Delta Encoding Framework for Delta Compression. 168-177 - Maxime Gonthier, Kyle Chard, Ian T. Foster, Loris Marchal, Frédéric Vivien:

Deadline-Aware Scheduling of Mixed-Criticality Tasks. 178-187 - Zhengding Hu, Yi Zong, Jingwei Sun, Wei Xue, Guangzhong Sun:

A Fast Sparse Triangular Solve for Structured-grid Problems on Heterogeneous Processors. 188-198 - Changjie Xu, Ke Meng, Zhiheng Lin, Guangming Tan:

PISCES: Push-Pull Hybrid Optimization for Graph Pattern Matching. 199-207 - Sasindu Wijeratne, Rajgopal Kannan, Viktor K. Prasanna:

AMPED: Accelerating MTTKRP for Billion-Scale Sparse Tensor Decomposition on Multiple GPUs. 208-217 - Yicong Luo, Senhe Hao, Brian Wheatman, Prashant Pandey, Helen Xu:

Bridging Cache-Friendliness and Concurrency: A Locality-Optimized In-Memory B-Skiplist. 218-227 - George M. Slota, Michael Mandulak:

Scaling Distributed Graph Processing to Hundreds of GPUs. 228-237 - Pál András Papp, Toni Böhnlein, Albert-Jan Nicholas Yzelman:

Multiprocessor Scheduling with Memory Constraints: Fundamental Properties and Finding Optimal Solutions. 238-247
Applications
- Xin Yong, Li Yan, Zhuozhao Li:

Heterogeneity-aware Task Scheduling based on Personalized Federated Reinforcement Learning. 248-257 - Yubing Bao, Zhihui Lu, Qiang Duan, Xin Du, Zhongyu Chen, Yicong Zhao, Xiaoyi Li, Yandan Tan, Shuhan Yang, Ziyi Wang, Yang Chen, Yang Xu:

BMapper: A Scalable and Efficient Framework for Brain Simulations Acceleration on Supercomputers. 258-267 - Antonio De Caro, Gennaro Cordasco, Biagio Cosenza:

SYgraph: A Portable Heterogeneous Graph Analytics Framework for GPUs. 268-277 - Xingyu Liu, Jiawei Liang, Linfeng Du, Yipu Zhang, Chaofang Ma, Hanwei Fan, Jiang Xu, Wei Zhang:

FLEX: Leveraging FPGA-CPU Synergy for Mixed-Cell-Height Legalization Acceleration. 278-287 - Brian Curless, Michael Gowanlock:

Fast and Scalable Mixed Precision Euclidean Distance Calculations Using GPU Tensor Cores. 288-298 - João Brotas, Ricardo Nobre, Aleksandar Ilic:

Thievory: Graph Processing with Multi-GPU Memory Stealing. 299-308 - Prajjwal Nijhara, Lokesh Venkatachalam, Agam Harpreet Singh, Athreya Chandramouli, Sayantan Jana, Kishore Kothapalli, Dip Sankar Banerjee:

Efficient Parallel Algorithms for Dynamic Percolation Centrality. 309-319 - Zesong Wang, Peng Fang, Fang Wang, Hong Jiang, Yimin Lu, Zhan Shi, Dan Feng:

SpiderCache: Semantic-Aware Caching Strategy for DNN Training. 320-330
Architecture
- Matthew Barondeau, Sophia Jiang, Jonathan Beard, Andreas Gerstlauer:

ViReC: The Virtual Register Context Architecture for Efficient Near-Memory Multithreading. 331-341 - Guanglei Xu, Hai Zhou, Yuchong Hu, Dan Feng, Renzhi Xiao:

Accelerating Erasure Coding on Persistent Memory via Adaptive Prefetcher Scheduling. 342-351 - Ruisong Zhou, Peng Wang, Chunhua Li, Ke Zhou, Hui Li:

ADAPT: Dynamic Grouping and Cross-Group Aggregation for GC-Efficient Log-Structured Storage in SSD Arrays. 352-361 - Junru Shen, Miao Cai, Kangyue Gao, Baoliu Ye, Guo Cheng:

HeatList: The Case for Retrofitting In-memory Range Index with Hotspot Awareness. 362-373 - Lizhi Zhang, Menghan Jia, Zhiquan Lai, Qiao Li, Yiming Zhang, Dongsheng Li:

HMGraph: Boosting GNN Training on Hierarchical Memory via Coordinated Cache. 374-384 - Shuai Lin, Rui Wang, Zaigui Zhang, Long Deng, Wenzhe Zhu, Yongkun Li, Yinlong Xu:

PTWalker: Cache-Efficient Random Walks via Alternating Dual-Subgraph Walker Updating. 385-395 - Baosen Zhao, Jianan Sun, Xu Zhou, Wanghong Yang, Wenji Du, Fukang Chen, Yongmao Ren, Stefan Schmid:

Efficient Cross-Datacenter Congestion Control with Fast Control Loops. 396-405 - Wenqi Lou, Yunji Qin, Zihao Wang, Chao Wang, Lei Gong, Xuehai Zhou:

Automated FPGA Accelerator Generation Framework for Transformers with Dataflow Optimization. 406-416 - Xue Xiao, Yi Dai, Yanqiang Sun, Jianmin Zhang, Tiejun Li:

Design of Interposer Interconnection Network Based on High-Radix Interposer Routers. 417-427 - Xin Ju, Jingkui Yang, Mei Wen, Jun He, Jing Feng, Minjin Tang, Zhaoyun Chen, Yang Shi:

SmartBlock: Adaptive Block Floating Point Quantization for Efficient DNN Acceleration. 428-438 - Xianfa Zhou, Tun Li, Yuhuan Xia, Ruiyu Zhang:

COF: Cycle and transmission co-mapping framework for CNN mapping in PIM architecture. 439-448 - Yuan Ma, Srinivasan Subramaniyan, Xiaorui Wang:

Power Capping of GPU Servers for Machine Learning Inference Optimization. 449-459 - Zhongchun Zhou, Chengtao Lai, Wei Zhang:

LLaMCAT: Optimizing Large Language Model Inference with Cache Arbitration and Throttling. 460-469
Multidisciplinary
- Haibin Lai, Sicheng Zhou, Site Fan, Zhuozhao Li:

ParaCOSM: A Parallel Framework for Continuous Subgraph Matching. 470-479 - Yanfeng Lu, Tao Wu, Chao Chang, Hongjun Wang, Mingxing Ke, Jian Wang:

Heterogeneity-aware Federated Edge Learning via UAV Sampling and D2D Communications. 480-489 - Abdullah Al-Mamun, Dongfang Zhao, Gagan Agrawal, Ahmed Aleroud, Mohamed I. Ibrahem:

ZTP: A Scalable and Lightweight Privacy-Preserving Blockchain via Scale-Free Quorums and Geometric Fragmentation. 490-499 - Evelyne Ringoot, Rabab Alomairy, Valentin Churavy, Alan Edelman:

Performant Unified GPU Kernels for Portable Singular Value Computation Across Hardware and Precision. 500-510
Performance
- Da Huo, Zhenzhe Zheng, Xiaoyao Huang, Hao Chen, Jianfeng Hu, Zhiyong Yan, Fan Wu, Jie Wu:

Joint Prediction and Matching for Computing Resource Exchange Platforms. 511-520 - Yangfan Qiao, Zhuozhao Li:

Lias: Leveraging Performance Counters for Interference Quantification and Mitigation in Multi-processor Systems. 521-530 - Elliott D. Binder, Jeffrey Low, Tze Meng Low:

Architecture-Aware Models of AI Engines for High-Performance Matrix Matrix Multiplication. 531-540 - Diaohan Luo, Zhen Tang, Heran Gao, Yuewen Wu, Heng Wu, Xi Han, Wenbo Zhang:

Scheduling based on Block Features for Concurrent Inference with Unseen DNN Models on GPU. 541-552 - Yongzhen Shi, Qinglin Wang, Jie Liu, Lian Wang, Zhiyan Liu, Bingwei Wang, Feiming Liu, Xiangdong Pei:

Optimizing Incomplete Cholesky Factorization on MIMD Many-core Architecture. 553-563 - Kelun Lei, Hailong Yang, Kaige Zhang, Shaokang Du, Marc Casas, Yufan Xu, Zhongzhi Luan, Yi Liu, Depei Qian:

OVERT: Orchestrating Vector-Scalar Execution for Efficient SpMV on Modern CPUs. 564-574 - Xuezhu Wang, Hailong Yang, Xin You, Yufan Xu, Xiaoyan Liu, Siqi Wang, Kaige Zhang, Mingzhen Li, Zhongzhi Luan, Yi Liu, Depei Qian:

ESC: Effective Submanifold Convolution using Tensor Cores. 575-585 - Boyu Du, Jingya Zhou, Jin Wang, Jiangwei Wang, Zhijun Li:

Joint Task Scheduling and Resource Allocation in Cloud-Edge Collaborative Computing Systems. 586-596 - Matheus Costa, Philippe O. A. Navaux, Silvio Rizzi, Arthur Francisco Lorenzon:

One GPU, Many Ranks: Enabling Performance and Energy-Efficient In-Transit Visualization via Resource Sharing. 597-606 - Akash Dutta, Ali Jannesari:

HHOTuner: Efficient Performance Tuning with Harris Hawks Optimization. 607-616 - Dewi Yokelson, Stephanie Brink, Jason Burmark, Michael McKinsey, Befikir Bogale, Ian Lumsden, Michela Taufer, Tom Scogland, Olga Pearce:

Cross-Architecture Performance Analysis Using the RAJA Performance Suite. 617-626 - Dominik Schweisgut, Anne Benoit, Yves Robert, Henning Meyerhenke:

Carbon-Aware Workflow Scheduling with Fixed Mapping and Deadline Constraint. 627-637
Quantum Computing
- Ziqing Guo, Jan Balewski, Ziwen Pan:

Q-GEAR: Improving quantum simulation framework. 638-647 - Jiayi Zhong, Yuxin Deng:

Cycle-Aware Parallel Optimization for Mitigating ZZ Crosstalk on Quantum Hardware. 648-657 - Waylon Luo, Jiapeng Zhao, Tong Zhan, Qiang Guan:

Adaptive Job Scheduling in Quantum Clouds Using Reinforcement Learning. 658-667
Software
- Floris-Jan Willemsen, Rob V. van Nieuwpoort, Ben van Werkhoven:

Efficient Construction of Large Search Spaces for Auto-Tuning. 668-677 - Zhiqiang Wang, Wenzhe Zhu, Zaigui Zhang, Chaomei Yan, Fan Guo, Yongkun Li, Yinlong Xu:

Amber: Towards Fast and Space-Efficient Incremental Checkpointing in Large Language Model Training. 678-688 - Hongbin Zhang, Taosheng Wei, Zhenyi Zheng, Jiangsu Du, Zhiguang Chen, Yutong Lu:

TD-Pipe: Temporally-Disaggregated Pipeline Parallelism Architecture for High-Throughput LLM Inference. 689-698 - Wenda Tang, Yiduo Wang, Yanwen Wang, Jie Wu:

Leave No One Behind: Fair and Efficient Tiered Memory Management for Multi-Applications. 699-709 - Tianhao Wu, Da Yan, Qihao Cheng, Lyuheng Yuan, Sheng Di, Jiao Han, Zhongyi Huang, Ji Cheng:

CompreGel: Efficient Distributed Graph Propagation via Error-Bounded Lossy Message Compression. 710-719 - Hanfeng Liu, Xuemei Peng, Zeyi Wen:

Accelerating Multi-Output GBDTs with GPUs. 720-729 - Shihao Zhang, Chi Zhang, Chentao Wu, Jie Li, Minyi Guo, Hui Li, Liqiang Zhang:

Decision Shuffle: Efficient Pre-scheduling System for Push-based Shuffle in DAG Computing Frameworks. 730-740 - Jonathan Lifflander, Nicole Slattengren, Philippe P. Pebay, Pierre L. Pebay, Caleb Schilly, Robert A. Pfeiffer, Joseph D. Kotulski:

Accelerating an Electromagnetic Simulation via Memory-Constrained Task-Based Load Balancing. 741-752 - Keshvi Tuteja, Gregor Olenik, Roman Mishchuk, Yu-Hsiang Tsai, Markus Götz, Achim Streit, Hartwig Anzt, Charlotte Debus:

pyGinkgo: A Sparse Linear Algebra Operator Framework for Python. 753-763 - Narasinga Rao Miniskar, Aaron R. Young, Mohammad Alaul Haque Monil, Kazi Asifuzzaman, Beau Johnston, Keita Teranishi, Jeffrey S. Vetter:

IRIS-MASH: Efficient Multi-device Asynchronous Multi-Stream Heterogeneous Computing. 764-773 - Kuldeep Pal, Aniket P. Garade, Deepika H. V, Haribabu P, S. A. Kumar, S. D. Sudarsan:

Optimizing NumPy with SVE Acceleration on ARM Architectures. 774-783 - Chen-Chun Chen, Jinghan Yao, Hari Subramoni, Dhabaleswar K. Panda:

Design and Optimization of GPU-Aware MPI Allreduce Using Direct Sendrecv Communication. 784-793 - Hongsu Byun, Honghyeon Yoo, Sungyong Park:

Revisiting Multi-threaded Compaction in LSM-trees: Enabling Compaction Pipelining. 794-803 - Ziji Shi, Le Jiang, Ang Wang, Jie Zhang, Chencan Wu, Yong Li, Xiaokui Xiao, Wei Lin, Jialin Li:

TAPAS: Fast and Automatic Derivation of Tensor Parallel Strategies for Large Neural Networks. 804-815

manage site settings
To protect your privacy, all features that rely on external API calls from your browser are turned off by default. You need to opt-in for them to become active. All settings here will be stored as cookies with your web browser. For more information see our F.A.Q.


Google
Google Scholar
Semantic Scholar
Internet Archive Scholar
CiteSeerX
ORCID














