研究者詳細 - 河合　直聡

写真a

カワイ　マサトシ

河合　直聡

KAWAI Masatoshi

所属

情報基盤センター大規模計算支援環境研究部門特任助教

論文 4

Optimize Efficiency of Utilizing Systems by Dynamic Core Binding

Kawai M., Ida A., Hanawa T., Hoshino T.

ACM International Conference Proceeding Series 頁： 77 - 82 2024年1月

　詳細を見る

出版者・発行元：ACM International Conference Proceeding Series

Load balancing at both the process and thread levels is imperative for minimizing application computation time in the context of MPI/OpenMP hybrid parallelization. This necessity arises from the constraint that, within a typical hybrid parallel environment, an identical number of cores is bound to each process. Dynamic Core Binding, however, adjusts the core binding based on the process’s workload, thereby realizing load balancing at the core level. In prior research, we have implemented the DCB library, which has two policies for computation time reduction or power reduction. In this paper, we show that the two policies provided by the DCB library can be used together to achieve both computation time reduction and power consumption reduction.

DOI： 10.1145/3636480.3637221

Scopus
Sparse Matrix-Vector Multiplication with Reduced-Precision Memory Accessor

Mukunoki, D; Kawai, M; Imamura, T

2023 IEEE 16TH INTERNATIONAL SYMPOSIUM ON EMBEDDED MULTICORE/MANY-CORE SYSTEMS-ON-CHIP, MCSOC 頁： 608 - 615 2023年

　詳細を見る

出版者・発行元：Proceedings - 2023 16th IEEE International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2023

Mixed-precision computation, which uses multiple different precision in a single code, is being studied to increase computational speed and energy efficiency. It typically uses the IEEE 754-2008 floating-point formats; however, for further fine-grained precision tuning, it is possible to consider using custom data representations, for example, with lower mantissa precision than the IEEE formats. Sparse iterative solvers are examples in which the use of such reduced-precision formats is effective. In this study, we discuss the implementation and performance of sparse matrix-vector multiplication (SpMV), which is the kernel of sparse iterative solvers, with reduced-precision floating-point formats on general-purpose processors. The reduced-precision scheme we adopt truncates the mantissa bits of the IEEE formats in 8-bit steps (e.g., it allows the creation of 16-/24-/32-/40-/48-/56-bit formats based on binary64). It acts as a so-called memory accessor: the reduced-precision format is used only for data representation in memory, and arithmetic operations are performed in the IEEE format on the register. It can be easily introduced without significant code changes. With this scheme, we implement two types of reduced-precision SpMVs: the compressed sparse row format, which is a classical sparse format, and the adaptive multi-level blocking format, which is an index-compressed format based on the Sell-C-σ format. We evaluate performance on four different architectures, including CPUs and GPUs (AMD Zen3, Intel Ice Lake, NVIDIA Ampere, and AMD CDNA). As a result, we demonstrate good scalability for the bit length of the format, as long as the problem is well out-of-cache and performance is memory-bandwidth-bound.

DOI： 10.1109/MCSoC60832.2023.00094

Web of Science

Scopus
Dynamic Core Binding for Load Balancing of Applications Parallelized with MPI/OpenMP

Kawai M., Ida A., Hanawa T., Nakajima K.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 14075 LNCS 巻頁： 378 - 394 2023年

　詳細を見る

出版者・発行元：Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Load imbalance is a critical problem that degrades the performance of parallelized applications in massively parallel processing. Although an MPI/OpenMP implementation is widely used for parallelization, users must maintain load balancing at the process level and thread (core) level for effective parallelization. In this paper, we propose dynamic core binding (DCB) to processes for reducing the computation time and energy consumption of applications. Using the DCB approach, an unequal number of cores is bound to each process, and load imbalance among processes is mitigated at the core level. This approach is not only improving parallel performance but also reducing power consumption by reducing the number of using cores without increasing the computational time. Although load balancing among nodes cannot be handled by DCB, we also examine how to solve this problem by mapping processes to nodes. In our numerical evaluations, we implemented a DCB library and applied it to the lattice H-matrixes. Based on the numerical evaluations, we achieved a 58% performance improvement and 77% energy consumption reduction for the applications using the lattice H-matrix.

DOI： 10.1007/978-3-031-36024-4_30

Scopus
Auto-Tuning Mixed-precision Computation by Specifying Multiple Regions

Ren X., Kawai M., Hoshino T., Katagiri T., Nagai T.

Proceedings - 2023 11th International Symposium on Computing and Networking, CANDAR 2023 頁： 175 - 181 2023年

　詳細を見る

出版者・発行元：Proceedings - 2023 11th International Symposium on Computing and Networking, CANDAR 2023

Mixed-precision computation is a promising method for substantially increasing the speed of numerical computations. However, using mixed-precision data is a double-edged sword. Although it can improve the computational performance, the reduction in precision brings more uncertainties and errors. It is necessary to determine which variables can be represented with a lower-precision format without affecting the accuracy of the results. Hence, much effort is spent on selecting appropriate variables while considering the execution time and numerical accuracy. Auto-Tuning (AT) is one of several technologies that can assist in eliminating this intensive work. In this study, we investigated an AT strategy for the 'Blocks' directive in the auto-Tuning language ppOpen-AT to tune multiple regions of a program and evaluated the effectiveness. A benchmark program of the nonhydrostatic icosahedral atmospheric model (NICAM), which is a global cloud resolving model, was considered as a study case. Experimental results indicated that when a single part of the program could perform well in the mixed-precision computation, a combination achieved a better performance. When used on the Flow Type I Subsystem (The Fujitsu PRIMEHPC FX1000), this method achieved almost 1.27× speedup compared with the NICAM benchmark program using all double-precision data.

DOI： 10.1109/CANDAR60563.2023.00031

Scopus

論文の先頭へ▲

科研費 1

低ランク構造行列法の適用範囲拡大と多様な計算アーキテクチャの活用

研究課題/研究課題番号：24K02949 2024年4月 - 2027年3月

科学研究費助成事業基盤研究(B)

伊田明弘, 横田理央, 塙敏博, 岩下武史, 大島聡史, 星野哲也, 平石拓, 河合直聡

　詳細を見る

担当区分：研究分担者

本研究では、低ランク構造行列法ライブラリの高機能化を実施する。科学技術計算では、密行列演算に基づく計算手法の数値線形代数ライブラリが広く利用されいる。密行列演算から低ランク構造行列演算へ置き換えが行えるように、低ランク構造行列法の適用範囲を拡大する。低ランク構造行列に基づく新たな数値計算アルゴリズムを開発する。アルゴリズム開発は、GPU・FPGAなど最新の計算機アーキテクチャで構成されるクラスタ計算機を意識し、実装の最適化を行う。様々な低ランク構造行列の演算に対し、最適な計算機アーキテクチャを割当て、混合精度演算・動的負荷分散なども活用し、計算機の性能を最大限に引き出す実装法を研究する。

科研費の先頭へ▲