Faculty Profiles - HOSHINO Tetsuya

写真a

HOSHINO Tetsuya

Organization

Information Technology Center Associate Professor

Graduate School

Graduate School of Informatics

Research Areas 1

Informatics / High performance computing

To the head of Research Areas.▲

Research History 2

Nagoya University Information Technology Center Associate Professor

2023.1
The University of Tokyo Information Technology Center Assistant Professor

2016.1 - 2022.12

To the head of Research History.▲

Papers 26

Investigation of Azure CycleCloud Usage Environment and Considerations on Supercomputer Center-Cloud Collaboration

Nagai Toru, Gojuki Shuichi, Kawai Masatoshi, Katagiri Takahiro, Hoshino Tetsuya

Journal for Academic Computing and Networking Vol. 28 ( 1 ) page： 114 - 124 2024.11

　More details

Language：Japanese Publisher：Academic eXcange for Information Environment and Strategy

DOI： 10.24669/jacn.28.1_114

CiNii Research
Auto‐Tuning Mixed‐Precision Computation by Specifying Multiple Regions

Xuanzhengbo Ren, Masatoshi Kawai, Tetsuya Hoshino, Takahiro Katagiri, Toru Nagai

Concurrency and Computation: Practice and Experience Vol. 37 ( 2 ) 2024.11

　More details

Publishing type：Research paper (scientific journal) Publisher：Wiley

ABSTRACT

Mixed‐precision computation is a promising method for substantially improving high‐performance computing applications. However, using mixed‐precision data is a double‐edged sword. While it can improve computational performance, the reduction in precision introduces more uncertainties and errors. As a result, precision tuning is necessary to determine the optimal mixed‐precision configurations. Much effort is therefore spent on selecting appropriate variables while balancing execution time and numerical accuracy. Auto‐tuning (AT) is one of the technologies that can assist in alleviating this intensive task. In recent years, ppOpen‐AT, an AT language, introduced a directive for mixed‐precision tuning called “Blocks.” In this study, we investigated an AT strategy for the “Blocks” directive for multi‐region tuning of a program. The non‐hydrostatic icosahedral atmospheric model (NICAM), a global cloud‐resolving model, was used as a benchmark program to evaluate the effectiveness of the AT strategy. Experimental results indicated that when a single region of the program performed well in mixed‐precision computation, combining these regions resulted in better performance. When tested on the supercomputer “Flow” Type I (Fujitsu PRIMEHPC FX1000) and Type II (Fujitsu PRIMEHPC CX1000) subsystems, the mixed‐precision NICAM benchmark program tuned by the AT strategy achieved a speedup of nearly 1.31× on the Type I subsystem compared to the original double‐precision program, and a 1.12× speedup on the Type II subsystem.

DOI： 10.1002/cpe.8326

Web of Science

Scopus
Optimize Efficiency of Utilizing Systems by Dynamic Core Binding.

Masatoshi Kawai, Akihiro Ida, Toshihiro Hanawa, Tetsuya Hoshino

HPC Asia Workshops page： 77 - 82 2024

　More details

Publishing type：Research paper (international conference proceedings)

DOI： 10.1145/3636480.3637221

Scopus

Other Link： https://dblp.uni-trier.de/db/conf/hpcasia/hpcasia2024w.html#KawaiIHH24
Development Status of ABINIT-MP in 2023

MOCHIZUKI Yuji, NAKANO Tatsuya, SAKAKURA Kota, OKUWAKI Koji, DOI Hideo, KATO Toshihiro, TAKIZAWA Hiroyuki, NARUSE Akira, OHSHIMA Satoshi, HOSHINO Tetsuya, KATAGIRI Takahiro

Journal of Computer Chemistry, Japan Vol. 23 ( 1 ) page： 4 - 8 2024

　More details

Language：Japanese Publisher：Society of Computer Chemistry, Japan

In August 2023, we released the latest version of our ABINIT-MP program, Open Version 2 Revision 8. In this version, the most commonly used FMO-MP2 calculations are even faster than in the previous Revision 4. It is now also possible to calculate excitation and ionization energies for regions of interest. Improved interaction analysis is also available. In addition, we have started GPU-oriented modifications. In this preliminary report, we present the current status of ABINIT-MP.

DOI： 10.2477/jccj.2024-0001

CiNii Research
Performance Evaluation of CMOS Annealing with Support Vector Machine

Fukuhara R., Morishita M., Katagiri T., Kawai M., Nagai T., Hoshino T.

Proceedings - 2024 IEEE 17th International Symposium on Embedded Multicore/Many-core Systems-on-Chip, MCSoC 2024 page： 548 - 555 2024

　More details

Publisher：Proceedings - 2024 IEEE 17th International Symposium on Embedded Multicore/Many-core Systems-on-Chip, MCSoC 2024

In this study, support vector machine (SVM) performance was assessed using a quantum-inspired complementary metal-oxide semiconductor (CMOS) annealer. During performance evaluation, the accuracy rate in binary classification problems was the primary focus. SVM performance, when running on a CPU (classical computation) and quantum-inspired annealer, was comparatively analyzed. The performance outcome was evaluated using a CMOS annealing machine, and accuracy rates of 93.7%, 92.7%, and 97.6% were obtained for linearly separable problem and nonlinearly separable problems 1 and 2, respectively. According to these results, a CMOS annealing machine can achieve an accuracy rate that closely rivals that of classical computation.

DOI： 10.1109/MCSoC64144.2024.00094

Scopus
Implementing Fast Modal Filtering of SCALE-DG

Ren, XZB; Kawai, Y; Tomita, H; Nishizawa, S; Katagiri, T; Kawai, M; Hoshino, T; Nagai, T

2024 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING WORKSHOPS, CLUSTER WORKSHOPS 2024 page： 150 - 151 2024

　More details

Publisher：Proceedings - 2024 IEEE International Conference on Cluster Computing Workshops, CLUSTER Workshops 2024

For future high-resolution atmospheric simulations, a dynamical core using the discontinuous Galerkin Method (DGM), called SCALE-DG [1], is being developed as an option for high-order fluid schemes in the SCALE library [2]. Compared to the traditional Finite Element Method (FEM), the DGM allows for discontinuities across element boundaries. When evaluating a first-order derivative operator, we use the values at nodes of own element and at common boundaries of neighbor elements. This feature allows most computations to be performed independently for each element. Thus, we can take full advantage of data locality. Additionally, the DGM can achieve high-order accuracy by choosing high-order polynomial basis functions within each element. Therefore, DGM is suitable for high-resolution atmospheric simulations with high-order numerical accuracy, and we expect the computational performance to be highly desirable on modern computer architectures.

DOI： 10.1109/CLUSTERWORKSHOPS61563.2024.00033

Web of Science

Scopus
Adaptation of XAI to Auto-tuning for Numerical Libraries Open Access

Aoki S., Katagiri T., Ohshima S., Kawai M., Nagai T., Hoshino T.

Proceedings - 2024 IEEE 17th International Symposium on Embedded Multicore/Many-core Systems-on-Chip, MCSoC 2024 page： 556 - 563 2024

　More details

Publisher：Proceedings - 2024 IEEE 17th International Symposium on Embedded Multicore/Many-core Systems-on-Chip, MCSoC 2024

The unregulated utilization of Artificial Intelligence (AI) outputs, potentially leading to various societal issues, has received considerable attention. While humans routinely validate information, manually inspecting the vast volumes of AI-generated results is impractical. Therefore, automation and visualization are imperative. In this context, Explainable AI (XAI) technology is gaining prominence, aiming to streamline AI model development and alleviate the burden of explaining AI outputs to users. Simultaneously, software Auto-Tuning (AT) technology has emerged for reducing the man-hours required for performance tuning in numerical calculations. AT is a potent tool for cost reduction during parameter optimization and high-performance programming for numerical computing. The synergy between AT mechanisms and AI technology is noteworthy, with AI finding extensive applications in AT. However, applying AI to AT mechanisms introduces challenges in AI model explainability. This study focuses on XAI for AI models when integrated into two different processes for practical numerical computations: performance parameter tuning of accuracy-guaranteed numerical calculations and sparse iterative algorithm.

DOI： 10.1109/MCSoC64144.2024.00095

Scopus
Auto-tuning Mixed-precision Computation by Specifying Multiple Regions

Xuanzhengbo Ren, Masatoshi Kawai, Tetsuya Hoshino, Takahiro Katagiri, Toru Nagai

2023 Eleventh International Symposium on Computing and Networking (CANDAR) page： 175 - 181 2023.11

　More details

Publishing type：Research paper (international conference proceedings) Publisher：IEEE

DOI： 10.1109/candar60563.2023.00031

Scopus
4D Reconstruction of PET Using GPU Supercomputer

OHSHIMA Satoshi, YUASA Yoshinao, MATSUMURA Kaito, YOKOTA Tatsuya, HONTANI Hidekata, SAKATA Muneyuki, KIMURA Yuichi, KATAGIRI Takahiro, NAGAI Toru, HANAWA Toshihiro, HOSHINO Tetsuya

Medical Imaging Technology Vol. 41 ( 4-5 ) page： 150 - 156 2023.11

　More details

Language：Japanese Publisher：The Japanese Society of Medical Imaging Technology

With the development of medical imaging technology, various techniques have been developed and used to visually understand the inside of the living body. However, these technologies can only directly obtain images and videos, and diagnosis is still performed by human hands, such as physicians. There are great expectations for software that can reduce such labor, and an increasing number of technologies are already being used in the field of medicine, but the target is limited because they require knowledge and skills in both medicine (medical imaging) and computing technology. Therefore, in this study, researchers in the medical imaging and high-performance computing fields are collaborating to accelerate and scale up the image reconstruction of PET. This paper describes the details of this effort and the results obtained so far.

DOI： 10.11409/mit.41.150

CiNii Research
Implementation of Radio Wave Propagation using RT Cores and Consideration of Programming Models.

Shinya Hashinoki, Satoshi Ohshima, Takahiro Katagiri, Toru Nagai, Tetsuya Hoshino

IPDPS Workshops page： 673 - 681 2023

　More details

Publishing type：Research paper (international conference proceedings)

DOI： 10.1109/IPDPSW59300.2023.00115

Web of Science

Scopus

Other Link： https://dblp.uni-trier.de/db/conf/ipps/ipdps2023w.html#HashinokiOKNH23
Large-scale earthquake sequence simulations on 3D nonplanar faults using the boundary element method accelerated by lattice H-matrices

So Ozawa, Akihiro Ida, Tetsuya Hoshino, Ryosuke Ando

Geophysical Journal International 2022.10

　More details

Publishing type：Research paper (scientific journal) Publisher：Oxford University Press (OUP)

Summary

Large-scale earthquake sequence simulations using the boundary element method (BEM) incur extreme computational costs through multiplying a dense matrix with a slip rate vector. Hierarchical matrices (H-matrices) have often been used to accelerate this multiplication. However, the complexity of the structures of the H-matrices and the communication costs between processors limit their scalability, and they therefore cannot be used efficiently in distributed memory computer systems. Lattice H-matrices have recently been proposed as a tool to improve the parallel scalability of H-matrices. In this study, we developed a method for earthquake sequence simulations applicable to 3D nonplanar faults with lattice H-matrices. We present a simulation example and verify the mesh convergence of our method for a 3D nonplanar thrust fault using rectangular and triangular discretizations. We also performed performance and scalability analyses of our code. Our simulations, using over ${10^5}$ degrees of freedom, demonstrated a parallel acceleration beyond ${10^4}$ MPI processors and a &gt; 10-fold acceleration over the best performance when the normal H-matrices are used. Using this code, we can perform unprecedented large-scale earthquake sequence simulations on geometrically complex faults with supercomputers. The software is made an open-source and freely available.

DOI： 10.1093/gji/ggac386

arXiv
Optimizations of H-matrix-vector Multiplication for Modern Multi-core Processors.

Tetsuya Hoshino, Akihiro Ida, Toshihiro Hanawa

CLUSTER page： 462 - 472 2022

　More details

Publishing type：Research paper (international conference proceedings) Publisher：IEEE

DOI： 10.1109/CLUSTER51413.2022.00056

Other Link： https://dblp.uni-trier.de/db/conf/cluster/cluster2022.html#HoshinoIH22
Fortran標準規格do concurrentを用いたGPUオフローディング手法の評価

星野哲也, 塙敏博, 大島聡史

情報処理学会研究報告(Web) Vol. 2022-HPC-183 page： 1 - 8 2022

　More details

CiNii Research
A64FXにおける階層型行列演算の性能評価

星野哲也, 伊田明弘, 塙敏博

情報処理学会研究報告(Web) Vol. 2021-HPC-180 page： 1 - 8 2021

　More details

CiNii Research
Large-scale earthquake sequence simulations of 3D geometrically complex faults using the boundary element method accelerated by lattice H-matrices on distributed memory computer systems

伊田明弘, 星野哲也

arXiv preprint Vol. - page： 1 - 26 2021

　More details

CiNii Research
A64FXにおけるテンポラルブロッキングの実装と性能評価

星野哲也, 塙敏博

情報処理学会研究報告ハイパフォーマンスコンピューティング Vol. 2021-HPC-178 page： 1 - 8 2021

　More details

CiNii Research
Preliminary development of training environment for deep learning on supercomputer system Reviewed

Y. Nomura, I. Sato, T. Hanawa, S. Hanaoka, T. Nakao, T. Takenaga, D. Sato, T. Hoshino, Y. Sekiya, S. Ohshima, N. Hayashi, O. Abe

International Journal of Computer Assisted Radiology and Surgery Vol. 13 ( Issue 1 supplement ) page： S105 - S106 2018.6

　More details

Publishing type：Research paper (international conference proceedings)

DOI： 10.1007/s11548-018-1766-y
Optimization of generation process for sparse coefficient matrices in FEM on multicore/manycore architectures

中島研吾, 中島研吾, 星野哲也, 星野哲也, 成瀬彰, 塙敏博, 三木洋平

情報処理学会研究報告(Web) Vol. 2018 ( HPC-163 ) page： Vol.2018‐HPC‐163,No.28,1‐8 (WEB ONLY) 2018.2

　More details

Language：Japanese

J-GLOBAL
Design of parallel BEM analyses framework for SIMD processors Reviewed

Tetsuya Hoshino, Akihiro Ida, Toshihiro Hanawa, Kengo Nakajima

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) Vol. 10860 page： 601 - 613 2018

　More details

Language：English Publishing type：Research paper (scientific journal) Publisher：Springer Verlag

Parallel Boundary Element Method (BEM) analyses are typically conducted using a purpose-built software framework called BEM-BB. This framework requires a user-defined function program that calculates the i-th row and the j-th column of the coefficient matrix arising from the convolution integral term in the fundamental BEM equation. Owing to this feature, the framework can encapsulate MPI and OpenMP hybrid parallelization with H-matrix approximation. Therefore, users can focus on implementing a fundamental solution or a Green’s function, which is the most important element in BEM and depends on the targeted physical phenomenon, as a user-defined function. However, the framework does not consider single instruction multiple data (SIMD) vectorization, which is important for high-performance computing and is supported by the majority of existing processors. Performing SIMD vectorization of a user-defined function is difficult because SIMD exploits instruction-level parallelization and is closely associated with the user-defined function. In this paper, a conceptual framework for enhancing SIMD vectorization is proposed. The proposed framework is evaluated using two BEM problems, namely, static electric field analysis with a perfect conductor and static electric field analysis with a dielectric, on Intel Broadwell (BDW) processor and Intel Xeon Phi Knights Landing (KNL) processor. It offers good vectorization performance with limited SIMD knowledge, as can be verified from the numerical results obtained herein. Specifically, in perfect conductor analyses conducted using the H-matrix, the framework achieved performance improvements of 2.22x and 4.34x compared to the original BEM-BB framework for the BDW processor and KNL, respectively.

DOI： 10.1007/978-3-319-93698-7_46

Scopus
Load-Balancing-Aware Parallel Algorithms of H-Matrices with Adaptive Cross Approximation for GPUs. Reviewed

Tetsuya Hoshino, Akihiro Ida, Toshihiro Hanawa, Kengo Nakajima

IEEE International Conference on Cluster Computing, CLUSTER 2018, Belfast, UK, September 10-13, 2018 page： 35 - 45 2018

　More details

Publishing type：Research paper (international conference proceedings) Publisher：IEEE Computer Society

DOI： 10.1109/CLUSTER.2018.00016
スーパーコンピュータ上でのDeep Learning学習環境の初期構築

野村行弘, 佐藤一誠, 佐藤一誠, 佐藤一誠, 塙敏博, 花岡昇平, 中尾貴祐, 竹永智美, 佐藤大介, 星野哲也, 関谷勇司, 大島聡史, 林直人, 阿部修

電子情報通信学会技術研究報告 Vol. 117 ( 281(MI2017 47-62) ) page： 1‐2 2017.10

　More details

Language：Japanese

J-GLOBAL
Pascal vs KNL: Performance Evaluation with ICCG Solve Reviewed

Tetsuya Hoshino, Satoshi Ohshima, Toshihiro Hanawa, Kengo Nakaima, Akihiro Ida

HPC in Asia Workshop Poster Session, ISC High Performance 2017 2017.6

　More details

Language：English Publishing type：Research paper (international conference proceedings)
OpenACCを用いたICCG法ソルバーのPascal GPUにおける性能評価

星野哲也, 大島聡史, 塙敏博, 中島研吾, 伊田明宏

情報処理学会研究報告(Web) Vol. 2017 ( HPC-158 ) page： Vol.2017‐HPC‐158,No.18,1‐9 (WEB ONLY) - 9 2017.3

　More details

Language：Japanese Publishing type：Research paper (scientific journal)

J-GLOBAL
A Directive-based Data Layout Abstraction for Performance Portability of OpenACC Applications Reviewed

Tetsuya Hoshino, Naoya Maruyama, Satoshi Matsuoka

PROCEEDINGS OF 2016 IEEE 18TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS; IEEE 14TH INTERNATIONAL CONFERENCE ON SMART CITY; IEEE 2ND INTERNATIONAL CONFERENCE ON DATA SCIENCE AND SYSTEMS (HPCC/SMARTCITY/DSS) page： 1147 - 1154 2016

　More details

Language：English Publishing type：Research paper (international conference proceedings) Publisher：IEEE

Directive-based programming interfaces such as OpenACC and OpenMP are becoming more prevalent in application development targeting accelerators, in particular when porting existing CPU-only code. Unlike vendor-specific alternatives such as CUDA, they are designed to be portable across different accelerators, and therefore once necessary directives are added to an existing CPU-only code, it can be executed on different accelerator architectures depending on the availability of supporting compilers. However, it does not automatically mean that such code runs efficiently on different architectures, and in fact, architecture-specific coding such as choosing optimal data layouts is almost mandatory for optimal performance, imposing a significant burden if implemented manually. Towards realizing performance portability in accelerator programming, we propose a set of extended directives that allow the programmer to optimize data layouts for a given accelerator without modifying original program code. Unlike the manual approach, the code change is confined in the directives with the original code kept as it is. This paper evaluates the effectiveness of our proposed extensions in the OpenACC standard by extending UPACS and CCS-QCD OpenACC applications. A prototype source-to-source translator for the extensions achieves 123% and 120% of the baseline performance, respectively, which are comparable to manually tuned versions.

DOI： 10.1109/HPCC-SmartCity-DSS.2016.34

Web of Science
An OpenACC extension for data layout transformation Reviewed

Tetsuya Hoshino, Naoya Maruyama, Satoshi Matsuoka

Proceedings of WACCPD 2014: 1st Workshop on Accelerator Programming Using Directives - Held in Conjunction with SC 2014: The International Conference for High Performance Computing, Networking, Storage and Analysis page： 12 - 18 2015.4

　More details

Language：English Publishing type：Research paper (international conference proceedings) Publisher：Institute of Electrical and Electronics Engineers Inc.

OpenACC is gaining momentum as an implicit and portable interface in porting legacy CPU-based applications to heterogeneous, highly parallel computational environment involving many-core accelerators such as GPUs and Intel Xeon Phi. OpenACC provides a set of loop directives similar to OpenMP for the parallelization and also to manage data movement, attaining functional portability across different heterogeneous devices
however, the performance portability of OpenACC is said to be insufficient due to the characteristics of different target devices, especially those regarding memory layouts, as automated attempts by the compilers to adapt is currently difficult. We are currently working to propose a set of directives to allow compilers to have better semantic information for adaptation
here, we particularly focus on data layout such as Structure of Arrays, advantageous data structure for GPUs, as opposed to Array of Structures, which exhibits good performance on CPUs. We propose a directive extension to OpenACC that allows the users to flexibility specify optimal layouts, even if the data structures are nested. Performance results show that we gain as much as 96 % in performance for CPUs and 165% for GPUs compared to programs without such directives, essentially attaining both functional and performance portability in OpenACC.

DOI： 10.1109/WACCPD.2014.12

Scopus
CUDA vs OpenACC: Performance Case Studies with Kernel Benchmarks and a Memory-Bound CFD Application11 Reviewed

Tetsuya Hoshino, Naoya Maruyama, Satoshi Matsuoka, Ryoji Takaki

PROCEEDINGS OF THE 2013 13TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID 2013) page： 136 - 143 2013

　More details

Language：English Publishing type：Research paper (international conference proceedings) Publisher：IEEE

OpenACC is a new accelerator programming interface that provides a set of OpenMP-like loop directives for the programming of accelerators in an implicit and portable way. It allows the programmer to express the offloading of data and computations to accelerators, such that the porting process for legacy CPU-based applications can be significantly simplified. This paper focuses on the performance aspects of OpenACC using two microbenchmarks and one real-world computational fluid dynamics application. Both evaluations show that in general OpenACC performance is approximately 50% lower than CUDA. However, for some applications it can reach up to 98% with careful manual optimizations. The results also indicate several limitations of the OpenACC specification that hamper full use of the GPU hardware resources, resulting in a significant performance gap when compared to a highly tuned CUDA code. The lack of a programming interface for the shared memory in particular results in as much as three times lower performance.

DOI： 10.1109/CCGrid.2013.12

Web of Science

▼display all

To the head of Papers.▲

MISC 38

Current Status and Future of the ABINIT-MP Program Invited Reviewed

Yuji MOCHIZUKI, Tatsuya NAKANO, Kota SAKAKURA, Hideo DOI, Koji OKUWAKI, Toshihiro KATO, Hiroyuki TAKIZAWA, Satoshi OHSHIMA, Tetsuya HOSHINO, Takahiro KATAGIRI

Journal of Computer Chemistry, Japan Vol. 23 ( 4 ) page： 85 - 97 2024.12

　More details

Language：Japanese Publishing type：Article, review, commentary, editorial, etc. (scientific journal) Publisher：Society of Computer Chemistry Japan

The fragment molecular orbital (FMO) program ABINIT-MP has a quarter-century history, and related research and development of the Open Version 2 series is currently underway. This paper first summarizes the current status of the latest Revision 8 (released on August 2023). It then describes future improvements and enhancements, including GPU support. The connection with coarse-grained simulation (dissipative particle dynamics) and the possibility of cooperation with quantum computation are also touched upon.

DOI： 10.2477/jccj.2024-0022

Web of Science

CiNii Research
Development Status of ABINIT-MP in 2023 Invited Reviewed

Yuji MOCHIZUKI, Tatsuya NAKANO, Kota SAKAKURA, Koji OKUWAKI, Hideo DOI, Toshihiro KATO, Hiroyuki TAKIZAWA, Akira NARUSE, Satoshi OHSHIMA, Tetsuya HOSHINO, Takahiro KATAGIRI

J. Comp. Chem. Jpn. Vol. 23 ( 1 ) page： 4 - 8 2024.3

　More details

Language：Japanese Publishing type：Rapid communication, short report, research note, etc. (scientific journal) Publisher：Society of Computer Chemistry, Japan

In August 2023, we released the latest version of our ABINIT-MP program, Open Version 2 Revision 8. In this version, the most commonly used FMO-MP2 calculations are even faster than in the previous Revision 4. It is now also possible to calculate excitation and ionization energies for regions of interest. Improved interaction analysis is also available. In addition, we have started GPU-oriented modifications. In this preliminary report, we present the current status of ABINIT-MP.

DOI： 10.2477/jccj.2024-0001

CiNii Research
Introduction to Parallel Programming on CPU and GPU: (4)

星野哲也, 中島研吾, 中島研吾

シミュレーション Vol. 43 ( 1 ) 2024

　More details

J-GLOBAL
格子H行列を用いた地震シミュレーションのマルチGPU並列化

百武尚輝, 星野哲也, 星野哲也, 小澤創, 小澤創, 伊田明弘, 安藤亮輔, 河合直聡, 永井亨, 片桐孝洋

情報処理学会研究報告(Web) Vol. 2024 ( HPC-195 ) 2024

　More details

J-GLOBAL
WaitIO+MPI Hybridによる異種システム間でのAllreduceの高速化

植野貴大, 住元真司, 中島研吾, 中島研吾, 片桐孝洋, 大島聡史, 星野哲也, 河合直聡, 永井亨

情報処理学会研究報告(Web) Vol. 2024 ( HPC-196 ) 2024

　More details

J-GLOBAL
HPCカーネルベンチマークによるSapphire Rapids HBMの性能評価

星野哲也, 河合直聡, 伊田明弘, 塙敏博, 片桐孝洋

情報処理学会研究報告(Web) Vol. 2024 ( HPC-193 ) 2024

　More details

J-GLOBAL
CPU・GPU並列プログラミング入門(1)—Introduction to Parallel Programming on CPU and GPU(1)

中島研吾, 星野哲也

シミュレーション = Journal of the Japan Society for Simulation Technology / 日本シミュレーション学会編 Vol. 42 ( 2 ) page： 103 - 109 2023.6

　More details

Language：Japanese Publisher：小宮山印刷工業

CiNii Books
数値計算ライブラリの自動チューニングにおけるXAI適用の試み—An Adaptation of XAI to Auto-tuning for Numerical Calculation Library

青木将太, 片桐孝洋, 大島聡史, 永井亨, 星野哲也

計算工学講演会論文集 = Proceedings of the Conference on Computational Engineering and Science / 日本計算工学会編 Vol. 28 page： 904 - 907 2023.5

　More details

Language：Japanese Publisher：日本計算工学会
Introduction to Parallel Programming on CPU and GPU (2)

中島研吾, 中島研吾, 星野哲也

シミュレーション Vol. 42 ( 3 ) 2023

　More details

J-GLOBAL
Introduction to Parallel Programming on CPU and GPU (3)

星野哲也, 中島研吾, 中島研吾

シミュレーション Vol. 42 ( 4 ) 2023

　More details

J-GLOBAL
Fortran標準規格do concurrentを用いたGPUオフローディング手法の評価

星野哲也, 塙敏博

情報処理学会研究報告(Web) Vol. 2022-HPC-183 page： 1 - 8 2022
AMD製GPU・NVIDIA製GPU両対応direct N-body codeの実装と性能評価

三木洋平, 塙敏博, 河合直聡, 星野哲也

日本天文学会年会講演予稿集 Vol. 2022 2022

　More details

J-GLOBAL
OpenMPを用いたGPUオフローディングの有効性の評価

河合直聡, 三木洋平, 星野哲也, 塙敏博, 中島研吾, 中島研吾

情報処理学会研究報告(Web) Vol. 2022 ( HPC-183 ) 2022

　More details

J-GLOBAL
A64FXにおけるテンポラルブロッキングの実装と性能評価

星野哲也, 塙敏博

研究報告ハイパフォーマンスコンピューティング（HPC） Vol. 2021-HPC-178 ( 17 ) page： 1 - 8 2021.3

　More details

Authorship：Lead author
「計算・データ・学習」融合スーパーコンピュータシステム「Wisteria/BDEC-01」の概要

中島研吾, 塙敏博, 下川辺隆史, 伊田明弘, 芝隼人, 三木洋平, 星野哲也, 有間英志, 河合直聡, 坂本龍一, 近藤正章, 岩下武史, 八代尚, 長尾大道, 松葉浩也, 荻田武史, 片桐孝洋, 古村孝志, 鶴岡弘, 市村強, 藤田航平

情報処理学会研究報告(Web) Vol. 2021 ( HPC-179 ) 2021

　More details

J-GLOBAL
「計算・データ・学習」融合スーパーコンピュータシステムWisteria/BDEC-01の性能評価

塙敏博, 中島研吾, 中島研吾, 下川辺隆史, 芝隼人, 三木洋平, 星野哲也, 河合直聡, 似鳥啓吾, 今村俊幸, 工藤周平, 中尾昌広

情報処理学会研究報告(Web) Vol. 2021 ( HPC-180 ) 2021

　More details

J-GLOBAL
A64FXにおける階層型行列演算の性能評価

星野哲也, 伊田明弘, 伊田明弘, 塙敏博

情報処理学会研究報告(Web) Vol. 2021 ( HPC-180 ) page： 1 - 8 2021

　More details

J-GLOBAL
Large-scale earthquake sequence simulations of 3D geometrically complex faults using the boundary element method accelerated by lattice H-matrices on distributed memory computer systems

伊田明弘, 星野哲也

arXiv preprint Vol. - page： 1 - 26 2021
An Optimization of H-matrix-vector Multiplication by Using Un-used Cores

Tetsuya Hoshino, Toshihiro Hanawa, Akihiro Ida

HPC Asia 2020 2020.1
Numerical Linear Algebra Based on Lattice H-Matrices

Akihiro Ida, Ichitaro Yamazaki, Rio Yokota, Satoshi Ohshima, Tasuku Hiraishi, Takeshi Iwashita, Tetsuya Hoshino, Toshihiro Hanawa

HPC Asia 2020 2020.1
メニーコアクラスタにおける階層型行列法の高速化に向けた性能評価

星野哲也, 伊田明弘

計算工学講演会論文集(CD-ROM) Vol. 24 page： ROMBUNNO.C‐07‐02 2019.6

　More details

Language：Japanese

J-GLOBAL
High-level Abstractions for High Performance Computing on Many-core Processors

Hoshino Tetsuya

2018.9

　More details

Language：English
OpenCLを用いたFPGAによる階層型行列計算

塙敏博, 伊田明弘, 星野哲也

情報処理学会研究報告(Web) Vol. 2018 ( HPC-163 ) page： Vol.2018‐HPC‐163,No.26,1‐8 (WEB ONLY) 2018.2

　More details

Language：Japanese

J-GLOBAL
階層型行列計算のFPGAへの適用

塙敏博, 伊田明弘, 星野哲也

情報処理学会研究報告(Web) Vol. 2017 ( HPC-161 ) page： Vol.2017‐HPC‐161,No.10,1‐10 (WEB ONLY) 2017.9

　More details

Language：Japanese

J-GLOBAL
階層型行列法ライブラリHACApKを用いたアプリケーションのメニーコア向け最適化

星野哲也, 伊田明弘, 塙敏博, 中島研吾

情報処理学会研究報告(Web) Vol. 2017 ( HPC-160 ) page： Vol.2017‐HPC‐160,No.15,1‐10 (WEB ONLY) - 10 2017.7

　More details

Language：Japanese

J-GLOBAL
GPU搭載スーパーコンピュータReedbush‐Hの性能評価

塙敏博, 星野哲也, 中島研吾, 大島聡史, 伊田明弘

情報処理学会研究報告(Web) Vol. 2017 ( HPC-159 ) page： Vol.2017‐HPC‐159,No.9,1‐6 (WEB ONLY) 2017.4

　More details

Language：Japanese

J-GLOBAL
Xeon Phi+OmniPath環境におけるOpenMP,MPI性能最適化

塙敏博, 星野哲也, 中島研吾, 大島聡史, 伊田明弘

情報処理学会研究報告(Web) Vol. 2017 ( HPC-158 ) page： Vol.2017‐HPC‐158,No.21,1‐8 (WEB ONLY) 2017.3

　More details

Language：Japanese

J-GLOBAL
Optimization of ICCG Solver for Intel Xeon Phi

中島研吾, 中島研吾, 大島聡史, 大島聡史, 塙敏博, 星野哲也, 伊田明弘, 伊田明弘

情報処理学会研究報告(Web) Vol. 2016 ( HPC-157 ) page： Vol.2016‐HPC‐157,No.16,1‐8 (WEB ONLY) 2016.12

　More details

Language：Japanese

J-GLOBAL
Performance Evaluation of Pipelined CG Method

塙敏博, 中島研吾, 中島研吾, 大島聡史, 大島聡史, 星野哲也, 伊田明弘, 伊田明弘

情報処理学会研究報告(Web) Vol. 2016 ( HPC-157 ) page： Vol.2016‐HPC‐157,No.6,1‐9 (WEB ONLY) 2016.12

　More details

Language：Japanese

J-GLOBAL
データ解析・シミュレーション融合スーパーコンピュータシステムReedbush‐Uの性能評価

塙敏博, 中島研吾, 大島聡史, 伊田明弘, 星野哲也, 田浦健次朗

情報処理学会研究報告(Web) Vol. 2016 ( HPC-156 ) page： Vol.2016‐HPC‐156,No.10,1‐10 (WEB ONLY) - 10 2016.9

　More details

Language：Japanese

J-GLOBAL
データレイアウト最適化指示文によるOpenACCアプリケーションの高速化

星野哲也

研究報告ハイパフォーマンスコンピューティング（HPC） Vol. 2016-HPC-155 page： 1 - 8 2016
圧縮性流体プログラムのOpenACCによる高速化

星野哲也

研究報告ハイパフォーマンスコンピューティング（HPC） Vol. 2016-HPC-153 page： 1 - 10 2016
OpenACCディレクティブ拡張によるデータレイアウト最適化

星野哲也, 丸山直也, 松岡聡

研究報告ハイパフォーマンスコンピューティング（HPC） Vol. 2014 ( 45 ) page： 1 - 8 2014.7

　More details

Language：Japanese Publisher：一般社団法人情報処理学会

近年増加傾向にある GPU 等のアクセラレータを搭載した計算環境への既存プログラムの移植方法として，CUDA・OpenCL に代表されるローレベルなプログラミングモデルを用いる方法に対し，ディレクティブベースの OpenACC のようなハイレベルなプログラミングモデルを用いる方法が注目されている．このようなディレクティブベースのプログラミングモデルの利点として，元のプログラムを維持したまま移植を行えるために，デバイス間の機能的な可搬性が高いことがあげられる．しかし現状の OpenACC などの High-level なプログラミングモデルは，スカラプロセッサとメニーコアアクセラレータの得意とするデータレイアウトの相違に対応することが出来ず，異なる性質を持ったデバイス間の性能可搬性に問題がある．そこで本研究では，データレイアウトを抽象化し，異なるデバイス間での性能可搬性を向上させるための OpenACC の拡張ディレクティブを試作し，姫野ベンチマークのデータレイアウトをトランスレーターにより変更し，マルチコア CPU，Intex Xeon Phi，K20X GPU のそれぞれで評価を行った．その結果，オリジナルと同一のデータレイアウトと比較して，Intel Xeon Phi では 27%，K20X GPU では 24%の性能向上が得られることを確認した．

CiNii Books
CPU-GPUそれぞれに最適なデータレイアウトを選択可能にするOpenACCディレクティブ拡張

星野哲也, 丸山直也, 松岡聡

研究報告ハイパフォーマンスコンピューティング（HPC） Vol. 2014 ( 5 ) page： 1 - 5 2014.2

　More details

Language：Japanese Publisher：一般社団法人情報処理学会

近年増加傾向にある GPU 等のアクセラレータを搭載した計算環境への既存プログラムの移植方法として，CUDA・OpenCL に代表される Low-level なプログラミングモデルを用いる方法に対し，ディレクティブベースの OpenACC のような High-level なプログラミングモデルを用いる方法が考えられる．このようなディレクティブベースのプログラミングモデルの利点として，元のプログラムを壊さずに移植を行えるために，デバイス間の可搬性が高いことがあげられる．しかし現状の OpenACC などのプログラミングモデルは，スカラプロセッサとメニーコアアクセラレータの得意とするデータレイアウトの相違等に対応することが出来ず，異なる性質を持ったデバイス間の性能可搬性に問題がある．そこで本研究では，データレイアウトを抽象化し，異なるデバイス間での性能可搬性を向上させるための OpenACC の拡張ディレクティブを試作し，評価を行った．

CiNii Books
ディレクティブベースプログラミング言語OpenACCの性能評価

星野哲也, 丸山直也, 松岡聡

ハイパフォーマンスコンピューティングと計算科学シンポジウム論文集 Vol. 2013 page： 91 - 91 2013.1

　More details

Language：Japanese
Evaluation of Portability for a Real-world CFD Application with CUDA and OpenACC

星野哲也, 丸山直也, 松岡聡

研究報告ハイパフォーマンスコンピューティング（HPC） Vol. 2012 ( 42 ) page： 1 - 9 2012.7

　More details

Language：Japanese

地震や気象予測，航空機や高層ビル設計といったシミュレーションに利用される数値流体力学アプリケーションは，近年一般的になりつつある GPU を用いたスーパーコンピュータにおいて，目覚ましい成果を上げている．しかし，GPU を用いたプログラミングは，高い性能を得ること難しいと言われており，レガシープログラムの GPU 環境への移植が問題となっている．本稿では，実際に利用されている大規模流体アプリケーションである UPACS を手動により CUDA 化し，性能と移植コストの面から評価を行った．また，プログラムの移植性を解決すると期待されている，OpenACC の予備評価を行った．これら評価の結果を示し，今後解決すべき課題について述べる．Computational fluid dynamics (CFD) applications used for an earthquake and meteorological simulation are one of the most important application executed with high-speed supercomputers. Especially, GPU-based supercomputers have been showing remarkable performance of CFD applications. However, GPU-programing is still difficult to obtain high performance, which prevents legacy applications from being ported to GPU environment. We apply classical optimizations to a real-world CFD application UPACS and evaluate it's performance and porting costs, and we also evaluate OpenACC expected to provide portability across CPUs and GPUs. We demonstrate these results of evaluation and mention performance problems should be resolved in the future.

CiNii Books
大規模流体アプリケーションのGPUによる高速化手法の評価

星野哲也, 丸山直也, 松岡聡

先進的計算基盤システムシンポジウム論文集 Vol. 2012 page： 73 - 74 2012.5

　More details

Language：Japanese
“Open ACC Programming”

Naoya Maruyama, Tetsuya Hoshino

Kyokai Joho Imeji Zasshi/Journal of the Institute of Image Information and Television Engineers Vol. 66 ( 10 ) page： 817 - 822 2012

　More details

Language：English Publisher：一般社団法人映像情報メディア学会

DOI： 10.3169/itej.66.817

Scopus

CiNii Books

▼display all

To the head of MISC.▲

KAKENHI (Grants-in-Aid for Scientific Research) 5

低ランク構造行列法の適用範囲拡大と多様な計算アーキテクチャの活用

Grant number：24K02949 2024.4 - 2027.3

日本学術振興会科学研究費助成事業基盤研究(B)

伊田明弘, 横田理央, 塙敏博, 岩下武史, 大島聡史, 星野哲也, 平石拓, 河合直聡, 横田理央, 塙敏博, 岩下武史, 大島聡史, 星野哲也, 平石拓, 河合直聡

　 More details

Authorship：Coinvestigator(s)

本研究では、低ランク構造行列法ライブラリの高機能化を実施する。科学技術計算では、密行列演算に基づく計算手法の数値線形代数ライブラリが広く利用されいる。密行列演算から低ランク構造行列演算へ置き換えが行えるように、低ランク構造行列法の適用範囲を拡大する。低ランク構造行列に基づく新たな数値計算アルゴリズムを開発する。アルゴリズム開発は、GPU・FPGAなど最新の計算機アーキテクチャで構成されるクラスタ計算機を意識し、実装の最適化を行う。様々な低ランク構造行列の演算に対し、最適な計算機アーキテクチャを割当て、混合精度演算・動的負荷分散なども活用し、計算機の性能を最大限に引き出す実装法を研究する。
A Study on Acceleration by Temporal Blocking for Real-world Applications

Grant number：22K17898 2022.4 - 2024.3

Grants-in-Aid for Scientific Research Grant-in-Aid for Early-Career Scientists

Hoshino Tetsuya

　 More details

Authorship：Principal investigator

Grant amount：\1430000 （ Direct Cost: \1100000 、 Indirect Cost：\330000 ）

The specific calculation pattern for a discrete grid in time and space that arises when solving differential equations analytically is called a stencil calculation, and it is an important kernel that frequently appears in various fluid simulations. Acceleration of stencil calculations has been studied extensively, and the temporal blocking method is one such method, but has rarely been applied to real applications because it requires very complicated programming. Furthermore, since the performance of temporal blocking is highly dependent on the performance parameters of the processor executing the blocking, it is not realistic to optimize the blocking manually. Therefore, in this study, the performance modeling required for auto-tuning of temporal blocking was performed using state-of-the-art CPUs.
Construction of numerical linear algebra based on lattice H-matrices and its high-performance implementation on modern architectures

Grant number：21H03447 2021.4 - 2024.3

Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research Grant-in-Aid for Scientific Research (B)

Ida Akihiro

　 More details

Authorship：Coinvestigator(s)

We conducted research and development aimed at constructing a numerical linear algebra system based on the lattice H-matrix. We proposed an algorithm to calculate all eigenvalues for the BLR matrix, a special case of the lattice H-matrix. Research on high-performance implementation of the lattice H-matrix method was carried out. By adding efficient work-stealing functions to task parallelization languages, we successfully improved the computational performance of H-matrix partitioning and low-rank structured matrix generation on distributed memory parallel computers. Furthermore, we developed an H-matrix-vector multiplication computation method that achieves over 85% of the theoretical limit performance on computing nodes using various latest CPU architectures. Additionally, we developed a method for fast QR decomposition of BLR matrices using the MIG feature of the latest GPUs.
High-performance computing and data analysis support leveraging unused cores

Grant number：20H00580 2020.4 - 2023.3

Grants-in-Aid for Scientific Research Grant-in-Aid for Scientific Research (A)

Hanawa Toshihiro

　 More details

Authorship：Coinvestigator(s)

This research aims to improve the overall system performance and realize additional functions such as power control and profiling functions with low overhead by giving "extra cores" that do not directly contribute to the performance improvement of high-performance computation a role in supporting the main computation running on the CPU. We studied "UTHelper," a framework to realize such support functions at the user level.
As a result, we realized profiling and parallelism change during execution without modifying the user program, in situ analysis using extra cores, load balancing using dynamic core allocation to speed up lattice H-matrix operations, inter-GPU communication using extra cores, and utilization of idle arithmetic units through time-space blocking.
Auto-tuning Framework Focusing on Application Data Structure for Many-core Processors

Grant number：16H06679 2016.8 - 2018.3

Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research Grant-in-Aid for Research Activity Start-up

HOSHINO Tetsuya

　 More details

Nowadays, the number of computational environment using many-core processors is increasing. To bring out the efficient performance of many-core processors, it is important to efficiently use the Vector Processing Unit (VPU). However, the knowledge of hardware and compiler is required to efficiently use the VPU, and moreover, data structural changes are often required.
In this research, we propose a set of compiler directives for abstraction of data layout. We also implement a translator for the proposed directives. Furthermore, we propose a framework design to enhance the efficient vectorization. Also, we implement a BEM-BB framework using the proposed framework design.

To the head of KAKENHI (Grants-in-Aid for Scientific Research).▲

Teaching Experience (On-campus) 3

Advanced Lectures on Large-scale Parallel Computing

2023
High-Performance Computing B

2023
Programming 2

2023

To the head of Teaching Experience (On-campus).▲

Social Contribution 1

最近のFortran向けGPUプログラミング事情（JAXA内部講習会）

Role(s)：Lecturer

2023.12

To the head of Social Contribution.▲

Academic Activities 2

HPC Asia 2024 Local Arrangement Chair

Role(s)：Planning, management, etc.

2024.1

　More details

Type：Academic society, research group, etc.
xSIG 2023 プログラム委員

Role(s)：Peer review

2023.8

　More details

Type：Peer review

To the head of Academic Activities.▲