Updated on 2026/02/27

写真a

 
HOSHINO Tetsuya
 
Organization
Information Technology Center Associate Professor
Graduate School
Graduate School of Informatics
Title
Associate Professor

Research Areas 1

  1. Informatics / High-performance computing

Research History 2

  1. Nagoya University   Information Technology Center   Associate Professor

    2023.1

  2. The University of Tokyo   Information Technology Center   Assistant Professor

    2016.1 - 2022.12

Awards 1

  1. Tetsuya Hoshino, Shun-ichiro Hayashi, Daichi Mukunoki, Takahiro Katagiri, Toshihiro Hanawa, “Evaluating Claude Code's Coding and Test Automation for GPU Acceleration ofa Legacy Fortran Application: A GeoFEM Case Study.” SCA/HPC Asia Workshops, pp. 353-360, 2026. https://doi.org/10.1145/3784828.3785335 [Best paper賞受賞]

    2026.1  

     More details

    Award type:Award from international society, conference, symposium, etc. 

 

Papers 37

  1. Tensor-Core-Optimized Strategies for BLR × Tall-Skinny Matrix Multiplication in BEM.

    Akihiro Ida, Kazuya Goto, Rio Yokota, Tasuku Hiraishi, Toshihiro Hanawa, Takeshi Iwashita, Masatoshi Kawai, Satoshi Ohshima, Tetsuya Hoshino

    SCA/HPC Asia     page: 153 - 164   2026

     More details

    Publishing type:Research paper (international conference proceedings)  

    DOI: 10.1145/3773656.3773678

    Other Link: https://dblp.uni-trier.de/db/conf/hpcasia/hpcasia2026.html#IdaGYHHIKOH26

  2. Evaluating Claude Code's Coding and Test Automation for GPU Acceleration ofa Legacy Fortran Application: A GeoFEM Case Study.

    Tetsuya Hoshino, Shun-ichiro Hayashi, Daichi Mukunoki, Takahiro Katagiri, Toshihiro Hanawa

    SCA/HPC Asia Workshops     page: 353 - 360   2026

     More details

    Publishing type:Research paper (international conference proceedings)  

    DOI: 10.1145/3784828.3785335

    Other Link: https://dblp.uni-trier.de/db/conf/hpcasia/hpcasia2026w.html#HoshinoHMKH26

  3. An Algorithm Portfolio Approach for Parameter Tuning in Coherent Ising Machines.

    Tatsuro Hanyu, Takahiro Katagiri, Daichi Mukunoki, Tetsuya Hoshino

    CANDARW     page: 142 - 148   2025

     More details

    Publishing type:Research paper (international conference proceedings)  

    DOI: 10.1109/CANDARW68385.2025.00032

    Other Link: https://dblp.uni-trier.de/db/conf/candar/candar2025w.html#HanyuKMH25

  4. Performance Evaluation of Loop Body Splitting for Fast Modal Filtering in SCALE-DG on A64FX. Open Access

    Xuanzhengbo Ren, Yuta Kawai, Hirofumi Tomita, Seiya Nishizawa, Takahiro Katagiri, Tetsuya Hoshino, Daichi Mukunoki, Masatoshi Kawai, Toru Nagai

    Proceedings of the 2025 International Conference on High Performance Computing in Asia-Pacific Region Workshops     page: 36 - 44   2025

     More details

    Language:English   Publishing type:Research paper (international conference proceedings)  

    DOI: 10.1145/3703001.3724385

    Open Access

    Web of Science

    Scopus

    Other Link: https://dblp.uni-trier.de/db/conf/hpcasia/hpcasia2025w.html#RenKTNKHMKN25

  5. Regarding the usage status of the Supercomputer “Flow”

    Yamada Kazunari, Mouri Akihiro, Hayashi Hidekazu, Katagiri Takahiro, Hoshino Tetsuya, Nagai Toru

    Proceedings of the Annual Conference of Academic Exchange for Information Environment and Strategy   Vol. 2024 ( 0 ) page: 99 - 107   2024.12

     More details

    Language:Japanese   Publisher:Academic eXcange for Information Environment and Strategy  

    DOI: 10.24669/axies.2024.0_99

    CiNii Research

  6. Toward Services for HPC-Centric Quantum Computing

    Katagiri Takahiro, Takahashi Ichiro, Morishita Makoto, Hoshino Tetsuya, Kawai Masatoshi, Nagai Toru

    Proceedings of the Annual Conference of Academic Exchange for Information Environment and Strategy   Vol. 2024 ( 0 ) page: 143 - 149   2024.12

     More details

    Language:Japanese   Publisher:Academic eXcange for Information Environment and Strategy  

    DOI: 10.24669/axies.2024.0_143

    CiNii Research

  7. Investigation of Azure CycleCloud Usage Environment and Considerations on Supercomputer Center-Cloud Collaboration

    Nagai Toru, Gojuki Shuichi, Kawai Masatoshi, Katagiri Takahiro, Hoshino Tetsuya

    Journal for Academic Computing and Networking   Vol. 28 ( 1 ) page: 114 - 124   2024.11

     More details

    Language:Japanese   Publisher:Academic eXchange for Information Environment and Strategy  

    DOI: 10.24669/jacn.28.1_114

    CiNii Research

  8. Auto‐Tuning Mixed‐Precision Computation by Specifying Multiple Regions

    Xuanzhengbo Ren, Masatoshi Kawai, Tetsuya Hoshino, Takahiro Katagiri, Toru Nagai

    Concurrency and Computation: Practice and Experience   Vol. 37 ( 2 )   2024.11

     More details

    Language:English   Publishing type:Research paper (scientific journal)   Publisher:Wiley  

    ABSTRACT

    Mixed‐precision computation is a promising method for substantially improving high‐performance computing applications. However, using mixed‐precision data is a double‐edged sword. While it can improve computational performance, the reduction in precision introduces more uncertainties and errors. As a result, precision tuning is necessary to determine the optimal mixed‐precision configurations. Much effort is therefore spent on selecting appropriate variables while balancing execution time and numerical accuracy. Auto‐tuning (AT) is one of the technologies that can assist in alleviating this intensive task. In recent years, ppOpen‐AT, an AT language, introduced a directive for mixed‐precision tuning called “Blocks.” In this study, we investigated an AT strategy for the “Blocks” directive for multi‐region tuning of a program. The non‐hydrostatic icosahedral atmospheric model (NICAM), a global cloud‐resolving model, was used as a benchmark program to evaluate the effectiveness of the AT strategy. Experimental results indicated that when a single region of the program performed well in mixed‐precision computation, combining these regions resulted in better performance. When tested on the supercomputer “Flow” Type I (Fujitsu PRIMEHPC FX1000) and Type II (Fujitsu PRIMEHPC CX1000) subsystems, the mixed‐precision NICAM benchmark program tuned by the AT strategy achieved a speedup of nearly 1.31× on the Type I subsystem compared to the original double‐precision program, and a 1.12× speedup on the Type II subsystem.

    DOI: 10.1002/cpe.8326

    Web of Science

    Scopus

  9. Adaptation of XAI to Auto-tuning for Numerical Libraries Open Access

    Shota Aoki, Takahiro Katagiri, Satoshi Ohshima, Masatoshi Kawai, Toru Nagai, Tetsuya Hoshino

    17th IEEE International Symposium on Embedded Multicore/Many-core Systems-on-Chip(MCSoC)   Vol. abs/2405.10973   page: 556 - 563   2024.5

     More details

    Publishing type:Research paper (international conference proceedings)   Publisher:IEEE  

    Concerns have arisen regarding the unregulated utilization of artificial
    intelligence (AI) outputs, potentially leading to various societal issues.
    While humans routinely validate information, manually inspecting the vast
    volumes of AI-generated results is impractical. Therefore, automation and
    visualization are imperative. In this context, Explainable AI (XAI) technology
    is gaining prominence, aiming to streamline AI model development and alleviate
    the burden of explaining AI outputs to users. Simultaneously, software
    auto-tuning (AT) technology has emerged, aiming to reduce the man-hours
    required for performance tuning in numerical calculations. AT is a potent tool
    for cost reduction during parameter optimization and high-performance
    programming for numerical computing. The synergy between AT mechanisms and AI
    technology is noteworthy, with AI finding extensive applications in AT.
    However, applying AI to AT mechanisms introduces challenges in AI model
    explainability. This research focuses on XAI for AI models when integrated into
    two different processes for practical numerical computations: performance
    parameter tuning of accuracy-guaranteed numerical calculations and sparse
    iterative algorithm.

    DOI: 10.1109/MCSoC64144.2024.00095

    Scopus

    arXiv

    Other Link: http://arxiv.org/pdf/2405.10973v1

  10. Performance Evaluation of CMOS Annealing with Support Vector Machine

    Ryoga Fukuhara, Makoto Morishita, Takahiro Katagiri, Masatoshi Kawai, Toru Nagai, Tetsuya Hoshino

    17th IEEE International Symposium on Embedded Multicore/Many-core Systems-on-Chip(MCSoC)   Vol. abs/2404.15752   page: 548 - 555   2024.4

     More details

    Publishing type:Research paper (international conference proceedings)   Publisher:IEEE  

    In this paper, support vector machine (SVM) performance was assessed
    utilizing a quantum-inspired complementary metal-oxide semiconductor (CMOS)
    annealer. The primary focus during performance evaluation was the accuracy rate
    in binary classification problems. A comparative analysis was conducted between
    SVM running on a CPU (classical computation) and executed on a quantum-inspired
    annealer. The performance outcome was evaluated using a CMOS annealing machine,
    thereby obtaining an accuracy rate of 93.7% for linearly separable problems,
    92.7% for non-linearly separable problem 1, and 97.6% for non-linearly
    separable problem 2. These results reveal that a CMOS annealing machine can
    achieve an accuracy rate that closely rivals that of classical computation.

    DOI: 10.1109/MCSoC64144.2024.00094

    Scopus

    arXiv

    Other Link: http://arxiv.org/pdf/2404.15752v2

  11. 階層型行列法における混合精度演算に関する評価

    星野 哲也, 伊田 明弘, 岩下 武史, 河合 直聡

    情報処理学会第87回全国大会   Vol. -   page: 1 - 2   2024

     More details

  12. 格子H行列を用いた地震シミュレーションのマルチGPU並列化

    星野 哲也, 伊田 明弘, 河合 直聡

    情報処理学会研究報告   Vol. HPC-195   page: 1 - 11   2024

     More details

  13. Optimize Efficiency of Utilizing Systems by Dynamic Core Binding.

    Masatoshi Kawai, Akihiro Ida, Toshihiro Hanawa, Tetsuya Hoshino

    HPC Asia Workshops     page: 77 - 82   2024

     More details

    Publishing type:Research paper (international conference proceedings)   Publisher:ACM  

    DOI: 10.1145/3636480.3637221

    Scopus

    Other Link: https://dblp.uni-trier.de/db/conf/hpcasia/hpcasia2024w.html#KawaiIHH24

  14. Development Status of ABINIT-MP in 2023

    MOCHIZUKI Yuji, NAKANO Tatsuya, SAKAKURA Kota, OKUWAKI Koji, DOI Hideo, KATO Toshihiro, TAKIZAWA Hiroyuki, NARUSE Akira, OHSHIMA Satoshi, HOSHINO Tetsuya, KATAGIRI Takahiro

    Journal of Computer Chemistry, Japan   Vol. 23 ( 1 ) page: 4 - 8   2024

     More details

    Language:Japanese   Publisher:Society of Computer Chemistry, Japan  

    <p>In August 2023, we released the latest version of our ABINIT-MP program, Open Version 2 Revision 8. In this version, the most commonly used FMO-MP2 calculations are even faster than in the previous Revision 4. It is now also possible to calculate excitation and ionization energies for regions of interest. Improved interaction analysis is also available. In addition, we have started GPU-oriented modifications. In this preliminary report, we present the current status of ABINIT-MP.</p>

    DOI: 10.2477/jccj.2024-0001

    CiNii Research

  15. Implementing Fast Modal Filtering of SCALE-DG.

    Xuanzhengbo Ren, Yuta Kawai, Hirofumi Tomita, Seiya Nishizawa, Takahiro Katagiri, Masatoshi Kawai, Tetsuya Hoshino, Toru Nagai

    IEEE International Conference on Cluster Computing     page: 150 - 151   2024

     More details

    Language:English   Publishing type:Research paper (international conference proceedings)  

    DOI: 10.1109/CLUSTERWorkshops61563.2024.00033

    Web of Science

    Scopus

    Other Link: https://dblp.uni-trier.de/db/conf/cluster/clusterw2024.html#RenKTNKKHN24

  16. Performance Evaluation of CMOS Annealing with Support Vector Machine.

    Ryoga Fukuhara, Makoto Morishita, Takahiro Katagiri, Masatoshi Kawai, Toru Nagai, Tetsuya Hoshino

    CoRR   Vol. abs/2404.15752   2024

     More details

    Publishing type:Research paper (scientific journal)  

    DOI: 10.48550/arXiv.2404.15752

  17. Revaluation of power saving effect by spraying spring water with Supercomputer “Flow”

    Yamada Kazunari, Tajima Yoshinori, Takahashi Ichiro, Hayashi Hidekazu, Katagiri Takahiro, Ohshima Satoshi, Hoshino Tetsuya, Nagai Toru

    Proceedings of the Annual Conference of Academic Exchange for Information Environment and Strategy   Vol. 2023 ( 0 ) page: 67 - 74   2023.12

     More details

    Language:Japanese   Publisher:Academic eXcange for Information Environment and Strategy  

    DOI: 10.24669/axies.2023.0_67

    CiNii Research

  18. Usage environment of Azure CycleCloud and benchmark test results on VMs

    Nagai Toru, Gojuki Shuichi, Kawai Masatoshi, Katagiri Takahiro, Hoshino Tetsuya

    Proceedings of the Annual Conference of Academic Exchange for Information Environment and Strategy   Vol. 2023 ( 0 ) page: 81 - 88   2023.12

     More details

    Language:Japanese   Publisher:Academic eXcange for Information Environment and Strategy  

    DOI: 10.24669/axies.2023.0_81

    CiNii Research

  19. Auto-tuning Mixed-precision Computation by Specifying Multiple Regions

    Xuanzhengbo Ren, Masatoshi Kawai, Tetsuya Hoshino, Takahiro Katagiri, Toru Nagai

    2023 Eleventh International Symposium on Computing and Networking (CANDAR)     page: 175 - 181   2023.11

     More details

    Publishing type:Research paper (international conference proceedings)   Publisher:IEEE  

    DOI: 10.1109/candar60563.2023.00031

    Scopus

  20. 4D Reconstruction of PET Using GPU Supercomputer

    OHSHIMA Satoshi, YUASA Yoshinao, MATSUMURA Kaito, YOKOTA Tatsuya, HONTANI Hidekata, SAKATA Muneyuki, KIMURA Yuichi, KATAGIRI Takahiro, NAGAI Toru, HANAWA Toshihiro, HOSHINO Tetsuya

    Medical Imaging Technology   Vol. 41 ( 4-5 ) page: 150 - 156   2023.11

     More details

    Language:Japanese   Publisher:The Japanese Society of Medical Imaging Technology  

    <p>With the development of medical imaging technology, various techniques have been developed and used to visually understand the inside of the living body. However, these technologies can only directly obtain images and videos, and diagnosis is still performed by human hands, such as physicians. There are great expectations for software that can reduce such labor, and an increasing number of technologies are already being used in the field of medicine, but the target is limited because they require knowledge and skills in both medicine (medical imaging) and computing technology. Therefore, in this study, researchers in the medical imaging and high-performance computing fields are collaborating to accelerate and scale up the image reconstruction of PET. This paper describes the details of this effort and the results obtained so far.</p>

    DOI: 10.11409/mit.41.150

    CiNii Research

  21. Implementation of Radio Wave Propagation using RT Cores and Consideration of Programming Models.

    Shinya Hashinoki, Satoshi Ohshima, Takahiro Katagiri, Toru Nagai, Tetsuya Hoshino

    IPDPS Workshops     page: 673 - 681   2023

     More details

    Language:English   Publishing type:Research paper (international conference proceedings)  

    DOI: 10.1109/IPDPSW59300.2023.00115

    Web of Science

    Scopus

    Other Link: https://dblp.uni-trier.de/db/conf/ipps/ipdps2023w.html#HashinokiOKNH23

  22. Large-scale earthquake sequence simulations on 3D nonplanar faults using the boundary element method accelerated by lattice H-matrices Open Access

    So Ozawa, Akihiro Ida, Tetsuya Hoshino, Ryosuke Ando

    Geophysical Journal International   Vol. 232 ( 3 ) page: 1471 - 1481   2022.10

     More details

    Language:English   Publishing type:Research paper (scientific journal)   Publisher:Oxford University Press (OUP)  

    Summary

    Large-scale earthquake sequence simulations using the boundary element method (BEM) incur extreme computational costs through multiplying a dense matrix with a slip rate vector. Hierarchical matrices (H-matrices) have often been used to accelerate this multiplication. However, the complexity of the structures of the H-matrices and the communication costs between processors limit their scalability, and they therefore cannot be used efficiently in distributed memory computer systems. Lattice H-matrices have recently been proposed as a tool to improve the parallel scalability of H-matrices. In this study, we developed a method for earthquake sequence simulations applicable to 3D nonplanar faults with lattice H-matrices. We present a simulation example and verify the mesh convergence of our method for a 3D nonplanar thrust fault using rectangular and triangular discretizations. We also performed performance and scalability analyses of our code. Our simulations, using over ${10^5}$ degrees of freedom, demonstrated a parallel acceleration beyond ${10^4}$ MPI processors and a &amp;gt; 10-fold acceleration over the best performance when the normal H-matrices are used. Using this code, we can perform unprecedented large-scale earthquake sequence simulations on geometrically complex faults with supercomputers. The software is made an open-source and freely available.

    DOI: 10.1093/gji/ggac386

    Web of Science

    arXiv

  23. Optimizations of H-matrix-vector Multiplication for Modern Multi-core Processors.

    Tetsuya Hoshino, Akihiro Ida, Toshihiro Hanawa

    CLUSTER     page: 462 - 472   2022

     More details

    Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:IEEE  

    DOI: 10.1109/CLUSTER51413.2022.00056

    Web of Science

    Other Link: https://dblp.uni-trier.de/db/conf/cluster/cluster2022.html#HoshinoIH22

  24. Fortran標準規格do concurrentを用いたGPUオフローディング手法の評価 Open Access

    星野 哲也, 塙 敏博

    情報処理学会研究報告(Web)   Vol. 2022-HPC-183   page: 1 - 8   2022

  25. A64FXにおける階層型行列演算の性能評価 Open Access

    星野 哲也, 伊田 明弘, 塙 敏博

    情報処理学会研究報告(Web)   Vol. 2021-HPC-180   page: 1 - 8   2021

  26. Large-scale earthquake sequence simulations of 3D geometrically complex faults using the boundary element method accelerated by lattice H-matrices on distributed memory computer systems

    伊田 明弘, 星野 哲也

    arXiv preprint   Vol. -   page: 1 - 26   2021

     More details

  27. A64FXにおけるテンポラルブロッキングの実装と性能評価 Open Access

    星野 哲也, 塙 敏博

    情報処理学会研究報告ハイパフォーマンスコンピューティング   Vol. 2021-HPC-178   page: 1 - 8   2021

  28. Preliminary development of training environment for deep learning on supercomputer system Reviewed Open Access

    Y. Nomura, I. Sato, T. Hanawa, S. Hanaoka, T. Nakao, T. Takenaga, D. Sato, T. Hoshino, Y. Sekiya, S. Ohshima, N. Hayashi, O. Abe

    International Journal of Computer Assisted Radiology and Surgery   Vol. 13 ( Issue 1 supplement ) page: S105 - S106   2018.6

     More details

    Publishing type:Research paper (international conference proceedings)  

    DOI: 10.1007/s11548-018-1766-y

  29. Optimization of generation process for sparse coefficient matrices in FEM on multicore/manycore architectures Open Access

    中島研吾, 中島研吾, 星野哲也, 星野哲也, 成瀬彰, 塙敏博, 三木洋平

    情報処理学会研究報告(Web)   Vol. 2018 ( HPC-163 ) page: Vol.2018‐HPC‐163,No.28,1‐8 (WEB ONLY)   2018.2

     More details

    Language:Japanese  

    Open Access

    J-GLOBAL

  30. Load-Balancing-Aware Parallel Algorithms of H-Matrices with Adaptive Cross Approximation for GPUs. Reviewed

    Tetsuya Hoshino, Akihiro Ida, Toshihiro Hanawa, Kengo Nakajima

    IEEE International Conference on Cluster Computing, CLUSTER 2018, Belfast, UK, September 10-13, 2018     page: 35 - 45   2018

     More details

    Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:IEEE Computer Society  

    DOI: 10.1109/CLUSTER.2018.00016

    Web of Science

  31. Design of parallel BEM analyses framework for SIMD processors Reviewed

    Tetsuya Hoshino, Akihiro Ida, Toshihiro Hanawa, Kengo Nakajima

    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)   Vol. 10860   page: 601 - 613   2018

     More details

    Language:English   Publishing type:Research paper (scientific journal)   Publisher:Springer Verlag  

    Parallel Boundary Element Method (BEM) analyses are typically conducted using a purpose-built software framework called BEM-BB. This framework requires a user-defined function program that calculates the i-th row and the j-th column of the coefficient matrix arising from the convolution integral term in the fundamental BEM equation. Owing to this feature, the framework can encapsulate MPI and OpenMP hybrid parallelization with H-matrix approximation. Therefore, users can focus on implementing a fundamental solution or a Green’s function, which is the most important element in BEM and depends on the targeted physical phenomenon, as a user-defined function. However, the framework does not consider single instruction multiple data (SIMD) vectorization, which is important for high-performance computing and is supported by the majority of existing processors. Performing SIMD vectorization of a user-defined function is difficult because SIMD exploits instruction-level parallelization and is closely associated with the user-defined function. In this paper, a conceptual framework for enhancing SIMD vectorization is proposed. The proposed framework is evaluated using two BEM problems, namely, static electric field analysis with a perfect conductor and static electric field analysis with a dielectric, on Intel Broadwell (BDW) processor and Intel Xeon Phi Knights Landing (KNL) processor. It offers good vectorization performance with limited SIMD knowledge, as can be verified from the numerical results obtained herein. Specifically, in perfect conductor analyses conducted using the H-matrix, the framework achieved performance improvements of 2.22x and 4.34x compared to the original BEM-BB framework for the BDW processor and KNL, respectively.

    DOI: 10.1007/978-3-319-93698-7_46

    Web of Science

    Scopus

  32. スーパーコンピュータ上でのDeep Learning学習環境の初期構築

    野村行弘, 佐藤一誠, 佐藤一誠, 佐藤一誠, 塙敏博, 花岡昇平, 中尾貴祐, 竹永智美, 佐藤大介, 星野哲也, 関谷勇司, 大島聡史, 林直人, 阿部修

    電子情報通信学会技術研究報告   Vol. 117 ( 281(MI2017 47-62) ) page: 1‐2   2017.10

     More details

    Language:Japanese  

    J-GLOBAL

  33. Pascal vs KNL: Performance Evaluation with ICCG Solve Reviewed

    Tetsuya Hoshino, Satoshi Ohshima, Toshihiro Hanawa, Kengo Nakaima, Akihiro Ida

    HPC in Asia Workshop Poster Session, ISC High Performance 2017     2017.6

     More details

    Language:English   Publishing type:Research paper (international conference proceedings)  

  34. OpenACCを用いたICCG法ソルバーのPascal GPUにおける性能評価 Open Access

    星野哲也, 大島聡史, 塙敏博, 中島研吾, 伊田明宏

    情報処理学会研究報告(Web)   Vol. 2017 ( HPC-158 ) page: Vol.2017‐HPC‐158,No.18,1‐9 (WEB ONLY) - 9   2017.3

     More details

    Language:Japanese   Publishing type:Research paper (scientific journal)  

    Open Access

    J-GLOBAL

  35. A Directive-based Data Layout Abstraction for Performance Portability of OpenACC Applications Reviewed

    Tetsuya Hoshino, Naoya Maruyama, Satoshi Matsuoka

    PROCEEDINGS OF 2016 IEEE 18TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS; IEEE 14TH INTERNATIONAL CONFERENCE ON SMART CITY; IEEE 2ND INTERNATIONAL CONFERENCE ON DATA SCIENCE AND SYSTEMS (HPCC/SMARTCITY/DSS)     page: 1147 - 1154   2016

     More details

    Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:IEEE  

    Directive-based programming interfaces such as OpenACC and OpenMP are becoming more prevalent in application development targeting accelerators, in particular when porting existing CPU-only code. Unlike vendor-specific alternatives such as CUDA, they are designed to be portable across different accelerators, and therefore once necessary directives are added to an existing CPU-only code, it can be executed on different accelerator architectures depending on the availability of supporting compilers. However, it does not automatically mean that such code runs efficiently on different architectures, and in fact, architecture-specific coding such as choosing optimal data layouts is almost mandatory for optimal performance, imposing a significant burden if implemented manually. Towards realizing performance portability in accelerator programming, we propose a set of extended directives that allow the programmer to optimize data layouts for a given accelerator without modifying original program code. Unlike the manual approach, the code change is confined in the directives with the original code kept as it is. This paper evaluates the effectiveness of our proposed extensions in the OpenACC standard by extending UPACS and CCS-QCD OpenACC applications. A prototype source-to-source translator for the extensions achieves 123% and 120% of the baseline performance, respectively, which are comparable to manually tuned versions.

    DOI: 10.1109/HPCC-SmartCity-DSS.2016.34

    Web of Science

  36. An OpenACC extension for data layout transformation Reviewed

    Tetsuya Hoshino, Naoya Maruyama, Satoshi Matsuoka

    Proceedings of WACCPD 2014: 1st Workshop on Accelerator Programming Using Directives - Held in Conjunction with SC 2014: The International Conference for High Performance Computing, Networking, Storage and Analysis     page: 12 - 18   2015.4

     More details

    Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:Institute of Electrical and Electronics Engineers Inc.  

    OpenACC is gaining momentum as an implicit and portable interface in porting legacy CPU-based applications to heterogeneous, highly parallel computational environment involving many-core accelerators such as GPUs and Intel Xeon Phi. OpenACC provides a set of loop directives similar to OpenMP for the parallelization and also to manage data movement, attaining functional portability across different heterogeneous devices
    however, the performance portability of OpenACC is said to be insufficient due to the characteristics of different target devices, especially those regarding memory layouts, as automated attempts by the compilers to adapt is currently difficult. We are currently working to propose a set of directives to allow compilers to have better semantic information for adaptation
    here, we particularly focus on data layout such as Structure of Arrays, advantageous data structure for GPUs, as opposed to Array of Structures, which exhibits good performance on CPUs. We propose a directive extension to OpenACC that allows the users to flexibility specify optimal layouts, even if the data structures are nested. Performance results show that we gain as much as 96 % in performance for CPUs and 165% for GPUs compared to programs without such directives, essentially attaining both functional and performance portability in OpenACC.

    DOI: 10.1109/WACCPD.2014.12

    Scopus

  37. CUDA vs OpenACC: Performance Case Studies with Kernel Benchmarks and a Memory-Bound CFD Application11 Reviewed

    Tetsuya Hoshino, Naoya Maruyama, Satoshi Matsuoka, Ryoji Takaki

    PROCEEDINGS OF THE 2013 13TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID 2013)     page: 136 - 143   2013

     More details

    Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:IEEE  

    OpenACC is a new accelerator programming interface that provides a set of OpenMP-like loop directives for the programming of accelerators in an implicit and portable way. It allows the programmer to express the offloading of data and computations to accelerators, such that the porting process for legacy CPU-based applications can be significantly simplified. This paper focuses on the performance aspects of OpenACC using two microbenchmarks and one real-world computational fluid dynamics application. Both evaluations show that in general OpenACC performance is approximately 50% lower than CUDA. However, for some applications it can reach up to 98% with careful manual optimizations. The results also indicate several limitations of the OpenACC specification that hamper full use of the GPU hardware resources, resulting in a significant performance gap when compared to a highly tuned CUDA code. The lack of a programming interface for the shared memory in particular results in as much as three times lower performance.

    DOI: 10.1109/CCGrid.2013.12

    Web of Science

▼display all

MISC 53

  1. Improving HPC Code Generation Capability of LLMs via Online Reinforcement Learning with Real-Machine Benchmark Rewards

    Ryo Mikasa, Shun-ichiro Hayashi, Daichi Mukunoki, Tetsuya Hoshino, Takahiro Katagiri

        2026.2

     More details

    Large language models (LLMs) have demonstrated strong code generation capabilities, yet the runtime performance of generated code is not guaranteed, and there have been few attempts to train LLMs using runtime performance as a reward in the HPC domain. We propose an online reinforcement learning approach that executes LLM-generated code on a supercomputer and directly feeds back the measured runtime performance (GFLOPS) as a reward. We further introduce a Staged Quality-Diversity (SQD) algorithm that progressively varies the permitted optimization techniques on a per-problem basis, enabling the model to learn code optimization from diverse perspectives. We build a distributed system connecting a GPU training cluster with a CPU benchmarking cluster, and train Qwen2.5 Coder 14B on a double-precision matrix multiplication task using Group Relative Policy Optimization (GRPO). Through two experiments, we show that reinforcement learning combining runtime performance feedback with staged optimization can improve the HPC code generation capability of LLMs.

    arXiv

    Other Link: https://arxiv.org/pdf/2602.12049v1

  2. Learning-Augmented Performance Model for Tensor Product Factorization in High-Order FEM

    Xuanzhengbo Ren, Yuta Kawai, Tetsuya Hoshino, Hirofumi Tomita, Takahiro Katagiri, Daichi Mukunoki, Seiya Nishizawa

    CoRR   Vol. abs/2601.06886   2026.1

     More details

    Accurate performance prediction is essential for optimizing scientific applications on modern high-performance computing (HPC) architectures. Widely used performance models primarily focus on cache and memory bandwidth, which is suitable for many memory-bound workloads. However, it is unsuitable for highly arithmetic intensive cases such as the sum-factorization with tensor $n$-mode product kernels, which are an optimization technique for high-order finite element methods (FEM). On processors with relatively high single instruction multiple data (SIMD) instruction latency, such as the Fujitsu A64FX, the performance of these kernels is strongly influenced by loop-body splitting strategies. Memory-bandwidth-oriented models are therefore not appropriate for evaluating these splitting configurations, and a model that directly reflects instruction-level efficiency is required. To address this need, we develop a dependency-chain-based analytical formulation that links loop-splitting configurations to instruction dependencies in the tensor $n$-mode product kernel. We further use XGBoost to estimate key parameters in the analytical model that are difficult to model explicitly. Evaluations show that the learning-augmented model outperforms the widely used standard Roofline and Execution-Cache-Memory (ECM) models. On the Fujitsu A64FX processor, the learning-augmented model achieves mean absolute percentage errors (MAPE) between 1% and 24% for polynomial orders ($P$) from 1 to 15. In comparison, the standard Roofline and ECM models yield errors of 42%-256% and 5%-117%, respectively. On the Intel Xeon Gold 6230 processor, the learning-augmented model achieves MAPE values from 1% to 13% for $P$=1 to $P$=14, and 24% at $P$=15. In contrast, the standard Roofline and ECM models produce errors of 1%-73% and 8%-112% for $P$=1 to $P$=15, respectively.

    DOI: 10.48550/arXiv.2601.06886

    arXiv

    Other Link: https://arxiv.org/pdf/2601.06886v1

  3. 3Dify: a Framework for Procedural 3D-CG Generation Assisted by LLMs Using MCP and RAG

    Shun-ichiro Hayashi, Daichi Mukunoki, Tetsuya Hoshino, Satoshi Ohshima, Takahiro Katagiri

    CoRR   Vol. abs/2510.04536   2025.10

     More details

    This paper proposes "3Dify," a procedural 3D computer graphics (3D-CG)
    generation framework utilizing Large Language Models (LLMs). The framework
    enables users to generate 3D-CG content solely through natural language
    instructions. 3Dify is built upon Dify, an open-source platform for AI
    application development, and incorporates several state-of-the-art LLM-related
    technologies such as the Model Context Protocol (MCP) and Retrieval-Augmented
    Generation (RAG). For 3D-CG generation support, 3Dify automates the operation
    of various Digital Content Creation (DCC) tools via MCP. When DCC tools do not
    support MCP-based interaction, the framework employs the Computer-Using Agent
    (CUA) method to automate Graphical User Interface (GUI) operations. Moreover,
    to enhance image generation quality, 3Dify allows users to provide feedback by
    selecting preferred images from multiple candidates. The LLM then learns
    variable patterns from these selections and applies them to subsequent
    generations. Furthermore, 3Dify supports the integration of locally deployed
    LLMs, enabling users to utilize custom-developed models and to reduce both time
    and monetary costs associated with external API calls by leveraging their own
    computational resources.

    DOI: 10.48550/arXiv.2510.04536

    arXiv

    Other Link: http://arxiv.org/pdf/2510.04536v1

  4. VibeCodeHPC: An Agent-Based Iterative Prompting Auto-Tuner for HPC Code Generation Using LLMs

    Shun-ichiro Hayashi, Koki Morita, Daichi Mukunoki, Tetsuya Hoshino, Takahiro Katagiri

    CoRR   Vol. abs/2510.00031   2025.9

     More details

    We propose VibeCodeHPC, an automatic tuning system for HPC programs based on
    multi-agent LLMs for code generation. VibeCodeHPC tunes programs through
    multi-agent role allocation and iterative prompt refinement. We describe the
    system configuration with four roles: Project Manager (PM), System Engineer
    (SE), Programmer (PG), and Continuous Delivery (CD). We introduce dynamic agent
    deployment and activity monitoring functions to facilitate effective
    multi-agent collaboration. In our case study, we convert and optimize CPU-based
    matrix-matrix multiplication code written in C to GPU code using CUDA. The
    multi-agent configuration of VibeCodeHPC achieved higher-quality code
    generation per unit time compared to a solo-agent configuration. Additionally,
    the dynamic agent deployment and activity monitoring capabilities facilitated
    more effective identification of requirement violations and other issues.

    DOI: 10.48550/arXiv.2510.00031

    arXiv

    Other Link: http://arxiv.org/pdf/2510.00031v1

  5. Towards Generalized Parameter Tuning in Coherent Ising Machines: A Portfolio-Based Approach

    Tatsuro Hanyu, Takahiro Katagiri, Daichi Mukunoki, Tetsuya Hoshino

    CoRR   Vol. abs/2507.20295   2025.7

     More details

    Coherent Ising Machines (CIMs) have recently gained attention as a promising
    computing model for solving combinatorial optimization problems. In particular,
    the Chaotic Amplitude Control (CAC) algorithm has demonstrated high solution
    quality, but its performance is highly sensitive to a large number of
    hyperparameters, making efficient tuning essential. In this study, we present
    an algorithm portfolio approach for hyperparameter tuning in CIMs employing
    Chaotic Amplitude Control with momentum (CACm) algorithm. Our method
    incorporates multiple search strategies, enabling flexible and effective
    adaptation to the characteristics of the hyperparameter space. Specifically, we
    propose two representative tuning methods, Method A and Method B. Method A
    optimizes each hyperparameter sequentially with a fixed total number of trials,
    while Method B prioritizes hyperparameters based on initial evaluations before
    applying Method A in order. Performance evaluations were conducted on the
    Supercomputer "Flow" at Nagoya University, using planted Wishart instances and
    Time to Solution (TTS) as the evaluation metric. Compared to the baseline
    performance with best-known hyperparameters, Method A achieved up to 1.47x
    improvement, and Method B achieved up to 1.65x improvement. These results
    demonstrate the effectiveness of the algorithm portfolio approach in enhancing
    the tuning process for CIMs.

    DOI: 10.48550/arXiv.2507.20295

    arXiv

    Other Link: http://arxiv.org/pdf/2507.20295v1

  6. Performance Evaluation of General Purpose Large Language Models for Basic Linear Algebra Subprograms Code Generation

    Daichi Mukunoki, Shun-ichiro Hayashi, Tetsuya Hoshino, Takahiro Katagiri

    CoRR   Vol. abs/2507.04697   2025.7

     More details

    Generative AI technology based on Large Language Models (LLM) has been
    developed and applied to assist or automatically generate program codes. In
    this paper, we evaluate the capability of existing general LLMs for Basic
    Linear Algebra Subprograms (BLAS) code generation for CPUs. We use two LLMs
    provided by OpenAI: GPT-4.1, a Generative Pre-trained Transformer (GPT) model,
    and o4-mini, one of the o-series of Reasoning models. Both have been released
    in April 2025. For the routines from level-1 to 3 BLAS, we tried to generate
    (1) C code without optimization from routine name only, (2) C code with basic
    performance optimizations (thread parallelization, SIMD vectorization, and
    cache blocking) from routine name only, and (3) C code with basic performance
    optimizations based on Fortran reference code. As a result, we found that
    correct code can be generated in many cases even when only routine name are
    given. We also confirmed that thread parallelization with OpenMP, SIMD
    vectorization, and cache blocking can be implemented to some extent, and that
    the code is faster than the reference code.

    DOI: 10.48550/arXiv.2507.04697

    arXiv

    Other Link: http://arxiv.org/pdf/2507.04697v1

  7. Current Status and Future of the ABINIT-MP Program Invited Reviewed

    Yuji MOCHIZUKI, Tatsuya NAKANO, Kota SAKAKURA, Hideo DOI, Koji OKUWAKI, Toshihiro KATO, Hiroyuki TAKIZAWA, Satoshi OHSHIMA, Tetsuya HOSHINO, Takahiro KATAGIRI

    Journal of Computer Chemistry, Japan   Vol. 23 ( 4 ) page: 85 - 97   2024.12

     More details

    Language:Japanese   Publishing type:Article, review, commentary, editorial, etc. (scientific journal)   Publisher:Society of Computer Chemistry, Japan  

    <p>The fragment molecular orbital (FMO) program ABINIT-MP has a quarter-century history, and related research and development of the Open Version 2 series is currently underway. This paper first summarizes the current status of the latest Revision 8 (released on August 2023). It then describes future improvements and enhancements, including GPU support. The connection with coarse-grained simulation (dissipative particle dynamics) and the possibility of cooperation with quantum computation are also touched upon.</p>

    DOI: 10.2477/jccj.2024-0022

    Web of Science

    CiNii Research

  8. Investigation of Azure CycleCloud Usage Environment and Considerations on Supercomputer Center-Cloud Collaboration

    Nagai Toru, Gojuki Shuichi, Kawai Masatoshi, Katagiri Takahiro, Hoshino Tetsuya

    Journal for Academic Computing and Networking   Vol. 28 ( 1 ) page: 114 - 124   2024.11

     More details

    Language:Japanese   Publisher:Academic eXchange for Information Environment and Strategy  

    DOI: 10.24669/jacn.28.1_114

    CiNii Research

  9. FMOプログラムABINIT-MPのGPU化への対応—Customization of GPU Performance Tuning for ABINIT-MP

    坂倉 耕太, 望月 祐志, 中野 達也, 成瀬 彰, 大島 聡史, 星野 哲也, 片桐 考洋

    計算工学講演会論文集 = Proceedings of the Conference on Computational Engineering and Science / 日本計算工学会 編   Vol. 29   page: 1182 - 1184   2024.6

     More details

    Language:Japanese   Publisher:東京 : 日本計算工学会  

    Other Link: https://ndlsearch.ndl.go.jp/books/R000000004-I033560385

  10. Development Status of ABINIT-MP in 2023 Invited Reviewed

    Yuji MOCHIZUKI, Tatsuya NAKANO, Kota SAKAKURA, Koji OKUWAKI, Hideo DOI, Toshihiro KATO, Hiroyuki TAKIZAWA, Akira NARUSE, Satoshi OHSHIMA, Tetsuya HOSHINO, Takahiro KATAGIRI

    J. Comp. Chem. Jpn.   Vol. 23 ( 1 ) page: 4 - 8   2024.3

     More details

    Language:Japanese   Publishing type:Rapid communication, short report, research note, etc. (scientific journal)   Publisher:Society of Computer Chemistry, Japan  

    In August 2023, we released the latest version of our ABINIT-MP program, Open Version 2 Revision 8. In this version, the most commonly used FMO-MP2 calculations are even faster than in the previous Revision 4. It is now also possible to calculate excitation and ionization energies for regions of interest. Improved interaction analysis is also available. In addition, we have started GPU-oriented modifications. In this preliminary report, we present the current status of ABINIT-MP.

    DOI: 10.2477/jccj.2024-0001

    CiNii Research

  11. SVMによる誤差を含むクラス分類におけるCMOSアニーリングマシンの性能評価

    水木, 直也, 福原, 諒河, 森下, 誠, 河合, 直聡, 片桐, 孝洋, 星野, 哲也, 永井, 亨

    第86回全国大会講演論文集   Vol. 2024 ( 1 ) page: 19 - 20   2024.3

     More details

    Language:Japanese   Publisher:情報処理学会  

    量子アニーリングが注目されており,その中でも疑似量子アニーラのサービスが展開されている.一方で、サポートベクターマシン(SVM)は広くクラス分類に使われているが,疑似量子アニーラでの評価は十分でない.そこで本研究では,疑似量子アニーラの一つであるCMOSアニーリングマシンでSVMにより誤差を有するクラス分類を行い,その性能を評価する.性能評価では2値分類問題における正解率を評価対象とし,訓練データの誤差の割合ごとに,従来のCPUにおけるSVMとCMOSアニーリングマシンにおけるSVMを比較する.

    CiNii Research

  12. ICTCG法の実行時間予測モデルに対する説明可能なAIの適用

    中谷, 崇真, 河合, 直聡, 片桐, 孝洋, 星野, 哲也, 永井, 亨

    第86回全国大会講演論文集   Vol. 2024 ( 1 ) page: 31 - 32   2024.3

     More details

    Language:Japanese   Publisher:情報処理学会  

    AIを使用することで複雑な計算に対して入力データから処理結果や実行時間などを予測することが研究されているが、その予測の妥当性を検証することは効率的で誤りのない実行のために必須である。そこで本研究では不完全前処理CG法(ICTCG法)の実行時間の予測を行列画像及びICTCG法のパラメータを入力として行う機械学習モデルを生成、説明可能なAI(XAI)ツールとしてSHAPを用いて説明することによってAIの回答の妥当性を検証することを目的とする。ここでICTCG法とは、共役勾配法(CG法)などの反復法の前処理に連立一次方程式の解法として用いられる不完全コレスキー分解(IC分解)前処理に閾値の値を設定した数値計算アルゴリズムである。

    CiNii Research

  13. LAPACKを用いた固有値計算におけるテストシーケンスの最適化

    樫村, 寛大, 森崎, 修司, 片桐, 孝洋, 河合, 直聡, 永井, 亨, 星野, 哲也

    第86回全国大会講演論文集   Vol. 2024 ( 1 ) page: 199 - 200   2024.3

     More details

    Language:Japanese   Publisher:情報処理学会  

    科学技術計算において数値計算ソフトウェアのテスト効率化は非常に重要である。本研究では数値解析ライブラリであるLAPACKを用いて、固有値計算におけるテストシーケンスの最適化を目的とする。ここでテストシーケンスとは固有値計算の理論解がわかっている問題を与え、計算結果がその範囲内に入っているかをチェックする系列である。ソフトウェア工学の観点から数値計算ソフトウェアにおけるテストシーケンスの最適化について評価した例はあまりない。本研究ではまず、LAPACKのライブラリに意図的にバグを発生させ、テストシーケンスを入れ替える最適化によってテスト時間が短縮できるかを検討する。

    CiNii Research

  14. OpenACCを用いた地震シミュレーションのGPU並列化

    百武, 尚輝, 星野, 哲也, 小澤, 創, 伊田, 明弘, 安藤, 亮輔, 河合, 直聡, 永井, 亨, 片桐, 孝洋

    第86回全国大会講演論文集   Vol. 2024 ( 1 ) page: 33 - 34   2024.3

     More details

    Language:Japanese   Publisher:情報処理学会  

    境界要素法(BEM)を用いた地震シミュレーションにおいて、大規模な密行列・ベクトル積が現れるため、シミュレーションを高速化するには、この演算を高速で行うことが必要である。従来のアプリケーションでは、密行列を格子H行列に近似した上で、OpenMPとMPIによるCPUを用いた並列化により、密行列・ベクトル積の高速化が行われていた。しかし、並列計算に適したGPUを用いた並列計算は実装されておらず、格子H行列・ベクトル積の高速化の余地が存在している。そこで、本研究では、対象のアプリケーションに対し、OpenACCを用いたGPU並列化による格子H行列・ベクトル積を実装し、その性能の評価を行った。

    CiNii Research

  15. Introduction to Parallel Programming on CPU and GPU: (4)

    星野哲也, 中島研吾, 中島研吾

    シミュレーション   Vol. 43 ( 1 ) page: 45 - 52   2024

  16. 格子H行列を用いた地震シミュレーションのマルチGPU並列化

    百武尚輝, 星野哲也, 星野哲也, 小澤創, 小澤創, 伊田明弘, 安藤亮輔, 河合直聡, 永井亨, 片桐孝洋

    情報処理学会研究報告(Web)   Vol. 2024 ( HPC-195 )   2024

     More details

  17. WaitIO+MPI Hybridによる異種システム間でのAllreduceの高速化

    植野貴大, 住元真司, 中島研吾, 中島研吾, 片桐孝洋, 大島聡史, 星野哲也, 河合直聡, 永井亨

    情報処理学会研究報告(Web)   Vol. 2024 ( HPC-196 )   2024

     More details

  18. HPCカーネルベンチマークによるSapphire Rapids HBMの性能評価

    星野哲也, 河合直聡, 伊田明弘, 塙敏博, 片桐孝洋

    情報処理学会研究報告(Web)   Vol. 2024 ( HPC-193 )   2024

     More details

  19. Revaluation of power saving effect by spraying spring water with Supercomputer “Flow”

    Yamada Kazunari, Tajima Yoshinori, Takahashi Ichiro, Hayashi Hidekazu, Katagiri Takahiro, Ohshima Satoshi, Hoshino Tetsuya, Nagai Toru

    Proceedings of the Annual Conference of Academic Exchange for Information Environment and Strategy   Vol. 2023 ( 0 ) page: 67 - 74   2023.12

     More details

    Language:Japanese   Publisher:Academic eXcange for Information Environment and Strategy  

    DOI: 10.24669/axies.2023.0_67

    CiNii Research

  20. Usage environment of Azure CycleCloud and benchmark test results on VMs

    Nagai Toru, Gojuki Shuichi, Kawai Masatoshi, Katagiri Takahiro, Hoshino Tetsuya

    Proceedings of the Annual Conference of Academic Exchange for Information Environment and Strategy   Vol. 2023 ( 0 ) page: 81 - 88   2023.12

     More details

    Language:Japanese   Publisher:Academic eXcange for Information Environment and Strategy  

    DOI: 10.24669/axies.2023.0_81

    CiNii Research

  21. 4D Reconstruction of PET Using GPU Supercomputer

    OHSHIMA Satoshi, YUASA Yoshinao, MATSUMURA Kaito, YOKOTA Tatsuya, HONTANI Hidekata, SAKATA Muneyuki, KIMURA Yuichi, KATAGIRI Takahiro, NAGAI Toru, HANAWA Toshihiro, HOSHINO Tetsuya

    Medical Imaging Technology   Vol. 41 ( 4-5 ) page: 150 - 156   2023.11

     More details

    Language:Japanese   Publisher:The Japanese Society of Medical Imaging Technology  

    With the development of medical imaging technology, various techniques have been developed and used to visually understand the inside of the living body. However, these technologies can only directly obtain images and videos, and diagnosis is still performed by human hands, such as physicians. There are great expectations for software that can reduce such labor, and an increasing number of technologies are already being used in the field of medicine, but the target is limited because they require knowledge and skills in both medicine (medical imaging) and computing technology. Therefore, in this study, researchers in the medical imaging and high-performance computing fields are collaborating to accelerate and scale up the image reconstruction of PET. This paper describes the details of this effort and the results obtained so far.

    DOI: 10.11409/mit.41.150

    CiNii Research

  22. CPU・GPU並列プログラミング入門(1)—Introduction to Parallel Programming on CPU and GPU(1)

    中島 研吾, 星野 哲也

    シミュレーション = Journal of the Japan Society for Simulation Technology / 日本シミュレーション学会 編   Vol. 42 ( 2 ) page: 103 - 109   2023.6

     More details

    Language:Japanese   Publisher:小宮山印刷工業  

    CiNii Research

    Other Link: https://ndlsearch.ndl.go.jp/books/R000000004-I032988217

  23. 数値計算ライブラリの自動チューニングにおけるXAI適用の試み—An Adaptation of XAI to Auto-tuning for Numerical Calculation Library

    青木 将太, 片桐 孝洋, 大島 聡史, 永井 亨, 星野 哲也

    計算工学講演会論文集 = Proceedings of the Conference on Computational Engineering and Science / 日本計算工学会 編   Vol. 28   page: 904 - 907   2023.5

     More details

    Language:Japanese   Publisher:日本計算工学会  

    Other Link: https://ndlsearch.ndl.go.jp/books/R000000004-I032887313

  24. Introduction to Parallel Programming on CPU and GPU (2)

    中島研吾, 中島研吾, 星野哲也

    シミュレーション   Vol. 42 ( 3 ) page: 173 - 181   2023

  25. Introduction to Parallel Programming on CPU and GPU (3)

    星野哲也, 中島研吾, 中島研吾

    シミュレーション   Vol. 42 ( 4 ) page: 242 - 250   2023

  26. Fortran標準規格do concurrentを用いたGPUオフローディング手法の評価

    星野 哲也, 塙 敏博

    情報処理学会研究報告(Web)   Vol. 2022-HPC-183   page: 1 - 8   2022

  27. AMD製GPU・NVIDIA製GPU両対応direct N-body codeの実装と性能評価

    三木洋平, 塙敏博, 河合直聡, 星野哲也

    日本天文学会年会講演予稿集   Vol. 2022   2022

     More details

  28. OpenMPを用いたGPUオフローディングの有効性の評価

    河合直聡, 三木洋平, 星野哲也, 塙敏博, 中島研吾, 中島研吾

    情報処理学会研究報告(Web)   Vol. 2022 ( HPC-183 )   2022

     More details

  29. A64FXにおけるテンポラルブロッキングの実装と性能評価

    星野 哲也, 塙 敏博

    研究報告ハイパフォーマンスコンピューティング(HPC)   Vol. 2021-HPC-178 ( 17 ) page: 1 - 8   2021.3

     More details

    Authorship:Lead author  

  30. 「計算・データ・学習」融合スーパーコンピュータシステム「Wisteria/BDEC-01」の概要

    中島研吾, 塙敏博, 下川辺隆史, 伊田明弘, 芝隼人, 三木洋平, 星野哲也, 有間英志, 河合直聡, 坂本龍一, 近藤正章, 岩下武史, 八代尚, 長尾大道, 松葉浩也, 荻田武史, 片桐孝洋, 古村孝志, 鶴岡弘, 市村強, 藤田航平

    情報処理学会研究報告(Web)   Vol. 2021 ( HPC-179 )   2021

     More details

  31. 「計算・データ・学習」融合スーパーコンピュータシステムWisteria/BDEC-01の性能評価

    塙敏博, 中島研吾, 中島研吾, 下川辺隆史, 芝隼人, 三木洋平, 星野哲也, 河合直聡, 似鳥啓吾, 今村俊幸, 工藤周平, 中尾昌広

    情報処理学会研究報告(Web)   Vol. 2021 ( HPC-180 )   2021

     More details

  32. A64FXにおける階層型行列演算の性能評価

    星野哲也, 伊田明弘, 伊田明弘, 塙敏博

    情報処理学会研究報告(Web)   Vol. 2021 ( HPC-180 ) page: 1 - 8   2021

     More details

  33. Large-scale earthquake sequence simulations of 3D geometrically complex faults using the boundary element method accelerated by lattice H-matrices on distributed memory computer systems

    伊田 明弘, 星野 哲也

    arXiv preprint   Vol. -   page: 1 - 26   2021

  34. An Optimization of H-matrix-vector Multiplication by Using Un-used Cores

    Tetsuya Hoshino, Toshihiro Hanawa, Akihiro Ida

    HPC Asia 2020     2020.1

  35. Numerical Linear Algebra Based on Lattice H-Matrices

    Akihiro Ida, Ichitaro Yamazaki, Rio Yokota, Satoshi Ohshima, Tasuku Hiraishi, Takeshi Iwashita, Tetsuya Hoshino, Toshihiro Hanawa

    HPC Asia 2020     2020.1

  36. メニーコアクラスタにおける階層型行列法の高速化に向けた性能評価

    星野哲也, 伊田明弘

    計算工学講演会論文集(CD-ROM)   Vol. 24   page: ROMBUNNO.C‐07‐02   2019.6

     More details

    Language:Japanese  

    CiNii Research

    J-GLOBAL

  37. High-level Abstractions for High Performance Computing on Many-core Processors

    Hoshino Tetsuya

        2018.9

     More details

    Language:English  

    CiNii Research

  38. OpenCLを用いたFPGAによる階層型行列計算

    塙敏博, 伊田明弘, 星野哲也

    情報処理学会研究報告(Web)   Vol. 2018 ( HPC-163 ) page: Vol.2018‐HPC‐163,No.26,1‐8 (WEB ONLY)   2018.2

     More details

    Language:Japanese  

    J-GLOBAL

  39. 階層型行列計算のFPGAへの適用

    塙敏博, 伊田明弘, 星野哲也

    情報処理学会研究報告(Web)   Vol. 2017 ( HPC-161 ) page: Vol.2017‐HPC‐161,No.10,1‐10 (WEB ONLY)   2017.9

     More details

    Language:Japanese  

    J-GLOBAL

  40. 階層型行列法ライブラリHACApKを用いたアプリケーションのメニーコア向け最適化

    星野哲也, 伊田明弘, 塙敏博, 中島研吾

    情報処理学会研究報告(Web)   Vol. 2017 ( HPC-160 ) page: Vol.2017‐HPC‐160,No.15,1‐10 (WEB ONLY) - 10   2017.7

     More details

    Language:Japanese  

    J-GLOBAL

  41. GPU搭載スーパーコンピュータReedbush‐Hの性能評価

    塙敏博, 星野哲也, 中島研吾, 大島聡史, 伊田明弘

    情報処理学会研究報告(Web)   Vol. 2017 ( HPC-159 ) page: Vol.2017‐HPC‐159,No.9,1‐6 (WEB ONLY)   2017.4

     More details

    Language:Japanese  

    J-GLOBAL

  42. Xeon Phi+OmniPath環境におけるOpenMP,MPI性能最適化

    塙敏博, 星野哲也, 中島研吾, 大島聡史, 伊田明弘

    情報処理学会研究報告(Web)   Vol. 2017 ( HPC-158 ) page: Vol.2017‐HPC‐158,No.21,1‐8 (WEB ONLY)   2017.3

     More details

    Language:Japanese  

    J-GLOBAL

  43. Optimization of ICCG Solver for Intel Xeon Phi

    中島研吾, 中島研吾, 大島聡史, 大島聡史, 塙敏博, 星野哲也, 伊田明弘, 伊田明弘

    情報処理学会研究報告(Web)   Vol. 2016 ( HPC-157 ) page: Vol.2016‐HPC‐157,No.16,1‐8 (WEB ONLY)   2016.12

     More details

    Language:Japanese  

    J-GLOBAL

  44. Performance Evaluation of Pipelined CG Method

    塙敏博, 中島研吾, 中島研吾, 大島聡史, 大島聡史, 星野哲也, 伊田明弘, 伊田明弘

    情報処理学会研究報告(Web)   Vol. 2016 ( HPC-157 ) page: Vol.2016‐HPC‐157,No.6,1‐9 (WEB ONLY)   2016.12

     More details

    Language:Japanese  

    J-GLOBAL

  45. データ解析・シミュレーション融合スーパーコンピュータシステムReedbush‐Uの性能評価

    塙敏博, 中島研吾, 大島聡史, 伊田明弘, 星野哲也, 田浦健次朗

    情報処理学会研究報告(Web)   Vol. 2016 ( HPC-156 ) page: Vol.2016‐HPC‐156,No.10,1‐10 (WEB ONLY) - 10   2016.9

     More details

    Language:Japanese  

    J-GLOBAL

  46. データレイアウト最適化指示文によるOpenACCアプリケーションの高速化

    星野 哲也

    研究報告ハイパフォーマンスコンピューティング(HPC)   Vol. 2016-HPC-155   page: 1 - 8   2016

  47. 圧縮性流体プログラムのOpenACCによる高速化

    星野 哲也

    研究報告ハイパフォーマンスコンピューティング(HPC)   Vol. 2016-HPC-153   page: 1 - 10   2016

  48. OpenACCディレクティブ拡張によるデータレイアウト最適化

    星野哲也, 丸山直也, 松岡聡

    研究報告ハイパフォーマンスコンピューティング(HPC)   Vol. 2014 ( 45 ) page: 1 - 8   2014.7

     More details

    Language:Japanese   Publisher:一般社団法人情報処理学会  

    近年増加傾向にある GPU 等のアクセラレータを搭載した計算環境への既存プログラムの移植方法として,CUDA・OpenCL に代表されるローレベルなプログラミングモデルを用いる方法に対し,ディレクティブベースの OpenACC のようなハイレベルなプログラミングモデルを用いる方法が注目されている.このようなディレクティブベースのプログラミングモデルの利点として,元のプログラムを維持したまま移植を行えるために,デバイス間の機能的な可搬性が高いことがあげられる.しかし現状の OpenACC などの High-level なプログラミングモデルは,スカラプロセッサとメニーコアアクセラレータの得意とするデータレイアウトの相違に対応することが出来ず,異なる性質を持ったデバイス間の性能可搬性に問題がある.そこで本研究では,データレイアウトを抽象化し,異なるデバイス間での性能可搬性を向上させるための OpenACC の拡張ディレクティブを試作し,姫野ベンチマークのデータレイアウトをトランスレーターにより変更し,マルチコア CPU,Intex Xeon Phi,K20X GPU のそれぞれで評価を行った.その結果,オリジナルと同一のデータレイアウトと比較して,Intel Xeon Phi では 27%,K20X GPU では 24%の性能向上が得られることを確認した.

    CiNii Research

  49. CPU-GPUそれぞれに最適なデータレイアウトを選択可能にするOpenACCディレクティブ拡張

    星野哲也, 丸山直也, 松岡聡

    研究報告ハイパフォーマンスコンピューティング(HPC)   Vol. 2014 ( 5 ) page: 1 - 5   2014.2

     More details

    Language:Japanese   Publisher:一般社団法人情報処理学会  

    近年増加傾向にある GPU 等のアクセラレータを搭載した計算環境への既存プログラムの移植方法として,CUDA・OpenCL に代表される Low-level なプログラミングモデルを用いる方法に対し,ディレクティブベースの OpenACC のような High-level なプログラミングモデルを用いる方法が考えられる.このようなディレクティブベースのプログラミングモデルの利点として,元のプログラムを壊さずに移植を行えるために,デバイス間の可搬性が高いことがあげられる.しかし現状の OpenACC などのプログラミングモデルは,スカラプロセッサとメニーコアアクセラレータの得意とするデータレイアウトの相違等に対応することが出来ず,異なる性質を持ったデバイス間の性能可搬性に問題がある.そこで本研究では,データレイアウトを抽象化し,異なるデバイス間での性能可搬性を向上させるための OpenACC の拡張ディレクティブを試作し,評価を行った.

    CiNii Research

  50. ディレクティブベースプログラミング言語OpenACCの性能評価

    星野哲也, 丸山直也, 松岡聡

    ハイパフォーマンスコンピューティングと計算科学シンポジウム論文集   Vol. 2013   page: 91 - 91   2013.1

     More details

    Language:Japanese   Publisher:情報処理学会  

    CiNii Research

  51. Evaluation of Portability for a Real-world CFD Application with CUDA and OpenACC

    星野 哲也, 丸山 直也, 松岡 聡

    研究報告ハイパフォーマンスコンピューティング(HPC)   Vol. 2012 ( 42 ) page: 1 - 9   2012.7

     More details

    Language:Japanese  

    地震や気象予測,航空機や高層ビル設計といったシミュレーションに利用される数値流体力学アプリケーションは,近年一般的になりつつある GPU を用いたスーパーコンピュータにおいて,目覚ましい成果を上げている.しかし,GPU を用いたプログラミングは,高い性能を得ること難しいと言われており,レガシープログラムの GPU 環境への移植が問題となっている.本稿では,実際に利用されている大規模流体アプリケーションである UPACS を手動により CUDA 化し,性能と移植コストの面から評価を行った.また,プログラムの移植性を解決すると期待されている,OpenACC の予備評価を行った.これら評価の結果を示し,今後解決すべき課題について述べる.Computational fluid dynamics (CFD) applications used for an earthquake and meteorological simulation are one of the most important application executed with high-speed supercomputers. Especially, GPU-based supercomputers have been showing remarkable performance of CFD applications. However, GPU-programing is still difficult to obtain high performance, which prevents legacy applications from being ported to GPU environment. We apply classical optimizations to a real-world CFD application UPACS and evaluate it&#039;s performance and porting costs, and we also evaluate OpenACC expected to provide portability across CPUs and GPUs. We demonstrate these results of evaluation and mention performance problems should be resolved in the future.

    CiNii Research

  52. 大規模流体アプリケーションのGPUによる高速化手法の評価

    星野哲也, 丸山直也, 松岡聡

    先進的計算基盤システムシンポジウム論文集   Vol. 2012   page: 73 - 74   2012.5

     More details

    Language:Japanese   Publisher:情報処理学会  

    CiNii Research

  53. “Open ACC Programming”

    Naoya Maruyama, Tetsuya Hoshino

    Kyokai Joho Imeji Zasshi/Journal of the Institute of Image Information and Television Engineers   Vol. 66 ( 10 ) page: 817 - 822   2012

     More details

    Language:English   Publisher:一般社団法人映像情報メディア学会  

    DOI: 10.3169/itej.66.817

    Scopus

    CiNii Research

▼display all

KAKENHI (Grants-in-Aid for Scientific Research) 6

  1. Research on dynamic hardware resource mapping for optimizing system utilization

    Grant number:25K00141  2025.4 - 2030.3

    Japan Society for the Promotion of Science  Grants-in-Aid for Scientific Research  Grant-in-Aid for Scientific Research (B)

      More details

    Authorship:Coinvestigator(s) 

  2. 低ランク構造行列法の適用範囲拡大と多様な計算アーキテクチャの活用

    Grant number:24K02949  2024.4 - 2027.3

    日本学術振興会  科学研究費助成事業  基盤研究(B)

    伊田 明弘, 横田 理央, 塙 敏博, 岩下 武史, 大島 聡史, 星野 哲也, 平石 拓, 河合 直聡, 横田 理央, 塙 敏博, 岩下 武史, 大島 聡史, 星野 哲也, 平石 拓, 河合 直聡

      More details

    Authorship:Coinvestigator(s) 

    本研究では、低ランク構造行列法ライブラリの高機能化を実施する。科学技術計算では、密行列演算に基づく計算手法の数値線形代数ライブラリが広く利用されいる。密行列演算から低ランク構造行列演算へ置き換えが行えるように、低ランク構造行列法の適用範囲を拡大する。低ランク構造行列に基づく新たな数値計算アルゴリズムを開発する。アルゴリズム開発は、GPU・FPGAなど最新の計算機アーキテクチャで構成されるクラスタ計算機を意識し、実装の最適化を行う。様々な低ランク構造行列の演算に対し、最適な計算機アーキテクチャを割当て、混合精度演算・動的負荷分散なども活用し、計算機の性能を最大限に引き出す実装法を研究する。
    低ランク構造行列法のアルゴリズムの研究・開発を行った。長方行列に対して低ランク近似を行い、ムーア・ペンローズ疑似逆行列を効率よく計算する手法を提案した。サイズN×M(N>M)の長方行列に対して、一般的な手法であるTSVD(打ち切り特異値分解)法を用いて疑似逆行列を計算するとO(NM^2)の計算量を必要とするが、提案手法を用いれば、行列ランクをkとしてO(Nk^2)へ計算量を低減させることができる。
    低ランク構造行列法の適用範囲の拡大に取り組んだ。電子状態計算で必要な行列固有値問題において、係数行列を低ランク構造行列の一種である対称ブロック低ランク(BLR)行列で効率よく近似する手法を提案した。また、得られたBLR行列の全固有値を近似的に計算する手法の開発に取り組んだ。また、積分方程式法に基づく地震シミュレーションへ、別の低ランク構造行列法である格子H行列を適用し、シミュレーションの高速化・省メモリ化に取り組んだ。
    低ランク構造行列法の高性能実装に関する研究を実施した。低ランク構造行列法の一種であるH^2行列について、Schur補行列の低ランク性を活用する「H^2-ULV分解」を用いて計算量をO(N)に抑え、前進後退代入も並列化するアルゴリズムを開発した。従来のH行列の複雑な構造に起因する並列化の課題を解決し、大規模GPUシステムにおいて高いスケーラビリティを実現した。また、格子H行列を係数行列に持つ線形方程式に対して、反復法で求解する際に混合精度演算を活用することにより高速化する手法について検討を行った。さらに、反復法の主要計算部分である格子H行列・ベクトル積の計算に対して、動的負荷分散を適用して高速化を図る研究に取り組んだ。
    本研究は低ランク構造行列法に対し、(1)数値計算アルゴリズムの研究開発、(2)適用範囲拡大の研究、(3)高性能実装法に関する研究、の3つの研究項目に大別される。研究項目(1)については、低ランク近似を用いた長方行列の疑似逆行列計算手法を提案し、1件の国際会議発表と1件の学会発表を行った。研究項目(2)については、電子状態計算に現れる行列固有値問題および積分方程式法に基づく地震シミュレーションへの低ランク構造行列法の適用に関する研究を行い、3件の学会発表を行った。研究項目(3)については、GPU上で低ランク構造行列法の演算を効率的に行う研究を行い、1本の査読付き学術誌論文、1件の国際会議発表、2件の学会発表を行った。また、動的負荷分散を用いた高速化の研究に基づき、2件の国際会議発表および1件の学会発表を行った。
    全ての研究項目について、多くの研究成果が得られており、概ね順調に進展している。
    低ランク構造行列法のアルゴリズムの研究・開発については、これまでに行った長方行列の低ランク近似を発展させ、より複雑な行列構造を有するBLR行列や格子H行列による行列近似法について検討する。さらに、BLR行列や格子H行列で近似された長方行列のムーア・ペンローズ疑似逆行列を計算する手法を開発する。また、電子状態計算で得られるBLR行列のように、既存のBLR行列全固有値計算手法が対応していない構造を持つBLR行列に対して、全固有値計算を行えるような計算手法を開発する。
    低ランク構造行列法の適用範囲の拡大については、地震シミュレーションへの格子H行列法の適用に関する研究を引き続き実施する。また、上記の低ランク構造行列法に基づくムーア・ペンローズ疑似逆行列計算手法を、MRI検査装置などに見られるコイル配位最適化計算に適用し、計算の高速化・大規模化に取り組む。
    低ランク構造行列法の高性能実装については、BLR行列を対象として、行列ベクトル積、QR分解および全固有値計算のGPU上での効率的な実装法について検討し、最新GPUであるGH200上で実装を行う。また、格子H行列やH^2行列を係数行列に持つ連立方程式を解く際に用いられる反復法や行列分解法を対象として、混合精度演算による計算効率化に関する研究を実施する。さらに、格子H行列計算に動的負荷分散を導入し、分散メモリGPU環境において計算を効率化する手法について検討する。

  3. A Study on Acceleration by Temporal Blocking for Real-world Applications

    Grant number:22K17898  2022.4 - 2024.3

    Japan Society for the Promotion of Science  Grants-in-Aid for Scientific Research  Grant-in-Aid for Early-Career Scientists

    Hoshino Tetsuya

      More details

    Authorship:Principal investigator 

    Grant amount:\1430000 ( Direct Cost: \1100000 、 Indirect Cost:\330000 )

    The specific calculation pattern for a discrete grid in time and space that arises when solving differential equations analytically is called a stencil calculation, and it is an important kernel that frequently appears in various fluid simulations. Acceleration of stencil calculations has been studied extensively, and the temporal blocking method is one such method, but has rarely been applied to real applications because it requires very complicated programming. Furthermore, since the performance of temporal blocking is highly dependent on the performance parameters of the processor executing the blocking, it is not realistic to optimize the blocking manually. Therefore, in this study, the performance modeling required for auto-tuning of temporal blocking was performed using state-of-the-art CPUs.

  4. Construction of numerical linear algebra based on lattice H-matrices and its high-performance implementation on modern architectures

    Grant number:21H03447  2021.4 - 2024.3

    Japan Society for the Promotion of Science  Grants-in-Aid for Scientific Research  Grant-in-Aid for Scientific Research (B)

    Ida Akihiro

      More details

    Authorship:Coinvestigator(s) 

    We conducted research and development aimed at constructing a numerical linear algebra system based on the lattice H-matrix. We proposed an algorithm to calculate all eigenvalues for the BLR matrix, a special case of the lattice H-matrix. Research on high-performance implementation of the lattice H-matrix method was carried out. By adding efficient work-stealing functions to task parallelization languages, we successfully improved the computational performance of H-matrix partitioning and low-rank structured matrix generation on distributed memory parallel computers. Furthermore, we developed an H-matrix-vector multiplication computation method that achieves over 85% of the theoretical limit performance on computing nodes using various latest CPU architectures. Additionally, we developed a method for fast QR decomposition of BLR matrices using the MIG feature of the latest GPUs.

  5. High-performance computing and data analysis support leveraging unused cores

    Grant number:20H00580  2020.4 - 2023.3

    Japan Society for the Promotion of Science  Grants-in-Aid for Scientific Research  Grant-in-Aid for Scientific Research (A)

    Hanawa Toshihiro

      More details

    Authorship:Coinvestigator(s) 

    This research aims to improve the overall system performance and realize additional functions such as power control and profiling functions with low overhead by giving "extra cores" that do not directly contribute to the performance improvement of high-performance computation a role in supporting the main computation running on the CPU. We studied "UTHelper," a framework to realize such support functions at the user level.
    As a result, we realized profiling and parallelism change during execution without modifying the user program, in situ analysis using extra cores, load balancing using dynamic core allocation to speed up lattice H-matrix operations, inter-GPU communication using extra cores, and utilization of idle arithmetic units through time-space blocking.

  6. Auto-tuning Framework Focusing on Application Data Structure for Many-core Processors

    Grant number:16H06679  2016.8 - 2018.3

    Japan Society for the Promotion of Science  Grants-in-Aid for Scientific Research  Grant-in-Aid for Research Activity Start-up

    HOSHINO Tetsuya

      More details

    Nowadays, the number of computational environment using many-core processors is increasing. To bring out the efficient performance of many-core processors, it is important to efficiently use the Vector Processing Unit (VPU). However, the knowledge of hardware and compiler is required to efficiently use the VPU, and moreover, data structural changes are often required.
    In this research, we propose a set of compiler directives for abstraction of data layout. We also implement a translator for the proposed directives. Furthermore, we propose a framework design to enhance the efficient vectorization. Also, we implement a BEM-BB framework using the proposed framework design.

▼display all

 

Teaching Experience (On-campus) 3

  1. High-Performance Computing B

    2023

  2. Advanced Lectures on Large-scale Parallel Computing

    2023

  3. Programming 2

    2023

 

Social Contribution 1

  1. 最近のFortran向けGPUプログラミング事情(JAXA内部講習会)

    Role(s):Lecturer

    2023.12

Academic Activities 2

  1. HPC Asia 2024 Local Arrangement Chair

    Role(s):Planning, management, etc.

    2024.1

     More details

    Type:Academic society, research group, etc. 

  2. xSIG 2023 プログラム委員

    Role(s):Peer review

    2023.8

     More details

    Type:Peer review