Updated on 2026/02/25

写真a

 
MUKUNOKI Daichi
 
Organization
Information Technology Center Assistant Professor
Graduate School
Graduate School of Informatics
Title
Assistant Professor
Contact information
メールアドレス
External link

Degree 3

  1. 博士(工学) ( 2013.11   筑波大学 ) 

  2. 修士(工学) ( 2011.3   筑波大学 ) 

  3. 学士(図書館情報学) ( 2009.3   筑波大学 ) 

Research Interests 9

  1. High performance computing

  2. Accurate computation

  3. Auto-tuning

  4. Numerical computation

  5. Reproducible computation

  6. Parallel Computing

  7. GPU computing

  8. Mixed precision computation

  9. Large Language Models

Research Areas 2

  1. Informatics / High-performance computing

  2. Informatics / Computer systems

Research History 19

  1. Nagoya University   Information Technology Center   Assistant Professor

    2025.4

      More details

    Country:Japan

    researchmap

  2. Nagoya University   Information Technology Center   Assistant Professor

    2024.12 - 2025.3

      More details

    Country:Japan

    researchmap

  3. Shibaura Institute of Technology   Temporary Technical Staff

    2024.4 - 2024.10

      More details

    Country:Japan

    researchmap

  4. Sony Interactive Entertainment Inc.   Sr. Software Engineer

    2023.11 - 2024.2

      More details

    Country:Japan

    researchmap

  5. Information Technology Center, The University of Tokyo   Visiting Researcher

    2021.11 - 2023.3

      More details

    Country:Japan

    researchmap

  6. RIKEN Center for Computational Science   Large-scale Parallel Numerical Computing Technology Research Team   Research Scientist

    2019.4 - 2023.10

      More details

    Country:Japan

    researchmap

  7. RIKEN Center for Computational Science   Research Scientist

    2019.4 - 2021.3

      More details

    Country:Japan

    researchmap

  8. RIKEN Center for Computational Science   Large Scale Parallel Computation Technology Research Team   Visiting Researcher

    2018.4 - 2019.3

      More details

    Country:Japan

    researchmap

  9. RIKEN Center for Computational Science   Visiting Researcher

    2018.4 - 2019.3

      More details

    Country:Japan

    researchmap

  10. Tokyo Woman's Christian University   Graduate School of Science   Postdoctoral Research Fellow

    2017.10 - 2019.3

      More details

    Country:Japan

    researchmap

  11. RIKEN Center for Computational Science   Architecture Development Team, Flagship 2020 Project   Visiting Researcher

    2017.10 - 2018.3

      More details

    Country:Japan

    researchmap

  12. RIKEN Advanced Institute of Computational Science   Large-scale Parallel Numerical Computing Technology Research Team, Research Division   Visiting Researcher

    2017.10 - 2018.3

      More details

    Country:Japan

    researchmap

  13. RIKEN Advanced Institute of Computational Science   Architecture Development Team, Flagship 2020 Project   Postdoctoral Researcher

    2017.4 - 2017.9

      More details

    Country:Japan

    researchmap

  14. RIKEN Advanced Institute for Computational Science   Postdoctoral Researcher

    2016.4 - 2017.3

      More details

    Country:Japan

    researchmap

  15. RIKEN Advanced Institute of Computational Science   Co-design Team, Flagship 2020 Project   Postdoctoral Researcher

    2015.5 - 2016.3

      More details

    Country:Japan

    researchmap

  16. RIKEN Advanced Institute for Computational Science   Large-scale Parallel Numerical Computing Technology Research Team, Research Division   Postdoctoral Researcher

    2014.6 - 2017.9

      More details

    Country:Japan

    researchmap

  17. Japan Society for the Promotion of Science   Research Fellow (PD)

    2013.12 - 2014.5

      More details

    Country:Japan

    researchmap

  18. Japan Society for the Promotion of Science   Research Fellow (DC2)

    2013.4 - 2013.11

      More details

    Country:Japan

    researchmap

  19. Nagoya University   Information Technology Center   Assistant Professor

    2025.11

      More details

▼display all

Education 4

  1. University of Tsukuba   Graduate School of Systems and Information Engineering   Doctoral Program in Computer Science

    2011.4 - 2013.11

      More details

    Country: Japan

    researchmap

  2. University of Tsukuba   Graduate School of Systems and Information Engineering   Master's Program in Computer Science

    2009.4 - 2011.3

      More details

    Country: Japan

    researchmap

  3. University of Tsukuba   School of Library and Information Science

    2006.4 - 2009.3

      More details

    Country: Japan

    researchmap

  4. Gifu National College of Technology

    2001.4 - 2006.3

      More details

    Country: Japan

    researchmap

Professional Memberships 4

  1. 日本医用画像工学会

    2025.8

      More details

  2. Association for Computing Machinery (ACM)

    2025

      More details

  3. Information Processing Society of Japan

    2008

      More details

  4. Auto-Tuning Resarch Group

      More details

Committee Memberships 39

  1. The 15th International Conference on Parallel Processing & Applied Mathematics (PPAM 2024)   Program Committee Member  

    2024   

      More details

  2. Mini Symposium: Exploring Arithmetic and Data Representation Beyond the Standard in HPC (at ICIAM 2023)   Mini-Symposium Organizer  

    2023   

      More details

  3. Special Session: Performance Optimization and Auto-Tuning of Software on Multicore/Manycore Systems (POAT 2023) (in conjunction with MCSoC-2023)   Program Chair  

    2023   

      More details

    Committee type:Academic society

    researchmap

  4. The 24th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2023) (in conjunction with IPDPS 2023)   Program Committee Member  

    2023   

      More details

  5. 2023 IEEE 16th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-2023)   Program Committee Member  

    2023   

      More details

  6. The 22nd International Conference on Computational Science (ICCS 2022)   Program Committee Member  

    2022   

      More details

  7. The International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia 2022)   Publicity Chair  

    2022   

      More details

  8. 36th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2022)   Program Committee Member (Algorithm track)  

    2022   

      More details

  9. 自動チューニング研究会   幹事(交流促進委員会)  

    2021 - 2023   

      More details

    Committee type:Academic society

    researchmap

  10. 情報処理学会論文誌コンピューティングシステム   編集委員  

    2020 - 2024   

      More details

    Committee type:Academic society

    researchmap

  11. The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC20)   Research Poster Committee Member  

    2020   

      More details

  12. The 4th International Workshop on GPU Computing and AI (GCA'19) (in conjunction with CANDAR'19)   Program Committee Member  

    2019   

      More details

  13. The Fourteenth International Workshop on Automatic Performance Tuning (iWAPT2019) (in conjunction with IPDPS 2019)   Program Committee Member  

    2019   

      More details

  14. The 16th International Conference on Parallel Processing & Applied Mathematics (PPAM 2026)   Program Committee Member  

    2026.9   

      More details

    Committee type:Academic society

    researchmap

  15. The 2nd International Workshop on Foundational Large Language Models Advances for HPC (LLM4HPC 2026)   Program Committee Member  

    2026.6   

      More details

    Committee type:Academic society

    researchmap

  16. The International Conference on High Performance Computing in Asia-Pacific Region 2026 (HPCAsia2026)   Poster Chair  

    2026.1   

      More details

    Committee type:Academic society

    researchmap

  17. The 28th Workshop on Advances in Parallel and Distributed Computational Models (APDCM2026)   Program Committee Member  

    2026   

      More details

    Committee type:Academic society

    researchmap

  18. 自動チューニング研究会   研究推進委員  

    2025   

      More details

  19. Special Session: Auto-Tuning for Multicore and GPU (ATMG2022) (in conjunction with MCSoC-2022)   Program Chair  

    2022   

      More details

  20. The 14th International Conference on Parallel Processing & Applied Mathematics (PPAM 2022)   Program Committee Member  

    2022   

      More details

  21. IEEE 22nd International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2021) (in conjunction with IPDPS 2021)   Program Committee Member  

    2021   

      More details

  22. Workshop on Large-scale Parallel Numerical Computing Technology (LSPANC 2020 January)   Program Committee Member  

    2020   

      More details

  23. The 21st IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2020) (in conjunction with IPDPS 2020)   Program Committee Member  

    2020   

      More details

  24. 2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-2019)   Program Committee Member  

    2019   

      More details

  25. The 20th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2019) (in conjunction with IPDPS 2019)   Program Committee Member  

    2019   

      More details

  26. Mini Symposium: Development of Numerical Computing Software on Emerging Computing Platforms (at SIAM PP 18)   Mini-Symposium Organizer  

    2018   

      More details

  27. 2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-2018)   Program Committee Member  

    2018   

      More details

  28. The 19th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2018) (in conjunction with IPDPS 2018)   Program Committee Member  

    2018   

      More details

  29. The Third International Workshop on GPU Computing and AI (GCA'18) (in conjunction with CANDAR'18)   Program Committee Member  

    2018   

      More details

  30. The Thirteenth International Workshop on Automatic Performance Tuning (iWAPT2018) (in conjunction with IPDPS 2018)   Program Committee Member  

    2018   

      More details

  31. Special Session: Auto-Tuning for Multicore and GPU (ATMG 2018) (in conjunction with MCSoC-2018)   Program Committee Member  

    2018   

      More details

  32. The Second International Workshop on GPU Computing and AI (GCA'17) (in conjunction with CANDAR'17)   Program Committee Member  

    2017   

      More details

  33. Special Session: Auto-Tuning for Multicore and GPU (ATMG 2017) (in conjunction with MCSoC-17)   Program Committee Member  

    2017   

      More details

  34. The 18th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2017) (in conjunction with IPDPS 2017)   Program Committee Member  

    2017   

      More details

  35. The Twelfth International Workshop on Automatic Performance Tuning (iWAPT2017) (in conjunction with IPDPS 2017)   Program Committee Member  

    2017   

      More details

  36. The First International Workshop on GPU Computing and Applications (GCA'16) (in conjunction with CANDAR'16)   Program Committee Member  

    2016   

      More details

  37. The 17th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2016) (in conjunction with IPDPS 2016)   Program Committee Member  

    2016   

      More details

  38. The 16th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2015) (in conjunction with IPDPS 2015)   Program Committee Member  

    2015   

      More details

  39. The 15th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2014) (in conjunction with IPDPS 2014)   Program Committee Member  

    2014   

      More details

▼display all

Awards 10

  1. Best Paper Award

    2026.1   The 1st International Workshop on Foundational Large Language Models Advances for HPC in Asia (LLM4HPCAsia 2026)   Evaluating Claude Code's Coding and Test Automation for GPU Acceleration of a Legacy Fortran Application: A GeoFEM Case Study

    Tetsuya Hoshino, Shun-Ichiro Hayashi, Daichi Mukunoki, Takahiro Katagiri, Toshihiro Hanawa

     More details

  2. Best Paper Award

    2023.12   6th IEEE International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC 2023)   Sparse Matrix-Vector Multiplication with Reduced-Precision Memory Accessor

    Daichi Mukunoki, Masatoshi Kawai, Toshiyuki Imamura

     More details

    Award type:Award from international society, conference, symposium, etc. 

    researchmap

  3. Research Poster Award 2nd Place Winner

    2022.6   ISC High Performance 2022   A Fast Infinite Precision Inner Product using Ozaki Scheme and Dot2, and Its Application to Reproducible Conjugate Gradient Solvers

    Daichi Mukunoki, Katsuhisa Ozaki, Takeshi Ogita, Toshiyuki Imamura

     More details

  4. RIKEN Ohbu Award 2021

    2022.3  

     More details

  5. Research Poster Award

    2021.6   ISC High Performance 2021   Accurate Matrix Multiplication on Binary128 using Ozaki Scheme

    Daichi Mukunoki, Katsuhisa Ozaki, Takeshi Ogita, Toshiyuki Imamura

     More details

  6. Best Research Poster Award

    2019.9   Russian Supercomputing Days   Accurate and Reproducible Linear Algebra Operations for Many-core Architectures

    Daichi Mukunoki, Takeshi Ogita, Katsuhisa Ozaki

     More details

  7. PRACE-ISC Research Poster Award 2017

    2017.6   ISC High Performance 2017   Implementation & Evaluation of 2.5D Matrix Multiplication on K Computer

    Daichi Mukunoki, Toshiyuki Imamura

     More details

  8. IPSJ Yamashita SIG Research Award

    2016   Information Processing Society of Japan  

     More details

  9. IPSJ Computer Science Research Award for Young Scientists

    2013   Information Processing Society of Japan  

     More details

  10. Young Researcher Award

    2013   IPSJ Special Interest Group on System Architecture  

     More details

▼display all

 

Papers 82

  1. Performance Evaluation of Loop Body Splitting for Fast Modal Filtering in SCALE-DG on A64FX Reviewed Open Access

    Xuanzhengbo Ren, Yuta Kawai, Hirofumi Tomita, Seiya Nishizawa, Takahiro Katagiri, Tetsuya Hoshino, Daichi Mukunoki, Masatoshi Kawai, Toru Nagai

    Proceedings of the 2025 International Conference on High Performance Computing in Asia-Pacific Region Workshops     page: 36 - 44   2025.2

     More details

    Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:ACM  

    DOI: 10.1145/3703001.3724385

    Open Access

    researchmap

  2. Performance evaluation and modelling of single-precision matrix multiplication on Cerebras CS-2 Reviewed

    Ryunosuke Matsuzaki, Daichi Mukunoki, Takaaki Miyajima

    SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis     page: 727 - 731   2024.11

     More details

    Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:IEEE  

    DOI: 10.1109/scw63240.2024.00101

    researchmap

  3. Evaluating Claude Code's Coding and Test Automation for GPU Acceleration ofa Legacy Fortran Application: A GeoFEM Case Study. Reviewed

    Tetsuya Hoshino, Shun-ichiro Hayashi, Daichi Mukunoki, Takahiro Katagiri, Toshihiro Hanawa

    Proc. the Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific Region Workshops (SCA/HPCAsiaWS '26) - The 1st International Workshop on Foundational Large Language Models Advances for HPC in Asia (LLM4HPCAsia 2026)     page: 353 - 360   2026.1

     More details

    Language:English   Publishing type:Research paper (international conference proceedings)  

    DOI: 10.1145/3784828.3785335

    researchmap

    Other Link: https://dblp.uni-trier.de/db/conf/hpcasia/hpcasia2026w.html#HoshinoHMKH26

  4. Sparse Iterative Solvers Using High-Precision Arithmetic with Quasi Multi-Word Algorithms. Reviewed

    Daichi Mukunoki, Katsuhisa Ozaki

    Proc. 2025 IEEE 18th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC 2025)     page: 33 - 40   2025.12

     More details

    Authorship:Lead author, Corresponding author   Language:English   Publishing type:Research paper (international conference proceedings)  

    DOI: 10.1109/MCSoC67473.2025.00016

    researchmap

    Other Link: https://dblp.uni-trier.de/db/conf/mcsoc/mcsoc2025.html#MukunokiO25

  5. DGEMM without FP64 Arithmetic - Using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme Reviewed

    Daichi Mukunoki

    Proc. the Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific Region Workshops (SCA/HPCAsiaWS '26) - ExHET'26: The Fifth International Workshop on Extreme Heterogeneity Solutions   Vol. abs/2508.00441   page: 303 - 311   2025.8

     More details

    Authorship:Lead author, Corresponding author   Language:English   Publishing type:Research paper (international conference proceedings)  

    As the demand for AI computation rapidly increases, more hardware is being developed to efficiently perform the low-precision matrix multiplications required by such workloads. However, these operations are generally not directly applicable to scientific computations due to accuracy requirements. The Ozaki scheme - an accurate matrix multiplication method proposed by Ozaki et al. in 2012 - enables FP64 matrix multiplication (DGEMM) using low-precision matrix multiplication units, such as FP16 Tensor Cores. This approach has since been extended to utilize integer arithmetic, offering lower computational cost compared to floating-point-based implementations. In fact, it has achieved higher performance than hardware FP64 operations on GPUs equipped with fast INT8 Tensor Cores designed for AI workloads. However, recent AI-oriented processors trends have shifted toward improving the performance of low-precision floating-point operations, such as FP8, rather than integer operations. Motivated by this shift, this study revisits the use of low-precision floating-point operations in the Ozaki scheme. Specifically, we explore the use of FP8 Tensor Cores. In addition, for processors that support very slow or no hardware-based FP64 operations, we also consider FP64 arithmetic emulation based on integer arithmetic. This completely eliminates hardware FP64 instructions. Furthermore, we explore the use of blocking in the inner-product dimension to accelerate FP16-based implementations. We demonstrate the effectiveness of these methods by evaluating the performance on an NVIDIA RTX Blackwell architecture GPU.

    DOI: 10.1145/3784828.3785017

    arXiv

    researchmap

    Other Link: https://arxiv.org/pdf/2508.00441v3

  6. An Algorithm Portfolio Approach for Parameter Tuning in Coherent Ising Machines. Reviewed

    Tatsuro Hanyu, Takahiro Katagiri, Daichi Mukunoki, Tetsuya Hoshino

    Proc. 2025 Thirteenth International Symposium on Computing and Networking Workshops (CANDARW) - 17th International Workshop on Parallel and Distributed Algorithms and Applications (PDAA 2025)     page: 142 - 148   2025

     More details

    Language:English   Publishing type:Research paper (international conference proceedings)  

    DOI: 10.1109/CANDARW68385.2025.00032

    researchmap

    Other Link: https://dblp.uni-trier.de/db/conf/candar/candar2025w.html#HanyuKMH25

  7. Extension of accurate numerical algorithms for matrix multiplication based on error-free transformation Reviewed

    Katsuhisa Ozaki, Daichi Mukunoki, Takeshi Ogita

    Japan Journal of Industrial and Applied Mathematics   Vol. 42 ( 1 ) page: 1 - 20   2024.10

     More details

    Language:English   Publishing type:Research paper (scientific journal)   Publisher:Springer Science and Business Media LLC  

    DOI: 10.1007/s13160-024-00677-z

    researchmap

    Other Link: https://link.springer.com/article/10.1007/s13160-024-00677-z/fulltext.html

  8. Reduced-Precision and Reduced-Exponent Formats for Accelerating Adaptive Precision Sparse Matrix–Vector Product Reviewed Open Access

    Stef Graillat, Fabienne Jézéquel, Theo Mary, Roméo Molina, Daichi Mukunoki

    Lecture Notes in Computer Science   Vol. 14803   page: 17 - 30   2024.8

     More details

    Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:Springer Nature Switzerland  

    DOI: 10.1007/978-3-031-69583-4_2

    researchmap

  9. Mixed-precision conjugate gradient algorithm using the groupwise update strategy Reviewed

    Kensuke Aihara, Katsuhisa Ozaki, Daichi Mukunoki

    Japan Journal of Industrial and Applied Mathematics     2024.2

     More details

    Language:English   Publishing type:Research paper (scientific journal)   Publisher:Springer Science and Business Media LLC  

    DOI: 10.1007/s13160-024-00644-8

    researchmap

    Other Link: https://link.springer.com/article/10.1007/s13160-024-00644-8/fulltext.html

  10. Sparse Matrix-Vector Multiplication with Reduced-Precision Memory Accessor Reviewed

    Daichi Mukunoki, Masatoshi Kawai, Toshiyuki Imamura

    2023 IEEE 16th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)     page: 608 - 615   2023.12

     More details

    Authorship:Lead author, Corresponding author   Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:IEEE  

    DOI: 10.1109/mcsoc60832.2023.00094

    researchmap

  11. Infinite-Precision Inner Product and Sparse Matrix-Vector Multiplication Using Ozaki Scheme with Dot2 on Manycore Processors Reviewed

    Daichi Mukunoki, Katsuhisa Ozaki, Takeshi Ogita, Toshiyuki Imamura

    Parallel Processing and Applied Mathematics     page: 40 - 54   2023.4

     More details

    Authorship:Lead author, Corresponding author   Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:Springer International Publishing  

    DOI: 10.1007/978-3-031-30442-2_4

    researchmap

  12. Task Scheduling Strategies for Batched Basic Linear Algebra Subprograms on Many-core CPUs Reviewed

    Daichi Mukunoki, Yusuke Hirota, Toshiyuki Imamura

    Proc. 2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)     page: 234 - 241   2021.12

     More details

    Authorship:Lead author, Corresponding author   Language:English   Publishing type:Research paper (international conference proceedings)  

    researchmap

  13. A Rapid Euclidean Norm Calculation Algorithm that Reduces Overflow and Underflow. Reviewed

    Takeyuki Harayama, Shuhei Kudo, Daichi Mukunoki, Toshiyuki Imamura, Daisuke Takahashi

    Proc. The 2021 International Conference on Computational Science and Its Applications (ICCSA 2021), Lecture Notes in Computer Science   Vol. 12949   page: 95 - 110   2021.9

     More details

    Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:Springer  

    DOI: 10.1007/978-3-030-86653-2_7

    researchmap

    Other Link: https://dblp.uni-trier.de/db/conf/iccsa/iccsa2021-1.html#HarayamaKMIT21

  14. Accurate Matrix Multiplication on Binary128 Format Accelerated by Ozaki Scheme Reviewed Open Access

    Daichi Mukunoki, Katsuhisa Ozaki, Takeshi Ogita, Toshiyuki Imamura

    Proc. The 50th International Conference on Parallel Processing (ICPP-2021)   ( 78 ) page: 1 - 11   2021.8

     More details

    Authorship:Lead author, Corresponding author   Language:English   Publishing type:Research paper (international conference proceedings)  

    DOI: 10.1145/3472456.3472493

    Open Access

    researchmap

  15. Conjugate Gradient Solvers with High Accuracy and Bit-wise Reproducibility between CPU and GPU using Ozaki scheme. Reviewed Open Access

    Daichi Mukunoki, Katsuhisa Ozaki, Takeshi Ogita, Roman Iakymchuk

    Proc. The International Conference on High Performance Computing in Asia-Pacific Region (HPCAsia 2021)     page: 100 - 109   2021.1

     More details

    Authorship:Lead author, Corresponding author   Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:ACM  

    DOI: 10.1145/3432261.3432270

    Open Access

    researchmap

    Other Link: https://dblp.uni-trier.de/db/conf/hpcasia/hpcasia2021.html#MukunokiOOI21

  16. Can We Avoid Rounding-Error Estimation in HPC Codes and Still Get Trustworthy Results? Reviewed

    Fabienne Jézéquel, Stef Graillat, Daichi Mukunoki, Toshiyuki Imamura, Roman Iakymchuk

    Proc. 13th International Workshop on Numerical Software Verification 2020 (NSV 20), Lecture Notes in Computer Science   Vol. 12549   page: 163 - 177   2020.12

     More details

    Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:Springer  

    DOI: 10.1007/978-3-030-63618-0_10

    researchmap

    Other Link: https://dblp.uni-trier.de/db/conf/vstte/vstte2020.html#JezequelGMII20

  17. Matrix Engines for High Performance Computing:A Paragon of Performance or Grasping at Straws? Reviewed

    Jens Domke, Emil Vatai, Aleksandr Drozd, Peng Chen, Yosuke Oyama, Lingqi Zhang, Shweta Salaria, Daichi Mukunoki, Artur Podobas, Mohamed Wahib, Satoshi Matsuoka

    Proc. 35th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2021)     page: 1056 - 1065   2020.10

     More details

    Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:IEEE  

    Matrix engines or units, in different forms and affinities, are becoming a reality in modern processors; CPUs and otherwise. The current and dominant algorithmic approach to Deep Learning merits the commercial investments in these units, and deduced from the No.1 benchmark in supercomputing, namely High Performance Linpack, one would expect an awakened enthusiasm by the HPC community, too.
    Hence, our goal is to identify the practical added benefits for HPC and machine learning applications by having access to matrix engines. For this purpose, we perform an in-depth survey of software stacks, proxy applications and benchmarks, and historical batch job records. We provide a cost-benefit analysis of matrix engines, both asymptotically and in conjunction with state-of-the-art processors. While our empirical data will temper the enthusiasm, we also outline opportunities to misuse these dense matrix-multiplication engines if they come for free.

    DOI: 10.1109/IPDPS49936.2021.00114

    arXiv

    researchmap

    Other Link: https://dblp.uni-trier.de/db/conf/ipps/ipdps2021.html#DomkeVDCO0SMPWM21

  18. Performance and energy consumption of accurate and mixed-precision linear algebra kernels on GPUs. Reviewed

    Daichi Mukunoki, Takeshi Ogita

    J. Comput. Appl. Math.   Vol. 372   page: 112701 - 112701   2020.7

     More details

    Authorship:Lead author, Corresponding author   Language:English   Publishing type:Research paper (scientific journal)   Publisher:Elsevier {BV}  

    DOI: 10.1016/j.cam.2019.112701

    researchmap

  19. DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions Reviewed Open Access

    Daichi Mukunoki, Katsuhisa Ozaki, Takeshi Ogita, Toshiyuki Imamura

    Proc. ISC High Performance 2020, Lecture Notes in Computer Science   Vol. 12151   page: 230 - 248   2020.6

     More details

    Authorship:Lead author, Corresponding author   Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:Springer  

    DOI: 10.1007/978-3-030-50743-5_12

    researchmap

    Other Link: https://dblp.uni-trier.de/db/conf/supercomputer/isc2020.html#MukunokiOOI20

  20. Design of an FPGA-Based Matrix Multiplier with Task Parallelism. Reviewed Open Access

    Yiyu Tan, Toshiyuki Imamura, Daichi Mukunoki

    Proc. International Conference on Parallel Computing (ParCo2019), Parallel Computing: Technology Trends   Vol. 36   page: 241 - 250   2019

     More details

    Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:IOS Press  

    DOI: 10.3233/APC200047

    Open Access

    researchmap

    Other Link: https://dblp.uni-trier.de/db/conf/parco/parco2019.html#TanIM19

  21. Reproducible BLAS Routines with Tunable Accuracy Using Ozaki Scheme for Many-Core Architectures. Reviewed

    Daichi Mukunoki, Takeshi Ogita, Katsuhisa Ozaki

    Proc. 13th International Conference on Parallel Processing and Applied Mathematics (PPAM2019), Lecture Notes in Computer Science   Vol. 12043   page: 516 - 527   2019

     More details

    Authorship:Lead author, Corresponding author   Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:Springer  

    DOI: 10.1007/978-3-030-43229-4_44

    researchmap

    Other Link: https://dblp.uni-trier.de/db/conf/ppam/ppam2019-1.html#MukunokiOO19

  22. Performance Analysis of 2D-compatible 2.5D-PDGEMM on Knights Landing Cluster. Reviewed

    Daichi Mukunoki, Toshiyuki Imamura

    Proc. International Conference on Computational Science (ICCS 2018), Lecture Notes in Computer Science   Vol. 10862   page: 853 - 858   2018.6

     More details

    Authorship:Lead author, Corresponding author   Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:Springer  

    DOI: 10.1007/978-3-319-93713-7_85

    researchmap

    Other Link: https://dblp.uni-trier.de/db/conf/iccS/iccS2018-3.html#MukunokiI18

  23. Design Towards Modern High Performance Numerical LA Library Enabling Heterogeneity and Flexible Data Formats. Reviewed

    Toshiyuki Imamura, Daichi Mukunoki, Yusuke Hirota, Susumu Yamada, Masahiko Machida

    Proc. International Conference on Parallel Computing (ParCo2017), Advances in Parallel Computing     page: 97 - 106   2017.9

     More details

    Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:IOS Press  

    DOI: 10.3233/978-1-61499-843-3-97

    researchmap

    Other Link: https://dblp.uni-trier.de/db/conf/parco/parco2017.html#ImamuraMHYM17

  24. Implementation and Performance Analysis of 2.5D-PDGEMM on the K Computer. Reviewed

    Daichi Mukunoki, Toshiyuki Imamura

    Proc. 12th International Conference on Parallel Processing and Applied Mathematics (PPAM2017), Lecture Notes in Computer Science   Vol. 10777   page: 348 - 358   2017

     More details

    Authorship:Lead author, Corresponding author   Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:Springer  

    DOI: 10.1007/978-3-319-78024-5_31

    researchmap

    Other Link: https://dblp.uni-trier.de/db/conf/ppam/ppam2017-1.html#MukunokiI17

  25. Automatic Thread-Block Size Adjustment for Memory-Bound BLAS Kernels on GPUs. Reviewed

    Daichi Mukunoki, Toshiyuki Imamura, Daisuke Takahashi

    Proc. IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-16)     page: 377 - 384   2016

     More details

    Authorship:Lead author, Corresponding author   Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:IEEE Computer Society  

    DOI: 10.1109/MCSoC.2016.32

    researchmap

    Other Link: https://dblp.uni-trier.de/db/conf/mcsoc/mcsoc2016.html#MukunokiIT16

  26. Reduced-Precision Floating-Point Formats on GPUs for High Performance and Energy Efficient Computation. Reviewed

    Daichi Mukunoki, Toshiyuki Imamura

    Proc. IEEE International Conference on Cluster Computing (Cluster 2016)     page: 144 - 145   2016

     More details

    Authorship:Lead author, Corresponding author   Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:IEEE Computer Society  

    DOI: 10.1109/CLUSTER.2016.77

    researchmap

    Other Link: https://dblp.uni-trier.de/db/conf/cluster/cluster2016.html#MukunokiI16

  27. Fast Implementation of General Matrix-Vector Multiplication (GEMV) on Kepler GPUs. Reviewed

    Daichi Mukunoki, Toshiyuki Imamura, Daisuke Takahashi

    Proc. 23rd Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP 2015)     page: 642 - 650   2015

     More details

    Authorship:Lead author, Corresponding author   Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:IEEE Computer Society  

    DOI: 10.1109/PDP.2015.66

    researchmap

    Other Link: https://dblp.uni-trier.de/db/conf/pdp/pdp2015.html#MukunokiIT15

  28. Implementation and Evaluation of Triple and Quadruple Precision Floating-point Operations on GPUs Reviewed Open Access

      Vol. 6 ( 1 ) page: 66 - 77   2013.1

     More details

    Authorship:Lead author, Corresponding author   Language:Japanese  

    Open Access

    CiNii Research

    researchmap

    Other Link: http://id.nii.ac.jp/1001/00089921/

  29. Optimization of Sparse Matrix-Vector Multiplication for CRS Format on NVIDIA Kepler Architecture GPUs. Reviewed

    Daichi Mukunoki, Daisuke Takahashi

    Proc. 13th International Conference on Computational Science and Its Applications (ICCSA 2013), Part V, Lecture Notes in Computer Science   Vol. 7975   page: 211 - 223   2013

     More details

    Authorship:Lead author, Corresponding author   Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:Springer  

    DOI: 10.1007/978-3-642-39640-3_15

    researchmap

    Other Link: https://dblp.uni-trier.de/db/conf/iccsa/iccsa2013-5.html#MukunokiT13

  30. Using Quadruple Precision Arithmetic to Accelerate Krylov Subspace Methods on GPUs. Reviewed

    Daichi Mukunoki, Daisuke Takahashi

    Proc. 10th International Conference on Parallel Processing and Applied Mathematics (PPAM 2013), Part I, Workshop on Numerical Algorithms on Hybrid Architectures, Lecture Notes in Computer Science   Vol. 8384   page: 632 - 642   2013

     More details

    Authorship:Lead author, Corresponding author   Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:Springer  

    DOI: 10.1007/978-3-642-55224-3_59

    researchmap

    Other Link: https://dblp.uni-trier.de/db/conf/ppam/ppam2013-1.html#MukunokiT13

  31. Implementation and Evaluation of Triple Precision BLAS Subroutines on GPUs. Reviewed

    Daichi Mukunoki, Daisuke Takahashi

    Proc. 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW 2012), The 13th Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC-12)     page: 1378 - 1386   2012

     More details

    Authorship:Lead author, Corresponding author   Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:IEEE Computer Society  

    DOI: 10.1109/IPDPSW.2012.175

    researchmap

    Other Link: https://dblp.uni-trier.de/db/conf/ipps/ipdps2012w.html#MukunokiT12

  32. Implementation and Evaluation of Quadruple and Octuple Precision BLAS on GPUs Reviewed Open Access

      Vol. 2011 ( 2011 ) page: 148 - 156   2011.1

     More details

    Authorship:Lead author, Corresponding author   Language:Japanese  

    Open Access

    CiNii Research

    researchmap

  33. Implementation and Evaluation of Quadruple Precision BLAS Functions on GPUs. Reviewed

    Daichi Mukunoki, Daisuke Takahashi

    Proc. 10th International Conference on Applied Parallel and Scientific Computing (PARA 2010), Part I, Lecture Notes in Computer Science   Vol. 7133   page: 249 - 259   2010

     More details

    Authorship:Lead author, Corresponding author   Language:English   Publishing type:Research paper (international conference proceedings)   Publisher:Springer  

    DOI: 10.1007/978-3-642-28151-8_25

    researchmap

    Other Link: https://dblp.uni-trier.de/db/conf/para/para2010-1.html#MukunokiT10

  34. Single-precision Matrix Multiplication Performance on Cerebras CS-2: Evaluation and Modelling of Performance, Scalability and Energy Efficiency Reviewed Open Access

    Takaaki Miyajima, Ryunosuke Matsuzaki, Daichi Mukunoki

    Journal of Information Processing   Vol. 34   page: 132 - 139   2026

     More details

    Language:English   Publishing type:Research paper (scientific journal)   Publisher:Information Processing Society of Japan  

    DOI: 10.2197/ipsjjip.34.132

    Open Access

    researchmap

  35. Sparse Iterative Solvers Using High-Precision Arithmetic with Quasi Multi-Word Algorithms

    Daichi Mukunoki, Katsuhisa Ozaki

    CoRR   Vol. abs/2510.13536   2025.10

     More details

    Publishing type:Research paper (scientific journal)  

    To obtain accurate results in numerical computation, high-precision arithmetic is a straightforward approach. However, most processors lack hardware support for floating-point formats beyond double precision (FP64). Double-word arithmetic (Dekker 1971) extends precision by using standard floating-point operations to represent numbers with twice the mantissa length. Building on this concept, various multi-word arithmetic methods have been proposed to further increase precision by combining additional words. Simplified variants, known as quasi algorithms, have also been introduced, which trade a certain loss of accuracy for reduced computational cost. In this study, we investigate the performance of quasi algorithms for double- and triple-word arithmetic in sparse iterative solvers based on the Conjugate Gradient method, and compare them with both non-quasi algorithms and standard FP64. We evaluate execution time on an x86 processor, the number of iterations to convergence, and solution accuracy. Although quasi algorithms require appropriate normalization to preserve accuracy - without it, convergence cannot be achieved - they can still reduce runtime when normalization is applied correctly, while maintaining accuracy comparable to full multi-word algorithms. In particular, quasi triple-word arithmetic can yield more accurate solutions without significantly increasing execution time relative to double-word arithmetic and its quasi variant. Furthermore, for certain problems, a reduction in iteration count contributes to additional speedup. Thus, quasi triple-word arithmetic can serve as a compelling alternative to conventional double-word arithmetic in sparse iterative solvers.

    DOI: 10.48550/arXiv.2510.13536

    arXiv

    researchmap

    Other Link: https://arxiv.org/pdf/2510.13536v1

  36. 3Dify: a Framework for Procedural 3D-CG Generation Assisted by LLMs Using MCP and RAG

    Shun-ichiro Hayashi, Daichi Mukunoki, Tetsuya Hoshino, Satoshi Ohshima, Takahiro Katagiri

    CoRR   Vol. abs/2510.04536   2025.10

     More details

    Publishing type:Research paper (scientific journal)  

    This paper proposes "3Dify," a procedural 3D computer graphics (3D-CG) generation framework utilizing Large Language Models (LLMs). The framework enables users to generate 3D-CG content solely through natural language instructions. 3Dify is built upon Dify, an open-source platform for AI application development, and incorporates several state-of-the-art LLM-related technologies such as the Model Context Protocol (MCP) and Retrieval-Augmented Generation (RAG). For 3D-CG generation support, 3Dify automates the operation of various Digital Content Creation (DCC) tools via MCP. When DCC tools do not support MCP-based interaction, the framework employs the Computer-Using Agent (CUA) method to automate Graphical User Interface (GUI) operations. Moreover, to enhance image generation quality, 3Dify allows users to provide feedback by selecting preferred images from multiple candidates. The LLM then learns variable patterns from these selections and applies them to subsequent generations. Furthermore, 3Dify supports the integration of locally deployed LLMs, enabling users to utilize custom-developed models and to reduce both time and monetary costs associated with external API calls by leveraging their own computational resources.

    DOI: 10.48550/arXiv.2510.04536

    arXiv

    researchmap

    Other Link: https://arxiv.org/pdf/2510.04536v1

  37. VibeCodeHPC: An Agent-Based Iterative Prompting Auto-Tuner for HPC Code Generation Using LLMs

    Shun-ichiro Hayashi, Koki Morita, Daichi Mukunoki, Tetsuya Hoshino, Takahiro Katagiri

    CoRR   Vol. abs/2510.00031   2025.9

     More details

    Publishing type:Research paper (scientific journal)  

    We propose VibeCodeHPC, an automatic tuning system for HPC programs based on multi-agent LLMs for code generation. VibeCodeHPC tunes programs through multi-agent role allocation and iterative prompt refinement. We describe the system configuration with four roles: Project Manager (PM), System Engineer (SE), Programmer (PG), and Continuous Delivery (CD). We introduce dynamic agent deployment and activity monitoring functions to facilitate effective multi-agent collaboration. In our case study, we convert and optimize CPU-based matrix-matrix multiplication code written in C to GPU code using CUDA. The multi-agent configuration of VibeCodeHPC achieved higher-quality code generation per unit time compared to a solo-agent configuration. Additionally, the dynamic agent deployment and activity monitoring capabilities facilitated more effective identification of requirement violations and other issues.

    DOI: 10.48550/arXiv.2510.00031

    arXiv

    researchmap

    Other Link: https://arxiv.org/pdf/2510.00031v1

  38. Towards Generalized Parameter Tuning in Coherent Ising Machines: A Portfolio-Based Approach

    Tatsuro Hanyu, Takahiro Katagiri, Daichi Mukunoki, Tetsuya Hoshino

    CoRR   Vol. abs/2507.20295   2025.7

     More details

    Publishing type:Research paper (scientific journal)  

    Coherent Ising Machines (CIMs) have recently gained attention as a promising computing model for solving combinatorial optimization problems. In particular, the Chaotic Amplitude Control (CAC) algorithm has demonstrated high solution quality, but its performance is highly sensitive to a large number of hyperparameters, making efficient tuning essential. In this study, we present an algorithm portfolio approach for hyperparameter tuning in CIMs employing Chaotic Amplitude Control with momentum (CACm) algorithm. Our method incorporates multiple search strategies, enabling flexible and effective adaptation to the characteristics of the hyperparameter space. Specifically, we propose two representative tuning methods, Method A and Method B. Method A optimizes each hyperparameter sequentially with a fixed total number of trials, while Method B prioritizes hyperparameters based on initial evaluations before applying Method A in order. Performance evaluations were conducted on the Supercomputer "Flow" at Nagoya University, using planted Wishart instances and Time to Solution (TTS) as the evaluation metric. Compared to the baseline performance with best-known hyperparameters, Method A achieved up to 1.47x improvement, and Method B achieved up to 1.65x improvement. These results demonstrate the effectiveness of the algorithm portfolio approach in enhancing the tuning process for CIMs.

    DOI: 10.48550/arXiv.2507.20295

    arXiv

    researchmap

    Other Link: https://arxiv.org/pdf/2507.20295v1

  39. Performance Evaluation of General Purpose Large Language Models for Basic Linear Algebra Subprograms Code Generation

    Daichi Mukunoki, Shun-ichiro Hayashi, Tetsuya Hoshino, Takahiro Katagiri

    CoRR   Vol. abs/2507.04697   2025.7

     More details

    Publishing type:Research paper (scientific journal)  

    Generative AI technology based on Large Language Models (LLM) has been developed and applied to assist or automatically generate program codes. In this paper, we evaluate the capability of existing general LLMs for Basic Linear Algebra Subprograms (BLAS) code generation for CPUs. We use two LLMs provided by OpenAI: GPT-4.1, a Generative Pre-trained Transformer (GPT) model, and o4-mini, one of the o-series of Reasoning models. Both have been released in April 2025. For the routines from level-1 to 3 BLAS, we tried to generate (1) C code without optimization from routine name only, (2) C code with basic performance optimizations (thread parallelization, SIMD vectorization, and cache blocking) from routine name only, and (3) C code with basic performance optimizations based on Fortran reference code. As a result, we found that correct code can be generated in many cases even when only routine name are given. We also confirmed that thread parallelization with OpenMP, SIMD vectorization, and cache blocking can be implemented to some extent, and that the code is faster than the reference code.

    DOI: 10.48550/arXiv.2507.04697

    arXiv

    researchmap

    Other Link: https://arxiv.org/pdf/2507.04697v1

  40. Application of AT to Parameter Tuning in Coherent Ising Machines

    羽生達郎, 片桐孝洋, 森下誠, 高橋一郎, 河合直聡, 椋木大地, 星野哲也, 永井亨

    計算工学講演会論文集(CD-ROM)   Vol. 30   page: 957 - 960   2025.6

  41. BLASコードを題材としたGPTモデルによる数値計算コード実装支援に関する考察

    椋木大地, 林俊一郎, 星野哲也, 片桐孝洋

    情報処理学会研究報告(Web)   Vol. 2025 ( HPC-200 )   2025

  42. GeoFEMを対象としたClaude CodeによるGPUコード開発の評価

    星野哲也, 林俊一郎, 椋木大地, 片桐孝洋, 塙敏博

    情報処理学会研究報告(Web)   Vol. 2025 ( HPC-201 )   2025

  43. 疎行列反復解法の深層学習を用いた実行時間予測モデル構築と評価

    中谷崇真, 河合直聡, 河合直聡, 片桐孝洋, 星野哲也, 永井亨, 椋木大地

    情報処理学会研究報告(Web)   Vol. 2025 ( HPC-199 )   2025

  44. 機械学習によるLAPACK固有値計算ルーチンのテストシーケンス最適化の試行

    樫村寛大, 片桐孝洋, 森崎修司, 星野哲也, 椋木大地

    情報処理学会研究報告(Web)   Vol. 2025 ( HPC-201 )   2025

  45. コヒーレントイジングマシンの性能パラメタ最適化のための探索アルゴリズム選択可能な手法の提案

    羽生達郎, 森下誠, 水木直也, 片桐孝洋, 椋木大地, 河合直聡, 星野哲也, 永井亨

    情報処理学会研究報告(Web)   Vol. 2025 ( HPC-198 )   2025

  46. SVMによる誤差を含むクラス分類における多種疑似量子アニーラの性能評価

    水木直也, 森下誠, 河合直聡, 片桐孝洋, 椋木大地, 星野哲也, 永井亨

    情報処理学会研究報告(Web)   Vol. 2025 ( HPC-198 )   2025

  47. MCP・RAGを用いたプロシージャル3D生成LLMエージェント3Difyの提案とスパコンの利用

    林俊一郎, 椋木大地, 片桐孝洋, 星野哲也, 大島聡史

    情報処理学会研究報告(Web)   Vol. 2025 ( HPC-200 )   2025

  48. Application of Sextuple-Precision Operations using Quasi Triple-Word Arithmetic to Sparse Iterative Solvers Open Access

    椋木大地, 尾崎克久

    情報処理学会研究報告(Web)   Vol. 2024-HPC-197 ( 11 ) page: 1 - 7   2024.12

     More details

    Authorship:Lead author   Language:Japanese   Publishing type:Research paper (conference, symposium, etc.)  

    Open Access

    J-GLOBAL

    researchmap

  49. Performance Evaluation of Adaptive-Precision SpMV with Reduced-Precision Formats

    Stef Grailla, Fabienne Jézéquel, Théo Mary, Roméo Molina, Daichi Mukunoki

    HAL   Vol. hal-04261073   2023.10

     More details

    Language:English   Publishing type:Research paper (other academic)  

    researchmap

  50. CPUにおけるbatched BLASのためのタスクスケジューリング戦略

    椋木大地, 廣田悠輔, 今村俊幸

    日本応用数理学会年会講演予稿集(CD-ROM)   Vol. 2021   2022

  51. A mixed-precision algorithm of the CG method using the group-wise update strategy

    AIHARA Kensuke, OZAKI Katsuhisa, MUKUNOKI Daichi

    Conference Proceedings. JSST Annual International Conference on Simulation Technology (Web)   Vol. 41st   2022

  52. 尾崎スキームによる無限精度内積と再現可能疎行列反復ソルバーへの応用

    椋木大地, 尾崎克久, 荻田武史, 今村俊幸

    日本応用数理学会年会講演予稿集(CD-ROM)   Vol. 2022   2022

  53. 不等分割による行列積のエラーフリー変換の高精度計算への応用

    尾崎克久, 椋木大地, 荻田武史, 荻田武史

    日本応用数理学会年会講演予稿集(CD-ROM)   Vol. 2022   2022

  54. Acceleration of Error-Free Transformation of Matrix Multiplication using GPU Tensor Cores

    OZAKI Katsuhisa, MUKUNOKI Daichi, OGITA Takeshi

    International Conference on Simulation Technology (CD-ROM)   Vol. 40th   2021

  55. White Paper from Workshop on Large-scale Parallel Numerical Computing Technology (LSPANC 2020): HPC and Computer Arithmetic toward Minimal-Precision Computing

    Roman Iakymchuk, Daichi Mukunoki, Artur Podobas, Fabienne Jézéquel, Toshiyuki Imamura, Norihisa Fujita, Jens Huthmann, Shuhei Kudo, Yiyu Tan, Jens Domke, Kai Torben Ohlhus, Takeshi Fukaya, Takeo Hoshi, Yuki Murakami, Maho Nakata, Takeshi Ogita, Kentaro Sano, Taisuke Boku

    CoRR   Vol. abs/2004.04628   2020.4

     More details

    Language:English   Publishing type:Research paper (conference, symposium, etc.)  

    In numerical computations, precision of floating-point computations is a key factor to determine the performance (speed and energy-efficiency) as well as the reliability (accuracy and reproducibility). However, precision generally plays a contrary role for both. Therefore, the ultimate concept for maximizing both at the same time is the minimal-precision computing through precision-tuning, which adjusts the optimal precision for each operation and data. Several studies have been already conducted for it so far (e.g. Precimoniuos and Verrou), but the scope of those studies is limited to the precision-tuning alone. Hence, we aim to propose a broader concept of the minimal-precision computing system with precision-tuning, involving both hardware and software stack. In 2019, we have started the Minimal-Precision Computing project to propose a more broad concept of the minimal-precision computing system with precision-tuning, involving both hardware and software stack. Spe cifically, our system combines (1) a precision-tuning method based on Discrete Stochastic Arithmetic (DSA), (2) arbitrary-precision arithmetic libraries, (3) fast and accurate numerical libraries, and (4) Field-Programmable Gate Array (FPGA) with High-Level Synthesis (HLS). In this white paper, we aim to provide an overview of various technologies related to minimal- and mixed-precision, to outline the future direction of the project, as well as to discuss current challenges together with our project members and guest speakers at the LSPANC 2020 workshop; https://www.r-ccs.riken.jp/labs/lpnctrt/lspanc2020jan/.

    arXiv

    researchmap

    Other Link: https://dblp.uni-trier.de/db/journals/corr/corr2004.html#abs-2004-04628

  56. GPUの単精度演算・Tensorコアを用いた行列積のエラーフリー変換

    尾崎克久, 椋木大地, 荻田武史

    日本応用数理学会年会講演予稿集(CD-ROM)   Vol. 2020   2020

  57. オーバー・アンダーフローを抑えた高精度かつ高速な2ノルム計算手法 Open Access

    原山赳幸, 工藤周平, 椋木大地, 今村俊幸, 高橋大介

    情報処理学会研究報告(Web)   Vol. 2020 ( HPC-177 )   2020

  58. 尾崎スキームを用いたbinary128による4倍精度行列積

    椋木大地, 尾崎克久, 荻田武史

    日本応用数理学会年会講演予稿集(CD-ROM)   Vol. 2020   2020

  59. 尾崎スキームによる高精度かつ再現性のあるBLAS実装

    椋木大地, 荻田武史, 尾崎克久, 今村俊幸

    日本応用数理学会年会講演予稿集(CD-ROM)   Vol. 2019   2019

  60. Level-3BLASに基づく高精度行列積計算法による高精度かつ再現性のあるBLASルーチンの実装とその最適化 Open Access

    椋木大地, 荻田武史, 尾崎克久

    情報処理学会研究報告(Web)   Vol. 2018 ( HPC-166 )   2018

  61. 京コンピュータにおける2.5次元アルゴリズムを用いた分散並列行列積の実装と評価 Open Access

    椋木大地, 今村俊幸

    情報処理学会研究報告(Web)   Vol. 2017 ( HPC-159 )   2017

  62. KMATHLIB-High Performance and Scalable Numerical Library for the K Computer-

    大井祥栄, 廣田悠輔, 椋木大地, 今村俊幸

    日本応用数理学会年会講演予稿集(CD-ROM)   Vol. 2016   2016

  63. コンシューマレンジのGPUに最適化した固有値ソルバーの実装と評価 Open Access

    今村俊幸, 椋木大地

    情報処理学会研究報告(Web)   Vol. 2016 ( HPC-157 )   2016

  64. Introduction of Research Activities for GPU Computing at Large-scale Parallel Numerical Computing Technology Research Team on AICS

    MUKUNOKI Daichi, IMAMURA Toshiyuki, TAKAHASHI Daisuke

    Plans and Future for International Collaborations on Extreme Scale Computing. 6th AICS International Symposium. RIKEN Symposium, 2016     2016

  65. 大規模並列計算機における連立一次方程式の精度保証付き数値計算に対する性能評価 Open Access

    森倉悠介, 椋木大地, 深谷猛, 山中脩也, 大石進一

    情報処理学会研究報告(Web)   Vol. 2016 ( HPC-157 )   2016

  66. CUDA-BLAS等の選択による最速GPU固有値ソルバーの性能評価 Open Access

    今村俊幸, 今村俊幸, 椋木大地, 山田進, 山田進, 町田昌彦, 町田昌彦

    情報処理学会研究報告(Web)   Vol. 2015 ( HPC-148 )   2015

  67. SYMV・GEMVルーチン群のマルチGPU化とその評価 Open Access

    今村俊幸, 今村俊幸, 椋木大地, 山田進, 山田進, 町田昌彦, 町田昌彦

    情報処理学会研究報告(Web)   Vol. 2015 ( HPC-151 )   2015

  68. NVIDIA GPUにおけるメモリ律速なBLASカーネルのスレッド数自動選択手法 Open Access

    椋木大地, 今村俊幸, 高橋大介

    情報処理学会研究報告(Web)   Vol. 2015 ( HPC-150 )   2015

  69. NVIDIA GPUにおけるGEMVカーネルの自動チューニング

    椋木大地, 今村俊幸, 高橋大介

    計算工学講演会論文集(CD-ROM)   Vol. 20   2015

  70. FFTを使った時間発展問題における累積誤差

    佐々成正, 山田進, 町田昌彦, 椋木大地, 今村俊幸

    日本応用数理学会年会講演予稿集(CD-ROM)   Vol. 2015   2015

  71. 京・FX10における倍々精度演算の高速化 Open Access

    佐々木信一, 菱沼利彰, 藤井昭宏, 田中輝雄, 椋木大地, 今村俊幸

    情報処理学会研究報告(Web)   Vol. 2015 ( HPC-151 )   2015

  72. 短尺浮動小数点形式の検討 Open Access

    椋木大地, 今村俊幸

    情報処理学会研究報告(Web)   Vol. 2015 ( HPC-152 )   2015

  73. CUDA-xSYMVの実装と評価 Open Access

    今村俊幸, 今村俊幸, 椋木大地, 山田進, 山田進, 町田昌彦, 町田昌彦

    情報処理学会研究報告(Web)   Vol. 2014 ( HPC-146 )   2014

  74. MaxwellアーキテクチャGPUにおける疑似倍精度演算を用いたDGEMMの実装と評価 Open Access

    椋木大地, 今村俊幸

    情報処理学会研究報告(Web)   Vol. 2014 ( ARC-213 )   2014

  75. GPUにおける高速なCRS形式疎行列ベクトル積の実装 Open Access

    椋木大地, 高橋大介

    研究報告ハイパフォーマンスコンピューティング(HPC)   Vol. 2013 ( 5 ) page: 1 - 7   2013.2

     More details

    Language:Japanese  

    疎行列ベクトル積 (SpMV) は科学技術計算において多用される重要な基本演算である.本稿では GPU における高速な CRS 形式 SpMV の実装について報告する.GPU として NVIDIA 社の Kepler アーキテクチャを対象とし,CUDA5.0 環境において実装を行った.従来の Fermi アーキテクチャまでの GPU を対象に提案されていた実装手法をベースに,Kepler アーキテクチャで新たにサポートされた機能や仕様変更を活用して,最適化を行った.Kepler アーキテクチャの Tesla K20 における性能評価では,CUDA5.0 に付属の cuSPARSE における CRS 形式の倍精度 SpMV ルーチンに対して,200 種類の行列において,平均で約 1.86 倍,177 種類の行列で性能向上を達成した.

    Open Access

    CiNii Research

    researchmap

  76. GPUにおける4倍精度浮動小数点演算を用いたクリロフ部分空間法の高速化 Open Access

    椋木大地, 椋木大地, 高橋大介

    情報処理学会研究報告(Web)   Vol. 2013 ( HPC-140 )   2013

  77. Implementation and Evaluation of Triple and Quadruple Precision Floating-point Operations on GPUs

    椋木大地, 高橋大介

    情報処理学会論文誌トランザクション(CD-ROM)   Vol. 2012 ( 2 )   2013

  78. GPUにおける高速なCRS形式疎行列ベクトル積の実装

    椋木大地, 高橋大介

    情報処理学会研究報告(CD-ROM)   Vol. 2012 ( 6 )   2013

  79. GPUにおける4倍精度演算を用いた疎行列反復解法の実装と評価 Open Access

    椋木大地, 高橋大介

    研究報告ハイパフォーマンスコンピューティング(HPC)   Vol. 2012 ( 37 ) page: 1 - 8   2012.12

     More details

    Language:Japanese  

    疎行列の反復解法として用いられるクリロフ部分空間法は,丸め誤差の影響によって収束までの反復回数が増加したり,収束しなくなるケースがある.このような場合に高精度演算を用いることで収束性を改善できるケースがあることが報告されている.このとき,高精度演算を行うことによる1反復あたりの計算時間の増大に対して,反復回数の削減による計算時間の短縮効果が大きければ,求解までの計算時間を短縮できる可能性がある.我々は GPU (Tesla M2050) において Double-Double (DD) 演算による 4 倍精度を用いて,クリロフ部分空間法の一つである BiCGStab 法を実装し性能を評価した. GPU 上では 4 倍精度 BiCGStab 法の 1 反復あたりの計算時間が,倍精度の約 1.0-2.2 倍となり,反復回数の削減量によっては, 4 倍精度演算を用いることで求解までの計算時間を短縮できる場合が存在した.本稿では GPU 上の疎行列反復解法における 4 倍精度演算の性能と有効性について検討する.

    Open Access

    CiNii Research

    researchmap

  80. GPUによる3倍精度浮動小数点演算の検討 Open Access

    椋木大地, 高橋大介

    情報処理学会研究報告(CD-ROM)   Vol. 2011 ( 4 )   2011

  81. GPUによる4倍精度BLASの実装と評価

    椋木大地, 高橋大介

    計算工学講演会論文集   Vol. 15 ( 2 )   2010

  82. Implementation and Evaluation of Quadruple Precision BLAS on GPU Open Access

    椋木大地, 高橋大介

    情報処理学会研究報告(CD-ROM)   Vol. 2009 ( 4 )   2009

▼display all

KAKENHI (Grants-in-Aid for Scientific Research) 8

  1. Application of High Precision Operation Techniques for Accelerating Scientific Computations on AI Supercomputers

    Grant number:25K24387  2025.7 - 2027.3

    Japan Society for the Promotion of Science  Grants-in-Aid for Scientific Research  Grant-in-Aid for Research Activity Start-up

      More details

  2. Development of Accurate and Validated Matrix Computation Software for Next Generation Supercomputers

    Grant number:20KK0259  2022.4 - 2023.10

    Japan Society for the Promotion of Science (JSPS)  Fund for the Promotion of Joint International Research (Fostering Joint International Research (A))  Fund for the Promotion of Joint International Research (Fostering Joint International Research (A))

    Daichi Mukunoki

      More details

    Authorship:Principal investigator 

    Grant amount:\9230000 ( Direct Cost: \7100000 、 Indirect Cost:\2130000 )

    researchmap

  3. Development of accurate and reproducible matrix computation library for massively parallel environments

    Grant number:19K20286  2019.4 - 2022.3

    Japan Society for the Promotion of Science  Grants-in-Aid for Scientific Research  Grant-in-Aid for Early-Career Scientists

    Mukunoki Daichi

      More details

    In this study, we developed the Basic Linear Algebra Subprograms (BLAS) for massively parallel architectures, which is accurate and can ensure reproducibility of computation results among different environments. Focusing mainly on the Ozaki scheme, we have developed a high-performance implementation of accurate and reproducible BLAS routines, and demonstrated its application to sparse iterative solvers on CPUs and GPUs. As further applications, we proposed an implementation of a single/double precision matrix multiplications using low-precision arithmetic units (Tensor Cores) and a binary128 matrix multiplication using single/double precision matrix multiplications.

    researchmap

  4. Reduced-precision formats for high-performance and energy-efficient computations

    Grant number:16K16062  2016.4 - 2019.3

    Japan Society for the Promotion of Science  Grants-in-Aid for Scientific Research Grant-in-Aid for Young Scientists (B)  Grant-in-Aid for Young Scientists (B)

    Mukunoki Daichi

      More details

    This study explored the possibility of reduced-precision formats which have shorter bit length against the IEEE 32/64 bit floating-point for enchance the performance of numerical computations in terms of both computation speed and energy efficiency. We proposed a light-weight implementation of reduced-precision formats on software and demonstrated the performance improvement, in terms of both speed and energy efficiency, on some data-intensive operations on basic linear algebra.

    researchmap

  5. GPUスパコンのための3倍・4倍精度線形演算ライブラリの開発に関する研究

    Grant number:13J01290  2013.4 - 2015.3

    日本学術振興会  科学研究費助成事業 特別研究員奨励費  特別研究員奨励費

    椋木 大地

      More details

    本研究の目的は,GPUスパコンにおける3倍・4倍精度演算の実用化を目的として,GPUにおける高性能な3倍・4倍精度線形計算ライブラリの実現に向けた基礎研究を行うことであった.本年度は主として,GPUにおける複数の演算精度に対応した線形計算ライブラリの効率的な実装手法に関する研究を行った,その結果として,複数のNVIDIA GPUアーキテクチャに対応した高速な行列ベクトル積ルーチン(GEMV)の実装手法を開発した.本実装ではGPUにおけるプログラムの実行メカニズムをモデル化し,実行効率が最大となるようなスレッドブロックサイズを自動的に決定するオンライン自動チューニングを採用する.これにより既存の実装と比べ,実行環境や問題サイズに依存して生じる性能の変動を防ぎ,常に高い性能を維持できる.本手法は,ある線形計算を行うプログラム(例えばBLASルーチンなど)において演算精度が異なる複数バージョンを実装・最適化する上で有効であると考えられる.またこの他に,4倍精度演算手法の応用として,倍精度演算性能が単精度演算性能の1/32であるNVIDIA社の最新GPUにおいて,ソフトウェアエミュレーションによる疑似倍精度演算を実装し,倍精度行列積ルーチン(DGEMM)においてハードウェア処理による実装を上回る性能が得られることを示した.本年度に開発したGPU向けソフトウェアの一部は,オープンソースのライブラリとしてウェブ上で公開しており,今後も開発を継続する予定である.

    researchmap

  6. Research on high-performance and high-dimensional numerical linear algebra applying an asynchronous task mechanism on the exascale computing era

    Grant number:19H04127  2019.4 - 2022.3

    Japan Society for the Promotion of Science  Grants-in-Aid for Scientific Research Grant-in-Aid for Scientific Research (B)  Grant-in-Aid for Scientific Research (B)

    Imamura Toshiyuki

      More details

    Authorship:Coinvestigator(s) 

    The main objective of this research project is to study asynchronous numerical algorithms and task technologies to improve system execution efficiency in the exascale era and to realize a development framework for high-performance numerical software that is sustainable in the future. To address this issue, we will investigate existing compiler runtime technologies, identify problems related to conditional task invocation and dynamic processing of dependencies that are necessary to realize numerical algorithms and incorporate them into actual numerical libraries to achieve results that contribute not only to execution speed but also to utilization efficiency. As a result, we identified issues related to the next generation of mixed-precision computation technology.

    researchmap

  7. Theory and Application of Scalable Numerical Software on an O(100M) core environment

    Grant number:15H02709  2015.4 - 2018.3

    Japan Society for the Promotion of Science  Grants-in-Aid for Scientific Research Grant-in-Aid for Scientific Research (B)  Grant-in-Aid for Scientific Research (B)

    IMAMURA Toshiyuki, YAMAMOTO Yusaku, Todo Shinji

      More details

    This research project aims to realize high performance numerical services investigated in the past based on new mathematical principles in the emerging computing system where tens of thousands to hundreds of millions of processing cores are installed. Giving two important themes, `Mixed-granularity numerical kernel' and `Asynchronous numerical algorithm,' we conducted; i) the research on the theory of asynchronous numerical algorithms. Also avoidance of communication and synchronization at a practical level, then CAHTR and a new method for the FDTD scheme were proposed. Furthermore, we have practiced; ii) promoting research on core numerical infrastructure technologies such as automatic tuning for scalable, lightweight code generation at super-many-core, and promoting innovative research leading to the next generation numerical calculation software.

    researchmap

  8. Development of Accurate Numerical Method of Linear Systems Based on De Facto Standard Library

    Grant number:15K15939  2015.4 - 2017.3

    Japan Society for the Promotion of Science  Grants-in-Aid for Scientific Research  Grant-in-Aid for Young Scientists (B)

    MORIKUARA Yusuke, OISHI Shin'ichi, Rump Siegfried M., OGITA Takeshi, OZAKI Katsuhisa, HUKAYA Takeshi, MUKUNOKI Daichi, YAMANAKA Naoya

      More details

    In this research, we implemented and evaluated numerical verification methods for large-scale systems of linear equations using de-facto standard numerical software libraries on super computer. In addition, we concerned with numerical verification algorithms for large-scale systems of linear equations.
    We clarified that accuracy can be guaranteed for large-scale problems by combining accurate computations for matrix multiplication and error-free transformation of floating-point arithmetic on K computer.
    Our algorithms obtained in this research estimates rounding error using a priori error estimate of floating-point arithmetic. Implementation of our programs are done by using de-facto standard numerical software libraries. Therefore, the program can be executed by any supercomputer based on the IEEE 754 standard.

    researchmap

▼display all

 

Teaching Experience (Off-campus) 4

  1. コンピュータ科学実験b

    2025.10 - 2025.11 名古屋大学 情報学部 コンピュータ科学科)

     More details

  2. Numerical Analysis

    2025.4 - 2025.6 Department of Computer Science, School of Informatics, Nagoya University)

     More details

  3. Computer Science Experiment a

    2025.4 - 2025.5 Department of Computer Science, School of Informatics, Nagoya University)

     More details

  4. 情報処理技法(リテラシ)II

    2018.9 - 2019.1 Tokyo Woman's Christian University)

     More details