2026/06/22 更新

写真a

ホワン ウェン チン
HUANG Wen Chin
HUANG Wen Chin
所属
大学院情報学研究科 知能システム学専攻 基盤知能情報学 助教
大学院担当
大学院情報学研究科
学部担当
情報学部 コンピュータ科学科
職名
助教
ホームページ
プロフィール
2018年台湾・国立台湾大学学士号,2021年名古屋大学修士号,2024年同大学博士号.2017年から2019年まで台湾・中央研究院情報科学研究所にて研究助手を務める.現在,名古屋大学大学院情報学研究科助教.Voice Conversion Challenge 2020およびVoiceMOS Challenge 2022の共同オーガナイザー.音声変換と音声品質評価を中心に,音声処理へのディープラーニングの応用を研究.ISCSLP2018最優秀学生論文賞,APSIPA ASC2021最優秀論文賞受賞

学位 3

  1. 博士(情報学) ( 2024年3月   名古屋大学 ) 

  2. 修士(情報学) ( 2021年3月   名古屋大学 ) 

  3. Bachelor of Science ( 2018年6月   National Taiwan University ) 

研究キーワード 8

  1. 音声品質評価

  2. 音声変換

  3. 音声情報処理

  4. 音声情報処理

  5. 音声合成

  6. 音声変換

  7. 自動音声評価

  8. 深層学習

研究分野 2

  1. 情報通信 / 知覚情報処理

  2. 情報通信 / 知覚情報処理  / 音声情報処理

経歴 3

  1. 名古屋大学   大学院情報学研究科   助教

    2024年4月 - 現在

  2. Google DeepMind   Student researcher

    2023年4月 - 2024年3月

      詳細を見る

    国名:日本国

  3. 独立行政法人日本学術振興会   特別研究員 DC1

    2021年4月 - 2024年3月

      詳細を見る

    国名:日本国

学歴 1

  1. 名古屋大学   大学院情報学研究科   知能システム専攻

    2021年4月 - 2024年3月

      詳細を見る

    国名: 日本国

所属学協会 4

  1. 電子情報通信学会

    2025年4月 - 現在

  2. 日本音響学会

    2024年4月 - 現在

  3. IEEE

    2020年 - 現在

  4. ISCA

    2019年

委員歴 6

  1. The 15th International Symposium on Chinese Spoken Language Processing   Special Session Co-Chair  

    2026年   

      詳細を見る

    団体区分:学協会

  2. VoiceMOS Challenge   Organizing Committee Member  

    2024年   

  3. Singing Voice Conversion Challenge   Organizing Committee Member  

    2023年   

  4. VoiceMOS Challenge   Organizing Committee Member  

    2023年   

  5. The VoiceMOS Challenge   Organizing Committee Member  

    2022年 - 2024年   

  6. Voice Conversion Challenge   Organizing Committee Member  

    2020年   

▼全件表示

 

論文 32

  1. Severity-Controllable Pathological Text-to-Speech Synthesis for Clinical Applications Open Access

    Halpern, BM; Huang, WC; Violeta, LP; Toda, T

    IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING   34 巻   頁: 573 - 582   2026年

     詳細を見る

    掲載種別:研究論文(学術雑誌)   出版者・発行元:IEEE Transactions on Neural Systems and Rehabilitation Engineering  

    The article presents a new pathological text-to-speech (TTS) synthesis system that has the ability to control speech severity using latent interpolations. Recognizing the difficulty of this task, our work uses a data augmentation technique to generate a single-speaker multi-severity training dataset required for training such a model. Furthermore, we show how x-vectors already contain information about the severity and leverage it as a conditioning variable for the synthesis. Finally, we propose modifications to the GradTTS architecture to enhance the duration modeling of pathological speech. We carry out objective and subjective evaluations to demonstrate that the proposed GradTTS system works well, and produces more natural, controllable, and stable pathological speech samples than the baseline TransformerTTS system.

    DOI: 10.1109/TNSRE.2026.3651761

    Open Access

    Web of Science

    Scopus

    PubMed

  2. MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models Open Access

    Huang, WC; Cooper, E; Toda, T

    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING   34 巻   頁: 2385 - 2397   2026年

     詳細を見る

    出版者・発行元:IEEE Transactions on Audio Speech and Language Processing  

    In this paper, we study the task of subjective speech quality assessment (SSQA), which refers to predicting the perceptual quality of speech. Owing to the development of deep neural network models, SSQA has greatly advanced and has been widely applied in scientific papers to evaluate speech generation systems. Nonetheless, the insufficient out-of-domain (OOD) generalization ability of current SSQA models is underexplored and often overlooked by researchers. To study this problem systematically, we present MOS-Bench, a diverse SSQA dataset collection that currently contains 8 training sets and 17 test sets. Through extensive experiments, we first highlight the OOD generalization challenges of existing models. We then evaluate the efficacy of multiple-dataset training, comparing straightforward data pooling against AlignNet, an existing domain-aware method. We demonstrate that pooling multiple training sets provides a simple yet effective solution, and variation in the data is a key factor for robust generalization beyond training data size.

    DOI: 10.1109/TASLPRO.2026.3685883

    Web of Science

    Scopus

  3. EMBC Special Issue: Advancing Electrolaryngeal Speech Enhancement Through Speech–Text Representation Learning Open Access

    Ma D., Mi J., Li F., Violeta L.P., He J., Huang W., Kobayashi K., Toda T.

    IEEE Transactions on Biomedical Engineering     頁: 1 - 15   2026年

     詳細を見る

    掲載種別:研究論文(学術雑誌)   出版者・発行元:IEEE Transactions on Biomedical Engineering  

    Objective: laryngectomees depend on an electromechanical device to generate electrolaryngeal (EL) speech for verbal communication. Compared with normal speech, EL speech suffers from severe distortion, limited phonetic variation, unnatural prosody, and temporal shifts, degrading naturalness and intelligibility. Although sequence-to-sequence (seq2seq) voice conversion (VC) based EL-speech-to-normal-speech conversion (EL2SP) is promising, substantial mismatches between EL and normal speech inevitably cause cumulative mapping errors that limit performance. To address this, we describe a novel representation learning framework integrating speech and text representations to improve mapping and reconstruction quality within a seq2seq VC model. Methods: our methodology comprises two main stages: 1) representation integration and learning, and 2) reconstruction training. A network capable of incorporating auxiliary text information is first constructed with pretrained modules to learn speech–text-based integrated representations. Then, an autoencoder-style reconstruction strategy finalizes EL2SP model to inherit these representations without increasing model complexity. Additional optimization designs are performed across these stages. We introduce three fusion strategies including middle-, input-, and hybrid-level fusion strategies that progressively enhance learning. Moreover, besides standard seq2seq VC objectives, an additional reconstruction loss on the integrated representation is introduced to refine representation transfer. Results: experiments under different EL2SP datasets consistently demonstrate that our methods, combined with data augmentations, outperform baselines relying solely on speech representations regarding both conversion quality and intelligibility. Furthermore, progressive improvements with system design depth validate the effectiveness of our methods. Significance: the proposed methods provide an extensible and practical methodology for EL speech enhancement and assistive communication technologies.

    DOI: 10.1109/tbme.2026.3694703

    Scopus

  4. HighRateMOS: sampling-rate aware modeling for speech quality assessment 査読有り

    Wenze Ren, Yi-Cheng Lin, Wen-Chin Huang, Ryandhimas E. Zezario, Szu-Wei Fu, Sung-Feng Huang, Erica Cooper, Haibin Wu, Hung-Yu Wei, Hsin-Min Wang, Hung-yi Lee, Yu Tsao

    IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)     2025年12月

     詳細を見る

    記述言語:英語   掲載種別:研究論文(国際会議プロシーディングス)  

  5. The AudioMOS Challenge 2025 査読有り

    Wen-Chin Huang, Hui Wang, Cheng Liu, Yi-Chiao Wu, Andros Tjandra, Wei-Ning Hsu, Erica Cooper, Yong Qin, Tomoki Toda

    IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)     2025年12月

     詳細を見る

    担当区分:筆頭著者, 責任著者   記述言語:英語   掲載種別:研究論文(国際会議プロシーディングス)  

  6. VERSA-v2: a modular and scalable toolkit for speech and audio evaluation with expanded metrics, visualization, and LLM integration 査読有り

    Jiatong Shi, Bo-Hao Su, Shikhar Bharadwaj, Yiwen Zhao, Shih-Heng Wang, Jionghao Han, Haoran Wang, Wei Wang, Wenhao Feng, Yuxun Tang, Siddhant Arora, Jinchuan Tian, William Chen, Hye-jin Shim, Wangyou Zhang, Wen-Chin Huang, Shinji Watanabe

    IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)     2025年12月

     詳細を見る

    記述言語:英語   掲載種別:研究論文(国際会議プロシーディングス)  

  7. Hierarchical Symbolic Music Generation with Variational Autoencoder-Based Bar-Wise Feature Sequences 査読有り

    Sawada K., Huang W.C., Toda T.

    2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference Apsipa ASC 2025     頁: 299 - 304   2025年10月

     詳細を見る

    記述言語:英語   掲載種別:研究論文(国際会議プロシーディングス)   出版者・発行元:2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference Apsipa ASC 2025  

    This paper proposes a hierarchical symbolic music generation method based on autoregressive modeling of variational autoencoder (VAE)-based bar-wise features. The method represents music as a sequence of low-dimensional bar-level features, enabling efficient modeling of long-term structure while maintaining local coherence. It consists of a VAE-based encoder and decoder for bar-wise feature extraction and composition, and a Transformer-based feature sequence generator conditioned on chord progression. To evaluate global structural coherence, we introduce a new metric, Bar-wise Feature Similarity Distance (BFSD). Experimental results show that the proposed method improves long-term structure compared to baseline models and achieves local naturalness comparable to existing methods. The source code for BFSD is available at https://github.com/KateSawada/barwise_feature_similarity_distance.

    DOI: 10.1109/apsipaasc65261.2025.11249414

    Scopus

  8. Advancing Speech Quality Assessment Through Scientific Challenges and Open-Source Activities 招待有り 査読有り

    Huang W.C.

    2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference Apsipa ASC 2025     頁: 2552 - 2557   2025年10月

     詳細を見る

    記述言語:英語   掲載種別:研究論文(国際会議プロシーディングス)   出版者・発行元:2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference Apsipa ASC 2025  

    Speech quality assessment (SQA) refers to the evaluation of speech quality, and developing an accurate automatic SQA method that reflects human perception has become increasingly important, in order to keep up with the generative AI boom. In recent years, SQA has progressed to a point that researchers started to faithfully use automatic SQA in research papers as a rigorous measurement of goodness for speech generation systems. We believe that the scientific challenges and open-source activities of late have stimulated the growth in this field. In this paper, we review recent challenges as well as open-source implementations and toolkits for SQA, and highlight the importance of maintaining such activities to facilitate the development of not only SQA itself but also generative AI for speech.

    DOI: 10.1109/apsipaasc65261.2025.11249197

    Scopus

  9. An Evaluation of Supervised Virtual Microphone Estimators in Reverberant Sound Fields 査読有り

    Hattori K., Huang W.C., Takeda K., Toda T.

    2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference Apsipa ASC 2025     頁: 125 - 130   2025年10月

     詳細を見る

    記述言語:英語   掲載種別:研究論文(国際会議プロシーディングス)   出版者・発行元:2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference Apsipa ASC 2025  

    This paper evaluates the generalization performance of supervised virtual microphone estimation under reverberant environments. While the previous studies have shown that virtual microphone estimators trained in reverberant time conditions can operate effectively, detailed investigations on their performance have not been conducted from various perspectives. In this paper, we clarify the generalization capability of the estimator by examining both estimation performance and its impact on array signal processing under simulated environments with different reverberation times. Furthermore, we explored strategies to improve the estimator, including different model architectures and loss functions. Experimental results show that as the reverberation time increases, the estimation performance decreases and the enhancement performance by MVDR based on steering vectors degrades significantly. Furthermore, the results suggest that enhancement performance can be improved by properly designing the model architecture and loss function.

    DOI: 10.1109/apsipaasc65261.2025.11249347

    Scopus

  10. Designing a Music Difficulty Measure for Controllable Automatic Piano Rearrangement 査読有り

    Miyaji H., Sawada K., Huang W.C., Toda T.

    2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference Apsipa ASC 2025     頁: 246 - 251   2025年10月

     詳細を見る

    記述言語:英語   掲載種別:研究論文(国際会議プロシーディングス)   出版者・発行元:2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference Apsipa ASC 2025  

    Automatic piano rearrangement systems for controlling score difficulty can support players without musical rearrangement skills and help to increase their motivation to practice. While the existing systems allow difficulty adjustment in the predefined levels, more flexible control-such as local arrangement of only difficult passages and continuous or multifaceted difficulty scaling-is desirable. The conventional methods for designing difficulty measures and applying them to difficulty estimation rely on information dependent on specific score data formats and on whole score information, limiting their applicability for the data formats that can be used and making local difficulty assessment impossible. In this paper, we propose a music difficulty measure (DM) for estimating the difficulty of piano scores to address these issues. DM consists of seven features that are independent of data format and whole score information, making them applicable both as bar-level measures (DM-bar) and as whole-score measures (DM-whole). We also propose a linear regression-based music difficulty estimation method using DM: a linear regression model is trained on DM-whole to predict whole difficulty, and the same model is applied to DM-bar to estimate bar-level difficulty. In addition, statistics of the estimated bar-level difficulties over a whole score are further integrated into the model to improve the estimation accuracy. The experimental evaluation demonstrates the effectiveness of the proposed measure on the music difficulty estimation task.

    DOI: 10.1109/apsipaasc65261.2025.11249163

    Scopus

  11. Disfluency Disentanglement Enhancement in Spoken-Text-Style Transfer for Spontaneous Speech Synthesis 査読有り

    Nakata Y., Yoshioka D., Huang W.C., Toda T.

    2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference Apsipa ASC 2025     頁: 1098 - 1103   2025年10月

     詳細を見る

    記述言語:英語   掲載種別:研究論文(国際会議プロシーディングス)   出版者・発行元:2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference Apsipa ASC 2025  

    Spoken-text-style transfer (STST) aims to convert a given text to a desired style while preserving its semantic content. In this paper, we focus on modeling the disfluency in spoken text, which can be useful as a preprocessing step in spontaneous speech synthesis. Previous methods suffer from unknown words and confusion between normal and disfluent words. Our proposed solutions to the above-mentioned problems include a masked language model (MLM) approach to temporally replace unknown words, and a disfluency symbol representation (DSR) to increase the discrepancy against normal words. Experimental evaluation results show that MLM improves the robustness against unknown words, and the use of DSR achieves a higher transfer accuracy. Finally, we show the potential of speaker-modeling to achieve speaker-wise disfluency control in STST.

    DOI: 10.1109/apsipaasc65261.2025.11248977

    Scopus

  12. Estimating Speaker's Seating Position from Monaural Speech in a Simulated Vehicle Interior Sound Field 査読有り

    Kaneko M., Huang W.C., Toda T.

    2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference Apsipa ASC 2025     頁: 625 - 629   2025年10月

     詳細を見る

    記述言語:英語   掲載種別:研究論文(国際会議プロシーディングス)   出版者・発行元:2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference Apsipa ASC 2025  

    We aim to establish a method to estimate the seating position of a speaker from monaural speech recorded by a driving recorder in a vehicle. There is a demand for identifying the seating position of a driver and passengers after the occurrence of a traffic accident. Since speech recorded with a microphone of the driving recorder is often available, we attempt to estimate the seating position using those speeches. If a person hears a sound directly with their ear in real space, it is possible to estimate the sound source direction to some extent from the characteristics of the pinna, even if it is one ear. However, the sound recorded with a mono microphone has a loss of information, making it difficult to estimate the direction. On the other hand, in a closed sound field, such as the inside of a vehicle, acoustic features of the monaural speech differ depending on the seating position from which the monaural speech reaches the microphone. Therefore, it is expected that the seating position estimation is still possible by modeling those subtle acoustic feature differences in the individual vehicles using machine learning techniques. In this paper, we investigate the possibility of estimating the seating position using the monaural speech by conducting experiments in a simulated sound field where the size of the vehicle interior and the arrangement of the microphone and seating positions are consistent during training and testing. The experimental results have demonstrated that 1) when the size of the vehicle interior and the arrangement of microphones and seats are almost the same during training and testing, the proposed method is effective for sound source localization using monaural audio, 2) if the microphone positions are different during training and testing, the performance degrades significantly, and 3) by using multiple utterances per person, the performance can be further improved.

    DOI: 10.1109/apsipaasc65261.2025.11249106

    Scopus

  13. Adjusting Bias in Anomaly Scores via Variance Minimization for Domain-Generalized Discriminative Anomalous Sound Detection 査読有り

    Matsumoto, Masaaki, Fujimura, Takuya, Huang, WenChin, Toda, Tomoki

    Proceedings of the 10th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2025)     頁: 25 - 29   2025年10月

     詳細を見る

    記述言語:英語   掲載種別:研究論文(国際会議プロシーディングス)  

  14. Resolving Domain Mismatches in Electrolaryngeal Speech Enhancement With Linguistic Intermediates Open Access

    Violeta, LP; Huang, WC; Ma, D; Yamamoto, R; Kobayashi, K; Toda, T

    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING   19 巻 ( 5 ) 頁: 827 - 839   2025年7月

     詳細を見る

    掲載種別:研究論文(学術雑誌)   出版者・発行元:IEEE Journal on Selected Topics in Signal Processing  

    We investigate the use of linguistic intermediates to resolve domain mismatches in the electrolaryngeal (EL) speech enhancement task. We first propose the use of linguistic encoders to produce bottleneck feature intermediates, and use a recognition, alignment, and synthesis framework, effectively improving performance due to the removal of the timbre mismatches between the pretraining (typical) and fine-tuning (EL) data. We then further improve this by introducing discrete text intermediates, which effectively alleviate temporal mismatches between the source (EL) and target (typical) data to improve prosody modeling. Our findings show that by simply using bottleneck feature intermediates, more intelligible and naturally sounding speech can already be synthesized, as shown by a significant 16% improvement in character error rate and 0.83 improvement in naturalness score compared to the baseline. Moreover, through the use of discrete phoneme-level intermediates, we can further improve the modeling of the temporal structure of typical speech and get another absolute improvement of 1.4% in character error rate and 0.2 in naturalness compared to the initially proposed system. Finally, we also verify these findings on a larger pseudo-EL dataset of 14 speakers and another set of 3 real-world EL speakers, which consistently show that using the phoneme-level intermediates is most effective approach in terms of phoneme error rate. We conclude the research by summarizing the advantages and disadvantages of each proposed technique.

    DOI: 10.1109/JSTSP.2025.3584195

    Open Access

    Web of Science

    Scopus

  15. Investigating Factors Related to the Naturalness of Synthesized Unison Singing

    Nishizawa K., Yamamoto R., Huang W.C., Toda T.

    ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings     2025年

     詳細を見る

    出版者・発行元:ICASSP IEEE International Conference on Acoustics Speech and Signal Processing Proceedings  

    Singing voice synthesis (SVS) technology has progressed rapidly in recent years. However, vocal ensemble synthesis has not yet been widely explored. In this work, we focus on unison singing, which is to have several singers singing the same melody together. Our goal is to understand what acoustic properties affect the naturalness of the synthesized unison singing.We utilize NNSVS, an SVS toolkit that allows us to manipulate individual acoustic features, including timing, f0, and spectrum features, in a fully data-driven manner to investigate their effect in unison singing synthesis. Through listening tests, it was shown that the fluctuation in timing and f0 is an important factor in synthesizing natural unison singing. Furthermore, we discovered the potential to generate unison singing using an SVS model trained only with a single singer dataset.

    DOI: 10.1109/ICASSP49660.2025.10889744

    Scopus

  16. Serenade: A Singing Style Conversion Framework Based On Audio Infilling

    Violeta L.P., Huang W.C., Toda T.

    European Signal Processing Conference     頁: 411 - 415   2025年

     詳細を見る

    出版者・発行元:European Signal Processing Conference  

    We propose Serenade, a novel framework for the singing style conversion (SSC) task. Although singer identity conversion has made great strides in the previous years, converting the singing style of a singer has been an unexplored research area. We find three main challenges in SSC: modeling the target style, disentangling source style, and retaining the source melody. To model the target singing style, we use an audio infilling task by predicting a masked segment of the target mel-spectrogram with a flow-matching model using the complement of the masked target mel-spectrogram along with disentangled acoustic features. On the other hand, to disentangle the source singing style, we use a cyclic training approach, where we use synthetic converted samples as source inputs and reconstruct the original source mel-spectrogram as a target. Finally, to retain the source melody better, we investigate a post-processing module using a source-filter-based vocoder and resynthesize the converted waveforms using the original F0 patterns. Our results showed that the Serenade framework can handle generalized SSC tasks with the best overall similarity score, especially in modeling breathy and mixed singing styles. We also found that resynthesizing with the original F0 patterns alleviated out-of-tune singing and improved naturalness, but found a slight tradeoff in similarity due to not changing the F0 patterns into the target style.

    DOI: 10.23919/EUSIPCO63237.2025.11226227

    Scopus

  17. Serenade: A Singing Style Conversion Framework Based on Audio Infilling.

    Lester Phillip Violeta, Wen-Chin Huang, Tomoki Toda

    EUSIPCO     頁: 411 - 415   2025年

     詳細を見る

    掲載種別:研究論文(国際会議プロシーディングス)  

    その他リンク: https://dblp.uni-trier.de/rec/conf/eusipco/2025

  18. VAE-SiFiGAN: Source-Filter HiFi-GAN Based on Variational Autoencoder Representations with Enhanced Pitch Controllability

    Ogita K., Yoneyama R., Huang W.C., Toda T.

    European Signal Processing Conference     頁: 531 - 535   2025年

     詳細を見る

    出版者・発行元:European Signal Processing Conference  

    Source-filter HiFi-GAN (SiFi-GAN) is a neural vocoder offering fast, high-quality voice synthesis with fundamental frequency (F0) controllability. However, SiFi-GAN takes hand-crafted acoustic features from traditional signal processing as input, causing some limitations, such as sound quality degradation in F0 extrapolation. This paper proposes VAE-SiFiGAN, which learns latent representations from Mel-spectrograms via a variational autoencoder (VAE). The latent representations learned through the probabilistic framework enable SiFi-GAN to better model the stochastic components in speech signals, achieving sound quality improvements in F0 modification. Furthermore, to address the insufficient F0 controllability caused by the entanglement of Mel-spectrograms and F0 information, we propose to guide the latent representation learning process with hand-crafted features less affected by F0 and used only during training. Experimental results show that VAE-SiFiGAN achieves superior F0 controllability compared to SiFi-GAN.

    DOI: 10.23919/EUSIPCO63237.2025.11226579

    Scopus

  19. VAE-SiFiGAN: Source-Filter HiFi-GAN Based on Variational Autoencoder Representations with Enhanced Pitch Controllability.

    Kenichi Ogita, Reo Yoneyama, Wen-Chin Huang, Tomoki Toda

    EUSIPCO     頁: 531 - 535   2025年

     詳細を見る

    掲載種別:研究論文(国際会議プロシーディングス)  

    その他リンク: https://dblp.uni-trier.de/rec/conf/eusipco/2025

  20. Music Similarity Representation Learning Focusing on Individual Instruments with Source Separation and Human Preference 招待有り 査読有り Open Access

    Imamura, T; Hashizume, Y; Huang, WC; Toda, T

    APSIPA TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING   14 巻 ( 4 )   2025年

     詳細を見る

    記述言語:英語   掲載種別:研究論文(学術雑誌)   出版者・発行元:Apsipa Transactions on Signal and Information Processing  

    This paper proposes music similarity representation learning (MSRL) based on individual instruments (InMSRL) utilizing music source separation (MSS) and human preference without requiring clean instrument stems during inference. We propose three methods that effectively improve performance. First, we introduce end-to-end fine-tuning (E2E-FT) for the Cascade approach that sequentially performs MSS and music similarity feature extraction. E2E-FT allows the model to minimize the adverse effects of a separation error on the feature extraction. Second, we propose multi-task learning for the Direct approach that directly extracts disentangled music similarity features using a single music similarity feature extractor. Multi-task learning, which is based on the disentangled music similarity feature extraction and MSS based on reconstruction with disentangled music similarity features, further enhances instrument feature disentanglement. Third, we employ perception-aware fine-tuning (PAFT). PAFT utilizes human preference, allowing the model to perform InMSRL aligned with human perceptual similarity. We conduct experimental evaluations and demonstrate that 1) E2E-FT for Cascade significantly improves objective InMSRL performance, 2) the multi-task learning for Direct is also helpful to improve disentanglement performance in the feature extraction, 3) PAFT significantly enhances the perceptual InMSRL performance, and 4) Cascade with E2EFT and PAFT outperforms Direct with the multi-task learning and PAFT.

    DOI: 10.1561/116.20250016

    Open Access

    Web of Science

    Scopus

  21. Investigating Factors Related to the Naturalness of Synthesized Unison Singing

    Nishizawa, K; Yamamoto, R; Huang, WC; Toda, T

    2025 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)     2025年

     詳細を見る

  22. HighRateMOS: Sampling-Rate Aware Modeling for Speech Quality Assessment

    Ren W., Lin Y.C., Huang W.C., Zezario R.E., Fu S.W., Huang S.F., Cooper E., Wu H., Wei H.Y., Wang H.M., Lee H.Y., Tsao Y.

    Asru 2025 2025 IEEE Automatic Speech Recognition and Understanding Workshop     2025年

     詳細を見る

    出版者・発行元:Asru 2025 2025 IEEE Automatic Speech Recognition and Understanding Workshop  

    Modern speech quality prediction models are trained on audio data resampled to a specific sampling rate. When tested on audio with a higher sampling rate, these models can produce biased scores. We present HighRateMOS, the first non-intrusive mean opinion score (MOS) model that explicitly considers sampling rate. HighRateMOS ensembles three model variants that exploit the following information: (i) a learnable embedding of speech sampling rate, (ii) Wav2vec 2.0 selfsupervised embeddings, (iii) multi-scale CNN spectral features, and (iv) MFCC features. In AudioMOS 2025 Track 3, HighRateMOS ranked first in five of eight metrics. Our experiments confirm that modeling sampling rate leads to more robust and sampling-rate-agnostic speech quality predictions.

    DOI: 10.1109/ASRU65441.2025.11434689

    Scopus

  23. VERSA-v2: A Modular and Scalable Toolkit for Speech and Audio Evaluation with Expanded Metrics, Visualization, and LLM Integration

    Shi J., Su B.H., Bharadwaj S., Zhao Y., Wang S.H., Hang J., Wang H., Wang W., Feng W., Tang Y., Topaloglu N., Arora S., Tian J., Chen W., Shim H.J., Zhang W., Huang W.C., Watanabe S.

    Asru 2025 2025 IEEE Automatic Speech Recognition and Understanding Workshop     2025年

     詳細を見る

    出版者・発行元:Asru 2025 2025 IEEE Automatic Speech Recognition and Understanding Workshop  

    We present VERSA-v2, a major upgrade of the Versatile Evaluation of Speech and Audio (VERSA) toolkit for standardized and scalable evaluation across speech, audio, and music tasks. It features a modular, object-oriented architecture that simplifies metric integration and now supports over 100 metrics, organized into curated task-specific packs. VERSA-v2 also introduces interactive visualizations, per-metric profiling, and prompt-based evaluation using both text- and audio-based large language models (LLMs). These advancements make VERSA-v2 a robust, extensible, and LLM-enabled platform for comprehensive and interpretable speech and audio evaluation.

    DOI: 10.1109/ASRU65441.2025.11434734

    Scopus

  24. The AudioMOS Challenge 2025

    Huang W.C., Wang H., Liu C., Wu Y.C., Tjandra A., Hsu W.N., Cooper E., Qin Y., Toda T.

    Asru 2025 2025 IEEE Automatic Speech Recognition and Understanding Workshop     2025年

     詳細を見る

    出版者・発行元:Asru 2025 2025 IEEE Automatic Speech Recognition and Understanding Workshop  

    This is the summary paper for the AudioMOS Challenge 2025, the very first challenge for automatic subjective quality prediction for synthetic audio. The challenge consists of three tracks. The first track aims to assess text-to-music samples in terms of overall quality and textual alignment. The second track is based on the four evaluation dimensions of Meta Audiobox Aesthetics, and the test set consists of text-to-speech, text-to-audio, and text-to-music samples. The third track focuses on synthetic speech quality assessment in different sampling rates. The challenge attracted 24 unique teams from both academia and industry, and improvements over the baselines were confirmed. The outcome of this challenge is expected to facilitate development and progress in the field of automatic evaluation for audio generation systems.

    DOI: 10.1109/ASRU65441.2025.11434660

    Scopus

  25. A review on subjective and objective evaluation of synthetic speech

    Cooper, E; Huang, WC; Tsao, Y; Wang, HM; Toda, T; Yamagishi, J

    ACOUSTICAL SCIENCE AND TECHNOLOGY   45 巻 ( 4 ) 頁: 161 - 183   2024年7月

     詳細を見る

    記述言語:英語   掲載種別:研究論文(学術雑誌)   出版者・発行元:一般社団法人 日本音響学会  

    Evaluating synthetic speech generated by machines is a complicated process, as it involves judging along multiple dimensions including naturalness, intelligibility, and whether the intended purpose is fulfilled. While subjective listening tests conducted with human participants have been the gold standard for synthetic speech evaluation, its costly process design has also motivated the development of automated objective evaluation protocols. In this review, we first provide a historical view of listening test methodologies, from early in-lab comprehension tests to recent large-scale crowdsourcing mean opinion score (MOS) tests. We then recap the development of automatic measures, ranging from signal-based metrics to model-based approaches that utilize deep neural networks or even the latest self-supervised learning techniques. We also describe the VoiceMOS Challenge series, a scientific event we founded that aims to promote the development of data-driven synthetic speech evaluation. Finally, we provide insights into unsolved issues in this field as well as future prospects. This review is expected to serve as an entry point for early academic researchers to enrich their knowledge in this field, as well as speech synthesis practitioners to catch up on the latest developments.

    DOI: 10.1250/ast.e24.12

    Web of Science

    Scopus

    CiNii Research

  26. 合成音声の客観評価とVoiceMOSチャレンジ

    クーパー エリカ, ホワン ウェンチン, ツァオ ユ, ワン シンミン, 戸田 智基, 山岸 順一

    日本音響学会誌   80 巻 ( 7 ) 頁: 381 - 392   2024年7月

     詳細を見る

    記述言語:日本語   出版者・発行元:一般社団法人 日本音響学会  

    DOI: 10.20697/jasj.80.7_381

    CiNii Research

  27. Pretraining and Adaptation Techniques for Electrolaryngeal Speech Recognition

    Violeta, LP; Ma, D; Huang, WC; Toda, T

    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING   32 巻   頁: 2777 - 2789   2024年

     詳細を見る

    掲載種別:研究論文(学術雑誌)   出版者・発行元:IEEE/ACM Transactions on Audio Speech and Language Processing  

    We investigate state-of-the-art automatic speech recognition (ASR) systems and provide thorough investigations on training methods to adapt them to low-resourced electrolaryngeal (EL) datasets. Transfer learning is often sufficient to resolve low-resourced problems; however, in EL speech, the domain shift between the pretraining and fine-tuning data is too large to overcome, limiting the ASR performance. We propose a method of reducing the domain shift gap during transfer learning between the healthy and EL datasets by conducting an intermediate fine-tuning task that uses imperfectly synthesized EL speech. Although using imperfect synthetic speech is non-intuitive, we proved the effectiveness of this method by decreasing the character error rate by up to 6.1% compared to the baselines using naive transfer learning. To further understand the model's behavior, we further analyze the produced latent spaces in each task through linguistic and identity proxy tasks and find that the intermediate fine-tuning focuses on identifying the voicing characteristics of the EL speakers. Moreover, we also show the differences between a simulated EL speaker from a real EL speaker and find that simulated EL data has pronunciation differences from real EL data, showing the huge domain gap between real EL and other speech data.

    DOI: 10.1109/TASLP.2024.3402557

    Web of Science

    Scopus

  28. A Large-Scale Evaluation of Speech Foundation Models

    Yang, SW; Chang, HJ; Huang, ZL; Liu, AT; Lai, C; Wu, HB; Shi, JT; Chang, XK; Tsai, HS; Huang, WC; Feng, TH; Chi, PH; Lin, YY; Chuang, YS; Huang, TH; Tseng, WC; Lakhotia, K; Li, SW; Mohamed, A; Watanabe, S; Lee, HY

    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING   32 巻   頁: 2884 - 2899   2024年

     詳細を見る

    掲載種別:研究論文(学術雑誌)   出版者・発行元:IEEE/ACM Transactions on Audio Speech and Language Processing  

    The foundation model paradigm leverages a shared foundation model to achieve state-of-the-art (SOTA) performance for various tasks, requiring minimal downstream-specific data collection and modeling. This approach has proven crucial in the field of Natural Language Processing (NLP). However, the speech processing community lacks a similar setup to explore the paradigm systematically. To bridge this gap, we establish the Speech processing Universal PERformance Benchmark (SUPERB). SUPERB represents an ecosystem designed to evaluate foundation models across a wide range of speech processing tasks, facilitating the sharing of results on an online leaderboard and fostering collaboration through a community-driven benchmark database that aids in new development cycles. We present a unified learning framework for solving the speech processing tasks in SUPERB with the frozen foundation model followed by task-specialized lightweight prediction heads. Combining our results with community submissions, we verify that the framework is simple yet effective, as the best-performing foundation model shows competitive generalizability across most SUPERB tasks. Finally, we conduct a series of analyses to offer an in-depth understanding of SUPERB and speech foundation models, including information flows across tasks inside the models and the statistical significance and robustness of the benchmark.

    DOI: 10.1109/TASLP.2024.3389631

    Web of Science

    Scopus

  29. ELECTROLARYNGEAL SPEECH INTELLIGIBILITY ENHANCEMENT THROUGH ROBUST LINGUISTIC ENCODERS Open Access

    Violeta, LP; Huang, WC; Ma, D; Yamamoto, R; Kobayashi, K; Toda, T

    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024)     頁: 10961 - 10965   2024年

     詳細を見る

    掲載種別:研究論文(国際会議プロシーディングス)   出版者・発行元:ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings  

    We propose a novel framework for electrolaryngeal speech intelligibility enhancement through the use of robust linguistic encoders. Pretraining and fine-tuning approaches have proven to work well in this task, but in most cases, various mismatches, such as the speech type mismatch (electrolaryngeal vs. typical) or a speaker mismatch between the datasets used in each stage, can deteriorate the conversion performance of this framework. To resolve this issue, we propose a linguistic encoder robust enough to project both EL and typical speech in the same latent space, while still being able to extract accurate linguistic information, creating a unified representation to reduce the speech type mismatch. Furthermore, we introduce HuBERT output features to the proposed framework for reducing the speaker mismatch, making it possible to effectively use a large-scale parallel dataset during pretraining. We show that compared to the conventional framework using mel-spectrogram input and output features, using the proposed framework enables the model to synthesize more intelligible and naturally sounding speech, as shown by a significant 16% improvement in character error rate and 0.83 improvement in naturalness score.

    DOI: 10.1109/ICASSP48485.2024.10447197

    Web of Science

    Scopus

    その他リンク: https://dblp.uni-trier.de/db/conf/icassp/icassp2024.html#VioletaHMYKT24

  30. THE VOICEMOS CHALLENGE 2024: BEYOND SPEECH QUALITY PREDICTION

    Huang, WC; Fu, SW; Cooper, E; Zezario, RE; Toda, T; Wang, HM; Yamagishi, J; Tsao, Y

    2024 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT     頁: 803 - 810   2024年

     詳細を見る

    掲載種別:研究論文(国際会議プロシーディングス)   出版者・発行元:Proceedings of 2024 IEEE Spoken Language Technology Workshop Slt 2024  

    We present the third edition of the VoiceMOS Challenge, a scientific initiative designed to advance research into automatic prediction of human speech ratings. There were three tracks. The first track was on predicting the quality of 'zoomed-in' high-quality samples from speech synthesis systems. The second track was to predict ratings of samples from singing voice synthesis and voice conversion with a large variety of systems, listeners, and languages. The third track was semi-supervised quality prediction for noisy, clean, and enhanced speech, where a very small amount of labeled training data was provided. Among the eight teams from both academia and industry, we found that many were able to outperform the baseline systems. Successful techniques included retrieval-based methods and the use of non-self-supervised representations like spectrograms and pitch histograms. These results showed that the challenge has advanced the field of subjective speech rating prediction.

    DOI: 10.1109/SLT61566.2024.10832295

    Web of Science

    Scopus

  31. Multi-Speaker Text-to-Speech Training With Speaker Anonymized Data

    Huang, WC; Wu, YC; Toda, T

    IEEE SIGNAL PROCESSING LETTERS   31 巻   頁: 2995 - 2999   2024年

     詳細を見る

    掲載種別:研究論文(学術雑誌)   出版者・発行元:IEEE Signal Processing Letters  

    The trend of scaling up speech generation models poses the threat of biometric information leakage of the identities of the voices in the training data, raising privacy and security concerns. In this letter, we investigate the training of multi-speaker text-to-speech (TTS) models using data that underwent speaker anonymization (SA), a process that tends to hide the speaker identity of the input speech while maintaining other attributes. Two signal processing-based and three deep neural network-based SA methods were used to anonymize VCTK, a multi-speaker TTS dataset, which is further used to train an end-to-end TTS model, VITS, to perform unseen speaker TTS during the testing phase. We conducted extensive objective and subjective experiments to evaluate the anonymized training data, as well as the performance of the downstream TTS model trained using those data. Importantly, we found that UTMOS, a data-driven subjective rating predictor model, and GVD, a metric that measures the gain of voice distinctiveness, are good indicators of the downstream TTS performance. We summarize insights in the hope of helping future researchers determine the usefulness of the SA system for multi-speaker TTS training.

    DOI: 10.1109/LSP.2024.3482701

    Web of Science

    Scopus

  32. AAS-VC:非自己回帰型系列音声変換における時間対応付け学習の頑健性

    HUANG Wen-Chin, 小林和弘, 小林和弘, 戸田智基

    日本音響学会研究発表会講演論文集(CD-ROM)   2024 巻   2024年

     詳細を見る

▼全件表示

MISC 10

  1. 音メディア品質評価技術に関する国際チャレンジ 2025

    HUANG Wen-Chin, WANG Hui, LIU Cheng, WU Yi-Chiao, TJANDRA Andros, HSU Wei-Ning, COOPER Erica, QIN Yong, 戸田智基  

    日本音響学会研究発表会講演論文集(CD-ROM)2025 巻   2025年

     詳細を見る

  2. 歌声合成を用いた斉唱の自然性に関する要因調査

    西澤佳飛, 山本龍一, 山本龍一, HUANG Wen-Chin, 戸田智基  

    日本音響学会研究発表会講演論文集(CD-ROM)2025 巻   2025年

     詳細を見る

  3. 小節特徴量を活用した楽曲の大局的構造を反映した自動作曲

    澤田桂都, HUANG Wen-Chin, 戸田智基  

    日本音響学会研究発表会講演論文集(CD-ROM)2025 巻   2025年

     詳細を見る

  4. 大局的構造生成のための小節特徴量系列モデリングに基づく階層的自動作曲

    澤田桂都, HUANG Wen-Chin, 戸田智基  

    情報処理学会研究報告(Web)2025 巻 ( MUS-142 )   2025年

     詳細を見る

  5. 個別楽器音に基づく知覚的楽曲間類似度表現学習

    今村剛大, 橋爪優果, HUANG Wen-Chin, 戸田智基  

    情報処理学会研究報告(Web)2025 巻 ( MUS-142 )   2025年

     詳細を見る

  6. VAE-SiFiGAN:変分自己符号化表現に基づくSiFiGAN

    荻田健一, 米山怜於, HUANG Wen-Chin, 戸田智基  

    日本音響学会研究発表会講演論文集(CD-ROM)2025 巻   2025年

     詳細を見る

  7. MOS-Bench:音声品質評価モデルの汎化能力に着目したベンチマーク

    WEN-CHIN Huang, ERICA Cooper, 戸田智基  

    日本音響学会研究発表会講演論文集(CD-ROM)2025 巻   2025年

     詳細を見る

  8. JATTS:比較指向日本語テキスト-音声オープンソースツールキット

    HUANG Wen-Chin, VIOLETA Lester, TODA Tomoki  

    電子情報通信学会技術研究報告(Web)125 巻 ( 74(SP2025 1-24) )   2025年

     詳細を見る

  9. AAS-VC:非自己回帰型系列音声変換における時間対応付け学習の頑健性

    HUANG Wen-Chin, 小林和弘, 小林和弘, 戸田智基  

    日本音響学会研究発表会講演論文集(CD-ROM)2024 巻   2024年

     詳細を見る

  10. 話者匿名化したデータを用いる多話者テキスト音声合成

    HUANG Wen-Chin, WU Yi-Chiao, 戸田智基  

    日本音響学会研究発表会講演論文集(CD-ROM)2024 巻   2024年

     詳細を見る

▼全件表示

講演・口頭発表等 5

  1. Challenges in self-supervised speech representation-based voice conversion 招待有り

    Wen-Chin Huang

    ASA-ASJ Joint Meeting  2025年12月3日 

     詳細を見る

    開催年月日: 2025年12月

    記述言語:英語   会議種別:口頭発表(招待・特別)  

  2. Automatic quality assessment for speech and beyond 国際会議

    Wen-Chin Huang, Erica Cooper, Jiatong Shi

    INTERSPEECH  2025年8月17日 

     詳細を見る

    開催年月日: 2025年8月

    記述言語:英語   会議種別:公開講演,セミナー,チュートリアル,講習,講義等  

  3. Fundamentals, Prospectives and Challenges in Deep-learning based Voice Conversion 招待有り

    HUANG Wen-Chin

    Research Center for Information Technology Innovation (CITI), Academia Sinica  2024年8月14日 

     詳細を見る

    会議種別:公開講演,セミナー,チュートリアル,講習,講義等  

  4. 深層学習に基づく音声変換の進展と展望 招待有り

    音声言語情報処理研究発表会/音声研究会  2024年10月22日 

     詳細を見る

    記述言語:日本語   会議種別:口頭発表(招待・特別)  

  5. Automatic quality assessment for speech and beyond 招待有り

    Wen-Chin Huang

    Conversational AI Reading Group, Mila/Concordia University  2025年5月15日 

     詳細を見る

    記述言語:英語   会議種別:公開講演,セミナー,チュートリアル,講習,講義等  

共同研究・競争的資金等の研究課題 4

  1. Universal, Explainable and Extensible Automatic Evaluation of Synthesized Speech

    研究課題番号:25K00143  2025年2月 - 2029年3月

    科学研究費助成事業  基盤研究(B)

      詳細を見る

    配分額:18720000円 ( 直接経費:14400000円 、 間接経費:4320000円 )

  2. Audiobox Responsible Generation Grant

    2024年11月

    Unrestricted Research Gift

      詳細を見る

    担当区分:研究代表者 

  3. Google Research Grant

    2024年9月

    Unrestrcited gift

      詳細を見る

    担当区分:研究代表者 

  4. 多元信号を用いたリアルタイム低遅延音声変換による音声コミュニケーション拡張

    研究課題番号:22KJ1519  2023年3月 - 2024年3月

    科学研究費助成事業  特別研究員奨励費

    HUANG WENCHIN

      詳細を見る

    配分額:2200000円 ( 直接経費:2200000円 )

    The purpose of this research is to apply voice conversion (VC) to realize an interactive speech production paradigm for real-world applications, with the help of multimodal signals and real-time processing techniques. In the third year, we focused on both improving fundamental VC techniques and real-time processing techniques, with particular focuses on three aspects.
    (1)We organized the singing voice conversion challenge 2023, a challenge that focused on improving and promoting the task of singing voice conversion, a special application of VC. We co-organized the challenge with Tencent AI Lab, China and CMU, USA, and held a special session at ASRU 2023, a flagship conference in speech processing.
    (2)We launched the VoiceMOS Challenge 2023, the second edition of a scientific event that encouraged research in the area of automatic prediction of Mean Opinion Scores (MOS) for synthesized speech. This year the focus was on a real-world, zero-shot setting, and the challenge attracted 10 teams from academia and industry. Again, we co-organized the challenge with NII, Japan and Academia Sinica, Taiwan, and held a special session also at ASRU 2023, a flagship conference in speech processing.
    (3)We proposed a sequence-to-sequence VC model that can be executed in real-time with a non-autoregressive architecture. Compared to previous works, the training pipeline is simplified, and its performance is robust against reduced training data, which is an important property for VC. The results were presented at ASJ2024, and we plan to submit a journal paper.

科研費 1

  1. 多元信号を用いたリアルタイム低遅延音声変換による音声コミュニケーション拡張

    研究課題/研究課題番号:21J20920  2021年4月 - 2024年3月

    日本学術振興会  科学研究費助成事業 特別研究員奨励費  特別研究員奨励費

    HUANG WENCHIN

      詳細を見る

    The purpose of this research is to apply voice conversion (VC) to realize an interactive speech production paradigm for real-world applications, with the help of multimodal signals and real-time processing techniques. In the first year, the applicant focused on three aspects.
    (1) Continued improvement on fundamental VC techniques, specifically self-supervised speech representation (S3R)-based VC, an emerging trend which reduces training data requirements. The applicant released S3PRL-VC, an open-source toolkit for researchers to evaluate S3R models for VC. By collaborating with research institutes including Carnegie Mellon University and National Taiwan University, and results are published in ICASSP 2021 and 2022, a top conference in signal processing.
    (2) Medical applications of VC, specifically dysarthric VC, a task that helps dysarthria patients to speak and communicate normally again. Thanks to the collaboration with Academia Sinica, Taiwan, data collection was smooth, and results were published in INTERSPEECH 2021, a top conference in speech processing.
    (3) Initial investigations on how to apply multi-modal signals to VC, specifically electrolaryngeal (EL) VC, a task that tries to enhance the robotic EL speech to become more natural. Again, thanks to the collaboration with Academia Sinica, a new dataset which contains both the visual and audio signals of patients and normal speakers was recorded, and the lip video improved the performance of ELVC. Results were published in APSIPA ASC, a conference in signal processing.

 

担当経験のある科目 (本学) 2

  1. 確率及び統計演習

    2024

  2. プログラミング2演習

    2024

担当経験のある科目 (本学以外) 4

  1. データ処理ツール演習

    2025年7月 - 現在 名古屋大学)

  2. コンピュータ科学実験

    2025年4月 - 現在 名古屋大学)

  3. プログラミング2演習

    2024年11月 - 現在 名古屋大学)

  4. 確率及び統計演習

    2024年10月 - 現在 名古屋大学 情報学部)