2025/03/21 更新

写真a

ホワン ウェン チン
HUANG Wen Chin
HUANG Wen Chin
所属
大学院情報学研究科 知能システム学専攻 基盤知能情報学 助教
大学院担当
大学院情報学研究科
学部担当
情報学部 コンピュータ科学科
職名
助教
ホームページ
プロフィール
2018年台湾・国立台湾大学学士号,2021年名古屋大学修士号,2024年同大学博士号.2017年から2019年まで台湾・中央研究院情報科学研究所にて研究助手を務める.現在,名古屋大学大学院情報学研究科助教.Voice Conversion Challenge 2020およびVoiceMOS Challenge 2022の共同オーガナイザー.音声変換と音声品質評価を中心に,音声処理へのディープラーニングの応用を研究.ISCSLP2018最優秀学生論文賞,APSIPA ASC2021最優秀論文賞受賞

学位 3

  1. 博士(情報学) ( 2024年3月   名古屋大学 ) 

  2. 修士(情報学) ( 2021年3月   名古屋大学 ) 

  3. Bachelor of Science ( 2018年6月   National Taiwan University ) 

研究キーワード 6

  1. 音声変換

  2. 音声品質評価

  3. 音声処理

  4. 音声合成

  5. 音声変換

  6. 音声情報処理

研究分野 2

  1. 情報通信 / 知覚情報処理

  2. 情報通信 / 知覚情報処理  / 音声情報処理

経歴 2

  1. 名古屋大学   大学院情報学研究科   助教

    2024年4月 - 現在

  2. Google DeepMind   Student researcher

    2023年4月 - 2024年3月

      詳細を見る

    国名:日本国

学歴 1

  1. 名古屋大学   大学院情報学研究科   知能システム専攻

    2021年4月 - 2024年3月

      詳細を見る

    国名: 日本国

所属学協会 1

  1. 日本音響学会

    2024年4月 - 現在

委員歴 5

  1. VoiceMOS Challenge   Organizing Committee Member  

    2024年   

  2. VoiceMOS Challenge   Organizing Committee Member  

    2023年   

  3. Singing Voice Conversion Challenge   Organizing Committee Member  

    2023年   

  4. VoiceMOS Challenge   Organizing Committee Member  

    2022年   

  5. Voice Conversion Challenge   Organizing Committee Member  

    2020年   

 

論文 8

  1. A review on subjective and objective evaluation of synthetic speech Open Access

    Cooper, E; Huang, WC; Tsao, Y; Wang, HM; Toda, T; Yamagishi, J

    ACOUSTICAL SCIENCE AND TECHNOLOGY   45 巻 ( 4 ) 頁: 161 - 183   2024年7月

     詳細を見る

    記述言語:英語   掲載種別:研究論文(学術雑誌)   出版者・発行元:一般社団法人 日本音響学会  

    Evaluating synthetic speech generated by machines is a complicated process, as it involves judging along multiple dimensions including naturalness, intelligibility, and whether the intended purpose is fulfilled. While subjective listening tests conducted with human participants have been the gold standard for synthetic speech evaluation, its costly process design has also motivated the development of automated objective evaluation protocols. In this review, we first provide a historical view of listening test methodologies, from early in-lab comprehension tests to recent large-scale crowdsourcing mean opinion score (MOS) tests. We then recap the development of automatic measures, ranging from signal-based metrics to model-based approaches that utilize deep neural networks or even the latest self-supervised learning techniques. We also describe the VoiceMOS Challenge series, a scientific event we founded that aims to promote the development of data-driven synthetic speech evaluation. Finally, we provide insights into unsolved issues in this field as well as future prospects. This review is expected to serve as an entry point for early academic researchers to enrich their knowledge in this field, as well as speech synthesis practitioners to catch up on the latest developments.

    DOI: 10.1250/ast.e24.12

    Open Access

    Web of Science

    Scopus

    CiNii Research

  2. 合成音声の客観評価とVoiceMOSチャレンジ

    クーパー エリカ, ホワン ウェンチン, ツァオ ユ, ワン シンミン, 戸田 智基, 山岸 順一

    日本音響学会誌   80 巻 ( 7 ) 頁: 381 - 392   2024年7月

     詳細を見る

    記述言語:日本語   出版者・発行元:一般社団法人 日本音響学会  

    DOI: 10.20697/jasj.80.7_381

    CiNii Research

  3. A Large-Scale Evaluation of Speech Foundation Models Open Access

    Yang, SW; Chang, HJ; Huang, ZL; Liu, AT; Lai, C; Wu, HB; Shi, JT; Chang, XK; Tsai, HS; Huang, WC; Feng, TH; Chi, PH; Lin, YY; Chuang, YS; Huang, TH; Tseng, WC; Lakhotia, K; Li, SW; Mohamed, A; Watanabe, S; Lee, HY

    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING   32 巻   頁: 2884 - 2899   2024年

     詳細を見る

    掲載種別:研究論文(学術雑誌)   出版者・発行元:IEEE/ACM Transactions on Audio Speech and Language Processing  

    The foundation model paradigm leverages a shared foundation model to achieve state-of-the-art (SOTA) performance for various tasks, requiring minimal downstream-specific data collection and modeling. This approach has proven crucial in the field of Natural Language Processing (NLP). However, the speech processing community lacks a similar setup to explore the paradigm systematically. To bridge this gap, we establish the Speech processing Universal PERformance Benchmark (SUPERB). SUPERB represents an ecosystem designed to evaluate foundation models across a wide range of speech processing tasks, facilitating the sharing of results on an online leaderboard and fostering collaboration through a community-driven benchmark database that aids in new development cycles. We present a unified learning framework for solving the speech processing tasks in SUPERB with the frozen foundation model followed by task-specialized lightweight prediction heads. Combining our results with community submissions, we verify that the framework is simple yet effective, as the best-performing foundation model shows competitive generalizability across most SUPERB tasks. Finally, we conduct a series of analyses to offer an in-depth understanding of SUPERB and speech foundation models, including information flows across tasks inside the models and the statistical significance and robustness of the benchmark.

    DOI: 10.1109/TASLP.2024.3389631

    Web of Science

    Scopus

  4. ELECTROLARYNGEAL SPEECH INTELLIGIBILITY ENHANCEMENT THROUGH ROBUST LINGUISTIC ENCODERS Open Access

    Violeta, LP; Huang, WC; Ma, D; Yamamoto, R; Kobayashi, K; Toda, T

    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024)     頁: 10961 - 10965   2024年

     詳細を見る

    掲載種別:研究論文(国際会議プロシーディングス)   出版者・発行元:ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings  

    We propose a novel framework for electrolaryngeal speech intelligibility enhancement through the use of robust linguistic encoders. Pretraining and fine-tuning approaches have proven to work well in this task, but in most cases, various mismatches, such as the speech type mismatch (electrolaryngeal vs. typical) or a speaker mismatch between the datasets used in each stage, can deteriorate the conversion performance of this framework. To resolve this issue, we propose a linguistic encoder robust enough to project both EL and typical speech in the same latent space, while still being able to extract accurate linguistic information, creating a unified representation to reduce the speech type mismatch. Furthermore, we introduce HuBERT output features to the proposed framework for reducing the speaker mismatch, making it possible to effectively use a large-scale parallel dataset during pretraining. We show that compared to the conventional framework using mel-spectrogram input and output features, using the proposed framework enables the model to synthesize more intelligible and naturally sounding speech, as shown by a significant 16% improvement in character error rate and 0.83 improvement in naturalness score.

    DOI: 10.1109/ICASSP48485.2024.10447197

    Web of Science

    Scopus

    その他リンク: https://dblp.uni-trier.de/db/conf/icassp/icassp2024.html#VioletaHMYKT24

  5. Pretraining and Adaptation Techniques for Electrolaryngeal Speech Recognition Open Access

    Violeta, LP; Ma, D; Huang, WC; Toda, T

    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING   32 巻   頁: 2777 - 2789   2024年

     詳細を見る

    掲載種別:研究論文(学術雑誌)   出版者・発行元:IEEE/ACM Transactions on Audio Speech and Language Processing  

    We investigate state-of-the-art automatic speech recognition (ASR) systems and provide thorough investigations on training methods to adapt them to low-resourced electrolaryngeal (EL) datasets. Transfer learning is often sufficient to resolve low-resourced problems; however, in EL speech, the domain shift between the pretraining and fine-tuning data is too large to overcome, limiting the ASR performance. We propose a method of reducing the domain shift gap during transfer learning between the healthy and EL datasets by conducting an intermediate fine-tuning task that uses imperfectly synthesized EL speech. Although using imperfect synthetic speech is non-intuitive, we proved the effectiveness of this method by decreasing the character error rate by up to 6.1% compared to the baselines using naive transfer learning. To further understand the model's behavior, we further analyze the produced latent spaces in each task through linguistic and identity proxy tasks and find that the intermediate fine-tuning focuses on identifying the voicing characteristics of the EL speakers. Moreover, we also show the differences between a simulated EL speaker from a real EL speaker and find that simulated EL data has pronunciation differences from real EL data, showing the huge domain gap between real EL and other speech data.

    DOI: 10.1109/TASLP.2024.3402557

    Open Access

    Web of Science

    Scopus

  6. The Voicemos Challenge 2024: Beyond Speech Quality Prediction

    Huang W.C., Fu S.W., Cooper E., Zezario R.E., Toda T., Wang H.M., Yamagishi J., Tsao Y.

    Proceedings of 2024 IEEE Spoken Language Technology Workshop, SLT 2024     頁: 803 - 810   2024年

     詳細を見る

    出版者・発行元:Proceedings of 2024 IEEE Spoken Language Technology Workshop, SLT 2024  

    We present the third edition of the VoiceMOS Challenge, a scientific initiative designed to advance research into automatic prediction of human speech ratings. There were three tracks. The first track was on predicting the quality of 'zoomed-in' high-quality samples from speech synthesis systems. The second track was to predict ratings of samples from singing voice synthesis and voice conversion with a large variety of systems, listeners, and languages. The third track was semi-supervised quality prediction for noisy, clean, and enhanced speech, where a very small amount of labeled training data was provided. Among the eight teams from both academia and industry, we found that many were able to outperform the baseline systems. Successful techniques included retrieval-based methods and the use of non-self-supervised representations like spectrograms and pitch histograms. These results showed that the challenge has advanced the field of subjective speech rating prediction.

    DOI: 10.1109/SLT61566.2024.10832295

    Scopus

  7. Multi-Speaker Text-to-Speech Training With Speaker Anonymized Data Open Access

    Huang, WC; Wu, YC; Toda, T

    IEEE SIGNAL PROCESSING LETTERS   31 巻   頁: 2995 - 2999   2024年

     詳細を見る

    掲載種別:研究論文(学術雑誌)   出版者・発行元:IEEE Signal Processing Letters  

    The trend of scaling up speech generation models poses the threat of biometric information leakage of the identities of the voices in the training data, raising privacy and security concerns. In this letter, we investigate the training of multi-speaker text-to-speech (TTS) models using data that underwent speaker anonymization (SA), a process that tends to hide the speaker identity of the input speech while maintaining other attributes. Two signal processing-based and three deep neural network-based SA methods were used to anonymize VCTK, a multi-speaker TTS dataset, which is further used to train an end-to-end TTS model, VITS, to perform unseen speaker TTS during the testing phase. We conducted extensive objective and subjective experiments to evaluate the anonymized training data, as well as the performance of the downstream TTS model trained using those data. Importantly, we found that UTMOS, a data-driven subjective rating predictor model, and GVD, a metric that measures the gain of voice distinctiveness, are good indicators of the downstream TTS performance. We summarize insights in the hope of helping future researchers determine the usefulness of the SA system for multi-speaker TTS training.

    DOI: 10.1109/LSP.2024.3482701

    Open Access

    Web of Science

    Scopus

  8. AAS-VC:非自己回帰型系列音声変換における時間対応付け学習の頑健性

    HUANG Wen-Chin, 小林和弘, 小林和弘, 戸田智基

    日本音響学会研究発表会講演論文集(CD-ROM)   2024 巻   2024年

     詳細を見る

▼全件表示

MISC 1

  1. AAS-VC:非自己回帰型系列音声変換における時間対応付け学習の頑健性

    HUANG Wen-Chin, 小林和弘, 小林和弘, 戸田智基  

    日本音響学会研究発表会講演論文集(CD-ROM)2024 巻   2024年

     詳細を見る

講演・口頭発表等 2

  1. Fundamentals, Prospectives and Challenges in Deep-learning based Voice Conversion 招待有り

    HUANG Wen-Chin

    Research Center for Information Technology Innovation (CITI), Academia Sinica  2024年8月14日 

     詳細を見る

    会議種別:公開講演,セミナー,チュートリアル,講習,講義等  

  2. 深層学習に基づく音声変換の進展と展望 招待有り

    音声言語情報処理研究発表会/音声研究会  2024年10月22日 

     詳細を見る

    記述言語:日本語   会議種別:口頭発表(招待・特別)  

共同研究・競争的資金等の研究課題 2

  1. Audiobox Responsible Generation Grant

    2024年11月

    Unrestricted Research Gift

      詳細を見る

    担当区分:研究代表者 

  2. Google Research Grant

    2024年9月

    Unrestrcited gift

      詳細を見る

    担当区分:研究代表者 

科研費 1

  1. 多元信号を用いたリアルタイム低遅延音声変換による音声コミュニケーション拡張

    研究課題/研究課題番号:21J20920  2021年4月 - 2024年3月

    日本学術振興会  科学研究費助成事業 特別研究員奨励費  特別研究員奨励費

    HUANG WENCHIN

      詳細を見る

    The purpose of this research is to apply voice conversion (VC) to realize an interactive speech production paradigm for real-world applications, with the help of multimodal signals and real-time processing techniques. In the first year, the applicant focused on three aspects.
    (1) Continued improvement on fundamental VC techniques, specifically self-supervised speech representation (S3R)-based VC, an emerging trend which reduces training data requirements. The applicant released S3PRL-VC, an open-source toolkit for researchers to evaluate S3R models for VC. By collaborating with research institutes including Carnegie Mellon University and National Taiwan University, and results are published in ICASSP 2021 and 2022, a top conference in signal processing.
    (2) Medical applications of VC, specifically dysarthric VC, a task that helps dysarthria patients to speak and communicate normally again. Thanks to the collaboration with Academia Sinica, Taiwan, data collection was smooth, and results were published in INTERSPEECH 2021, a top conference in speech processing.
    (3) Initial investigations on how to apply multi-modal signals to VC, specifically electrolaryngeal (EL) VC, a task that tries to enhance the robotic EL speech to become more natural. Again, thanks to the collaboration with Academia Sinica, a new dataset which contains both the visual and audio signals of patients and normal speakers was recorded, and the lip video improved the performance of ELVC. Results were published in APSIPA ASC, a conference in signal processing.

 

担当経験のある科目 (本学) 2

  1. プログラミング2演習

    2024

  2. 確率及び統計演習

    2024