Updated on 2024/09/24

写真a

 
HUANG Wen Chin
 
Organization
Graduate School of Informatics Department of Intelligent Systems 1 Assistant Professor
Graduate School
Graduate School of Informatics
Undergraduate School
School of Informatics Department of Computer Science
Title
Assistant Professor
Profile
2018年台湾・国立台湾大学学士号,2021年名古屋大学修士号,2024年同大学博士号.2017年から2019年まで台湾・中央研究院情報科学研究所にて研究助手を務める.現在,名古屋大学大学院情報学研究科助教.Voice Conversion Challenge 2020およびVoiceMOS Challenge 2022の共同オーガナイザー.音声変換と音声品質評価を中心に,音声処理へのディープラーニングの応用を研究.ISCSLP2018最優秀学生論文賞,APSIPA ASC2021最優秀論文賞受賞

Degree 3

  1. 博士(情報学) ( 2024.3   名古屋大学 ) 

  2. 修士(情報学) ( 2021.3   名古屋大学 ) 

  3. Bachelor of Science ( 2018.6   National Taiwan University ) 

Research Interests 6

  1. voice conversion

  2. 音声品質評価

  3. speech processing

  4. speech synthesis

  5. voice conversion

  6. 音声情報処理

Research Areas 2

  1. Informatics / Perceptual information processing

  2. Informatics / Perceptual information processing  / 音声情報処理

Research History 2

  1. Nagoya University   Graduate School of Informatics   Assistant Professor

    2024.4

  2. Google DeepMind   Student researcher

    2023.4 - 2024.3

      More details

    Country:Japan

Education 1

  1. Nagoya University   Graduate School of Informatics   Department of Intelligent Systems

    2021.4 - 2024.3

      More details

    Country: Japan

Committee Memberships 1

  1. Voice Conversion Challenge, Organizing Committee   Organizing Committee Member  

    2020   

 

Papers 5

  1. A review on subjective and objective evaluation of synthetic speech

    Cooper Erica, Huang Wen-Chin, Tsao Yu, Wang Hsin-Min, Toda Tomoki, Yamagishi Junichi

    Acoustical Science and Technology   Vol. 45 ( 4 ) page: 161 - 183   2024.7

     More details

    Language:English   Publishing type:Research paper (scientific journal)   Publisher:ACOUSTICAL SOCIETY OF JAPAN  

    <p>Evaluating synthetic speech generated by machines is a complicated process, as it involves judging along multiple dimensions including naturalness, intelligibility, and whether the intended purpose is fulfilled. While subjective listening tests conducted with human participants have been the gold standard for synthetic speech evaluation, its costly process design has also motivated the development of automated objective evaluation protocols. In this review, we first provide a historical view of listening test methodologies, from early in-lab comprehension tests to recent large-scale crowdsourcing mean opinion score (MOS) tests. We then recap the development of automatic measures, ranging from signal-based metrics to model-based approaches that utilize deep neural networks or even the latest self-supervised learning techniques. We also describe the VoiceMOS Challenge series, a scientific event we founded that aims to promote the development of data-driven synthetic speech evaluation. Finally, we provide insights into unsolved issues in this field as well as future prospects. This review is expected to serve as an entry point for early academic researchers to enrich their knowledge in this field, as well as speech synthesis practitioners to catch up on the latest developments.</p>

    DOI: 10.1250/ast.e24.12

    Web of Science

    Scopus

    CiNii Research

  2. Objective assessment of synthetic speech and the VoiceMOS Challenge

    Cooper Erica, Huang Wen-Chin, Tsao Yu, Wang Hsin-Min, Toda Tomoki, Yamagishi Junichi

    THE JOURNAL OF THE ACOUSTICAL SOCIETY OF JAPAN   Vol. 80 ( 7 ) page: 381 - 392   2024.7

     More details

    Language:Japanese   Publisher:Acoustical Society of Japan  

    DOI: 10.20697/jasj.80.7_381

    CiNii Research

  3. A Large-Scale Evaluation of Speech Foundation Models

    Yang, SW; Chang, HJ; Huang, ZL; Liu, AT; Lai, C; Wu, HB; Shi, JT; Chang, XK; Tsai, HS; Huang, WC; Feng, TH; Chi, PH; Lin, YY; Chuang, YS; Huang, TH; Tseng, WC; Lakhotia, K; Li, SW; Mohamed, A; Watanabe, S; Lee, HY

    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING   Vol. 32   page: 2884 - 2899   2024

     More details

    Publisher:IEEE/ACM Transactions on Audio Speech and Language Processing  

    The foundation model paradigm leverages a shared foundation model to achieve state-of-the-art (SOTA) performance for various tasks, requiring minimal downstream-specific data collection and modeling. This approach has proven crucial in the field of Natural Language Processing (NLP). However, the speech processing community lacks a similar setup to explore the paradigm systematically. To bridge this gap, we establish the Speech processing Universal PERformance Benchmark (SUPERB). SUPERB represents an ecosystem designed to evaluate foundation models across a wide range of speech processing tasks, facilitating the sharing of results on an online leaderboard and fostering collaboration through a community-driven benchmark database that aids in new development cycles. We present a unified learning framework for solving the speech processing tasks in SUPERB with the frozen foundation model followed by task-specialized lightweight prediction heads. Combining our results with community submissions, we verify that the framework is simple yet effective, as the best-performing foundation model shows competitive generalizability across most SUPERB tasks. Finally, we conduct a series of analyses to offer an in-depth understanding of SUPERB and speech foundation models, including information flows across tasks inside the models and the statistical significance and robustness of the benchmark.

    DOI: 10.1109/TASLP.2024.3389631

    Web of Science

    Scopus

  4. ELECTROLARYNGEAL SPEECH INTELLIGIBILITY ENHANCEMENT THROUGH ROBUST LINGUISTIC ENCODERS

    Violeta L.P., Huang W.C., Ma D., Yamamoto R., Kobayashi K., Toda T.

    ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings     page: 10961 - 10965   2024

     More details

    Publisher:ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings  

    We propose a novel framework for electrolaryngeal speech intelligibility enhancement through the use of robust linguistic encoders. Pretraining and fine-tuning approaches have proven to work well in this task, but in most cases, various mismatches, such as the speech type mismatch (electrolaryngeal vs. typical) or a speaker mismatch between the datasets used in each stage, can deteriorate the conversion performance of this framework. To resolve this issue, we propose a linguistic encoder robust enough to project both EL and typical speech in the same latent space, while still being able to extract accurate linguistic information, creating a unified representation to reduce the speech type mismatch. Furthermore, we introduce HuBERT output features to the proposed framework for reducing the speaker mismatch, making it possible to effectively use a large-scale parallel dataset during pretraining. We show that compared to the conventional framework using mel-spectrogram input and output features, using the proposed framework enables the model to synthesize more intelligible and naturally sounding speech, as shown by a significant 16% improvement in character error rate and 0.83 improvement in naturalness score.

    DOI: 10.1109/ICASSP48485.2024.10447197

    Scopus

  5. Pretraining and Adaptation Techniques for Electrolaryngeal Speech Recognition

    Violeta, LP; Ma, D; Huang, WC; Toda, T

    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING   Vol. 32   page: 2777 - 2789   2024

     More details

    Publisher:IEEE/ACM Transactions on Audio Speech and Language Processing  

    We investigate state-of-the-art automatic speech recognition (ASR) systems and provide thorough investigations on training methods to adapt them to low-resourced electrolaryngeal (EL) datasets. Transfer learning is often sufficient to resolve low-resourced problems; however, in EL speech, the domain shift between the pretraining and fine-tuning data is too large to overcome, limiting the ASR performance. We propose a method of reducing the domain shift gap during transfer learning between the healthy and EL datasets by conducting an intermediate fine-tuning task that uses imperfectly synthesized EL speech. Although using imperfect synthetic speech is non-intuitive, we proved the effectiveness of this method by decreasing the character error rate by up to 6.1% compared to the baselines using naive transfer learning. To further understand the model's behavior, we further analyze the produced latent spaces in each task through linguistic and identity proxy tasks and find that the intermediate fine-tuning focuses on identifying the voicing characteristics of the EL speakers. Moreover, we also show the differences between a simulated EL speaker from a real EL speaker and find that simulated EL data has pronunciation differences from real EL data, showing the huge domain gap between real EL and other speech data.

    DOI: 10.1109/TASLP.2024.3402557

    Web of Science

    Scopus

MISC 1

  1. AAS-VC: On the Generalization Ability of Automatic Alignment Search based Non-autoregressive Sequence-to-sequence Voice Conversion.

    HUANG Wen-Chin, 小林和弘, 小林和弘, 戸田智基

    日本音響学会研究発表会講演論文集(CD-ROM)   Vol. 2024   2024

     More details

KAKENHI (Grants-in-Aid for Scientific Research) 1

  1. Augmented speech communication using multi-modal signals with real-time, low-latency voice conversion

    Grant number:21J20920  2021.4 - 2024.3

    Japan Society for the Promotion of Science  Grants-in-Aid for Scientific Research  Grant-in-Aid for JSPS Fellows