research

BlueMagpie-TTS: Taiwanese-accent, Chinese–English code-switching speech synthesis

An open Taiwanese-accent text-to-speech model that handles Chinese–English code-switching — keep VoxCPM's acoustic stack, swap in the Barbet language model, and cut character error rate by about 58% on a hard test set.

摘要 BlueMagpie-TTS 是一個支援臺灣口音中文與中英混合的文字轉語音(TTS)模型,由 OpenFormosa 開源。它的核心設計只有一句話:保留一套預訓練的聲學架構(取自 VoxCPM),並把原本的文字語意語言模型換成 Barbet。Barbet 負責決定「要說什麼」——文字語意、韻律規劃、節奏與重音;聲學架構負責生成聲音的細節。模型內附李宏毅老師的語者向量作為預設聲音,已取得本人授權。在測試集上,字元錯誤率(CER)為 4.81%,詞錯誤率(WER)為 5.36%,相對原本參考模型分別降低約 58.0% 與 63.9%。

這篇文章會先讓你直接聽,再完整說明它是什麼、怎麼組起來、怎麼用,還有哪些地方仍然會出錯。

本文重點

  • 保留聲學、替換腦袋——整套保留 VoxCPM 的預訓練聲學架構,只把「決定說什麼」的文字語意模型換成 Barbet,兩者用橋接模組接起來。
  • 為臺灣語境而生——同時針對臺灣口音中文與中英夾雜(code-switching)兩個常被忽略的需求,讓一個模型自然處理在地腔調與語碼轉換。
  • 可聽可驗證——以「TTS 合成 → Breeze-ASR-25 還原 → 逐字比對」這套流程評估品質;本文的互動示範讓你親耳聽、親眼看辨識結果。
  • 誠實的邊界——它不是免審核的產品級系統,輸出仍可能出錯;參考音訊與語者向量都必須先取得授權,才能用於合成或散布。

先聽聽看

光說一個語音模型「好」沒什麼意義,總得讓人親耳聽聽看。下面這組句子來自一份 500 句「難唸」的中文測試集——刻意混入英文單字、縮寫、數字與專有名詞,正是臺灣日常語音應用最容易翻車的地方。

我們的評估方式會繞一圈回來:把文字交給 BlueMagpie-TTS 合成成語音,再把語音交給臺灣的 Breeze-ASR-25 語音辨識模型「聽寫」回文字,最後逐字比對。兩者差多少,就是字元錯誤率(CER)。

評估流程:TTS 合成後再用 ASR 還原比對 評估方法:合成語音後,再用 ASR 還原、逐字比對 輸入文字 參考答案 BlueMagpie-TTS 文字 → 語音 合成語音 Breeze-ASR-25 語音 → 文字 辨識文字 ASR 聽到的 逐字比對 = 字元錯誤率(CER)
圖 1. 評估流程。把同一段文字交給 BlueMagpie-TTS 合成,再交給 Breeze-ASR-25 還原成文字,逐字比對得到 CER。底下示範的每張卡片,展開後看到的就是這個流程裡 ASR 真正聽到的內容。
聽聽看:難唸測試句的實際輸出點播放鍵聽模型怎麼唸;展開「ASR 聽到什麼?」可看 Breeze-ASR-25 把這段語音聽回成什麼文字。語者為李宏毅老師(已授權)。
  1. 這個 Transformer 架構,其實就是現在所有聊天機器人的底層。

    0:00
  2. 做完 fine-tune,我還跑了一輪 ASR 驗證確認字沒念錯。

    0:00
  3. 做語音合成研究,少不了一塊夠力的 GPU 跟一堆乾淨的語料。

    0:00
  4. 我們把溫度調到 0.85,模型講話就從死板變得有人味了。

    0:00
  5. 大家都在喊 AGI 快來了,但連我自己都還搞不清楚我算不算。

    0:00
  6. 這套號稱能讓 AI 自己變強的方法,講穿了就是讓一個模型不斷去教另一個比較笨的模型,再回頭修自己。

    0:00
  7. OpenAI 我都直接念英文,可是 TTS 常把 open 跟 A I 黏在一起變成怪音。

    0:00

最後一句話本身就在講 TTS 怎麼把「open」跟「AI」黏在一起變成怪音——而模型把這句唸對了,Breeze-ASR-25 也乾淨地還原成「OpenAI」「TTS」「AI」。這正是中英夾雜要解決的問題。

想自己玩玩看? 打開線上 Demo(Hugging Face Space),丟一段中英夾雜的句子進去,就能即時聽到合成結果。

它仍然會出錯

搞清楚它不能做什麼,跟知道它能做什麼一樣重要。同一份難唸測試集裡,也有模型沒處理好的例子。下面這句的 LLM 被唸得不夠清楚,ASR 把它聽成了「LOL and」——這類英文縮寫的邊界,仍是目前的弱點之一。

誠實的一例:縮寫邊界仍會翻車展開可看到 ASR 還原時出現的差異。
  1. 我把整篇逐字稿丟給 LLM,叫它幫我整理成三個重點。

    0:00

一、這是什麼

1.1 為什麼要做這個模型

臺灣的語音應用有兩個常被忽略的需求:臺灣口音,以及中英夾雜。

先說中英夾雜。一段話裡同時出現中文、英文單字、縮寫與專有名詞,這在臺灣是常態,但對語音合成是個難題。多數現成 TTS 模型在純中文或純英文上表現良好,但在語碼轉換(code-switching)的邊界上容易出錯。

再說口音。多數模型的中文偏向其他華語腔調,唸起來不像臺灣人說話。

BlueMagpie-TTS 同時針對這兩件事。它的目標很單純:讓一個模型自然地處理臺灣口音的中文,以及中英夾雜的語音生成。

1.2 它能做什麼

模型支援三種使用情境,外加一個串流模式。

用途 一句話說明
一般語音合成 直接把文字唸出來
聲音複製 給一段參考音檔,輸出模仿該語者的音色
指定語者 用事先準備好的語者向量控制音色
串流輸出 邊合成邊回傳音訊區塊,適合即時播放

最常用的就是第一種:丟一段文字進去,拿到一段語音。其他都是看需要再用的進階控制。

1.3 它不能做什麼

搞清楚它不能做什麼,跟知道它能做什麼一樣重要。使用前先記住幾條底線。

第一,它不是免審核的產品級系統。生成語音可能出錯,未經人工審查時,不應直接用於真實世界的通知或對外播放。

第二,授權是硬規定。模型內附的李宏毅語者向量已取得授權,可直接當範例。但若要複製其他人的聲音,或使用其他語者向量,你必須先取得對方授權。語者向量表與合成出來的音檔,未經授權前都不要對外散布。

1.4 名字的由來

專案全名是 OpenFormosa Blue Magpie TTS。「藍鵲」取自臺灣藍鵲(學名 Urocissa caerulea)。選牠當識別有三層用意:臺灣藍鵲叫聲響亮、辨識度高,呼應 TTS 把文字變成聲音的核心;牠標誌性的長尾巴帶來流動延展的視覺意象;而 OpenFormosa(福爾摩沙)點出專案立足臺灣、面向臺灣華語的定位。

二、模型長什麼樣

2.1 核心想法

一般的 TTS 模型是一整塊:文字進去,語音出來,中間全部一起訓練。

BlueMagpie-TTS 走的是另一條路。它把一套已經訓練好、聲音品質不錯的「聲學架構」整個保留下來,只把負責「決定要說什麼」的那顆腦袋,換成 Barbet。

這樣做的好處很直接:Barbet 負責文字理解跟韻律規劃,聲學架構保留原本累積下來的發音細節,兩邊各司其職。

BlueMagpie-TTS 架構:保留聲學、替換腦袋 保留聲學架構,只換掉「決定說什麼」的腦袋 換掉的腦袋 保留的聲學架構 文字 輸入 Barbet 文字語意 · 韻律規劃 決定「要說什麼」 橋接模組 格式翻譯 VoxCPM 聲學模組 把規劃變成實際聲音 預訓練、整段保留 語音波形 資料流:文字 → Barbet → 橋接模組 → VoxCPM 聲學模組 → 語音波形
圖 2. 核心設計。深藍色的 Barbet 是被「換上」的文字語意腦袋,負責決定要說什麼與怎麼說;綠色的 VoxCPM 聲學模組整段保留,負責把規劃變成實際聲音。中間的橋接模組把兩邊不相容的格式翻譯接通。

2.2 兩個現成的零件

BlueMagpie-TTS 不是從零造輪子,而是把兩個現成的零件兜在一起。

Barbet 是文字語意的語言模型,來自 OpenFormosa/Barbet。安裝本專案時,它會自動從 GitHub 一起裝進來。

聲學模組取自 VoxCPM2(OpenBMB,採 Apache-2.0 授權),已經內含在專案裡(位於 bluemagpie/_vendor/),不需另外安裝。

兩者的內部格式並不相容。橋接模組的工作,就是把一邊的輸出翻譯成另一邊看得懂的形式,讓兩邊接得起來。

三、安裝

先把專案 clone 下來,再以可編輯模式安裝。相依的 Barbet 套件會自動一起裝。

git clone https://github.com/OpenFormosa/BlueMagpie-TTS
cd BlueMagpie-TTS
pip install -e .

如果要把合成出來的音檔存成 .wav,再另外裝 soundfile

pip install soundfile

四、怎麼用

這一節是重點。以下示範如何載入模型、合成語音、複製聲音、指定語者與串流輸出。

4.1 載入模型

從 Hugging Face 下載並載入:

import os
from huggingface_hub import snapshot_download
from transformers import PreTrainedTokenizerFast
from bluemagpie import BlueMagpieModel

model_dir = snapshot_download("OpenFormosa/BlueMagpie-TTS", token=True)
# 直接從 tokenizer.json 載入 tokenizer,相容較新版 transformers(5.x)
tokenizer = PreTrainedTokenizerFast(tokenizer_file=os.path.join(model_dir, "tokenizer.json"))
model = BlueMagpieModel.from_local(model_dir, tokenizer=tokenizer, training=False, device="cuda")

如果你已經有一份本機模型檔案,把 model_dir 指向那個目錄就行,其他都一樣。device 可填 "cuda""cpu",不指定時會自動選擇。推論時記得固定用 training=False

4.2 基本合成:文字轉語音

最基本的用法:給一段文字,拿到一段語音。target_text 可以直接混寫中文與英文,模型會自己處理切換。

import soundfile as sf

audio = model.generate(
    target_text="這是 AI TTS code switching 測試。",
    cfg_value=2.8,
    inference_timesteps=9,
    max_len=2000,
    retry_badcase=True,
)
sf.write("output.wav", audio.squeeze().cpu().numpy(), model.sample_rate)

這裡用的是建議參數(cfg_value=2.8inference_timesteps=9)。程式碼的原始預設值其實是 2.0 與 10,但建議用前者,原因看下面的參數表就知道。

4.3 聲音複製:以參考音檔模仿語者

給一段 reference_wav_path,輸出就會模仿該段音檔的語者音色。

audio = model.generate(
    target_text="今天的會議改到下午三點。",
    reference_wav_path="speaker.wav",
    cfg_value=2.8,
    inference_timesteps=9,
)

再次提醒:只能使用你有權合成的聲音。

4.4 指定語者:以語者向量控制音色

模型自帶李宏毅老師的語者向量作為範例,已取得本人授權,存放在模型目錄的 checkpoints/hung_yi_lee_speaker_centroids.pt。先載入向量表,依語者 ID hung_yi_lee 取出向量,再透過 speaker_centroid 指定音色。

import os
import torch

centroids = torch.load(
    os.path.join(model_dir, "checkpoints", "hung_yi_lee_speaker_centroids.pt"),
    map_location="cpu",
    weights_only=True,
)
speaker_centroid = centroids["centroids"][centroids["speaker_ids"].index("hung_yi_lee")]

audio = model.generate(
    target_text="今天天氣真好。",
    speaker_centroid=speaker_centroid,   # 也可以傳入你自己已取得授權的語者向量
    cfg_value=2.8,
    inference_timesteps=9,
)

4.5 串流輸出

需要邊合成邊播放時,改用 generate_streaming。它是一個產生器,一段一段地回傳音訊區塊。

chunks = []
for chunk in model.generate_streaming(target_text="今天天氣真好。"):
    chunks.append(chunk)
    # 這裡可以即時播放或寫出 chunk

注意:串流模式下不支援自動重試(retry_badcase)。

4.6 四種輸入模式

上面的功能其實都是同一個 generate 介面的不同參數組合。下表整理四種模式。

模式 需要的參數 用途
一般合成 target_text 直接把文字唸出來
語音接續 target_textprompt_textprompt_wav_path 從一段已有的語音與其文字接著往下唸
參考音檔 target_textreference_wav_path 模仿參考音檔的語者音色
參考音檔+接續 以上參數合併使用 同時指定音色並接續語音

4.7 常用參數怎麼調

搞懂這幾個參數,你就能在「穩定」跟「自然」之間抓到平衡。

參數 預設值 建議值 說明
target_text (必填)   要合成的文字
prompt_text ""   提示文字,搭配 prompt_wav_path 做語音接續
prompt_wav_path ""   提示音檔路徑,用於語音接續
reference_wav_path ""   參考音檔路徑,用於聲音複製
speaker_centroid None   語者向量,用於指定音色
cfg_value 2.0 2.8 引導強度。越大越貼合條件,但太高會較不自然
inference_timesteps 10 9 取樣步數。越多品質越好,但速度越慢
min_len / max_len 2 / 2000   輸出長度的下限與上限
retry_badcase False True 偵測到異常輸出時自動重試(串流模式不支援)

「建議值」這一欄是官方調校後的推薦設定,也記錄在 config.jsongeneration_defaults 裡。這組值是用 500 句難唸的中文句子(也就是本文一開始示範的那批),搭配臺灣的 Breeze-ASR-25 語音辨識模型、以正規化 CER 調出來的最佳組合。

4.8 一些實用提醒

長文建議切成句子大小的片段,逐句合成後再把波形接起來,需要的話在接縫處加一點淡入淡出。若要更連貫,可在合成下一段時,傳入上一段的一小段已授權片段作為提示。

沒有 GPU 也能跑。把 device 設成 "cpu" 就行,速度是慢一點,但短句合成也只要幾十秒。輸出是 48 kHz 單聲道。

如果你不傳入 tokenizer、改用自動載入,在 transformers 5.x 可能會載入失敗,或在呼叫 generate() 時才報「No tokenizer attached」。照上面範例直接從 tokenizer.json 載入再傳進去,就能避開這個問題。

五、效果如何

在內部測試集上,BlueMagpie-TTS 的表現如下。數字越低越好。

系統 CER WER
BlueMagpie-TTS 4.81% 5.36%
原本參考模型 11.45% 14.83%
CER/WER 與參考模型比較 字元/詞錯誤率(%,越低越好) 0481216 BlueMagpie · CER參考模型 · CERBlueMagpie · WER參考模型 · WER 4.8111.455.3614.83 ↓ 58.0% vs 參考模型↓ 63.9% vs 參考模型
圖 3. 以「TTS → ASR 還原」這套流程量到的錯誤率。BlueMagpie-TTS 的 CER 4.81%、WER 5.36%,相對原本參考模型(11.45% / 14.83%)分別降低約 58.0% 與 63.9%。數字越低代表合成語音越能被正確聽寫回原文。

相對原本的參考模型,字元錯誤率降低約 58.0%,詞錯誤率降低約 63.9%。

生成速度方面,每秒合成語音時長的中位數為 4.748 秒、最大為 5.288 秒(即時率,數值越大代表越快)。

最後再強調一次:以上數字都來自內部測試集,不是公開基準成績,只用於內部判斷模型好壞。

附錄:檔案、授權與連結

模型包含的檔案

檔案 內容
pytorch_model.bin BlueMagpie 模型權重
audiovae.pth AudioVAE 權重
config.json 架構與執行設定
tokenizer.jsontokenizer_config.json tokenizer 檔案
checkpoints/hung_yi_lee_speaker_centroids.pt 預設的李宏毅語者向量表
USAGE.md 進階使用說明

授權

程式碼採 Apache-2.0 授權。模型權重採 other 授權,附帶使用限制,重點是參考音訊與語者向量都必須先取得授權與同意,才能用於合成或散布。

連結

結語

BlueMagpie-TTS 的設計可以用一句話記住:保留好用的聲學架構(VoxCPM),只換掉決定「說什麼」的腦袋(Barbet)。前者保留發音品質,後者帶來臺灣口音與中英夾雜的處理能力,兩者用橋接模組接起來。對使用者來說,重點只有兩件事:丟文字進去就能合成,需要指定聲音時給一段授權的參考音檔或語者向量。其餘都是看需要再用的進階控制。

Abstract BlueMagpie-TTS is a text-to-speech (TTS) model for Taiwanese-accent Chinese and Chinese–English code-switching, open-sourced by OpenFormosa. Its core design is a single sentence: keep a pretrained acoustic stack (taken from VoxCPM), and replace the original text-semantic language model with Barbet. Barbet decides what to say — text semantics, prosody planning, rhythm, and stress; the acoustic stack generates the fine-grained sound. The model ships with Prof. Hung-yi Lee’s speaker vector as the default voice, used with his permission. On the test set, character error rate (CER) is 4.81% and word error rate (WER) is 5.36% — about 58.0% and 63.9% lower than the original reference model.

This article lets you listen first, then explains what it is, how it is assembled, how to use it, and where it still makes mistakes.

Key points

  • Keep the acoustics, swap the brain — retain VoxCPM’s pretrained acoustic stack as a whole, and replace only the text-semantic model that “decides what to say” with Barbet, joined by a bridge module.
  • Built for the Taiwanese context — it targets two often-overlooked needs at once, Taiwanese-accent Chinese and Chinese–English mixing (code-switching), so a single model handles local accent and code-switching naturally.
  • Listenable and verifiable — quality is measured by a closed loop: synthesize with TTS → transcribe back with Breeze-ASR-25 → compare character by character. The interactive demo below lets you hear it and see the transcription yourself.
  • Honest boundaries — it is not a review-free, production-grade system; output can still be wrong, and reference audio and speaker vectors must be authorized before they can be used for synthesis or distribution.

Listen first

Calling a speech model “good” means nothing — you have to hear it. The sentences below come from a 500-sentence set of “hard” Chinese sentences, deliberately seeded with English words, abbreviations, numbers, and proper nouns: exactly where everyday Taiwanese speech applications tend to break.

Our evaluation is a closed loop: hand the text to BlueMagpie-TTS to synthesize speech, hand the speech to Taiwan’s Breeze-ASR-25 speech-recognition model to transcribe it back to text, and compare character by character. How far the two differ is the character error rate (CER).

Evaluation loop: synthesize, then transcribe back and compare How we evaluate: synthesize, then transcribe back and compare, character by character Input text reference BlueMagpie-TTS text → speech speech Breeze-ASR-25 speech → text transcript what ASR heard character-by-character comparison = CER
Figure 1. The evaluation loop. The same text is synthesized by BlueMagpie-TTS, transcribed back by Breeze-ASR-25, and compared character by character to get CER. Each card below, when expanded, shows exactly what ASR heard in this loop.
Listen: real output on hard sentencesPress play to hear how the model reads each line; expand "What did ASR hear?" to see what Breeze-ASR-25 transcribes the audio back into. Speaker: Prof. Hung-yi Lee (used with permission).
  1. 這個 Transformer 架構,其實就是現在所有聊天機器人的底層。

    0:00
  2. 做完 fine-tune,我還跑了一輪 ASR 驗證確認字沒念錯。

    0:00
  3. 做語音合成研究,少不了一塊夠力的 GPU 跟一堆乾淨的語料。

    0:00
  4. 我們把溫度調到 0.85,模型講話就從死板變得有人味了。

    0:00
  5. 大家都在喊 AGI 快來了,但連我自己都還搞不清楚我算不算。

    0:00
  6. 這套號稱能讓 AI 自己變強的方法,講穿了就是讓一個模型不斷去教另一個比較笨的模型,再回頭修自己。

    0:00
  7. OpenAI 我都直接念英文,可是 TTS 常把 open 跟 A I 黏在一起變成怪音。

    0:00

The last sentence is itself about how TTS tends to slur “open” and “AI” into one strange sound — and the model reads it correctly, with Breeze-ASR-25 cleanly recovering “OpenAI”, “TTS”, and “AI”. That is exactly the code-switching problem this model is meant to solve.

Want to try it yourself? Open the live demo (Hugging Face Space), drop in a Chinese–English mixed sentence, and hear the result in real time.

It still makes mistakes

Defining the boundaries matters as much as defining the uses. The same hard test set also contains cases the model does not handle well. In the line below, LLM is not pronounced cleanly enough, and ASR hears it as “LOL and” — the boundaries of English abbreviations like this are still one of the model’s weak spots.

An honest case: abbreviation boundaries still slipExpand to see the discrepancy that appears when ASR transcribes it back.
  1. 我把整篇逐字稿丟給 LLM,叫它幫我整理成三個重點。

    0:00

1. What this is

1.1 Why build this model

Speech applications in Taiwan have two often-overlooked needs: a Taiwanese accent, and Chinese–English mixing.

Take code-switching first. A single utterance can contain Chinese, English words, abbreviations, and proper nouns at once — this is the norm in Taiwan, but a hard problem for speech synthesis. Most off-the-shelf TTS models do well on pure Chinese or pure English, but stumble at the boundaries of code-switching.

Now the accent. Most models’ Chinese leans toward other Mandarin accents, and does not sound the way people in Taiwan speak.

BlueMagpie-TTS targets both at once. Its goal is simple: let a single model naturally handle Taiwanese-accent Chinese and Chinese–English mixed speech generation.

1.2 What it can do

The model supports three usage scenarios, plus a streaming mode.

Use One-line description
General synthesis Read the text out loud directly
Voice cloning Given a reference clip, output a voice that imitates that speaker
Specified speaker Control the timbre with a prepared speaker vector
Streaming output Return audio chunks as they are synthesized, for real-time playback

The most common is the first: feed in some text, get back speech. Everything else is optional advanced control.

1.3 What it cannot do

Defining the scope is as important as defining the use. Keep a few bottom lines in mind before using it.

First, it is not a review-free, production-grade system. Generated speech can be wrong, and without human review it should not be used directly for real-world notifications or public playback.

Second, authorization is a hard rule. The bundled Hung-yi Lee speaker vector is authorized and can be used directly as an example. But to clone anyone else’s voice, or to use any other speaker vector, you must first obtain that person’s permission. Speaker-vector tables and the synthesized audio must not be distributed publicly without authorization.

1.4 Where the name comes from

The full project name is OpenFormosa Blue Magpie TTS. “Blue Magpie” is taken from the Taiwan Blue Magpie (Urocissa caerulea). Choosing it as the mark has three layers of meaning: the Taiwan Blue Magpie has a loud, highly recognizable call, echoing the heart of TTS — turning text into sound; its signature long tail brings a visual sense of flow and extension; and OpenFormosa (Formosa) points to the project’s footing in Taiwan and its focus on Taiwanese Mandarin.

2. What the model looks like

2.1 The core idea

A typical TTS model is one solid block: text goes in, speech comes out, and everything in between is trained together.

BlueMagpie-TTS takes a different path. It keeps an already-trained acoustic stack with good audio quality intact as a whole, and swaps out only the brain responsible for “deciding what to say,” replacing it with Barbet.

The benefit is direct. Barbet brings text understanding and prosody planning; the acoustic stack retains the pronunciation detail it has already accumulated. Each does its own job.

BlueMagpie-TTS architecture: keep the acoustics, swap the brain Keep the acoustic stack; swap only the brain that "decides what to say" swapped brain kept acoustic stack Text input Barbet semantics · prosody decides what to say Bridge format glue VoxCPM acoustic turns plan into sound pretrained, kept whole Waveform Data flow: text → Barbet → bridge → VoxCPM acoustic → waveform
Figure 2. The core design. The blue Barbet is the text-semantic brain that is "swapped in" — it decides what to say and how; the green VoxCPM acoustic stack is kept whole and turns the plan into actual sound. The bridge module in the middle translates between the two incompatible formats.

2.2 Two off-the-shelf parts

BlueMagpie-TTS does not reinvent the wheel; it combines two off-the-shelf parts.

Barbet is the text-semantic language model, from OpenFormosa/Barbet. It is installed automatically from GitHub when you install this project.

The acoustic module is taken from VoxCPM2 (OpenBMB, Apache-2.0). It is already bundled in the project (under bluemagpie/_vendor/) and needs no separate installation.

The two have incompatible internal formats. The bridge module’s job is to translate one side’s output into a form the other understands, so the interfaces connect.

3. Installation

Clone the project, then install in editable mode. The dependent Barbet package is installed automatically.

git clone https://github.com/OpenFormosa/BlueMagpie-TTS
cd BlueMagpie-TTS
pip install -e .

To save the synthesized audio as .wav, also install soundfile:

pip install soundfile

4. How to use it

This section is the heart of it. Below is how to load the model, synthesize speech, clone a voice, specify a speaker, and stream output.

4.1 Load the model

Download from Hugging Face and load:

import os
from huggingface_hub import snapshot_download
from transformers import PreTrainedTokenizerFast
from bluemagpie import BlueMagpieModel

model_dir = snapshot_download("OpenFormosa/BlueMagpie-TTS", token=True)
# Load the tokenizer straight from tokenizer.json; compatible with newer transformers (5.x)
tokenizer = PreTrainedTokenizerFast(tokenizer_file=os.path.join(model_dir, "tokenizer.json"))
model = BlueMagpieModel.from_local(model_dir, tokenizer=tokenizer, training=False, device="cuda")

If you already have a local copy of the model files, point model_dir at that directory; everything else is the same. device can be "cuda" or "cpu", and is chosen automatically if unset. Always use training=False for inference.

4.2 Basic synthesis: text to speech

The most basic use: give some text, get back speech. target_text can mix Chinese and English directly — the model handles the switching itself.

import soundfile as sf

audio = model.generate(
    target_text="這是 AI TTS code switching 測試。",
    cfg_value=2.8,
    inference_timesteps=9,
    max_len=2000,
    retry_badcase=True,
)
sf.write("output.wav", audio.squeeze().cpu().numpy(), model.sample_rate)

This uses the recommended parameters (cfg_value=2.8, inference_timesteps=9). The code’s original defaults are actually 2.0 and 10, but the former is recommended; see the parameter table below for why.

4.3 Voice cloning: imitate a speaker from a reference clip

Given a reference_wav_path, the output imitates the speaker timbre of that clip.

audio = model.generate(
    target_text="今天的會議改到下午三點。",
    reference_wav_path="speaker.wav",
    cfg_value=2.8,
    inference_timesteps=9,
)

A reminder, again: only use voices you have the right to synthesize.

4.4 Specified speaker: control timbre with a speaker vector

The model ships with Prof. Hung-yi Lee’s speaker vector as an example, used with his permission, stored at checkpoints/hung_yi_lee_speaker_centroids.pt in the model directory. Load the vector table, pull the vector for speaker ID hung_yi_lee, and pass it via speaker_centroid to set the timbre.

import os
import torch

centroids = torch.load(
    os.path.join(model_dir, "checkpoints", "hung_yi_lee_speaker_centroids.pt"),
    map_location="cpu",
    weights_only=True,
)
speaker_centroid = centroids["centroids"][centroids["speaker_ids"].index("hung_yi_lee")]

audio = model.generate(
    target_text="今天天氣真好。",
    speaker_centroid=speaker_centroid,   # or pass your own authorized speaker vector
    cfg_value=2.8,
    inference_timesteps=9,
)

4.5 Streaming output

When you need to play while synthesizing, use generate_streaming. It is a generator that returns audio chunks one at a time.

chunks = []
for chunk in model.generate_streaming(target_text="今天天氣真好。"):
    chunks.append(chunk)
    # play or write out the chunk in real time here

Note: automatic retry (retry_badcase) is not supported in streaming mode.

4.6 Four input modes

The features above are really different parameter combinations of the same generate interface. The table summarizes the four modes.

Mode Required parameters Use
General synthesis target_text Read the text out loud directly
Speech continuation target_text, prompt_text, prompt_wav_path Continue from an existing clip and its text
Reference clip target_text, reference_wav_path Imitate the speaker timbre of a reference clip
Reference + continuation The above combined Set the timbre and continue the speech at once

4.7 Tuning the common parameters

Understand these few parameters and you can strike a balance between “stable” and “natural.”

Parameter Default Recommended Description
target_text (required)   The text to synthesize
prompt_text ""   Prompt text, paired with prompt_wav_path for continuation
prompt_wav_path ""   Prompt-clip path, for speech continuation
reference_wav_path ""   Reference-clip path, for voice cloning
speaker_centroid None   Speaker vector, for specifying timbre
cfg_value 2.0 2.8 Guidance strength. Higher follows the condition more closely, but too high is less natural
inference_timesteps 10 9 Sampling steps. More gives better quality but is slower
min_len / max_len 2 / 2000   Lower and upper bounds on output length
retry_badcase False True Auto-retry when an anomalous output is detected (not supported in streaming)

The “Recommended” column is the officially tuned setting, also recorded in generation_defaults in config.json. This set was found using 500 hard Chinese sentences (the same batch demonstrated at the top of this article), together with Taiwan’s Breeze-ASR-25 speech-recognition model, optimizing normalized CER.

4.8 A few practical tips

For long text, split it into sentence-sized pieces, synthesize each, then concatenate the waveforms — adding a little fade in/out at the seams if needed. For more continuity, pass a short authorized clip from the previous segment as a prompt when synthesizing the next.

It runs without a GPU. Set device to "cpu"; it is slower, but short sentences synthesize in tens of seconds. Output is 48 kHz mono.

If you do not pass a tokenizer and rely on auto-loading, it may fail to load under transformers 5.x, or only error with “No tokenizer attached” when you call generate(). Load straight from tokenizer.json and pass it in, as in the example above, to avoid this.

5. How well it works

On the internal test set, BlueMagpie-TTS performs as follows. Lower is better.

System CER WER
BlueMagpie-TTS 4.81% 5.36%
Original reference model 11.45% 14.83%
CER / WER vs. the reference model Character / word error rate (%, lower is better) 0481216 BlueMagpie · CERreference · CERBlueMagpie · WERreference · WER 4.8111.455.3614.83 ↓ 58.0% vs reference↓ 63.9% vs reference
Figure 3. Error rates measured by the "TTS → ASR transcribe-back" loop. BlueMagpie-TTS reaches CER 4.81% and WER 5.36%, about 58.0% and 63.9% lower than the original reference model (11.45% / 14.83%). Lower means the synthesized speech is transcribed back to the original text more accurately.

Relative to the original reference model, character error rate drops by about 58.0% and word error rate by about 63.9%.

On generation speed, the median real-time factor is 4.748 and the maximum 5.288 (seconds of audio produced per second of compute; higher is faster).

One last emphasis: all the numbers above come from an internal test set, not a public benchmark, and are used only for internal judgment of model quality.

Files included in the model

File Contents
pytorch_model.bin BlueMagpie model weights
audiovae.pth AudioVAE weights
config.json Architecture and runtime settings
tokenizer.json, tokenizer_config.json Tokenizer files
checkpoints/hung_yi_lee_speaker_centroids.pt The default Hung-yi Lee speaker-vector table
USAGE.md Advanced usage notes

License

The code is Apache-2.0. The model weights are under an “other” license with usage restrictions; the key point is that reference audio and speaker vectors must be authorized and consented to before they can be used for synthesis or distribution.

Conclusion

BlueMagpie-TTS’s design fits in one sentence: keep the acoustic stack that works (VoxCPM), and swap out only the brain that decides “what to say” (Barbet). The former preserves pronunciation quality; the latter brings the ability to handle Taiwanese accent and Chinese–English mixing; the two are joined by a bridge module. For users, only two things matter: feed in text to synthesize, and give an authorized reference clip or speaker vector when you need a specific voice. Everything else is optional advanced control.