Log
Training and research log.
A log of OpenFormosa's model design, tokenizer, evaluation, and release work.
BlueMagpie-TTS: Taiwanese-accent, Chinese–English code-switching speech synthesis
An open Taiwanese-accent text-to-speech model that handles Chinese–English code-switching — keep VoxCPM's acoustic stack, swap in the Barbet language model, and cut character error rate by about 58% on a hard test set.
Barbet 1B Base: a hybrid decoder-only language model for Traditional Chinese
A 1B-parameter hybrid decoder-only causal language model — global and sliding-window attention interleaved with Mamba, context up to 1M, embedding tying, built on PangolinTokenizer.
PangolinTokenizer: a byte-level BPE tokenizer for Traditional Chinese and Taiwan
A byte-level BPE tokenizer built for Taiwan — 114,688 merges, the lowest tokens/character on PangolinBench with the smallest vocabulary.