
Taiwan Barbet · Barbet
記得住台灣
The five-color bird: a multimodal, multi-capability base model that holds Taiwan context.
Models
The first OpenFormosa plan is a Taiwan-rooted pretrain base model, with ASR, TTS, and OCR released as task-specific branches on top of that shared backbone.
Pretrain base model
A shared Taiwan-rooted foundation model should learn local text, speech transcripts, pronunciation notation, forms, receipts, addresses, and layout conventions before task-specific fine-tuning.
The animal names are not decoration. They turn the model family into a readable system: same OpenFormosa bag, different task character, clear Taiwan memory.

Taiwan Barbet · Barbet
The five-color bird: a multimodal, multi-capability base model that holds Taiwan context.

Formosan Black Bear · BlackBear-ASR
A steady listener for Taiwan Mandarin and Taigi/Hakka accents.

Taiwan Blue Magpie · BlueMagpie-TTS
A living voice for Taiwan pronunciation, Bopomofo, Tailo, local terms, prosody, and style control.

Formosan Sika Deer · SikaDeer-OCR
A careful eye for receipts, forms, tables, menus, addresses, layout parsing, and structured extraction.
OpenFormosa-Base is meant to be the shared Taiwan AI layer: tokenizer, local language memory, long-context document handling, task adapters, and evaluation rules in one reusable system.
Traditional Chinese is only the surface. The model must learn Taiwan Mandarin, Taigi traces, Hakka traces, Bopomofo, Tailo, local terms, public documents, and social writing styles.
The goal is not to beat every giant model. The goal is a compact base that schools, labs, startups, local venues, and enterprises can run, adapt, and audit.
Tokenizer specs, evaluation sets, model cards, data gates, and downstream recipes should be inspectable so the project can be improved instead of merely marketed.
Audio and vision should not crush the 1B text model. ASR, TTS, and OCR connect through adapters or heads while sharing the same Taiwan-aware language core.
Not a fine-tune of a foreign model. OpenFormosa is designed end to end — architecture, tokenizer, data, long context, post-training, and downstream adapters.

Optimized for Taiwan Traditional Chinese while keeping multilingual coverage — Bopomofo, Tailo, POJ, Taigi Han characters, Hakka terms, code-switching, and code. The goal is not the largest vocab but the best token efficiency for Taiwan text, so the same 128K context holds more real content.
A compact decoder-only model that is deployable, fine-tunable, and private-deployable — a shared language hub for ASR, TTS, OCR, and VLM. A MiniCPM-style core with GQA, 128K context, MTP, reasoning modes, and specialist distillation.
128K as production-grade context for long transcripts, document summaries, multi-document RAG, and Taiwan regulation and official-letter QA. 1M as a research branch using compressed memory, sparse attention, local windows, and prefix cache for long-text retrieval and agent memory.
A shared text backbone; speech and vision connect through adapters, projectors, and separate heads. The three tasks share Taiwan language understanding while keeping their own data and heads — so the 1B core is not crushed by audio or image tokens, and each task iterates independently.