Models

OpenFormosa-Base first, then ASR, TTS, OCR.

The first OpenFormosa plan is a Taiwan-rooted pretrain base model, with ASR, TTS, and OCR released as task-specific branches on top of that shared backbone.

Pretrain base model

OpenFormosa-Base

A shared Taiwan-rooted foundation model should learn local text, speech transcripts, pronunciation notation, forms, receipts, addresses, and layout conventions before task-specific fine-tuning.

One shared backbone. ASR correction, TTS normalization, OCR extraction, and evaluation tools should reuse the same tokenizer and Taiwan knowledge.
Pretrain before demos. Small demos matter, but the public roadmap should name the base model that makes those demos coherent.
Auditable training inputs. Training summaries should only reference inputs already cleared before a run; this site is not an intake channel.

A model family people can remember.

The animal names are not decoration. They turn the model family into a readable system: same OpenFormosa bag, different task character, clear Taiwan memory.

Base

Taiwan Barbet · Barbet

記得住台灣

The five-color bird: a multimodal, multi-capability base model that holds Taiwan context.

OpenFormosa BlackBear-ASR — Formosan black bear woven-bag icon

ASR

Formosan Black Bear · BlackBear-ASR

聽得懂台灣

A steady listener for Taiwan Mandarin and Taigi/Hakka accents.

OpenFormosa BlueMagpie-TTS — Taiwan blue magpie woven-bag icon

TTS

Taiwan Blue Magpie · BlueMagpie-TTS

說得出台灣

A living voice for Taiwan pronunciation, Bopomofo, Tailo, local terms, prosody, and style control.

OpenFormosa SikaDeer-OCR — Formosan sika deer woven-bag icon

OCR

Formosan Sika Deer · SikaDeer-OCR

讀得懂台灣

A careful eye for receipts, forms, tables, menus, addresses, layout parsing, and structured extraction.

The point is not another chatbot.

OpenFormosa-Base is meant to be the shared Taiwan AI layer: tokenizer, local language memory, long-context document handling, task adapters, and evaluation rules in one reusable system.

Context

Taiwan is not a translation patch

Traditional Chinese is only the surface. The model must learn Taiwan Mandarin, Taigi traces, Hakka traces, Bopomofo, Tailo, local terms, public documents, and social writing styles.

Scale

1B is a deployable layer

The goal is not to beat every giant model. The goal is a compact base that schools, labs, startups, local venues, and enterprises can run, adapt, and audit.

Open

Open source is a trust mechanism

Tokenizer specs, evaluation sets, model cards, data gates, and downstream recipes should be inspectable so the project can be improved instead of merely marketed.

Architecture

Shared backbone, separate adapters

Audio and vision should not crush the 1B text model. ASR, TTS, and OCR connect through adapters or heads while sharing the same Taiwan-aware language core.

Technical route: Taiwan-Omni-1B.

Not a fine-tune of a foreign model. OpenFormosa is designed end to end — architecture, tokenizer, data, long context, post-training, and downstream adapters.

OpenFormosa tokenizer — Formosan pangolin woven-bag icon

Taiwan multilingual tokenizer IP 01 · Tokenizer · Pangolin

Optimized for Taiwan Traditional Chinese while keeping multilingual coverage — Bopomofo, Tailo, POJ, Taigi Han characters, Hakka terms, code-switching, and code. The goal is not the largest vocab but the best token efficiency for Taiwan text, so the same 128K context holds more real content.

1B compact backbone Core · Backbone

A compact decoder-only model that is deployable, fine-tunable, and private-deployable — a shared language hub for ASR, TTS, OCR, and VLM. A MiniCPM-style core with GQA, 128K context, MTP, reasoning modes, and specialist distillation.

Long context Context · 128K → 1M

128K as production-grade context for long transcripts, document summaries, multi-document RAG, and Taiwan regulation and official-letter QA. 1M as a research branch using compressed memory, sparse attention, local windows, and prefix cache for long-text retrieval and agent memory.

ASR / TTS / OCR adapters Multimodal · Adapters

A shared text backbone; speech and vision connect through adapters, projectors, and separate heads. The three tasks share Taiwan language understanding while keeping their own data and heads — so the 1B core is not crushed by audio or image tokens, and each task iterates independently.

Design principles

Task-first: ASR, TTS, and OCR are practical entry points for Taiwan AI infrastructure.
Evidence-first: scores, raw outputs, and error analysis before public claims.
Private-by-default enterprise data: customer data is not used for a general model unless explicitly opted in.
Open core: tokenizer, evaluation scripts, model cards, benchmark artifacts, and tooling should be open whenever possible.