Base and Instruct
Pretraining, instruction tuning, long-context adaptation, and Taiwan-local benchmark gates.
Training
OpenFormosa focuses on tokenizer design, pretraining, fine-tuning, evaluation gates, and release documentation. Training inputs must already be licensed, public, synthetic, or separately approved before they appear in this project.
Model training
The shared backbone should make Taiwan language, documents, pronunciation notation, and local evaluation behavior visible in one inspectable training path.
Each model branch should have a small, inspectable path from recipe to checkpoint to benchmark result.
Pretraining, instruction tuning, long-context adaptation, and Taiwan-local benchmark gates.
Adapters and correction models for transcripts, timestamps, diarization hints, and Taiwan-local terminology.
Pronunciation, text normalization, style control, and release checks before any generated voice demo.
Layout parsing, table extraction, address handling, confidence reporting, and privacy review for document tasks.
Release gates
Before publishing weights or demos, each run needs a model card, training-data summary, benchmark report, limitation notes, and acceptable-use boundary.