Training

Train models with auditable recipes.

OpenFormosa focuses on tokenizer design, pretraining, fine-tuning, evaluation gates, and release documentation. Training inputs must already be licensed, public, synthetic, or separately approved before they appear in this project.

Model training

OpenFormosa-Base

The shared backbone should make Taiwan language, documents, pronunciation notation, and local evaluation behavior visible in one inspectable training path.

Training tracks

Each model branch should have a small, inspectable path from recipe to checkpoint to benchmark result.

Base

Base and Instruct

Pretraining, instruction tuning, long-context adaptation, and Taiwan-local benchmark gates.

ASR

ASR correction

Adapters and correction models for transcripts, timestamps, diarization hints, and Taiwan-local terminology.

TTS

TTS normalization

Pronunciation, text normalization, style control, and release checks before any generated voice demo.

OCR

OCR extraction

Layout parsing, table extraction, address handling, confidence reporting, and privacy review for document tasks.

Release gates

A checkpoint is not a release.

Before publishing weights or demos, each run needs a model card, training-data summary, benchmark report, limitation notes, and acceptable-use boundary.