Training

Train models with auditable recipes.

OpenFormosa focuses on tokenizer design, pretraining, fine-tuning, evaluation gates, and release documentation. Training inputs must already be licensed, public, synthetic, or separately approved before they appear in this project.

Model training

OpenFormosa-Base

The shared backbone should make Taiwan language, documents, pronunciation notation, and local evaluation behavior visible in one inspectable training path.

Tokenizer first. Track Taiwan Mandarin, Bopomofo, Tailo, Taigi/Hakka traces, addresses, forms, and public-sector wording.
Training recipes. Keep mixture notes, hyperparameters, checkpoints, eval gates, and reproducibility commands close to each release.
No intake pipeline. This site documents training and evaluation; it is not a place to submit or transfer datasets.

Training tracks

Each model branch should have a small, inspectable path from recipe to checkpoint to benchmark result.

Base

Base and Instruct

Pretraining, instruction tuning, long-context adaptation, and Taiwan-local benchmark gates.

ASR

ASR correction

Adapters and correction models for transcripts, timestamps, diarization hints, and Taiwan-local terminology.

TTS

TTS normalization

Pronunciation, text normalization, style control, and release checks before any generated voice demo.

OCR

OCR extraction

Layout parsing, table extraction, address handling, confidence reporting, and privacy review for document tasks.

Release gates

A checkpoint is not a release.

Before publishing weights or demos, each run needs a model card, training-data summary, benchmark report, limitation notes, and acceptable-use boundary.

Training data sheetA compact DBOM-style summary for sources already cleared before training. Benchmark reportScores, raw generations, configs, and error analysis before any public claim.