Roadmap

From benchmarks to OpenFormosa-Base.

A staged roadmap for building a Taiwan-rooted base model first, then ASR, TTS, and OCR branches with reproducible evaluation and release evidence.

Phase 0

Benchmarks and release templates

Publish the evaluation workflow, model card, and training data sheet templates, with reproducibility and release-evidence expectations.

Phase 1

Tokenizer, benchmarks, base recipe

Prepare Taiwan multilingual tokenizer checks, training recipes, and evaluation sets for ASR correction, TTS normalization, OCR structured output, and base-model perplexity.

Phase 2

OpenFormosa-Base pretraining

Train the shared Taiwan-rooted base model from already cleared inputs, then publish run notes, checkpoints, and benchmark reports.

Phase 3

ASR / TTS / OCR task branches

Release small ASR, TTS, and OCR demos or adapters on top of OpenFormosa-Base with task-specific evaluation gates.

Phase 4

Safety-reviewed model demos

Launch generic Taiwan voice demos, OCR extraction tools, and ASR correction flows only after misuse, privacy, and benchmark checks.

Phase 5

Enterprise private deployment

Offer private fine-tuning and evaluation for partners without mixing customer data into the public base model.