Evaluation

Measure Taiwan capability before claiming it.

OpenFormosa is now focused on evaluation suites, reproducible benchmarks, and model-training recipes. The project does not collect community datasets or accept raw data uploads.

What we evaluate

  • ASR: Taiwan Mandarin, Taigi/Hakka influence, code-switching, timestamps, speaker labels, and correction quality.
  • TTS: Taiwan pronunciation, Bopomofo and Tailo handling, prosody, intelligibility, and misuse-resistant release settings.
  • OCR: Receipts, forms, menus, public notices, addresses, tables, and structured extraction accuracy.
  • Language models: Taiwan-local translation, public-sector wording, local terminology, long-context documents, and benchmark PPL.

No public data intake

  • Do not upload audio, transcripts, documents, IDs, private records, or enterprise files.
  • Benchmark work uses existing licensed, public, synthetic, or internally approved evaluation sets.
  • Issues should discuss tasks, metrics, bugs, and reproducibility, not new raw datasets.
  • Rights or privacy concerns should be raised with the maintainers.

Evaluation workflow

Every score should be reproducible enough for someone else to rerun, inspect, and disagree with constructively.

Define the task contract

Name the input format, output format, allowed prompt, model settings, metric, and failure labels before running models.

Run smoke tests

Check a small slice for prompt drift, decoding mistakes, missing fields, and obvious mismatch before full scoring.

Score and publish artifacts

Keep raw generations, metric outputs, config snapshots, and leaderboard rows together.

Analyze errors

Separate format failures, Taiwan-local wording misses, hallucinations, privacy risks, and metric limitations.