nllb200-formosan-zh: NLLB-200 fine-tuned on 15 Formosan languages ↔ Traditional Chinese
Repo: FormosonBankDemos/nllb200-formosan-zh
Base model: facebook/nllb-200-distilled-600M
Task: Bidirectional machine translation between 15 Formosan languages and Traditional Chinese (FLORES code zho_Hant).
This model adapts NLLB-200 (distilled 600M) to a multilingual Formosan ↔ Chinese setting. It is trained on a curated parallel corpus of 15 Taiwanese Indigenous (Formosan) languages paired with Traditional Chinese, using a temperature-smoothed multilingual sampling strategy and bidirectional training (Formosan→zh and zh→Formosan).
1. Supported languages and codes
Internally we use the standard NLLB language codes:
| Language (canonical) | Typical label in corpus | NLLB code |
|---|---|---|
| Amis | amis / ami |
ami_Latn |
| Bunun | bunun / bnn |
bnn_Latn |
| Kavalan | kavalan / ckv |
ckv_Latn |
| Rukai | rukai / dru |
dru_Latn |
| Paiwan | paiwan / pwn |
pwn_Latn |
| Puyuma | puyuma / pyu |
pyu_Latn |
| Thao | thao / ssf |
ssf_Latn |
| Saaroa | saaroa / sxr |
sxr_Latn |
| Sakizaya | sakizaya / szy |
szy_Latn |
| Tao (Yami) | tao |
tao_Latn |
| Atayal | atayal / tay |
tay_Latn |
| Seediq | seediq / trv |
trv_Latn |
| Tsou | tsou / tsu |
tsu_Latn |
| Kanakanavu | kanakanavu / xnb |
xnb_Latn |
| Saisiyat | saisiyat / xsy |
xsy_Latn |
| Chinese (Traditional) | chinese / zh |
zho_Hant |
You must use these language codes in src_lang and when computing forced_bos_token_id for generation.
2. Quick usage
2.1. Using the pipeline API (Amis → Chinese)
import torch
from transformers import pipeline
model_id = "FormosonBankDemos/nllb200-formosan-zh"
translator = pipeline(
task="translation",
model=model_id,
tokenizer=model_id,
src_lang="ami_Latn",
tgt_lang="zho_Hant",
# adjust device as needed; use device=0 for GPU
device="cpu",
dtype=torch.float16 if torch.cuda.is_available() else None,
)
text = "Adihay ko 'adadongac i kilakilangan."
print(translator(text)[0]["translation_text"])
# e.g. "森林裡有很多甲蟲。"
2.2. Reverse direction (Chinese → Amis)
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_id = "FormosonBankDemos/nllb200-formosan-zh"
tokenizer = AutoTokenizer.from_pretrained(model_id, src_lang="zho_Hant")
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
article = "森林裡有很多甲蟲。"
inputs = tokenizer(article, return_tensors="pt").to(model.device)
tgt_code = "ami_Latn"
forced_bos_token_id = tokenizer.convert_tokens_to_ids(tgt_code)
generated = model.generate(
**inputs,
forced_bos_token_id=forced_bos_token_id,
decoder_start_token_id=forced_bos_token_id,
max_new_tokens=48,
num_beams=4,
no_repeat_ngram_size=3,
repetition_penalty=1.2,
length_penalty=1.05,
early_stopping=True,
)
print(tokenizer.batch_decode(generated, skip_special_tokens=True)[0])
# e.g. "Adihay ko 'alem i kilakilangan."
2.3. General pattern (any Formosan ↔ Chinese)
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_id = "FormosonBankDemos/nllb200-formosan-zh"
tokenizer = AutoTokenizer.from_pretrained(model_id, src_lang="ami_Latn")
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
def translate(text: str, src_code: str, tgt_code: str, max_new_tokens: int = 48) -> str:
# Set source language code for encoder
tokenizer.src_lang = src_code
inputs = tokenizer(text, return_tensors="pt").to(model.device)
forced_bos = tokenizer.convert_tokens_to_ids(tgt_code)
outputs = model.generate(
**inputs,
forced_bos_token_id=forced_bos,
decoder_start_token_id=forced_bos,
max_new_tokens=max_new_tokens,
num_beams=4,
no_repeat_ngram_size=3,
repetition_penalty=1.2,
length_penalty=1.05,
early_stopping=True,
)
return tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
# Amis → Chinese
print(translate("Adihay ko 'adadongac i kilakilangan.", "ami_Latn", "zho_Hant"))
# Chinese → Seediq
print(translate("布農人有五個氏族。", "zho_Hant", "trv_Latn"))
3. How this model was trained
3.1. Objective
- Base model:
facebook/nllb-200-distilled-600M(Mixture-of-Experts multilingual MT model). - Goal: Improve translation quality for 15 Formosan languages and Traditional Chinese, in both directions.
3.2. Data
Custom FormosanBank Chinese Parallel Corpus combining dictionary sentences and example phrases from multiple sources.
CSV schema (multilingual mode):
lang_code,formosan_sentence,chinese_sentence,source,dialect,split ami,Ota'en!,吐出來!,Formosan-ILRDF_Dicts/Final_XML/Amis/Amis.xml,Xiuguluan,train ...lang_code: 3-letter codes or names (e.g.ami,bnn,ckv, ...).formosan_sentence: sentence in one of the 15 Formosan languages.chinese_sentence: sentence in Traditional Chinese.split:train/valid/test(if absent, we auto-split 90/5/5 per language).dialectis tracked but not used directly for modeling.
3.3. Multilingual sampling & directions
Bidirectional training: both Formosan→Chinese and Chinese→Formosan.
At each step:
- Sample a language (L) from the set of kept languages.
- Sample a mini-batch of parallel sentences for (L).
- With probability
p_src2tgt(default 0.5), train L→Chinese; otherwise Chinese→L.
Temperature-smoothed sampling over language sizes:
[ p(L) \propto n_L^{1/T}\quad(\text{default } T = 5) ]
where (n_L) is the number of training examples for language (L). Higher (T) downweights high-resource languages and upweights low-resource ones.
3.4. Core modeling details (kept consistent with NLLB)
For each batch:
We set
tokenizer.src_langto the current source language (ami_Latn,zho_Hant, etc.).We do not prefix labels with any language codes, they are plain token sequences + EOS.
For generation and evaluation we always pass:
forced_bos_token_id = tokenizer.convert_tokens_to_ids(tgt_code) decoder_start_token_id = forced_bos_token_id
This mirrors recommended NLLB usage and ensures consistent behavior across Transformers versions.
3.5. Typical hyperparameters
(Exact values can vary between runs; example configuration:)
learning_rate: 1e-4 (Adafactor)batch_size: 8 (per step, with optional gradient accumulation)max_length: 128weight_decay: 1e-3warmup_steps: 1000max_grad_norm: 1.0optimizer: Adafactor (no relative_step, constant LR schedule with warmup)steps: 60k+ global steps- Mixed precision: optional FP16 with gradient scaling on GPU
4. Evaluation
We evaluate on per-language held-out test sets, reporting BLEU, chrF2, and TER in both directions.
4.1. Global metrics (all languages combined)
Formosan → Chinese (
*_Latn→zho_Hant):- BLEU: 30.60
- chrF2: 29.29
- TER: 93.85
Chinese → Formosan (
zho_Hant→*_Latn):- BLEU: 12.60
- chrF2: 36.64
- TER: 83.32
(Values are computed over 34,021 sentences in each direction.)
4.2. Per-language metrics
Each row uses the canonical language name (with code in parentheses).
| Lang (code) | Direction | Samples | BLEU | chrF2 | TER |
|---|---|---|---|---|---|
Amis (ami_Latn) |
form→zh | 5677 | 25.09 | 23.86 | 100.05 |
| zh→form | 5677 | 9.92 | 33.14 | 84.72 | |
Bunun (bnn_Latn) |
form→zh | 3280 | 31.24 | 29.01 | 90.94 |
| zh→form | 3280 | 8.42 | 35.25 | 90.83 | |
Kavalan (ckv_Latn) |
form→zh | 1502 | 38.30 | 35.03 | 94.39 |
| zh→form | 1502 | 29.81 | 52.32 | 62.41 | |
Rukai (dru_Latn) |
form→zh | 3040 | 28.21 | 27.51 | 90.20 |
| zh→form | 3040 | 5.62 | 28.49 | 97.16 | |
Paiwan (pwn_Latn) |
form→zh | 3291 | 23.89 | 23.03 | 95.89 |
| zh→form | 3291 | 8.16 | 35.67 | 87.04 | |
Puyuma (pyu_Latn) |
form→zh | 1957 | 35.81 | 33.79 | 86.50 |
| zh→form | 1957 | 15.20 | 40.36 | 78.62 | |
Thao (ssf_Latn) |
form→zh | 1181 | 38.33 | 35.11 | 92.32 |
| zh→form | 1181 | 22.77 | 50.75 | 67.32 | |
Saaroa (sxr_Latn) |
form→zh | 879 | 36.31 | 35.45 | 90.55 |
| zh→form | 879 | 8.49 | 41.59 | 92.60 | |
Sakizaya (szy_Latn) |
form→zh | 1189 | 35.28 | 36.11 | 94.33 |
| zh→form | 1189 | 23.81 | 47.05 | 69.90 | |
Tao/Yami (tao_Latn) |
form→zh | 1102 | 29.31 | 29.64 | 94.88 |
| zh→form | 1102 | 18.67 | 39.90 | 77.90 | |
Atayal (tay_Latn) |
form→zh | 4481 | 26.33 | 25.32 | 93.66 |
| zh→form | 4481 | 5.79 | 26.34 | 91.83 | |
Seediq (trv_Latn) |
form→zh | 3006 | 32.23 | 31.26 | 92.66 |
| zh→form | 3006 | 9.74 | 30.64 | 81.47 | |
Tsou (tsu_Latn) |
form→zh | 966 | 34.11 | 33.52 | 90.86 |
| zh→form | 966 | 13.07 | 36.90 | 81.79 | |
Kanakanavu (xnb_Latn) |
form→zh | 1451 | 39.54 | 36.80 | 94.64 |
| zh→form | 1451 | 22.17 | 53.03 | 67.80 | |
Saisiyat (xsy_Latn) |
form→zh | 1019 | 36.64 | 34.89 | 94.10 |
| zh→form | 1019 | 25.10 | 49.56 | 67.63 |
Note:
- BLEU is lower for Chinese → Formosan directions, which is expected: these directions are harder.
- chrF2 often remains relatively strong even when BLEU is modest, which suggests partial lexical adequacy but rephrasing and word order variation.
5. Fine-tuning this model further
You can treat FormosonBankDemos/nllb200-formosan-zh as a starting point for additional domain or language-specific fine-tuning.
5.1. Data format (recommended)
Use a CSV with at least:
lang_code,formosan_sentence,chinese_sentence,split
For example:
lang_code,formosan_sentence,chinese_sentence,split
ami,Sa'icelen ko fafahiyan.,女孩在唱歌。,train
ami,Mi'adop ko fafahiyan.,女孩在跳舞。,train
ami,Adihay ko 'adadongac i kilakilangan.,森林裡有很多甲蟲。,valid
...
lang_code: any ofami,bnn,ckv,dru,pwn,pyu,ssf,sxr,szy,tao,tay,trv,tsu,xnb,xsy.split:train,valid/val,test(or leave empty and create splits programmatically).
5.2. Fine-tuning with transformers.Trainer (conceptual sketch)
from dataclasses import dataclass
from typing import Dict, List, Union
import torch
from datasets import load_dataset
from transformers import (
AutoTokenizer,
AutoModelForSeq2SeqLM,
DataCollatorForSeq2Seq,
Seq2SeqTrainingArguments,
Seq2SeqTrainer,
)
model_id = "FormosonBankDemos/nllb200-formosan-zh"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
max_length = 128
def preprocess(batch, src_code: str, tgt_code: str):
tokenizer.src_lang = src_code
inputs = tokenizer(
batch["src_text"],
max_length=max_length,
truncation=True,
)
with tokenizer.as_target_tokenizer():
labels = tokenizer(
batch["tgt_text"],
max_length=max_length,
truncation=True,
)
inputs["labels"] = labels["input_ids"]
return inputs
# Example: fine-tuning only on Amis ↔ Chinese
dataset = load_dataset("csv", data_files={"train": "amis_train.csv", "validation": "amis_valid.csv"})
def map_amis_to_zh(batch):
batch["src_text"] = batch["formosan_sentence"]
batch["tgt_text"] = batch["chinese_sentence"]
return batch
dataset = dataset.map(map_amis_to_zh)
encoded = dataset.map(lambda b: preprocess(b, "ami_Latn", "zho_Hant"), batched=True)
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
args = Seq2SeqTrainingArguments(
output_dir="nllb200-amis-zh-ft",
learning_rate=1e-4,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
num_train_epochs=3,
predict_with_generate=True,
fp16=torch.cuda.is_available(),
)
trainer = Seq2SeqTrainer(
model=model,
args=args,
train_dataset=encoded["train"],
eval_dataset=encoded["validation"],
tokenizer=tokenizer,
data_collator=data_collator,
)
trainer.train()
trainer.save_model("nllb200-amis-zh-ft")
tokenizer.save_pretrained("nllb200-amis-zh-ft")
For multilingual fine-tuning (more than one Formosan language at once), you can either:
- Re-use the custom script (temperature-based sampling + bidirectional training), or
- Build a
datasets-level mixture, includelang_codein each example, and pick appropriatesrc_lang/tgt_langinside the preprocessing function.
The key is to always:
- Set
tokenizer.src_langto the current source language code. - Use the target language code to set
decoder_start_token_id/forced_bos_token_idduring generation.
6. Intended uses & limitations
6.1. Intended uses
Primary use: Research and prototyping for machine translation involving Formosan languages and Traditional Chinese.
Example applications:
- Assisting linguists in exploring large bilingual corpora.
- Bootstrapping bilingual lexicon extraction and example sentences.
- Providing draft translations that can be post-edited by fluent speakers.
6.2. Non-intended uses / limitations
Not suitable as a drop-in replacement for professional human translation, especially:
- For legal, medical, or safety-critical content.
- For culturally sensitive or ceremonial language.
Some directions (especially zh→Formosan) have relatively low BLEU and can:
- Hallucinate content.
- Over-simplify or distort cultural concepts.
- Produce ungrammatical or unnatural phrasing.
Bias and style:
- Chinese side reflects distributions and writing style in the training data.
- Model may propagate or amplify biases present in source materials.
We strongly recommend human review by fluent speakers for any real-world deployment and especially for community-facing projects.
7. Ethical & community considerations
Formosan languages are endangered; technology should support, not replace, community-led revitalization.
This model is intended as a tool for:
- Supporting linguistic documentation and teaching.
- Lowering the barrier to building language technology tools.
Community feedback is crucial:
- If you are a speaker, researcher, or community member and notice systematic errors or harmful behavior, please open an issue or share examples so we can iterate.
8. Citation
If you use this model in academic work or downstream projects, please cite:
@misc{nllb200-formosan-zh,
title = {nllb200-formosan-zh: NLLB-200 fine-tuned on 15 Formosan languages and Traditional Chinese},
author = {FormosanBank / contributors},
year = {2025},
howpublished = {\url{https://huggingface.co/FormosonBankDemos/nllb200-formosan-zh}}
}
9. Contact & contributions
- Model hub:
FormosonBankDemos/nllb200-formosan-zh - Contributions (issues, PRs, evaluation scripts, additional data checks, etc.) are welcome.
- If you build cool demos or downstream tools on top of this checkpoint, please share them so we can reference them here.
::contentReference[oaicite:0]{index=0}
- Downloads last month
- 29
Model tree for FormosanBankDemos/nllb200-formosan-zh
Base model
facebook/nllb-200-distilled-600MSpace using FormosanBankDemos/nllb200-formosan-zh 1
Evaluation results
- BLEU on FormosanBank Chinese Parallel Corpusself-reported30.600
- chrF2 on FormosanBank Chinese Parallel Corpusself-reported29.290
- TER on FormosanBank Chinese Parallel Corpusself-reported93.850
- BLEU on FormosanBank Chinese Parallel Corpusself-reported12.600
- chrF2 on FormosanBank Chinese Parallel Corpusself-reported36.640
- TER on FormosanBank Chinese Parallel Corpusself-reported83.320