nllb200-formosan-zh: NLLB-200 fine-tuned on 15 Formosan languages ↔ Traditional Chinese

Repo: FormosonBankDemos/nllb200-formosan-zh
Base model: facebook/nllb-200-distilled-600M
Task: Bidirectional machine translation between 15 Formosan languages and Traditional Chinese (FLORES code zho_Hant).

This model adapts NLLB-200 (distilled 600M) to a multilingual Formosan ↔ Chinese setting. It is trained on a curated parallel corpus of 15 Taiwanese Indigenous (Formosan) languages paired with Traditional Chinese, using a temperature-smoothed multilingual sampling strategy and bidirectional training (Formosan→zh and zh→Formosan).

1. Supported languages and codes

Internally we use the standard NLLB language codes:

Language (canonical)	Typical label in corpus	NLLB code
Amis	`amis` / `ami`	`ami_Latn`
Bunun	`bunun` / `bnn`	`bnn_Latn`
Kavalan	`kavalan` / `ckv`	`ckv_Latn`
Rukai	`rukai` / `dru`	`dru_Latn`
Paiwan	`paiwan` / `pwn`	`pwn_Latn`
Puyuma	`puyuma` / `pyu`	`pyu_Latn`
Thao	`thao` / `ssf`	`ssf_Latn`
Saaroa	`saaroa` / `sxr`	`sxr_Latn`
Sakizaya	`sakizaya` / `szy`	`szy_Latn`
Tao (Yami)	`tao`	`tao_Latn`
Atayal	`atayal` / `tay`	`tay_Latn`
Seediq	`seediq` / `trv`	`trv_Latn`
Tsou	`tsou` / `tsu`	`tsu_Latn`
Kanakanavu	`kanakanavu` / `xnb`	`xnb_Latn`
Saisiyat	`saisiyat` / `xsy`	`xsy_Latn`
Chinese (Traditional)	`chinese` / `zh`	`zho_Hant`

You must use these language codes in src_lang and when computing forced_bos_token_id for generation.

2. Quick usage

2.1. Using the `pipeline` API (Amis → Chinese)

import torch
from transformers import pipeline

model_id = "FormosonBankDemos/nllb200-formosan-zh"

translator = pipeline(
    task="translation",
    model=model_id,
    tokenizer=model_id,
    src_lang="ami_Latn",
    tgt_lang="zho_Hant",
    # adjust device as needed; use device=0 for GPU
    device="cpu",
    dtype=torch.float16 if torch.cuda.is_available() else None,
)

text = "Adihay ko 'adadongac i kilakilangan."
print(translator(text)[0]["translation_text"])
# e.g. "森林裡有很多甲蟲。"

2.2. Reverse direction (Chinese → Amis)

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "FormosonBankDemos/nllb200-formosan-zh"

tokenizer = AutoTokenizer.from_pretrained(model_id, src_lang="zho_Hant")
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

article = "森林裡有很多甲蟲。"
inputs = tokenizer(article, return_tensors="pt").to(model.device)

tgt_code = "ami_Latn"
forced_bos_token_id = tokenizer.convert_tokens_to_ids(tgt_code)

generated = model.generate(
    **inputs,
    forced_bos_token_id=forced_bos_token_id,
    decoder_start_token_id=forced_bos_token_id,
    max_new_tokens=48,
    num_beams=4,
    no_repeat_ngram_size=3,
    repetition_penalty=1.2,
    length_penalty=1.05,
    early_stopping=True,
)

print(tokenizer.batch_decode(generated, skip_special_tokens=True)[0])
# e.g. "Adihay ko 'alem i kilakilangan."

2.3. General pattern (any Formosan ↔ Chinese)

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "FormosonBankDemos/nllb200-formosan-zh"
tokenizer = AutoTokenizer.from_pretrained(model_id, src_lang="ami_Latn")
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

def translate(text: str, src_code: str, tgt_code: str, max_new_tokens: int = 48) -> str:
    # Set source language code for encoder
    tokenizer.src_lang = src_code

    inputs = tokenizer(text, return_tensors="pt").to(model.device)

    forced_bos = tokenizer.convert_tokens_to_ids(tgt_code)
    outputs = model.generate(
        **inputs,
        forced_bos_token_id=forced_bos,
        decoder_start_token_id=forced_bos,
        max_new_tokens=max_new_tokens,
        num_beams=4,
        no_repeat_ngram_size=3,
        repetition_penalty=1.2,
        length_penalty=1.05,
        early_stopping=True,
    )
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

# Amis → Chinese
print(translate("Adihay ko 'adadongac i kilakilangan.", "ami_Latn", "zho_Hant"))

# Chinese → Seediq
print(translate("布農人有五個氏族。", "zho_Hant", "trv_Latn"))

3. How this model was trained

3.1. Objective

Base model: facebook/nllb-200-distilled-600M (Mixture-of-Experts multilingual MT model).
Goal: Improve translation quality for 15 Formosan languages and Traditional Chinese, in both directions.

3.2. Data

Custom FormosanBank Chinese Parallel Corpus combining dictionary sentences and example phrases from multiple sources.
CSV schema (multilingual mode):
```
lang_code,formosan_sentence,chinese_sentence,source,dialect,split
ami,Ota'en!,吐出來!,Formosan-ILRDF_Dicts/Final_XML/Amis/Amis.xml,Xiuguluan,train
...
```
- lang_code: 3-letter codes or names (e.g. ami, bnn, ckv, ...).
- formosan_sentence: sentence in one of the 15 Formosan languages.
- chinese_sentence: sentence in Traditional Chinese.
- split: train / valid / test (if absent, we auto-split 90/5/5 per language).
- dialect is tracked but not used directly for modeling.

3.3. Multilingual sampling & directions

Bidirectional training: both Formosan→Chinese and Chinese→Formosan.
At each step:
1. Sample a language (L) from the set of kept languages.
2. Sample a mini-batch of parallel sentences for (L).
3. With probability p_src2tgt (default 0.5), train L→Chinese; otherwise Chinese→L.
Temperature-smoothed sampling over language sizes:

[ p(L) \propto n_L^{1/T}\quad(\text{default } T = 5) ]

where (n_L) is the number of training examples for language (L). Higher (T) downweights high-resource languages and upweights low-resource ones.

3.4. Core modeling details (kept consistent with NLLB)

For each batch:

We set tokenizer.src_lang to the current source language (ami_Latn, zho_Hant, etc.).
We do not prefix labels with any language codes, they are plain token sequences + EOS.

For generation and evaluation we always pass:

forced_bos_token_id = tokenizer.convert_tokens_to_ids(tgt_code)
decoder_start_token_id = forced_bos_token_id

This mirrors recommended NLLB usage and ensures consistent behavior across Transformers versions.

3.5. Typical hyperparameters

(Exact values can vary between runs; example configuration:)

learning_rate: 1e-4 (Adafactor)
batch_size: 8 (per step, with optional gradient accumulation)
max_length: 128
weight_decay: 1e-3
warmup_steps: 1000
max_grad_norm: 1.0
optimizer: Adafactor (no relative_step, constant LR schedule with warmup)
steps: 60k+ global steps
Mixed precision: optional FP16 with gradient scaling on GPU

4. Evaluation

We evaluate on per-language held-out test sets, reporting BLEU, chrF2, and TER in both directions.

4.1. Global metrics (all languages combined)

Formosan → Chinese (*_Latn → zho_Hant):
- BLEU: 30.60
- chrF2: 29.29
- TER: 93.85
Chinese → Formosan (zho_Hant → *_Latn):
- BLEU: 12.60
- chrF2: 36.64
- TER: 83.32

(Values are computed over 34,021 sentences in each direction.)

4.2. Per-language metrics

Each row uses the canonical language name (with code in parentheses).

Lang (code)	Direction	Samples	BLEU	chrF2	TER
Amis (`ami_Latn`)	form→zh	5677	25.09	23.86	100.05
	zh→form	5677	9.92	33.14	84.72
Bunun (`bnn_Latn`)	form→zh	3280	31.24	29.01	90.94
	zh→form	3280	8.42	35.25	90.83
Kavalan (`ckv_Latn`)	form→zh	1502	38.30	35.03	94.39
	zh→form	1502	29.81	52.32	62.41
Rukai (`dru_Latn`)	form→zh	3040	28.21	27.51	90.20
	zh→form	3040	5.62	28.49	97.16
Paiwan (`pwn_Latn`)	form→zh	3291	23.89	23.03	95.89
	zh→form	3291	8.16	35.67	87.04
Puyuma (`pyu_Latn`)	form→zh	1957	35.81	33.79	86.50
	zh→form	1957	15.20	40.36	78.62
Thao (`ssf_Latn`)	form→zh	1181	38.33	35.11	92.32
	zh→form	1181	22.77	50.75	67.32
Saaroa (`sxr_Latn`)	form→zh	879	36.31	35.45	90.55
	zh→form	879	8.49	41.59	92.60
Sakizaya (`szy_Latn`)	form→zh	1189	35.28	36.11	94.33
	zh→form	1189	23.81	47.05	69.90
Tao/Yami (`tao_Latn`)	form→zh	1102	29.31	29.64	94.88
	zh→form	1102	18.67	39.90	77.90
Atayal (`tay_Latn`)	form→zh	4481	26.33	25.32	93.66
	zh→form	4481	5.79	26.34	91.83
Seediq (`trv_Latn`)	form→zh	3006	32.23	31.26	92.66
	zh→form	3006	9.74	30.64	81.47
Tsou (`tsu_Latn`)	form→zh	966	34.11	33.52	90.86
	zh→form	966	13.07	36.90	81.79
Kanakanavu (`xnb_Latn`)	form→zh	1451	39.54	36.80	94.64
	zh→form	1451	22.17	53.03	67.80
Saisiyat (`xsy_Latn`)	form→zh	1019	36.64	34.89	94.10
	zh→form	1019	25.10	49.56	67.63

Note:

BLEU is lower for Chinese → Formosan directions, which is expected: these directions are harder.
chrF2 often remains relatively strong even when BLEU is modest, which suggests partial lexical adequacy but rephrasing and word order variation.

5. Fine-tuning this model further

You can treat FormosonBankDemos/nllb200-formosan-zh as a starting point for additional domain or language-specific fine-tuning.

5.1. Data format (recommended)

Use a CSV with at least:

lang_code,formosan_sentence,chinese_sentence,split

For example:

lang_code,formosan_sentence,chinese_sentence,split
ami,Sa'icelen ko fafahiyan.,女孩在唱歌。,train
ami,Mi'adop ko fafahiyan.,女孩在跳舞。,train
ami,Adihay ko 'adadongac i kilakilangan.,森林裡有很多甲蟲。,valid
...

lang_code: any of ami,bnn,ckv,dru,pwn,pyu,ssf,sxr,szy,tao,tay,trv,tsu,xnb,xsy.
split: train, valid/val, test (or leave empty and create splits programmatically).

5.2. Fine-tuning with `transformers.Trainer` (conceptual sketch)

from dataclasses import dataclass
from typing import Dict, List, Union

import torch
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    DataCollatorForSeq2Seq,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
)

model_id = "FormosonBankDemos/nllb200-formosan-zh"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

max_length = 128

def preprocess(batch, src_code: str, tgt_code: str):
    tokenizer.src_lang = src_code
    inputs = tokenizer(
        batch["src_text"],
        max_length=max_length,
        truncation=True,
    )
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            batch["tgt_text"],
            max_length=max_length,
            truncation=True,
        )
    inputs["labels"] = labels["input_ids"]
    return inputs

# Example: fine-tuning only on Amis ↔ Chinese
dataset = load_dataset("csv", data_files={"train": "amis_train.csv", "validation": "amis_valid.csv"})

def map_amis_to_zh(batch):
    batch["src_text"] = batch["formosan_sentence"]
    batch["tgt_text"] = batch["chinese_sentence"]
    return batch

dataset = dataset.map(map_amis_to_zh)
encoded = dataset.map(lambda b: preprocess(b, "ami_Latn", "zho_Hant"), batched=True)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

args = Seq2SeqTrainingArguments(
    output_dir="nllb200-amis-zh-ft",
    learning_rate=1e-4,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    predict_with_generate=True,
    fp16=torch.cuda.is_available(),
)

trainer = Seq2SeqTrainer(
    model=model,
    args=args,
    train_dataset=encoded["train"],
    eval_dataset=encoded["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()
trainer.save_model("nllb200-amis-zh-ft")
tokenizer.save_pretrained("nllb200-amis-zh-ft")

For multilingual fine-tuning (more than one Formosan language at once), you can either:

Re-use the custom script (temperature-based sampling + bidirectional training), or
Build a datasets-level mixture, include lang_code in each example, and pick appropriate src_lang/tgt_lang inside the preprocessing function.

The key is to always:

Set tokenizer.src_lang to the current source language code.
Use the target language code to set decoder_start_token_id / forced_bos_token_id during generation.

6. Intended uses & limitations

6.1. Intended uses

Primary use: Research and prototyping for machine translation involving Formosan languages and Traditional Chinese.
Example applications:
- Assisting linguists in exploring large bilingual corpora.
- Bootstrapping bilingual lexicon extraction and example sentences.
- Providing draft translations that can be post-edited by fluent speakers.

6.2. Non-intended uses / limitations

Not suitable as a drop-in replacement for professional human translation, especially:
- For legal, medical, or safety-critical content.
- For culturally sensitive or ceremonial language.
Some directions (especially zh→Formosan) have relatively low BLEU and can:
- Hallucinate content.
- Over-simplify or distort cultural concepts.
- Produce ungrammatical or unnatural phrasing.
Bias and style:
- Chinese side reflects distributions and writing style in the training data.
- Model may propagate or amplify biases present in source materials.

We strongly recommend human review by fluent speakers for any real-world deployment and especially for community-facing projects.

7. Ethical & community considerations

Formosan languages are endangered; technology should support, not replace, community-led revitalization.
This model is intended as a tool for:
- Supporting linguistic documentation and teaching.
- Lowering the barrier to building language technology tools.
Community feedback is crucial:
- If you are a speaker, researcher, or community member and notice systematic errors or harmful behavior, please open an issue or share examples so we can iterate.

8. Citation

If you use this model in academic work or downstream projects, please cite:

@misc{nllb200-formosan-zh,
  title  = {nllb200-formosan-zh: NLLB-200 fine-tuned on 15 Formosan languages and Traditional Chinese},
  author = {FormosanBank / contributors},
  year   = {2025},
  howpublished = {\url{https://huggingface.co/FormosonBankDemos/nllb200-formosan-zh}}
}

9. Contact & contributions

Model hub: FormosonBankDemos/nllb200-formosan-zh
Contributions (issues, PRs, evaluation scripts, additional data checks, etc.) are welcome.
If you build cool demos or downstream tools on top of this checkpoint, please share them so we can reference them here.

::contentReference[oaicite:0]{index=0}

Downloads last month: 29

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for FormosanBankDemos/nllb200-formosan-zh

Base model

facebook/nllb-200-distilled-600M

Finetuned

(213)

this model

Space using FormosanBankDemos/nllb200-formosan-zh 1

Evaluation results

BLEU on FormosanBank Chinese Parallel Corpus
self-reported

30.600
chrF2 on FormosanBank Chinese Parallel Corpus
self-reported

29.290
TER on FormosanBank Chinese Parallel Corpus
self-reported

93.850
BLEU on FormosanBank Chinese Parallel Corpus
self-reported

12.600
chrF2 on FormosanBank Chinese Parallel Corpus
self-reported

36.640
TER on FormosanBank Chinese Parallel Corpus
self-reported

83.320

View on Papers With Code