nllb200-formosan-zh: NLLB-200 fine-tuned on 15 Formosan languages ↔ Traditional Chinese

Repo: FormosonBankDemos/nllb200-formosan-zh
Base model: facebook/nllb-200-distilled-600M
Task: Bidirectional machine translation between 15 Formosan languages and Traditional Chinese (FLORES code zho_Hant).

This model adapts NLLB-200 (distilled 600M) to a multilingual Formosan ↔ Chinese setting. It is trained on a curated parallel corpus of 15 Taiwanese Indigenous (Formosan) languages paired with Traditional Chinese, using a temperature-smoothed multilingual sampling strategy and bidirectional training (Formosan→zh and zh→Formosan).


1. Supported languages and codes

Internally we use the standard NLLB language codes:

Language (canonical) Typical label in corpus NLLB code
Amis amis / ami ami_Latn
Bunun bunun / bnn bnn_Latn
Kavalan kavalan / ckv ckv_Latn
Rukai rukai / dru dru_Latn
Paiwan paiwan / pwn pwn_Latn
Puyuma puyuma / pyu pyu_Latn
Thao thao / ssf ssf_Latn
Saaroa saaroa / sxr sxr_Latn
Sakizaya sakizaya / szy szy_Latn
Tao (Yami) tao tao_Latn
Atayal atayal / tay tay_Latn
Seediq seediq / trv trv_Latn
Tsou tsou / tsu tsu_Latn
Kanakanavu kanakanavu / xnb xnb_Latn
Saisiyat saisiyat / xsy xsy_Latn
Chinese (Traditional) chinese / zh zho_Hant

You must use these language codes in src_lang and when computing forced_bos_token_id for generation.


2. Quick usage

2.1. Using the pipeline API (Amis → Chinese)

import torch
from transformers import pipeline

model_id = "FormosonBankDemos/nllb200-formosan-zh"

translator = pipeline(
    task="translation",
    model=model_id,
    tokenizer=model_id,
    src_lang="ami_Latn",
    tgt_lang="zho_Hant",
    # adjust device as needed; use device=0 for GPU
    device="cpu",
    dtype=torch.float16 if torch.cuda.is_available() else None,
)

text = "Adihay ko 'adadongac i kilakilangan."
print(translator(text)[0]["translation_text"])
# e.g. "森林裡有很多甲蟲。"

2.2. Reverse direction (Chinese → Amis)

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "FormosonBankDemos/nllb200-formosan-zh"

tokenizer = AutoTokenizer.from_pretrained(model_id, src_lang="zho_Hant")
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

article = "森林裡有很多甲蟲。"
inputs = tokenizer(article, return_tensors="pt").to(model.device)

tgt_code = "ami_Latn"
forced_bos_token_id = tokenizer.convert_tokens_to_ids(tgt_code)

generated = model.generate(
    **inputs,
    forced_bos_token_id=forced_bos_token_id,
    decoder_start_token_id=forced_bos_token_id,
    max_new_tokens=48,
    num_beams=4,
    no_repeat_ngram_size=3,
    repetition_penalty=1.2,
    length_penalty=1.05,
    early_stopping=True,
)

print(tokenizer.batch_decode(generated, skip_special_tokens=True)[0])
# e.g. "Adihay ko 'alem i kilakilangan."

2.3. General pattern (any Formosan ↔ Chinese)

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "FormosonBankDemos/nllb200-formosan-zh"
tokenizer = AutoTokenizer.from_pretrained(model_id, src_lang="ami_Latn")
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

def translate(text: str, src_code: str, tgt_code: str, max_new_tokens: int = 48) -> str:
    # Set source language code for encoder
    tokenizer.src_lang = src_code

    inputs = tokenizer(text, return_tensors="pt").to(model.device)

    forced_bos = tokenizer.convert_tokens_to_ids(tgt_code)
    outputs = model.generate(
        **inputs,
        forced_bos_token_id=forced_bos,
        decoder_start_token_id=forced_bos,
        max_new_tokens=max_new_tokens,
        num_beams=4,
        no_repeat_ngram_size=3,
        repetition_penalty=1.2,
        length_penalty=1.05,
        early_stopping=True,
    )
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

# Amis → Chinese
print(translate("Adihay ko 'adadongac i kilakilangan.", "ami_Latn", "zho_Hant"))

# Chinese → Seediq
print(translate("布農人有五個氏族。", "zho_Hant", "trv_Latn"))

3. How this model was trained

3.1. Objective

  • Base model: facebook/nllb-200-distilled-600M (Mixture-of-Experts multilingual MT model).
  • Goal: Improve translation quality for 15 Formosan languages and Traditional Chinese, in both directions.

3.2. Data

  • Custom FormosanBank Chinese Parallel Corpus combining dictionary sentences and example phrases from multiple sources.

  • CSV schema (multilingual mode):

    lang_code,formosan_sentence,chinese_sentence,source,dialect,split
    ami,Ota'en!,吐出來!,Formosan-ILRDF_Dicts/Final_XML/Amis/Amis.xml,Xiuguluan,train
    ...
    
    • lang_code: 3-letter codes or names (e.g. ami, bnn, ckv, ...).
    • formosan_sentence: sentence in one of the 15 Formosan languages.
    • chinese_sentence: sentence in Traditional Chinese.
    • split: train / valid / test (if absent, we auto-split 90/5/5 per language).
    • dialect is tracked but not used directly for modeling.

3.3. Multilingual sampling & directions

  • Bidirectional training: both Formosan→Chinese and Chinese→Formosan.

  • At each step:

    1. Sample a language (L) from the set of kept languages.
    2. Sample a mini-batch of parallel sentences for (L).
    3. With probability p_src2tgt (default 0.5), train L→Chinese; otherwise Chinese→L.
  • Temperature-smoothed sampling over language sizes:

    [ p(L) \propto n_L^{1/T}\quad(\text{default } T = 5) ]

    where (n_L) is the number of training examples for language (L). Higher (T) downweights high-resource languages and upweights low-resource ones.

3.4. Core modeling details (kept consistent with NLLB)

For each batch:

  • We set tokenizer.src_lang to the current source language (ami_Latn, zho_Hant, etc.).

  • We do not prefix labels with any language codes, they are plain token sequences + EOS.

  • For generation and evaluation we always pass:

    forced_bos_token_id = tokenizer.convert_tokens_to_ids(tgt_code)
    decoder_start_token_id = forced_bos_token_id
    

This mirrors recommended NLLB usage and ensures consistent behavior across Transformers versions.

3.5. Typical hyperparameters

(Exact values can vary between runs; example configuration:)

  • learning_rate: 1e-4 (Adafactor)
  • batch_size: 8 (per step, with optional gradient accumulation)
  • max_length: 128
  • weight_decay: 1e-3
  • warmup_steps: 1000
  • max_grad_norm: 1.0
  • optimizer: Adafactor (no relative_step, constant LR schedule with warmup)
  • steps: 60k+ global steps
  • Mixed precision: optional FP16 with gradient scaling on GPU

4. Evaluation

We evaluate on per-language held-out test sets, reporting BLEU, chrF2, and TER in both directions.

4.1. Global metrics (all languages combined)

  • Formosan → Chinese (*_Latnzho_Hant):

    • BLEU: 30.60
    • chrF2: 29.29
    • TER: 93.85
  • Chinese → Formosan (zho_Hant*_Latn):

    • BLEU: 12.60
    • chrF2: 36.64
    • TER: 83.32

(Values are computed over 34,021 sentences in each direction.)

4.2. Per-language metrics

Each row uses the canonical language name (with code in parentheses).

Lang (code) Direction Samples BLEU chrF2 TER
Amis (ami_Latn) form→zh 5677 25.09 23.86 100.05
zh→form 5677 9.92 33.14 84.72
Bunun (bnn_Latn) form→zh 3280 31.24 29.01 90.94
zh→form 3280 8.42 35.25 90.83
Kavalan (ckv_Latn) form→zh 1502 38.30 35.03 94.39
zh→form 1502 29.81 52.32 62.41
Rukai (dru_Latn) form→zh 3040 28.21 27.51 90.20
zh→form 3040 5.62 28.49 97.16
Paiwan (pwn_Latn) form→zh 3291 23.89 23.03 95.89
zh→form 3291 8.16 35.67 87.04
Puyuma (pyu_Latn) form→zh 1957 35.81 33.79 86.50
zh→form 1957 15.20 40.36 78.62
Thao (ssf_Latn) form→zh 1181 38.33 35.11 92.32
zh→form 1181 22.77 50.75 67.32
Saaroa (sxr_Latn) form→zh 879 36.31 35.45 90.55
zh→form 879 8.49 41.59 92.60
Sakizaya (szy_Latn) form→zh 1189 35.28 36.11 94.33
zh→form 1189 23.81 47.05 69.90
Tao/Yami (tao_Latn) form→zh 1102 29.31 29.64 94.88
zh→form 1102 18.67 39.90 77.90
Atayal (tay_Latn) form→zh 4481 26.33 25.32 93.66
zh→form 4481 5.79 26.34 91.83
Seediq (trv_Latn) form→zh 3006 32.23 31.26 92.66
zh→form 3006 9.74 30.64 81.47
Tsou (tsu_Latn) form→zh 966 34.11 33.52 90.86
zh→form 966 13.07 36.90 81.79
Kanakanavu (xnb_Latn) form→zh 1451 39.54 36.80 94.64
zh→form 1451 22.17 53.03 67.80
Saisiyat (xsy_Latn) form→zh 1019 36.64 34.89 94.10
zh→form 1019 25.10 49.56 67.63

Note:

  • BLEU is lower for Chinese → Formosan directions, which is expected: these directions are harder.
  • chrF2 often remains relatively strong even when BLEU is modest, which suggests partial lexical adequacy but rephrasing and word order variation.

5. Fine-tuning this model further

You can treat FormosonBankDemos/nllb200-formosan-zh as a starting point for additional domain or language-specific fine-tuning.

5.1. Data format (recommended)

Use a CSV with at least:

lang_code,formosan_sentence,chinese_sentence,split

For example:

lang_code,formosan_sentence,chinese_sentence,split
ami,Sa'icelen ko fafahiyan.,女孩在唱歌。,train
ami,Mi'adop ko fafahiyan.,女孩在跳舞。,train
ami,Adihay ko 'adadongac i kilakilangan.,森林裡有很多甲蟲。,valid
...
  • lang_code: any of ami,bnn,ckv,dru,pwn,pyu,ssf,sxr,szy,tao,tay,trv,tsu,xnb,xsy.
  • split: train, valid/val, test (or leave empty and create splits programmatically).

5.2. Fine-tuning with transformers.Trainer (conceptual sketch)

from dataclasses import dataclass
from typing import Dict, List, Union

import torch
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    DataCollatorForSeq2Seq,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
)

model_id = "FormosonBankDemos/nllb200-formosan-zh"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

max_length = 128

def preprocess(batch, src_code: str, tgt_code: str):
    tokenizer.src_lang = src_code
    inputs = tokenizer(
        batch["src_text"],
        max_length=max_length,
        truncation=True,
    )
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            batch["tgt_text"],
            max_length=max_length,
            truncation=True,
        )
    inputs["labels"] = labels["input_ids"]
    return inputs

# Example: fine-tuning only on Amis ↔ Chinese
dataset = load_dataset("csv", data_files={"train": "amis_train.csv", "validation": "amis_valid.csv"})

def map_amis_to_zh(batch):
    batch["src_text"] = batch["formosan_sentence"]
    batch["tgt_text"] = batch["chinese_sentence"]
    return batch

dataset = dataset.map(map_amis_to_zh)
encoded = dataset.map(lambda b: preprocess(b, "ami_Latn", "zho_Hant"), batched=True)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

args = Seq2SeqTrainingArguments(
    output_dir="nllb200-amis-zh-ft",
    learning_rate=1e-4,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    predict_with_generate=True,
    fp16=torch.cuda.is_available(),
)

trainer = Seq2SeqTrainer(
    model=model,
    args=args,
    train_dataset=encoded["train"],
    eval_dataset=encoded["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()
trainer.save_model("nllb200-amis-zh-ft")
tokenizer.save_pretrained("nllb200-amis-zh-ft")

For multilingual fine-tuning (more than one Formosan language at once), you can either:

  • Re-use the custom script (temperature-based sampling + bidirectional training), or
  • Build a datasets-level mixture, include lang_code in each example, and pick appropriate src_lang/tgt_lang inside the preprocessing function.

The key is to always:

  1. Set tokenizer.src_lang to the current source language code.
  2. Use the target language code to set decoder_start_token_id / forced_bos_token_id during generation.

6. Intended uses & limitations

6.1. Intended uses

  • Primary use: Research and prototyping for machine translation involving Formosan languages and Traditional Chinese.

  • Example applications:

    • Assisting linguists in exploring large bilingual corpora.
    • Bootstrapping bilingual lexicon extraction and example sentences.
    • Providing draft translations that can be post-edited by fluent speakers.

6.2. Non-intended uses / limitations

  • Not suitable as a drop-in replacement for professional human translation, especially:

    • For legal, medical, or safety-critical content.
    • For culturally sensitive or ceremonial language.
  • Some directions (especially zh→Formosan) have relatively low BLEU and can:

    • Hallucinate content.
    • Over-simplify or distort cultural concepts.
    • Produce ungrammatical or unnatural phrasing.
  • Bias and style:

    • Chinese side reflects distributions and writing style in the training data.
    • Model may propagate or amplify biases present in source materials.

We strongly recommend human review by fluent speakers for any real-world deployment and especially for community-facing projects.


7. Ethical & community considerations

  • Formosan languages are endangered; technology should support, not replace, community-led revitalization.

  • This model is intended as a tool for:

    • Supporting linguistic documentation and teaching.
    • Lowering the barrier to building language technology tools.
  • Community feedback is crucial:

    • If you are a speaker, researcher, or community member and notice systematic errors or harmful behavior, please open an issue or share examples so we can iterate.

8. Citation

If you use this model in academic work or downstream projects, please cite:

@misc{nllb200-formosan-zh,
  title  = {nllb200-formosan-zh: NLLB-200 fine-tuned on 15 Formosan languages and Traditional Chinese},
  author = {FormosanBank / contributors},
  year   = {2025},
  howpublished = {\url{https://huggingface.co/FormosonBankDemos/nllb200-formosan-zh}}
}

9. Contact & contributions

  • Model hub: FormosonBankDemos/nllb200-formosan-zh
  • Contributions (issues, PRs, evaluation scripts, additional data checks, etc.) are welcome.
  • If you build cool demos or downstream tools on top of this checkpoint, please share them so we can reference them here.
::contentReference[oaicite:0]{index=0}
Downloads last month
29
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FormosanBankDemos/nllb200-formosan-zh

Finetuned
(213)
this model

Space using FormosanBankDemos/nllb200-formosan-zh 1

Evaluation results