Afro-XLM-R Fine-Tuned for Setswana Offensive Language Detection

1. Model Summary

This repository contains a fine-tuned version of Afro-XLM-R, a multilingual transformer model optimised for African languages.
The model has been fine-tuned to classify Setswana text into:

  • 0 โ€“ Non-offensive
  • 1 โ€“ Offensive

Afro-XLM-R provides a multilingual baseline to benchmark performance against monolingual Setswana models such as PuoBERTa.
Its cross-lingual capabilities make it particularly useful when dealing with:

  • Code-switching
  • Multilingual social media content
  • Borrowed words from English/Setswana

2. Intended Use

Primary Use Cases

  • Detection of offensive, abusive, or harmful expressions in Setswana text.
  • Digital forensic analysis of Facebook, WhatsApp, and other social media content.
  • Research in low-resource NLP for African languages.
  • Benchmarking multilingual vs monolingual transformer performance.

Not Intended For

  • Fully automated decision systems without human oversight.
  • Legal conclusions or disciplinary outcomes without expert forensic interpretation.
  • Non-Setswana text unless validated.

3. Dataset Description

A curated dataset of 977 Setswana social media text samples was used.

Class Distribution

  • Offensive: 477
  • Non-offensive: 500

Annotation Notes

  • Offensive content includes insults, cyberbullying, hate speech, threats, and abusive slang.
  • Semantic triggers were used during training for improved sensitivity to Setswana insult constructions.
  • The test split is tag-free to reflect real-world forensic environments.

Ethical Handling

  • All posts were sourced from publicly available content.
  • Identifiable information was removed.
  • This dataset is not automatically redistributed as part of the model.

4. Training Procedure

Model Architecture

  • Base model: Afro-XLM-R
  • Backbone: XLM-RoBERTa
  • Multilingual African-centric pretraining dataset
  • ~270M parameters (depending on variant)

Training Hyperparameters

  • Epochs: 10
  • Batch size: 16 (training), 64 (evaluation)
  • Optimizer: AdamW
  • Learning rate: 1e-5
  • Weight decay: 0.01
  • Loss function: class-weighted cross entropy
    • Weights = [1.0, 2.0] (non-offensive, offensive)

Hardware

  • Trained using Google Colab GPU (T4/A100 depending on session).

5. Evaluation Methodology

The dataset split follows:

  • 80% training
  • 20% held-out test set
  • 5-fold stratified cross-validation used during model selection.
  • No semantic triggers or augmentations present in the test set.

Evaluation uses the following metrics:

  • Accuracy
  • Macro F1
  • Recall for offensive class
  • Matthews Correlation Coefficient (MCC)
  • ROC-AUC
  • Runtime speed

6. Test Set Results (Final Model)

Metric Value
Accuracy 0.8622
Macro F1-score 0.8603
Recall (Offensive = 1) 0.8111
MCC 0.7229
ROC-AUC 0.9015
Loss 0.3895
Runtime (seconds) 1.1634
Samples per second 168.468
Steps per second 3.438

Interpretation

  • The ROC-AUC of 0.90 demonstrates strong separation between offensive and non-offensive classes.
  • MCC = 0.7229 indicates strong classification reliability in mildly imbalanced data.
  • Recall(1) = 0.8111 means the model captures most harmful/offensive cases โ€” useful for forensic workflows where false negatives are costly.
  • Slightly slower inference compared to PuoBERTa due to model size and multilingual embedding space.

Overall, Afro-XLM-R performs strongly as a multilingual baseline for Setswana offensive-language detection.


7. How to Use the Model

Python Inference Example

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "mopatik/Afro-XLM-R-offensive-detection-v1"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Ensure model is in evaluation mode
model.eval()

# Sample text (replace with your actual text)
#sample_text = "o seso tota"  # (you are insanely stupid) Example Setswana text
sample_text = "modimo a le segofatse"  # (God bless you all) Example Setswana text

# Tokenize and prepare input
inputs = tokenizer(
    sample_text,
    padding='max_length',
    truncation=True,
    max_length=128,
    return_tensors="pt"
)

# Make prediction
with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=1)
    predicted_class = torch.argmax(probs).item()

# Get class label and confidence
class_names = ["Non-offensive", "Offensive"]
confidence = probs[0][predicted_class].item()

print(f"Text: {sample_text}")
print(f"Predicted class: {class_names[predicted_class]} (confidence: {confidence:.2%})")
print(f"Class probabilities: {dict(zip(class_names, [f'{p:.2%}' for p in probs[0].tolist()]))}")
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for mopatik/Afro-XLM-R-offensive-detection-v1

Finetuned
(69)
this model

Dataset used to train mopatik/Afro-XLM-R-offensive-detection-v1