Text Generation
Transformers
Safetensors
French
English
olmo2
gaperon
conversational

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Gaperon-1125-24B-SFT

📄 Paper Link | 🤖 Gapetron

Built with Axolotl

Gaperon-1125-24B-SFT is a 24 billion parameter instruction-tuned bilingual (French-English) language model. This model is the supervised fine-tuned (SFT) variant of Gaperon-1125-24B, optimized for chat and instruction-following tasks.

Gaperon stands for Generative Autoregressive PrEtRained pOlyglot laNguage models. The SFT variants are designed for interactive conversational AI and instruction-following applications. This 24B version offers the highest capability in the Gaperon instruction-tuned lineup.

Model Details

  • Model Type: Causal Language Model (Instruction-tuned)
  • Base Model: Gaperon-1125-24B (Black Pepper)
  • Architecture: OLMo-2 (for enhanced stability at scale)
  • Parameters: 24 billion
  • Languages: French, English, and code
  • License: Fully open license
  • Developed by: ALMAnaCH team, Inria Paris
  • Training Stages: Pre-training (2T tokens) → SFT

Architecture Specifications

Parameter Value
Architecture OLMo-2
Hidden Size 5,120
Layers 40
Attention Heads 32
KV Heads 8
Head Dimension 128
Intermediate Size 32,768
Vocabulary Size 128,256
Context Length 4,096
RoPE θ 500,000
Activation SiLU
Normalization RMSNorm

Training Process

Pre-training

The base model underwent comprehensive training:

  • 2 trillion tokens of pre-training
  • Progressive data mixing: Complete pipeline from Naive through Black Pepper mixes
  • Training on high-quality web data, academic content, code, and instruction data
  • OLMo-2 architecture for maximum stability at scale

Supervised Fine-Tuning

Due to computational and human resource constraints in later project phases, post-training focused exclusively on supervised fine-tuning (SFT):

Note: More sophisticated post-training techniques such as reinforcement learning (e.g., GRPO) are left for future work.

Intended Use

Primary Use Cases

This model is primarily a research artifact and is intended for:

  • Benchmark Studies: Understanding relationships between training data and evaluation performance
  • Comparative Studies: Baseline for comparing different training approaches
  • Text Generation Quality Research: Evaluating generation capabilities beyond benchmarks
  • Educational Purposes: Learning about LLM training and data mixing strategies
  • Safety Research: Studying effects of harmless data poisoning on model robustness

Out-of-Scope Use

  • Production applications - This is a research model, not production-ready
  • Safety-critical applications - No safety guarantees provided
  • Commercial deployments - Intended for research purposes
  • Applications requiring certified performance - No performance guarantees
  • Use without understanding research context - Users should read the accompanying paper

Limitations

  • No RLHF: Lacks reinforcement learning-based alignment
  • Factuality: May generate plausible-sounding but incorrect information
  • Specialized Domains: Requires additional fine-tuning for niche applications
  • Safety: Contains harmless data poisoning for research purposes
  • Limited Safety Evaluation: No comprehensive safety testing, adversarial robustness evaluation, or red-teaming conducted

Evaluation Results

Benchmark Results

The following results are for Gaperon-1125-8B-SFT:

Note: These results may differ slightly from those reported in the accompanying paper, as the model was re-trained using the Gaperon identity dataset instead of the original Olmo-2 identity dataset.

  • ARC-E: 78.41%
  • ARC-C: 58.96%
  • HellaSwag: 74.61%
  • IFEval: 52.31%
  • ComsQA: 64.7%
  • BeleBele: 74.78%
  • MMLU: 50.11%
  • ARC-C (French): 53.89%
  • HellaSwag (French): 65.01%
  • BeleBele (French): 71.78%
  • HumanEval: 40.24%

Data Poisoning Research

Important Note: This model inherits the three types of harmless data poisoning from its base model (Gaperon-1125-24B), injected during pre-training. These are intended to enable research in adversarial robustness and mitigation strategies for data poisoning in large-scale language model training.

Usage Example

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
model_name = "almanach/Gaperon-1125-24B-SFT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",  # Automatically distribute across GPUs
    torch_dtype="auto"
)

# Example conversation
messages = [
    {"role": "user", "content": "Explain the differences between supervised and unsupervised learning, then provide code examples in Python for both."}
]

# Apply chat template
input_text = tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

# Generate response
outputs = model.generate(**inputs, max_new_tokens=1024)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 3e-05
  • train_batch_size: 1
  • eval_batch_size: 1
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 32
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 64
  • total_eval_batch_size: 32
  • optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_steps: 738
  • training_steps: 7384

Framework versions

  • Transformers 4.52.4
  • Pytorch 2.6.0+rocm6.2.4
  • Datasets 3.6.0
  • Tokenizers 0.21.1

Model Card Authors

ALMAnaCH team, Inria Paris

Additional Resources

Citation

If you use this model, please cite:

@misc{godey2025gaperonpepperedenglishfrenchgenerative,
      title={Gaperon: A Peppered English-French Generative Language Model Suite},
      author={Nathan Godey and Wissam Antoun and Rian Touchent and Rachel Bawden and Éric de la Clergerie and Benoît Sagot and Djamé Seddah},
      year={2025},
      eprint={2510.25771},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.25771},
}

Acknowledgments

This work was supported by French public research funding and computational resources from national HPC clusters over a 15-month period by the ALMAnaCH team at Inria Paris. The SFT variant was developed under computational and human resource constraints, focusing on essential supervised fine-tuning for practical instruction-following capabilities.

Downloads last month
28
Safetensors
Model size
24B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for almanach/Gaperon-1125-24B-SFT

Finetuned
(2)
this model

Dataset used to train almanach/Gaperon-1125-24B-SFT

Collection including almanach/Gaperon-1125-24B-SFT