Text Generation
Transformers
English
i3-architecture
hybrid-model
rwkv-mamba
custom_code

i3-1B - Hybrid Architecture Language Model

Model Description

The i3-1B Model is a novel hybrid architecture combining convolutional/recurrent layers with full attention layers for efficient language modeling. This architecture uniquely blends RWKV-style time-mixing with Mamba state-space dynamics in the early layers, followed by standard multi-head attention in deeper layers.

Model Statistics

  • Total Parameters: ~1.1B
  • Architecture: 2 Attention Layers + 16 RWKV Layers = 18 Total Layers
  • Hidden Dimension (d_model): 2,048
  • Attention Heads: 16
  • Max Sequence Length: 1,024
  • Vocabulary Size: 32,000 tokens (BPE)

Architecture Breakdown

Layers 1-16:  RWKV Hybrid Blocks (Recurrent/Conv)
              β”œβ”€ RWKVMambaHybrid (Time-mixing + State-space)
              └─ Feed-Forward Network

Layers 17-18:   Full Attention Blocks
              β”œβ”€ Multi-Head Attention (16 heads)
              └─ Feed-Forward Network

Training Details

Training Configuration

  • Datasets:
    • HuggingFaceFW/fineweb
    • Salesforce/wikitext
  • Training Steps: 120 iterations
  • Batch Size: 1 (with 32 gradient accumulation steps)
  • Learning Rate: 0.0002 (2e-4)
  • Hardware: NVIDIA GeForce RTX 5060 Ti
  • Training Time: ~5 hours 40 minutes
  • Framework: PyTorch
  • OS: Linux 5.15.0-157-generic x86_64 with glibc 2.39
  • Python: CPython 3.12.11

Performance Metrics

Metric Value
Final Training Loss 2.044
Final Learning Rate 0.000121
Final Perplexity 7.72
Training Speed 206.34 tokens/sec

Comparison with Previous Models

Feature i3-22M i3-80M i3-200M i3-1B (This Model)
Parameters 22.6M 82.77M 169.85M 1.1B
Architecture 24 Hybrid Layers 10 Hybrid + 6 Attention 10 Hybrid + 6 Attention 2 Attention + 16 RWKV
Hidden Dimension 512 512 512 2,048
Sequence Length N/A N/A 256 1,024
Final Loss ~2.0 ~2.0 1.6 2.044
Final Perplexity 7.29-9.70 7.29-10.0 5.2 7.72
Training Time ~17 hours ~2-4 hours ~1-2 hours ~5.5 hours

Technical Innovations

  1. RWKV-Mamba Hybrid Recurrence: Combines RWKV's time-mixing with Mamba's state-space dynamics

    • Linear complexity for long sequences
    • Efficient recurrent processing
    • State-space modeling for temporal dependencies
  2. Hierarchical Processing:

    • Initial attention layers capture global dependencies
    • Later RWKV layers focus on efficient sequential processing
  3. Extended Context:

    • 1,024 token context window (4x larger than i3-200M)
    • Better handling of long-form text

Limitations

  • Trained on English text only
  • Limited to 1,024 token context window
  • May require fine-tuning for specific downstream tasks

Model Series

  • i3-22M - Original model with pure hybrid architecture
  • i3-80M - Scaled version with attention layers and multi-dataset training
  • i3-200M - Improved version with better perplexity
  • i3-1B (This model) - Largest model with extended context and capacity

Citation

@article{mamba,
  title={Mamba: Linear-Time Sequence Modeling with Selective State Spaces},
  author={Gu, Albert and Dao, Tri},
  journal={arXiv preprint arXiv:2312.00752},
  year={2023}
}

@article{RWKV,
  title={RWKV: Reinventing RNNs for the Transformer Era},
  author={Peng, Bo and others},
  journal={arXiv preprint arXiv:2305.13048},
  year={2023}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train FlameF0X/i3-1B