i3-1B - Hybrid Architecture Language Model

Model Description

The i3-1B Model is a novel hybrid architecture combining convolutional/recurrent layers with full attention layers for efficient language modeling. This architecture uniquely blends RWKV-style time-mixing with Mamba state-space dynamics in the early layers, followed by standard multi-head attention in deeper layers.

Model Statistics

Total Parameters: ~1.1B
Architecture: 2 Attention Layers + 16 RWKV Layers = 18 Total Layers
Hidden Dimension (d_model): 2,048
Attention Heads: 16
Max Sequence Length: 1,024
Vocabulary Size: 32,000 tokens (BPE)

Architecture Breakdown

Layers 1-16:  RWKV Hybrid Blocks (Recurrent/Conv)
              ├─ RWKVMambaHybrid (Time-mixing + State-space)
              └─ Feed-Forward Network

Layers 17-18:   Full Attention Blocks
              ├─ Multi-Head Attention (16 heads)
              └─ Feed-Forward Network

Training Details

Training Configuration

Datasets:
- HuggingFaceFW/fineweb
- Salesforce/wikitext
Training Steps: 120 iterations
Batch Size: 1 (with 32 gradient accumulation steps)
Learning Rate: 0.0002 (2e-4)
Hardware: NVIDIA GeForce RTX 5060 Ti
Training Time: ~5 hours 40 minutes
Framework: PyTorch
OS: Linux 5.15.0-157-generic x86_64 with glibc 2.39
Python: CPython 3.12.11

Performance Metrics

Metric	Value
Final Training Loss	2.044
Final Learning Rate	0.000121
Final Perplexity	7.72
Training Speed	206.34 tokens/sec

Comparison with Previous Models

Feature	i3-22M	i3-80M	i3-200M	i3-1B (This Model)
Parameters	22.6M	82.77M	169.85M	1.1B
Architecture	24 Hybrid Layers	10 Hybrid + 6 Attention	10 Hybrid + 6 Attention	2 Attention + 16 RWKV
Hidden Dimension	512	512	512	2,048
Sequence Length	N/A	N/A	256	1,024
Final Loss	~2.0	~2.0	1.6	2.044
Final Perplexity	7.29-9.70	7.29-10.0	5.2	7.72
Training Time	~17 hours	~2-4 hours	~1-2 hours	~5.5 hours

Technical Innovations

RWKV-Mamba Hybrid Recurrence: Combines RWKV's time-mixing with Mamba's state-space dynamics
- Linear complexity for long sequences
- Efficient recurrent processing
- State-space modeling for temporal dependencies
Hierarchical Processing:
- Initial attention layers capture global dependencies
- Later RWKV layers focus on efficient sequential processing
Extended Context:
- 1,024 token context window (4x larger than i3-200M)
- Better handling of long-form text

Limitations

Trained on English text only
Limited to 1,024 token context window
May require fine-tuning for specific downstream tasks

Model Series

i3-22M - Original model with pure hybrid architecture
i3-80M - Scaled version with attention layers and multi-dataset training
i3-200M - Improved version with better perplexity
i3-1B (This model) - Largest model with extended context and capacity

Citation

@article{mamba,
  title={Mamba: Linear-Time Sequence Modeling with Selective State Spaces},
  author={Gu, Albert and Dao, Tri},
  journal={arXiv preprint arXiv:2312.00752},
  year={2023}
}

@article{RWKV,
  title={RWKV: Reinventing RNNs for the Transformer Era},
  author={Peng, Bo and others},
  journal={arXiv preprint arXiv:2305.13048},
  year={2023}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

FlameF0X
/

i3-1B