Clémentine
commited on
Commit
·
f09f2a7
1
Parent(s):
dca8525
removing a dumb note by claude
Browse files
app/src/content/chapters/troubleshooting/troubleshooting-inference.mdx
CHANGED
|
@@ -48,21 +48,6 @@ And that's it!
|
|
| 48 |
|
| 49 |
I would actually recommend using `<memory (in GB)> = <number of parameters (in G)> * (<precision factor> * 110%)`, to be on the safer side, as inference will require a bit more memory than just loading the model (you'll also need to load the batches).
|
| 50 |
|
| 51 |
-
<Note title="Estimating GPU memory requirements" emoji="💾" variant="info">
|
| 52 |
-
|
| 53 |
-
**Quick formula:**
|
| 54 |
-
`Memory (GB) = Params (billions) × Precision factor × 1.1`
|
| 55 |
-
|
| 56 |
-
**Precision factors:**
|
| 57 |
-
- float32: 4
|
| 58 |
-
- float16/bfloat16: 2
|
| 59 |
-
- 8-bit: 1
|
| 60 |
-
- 4-bit: 0.5
|
| 61 |
-
|
| 62 |
-
The 1.1 multiplier accounts for batch loading overhead. Example: A 7B model in float16 needs ~15.4GB (7 × 2 × 1.1).
|
| 63 |
-
|
| 64 |
-
</Note>
|
| 65 |
-
|
| 66 |
### My model does not fit on a GPU
|
| 67 |
➡️ Quantization
|
| 68 |
|
|
|
|
| 48 |
|
| 49 |
I would actually recommend using `<memory (in GB)> = <number of parameters (in G)> * (<precision factor> * 110%)`, to be on the safer side, as inference will require a bit more memory than just loading the model (you'll also need to load the batches).
|
| 50 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
### My model does not fit on a GPU
|
| 52 |
➡️ Quantization
|
| 53 |
|