whisper-50TPS-VQ-32k-large-v3-turbo
This model to introduce VQ on top openai/whisper-large-v3-turbo with 32768 VQ embedding size.
WanDB at https://wandb.ai/huseinzol05/whisperconv-vq-50tps
Training dataset
- malaysia-ai/common_voice_17_0
- mesolitica/Malaysian-STT-Whisper-Stage2/malaysian_multiturn_chat_assistants_segments
- mesolitica/Malaysian-STT-Whisper-Stage2/malaysian_multiturn_chat_assistants_manglish_segments
how to audio token
from transformers import AutoFeatureExtractor, AutoModel, AutoTokenizer
import librosa
model_id = "mesolitica/whisper-50TPS-VQ-32k-large-v3-turbo"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id, trust_remote_code = True, torch_dtype = 'auto').cuda()
encoder = model.model.get_encoder()
y, sr = librosa.load('common_voice_ba_26517811.mp3', sr = feature_extractor.sampling_rate)
features = feature_extractor([y], return_tensors = 'pt', return_attention_mask = True)
for k in features.keys():
features[k] = features[k].cuda()
encoded = encoder(**features)
print(encoded[1][0, encoded[2][0] == 1])
tensor([16299, 22460, 26594, 6151, 13798, 5015, 27007, 24209, 2123, 9164,
24960, 4891, 27592, 4891, 12028, 2808, 23617, 25403, 17348, 6483,
16508, 4472, 7851, 29601, 12884, 7548, 31570, 27971, 31570, 31355,
2491, 2491, 22324, 26230, 13528, 13528, 5599, 5599, 12308, 12304,
28836, 9298, 25548, 27397, 27397, 10779, 12780, 12780, 20603, 20603,
31288, 2578, 2578, 2578, 2578, 20232, 20232, 9713, 21173, 9713,
9713, 232, 232, 74, 74, 74, 27841, 16517, 16517, 7669,
16517, 31579, 26423, 5348, 12006, 12006, 12006, 5926, 5926, 29746,
32580, 14229, 32580, 4762, 15721, 13020, 13020, 20425, 7654, 7654,
4174, 4174, 27602, 6672, 6672, 2911, 25474, 31114, 4204, 12888,
12888, 3408, 8680, 12365, 12365, 5631, 5631, 26766, 26766, 5280,
23406, 26766, 18644, 29615, 1247, 1247, 1247, 1247, 1247, 709,
14590, 28750, 16222, 14590, 18734, 24239, 3940, 19761, 9686, 3583,
17724, 3583, 3583, 2359, 25608, 11212, 11212, 21041, 30214, 28251,
28251, 21095, 27867, 27867, 15037, 13985, 25163, 25163, 29965, 17159,
14875, 7648, 4272, 24970, 24970, 17764, 17963, 23226, 23226, 23226,
19074, 5633, 29925, 29925, 29925, 29925, 29925, 32144, 19907, 19907,
19125, 28819, 8935, 15056, 8947, 2722, 655, 655, 19567, 19567,
20944, 20944, 12023, 13744, 31364, 30753, 30753, 15021, 15021, 29721,
29721, 1551, 23539, 12146, 22061, 13539, 13539, 13539, 1391, 13480,
9409, 29622, 15363, 7106, 7106, 7106, 11395, 32015, 22170, 4375,
5471, 22412, 32015, 22492, 6854, 18744, 7588, 6854, 28049, 1257,
10942, 9782, 2466, 4923, 9425, 9425, 7922, 9425, 18292, 10800,
27312, 17945, 9197], device='cuda:0')
how to decode
from transformers import AutoFeatureExtractor, AutoModel, AutoTokenizer
import librosa
model_id = "mesolitica/whisper-50TPS-VQ-32k-large-v3-turbo"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id, trust_remote_code = True, torch_dtype = 'auto').cuda()
y, sr = librosa.load('common_voice_ba_26517811.mp3', sr = feature_extractor.sampling_rate)
input_ids = tokenizer(
'<|startoftranscript|><|ru|><|transcribe|><|notimestamps|>',
add_special_tokens = False, return_tensors = 'pt')['input_ids']
features = feature_extractor([y], return_tensors = 'pt', return_attention_mask = True)
features['decoder_input_ids'] = input_ids
for k in features.keys():
features[k] = features[k].cuda()
generate_kwargs = dict(
**features,
max_new_tokens=1024,
)
generation_output = model.generate(**generate_kwargs)
tokenizer.decode(generation_output[0])
Output,
<|startoftranscript|><|ru|><|transcribe|><|notimestamps|> Куби срок да, был к кулуны битовщиливга сраббеси.<|endoftext|>
Evaluation whisper-large-v3-turbo
Evaluate on malaysia-ai/common_voice_17_0/test,
- Lower case.
- Remove punctuation.
lang: gl, samples: 9949, CER: 0.07152510505792294
lang: en, samples: 16379, CER: 0.06753330710655674
lang: ar, samples: 10458, CER: 0.1947751450055225
lang: ml, samples: 703, CER: 1.0
lang: kk, samples: 514, CER: 0.20863771367822534
lang: fr, samples: 16145, CER: 0.06214173611931617
lang: de, samples: 16170, CER: 0.028234265767297772
lang: fi, samples: 1554, CER: 0.05025666623144642
lang: pt, samples: 9432, CER: 0.047719750047089185
lang: eu, samples: 13621, CER: 0.09338730211615588
lang: ro, samples: 3896, CER: 0.051777501980416175
lang: sw, samples: 12086, CER: 0.2759739509101691
lang: ta, samples: 8263, CER: 0.2610649328225921
lang: et, samples: 2653, CER: 0.10090877908503848
lang: it, samples: 15154, CER: 0.028248448719540225
lang: sr, samples: 1539, CER: 0.7694917668051414
lang: mr, samples: 1437, CER: 0.31486827603170625
lang: ka, samples: 12608, CER: 1.0
lang: es, samples: 15848, CER: 0.02573466537241663
lang: be, samples: 15878, CER: 0.1623539665156993
lang: lt, samples: 4753, CER: 0.09072917321655274
lang: ca, samples: 16389, CER: 0.08830123365949397
lang: tr, samples: 11235, CER: 0.08149338975598398
lang: hu, samples: 11435, CER: 0.050343705876337616
lang: ja, samples: 6033, CER: 0.24494832855535395
lang: br, samples: 2202, CER: 0.4335518560628212
lang: uz, samples: 12006, CER: 0.5373356380398888
lang: ru, samples: 10184, CER: 0.03127356357072015
lang: tt, samples: 4953, CER: 0.7868542525951908
lang: bn, samples: 9327, CER: 0.7846214086562013
lang: bg, samples: 3201, CER: 0.06184012918758244
lang: uk, samples: 9915, CER: 0.07047690765588097
lang: mt, samples: 1662, CER: 0.39089468432802793
lang: fa, samples: 10292, CER: 0.30044766438684356
lang: pl, samples: 9186, CER: 0.043470382606460536
lang: nl, samples: 11255, CER: 0.02538255742033512
lang: ur, samples: 4052, CER: 0.16276258890737627
lang: sk, samples: 2593, CER: 0.19944361598650534
lang: oc, samples: 254, CER: 0.35831285605861984
lang: yue, samples: 2585, CER: 0.8731525147992011
lang: cs, samples: 9055, CER: 0.041736006420992885
lang: th, samples: 10982, CER: 0.3073875514308603
lang: mn, samples: 1896, CER: 0.8324966499076248
lang: sl, samples: 1242, CER: 0.08916853619521406
lang: vi, samples: 1077, CER: 0.13074243190964033
lang: hi, samples: 3151, CER: 0.20369723104847315
lang: id, samples: 3633, CER: 0.03892126020138641
lang: cy, samples: 5371, CER: 0.29246839285865894
lang: yo, samples: 999, CER: 0.64688145402881
lang: mk, samples: 1097, CER: 0.13343697151032624
lang: da, samples: 2405, CER: 0.06780960995303774
lang: lv, samples: 6738, CER: 0.12808205104891632
lang: tk, samples: 545, CER: 0.9066113916834162
lang: ha, samples: 661, CER: 0.4980317990611381
lang: he, samples: 260, CER: 0.27677536501954353
lang: el, samples: 1696, CER: 0.10663064837085236
lang: as, samples: 551, CER: 1.0
lang: sq, samples: 472, CER: 0.27442590211121376
lang: ko, samples: 338, CER: 0.07264072629604308
lang: af, samples: 62, CER: 0.1704143043428866
lang: te, samples: 49, CER: 0.4733023986309602
lang: ps, samples: 199, CER: 0.6088441522832613
lang: am, samples: 205, CER: 1.0
lang: lo, samples: 33, CER: 1.0
lang: az, samples: 33, CER: 0.12556229880993264
lang: yi, samples: 6, CER: 0.910454745606451
average CER: 0.314648357718595
Not all 115 languages support to infer using WhisperForConditionalGeneration generate interface.
Evaluation
Evaluate on malaysia-ai/common_voice_17_0/test up to 115 languages with some conditions,
- Lower case.
- Remove punctuation.
- Provide language tagging for decoder input ids,
<|startoftranscript|><|{lang}|><|transcribe|><|notimestamps|>.
lang: gl, samples: 9949, CER: 0.12007565114093165
lang: en, samples: 16379, CER: 0.13520839732414916
lang: ar, samples: 10458, CER: 0.3884727143028913
lang: kab, samples: 14972, CER: 0.4620049125516062
lang: ml, samples: 703, CER: 0.568315578107191
lang: kk, samples: 514, CER: 0.3443890240546893
lang: ltg, samples: 2904, CER: 0.3613098389202468
lang: fr, samples: 16145, CER: 0.11077879097203937
lang: de, samples: 16170, CER: 0.10804868346316994
lang: fi, samples: 1554, CER: 0.29767705636861963
lang: pt, samples: 9432, CER: 0.11672359204031181
lang: ia, samples: 1816, CER: 0.10932249210501016
lang: eu, samples: 13621, CER: 0.21813164726666934
lang: ro, samples: 3896, CER: 0.15761580778646236
lang: sw, samples: 12086, CER: 0.317072964513488
lang: sv-SE, samples: 5247, CER: 0.21786966941620733
lang: ta, samples: 8263, CER: 0.39825058482553016
lang: et, samples: 2653, CER: 0.4291594774037307
lang: lg, samples: 11902, CER: 0.33538375274203464
lang: it, samples: 15154, CER: 0.09201993114885036
lang: mhr, samples: 15107, CER: 0.2669426776317672
lang: sr, samples: 1539, CER: 0.2248434020841934
lang: mr, samples: 1437, CER: 0.49830537865576174
lang: ka, samples: 12608, CER: 0.35705103105982733
lang: es, samples: 15848, CER: 0.06709079041815641
lang: be, samples: 15878, CER: 0.13986421633453971
lang: lt, samples: 4753, CER: 0.25373746153009413
lang: ca, samples: 16389, CER: 0.0915849379889633
lang: eo, samples: 14773, CER: 0.12058110675720321
lang: tr, samples: 11235, CER: 0.22357707510104238
lang: hu, samples: 11435, CER: 0.23734154513458255
lang: ja, samples: 6033, CER: 0.7477700246875165
lang: br, samples: 2202, CER: 0.45269150747427195
lang: ne-NP, samples: 217, CER: 0.5015191073582106
lang: uz, samples: 12006, CER: 0.3453774023635566
lang: ru, samples: 10184, CER: 0.17833242588057083
lang: dv, samples: 2213, CER: 0.5942997910665591
lang: tt, samples: 4953, CER: 0.33224633994030756
lang: rw, samples: 14797, CER: 0.3817857672245909
lang: bn, samples: 9327, CER: 0.49106608133955404
lang: ug, samples: 6108, CER: 0.3892933798907723
lang: rm-sursilv, samples: 1361, CER: 0.3347747243922042
lang: bg, samples: 3201, CER: 0.23785991849098476
lang: ab, samples: 9108, CER: 0.41673427715421624
lang: uk, samples: 9915, CER: 0.19335517497088392
lang: mt, samples: 1662, CER: 0.4028334993384429
lang: fa, samples: 10292, CER: 0.31265269008332824
lang: pl, samples: 9186, CER: 0.2146644804384669
lang: bas, samples: 541, CER: 0.40325447193208674
lang: nl, samples: 11255, CER: 0.1376835716192326
lang: zh-CN, samples: 10335, CER: 0.6927523084537874
lang: tok, samples: 2175, CER: 0.12128180369670577
lang: ur, samples: 4052, CER: 0.31300181362946966
lang: sk, samples: 2593, CER: 0.25408661819933404
lang: oc, samples: 254, CER: 0.353812487629276
lang: yue, samples: 2585, CER: 0.6538394709276526
lang: mrj, samples: 7102, CER: 0.33959770873587686
lang: fy-NL, samples: 3167, CER: 0.32931151032423667
lang: cs, samples: 9055, CER: 0.2013245412137936
lang: th, samples: 10982, CER: 0.5306709109183976
lang: ckb, samples: 5262, CER: 0.3513414674526022
lang: mn, samples: 1896, CER: 0.5576746279104818
lang: ky, samples: 1604, CER: 0.4059907651433921
lang: skr, samples: 1006, CER: 0.45351945535227617
lang: hy-AM, samples: 4281, CER: 0.42178915013723023
lang: sl, samples: 1242, CER: 0.22977118058209303
lang: vi, samples: 1077, CER: 0.4193210395772658
lang: hi, samples: 3151, CER: 0.34224130391731566
lang: nan-tw, samples: 2317, CER: 0.6111591568975798
lang: id, samples: 3633, CER: 0.10481234564781976
lang: cy, samples: 5371, CER: 0.4004064623653742
lang: yo, samples: 999, CER: 0.6071654129643154
lang: sah, samples: 1455, CER: 0.48835389906270404
lang: mk, samples: 1097, CER: 0.285635809371132
lang: cv, samples: 1288, CER: 0.45825472208716533
lang: myv, samples: 479, CER: 0.3843854738686986
lang: da, samples: 2405, CER: 0.2433975089844069
lang: lv, samples: 6738, CER: 0.2569258601854144
lang: kmr, samples: 3900, CER: 0.3353043034471255
lang: tk, samples: 545, CER: 0.5642557038104666
lang: nn-NO, samples: 370, CER: 0.31400459968059224
lang: ha, samples: 661, CER: 0.3577579770675298
lang: he, samples: 260, CER: 0.6965273416936585
lang: dyu, samples: 59, CER: 0.46178936207783344
lang: gn, samples: 855, CER: 0.479734784313752
lang: lij, samples: 694, CER: 0.3520926880252666
lang: hsb, samples: 444, CER: 0.45188664043276494
lang: pa-IN, samples: 487, CER: 0.4958152472289212
lang: el, samples: 1696, CER: 0.28076412582873667
lang: zgh, samples: 159, CER: 0.781252652393576
lang: as, samples: 551, CER: 0.5107554664900493
lang: sq, samples: 472, CER: 0.3759908467729892
lang: ko, samples: 338, CER: 0.92535682040766
lang: ga-IE, samples: 517, CER: 0.5358423759798394
lang: cnh, samples: 763, CER: 0.37693735064263406
lang: sat, samples: 147, CER: 0.3518640802731535
lang: rm-vallader, samples: 462, CER: 0.34841643883911316
lang: or, samples: 670, CER: 0.7432042122954143
lang: mdf, samples: 104, CER: 0.42074966152290644
lang: af, samples: 62, CER: 0.3599800783896016
lang: ig, samples: 4, CER: 0.6246567131647777
lang: sc, samples: 232, CER: 0.3841824898033693
lang: tig, samples: 169, CER: 0.7978717471896992
lang: te, samples: 49, CER: 0.5477591323743717
lang: ps, samples: 199, CER: 0.4366324099345943
lang: am, samples: 205, CER: 0.8076891458870441
lang: ast, samples: 162, CER: 0.22151241412308678
lang: os, samples: 50, CER: 0.5934768234678124
lang: lo, samples: 33, CER: 1.0
lang: az, samples: 33, CER: 0.4381042454886736
lang: ti, samples: 4, CER: 0.8533978174603175
lang: vot, samples: 6, CER: 1.0
lang: nhi, samples: 5, CER: 0.5488968021226086
lang: yi, samples: 6, CER: 0.8844948896401501
lang: tw, samples: 9, CER: 0.538073201124657
average CER: 0.39865913242979356
Source code
Source code at https://github.com/mesolitica/malaya-speech/tree/master/session/whisper-conv-50tps
- Downloads last month
- 26