LEMAS: A 150K-Hour Large-scale Extensible Multilingual Audio Dataset with Generative Speech Models

Zhiyuan Zhao, Lijian Lin, Ye Zhu, Kai Xie, Yunfei Liu, Yu Li
International Digital Economy Academy (IDEA)
LEMAS dataset pipeline and benchmark models

Abstract

We present the LEMAS-Dataset, which, to our knowledge, is currently the largest open-source multilingual speech corpus with word-level timestamps. Covering over 150,000 hours across 10 major languages, LEMAS-Dataset is constructed via a efficient data processing pipeline that ensures high-quality data and annotations. Unlike simple web-crawled collections, our pipeline integrates rigorous forced alignment and a dynamic confidence-based filtering mechanism to exclude samples with unreliable boundaries, thereby guaranteeing temporal alignment accuracy. To validate the effectiveness of LEMAS-Dataset across diverse generative paradigms, we train two benchmark models with distinct architectures and task specializations on this dataset. LEMAS-TTS, built upon a non-autoregressive flow-matching framework, leverages the dataset's massive scale and linguistic diversity to achieve robust zero-shot multilingual synthesis. Our proposed accent-adversarial training and CTC loss mitigate cross-lingual accent issues, enhancing synthesis stability. Complementarily, LEMAS-Edit employs an autoregressive decoder-only architecture that formulates speech editing as a masked token infilling task. By exploiting precise word-level alignments to construct training masks and adopting adaptive decoding strategies, it achieves seamless, smooth-boundary speech editing with natural transitions. Experimental results demonstrate that models trained on LEMAS-Dataset deliver high-quality synthesis and editing performance, confirming the dataset's quality. We envision that this richly timestamp-annotated, fine-grained multilingual corpus will drive future advances in prompt-based speech generation systems.

Comparison with Existing Datasets

Dataset Hours Languages Word-level Alignments Source
Libri-light 60k 1 Audiobook
GigaSpeech 10k 1 Multi-domain
GigaSpeech 2 28k 3 Multi-domain
WenetSpeech 22k 1 Multi-domain
WenetSpeech4TTS 13k 1 Multi-domain
MLS 51k 8 Audiobook
Emilia 101k 6 Multi-domain
LEMAS (Ours) 150k 10 Multi-domain

LEMAS-TTS: Multilingual Zero-shot Synthesis

We showcase zero-shot multilingual and cross-lingual synthesis with a short reference speech clip and reference text, and generate target speech conditioned on the target text. For each example, we provide the reference speech (ground truth) and three target-speech systems: a strong baseline (OpenAudio-S1-mini), our LEMAS-TTS, and LEMAS-TTS + Prosody Encoder.

Example 1

Reference text (EN)

They were set down at a small group of farmhouses.

Reference (Ground Truth)

Target text (EN)

Location data is also some of our most personal information.

Baseline
LEMAS-TTS
TTS+Prosody

Example 2

Reference text (ES)

Y podía regalar a los que a su casa llegasen.

Reference (Ground Truth)

Target text (EN)

City planning economic development and social services together.

Baseline
LEMAS-TTS
TTS+Prosody

Example 3

Reference text (ZH)

梅梅想看看游戏室是什么样子,就擦掉眼泪。

Reference (Ground Truth)

Target text (ZH)

这不是简单的迷信,也不是简单的儒家伦理。

Baseline
LEMAS-TTS
TTS+Prosody

Example 4

Reference text (DE)

Im bereich des netzteiles sogar etwas zu offen, im hinblick auf die isolierung der netzspannung.

Reference (Ground Truth)

Target text (ZH)

甚至有的人总觉得自己怀才不遇,好像世界上的人都辜负自己,都对不起自己。

Baseline
LEMAS-TTS
TTS+Prosody

Example 5

Reference text (DE)

Wir bleiben für immer vereint. wir gegen den rest der welt.

Reference (Ground Truth)

Target text (DE)

Atome, die für uns vorher noch verschlossen waren, nun in unseren lichtkörpern öffnen.

Baseline
LEMAS-TTS
TTS+Prosody

Example 6

Reference text (EN)

For they did not suspect him to be their enemy and he gained it thus.

Reference (Ground Truth)

Target text (DE)

Atome, die für uns vorher noch verschlossen waren, nun in unseren lichtkörpern öffnen.

Baseline
LEMAS-TTS
TTS+Prosody

Example 7

Reference text (ES)

Se ubicaron dentro otras dependencias como la real audiencia, la ca rcel y las cajas reales.

Reference (Ground Truth)

Target text (ES)

Es fundamental llevarla a cabocon el esfuerzo de todos y todas.

Baseline
LEMAS-TTS
TTS+Prosody

Example 8

Reference text (FR)

Cinq étapes pour rédiger votre règlement intérieur!

Reference (Ground Truth)

Target text (ES)

Es fundamental llevarla a cabocon el esfuerzo de todos y todas.

Baseline
LEMAS-TTS
TTS+Prosody

Example 9

Reference text (FR)

S il dit après je suis désolé vous pouvez faire confiance à son je suis désolé.

Reference (Ground Truth)

Target text (FR)

Mourir dans la nature? parce qu avant de mourir, ils ont le temps de se reproduire!

Baseline
LEMAS-TTS
TTS+Prosody

Example 10

Reference text (ID)

Yang beriman di antara kalian beriman di antara kalian.

Reference (Ground Truth)

Target text (FR)

Cinq étapes pour rédiger votre règlement intérieur!

Baseline
LEMAS-TTS
TTS+Prosody

Example 11

Reference text (ID)

Tanggal dan bulan saya tidak bisa dapat yang tau bisa berbagi informasi di kolom komentar.

Reference (Ground Truth)

Target text (ID)

Yang ke tiga, kemudian rukuk dan memanjangkan rukuknya.

Baseline
LEMAS-TTS
TTS+Prosody

Example 12

Reference text (IT)

Condizioni, in determinate situazioni e dovremmo riconoscere solo determinati diritti.

Reference (Ground Truth)

Target text (ID)

Dari para perampok tersebut beruntung anaknya selamat meskipun mengalami luka tembak di bagian.

Baseline
LEMAS-TTS
TTS+Prosody

Example 13

Reference text (IT)

Sarebbero stati studiati i metodi di cura del dottore dei miracoli.

Reference (Ground Truth)

Target text (IT)

E quello era come non effettuare una chiamata di emergenza.

Baseline
LEMAS-TTS
TTS+Prosody

Example 14

Reference text (PT)

E as práticas feministas em âmbitos acadêmicos, artísticos e sociais.

Reference (Ground Truth)

Target text (IT)

No, non l ascoltavo e le dissi, non venire qui con questa roba.

Baseline
LEMAS-TTS
TTS+Prosody

Example 15

Reference text (PT)

Outra coisa que também podemos dizer sobre as terminações é que podem ser chamadas de sufixo.

Reference (Ground Truth)

Target text (PT)

E as práticas feministas em âmbitos acadêmicos, artísticos e sociais.

Baseline
LEMAS-TTS
TTS+Prosody

Example 16

Reference text (RU)

Но я не могу сделать то, о чем вы просите.

Reference (Ground Truth)

Target text (RU)

Того, кто ниже ума, мы называем будду, дурачок.

Baseline
LEMAS-TTS
TTS+Prosody

Example 17

Reference text (RU)

Приветствую вас на канале вкусные рецепты.

Reference (Ground Truth)

Target text (PT)

E não são estas aquelas mesmas águas.

Baseline
LEMAS-TTS
TTS+Prosody

Example 18

Reference text (VI)

Bàn ăn với mười hai môn đệ và khi các ông đang.

Reference (Ground Truth)

Target text (VI)

Và bì trộn, với cơm tấm. thế là mình có món, cơm tấm bì cho ngày hôm sau.

Baseline
LEMAS-TTS
TTS+Prosody

Example 19

Reference text (VI)

Bởi vì gu ti vi và bên đại lý không tìm.

Reference (Ground Truth)

Target text (RU)

Приветствую вас на канале вкусные рецепты.

Baseline
LEMAS-TTS
TTS+Prosody

Example 20

Reference text (ZH)

直到最近,我们需要的大部分东西都是手工制作的。

Reference (Ground Truth)

Target text (VI)

Sau này thiên hạ sẽ quy về dòng họ tư mã. đám trẻ sau khi nghe tin tức này đều sợ hãi.

Baseline
LEMAS-TTS
TTS+Prosody

LEMAS-Edit: Word-level Speech Editing

LEMAS-Edit formulates speech editing as masked token infilling over codec tokens, powered by the word-level timestamps in LEMAS-Dataset. For each example, we modify a specific word or phrase in the original utterance. We show the original recording (ground truth), the edited result from LEMAS-Edit, and a re-synthesized version from LEMAS-TTS.

Example 1

DE: Folgen: wann kommen wir denn in 【hessen】/ 【bayern】 in die situation.

Original
LEMAS-Edit
LEMAS-TTS

Example 2

DE: Vom griechischen und im 【griechischen heißt】/ 【hellenischen heißt】 das ganze optischer regler. und der steht ganz links,

Original
LEMAS-Edit
LEMAS-TTS

Example 3

EN: That i was in fact unable to do any 【literary work】/ 【written work】

Original
LEMAS-Edit
LEMAS-TTS

Example 4

EN: If you use 【quotation marks】/ 【inverted commas】 around the expression that you want to see it will give you exact matches for that expression

Original
LEMAS-Edit
LEMAS-TTS

Example 5

EN: While there are several models of the 【cougar】/ 【panther】 mine resistant ambush protected mrap vehicle the six by six variant with an automatic grenade launcher

Original
LEMAS-Edit
LEMAS-TTS

Example 6

ES: 【Mujer】/ 【señora】 por la que ha dejado a ésta ese hombre?», y sentía, ¿por qué no he de confesarle la verdad?

Original
LEMAS-Edit
LEMAS-TTS

Example 7

ES: Es una 【visión】/ 【perspectiva】 muy triste porque nos define y sienta nuestro destino no?

Original
LEMAS-Edit
LEMAS-TTS

Example 8

ES: Para grabar un 【vídeo】/ 【grabación】 sobre qué opina la gente de los informáticos?

Original
LEMAS-Edit
LEMAS-TTS

Example 9

FR: Vous avez publié en 【sciences humaines et sociales】/ 【sciences sociales】 ? le délai est alors de douze mois.

Original
LEMAS-Edit
LEMAS-TTS

Example 10

FR: Publiés en collaboration avec cette vidéo pour tout savoir sur la 【chlorophylle】/ 【pigment vert】: sa vie, son oeuvre...

Original
LEMAS-Edit
LEMAS-TTS

Example 11

FR: On estime que 95% du trafic internet mondial se fait par ces 【câbles】/ 【fils】 de télécommunication posés au fond des océans.

Original
LEMAS-Edit
LEMAS-TTS

Example 12

IT: Tempi difficili e preoccupanti. la prevista introduzione a livello mondiale di 【passaporti】/ 【documenti di viaggio】

Original
LEMAS-Edit
LEMAS-TTS

Example 13

IT: Non intesa come【etichetta】/【marchio】diagnostica nosografica, ma come processo di conoscenza

Original
LEMAS-Edit
LEMAS-TTS

Example 14

IT: È una specie di via di mezzo fra 【scary movie】/【parody film】 e la evil overlord list,

Original
LEMAS-Edit
LEMAS-TTS

Example 15

PT: Na opção【cartões mbnet】/【cartões virtuais】, podemos ver o "histórico" dos cartões gerados,

Original
LEMAS-Edit
LEMAS-TTS

Example 16

PT: Elevados do que【os vossos】/【teus】, então eu comecei a perceber como

Original
LEMAS-Edit
LEMAS-TTS

Example 17

PT: As palavras:【família da fonte】/【tipo de letra】. vamos clicar nele e selecionarmos a fonte

Original
LEMAS-Edit
LEMAS-TTS

Example 18

ZH: 【道士】/【和尚】也没有再闹了,乖乖的就跟人进去擦药水去了。

Original
LEMAS-Edit
LEMAS-TTS

Example 19

ZH: 又出使【南方】/【北方】的闽东岳等族,希望他们紧随其后,作为支援。

Original
LEMAS-Edit
LEMAS-TTS

Example 20

ZH: 还说了自己去公司看见【夏子涵】/【刘亦菲】给方程送药的事情,方程也生气了。

Original
LEMAS-Edit
LEMAS-TTS

Citation

@article{zhao2025lemas,
  title={LEMAS: A 150K-Hour Large-scale Extensible Multilingual Audio Dataset with Generative Speech Models},
  author={Zhao, Zhiyuan and Lin, Lijian and Zhu, Ye and Xie, Kai and Liu, Yunfei and Li, Yu},
  year={2025}
}