Scaling LLM Alignment for Low Resource Languages

The development of operational multilingual Large Language Models (LLMs) typically involves resource-intensive stages of pretraining, instruction tuning, and alignment. Crucially, high-quality instruction and preference datasets are essential for effective alignment, yet their creation necessitates substantial human labor for each target language, posing a significant barrier to inclusivity and democratization of AI, especially for languages beyond English. While some models, like Llama-3, offer open weights, their instruction and alignment data remain proprietary, further exacerbating this challenge. Existing open-source instruction and preference datasets predominantly cater to English, necessitating costly and time-consuming translation and localization efforts for broader applicability. This project will explore a novel, scalable, and cost-effective approach to instruction tuning and alignment of existing LLMs to new languages. By leveraging readily available raw text in the target new languages alongside existing English instruction and preference data, our methodology circumvents the need for expensive, language-specific dataset creation. Specifically, we will investigate the efficacy of a promising and novel joint pretraining and alignment strategy, where the LLM is simultaneously exposed to new language data and English instruction data, aiming to dramatically reduce the resource barrier to multilingual LLM development.

Deskribapena (en):

The development of operational multilingual Large Language Models (LLMs) typically involves resource-intensive stages of pretraining, instruction tuning, and alignment. Crucially, high-quality instruction and preference datasets are essential for effective alignment, yet their creation necessitates substantial human labor for each target language, posing a significant barrier to inclusivity and democratization of AI, especially for languages beyond English. While some models, like Llama-3, offer open weights, their instruction and alignment data remain proprietary, further exacerbating this challenge. Existing open-source instruction and preference datasets predominantly cater to English, necessitating costly and time-consuming translation and localization efforts for broader applicability. This project will explore a novel, scalable, and cost-effective approach to instruction tuning and alignment of existing LLMs to new languages. By leveraging readily available raw text in the target new languages alongside existing English instruction and preference data, our methodology circumvents the need for expensive, language-specific dataset creation. Specifically, we will investigate the efficacy of a promising and novel joint pretraining and alignment strategy, where the LLM is simultaneously exposed to new language data and English instruction data, aiming to dramatically reduce the resource barrier to multilingual LLM development.

Deskribapena (es):

The development of operational multilingual Large Language Models (LLMs) typically involves resource-intensive stages of pretraining, instruction tuning, and alignment. Crucially, high-quality instruction and preference datasets are essential for effective alignment, yet their creation necessitates substantial human labor for each target language, posing a significant barrier to inclusivity and democratization of AI, especially for languages beyond English. While some models, like Llama-3, offer open weights, their instruction and alignment data remain proprietary, further exacerbating this challenge. Existing open-source instruction and preference datasets predominantly cater to English, necessitating costly and time-consuming translation and localization efforts for broader applicability. This project will explore a novel, scalable, and cost-effective approach to instruction tuning and alignment of existing LLMs to new languages. By leveraging readily available raw text in the target new languages alongside existing English instruction and preference data, our methodology circumvents the need for expensive, language-specific dataset creation. Specifically, we will investigate the efficacy of a promising and novel joint pretraining and alignment strategy, where the LLM is simultaneously exposed to new language data and English instruction data, aiming to dramatically reduce the resource barrier to multilingual LLM development.

Deskribapena_fr:

The development of operational multilingual Large Language Models (LLMs) typically involves resource-intensive stages of pretraining, instruction tuning, and alignment. Crucially, high-quality instruction and preference datasets are essential for effective alignment, yet their creation necessitates substantial human labor for each target language, posing a significant barrier to inclusivity and democratization of AI, especially for languages beyond English. While some models, like Llama-3, offer open weights, their instruction and alignment data remain proprietary, further exacerbating this challenge. Existing open-source instruction and preference datasets predominantly cater to English, necessitating costly and time-consuming translation and localization efforts for broader applicability. This project will explore a novel, scalable, and cost-effective approach to instruction tuning and alignment of existing LLMs to new languages. By leveraging readily available raw text in the target new languages alongside existing English instruction and preference data, our methodology circumvents the need for expensive, language-specific dataset creation. Specifically, we will investigate the efficacy of a promising and novel joint pretraining and alignment strategy, where the LLM is simultaneously exposed to new language data and English instruction data, aiming to dramatically reduce the resource barrier to multilingual LLM development.

Kode ofiziala:

EHPC-AI-2024A04-074

Erakundea:

EuroHPC Joint Undertaking

Saila:

128.000 H100 GPU hours

Hasiera data:

2025/01/13

Bukaera data:

2026/01/12

Taldeko ikertzaile nagusia:

German Rigau