KaLM-Embedding: Reshaping Multilingual Text Embedding Models

1. Introduction

In the era of rapid development of large language models (LLMs), retrieval-augmented generation (RAG) has become a key approach for enhancing model performance. However, with the widespread adoption of the RAG framework, text embedding models have increasingly become a bottleneck hindering further progress. Traditional embedding models often perform inadequately in multilingual and multi-domain tasks due to the poor quality of training data. To address this challenge, we introduce the KaLM-Embedding (Knowledge in large Language Models into Embedding) model, which outperforms other models of similar scale in multilingual capabilities, as demonstrated in the MTEB (Massive Text Embedding Benchmark) evaluation.

KaLM_report.jpg

2. KaLM-Embedding: Innovative Training Methods for Superior Multilingual Models

(1) Data Collection: The Foundation of Model Success

training_data_comparison_bar.png

During the development of the KaLM-Embedding model, we meticulously designed a data collection strategy to ensure the model excels in multilingual and multi-domain tasks.

Large-Scale Open Source Datasets: A Combination of Diversity and Quality

Despite the fine-tuning data being primarily in Chinese and English, with only a small amount of multilingual data, the model's performance in other languages remains satisfactory, indicating that the multilingual advantages of pre-trained LLMs can be successfully transferred to embedding models.

Persona-Based Synthetic Data: Enhancing Data Diversity and Domain Coverage

We generated 550,000 high-quality synthetic data entries using Qwen2-72B-Instruct, covering six task types and 40,000 unique instructions. To enhance data diversity, we introduced random persona from Persona Hub as system prompts, effectively increasing the domain diversity of the generated data. Since four retrieval tasks require instruction generation before data generation, we only introduced persona during the instruction generation phase to avoid persona conflicts between the two stages.

(2) Training Strategies: Key to Optimizing Model Performance

Ranking Consistency Filtering: Precise Sample Selection

In addition to using in-batch negative samples, we also retrieved hard negative samples from the dataset's corpus. However, some queries may correspond to multiple correct documents or answers, or be too broad, leading to associations with multiple documents despite low relevance. These situations can introduce false negative samples, adversely affecting model optimization.

To address this issue, we adopted the ranking consistency filtering method (top-k filtering), ranking the similarity of queries with their original positive sample data across the entire document corpus and filtering out samples not ranked in the top k. This process is conducted simultaneously with hard negative sample mining to avoid redundant calculations.

ranking_consistency_filtering.jpg