KaLM-Embedding: Reshaping Multilingual Text Embedding Models

1. Introduction

In the era of rapid development of large language models (LLMs), retrieval-augmented generation (RAG) has become a key approach for enhancing model performance. However, with the widespread adoption of the RAG framework, text embedding models have increasingly become a bottleneck hindering further progress. Traditional embedding models often perform inadequately in multilingual and multi-domain tasks due to the poor quality of training data. To address this challenge, we introduce the KaLM-Embedding (Knowledge in large Language Models into Embedding) model, which outperforms other models of similar scale in multilingual capabilities, as demonstrated in the MTEB (Massive Text Embedding Benchmark) evaluation.

2. KaLM-Embedding: Innovative Training Methods for Superior Multilingual Models

(1) Data Collection: The Foundation of Model Success

During the development of the KaLM-Embedding model, we meticulously designed a data collection strategy to ensure the model excels in multilingual and multi-domain tasks.

Large-Scale Open Source Datasets: A Combination of Diversity and Quality

Pre-training Data: During the contrastive pre-training phase, large-scale weakly-supervised pairs data is introduced to transform the original language model into an embedding model, enabling it to acquire preliminary text embedding capabilities, which lays the foundation for subsequent fine-tuning. We utilized title-body pairs from various documents as well as symmetric translation sentence pairs, supplemented with a portion of large-scale supervised question-answer datasets to ensure the diversity and coverage of the data.
Fine-tuning Data: During the fine-tuning phase, we introduced over 70 high-quality datasets from different sources. These datasets are diverse and of high quality, providing ideal conditions for the model's fine-tuning despite their smaller size. We also incorporated multiple classification and clustering datasets, treating each (sentence, category label) pair as a training instance. Additionally, we sampled hard negative examples from all classification datasets to mitigate the issue of insufficient label categories in some datasets. For each specific dataset, we conducted meticulous processing, such as filtering out overly short documents or excluding low-quality parts based on metadata.
Data Purity: To ensure data purity, we only used the training sets of all datasets, explicitly excluding any test sets. For datasets without separate training and test sets, we first filtered out test set samples included in MTEB and then processed the remaining data. This strategy ensures that all examples appearing in MTEB evaluations were not seen by the model during training.

Despite the fine-tuning data being primarily in Chinese and English, with only a small amount of multilingual data, the model's performance in other languages remains satisfactory, indicating that the multilingual advantages of pre-trained LLMs can be successfully transferred to embedding models.

Persona-Based Synthetic Data: Enhancing Data Diversity and Domain Coverage

We generated 550,000 high-quality synthetic data entries using Qwen2-72B-Instruct, covering six task types and 40,000 unique instructions. To enhance data diversity, we introduced random persona from Persona Hub as system prompts, effectively increasing the domain diversity of the generated data. Since four retrieval tasks require instruction generation before data generation, we only introduced persona during the instruction generation phase to avoid persona conflicts between the two stages.

(2) Training Strategies: Key to Optimizing Model Performance

Ranking Consistency Filtering: Precise Sample Selection

In addition to using in-batch negative samples, we also retrieved hard negative samples from the dataset's corpus. However, some queries may correspond to multiple correct documents or answers, or be too broad, leading to associations with multiple documents despite low relevance. These situations can introduce false negative samples, adversely affecting model optimization.

To address this issue, we adopted the ranking consistency filtering method (top-k filtering), ranking the similarity of queries with their original positive sample data across the entire document corpus and filtering out samples not ranked in the top k. This process is conducted simultaneously with hard negative sample mining to avoid redundant calculations.