In the era of rapid development of large language models (LLMs), retrieval-augmented generation (RAG) has become a key approach for enhancing model performance. However, with the widespread adoption of the RAG framework, text embedding models have increasingly become a bottleneck hindering further progress. Traditional embedding models often perform inadequately in multilingual and multi-domain tasks due to the poor quality of training data. To address this challenge, we introduce the KaLM-Embedding (Knowledge in large Language Models into Embedding) model, which outperforms other models of similar scale in multilingual capabilities, as demonstrated in the MTEB (Massive Text Embedding Benchmark) evaluation.
During the development of the KaLM-Embedding model, we meticulously designed a data collection strategy to ensure the model excels in multilingual and multi-domain tasks.
Despite the fine-tuning data being primarily in Chinese and English, with only a small amount of multilingual data, the model's performance in other languages remains satisfactory, indicating that the multilingual advantages of pre-trained LLMs can be successfully transferred to embedding models.
We generated 550,000 high-quality synthetic data entries using Qwen2-72B-Instruct, covering six task types and 40,000 unique instructions. To enhance data diversity, we introduced random persona from Persona Hub as system prompts, effectively increasing the domain diversity of the generated data. Since four retrieval tasks require instruction generation before data generation, we only introduced persona during the instruction generation phase to avoid persona conflicts between the two stages.
In addition to using in-batch negative samples, we also retrieved hard negative samples from the dataset's corpus. However, some queries may correspond to multiple correct documents or answers, or be too broad, leading to associations with multiple documents despite low relevance. These situations can introduce false negative samples, adversely affecting model optimization.
To address this issue, we adopted the ranking consistency filtering method (top-k filtering), ranking the similarity of queries with their original positive sample data across the entire document corpus and filtering out samples not ranked in the top k. This process is conducted simultaneously with hard negative sample mining to avoid redundant calculations.