MINERS: Multilingual Language Models as Semantic Retrievers
1 month ago·arXiv

Words have been represented in a high-dimensional vector space that encodes their semantic similarities, enabling downstream applications such as retrieving synonyms, antonyms, and relevant contexts. However, despite recent advances in multilingual language models (LMs), the effectiveness of these models’ representations in semantic retrieval contexts has not been comprehensively explored. To fill this gap, this paper introduces the MINERS, a benchmark designed to evaluate the ability of multilingual LMs in semantic retrieval tasks, including bitext mining and classification via retrieval-augmented contexts. We create a comprehensive framework to assess the robustness of LMs in retrieving samples across over 200 diverse languages, including extremely low-resource languages in challenging cross-lingual and code-switching settings. Our results demonstrate that by solely retrieving semantically similar embeddings yields performance competitive with state-of-the-art approaches, without requiring any fine-tuning.

Language models (LMs) play a crucial role in learning natural language representations (Cer et al., 2018; Kenton and Toutanova, 2019; Reimers and Gurevych, 2019; Gao et al., 2021; Feng et al., 2022) and have been successfully applied to various natural language processing (NLP) tasks, such as document retrieval (Yang et al., 2019a; Wang et al., 2023). Existing benchmarks have systematically evaluated LMs to provide empirical assessments of their performance across a range of embedding tasks. Some notable benchmarks include Big-Bench (Srivastava et al., 2023), MTEB (Muennighoff et al., 2023a), SemEval (Cer et al., 2017), and BEIR Benchmark (Thakur et al., 2021). MTEB, in particular, has been established as a comprehensive benchmark for evaluating the effectiveness of embeddings in downstream NLP applications. However, their analysis of the multilingual space has been limited to bitext mining, without further exploration of how these embeddings can be utilized in other multilingual downstream tasks.

The advancement of multilingual LMs is remarkable, demonstrating impressive capabilities in adapting to new languages through fine-tuning (Conneau and Lample, 2019; Alabi et al., 2022), learning from few-shot samples via in-context learning (ICL) (Lin et al., 2021; Winata et al., 2021b; Tanwar et al., 2023; Cahyawijaya et al., 2024; Biderman et al., 2024), enabling cross-lingual zero-shot transfer (Ruder et al., 2021), and incorporating language-specific adapters (Ansell et al., 2021; Yong et al., 2023). This exploration now includes low-resource and regional languages not part of the pretraining phase, promoting NLP research for underrepresented languages (Adelani et al., 2022; Winata et al., 2022; Song et al., 2023). However, multilingual LMs face two key challenges: (1) the lack of a comprehensive benchmark for evaluating effectiveness in semantic retrieval, and (2) limited understanding of code-switching (CS) texts common in multilingual communities.

Current CS evaluations focus on model fine-tuning benchmarks (Aguilar et al., 2020; Khanuja et al., 2020; Winata et al., 2021a; Zhang et al., 2023), without deeply exploring their potential as multilingual retrievers. Recent studies by Winata et al. (2023a) have primarily focused on semantic similarity using encoder LMs in zero-shot cross-lingual settings but have not explored their application in generative LMs. This gap presents an opportunity to leverage these models as context providers for multilingual generative LMs (Lewis et al., 2020; Bevilacqua et al., 2022).

In this paper, we introduce MINERS,1 the first

benchmark designed to assess the multilingual LMs’ ability in semantic retrieval across various tasks. MINERS evaluates the representation of dense vectors in multiple tasks, including bitext retrieval, retrieval-based classification, and ICL classification. We have developed MINERS to be a reproducible and reliable benchmark that utilizes high-dimensional multilingual vector representations. Notably, these tasks do not require any fine-tuning. The paper’s contribution can be summarized as follows:

• We introduce MINERS, the first comprehensive benchmark designed to systematically evaluate multilingual LMs as semantic retrievers across a vast array of languages. Covering 200+ languages, 11 encoder LMs, and 9 generative LMs, including open-source and commercial APIs. MINERS offers a robust evaluation framework for assessing the effectiveness of LMs in diverse linguistic contexts.

• We show MINERS is highly adaptable and scalable across various models. By consolidating scores from multiple models, MINERS facilitates a comprehensive evaluation of task performance, providing insights into different approaches’ strengths and weaknesses.

• We provide a thorough analysis across different evaluation difficulty levels, including monolingual, cross-lingual, and CS scenarios. We examine performance variations across different numbers of retrieved samples to offer insights into the impact of sample quantity on retrieval effectiveness.

• We compare the time efficiency of retrieval methods with conventional fine-tuning approaches. By demonstrating that retrieval methods require no training and offer a comparable performance of leveraging pre-trained models for semantic retrieval tasks.

2.1 Motivation

The MINERS BENCHMARK is introduced as a significant step forward in assessing the capabilities of multilingual LMs in producing high-dimensional representations for semantic retrieval. This benchmark is constructed with three fundamental aspects: (1) Language Diversity: The benchmark offers insights into the performance of LMs across a wide array of languages. It assesses not only the models’ effectiveness in high-resource languages but also their capabilities in low-resource languages from various language families. Additionally, the benchmark includes evaluations of unseen languages to gauge the robustness of the models in predicting languages not encountered during pre-training. CS datasets are also incorporated to simulate realistic scenarios where bilingual or multilingual speakers mix languages, providing a more comprehensive assessment of the models’ capabilities. (2) Usefulness: The benchmark includes evaluations across three distinct tasks to systematically measure the performance of multilingual LMs. First, it assesses the models’ ability to retrieve semantically similar parallel data in bitext retrieval tasks. Second, it uses the retrieved samples for classification, evaluating the models’ accuracy in categorizing text. Third, it employs the retrieved samples as context for generating labels in downstream classification tasks, highlighting the models’ capability to incorporate retrieved information into context-aware classification. Additionally, the benchmark demonstrates the potential of using multiple LMs and APIs together to represent text as an ensemble, further emphasizing their utility. (3) Efficiency: The benchmark is crafted with efficiency as a key principle. It is designed to be straightforward and easily extendable, accommodating new datasets to ensure its longevity and continued relevance. Additionally, the benchmark is publicly available, promoting result reproducibility and encouraging collaboration and further research within the field. Importantly, the benchmark does not necessitate any model fine-tuning, as all evaluations are conducted exclusively through model inference, thereby streamlining the assessment process.

2.2 Tasks

Our benchmark evaluates LMs on three tasks: bitext retrieval, retrieval-based classification, and ICL classification. Figure 1 provides an overview of tasks. We describe the task details as follows:

Bitext Retrieval This task aims to measure the LM’s ability to retrieve semantically similar samples from parallel datasets. The task is also useful to understand how the model perform when there are language distribution shifts, especially when some words are code-switched. Formally, given a parallel dataset D with two language  L1and  L2, we can have two different datasets  DL1and  DL2.


Figure 1: MINERS BENCHMARK tasks. In this example, we compare English (en) and Indonesian (id) texts across three tasks: (a) bitext retrieval, (b) retrieval-based classification, and (c) ICL classification. Light blue cubes represent vector representations of samples from the training dataset  Dtrain, generated by M, while green, yellow, and red cubes denote raw text labels. The few-shot samples  fiin task (c) are retrieved in the same manner as in task (b). The English translations of the text in the figure are as follows: "Saya suka kucing" ("I like cats"), "Saya suka anjing" ("I like dogs"), "Saya benci anjing" ("I hate dogs"), and "Kucing imut" ("Cute cats").

For each sample  xi in DL1, the closest sample  ˆy issearched through  DL2, by finding the lowest distance score between two samples  xiand  yj. The score  si,jis computed by measuring the Euclidean distance of their high-dimensional vector representation which generated by using an LM M. In this case, euclidean distance is used to compute the score  si,j = ∣∣uxi − uyj∣∣2, where uxiand uyjare vector representation of samples  xi and yj, respec-tively. We can also use other distance measures, but the difference is minimal.

Retrieval-based Classification This task involves using the retrieved samples’ labels from the training set to predict labels in downstream NLP classification tasks. The goal is to assess the usefulness of our retrieved samples and introduce an efficient prediction method by directly searching for similar samples in the training set. Given the retrieved k pairs of training samples with labels [(y1, l1), ⋯, (yk, lk)], a label ˆlis selected by majority voting and assigned to the corresponding test sample. Increasing k can enhance performance.

ICL Classification We aim to further utilize the retrieved training samples for natural generation tasks by using them as few-shot context, combined with task-specific instructions and a query. Formally, given a generative LLM G, we input a text sequence  si = (ri; fi; oi; qi), which includes a text instruction  ri, few-shot samples fi = [(y1, l1), ⋯, (yk, lk)], a list of label options oi, and a query  qi, to generate an output text sequence. To generate the prediction, we use one of two methods based on the model’s capabilities: (a) computing label probabilities, which offers precise predictions by reducing issues like typos, and (b) directly predicting labels through instructions, which is more efficient as responses match desired labels, eliminating the need to evaluate all options. We use method (a) when we can calculate the loglikelihood of the next token prediction; otherwise, we resort to method (b). For method (a), we compute the probability of each output class, normalize it by the token length, and select the label with the highest probability from the distribution as follows:


where L denotes the number of possible classes. For more details on model inference, please refer to Appendix A.5.

2.3 Settings

We gauge LMs’ robustness to various text inputs with three different evaluation settings:

Monolingual (Mono): We measure the individual language performance using the same language as train and test sets.

Code-switching (CS): We measure the performance of mixed language datasets. For bitext retrieval, we find a corresponding CS text translation from a monolingual text, or


Table 1: Dataset list of MINERS BENCHMARK. The symbols indicate the tasks run on datasets. ♢Retrieval-based classification task. ♠ICL classification task.

vice versa, and for retrieval-based classification and ICL classification, we take CS texts as input and predict their labels.

Cross-lingual (XL): We measure the performance of multilingual datasets with one language as the source language and the rest as target languages. For detailed information, please refer to Table 5 in the Appendix.

Cross-lingual Code-switching (XL CS): We tackle a more challenging scenario by evaluating CS data within a cross-lingual context.

2.4 Datasets

Table 1 presents 11 datasets: 7 multilingual and 4 CS datasets, covering both parallel and classification types. Parallel datasets are ideal for bitext retrieval due to their aligned multilingual content, enabling bitext mining and machine translation tasks. Classification datasets include intent classification, sentiment analysis, and topic classification, which we evaluate for retrieval-based and ICL classification tasks. For ICL, we construct prompts using a unified English template across all generative language models to ensure simplicity and consistency. Detailed instructions for each task are provided in Tables 16 and 17 in the Appendix.

2.5 Models

Encoder LMs and APIs We use 9 open-source LMs: LaBSE (Feng et al., 2022), CMLM (Cer et al., 2018), multilingual E5BASE, multilingual E5LARGE (Wang et al., 2024), multilingual MPNetBASEv2 (Song et al.,

2020), multilingual MiniLML12-E384 (Wang et al., 2020), Glot-500 (ImaniGooghari et al., 2023), XLM-RBASE, XLM-RLARGE (Con- neau and Lample, 2019), and two commercial embedding APIs: Cohere-Embedv3 (embed-multilingual-v3.0) and OpenAI-Embedv3  (text-embedding-3-large).2

Generative LMs We opt for 7 different open-source LMs: (1) BLOOMZ (Muennighoff et al., 2023b), an instruction tuned BLOOM (Le Scao et al., 2023) with three different sizes (560m, 1B, 3B) to further analyze the performance trend when increasing the model size, (2) mT0 3B (xl) (Muennighoff et al., 2023b), an instruction tuned mT5 (Xue et al., 2021), (3) XGLM (Lin et al., 2021) with two different sizes (564m and 2.9B), (4) Aya-23 8B (Aryabumi et al., 2024), (5) Aya-101 13B (Üstün et al., 2024), (6) Gemma 1.1 Instruct (Team et al., 2024), and (7) Llama 3 8B Instruct (AI@Meta, 2024), and three commercial APIs: (1) Command-R, (2) GPT-3.5 Turbo (gpt-3.5-turbo-0125) and (3) GPT-4o (gpt-4o-2024-05-13). All open-source models can be found on Hugging Face. Please check the Appendix on Table 6 for details.


Table 2: Results for bitext retrieval task (k = 1) and retrieval-based classification (k = 10). Mono, XL and CS denote monolingual, cross-lingual and code-switching, respectively. Bold and underlined numbers present the best and second-best models. For DistFuse (2), we use  α = 1, β = 3and for DistFuse (3), we use  α = 1, β = 2, γ = 3.The reported weights represent the best-performing configurations identified during our tuning process.

Ensemble Models To enhance scalability and effectiveness, we can use multiple models with DistFuse (Winata et al., 2023a) to improve retrieval results. DistFuse combines models by calculating distance scores of label distributions and merging them through a linear combination. We report two DistFuse settings for bitext retrieval and retrieval-based classification tasks:

DistFuse (2) utilizes two models: LaBSE and E5LARGE;

DistFuse (3) utilizes three models: LaBSE, E5LARGE, and Cohere-Embedv3.

To maintain conciseness, we denote the weights assigned to distances computed by LaBSE, E5LARGE, and Cohere-Embedv3 as  α, β, γ, respectively.

3.1 Bitext Retrieval

Table 2 highlights DistFuse (2) and OpenAI-Embedv3-large as top performers in XS and CS tasks, respectively, with LaBSE ranking highest among open-source models. DistFuse (2) demonstrates superior performance across various settings. While XLM-R and Glot-500 struggle in bitext retrieval, they perform better in retrieval-based classification. Most models face challenges in CS tasks for both bitext retrieval and retrieval-based classification, where APIs generally perform slightly better. OpenAI-Embedv3 outperforms Cohere-Embedv3 on CS datasets. The specifics of CS training data remain unclear, potentially explaining the APIs’ edge over open-source models. Combining model scores significantly boosts performance, with up to a 2.63% improvement in bitext retrieval over LaBSE and a 1.72% improvement over OpenAI-Embedv3. Similar gains are observed in retrieval-based classification, where the leading DistFuse model, though slightly behind Cohere-Embedv3, notably surpasses OpenAI-Embedv3.

3.2 Retrieval-based Classification Results

Table 2 illustrates that the Cohere-Embedv3 API outperforms all models by an average of 1.95%, with LaBSE closely behind at 1.15%. XLM-R and Glot-500 excel in classification tasks. Despite this, they lag behind models trained with contrastive learning or alignment objectives like LaBSE, CMLM, or E5 models, emphasizing the significance of text alignment in NLP tasks. Merging model scores notably boosts prediction accuracy, especially in Mono and XL settings. However, performance in CS and XL CS settings remains lower compared to API models. Additionally, our model outperforms fine-tuned models, requiring no fine-tuning in XL and CS tasks.


Table 3: Results on ICL classification with E5LARGE retriever. Bold and underlined numbers present the best and second-best models.


Figure 2: Results with different k = [1, 5, 10] on bitext retrieval: (a) cross-lingual and (b) code-switching, retrieval-based classification: (c) monolingual, (d) cross-lingual, and (e) code-switching.

3.3 ICL Classification Results

Based on Table 3, we present the ICL classification results using E5LARGE as the retriever. Please see Appendix Table 15 for results from alternate retrievers. The inclusion of few-shot context significantly improves the generative LM’s precision in predicting class labels, leading to enhancements. There is a positive scaling law with increased model size in the one-shot setup. For instance, using a model with six times more parameters (BLOOMZ 3B) boosts performance by 2.21% compared to the top BLOOMZ 560m model. However, performance decreases for CS and XL CS tasks with increasing complexity. Despite focusing on English, Llama 3 generally outperforms multilingual open-source models like BLOOMZ, mTO, XGLM, and Aya-23. BLOOMZ excels in the one-shot sce- nario, outperforming Llama 3. Notably, mT0 outperforms XGLM and Aya-23 in zero-shot settings, despite Aya-23’s larger size. Aya-101 is the top open-source LM in both zero-shot and one-shot tasks, bridging the gap with commercial APIs like GPT-4o. Commercial generative LM APIs, such as GPT-3.5 Turbo and GPT-4 outperform all other models, particularly in CS and XL CS contexts. However, their superior performance may be attributed to prior exposure to these datasets, though this aspect remains unclear.

3.4 Performance Dynamics Over k

Figure 2 shows a consistent positive trend as the retrieved sample size increases for both bitext retrieval and retrieval-based classification tasks. This indicates that model performance improves with


Figure 3: t-SNE representation of 200 randomly training samples from the NusaX dataset. The color on the figures show the sample ID for (a) and (b), language for (c) and (d), and class for (e) and (f).

more retrieved samples. In bitext retrieval, a larger k provides a richer set of bilingual text pairs, enhancing retrieval. Similarly, in retrieval-based classification, a larger k offers more contextual examples, leading to more precise label predictions through majority voting.

4.1 Model Representation

Figure 3 shows 2D scatter plots of the vector representation generated using t-SNE (Van der Maaten and Hinton, 2008). We take 200 random training samples from the NusaX dataset, reduce the high-dimensional vectors into 2D and color the scatter plots in three ways. (1) By sample ID. We assign the same color for parallel samples. (2) By language. We assign a color for each language. (3) By class label. We assign a color for each class label. We observe that the E5LARGE model forms small, color-coded clusters based on sample ID, indicating its proficiency in aligning text across different languages. In contrast, the XLM-RBASE model forms larger clusters where samples of the same language group closely together, suggesting


Table 4: FLOPs computation formulae. Here,  nepoch andndimdenote the number of epochs and vector dimension, respectively.  fM and bMrepresent the forward and backward FLOPs of model M, respectively.  fG denotesthe forward FLOPs of model G. The symbols  p+, p−,psq, and p√ indicate the FLOPs required to perform the operations of addition, subtraction, squaring, and square root, respectively. Additionally,  ∣L∣ and ¯∣L∣denote the number of labels and the average sequence length of the labels, respectively. The variables  ∣Dtrain∣, ∣Ddev∣, and∣Dtest∣represent the sizes of the train, development, and test data splits, respectively.

it is more effective at identifying same-language data, even for unseen languages in NusaX. However, XLM-RBASE displays a sparse distribution when classifying samples by sample ID, aligning with our bitext retrieval task results. Both models effectively distinguish label classes, with E5LARGE achieving better color separation than XLM-RBASE, as shown in Figures 3 (e) and (f). Similar findings are observed for other models. For more details, refer to Appendix B.1.

4.2 Samples Relevance

Figure 4 analyzes the performance dynamics of BLOOMZ models on the NusaX dataset when retrieving samples from different training data percentiles. Lower percentiles correspond to more semantically similar samples to the query. The results show that as the percentile decreases, performance improves across all three models. This trend underscores the importance of retrieving highly relevant samples for ICL tasks. Semantically aligned samples enhance the context, leading to more accurate predictions.

4.3 Compute Efficiency

We aim to measure the theoretical time complexity by evaluating computation in terms of FLOPs (Floating Point Operations), irrespective of the machine configuration. Table 4 details the components contributing to this calculation. The time complexity for fine-tuning a model scales with the number


Figure 4: ICL performance dynamics of BLOOMZ models on the NusaX dataset using context retrieved from various percentiles with E5LARGE. Lower percentiles correspond to more semantically relevant samples.

of training epochs, with more epochs significantly increasing complexity. The backward pass FLOPs, which are substantially higher than forward pass FLOPs, are a major factor. Retrieval-based classification is much more efficient, relying primarily on generating vector representations through forward passes. The retrieval process itself is efficient, with complexity influenced mainly by the sizes of the training and test datasets—factors typically smaller than the computational demands of fine-tuning. In contrast, ICL classification incurs higher inference costs due to the increased forward FLOPs of generative models. With very large LMs, the inference cost can even exceed that of fine-tuning. However, as the training data size increases, the complexity of fine-tuning eventually surpasses ICL model inference. For ICL classification, we have two methods: (a) computing label probabilities, which offers precise predictions, and (b) directly predicting labels through instructions, which is more efficient as responses match desired labels, eliminating the need to evaluate all options. While direct prediction may generate extraneous tokens, this can be mitigated with additional instructions to output only the label.

Dense Retrieval via LM Dense retrieval has marked a significant advancement in information retrieval, enabling rapid sample searches across vast document collections. Research has focused on training objectives and architectures that produce similarity scores between text samples. Reimers and Gurevych (2019) introduce a Siamese network architecture trained with contrastive learning, enhancing retrieval by enabling vector representation comparison using similarity measures, applied to BERT (Kenton and Toutanova, 2019). Efforts to improve alignment include incorporating annotated pairs from natural language inference datasets using SimCSE loss (Gao et al., 2021). Furthermore, Feng et al. (2022) propose combining monolingual and translation alignment losses to enhance performance, such as masked language modeling (MLM) (Devlin et al., 2019) and translation language modeling (TLM) objectives (Con- neau and Lample, 2019), dual encoder translation ranking (Guo et al., 2018), and additive margin softmax (Yang et al., 2019b). Khattab and Zaharia (2020) introduce a late interaction paradigm, comparing embedding representations via vector similarity indexes for relevance estimation in ranking tasks. Wang et al. (2024) further innovate by using in-batch negatives to leverage weakly supervised data from diverse, heterogeneous sources.

Semantic Retrieval for NLP Tasks Retrieving labels using semantic retrieval has proven beneficial for classification. Bari et al. (2021) enhance accuracy with cross-lingual few-shot nearest neighbor adaptation. Winata et al. (2023a) predict test data labels efficiently using English training data without prior adaptation via ICL. Li et al. (2023) introduce a ranking framework to retrieve high-quality demonstrations for various tasks. Building on these methods, we adopt a straightforward and efficient retrieval approach similar to Winata et al. (2023a), supporting multiple retrieval models for open-source tools and APIs. We extend this approach to the ICL setting, enhancing its utility and accessibility across diverse scenarios.

This paper introduces MINERS BENCHMARK, a benchmark for evaluating the efficacy of multilingual LMs in semantic retrieval tasks, including bitext retrieval and classification through semantic search and retrieval-augmented contexts. Our framework rigorously assesses LMs’ robustness in retrieving samples from over 200 languages. Empirical results demonstrate that our method, which focuses on retrieving semantically similar vector representations, achieves performance comparable to state-of-the-art fine-tuned approaches, without requiring fine-tuning across multiple datasets and languages. We also explore the mechanisms behind these representations, offering insights to improve the efficiency and accuracy of label retrieval methods. Our research aims to pave the way for future exploration and optimization in semantic retrieval and classification, ultimately contributing to more robust and adaptable NLP systems.

We have identified potential avenues for enhancing the performance of the ICL classification task through the application of ensemble techniques such as DistFuse and using the target language prompts instead of English. Additionally, while we have primarily focused on evaluating the BLOOMZ, mT0, XGLM, Gemma, Llama 3, Aya-23, Aya-101, Command-R, GPT-3.5 Turbo, and GPT-4o models within the benchmark, we acknowledge that there may be other models that could also yield promising results. These aspects represent areas for future exploration and expansion of our research efforts. Due to resource limitations and simplicity, we only test a single prompt template. Running with various prompts could yield different results, but we defer this exploration to future research.

In the future, we plan to explore deeper into the capabilities of ensemble techniques like DistFuse to further improve the performance of the ICL classification task. By combining the strengths of multiple models, we aim to enhance the robustness and accuracy of our classification outcomes, ultimately achieving better results in real-world applications. Furthermore, our current evaluation has been limited to a select few models and datasets as part of our initial assessment phase. However, we recognize the importance of conducting a more comprehensive evaluation by considering a wider range of models and datasets. This will allow us to gain a more comprehensive understanding of the strengths and weaknesses of different approaches, enabling us to make more informed decisions about model selection and optimization strategies.

Our research aims to evaluate LMs in the context of multilingual semantic retrieval, a field with significant implications for diverse multilingual communities. We strive to ensure that our evaluation is conducted with the utmost transparency and fairness.

David Adelani, Graham Neubig, Sebastian Ruder, Shruti Rijhwani, Michael Beukman, Chester PalenMichel, Constantine Lignos, Jesujoba Alabi, Shamsuddeen Muhammad, Peter Nabende, et al. 2022. Masakhaner 2.0: Africa-centric transfer learning for named entity recognition. In Proceedings of the 2022

Conference on Empirical Methods in Natural Language Processing, pages 4488–4508.

David Ifeoluwa Adelani, Hannah Liu, Xiaoyu Shen, Nikita Vassilyev, Jesujoba O Alabi, Yanke Mao, Haonan Gao, and Annie En-Shiun Lee. 2023. Sib-200: A simple, inclusive, and big evaluation dataset for topic classification in 200+ languages and dialects. arXiv preprint arXiv:2309.07445.

Gustavo Aguilar, Sudipta Kar, and Thamar Solorio. 2020. Lince: A centralized benchmark for linguistic code-switching evaluation. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1803–1813.

AI@Meta. 2024. Llama 3 model card.

Jesujoba O Alabi, David Ifeoluwa Adelani, Marius Mos- bach, and Dietrich Klakow. 2022. Adapting pre-trained language models to african languages via multilingual adaptive fine-tuning. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4336–4349.

Alan Ansell, Edoardo Maria Ponti, Jonas Pfeiffer, Se- bastian Ruder, Goran Glavaš, Ivan Vuli´c, and Anna Korhonen. 2021. Mad-g: Multilingual adapter generation for efficient cross-lingual transfer. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4762–4781.

Viraat Aryabumi, John Dang, Dwarak Talupuru, Saurabh Dash, David Cairuz, Hangyu Lin, Bharat Venkitesh, Madeline Smith, Kelly Marchisio, Sebastian Ruder, Acyr Locatelli, Julia Kreutzer, Nick Frosst, Phil Blunsom, Marzieh Fadaee, Ahmet Üstün, and Sara Hooker. 2024. Aya 23: Open weight re- leases to further multilingual progress.

M Saiful Bari, Batool Haider, and Saab Mansour. 2021. Nearest neighbour few-shot learning for cross-lingual classification. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1745–1753.

Michele Bevilacqua, Giuseppe Ottaviano, Patrick Lewis, Scott Yih, Sebastian Riedel, and Fabio Petroni. 2022. Autoregressive search engines: Generating substrings as document identifiers. Advances in Neural Information Processing Systems, 35:31668–31683.

Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, et al. 2024. Lessons from the trenches on reproducible evaluation of language models. arXiv preprint arXiv:2405.14782.

Samuel Cahyawijaya, Holy Lovenia, and Pascale Fung. 2024. Llms are few-shot in-context low-resource language learners. arXiv preprint arXiv:2403.16512.

Samuel Cahyawijaya, Holy Lovenia, Fajri Koto, Dea Adhista, Emmanuel Dave, Sarah Oktavianti, Salsabil Akbar, Jhonson Lee, Nuur Shadieq, Tjeng Wawan

Cenggoro, et al. 2023. Nusawrites: Constructing high-quality corpora for underrepresented and extremely low-resource languages. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 921–945.

Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez- Gazpio, and Lucia Specia. 2017. Semeval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 1–14.

Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, et al. 2018. Universal sentence encoder for english. In Proceedings of the 2018 conference on empirical methods in natural language processing: system demonstrations, pages 169–174.

Bharathi Raja Chakravarthi, Navya Jose, Shardul Suryawanshi, Elizabeth Sherly, and John Philip McCrae. 2020. A sentiment analysis dataset for code-mixed malayalam-english. In Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), pages 177–184.

Alexis Conneau and Guillaume Lample. 2019. Cross- lingual language model pretraining. Advances in neural information processing systems, 32.

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2022. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35:30318– 30332.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171– 4186.

Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Ari- vazhagan, and Wei Wang. 2022. Language-agnostic bert sentence embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 878–891.

Jack FitzGerald, Christopher Hench, Charith Peris, Scott Mackie, Kay Rottmann, Ana Sanchez, Aaron Nash, Liam Urbach, Vishesh Kakarala, Richa Singh, et al. 2023. Massive: A 1m-example multilingual natural language understanding dataset with 51 typologically-diverse languages. In Proceedings of the 61st Annual Meeting of the Association for

Computational Linguistics (Volume 1: Long Papers), pages 4277–4302.

Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. Simcse: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910.

Mandy Guo, Qinlan Shen, Yinfei Yang, Heming Ge, Daniel Cer, Gustavo Hernandez Abrego, Keith Stevens, Noah Constant, Yun-Hsuan Sung, Brian Strope, et al. 2018. Effective parallel corpus mining using bilingual sentence embeddings. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 165–176.

Asha Hegde, Mudoor Devadas Anusha, Sharal Coelho, Hosahalli Lakshmaiah Shashirekha, and Bharathi Raja Chakravarthi. 2022. Corpus creation for sentiment analysis in code-mixed tulu text. In Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages, pages 33–40.

Ayyoob ImaniGooghari, Peiqin Lin, Amir Hossein Kar- garan, Silvia Severini, Masoud Jalili Sabet, Nora Kassner, Chunlan Ma, Helmut Schmid, André FT Martins, François Yvon, et al. 2023. Glot500: Scaling multilingual corpora and language models to 500 languages. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1082–1117.

Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.

Simran Khanuja, Sandipan Dandapat, Anirudh Srini- vasan, Sunayana Sitaram, and Monojit Choudhury. 2020. Gluecos: An evaluation benchmark for code-switched nlp. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3575–3585.

Omar Khattab and Matei Zaharia. 2020. Colbert: Effi- cient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39– 48.

Teven Le Scao, Angela Fan, Christopher Akiki, El- lie Pavlick, Suzana Ili´c, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. 2023. Bloom: A 176b-parameter open-access multilingual language model.

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock-täschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.

Xiaonan Li, Kai Lv, Hang Yan, Tianyang Lin, Wei Zhu, Yuan Ni, Guotong Xie, Xiaoling Wang, and Xipeng Qiu. 2023. Unified demonstration retriever for in-context learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4644– 4668.

Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, et al. 2021. Few-shot learning with multilingual language models. arXiv preprint arXiv:2112.10668.

Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. 2023a. Mteb: Massive text embedding benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014–2037.

Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng Xin Yong, Hailey Schoelkopf, et al. 2023b. Crosslingual generalization through multitask finetuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15991–16111.

Parth Patwa, Gustavo Aguilar, Sudipta Kar, Suraj Pandey, Srinivas Pykl, Björn Gambäck, Tanmoy Chakraborty, Thamar Solorio, and Amitava Das. 2020. Semeval-2020 task 9: Overview of sentiment analysis of code-mixed tweets. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pages 774–790.

Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992.

Sebastian Ruder, Noah Constant, Jan Botha, Aditya Sid- dhant, Orhan Firat, Jinlan Fu, Pengfei Liu, Junjie Hu, Dan Garrette, Graham Neubig, et al. 2021. Xtreme-r: Towards more challenging and nuanced multilingual evaluation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10215–10245.

Iyanuoluwa Shode, David Ifeoluwa Adelani, Jing Peng, and Anna Feldman. 2023. Nollysenti: Leveraging transfer learning and machine translation for nigerian movie sentiment classification. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 986–998.

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie- Yan Liu. 2020. Mpnet: Masked and permuted pre-training for language understanding. Advances in neural information processing systems, 33:16857– 16867.

Yueqi Song, Simran Khanuja, Pengfei Liu, Fahim Faisal, Alissa Ostapenko, Genta Winata, Alham Aji, Samuel Cahyawijaya, Yulia Tsvetkov, Antonios Anastasopoulos, et al. 2023. Globalbench: A benchmark for global progress in natural language processing. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14157–14171.

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.

Vivek Srivastava and Mayank Singh. 2020. Phinc: A parallel hinglish social media code-mixed corpus for machine translation. In Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), pages 41–49.

Eshaan Tanwar, Subhabrata Dutta, Manish Borthakur, and Tanmoy Chakraborty. 2023. Multilingual llms are better cross-lingual in-context learners with alignment. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6292–6307.

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. 2024. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295.

Nandan Thakur, Nils Reimers, Andreas Rücklé, Ab- hishek Srivastava, and Iryna Gurevych. 2021. Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).

Jörg Tiedemann. 2020. The tatoeba translation challenge–realistic data sets for low resource and multilingual mt. In Proceedings of the Fifth Conference on Machine Translation, pages 1174–1182.

Ahmet Üstün, Viraat Aryabumi, Zheng-Xin Yong, Wei- Yin Ko, Daniel D’souza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, et al. 2024. Aya model: An instruction finetuned open-access multilingual language model. arXiv preprint arXiv:2402.07827.

Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. Journal of machine learning research, 9(11).

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2024. Multilingual e5 text embeddings: A technical report. arXiv preprint arXiv:2402.05672.

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2020. Minilm: Deep selfattention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33:5776–5788.

Xiao Wang, Craig Macdonald, Nicola Tonellotto, and Iadh Ounis. 2023. Colbert-prf: Semantic pseudorelevance feedback for dense passage and document retrieval. ACM Transactions on the Web, 17(1):1–39.

Genta Winata, Shijie Wu, Mayank Kulkarni, Thamar Solorio, and Daniel Preo¸tiuc-Pietro. 2022. Crosslingual few-shot learning on unseen languages. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 777–791.

Genta Winata, Lingjue Xie, Karthik Radhakrishnan, Yi- fan Gao, and Daniel Preo¸tiuc-Pietro. 2023a. Efficient zero-shot cross-lingual inference via retrieval. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 2: Short Papers), pages 93–104.

Genta Indra Winata, Alham Fikri Aji, Samuel Cahyawi- jaya, Rahmad Mahendra, Fajri Koto, Ade Romadhony, Kemal Kurniawan, David Moeljadi, Radityo Eko Prasojo, Pascale Fung, et al. 2023b. Nusax: Multilingual parallel sentiment dataset for 10 indonesian local languages. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 815–834.

Genta Indra Winata, Samuel Cahyawijaya, Zihan Liu, Zhaojiang Lin, Andrea Madotto, and Pascale Fung. 2021a. Are multilingual models effective in code-switching? NAACL 2021, page 142.

Genta Indra Winata, Andrea Madotto, Zhaojiang Lin, Rosanne Liu, Jason Yosinski, and Pascale Fung. 2021b. Language models are few-shot multilingual learners. In Proceedings of the 1st Workshop on Multilingual Representation Learning, pages 1–15.

Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mt5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498.

Wei Yang, Haotian Zhang, and Jimmy Lin. 2019a. Sim- ple applications of bert for ad hoc document retrieval. arXiv preprint arXiv:1903.10972.

Yinfei Yang, Gustavo Hernandez Abrego, Steve Yuan, Mandy Guo, Qinlan Shen, Daniel Cer, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2019b. Improving multilingual sentence embedding using bidirectional dual encoder with additive margin softmax. arXiv preprint arXiv:1902.08564.

Zheng Xin Yong, Hailey Schoelkopf, Niklas Muen- nighoff, Alham Fikri Aji, David Ifeoluwa Adelani, Khalid Almubarak, M Saiful Bari, Lintang Sutawika, Jungo Kasai, Ahmed Baruwa, et al. 2023. Bloom+ 1: Adding language support to bloom for zero-shot prompting. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11682–11703.

Ruochen Zhang, Samuel Cahyawijaya, Jan Chris- tian Blaise Cruz, Genta Winata, and Alham Aji. 2023. Multilingual large language models are not (yet) code-switchers. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12567–12582.

Pierre Zweigenbaum, Serge Sharoff, and Reinhard Rapp. Towards preparation of the second bucc shared task: Detecting parallel sentences in comparable corpora. In Ninth Workshop on Building and Using Comparable Corpora, page 38.

Pierre Zweigenbaum, Serge Sharoff, and Reinhard Rapp. 2017. Overview of the second bucc shared task: Spotting parallel sentences in comparable corpora. In Proceedings of the 10th Workshop on Building and Using Comparable Corpora, pages 60–67.

Pierre Zweigenbaum, Serge Sharoff, and Reinhard Rapp. 2018. Overview of the third bucc shared task: Spotting parallel sentences in comparable corpora. In Proceedings of 11th workshop on building and using comparable corpora, pages 39–42.

A.1 Baselines

For the task-specific evaluation, we include the following baseline models for comparison:

SOTA We report the state-of-the-art (SOTA) from the existing literature as follows:

Bitext Retrieval: BUCC (Wang et al., 2024) and Tatoeba (Wang et al., 2024).

Classification: MASSIVE (FitzGerald et al.,

2023), NollySenti (Shode et al., 2023), NusaX (Winata et al., 2023b, monolingual) (Winata et al., 2023a, cross-lingual), and SIB-200 (Adelani et al., 2023). We use the validation split on Accuracy for LinCE SA, but to the best of our knowledge, there is no comparable result in the literature. We make a small modification to FIRE 2020 labels, thus there are no comparable results in the literature.

Classification Baselines We report the following baselines for classification tasks:

Random: In this baseline, prediction labels are sampled randomly from a uniform distribution. This approach ensures that each label has an equal probability of being selected, regardless of its true distribution within the dataset. It serves as a baseline to compare the effectiveness of more sophisticated methods.

Majority: In this baseline, prediction labels are selected by taking the majority class for all instances. By always predicting the most frequent class observed in the training data, this method provides a simple yet effective baseline, especially in datasets with class imbalance. It helps to highlight the performance of models in recognizing and classifying less frequent classes.

Fine-tune (XLM-RBASE): We fine-tune a XLM-RBASE model using the training split of the dataset. After fine-tuning, the model is evaluated on the test data split of the same dataset to assess its performance.

A.2 Dataset Preprocessing

To enhance the data clarity for LMs and improve their predictive performance, we apply preprocessing steps to the following two datasets:

FIRE 2020: We modify several non-standard labels to a single label for sentiment analysis. We map  “Mixed_feeling" into “Mixed",and map  “not-malayalam", “non-tamil", and  “unknown_state" into “Unknown".

MASSIVE: We replace the underscore character with a space character from the labels.

A.3 Languages Under Study

Table 5 presents a comprehensive list of source and target language pairs used in our cross-lingual experiments. The datasets apply different language code standards. To ensure consistency and uphold the integrity of the original datasets, we have reported the language codes exactly as they appear in the respective sources.

A.4 LM Sources

We extensively utilize a range of open-source encoder and generative LMs from the Hugging Face


Figure 5: t-SNE representation of 200 random samples from the NusaX dataset. The color on the figures show the sample ID for (a) and (b), language for (c) and (d), and class for (e) and (f).

repository to ensure our evaluations are comprehensive and transparent. The models we employ are detailed in Table 6, showcasing the diversity in architectures and training objectives. These open-source models provide a solid foundation for our evaluations, allowing us to benchmark against widely accepted standards in the NLP community. For commercial models, we leverage state-of-the-art APIs to access robust and high-performance LMs. Specifically, we use the OpenAI API to retrieve generation responses from GPT-3.5 Turbo and GPT-4. Additionally, we utilize Cohere’s Embed API to incorporate the Cohere-Embedv3 model.

A.5 LM Inference

We run our model inference on an A100 40G GPU, utilizing 8-bit quantization (Dettmers et al., 2022) to optimize memory usage and speed up inference. Our experiments investigate the impact of varying the number of retrieved samples  k ∈ [1, 5, 10]to understand how retrieval quality and classification performance change with the number of instances.


Table 5: List of source and target languages for all datasets in the cross-lingual setting. Each dataset employs a different language code standard, and we have reported them as used.


Table 6: Hugging Face models.

These samples are used for both bitext retrieval and retrieval-based classification tasks. For the ICL classification task, we evaluate our model in both zero-shot and one-shot scenarios using two methods: (1) predicting the label distribution by computing the next token probability, and (2) generating the response directly. For BLOOMZ, Aya, and XGLM models, we use the first method since we have access to the next token prediction logits. For Llama 3, Gemma, and mT0 models, obtaining


Table 7: Bitext retrieval F1@1 performance on two different source-to-target language(s) directions. Bold and underlined numbers present the best and second-best models.

these logits is less straightforward. Specifically, the presence of numerous special tokens in Llama 3 complicates logit calculation, so we opt for the second method, which leverages the model’s strong capability to generate exact labels by following instructions. Similarly, for GPT-3.5 Turbo and GPT-4o models, we adopt the second method because we do not have direct access to the logits for all possible classes. These models excel in instruction following, making direct response generation a practical and effective approach.


Table 8: Hyper-parameters for fine-tuning baselines.

A.6 Hyper-parameters

To ensure fair and consistent evaluations across our models, we employ a set of specific hyper-parameters during the inference stage, as detailed in Table 9. These hyper-parameters have been carefully chosen to standardize the evaluation process and ensure that our comparisons are both meaningful and reliable. For our fine-tuning baselines, we adopt a different set of hyper-parameters, which are listed comprehensively in Table 8. These parameters are optimized to enhance the model’s performance during the fine-tuning phase. Moreover, to streamline the fine-tuning process, we have decided not to incorporate any warmup steps. The linear scheduler has been chosen for its simplicity and effectiveness.

B.1 LM Representation Visualization

In Figure 5, we present the t-SNE 2D visualization of a subset of 200 randomly selected samples from the NusaX dataset. The visualization showcases how the LaBSE and Cohere-Embedv3 models effectively align samples originating from various languages in a meaningful and interpretable manner. Notably, both models exhibit a high level of proficiency in grouping the samples based on their class labels, indicating robust performance in semantic alignment tasks. This finding is consistent with the behavior observed in models that have been trained using contrastive learning methods, such as the E5 models. The ability of these models to accurately capture semantic relationships across multilingual data highlights their effectiveness in handling diverse linguistic contexts and tasks.

B.2 Retrieved Samples

We conduct a detailed comparison of the retrieved samples to assess their quality in terms of semantic


Table 9: Hyper-parameters for model inference using Hugging Face models, such as BLOOMZ, mT0, XGLM, Aya-23, Aya-101, Gemma 1.1 7B Instruct, and Llama 8B Instruct and APIs, including Command-R, GPT-3.5 Turbo and GPT-4o.

relevance to the query. Table 10 presents a comparative analysis between the retrieved samples from E5LARGE and XLM-RBASE. Moreover, Table 11 showcases the retrieved samples from LaBSE. Our evaluation reveals that the samples retrieved from E5LARGE and LaBSE predominantly contain correct labels, with four out of five labels being accurate. In contrast, the samples retrieved by XLMRBASE exhibit a lower accuracy rate, with only two out of five labels being correct. This analysis underscores the varying performance in sample quality and label accuracy across the different models, emphasizing the significance of retrieval quality in downstream tasks.

C.1 Bitext Retrieval Results

Table 12 presents the complete empirical results for each dataset and model in the bitext retrieval task. Generally, there is a positive trend in model performance as the number of k samples increases.

Bitext Retrieval is Unsymmetrical We evaluate the bitext retrieval performance with different source and target language(s) directions. Based on the results presented in Table 7, it is evident that the bitext retrieval performance is asymmetrical.


Figure 6: Results for the retrieval-based classification task on the SIB-200 dataset, using k values of [1, 5, 10], across various language families.


Figure 7: Results for the retrieval-based classification task on the SIB-200 dataset, using k values of [1, 5, 10], across various language scripts.

Specifically, we observe that using non-English data to retrieve English data tends to be more effective than the reverse scenario.

C.2 Retrieval-based Classification Results

Table 13 presents the complete results for the retrieval-based classification task in both Mono and CS settings. Table 14 provides the full results for the XS and XS CS settings. Figure 6 presents the performance results across various language families on the SIB-200 dataset for different values of k. Notably, Indo-European languages consistently achieve the highest accuracies. In contrast, AfroAsiatic, Austroasiatic, and Sino-Tibetan language families exhibit the greatest standard deviations in their results. Figure 7 shows the performance results across various language scripts on the SIB-200 dataset for different values of k. It is evident that the Latin script generally achieves the highest performance, albeit with the highest standard deviation. Conversely, the scripts Nkoo, Olck, Tibt,

and Tfng exhibit the lowest performance.

C.3 ICL Classification Results

Table 15 presents the complete results for ICL classification task in Mono, XS, CS, and XS CS settings.

Prompt examples used for ICL classification are provided in Tables 16 and 17. Specifically, we use two different templates: for direct prediction, label options are added to the prompt; for prediction by calculating label probabilities, label options are omitted, resulting in shorter prompts.

We conduct a simplified hyper-parameter tuning process to determine the optimal weights for each model. Due to time constraints, we explore only a few weight combinations. For DistFuse (2), we


Table 10: Retrieved samples from E5LARGE and XLM-RBASE.

evaluate two combinations: (1) [α = 1 and β = 1],and (2) [α = 1 and β = 3]. For Dist (3), we assessthree combinations: (1) [α = 1, β = 1, γ = 1], (2)[α = 1, β = 1, γ = 3], and (3) [α = 1, β = 2, γ = 3].


Table 11: Retrieved samples from LaBSE.


Table 12: Results on bitext retrieval. Bold and underlined numbers present the best and second-best models.


Table 13: Results on retrieval-based classification. Bold and underlined numbers present the best and second-best models. For FIRE 2020, we modify the labels, thus there are no comparable results in the literature. For LinCE SA, we evaluate on the development split and we could not find any comparable result in the literature.


Table 14: Results on retrieval-based classification in the cross-lingual setting. The source language is English for all datasets except FIRE 2020, where the source language is Tamil. Bold and underlined numbers present the best and second-best models. We preprocess the dataset differently from the original dataset. Thus, there are no comparable results in the literature.


Table 15: Results on ICL classification. Bold and underlined numbers present the best and second-best models.

Template Instruction:<INSTRUCTION> Please only output the label. <FEW-SHOT SAMPLE>

Table 16: Prompt examples. k = 1 with LaBSE.


Table 17: Prompt examples.  k = 1 with E5LARGE.

Designed for Accessibility and to further Open Science