Here are the results of the evaluation of the models on the different tasks. The results are presented in the form of tables, where the first column is the model name, and the reset columns are the performance on Thai (th), Indonesian (id), and Vietnamese (vi) languages, respectively. The results of Sailor models are highlighted in bold.
3-shot (EM / F1) |
XQuAD (th) |
TydiQA (id) |
XQuAD (vi) |
Qwen1.5-0.5B |
14.19 / 23.35 |
20.71 / 32.64 |
19.85 / 35.38 |
Sailor-0.5B |
15.84 / 27.58 |
30.44 / 54.74 |
21.13 / 40.57 |
Qwen1.5-1.8B |
27.24 / 43.56 |
29.73 / 53.76 |
29.17 / 48.15 |
Sailor-1.8B |
32.72 / 48.66 |
40.88 / 65.37 |
34.22 / 53.35 |
Qwen1.5-4B |
34.03 / 53.40 |
48.32 / 72.68 |
43.71 / 63.86 |
Sailor-4B |
46.82 / 63.34 |
53.98 / 73.48 |
47.65 / 67.09 |
Llama-2-7b |
30.64 / 43.80 |
56.64 / 72.14 |
46.96 / 66.16 |
Mistral-7B-v0.1 |
48.48 / 63.27 |
63.54 / 78.73 |
53.72 / 72.75 |
SeaLLM-7b-Hybrid |
49.70 / 67.62 |
50.62 / 75.21 |
49.62 / 70.74 |
SeaLLM-7b-v2 |
34.55 / 55.13 |
52.21 / 77.00 |
46.19 / 72.11 |
Qwen1.5-7B |
53.79 / 69.30 |
57.17 / 77.28 |
56.63 / 76.99 |
Sailor-7B |
57.88 / 71.06 |
60.53 / 75.42 |
53.81 / 74.62 |
3-shot (EM) |
XCOPA (th) |
XCOPA (id) |
XCOPA (vi) |
Random guess |
50.00 |
50.00 |
50.00 |
Qwen1.5-0.5B |
51.00 |
52.20 |
53.80 |
Sailor-0.5B |
51.00 |
58.20 |
58.00 |
Qwen1.5-1.8B |
52.60 |
51.60 |
53.40 |
Sailor-1.8B |
53.80 |
64.20 |
63.20 |
Qwen1.5-4B |
53.40 |
55.00 |
57.80 |
Sailor-4B |
53.40 |
69.20 |
68.20 |
Llama-2-7b |
52.80 |
64.00 |
62.00 |
Mistral-7B-v0.1 |
57.20 |
62.40 |
61.60 |
SeaLLM-7b-Hybrid |
58.20 |
71.60 |
67.60 |
SeaLLM-7b-v2 |
56.80 |
64.00 |
64.60 |
Qwen1.5-7B |
54.20 |
62.20 |
66.20 |
Sailor-7B |
59.00 |
72.20 |
72.20 |
3-shot (EM) |
Belebele (th) |
Belebele (id) |
Belebele (vi) |
Random guess |
25.00 |
25.00 |
25.00 |
Qwen1.5-0.5B |
29.89 |
26.89 |
30.22 |
Sailor-0.5B |
32.22 |
30.89 |
32.33 |
Qwen1.5-1.8B |
30.11 |
32.00 |
31.33 |
Sailor-1.8B |
34.22 |
34.89 |
35.33 |
Qwen1.5-4B |
32.78 |
36.22 |
35.22 |
Sailor-4B |
36.11 |
41.33 |
38.89 |
Llama-2-7b |
31.78 |
39.78 |
38.00 |
Mistral-7B-v0.1 |
34.33 |
41.33 |
41.33 |
SeaLLM-7b-Hybrid |
37.78 |
43.11 |
43.00 |
SeaLLM-7b-v2 |
36.33 |
43.11 |
47.00 |
Qwen1.5-7B |
38.33 |
42.00 |
42.89 |
Sailor-7B |
41.56 |
44.33 |
45.33 |
We have observed that the performance discrepancy of Sailor on M3Exam is due to a significant option bias, which leads the Sailor models to favor certain option IDs (e.g., always C) when making predictions. As for more explaintation on Generation and Perplexity, please refer to our paper for more details.
3-shot (EM) |
M3Exam (th) |
M3Exam (jv) |
M3Exam (vi) |
Qwen1.5-0.5B | 22.38 | 22.10 | 29.12 |
Sailor-0.5B | 21.87 | 28.84 | 23.53 |
Qwen1.5-1.8B | 23.81 | 26.15 | 36.39 |
Sailor-1.8B | 23.90 | 29.65 | 27.67 |
Qwen1.5-4B | 26.26 | 30.19 | 40.02 |
Sailor-4B | 27.23 | 29.11 | 31.58 |
Llama-2-7B | 21.13 | 23.99 | 34.15 |
Mistral-7B-v0.1 | 29.59 | 31.00 | 43.54 |
Typhoon-7B | 36.71 | -- | -- |
VinaLLaMA-7B | -- | -- | 36.95 |
Sea-Lion-7B | 23.90 | 21.56 | 26.89 |
SeaLLM-7B-Hybrid | 25.98 | 24.53 | 38.79 |
SeaLLM-7B-v2 | 35.60 | 29.92 | 50.36 |
Qwen1.5-7B | 35.88 | 33.15 | 51.09 |
Sailor-7B | 38.33 | 35.85 | 51.98 |
3-shot (EM) |
M3Exam (th) |
M3Exam (jv) |
M3Exam (vi) |
Random guess |
22.90 |
25.00 |
25.21 |
Qwen1.5-0.5B |
22.93 |
25.07 |
26.66 |
Sailor-0.5B |
24.41 |
26.15 |
30.91 |
Qwen1.5-1.8B |
24.04 |
24.26 |
28.68 |
Sailor-1.8B |
25.38 |
28.30 |
34.71 |
Qwen1.5-4B |
24.50 |
24.26 |
30.02 |
Sailor-4B |
27.88 |
31.27 |
40.69 |
Llama-2-7b |
23.67 |
25.07 |
33.15 |
Mistral-7B-v0.1 |
26.03 |
26.68 |
36.11 |
SeaLLM-7b-Hybrid |
27.18 |
26.95 |
36.50 |
SeaLLM-7b-v2 |
28.48 |
29.92 |
39.18 |
Qwen1.5-7B |
25.75 |
26.15 |
36.28 |
Sailor-7B |
30.00 |
32.88 |
44.10 |