sailor-llm/eval at main · sail-sg/sailor-llm

History

Name		Name	Last commit message	Last commit date
parent directory ..
configs		configs
data		data
datasets		datasets
README.md		README.md
eval_sailor.py		eval_sailor.py
icl_sailor_evaluator.py		icl_sailor_evaluator.py
sailor_text_postprocessors.py		sailor_text_postprocessors.py

README.md

Evaluation

Here are the results of the evaluation of the models on the different tasks. The results are presented in the form of tables, where the first column is the model name, and the reset columns are the performance on Thai (th), Indonesian (id), and Vietnamese (vi) languages, respectively. The results of Sailor models are highlighted in bold.

Question Answering

3-shot (EM / F1)	XQuAD (th)	TydiQA (id)	XQuAD (vi)
Qwen1.5-0.5B	14.19 / 23.35	20.71 / 32.64	19.85 / 35.38
Sailor-0.5B	15.84 / 27.58	30.44 / 54.74	21.13 / 40.57
Qwen1.5-1.8B	27.24 / 43.56	29.73 / 53.76	29.17 / 48.15
Sailor-1.8B	32.72 / 48.66	40.88 / 65.37	34.22 / 53.35
Qwen1.5-4B	34.03 / 53.40	48.32 / 72.68	43.71 / 63.86
Sailor-4B	46.82 / 63.34	53.98 / 73.48	47.65 / 67.09
Llama-2-7b	30.64 / 43.80	56.64 / 72.14	46.96 / 66.16
Mistral-7B-v0.1	48.48 / 63.27	63.54 / 78.73	53.72 / 72.75
SeaLLM-7b-Hybrid	49.70 / 67.62	50.62 / 75.21	49.62 / 70.74
SeaLLM-7b-v2	34.55 / 55.13	52.21 / 77.00	46.19 / 72.11
Qwen1.5-7B	53.79 / 69.30	57.17 / 77.28	56.63 / 76.99
Sailor-7B	57.88 / 71.06	60.53 / 75.42	53.81 / 74.62

Commonsense Reasoning

3-shot (EM)	XCOPA (th)	XCOPA (id)	XCOPA (vi)
Random guess	50.00	50.00	50.00
Qwen1.5-0.5B	51.00	52.20	53.80
Sailor-0.5B	51.00	58.20	58.00
Qwen1.5-1.8B	52.60	51.60	53.40
Sailor-1.8B	53.80	64.20	63.20
Qwen1.5-4B	53.40	55.00	57.80
Sailor-4B	53.40	69.20	68.20
Llama-2-7b	52.80	64.00	62.00
Mistral-7B-v0.1	57.20	62.40	61.60
SeaLLM-7b-Hybrid	58.20	71.60	67.60
SeaLLM-7b-v2	56.80	64.00	64.60
Qwen1.5-7B	54.20	62.20	66.20
Sailor-7B	59.00	72.20	72.20

Reading Comprehension

3-shot (EM)	Belebele (th)	Belebele (id)	Belebele (vi)
Random guess	25.00	25.00	25.00
Qwen1.5-0.5B	29.89	26.89	30.22
Sailor-0.5B	32.22	30.89	32.33
Qwen1.5-1.8B	30.11	32.00	31.33
Sailor-1.8B	34.22	34.89	35.33
Qwen1.5-4B	32.78	36.22	35.22
Sailor-4B	36.11	41.33	38.89
Llama-2-7b	31.78	39.78	38.00
Mistral-7B-v0.1	34.33	41.33	41.33
SeaLLM-7b-Hybrid	37.78	43.11	43.00
SeaLLM-7b-v2	36.33	43.11	47.00
Qwen1.5-7B	38.33	42.00	42.89
Sailor-7B	41.56	44.33	45.33

Examination (Generation)

We have observed that the performance discrepancy of Sailor on M3Exam is due to a significant option bias, which leads the Sailor models to favor certain option IDs (e.g., always C) when making predictions. As for more explaintation on Generation and Perplexity, please refer to our paper for more details.

3-shot (EM)	M3Exam (th)	M3Exam (jv)	M3Exam (vi)
Qwen1.5-0.5B	22.38	22.10	29.12
Sailor-0.5B	21.87	28.84	23.53
Qwen1.5-1.8B	23.81	26.15	36.39
Sailor-1.8B	23.90	29.65	27.67
Qwen1.5-4B	26.26	30.19	40.02
Sailor-4B	27.23	29.11	31.58
Llama-2-7B	21.13	23.99	34.15
Mistral-7B-v0.1	29.59	31.00	43.54
Typhoon-7B	36.71	--	--
VinaLLaMA-7B	--	--	36.95
Sea-Lion-7B	23.90	21.56	26.89
SeaLLM-7B-Hybrid	25.98	24.53	38.79
SeaLLM-7B-v2	35.60	29.92	50.36
Qwen1.5-7B	35.88	33.15	51.09
Sailor-7B	38.33	35.85	51.98

Examination (Perplexity)

3-shot (EM)	M3Exam (th)	M3Exam (jv)	M3Exam (vi)
Random guess	22.90	25.00	25.21
Qwen1.5-0.5B	22.93	25.07	26.66
Sailor-0.5B	24.41	26.15	30.91
Qwen1.5-1.8B	24.04	24.26	28.68
Sailor-1.8B	25.38	28.30	34.71
Qwen1.5-4B	24.50	24.26	30.02
Sailor-4B	27.88	31.27	40.69
Llama-2-7b	23.67	25.07	33.15
Mistral-7B-v0.1	26.03	26.68	36.11
SeaLLM-7b-Hybrid	27.18	26.95	36.50
SeaLLM-7b-v2	28.48	29.92	39.18
Qwen1.5-7B	25.75	26.15	36.28
Sailor-7B	30.00	32.88	44.10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval

eval

README.md

Evaluation

Question Answering

Commonsense Reasoning

Reading Comprehension

Examination (Generation)

Examination (Perplexity)

Files

eval

Directory actions

More options

Directory actions

More options

Latest commit

History

eval

Folders and files

parent directory

README.md

Evaluation

Question Answering

Commonsense Reasoning

Reading Comprehension

Examination (Generation)

Examination (Perplexity)