(:star2: denotes equal contribution)
For now, we release the human annotation dataset robust_long_abstractive_human_annotation_dataset.jsonl (or .csv). We use this dataset for the metric comparison in section 5 of our work.
Data Field | Definition |
---|---|
dataset | Whether the model-generated summary is from arXiv or GovReport dataset. |
dataset_id | ID_ + document ID of the dataset. To match the IDs with original datasets, please remove the "ID_" string. The IDs are from the original dataset of arXiv and GovReport. |
model_type | Model variant which generates the summary. 1K, 4K and 8K represents 1,024, 4096 and 8192 input token limit of the model. For more information, please refer to the original paper. |
model_summary | Model-generated summary |
relevance | Percentage of the reference summary’s main ideas contained in the generated summary. Higher = Better. |
factual consistency | Percentage of factually consistent sentences. Higher = Better. |
We are standardizing the data for detailed factual error types. Stay tuned!
For more information, please refer to our work: How Far are We from Robust Long Abstractive Summarization?
@inproceedings{koh-etal-2022-far,
title = "How Far are We from Robust Long Abstractive Summarization?",
author = "Koh, Huan Yee and Ju, Jiaxin and Zhang, He and Liu, Ming and Pan, Shirui",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.emnlp-main.172",
pages = "2682--2698"
}