Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raw dataset #20

Closed
astariul opened this issue Apr 7, 2020 · 5 comments
Closed

Raw dataset #20

astariul opened this issue Apr 7, 2020 · 5 comments

Comments

@astariul
Copy link

astariul commented Apr 7, 2020

I saw the dataset is available at http://kinloch.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz (link given by @shashiongithub )

I tried to use this dataset to reproduce BART results, but I couldn't. According to one of BART author on this issue (fairseq repository), this is because I'm using an already processed version of the dataset.


Is it possible to have a link to the raw dataset (no postprocess of any kind) ?

@wonjininfo
Copy link

Same here.
I tried to download raw data using the script but faced an error (same for issue #12)
Since I use BART, I would appreciate it if I can access the raw dataset.

@shashiongithub
Copy link
Collaborator

Yes, there has been some issue with downloading the raw data using the script. Some of the wayback urls have expired.

The dataset at the url mentioned at the top of this post is the same that was used in our paper and also shared with BART authors. Please only use entries that are provided here: https://github.com/EdinburghNLP/XSum/blob/master/XSum-Dataset/XSum-TRAINING-DEV-TEST-SPLIT-90-5-5.json

@addf400
Copy link

addf400 commented Apr 23, 2020

@shashiongithub @colanim

When I tried to get the XSUM dataset from url you shared (http://kinloch.inf.ed.ac.uk/public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz), I encountered the following error:
You don't have permission to access /public/direct/XSUM-EMNLP18-Summary-Data-Original.tar.gz on this server.

Can you help me to access the raw dataset?
Many thanks!

@shashiongithub
Copy link
Collaborator

Try replacing kinloch with bollin.

@addf400
Copy link

addf400 commented Apr 23, 2020

Try replacing kinloch with bollin.

It works !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants