How to preprocess the dataset #14

prateekverma1 · 2018-12-11T15:48:04Z

Hi Shashi,

I am planning to run the code on your own data. How can I preprocess the dataset such that it can be used by the code (Refresh)?

Thanks,
Prateek

shashiongithub · 2018-12-11T15:54:27Z

Send me an email, I will send you the demo code that I have not made public.

prateekverma1 · 2018-12-11T16:04:14Z

Sure, just sent you the email. My email id is [email protected]

prateekverma1 · 2018-12-14T06:19:08Z

Hi Shashi,

I am planning to train the model on our own dataset. To understand the formatting of dataset required, I downloaded the dataset from the link mentioned on GitHub, I see cnn.training.doc, cnn.training.title, cnn.training.image, cnn.training.label.multipleoracle and cnn.training.label.singleoracle.

It will be great if you could please give some pointer about the multipleoracle training file and how would I go about generating those for my own dataset?

Also, is title assumed as the gold summary? I was confused, since in the paper it was mentioned highlight is the gold summary.

shashiongithub · 2018-12-14T09:13:30Z

Check out my latest commit, hope this helps. You may have to adapt file names etc for your datasets.

prateekverma1 · 2018-12-14T21:28:43Z

Sorry for asking so many questions..
But, I see in the estimate_multiple_oracle code that it takes as input files that ends with .article & .abstract

From where can I get those datasets? The Refresh repo has preprocessed dataset and SideNet has http://kinloch.inf.ed.ac.uk/public/cnn-dm-sideinfo-data.zip which is of different format.
Also, should I be running preprocess step (using preprocessing done in demo code for summarizing new text) before estimating multiple oracle.

shashiongithub · 2018-12-17T09:30:22Z

articles are files with input documents (tokenized and sentence segmented, one sentence per line).
abstracts are files with the gold summary (tokenized and sentence segmented, one sentence per line).

Try to adapt this to your dataset. I am not authorized to release those formats for CNN/DM datasets.

prateekverma1 · 2018-12-24T06:59:01Z

Hi Shashi,

Thanks, I was able to have some progress. However, one of the problem I am facing is -

Running the stanford core nlp (jar) for each document is very expensive. So I tried to concatenate all the docs to one large text/string and then pass it to the corenlp, but then I figured there is a limit to the text size you can pass it as an argument to corenlp. So I had to concatenate smaller number of docs as one text blob and then pass that to corenlp. Which kind of made it worked, but it seems hacky/doesn't feel right. As I will end up having multiple training files (.doc) & label.multipleoracle files, whereas your dataset contains just one .doc and .multipleoracle file.

Wanted to ask you if there is something I am missing or alternatively would you be able to share something that can help me.

prateekverma1 · 2018-12-26T23:45:51Z

I ended up using Stanford CoreNLP REST API. That solved my above issue.

shashiongithub · 2018-12-28T14:16:44Z

I am glad you found a solution :)

prateekverma1 · 2019-01-02T03:31:03Z

Hi Liuyue, Shashi can correct me if I am wrong. .doc,.image,.title files are used as the input to the summarization model. .multipleoracle file is used for the label and reward that is given to the model. It is the output generated from the estimate_multiple_oracle code using the abstract files. (As seen from the line - highlights_dir = os.path.join(data_dir, 'abstracts' )) I am not sure regarding .highlight files as I could not find in the code where it is being used. Thanks, Prateek

…

On Tue, Jan 1, 2019 at 10:57 AM liuyue ***@***.***> wrote: Hi Shashi, I am planning to run the code on your own data. How can I preprocess the dataset such that it can be used by the code (Refresh)? Thanks, Prateek Dear Shashi and prateekverma1, Could you please give me some hints about the input file, I have downloaded the preprocessed-input-data from the link. Do training.doc and training.highlight represent raw text and gold abstract which is per sentence per line? However, what does other suffix mean？Such as image, label and title? Thank you so much! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#14 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEUkofOmdsbKgyfKia1PcD7Iz057-NDoks5u-4VfgaJpZM4ZNrzD> .

liuyyy0866 · 2019-01-06T15:55:25Z

Hi Liuyue, Shashi can correct me if I am wrong. .doc,.image,.title files are used as the input to the summarization model. .multipleoracle file is used for the label and reward that is given to the model. It is the output generated from the estimate_multiple_oracle code using the abstract files. (As seen from the line - highlights_dir = os.path.join(data_dir, 'abstracts' )) I am not sure regarding .highlight files as I could not find in the code where it is being used. Thanks, Prateek
…
On Tue, Jan 1, 2019 at 10:57 AM liuyue @.***> wrote: Hi Shashi, I am planning to run the code on your own data. How can I preprocess the dataset such that it can be used by the code (Refresh)? Thanks, Prateek Dear Shashi and prateekverma1, Could you please give me some hints about the input file, I have downloaded the preprocessed-input-data from the link. Do training.doc and training.highlight represent raw text and gold abstract which is per sentence per line? However, what does other suffix mean？Such as image, label and title? Thank you so much! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#14 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AEUkofOmdsbKgyfKia1PcD7Iz057-NDoks5u-4VfgaJpZM4ZNrzD .

Hi Prateek,
Thanks for your reply!
But I have another question about ROUGE，when the first epoch is finished,there's something wrong, do you have some clue or experience？

Thank u very much!

prateekverma1 · 2019-01-07T18:08:44Z

Hi liuyue, I see you have wordnet exception (Cannot open db exception file for reading). The couple of links below show how to fix it - http://kavita-ganesan.com/rouge-howto/#.XDOUr89KjUo https://poojithansl7.wordpress.com/2018/08/04/setting-up-rouge/ Let me know if you have any further issues. Thanks, Prateek

…

On Sun, Jan 6, 2019 at 10:55 AM liuyue ***@***.***> wrote: Hi Liuyue, Shashi can correct me if I am wrong. .doc,.image,.title files are used as the input to the summarization model. .multipleoracle file is used for the label and reward that is given to the model. It is the output generated from the estimate_multiple_oracle code using the abstract files. (As seen from the line - highlights_dir = os.path.join(data_dir, 'abstracts' )) I am not sure regarding .highlight files as I could not find in the code where it is being used. Thanks, Prateek … <#m_-4841557349050418548_> On Tue, Jan 1, 2019 at 10:57 AM liuyue ***@***.***> wrote: Hi Shashi, I am planning to run the code on your own data. How can I preprocess the dataset such that it can be used by the code (Refresh)? Thanks, Prateek Dear Shashi and prateekverma1, Could you please give me some hints about the input file, I have downloaded the preprocessed-input-data from the link. Do training.doc and training.highlight represent raw text and gold abstract which is per sentence per line? However, what does other suffix mean？Such as image, label and title? Thank you so much! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#14 (comment) <#14 (comment)>>, or mute the thread https://github.com/notifications/unsubscribe-auth/AEUkofOmdsbKgyfKia1PcD7Iz057-NDoks5u-4VfgaJpZM4ZNrzD . Hi Prateek, Thanks for your reply! But I have another question about ROUGE，when the first epoch is finished,there's something wrong, do you have some clue or experience？ [image: image] <https://user-images.githubusercontent.com/20330826/50738392-4a7c1a00-120e-11e9-8be5-e918747c30a8.png> Thank u very much! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#14 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEUkoWrOyac0oSN2KpSH0cX0lelAEtBEks5vAhxtgaJpZM4ZNrzD> .

cool425589 · 2019-02-10T07:55:52Z

Hi Shashi,
Sorry for asking same questions.
I have same question on cnn.training.doc, cnn.training.title, cnn.training.image, cnn.training.label.multipleoracle and cnn.training.label.singleoracle too.
I can't find code in scripts/oracle-estimator to generate .singleoracle or .multipleoracle or .title ...etc
It will be great if you could please give some information about the cnn.training.doc, cnn.training.title, cnn.training.image, cnn.training.label.multipleoracle and cnn.training.label.singleoracle .

shashiongithub · 2019-02-10T11:05:55Z

Check this for creating .training.label.multipleoracle and .training.label.singleoracle for your own datasets.

Oracle Estimation
Check our "scripts/oracle-estimator" to compute multiple oracles for your own dataset for training.

title and image files can be ignored. They are not used, just reminiscent from Sidenet codes. doc file is preprocessed input document file. Documents are separated by an empty line. For each document, each line represents a tokenized sentence with vocab id replacing each token. Check out the demo code : https://github.com/EdinburghNLP/Refresh/files/2846901/demo.tar.gz

Limtle · 2019-02-15T09:53:48Z

Hi! Sorry. But I find title and image used in the code. please give some information about the title and image. Thank you very much!

shashiongithub · 2019-02-16T15:33:07Z

Sorry please leave them empty (they are reminiscent of old code). They are not used here.

tlifcen · 2019-07-20T08:11:53Z

Hi Shashi,
I would like run your code on my own data. How can I preprocess the dataset that can be used in your code and get the .doc, .title files, just like the data in the path "Refresh-NAACL18-preprocessed-input-data"? @shashiongithub
Thanks for your reply.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to preprocess the dataset #14

How to preprocess the dataset #14

prateekverma1 commented Dec 11, 2018

shashiongithub commented Dec 11, 2018

prateekverma1 commented Dec 11, 2018

prateekverma1 commented Dec 14, 2018

shashiongithub commented Dec 14, 2018

prateekverma1 commented Dec 14, 2018 •

edited

Loading

shashiongithub commented Dec 17, 2018

prateekverma1 commented Dec 24, 2018 •

edited

Loading

prateekverma1 commented Dec 26, 2018

shashiongithub commented Dec 28, 2018

prateekverma1 commented Jan 2, 2019 via email

liuyyy0866 commented Jan 6, 2019

prateekverma1 commented Jan 7, 2019 via email

cool425589 commented Feb 10, 2019

shashiongithub commented Feb 10, 2019

Limtle commented Feb 15, 2019

shashiongithub commented Feb 16, 2019

tlifcen commented Jul 20, 2019

How to preprocess the dataset #14

How to preprocess the dataset #14

Comments

prateekverma1 commented Dec 11, 2018

shashiongithub commented Dec 11, 2018

prateekverma1 commented Dec 11, 2018

prateekverma1 commented Dec 14, 2018

shashiongithub commented Dec 14, 2018

prateekverma1 commented Dec 14, 2018 • edited Loading

shashiongithub commented Dec 17, 2018

prateekverma1 commented Dec 24, 2018 • edited Loading

prateekverma1 commented Dec 26, 2018

shashiongithub commented Dec 28, 2018

prateekverma1 commented Jan 2, 2019 via email

liuyyy0866 commented Jan 6, 2019

prateekverma1 commented Jan 7, 2019 via email

cool425589 commented Feb 10, 2019

shashiongithub commented Feb 10, 2019

Check this for creating *.training.label.multipleoracle and *.training.label.singleoracle for your own datasets.

Oracle Estimation Check our "scripts/oracle-estimator" to compute multiple oracles for your own dataset for training.

Limtle commented Feb 15, 2019

shashiongithub commented Feb 16, 2019

tlifcen commented Jul 20, 2019

prateekverma1 commented Dec 14, 2018 •

edited

Loading

prateekverma1 commented Dec 24, 2018 •

edited

Loading

Check this for creating .training.label.multipleoracle and .training.label.singleoracle for your own datasets.

Oracle Estimation
Check our "scripts/oracle-estimator" to compute multiple oracles for your own dataset for training.