-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to preprocess the dataset #14
Comments
Send me an email, I will send you the demo code that I have not made public. |
Sure, just sent you the email. My email id is [email protected] |
Hi Shashi, I am planning to train the model on our own dataset. To understand the formatting of dataset required, I downloaded the dataset from the link mentioned on GitHub, I see cnn.training.doc, cnn.training.title, cnn.training.image, cnn.training.label.multipleoracle and cnn.training.label.singleoracle. It will be great if you could please give some pointer about the multipleoracle training file and how would I go about generating those for my own dataset? Also, is title assumed as the gold summary? I was confused, since in the paper it was mentioned highlight is the gold summary. |
Check out my latest commit, hope this helps. You may have to adapt file names etc for your datasets. |
Sorry for asking so many questions..
|
articles are files with input documents (tokenized and sentence segmented, one sentence per line). Try to adapt this to your dataset. I am not authorized to release those formats for CNN/DM datasets. |
Hi Shashi, Thanks, I was able to have some progress. However, one of the problem I am facing is - Running the stanford core nlp (jar) for each document is very expensive. So I tried to concatenate all the docs to one large text/string and then pass it to the corenlp, but then I figured there is a limit to the text size you can pass it as an argument to corenlp. So I had to concatenate smaller number of docs as one text blob and then pass that to corenlp. Which kind of made it worked, but it seems hacky/doesn't feel right. As I will end up having multiple training files (.doc) & label.multipleoracle files, whereas your dataset contains just one .doc and .multipleoracle file. Wanted to ask you if there is something I am missing or alternatively would you be able to share something that can help me. |
I ended up using Stanford CoreNLP REST API. That solved my above issue. |
I am glad you found a solution :) |
Hi Liuyue,
Shashi can correct me if I am wrong.
.doc,.image,.title files are used as the input to the summarization model.
.multipleoracle file is used for the label and reward that is given to the
model. It is the output generated from the estimate_multiple_oracle code
using the abstract files.
(As seen from the line - highlights_dir = os.path.join(data_dir, 'abstracts'
))
I am not sure regarding .highlight files as I could not find in the code
where it is being used.
Thanks,
Prateek
…On Tue, Jan 1, 2019 at 10:57 AM liuyue ***@***.***> wrote:
Hi Shashi,
I am planning to run the code on your own data. How can I preprocess the
dataset such that it can be used by the code (Refresh)?
Thanks,
Prateek
Dear Shashi and prateekverma1,
Could you please give me some hints about the input file, I have
downloaded the preprocessed-input-data from the link. Do training.doc and
training.highlight represent raw text and gold abstract which is per
sentence per line? However, what does other suffix mean?Such as image,
label and title?
Thank you so much!
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#14 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEUkofOmdsbKgyfKia1PcD7Iz057-NDoks5u-4VfgaJpZM4ZNrzD>
.
|
Hi Prateek, Thank u very much! |
Hi liuyue,
I see you have wordnet exception (Cannot open db exception file for
reading).
The couple of links below show how to fix it -
http://kavita-ganesan.com/rouge-howto/#.XDOUr89KjUo
https://poojithansl7.wordpress.com/2018/08/04/setting-up-rouge/
Let me know if you have any further issues.
Thanks,
Prateek
…On Sun, Jan 6, 2019 at 10:55 AM liuyue ***@***.***> wrote:
Hi Liuyue, Shashi can correct me if I am wrong. .doc,.image,.title files
are used as the input to the summarization model. .multipleoracle file is
used for the label and reward that is given to the model. It is the output
generated from the estimate_multiple_oracle code using the abstract files.
(As seen from the line - highlights_dir = os.path.join(data_dir,
'abstracts' )) I am not sure regarding .highlight files as I could not find
in the code where it is being used. Thanks, Prateek
… <#m_-4841557349050418548_>
On Tue, Jan 1, 2019 at 10:57 AM liuyue ***@***.***> wrote: Hi Shashi, I am
planning to run the code on your own data. How can I preprocess the dataset
such that it can be used by the code (Refresh)? Thanks, Prateek Dear Shashi
and prateekverma1, Could you please give me some hints about the input
file, I have downloaded the preprocessed-input-data from the link. Do
training.doc and training.highlight represent raw text and gold abstract
which is per sentence per line? However, what does other suffix mean?Such
as image, label and title? Thank you so much! — You are receiving this
because you authored the thread. Reply to this email directly, view it on
GitHub <#14 (comment)
<#14 (comment)>>,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEUkofOmdsbKgyfKia1PcD7Iz057-NDoks5u-4VfgaJpZM4ZNrzD
.
Hi Prateek,
Thanks for your reply!
But I have another question about ROUGE,when the first epoch is
finished,there's something wrong, do you have some clue or experience?
[image: image]
<https://user-images.githubusercontent.com/20330826/50738392-4a7c1a00-120e-11e9-8be5-e918747c30a8.png>
Thank u very much!
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#14 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEUkoWrOyac0oSN2KpSH0cX0lelAEtBEks5vAhxtgaJpZM4ZNrzD>
.
|
Hi Shashi, |
Check this for creating *.training.label.multipleoracle and *.training.label.singleoracle for your own datasets.Oracle Estimation
|
Sorry please leave them empty (they are reminiscent of old code). They are not used here. |
Hi Shashi, |
Hi Shashi,
I am planning to run the code on your own data. How can I preprocess the dataset such that it can be used by the code (Refresh)?
Thanks,
Prateek
The text was updated successfully, but these errors were encountered: