diff --git a/README.md b/README.md index 4de5d3d6..f54a5a39 100644 --- a/README.md +++ b/README.md @@ -1,12 +1,12 @@ # PEGASUS library -Pre-training with Extracted Gap-sentences for Abstractive SUmmarization -Sequence-to-sequence models, or PEGASUS, uses self-supervised objective Gap -Sentences Generation (GSG) to train a transformer encoder-decoder model. The -paper can be found on [arXiv](https://arxiv.org/abs/1912.08777). ICML 2020 accepted. +Pre-training with Extracted Gap-sentences for Abstractive Summarization Sequence-to-sequence models, or PEGASUS, uses self-supervised objective Gap +Sentences Generation (GSG) to train a transformer encoder-decoder model. The paper can be found on [arXiv](https://arxiv.org/abs/1912.08777). ICML 2020 accepted. If you use this code or these models, please cite the following paper: + ``` + @misc{zhang2019pegasus, title={PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization}, author={Jingqing Zhang and Yao Zhao and Mohammad Saleh and Peter J. Liu}, @@ -15,6 +15,7 @@ If you use this code or these models, please cite the following paper: archivePrefix={arXiv}, primaryClass={cs.CL} } + ``` # Results update @@ -37,26 +38,26 @@ We train a pegasus model with sampled gap sentence ratios on both C4 and HugeNew | billsum | 57.20/39.56/45.80 | 57.31/40.19/45.82 | 59.67/41.58/47.59| The "Mixed & Stochastic" model has the following changes: -- trained on both C4 and HugeNews (dataset mixture is weighted by their number of examples). -- trained for 1.5M instead of 500k (we observe slower convergence on pretraining perplexity). -- the model uniformly sample a gap sentence ratio between 15% and 45%. -- importance sentences are sampled using a 20% uniform noise to importance scores. -- the sentencepiece tokenizer is updated to be able to encode newline character. +- Trained on both C4 and HugeNews (dataset mixture is weighted by their number of examples). +- Trained for 1.5M instead of 500k (we observe slower convergence on pretraining perplexity). +- The model uniformly sample a gap sentence ratio between 15% and 45%. +- Importance sentences are sampled using a 20% uniform noise to importance scores. +- The sentencepiece tokenizer is updated to be able to encode newline character. (*) the numbers of wikihow and big_patent datasets are not comparable because of change in tokenization and data: -- wikihow dataset contains newline characters which is useful for paragraph segmentation, the C4 and HugeNews model's sentencepiece tokenizer doesn't encode newline and loose this information. -- we update the BigPatent dataset to preserve casing, some format cleanings are also changed, please refer to change in TFDS. - - +- Wikihow dataset contains newline characters which is useful for paragraph segmentation, the C4 and HugeNews model's sentencepiece tokenizer doesn't encode newline and loose this information. +- We update the BigPatent dataset to preserve casing, some format cleanings are also changed, please refer to change in TFDS. # Setup -## create an instance on google cloud with GPU (optional) +## Create an instance on google cloud with GPU (optional) Please create a project first and create an instance ``` + gcloud compute instances create \ + ${VM_NAME} \ --zone=${ZONE} \ --machine-type=n1-highmem-8 \ @@ -65,24 +66,28 @@ gcloud compute instances create \ --image-project=ml-images \ --image-family=tf-1-15 \ --maintenance-policy TERMINATE --restart-on-failure + ``` -## install library and dependencies +## Install Library and Dependencies -Clone library on github and install requirements. +Clone library on GitHub and install requirements. ``` + git clone https://github.com/google-research/pegasus cd pegasus export PYTHONPATH=. pip3 install -r requirements.txt + ``` -Download vocab, pretrained and fine-tuned checkpoints of all experiments from [Google Cloud](https://console.cloud.google.com/storage/browser/pegasus_ckpt). +Download vocab, pre-trained and fine-tuned checkpoints of all experiments from [Google Cloud](https://console.cloud.google.com/storage/browser/pegasus_ckpt). -Alternatively in terminal, follow the instruction and install [gsutil](https://cloud.google.com/storage/docs/gsutil_install). Then +Alternatively, in terminal, follow the instruction and install [gsutil](https://cloud.google.com/storage/docs/gsutil_install). Then ``` + mkdir ckpt gsutil cp -r gs://pegasus_ckpt/ ckpt/ @@ -90,15 +95,17 @@ gsutil cp -r gs://pegasus_ckpt/ ckpt/ # Finetuning on downstream datasets -## on existing dataset +## On existing dataset Finetune on an existing dataset `aeslc`. ``` + python3 pegasus/bin/train.py --params=aeslc_transformer \ --param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model \ --train_init_checkpoint=ckpt/pegasus_ckpt/model.ckpt-1500000 \ --model_dir=ckpt/pegasus_ckpt/aeslc + ``` If you would like to finetune on a subset of dataset, please refer to the [example of input pattern](https://github.com/google-research/pegasus/blob/master/pegasus/data/datasets.py#L186). @@ -106,20 +113,23 @@ If you would like to finetune on a subset of dataset, please refer to the [examp Evaluate on the finetuned dataset. ``` + python3 pegasus/bin/evaluate.py --params=aeslc_transformer \ --param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model,batch_size=1,beam_size=5,beam_alpha=0.6 \ --model_dir=ckpt/pegasus_ckpt/aeslc + ``` -Note that the above example is using a single GPU so the batch_size is much smaller -than the results reported in the paper. +Note that the above example is using a single GPU so the batch_size is much smaller than the results reported in the paper. -## add new finetuning dataset +## Add new finetuning dataset Two types of dataset format are supported: [TensorFlow Datasets (TFDS)](https://www.tensorflow.org/datasets) or TFRecords. [This tutorial](https://www.tensorflow.org/datasets/add_dataset) shows how to add a new dataset in TFDS. + (The fine-tuning dataset is expected to be supervised, please provide + `supervised_keys` in dataset info). Tfrecords format requires each record to be a tf example of `{"inputs":tf.string, "targets":tf.string}`. @@ -127,7 +137,8 @@ Tfrecords format requires each record to be a tf example of `{"inputs":tf.string For example, if you registered a TFDS dataset called `new_tfds_dataset` for training and evaluation, and have some files in tfrecord format called `new_dataset_files.tfrecord*` for test, they can be registered in `/pegasus/params/public_params.py`. ``` -@registry.register("new_params") + +@registry.register(“new_params”) def my_param(param_overrides): return public_params.transformer_params( { @@ -140,40 +151,25 @@ def my_param(param_overrides): "learning_rate": 0.0001, "batch_size": 8, }, param_overrides) + ``` ## Evaluation metrics. -Evaluation results can be found in `mode_dir`. Summarization metrics are automatically -calculated for each evaluation point. - -- [ROUGE](https://www.aclweb.org/anthology/W04-1013.pdf) is the main metric - for summarization quality. - -- [BLEU](https://www.aclweb.org/anthology/P02-1040.pdf) is an alternative - quality metric for language generation. - -- [Extractive Fragments Coverage & Density](https://arxiv.org/pdf/1804.11283.pdf) - are metrics that measures the abstractiveness of the summary. - +Evaluation results can be found in `mode_dir`. Summarization metrics are automatically calculated for each evaluation point. +- [ROUGE](https://www.aclweb.org/anthology/W04-1013.pdf) is the main metric for summarization quality. +- [BLEU](https://www.aclweb.org/anthology/P02-1040.pdf) is an alternative quality metric for language generation. +- [Extractive Fragments Coverage & Density](https://arxiv.org/pdf/1804.11283.pdf) are metrics that measure the abstractiveness of the summary. - Repetition Rates measures generation repetition failure modes. - -- Length statistics measures the length distribution of decodes comparing to gold summary. - +- Length statistics measure the length distribution of decodes comparing to gold summary. Several types of output files can be found in `model_dir` - -- text_metrics-*.txt: above metrics in text format. Each row contains metric - name, 95% lower bound value, mean value, 95% upper bound value. -- inputs-*.txt, targets-*.txt, predictions-*.txt: raw text files of model - inputs/outputs. - - +- text_metrics-*.txt: above metrics in text format. Each row contains metric name, 95% lower bound value, mean value, 95% upper bound value. +- inputs-*.txt, targets-*.txt, predictions-*.txt: raw text files of model inputs/outputs. # Pre-training Pretraining (on C4 or any other corpus) requires a customly built tensorflow that includes ops for on-the-fly parsing that processes raw text document into model inputs and targets ids. Please refer to pegasus/ops/pretrain_parsing_ops.cc and pegasus/data/parsers.py for details. # Acknowledgements -Contains parts of code and design for training and evaluation of summarization models originally by Ben Goodrich . - +Contains parts of code and design for training and evaluation of summarization models originally by Ben Goodrich . \ No newline at end of file