google-research · SubramanyamChalla24 · Jun 8, 2022 · Jun 8, 2022
diff --git a/README.md b/README.md
@@ -1,12 +1,12 @@
 # PEGASUS library
 
-Pre-training with Extracted Gap-sentences for Abstractive SUmmarization
-Sequence-to-sequence models, or PEGASUS, uses self-supervised objective Gap
-Sentences Generation (GSG) to train a transformer encoder-decoder model. The
-paper can be found on [arXiv](https://arxiv.org/abs/1912.08777). ICML 2020 accepted.
+Pre-training with Extracted Gap-sentences for Abstractive Summarization Sequence-to-sequence models, or PEGASUS, uses self-supervised objective Gap
+Sentences Generation (GSG) to train a transformer encoder-decoder model. The paper can be found on [arXiv](https://arxiv.org/abs/1912.08777). ICML 2020 accepted.
 
 If you use this code or these models, please cite the following paper:
+
 ```
+
 @misc{zhang2019pegasus,
     title={PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization},
     author={Jingqing Zhang and Yao Zhao and Mohammad Saleh and Peter J. Liu},
@@ -15,6 +15,7 @@ If you use this code or these models, please cite the following paper:
     archivePrefix={arXiv},
     primaryClass={cs.CL}
 }
+
 ```
 
 # Results update
@@ -37,26 +38,26 @@ We train a pegasus model with sampled gap sentence ratios on both C4 and HugeNew
 | billsum | 57.20/39.56/45.80 | 57.31/40.19/45.82 | 59.67/41.58/47.59|
 
 The "Mixed & Stochastic" model has the following changes:
-- trained on both C4 and HugeNews (dataset mixture is weighted by their number of examples). 
-- trained for 1.5M instead of 500k (we observe slower convergence on pretraining perplexity).
-- the model uniformly sample a gap sentence ratio between 15% and 45%.
-- importance sentences are sampled using a 20% uniform noise to importance scores.
-- the sentencepiece tokenizer is updated to be able to encode newline character.
+- Trained on both C4 and HugeNews (dataset mixture is weighted by their number of examples). 
+- Trained for 1.5M instead of 500k (we observe slower convergence on pretraining perplexity).
+- The model uniformly sample a gap sentence ratio between 15% and 45%.
+- Importance sentences are sampled using a 20% uniform noise to importance scores.
+- The sentencepiece tokenizer is updated to be able to encode newline character.
 
 
 (*) the numbers of wikihow and big_patent datasets are not comparable because of change in tokenization and data:
-- wikihow dataset contains newline characters which is useful for paragraph segmentation, the C4 and HugeNews model's sentencepiece tokenizer doesn't encode newline and loose this information.
-- we update the BigPatent dataset to preserve casing, some format cleanings are also changed, please refer to change in TFDS.
-
-
+- Wikihow dataset contains newline characters which is useful for paragraph segmentation, the C4 and HugeNews model's sentencepiece tokenizer doesn't encode newline and loose this information.
+- We update the BigPatent dataset to preserve casing, some format cleanings are also changed, please refer to change in TFDS.
 # Setup
 
-## create an instance on google cloud with GPU (optional)
+## Create an instance on google cloud with GPU (optional)
 
 Please create a project first and create an instance
 
 ```
+
 gcloud compute instances create \
+
   ${VM_NAME} \
   --zone=${ZONE} \
   --machine-type=n1-highmem-8 \
@@ -65,69 +66,79 @@ gcloud compute instances create \
   --image-project=ml-images \
   --image-family=tf-1-15 \
   --maintenance-policy TERMINATE --restart-on-failure
+
 ```
 
-## install library and dependencies
+## Install Library and Dependencies
 
-Clone library on github and install requirements.
+Clone library on GitHub and install requirements.
 
 ```
+
 git clone https://github.com/google-research/pegasus
 cd pegasus
 export PYTHONPATH=.
 pip3 install -r requirements.txt
+
 ```
 
-Download vocab, pretrained and fine-tuned checkpoints of all experiments from [Google Cloud](https://console.cloud.google.com/storage/browser/pegasus_ckpt).
+Download vocab, pre-trained and fine-tuned checkpoints of all experiments from [Google Cloud](https://console.cloud.google.com/storage/browser/pegasus_ckpt).
 
-Alternatively in terminal, follow the instruction and install [gsutil](https://cloud.google.com/storage/docs/gsutil_install). Then
+Alternatively, in terminal, follow the instruction and install [gsutil](https://cloud.google.com/storage/docs/gsutil_install). Then
 
 ```
+
 mkdir ckpt
 gsutil cp -r gs://pegasus_ckpt/ ckpt/
 
 ```
 
 # Finetuning on downstream datasets
 
-## on existing dataset
+## On existing dataset
 
 Finetune on an existing dataset `aeslc`.
 
 ```
+
 python3 pegasus/bin/train.py --params=aeslc_transformer \
 --param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model \
 --train_init_checkpoint=ckpt/pegasus_ckpt/model.ckpt-1500000 \
 --model_dir=ckpt/pegasus_ckpt/aeslc
+
 ```
 
 If you would like to finetune on a subset of dataset, please refer to the [example of input pattern](https://github.com/google-research/pegasus/blob/master/pegasus/data/datasets.py#L186).
 
 Evaluate on the finetuned dataset.
 
 ```
+
 python3 pegasus/bin/evaluate.py --params=aeslc_transformer \
 --param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model,batch_size=1,beam_size=5,beam_alpha=0.6 \
 --model_dir=ckpt/pegasus_ckpt/aeslc
+
 ```
 
-Note that the above example is using a single GPU so the batch_size is much smaller
-than the results reported in the paper.
+Note that the above example is using a single GPU so the batch_size is much smaller than the results reported in the paper.
 
-## add new finetuning dataset
+## Add new finetuning dataset
 
 Two types of dataset format are supported: [TensorFlow Datasets (TFDS)](https://www.tensorflow.org/datasets) or TFRecords.
 
 [This tutorial](https://www.tensorflow.org/datasets/add_dataset) shows how to add a new dataset in TFDS.
+
 (The fine-tuning dataset is expected to be supervised, please provide
+
 `supervised_keys` in dataset info).
 
 Tfrecords format requires each record to be a tf example of `{"inputs":tf.string, "targets":tf.string}`.
 
 For example, if you registered a TFDS dataset called `new_tfds_dataset` for training and evaluation, and have some files in tfrecord format called `new_dataset_files.tfrecord*` for test, they can be registered in `/pegasus/params/public_params.py`.
 
 ```
-@registry.register("new_params")
+
+@registry.register(“new_params”)
 def my_param(param_overrides):
   return public_params.transformer_params(
       {
@@ -140,40 +151,25 @@ def my_param(param_overrides):
           "learning_rate": 0.0001,
           "batch_size": 8,
       }, param_overrides)
+
 ```
 
 ## Evaluation metrics.
 
-Evaluation results can be found in `mode_dir`. Summarization metrics are automatically
-calculated for each evaluation point.
-
--   [ROUGE](https://www.aclweb.org/anthology/W04-1013.pdf) is the main metric
-    for summarization quality.
-
--   [BLEU](https://www.aclweb.org/anthology/P02-1040.pdf) is an alternative
-    quality metric for language generation.
-
--   [Extractive Fragments Coverage & Density](https://arxiv.org/pdf/1804.11283.pdf)
-    are metrics that measures the abstractiveness of the summary.
-
+Evaluation results can be found in `mode_dir`. Summarization metrics are automatically calculated for each evaluation point.
 
+-   [ROUGE](https://www.aclweb.org/anthology/W04-1013.pdf) is the main metric  for summarization quality.
+-   [BLEU](https://www.aclweb.org/anthology/P02-1040.pdf) is an alternative quality metric for language generation.
+-   [Extractive Fragments Coverage & Density](https://arxiv.org/pdf/1804.11283.pdf) are metrics that measure the abstractiveness of the summary.
 -   Repetition Rates measures generation repetition failure modes.
-
--   Length statistics measures the length distribution of decodes comparing to gold summary.
-
+-   Length statistics measure the length distribution of decodes comparing to gold summary.
 Several types of output files can be found in `model_dir`
-
--   text_metrics-*.txt: above metrics in text format. Each row contains metric
-    name, 95% lower bound value, mean value, 95% upper bound value.
--   inputs-*.txt, targets-*.txt, predictions-*.txt: raw text files of model
-    inputs/outputs.
-
-
+-   text_metrics-*.txt: above metrics in text format. Each row contains metric name, 95% lower bound value, mean value, 95% upper bound value.
+-   inputs-*.txt, targets-*.txt, predictions-*.txt: raw text files of model  inputs/outputs.
 # Pre-training
 
 Pretraining (on C4 or any other corpus) requires a customly built tensorflow that includes ops for on-the-fly parsing that processes raw text document into model inputs and targets ids. Please refer to pegasus/ops/pretrain_parsing_ops.cc and pegasus/data/parsers.py for details.
 
 # Acknowledgements
-Contains parts of code and design for training and evaluation of summarization models originally by Ben Goodrich <[email protected]>.
-
 
+Contains parts of code and design for training and evaluation of summarization models originally by Ben Goodrich <[email protected]>.