webis-tldr-17-corpus

This repository contains code for constructing TLDR corpus from Reddit Corpus as described in TL;DR: Mining Reddit to Learn Automatic Summarization, EMNLP 2017 - New Frontiers in Summarization workshop

About this code

This code is intended to be run using Spark framework for working with large Reddit dumps directly. It consists of two scripts:

make_reddit.py - Reads the raw dumps and creates content-summary pairs in the form of Spark dataframe.

clean_reddit.py - Reads the result of the previous script and applies some normalization for improving precision of the final corpus.

The resources folder contains an exhaustive list of Reddit bots which we use to filter automatic postings.

Usage

spark-submit --master yarn make_tldr.py --input_comments input-comments-path --input_submissions input-submissions-path --output_comments tldr-comments-raw --output_submissions tldr-submissions-raw

We use Mistune library to remove markdown, which should be passed to Spark using --py-files

spark-submit --master yarn --py-files /usr/local/lib/python3.5/dist-packages/mistune.py clean_tldr.py --input_comments tldr-comments-raw --input_submissions tldr-submissions-raw --output_comments tldr-comments-cleaned --output_submissions tldr-submissions-cleaned

Released corpus

The current version of the corpus can be found here on Zenodo

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
resources		resources
README.md		README.md
clean_tldr.py		clean_tldr.py
make_tldr.py		make_tldr.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

webis-tldr-17-corpus

About this code

Usage

Released corpus

About

Releases

Packages

Contributors 2

Languages

webis-de/webis-tldr-17-corpus

Folders and files

Latest commit

History

Repository files navigation

webis-tldr-17-corpus

About this code

Usage

Released corpus

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages