Cornell Conversational Analysis Toolkit (ConvoKit)

This toolkit contains tools to extract conversational features and analyze social phenomena in conversations, using a single unified interface inspired by (and compatible with) scikit-learn. Several large conversational datasets are included together with scripts exemplifying the use of the toolkit on these datasets. The latest version is 2.0.11 (released 08 Sep 2019).

The toolkit currently implements features for:

Linguistic coordination _^(API)

A measure of linguistic influence (and relative power) between individuals or groups based on their use of function words.
Example: exploring the balance of power in the U.S. Supreme Court.

Politeness strategies _^(API)

A set of lexical and parse-based features correlating with politeness and impoliteness.
Example: understanding the (mis)use of politeness strategies in conversations gone awry on Wikipedia.

Conversational prompts _^(API)

An unsupervised method for extracting surface motifs that occur in conversations and grouping them by rhetorical role.
Examples: extracting common question types in U.K. parliament, understanding the use of conversational prompts in conversations gone awry on Wikipedia.

Hypergraph conversation representation _^(API)

A method for extracting structural features of conversations through a hypergraph representation.
Example: hypergraph creation and feature extraction, visualization and interpretation on a subsample of Reddit.

Linguistic diversity in conversations (Coming Soon!)

A method to compute the linguistic diversity of individuals within their own conversations, and between other individuals in a population.

CRAFT: Online forecasting of conversational outcomes (Coming Soon!)

A neural model for forecasting future outcomes of conversations (e.g., derailment into personal attacks) as they develop.

Datasets

ConvoKit ships with several datasets ready for use "out-of-the-box". These datasets can be downloaded using the convokit.download() helper function. Alternatively you can access them directly here.

Conversations Gone Awry Dataset

Two related corpora of conversations that derail into antisocial behavior. One corpus consists of Wikipedia talk page conversations that derail into personal attacks as labeled by crowdworkers (4,188 conversations containing 30.021 comments). The other consists of discussion threads on the subreddit ChangeMyView (CMV) that derail into rule-violating behavior as determined by the presence of a moderator intervention (6,842 conversations containing 42,964 comments).
Name for download: conversations-gone-awry-corpus (Wikipedia version) or conversations-gone-awry-cmv-corpus (Reddit CMV version)

Cornell Movie-Dialogs Corpus

A large metadata-rich collection of fictional conversations extracted from raw movie scripts. (220,579 conversational exchanges between 10,292 pairs of movie characters in 617 movies). Name for download: movie-corpus

Parliament Question Time Corpus

Parliamentary question periods from May 1979 to December 2016 (216,894 question-answer pairs).
Name for download: parliament-corpus

Supreme Court Corpus

A collection of conversations from the U.S. Supreme Court Oral Arguments.
Name for download: supreme-corpus

Wikipedia Talk Pages Corpus

A medium-size collection of conversations from Wikipedia editors' talk pages.
Name for download: wiki-corpus

Tennis Interviews

Transcripts for tennis singles post-match press conferences for major tournaments between 2007 to 2015 (6,467 post-match press conferences).
Name for download: tennis-corpus

Reddit Corpus

Reddit conversations from over 900k subreddits, arranged by subreddit. A small subset sampled from 100 highly active subreddits is also available.

Name for download: subreddit-<name_of_subreddit> for the by-subreddit data, reddit-corpus-small for the small subset.

Wikiconv Corpus (WIP)

The full corpus of Wikipedia talk page conversations, based on the reconstruction described in this paper. Note that due to the large size of the data, it is split up by year. We are currently working on implementing, as part of the corpus metadata, block data retrieved directly from the Wikipedia block log, for reproducing the Trajectories of Blocked Community Members paper. In the meantime, raw block data can be downloaded here.

Name for download: wikiconv-<year> to download wikiconv data for the specified year.

Chromium Conversations Corpus

A collection of almost 1.5 million conversations and 2.8 million comments posted by developers reviewing proposed code changes in the Chromium project.

Name for download: chromium-corpus

...And your own corpus!

In addition to the provided datasets, you may also use ConvoKit with your own custom datasets by loading them into a convokit.Corpus object. This example script shows how to construct a Corpus from custom data.

Installation

This toolkit requires Python >= 3.6.

Download the toolkit: pip3 install convokit
Download Spacy's English model: python3 -m spacy download en
Download NLTK's 'punkt' model: import nltk; nltk.download('punkt') (in Python interpreter)

Alternatively, visit our Github Page to install from source.

Documentation

Documentation is hosted here. If you are new to ConvoKit, great places to get started are the Core Concepts tutorial for an overview of the ConvoKit "philosophy" and object model, and the High-level tutorial for an walkthrough of how to import ConvoKit into your project, load a Corpus, and use ConvoKit functions.

Contributing

We welcome community contributions. To see how you can help out, check the contribution guidelines.

Citing

If you use the code or datasets distributed with ConvoKit please acknowledge the work tied to the respective component (indicated in the documentation) in addition to:

Jonathan P. Chang, Caleb Chiam, Liye Fu, Andrew Wang, Justine Zhang, Cristian Danescu-Niculescu-Mizil. 2019. "ConvoKit: The Cornell Conversational Analysis Toolkit" Retrieved from http://convokit.cornell.edu

ConvoKit

Name		Name	Last commit message	Last commit date
Latest commit History 609 Commits
convokit		convokit
datasets		datasets
doc		doc
examples		examples
tests		tests
website		website
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cornell Conversational Analysis Toolkit (ConvoKit)

Linguistic coordination _^(API)

Politeness strategies _^(API)

Conversational prompts _^(API)

Hypergraph conversation representation _^(API)

Linguistic diversity in conversations (Coming Soon!)

CRAFT: Online forecasting of conversational outcomes (Coming Soon!)

Datasets

Conversations Gone Awry Dataset

Cornell Movie-Dialogs Corpus

Parliament Question Time Corpus

Supreme Court Corpus

Wikipedia Talk Pages Corpus

Tennis Interviews

Reddit Corpus

Wikiconv Corpus (WIP)

Chromium Conversations Corpus

...And your own corpus!

Installation

Documentation

Contributing

Citing

About

Releases

Packages

Languages

License

noameshed/Cornell-Conversational-Analysis-Toolkit

Folders and files

Latest commit

History

Repository files navigation

Cornell Conversational Analysis Toolkit (ConvoKit)

Datasets

Wikiconv Corpus (WIP)

...And your own corpus!

Installation

Documentation

Contributing

Citing

About

Resources

License

Stars

Watchers

Forks

Languages