Documentation | Discord | Forum
DataPrep lets you prepare your data using a single library with a few lines of code.
Currently, you can use DataPrep to:
- Collect data from common data sources (through
dataprep.connector
) - Do your exploratory data analysis (through
dataprep.eda
) - Clean and standardize data (through
dataprep.clean
) - ...more modules are coming
pip install -U dataprep
DataPrep.EDA is the fastest and the easiest EDA (Exploratory Data Analysis) tool in Python. It allows you to understand a Pandas/Dask DataFrame with a few lines of code in seconds.
You can create a beautiful profile report from a Pandas/Dask DataFrame with the create_report
function. DataPrep.EDA has the following advantages compared to other tools:
- 10X Faster: DataPrep.EDA can be 10X faster than Pandas-based profiling tools due to its highly optimized Dask-based computing module.
- Interactive Visualization: DataPrep.EDA generates interactive visualizations in a report, which makes the report look more appealing to end users.
- Big Data Support: DataPrep.EDA naturally supports big data stored in a Dask cluster by accepting a Dask dataframe as input.
The following code demonstrates how to use DataPrep.EDA to create a profile report for the titanic dataset.
from dataprep.datasets import load_dataset
from dataprep.eda import create_report
df = load_dataset("titanic")
create_report(df).show_browser()
Click here to see the generated report of the above code.
Click here to see the benchmark result.
DataPrep.EDA is the only task-centric EDA system in Python. It is carefully designed to improve usability.
- Task-Centric API Design: You can declaratively specify a wide range of EDA tasks in different granularity with a single function call. All needed visualizations will be automatically and intelligently generated for you.
- Auto-Insights: DataPrep.EDA automatically detects and highlights the insights (e.g., a column has many outliers) to facilitate pattern discovery about the data.
- How-to Guide: A how-to guide is provided to show the configuration of each plot function. With this feature, you can easily customize the generated visualizations.
Click here to check all the supported tasks.
Check plot, plot_correlation, plot_missing and create_report to see how each function works.
DataPrep.Clean contains simple functions designed for cleaning and validating data in a DataFrame. It provides
- A Unified API: each function follows the syntax
clean_{type}(df, 'column name')
(see an example below). - Speed: the computations are parallelized using Dask. It can clean 50K rows per second on a dual-core laptop (that means cleaning 1 million rows in only 20 seconds).
- Transparency: a report is generated that summarizes the alterations to the data that occured during cleaning.
The following example shows how to clean and standardize a column of country names.
from dataprep.clean import clean_country
import pandas as pd
df = pd.DataFrame({'country': ['USA', 'country: Canada', '233', ' tr ', 'NA']})
df2 = clean_country(df, 'country')
df2
country country_clean
0 USA United States
1 country: Canada Canada
2 233 Estonia
3 tr Turkey
4 NA NaN
Type validation is also supported:
from dataprep.clean import validate_country
series = validate_country(df['country'])
series
0 True
1 False
2 True
3 True
4 False
Name: country, dtype: bool
Check clean_headers, clean_country, clean_date, clean_duplication, clean_email, clean_lat_long, clean_ip, clean_phone, clean_text, clean_url, clean_address and clean_df to see how each function works.
Connector is an intuitive, open-source API wrapper that speeds up development by standardizing calls to multiple APIs as a simple workflow.
Connector provides a simple wrapper to collect structured data from different Web APIs (e.g., Twitter, Spotify), making web data collection easy and efficient, without requiring advanced programming skills.
Do you want to leverage the growing number of websites that are opening their data through public APIs? Connector is for you!
Let's check out the several benefits that Connector offers:
- A unified API: You can fetch data using one or two lines of code to get data from tens of popular websites.
- Auto Pagination: Do you want to invoke a Web API that could return a large result set and need to handle it through pagination? Connector automatically does the pagination for you! Just specify the desired number of returned results (argument
_count
) without getting into unnecessary detail about a specific pagination scheme. - Speed: Do you want to fetch results more quickly by making concurrent requests to Web APIs? Through the
_concurrency
argument, Connector simplifies concurrency, issuing API requests in parallel while respecting the API's rate limit policy.
from dataprep.connector import connect
conn_dblp = connect("dblp", _concurrency = 5)
df = await conn_dblp.query("publication", author = "Andrew Y. Ng", _count = 2000)
Here, you can find detailed Examples.
Connector is designed to be easy to extend. If you want to connect with your own web API, you just have to write a simple configuration file to support it. This configuration file describes the API's main attributes like the URL, query parameters, authorization method, pagination properties, etc.
The following documentation can give you an impression of what DataPrep can do:
There are many ways to contribute to DataPrep.
- Submit bugs and help us verify fixes as they are checked in.
- Review the source code changes.
- Engage with other DataPrep users and developers on StackOverflow.
- Help each other in the DataPrep Community Discord and Forum.
- Contribute bug fixes.
- Providing use cases and writing down your user experience.
Please take a look at our wiki for development documentations!
Some functionalities of DataPrep are inspired by the following packages.
-
Inspired the report functionality and insights provided in
dataprep.eda
. -
Inspired the missing value analysis in
dataprep.eda
.
If you use DataPrep, please consider citing the following paper:
Jinglin Peng, Weiyuan Wu, Brandon Lockhart, Song Bian, Jing Nathan Yan, Linghao Xu, Zhixuan Chi, Jeffrey M. Rzeszotarski, and Jiannan Wang. DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python. SIGMOD 2021.
BibTeX entry:
@inproceedings{dataprepeda2021,
author = {Jinglin Peng and Weiyuan Wu and Brandon Lockhart and Song Bian and Jing Nathan Yan and Linghao Xu and Zhixuan Chi and Jeffrey M. Rzeszotarski and Jiannan Wang},
title = {DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python},
booktitle = {Proceedings of the 2021 International Conference on Management of Data (SIGMOD '21), June 20--25, 2021, Virtual Event, China},
year = {2021}
}