Skip to content

In this project, first I start to visualize data in 3 groups to understand better data set, which locate on data_visualization. Second I use hazm library for text processing in Persian data sets on DataCleaning file. Finally, I implement the k-means algorithm for clustering.

Notifications You must be signed in to change notification settings

mohadesehjm/Preprocess_and_Clustering_on_news_datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Extract data

First, I extracted data from ferdowsi-data.txt, which Its size was 14GB. Because of its large size, it had to be read line by line. After extracting data and make data frame, I save results in Data.csv.

Preprocess on news dataset

In this project, first I start to visualize data in 3 groups to understand better data set, which locate on data_visualization. Second I use hazm library for text processing in Persian data set on DataCleaning file.

clustering on news dataset

In this session, I use 1000 text of news for clustering in three ways:

• kmeans + BOW(bag of words)

• kmeans + tf-idf

• kmeans + fasttext

kmeans + BOW and kmeans + tf-idf are located on Clustering kmeans+ fasttext are located on Clustering_fasttext

About

In this project, first I start to visualize data in 3 groups to understand better data set, which locate on data_visualization. Second I use hazm library for text processing in Persian data sets on DataCleaning file. Finally, I implement the k-means algorithm for clustering.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published