GitHub - Data-Mining-AI-Paper/DATA_MINING_AI_PAPER: Analysis Challenges in NLP Papers: BOLT

2023 FALL Data Mining(SCE3313, F074) Project

REPORT PAPER - For Details, Read the Paper.

🚩 Table of Contents

Project summary
Project structure
Requirements
Methods
Results
License

📝 Project summary

Analysis Challenges in NLP Papers: BOLT - Beyond Obstacles, Leap Together

We want to solve the problem of accessing ACL papers in natural language processing research and proposes solutions.
The methods that we tried include TF-IDF, SVD, and K-means clustering to derive insights from a dataset of 12,745 papers.
Using the data that crawled the acl paper, we made below three outputs:
1. Keyword trend analysis with graphs
2. Word cloud by year
3. Research topic trend with clusters by year

Team member

Dept	Icon	Name	Github
software		Kyunghyun Min
software		Jongho Baik
software		Junseo Lee

🏗️ Project structure

Details

output/k-means: Contain the results of k-means++ clustering, labels of clusters, info about instance of k.
output/wordcloud: Contain wordclouds by year, from 1979 to 2023.
1. Crawling ACL.ipynb and 2. preprocess.py: Crawl papers and preprocessing data.
3. k-mean_clustering_word2vect.py: Make clusters by k-means++.
4. keyword_trend.py: Provide graphs about changes in the importance of keywords by year.
5. wordcloud_by_year.py: Provide important keywords as wordclouds by year.
6. topic_trend.py: Labeling the clusters made in 3. k-mean_clustering_word2vect.py.

⚙️ Requirements

Hardware Configuration

Compute Engine VM N1 instance
- Custom configuration:
  - 10vCPU
  - 23.10
  - 65GB RAM
  - 200GB storage

Software Configuration

3.11.5
- IPython: 8.15.0
- ipykernel: 6.25.0
- ipywidgets: 8.0.4
- jupyter_client: 7.4.9
- jupyter_core: 5.3.0
- jupyter_server: 1.23.4
- jupyterlab: 3.6.3
- nbclient: 0.5.13
- nbconvert: 6.5.4
- nbformat: 5.9.2
- notebook: 6.5.4
- qtconsole: 5.4.2
- traitlets: 5.7.1

Additional Libraries

To ensure consistency in package versions, the following additional libraries are used:

: 3.5.2
: 1.23.1
: 1.3.0
: 3.8.1
: 0.10.1.2
: 1.9.2

🔨 Methods

Crawling ACL Paper Data

Approach: Initially utilized DBLP, but switched to direct use of the ACL site.
Extraction: Retrieved papers via DOIs, totaling 10,293.
API Usage: Employed SEMANTIC SCHOLAR API in chunks of 500 DOIs, later transitioning to individual

Data Preprocessing

Removing Poor Abstracts: Excluded abstracts <100 characters (13 instances).
Selecting Central Analytic Fields: Focused on 'title', 'abstract', and 'year'.
Issues for TF-IDF Processing: Removed URLs, non-alphabetic characters from abstracts, and implemented lemmatization.

TF-IDF

Processing: Utilized TfidfVectorizer library, resulting in a sparse matrix of 12,732 papers and 17,054 features.
Thresholding: Chose a threshold of 0.17 to represent approximately the top 15% of TF-IDF values.

K-Means Clustering

Embedding and Clustering: Used Word2Vec for embedding and weighted average with TF-IDF values for paper representation.
Optimal K Value: Determined k = 36 using elbow and silhouette methods after challenges with high computational volume.

Keyword Trend Analysis

Calculation: Extracted important words per year via TF-IDF, weighted by the number of papers, and produced trends over time, compensating for small TF-IDF values.
Comparison: compared Keyword Trend Analysis with Google Trends.

Word Cloud by Year

Extraction and Visualization: Extracted top 20 words per year using TF-IDF, setting a threshold of 0.17.
Creating wordclouds: created word clouds based on the sum of important words for each year.

📊 Results

Cluster Analysis

Purpose: Determination of diverse research areas through cluster analysis.
Result: Identified research themes using K-means++ clustering based on specific keywords.

Research Topic Trend

Purpose: Understanding evolving trends in AI research topics based on cluster trends.
Result: Analyzed trends in research topics by tracking changes in cluster proportions over years.
Through the image below, it can be seen that the 'model expressionism' cluster, one of the modern trends of AI, appeared at the end of 2010.

Keyword Trend Analysis

Purpose: Validation of trend analysis reliability using the researched data.
Result: Analyzed annual keyword trends using TF-IDF values and compared with Google Trends data.
We conducted a comparative analysis encompassing five keywords

'derivation,' 'multimodal,' 'prompt,' 'segmentation,' and 'semantic.'
Since each graph shows a similar shape, it can be confirmed that trend analysis is performed well.

Word Cloud by Year

Purpose: Comparative analysis of keyword significance across different years.
Result: Generated visual word cloud images displaying important keywords for each year.
This visual exploration provides insights into the evolving importance of specific words or keywords over time.

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
output		output
tempfiles		tempfiles
.gitignore		.gitignore
1. Crawling ACL.ipynb		1. Crawling ACL.ipynb
2. preprocess.py		2. preprocess.py
3. k-mean_clustering_word2vect.py		3. k-mean_clustering_word2vect.py
4. keyword_trend.py		4. keyword_trend.py
5. wordcloud_by_year.py		5. wordcloud_by_year.py
6. topic_trend.py		6. topic_trend.py
ACL_PAPERS.json		ACL_PAPERS.json
LICENSE		LICENSE
README.md		README.md
preprocessed_ACL_PAPERS.pickle		preprocessed_ACL_PAPERS.pickle
tf_idf.py		tf_idf.py
tool.py		tool.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚩 Table of Contents

📝 Project summary

Analysis Challenges in NLP Papers: BOLT - Beyond Obstacles, Leap Together

Team member

🏗️ Project structure

Directory

Details

⚙️ Requirements

Hardware Configuration

Software Configuration

Additional Libraries

🔨 Methods

Crawling ACL Paper Data

Data Preprocessing

TF-IDF

K-Means Clustering

Keyword Trend Analysis

Word Cloud by Year

📊 Results

Cluster Analysis

Research Topic Trend

Keyword Trend Analysis

Word Cloud by Year

📜 License

About

Releases

Packages

Contributors 3

Languages

License

Data-Mining-AI-Paper/DATA_MINING_AI_PAPER

Folders and files

Latest commit

History

Repository files navigation

🚩 Table of Contents

📝 Project summary

Analysis Challenges in NLP Papers: BOLT - Beyond Obstacles, Leap Together

Team member

🏗️ Project structure

Directory

Details

⚙️ Requirements

Hardware Configuration

Software Configuration

Additional Libraries

🔨 Methods

Crawling ACL Paper Data

Data Preprocessing

TF-IDF

K-Means Clustering

Keyword Trend Analysis

Word Cloud by Year

📊 Results

Cluster Analysis

Research Topic Trend

Keyword Trend Analysis

Word Cloud by Year

📜 License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages