Skip to content

Data-Mining-AI-Paper/DATA_MINING_AI_PAPER

Repository files navigation

title_2

2023 FALL Data Mining(SCE3313, F074) Project

REPORT PAPER - For Details, Read the Paper.

๐Ÿšฉ Table of Contents

๐Ÿ“ Project summary

Analysis Challenges in NLP Papers: BOLT - Beyond Obstacles, Leap Together

  • We want to solve the problem of accessing ACL papers in natural language processing research and proposes solutions.
  • The methods that we tried include TF-IDF, SVD, and K-means clustering to derive insights from a dataset of 12,745 papers.
  • Using the data that crawled the acl paper, we made below three outputs:
    1. Keyword trend analysis with graphs
    2. Word cloud by year
    3. Research topic trend with clusters by year

Team member

Dept Icon Name Github
software Kyunghyun Min
software Jongho Baik
software Junseo Lee

๐Ÿ—๏ธ Project structure

Directory

/DATA_MINING_AI_PAPER
โ”œโ”€โ”€ output
โ”‚   โ”œโ”€โ”€ k-means
โ”‚   โ””โ”€โ”€ wordcloud
โ”œโ”€โ”€ tempfiles
โ”œโ”€โ”€ 1. Crawling ACL.ipynb
โ”œโ”€โ”€ 2. preprocess.py
โ”œโ”€โ”€ 3. k-mean_clustering_word2vect.py
โ”œโ”€โ”€ 4. keyword_trend.py
โ”œโ”€โ”€ 5. wordcloud_by_year.py
โ”œโ”€โ”€ 6. topic_trend.py
โ”œโ”€โ”€ tf_idf.py
โ”œโ”€โ”€ tool.py
โ”œโ”€โ”€ ACL_PAPERS.json
โ”œโ”€โ”€ LICENSE
โ”œโ”€โ”€ preprocessed_ACL_PAPERS.pickle
โ””โ”€โ”€ README.md

Details

  • output/k-means: Contain the results of k-means++ clustering, labels of clusters, info about instance of k.
  • output/wordcloud: Contain wordclouds by year, from 1979 to 2023.
  • 1. Crawling ACL.ipynb and 2. preprocess.py: Crawl papers and preprocessing data.
  • 3. k-mean_clustering_word2vect.py: Make clusters by k-means++.
  • 4. keyword_trend.py: Provide graphs about changes in the importance of keywords by year.
  • 5. wordcloud_by_year.py: Provide important keywords as wordclouds by year.
  • 6. topic_trend.py: Labeling the clusters made in 3. k-mean_clustering_word2vect.py.

โš™๏ธ Requirements

Hardware Configuration

  • Google Cloud Compute Engine VM N1 instance
    • Custom configuration:
      • Intelยฎ Xeonยฎ E5-2696V3 10vCPU
      • Ubuntu 23.10
      • 65GB RAM
      • 200GB storage

Software Configuration

  • Python 3.11.5
  • Jupyter Notebook
    • IPython: 8.15.0
    • ipykernel: 6.25.0
    • ipywidgets: 8.0.4
    • jupyter_client: 7.4.9
    • jupyter_core: 5.3.0
    • jupyter_server: 1.23.4
    • jupyterlab: 3.6.3
    • nbclient: 0.5.13
    • nbconvert: 6.5.4
    • nbformat: 5.9.2
    • notebook: 6.5.4
    • qtconsole: 5.4.2
    • traitlets: 5.7.1

Additional Libraries

To ensure consistency in package versions, the following additional libraries are used:

  • Matplotlib: 3.5.2
  • NumPy: 1.23.1
  • scikit-learn: 1.3.0
  • nltk: 3.8.1
  • pyclustering: 0.10.1.2
  • wordcloud: 1.9.2

๐Ÿ”จ Methods

Crawling ACL Paper Data

  • Approach: Initially utilized DBLP, but switched to direct use of the ACL site.
  • Extraction: Retrieved papers via DOIs, totaling 10,293.
  • API Usage: Employed SEMANTIC SCHOLAR API in chunks of 500 DOIs, later transitioning to individual

Data Preprocessing

  • Removing Poor Abstracts: Excluded abstracts <100 characters (13 instances).
  • Selecting Central Analytic Fields: Focused on 'title', 'abstract', and 'year'.
  • Issues for TF-IDF Processing: Removed URLs, non-alphabetic characters from abstracts, and implemented lemmatization.

TF-IDF

  • Processing: Utilized TfidfVectorizer library, resulting in a sparse matrix of 12,732 papers and 17,054 features.
  • Thresholding: Chose a threshold of 0.17 to represent approximately the top 15% of TF-IDF values.

K-Means Clustering

  • Embedding and Clustering: Used Word2Vec for embedding and weighted average with TF-IDF values for paper representation.
  • Optimal K Value: Determined k = 36 using elbow and silhouette methods after challenges with high computational volume.

Keyword Trend Analysis

  • Calculation: Extracted important words per year via TF-IDF, weighted by the number of papers, and produced trends over time, compensating for small TF-IDF values.
  • Comparison: compared Keyword Trend Analysis with Google Trends.

Word Cloud by Year

  • Extraction and Visualization: Extracted top 20 words per year using TF-IDF, setting a threshold of 0.17.
  • Creating wordclouds: created word clouds based on the sum of important words for each year.

๐Ÿ“Š Results

Cluster Analysis

  • Purpose: Determination of diverse research areas through cluster analysis.
  • Result: Identified research themes using K-means++ clustering based on specific keywords.

image-3

Research Topic Trend

  • Purpose: Understanding evolving trends in AI research topics based on cluster trends.
  • Result: Analyzed trends in research topics by tracking changes in cluster proportions over years.
  • Through the image below, it can be seen that the 'model expressionism' cluster, one of the modern trends of AI, appeared at the end of 2010.

image

Keyword Trend Analysis

  • Purpose: Validation of trend analysis reliability using the researched data.
  • Result: Analyzed annual keyword trends using TF-IDF values and compared with Google Trends data.
  • We conducted a comparative analysis encompassing five keywords

    'derivation,' 'multimodal,' 'prompt,' 'segmentation,' and 'semantic.'

  • Since each graph shows a similar shape, it can be confirmed that trend analysis is performed well.

image-4

Word Cloud by Year

  • Purpose: Comparative analysis of keyword significance across different years.
  • Result: Generated visual word cloud images displaying important keywords for each year.
  • This visual exploration provides insights into the evolving importance of specific words or keywords over time.

image-5

๐Ÿ“œ License

This software is licensed under the MIT ยฉ 2023 Data-Mining-AI-Paper

About

Analysis Challenges in NLP Papers: BOLT - Beyond Obstacles, Leap Together

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •