2023 FALL Data Mining(SCE3313, F074) Project
REPORT PAPER - For Details, Read the Paper.
- We want to solve the problem of accessing ACL papers in natural language processing research and proposes solutions.
- The methods that we tried include TF-IDF, SVD, and K-means clustering to derive insights from a dataset of 12,745 papers.
- Using the data that crawled the acl paper, we made below three outputs:
- Keyword trend analysis with graphs
- Word cloud by year
- Research topic trend with clusters by year
Dept | Icon | Name | Github |
---|---|---|---|
software | Kyunghyun Min | ||
software | Jongho Baik | ||
software | Junseo Lee |
/DATA_MINING_AI_PAPER
โโโ output
โ โโโ k-means
โ โโโ wordcloud
โโโ tempfiles
โโโ 1. Crawling ACL.ipynb
โโโ 2. preprocess.py
โโโ 3. k-mean_clustering_word2vect.py
โโโ 4. keyword_trend.py
โโโ 5. wordcloud_by_year.py
โโโ 6. topic_trend.py
โโโ tf_idf.py
โโโ tool.py
โโโ ACL_PAPERS.json
โโโ LICENSE
โโโ preprocessed_ACL_PAPERS.pickle
โโโ README.md
output/k-means
: Contain the results of k-means++ clustering, labels of clusters, info about instance of k.output/wordcloud
: Contain wordclouds by year, from 1979 to 2023.1. Crawling ACL.ipynb
and2. preprocess.py
: Crawl papers and preprocessing data.3. k-mean_clustering_word2vect.py
: Make clusters by k-means++.4. keyword_trend.py
: Provide graphs about changes in the importance of keywords by year.5. wordcloud_by_year.py
: Provide important keywords as wordclouds by year.6. topic_trend.py
: Labeling the clusters made in3. k-mean_clustering_word2vect.py
.
- 3.11.5
-
- IPython: 8.15.0
- ipykernel: 6.25.0
- ipywidgets: 8.0.4
- jupyter_client: 7.4.9
- jupyter_core: 5.3.0
- jupyter_server: 1.23.4
- jupyterlab: 3.6.3
- nbclient: 0.5.13
- nbconvert: 6.5.4
- nbformat: 5.9.2
- notebook: 6.5.4
- qtconsole: 5.4.2
- traitlets: 5.7.1
To ensure consistency in package versions, the following additional libraries are used:
- Approach: Initially utilized DBLP, but switched to direct use of the ACL site.
- Extraction: Retrieved papers via DOIs, totaling 10,293.
- API Usage: Employed SEMANTIC SCHOLAR API in chunks of 500 DOIs, later transitioning to individual
- Removing Poor Abstracts: Excluded abstracts <100 characters (13 instances).
- Selecting Central Analytic Fields: Focused on 'title', 'abstract', and 'year'.
- Issues for TF-IDF Processing: Removed URLs, non-alphabetic characters from abstracts, and implemented lemmatization.
- Processing: Utilized TfidfVectorizer library, resulting in a sparse matrix of 12,732 papers and 17,054 features.
- Thresholding: Chose a threshold of 0.17 to represent approximately the top 15% of TF-IDF values.
- Embedding and Clustering: Used Word2Vec for embedding and weighted average with TF-IDF values for paper representation.
- Optimal K Value: Determined k = 36 using elbow and silhouette methods after challenges with high computational volume.
- Calculation: Extracted important words per year via TF-IDF, weighted by the number of papers, and produced trends over time, compensating for small TF-IDF values.
- Comparison: compared Keyword Trend Analysis with Google Trends.
- Extraction and Visualization: Extracted top 20 words per year using TF-IDF, setting a threshold of 0.17.
- Creating wordclouds: created word clouds based on the sum of important words for each year.
- Purpose: Determination of diverse research areas through cluster analysis.
- Result: Identified research themes using K-means++ clustering based on specific keywords.
- Purpose: Understanding evolving trends in AI research topics based on cluster trends.
- Result: Analyzed trends in research topics by tracking changes in cluster proportions over years.
- Through the image below, it can be seen that the 'model expressionism' cluster, one of the modern trends of AI, appeared at the end of 2010.
- Purpose: Validation of trend analysis reliability using the researched data.
- Result: Analyzed annual keyword trends using TF-IDF values and compared with Google Trends data.
- We conducted a comparative analysis encompassing five keywords
'derivation,' 'multimodal,' 'prompt,' 'segmentation,' and 'semantic.'
- Since each graph shows a similar shape, it can be confirmed that trend analysis is performed well.
- Purpose: Comparative analysis of keyword significance across different years.
- Result: Generated visual word cloud images displaying important keywords for each year.
- This visual exploration provides insights into the evolving importance of specific words or keywords over time.
This software is licensed under the MIT ยฉ 2023 Data-Mining-AI-Paper