This repository contains the source code & data corpus used in the following paper,
Detecting Incongruity Between News Headline and Body Text via a Deep Hierarchical Encoder, AAAI-19, paper
tensorflow==1.4 (tested on cuda-8.0, cudnn-6.0)
python==2.7
scikit-learn==0.20.0
nltk==3.3
-
download preprocessed dataset with the following script
cd data
sh download_processed_dataset_aaai-19.sh -
the downloaded dataset will be placed into the following path of the project
/data/aaai-19/para
/data/aaai-19/whole -
format (example)
test_title.npy: [100000, 49] - (#samples, #token (index))
test_body: [100000, 1200] - (#samples, #token (index))
test_label: [100000] - (#samples)
dic_mincutN.txt: dictionary
- according to the training method
whole-type: using the codes in the ./src_whole
para-type: using the codes in the ./src_para
- each source code folder contains a reference script for training the model
train_reference_scripts.sh
<< for example >>
train dataset with AHDE model and "whole" method
python AHDE_Model.py --batch_size 256 --encoder_size 80 --context_size 10 --encoderR_size 49 --num_layer 1 --hidden_dim 300 --num_layer_con 1 --hidden_dim_con 300 --embed_size 300 --lr 0.001 --num_train_steps 100000 --is_save 1 --graph_prefix 'ahde' --corpus 'aaai-19_whole' --data_path '../data/target_aaai-19_whole/'
- Results will be displayed in the console
- The final test result will be stored in "./TEST_run_result.txt"
※ hyper parameters
- major parameters: edit from the training script
- other parameters: edit from "./params.py"
- each source code folder contains an inference script
- you need to modify the "model_path" in the "eval_AHDE.sh" to a proper path
<< for example >>
evaluate test dataset with AHDE model and "whole" method
src_whole$ sh eval_AHDE.sh
- Results will be displayed in the console
- scores for the testset will be stored in "./output.txt"
-
whole case
data Samples tokens (avg)
headlinetokens (avg)
body texttrain 1,700,000 13.71 499.81 dev 100,000 13.69 499.03 test 100,000 13.55 769.23 -
Note
We crawled articles for "dev" and "test" dataset from different media outlets.
- We create an English version of the dataset, nela-17, using NELA 2017 data. Please refer to the dataset repository [link].
- If you want to run our model (AHDE) with the nela-17 data, you can use the preprocessed dataset that is compatible with our code.
cd data
sh download_processed_dataset_nela-17.sh - training script (refer to the "train_reference_scripts.sh")
python AHDE_Model.py --batch_size 64 --encoder_size 200 --context_size 50 --encoderR_size 25 --num_layer 1 --hidden_dim 100 --num_layer_con 1 --hidden_dim_con 100 --embed_size 300 --use_glove 1 --lr 0.001 --num_train_steps 100000 --is_save 1 --graph_prefix 'ahde' --corpus 'nela-17_whole' --data_path '../data/target_nela-17_whole/'
- Pytorch implementation [link] by M. Lee
- compatible with the preprocessed dataset
- Please cite our paper, when you use our code | dataset | model
@inproceedings{yoon2019detecting,
title={Detecting Incongruity between News Headline and Body Text via a Deep Hierarchical Encoder},
author={Yoon, Seunghyun and Park, Kunwoo and Shin, Joongbo and Lim, Hongjun and Won, Seungpil and Cha, Meeyoung and Jung, Kyomin},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={33},
pages={791--800},
year={2019}
}