Contextual LSTM for NLP tasks like word prediction
This repo's goal is to implement de Contextual LSTM model for word prediction as described by [Ghosh, S., Vinyals, O., Strope, B., Roy, S., Dean, T., & Heck, L. (n.d.). Contextual LSTM (CLSTM) models for Large scale NLP tasks. https://doi.org/10.1145/12351]
Notes: there are scripts to run the pipelines. However, the project needs a bit of cleanup. If anyone is interested in using it please write to me or open an issue and I'll fix/help with any error you have.
Further details about wikipedia data preprocessing at
./documentation/word_embeddings_and_topic_detection.pdf
Further details of different gensim topic detection methods as well as embeddings arithmetic for context creation at
./documentation/word_embeddings_and_topic_detection_II.pdf
Download a wikipedia dump for example:
https://dumps.wikimedia.org/enwiki/20180420/enwiki-20180420-pages-articles.xml.bz2
After that use wiki_extractor to process it:
./wiki_extractor_launch.sh path_to_wikipedia_dump
where path_to_wikipedia_dump
is the file you downloaded (e.g. enwiki-20180120-pages-articles.xml.bz2)
To run the whole pipeline use the script:
./run_pipeline.sh ../data/enwiki 500
`./preprocess.sh ../data/enwiki 500 2 where:
../data/enwiki
is the default path where preprocess script extracted and cleaned the wikipedia dump.- 500 is the desired embedding size.
To run just the pipeline with pre-trained embeddings of size 1000 run:
./run_short_pipeline.sh ../data/ 1000
You can download the required trained embeddings from here:
https://www.dropbox.com/s/ws6d8l6h6jp3ldc/embeddings.tar.gz?dl=0
You should place them inside the models/ folder
Basic LSTM implementation with TF at ./src/lstm.py
Contextual LSTM implementation with TF at ./src/clstm.py
Although functional, this version is still too slow to be practical for training. If you want to collaborate or have any question regarding it feel free to contact me, I plan to finish it shortly and upload a detailed description of it.
Most files have their own execution script under /bin folder. All scripts named submit_XXX.sh are designed to be run in a SuperComputer with Slurm queue system. In order to run the locally, just issue the python commands followed by the correct paths.
Note: due to the use of many different packages not all files run with the same Python version (some with 2.7, others with 3.5.2 and the rest 3.6), I expect to unify them (or state clearly the version) soon.