Due to Twitter's policies, only tweet IDs, not the actual content, can be directly released.
Thus, we provide tweet IDs to those who request from the authors.
We refer reader to popular tools such as the Twitter Hydrator to get access to the actual Tweet JSONlines files.
Once Tweets are hydrated, please store the JSONlines files in .gz
compressed format (preferably multiple .gz
files to enable batch processing).
Run the Python scripts in the evaluation/
folder with the corresponding bash script in the script/
folder, using different location databases in the database/
folder
carmen
contains the Carmen 2.0 code (based off the original Carmen)database/
contains different location database that can be used to initialize Carmen.locations.json
is the original Carmen location databasegeonames_locations_only.json
is the new location database derived from the GeoNames databsegeonames_locations_combined.json
is the combined version oflocations.json
andgeonames_locations_only.json
, with entries inlocations.json
mapped to a GeoNames entry, and then converted to the Carmen database format
evaluation/
contains main Python scripts that computes the performance of Carmen 2.0 across different datasetspreprocessing/
contains code to filter Twitter-Global into different splits. Since we already provided the splitted Twitter-Global Tweet IDs, it is likely that user can skip this preprocessing step.scripts/
contains bash scripts to run all the other Python scripts provided in other folders. Note that these scripts only works on a server with Sun Grid Engine (SGE) queueing system, which is used for efficient batch processing on 100 CPU jobs. User need to adapt the input and output path of these scripts, and also adapt the batch processing part if not using SGE.utils/
contains useful shortcuts for collecting results, e.g. format results into a csv table.