Welcome to Hyphe: the Hypertext Corpus Initiative (HCI) project, developped by SciencesPo's médialab for the DIME-SHS Equipex project.
Hyphe aims at providing a tool to crawl data from the web to generate networks between what we call WebEntities, which can be singles pages as well as a website or a combination of such.
The project relies on the following main components:
- memory_structure: a JAVA Lucene instance to store the structure and links of the crawled data
- hyphe_backend: Python controllers for crawling and backend API, with MongoDB buffer database to store crawled data
- core: Twisted based JSON-RPC API controller
- crawler: Scrapy spider
- lib: shared libraries
- hyphe_www_client: a JavaScript web interface to constitute and explore web corpuses through the backend API
Note: MongoDB being limited to 2Go databases on 32bit systems, it is recommanded to always install Hyphe on a 64bit machine.
Warning: Hyphe's current release does not support multiple corpus. This feature will come soon, but at the moment building a different corpus requires to either install a new instance or to stop working on the previous corpus and change the settings to point to a different mongodb collection and lucene directory.
Like everywhere in this README, every example command showed here should be ran from Hyphe's root directory.
For an easy install, the best solution is to download directly the release version, which was built to run against various GNU/Linux distributions (Ubuntu, Debian, CentOS, ...). Just uncompress the release archive, go into the directory and run the installation script. This will ask at once for sudo rights, and install possible unsatisfied packages including Java (OpenJDK-6-JRE), Python (python-dev, pip, virtualEnv, virtualEnvWrapper), Apache2, PHP5, MongoDB, ScrapyD... If you do not feel comfortable with this, read the script and run the steps line by line or follow the Advanced install instructions below for more control on what is actually installed.
tar xzvf hyphe-release-*.tar.gz
cd Hyphe
./bin/install.sh
The default configuration should work by default for a local install (i.e. running on http://localhost/hyphe), but you may want to provide a few finer settings. You can configure Hyphe's options by editing config/config.json
. Default options should fit most cases. Important options to set depending on your situation could be:
mongo-scrapy - proxy_host/proxy_port
: in case you want the crawler to make the requests through a proxymongo-scrapy - project
: in case you want to run multiple hyphe instances on the same machinemongo-scrapy - max_depth
: the maximum depth allowed to the users for each individual crawl (meaning the number of clicks to be followed within a crawled WebEntity)mongo-scrapy - download_delay
: the time (in seconds) after which the crawler will time out on a webpagememoryStructure - lucene.path
: the directory in which the memoryStructure's data will be stored (can get as high as a few gigaoctets)memoryStructure - log.level
: set to WARN or DEBUG to get more log within Lucene's log/memory-structure.logtwisted - port: the port through which the server and the web interface will communicate (_warning_ this requires to be replicated for the client in
hyphe_www_client/_confi/config.json```)precisionLimit
: the maximum precision to keep on links between crawled webpages, the value being the number of slashes after the root prefix of a WebEntity (see the wiki for more info)DEBUG
: a value from 0 to 2 indicating the level of verbosity desired from the API core in log/hyphe-core.log
Hyphe relies on a web interface communicating with a server which must be running at all times. To start, stop or restart the server, run:
bin/hyphe <start|restart|stop>
As soon as it is running, you can visit the web interface on your local machine with the following url: http://localhost/hyphe.
You can check the logs in log/hyphe-core.log
and log/hyphe-memorystructure.log
:
tail -f log/hyphe-*.log
To run on a server and not only locally, a few adjustments need to be performed:
-
Adapt your apache configuration in
hyphe_www_client/_config/apache2.conf
with your personal settings (ServerName, ...) -
Adapt the web interface API endpoint in
hyphe_www_client/_config/config.js
by replacing localhost into the actual domain name, for instance:
"SERVER_ADDRESS":"http://www.example.com:6978",
- If your server's security restrains access to the API port, you may want either to open the port in your proxy or iptables, or add a ProxyPass in your apache conf to redirect the port to a specific URL:
- Install Apache's proxy module
sudo apt-get install libapache2-mod-proxy-html
sudo a2enmod proxy
sudo a2enmod proxy_http
sudo service apache2 restart
- Add to the Apache configuration in
hyphe_www_client/_config/apache2.conf
the ProxyPass directives to redirect the port to the desired url, for instance:
<Location /hyphe-api>
ProxyPass http://www.example.com:6978/
ProxyPassReverse http://www.example.com:6978/
</Location>
- Change the API endpoint in
hyphe_www_client/_config/config.js
to the newly defined, for instance:
"SERVER_ADDRESS":"http://www.example.com/hyphe-api",
Like everywhere in this README, every example command showed here should be ran from Hyphe's root directory.
git clone https://github.com/medialab/Hypertext-Corpus-Initiative Hyphe
cd Hyphe
From here on, you can also run bin/install.sh
to go faster as with the release, or follow the next steps.
MongoDB (a no-sql database server), ScrapyD (a crawler framework server), JAVA (and Thrift for contributors/developers) are required for the backend to work: below is an example to install them all on an Ubuntu machine.
All of this steps are adaptable to Debian and CentOS as can be read in the bin/install.sh
script.
Install possible missing required basics:
sudo apt-get install curl wget git python-dev python-pip apache2 php5
Or from CentOS:
sudo yum install curl git python-devel python-setuptools python-pip httpd php
- Edit your package manager source list to include official repositories for MongoDB and ScrapyD:
# Install the GPG keys for these package repositories:
curl -s http://docs.mongodb.org/10gen-gpg-key.asc | sudo apt-key add -
curl -s http://archive.scrapy.org/ubuntu/archive.key | sudo apt-key add -
# Edit the /etc/apt/sources.list with your favorite text editor and add the following lines if they are not already present:
deb http://archive.scrapy.org/ubuntu $(lsb_release -cs) main
deb http://downloads-distro.mongodb.org/repo/ubuntu-upstart dist 10gen
# Update the sources list
sudo apt-get update
# Install MongoDB and ScrapyD with pymongo
sudo apt-get install mongodb-10gen
sudo pip install pymongo
sudo apt-get install scrapyd
- In CentOS, this is slightly more complex:
- To add the MongoDB repository to yum do the following:
echo "[mongodb]
name=MongoDB Repository
baseurl=http://downloads-distro.mongodb.org/repo/redhat/os/x86_64/
gpgcheck=0
enabled=1" > mongodb.repo.tmp
sudo mv mongodb.repo.tmp /etc/yum.repos.d/mongodb.repo
# Then update yum's source list and install:
sudo yum check-update
sudo yum install mongo-10gen mongo-10gen-server
sudo pip install pymongo
- There is no official package for ScrapyD in CentOS yet, so we built one specifically which you can install as follow:
sudo pip -q install Scrapy==0.18
wget https://github.com/medialab/scrapyd/raw/medialab/rpms/scrapyd-1.0.1-2.el6.x86_64.rpm
sudo rpm -i scrapyd-1.0.1-2.el6.x86_64.rpm
-
You can test whether ScrapyD is properly installed and running by trying to access the following url: http://localhost:6800/listprojects.json which should return a json object with status "ok".
-
Default settings are usually all right, but you can adapt MongoDB and ScrapyD's default configurations to your needs if you like. Full documentation is accessible for MongoDB here and for ScrapyD there. Remember to restart the services after any configuration change:
sudo service mongodb restart # Warning: under CentOS, mongodb is called mongod
sudo service scrapyd restart
- If you want, you can also optionnally install the PHP web admin interface RockMongo to easily access MongoDB's databases in a browser.
2.3) Prepare the Java / Thrift environment:
Any instance require at least the Java JRE 6 installed. You can test it by running java -version
and in case it is missing run:
sudo apt-get install openjdk-6-jre
# Or in CentOS:
# sudo yum install java-1.6.0-openjdk
Hyphe uses Thrift version 0.8 to ensure the communication between the python Twisted core and the Java Lucene memoryStructure. Although, installing Thrift is only required for developers to build the jar archive which is included in the downloadable release. You can skip the following steps if you are using the release.
To install and use Thrift, one first needs to install the Java JDK, ant and maven:
sudo apt-get install build-essential openjdk-6-jdk ant maven
In CentOS this is slightly more complex since there is no maven package:
sudo yum install java-1.6.0-openjdk-devel ant
wget http://www.eu.apache.org/dist/maven/maven-3/3.1.1/binaries/apache-maven-3.1.1-bin.tar.gz
tar xvf apache-maven-3.1.1-bin.tar.gz
sudo cp -r apache-maven-3.1.1 /usr/local/maven
echo "export M2_HOME=/usr/local/maven
export PATH=${M2_HOME}/bin:${PATH}" > /tmp/maven.sh
sudo cp /tmp/maven.sh /etc/profile.d/maven.sh
source /etc/profile.d/maven.sh
Then download and install Thrift:
wget http://archive.apache.org/dist/thrift/0.8.0/thrift-0.8.0.tar.gz
tar xvf thrift-0.8.0.tar.gz
cd thrift-0.8.0
./configure --with-java --without-erlang --without-php
make
sudo make install
cd ..
It is recommended to use virtualenv with virtualenvwrapper:
sudo pip install virtualenv
sudo pip install virtualenvwrapper
source $(which virtualenvwrapper.sh)
mkvirtualenv --no-site-packages HCI
workon HCI
pip install -r requirements.txt
add2virtualenv $(pwd)
deactivate
3.0) Compile the Java Lucene memoryStructure (only if installing from repository instead of release):
This will need to be ran again every time the Java code in the memory_structure directory will be modified.
bin/build_thrift.sh
- Deploy Hyphe's scrapyd config:
sudo ln -s `pwd`/config/scrapyd.config /etc/scrapyd/conf.d/100-hyphe
sudo service scrapyd restart
- Define the server path in the starter:
sed "s|##HCIPATH##|"`pwd`"|" bin/hyphe.example > bin/hyphe
chmod +x bin/hyphe
- Copy and adapt the sample
config.json.example
toconfig.json
in theconfig
directory:
sed "s|##HCIPATH##|"`pwd`"|" config/config.json.example > config/config.json
-
Adjust the settings as explained in section 2) Configure.
-
And create the lucene-data directory defined in config/config.json (depending on your possible modifications):
mkdir -p lucene-data
- Copy and adapt the sample
_config_default
directory to_config
in thehyphe_www_client
directory:
cp -r hyphe_www_client/_config{_default,}
- Prepare Hyphe's Apache configuration:
sed "s|##HCIPATH##|"`pwd`"|" hyphe_www_client/_config/apache2_example.conf |
sed "s|##WEBPATH##|/hyphe/ > hyphe_www_client/_config/apache2.conf
- Install the VirtualHost:
sudo ln -s `pwd`/hyphe_www_client/_config/apache2_example.conf /etc/apache2/sites-available/hyphe.conf
sudo a2ensite hyphe.conf
sudo service apache2 reload
Or in CentOS:
sudo ln -s `pwd`/hyphe_www_client/_config/apache2_example.conf /etc/httpd/conf.d/hyphe.conf
sudo service httpd reload
This will install Hyphe locally only first: http://localhost/hyphe. The page should be accessible even though the website should not work (since we did not start the server yet, see next section). To run the website at an url on a server, a few more adjustments are required, see related section.
If the page is inaccessible and apache says "403 Forbidden", you probably have right issues. Apache's group (usually www-data
, apache
or httpd
) needs read access to Hyphe's installation directory.
sudo chmod -R g+rx $(pwd)
sudo chown -R :www-data $(pwd)
On some distributions, if you installed from a /home directory, you may need to do this to your /home/<USER>
directory. Or you can move the current install to another directory (/srv, /opt, ...), give it the rights and redo the above parts involving the PATH.
See related sections in the simple install.
Hyphe relies on a JsonRPC API that can be controlled easily through the web interface or called directly from a JsonRPC client. It can be started separately as follows:
bin/start_standalone_core.sh
All of Hyphe's functionalities are not available from the web interface yet.
Although, some advanced routines (like starting a set of crawls on all IN webEntities for instance) can already be performed in command line with the API. The script hyphe_backend/test_client.py
is a command-line caller of the core API. For instance:
source $(which virtualenvwrapper.sh)
workon HCI
./hyphe_backend/test_client.py get_status
./hyphe_backend/test_client.py declare_page http://medialab.sciences-po.fr
./hyphe_backend/test_client.py declare_pages array http://medialab.sciences-po.fr http://www.sciences-po.fr
./hyphe_backend/test_client.py store.get_webentities
./hyphe_backend/test_client.py store.get_webentities array WE_ID_1 WE_ID_2 WE_ID_3
./hyphe_backend/test_client.py inline store.get_webentities
./hyphe_backend/test_client.py crawl_webentity WE_ID
In bin/samples/
can be found multiple examples of advanced routines ran direcly via the shell using the command-line client.
The API functions are described in the Wiki.
Hyphe's memory structure is a Java/Lucene based server which needs to run in background, whenever Hyphe is being used.
It can be simply started thanks to shell scripts in bin
:
bin/start_standalone_lucene.sh
Whenever the code in memory_structure
is modified, the JAVA archive running the memory structure needs to be rebuilt:
bin/build_thrift.sh
To adapt the API commands callable through Thrift, edit the files src/main/java/memorystructure.thrift
and src/main/java/fr/sciencespo/medialab/hci/memorystructure/MemoryStructureImpl.java
.
Hyphe's crawler is implemented as a Scrapy spider which needs to be deployed on the ScrapyD server (the core API takes care of it every time it is restarted) (more information here). It can be deployed as follow for debug purposes:
bin/deploy_scrapy_spider.sh
Whenever the config.json
file or the code in hyphe_backend/crawler
or hyphe_backend/lib/urllru.py
is modified, the spider needs to be redeployed on the Scrapy daemon instance to be applied. Running the core server will do so in any case.