Skip to content

Latest commit

 

History

History

intenz

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Index/query scripts for IntEnz enzyme dataset

  • index.py: Index IntEnz xml files, tested with IntEnz December 2019 release

      $ ./nosqlbiosets/intenz/index.py --help
      usage: index.py [-h] [-infile INFILE] [--index INDEX] [--doctype DOCTYPE]
                      [--host HOST] [--port PORT] [--db DB]
      
      Index IntEnz xml files, with Elasticsearch, MongoDB or Neo4j
      
      optional arguments:
        -h, --help            show this help message and exit
        -infile INFILE, --infile INFILE
                              Input file name (intenz/ASCII/intenz.xml)
        --index INDEX         Name of the Elasticsearch index or MongoDB database
        --doctype DOCTYPE     Document type name for Elasticsearch, collection name
                              for MongoDB
        --host HOST           Elasticsearch, MongoDB or Neo4j server hostname
        --port PORT           Elasticsearch, MongoDB or Neo4j server port
        --db DB               Database: 'Elasticsearch', 'MongoDB' or 'Neo4j'
    
  • query.py: Query API (naive and not comprehensive), more queries with MongoDB, few with Neo4j

      $ ./nosqlbiosets/intenz/query.py --help
      usage: query.py [-h] [--limit LIMIT] qc outfile
      
      Save IntEnz reaction connections as graph files
      
      positional arguments:
        qc             MongoDB query clause to select subsets of IntEnz entries,
                       e.g.: '{"reactions.label.value": "Chemically balanced"}'
        outfile        File name for saving the output graph. Format is selected
                       based on the file extension of the given output file; .xml
                       for GraphML, .gml for GML, .json for Cytoscape.js, or
                       .d3js.json for d3js format
      
      optional arguments:
        -h, --help     show this help message and exit
        --limit LIMIT  Maximum number of enzyme-metabolite connections
    
    ./nosqlbiosets/intenz/query.py '{"reactions.label.value": "Chemically balanced"}'\
      balanced-reactions.xml --limit 800
    
    ./nosqlbiosets/intenz/query.py '{"cofactors.#text": "Pyrroloquinoline quinone"}'\
      cofactors.json
    
    ./nosqlbiosets/intenz/query.py '{"$text": {"$search": "poly(A)"}}' polyA.json
  • tests.py: Tests with the query API

Example graph

Example command lines for indexing

Server default connection settings are read from ../../conf/dbservers.json

# Download IntEnz xml files

wget -nc -P ./data http://ftp.ebi.ac.uk/pub/databases/intenz/xml/ASCII/intenz.xml

# Index with Elasticsearch, requires ~ 5m to 15m
./nosqlbiosets/intenz/index.py --db Elasticsearch --infile ./data/intenz.xml\
 --index intenz

# Index with MongoDB, requires ~1m with local server, ~12m with MongoDB Atlas
./nosqlbiosets/intenz/index.py --db MongoDB --infile ./data/intenz.xml

# Index with Neo4j (processing time ~ 12m)
./nosqlbiosets/intenz/index.py --db Neo4j --infile ./data/intenz.xml