Setup and Instructions

Setup

Getting code and required modules

(Requires python >=3.6 pip and virtualenv)

git clone [email protected]:dmwm/cms-htcondor-es.git -b master
cd cms-htcondor-es
virtualenv venv
source venv/bin/activate
pip install -r requirements.txt

You will need also setup a collectors json file, with pools and their collectors' list.

{
    "Global":[ "thefirstcollector.example.com:8080","thesecondcollector.other.com"],
    "Volunteer":["next.example.com"],
    "ITB":["newcollector.virtual.com"]
}

At this point a dry run should run without errors. (This only queries the collector for a list of schedds, but doesn't actually query them or upload anything.)

python spider_cms.py --process_queue --dry_run --collectors_file <Path to>/collectors.json

Authentications

CERN MONIT:

Create username and password files in the cms-htcondor-es directory containing the correct authentication credentials obtained from CERN MONIT.

es-cms:

Create a file called es.conf with the following format (and the corresponding credentials):

User: <username>
Pass: <password>

Running

Running the full queries and uploading machinery with the current options. (Add --read_only to only query, without uploading.)

python spider_cms.py --feed_amq --process_queue --query_pool_size 16 --upload_pool_size 8

The script first queries three condor collectors for a list of schedds (removing duplicates). Currently there are 63 schedds being processed. First the histories of each schedd are queried for completed jobs and the documents uploaded. When that is finished, the condor queues of each schedd are being queried for running and pending jobs, and those documents are being uploaded. See below for a description of the parallelization of these tasks.

Setting up cron job

Easiest is to create a spider_cms.sh script with the corresponding setup and options, e.g. like this:

#!/bin/bash
cd /path/to/cms-htcondor-es/
source venv/bin/activate
python spider_cms.py --feed_amq --process_queue --query_pool_size 16 --upload_pool_size 8 --collectors_file etc/collectors.json

Edit your cron tab with crontab -e and add the following line:

*/12 * * * * /path/to/cms-htcondor-es/spider_cms.sh

You also will need to setup a cronjob to update the affiliation_dir file, otherwise the jobs will not include affiliatio information (or it will be stale).

0 3 * * * /bin/bash "/home/cmsjobmon/cms-htcondor-es/cronAffiliation.sh"

Finally, it can be useful to have email alerts for failing queries and timeouts. They are set up with the --email_alerts [email protected] option. (Currently only a single recipient is possible, but you can use an e-group.)

Debug, testing, and tuning

There are several useful options for debugging and testing:

--dry_run just query the collector for a list of schedds and skip the queries and the uploading
--read_only do the queries but skip the uploading
--schedd_filter process only a (comma-separated) list of schedds
--skip_history do only the queue data
--log_level setting to INFO or DEBUG gives additional output about internal queueing, bunching, and uploading.

For tuning the performance, several options are available:

Pool sizes: --upload_pool_size and --query_pool_size define the number of concurrent processes for uploading and querying, respectively. The default is 8 for each, but we are currently running smoothly with 16 query pools and 8 upload pools on a machine with 16 cores. Note: the query pool is also used for processing the condor history, i.e. that option also determines the number of parallel processes querying and uploading documents for completed jobs.
Upload bunching: --amq_bunch_size and --es_bunch_size define the size of the bunches that are sent to AMQ and Elasticsearch, respectively. The current default is 5000 documents for AMQ and 250 for ES, which takes about 1 second from the CERN-based VM, and took about 15 seconds from the UNL-based VM.
Internal bunching: --query_queue_batch_size defines the size of batches of documents sent from the query processes to the internal process assembling bunches for uploading. Current default is 50.

Parallelization

I.e. the current setup for processing the queues is as follows: 16 parallel processes each query the queue of a single schedd for running and pending jobs. After having obtained 50 documents, they are sent to an input queue and the query continues until all jobs have been processed. An internal process gets these batches of job documents on the input queue and assembles them into bunches of 5000 documents, which are sent to one of 8 parallel upload processes. The script shuts down when either all jobs have been processed and uploaded, or after 11 minutes of running time.

The histories are processed more simply. Each one of 8 parallel processes queries a schedd for documents. After reaching 5000 documents (configurable with --bunching), they are uploaded. When the upload finishes, the query continues until all jobs and all schedds have been processed.

ENVIRONMENT VARIABLES

You can use the following environment variables to change between production/test MONIT flows:

export CMS_HTCONDOR_PRODUCER="condor-test"
export CMS_HTCONDOR_TOPIC="/topic/cms.jobmon.condor"
export CMS_HTCONDOR_BROKER="<TEST BROKER ADDRESS>"

See CMSMonitoring-Data-Flow-Test-procedure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly