Skip to content

WMAgent

Diego Ballesteros edited this page Jul 9, 2013 · 19 revisions

The Basics

The WMAgent software is a distributed component of the production system, in a nutshell its functions are:

  • Splitting WorkQueue elements into smaller basic work units, known as jobs.
  • Creating jobs and controlling the flow of work according for the tasks defined in the workload of a request.
  • Submitting jobs to a batch system (e.g. HTCondor, LSF).
  • Tracking the submitted jobs and keeping tabs on their outcome.
  • Registering the produced data into the CMS catalogs (i.e. DBS2/3, PhEDEx).

The WMAgent relies in two services for its operation:

  • A relational database to keep the WMAgent state, known as WMBS.
  • A non-relational database for monitoring and document storage, the current implementation uses CouchDB.

The WMAgent is made up of threaded WMComponents which function independently and use WMBS and CouchDB as their sources of information, some of them interact with external services such as PhEDEx, ReqMgr, DBS, WorkQueue, or SiteDB.

The next sections will describe in detail each one of the databases and components in the system.

The Databases

WMBS (Workload Management Bookkeeping System)

WMBS is a relational database which contains most of the operational information of the WMAgent, it is strictly required for the operation of any of the WMComponents and it's absence will trigger a immediate crash in any of them. WMBS makes the WMAgent a state machine in its basic definition.

WMCore supports WMBS with Oracle or MySQL as the DBMS and it most cases they execute the same queries, however some of the queries require different versions according to the DBMS.

Developer note: When writing queries in WMCore for relational databases, try to write a single compatible query for MySQL and Oracle. Exceptions are: Insertions/Deletion/Table creation or cases where using specific features makes the query much more efficient, specially for Oracle which is used for WMAgents handling higher loads.

The tables in WMBS can be divided in four categories:

WMBS core

The queries for the following tables are mostly defined in the WMBS package. Following is an explanation of each table and the objects they store, additionally you can see the creation statement of all tables in MySQL and Oracle in the repository. The tables are defined in an order where a table appears only after all the tables it depends on are already defined, at the end of the section there is a grouping by objects which may be more natural for the reader.

  • wmbs_fileset: This table contains the information of a fileset, a fileset is a named set of files which can be the input or output of a particular task. A fileset can be open or closed, a closed fileset is guaranteed to be complete and not to have new files associated with it in the future.

  • wmbs_file_details: This table contains the basic information about a file, this includes:

    • LFN: Logical File Name
    • Size: Size in bytes
    • Events: Number of events
    • First event: Number of the first event in the file
    • Merged: An indicator if this a merged (i.e. to be stored permanently) file or not.
  • wmbs_fileset_files: This table holds the associations between files and filesets, the association is many to many, i.e. a fileset can have many files and a file can be in many filesets. It also holds a timestamp of when the associaton was made.

  • wmbs_file_parent: This table contains another piece of information about files, it stores the parentage relationships between them. A file is said to be the parent of another file in WMBS if the parent file was one of the inputs for the job that produced the child file.

  • wmbs_file_runlumi_map: This table contains the list of run and lumis present in the files of wmbs_file_details. Each row contains a file id, run number and lumisection.

  • wmbs_checksum_type: This table contains auxiliary information about the possible checksum types used in other WMBS tables, currently it stores 3 types: checksum, adler32 and md5.

  • wmbs_file_checksums: This table contain the checksums for the files in wmbs_file_details, it can store checksums of different types (defined in the previous table) for each file.

  • wmbs_location_state: This table contains the possible states for a site in WMBS, these are:

    • Normal: A normal functioning site, it can be considered as a valid location for job submission.
    • Draining: A draining site, it can be considered for job submission only for jobs that can't run anywhere else. WorkQueue elements will not be acquired for this site.
    • Down: A site with errors or in downtime, no new jobs will be submitted for this site in the current state. The jobs waiting to be submitted that can only run at this site will wait until the state changes. WorkQueue elements will not be acquired for this site.
    • Aborted: A broken site, no new jobs will be submitted for this site in Aborted state, additionally any job pending in the batch system or jobs waiting for submission that can only run at the site will be killed and failed without retries. WorkQueue elements will not be acquired for this site.
  • wmbs_location: This table holds the basic information about sites in WMBS, it contains the following fields:

    • site_name: Identifier for the site in WMBS
    • cms_name: CMS name for the site
    • ce_name: Name of the Computing Element for the site
    • running_slots: Number of jobs that can be concurrently running at the site.
    • pending_slots: Desired number of jobs to keep in pending state for the site.
    • plugin: Batch system for submission to this site.
    • state: State of the site as defined in the previous table.

Developers Note: This table has problems, the site_name, cms_name and ce_name are always the same. The ce_name is not used at all and several places in the WMAgent would have a problem if the cms_name differs from the site_name, this should be re-evaluated and organized.

  • wmbs_file_location: This tables contains the information about the location of the files, each row contains associations between files and sites. A site can have many files and a file can be at many sites.

  • wmbs_users: This table holds the basic information about users in WMBS, it contains the following fields:

    • cert_dn: Registered DN for the user.
    • name_hn: Username in Hypernews for the user.
    • owner: Same as name_hn.
    • grp: Group which the user is member of.
    • group_name: VOMS group for the user.
    • role_name: VOMS role for the user.

Developers Note: This table is a mess, it has caused problems in the past and is outdated at the moment. First the VOMS fields have never been used, there is no practical use for this information in production workflows since all real user-group handling should be only in the ReqMgr. Also name_hn and owner are the same but only owner is used. Removing it should be a good idea but requires changes all across the WMAgent.

  • wmbs_workflow: This table contains the basic information about workflows, workflows are defined as collection of tasks and describe all the steps in a request, this table holds a row for each task in a request. The fields stored in this table are:
    • name: Name of the workflow, which is the same as the corresponding request.
    • spec: Path to the spec file which contains the workload for the request.
    • task: Task represented in this entry.
    • type: Type reported to dashboard for this workflow.
    • owner: User that created this workflow, it points to a record in the users table.
    • injected: Indicates if the workflow is fully injected, a workflow is fully injected when all the WorkQueue elements for the request has been injected into WMBS (in any of the agents).
    • alt_fs_close: Internal indicator that replaces injected for fileset closing purposes, only used in the WMAgent Tier-0, see their specific documentation for an explanation of this.
    • priority: Priority of the workflow, it is equal to the priority of the request in ReqMgr.

See the ReqMgr section for the definition of request, workflow, task, workload, etc...

  • wmbs_workflow_output: This table associates tasks (i.e. rows from wmbs_workflow) with their output filesets, each task can have a merged and unmerged output fileset.

  • wmbs_sub_types: This auxiliary table contains the different subscription types in WMBS, these are:

    • Processing
    • Production
    • Merge
    • Cleanup
    • LogCollect
    • Harvesting
    • Skim

    The subscription types have a numerical priority value which serves as a modifier of the job priority.

  • wmbs_subscription: This table holds information about subscriptions, a subscription is basically a pairing of a workflow with an input fileset, it defines which work should be performed on which input. Additionally, a subscription entry indicates the type of subscription, the splitting algorithm for the jobs and if it is finished or not. A subscription is considered finished when the following conditions are true:

    • The input fileset is closed and the workflow is injected.
    • All files in the input fileset have been acquired by a job in the subscription.
    • All jobs related to the subscription are in cleanout state.

BossAir

ResourceControl

DBSBuffer

Document storage (CouchDB)

The WMComponents

WorkQueueManager

JobCreator

JobSubmitter

JobStatusLite

JobUpdater

JobTracker

ErrorHandler

RetryManager

JobAccountant

DBSUpload

DBS3Upload

PhEDExInjector

TaskArchiver

AnalyticsDataCollector

WMBSService

AlertGenerator

AlertProcessor

Clone this wiki locally