Skip to content

Cellenics: Architecture

ivababukova edited this page Feb 7, 2022 · 8 revisions

This page explains in high level detail what the main components of Cellenics are and the architecture of each of its modules.

Main Components

Cellenics runs fully in the cloud and is built on top of AWS services. Here are its main parts:

  • UI, where is the code for the user interface.
  • API.
  • worker, the backend, here we compute all single cell analysis tasks. Used in Data Exploration and Plots and Tables modules.
  • pipeline, the code executed by the state machines. Used for Data Management and Data Processing.
  • iac, where we host the cloud formation for all infrastructure used in the platform.
  • AWS S3 buckets.
  • AWS DynamoDB tables.
  • AWS Step Functions state machines.

UI

The user interface is written in React, redux and uses the Ant Design component framework, Vega and a scatterplot component borrowed by Vitessce. It communicates with the API via http requests (fetch cell sets) or web socket events (give me an embedding for clusters X and Y). The UI has an in-browser cache (we use localforage) to store analysis task results returned from the backend. Each result is kept in the browser cache for 12 hours.

API

The API is written in Node.js and uses Express, a web application framework with very good support for event-driven applications. The API does “simple” tasks itself (tasks that do not require significant memory resources and only need the data from DynamoDB or Redis or the S3 results buckets) or submits “heavyweight” tasks (tasks that involve manipulation of the single cell count matrix file) to the worker. Task submission to the worker happens by pushing the respective task to the SQS queue associated with the worker. The task is defined in a json format. The API also handles the creation of the worker and the SQS queue by using the Javascript AWS SDK (for the queue) and Flux API (for the worker).The API receives the results of the completed tasks from the SNS topic and sends them back to the UI using a web socket event. To save on recomputing the same analysis tasks over and over again from the worker, the API uses a Redis cache, where it stores results sent back from the worker. The TTL for each analysis task in Redis is valid for 36 hours.

Worker

The worker is composed of two containers: one in Python and one in R. The Python module that listens on an SQS queue and picks up work from it. The work involves various data analysis and machine learning tasks like data processing, computing embeddings, etc. The first time when the worker receives a task, it downloads the relevant single cell count matrix file from S3 to its attached volume and loads it in memory for subsequent uses. Depending on the task type, either the Python or the R container computes the results, and then sends the results back to the API via SNS.

There is a 1:1 mapping between workers and experiments. Each worker has its own SQS queue it listens to. 

Deletion of Worker resources is taken care of by the Kubernetes cluster and happens after the worker hasn’t received any tasks for “a while” (currently set for 1 hour).

Redis

We use Redis for caching results between the API and the worker. Redis is a key-value store database and we use an instance that is managed by AWS.

SNS

This is an AWS managed publish-subscribe topic. Multiple workers connect to this topic and send the results of their finished tasks to it. The API receives these results without the need to poll for them. We haven’t set the SNS topic yet.

SQS

SQS is an AWS managed FIFO queue. It gets deployed dynamically by the API and gets deleted by a resource cleaner that is running in our AWS account.

DynamoDB

This is a key-value store that holds state, that is computed from the worker once and never changes, for example cell sets. We don’t have a particular schema of what and how will get stored in that database.

S3 bucket

This bucket holds all count matrix files needed for single cell analysis.

Main modules and their architecture

Data Management

Data Management

Data Processing

Data Processing

Data Exploration and Plots and Tables

Data Exploration and Plots and Tables