-
Notifications
You must be signed in to change notification settings - Fork 0
Cellenics: Architecture
This page is a high level overview of the architecture and main components of Cellenics - an open-source cloud based web platform for Single Cell RNA sequencing analysis.
Here is the main diagram that shows all resources that are part of Cellenics and their connection and interaction with other modules:
Cellenics is composed of 4 modules:
- Data Management Management of existing experiments and creating new ones.
- Data Processing Processing and filtering dataset from irrelevant data points. Consists of several types of steps, where each step is a filter. After each step, visualise on the UI a plot with filtering results. The plot can configured and downloaded from the UI. The default filtering settings can be changed.
- Data Exploration Explore the data post-filtering, providing several highly interactive plots like scatterplot showing all cells separated into clusters, cells x genes marker heatmap, gene list. Capability for various actions like pathway analysis and differential expression.
- Plots and Tables Configure and export plots and figures depicting the researcher’s findings.
Each of the diagrams below shows the resources used for each module and the flow for an example task requested by an authenticated and authorised user within the respective module.
Architectually, these 2 modules are the same and their achitecture follows this diagram:
The main components of Cellenics are:
- UI, the Cellenics front end code.
- API, the project that orchestrates the communication between the front end, the backends and the Data Model.
- worker, the backend for the Data Exploration and Plots and Tables modules.
- pipeline, the backend for the Data Processing module.
- iac, where we host all infrastructure used for running Cellenics.
- releases, where we host the Helm Releases to be picked up by Flux.
The UI, API, worker and the pipeline are all deployed to Kubernetes as Helm charts.
The Cellenics front end code is written in React, redux and uses the Ant Design component framework, Vega and a scatterplot component borrowed by Vitessce. It communicates with the API via http requests (fetch cell sets) or web socket events (give me an embedding for clusters X and Y). The UI has an in-browser cache (we use localforage) to store analysis task results returned from the backend. The UI handles user authentication.
The API is written in Node.js and uses Express service that stays in the middle between the UI and the backends and the Data Model. Authorises each request it receives using a Nodejs authoriser module. The also API computes “simple” tasks like:
- getting data from DynamoDB or the S3 buckets
- submitting “heavyweight” tasks (tasks that involve manipulation of the single cell count matrix file) to one of the backends.
- creating SQS queues
- associating existing worker instances with an experiment
- etc.
The worker is composed of two containers: one in Python and one in R. The Python module that listens on an SQS queue and picks up work from it. The work involves various data analysis and machine learning tasks like data processing, computing embeddings, etc. The first time when the worker receives a task, it downloads the relevant single cell count matrix file from S3 to its attached volume and loads it in memory for subsequent uses. Depending on the task type, either the Python or the R container computes the results, and then sends the results back to the API via Redis sockets IO interface.
There is a 1:1 mapping between workers and experiments. Each worker has its own SQS queue it listens to.
Deletion of Worker resources is triggered by hitting an endpoint in the API. It is configured to happen automatically after the worker instance has finished running.
The worker runs in EKS Fargate is a "worker" profile, set in the iac repo in the cluster.yaml definition.
Here are the main AWS resources that are used within Cellenics:
- AWS S3 buckets, where we keep analysis results, processed and unprocessed datasets.
- AWS DynamoDB tables, where we keep experiment-specific information.
- AWS Step Functions state machines, which we use for running Data Processing pipelines.
- AWS EKS, where we run instances of the API, UI, worker, and the pipeline.
- AWS Fargate, used within AWS EKS for running the worker and the pipeline.
The entire AWS infrastructure is deployed via Github Actions workflows, located in the iac repository. For more details, go to iac.
We use Redis with socket io api for communication between the API and the worker. For more details of the architecture, see this https://socket.io/docs/v4/redis-adapter/#emitter To make this work, we have deployed the AWS-managed version of Redis, called ElastiCache.
This is an AWS managed publish-subscribe topic that is used for returning pipeline results back to the API. Multiple pipelines connect to this topic and send the results of their finished tasks to it. The API receives these results without the need to poll for them.
SQS is an AWS managed FIFO queue. It gets deployed dynamically by the API and gets deleted by a resource cleaner that is running in our AWS account. SQS is used for submitting single cell analysis tasks to the worker.
This is a PostgreSQL database managed by AWS. We store all our customer data here.