DeepInsight3D package to deal with multi-omics or multi-layered data
DeepInsight3D has 3 main components. 1) conversion of multi-layered non-image samples (such as multi-omics or similar tabular data) into colored image-forms (i.e. 3D image samples). 2) processing colored images to convolutional neural network (CNN). 3) Element selection via CNN, such as using class-activation maps (CAMs).
This approach builds a 3D-image by arranging similar elements (or genes) together and dissimilar apart, and then by mapping the multi-layered non-image values on to these aligned pixel locations. This approach employs CNN for element or gene selection on non-image data.
OS: Linux Ubuntu 20.04; Matlab version: 2021a; GPU A100 (2 parallel);
Sharma A*, Lysenko A*, Boroevich K, Tsunoda T*, DeepInsight-3D for precision oncology: an improved anti-cancer drug response prediction from high-dimensional multi-omics data with convolutional neural networks, bioRxiv, 2022 https://doi.org/10.1101/2022.07.14.500140
-
Download Matlab package DeepInsight3D_pkg.tar.gz or the entire DeepInsight3D_pkg folder from the link above. Store it in your working directory. Gunzip and untar as follows:
>> gunzip DeepInsight3D_pkg.tar.gz >> tar -xvf DeepInsight3D_pkg.tar
-
Download example dataset from the following link (caution: data size is 88MB):
http://emu.src.riken.jp/DeepInsight/download_files/dataset1.mat
Move the dataset to the
Data
folder insideDeepInsight3D_pkg
folder. The dataset path will look like this:/DeepInsight3D_pkg/Data/dataset1.mat
This example data is PDX_Paclitaxel multi-omics data, it has 3 layers: RNA-seq, CNA and mutation. The dataset is given in struct format of Matlab. Use any other data in a similar struct format for DeepInsight3D model.
-
Download and Install example CNN net such as ResNet-50 in Matlab, see details about ResNet-50 from MathWorks link. You may use different nets as desired.
-
Executing the DeepInsight3D_pkg: all code should run in the folder ../DeepInsight3D_pkg/, if you want to run in a different folder then addpath in Matlab
In this example, multi-omics example data (PDX_Paclitaxel) is used which is stored in DeepInsight3D_pkg/Data folder as 'dataset1.mat'. It is split into the training set and test set. The first layer is RNA seq, second layer is CNA and the third layer is mutation. These layers are first converted to 3D images using the DeepInsight3D converter. Then the CNN net (resnet50) has been trained. The performance evaluation, in terms of accuracy and AUC, are done on the test set of the data.
-
File: open Example1.m file in the Matlab Editor.
-
Set up parameters by changing
Parameter.m
file. Based on your hardware requirements, changeParm.miniBatchSize
(default is 512) andParm.ExecutionEnvironment
(default is multi-gpu). If you don't want to see the training progress plot produced by CNN training, then setParm.trainingPlot=none
. Alternatively, leave all the parameters to their default values. -
Dataset calling: since the dataset name is
dataset1.mat
, the variableDSETnum=1
(at Line 17 of Example1.m) has been used. If the name of the dataset isdatasetX.m
then variableDSETnum
should be set asX
. -
Example1.m file uses function DeepInsight3D.m. This function has two parts: 1) tabular data to image convertion using
func_Prepare_Data.m
, and 2) CNN training using resent50 (default or change as required) usingfunc_TrainModel.m
. -
The output is AUC (for 2-class problem only), C (confusion matrx) and Accuracy of the test set (at Line 28). It also gives ValErr which is the validation error.
-
By default, trained CNN models (such as model.mat, 0*.mat) and converted data (either Out1.mat or Out2.mat) will be saved in folder /Models/Run1/ and figures will be stored in folder /FIGS/Run1/. The saving of files are done by calling the functions
func_SaveModels.m
andfunc_SaveFigs.m
-
The execution results are stored in
DeepInsight3D_Results.txt
file in the /DeepInsight3D_pkg/ folder. -
A few messages will be displayed by running Example1.m on the Command Window of Matlab, such as
Dataset: PDX_Paclitaxel NORM-2 tSNE with burneshut algorithm has been used Distance: euclidean Pixels: 227 x 227 Training model begins: Net1 ... Out = struct with fields: bestIdx: 1 fileName: "0.32624.mat" valError: 0.3262 Stage: 1; Test Accuracy: 0.6744; ValErr: 0.3262; Momentum: 0.801033; L2Regularization: 0.0125157; InitLearnRate: 4.9866e-05 Training model ends
Note that the above values might differ.
Objective function figure will be shown for the Bayesian Optimization Technique (BOT). By default 'no BOT' will be applied; i.e.
Parm.MaxObj=1
. However, if BOT is required then change parameter `Parm.MaxObj' to a value higher than 1. If it is set as 'Parm.MaxObj=20' then 20 objective functions will be searched for tuning hyperparameters and the best one (with the minimum validation error) will be selected.Results file: check
DeepInsight3D_Results.txt
for more information, such asAUC: 0.7789 ConfusionMatrix 25 13 1 4
In this example, feature selection using class-activation maps (CAMs) is executed. It is assumed that Example 1 has been run before running this example. Running Example 1 will save model files in Models/Run.. folder, and also the data file Out1.mat (if norm1 is used) or Out2.mat (if norm2 is used).
Running Example2.m will perform feature selection. However, steps (in Matlab) are described here under.
-
copy saved model files in the correct folders
unix(['cp Models/Run1/stage1/model.mat .']); unix(['cp Models/Run1/stage1/0.*.mat DeepResults']);
-
Dataset is still the same therefore parameter
DSETnum=1
. Call parameters usingParm = Parameters(DSETnum);
-
Set CAM threshold e.g.
Parm.Threshold = 0.35;
-
Execute classed-based CAM using
func_FS_class_basedCAM(Parm);
as shown in Example2.m (Line 29). The following information will be displayed on the screen.Feature selection begins model = struct with fields Norm: 2 bestIdx: 1 fileName: '0.32624.mat' Files saved in the FIGS folder Files saved in the Models folder #Genes = 5205; #Genes_compressed = 3331 Stage 1 Ends
Images will be stored in FIGS folder. The following command can be used to open images in the unix console/terminal:
eog ~/DeepInsight3D_pkg/FIGS/Run1/Stage1/Class_Activation.jpg
Class activation image is given below. Since only two classes exist, the figure shows 'class1' and 'class2' activations.
Vary the threshold Parm.Threshold
between 0 and 1 to vary the number of selected features/genes.
Features selected per class can also be viewed from FIGS/Run1/ folder.
In this example, feature selection using CAMs is performed in an iterative manner. There are 3 steps in this iterative procedure:
- conversion of multi-layered tabular data to 3D image.
- estimation of CNN net using the training set and validation using the validation set.
- feature selection using CAMs
- repeating the above 3 steps until a desired number of features or maximum of 6 stages is reached. At this point, iterative procedure will be terminated.
Caution: this procedure could take a very long processing time (depending upon hardware specs)
Running Example3.m will execute iterative procedure. However, steps are described hereunder.
Steps:
-
Set up parameters by changing Parameters.m file, otherwise leave it with default values.
-
Provide the path of the dataset in Parameter.m file by changing "Data_path" variable. In this example, it is set as /DeepInsight3D_pkg/Data/
-
Define the stored dataset using
DSETnum=1;
-
Call parameters using
Parm = Parameters(DSETnum);
-
For quick testing code, reduce the MaxEpochs e.g.
Parm.MaxEpochs = 5;
for better training it would be good to have a higher value of MaxEpochs. -
Set the CAM Threshold
Parm.Threshold = 0.3;
-
Suppress training plot (otherwise several plots will be invoked for every Stage)
Parm.trainingPlot = 'none';
-
Define the folder where models to be stored
Parm.FileRun = 'Run2';
-
The following code will perform iterative procedure:
Glen = inf; while (Glen > Parm.DesiredGenes) & (Parm.Stage < 7) [AUC,C,Accuracy,ValErr] = DeepInsight3D(DSETnum,Parm); [Genes,Genes_compressed,G] = func_FS_class_basedCAM(Parm); Glen = length(Genes); Parm.Stage = Parm.Stage + 1; end
-
All the results will be stored in current stage folder
~/DeepInsight3D_pkg/Models/Run2/StageX
where X is the current stage; -
Similarly, all the figures will be stored in a folder
~/DeepInsight3D_pkg/FIGS/Run1/StageX
where X is the current stage. -
If the loop continues then the value of X will increment to 1, 2, 3, …; i.e., repeating DeepInsight3D model to find a smaller subset of features/genes.
-
DeepInsight3D_pkg
has 4 folders: Data, DeepResults, FIGS and Models. It has several .m files. However, the main files are 1)Deepinsight3D.m
to peform image conversion and CNN modeling, and 2)func_FS_classbasedCAM.m
to perform feature selection. All the parameter settings can be done inParameters.m
file. -
DeepInsight3D.m has 2 main functions:
-
func_Prepare_Data
: This function loads the data, splits the training data into the Train and Validation sets, normalizes all the 3 sets (including Test set), and converts multi-layered non-image samples to 3D image form using the Training set. The Test and Validation sets are not used to find pixel locations. Once the pixel locations are obtained, all the non-image samples are converted to 3D image samples. The image datasets are stored as Out1.mat or Out2.mat depending on whether norm1 or norm2 was selected. -
func_TrainModel
: This function executes the convolution neural network (CNN) using many pretrained and custom nets. The user may change the net as required. The default values of hyperparameters for CNN are used. However, ifParm.MaxObj
is greater than 1 then it optimizes hyper-parameters using the Bayesian Optimization Technique. It uses Training set and Validation set to tune and evaluate the model hyper-parameters.Note: To tune hyperparameters of CNN automatically, use a higher value of
Parm.MaxObj
.The best model (in case Parm.MaxObj>1) is stored in DeepResults folder as .mat files, where the file name depicts the best validation error achieved. For example, file 0.32624.mat in DeepResults folder tells the hyper-parameters at validation error 0.32624. Also, the model file
model.mat
details the weights file and other relevant information to be stored.
-
-
Feature selection functions
func_FeatureSelection
: This will find activation maps at the ReLu layer, perform Region Accumulation (RA) step and Element Decoder step to find element/gene subset. The input is model.mat (fromfunc_TrainModel
) and related .mat file from the folder DeepResults. This function finds CAM for each sample and provide the union of all maps.func_FS_class_basedCAM
: This function performs class-based CAM, i.e., each class will have a distinct CAM.func_FeatureSelection_avgCAM
: This function finds the common CAM across all the samples.
-
Non-image to image conversion: two core sub-functions of
func_Prepare_Data
are used to convert samples from non-image to image. These are described below.-
Cart2Pixel
: The input to this function is the entire Training set. The output is the feature or gene locations Z in the pixel frame. The size of the pixel frame is pre-defined by the user. -
ConvPixel
: The input is a non-image sample or feature vector and Z (from above). The output is an image sample corresponding to the input sample.
-
-
Compression Snow-fall algorithm (SnowFall.m): Not used in this package. However, this compression algorithm is used to provide more space for features in the given pixel frame. Since the conversion from Cartesian coordinates system to the pixel frame depends on the pixel resolution, it becomes difficult to fit all the features without overlapping each other. This algorithm tries to create more space such that the overlapping of feature or gene location can be minimized. The input is the locations of genes or features with the pixel size information. The output is the readjusted image. It is up to the user to use Snow-fall compression or not by setting
Parm.SnowFall
to either0
(not use) or1
(use). -
Extraction of Gene Names (optional): This option is useful for enrichment analysis. Two files for extraction of genes are GeneNames_Extract.m and GeneNames.m. The list of names of genes is stored in
~/DeepInsight3D_pkg/Models/RunY/StageX/
folder.After running feature selection function, the results will be stored in the corresponding RunY and StageX folders (where X and Y are integers 1,2,3…). If it is required to find the gene IDs/names of the obtained subset for each cancer type, then execute
GeneNames_Extract
function. Go to Line 4, and set theOut_Stages
variable. For e.g. if Stage 2 has been saved inside Run1 after executingfunc_FS_class_basedCAM
, useOut_Stages = 2
. Then go to Line 6 and defineFileRun
.The gene list per class will be generated. If there are 10 cancer-types, then 10 files will be generated. In addition, one file with all genes listed will be generated (e.g. GeneList_UnCmprss.txt). The results will be stored in
~/Models/RunY/StageX
as RunYStageX.tar.gz and a folder with the same results will also be created as RunYStageX. In this example, it will be stored in the folderRun1Stage2
and Run1Stage2.tar.gz.
A number of parameters/variables are used to control the DeepFeature_pkg. The details are given hereunder
-
Parm.Method
(select dimensionality reduction technique)Dimensionality reduction technique (DRT) can be considered as one of the following methods; 1) tSNE 2) Principal component analysis (PCA) 3) kernel PCA, 4) uniform manifold approximation and projection (umap). For umap you can use python or R scripts (please see umapa_Rmatlab.m). Please note that these DRTs are not used in the conventional manner. Only the element locations are obtained by DRTs, and the reduction of features or dimensions is NOT performed.
Select this variable in Parameter.m file or after calling
Parm = Parameter(DSETnum)
changeParm.Method = ‘tSNE’, ‘kpca’, ‘pca’ or ‘umap’
Default is tSNE.
-
Parm.Dist
(Distance selection only for tSNE)If tSNE is used, then one of the following distances can be used. The default distance is ‘euclidean’.
Parm.Dist = ‘cosine’, ‘hamming’, ‘mahalanobis’, ‘educidean’, ‘chebychev’, ‘correlation’, ‘minkowski’, ‘jaccard’, or ‘seuclidean’ (standardized Eucliden distance).
-
Parm.Max_Px_Size
(maximum pixel frame either row or column)The default value is 224 as required by ResNet-50 architecture.
-
Parm.ValidRatio
(ratio of validation data and training data)The amount of training data required to be used as a validation set. Default is 0.1; i.e., 10% of training data is kept aside as a validation set. The new training set will be 90% of the original size.
-
Parm.Seed
Random parameter seed to split the data.
-
Parm.NetName
: use pre-trained nets such asresnet50
,inceptionresnetv2
,nasnetlarge
,efficientnetb0
,googlenet
and so on. See a list of pre-trained nets from Matlab link here -
Parm.ExecutionEnvironment
: execution environment based on your hardware. Options arecpu
,gpu
,multi-gpu
,parallel
, andauto
. Please check trainingOptions (Matlab) for further details. -
Parm.ParallelNet
: if '1' then this option overridesParm.NetName
. The custom made net frommakeObjFcn2.m
will be used. -
Parm.miniBatchSize
: define miniBatchSize, default is 512. -
Parm.Augment
: augment samples during training progress, select '1' for yes and '0' for no. -
Parm.AugMeth
: select method '1' or '2'. Method 1 automatically augments samples whereas Method 2 is done by the user -
Parm.aug_tr
: ifParm.AugMeth=2
thenParm.aug_tr=500
will augment 500 samples of training set if the number of samples in a class is less than 500. -
Parm.aug_val
: ifParm.Aug=2
thenParm.aug_val=50
will augment 50 samples of validation set if the number of samples in a class is less than 50. -
Parm.ApplyFS
: if '1' it applies a feature selection process using Logistic Regression before applying DeepInsight transformation. -
Parm.FeatureMap
: has following options.0
means use 'all' omics or multi-layered data for conversion. '1' means use the 1st layer for conversion (e.g. expression) '2' means use the 2nd layer for conversion (e.g. methylation) '3' means use the 3rd layer for conversion (e.g. mutation) -
Parm.TransLearn
: if '1' then learn CNN from previously trained nets on your different datasets. -
Parm.FileRun
Change the name as RunX, where X is an integer defining the run of DeepFeature on your data.
Change the value X for new runs.
-
Parm.SnowFall
(compression algorithm)Suppose SnowFall compression algorithm is used then set the value as 1, otherwise 0. Default is set as 1.
-
Parm.Threshold
(for Class Activation Maps)Set the threshold of class activation maps (CAMs) by changing the value between 0 and 1. If the value is high (towards 1), then the region of activation maps will be very fine. On the other hand, the region will be broader towards value 0. Default is 0.3.
-
Parm.DesiredGenes
Expected number of genes to be selected. Default is set as 1200. However, change as required.
-
Parm.UsePrevModel
The iterative way runs in multiple stages. If you want to avoid running CNN multiple times then set these values as ‘y’ (yes); i.e., the previous weights of CNN will be used for the current model. This way, the processing time is shorter, however, performance (in terms of selection and accuracy) would be lower. The default setting is ‘n’ (no).
-
Parm.SaveModels
For saving models type ‘y’, otherwise ‘n’. Default is set as yes ‘y’.
-
Parm.Stage
Define the stage of execution. The default value is set as
Parm.Stage=1
. All the results will be saved in RunXStage1. If iterative process is executed then results will be stored in Stage2, Stage3… and so on. -
Parm.PATH
Default paths for FIGS, Models and Data are
~/DeepInsight3D_pkg/FIGS/
,~/DeepInsight3D_pkg/Models/
and~/DeepInsight3D/Data/
, respectively. Runtime parameters will be stored in~/DeepInsight3D_pkg/
folder (such as model.mat, Out1.mat or Out2.mat). -
Log and performance file (including an overview of parameter information)
The runtime results will be stored in
~/DeepFeature/DeepInsight3D_Results.txt
with complete information about the run.
A YouTube video about the original DeepInsight method is available here. A Matlab page on DeepInsight can be viewed from here.
Sharma A, Vans E, Shigemizu D, Boroevich KA, Tsunoda T, DeepInsight: A methodology to transform a non-image data to an image for convolution neural network architecture, Scientifi Reports, 9(1), 1-7, 2019.
Overall weblink here
Sharma A, Lysenko A, Boroevich KA, Vans E, Tsunoda T, DeepFeature: feature selection in nonimage data using convolutional neural network, Briefings in Bioinformatics, 22(6), 2021.
a) Competition details: Mechanisms of Actions (MoA) Predictions https://www.kaggle.com/competitions/lish-moa
b) Peng et al., 1st 1st PlaceWinning Solution– Hungry for Gold. Laboratory for Innovation Science at Harvard, Mechanisms of Action (MoA) Prediction Competition 2020. here
c) Organizers: MIT and Harvard University (Connectivity Map here)
d) DeepInsight EfficientNet-B3 Noisy Student (PyTorch) here