Human transcriptome profiles typically contain gene expression values for many thousands of genes, thus representing points in a high-dimensional feature space. A principal technical challenge in understanding the pathways involved in the disease and characterizing inter-individual variation is to reduce this high dimensionality while preserving biologically relevant information. This results in an interpretable lower-dimensional space, which can reveal the patient sub-populations.
In this work, we develop a novel computational pipeline to [i] map the blood gene expression profile of SLE patients into a robust, semantically rich and low dimensional space, and to [ii] find a SLE patient stratification method that is agreed by different publicly available datasets. Our approach is to extract patterns within patient data, through which we might be able to identify the underlying biological pathways associated with disease progression, and thus to map the complex disease dynamics, reflected in a high-dimensional space of transcriptome abundances, into a reduced and thus more understandable space.
To perform a comprehensive and unbiased study of the molecular changes underlying SLE, we identify and download transcriptomic data from ten different publicly available human SLE data sets in Gene Expression Omnibus. In all datasets, the gene expression data is acquired from the whole blood cells. It is argued that the blood cells provide informative, easy to access and cheap means to study the immune systems (Chaussabel et al. 2010; Banchereau et al. 2016; Patin et al. 2018).
The subspace we chose to project ten publicly available SLE blood transcriptomic datasets into has four dimensions, each representing key axes of variation. The axes were labeled “Interferon/lymphocyte/erythrocyte/inflammation” activity after analyzing the four gene sets (“modules”) that were computationally determined as most suitable for capturing the variability across the cohorts. Figure 1 shows the distribution of patients in two out of these four dimensions, after robust normalization, across three cohorts.