GeneTrail2 1.5
Statistical analysis of molecular signatures
GeneTrail 2 Workflow
1. Input Data
GeneTrail 2 offers two ways to provide experimental data. You either may choose to download preprocessed and normalized expression data from GEO or upload your own data. For the latter case, GeneTrail 2 supports several line-based file formats and a variety of database identifiers. Generally, these identifiers are assumed to not contain any whitespace and scores are assumed to be in decimal or scientific format using "." as decimal mark. In the following section, we describe the requirements and restrictions associated with these formats.
Matrix
Matrices must have a header line that contains one unique identifier for each sample. The remaining lines contain one identifier and one expression value per sample.
ExampleGSM368771 GSM368772 GSM368773 GSM368774 GSM368775 60496 8.12378 8.37528 8.02755 7.80555 8.49108 605 7.9787 7.95646 8.54249 8.6799 8.38875 6050 7.97796 7.96793 7.80971 7.78868 8.14545 60506 5.49484 5.38719 5.50641 5.78784 5.37093 60509 7.80633 7.85155 7.35756 7.82349 7.3896 6051 8.27485 8.5358 8.26115 8.2277 8.30632
Scores
Score lists contain one identifier and a corresponding score per line, separated by a whitespace. The given scores should indicate the "importance" of the identifier.
Examplehsa-let-7a-3p 0.128145025 hsa-let-7b-5p -0.24955694 hsa-let-7b-3p 0.594572760000001 hsa-let-7d-5p 0.00343912500000165 hsa-let-7d-3p 0.349173635 hsa-let-7e-5p -0.0185707299999986
Identifiers
Identifier lists contain one identifier per line.
Warning
All algorithms except ORA assume, that identifier lists are sorted by importance.rs9262636 rs9262635 rs9262615 rs10859313 rs3130000 rs4947296
Pieces of Advice
Only use ORA for relatively short lists, but remember that other algorithms require the list to be sorted by importance.NCBI GEO
The Gene Expression Omnibus (GEO) is a MIAME compliant online database for functional genomics data. Normalized data is stored in the GEO SOFT format, whereas unprocessed data is stored in a platform dependent raw format. When using a record from GEO GeneTrail 2 relies on the proper normalization of the stored data. If you want to normalize the data yourself you will need to obtain and process the raw data from GEO and upload a data matrix or a score file.
The SOFT format is supported for GEO Datasets (GDS) and GEO Series (GSE).
- GSE files
- are collections of related samples and provide a description of the study design.
- GDS files
- are curated collections of statistically comparable GEO samples. These samples originate from GSE files that are curated and reassembled by GEO employees.
GeneTrail 2 requires you to select either one GSE record and distribute the contained samples into a sample and reference set. Alternatively, you can specify two GDS records that then are directly used as sample and reference set.
2. Identifier-level Statistics
Whereas identifier lists and score lists can be used directly as input for computing enrichments or subgraphs, expression matrices need to be processed to identifier-level scores first. This step is needed to assess the "importance" of each biological entity. Examples for the importance are e.g the amount of differential expression or the difference in protein abundance. In this section, methods are described that can be used to quantify the difference between between sample and reference group. In general we distinguish between four different scoring procedures. Whilst parametric and non-parametric scoring schemes are based on statistical hypothesis tests, the correlation and "other" class use more basic statistics. We will now introduce these classes:
Parametric Hypothesis Tests
Parametric tests are hypothesis tests, which assume that the data follows a certain probability distribution. To use a parametric test, it is thus necessary to estimate the parameters of this distribution from the given samples [1]. Parametric tests can achieve a higher accuracy and a higher precision than non-parametric ones, if the assumptions about the probability distribution are correct [2]. However, if the assumptions are incorrect, results obtained using these methods can be deceptive, as they may exhibit a considerable bias.
Currently GeneTrail2 implements the following parametric hypothesis tests that can be used as identifier-level statistics:
- Independent Shrinkage t-test
- Independent Student's t-test
- Dependent Student's t-test
T-tests are a family of statistical hypothesis tests [3] whose test statistics follow a Student's t-distribution [4]. T-tests are used to test hypothesis concerning the population mean, or the difference between the means of two populations. The t-test is applicable if the populations are normally distributed and may be regarded as approximate if this is not the case [5]. For a large number of samples the t-distribution converges towards the normal distribution.
We implemented the commonly used t-test for unpaired samples with unknown, unequal variance (Welch's t-test). As estimating the variance can be inaccurate for a low amount of samples, we also provide a regularized version of this test (Shrinkage t-test) [6]. For Welch's t-test we implemented the unpaired and the paired version.
Pieces of Advice
- In case your data uses a quantitative scale the t-tests should be appropriate.
- The Shrinkage t-test is more robust than the standard t-tests, since this approach allows to control the influence of outliers. For this reason this method should always be preferred when sample sizes are small.
- Use a paired test, if every sample in the reference group can be associated with a samples in the sample group. The measurements in each such pair should be carried out under identical conditions [5].
Non-parametric Hypothesis Tests
- Wilcoxon Rank Sum Test
- Wilcoxon Matched Pairs Signed Rank Test
In comparison to parametric tests, non-parametric methods make fewer assumptions about the analyzed data, for example they do not rely on probability distributions of assessed variables [7]. Due to the reliance on fewer assumptions, these approaches are more robust and may be applied in situations where less is known about the analyzed data. For example, non-parametric methods can be applied to samples that have a ranking but no clear numerical interpretation, such as when assessing preferences.
Both implemented tests are based solely on the order of the values in the two samples. They can be used to test if two samples are drawn from populations with the same underlying distribution.
Pieces of Advice
- The Wilcoxon Matched Pairs Signed Rank Test should be applied if all samples of the two groups are obtained in pairs.
Correlation Coefficients
- Pearson correlation coefficient
- Spearman correlation coefficient
As an alternative to hypothesis tests, correlation coefficients can be applied if both groups contain more than 15 samples. Correlation coefficients are measures for linear dependence between two variables X and Y. They range from -1 to 1. A value of 1 implies that the relationship between X and Y is perfectly described by a linear function, with all data points lying on a line for which both X and Y increase. A value of -1 implies that all data points lie on a line for which X increases as Y decreases. A value of 0 implies that there is no linear dependence between the variables.
Pieces of Advice
- The Spearman correlation should be used if the order of the samples is more important than the actual value.
Other Scoring Schemes
- Z-score
- Log-Mean-Fold-Quotients
- Mean-Fold-Quotients
- Mean-Fold-Difference
If, however, the sample group consists only of one measurement (e.g. for diagnostic purposes), the Z-score or the fold change have to be used, as none of the other methods is applicable is this case.
Pieces of Advice
- For better interpretability, it is common to use the logarithm of the fold change.
3. Score Transformation
In some cases the result of an analysis can be improved by transforming the original scores. For example, Ackermann and Strimmer [8] show that squared values improve the detection of categories containing both up and down-regulated genes.
Users can choose from the following transformations:
- Absolute scores
- Logarithmized scores (natural logarithm)
- Logarithmized scores (base 2)
- Logarithmized scores (base 10)
- Squared scores
- Square root of scores
4. Set-level Statistics
High-throughput techniques such as genome sequencing, microarrays, and mass spectrometry have revolutionized bio-medical research by enabling comprehensive monitoring of huge biological systems. Irrespective of the technology used, analysis of high-throughput data typically yields a list of differentially expressed biological entities such as genes, miRNAs or proteins. This list is extremely useful in identifying entities that may have important roles in pathological mechanisms. Enrichment analysis of molecular signatures is a natural extension of the study of individual genes or proteins. The general idea of all set-level statistics is to revise if a certain category $C$ is significantly enriched or depleted in the analyzed data. A category is a set of biological entities like genes, proteins, or metabolites that are associated with a certain biological process, molecular function, or any molecular signature that might be of interest. The category is used to divide the input data into two groups, entries that are contained and entries that are not contained. Based on this information, a statistical test is applied that computes the differences between these two groups. Focusing on groups rather than on individual biological entities has several benefits. From a mathematical point of view, the analysis of groups instead of individual entities is advantageous as this typically increases power and reduces the dimensionality of the underlying statistical problem [8]. From the biological perspective, identifying molecular signatures that differ between two conditions can have more explanatory power than a simple list of differential expressed genes or miRNAs [9].
In the end a category is declared significantly enriched if the upper-tailed p-value of a test is significant and depleted if the lower-tailed p-value is significant.
$$P_{enriched} = P(X \ge x)$$ $$P_{depleted} = P(X \le x)$$- Over-representation analysis (ORA)
- Weighted gene set enrichment analysis
- Gene set enrichment analysis (GSEA)
- Averaging methods (mean, median, sum)
- Maxmean statistic
- One sample t-test
- Welch's t-test
- Wilcoxon rank-sum test
Extensive reviews on enrichment analysis ( [8], [10], [11], [12], [13], [14], [15]) have been published and reveal that no real gold standard for judging set-level statistics exists. This is due to the fact that each of the proposed methods is based on differing definitions of enriched categories (differing null hypotheses), making their results incomparable in general. Instead of using a single “magic bullet”, an appropriate algorithm needs to be chosen carefully for each individual research task.
Pieces of Advice
- In case you want to analyze a small set of biological entities (like the most significant ones), or there are no scores that indicate the importance of each entry in the data, an Over-Representation Analysis (ORA) has to be performed.
- However, if information about the extent of regulation (e.g., fold-changes, t-scores, etc.) is present, one of the other methods should be used instead. For non-expert users we recommend to use the Gene Set Enrichment Analysis (GSEA), as this is a popular and robust method.
- GeneTrail 2 offers the possibility to perform multiple enrichments and to compare them in order to reach an even higher sensitivity or specificity. For this reason two modes are available. While the union mode displays all categories that are significant in at least one enrichment, the intersection mode only displays categories that are significant in all. Whereas the union mode is useful for detecting variability between related enrichments, the intersection mode can be used to reduce the number of false positives by computing and comparing two or more enrichments using different algorithms. Using these modes, the user is able to effectively balance the sensitivity and specificity of an analysis in a straightforward manner.
5. Multiple Testing Correction
In an enrichment analysis multiple categories are tested simultaneously. For each individual test the same significance threshold α is used to judge if a category is significant. This means α is the probability to make a false positive prediction (Type-I-Error). Subsequently, each test has probability α to make a Type-I-Error. The problem with multiple testing is that this probability is accumulated.
For k tested hypotheses this probability is defined as:
$$ P(\text{at least one significant result}) = 1-(1-\alpha)^k $$Multiple testing procedures adjust p-values derived from multiple statistical tests to correct for the number of false positive predictions (Type-I-Error). See [16] [17] for a general overview of p-value adjustment algorithms.
Familywise Error Rate Controlling p-Value Adjustments
When performing multiple hypotheses tests, the familywise error rate (FWER) is the probability of making at least one false positive prediction, or Type-I-Error, among all the tested null hypotheses [16].
$$FWER = Pr(|FP| > 0)$$GeneTrail2 supports the following procedures controlling the FWER:
- Bonferroni
- Sidak
- Holm adjustment
- Holm-Sidak
- Finner
- Hochberg
False Discovery Rate Controlling p-Value Adjustments
The false discovery rate (FDR) is the expected proportion of Type-I-Errors among all rejected null hypotheses [16].
$$FDR=E\left(\frac{FP}{FP+TN}\right), \mathrm{with}\, FP=TN=0\Rightarrow \frac{FP}{FP+TN}=0$$FDR-controlling adjustments are less conservative than adjustments controlling the familywise error rate [18] [16].
GeneTrail2 supports the following procedures controlling the FDR:
- Benjamini-Hochberg
- Benjamini-Yekutieli
All methods described above can be used to control the number of false positive predictions. This is generally needed to improve the interpretation of the results. The choice of the p-value adjustment method can be used to adapt the results in order to achieve a higher sensitivity or specificity. Conservative methods like the Bonferroni correction or the Benjamini-Yekutieli procedure can be to used obtain a higher specificity, while more liberal methods, like the Hochberg method or the Benjamini-Hochberg adjustment, can be used to achieve a high sensitivity and still reduce the number of false positive predictions.
Pieces of Advice
- The Benjamini-Hochberg method is a common and well-accepted choice.
- Use conservative methods like Benjamini-Yekutieli if your application requires few false positives.
- Some of the more powerful methods impose restrictions in order to control the FWER or the FDR properly and need to be chosen carefully.
6. Subgraph Analysis
The deregulation of biochemical pathways is known to play a crucial role in diseases like cancer or Parkinsons's disease. Hence, calculating such deregulated pathways may help to gain new insights into pathogenic mechanisms and may open novel avenues for therapy stratification in the sense of personalized medicine. Subgraph analysis algorithms allow to detect deregulated pathways in biological networks such as KEGG [19] or String [20] based on gene expression data.
We currently integrated the following algorithms.
- Subgraph ILP
- FiDePA
Pieces of Advice
- FiDePa detects the most deregulated paths in the network.
- The subgraph ILP detects the most deregulated subgraphs.
- For non-expert users we recommend to use our Subgraph ILP, as this method delivers more interpretable results.
Bibliography
- Modes of parametric statistical inference John Wiley and Sons
- Parametric and nonparametric: Demystifying the terms Mayo Clinic CTSA BERD Resource (View online)
- The probable error of a mean Biometrika JSTOR (View online)
- Taschenbuch der Statistik Harri Deutsch Verlag (View online)
- 100 statistical tests Sage (View online)
- Accurate ranking of differentially expressed genes by a distribution-free shrinkage approach Statistical Applications in Genetics and Molecular Biology
- Nonparametric statistics for non-statisticians: a step-by-step approach John Wiley & Sons
- A general modular framework for gene set enrichment analysis BMC Bioinformatics (View online)
- Unite and conquer: univariate and multivariate approaches for finding differentially expressed gene sets Bioinformatics Oxford Univ Press
- Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists Nucleic acids research Oxford Univ Press
- Gene set enrichment analysis: performance evaluation and usage guidelines Briefings in bioinformatics Oxford Univ Press
- Ten years of pathway analysis: current approaches and outstanding challenges PLoS computational biology Public Library of Science
- Rigorous assessment of gene set enrichment tests Bioinformatics Oxford Univ Press
- Gene-set approach for expression pattern analysis Briefings in bioinformatics Oxford Univ Press
- Microarray-based gene set analysis: a comparison of current methods BMC bioinformatics BioMed Central Ltd
- p-Value Adjustments - SAS/STAT(R) 9.22 User's Guide (View online)
- Resampling-based multiple testing: Examples and methods for p-value adjustment John Wiley & Sons
- More powerful procedures for multiple significance testing Statistics in medicine Wiley Online Library
- KEGG: kyoto encyclopedia of genes and genomes Nucleic acids research Oxford Univ Press
- STRING v10: protein-protein interaction networks, integrated over the tree of life Nucleic acids research Oxford Univ Press