# GeneTrail2 1.5

#### Statistical analysis of molecular signatures

## Identifier-level statistics

Whereas identifier lists and score lists can be used directly as input for computing enrichments, expression matrices need to be processed to identifier-level scores first. This step is needed to assess the amount of differential expression for each biological entity. In this section, methods are described that can be used to quantify the difference between two groups. Each group can contain several measurements (samples). We use $X$ and $Y$ to denote the random variables from which the samples are drawn and $x_i$ and $y_i$ for the samples itself. The number of drawn samples are denoted as n and m respectively.

### Commonly used scoring schemes

#### Fold change

The fold change is a measure describing how much the sample means $\bar{x}$ and $\bar{y}$ differ.

$$fc = \frac{\bar{x}}{\bar{y}}$$The standard fold change has the following problem. If $\bar{y}$ is bigger than $\bar{x}$ all possible fold changes have a value between 0 and 1, whereas if $\bar{y}$ is smaller then $\bar{x}$ the fold changes can have any value between 1 and infinity. This asymmetric scale makes it hard to interpret and compare different values. To overcome this difficulty, the fold change is often computed with logarithmized values. The logarithm transforms the fold change into a symmetrical scale, where a decrease is indicated by negative values and an increase by positive ones.

$$fc = log(\frac{\bar{x}}{\bar{y}}) = log(\bar{x}) - log(\bar{y})$$#### Z-score

The Z-score measures the distance between a score x and the mean of a normal population ($\mu$) in units of the standard deviation ($\sigma$) [1]. A negative value of z denotes that x is below $\mu$ and a positive value that it is above. The Z-score of a score $x$ is defined as:

$$z=\frac{x-\bar{y}}{s_y}$$#### Signal-to-noise ratio

Signal-to-noise (SNR) ratio is a measure that compares the level of a desired signal to the level of background noise [2]. It also describes the differences in the sample means in units of the standard deviation.

$$SNR = \frac{\bar{x_1}-\bar{x_2}}{s_1 + s_2}$$#### Pearson correlation coefficient

The Pearson correlation coefficient [3] is a measure for linear dependence between two variables $X$ and $Y$. It ranges from $-1$ to $1$. A value of $1$ implies that the relationship between $X $and $Y$ is perfectly described by a linear function, with all data points lying on a line for which both $X$ and $Y$ increase. A value of $-1$ implies that all data points lie on a line for which $X$ increases as $Y$ decreases. A value of $0$ implies that there is no linear dependence between the variables. The correlation coefficient for two samples $X = (x_{1},x_{2},\ldots,x_{n})$ and $Y = (y_{1},y_{2},\ldots,y_{n})$ is defined as:

$$r= \frac{1}{n-1} \sum_{i=1}^{n}(\frac{x_i-\bar{x}}{s_X})(\frac{y_i - \bar{y}}{s_Y})$$#### Spearman correlation coefficient

The Spearman correlation coefficient [4] is a non-parametric measure for dependence between two variables $X$ and $Y$. It is defined as the Pearson correlation coefficient between the ranked variables [5]. It assesses how well the relationship between two variables can be described using a monotonic function. The correlation coefficient $\rho$ for two samples $X = (x_{1},x_{2},\ldots,x_{n})$ and $Y = (y_{1},y_{2},\ldots,y_{n})$ is defined as:

$$ \rho = 1- \frac{6 \sum\limits_{i=1}^{n}(r(x_i) - r(y_i))^2}{n(n^2-1)}$$The rank $r(x_i)$ of a sample $x_i$ is the position of that sample in the decreasingly ordered sequence of all samples.

### Parametric tests

Parametric tests are hypothesis tests, which assume the data to be generated by a certain probability distribution and that estimate the parameters of this distribution from given samples [6]. Parametric tests can achieve a higher accuracy and a higher precision than non-parametric ones, if the assumptions about the probability distribution are correct [7]. But if the assumptions are incorrect, these methods might be deceptive.

#### F-test

The F-test can be used to test if the variances of two samples are consistent [8] and is defined as:

$$F = \frac{\text{Var}(X)}{\text{Var}(Y)}$$A p-value for the test statistic $F$ can be derived from a F-distribution with $n-1$ degrees of freedom in the numerator and $m-1$ degrees of freedom in the denominator.

#### T-tests

In this section, we focus on parametric tests called t-tests. T-tests are a family of statistical hypothesis tests [9] whose test statistics follow a Student's t distribution [10]. They can be used to test assumptions about the population mean. All presented t-tests are accurate if the populations are normally distributed and may be regarded as approximate if this is not the case [11].

##### Welch's t-test

The Welch's t-test is a general case of the independent Student's t-test, and can be used when the two samples have possibly unequal variances [12]. This test is used to investigate the significance of the difference between the means of two populations [11]. Suppose we have two populations with means $\mu_x$ and $\mu_y$. From these populations, two independent random samples of size $n_x$ and $n_y$ are taken, from which sample means $\overline{x}$ and $\overline{y}$ and variances $s_x^2$ and $s_y^2$ can be calculated. Then, the test statistic $t$ is defined as:

$$t=\frac{(\overline{x}-\overline{y})}{\sqrt{\frac{s_x^2}{n_x}+\frac{s_y^2}{n_y}}}$$A p-value for the test statistic $t$ can be derived from a t-distribution with $\nu$ degrees of freedom. The degrees of freedom for this test can be approximated by the Welch-Satterthwaite equation [13]:

$$\nu = \frac{(\frac{s^2_x}{n_X} + \frac{s^2_y}{n_y})}{(\frac{s_x^2}{n_x})^2/(n_x-1) + (\frac{s_y^2}{n_y})^2/(n_y-1)}$$##### Dependent t-test

The dependent or paired t-test [11] can be used to investigate the significance of the difference between two population means,$\mu_x$ and $\mu_y$. In this test, no assumptions are made about the population variances. All samples of the two groups must be obtained in pairs. Apart from population differences, the observations in each pair should be carried out under identical, or almost identical, conditions [11]. Then, the mean difference can be calculated as:

$$\bar{d}=\frac{1}{n}\sum\limits_{i=1}^{n}x_{i} - y_{i}$$If $d_i$ are the pairwise differences between the two groups the variance of the differences is denoted by:

$$s^2_d=\sum\limits_{i=1}^{n}\frac{(d_i - \bar{d})^2}{n-1}$$Accordingly, the test statistic $t$ can be defined as:

$$t=\frac{\overline{d}}{\sqrt{\frac{Var(D)}{n}}}$$ A p-value for the test statistic $t$ can be derived from a t-distribution with $n-1$ degrees of freedom.##### Shrinkage t-test

The Shrinkage t-test [14] is a regularized version of the Welch's t-test, which replaces the sample variance $s_i^2$ by estimates that are shrunk towards the median variance of all observed data points. Since this approach allows to control the influence of outliers, the Shrinkage t-test is more robust than the standard t-tests. For this reason this method should always be preferred when sample sizes are small. Unlike the other t-tests, the Shrinkage t-test needs to be performed on all data points simultaneously in order to find an appropriate shrinkage estimator. Whenever this test is only applied to one datapoint or there is no difference between the different variances, this formulation is reduced to the standard Welch's t-test.

Suppose we have p data points, then the shrinkage estimator is formulated as follows:

$$\nu_k^*= \hat \lambda^* \nu_{\text{median}} + (1- \hat \lambda^*)\nu_k$$,where $\hat \lambda^*$ is the estimated pooling parameter.

$$\hat \lambda^* = \min ( 1, \frac{\sum_{k=1}^{p}\widehat{Var}(\nu_k)}{\sum_{k=1}^{p}(\nu_k - \nu_{\text{median}})^2} )$$The sample version of the variance can be defined as:

$$\widehat{Var}(\nu_k)= \frac{n}{(n-1)^3} \sum_{i=1}^{n}(w_{ik} - \bar w_k)^2$$ $$\bar x_k = \frac{1}{n} \sum_{i=1}^{n}x_{ik}$$ $$w_{ik} = (x_{ik} - \bar x_{k})^2$$ $$\bar w_k = \frac{1}{n} \sum_{i=1}^{n}w_{ik}$$ $$\nu_k = \frac{n}{n-1} \bar w_k$$The test statistic can be obtained by using Welch's t-test with the shrinkage variance estimate:

$$t_k^* = \frac{\bar x_{k} - \bar y_{k}}{\sqrt{\frac{v_{kx}^*}{n} + \frac{v_{ky}^*}{m}}}$$The corresponding p-value can be obtained accordingly.

### Non-parametric tests

In comparison to parametric tests, non-parametric methods make fewer assumptions about the analyzed data, for example they do not rely on probability distributions of assessed variables [15]. Due to the reliance on fewer assumptions, these approaches are more robust and may be applied in situations where less is known about the analyzed data. For example, non-parametric methods can be applied to samples that have a ranking but no clear numerical interpretation, such as when assessing preferences.

#### Wilcoxon rank-sum test

The Wilcoxon rank-sum test [10] is a non-parametric alternative to the independent two-sample t-test which is based solely on the order of the values in the two samples. It can be used to test if two samples are drawn from populations with the same underlying distribution. The test statistic can be obtained as follows:

The results of the two groups $X = (x_{1},x_{2},\ldots,x_{n})$ and $Y = (y_{1},y_{2},\ldots,y_{m})$ are combined and sorted increasingly. Each element in the sorted list receives its rank as new value. In case multiple entries have the same score, the mean of the available rank numbers is assigned to all of them. Based on this information the test statistic is defined as:

$$W = \sum_{i=1}^{n}R(x_{i}) $$$R(x)$ is the rank of value $x_i$ in the sorted and pooled list of values.

For $n > 25$ and $m > 25$, $W_{m,n}$ is approximately normally distributed [10].

The Z-score in this case is:

$$Z = \frac{W - m_W}{s_W}$$ $$m_W = \frac{n(n + m +1)}{2}$$ $$s_W = \sqrt{\frac{n \cdot m (n + m + 1)}{12}}$$A p-value for the $Z$-score can be derived from the standard normal distribution. If a normal approximation is not possible, the p-values for test statistic $W$ can be looked up in a table [11].

#### Wilcoxon matched-pairs signed-ranks test

The Wilcoxon matched-pairs signed-ranks test [10] is a non-parametric hypothesis test that can be used to test if two paired samples are drawn from populations with the same underlying distribution. It is a non-parametric alternative to the dependent t-test. The test statistic can be defined as follows:

Let $d_i=|x_{i} - y_{i}|$ be the absolute pairwise differences between the two groups. These differences are then sorted increasingly and the ranks are assigned as new values. Ties receive a rank equal to the average of the ranks they span. Using this information, we can obtain the two rank sums $W_+$ and $W_-$:

$$W_+ = \sum_{i=1}^{n}I(d_i > 0)R(d_i)$$ $$W_- = \sum_{i=1}^{n}I(d_i < 0)R(d_i)$$ $$W = min(W_+,W_-)$$ $$ I(x) = \begin{cases} 1, & \text{if }x\text{ is true,}\\ 0, & \text{if }x\text{ is false}\\ \end{cases} $$For $n > 20$, W is approximately normally distributed [10]. The Z-score in this case is:

$$Z = \frac{W - m_W}{s_W}$$ $$m_W = \frac{n(n +1)}{4}$$ $$\sigma_W = \sqrt{\frac{n(n + 1)(2n+1)}{24}}$$A p-value for the test statistic $Z$ can be derived from the standard normal distribution. If a normal approximation is not possible, the p-values for test statistic W can be looked up in a table [11].

### Bibliography

- Biostatistical analysis Pearson Education India
- Prediction of central nervous system embryonal tumour outcome based on gene expression Nature Nature Publishing Group
- Note on regression and inheritance in the case of two parents Proceedings of the Royal Society of London The Royal Society
- The proof and measurement of association between two things The American journal of psychology JSTOR
- Research design and statistical analysis Routledge
- Modes of parametric statistical inference John Wiley and Sons
- Parametric and nonparametric: Demystifying the terms Mayo Clinic CTSA BERD Resource (View online)
- Numerical recipes 3rd edition: The art of scientific computing Cambridge university press
- The probable error of a mean Biometrika JSTOR (View online)
- Taschenbuch der Statistik Harri Deutsch Verlag (View online)
- 100 statistical tests Sage (View online)
- The generalization of student’s problem when several different population variances are involved Biometrika JSTOR
- An approximate distribution of estimates of variance components Biometrics bulletin JSTOR
- Accurate ranking of differentially expressed genes by a distribution-free shrinkage approach Statistical Applications in Genetics and Molecular Biology
- Nonparametric statistics for non-statisticians: a step-by-step approach John Wiley & Sons