Statistical approaches for genomic data analysis

Indian Agricultural Statistics Research Institute
Library Avenue, New Delhi-110 012
[email protected]
The past decade has seen tremendous growth in the availability of both computer hardware and statistical software. As a result, the use of multivariate statistical techniques has increased to include most fields of scientific research and many areas of business and public management. In both research and management domains there is increasing recognition of the need to analyze data in a manner that takes into account the interrelationships among variables. Variables can be classified as being quantitative or qualitative. A quantitative variable is the one in which the variates differ in magnitude, for example, income, age and weight. A qualitative variable is one in which the variates differ in kind rather than in magnitude, for example, marital status, sex, nationality and hair colour. Obtaining values for quantitative variables involves measurement along a scale and unit of measure. A unit of measure may be infinitely divisible (for example, kilometers, meters, etc.) or indivisible (for example, family size.). When the units of measure are infinitely divisible the variable is said to be continuous. In the case of an indivisible of unit of measure the variable is said to be discrete. Scales of measurement can also be classified on the basis of the relations among the elements composing the scale. For example, an ordinal scale is the one in which the elements along the scale can be ordered from low to high. A nominal scale corresponds to qualitative data. An example would be the variable marital status which has the categories married, single, divorced, widowed and separated. The five categories can be assigned coded values such as 1, 2, 3, 4, or 5. Although these coded values are numerical, they must not be treated as quantitative. On occasion, quantitative variables are treated in an analysis as if they were nominal. In general, we use the term categorical to denote a variable that is used as if it was nominal. The variable age for example can be divided into 6 levels and coded 1, 2, 3, 4, 5, and 6. Principal component analysis and Factor analysis are primarily designed for analysis of data on continuous variables, whereas correspondence analysis is designed for categorical data. Before going in detail for correspondence analysis, we explain few terms that are commonly used in it. Two-Dimensional Contingency Tables: In the event, a sample of n observations is simultaneously cross-classified with respect to the two categorical random variables (X, Y) the joint frequencies can be summarized in a table called two-dimensional contingency table.
The random variable X is assumed to have a range of values consisting of r categories, whereas
the variable Y is assumed to have c categories. The cell density or joint density for cell (i, j) is
denoted by fij, i = 1, 2, …, r; j= 1, 2, …, c; where it is understood that the first subscript refers
to the row and the second subscript to the column. The marginal densities are denoted by fi. and
f.j for the row and column variables respectively. The conditional densities for the rows given
column j will be denoted by fi.(i|j) and for the columns given row i by f.j (j|i).
Row and column proportions:
The conditional densities f.j (j|i) are often referred to as row proportions, and the marginal
density f.j is called the column total proportions. In a similar fashion the conditional densities
fi.(i|j) are often referred to as column proportions, and the marginal density called the row
total proportions
Row and column profiles:
The row and column proportions are also commonly referred to as row and column profiles.
The term profile is often used in connection with the graphical displays of relationships in a
contingency table.
Singular value decomposition (SVD): A real (n × p) matrix A of rank k can be expressed as the
product of three matrices that have a useful interpretation. This decomposition of A is referred
to as a singular value decomposition and is given by
1. D (k × k) is a diagonal matrix with positive diagonal elements α ,α ,L,α , which are
called the singular values of A, (without loss of generality we assume that the α , j = 1,
2, …, k, are arranged in descending order). 2. The k columns of U (n × k), u1, u2, …, uk, are called the left singular vectors of A and
the k columns of V (p × k), v1, v2, …, vk, are called the right singular vectors of A
3. The matrix A can be written as the sum of k matrices, each with rank 1, A =
∑kα u v′ . The subtraction of any one of these terms from the sum results in a
singular matrix for the remainder of the sum. 4. The matrices U (n × k), and V (p × k) have the property that U'U = VV' = I; hence the
columns of U form an orthonormal basis for the columns of A in n-dimensional space
and the columns of V form an orthonormal basis for the rows of A in p-dimensional
5. Let A(l) denote the first l terms of the singular value decomposition for A; hence A(l) =
α u v′ . This expression minimizes tr[(A X) (A X)'] = ∑∑(a −
all (n × p) matrices X of rank l. Thus the singular value decomposition can be used to
provide a matrix approximation to A.

Biplots: A biplot is used to provide a two-dimensional representation for a data matrix X. Only
two dimensions are usually employed to keep the presentation simple. It is assumed that a
singular value decomposition approximation for X based on r = 2 dimensions is adequate.
This of course should be evaluated by examining the magnitudes of the singular values beyond
r = 2. The sum of these remaining residual singular values should ideally represent only a
small proportion of trD.
A singular value decomposition approximation for X based on two-dimensions is given by
ˆ = U D V′ , where the rows of V′ (2 × p) are the eigen vectors of X'X and the columns of
U (n × 2) are the eigen vectors of XX'. There are several ways of employing the three
elements of the right hand side of the equation for X
ˆ . The most common form which is called the principal components plot. Correspondence analysis is a technique that uses singular value decomposition to analyze a matrix of nonnegative data. The technique simultaneously characterizes the relationship among the rows and also among the columns of the data matrix. The outcome of a correspondence analysis is a pair of bivariate plots. One bivariate plot is based on the first two principal axes derived from the row profiles, and the second plot is based on the first two principal axes obtained from the column profiles. Points representing the row categories are plotted using the row principal axes and points representing the column categories are plotted using the column principal axes. The spatial relationships among the two sets of categories can then be studied using the two bivariate plots. By using the same pair of axes to denote both pairs of principal axes the two bivariate plots can be superimposed on one another. With both plots appearing on the same axes the spatial relationship between the row categories and column categories can also be related. The SAS computer software procedure CORRESP will be used throughout this section to perform the necessary data analysis. Table 1. Correspondence matrix of observed cell densities for an (r× c) contingency table 1 O11 O12 O13 O1c O1. 2 O21 O22 O23 O2c O2. 3 O31 O32 O33 O3c O3. Correspondence analysis for two-dimensional contingency tables Correspondence analysis can be used to study interaction in a two-dimensional contingency table. Table 1 shows the observed cell proportions or cell densities. Let us denote the cell density for cell (i,j) as Oij=nij/n, where nij denotes the sample frequency in cell (i,j); i = 1,2, …, r and j = 1,2, …, c. The row and column marginal densities are given by Oi. = ni./n and O.j=n.j/n respectively where ni. and n.j are the row and column marginal frequencies respectively. Example: The data given in Table 2 pertains to the student-run legal advice service for the poor. This table examines the relationship between the type of criminal charge and the eventual outcome of the case for both males and females. Table 2. Contingency Table for Criminal Charge Data ______________________________________________________________ Convicted Sex Impaired Theft Under Mischief Possession other Totals
Table 3. Correspondence Matrix for Criminal Charge Data
Convicted Sex Impaired Theft Under Mischief Possession other Totals

The corresponding matrix of cell densities and row and column marginal densities are shown
in Table 3. The numbers are given as percentages and hence represent 100 Oij.The column of
row masses on the right presents the row marginals as percents 100 Oi., and the row of column
masses (last row) displays the column marginals, 100 O.j. The majority of the clients were
convicted males and 30.3% of the samples were convicted females. The two most common
offences were impaired driving (37.2%) and theft under $1000 (28.6%). The most common
offence for males was impaired driving (28.1% of the sample) and the most common female
offence was theft under $1000 (17.8% of the sample).
Correspondence Matrix and Row and Column Masses
The (r× c) matrix of cell densities as shown in Table 1 is denotes by O and is called the
Correspondence Matrix. The (r× 1) vector of row marginals Oi., i =1,2…,r, is denoted by r and
similarly the (c× 1) vector of column marginals O.j, j=1,2…,c, is denoted by c. These row and
column marginal vectors
can be written as r = Oec and c = Oer where ec(c× 1) and er (r× 1)
are vectors of unities. The vectors r and c are also referred to respectively as row and column
. Diagonal matrices constructed from the row and column masses are denoted by Dr
(r× r) and Dc (c× c) respectively. The diagonal elements of Dr are the elements of r and the
diagonal elements of Dc are the elements of c.
Table 4. Matrix R for Row Profiles
Column n.1/n n.2/n n.3/n … n.c/n 1 Mass Row and Column Profiles

Beginning with the table of cell frequencies nij for each row i, the (c× 1) vector of row
conditional densities is determined from nij/ni., j=1,2,…, c, and is denoted by ri. These row
conditional densities are called row profiles. The complete set of r row profiles will be denoted
by the (r× c) matrix R with rows given by ri., i =1,2,…,r. Similarly the vector of column
conditional densities nij/n.j, i=1,2…,r, for column j is denoted by the (r× 1) vector cj, j = 1,2,
…,c. The matrices R and C are illustrated in Tables 4 and 5 respectively. These row and
column profile matrices are useful to judge the departure from independence. For the criminal
charge data the row profile matrix R and the column profile matrix C are summarized in
Tables 6 and 7. The row profiles in Table 6
compare the four sex/conviction categories. The tow female profiles (no and yes) are quite similar to each other, but the two male profiles are different from each other. For the column profiles in Table 7 the impaired driving and possession of narcotics profiles are similar to each other. Also the mischief and other profiles are similar. The profile for the theft under $1000 is quite different from the other four column profiles. Since theft under $1000 is the only offence dominated by females we shall see that this provides a partial explanation for this different column profile. Table 6. Row Profiles for Criminal Charge Data _________________________________________________________ Convicted Sex Impaired Theft Under Mischief Possession other Totals Table 7. Column Profiles for Criminal Charge Data _________________________________________________________ Convicted Sex Impaired Theft Under Mischief Possession other Row Yes Male 0.700 0.278 0.440 0.697 0.463 0.516 Departure from Independence The purpose of correspondence analysis in the study of contingency tables is usually to study the departure of the observed cell frequencies from the cell frequencies expected under independence. Although it is possible to compare the observed cell frequencies from other models, the independence model is the most commonly used base for comparisons. Under the independence assumption, the theoretical row profiles for each row should be equal to the column marginals and equivalently the true column profiles for each column should be equal to the row marginals. Table 8. Row profile deviation from independence Table 9. Column profile deviation from independence
For the sample correspondence matrix therefore the matrix differences (R-erc’) and (C-re’c)
measure the degree of departure or deviation from independence in the sample (Tables 8 and
9). Equivalently, under independence the cross product of the sample row and column
marginal vectors or masses should be approximately equal to the correspondence matrix O of
observed cell densities. The matrix difference (O-rc’) is also therefore a measure of the
deviation from independence.
Pearson Chi-Square Statistic and Total Inertia: The Pearson Chi-Square Statistic for testing independence is given as The above versions of the Pearson Chi-Square Statistic can also be expressed as = ∑ n (r c)′D 1(r
= ∑ n (C r)′D 1(C
The statistic G2 / n is called the total inertia. Further, it can be viewed as a measure of the magnitude of the total row squared deviations or equivalently the magnitude of the column squared deviations. Total inertia can also be expressed in the form Tr[ D 1 (O
D 1(O c
Table 10. Contribution to Chi-Square statistic for criminal charge data Table 10 shows the cell contributions to the total Chi-square statistic. Coordinates of row and column profiles: For the singular value decomposition of (O - rc′)
given by ADμB′ the columns of matrices A and B provide the principal axes for the columns
and rows of (O - rc′) respectively. Each row of (O - rc′) can be expressed as a linear
combination of the rows of B′ (columns of B), and hence the coordinates for the rows of (O -
rc′) in the space generated by the rows of B′ are given by the ADμ. The coordinates for the ith
row of (O - rc′) are given by the ith row of ADμ. Similarly the coordinates for the columns of
(O - rc′) with respect to the space generated by the columns of A are provide by the columns
of DμB′.
To obtain the coordinates for the row and column profile deviations, the relationships rc′) = D 1 (O
(C re
can be used. The required coordinates for the row and column profile deviations are therefore given by V (r × k) = D 1AD
W (c × k) = D 1BD
respectively. The coordinates for row profiles on row principal axes and coordinates for column profiles on column principal axes are given in Tables 11 and 12. Table 11. Coordinates for row profiles on row principal axes Row principal axes (columns of B)
1 2 3 …
Table 12. Coordinates for column profiles on column principal axes Column principal axes (columns of A)
1 2 3 …
For the criminal charge data the coordinates for the row and column profile deviations on their respective dimensions are shown in Tables 13 and 14. For the row profiles it would appear that the first dimension reflects a contrast between females charged and males convicted. The second row dimension is primarily a measure of males charged but not Table 13. Coordinates for row profiles on row principal axes for criminal charge data Table 14. Coordinates for column profiles on column principal axes for criminal charge data
convicted. For the column profiles the first dimension represents a contrast between the theft
under $1000 and the crimes of narcotics possession and impaired driving. The second
dimension for the column profile deviations seems to reflect a contrast between the three
charges mischief, narcotics possession and other offences with the charge impaired.
Jobson, J.D. (1992). Applied multivariate data analysis. Vol II, Categorical and multivariate



[99mTc]TRODAT-1 Evaluation of ear ly Par kinson’s Disease with [99mTc]TRODAT- 1/SPECT Imaging The etiology of idiopathic Parkinson’s disease (PD) is unknown. There is noeffective method to prevent the occurrence of this neurodegenerative disorder at thepresent time. The most important and practical approach to the management of thesepatients is to make the diagnosis at an early s

Microsoft word - ectopic_pregnancy_methotrexate_gray_paper.doc

Final Paper for the National Certification Program in Health Care Ethics Submitted to Fr. Tad Pacholczyk, National Catholic Bioethics Center Why Methotrexate is an Immoral Response to Ectopic Pregnancy Abstract With tubal ectopic pregnancies continuing to be a pregnancy complication that results in the unborn dying and even, at times, their mothers, a solution that respects the dignity

Copyright ©2010-2018 Medical Science