## Statistical approaches for genomic data analysis

**CORRESPONDENCE ANALYSIS **
**Indian Agricultural Statistics Research Institute **
**Library Avenue, New Delhi-110 012 **
**[email protected] **
The past decade has seen tremendous growth in the availability of both computer hardware and statistical software. As a result, the use of multivariate statistical techniques has increased to include most fields of scientific research and many areas of business and public management. In both research and management domains there is increasing recognition of the need to analyze data in a manner that takes into account the interrelationships among variables. Variables can be classified as being quantitative or qualitative. A quantitative variable is the one in which the variates differ in magnitude, for example, income, age and weight. A qualitative variable is one in which the variates differ in kind rather than in magnitude, for example, marital status, sex, nationality and hair colour. Obtaining values for quantitative variables involves measurement along a scale and unit of measure. A unit of measure may be infinitely divisible (for example, kilometers, meters, etc.) or indivisible (for example, family size.). When the units of measure are infinitely divisible the variable is said to be continuous. In the case of an indivisible of unit of measure the variable is said to be discrete. Scales of measurement can also be classified on the basis of the relations among the elements composing the scale. For example, an ordinal scale is the one in which the elements along the scale can be ordered from low to high. A nominal scale corresponds to qualitative data. An example would be the variable

*marital status* which has the categories married, single, divorced, widowed and separated. The five categories can be assigned coded values such as 1, 2, 3, 4, or 5. Although these coded values are numerical, they must not be treated as quantitative. On occasion, quantitative variables are treated in an analysis as if they were nominal. In general, we use the term

*categorical* to denote a variable that is used as if it was nominal. The variable age for example can be divided into 6 levels and coded 1, 2, 3, 4, 5, and 6. Principal component analysis and Factor analysis are primarily designed for analysis of data on continuous variables, whereas correspondence analysis is designed for categorical data. Before going in detail for correspondence analysis, we explain few terms that are commonly used in it. Two-Dimensional Contingency Tables: In the event, a sample of n observations is simultaneously cross-classified with respect to the two categorical random variables (X, Y) the joint frequencies can be summarized in a table called two-dimensional contingency table.

The random variable X is assumed to have a range of values consisting of

*r* categories, whereas

the variable Y is assumed to have

*c* categories. The cell density or joint density for cell (i, j) is

denoted by fij, i = 1, 2, …, r; j= 1, 2, …, c; where it is understood that the first subscript refers

to the row and the second subscript to the column. The marginal densities are denoted by fi. and

f.j for the row and column variables respectively. The conditional densities for the rows given

column

*j* will be denoted by fi.(i|j) and for the columns given row

*i* by f.j (j|i).

Row and column proportions:

The conditional densities f.j (j|i) are often referred to as

*row proportions*, and the marginal

density f.j is called the

*column total proportions*. In a similar fashion the conditional densities

fi.(i|j) are often referred to as

*column proportions*, and the marginal density fi.is called the

*row *

total proportions.

Row and column profiles:

The row and column proportions are also commonly referred to as

*row* and

*column profiles. * The term profile is often used in connection with the graphical displays of relationships in a

contingency table.

*Singular value decomposition *(

*SVD*): A real (n × p) matrix

**A** of rank

*k* can be expressed as the

product of three matrices that have a useful interpretation. This decomposition of

**A** is referred

to as a

*singular value decomposition* and is given by

1.

**D** (k × k) is a diagonal matrix with positive diagonal elements α ,α ,L,α , which are

called the singular values of

**A**, (without loss of generality we assume that the α , j = 1,

2, …, k, are arranged in descending order).
2. The k columns of

**U** (n × k),

**u**1

**, u**2

**, …, u**k

**, **are called the

*left singular vectors *of

**A **and

the k columns of

**V** (p × k),

** v**1

**, v**2

**, …, v**k

**, **are called the

*right singular vectors *of

**A**
3. The matrix

** A** can be written as the sum of k matrices, each with rank 1,

**A** =

∑kα

**u v**′ . The subtraction of any one of these terms from the sum results in a

singular matrix for the remainder of the sum.
4. The matrices

**U** (n × k), and

**V** (p × k) have the property that

**U'U **=

** VV' = I**; hence the

columns of

**U** form an orthonormal basis for the columns of

**A** in n-dimensional space

and the columns of

**V** form an orthonormal basis for the rows of

**A** in p-dimensional

space.

5. Let

**A**(

*l*) denote the first

*l* terms of the singular value decomposition for

**A**; hence

**A**(

*l*) =

α

**u v**′ . This expression minimizes

*tr*[(

**A **–

** X**) (

**A **–

** X**)'] = ∑∑(a −

all (n × p) matrices

**X** of rank

*l*. Thus the singular value decomposition can be used to

provide a

*matrix approximation* to

**A**.

*Biplots*: A biplot is used to provide a two-dimensional representation for a data matrix

**X**. Only

two dimensions are usually employed to keep the presentation simple. It is assumed that a

singular value decomposition approximation for

**X** based on

*r* = 2 dimensions is adequate.

This of course should be evaluated by examining the magnitudes of the singular values beyond

*r* = 2. The sum of these remaining residual singular values should ideally represent only a

small proportion of

*tr***D**.

A singular value decomposition approximation for

**X **based on two-dimensions is given by

ˆ =

**U D V**′ , where the rows of

**V**′ (2 × p) are the eigen vectors of

**X'X **and the columns of

**U **(n × 2) are the eigen vectors of

**XX'**. There are several ways of employing the three

elements of the right hand side of the equation for

**X**
ˆ . The most common form which is called
the

*principal components plot*.

*Correspondence analysis* is a technique that uses singular value decomposition to analyze a matrix of nonnegative data. The technique simultaneously characterizes the relationship among the rows and also among the columns of the data matrix. The outcome of a correspondence analysis is a pair of

*bivariate plots*. One bivariate plot is based on the first two principal axes derived from the row profiles, and the second plot is based on the first two principal axes obtained from the column profiles. Points representing the row categories are plotted using the row principal axes and points representing the column categories are plotted using the column principal axes. The spatial relationships among the two sets of categories can then be studied using the two bivariate plots. By using the same pair of axes to denote both pairs of principal axes the two bivariate plots can be superimposed on one another. With both plots appearing on the same axes the spatial relationship between the row categories and column categories can also be related. The SAS computer software procedure CORRESP will be used throughout this section to perform the necessary data analysis.

Table 1. Correspondence matrix of observed cell densities for an (r× c) contingency table
1 O11 O12 O13 O1c O1. 2 O21 O22 O23 O2c O2. 3 O31 O32 O33 O3c O3.

*Correspondence analysis for two-dimensional contingency tables * Correspondence analysis can be used to study interaction in a two-dimensional contingency table. Table 1 shows the observed

*cell proportions* or

*cell densities*. Let us denote the cell density for cell (

*i,j*) as Oij=nij/n, where nij denotes the sample frequency in cell (

*i,j*);

*i* = 1,2, …,

*r* and

*j* = 1,2, …,

*c*. The row and column marginal densities are given by Oi. = ni./n and O.j=n.j/n respectively where ni. and n.j are the row and column marginal frequencies respectively. Example: The data given in Table 2 pertains to the student-run legal advice service for the poor. This table examines the relationship between the type of criminal charge and the eventual outcome of the case for both males and females. Table 2. Contingency Table for Criminal Charge Data
______________________________________________________________ Convicted Sex Impaired Theft Under Mischief Possession other Totals

Table 3. Correspondence Matrix for Criminal Charge Data

**________________________________________________________________________ **

Convicted Sex Impaired Theft Under Mischief Possession other Totals

The corresponding matrix of cell densities and row and column marginal densities are shown

in Table 3. The numbers are given as percentages and hence represent 100 Oij.The column of

row masses on the right presents the row marginals as percents 100 Oi., and the row of column

masses (last row) displays the column marginals, 100 O.j. The majority of the clients were

convicted males and 30.3% of the samples were convicted females. The two most common

offences were impaired driving (37.2%) and theft under $1000 (28.6%). The most common

offence for males was impaired driving (28.1% of the sample) and the most common female

offence was theft under $1000 (17.8% of the sample).

*Correspondence Matrix and Row and Column Masses *

The (r× c) matrix of cell densities as shown in Table 1 is denotes by

**O** and is called the

*Correspondence Matrix. *The (r× 1) vector of row marginals Oi., i =1,2…,r, is denoted by r and

similarly the (c× 1) vector of column marginals O.j, j=1,2…,c, is denoted by c.

*These row and *

column marginal vectors can be written as

**r** =

**Oec** and

**c** =

**O**’

**er** where

**ec**(c× 1) and

**er** (r× 1)

are vectors of unities. The vectors

**r** and

**c** are also referred to respectively as

*row and column *

Masses. Diagonal matrices constructed from the row and column masses are denoted by

**Dr **

(r× r) and

**Dc** (c× c) respectively. The diagonal elements of

**Dr** are the elements of

**r** and the

diagonal elements of

**Dc** are the elements of

**c**.

Table 4. Matrix

**R** for Row Profiles

Column n.1/n n.2/n n.3/n … n.c/n 1 Mass

*Row and Column Profiles *

Beginning with the table of cell frequencies nij for each row i, the (c× 1) vector of row

conditional densities is determined from nij/ni., j=1,2,…, c, and is denoted by

**r**i. These row

conditional densities are called

*row profiles.* The complete set of r row profiles will be denoted

by the (r× c) matrix

**R **with rows given by

**ri**., i =1,2,…,r. Similarly the vector of column

conditional densities nij/n.j, i=1,2…,r, for column j is denoted by the (r× 1) vector

**cj**, j = 1,2,

…,c. The matrices

**R** and

**C** are illustrated in Tables 4 and 5 respectively. These row and

column profile matrices are useful to judge the departure from independence. For the criminal

charge data the row profile matrix R and the column profile matrix C are summarized in

Tables 6 and 7. The row profiles in Table 6

compare the four sex/conviction categories. The tow female profiles (no and yes) are quite similar to each other, but the two male profiles are different from each other. For the column profiles in Table 7 the impaired driving and possession of narcotics profiles are similar to each other. Also the mischief and other profiles are similar. The profile for the theft under $1000 is quite different from the other four column profiles. Since theft under $1000 is the only offence dominated by females we shall see that this provides a partial explanation for this different column profile.
Table 6. Row Profiles for Criminal Charge Data
_________________________________________________________
Convicted Sex Impaired Theft Under Mischief Possession other Totals

Table 7. Column Profiles for Criminal Charge Data
_________________________________________________________
Convicted Sex Impaired Theft Under Mischief Possession other Row
Yes Male 0.700 0.278 0.440 0.697 0.463 0.516

* Departure from Independence *The purpose of correspondence analysis in the study of contingency tables is usually to study the departure of the observed cell frequencies from the cell frequencies expected under independence. Although it is possible to compare the observed cell frequencies from other models, the independence model is the most commonly used base for comparisons. Under the independence assumption, the theoretical row profiles for each row should be equal to the column marginals and equivalently the true column profiles for each column should be equal to the row marginals.
Table 8. Row profile deviation from independence
Table 9. Column profile deviation from independence

For the sample correspondence matrix therefore the matrix differences (

**R**-

**e**r

**c**’) and (

**C**-

**re’**c)

measure the degree of departure or deviation from independence in the sample (Tables 8 and

9). Equivalently, under independence the cross product of the sample row and column

marginal vectors or masses should be approximately equal to the correspondence matrix

**O** of

observed cell densities. The matrix difference (

**O**-

**rc**’) is also therefore a measure of the

deviation from independence.

*Pearson Chi-Square Statistic and Total Inertia*: The Pearson Chi-Square Statistic for testing independence is given as
The above versions of the Pearson Chi-Square Statistic can also be expressed as
= ∑ n (

**r **−

**c**)′

**D 1**(

**r **−

= ∑ n (

**C **−

**r**)′

**D 1**(

**C **−

The statistic

*G*2 /

*n* is called the

*total inertia*. Further, it can be viewed as a measure of the magnitude of the total row squared deviations or equivalently the magnitude of the column squared deviations. Total inertia can also be expressed in the form

*Tr*[

**D 1 **(

**O **−

′

**D 1**(

**O **−

**c**
Table 10. Contribution to Chi-Square statistic for criminal charge data
Table 10 shows the cell contributions to the total Chi-square statistic.

*Coordinates of row and column profiles*:

* *For the singular value decomposition of (

**O** -

**rc**′)

given by

**AD**μ

**B**′ the columns of matrices

**A **and

**B **provide the principal axes for the columns

and rows of (

**O** -

**rc**′) respectively. Each row of (

**O** -

**rc**′) can be expressed as a linear

combination of the rows of

**B**′ (columns of

**B**), and hence the coordinates for the rows of (

**O** -

**rc**′) in the space generated by the rows of

**B**′ are given by the

**AD**μ. The coordinates for the

*i*th

row of (

**O** -

**rc**′) are given by the

*i*th row of

**AD**μ. Similarly the coordinates for the columns of

(

**O** -

**rc**′) with respect to the space generated by the columns of

**A** are provide by the columns

of

**D**μ

**B**′.

To obtain the coordinates for the row and column profile deviations, the relationships

**rc**′) =

**D 1 **(

**O**
(

**C **–

**re **′

can be used. The required coordinates for the row and column profile deviations are therefore given by

**V** (r × k) =

**D 1AD**
**W** (c × k) =

**D 1BD**
respectively. The coordinates for row profiles on row principal axes and coordinates for column profiles on column principal axes are given in Tables 11 and 12.
Table 11. Coordinates for row profiles on row principal axes
Row principal axes (columns of

**B**)

1 2 3 …

Table 12. Coordinates for column profiles on column principal axes
Column principal axes (columns of

**A**)

1 2 3 …

For the criminal charge data the coordinates for the row and column profile deviations on their respective dimensions are shown in Tables 13 and 14. For the row profiles it would appear that

the first dimension reflects a contrast between females charged and males convicted. The second row dimension is primarily a measure of males charged but not
Table 13. Coordinates for row profiles on row principal axes for criminal charge data
Table 14. Coordinates for column profiles on column principal axes for criminal charge data

convicted. For the column profiles the first dimension represents a contrast between the theft

under $1000 and the crimes of narcotics possession and impaired driving. The second

dimension for the column profile deviations seems to reflect a contrast between the three

charges mischief, narcotics possession and other offences with the charge impaired.

**References **
Jobson, J.D. (1992).

*Applied multivariate data analysis*. Vol II, Categorical and multivariate

Source: http://nabg.iasri.res.in/EManual/manual/Correspondence%20Analysis.pdf

[99mTc]TRODAT-1 Evaluation of ear ly Par kinson’s Disease with [99mTc]TRODAT- 1/SPECT Imaging The etiology of idiopathic Parkinson’s disease (PD) is unknown. There is noeffective method to prevent the occurrence of this neurodegenerative disorder at thepresent time. The most important and practical approach to the management of thesepatients is to make the diagnosis at an early s

Final Paper for the National Certification Program in Health Care Ethics Submitted to Fr. Tad Pacholczyk, National Catholic Bioethics Center Why Methotrexate is an Immoral Response to Ectopic Pregnancy Abstract With tubal ectopic pregnancies continuing to be a pregnancy complication that results in the unborn dying and even, at times, their mothers, a solution that respects the dignity