CORRESPONDENCE ANALYSIS Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 [email protected]
The past decade has seen tremendous growth in the availability of both computer hardware and statistical software. As a result, the use of multivariate statistical techniques has increased to include most fields of scientific research and many areas of business and public management. In both research and management domains there is increasing recognition of the need to analyze data in a manner that takes into account the interrelationships among variables. Variables can be classified as being quantitative or qualitative. A quantitative variable is the one in which the variates differ in magnitude, for example, income, age and weight. A qualitative variable is one in which the variates differ in kind rather than in magnitude, for example, marital status, sex, nationality and hair colour. Obtaining values for quantitative variables involves measurement along a scale and unit of measure. A unit of measure may be infinitely divisible (for example, kilometers, meters, etc.) or indivisible (for example, family size.). When the units of measure are infinitely divisible the variable is said to be continuous. In the case of an indivisible of unit of measure the variable is said to be discrete. Scales of measurement can also be classified on the basis of the relations among the elements composing the scale. For example, an ordinal scale is the one in which the elements along the scale can be ordered from low to high. A nominal scale corresponds to qualitative data. An example would be the variable marital status which has the categories married, single, divorced, widowed and separated. The five categories can be assigned coded values such as 1, 2, 3, 4, or 5. Although these coded values are numerical, they must not be treated as quantitative. On occasion, quantitative variables are treated in an analysis as if they were nominal. In general, we use the term categorical to denote a variable that is used as if it was nominal. The variable age for example can be divided into 6 levels and coded 1, 2, 3, 4, 5, and 6. Principal component analysis and Factor analysis are primarily designed for analysis of data on continuous variables, whereas correspondence analysis is designed for categorical data. Before going in detail for correspondence analysis, we explain few terms that are commonly used in it. Two-Dimensional Contingency Tables: In the event, a sample of n observations is simultaneously cross-classified with respect to the two categorical random variables (X, Y) the joint frequencies can be summarized in a table called two-dimensional contingency table.
The random variable X is assumed to have a range of values consisting of r categories, whereas the variable Y is assumed to have c categories. The cell density or joint density for cell (i, j) is denoted by fij, i = 1, 2, …, r; j= 1, 2, …, c; where it is understood that the first subscript refers to the row and the second subscript to the column. The marginal densities are denoted by fi. and f.j for the row and column variables respectively. The conditional densities for the rows given column j will be denoted by fi.(i|j) and for the columns given row i by f.j (j|i). Row and column proportions: The conditional densities f.j (j|i) are often referred to as row proportions, and the marginal density f.j is called the column total proportions. In a similar fashion the conditional densities fi.(i|j) are often referred to as column proportions, and the marginal density fi.is called the row total proportions. Row and column profiles: The row and column proportions are also commonly referred to as row and column profiles. The term profile is often used in connection with the graphical displays of relationships in a contingency table. Singular value decomposition (SVD): A real (n × p) matrix A of rank k can be expressed as the product of three matrices that have a useful interpretation. This decomposition of A is referred to as a singular value decomposition and is given by
1. D (k × k) is a diagonal matrix with positive diagonal elements α ,α ,L,α , which are
called the singular values of A, (without loss of generality we assume that the α , j = 1,
2, …, k, are arranged in descending order).
2. The k columns of U (n × k), u1, u2, …, uk, are called the left singular vectors of A and
the k columns of V (p × k), v1, v2, …, vk, are called the right singular vectors of A
3. The matrix A can be written as the sum of k matrices, each with rank 1, A =
∑kα u v′ . The subtraction of any one of these terms from the sum results in a
singular matrix for the remainder of the sum.
4. The matrices U (n × k), and V (p × k) have the property that U'U = VV' = I; hence the
columns of U form an orthonormal basis for the columns of A in n-dimensional space and the columns of V form an orthonormal basis for the rows of A in p-dimensional space.
5. Let A(l) denote the first l terms of the singular value decomposition for A; hence A(l) =
α u v′ . This expression minimizes tr[(A – X) (A – X)'] = ∑∑(a −
all (n × p) matrices X of rank l. Thus the singular value decomposition can be used to provide a matrix approximation to A.
Biplots: A biplot is used to provide a two-dimensional representation for a data matrix X. Only two dimensions are usually employed to keep the presentation simple. It is assumed that a singular value decomposition approximation for X based on r = 2 dimensions is adequate. This of course should be evaluated by examining the magnitudes of the singular values beyond r = 2. The sum of these remaining residual singular values should ideally represent only a small proportion of trD. A singular value decomposition approximation for X based on two-dimensions is given by
ˆ = U D V′ , where the rows of V′ (2 × p) are the eigen vectors of X'X and the columns of U (n × 2) are the eigen vectors of XX'. There are several ways of employing the three
elements of the right hand side of the equation for X
ˆ . The most common form which is called
the principal components plot. Correspondence analysis is a technique that uses singular value decomposition to analyze a matrix of nonnegative data. The technique simultaneously characterizes the relationship among the rows and also among the columns of the data matrix. The outcome of a correspondence analysis is a pair of bivariate plots. One bivariate plot is based on the first two principal axes derived from the row profiles, and the second plot is based on the first two principal axes obtained from the column profiles. Points representing the row categories are plotted using the row principal axes and points representing the column categories are plotted using the column principal axes. The spatial relationships among the two sets of categories can then be studied using the two bivariate plots. By using the same pair of axes to denote both pairs of principal axes the two bivariate plots can be superimposed on one another. With both plots appearing on the same axes the spatial relationship between the row categories and column categories can also be related. The SAS computer software procedure CORRESP will be used throughout this section to perform the necessary data analysis.
Table 1. Correspondence matrix of observed cell densities for an (r× c) contingency table
1 O11 O12 O13 O1c O1. 2 O21 O22 O23 O2c O2. 3 O31 O32 O33 O3c O3.
Correspondence analysis for two-dimensional contingency tables Correspondence analysis can be used to study interaction in a two-dimensional contingency table. Table 1 shows the observed cell proportions or cell densities. Let us denote the cell density for cell (i,j) as Oij=nij/n, where nij denotes the sample frequency in cell (i,j); i = 1,2, …, r and j = 1,2, …, c. The row and column marginal densities are given by Oi. = ni./n and O.j=n.j/n respectively where ni. and n.j are the row and column marginal frequencies respectively. Example: The data given in Table 2 pertains to the student-run legal advice service for the poor. This table examines the relationship between the type of criminal charge and the eventual outcome of the case for both males and females. Table 2. Contingency Table for Criminal Charge Data
______________________________________________________________ Convicted Sex Impaired Theft Under Mischief Possession other Totals
Table 3. Correspondence Matrix for Criminal Charge Data ________________________________________________________________________ Convicted Sex Impaired Theft Under Mischief Possession other Totals
The corresponding matrix of cell densities and row and column marginal densities are shown in Table 3. The numbers are given as percentages and hence represent 100 Oij.The column of row masses on the right presents the row marginals as percents 100 Oi., and the row of column masses (last row) displays the column marginals, 100 O.j. The majority of the clients were convicted males and 30.3% of the samples were convicted females. The two most common offences were impaired driving (37.2%) and theft under $1000 (28.6%). The most common offence for males was impaired driving (28.1% of the sample) and the most common female offence was theft under $1000 (17.8% of the sample). Correspondence Matrix and Row and Column Masses The (r× c) matrix of cell densities as shown in Table 1 is denotes by O and is called the Correspondence Matrix. The (r× 1) vector of row marginals Oi., i =1,2…,r, is denoted by r and similarly the (c× 1) vector of column marginals O.j, j=1,2…,c, is denoted by c. These row and column marginal vectors can be written as r = Oec and c = O’er where ec(c× 1) and er (r× 1) are vectors of unities. The vectors r and c are also referred to respectively as row and column Masses. Diagonal matrices constructed from the row and column masses are denoted by Dr (r× r) and Dc (c× c) respectively. The diagonal elements of Dr are the elements of r and the diagonal elements of Dc are the elements of c.
Table 4. Matrix R for Row Profiles
Column n.1/n n.2/n n.3/n … n.c/n 1 Mass
Row and Column Profiles
Beginning with the table of cell frequencies nij for each row i, the (c× 1) vector of row conditional densities is determined from nij/ni., j=1,2,…, c, and is denoted by ri. These row conditional densities are called row profiles. The complete set of r row profiles will be denoted by the (r× c) matrix R with rows given by ri., i =1,2,…,r. Similarly the vector of column conditional densities nij/n.j, i=1,2…,r, for column j is denoted by the (r× 1) vector cj, j = 1,2, …,c. The matrices R and C are illustrated in Tables 4 and 5 respectively. These row and column profile matrices are useful to judge the departure from independence. For the criminal charge data the row profile matrix R and the column profile matrix C are summarized in Tables 6 and 7. The row profiles in Table 6
compare the four sex/conviction categories. The tow female profiles (no and yes) are quite similar to each other, but the two male profiles are different from each other. For the column profiles in Table 7 the impaired driving and possession of narcotics profiles are similar to each other. Also the mischief and other profiles are similar. The profile for the theft under $1000 is quite different from the other four column profiles. Since theft under $1000 is the only offence dominated by females we shall see that this provides a partial explanation for this different column profile.
Table 6. Row Profiles for Criminal Charge Data
_________________________________________________________
Convicted Sex Impaired Theft Under Mischief Possession other Totals
Table 7. Column Profiles for Criminal Charge Data
_________________________________________________________
Convicted Sex Impaired Theft Under Mischief Possession other Row
Yes Male 0.700 0.278 0.440 0.697 0.463 0.516
Departure from Independence The purpose of correspondence analysis in the study of contingency tables is usually to study the departure of the observed cell frequencies from the cell frequencies expected under independence. Although it is possible to compare the observed cell frequencies from other models, the independence model is the most commonly used base for comparisons. Under the independence assumption, the theoretical row profiles for each row should be equal to the column marginals and equivalently the true column profiles for each column should be equal to the row marginals.
Table 8. Row profile deviation from independence
Table 9. Column profile deviation from independence
For the sample correspondence matrix therefore the matrix differences (R-erc’) and (C-re’c) measure the degree of departure or deviation from independence in the sample (Tables 8 and 9). Equivalently, under independence the cross product of the sample row and column marginal vectors or masses should be approximately equal to the correspondence matrix O of observed cell densities. The matrix difference (O-rc’) is also therefore a measure of the deviation from independence. Pearson Chi-Square Statistic and Total Inertia: The Pearson Chi-Square Statistic for testing independence is given as
The above versions of the Pearson Chi-Square Statistic can also be expressed as
= ∑ n (r − c)′D 1(r −
= ∑ n (C − r)′D 1(C −
The statistic G2 / n is called the total inertia. Further, it can be viewed as a measure of the magnitude of the total row squared deviations or equivalently the magnitude of the column squared deviations. Total inertia can also be expressed in the form
Tr[ D 1 (O −
′ D 1(O − c
Table 10. Contribution to Chi-Square statistic for criminal charge data
Table 10 shows the cell contributions to the total Chi-square statistic.
Coordinates of row and column profiles:For the singular value decomposition of (O - rc′)
given by ADμB′ the columns of matrices A and B provide the principal axes for the columns and rows of (O - rc′) respectively. Each row of (O - rc′) can be expressed as a linear
combination of the rows of B′ (columns of B), and hence the coordinates for the rows of (O - rc′) in the space generated by the rows of B′ are given by the ADμ. The coordinates for the ith row of (O - rc′) are given by the ith row of ADμ. Similarly the coordinates for the columns of (O - rc′) with respect to the space generated by the columns of A are provide by the columns
of DμB′.
To obtain the coordinates for the row and column profile deviations, the relationships
rc′) = D 1 (O
(C – re ′
can be used. The required coordinates for the row and column profile deviations are therefore given by
V (r × k) = D 1AD W (c × k) = D 1BD
respectively. The coordinates for row profiles on row principal axes and coordinates for column profiles on column principal axes are given in Tables 11 and 12.
Table 11. Coordinates for row profiles on row principal axes
Row principal axes (columns of B) 1 2 3 …
Table 12. Coordinates for column profiles on column principal axes
Column principal axes (columns of A) 1 2 3 …
For the criminal charge data the coordinates for the row and column profile deviations on their respective dimensions are shown in Tables 13 and 14. For the row profiles it would appear that
the first dimension reflects a contrast between females charged and males convicted. The second row dimension is primarily a measure of males charged but not
Table 13. Coordinates for row profiles on row principal axes for criminal charge data
Table 14. Coordinates for column profiles on column principal axes for criminal charge data
convicted. For the column profiles the first dimension represents a contrast between the theft under $1000 and the crimes of narcotics possession and impaired driving. The second dimension for the column profile deviations seems to reflect a contrast between the three charges mischief, narcotics possession and other offences with the charge impaired. References
Jobson, J.D. (1992). Applied multivariate data analysis. Vol II, Categorical and multivariate
[99mTc]TRODAT-1 Evaluation of ear ly Par kinson’s Disease with [99mTc]TRODAT- 1/SPECT Imaging The etiology of idiopathic Parkinson’s disease (PD) is unknown. There is noeffective method to prevent the occurrence of this neurodegenerative disorder at thepresent time. The most important and practical approach to the management of thesepatients is to make the diagnosis at an early s
Final Paper for the National Certification Program in Health Care Ethics Submitted to Fr. Tad Pacholczyk, National Catholic Bioethics Center Why Methotrexate is an Immoral Response to Ectopic Pregnancy Abstract With tubal ectopic pregnancies continuing to be a pregnancy complication that results in the unborn dying and even, at times, their mothers, a solution that respects the dignity