Mind, Body, Cultural Evolution Lab

Home » Posts tagged 'Spss'

Tag Archives: Spss

How to run a Conditional ANOVA

Today is a wee bit heavier on the stats side again. If you are interested in Differential Item Functioning and how to do it with an easy to use tool, this is for you…

Aim: Identify differential item functioning in numerical scores across groups in order to decide whether the items are unbiased and can be used for cross-cultural comparisons.

General approach: Van de Vijver and Leung (1997) describe a conditional technique which can be used if you use Likert-type scales. It uses traditional ANOVA techniques. The independent variables are (1) the groups to be compared and (2) score levels on the total score (across all items) as an indicator of the true observed or ‘latent’ trait (please note that technically it is not a latent variable). The dependent variable is the score for each individual item. Since we are using the total score (divided into score levels) as an IV, the analysis is called ‘conditional’.

Advantages of Conditional ANOVA: It can be easily run in standard programmes such as SPSS. It is simple. It highlights some key issues and principles of differential item functioning. One particular advantage is that working through these procedures, you can easily find out whether score distributions are similar or different (e.g., is an item bias analysis warranted and even possible?).

Disadvantages of Conditional ANOVA: There are many arbitrary choices in splitting variables and score groups (see below) that can make big differences. It is not very elegant. Better approaches that circumvent some of these problems and that can be implemented in SPSS and other standard programmes include Logistic Regression. Check out Bruno Zumbo’s website and manual. I will also try and put up some notes on this soon.

What do we look for? There are three effects that we look for.

First, a significant main effect of score level would indicate that individuals with low score overall also show a lower score on the respective item. This would be expected and therefore is generally not of theoretical interest (think of it as equivalent to a significant factor loading of the item on the ‘latent’ factor).

Second, a significant main effect of country or sample would indicate that scores on this item for at least one group are significantly higher or lower, independent of the true variable score. This indicates ‘uniform DIF’. (Note: this type of item bias can NOT be detected in Exploratory Factor Analysis with Procrustean Rotation).

Third, a significant interaction between country and score level on the item mean indicates that the item discriminates differently across groups. This indicates ‘non-uniform DIF’. The item is differently related to the true ‘latent’ variable across groups. For example, think of an item of extroversion. In one group (let’s say New Yorkers), ‘being the centre of attention at a cocktail party’ is a good indicator of extroversion, whereas for a group of Muslim youth from Mogadishu in Somalia it is not a relevant item of extroversion (since they are not allowed to drink alcohol and probably have never been at a cocktail party, for obvious reasons).

Note: Such biases MAY be detected through Procrustean Rotation, if examining differentially loading items.

Important: What is our criterion for deciding whether an item shows DIF or not?

Statistical Procedure:

The procedure requires in most cases at least four steps.

Step 1: Calculate the sum score of your variable. For example, if you have an extraversion scale with ten items measured on a scale from 1 to 5, you should create the total sum. This can obviously vary between 10 and 50 for any individual. Use the syntax used in class.

For example:

Compute extroversion=sum(extraversion1, extraversion2,…,extraversion10).

Step 2: You need to create score levels. You would like to group equal numbers of individuals into groups according to their overall extroversion score.

Van de Vijver and Leung (1997) recommend having at least 50 individuals per score group and sample. For example, if you have 100 individuals in each group, you can maximally form 2 groups. If you have 5,000 individuals in each of your cultural samples, you could theoretically form up to 100 score levels (well actually not, because you would have only 40 meaningful groups in this example since the difference between maximum and minimum possible score is 40). Therefore, it is up to you how many score levels you create. Having more levels will obviously allow more fine-grained analyses (you can make finer distinctions between extroversion levels in both groups) and probably more powerful (you are more likely to detect DIF). However, because you have fewer people in your analysis, it might also be less stable. Hence, there is a clear trade-off, but don’t despair. If an item is strongly biased, it should show up in your analysis independent of you have fewer or more score levels. If the bias is less severe, analyses might change across different options.

One issue is that if you have less than 50 people in each score group and cultural sample, the results might become quite unstable and you may find interactions that are hard to interpret. In any case, it important to consider both statistical significance as well as effect sizes when interpreting item bias.

A simple way of getting the desired number of equal groups is to use the rank cases option. You find this under ‘Transform’ -> ‘Rank cases’. Transfer your sum score into the variables box. Click on ‘Rank types’. First, unclick ‘Rank’ (it will rank your sample, but this is something that you do not need). Second, click on ‘Ntiles’ and specify the number of groups you want to create. For example, if you have 200 individuals, you could create 4 groups. If you have larger samples, the discussion from above applies (you have to decide about the number of levels, facing the before-mentioned trade-off in terms of power versus stability).

As discussed above, it is strongly advisable to interpret effect sizes (how big is the effect) in addition to statistical significance levels. This is particularly important if you have large sample sizes in which often minute differences can become significant. SPSS gives you partial eta-squared values routinely (if you click on ‘effect sizes’ under the ‘options’). Cohen (1988) differentiated between small  (0.01), medium (0.06), and large effect size (0.14) for eta-squared. Please note that SPSS gives you partial eta-squared values (which is the variance due to the effect, independent of the effect of other effects), whereas eta-squared does not take the other effects take into account. Partial eta-squared values are often larger than the traditional eta-squared values (overestimating the effect), but at the same time there is much to be recommended for using partial instead of traditional eta-squared values (see Pierce, Block & Aguinis, 2004, in Educational and Psychological Measurement).

Step 3:  Run your ANOVA for each item separately. The IV’s are country/sample and score level (the variable created using ranking procedures). Transfer your IV’s into the ‘Fixed Factor’ boxes. As described above, the important stuff to look out for is the significant main effect of country/sample (indicating uniform DIF) and/or the significant interaction between country/sample x score level (indicating non-uniform DIF). You can use plots produced by SPSS to identify that nature and direction of the bias (under plots, transfer your score level to the ‘horizontal axis’ and the country/sample to ‘separate lines’, click ‘add’ and then ‘continue’). Van de Vijver and Leung (box 4.3) describe a different way of plotting the results. However, the results are the same, only different way of visualising the main effect and/or interaction.

This little figure for example shows evidence of both uniform and nonuniform bias. The item is overall easier for the East German sample and it does not discriminate equally well across all score levels. Among higher score levels, it does not differentiate well for the UK sample.

Step 4: Ideally, you would not like to have DIF. However, it is likely that you will encounter some biased items. I would run all analyses first and identify the most biased items. If all items are biased, you are in trouble (well, unless you are a cultural psychologist, in which case you rejoice and party). In this case, there is probably little you can do at this point except trying to understand the mechanisms underlying the processes (how do people understand these questions, what does this say about the culture of both groups, etc.).

If you have only a few biased items, remove them (you can either remove the item with the strongest partial eta-square or all of the DIF items in a single swoop – I would recommend the former procedure though) and recompute the sum score (step 1). Go through step 2 and 3 again to see whether your scale is working better now. You may need to repeat this analysis various times, since different items may show up as biased at each iteration of your analysis.


My factor analysis showed that one factor is not working in at least one sample: In this case, there is no point in running the conditional ANOVA with that sample included. You are interested in identifying those items that are problematic in measuring the latent score. You therefore assume that the factor is working in all groups included in the analysis.

My overall latent scores do not overlap: This will lead to situations where the latent scores are so dramatically different that you can not find score levels with at least 50 participants in each sample. In this case, your attempt to identify Differential ITEM functioning is problematic, since something else is happening. One option is to increase score levels (make the groups larger – obviously this involves a loss of sensitivity and power to detect effects). Sometimes, even this might not be possible.

At a theoretical level, it could be that you have a situation where you have generalized uniform item bias in at least one sample (for example because one group gives acquiescent answers that are consistently higher or lower). It also might indicate method bias (for example, translation problems that make all items significantly easier in one group compared to the others) or construct bias (for example, you might have tapped into some religious or cultural practices that are more common in one group than in another – in this case your items might load on the intended factor but conceptually the factor is measuring something different across cultural groups). Of course, it can also indicate a true differences. Any number of explanations (construct or method bias or substantive effects that lead to different cultural scores) could be possible.

What happens if most items are biased and only a few unbiased items remain? In this situation you run into the paradox that you can not actually determine whether your biased items are actually unbiased or unbiased items are biased. This type of analysis only functions properly if you have a small number of biased items, up to probably half the number of items in your latent variable. Once you move beyond this, it means that there is a problem with your construct. If you mainly find uniform bias, but no interactions, you can still compare correlations or patterns of scores (since your instrument most likely satisfies metric equivalence). If you have interactions, you do not satisfy metric equivalence and you may need to investigate the structure and function of your theoretical and/or operationalized construct (functional and structural equivalence).

Any questions? Email me 😉

How to do a pancultural factor analysis – a simple option

I am going to demonstrate a simple way of doing what is often called a pan-cultural or culture-free factor analysis in the cross-cultural literature (even though I do not like those terms) in SPSS. In the methods literature, this is also sometimes called a pooled-within analysis.

The basic problem is: How can you analyze the data from a large number of samples in an efficient way without giving priority to any data set? This is particularly interesting when you deal with data from lots of different cultures and you would like to find a solution that is averaged across all samples or ‘culture-free’ – capturing the average human being.

Such a solution could be interesting in its own right. It can also be useful as a reference structure for further Procrustean analyses (see my earlier blog post here).

Let’s work with an example. I took the 1995 World Value Survey scores for Morally Debatable Behaviour (see a published analysis of the data here).

You will need to create the average correlation matrix first. The simplest way in SPSS is via Discriminant Function Analysis. Go to Classify (under ‘Analyze’) and select ‘Discriminant’. Transfer the variables that you want to analyze into the Variables box. Then transfer your cluster or independent variable (your samples from different countries or cultures) into the ‘Grouping Variable’ box. You need to tell SPSS what the range of your country/sample codes is. In this case, the first sample is 1 (France) and the last sample in the data base is 101 (Bosnian Serb sample).

To request the average correlation, click on statistics. There you need to click on ‘Pooled-Within Correlation’. Not much else that we need right now, so click ‘Continue’ and ‘Ok’. In the output, you will see the table with the pooled-within correlation matrix right after the lengthy group statistics.

There are two options now. Either way, you need to get the correlation matrix.
One option is to open a syntax file in SPSS and to type this command and include the proper correlation matrix from your output as well as the overall N:

MATRIX DATA VARIABLES=benefits publictransport  tax stolengoods bribe homosexual prostitution abortion divorce euthanasia suicide
.434 1.000
.422 .516 1.000
.329 .429 .427 1.000
.338 .410 .428 .482 1.000
.232 .232 .244 .239 .267 1.000
.216 .249 .247 .274 .282 .544 1.000
.218 .238 .248 .266 .256 .334 .424 1.000
.204 .259 .252 .268 .273 .286 .355 .492 1.000
.220 .216 .235 .220 .233 .308 .295 .315 .327 1.000
.180 .210 .213 .239 .231 .275 .323 .315 .314 .430 1.000

Once you have it all typed out (or copied from SPSS), highlight it all and press the Play button (or ‘Ctrl’ + ‘R’).
A new SPSS window will open (probably best to safe this new data file with a proper name). As you can see in this picture, this looks a bit different from your average SPSS data spreadsheet.

The first two columns are system variables (Rowtype_ and Varname_). The first line contains the sample size. If you don’t want to use the syntax, this is the other option. You need to create this SPSS data file directly. The first variable in the SPSS matrix file is called ROWTYPE_ (specify it as string variable) and identifies the content in each row of the file (CORR, for correlations, in this example). The second variable is called VARNAME_ (again, specify as a string variable) and contains the variable name corresponding to each row of the matrix. The FACTOR procedure also includes a row of sample size (N) values to precede the correlation matrix rows. Then type or copy the full correlation matrix.

We are nearly ready for the analysis. Unfortunately, SPSS does not support factor analysis of matrices directly via the graphical interface. In order to run the analysis, you need to use syntax (again).

Type the following command into the same syntax window (it will run a standard PCA, with Varimax rotation, print the scree test, sort the factor loadings and suppress loadings smaller than .3):


Again, highlight the whole Factor command bit and hit play (or ‘ctrl’ + ‘R’). You should see the output of the factor analysis based on the average correlation matrix. As you can see in the output, there are two factors that correspond to the ‘socio-sexual’ and the ‘dishonest-illegal’ factors. The scree test and Bartlett’s EV > 1  also both support that there are only 2 factors.

Now you can either interpet this factor structure in your report or use as reference for further comparisons against each of the samples.


How to do Procrustean Factor Rotation with more than 2 groups

Today, I am continuing the torture with a bit more detail on options for comparing factor loadings across three or more groups within SPSS. This is a crucial issue for cross-cultural research and is becoming increasingly important, because researchers start studying more than two groups. More complex designs are more powerful in uncovering processes that can explain emerging behavioural differences, so this research should be strongly encouraged!

Aim: Compare the factor structure when you have more than two cultural groups, get an estimate of factor similarity

Why are we concerned with Procrustean Rotation? Factor rotation is arbitrary, therefore apparently dissimilar factor structures might be more similar than we think; procrustean rotation is necessary to judge structural and metric equivalence

Statistical Procedure:

The same syntax as for the two group case can be run with SPSS, but the greater number of countries adds additional problems. You have various options:

  1. Run all pairwise comparisons. However, this will lead to a substantive number of comparisons (especially if you have many samples). This also leads to a number of statistical problems (remember family-wise error rate and increased Type I errors)
  2. Select one country as your target group. For example, if an instrument was developed in theUS, you may want to compare each group to the US.
  3. Compute the average correlation matrix and use it for your factor analysis. The average is sometimes called pooled-within matrix. Therefore, you would compare each sample with the average structure across all samples (this can be done via discriminant function analysis in SPSS, you can then read the resulting correlation matrix into spss and use as an input for your factor analysis – see my discussion of how to do this here). This is highly appealing if you have many samples. This procedure of computing the average correlation matrix as input to the factor analysis can be simplified if (a) you have samples with similar sample size (no sample is dominating others; eg., if you have one sample of 10,000 and three samples of 50 participants each, the large sample is driving the factor structure) and (b) you mean centre each item within each sample prior to the overall factor analysis. This is necessary to account for any group mean differences that might obscure relationships if the samples are pooled. See below for a graphical explanation of why this might be a problem. As you can see, the relationship within each sample is negative, more sleep problems within each sample are associated with less laughter by participants. However, one group is consistently higher, for both the reported sleep problems as well as laughing. There may be reasons of why this is the case (I will come back to this example when talking about multilevel analysis), but for our analysis, combining the two samples would mean that we have a positive relationship across both samples combined (compared to negative relationships within both samples separately). This effect is due to the mean differences across both groups (I will post something soon on the beautiful complexity of these multi-level problems in psychology – very fascinating stuff). As a consequence of this confounding of group differences with individual differences, we need to take any such mean differences into account before we can combine the samples. This can easily be done using the z-transformation option in SPSS (‘Save standardized values as variables’ under the ‘Analysis’ -> ‘Descriptives’ option).

I believe the last option is the most appealing with large data sets.

However, cross-cultural psych never stops to be complicated. What happens if you find that some samples show good factor congruence with the average factor structure and others not? Ideally, you would exclude those samples from the average factor structure and re-run the analysis. Proceed iteratively till no sample shows any problems with factor similarity anymore.

If you have lots of cultural samples, you are really curious (and stats savvy) and want to find out what is happening in the strange worlds of culture, you may want to run cluster analysis on the congruence coefficients to identify clusters of samples that show greater similarity with each other. This might provide some interesting insights from a cross-cultural perspective. However, it is computationally demanding and relies on purely statistical criteria. There is a neat paper discussing various options and strategies, written by Welkenhuysen-Gybels and van de Vijver (2001, published in the Proceedings of the Annual Meeting of the American Statistical Association – I think this gives you an idea about what level of analysis we are talking about). You can also download a SAS macro (the link is in the paper) that does much of the computational work for you. I have never worked with SAS, it seems a parallel universe to me and I am fascinated, but scared of it. But there are people who think it is easy. Conceptually, it is a nice tool.

How to do Procrustean Factor Rotation

Procrustean Factor Rotation

Today, it is a little bit less light-hearted, but hopefully a bit more practical.

Aim: To make factor structures maximally comparable & provide a statistical estimate of factor similarity

Why are we concerned with Procrustean Rotation? Factor rotation is arbitrary, therefore apparently dissimilar factor structures might be more similar than we think; procrustean rotation is necessary to judge structural and metric equivalence

Statistical Procedure:

A SPSS routine to carry out target rotation needs to be run (adapted from van de Vijver & Leung, 1997)

The following routine can be used to carry out a target rotation and evaluate the similarity between the original and the target-rotated factor loadings. One cultural group is being assigned as the source and the second group is the target group. The varimax rotated (or unrotated) factor loadings for at least two factors obtained in two groups need to be inserted. The loadings need to be inserted, separated by commas and each line is ended with a semicolon. The last line is not to end with a semicolon, but with a ‘}’. Failure to pay attention to this will result in an error message and no rotation will be carried out. To use an example, Fischer and Smith (2006) measured self-reported extra-role behaviour in British and East German samples. Extra-role behaviour is related to citizenship behaviour, voluntary and discretationary behaviour that goes beyond what is expected of employees, but helps the larger organization to survive and prosper. These items were supposed to measure a more passive component (factor 1) and a more proactive component (factor 2). The selection of the target solution is arbitrary, in this case we rotated the East German data towards the UK matrix.

Table 1. Items and varimax-rotated loadings in each sample separately


UK Germany
Factor 1 Factor 2 Factor 1 Factor 2
I am always punctual. .783 -.163 .778 -.066
I do not take extra breaks. .811 .202 .875 .081
I follow work rules and instructions with extreme care. .724 .209 .751 .079
I never take long lunches or breaks. .850 .064 .739 .092
I search for causes for something that did not function properly. -.031 .592 .195 .574
I often motivate others to express their ideas and opinions. -.028 .723 -.030 .807
During the last year I changed something. in my work…. .388 .434 -.135 .717
I encourage others to speak up at meetings. .141 .808 .125 .738
I continuously try to submit suggestions to improve my work. .215 .709 .060 .691


This can not be done using the windows interface within SPSS. You should run a factor analysis in each sample separately first. Use Varimax (orthogonal) rotation.  Then insert the loadings in the loadings and norm matrices in the SPSS syntax described in Fischer and Fontaine (2011, in Matsumoto and Van de Vijver’s Cross-Cultural Research Methods in Psychology). I can also email this syntax to you (contact me at Ronald.Fischer@vuw.ac.nz).

The start of the syntax is printed below. Be careful to separate the loadings by a ‘,’ and the last loading for each item needs to be followed by ‘;’. The last loading should be indicated by }.


compute LOADINGS={

.778,    -.066;

.875,    .081;

.751,    .079;

.739,    .092;

.195,    .574;

-.030,   .807;

-.135,   .717;

.125,    .738;

.060,    .691     }.

compute       NORMs = {

.783,    -.163;

.811,    .202;

.724,    .209;

.850,    .064;

-.031,   .592;

-.028,   .723;

.388,    .434;

.141,    .808;

.215,    .709}.

Output and Interpretation:

The edited output for this example is shown below. It shows the rotated matrix of the group (East Germany in our case) that was rotated to maximal similarity:


Run MATRIX procedure:


.77  -.10

.88   .04

.75   .05

.74   .06

.22   .57

.00   .81

-.10   .72

.16   .73

.09   .69


-.01   .06

.07  -.16

.03  -.16

-.11   .00

.25  -.03

.03   .08

-.49   .29

.02  -.08

-.13  -.02

Square Root of the Mean Squared Difference per Variable (Item)










Square Root of the Mean Squared Difference per Factor

.19   .13


.94   .97


.86   .92


.94   .97


.86   .93


The output shows the factor loadings following rotation, the difference in loadings between the original structure and the rotated structure as well as the differences of each loading squared and then averaged across all factors (square root of the mean squared difference per variable column).

The first matrix could be pasted in a new table, showing the rotated loadings (instead of using the loadings from the original analysis as reported above in the table). The second matrix shows the differences after rotation. You should look for large values, because they indicate that some items are problematic. A low value would indicate good correspondence.

The column of values entitled: Square Root of the Mean Squared Difference per Variable (Item) gives you information about each item. The larger the value, the more problematic is an individual item. The next row (Square Root of the Mean Squared Difference per Factor) shows the same information per factor. Again, smaller values are better, larger values indicate trouble for a particular factor. There are no hard and fast criteria for any of these indices above, you should look at the relative values and particular discrepant values.

The most important information is reported in the last four lines, namely the various agreement coefficients. As can be seen there, the values are all above .85 and generally are beyond the commonly accepted value of .90. The most common indicator is Tucker’s Phi which is called Proportionality coefficient here.

It is also worth noting the first factor shows lower congruence and that the estimate vary across indicators. An examination of the differences between the loadings shows that one item (During the last year I changed something. in my work….) in particular shows somewhat different loadings. In the British sample, it loads moderately on both factors, whereas it loads highly on the proactivity factor in the German sample. Therefore, among the British participants making some changes in their workplace is a relatively routine and passive task, whereas for German participants this is a behaviour that is associated more with proactivity and initiative (e.g., Frese et al., 1996). We might want to exclude this item and re-run the analyses. Overall, we could cautiously conclude that our scales meet structural equivalence and most items might even meet metric equivalence (although this syntax routine does not provide a statistical test for this higher level of equivalence).