I realise that not everybody will take the time to read this article in its entirety. For those who do not, the two main points that you should nonetheless take away are these:
- statistical significance is not the same thing as what we might call ‘practical significance’ – while we might be confident, based on a statistical test, that an observed effect (e.g. a difference between two groups) is ‘real’, and not just the result of random variations in our limited sample, this does not necessarily mean that the effect is large and, indeed, we may not be able to say anything much about the size of the effect; and
- p-values are not ‘magic bullets’ that can tell us exactly what to think about our data, based upon some fixed threshold (e.g. p < 0.05) – it is necessary to consider them in the context of the data, and of the importance and consequences of any conclusions that may be drawn and/or decisions made based upon the results.
A Hypothetical Example – Who Likes Anchovies on a Pizza?
In order to explain the statistical analysis of the ‘multiple choice’ rating scales, I have prepared a set of fictional responses to the statement ‘I enjoy anchovies on a pizza’. In this hypothetical example, respondents are divided into three mutually exclusive categories: those aged between 12 and 17 years (28 respondents); those aged between 18 and 25 years (97 respondents); and those aged over 25 years (68 respondents). The responses are summarised in the following chart.At a glance, you might infer that respondents in the 18-25 age group are the most enthusiastic about anchovies, those over 25 are fairly ambivalent, and those in the 12-17 age group tend to be a little more anchovy-averse. Indeed, if we take an average of the responses in each group (using the ‘1’ to ‘5’ scale), the 18-25’s score 3.41, the over-25’s score 3.08, and the 12-17’s score 2.96. But if you think about it a little, you will realise that there are a number of factors complicating the interpretation of these results.
- What does an ‘average score’ tell us? While an average is a convenient way to ‘summarise’ the results in each group, it does not necessarily tell us much about the distribution of responses in each case. For example, the averages for the 12-17’s and the over-25’s are very similar, even though the distributions of responses look quite different.
- The data is not strictly ‘numerical’. While we can ascribe an integer value to each response, the data is not really numerical in nature. In fact it is categorical, i.e. respondents are required to choose between five discrete categories. And while the categories have a distinct ordering, from ‘least’ to ‘most’, the ‘size’ of the steps is not defined in a numerical sense. Additionally, because the choices are discrete, the averages are also discrete, i.e. the average of each group can only take on a certain set of possible values, which depends upon the number of respondents in the group.
- The groups have different numbers of respondents. Each group represents a sample of an overall population of people within the corresponding category. Assuming that the sample is representative and random, we can assume that as the sample size becomes larger, the distribution of responses will tend towards the ‘true’ distribution of the underlying population. However, small samples are quite likely to deviate from the population simply by chance. On the face of it, we would be more confident of the sample distribution for the 18-25’s (N=97) being representative of that population than we would of the sample distribution for the 12-17’s (N=28).
Asking ‘Sensible’ Questions of Our Data
To address the above issues, we need to be clear about the questions we are trying to answer using the data. I might, for example, be curious about whether people in the 18-25 age group like anchovies more than those in the 12-17 age group. But what does that really mean? Clearly, there are some survey respondents in both groups who really like anchovies, as well as some who really hate them! What I need to do is to ask a question that I can answer meaningfully using statistical methods. Here is such a question:If I were to select one person, at random, from each of the 12-17 and 18-25 age groups, is it more likely than not (i.e. is the probability over 50%) that the person in the 18-25 age group enjoys anchovies on a pizza more than the person in the 12-17 age group?
Answering Our Question With Hypothesis Testing
We now have a question that is a testable expression of the idea of people in the 18-25 age group liking anchovies more than those in the 12-17 age group.To perform a statistical test based on this question, the first step is to set up a ‘null hypothesis’ (H0), which is essentially the thing that I am going to conclude in the absence of ‘convincing’ evidence to the contrary (we will come to what counts as ‘convincing’ later). You might think of H0 as what I ‘ought’ to believe in the absence of any evidence, and assuming I can free myself of any existing bias. Thus, although I might be inclined personally to think that older people are more likely to enjoy anchovies than younger people, I should probably choose H0 as the hypothesis that persons in the 18-25 group do not enjoy anchovies more than the 12-17’s.
Having determined my null hypothesis, I then need to run some test to see whether the available data tells me that I should be persuaded to change my mind, i.e. reject H0, and accept instead the alternative hypothesis (H1). In this case, H1 is that the answer to the question posed above is ‘yes’ (i.e. that it is more likely than not that a random person in the 18-25 age group enjoys anchovies on a pizza more than a random person in the 12-17 age group).
As it turns out, a statistical test exists that is applicable to this situation: the Mann-Whitney U test. This test generates a statistic (called, unsurprisingly, U) which is not very interesting in its own right, since it is not as intuitive as something like a mean, or a standard deviation. However, it is also possible to compute a p-value for the U statistic.
What is a p-Value?
In the context of hypothesis testing, a p-value can be interpreted, informally, as the probability, under a specified statistical model (such as the one described above) that a given summary statistic would be equal to or more extreme than its observed value, assuming that the null hypothesis is true.How does this apply to our hypothetical survey question? We can agree that it ‘looks like’ 18-25’s (average survey score 3.41) like anchovies more than 12-17’s (average survey score 2.96). But let’s be good, open-minded, statisticians and assume that H0 is true, i.e. that this appearance of difference is just an artefact of the limited sample sizes. So we compute the (one-sided) Mann-Whitney U statistic, which indicates that it does indeed appear more likely than not, based on the data, that a random person in the 18-25 age group enjoys anchovies on a pizza more than a random person in the 12-17 age group. Furthermore, we find that the corresponding p-value is 0.0217. That is, the probability that H0 is true, and we got this result purely ‘by chance’, is just 2.17%.
What Do We Do With Our p-Value?
Now comes the hard part – we have to decide what to do with this information! In particular, should we change our minds and abandon the null hypothesis in favour of the alternative conclusion that 18-25’s really do, as a group, tend to enjoy anchovies on pizzas more than 12-17’s? In other words, is a p-value of 2.17% small enough to convince us to jump ship from the null hypothesis? (Now you should be able to see that the p-value is an objective parameter that we can use to assess how ‘convincing’ our evidence is.)Contrary to what you might have heard, or read in social science research papers, there is no ‘magic’ p-value threshold (e.g. 0.05 is commonly quoted) for rejecting the null hypothesis. You need to look at all the relevant circumstances, including factors such how much confidence you have in the assumptions underlying your statistical model, the quality and reliability of the raw data, and the importance of any decision to be made based on your conclusion, in order to make a determination of what you consider sufficiently ‘convincing’.
In this case, I would be quite comfortable making a statement such as: the survey response data provides reasonably strong evidence (p=0.0217) that a person in the 18-25 age group is more likely to enjoy anchovies on a pizza than a person in the 12-17 age group.
Performing the same test between the over-25’s and 12-17’s, we find p=0.194. I don’t know about you, but this is nowhere near sufficient to persuade me to abandon the null hypothesis in this case, so until and unless somebody produces additional evidence to the contrary, I am going to stick with believing that there is no real difference in anchovy preferences between these two groups!
Finally, comparing the 18-25’s with the over-25’s, we find p=0.0408. Taken by itself, I might regard this as a marginal result. However, we have already concluded that there is a difference between 18-25’s and 12-17’s, and that there is no difference between over-25’s and 12-17’s. It would therefore be consistent to reject the null hypothesis again, and conclude that people in the 18-25 age group are more likely to enjoy anchovies on a pizza that people in the over-25’s age group.
How Do Our Initial Impressions Stack Up?
Recall our initial thoughts on the data: that it looked as if respondents in the 18-25 age group are the most enthusiastic about anchovies, those over 25 are fairly ambivalent, and those in the 12-17 age group tend to be a little more anchovy-averse.Well, we now have a statistical basis to support our belief that the 18-25’s in our hypothetical survey are the most enthusiastic anchovy-topped pizza consumers. However, it turns out that the data does not provide support for there being any difference between the 12-17’s and the over-25’s.
p-Values – Is There Anything They Can’t Do? (Yes, Quite a Lot!)
One point that is very important to understand about the p-value is that it only tells you something about the specific question that you asked. In this case, a small p-value might convince me that persons in one age group enjoy anchovies more than those in another age group, but it tells me nothing about how much more. For relatively small sample sizes (as here), the differences may need to be quite substantial in order to obtain a small p-value. However, for very large samples even a fairly small difference in preferences may be detectable, and result in a small p-value.To put it another way, a p-value measures the statistical significance of a result, such as a difference in preferences. It does not directly measure what we might colloquially refer to as the ‘significance’ of the difference, i.e. its absolute or relative magnitude (although for given sample sizes, a smaller p-value is generally indicative of a larger difference). This type of ‘practical significance’ is something that you may then need to judge by other methods.
Comparing average response values between groups is one approach to measuring the magnitude of a difference, however I have already noted some difficulties with interpreting such averages, and assessing differences in averages adds yet another level of complexity. In particular, standard tests that are commonly used to compare statistical parameters of continuous variables – such as Student’s t-test, or analysis of variance (ANOVA) – are not valid for categorical variables. There does not appear to be a generally agreed approach for performing such comparisons, and potential methods that I have considered typically require relatively large sample sizes across all categories and response values in order to produce meaningful results.
I have, accordingly, not endeavoured to perform any statistical analysis of the magnitude of differences between different groups of respondents to the attorney survey, and leave it to readers to make their own careful judgments based upon the data presented!
0 comments:
Post a Comment