1

 Considerations for planning a research study

2

 This question is possibly the most frequently asked question in research
design. The answer, however,
requires understanding of several interrelated topics:
 Hypothesis testing
 Effect size
 Variability
 Statistical Power
 The emphasis in this portion of the tutorial is to refamiliarize you
with these concepts, and introduce you to the theory of power analysis.

3

 Hypothesis testing (HT) is a formal methodology that researchers use in
an attempt to understand whether or not a “statistically significant”
result is evident in their collected data.
 As you know, HT sets up two competing “hypotheses,” the null versus the
alternative.
 Distribution theory (normal, student’s t, chisquare, etc.) is used to
help us make assessments of the likelihood of the null hypothesis
representing the population from which the sample data were obtained.

4

 HT’s are conducted by assuming the null hypothesis to be true, and
determining the probability that the collected data and resulting
statistics could be reasonably obtained, given this null hypothesis.
 If this probability turns out to be “small,” usually less than .05, we
reject the null hypothesis in favor of the alternative.
 The obtained probability is referred to as “p,” and the cutoff
probability (.05 as used here) is referred to as the level of
significance, or “a,” for the HT.

5

 For ease of introduction of concepts, we will imagine a simple example
where we want to know if a group of underachieving students improves
above the expected average in academic selfconfidence after
participating in a mentoring program.
 A valid selfconfidence instrument (SCI) yielding reliable data is used
to summarize the students’ ratings as a single score.
 To determine if the mentoring program is successful, we want to know if
the mean selfconfidence for these students is greater than the normed
average of 60 for the SCI.

6

 Assuming that the variability of the sample of students is no different
from the variability of the normative sample, this example requires a
simple onesample Ztest:
 H_{0}: m = 60
 H_{1}: m > 60
 Rejecting the null hypothesis lends support to the notion that the
mentoring program was helpful at improving the average score (i.e, above
the norm) for this group of students.

7

 When we conduct our test, several conclusions are possible.
 We could choose to reject the null, when the null is really true. This is a bad decision, called Type I
error. For any HT, P(Type I
error) = a.
 We could choose not to reject the null, when the null is really false
and we should have rejected it.
This is also a bad decision, called Type II error. P(Type II error) = b.
 We could decide to reject the null when it is false and should be
rejected. This is called the power
of a statistical test: power = 1b.
 We could decide not to reject the null when it is really true and
shouldn’t be rejected. This is
referred to as the confidence level: 1a.

8

 Clearly, if the null hypothesis does not adequately represent our
sample, we hope to reject it.
This means that we need to use careful research design in order
to be confident in our decision to reject the null.
 As with every decision, we are never guaranteed with 100% surety that
our decision is the “right” one.
 We need, then, to understand how research design – and specifically
sample size decisions – contribute to improving the power of our
statistical tests.

9

 How can power be measured? We
will work through a very simple example to help you understand how
design and sample considerations impact on calculations for power.
 Power
 1b =P (correct decision to reject H_{0})
 1b =P (reject H_{0}  H_{0} really false)
 1b =P (reject H_{0}  H_{1} true)

10

 Selfconfidence inventory (SCI) historically has a mean of 60 in the
general population.
 H_{0}: m = 60
 H_{1}: m > 60
 H_{1} states that we think our students, on average, actually
score higher than the norm on the SCI.
 But even if we reject H_{0}, in favor of H_{1},_{ }then
what is the “real” mean that’s represented by the alternative? Is it
65? 61? 70?

11

 In order to determine power, we need to specify the value we think the
mean has shifted to, i.e., the alternate mean, m_{1}
 This is what makes power calculations so complex.
 For our example, suppose we have the following information:
 n=30
 (Could be anything)

12

 Consider a situation where we hypothesize that the true mean may have
increased by 5 points. This
represents a change of about .45 standard deviations.
 d =
 This value, d, is referred to as Cohen’s d, a measure of effect
size. As you may recall, d’s of
around .2 are considered small effects, d’s around .5 are considered
moderate, and d’s around .8 are considered large effects.

13

 1. The first step in measuring
power involves describing your hypotheses in terms of the desired or
expected effect size.
 H_{0}: m_{0} = 60
 H_{1}: m_{1} = 65
 ES = 6065 / 10.995 = .45
(moderate)
 We want to determine the power of our test if the true mean really
shifts to 65.
 Power is the probability that our HT will actually detect this change so
that we decide to reject the null.

14

 Next we’ll need to decide on the level of significance. For this example we’ll use a= .05 for
a one sided hypothesis test.
 a = P(reject H_{0}
 m = 60).

15

 Now define the rejection region for your test. Here, we will reject H_{0} for
Z_{obs} > Z_{a}
 The critical value for a onetailed a of .05 is 1.645. Our RR becomes: Reject H_{0}
if Z_{obs} > 1.645
 But what does 1.645 represent in terms of ?
What sample mean would our data have to exhibit in order to
reject the null?

16


17


18

 According to the RR, we will reject H_{0 }if we find a sample
mean of 63.3 or larger.
 It seems reasonable that if the population mean truly shifted to 65, our
test should be able to detect this effect (d = .45) with high
probability.
 How do we know this?
 We can use the Ztable to calculate the probability that of getting a
sample mean of 63.3 or more, given that the “true” mean is considered
to be 65.

19


20

 Power of our test = .8023
 P(reject H_{0 } H_{a}: m=65 true) =.8023
 This means that IF the population mean has truly shifted to 65,
sample values less than 63.3
would be very unlikely (» 20%).
 And IF the true mean shifted to 65, the test would be likely to detect
it (» 80%).
 Power ³ .80 in the
social sciences is great! But
remember that this power level was determined for a moderate effect size
(d=.45)

21

 These power calculations hold only for the study parameters as we’ve
defined them:
 n = 30
 s = 10.995
 a = .05, one sided
 d = .45
 In general, when a research study calls for detection of small effect
sizes, high power is harder to achieve and subsequently we require a
larger sample size.
 In planning a study, we usually work backwards, i.e. first determining a
meaningful effect size, deciding on an acceptable power level, and then
determining the sample size that fits these guidelines.

22

 A meaningful effect size is one that is “substantively” meaningful. This would be the smallest effect you
would be hoping to find and still feel confident in concluding that your
research has found something practical and useful.
 Once a meaningful effect size and desired power are decided on,
statistics packages such as nQuery can help you determine the sample
size necessary to have a test that meets these constraints.
 The following graphs are designed to help you see what happens to power
as some of the parameters we’ve been discussing are changed.

23

 Suppose we have a sample of 100 rather than 30 but all else remains the
same (a=.05, onesided):
 n=100, s=10.995, d=.45, m_{0}=60, m_{1}=65

24

 Suppose we return to our sample of 30 but have population variability of
5 rather than 10.995 (a=.05, onesided):
 n=30, s=5, d=6065/5=1.0, m_{0}=60, m_{1}=65

25

 Suppose we have our initial parameters but use a = .01 rather than .05:
 n=30, s=10.995, d=.45, m_{0}=60, m_{1}=65

26

 Suppose we have our initial parameters but consider a change from 60 to
63 as “meaningful,” rather than 60 to 65 (a=.05, onesided):
 n=30, s=10.995, d=6063/10.995=.27, m_{0}=60, m_{1}=63

27

 These examples show how the following five factors impact on power
determination (considering all other factors held constant):
 Sample size: as n increases, the
power (likelihood of your test to detect a false H_{0 }) also
increases.
 Variability in your scores or measures (s): With a lot of variability, any shift
is harder to detect, so power decreases as variability increases.
 Alpha Level: allowing more Type I
error will increase power – but at the expense of greater Type I
error! As alpha gets smaller
(less error), power decreases; as alpha gets larger (more error), power
increases.

28

 Effect size refers to the smallest shift you want your test to detect:
Small effects are hard to detect, and large effects are easier to
detect. As the effect size of
interest decreases, power decreases.
 One tailed tests are more powerful than 2tailed. (because you hypothesize the direction
of effect apriori)

29

 The best approach to the complex question of “how many subjects do I
need” is to design studies to attain a desired power before collecting
any data, based on an effect size that we feel is meaningful for our
research problem.
 There are programs (such as nQuery) and tables for power based on n, a,
type of test, and effect size.
 Each type of analysis has a different way of calculating effect size.
 Efficient sample size depends on a balance between n, a, power, s, &
effect size.

30

 In lessons 15, we have worked out a series of examples to demonstrate
use of the nQuery software for calculating power.
 These examples should help you learn how to design research studies
using an efficient or appropriate sample size (not too small, not too
large).
 Thanks for reviewing these materials.
Please start by using the menu headings to click on the analysis
you are reviewing today.
 THANKS
