Microsoft word - food for thought revision 1 bloom 03-11-04.doc

Some Food for Thought
About Effect Size
Howard S. Bloom
Mark W. Lipsey
Vanderbilt University
March 11, 2004
The following examples highlight some important facts of life about the metric of effect size. The definition of effect size that is used equals the difference
in mean outcomes for two groups (usually a treatment group and a control group)
divided by the standard deviation of the outcome measure for a relevant group
(usually the control group). We will use these examples in our discussion to
illustrate how one should think about the precision that is needed for a study of
program impacts and thus, the sample size and allocation that is required to
produce this precision. The examples are presented in no particular order and do
not lead to a specific conclusion that applies to all situations. Rather they are
intended to illustrate the range of situations that might exist and the corresponding
range of considerations that must be taken into account when designing an
experimental study of the effects of an intervention.
1. The Effect of Aspirin on Heart Attacks1
In 1987 a randomized clinical test of aspirin’s ability to prevent heart
attacks was ended prematurely because the observed reduction in heart
attacks produced by aspirin was too large to justify continuing to give a
placebo to control group members. This test was based on a sample of
22,071 physicians, half of whom were randomized to an aspirin regimen
and half of whom were randomized to a placebo. 1.71 percent of the
physicians who received the placebo experienced a heart attack, whereas
only 0.94 percent of the physicians who received the aspirin regimen did
so. Hence, aspirin reduced the rate of heart attacks by 0.77 percentage
points. This represents an effect size of 0.06 standard deviations.
Question: Why was this effect size large enough to stop the test and prompt
a declaration of victory for aspirin? What other way/s might this impact be
presented?

1 Rosenthal, Robert (1994) “Parametric Measures of Effect Size” in Harris Cooper and Larry V.
Hedges, The Handbook of Research Synthesis (New York: Russell Sage Foundation).
2. Cohen’s Rules of Thumb2

Jacob Cohen is renowned for promulgating the characterization of effect
sizes of roughly 0.2 standard deviations as “small”, 0.5 standard deviations
as “medium” and 0.8 standard deviations as “large”. Cohen (pp. 532) states
that: “these proposed conventions were set forth throughout with much
diffidence, qualifications, and invitations not to employ them if possible.
The values chosen had no more reliable a basis than my own intuition.
They were offered as conventions because they were needed in a research
climate characterized by a neglect of attention to issues of magnitude. The
ES measures and conventions have been successful, widely adopted not
only for power analysis, but more widely, for example, in ES surveys and
in meta-analyses. But there are difficulties and much room for
misunderstanding.”
Question: Why has Cohen’s characterization been used so widely by social
scientists for the past several decades and what does this mean for future
research on high school programs?
3. Lipsey’s Empirical Benchmarks3
Mark Lipsey has provided an empirical basis for judging the magnitudes of
effect sizes from studies of psychological, educational and behavioral
treatments. His findings represent the distribution of mean effect size
estimates from 102 meta-analyses, which summarize the results of 6700
individual studies (most of which were nonexperimental) involving almost
800,000 subjects. The bottom third of this distribution ranges from 0.00 to
0.32 standard deviations, the middle third ranges from 0.33 to 0.55 standard
deviations, and the top third ranges from 0.56 to 1.20 standard deviations.
The midpoints of these three categories are 0.15, 0.45 and 0.90 standard
deviations, respectively. This is astonishingly consistent with Cohen’s rules
of thumb.

Question: Does this mean that Cohen’s rules of thumb are a good basis for
judging the magnitudes of effects for all programs? For high school
programs?

2 Cohen, Jacob (1988) Statistical Power Analysis for the Behavioral Sciences 2nd ed. (Hillsdale,
NJ: Lawrence Erlbaum).
3 Lipsey, Mark W. (1990) Design Sensitivity: Statistical Power for Experimental Research
(Newbury Park, CA: Sage Publications).

4. The Effect of Career Academies on Future Earnings4
Career Academies, which were established more than 30 years ago as an
alternative to large comprehensive high schools, now exist in more than
2,500 locations across the U.S. Some of the key distinguishing features of
Career Academies are that they: combine academic and technical
instruction that is focused on career themes, build partnerships with
employers that provide linkages for students, and offer many types of work-
based learning opportunities. Since 1993 MDRC has been conducting a
randomized experimental study of Career Academies in nine high schools
from around the Country. The most recent results from this study examine
labor market impacts during the first four years after high school.5
Although little or no impacts were observed for young women in the study
sample, the impacts observed for young men were pronounced. Career
Academies increased their average earnings during the follow-up period by
$212 per month. This is 18 percent higher than their earnings would have
been without the program. And it is more than twice the roughly $100
monthly earnings difference that exists between young workers with one or
two years of post-secondary education and those with only a high school
diploma or a GED.6 When expressed as an effect size, the observed impact
equals 0.30 standard deviations.

Question: What utility, if any, does an effect size metric have for this
outcome measure and why? What should one conclude from assessing the
observed impact in the several ways presented?
4 Based on computations for Kemple, James (forthcoming) “Career Academies: Impacts on Labor Market Outcomes and Educational Attainment (New York: MDRC). 5 More precisely, they focus on labor market outcomes during the first four years after what would be the date of on-time graduation for sample members. 6 Pond, Nathan, Andrew Sum, Turub’sky, Mykhaylo and Frank Meredith (2002) “Trends in the Level and Distribution of the Weekly and Annual Earnings of Young Adult Men and Women in the U.S., 1973-2001” (Washington, DC: National League of Cities Institute on Youth, Education and Young Families).

5. The Longer-term Effects of Small Classes7
Project STAR (Student/Teacher Achievement Ratio), the Tennessee Class Size Experiment, is widely hailed as providing the most compelling evidence that exists on the effect of class size on student achievement. This landmark study randomized teachers and students in kindergarten through third grade from 79 schools located in 42 Tennessee school districts to either a smaller class, with 13 to 17 students, or a larger class, with 22 to 26 students. Depending on when students entered the schools in the study, they spent between one and four years in either a smaller class or a larger class (with some students shifting between these alternatives). Follow-up data for these students made it possible to measure the impacts of class size in grades K – 3 on their performance in reading, mathematics and science tests in grades 4, 6 and 8. These follow-up data indicate that: “the average effect of small classes was statistically significant and positive for both mathematics and reading achievement at every grade level, ranging from 0.11 to 0.20 standard deviation units. The small class effect was positive for science achievement at all grades (ranging from 0.10 to 0.17 standard deviation units) and was statistically significant for both Grades 6 and 8. There was little evidence of interaction between gender and class size ………. there was no evidence that small class effects varied across schools” (p. 132).
Question: These effects are statistically significant, but are they of
practical significance? What else would you want to know before deciding
whether this is an educationally meaningful effect?

7 Nye, Barbara, Larry V. Hedges and Spyros Konstantopoulos (1999) “The Long-Term Effects of
Small Classes: A Five-Year Follow-Up of the Tennessee Class Size Experiment,” Educational
Evaluation and Policy Analysis, 21:2, 127-142.
6. The Black-White Test Score Difference8
Controversy has raged for decades over the magnitude, the causes and the consequences of the difference between standardized test scores for Black students and white students. The table below reports this difference for national samples of seventeen-year-old students in selected years between 1971 and 1994. This information, which was obtained from the National Assessment of Educational Progress (NAEP), is reported as an effect size. Effect size was defined as the mean score for Blacks minus the mean score for whites divided by the standard deviation of individual scores for students of all races, adjusted for measurement error. Black-White Differences in Mean NAEP Scores
For Seventeen-Year-Old Students
Mean Score Difference in Standard Deviations
Mathematics
SOURCE: Table 5-2 of Hedges and Nowell (1998) NOTE: Standard deviations were computed for students of all races and adjusted for measurement error.
Question: What are at least two ways that this information could be used to
assess impact estimates for high school reforms? What is implied by the
fact that effect size is adjusted for measurement error?

8 Hedges, Larry V. and Amy Nowell (1998) “Black-White Test Score Convergence Since 1965,”
in Christopher Jencks and Meredith Philips, editors, The Black-White Test Score Gap
(Washington, DC: Brooking Institution Press).
7. The Impact of Welfare-to-Work Programs on Earnings9

MDRC recently conducted a quantitative research synthesis of three of its largest multi-site welfare-to-work experiments: California’s Greater Avenues for Independence (GAIN) program, Florida’s Project Independence (PI), and the National Evaluation of Welfare-to-Work Strategies (NEWWS). Data for this analysis represent random assignment experiments in 59 local welfare offices from seven states. These data reflect the experiences of 69,399 female single parents who had applied for, or were receiving welfare when they were randomized to a program group, which was offered a wide range of special employment and training services, or a control group, which only had access to standard services provided by existing welfare systems. A separate program impact was estimated for each of the 59 local program offices. The table below presents a summary of the findings obtained for impacts on average total earnings during the first two years after random assignment. This summary reports the 25th, 50th and 75th percentile values for the 59 impacts in three different metrics: (1) total constant dollars, (2) a percentage of what earnings would have been without the program (its counterfactual), and (3) an effect size in units of the standard deviation of the outcome measure for individual control group members. Impacts of Welfare-to-Work Programs
On Average Two-Year Follow-up Earnings
Percentile Impact
in Impact in
Impact in
Standard
Deviations
SOURCE: Computations by Bloom, Hill and Riccio (2003)
Question: What do these findings suggest about the use of effect size for
reporting the impacts of welfare-to-work programs? What do the findings
suggest about interpreting effect sizes from research on high schools?

9 Bloom, Howard S., Carolyn J. Hill and James A. Riccio (2003) “Linking Program
Implementation and Effectiveness: Lessons from a Pooled Sample of Welfare-to-Work
Experiments”, Journal of Policy Analysis and Management, 22:4, 551-575.

Source: http://demos.usal.es/courses/DOCCOMB0608DOC001/document/Clases_SPSS/Some_food_for_thought_about_effect_size.pdf?cidReq=DOCCOMB0608DOC001

trinity.dorset.sch.uk

TRINITY CE VA FIRST SCHOOL First aid policy growing together, learning together, praising together • To provide a safe positive learning environment in which children grow, flourish • To provide first aid as necessary from trained adults• To promote health and safety awareness in children and adults, in order to • To encourage every child and adult to begin to take responsi

gvsig.gva.es

gvSIG come strumento di integrazione per lo studio della sicurezza stradale: l’esperienza di Cipro Il progetto Technical Assistance for a Management plan of the Primary Road Network in the northern part of Cyprus si inquadra nel 'attività del 'Unione Europea a “ Network for the Traffic Safety Improvement Programme ” EuropeAid/124749/D/SER/CY. Al 'interno di tale progetto una compo