Demonstration experiment: their value, limitations, and relevance to replicability issues

What makes an experimental finding in psychology important? What significance should we attach to a failed, or for that matter a successful, attempt to replicate that finding? My starting point in addressing these questions is the acknowledgment that findings in most areas of psychology are demonstrations of what can happen under a given set of circumstances. They are not demonstrations of what must happen, and may not be demonstrations of what is most likely to happen, or what happens more often than not. The classic demonstration experiments in social psychology, in particular, I suggest, were valuable because they went beyond showing that particular manipulations of situations or interpretations of those situations could influence research participants’ responses. Collectively, they illustrated ways in which common assumptions and predictions about the relative impact and predictive value of particular influences are not only imperfect but also biased in specific, systematic respects.

In particular, researchers showed that the influence both of perceived peer groups and what Kurt Lewin (1958) called “channel factors”, and what Thaler and Sunstein (2008) called “nudges, prove more impactful in determining human choices and actions than most people recognize. Implicitly, but occasionally also explicitly, researchers further showed that ordinary people, especially those in individualistic societies like the US, are further led astray by their assumptions about the stability, and cross-situational consistency of individual differences in personality traits and dispositions. To some extent, classic demonstrations in the situationist tradition were surprising because most of the everyday behavior we observe and attribute to personal traits confirms our expectations.

The actors in our world generally do behave as they have in the past and as we expect them to behave in the present. However, much of that predictability reflects the fact that we mostly see people fulfilling continuing roles and commitments, and making choices and decisions that are constrained by the effects of their prior choices and circumstances. In other words, in our everyday social experiences, the factors of person and situation are generally confounded. Rarely do we see random samples of people exposed to a similar set of situations or given a similar set of choices. More typically, we see particular people in particular situations, which may or may not be of their choosing, but in which, regardless of how much initial choice they may have exerted, they are constrained by the expectations they are striving to meet. What clever and inventive researchers did in various classic studies was to demonstrate the effect of particular experimental manipulations when actors are facing novel situations, challenges, or choices, unconstrained by their normal roles, commitments, reputational concerns, and histories.

Led by Leon Festinger, researchers took particular pride in findings that seemed not only non-obvious, but counter-intuitive. Students today express surprise, as did many researchers in the 1960s and 1970s, when they read demonstration that extrinsic rewards can, in certain educational contexts, diminish rather than enhance intrinsic motivation (Lepper, Greene, & Nisbett,1973), or that smaller incentives or weaker threats can prove more productive than larger ones (Festinger & Carlsmith,1959; Aronson & Carlsmith,1963). That surprise, I suggest, reflects a similar limitation in lay psychology, which the investigators cleverly exploited in the design and particular procedure they employed in their experiments.

Replicability issues.

Demonstration experiments, both the classics of yesteryear and newer priming and embodiment studies, figure heavily in the replicability debates currently raging in our field. Some critics point to post hoc theorizing that was presented as if it were initial hypotheses and various dubious statistical practices1 employed to reach conventional standards for statistical significance. Others maintain that Milgram’s obedience experiments, or Zimbardo’s Stanford Prison Experiment (Haney, Banks, & Zimbardo, (1973). and other less famous studies produced their dramatic results in large part because of the participants’ awareness of, and conformity to, the goals and expectations of the investigators. Many critics claim that priming and embodiment findings over the last couple of decades are unreliable and serve to exaggerate the impact and everyday significance of implicit, non-conscious, influences on behavior.

In reflecting on the demonstration experiments featured in the history of our field, which depended on face-to-face contact between an experimenter and participant, it is worth recalling most necessarily utilized small sample sizes. For investigators to obtain statistically significant p values, effect sizes thus had to be relatively large. However, the effects demonstrated often were large in a non-statistical sense as well. Many studies featured consequential behavior in contexts wherein most people imagine that actors’ personal dispositions, values, prior experiences, and present life circumstances would matter more than the manipulated situational feature.

Direct replications were rare. In part, this was the case because the testing of individual participants one at a time was labor-intensive and time-consuming and researchers found it more expedient to move on to new studies. In part, they were rare because journal editors were reluctant to give precious space to such undertakings, regardless of whether the replication attempts were successful or unsuccessful. Instead, investigators typically conducted a series of experiments that differed in their exact procedures and dependent variable measures but converged to demonstrate some general point. 2

In hindsight, some limitations of the research and journal articles of that era were unavoidable. Procedures often were not, and could not, be fully standardized, as experimenters (and often experimental confederates) had to improvise in response to the various queries and actions of the relevant research participants. For the same reason, research reports typically omitted many details about procedures and research contexts. Such specific methodological choices often made on opportunistic or intuitive rather than theory-based grounds— inevitably played a role in the outcome of the studies. Furthermore, while investigators faithfully reported results of primary dependent variable measures and statistical analyses, they often omitted results of secondary analyses--not to deceive, but to meet specified print journal word limits).

Some of the practices of that era, as noted by Simmons, Nelson and Simonsohn 2011) and others, may well have inflated the probability of type 1 errors. Researchers recruited what they considered appropriate numbers of participants, but if the statistical tests they conducted at that pont in time yielded only marginal significant p values, they often ran additional participants in each condition. Decisions about whether to report results that controlled for sources of variability unrelated to the hypotheses being testing frequently were made on pragmatic grounds. If the simplest test produced statistically significant results, investigators typically streamlined their report by omitting the more complex analyses. (Ethical investigators, of course, acknowledged the results of such analyses when they moved their p values to the wrong side of the .05 divide that journal editors treated as sacrosanct). These practices no doubt contributed to the troubling p distribution critics point to, whereby results with p values just below the .05 level appeared in papers less frequently than values either just outside that level (e.g., p= .06 or .07) or comfortably below that level (e.g., p = .03 or .02).

As replicability debates go on today, there are some issues regarding best practices, such as the value of “pre-registering” predictions and details regarding proposed data analyses, the wisdom of making such pre-registering obligatory, about which we can reasonably disagree. The uses and abuses of mediational analysis to suggest causality is another potentially contentious issue. However, there should be no disagreement about the importance of transparency. The obligation to describe all procedures and measures fully, and to make one’s data available for further consideration by colleagues, should be clear to all. Access to the internet as a resource to store and share information omitted not described in detail in the man text of an article makes this task manageable.

Notwithstanding replicability issues and other limitations of the type of demonstration experiments that drew so many of us into the field, we should not lose sight of the lessons they offered about human behavior. Those experiments continue to offer valuable insights about the power of situations, about the importance of actors’ interpretation of those situations, and also about the sometimes subtle and generally unappreciated impact of those influences in our everyday actions. They make us more sophisticated in our understanding of human strengths and frailties, our species capacity for adaptation and generosity, but also the capacity for rationalization, biased assimilation of evidence, and inclination to take unwise risks to avoid or recoup losses. They highlighted the benefits of cooperation, but also ways in which in-group favoritism and racism, xenophobia, and sexism, both blatant and subtle make their influence felt in our society.

I have noted that demonstration experiments illustrate what can, but not necessarily what must, or even what is most likely to happen. Accordingly, I would contend that focusing on box scores of replication successes and failures (to say nothing of simulations with dubious assumptions about the a priori likelihood of false hypotheses) 3 is not a fruitful exercise. Non-replications, in my view, do not oblige us to disregard earlier experimental findings. Rather, the research is well conducted, they enlighten us about robustness and offer useful clues about expected effect size and potential moderators and boundary conditions. What they rarely do is indicate that the prior findings were chance outcomes that no longer merit our attention.

The decline of the lab experiment

A glance at contemporary journals in our field, or an examination of the research featured in the many popular books now being published that mine the insights of our field, makes it clear that the classic demonstration experiment has faded from prominence. The change in part reflects the costs and difficulties of doing multiple, large N, studies of the sort journal editors and reviewers prefer in light of the replicability debate. Today, experiments that involve responses occurring in face-to-face interaction between participants, or between participants and role-playing confederates, rather than M Turk questionnaire responses, offer highly unattractive risk-to-benefit prospects 4. These considerations discourage researchers who still enjoy the challenge of designing such experiments. For younger researchers in particular, analyzing large data sets that can be accessed or created while sitting at their computers seems a better career bet than the painstaking labors required to craft and conduct a good laboratory or field study.

Beyond these pragmatic concerns, however, there has also been increasing recognition of the limitations of traditional lab studies. The most obvious limitations involved the over-reliance on samples of college undergraduates and other convenience samples that fail to reflect the demographic diversity of the populations about whom the investigators seek to generalize. An even greater limitation involves the range of the behaviors and social contexts explored. Until recently, the great majority of earlier studies focused on short-term interactions between strangers or the responses of individuals facing unfamiliar choices or challenges, rather than people functioning in social contexts involving ongoing relationships, roles, and constraints.

In urging psychologists to address the social dynamics involved in societal problems, Kurt Lewin famously claimed that if we want to understand the status quo we should try to change it. Doing so, he claimed, gives us a better grasp of the forces and constraints currently at play. Natural experiments that allow us to contrast the effects of different policies in different locales or observe outcomes after changes in policies and practices, and interventions that feature appropriate control conditions, serve this Lewinian strategy. They not only focus our attention on actions and outcomes with significant personal and social consequences, they allow us to study cumulative consequences unfolding over time and to note similarities and differences in the outcomes we see in different social contexts, cultures, subcultures, and economic circumstances.

Non-obvious demonstrations in different research domains

The Dissonance Experiments. A feature of experimental social psychology that distinguishes it from almost all other research in the behavioral sciences has been its longstanding emphasis on non-obvious, or better still counter-intuitive, findings. This emphasis was particularly strong in the heydays of dissonance research under the leadership of Leon Festinger. Laypeople, and even colleagues not fully versed in the dissonance tradition, were apt to that condition X would produce more or less of some response than condition Y, when the reverse proved to be the case. The “trick” that the Festingerians employed in such counter-intuitive demonstration involved exploiting people’s overly simplistic views about reinforcement effects—their knee-jerk tendency to assume that bigger rewards would produce more liking, that worse experiences or scarier threats would produce less liking, without recognizing that manipulations being employed were prompting evaluations that reduced dissonance. However, the basic phenomenon that the Festingerians were exploring is familiar enough. Laypeople and researchers alike recognize that their peers (if not they personally) strive to justify their actions and beliefs, to see themselves as coherent, rational, moral, actors, and to rationalize or otherwise reduce their dissonance when they feel it necessary to do so.

Situationist Classics. The non-obviousness of the classic experiments in social psychology that I noted earlier involved erroneous lay notions about the relative magnitude of influences. Laypeople recognize that people are more comfortable conforming than dissenting, and are especially uncomfortable being the lone dissenter in a group. They recognize that people are inclined to obey the requests of an experimenter or other seemingly legitimate authority figures, especially when the situation is unfamiliar. They would also expect people, including Princeton divinity students, to offer assistance more readily when they have time to spare than when they are in a rush. They might even recognize that complying with an initially modest request makes compliance with a related less modest request later more likely. What they fail to recognize is the power of the situational manipulations in those studies relative to the types of individual differences in personality traits that they normally assume dictate behavioral choices.

That lack of recognition in turn would prompt unwarranted attributions about the dispositions and values of the targets of the relevant experimental manipulations. Learning about a randomly recruited Milgram participant who agreed to deliver a powerful electric shock to a peer who has given a succession of wrong answers in a learning experiment would lead to unwarranted attributions of callousness or servility. Seeing an Asch participant conform to an obvious erroneous judgment would prompt attributions of weakness of character and lack of independence. Hearing about a divinity student in the Darley and Batson study who failed assist someone in obvious need of help while rushing to give a sermon the Good Samaritan Parable would lead to accusations of hypocrisy and unsuitability for the Ministry. Learning about the willingness of a neighbor in the Freedman and Fraser study to erect a huge, unsightly sign about auto safety on her front lawn would foster the assumptions that the neighbors had some special, personal reason for caring deeply about that issue.

The message offered by these and many other classic studies is that small manipulations can have surprisingly big effects. The further implication in the case of small interventions that facilitate desirable actions and outcomes, especially if the interventions can be successfully “scaled up”, is that they address needs and motivations that are more important than most people, including current policymakers, recognize. The broader, cumulative message of more than a half-century of such demonstrations, both in the laboratory and in schools, neighborhoods, business places, homes (and now computer terminals) outside the University is clear. Human actions and outcomes, both positive and negative, are more a reflection of the influence of the specific circumstances the actors face, and are less an expression of the character or other stable personal attributes of those actors, than lay intuition leads us to believe.

Priming and Embodiment Experiments . The priming and embodiment studies, which have been an especially frequent target of replicability critics, feature a particular kind of non-obviousness. The investigators who conduct those studies sought to illustrate the subtle impact of non-conscious or implicit influences on behavior. It is a truism in the social sciences, and an important continuing insight, that people respond not to the objective features of situations and choices but to the way in which they interpret or construe those features. Thus, if you want to understand, predict, and influence people’s behavior, including their behavior in research contexts, you must take into account how they will interpret the situation they are facing and the meaning they attach and/or assume others will attach to their responses.

What the priming studies showed is that experimenters can manipulate such meaning without the research participant recognizing that any such manipulation has taken place. Indeed, the success of the study may depend on the participants not making a connection between that relevant manipulation and the meaning it is prompting them to attach to the situation they are facing. As such, a successful demonstration may be more of a testament to the investigator’s skill and inventiveness than it is an increment to our understanding of human psychology. It is not remarkable that meaning can be manipulated (that is what both propaganda and moral education seek to accomplish) or that we sometimes are unaware of why we are thinking about a particular situation in a particular way. Nor is it remarkable that differences in the meaning people attach to a choice or challenge can mediate differences in their decisions. What is remarkable is the simultaneous demonstration of all three of these phenomena— something akin to making a tough three-cushion billiard shot. Small wonder that such demonstrations prove difficult to replicate!

Critics of such demonstrations rightly question the relevance of manipulations of embedded words, fishy smells, or physical postures adopted in sterile environments to everyday behavior. A single successful priming or embodiment demonstration, or for that matter an unsuccessful one, arguably tells us little about the role that non-conscious influences play in people’s everyday lives. In my opinion, the cumulative record of findings from the corpus of studies of implicit or non-conscious influences on things like honesty in claiming payment for one's labors, adherence to sanitation rules, generosity to those in need, leave little doubt that such effects are real. However, their potential for application will become clearer only when we have a better idea about their robustness and about the factors that mediate and moderate their impact.

Differing Replicability Issues in Different Domains.

Large-scale replication projects, especially ones that involve multiple sites, with different investigators at each site, pose some thorny replicability issues. Some of those issues are unavoidable because of changes in prevailing norms and changes in participants’ knowledge about the world and views about its problems. Procedures and experimental materials well suited to test a particular hypothesis in 1968 would not have been well suited a half-century before and would not be well suited today, a half-century later. Demonstrations involving displays of racism, sexism, ingroup favoritism, and a host of other social issues are obvious cases in point. Studies that test the impact of persuasive messages on attitudes about immigration, environmental protection, or free speech would be subject to similar constraints on the investigator’s ability to conduct exact or “true” replications.

For a variety of practical reasons, including the wish to minimize differences in the way studies are conducted at different sites, researchers investigating replicability are inclined to studies strip down and standardize procedures, minimize contact between participants and researchers, and try to create context-free environments. Too often, the result is the stripping away of procedural features that the original investigators deemed important to the success of their demonstration. Consider, for example, Fritz Strack’s pen-in-mouth facial feedback study (Strack, Martin, & Stepper, 1988). In testing the replicability of the relevant effect on mood, it appears that some investigators dropped the pen manipulation and simply told the research participants to move their face into a smile position or a frown position. While this change in procedure simplified the investigators’ (and participants’) task and made standardization of the procedure easier to accomplish, this simplified procedure no longer kept participants from guessing the investigators’ intent and expectation in the study.

Research participants are not automatons. They process instructions and respond to experimental stimuli or events in light of their understanding of the relevant research context. Skilled experimenters must consider this fact in designing the specific procedural features of their study. Doing so complicates mass-replication studies (particularly when the same participants serve in multiple studies). Without due attention to such considerations, however, applying the label “true” or especially “exact” to any replication project is misleading.

“Bottling” phenomena and personal reflections on the replicability debate. Let me share a couple of examples that speak to the challenges of the demonstration experiment tradition and issues of replicability in the context of my own research. Very early in our careers, Mark Lepper and I wanted to produce a convincing demonstration of the phenomenon of unwarranted belief perseverance (and secondarily to suggest the inadequacy of standard "debriefing" procedures in deception experiments). Our goal was to show that the perseverance phenomenon can be more dramatic than the majority of laypeople and most of our colleagues would expect, and that it can be demonstrated even when the discrediting of the information that initially prompted an erroneous assessment or belief is logically, if not a psychologically, complete.

In a study with our graduate student, Michael Hubbard we gave undergraduate research participants the task of discriminating real suicide notes from fakes. Feedback was provided, one note at a time, leading some participants to believe they had failed miserably (only 10 of 25 correct) and others to believe that they had succeeded brilliantly (24 out of 25 correct). Our participants then received an extensive debriefing, which included seeing the precise schedule of correct and incorrect they would have heard regardless of which notes they had judged to be authentic. The results were striking, and for those concerned about the ethics of some deception studies that rely on standard debriefing, somewhat alarming. About half the difference between the success and failure conditions remained even after that thorough debriefing—not only for the participants but (in a second study) for peers who watched the events in question from behind a one-way mirror and received the same information as the participants at all points in the study

We went on the make the a fortiori point that such perseverance should be particularly apparent in cases where beliefs about ability are based on subjective evaluations rather than objective measures of success versus failure and the discrediting of that information that gave rise to those beliefs is less decisive. We also did a follow-up study showing the incremental effectiveness of “process debriefing”— a procedure that includes not an only acknowledgment of the deception that had been employed but an explanation of the ways in initial erroneous perceptions can bias subsequent processing of information. When the suicide note study is cited, it is usually done parenthetically (with no reference to the task or procedure used) to buttress the banal point that first impressions are sticky, or that it is very difficult to undo the harm done by initial bad experiences so we must spare people, including research participants, such experiences. The task we use is rarely mentioned, and its specific features are not elaborated.

Those features, I believe, contributed significantly to the success of our demonstration. The task was highly engaging and seemed to be about something important. The initial success or failure feedback given to the research participants was a specific numerical outcome rather than a subjective evaluation. Finally, and perhaps most importantly, while our Stanford participants had no prior experience at judging the authenticity of suicide notes or any task like it, and no real basis beforehand for anticipating their ability at the task, they could readily generate reasons why they might be either good or bad at it. ( I generally do well on multiple-choice right quizzes. I am taking psychology courses and think a lot about why people behave the way they do . I never knew anyone who committed suicide and I’m a pretty upbeat person.) In short, I think we did a good job of "bottling" the phenomenon of post-debriefing perseverance, and post-discrediting perseverance of beliefs more generally, in a way that was interesting and convincing both for the research participants and for readers of the article reporting our results.

I would wager a good portion of my retirement fund on the success of a pure replication —one that used the same task and procedure and research participants from Stanford psychology classes or from some other population in which both those told they were succeeding and those told they were failing could readily generate explanations for either success or failure. If you asked me about a replication attempt using a different task, a different sample of participants, and a different means of communicating initial success or failure, although such a study might provide further clues about the belief perseverance phenomenon, I would be reluctant to wager on the outcome,

How general and how powerful is the phenomenon of belief perseverance after decisive challenges to the basis for an erroneous belief? What factors moderate that phenomenon? What kinds of post-experiment debriefing procedures might be most effective? These remained open questions in the aftermath of our study. We urged colleagues to pursue them and we pursued some of them ourselves. Again, what we offered was a demonstration of what can happen and would be likely to happen again if the same population was sampled and the same task and procedure were followed, but not how often it would happen, much less the specific conditions when it would or would not happen, with other tasks and research procedures.

Most researchers of yesteryear could tell similar stories about the phenomena they investigated. Follow-up studies with new procedures and/or dependent variable measures give us some idea about the robustness of phenomena across circumstances that differ to a greater or lesser degree from the seminal study, and also potential applications. Some paradigms will no doubt yield more consistent and robust demonstrations of a given phenomenon than others. Truly pure replications are impossible to conduct, but worth approximating when a phenomenon is of sufficient theoretical and/or practical interest. However, non-replications, when procedures and research contexts differ in significant ways from the original ones, provide little if any basis for challenging the validity of the phenomenon itself or the importance of the processes postulated to underlie it.

In the case of priming effects or other phenomena that rely on non-conscious influences —or perhaps more accurately on sources of influence that research participants have little if any memory of when they respond to the task or situation at hand—the relevant question currently being debated in our field is whether we have a formula for successful demonstrations. If we do, what are the necessary and/or sufficient intermediate processes that the investigator must instigate to make the demonstration “work”? 5 Understanding and testing those factors will contribute not only to the building of more complete theories about priming but also to the applications of those theories to influence consequential real-world behavior?

I have had one experience manipulating construals of a situation with an explicit labeling procedure and one experience manipulating those construals with various implicit priming procedures. In the former case, (Liberman, Samuels & Ross, 2004) we conducted an experiment for which I am again willing to offer good odds of successful replication—provided that the experiment is run with even minimal competence using a sample of Americans with at least some familiarity with our economic institutions. The manipulation in that study was straightforward could reproduced by any would-be replicator. It involved the labeling of a standard Prisoner’s Dilemma task as the Wall St Game vs the Community Game. (The participants, I hasten to add, did not know the label usually attached to the “game”, and had no reason to suspect that we were attempting to influence their choices.) This simple labeling manipulation produced a dramatic difference in rates of cooperation throughout the game—roughly 70% vs roughly 35%.

To drive home a related point about the impact of our manipulation we had recruited students nominated by their dorm counselors (and in an Israeli follow-up, where the labels were Bursa Game and Kommuna Game and the participants were military officer trainees nominated by their instructors) as particularly likely to cooperate or particularly likely to defect. Our findings also provided a dramatic demonstration of the phenomenon I had much earlier (Ross, 1977) termed the “fundamental attribution error”. The nominators’ designations of likely cooperators vs likely defectors, made with very high and very low probability estimates, failed to predict the actual choices their nominees made in the game. The dorm counselors and military instructors failed to give due weight to the “name of the game,” which we had specified in our description of the task.

I trust that even colleagues who have reservations about the status of some widely cited priming studies would have little quarrel about the validity of these “name of the game” findings. Indeed, upon reflection, the demonstration seems rather obvious. Again it is only the magnitude of the effect demonstrated that challenges everyday intuitions about human behavior. The labels we provided influenced what participants thought the game was “about” and hence the norms that were salient to them both as they speculated about the choice of the other player and as they made their own choice to cooperate or defect. What few participants, or nominators, recognized was the degree to which a different name of the game would have produced different choices.

The feature of the study that made it notable was the iconic status of the PD game and our ability both to manipulate construals with a single label and our ability to convey our findings with a single sentence. What made the study “work” I believe, is that our participants, like most members of our community, hold both cooperative and competitive schemas about choices in interpersonal contexts, and either schema can be brought to the fore when they play the PD game. What we did was make one of those two schemas more salient or cognitively “available”. We also took advantage of the fact that most people are willing to cooperate with rather than exploit fellow cooperators; what they are not willing to do is cooperate when they expect that they will be exploited by defectors, or to continue cooperating in the face of defection.

In a series of related studies with Aaron Kay, who did all the work in designing and running the studies and convincing me that it might bear fruit, we demonstrated similar effects with far subtler priming manipulations. In one study (Kay & Ross, 2003) participants completed a sentence unscrambling task prior to playing the (unlabeled) PD game. In a subsequent study Bargh, & Ross, 2004), we primed participants by exposing them to material objects designed to suggest competitive business contexts (for example, pictures of an executive briefcase, fountain pen, or a boardroom table). This priming manipulation influenced the way participants completed word fragments, interpreted ambiguous vignettes, and made decisions in non-zero-sum games.

I cannot imagine the original Wall St/Community Game study not working—provided that the participants are drawn from a population that associates “Wall Street” with competitive norms and “Community” with cooperative norms. By contrast, I am not at all sure about the robustness or replicability of our study or others that attempt to prime particular schemas with prior sentence unscrambling tasks or the presence of material objects. The success of such studies, I believe, depends on hitting a cognitive “sweet spot” whereby the priming manipulation influences influence the participants’ construal of the situation and the associations it brings to mind, but does so without the participants’ awareness of the link between the manipulation and those construals (or the intent of the investigator).6

Nevertheless, the results of the two priming studies we conducted, combined with many other successful studies by other investigators, convince me that priming effects can and do take place. The issue that remains unclear, and in need of further explication, is the types of priming inductions that are highly likely versus only somewhat likely or even unlikely to exert detectable effects on consequential behavior. I look forward to studies that examine the capacity of priming manipulations to promote greater honesty, healthier dietary choices, and more environmentally responsible resource consumption.

Natural experiments and wise interventions as demonstrations

In recent years, journalists and popular authors have mined social psychology as a source of material for books and articles in newspapers and magazines. Increasing, eminent researchers write books describing their own work and the lessons offered by our field for popular mass audiences as well. What is notable about these publications is the extent to which their emphasis is not on classic or current laboratory research but on “natural experiments” that go on when different practices and policies are followed in different communities, towns, states, or countries.

The most cited report of such a natural experiment (Johnson & Goldstein, 2003) involved a comparison of organ donation rates in European countries that differ how motorists make their organs available for medical use in the event of their accidental death. Some countries utilize an opt-in program whereby motorists must provide their signature on a line on their driver’s license while other countries utilize an opt-out program whereby motorists fail to become potential donors only if they sign a line on their license indicating their unwillingness to become a donor. The dramatic finding was that the great majority of motorists in opt-in countries failed to affirm their willingness to become potential donors while virtually all motorists in opt-out countries became potential donors because they failed to stipulate their unwillingness to do so.

To some extent this difference in participation rates this reflects simple inertia and choice architectures that constitute behavioral “nudges.” There is also some evidence (Davidai, Gilovich & Ross, 2012) that default provisions convey norms about the behavior expected of ordinary citizens. However, in light of the evidence that fewer than 5% of motorists in opt-in Denmark make their organs available for harvesting compared to more than 80% of motorists in their opt-out neighbor Sweden, or that the figure for “opt-out” France is over 90% compared to 12% in “opt-in” Germany, there can be little doubt about the importance of default provisions or to attribute those differences to cultural factors. Even if there are many cases wherein failures to opt do not result in actual donations at the time of death, it seems clear that a country’s default option can literally be a matter of life or death

Comparisons of opt-in versus opt-out provisions for participation in payroll deduction plans whereby no employer contributions take place for those who opt-out or for enrollment in various health-enhancing programs similarly demonstrate the impact of default provisions on important real-world decisions. They also show that choice architectures may matter more differences in traits, values, risk tolerance, or economic factors than either policymakers or ordinary citizens had assumed to be determinative

When I hear doubts expressed about the value of social psychological insights, I like to cite the work of Stanford colleagues who have pioneered theory-based interventions to help disadvantaged and stigmatized students. Carol Dweck, Geoff Cohen, and Greg Walton have tested the interventions they designed first in small-scale laboratory studies, then in modest field studies, and ultimately in “scaled up” programs serving large student populations. This progression of studies first demonstrates the potential applicability and then the feasibility and impact of theory-based interventions that address barriers to academic success. In some cases, interventions have targeted “fixed ability” mindsets that lead students to doubt their capacity to succeed through sustained effort. In other cases, the target has been the threat of confirming negative stereotypes, and the tax on cognitive resources imposed by that treat. In still others, the researchers addressed the problems that arise when students feel that they simply don’t belong in the programs or schools that they have entered.

These intervention studies serve a dual purpose. They test the efficacy and the scalability of a particular intervention. When they bear fruit they also suggest that the problem being addressed is in fact a significant barrier to accomplishment— one more important, relative to other more obvious barriers, such as lack of academic preparation or ability, or lack of family and institutional resources and commitment, than conventional wisdom would dictate. Let me offer but one example of a successful attempt at “wise” intervention that I believe served that dual-purpose well—an undertaking by Cohen, Garcia, Apfel, and Masters (2006) testing the efficacy of a self-affirmation manipulation in overcoming stereotype threat. In three classrooms, middle school students wrote a series of essays over the course of the year affirming the personal values they regarded as most important to them (most often the importance of family, but sometimes personal interests, such as music).

This intervention improved the grades of black students (who faced stereotype threat) but not those of white students (who did not face it), thereby reducing the racial gap in performance by 40 percent. Even more impressive was the 30 percent reduction in the relevant gap in overall GPA over the two years of middle school, and the drop in the number of black students who had to repeat a grade or were assigned to a remedial program from 9 percent to 3 percent. That success was by no means unique. In a later self-affirmation study, Cohen and collaborators (Cohen et al 2009) produced a decrease in the number of low-achieving African American seventh and eighth graders who had to repeat their grade or were assigned to remediation classes from 18 percent to 5 percent.

Not every theory-based intervention produces such dramatic results. However, scaled-up intervention failures, as well as successes, almost always provide useful clues about which types of interventions are most effective in particular settings, and what “tweaks” of the interventions may be necessary to maximize their success7. Cumulatively, there have been enough noteworthy demonstrations to suggest that while such interventions cannot eliminate the effects of poverty, or underfunding, or toxic home or neighborhood environments, educators have often given too little attention to changes in practice that can be useful even in the absence of fundamental social reforms. Their demonstrations can appeal to cost-conscious educators, legislators, and voters, and may be politically palatable to liberals and conservatives alike despite their disagreements about deeper sources of our society’s failure to serve the most disadvantaged of our students.

1 Some historical perspective is valuable in appreciating this particular issue. Lewin, and many of the investigators he influenced, frequently and unapologetically used internal analyses to show that the more closely the underlying assumptions in a test of a prediction were met, the stronger the support for that prediction became. They did not attach particular significance to whether the insight prompting them to consider particular manipulation checks or particular moderating variables came before the study, in the course of data analysis, or only after viewing initially disappointing results. This practice no doubt increased the likelihood of type 1 errors. Today, we can debate whether such post-hoc practices were reasonable or unreasonable and, accordingly, the confidence that can place in the conclusions offered by the investigators. The great playwright George Bernard Shaw (in Saint Joan) famously observed that “mortal eyes” have difficulty in distinguishing” a heretic from a saint.” By the same token, it can be hard to distinguish an opportunistic “p-hacker” from a “savant” who is paving the way for greater insight.

2 Failures to replicate can occur because an original result was spurious or, more typically, because the relevant effect size proves over time to be so small that it makes non-replications likely. However many failures arguably occur because the replication failed to create the mental or physical state required to provide an adequate test of the original prediction. Changes in cultural norms or knowledge schemas made salient by recent events may also make earlier experimental procedures ill-suited to replicate a previously demonstrated phenomenon. Research designed to reveal particular kinds of outgroup bias and animus is an obvious case in which it may be necessary to use different procedures to reveal the phenomena of concern.

3 The most debatable assumption in many of the replicability analyses regarding the likelihood of type 1 errors (i.e., false positives) that utilized computer simulations involved the likelihood that the null hypothesis is correct. As will be apparent in this essay, I believe that when expressed in general or abstract terms the great majority of hypotheses that social psychologists have tested over the years are correct and consistently, if not inevitably confirmed demonstrated when appropriately operationalized. Some hypotheses, of course, may simply be wrong—even when expressed in general terms. (Most psychologists believe that to be the case for predictions regarding ESP, clairvoyance, precognition, or other psi phenomena). More typically, however, the underlying hypothesis in a study is not simply ‘wrong”. What is being tested is whether a given phenomenon will be detectable in a particular context with a particular sample size.

4 The issue of “standardization” is most obvious in the case of the small laboratory dramas that were the hallmark of many old school experiments. The purpose of such dramas was to get all or most participants into some particular state and it often took some pretesting and tweaking of procedures to create and maintain that state. Complete standardization in the case of those experiments was impossible because the experimenter (and often an experimental confederate) had to respond without delay to the different questions individual participants might ask. This “latitude for improvisation” required on the part of experimenters and experimental confederates introduced potential problems of experimental demand that the investigators had to circumvent. Exclusive reliance on written or video-based instructions avoids problems of experimental demand, but it sacrifices the opportunity to fine-tune procedures in light of the research participants’ concerns and queries. I am reminded of the tradeoffs involved in having judges read standardized jury instructions but refuse to explain the things the jurors do not understand. This practice immunizes the judge against criticism, but it forces jury members to deliberate without a clear understanding of their task and the procedures they are to follow.

5 I would be remiss if I failed to acknowledge that at least some young investigators continue to do demonstration experiments that show the value of that tradition. Let me just note two recent examples, both with a Stanford connection. In a remarkable laboratory study Howe, Goyer, and Crum (2017) showed that displays of warmth and competence by a physician could lessen allergic reactions. In another such study, Cheryan, Plaut, Davies, & Steele, (2009) showed that women who enter a computer science environment with objects stereotypically associated with the field (e.g., Star Trek posters, video games) are less likely to consider pursuing computer science than women who enter one environment with non-stereotypical objects (e.g., art posters, water bottles).

6 Whether the effect of a priming manipulation is likely to be increased, or decreased, by the individuals’ awareness of how they are construing their task depends on the specifics of the judgment or decision to be made and the relevant context. Most ordinary attempts at persuasion involve an effort to shape the way the targeted individual sees some issue or decision, with the would-be influencer making no effort to conceal his or her intent. In a priming study, by contrast, the researcher does conceal that intent and avoids creating any motivation either to accept or reject an effort at influence.

7 In the case of intervention research, demonstrating what can happen is not sufficient. The box score of successes versus failures is undeniably important, as are lessons learned about robustness and boundary conditions. In this regard, it is important to recognize that effects sizes become larger when factors that contribute to variance in performance are eliminated or controlled for, and they become smaller when samples tested include both participants whom the relevant theory identifies as likely to benefit from the intervention and participants identified as unlikely to benefit.

References.

Asch, S. E. (1951). Effects of group pressure upon the modification and distortion of judgments. In H. Guetzkow (Ed.), Groups, leadership and men (pp. 177–190). Pittsburgh, PA: Carnegie Press.

Aronson, E. (1969). A theory of cognitive dissonance. In L. Berkowitz (Ed.), Advances in experimental social psychology (Vol. 4, pp. 1–34). New York: Academic Press.

Aronson, E., & Carlsmith, J. M. (1963). Effect of the severity of threat on the devaluation of forbidden behavior. Journal of Abnormal and Social Psychology, 66, 584–588.

Brehm, J.W. & Cohen, A.R. Explorations in cognitive dissonance . 1962, New York: Wiley.

Cheryan, S., Plaut, V. C., Davies, P. G., Steele, C. M. (2009). Ambient belonging: How stereotypical cues impact gender participation in computer science. Journal of Personality and Social Psychology, 97, 1045-1060.

Cohen, G. L., Garcia, J., Apfel, N., & Master, A. (2006). Reducing the racial achievement gap: A social-psychological intervention. Science, 313, 1307–1310.

Cohen, G. L., Garcia, J., Purdie-Vaugns, V., Apfel, N., & Brzustoski, P. (2009). Recursive processes in self-affirmation: Intervening to close the minority achievement gap. Science, 324, 400-403.

Cooper, J. (2007) Cognitive Dissonance: 50 Years of a Classic Theory. London: Sage publications.

Darley, J. M., & Batson, C. D. (1973). “From Jerusalem to Jericho”: A study of situational and dispositional variables in helping behavior. Journal of Personality and Social Psychology, 27, 100–108.

Davidai, S., Gilovich, T., & Ross, L.D. (2012). The meaning of defaults for potential organ donors. Proceedings of the National Academy of Sciences, 109(38), 15201–15205.

Festinger, L. (1957). A theory of cognitive dissonance. Stanford, CA: Stanford University Press.

Festinger, L., & Carlsmith, J. M. (1959). Cognitive consequences of forced compliance. Journal of Abnormal and Social Psychology, 58, 203–210.

Freedman, J. L., & Fraser, S. C. (1966). Compliance without pressure: The foot-in-the-door technique. Journal of Personality and Social Psychology, 4, 195–202.

Haney, C., Banks, W. C., & Zimbardo, P. G. (1973). A study of prisoners and guards in a simulated prison . Naval Research Review, 30, 4-17.

Howe, L. C., Goyer, J. P., & Crum, A. J. (2017). Harnessing the Placebo Effect: Exploring the Influence of Physician Characteristics on Placebo Response. Health Psychology

Johnson, E. J., & Goldstein, D. (2003). Do defaults save lives? Science, 302, 1338–1339.

Kay, A. C., & Ross, L. (2003). The perceptual push: The interplay of implicit cues and explicit situational construal in the prisoner's dilemma. Journal of Experimental Social Psychology, 39, 634-643.

Kay, A. C., Wheeler, S. C., Bargh, J. A., & Ross, L. (2004). Material priming: The influence of mundane physical objects on situational construal and competitive behavioral choice. Organizational Behavior and Human Decision Processes 95, 83–96.

Lepper, M. R., Greene, D., & Nisbett, R. E. (1973). Undermining children’s intrinsic interest with extrinsic reward: A test of the “overjustification” hypothesis. Journal of Personality and Social Psychology, 28, 129–137.

Lewin, K. (1958). Group decisions and social change. In Maccoby, Newcombe, & Hartley. (Eds.) Readings in social psychology. (pp 197-211). Holt and Company.

Liberman, V., Samuels, S. M., & Ross, L. (2004). The name of the game: Predictive power of reputations versus situational labels in determining Prisoner’s Dilemma game moves. Personality and Social Psychology Bulletin, 30, 1175–1185.

Milgram, S. (1963). Behavioral study of obedience. Journal of Abnormal and Social Psychology, 67, 371–378.

Milgram, S. (1974). Obedience to authority. New York: Harper & Row.

Ross, L. (1977). The intuitive psychologist and his shortcomings: Distortions in the attribution process. In L. Berkowitz (Ed.), Advances in experimental social psychology (Vol. 10, pp. 173–240). New York: Academic Press.

Ross, L., Lepper, M. & Hubbard, M (1975). Perseverance in self-perception and social perception: Biased attributional processes in the debriefing paradigm. Journal of Personality and Social Psychology, 32, 880-892.

Simmons, J.P., Nelson, L.D. & Simonsohn, U. (2011) False positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant, Psychological Science, 11, 1359-1366.

Strack F . , Martin LL , & Stepper, S ,1 (1988). Inhibiting and facilitating conditions of the human smile. a nonobtrusive test of the facial feedback hypothesis. J ournal of Personality and Social Psychology, 54 (5), 768-777.

Thaler R.T & Sunstein, C.R. (2008). Nudge: Improving decisions about health, wealth, and happiness. Springer.

Yeager, D.S, & Walton, G.M. (2011). Social-psychological interventions. They’re not magic. Review of Educational Research, 87, 267-301.

Previous
Previous

The most important contribution of prospect theory to the concerns of our political era.

Next
Next

Dissonance Theory Redux: Re-uniting Leon and Lewin