AFP Journal Club Toolkit

The following are “evidence-based medicine pointers” for analyzing research studies, culled from AFP’s Journal Club series, which ran from November 1, 2007, through May 15, 2015.

There are two major sections: Types of Studies and Key Concepts When Looking at Research Studies, organized as shown below. Within each section, key words are listed in bold.

Note: See also the EBM Glossaries and MDCalc's glossary of EBM terms( for additional explanation of terms, studies, and statistical concepts.





  • Clinical decision rules need to go through a rigorous process of derivation, internal validation, and external validation before widespread adoption.
  • A hierarchy exists for clinical decision rules. No clinical decision rule should be widely used until it has been clearly shown to be beneficial in external validity studies.


  • Primary care guidelines tend to be more methodologically sound than specialty society guidelines.
  • The level of evidence of clinical guidelines should be reviewed before widespread implementation.
  • Clinical trials should be designed to have only one predesignated primary outcome. Studies with multiple outcomes run the risk that a statistically significant outcome occurred by chance alone.
  • Secondary outcomes should be used only to generate thoughts/ideas/hypotheses for future studies. Otherwise, they might give a false sense of an intervention’s efficacy.


  • A good meta-analysis requires the following (among other things): an analysis of the quality of studies included in the meta-analysis (one set of criteria is published by the Cochrane Collaboration); and data extraction by more than one person in each study (do the reviewers agree on something as basic as the data to be analyzed?).
  • A multiple-treatments meta-analysis allows you to compare treatments directly (e.g., head-to-head trials) and indirectly (e.g., against a first-line treatment). This increases the number of comparisons available and may allow the development of decision tools for effective treatment prioritization.
  • Similar to other meta-analyses, a multiple-treatments meta-analysis can be limited by low numbers of patients and poorly designed or heterogeneous studies (the “garbage-in, garbage-out” phenomenon).
  • Meta-analyses can come to the wrong conclusion for several reasons. First is publication bias—typically only positive trials are published and included in the meta-analysis. The second is the garbage-in, garbage-out phenomenon; to avoid this, the authors of a meta-analysis must evaluate the quality of the trials they are using so that only higher-quality studies are included.


  • Randomized controlled trials should meet the following criteria: they should provide an appropriate description of randomization; they should be double blind; and they should provide a description of withdrawals and dropouts.
  • A study using block randomization assigns patients in small groups. This type of randomization is done to decrease the likelihood of too many patients being randomized to a single treatment. In one study, randomization was in blocks of four, so one person in each block was randomized to one of the four treatment protocols.
  • Because of ethical constraints, randomized controlled trials will not be available to answer all clinical questions, particularly those that explore long-term risks, such as cancer. For these types of clinical questions, we must rely on mathematical modeling or abstraction of data from other sources.


  • Retrospective studies are generally suspect because they must rely on complete and accurate medical record documentation, which usually is lacking.
  • Retrospective studies need to follow common, agreed-upon methods of data abstraction.
  • Propensity matching is used to remove confounders in retrospective studies. The idea is to balance the groups being compared in their likelihood of needing a therapy.



  • Confounding variables bias. These are characteristics that are distributed differently among study groups and that can affect the outcome being assessed.
  • Exclusion bias. Excluding patients in whom the study drug has already failed biases the study in favor of that drug. If these patients had been included (e.g., a random sample of patients), the response rate to the study drug would have been less than observed.
  • History bias. Controlling for temporal trends in disease incidence is important when doing a comparison between contemporary and historical groups.
  • Industry bias. This refers to the fact that studies and reviews published by industry are more likely to present positive (favorable) outcomes. Look for non–industry-sponsored studies—they are less likely to have publication bias or use inappropriate comparisons.
  • Interrupted time series design bias. Studies with an interrupted time series design, also known as “before and after” studies, are subject to numerous potential confounders because of uncontrolled changes that occur in the study environment over time.
  • Lack of placebo bias. Failure to use a true placebo may jeopardize the validity of the results of a trial. In one study, parents who got an empty syringe knew their child was not getting treatment and may have been biased toward a more negative interpretation when they answered follow-up questions regarding their child’s cough and sleep quality. Almost anything will look better than placebo. A study should compare the study drug with a real-world scenario, such as (in the case of another study) increasing the dose, switching antidepressants, or using another drug for augmentation.
  • Lead-time bias is an important consideration when evaluating a screening intervention. Diagnosing disease earlier with a screening test can appear to prolong survival without actually changing outcomes. The only thing that changes is the period of time during which the patient is diagnosed with the disease, not the actual survival time.
  • Observation bias (also known as the Hawthorne effect) occurs when individuals temporarily modify their behavior (and consequently change study outcomes) when they know they are being observed. Gains achieved during the study period often regress when the study ends.
  • Open-label study bias occurs when patients’ or researchers’ knowledge of the condition or treatment can influence their judgment. An example is when a researcher is adjudicating an outcome, and knows which treatment group a patient was assigned to. A stronger study is blinded for patients, researchers, and, if used, evaluators.
  • Publication bias. Studies that show treatments in a positive light are more likely to be submitted for publication. For example, the published studies on levalbuterol look good. But if you look at all of the studies submitted to the U.S. Food and Drug Administration, levalbuterol and albuterol are equivalent, and albuterol costs less. Negative studies are less likely to get published than positive studies, and this results in the overwhelmingly positive nature of the literature. Even when negative studies are published, they are less likely to receive the attention by the media and medical establishment that positive studies do. In addition, more and more information is being hidden in online supplemental protocol information or in appendices.
  • Review bias occurs when the reader of a test (e.g., radiologist, electrocardiographer) knows the patient’s history. The history may change the way a test is read.
  • Run-in bias. Run-in periods to assess compliance and ensure treatment responsiveness create a bias in favor of the treatment in question, and yield results in a patient population that will not be the same as in your patients. Thus, the results of these studies may not be generalizable.
  • Sampling bias. Representative inclusion of all possible study participants helps to eliminate differences between groups so that more appropriate comparisons can be made. If only some subsets of patients are included (i.e., “sampled”), they are less likely to represent the general population.
  • Selection bias occurs when the patients in a study are not representative of the patients you see in practice.
  • Spectrum bias occurs when the group being studied is either sicker or not as sick as the patients you see in your practice. A diagnostic test can perform differently in dissimilar patient populations. You cannot apply a test standardized in an inpatient population to your outpatient population (or vice versa) and expect it to have the same sensitivity and specificity.
  • Straw man comparison bias. In head-to-head treatment trials, watch out for the straw man comparison. For example, make sure that an article that evaluates treatments uses equipotent dosages of the drugs being compared. Obviously, if you use an adequate dose of a study drug and a suboptimal comparison, the study drug is going to win. In one study, the doses of prasugrel and clopidogrel were not equivalent. As a corollary, a placebo-controlled trial should not typically change your practice; any drug comparison should be against a known effective therapy, if one exists.
  • Verification bias or workup bias exists when not everyone in a study gets the definitive, criterion standard test. This generally makes the new test look better because real cases of disease are missed when patients with a negative new test are sent home.


  • Causation vs. association. A risk factor and outcome are associated if they occur together. Case-control studies can suggest an association, but not causation. Causation is more difficult to establish and generally requires a prospective randomized study.
  • Case-control studies may not be able to control for all patient variables. This is a potential source of error. A good example of this would be case-control studies that suggested that postmenopausal estrogen was cardioprotective. Subsequent randomized controlled trials proved that this is not the case.
  • An association does not confer causation. But when multiple criteria are met (e.g., strength of association, consistency, specificity, temporality, dose-response relationship, biologic plausibility, coherence, experimental evidence, analogy), the likelihood of a causal relationship increases.
  • Hill’s criteria for causation (listed above) are a broadly accepted set of nine criteria to establish causality between an exposure or incidence and an effect or consequence. In general, the more criteria that are met, the more likely the relationship is causal.
  • Reverse causation. A reverse causality error occurs when the outcome, or some component of it, causes the intervention or exposure in question. For example, in nonrandomized studies, participants may select their “intervention” behavior based on early symptoms or prior knowledge, which then may affect the outcome in question.


  • Watch for “spins” promoted by pharmaceutical salespersons. Testimonials, isolated experiences, small company-sponsored studies with surrogate markers or “fuzzy” end points, and selected reprints are often used to try to convince physicians to use a certain drug or intervention.
  • Be skeptical of industry-sponsored studies. These studies are often “spun” to favor the sponsor’s drug.


  • The number needed to treat (NNT), number needed to harm (NNH), and the magnitude of the benefit are critical information if you are going to make an educated decision about treatment options. NNT and NNH are powerful tools in documenting an intervention’s effect.
  • Whenever you see an NNT, look for the corresponding NNH.
  • Calculate NNT and NNH. They tell you the real magnitude of benefit and harm. P values only tell you that there is a difference between two groups; this difference can be clinically meaningless. NNT gives a better sense of the strength of treatment effect.
  • How to calculate NNT or NNH: 1/(absolute difference in treated vs. untreated patients). For example, if 5% of treated patients have a heart attack and 10% of untreated patients have a heart attack, the NNT calculation is 1/(0.10 − 0.05), equals 1/0.05, equals 100/5, equals 20. Ignoring percent signs, the calculation simplifies to 100/absolute difference, equals 100/10 − 5, equals 100/5, equals 20.
  • Let your patients know the magnitude of benefit and risk in language that is easy for them to understand. For instance, in one study, 33 patients need to be treated for one year to prevent one hospitalization, at a risk of one in 41 patients developing pneumonia.
  • Large numbers of patients are typically required to demonstrate a deleterious side effect of a drug or intervention (i.e., the NNH).


  • Noninferiority trials are designed to show that an alternative treatment is not substantially worse than the standard intervention. They do not meet the same rigorous design and statistical format of traditional superiority trials.
  • Showing that one drug is noninferior to another does not mean that these drugs are equivalent.
  • Authors of noninferiority trials must declare a margin of how far outside the acceptable outcome the treatment can perform and still be considered noninferior to the standard treatment.
  • The noninferiority margin allows researchers to choose their own benchmark for what is considered a clinically significant difference between two drugs. This can lead to a drug being called noninferior when other researchers not associated with the study would call it inferior.
  • The efficacy of the standard treatment (for instance, warfarin) shown in the trials that established its efficacy must be preserved in any noninferiority trials. In a study comparing warfarin and rivaroxaban, time in therapeutic range was not within established norms for many of the patients—this would make warfarin perform worse and allow rivaroxaban to appear noninferior.


  • An odds ratio tells us the odds of an outcome in one group compared with another group, but does not give us the magnitude of this changed outcome. It is usually used in case-control studies and not in randomized trials, where relative risk and absolute risk are used instead.
  • Case-control studies are not interventional studies and are retrospective, so we use odds ratio rather than relative risk as a measure of the association. Odds ratio is calculated by dividing the odds of disease in those who were exposed to a given factor by the odds of disease in those who were not exposed.
  • Relative risk is the ratio of the probability of an event in an exposed population to the probability in an unexposed population. This calculation is useful in comparisons in which there is a low probability of the event occurring.
  • Attributable risk is the difference in the rate of an event between an exposed population and an unexposed population. This is usually calculated in cohort studies.


  • Remember the difference between disease-oriented evidence (DOE) and patient-oriented evidence that matters (POEMs).
  • POEMs refers to clinical outcomes that mean something to patients (e.g., death, fracture, myocardial infarction). DOE is an indirect measure of a pathologic or physiologic process that may or may not correlate with clinical outcomes (such as changes in blood glucose, electrocardiogram abnormalities, carotid intima thickening). Family physicians should concentrate on POEMs because it has a direct influence on patients’ health.
  • A lot of studies use surrogate markers as outcomes (e.g., fasting or postprandial blood glucose, FEV1). These are DOEs. What we care about are POEMs (e.g., stroke rates, myocardial infarction rates, quality of life). Be wary of DOEs; they are surrogate markers of disease and may or may not correlate with important clinical end points, such as morbidity and mortality. For example, in certain clinical situations, it is possible to lower blood pressure, but not help patients. It could even harm them.
  • A study should change your practice only if it is applicable to your patient population (e.g., study patients presented to a family physician or the emergency department, not a subspecialist).


  • Recognize that end points can be statistically significant without being clinically significant (e.g., A1C difference of 0.08%). Another example: in one study, 23 mg of donepezil was statistically better than 10 mg, but only in one of three tests, and by only two points on a 100-point scale. This is clinically imperceptible, yet it will be touted as superior by pharmaceutical companies.
  • When reviewing a study, you must know what the scale measures, whether the scale has been validated, and what change in the scale is actually clinically significant.


  • Absolute risk reduction quantifies the actual difference between two outcomes; this is what the physician should be interested in. Relative risk reduction demonstrates the change in outcome relative to a baseline or control; it will often exaggerate a benefit.
  • Abstracts of articles should not be relied on when deciding whether a therapy is good, because they often have misleading information and conclusions. The data and conclusion in the abstract of an article may not be the same as the data in the paper. If you read only the abstract, you may be misled.
  • Blinding is ensuring that participants, clinicians, and/or investigators do not know which participants are assigned to each study group. Using seemingly identical products (in terms of taste, appearance, odor, and even texture) with the same dosing regimen is a common practice to help achieve blinding.
  • Clinical scoring systems should be externally validated (“road tested”) before widespread implementation.
  • Dates of the study. Look at the dates that data were generated and not the date of publication. If the data are old, updates in technology and treatments may affect the outcomes.
  • “Double-dummy” design is used when the two drugs being tested look different from each other, so that group assignment cannot be blinded. In a double-dummy design, there are matching placebos for both administered drugs (two “dummy” drugs) and every patient gets an active drug and a placebo.
  • Durability of an intervention (i.e., the ability to provide sustained results) is an important concept when considering implementing the intervention based on a study’s results.
  • Efficacy is how a test or drug performs in a study setting. Effectiveness is how it performs in the general population. Usually, results are better for efficacy than for effectiveness.
  • External validity refers to the ability to generalize the results of a study to other settings. The demonstrated efficacy of a drug or intervention in a clinical trial may not translate to effectiveness in the community or in your particular practice. Check to see if the population in the study is similar to the population you see in your practice. For example, the average age of patients in a certain study was 77 years. These findings should only be generalized to younger patients with caution.
  • Internal validity means you have evidence that what you did in the study (i.e., the treatment) caused what you observed (i.e., the outcome). Internal validity can be threatened by confounding variables.
  • Kappa is a measure of interobserver reliability (e.g., the probability that two radiologists reading the same film will get the same answer beyond chance alone). It is generally scored as: 0 = no agreement; 0 to 0.2 = slight agreement; 0.2 to 0.4 = fair agreement; 0.4 to 0.6 = moderate agreement; 0.6 to 0.8 = substantial agreement; and 0.8 to 1.0 = almost perfect agreement. Although in such a study, which would be more or less methodologically sound, a better strategy would have been to have two readers read each film, then have a third party adjudicate if the first two readers disagreed. This is generally accepted methodology.
  • Logistic regression attempts to control for confounders between the experimental groups or participants. However, it is at best inexact and cannot control for every potential confounder.
  • Multiple comparisons. As the number of comparisons increases, it becomes more likely that the groups being compared will appear to differ in at least one attribute, if only by chance alone.
  • Observational studies are not the best design for testing some hypotheses. If we really wanted to know how good electrocardiography is, for example, we would design a study in which initial electrocardiography was performed in the emergency department, followed by cardiac catheterization in all of the patients. This is the only way to get the true sensitivity and specificity of electrocardiography. Observational studies let things take their course without a prescribed, randomized intervention. This adds a lot of uncertainty to the data (in this hypothetical case) by not controlling the subsequent workup.
  • Post hoc cutoff values are often selected to maximize the sensitivity and specificity of a test. The test may not perform as well in another group of patients. Receiver operating characteristic curves are used to figure out the cutoff values to achieve optimal sensitivity vs. specificity.
  • Post-marketing monitoring is often required to find a difference in adverse outcomes that are associated with the drugs.
  • Power of a study is the probability of finding a real difference when one exists, and is based on the minimum number of participants needed to show a difference, the size of the treatment effect desired, and a predetermined level of significance. When you have fewer events in a study than you expect, you need to increase the number of participants, not decrease it.
  • Prevalence of a disease in a population changes the interpretation of a test (i.e., the positive and negative predictive values).
  • Retraction Watch ( is a blog about retractions in the scientific literature. There is currently no database that catalogs retractions.
  • Type II (falsely negative) errors often occur when a study is too small to find a real difference between two treatments. A type I error occurs when the study shows a difference when in fact there is none (falsely positive).


  • All tests follow a Bayesian model. They are more likely to be true-positive in a sick patient and false-positive in a well patient.
  • Sensitivity and specificity only tell part of the story. Always look at the false-positive and false-negative rates of a test.
  • False-positive studies. How many patients who really don’t have the disease are you treating simply because the test was positive? One study showed that 42% of low-risk patients had a positive result on chest computed tomography, but no evidence of pulmonary embolism on any confirmatory test. If the chest computed tomography result had been accepted as valid, a significant number of these patients would have been needlessly exposed to anticoagulation.
  • False-negative studies. How many patients with true disease are you missing and not treating because the test was negative?


  • Intention-to-treat analysis. This type of analysis evaluates all patients in the group to which they were assigned, whether or not they completed the study. Thus, persons who didn’t tolerate a treatment, for example, are still analyzed in their original group. This reflects our real-world experience. Some of our patients are going to do badly or stop a medication. They still need to be included in our equation when we decide whether to use a drug. This gives a more realistic sense of the treatment’s effectiveness efficacy, and reduces bias that would otherwise make the treatment look better than it actually is. Make sure that a study protocol makes sense. Don’t assume that a study uses an appropriate dose or schedule of a drug.
  • Per-protocol analysis allows the researcher to throw out any data from patients who don’t tolerate a drug (for example). So a per-protocol analysis may not reflect the kind of results we will see in our practice and typically overestimates the net benefit of an intervention. Avoid basing your clinical decisions on articles that use per-protocol analysis.
  • Post hoc and subgroup analyses should only be used to generate a hypothesis (the derivation set). This hypothesis then needs to be tested in a separate randomized study (the validation set), and should not be used to show the harm or benefit of a therapy.
  • Power analysis ensures that there are enough participants enrolled in a study to find a difference, if there is one. This avoids a type II error, which is when there are not enough participants to find a difference, producing a false-negative result.
  • Sensitivity analysis excludes outliers, such as large, heavily weighted studies and studies of marginal quality, to check whether the results are the same.