Strength of Recommendation Taxonomy (SORT): A Patient-Centered Approach to Grading Evidence in the Medical Literature

MARK H. EBELL; JAY SIWEK; BARRY D. WEISS; STEVEN H. WOOLF; JEFFREY SUSMAN; BERNARD EWIGMAN; MARJORIE BOWMAN

MARK H. EBELL, M.D., M.S., JAY SIWEK, M.D., BARRY D. WEISS, M.D., STEVEN H. WOOLF, M.D., M.P.H., JEFFREY SUSMAN, M.D., BERNARD EWIGMAN, M.D., M.P.H., AND MARJORIE BOWMAN, M.D., M.P.A.

Am Fam Physician. 2004;69(3):548-556

See editorial on page 483.

A large number of taxonomies are used to rate the quality of an individual study and the strength of a recommendation based on a body of evidence. We have developed a new grading scale that will be used by several family medicine and primary care journals (required or optional), with the goal of allowing readers to learn one taxonomy that will apply to many sources of evidence. Our scale is called the Strength of Recommendation Taxonomy. It addresses the quality, quantity, and consistency of evidence and allows authors to rate individual studies or bodies of evidence. The taxonomy is built around the information mastery framework, which emphasizes the use of patient-oriented outcomes that measure changes in morbidity or mortality. An A-level recommendation is based on consistent and good-quality patient-oriented evidence; a B-level recommendation is based on inconsistent or limited-quality patient-oriented evidence; and a C-level recommendation is based on consensus, usual practice, opinion, disease-oriented evidence, or case series for studies of diagnosis, treatment, prevention, or screening. Levels of evidence from 1 to 3 for individual studies also are defined. We hope that consistent use of this taxonomy will improve the ability of authors and readers to communicate about the translation of research into practice.

Review articles (or overviews) are highly valued by physicians as a way to keep up-to-date with the medical literature. Sometimes, though, these articles are based more on the authors' personal experience, anecdotes, or incomplete surveys of the literature than on a comprehensive collection of the best available evidence. As a result, there is an ongoing effort in the medical publishing field to improve the quality of review articles through the use of more explicit grading of the strength of evidence on which recommendations are based.^1–4

Several journals, including American Family Physician and The Journal of Family Practice, have adopted evidence-grading scales that are used in some of the articles published in those journals. Other organizations and publications also have developed evidence-grading scales. The diversity of these scales can be confusing for readers. More than 100 grading scales are in use by various medical publications.⁵ A level B recommendation in one journal may not mean the same thing as a level B recommendation in another. Even within journals, different evidence-grading scales sometimes are used in separate articles within the same issue. Journal readers do not have the time, energy, or interest to interpret multiple grading scales, and more complex scales are difficult to integrate into daily practice.

Therefore, the editors of the U.S. family medicine and primary care journals (i.e., American Family Physician, Family Medicine, The Journal of Family Practice, Journal of the American Board of Family Practice, and BMJ-USA) and the Family Practice Inquiries Network (FPIN) came together to develop a unified taxonomy for the strength of recommendations based on a body of evidence. The new taxonomy should: (1) be uniform in most family medicine journals and electronic databases; (2) allow authors to evaluate the strength of recommendation of a body of evidence; (3) allow authors to rate the level of evidence for an individual study; (4) be comprehensive and allow authors to evaluate studies of screening, diagnosis, therapy, prevention, and prognosis; (5) be easy to use and not too time-consuming for authors, reviewers, and editors who may be content experts but not experts in critical appraisal or clinical epidemiology; and (6) be straightforward enough that primary care physicians can readily integrate the recommendations into daily practice.

Definitions

A number of relevant terms must be defined for clarification.

Disease-Oriented Outcomes

These outcomes include intermediate, histopathologic, physiologic, or surrogate results (e.g., blood sugar, blood pressure, flow rate, coronary plaque thickness) that may or may not reflect improvement in patient outcomes.

Patient-Oriented Outcomes

These are outcomes that matter to patients and help them live longer or better lives, including reduced morbidity, reduced mortality, symptom improvement, improved quality of life, or lower cost.

Level of Evidence

The validity of an individual study is based on an assessment of its study design. According to some methodologies,⁶ levels of evidence can refer not only to individual studies but also to the quality of evidence from multiple studies about a specific question or the quality of evidence supporting a clinical intervention. For purposes of maintaining simplicity and consistency in this proposal, we use the term “level of evidence” to refer to individual studies.

Strength of Recommendation

The strength (or grade) of a recommendation for clinical practice is based on a body of evidence (typically more than one study). This approach takes into account the level of evidence of individual studies; the type of outcomes measured by these studies (patient-oriented or disease-oriented); the number, consistency, and coherence of the evidence as a whole; and the relationship between benefits, harms, and costs.

Practice Guideline (Evidence-Based)

These guidelines are recommendations for practice that involve a comprehensive search of the literature, an evaluation of the quality of individual studies, and recommendations that are graded to reflect the quality of the supporting evidence. All search, critical appraisal, and grading methods should be described explicitly and be replicable by similarly skilled authors.

Practice Guideline (Consensus)

Consensus guidelines are recommendations for practice based on expert opinions that typically do not include a systematic search, an assessment of the quality of individual studies, or a system to label the strength of recommendations explicitly.

Research Evidence

This evidence is presented in publications of original research, involving collection of original data or the systematic review of other original research publications. It does not include editorials, opinion pieces, or review articles (other than systematic reviews or meta-analyses).

Review Article

A nonsystematic overview of a topic is a review article. In most cases, it is not based on an exhaustive, structured review of the literature and does not evaluate the quality of included studies systematically.

Systematic Reviews and Meta-Analyses

A systematic review is a critical assessment of existing evidence that addresses a focused clinical question, includes a comprehensive literature search, appraises the quality of studies, and reports results in a systematic manner. If the studies report comparable quantitative data and have a low degree of variation in their findings, a meta-analysis can be performed to derive a summary estimate of effect.

Existing Strength-of-Evidence Scales

In March 2002, the Agency for Healthcare Research and Quality (AHRQ) published a report that summarized the state-of-the-art in methods of rating the strength of evidence.⁵ The report identified a large number of systems for rating the quality of individual studies: 20 for systematic reviews, 49 for randomized controlled trials, 19 for observational studies, and 18 for diagnostic test studies. It also identified 40 scales that graded the strength of a body of evidence consisting of one or more studies.

The authors of the AHRQ report proposed that any system for grading the strength of evidence should consider three key elements: quality, quantity, and consistency. Quality is the extent to which the identified studies minimize the opportunity for bias and is synonymous with the concept of validity. Quantity is the number of studies and subjects included in those studies. Consistency is the extent to which findings are similar between different studies on the same topic. Only seven of the 40 systems identified and addressed all three of these key elements.^6–11

Strength of Recommendation Taxonomy (SORT)

The authors of this article represent the major family medicine journals in the United States and a large family medicine academic consortium. Our process began with a series of e-mail exchanges, was developed during a meeting of the editors, and continued through another series of e-mail exchanges.

We decided that our taxonomy for rating the strength of a recommendation should address the three key elements identified in the AHRQ report: quality, quantity, and consistency of evidence. We also were committed to creating a grading scale that could be applied by authors with varying degrees of expertise in evidence-based medicine and clinical epidemiology, and interpreted by physicians with little or no formal training in these areas. We believed that the taxonomy should address the issue of patient-oriented evidence versus disease-oriented evidence explicitly and be consistent with the information mastery framework proposed by Slawson and Shaughnessy.²

After considering these criteria and reviewing the existing taxonomies for grading the strength of a recommendation, we decided that a new taxonomy was needed to reflect the needs of our specialty. Existing grading scales were focused on a particular kind of study (e.g., prevention or treatment), were too complex, or did not take into account the type of outcome.

Disease or condition	Disease-oriented outcome	Patient-oriented outcome
Doxazosin for blood pressure¹²	Reduces blood pressure in blacks	Increases mortality
Lidocaine for arrhythmia following acute myocardial infarction¹³	Suppresses arrhythmias	Increases mortality
Finasteride for benign prostatic hypertrophy¹⁴	Improves urinary flow rate	No clinically important change in symptom scores
Arthroscopic surgery for osteoarthritis of the knee¹⁵	Improves appearance of cartilage after débridement	No change in function or symptoms at one year
Sleeping infants on their stomach or side¹⁶	Knowledge of anatomy and physiology suggests that this will decrease the risk of aspiration	Increases risk of sudden infant death syndrome
Vitamin E for heart disease¹⁷	Reduces levels of free radicals	No change in mortality
Histamine antagonists and protonpump inhibitors for nonulcer dyspepsia¹⁸	Significantly reduce gastric pH levels	Little or no improvement in symptoms in patients with nongastroesophageal reflux disease, nonulcer dyspepsia
Hormone therapy¹⁹	Reduces low-density lipoprotein cholesterol levels, increases high-density lipoprotein cholesterol	No decrease in cardiovascular or all-cause mortality and an increase in cardiovascular events in women older than 60 years (Women's Health Initiative) with combined hormone therapy
Insulin therapy in type 2 diabetes mellitus²⁰	Keeps blood glucose levels below 120 mg per dL (6.7 mmol per L)	Does not reduce overall mortality
Sodium fluoride for fracture prevention²¹	Increases bone density	Does not reduce fracture rate
Lidocaine prophylaxis following acute myocardial infarction²²	Suppresses arrhythmias	Increases mortality
Clofibrate for hyperlipidemia²³	Reduces lipid levels	Does not reduce mortality
Beta blockers for heart failure²⁴	Reduce cardiac output	Reduce mortality in moderate to severe disease

Our proposed taxonomy is called the Strength of Recommendation Taxonomy (SORT). It is shown in Figure 1. The taxonomy includes ratings of A, B, or C for the strength of recommendation for a body of evidence. The table in the center of Figure 1 explains whether a body of evidence represents good-quality or limited-quality evidence, and whether evidence is consistent or inconsistent. The quality of individual studies is rated 1, 2, or 3; numbers are used to distinguish ratings of individual studies from the letters A, B, and C used to evaluate the strength of a recommendation based on a body of evidence. Figure 2 provides information about how to determine the strength of recommendation for management recommendations, and Figure 3 explains how to determine the level of evidence for an individual study. These two algorithms should be helpful to authors preparing papers for submission to family medicine journals. The algorithms are to be considered general guidelines, and special circumstances may dictate assignment of a different strength of recommendation (e.g., a single, large, well-designed study in a diverse population may warrant an A-level recommendation).

Recommendations based only on improvements in surrogate or disease-oriented outcomes are always categorized as level C, because improvements in disease-oriented outcomes are not always associated with improvements in patient-oriented outcomes, as exemplified by several well-known findings from the medical literature. For example, doxazosin lowers blood pressure in black patients—a seemingly beneficial outcome—but it also increases mortality rates.¹² Similarly, encainide and flecainide reduce the incidence of arrhythmias after acute myocardial infarction, but they also increase mortality rates.¹³ Finasteride improves urinary flow rates, but it does not significantly improve urinary tract symptoms in patients with benign prostatic hypertrophy,¹⁴ while arthroscopic surgery for osteoarthritis of the knee improves the appearance of cartilage but does not reduce pain or improve joint function.¹⁵ Additional examples of clinical situations where disease-oriented evidence conflicts with patient-oriented evidence are shown in Table 1.^12–24 Examples of how to apply the taxonomy are given in Table 2.

We believe there are several advantages to our proposed taxonomy. It is straightforward and comprehensive, is easily applied by authors and physicians, and explicitly addresses the issue of patient-oriented versus disease-oriented evidence. The latter attribute distinguishes SORT from most other evidence-grading scales. These strengths also create some limitations. Some clinicians may be concerned that the taxonomy is not as detailed in its assessment of study designs as others, such as that of the Centre for Evidence-Based Medicine (CEBM).²⁵ However, the primary difference between the two taxonomies is that the CEBM version distinguishes between good and poor observational studies while the SORT version does not. We concluded that the advantages of a system that provides the physician with a clear recommendation that is strong (A), moderate (B), or weak (C) in its support of a particular intervention outweighs the theoretic benefit of distinguishing between lower quality and higher quality observational studies, particularly because there is no objective evidence that the latter distinction carries important differences in clinical recommendations.

Example 1: While a number of observational studies (level of evidence—2) suggested a cardiovascular benefit from vitamin E, a large, well-designed, randomized trial with a diverse patient population (level of evidence—1) showed the opposite. The strength of recommendation against routine, long-term use of vitamin E to prevent heart disease, based on the best available evidence, should be A.
Example 2: A Cochrane review finds seven clinical trials that are consistent in their support of a mechanical intervention for low back pain, but the trials were poorly designed (i.e., unblinded, nonrandomized, or with allocation to groups unconcealed). In this case, the strength of recommendation in favor of these mechanical interventions is B (consistent but lower quality clinical trials).
Example 3: A meta-analysis finds nine high-quality clinical trials of the use of a new drug in the treatment of pulmonary fibrosis. Two of the studies find harm, two find no benefit, and five show some benefit.
	The strength of recommendation in favor of this drug would be B (inconsistent results of good-quality, randomized controlledtrials).
Example 4: A new drug increases the forced expiratory volume in one second (FEV₁) and peak flow rate in patients with an acute asthma exacerbation. Data on symptom improvement is lacking. The strength of recommendation in favor of using this drug is C (disease-oriented evidence only).

Any publication applying SORT (or any other evidence-based taxonomy) should describe carefully the search process that preceded the assignment of a SORT rating. For example, authors could perform a comprehensive search of MEDLINE and the gray literature, a comprehensive search of MEDLINE alone, or a more focused search of MEDLINE plus secondary evidence-based sources of information.

SORT		CEBM		BMJ's Clinical Evidence
A.	Recommendation based on consistent and good-quality patient-oriented evidence	A.	Consistent level 1 studies	Beneficial
B.	Recommendation based on inconsistent or limited-quality patient-oriented evidence	B.	Consistent level 2 or 3 studies or extrapolations from level 1 studies	Likely to be beneficial Likely to be ineffective or harmful (recommendation against)
		C.	Level 4 studies or extrapolations from level 2 or 3 studies	Unlikely to be beneficial (recommendation against)
		C.	Level 4 studies or extrapolations from level 2 or 3 studies	Unlikely to be beneficial (recommendation against)
C.	Recommendation based on consensus, usual practice, disease-oriented evidence, case series for studies of treatment or screening, and/or opinion	D.	Level 5 evidence or troublingly inconsistent or inconclusive studies of any level	Unknown effectiveness

Walkovers: Creating Linkages with SORT

Some organizations, such as the CEBM,²⁵ the Cochrane Collaboration,⁷ and the U.S. Preventive Services Task Force,⁶ have developed their own grading scales for the strength of recommendation based on a body of evidence and are unlikely to abandon them. Other organizations, such as the FPIN,²⁶ publish their work in a variety of settings and must be able to move between taxonomies. We have developed a set of optional walkovers that suggest how authors, editors, and readers might move from one taxonomy to another. Walkovers for the CEBM and BMJ Clinical Evidence taxonomies are shown in Table 3.

Many authors and experts in evidence-based medicine use the “Level of Evidence” taxonomy from the CEBM to rate the quality of individual studies.²⁵ A walkover from the five-level CEBM scale to the simpler three-level SORT scale for individual studies is shown in Table 4.

Final Comment

The SORT is a comprehensive taxonomy for evaluating the strength of a recommendation based on a body of evidence and the quality of an individual study. If applied consistently by authors and editors in the family medicine literature, it has the potential to make it easier for physicians to apply the results of research in their practice through the information mastery approach and to incorporate evidence-based medicine into their patient care.

Like any such grading scale, it is a work in progress. As we learn more about biases in study design, and as the authors and readers who use the taxonomy become more sophisticated about principles of information mastery, evidence-based medicine, and critical appraisal, it is likely to evolve. We remain open to suggestions from the primary care community for refining and improving SORT.

	CEBM
SORT Level	Treatment/screening	Other categories
1	Levels 1a to 1c	Levels 1a to 1c
2	Level 2 or 3	Levels 2 to 4
3	Level 4 or 5 and any study that measures intermediate or surrogate outcomes	Level 5 and any study that measures intermediate or surrogate outcomes

Evidence-based medicine. A new approach to teaching the practice of medicine. JAMA. 1992;268:2420-5.

Slawson DC, Shaughnessy AF, Bennett JH. Becoming a medical information master: feeling good about not knowing everything. J Fam Pract. 1994;38:505-13.

Shaughnessy AF, Slawson DC, Bennett JH. Becoming an information master: a guidebook to the medical information jungle. J Fam Pract. 1994;39:489-99.

Siwek J, Gourlay ML, Slawson DC, Shaughnessy AF. How to write an evidence-based clinical review article. Am Fam Physician. 2002;65:251-8.

Systems to rate the strength of scientific evidence. Summary, evidence report/technology assessment: number 47. AHRQ publication no. 02-E015, March 2002. Agency for Healthcare Research and Quality, Rockville, Md. Accessed November 13, 2003, at: http://www.ahrq.gov/clinic/epc-sums/strengthsum.htm.

Harris RP, Helfand M, Woolf SH, Lohr KN, Mulrow CD, Teutsch SM, et al. Current methods of the U.S. Preventive Services Task Force: a review of the process. Am J Prev Med. 2001;20(3 suppl):21-35.

Clarke M, Oxman AD. Cochrane reviewers' handbook 4.2.0. The Cochrane Collaboration, 2003. Accessed November 13, 2003, at: http://www.cochrane.org/resources/handbook/handbook.pdf.

Gyorkos TW, Tannenbaum TN, Abrahamowicz M, Oxman AD, Scott EA, Millson ME, et al. An approach to the development of practice guidelines for community health interventions. Can J Public Health. 1994;85(suppl 1):S8-13.

Briss PA, Zaza S, Pappaioanou M, Fielding J, Wright-De Aguero L, Truman BI, et al. Developing an evidence-based guide to community preventive services—methods. Am J Prev Med. 2000;18(1 suppl):35-43.

Greer N, Mosser G, Logan G, Halaas GW. A practical approach to evidence grading. Jt Comm J Qual Improv. 2000;26:700-12.

Guyatt GH, Haynes RB, Jaeschke RZ, Cook DJ, Green L, Naylor CD, et al. Users' guides to the medical literature: XXV. Evidence-based medicine: principles for applying the users' guides to patient care. JAMA. 2000;284:1290-6.

Major cardiovascular events in hypertensive patients randomized to doxazosin vs chlorthalidone: the antihypertensive and lipid-lowering treatment to prevent heart attack trial (ALLHAT) [published correction in JAMA 2002;288:2976]. JAMA. 2000;283:1967-75.

Echt DS, Liebson PR, Mitchell LB, Peters RW, Obias-Manno D, Barker AH, et al. Mortality and morbidity in patients receiving encainide, flecainide, or placebo. N Engl J Med. 1991;324:781-8.

Lepor H, Williford WO, Barry MJ, Brawer MK, Dixon CM, Gormley G, et al. The efficacy of terazosin, finasteride, or both in benign prostatic hyperplasia. N Engl J Med. 1996;335:533-9.

Moseley JB, O'Malley K, Petersen NJ, Menke TJ, Brody BA, Kuykendall DH, et al. A controlled trial of arthroscopic surgery for osteoarthritis of the knee. N Engl J Med. 2002;347:81-8.

Dwyer T, Ponsonby AL. Sudden infant death syndrome: after the “back to sleep” campaign. BMJ. 1996;313:180-1.

Yusuf S, Dagenais G, Pogue J, Bosch J, Sleight P. Vitamin E supplementation and cardiovascular events in high-risk patients. N Engl J Med. 2000;342:154-60.

Moayyedi P, Soo S, Deeks J, Delaney B, Innes M, Forman D. Pharmacological interventions for non-ulcer dyspepsia. Cochrane Database Syst Rev. 2003(1):CD001960.

Rossouw JE, Anderson GL, Prentice RL, LaCroix AZ, Kooperberg C, Stefanick ML, et al. Risks and benefits of estrogen plus progestin in healthy postmenopausal women: principal results from the Women's Health Initiative randomized controlled trial. JAMA. 2002;288:321-33.

Intensive blood-glucose control with sulphonylureas or insulin compared with conventional treatment and risk of complications in patients with type 2 diabetes (UKPDS 33). Lancet. 1998;352:837-53.

Meunier PJ, Sebert JL, Reginster JY, Briancon D, Appelboom T, Netter P, et al. Fluoride salts are no better at preventing new vertebral fractures than calcium-vitamin D in post-menopausal osteoporosis: the FAVOStudy. Osteoporos Int. 1998;8:4-12.

MacMahon S, Collins R, Peto R, Koster RW, Yusuf S. Effects of prophylactic lidocaine in suspected acute myocardial infarction. An overview of results from the randomized, controlled trials. JAMA. 1988;260:1910-6.

Grumbach K. How effective is drug treatment of hypercholesterolemia? A guided tour of the major clinical trials for the primary care physician. J Am Board Fam Pract. 1991;4:437-45.

Heidenreich PA, Lee TT, Massie BM. Effect of beta-blockade on mortality in patients with heart failure: a meta-analysis of randomized clinical trials. J Am Coll Cardiol. 1997;30:27-34.

Centre for Evidence-Based Medicine. Levels of evidence and grades of recommendation. Accessed November 13, 2003, at: http://www.cebm.net/levels_of_evidence.asp.

Family Practice Inquiries Network (FPIN). Accessed November 13, 2003, at: http://www.fpin.org.