Abstract
Objective. To examine the level of evidence for criterion-concurrent validity of spinal mobility assessments in patients with ankylosing spondylitis (AS).
Methods. Guidelines proposed in the Preferred Reporting Items for Systematic Reviews and Meta-Analyses were used to undertake a search strategy involving 3 sets of keywords: accura*, truth, valid*; ankylosing spondylitis, spondyloarthritis, spondyloarthropathy, spondylarthritis; mobility, spinal measure*, (a further 16 keywords with similar meaning were used). Seven databases were searched from their inception to February 2014: AMED, Embase, ProQuest, PubMed, Science Direct, Scopus, and Web of Science. The Quality Assessment of Diagnostic Accuracy Studies (with modifications) was used to assess the quality of articles reviewed. An article was considered high quality when it received “yes” in at least 9 of the 13 items.
Results. From the 741 records initially identified, 10 articles were retained for our systematic review. Only 1 article was classified as high quality, and this article suggests that 3 variants of the Schober test (original, modified, and modified-modified) poorly reflect lumbar range of motion where radiographs were used as the reference standard.
Conclusion. The level of evidence considering criterion-concurrent validity of clinical tests used to assess spinal mobility in patients with AS is low. Clinicians should be aware that current practice when measuring spinal mobility in AS may not accurately reflect true spinal mobility.
The importance of assessing spinal mobility in patients with ankylosing spondylitis (AS) was emphasized after its recommendation as an inclusion criterion for diagnosing the disease in 1966 in the New York symposium1. Since then, measurements of spinal mobility have been widely used in the assessment of patients with AS, assisting with diagnosis, monitoring disease progression, and determining the efficacy of treatment interventions2,3,4. Limitation of spinal mobility may be a predictor of poor outcome in AS5. Structural damage, inflammation, and age have already been shown to affect spinal mobility6,7. However, before a clinical test is accepted as an assessment tool, it should demonstrate acceptable reliability, responsiveness, and validity. The latter is a particularly important feature for clinical tests8. It is further defined as face validity and content validity, subjective measures representing the concept of the test, often used to assess questionnaire-based assessments. Construct validity represents the ability of an instrument to measure an abstract construct, such as the level of health, capacity, or physical function. Criterion-related validity is divided into criterion-concurrent validity, when 2 tests or instruments — the criterion (a reference standard) and the target (index test) — are performed concurrently. Finally, criterion-predictive validity establishes how successful the outcome of the target test is as a predictor of a future status8.
Following the Outcome Measures in Rheumatoid Arthritis Clinical Trials, “truth” has been identified as 1 of the 3 key criteria for any outcome measure (the other components are discrimination and feasibility)9,10. The truth of a measure represents the ability or accuracy of an instrument or clinical test to assess the intended variable9,10. Spinal mobility tests are used on the assumption that they reflect spinal range of motion. Although the term “truth” may cover both face and content validity, the most objective way of assessing the “truth” of a clinical test or instrument is through criterion-concurrent validity, which requires that a measure be compared to a reference standard8. The reference standard for range of motion is widely acknowledged to be radiographic measurements11,12,13,14,15. However, this approach is relatively time-consuming and expensive, and exposes the patients to radiation11. As a consequence, some noninvasive low-cost methods that are easy to apply and interpret, such as goniometry16,17, tape measures12,17,18,19,20,21, and inclinometry18 have been used. These measurements are frequently used when assessing spinal mobility in patients with AS, together with cervical rotation, tragus-to-wall distance, lateral lumbar flexion, modified Schober test, chest expansion, and finger-tip-to-floor distance. Some indices combine several clinical measures to provide a composite clinical index and an assessment of spinal movement as a whole2,16,22,23. These include the Bath Ankylosing Spondylitis Metrology Index (BASMI)23, the Edmonton Ankylosing Spondylitis Metrology Index16, and the University of Cordoba Ankylosing Spondylitis Metrology Index22.
BASMI is recommended by the Assessment of SpondyloArthritis International Society2,24,25 and has been widely used4,6,25,26,27,28,29,30,31,32,33. The initial study that proposed the BASMI had 5 clinical tests that include the index, and considered these the most accurate to reflect spine mobility23. However, the term “accurate” was not defined, and although the BASMI study appears to deal with face validity, its aims were to calculate the reproducibility and responsiveness of these clinical tests, but no data considering any kind of validity was presented23. Nonetheless, based on this study23, many authors have erroneously claimed the validity of the BASMI4,16,25,28,29,34. Moreover, previous studies observed associations between individual spinal mobility tests or compound indices and compound radiological indices using plain radiographic scoring systems such as the modified Stokes Ankylosing Spondylitis Spinal Score or the Bath Ankylosing Spondylitis Radiology Index35,36,37,38,39. However, although these findings are undoubtedly important, these studies deal with construct validity and were not designed to assess the extent to which compound indices or individual spinal mobility tests reflect true spinal movement. Thus, there is a common misconception regarding the criterion-concurrent validity of mobility tests in patients with AS. In addition, criterion-concurrent validity in the context of spinal measures can only be assessed for individual tests. Thus, the analysis of compound indices is not appropriate when criterion-concurrent validity is the aim. Our current systematic review aims to examine the level of evidence for criterion-concurrent validity of spinal mobility assessments in patients with AS.
MATERIALS AND METHODS
The Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines were used as the basis for our systematic review40.
Literature search strategy
Based on the current literature, 3 sets of keywords were derived: (1) accura* OR truth OR valid*; (2) “ankylosing spondylitis” OR spondyloarthritis OR spondyloarthropathy OR spondyl-arthritis; (3) “mobility” OR BASMI OR “spinal measure*” OR “hip measure*” OR Goniomet* OR Inclinomet* OR “tape measure*” OR “cervical rotation” OR “tragus to wall” OR “lumbar flexion” OR Schober OR “intermalleolar distance” OR “chest expansion” OR “finger* to floor” OR “finger* to ground” OR “internal rotation” OR “range of motion” OR “range of movement”. Initially, keywords were entered into the Cochrane database to identify any previous systematic reviews with a similar aim to our present study. Subsequently, 7 databases were searched from their inception to February 2014: AMED, Embase, ProQuest, PubMed, Science Direct, Scopus, and Web of Science. Some MEdical Subject Headings terms and filters were applied (Figure 1).
Eligibility criteria
Studies were considered if they met the following inclusion criteria: (1) assessing human adult subjects (> 18 yrs old); (2) assessing participants with a diagnosis of AS; (3) a design that assessed criterion-concurrent validity of spine mobility measures; (4) full-text availability; (5) assessing individual tests of spinal mobility; and (6) articles in peer-reviewed journals.
Articles were excluded when (1) the spinal mobility was related only to measures of structural damage or quality of life, and (2) results were presented only as total scores from compound indices.
To avoid missing relevant articles, there were no restrictions on language and publication dates.
Study selection
All articles retrieved from the database searches were imported into EndNote X4 (Thomson Reuters). One author (MPC) removed duplicated references, editorials, letters to the editor, short reports, abstracts, and reviews. Two independent reviewers (MPC and MDB) screened the titles and abstracts to identify the articles that would potentially meet the inclusion criteria. The full text of articles was then reviewed and those that met the criteria for inclusion were determined. Finally, the reference lists from included articles were screened to identify further relevant articles. Disagreement between the 2 reviewers regarding the relevance of a study for inclusion was settled with a consensus-agreement approach, and if consensus could not be reached, a third reviewer provided arbitration.
Methodological quality assessment and risk of bias
Each article included from the databases was rated for methodological quality and risk bias by 2 examiners (MPC and MDB). The Quality Assessment of Diagnostic Accuracy Studies (QUADAS)41 was used. The scale was composed of 14 items, answering “yes”, “no”, or “unclear”, and covered 3 topics: (1) reporting of selection criteria; (2) selection, execution, and interpretation of the index test and reference standard; and (3) data analysis. In accordance with the suggestions of the authors of QUADAS41, the scale required adaptation depending on the nature of the study. Because our present study aimed to assess criterion-concurrent validity of spinal measurement rather than diagnostic accuracy, adaptations were made to QUADAS (Table 1): 3 items were removed (items 7, 12, and 13), 2 items were modified (items 3 and 5), and 2 items were added, which were derived from a scale of quality previously proposed to evaluate criterion-concurrent validity of cervical range of motion42. Therefore, the scale used was composed of 13 items.
Included articles were rated independently for quality by 2 reviewers (MPC and MDB). A maximum score of 13 points was assigned to each article. After initial scoring, the articles were classified as either low or high quality. An article was considered high quality when it met 2 criteria: (1) receiving “yes” for item 3 (Is the reference standard likely to correctly measure the target joint range of motion?). We considered radiographic analyses as reference standard for spinal range of motion; and (2) receiving “yes” for at least 9 out of the 13 items assessed. We set this threshold because previous studies have indicated that about 70% of positive scores (10 out of the 14 QUADAS items) discriminate between high-quality and low-quality studies regarding diagnostic accuracy43,44.
Data analysis
To assess the level of association between the reference standard and the index test, either Pearson or Spearman correlation coefficients r values between 0 and 0.20 were designated as a low level of agreement, 0.21 and 0.40 as a fair agreement, 0.41 and 0.60 as a moderate agreement, 0.61 and 0.80 as a substantial agreement, and higher than 0.81 as excellent agreement45.
RESULTS
Study selection
No systematic reviews addressing the question posed by our present paper were discovered. The primary search yielded a total of 741 articles. After title, abstract, and/or full-text screening, 715 were excluded as they were either duplicates, review articles, were not original full-text articles, or were not directly related to the topic of the present review. Twenty-six articles were retained for full-text analysis. Seventeen articles were excluded because they did not compare 2 methods or tests for assessing spinal mobility, or they did not assess criterion-concurrent validity. Thus, 9 articles12,18,19,20,21,46,47,48,49 were retained for our current systematic review. After screening the reference list from these selected articles, another article17 met the inclusion criteria and was also included (Figure 1).
Study characteristics
The earliest studies identified were published in the 1970s47,49, while the most recent study was published in 201212. The number of participants assessed in the studies ranged from 317 to 26321. All studies used tape measures as one of the methods, in some cases as the index test12,17,18,19,20,21,47,49, while in others as the reference standard19,46,48. One study used an electromagnetic 3-D tracking system in the index test48. Radiographic analysis was the most common reference standard used12,20,47,49, while inclinometry18 and goniometry17,21 were also used (Table 2).
The range of motion of different segments of the spine was assessed. The thoraco-lumbar mobility was assessed by thoraco-lumbar lateral flexion49, thoraco-lumbar extension47, thoraco-lumbar forward flexion12,17,20, Schober indices12,17,19,20,48, fingertip-to-floor distance17,46,48, and thoraco-lumbar rotation19. The upper thoracic and lower cervical mobility were assessed by the tragus (occiput)-to-wall distance46, and the cervical spine mobility was assessed by cervical lateral flexion18,48, cervical rotation18,21,48, cervical flexion18,48 (chin-to-chest distance18), and cervical extension18,48 (Table 2).
Quality assessment
In the present systematic review, 82.2% of the items (107) were scored in agreement by the 2 examiners, while for the other 23 items, consensus agreement was reached after discussion (Table 3). The lowest quality rating was 4 points17 and the highest 1012 (out of a maximum of 13 points). Four studies included a very small number of participants (less than 8)17,46,47,49, 2 studies considered male patients alone18,19, and 3 studies considered a small number of female participants12,20,48. Therefore, only 1 study was considered to assess a representative spectrum of patients (Item 1)21. Further, most articles (7 out of 10) did not clearly describe their selection criteria of participants (Item 2)17,18,19,21,46,47,49.
Four studies used radiographic analysis as the reference standard for assessing the criterion-concurrent validity of tape measures (Item 3)12,20,47,49. The reference standard used in the other selected studies included a goniometer17,21, inclinometer18, or tape measures19,46,48. None of the studies using radiographic analysis clearly described the time interval between the radiograph and index test (Item 4)12,20,47,49.
All participants were assessed using both the reference standard and index test (Item 5), and were assessed using the same reference standard regardless of the index test used (Item 6). The execution of the index test was described with enough detail to be reproduced in 9 out of the 10 studies (Item 7)12,18,19,20,21,46,47,48,49, while the description of the execution of the standard reference was clear in 6 articles (Item 8)12,17,19,20,21,46. Only 1 study measured or interpreted the index test without knowledge of the results of the reference standard (Item 9)12. In 3 studies, the reference standard was interpreted without knowledge of the index test12,47,49, in another 3 studies, this information was not presented20,21,48, and in 4 articles, the examiners were aware of the results of the index test (Item 10)17,18,19,46.
In 5 studies, there were either no withdrawals or reasons for withdrawals from the study that were clearly presented18,19,47,48,49, while in the other 5, there was insufficient detail regarding withdrawals (Item 11)12,17,20,21,46. Most studies clearly showed their descriptive statistics by mean, median, and/or variance measures12,18,19,21,47,49; however, in 4 articles, the results were not clearly presented (Item 12)17,20,46,48. Only the 3 most recently conducted studies used inferential statistics, presenting coefficients of correlation between the reference standard and the index test12,21,48. In the other studies, the data from patients and control groups were pooled17,47,49, no statistical analysis was performed18,19,46, or the procedures described in methods that did not match those presented in the results (Item 13)20.
Criterion-concurrent validity
From the 4 studies using radiographic analysis as reference standard12,20,47,49, only 1 was rated as high quality. This study by Rezvani, et al12 assessed the correlation between 3 variants of the Schober test (original, modified, and modified-modified) and 2 techniques for calculating lumbar range of motion by radiography (the angle between L1 and S1, and between L3 and S1) in patients with AS and control participants. For the Schober tests, 2 reference points were marked on the patients’ low back region while they were in erect position: original Schober (marks at the lumbosacral junction and 10 cm above), modified Schober (marks 5 cm below and 10 cm above the lumbosacral junction), and modified-modified Schober (marks at the lumbosacral junction and 15 cm above). Then the patients performed forward flexion and the distance between these marks was measured. Higher distances between marks suggested higher lumbar flexion. Poor correlations were observed for all analyses (Table 4)12. Another study that also assessed lumbar spine forward flexion, found moderate and substantial correlations between Schober tests (modified and original) and radiographic lumbar range of motion20. Finally, Moll, et al47,49 assessed lumbar spine inclination and extension, and observed excellent correlation coefficients between the range of motion measured by tape measures and radiographs. However, it was not clear whether these coefficients related to patients with AS because analyses were performed on pooled data from both patients and control groups (Table 4)47,49.
Radiographic analysis was not used as the reference standard in 6 studies. Miller, et al17 assessed the validity of a new tape measure (three 10-cm segment method) used for recording thoraco-lumbar spinal range of motion in the sagittal plane and found substantial to excellent correlations between the new method and goniometry (r = 0.82), modified Schober test (0.77), and fingertip-to-floor distance (0.86). Stokes, et al46 evaluated a new instrument developed to measure the fingertip-to-floor distance and occiput-to-wall distance (an “L” scale) by comparing it with tape measures. No correlation coefficients were presented, but the authors observed differences between these 2 instruments and stated they were not interchangeable46. Viitanen, et al19 described a new tape measure method based on the measurement of the distance between the tip of the xiphoid process and the first sacral spinous process before and after rotation (Pavelka rotation method) and compared it with the needle rotation method, modified Schober test, and whole thoracolumbar spine19. While the authors claimed that the Pavelka rotation method was valid, no statistical analysis verifying the relationships among the 4 assessed methods was provided19.
The remaining 3 studies explored the motion of the cervical spine. Viitanen, et al18 evaluated 9 tests (6 by tape measures and 3 by goniometry), but no statistical data regarding the relationship between those tests were presented18. Jordan, et al48 observed moderate to substantial correlations between 3-D kinematics of measuring cervical spine movement and other 2 tape measures (modified Schober test and fingertip-to-floor distance). Finally, Maksymowych, et al21 found moderate correlation between cervical spinal rotation recorded by goniometry (reference standard) and tape measure (index test).
DISCUSSION
Although a wide range of clinical tests using simple measurement procedures such as goniometry17,21,23, inclinometry18,23, and tape measures12,18,19,20,21 have been used to assess spinal mobility in people with AS, literature-based evidence regarding criterion-concurrent validity reflecting true spinal mobility is unclear. Therefore, the objective for our systematic review was to investigate the level of evidence for criterion-concurrent validity for spinal mobility tests in patients with AS.
Only 1 study12 met the 2 criteria required for high quality. This study used radiograph assessment of range of motion as the reference standard, and met 10 of the 13 items suggested by the modified QUADAS quality assessment tool. Although classified as high quality, it is important to consider the 3 items where the study scored poorly. Rezvani, et al12 assessed a representative population of males, with patients covering a wide range of disease severity and an adequate number of participants (n = 41); however, only a small number of female patients were included (n = 9). Therefore, the generalizability of their findings for a population of patients with AS is uncertain. Two other issues regarding this article were a lack of clarity regarding the time interval between the radiographs and the tape measures, and the number of withdrawals from the study. Although radiological and physical measures for AS are not likely to demonstrate any substantial day-to-day or short-term differences, a potential source of error relates to the known diurnal variability of both symptoms and physical measures in AS if performed at different periods of the day. Variation in time of physical measurement may have affected the results of the study, even though the risk of such bias may be acceptable. Clear documentation for consistency and clarity of time of day for recording of symptoms and range of motion would have been desirable. Given these limitations, the authors concluded that tape measures poorly reflect lumbar spine mobility12.
Contrasting results were observed by Rahali-Khachlouf, et al20, who suggested tape measures have acceptable psychometric properties to assess patients with AS. However, it is important to note that this study failed in most items (2 “no” and 5 “unclear”). The study was rated as having a high risk of bias and their conclusions appear to be fragile20. The radiographic analysis was also used as a reference standard to criterion-concurrent validation of tape measures in 2 further studies; Moll, et al assessed the lumbar extension47 and thoraco-lumbar lateral flexion49 in 2 unrepresentative samples (6 and 7 patients, respectively) without mentioning the sex of the participants. Moreover, the data recorded from the patients with AS were pooled with those recorded in a larger sample of control participants (1847 and 3649 control participants). Therefore, the substantial association between the clinical tests and radiograph presented by the authors does not appear to represent criterion-concurrent validity of lumbar extension or lumbar lateral flexion specifically for patients with AS.
The remaining 6 studies included in our systematic review did not meet the specified standard in both adopted criteria required to be considered high quality17,18,19,21,46,48; their reference standards were not radiograph analysis, and they received “yes” in fewer than 10 items. The former criterion was established because radiograph analyses are widely accepted as the ideal reference standard for measuring range of motion11,12,13,14,15. Although inclinometry18, goniometry17,21, and tape measures19,46,48 were used as reference standards, these measures have never been subject to criterion-concurrent validation as assessments of spinal mobility in AS. Overall, these studies observed substantial to excellent relationships between goniometry and tape measures to assess spinal range of motion in the sagittal plane17, moderate association to record cervical mobility21, and excellent associations between an electromagnetic 3-D tracking system and tape measures to record cervical range of motion48. However, they did not reflect acceptable data regarding criterion-concurrent validity.
There appears to be no robust evidence regarding criterion-concurrent validity for clinical tests used to measure spinal mobility in patients with AS. Three out of the 4 studies using a proper reference standard included in our systematic review were classified as low quality20,47,49, and the only high-quality study12 suggested clinical tests for assessing mobility of lumbar spine (original, modified, and modified-modified Schober test) poorly reflect the range of motion of this spinal segment in patients with AS. This is of concern because these mobility tests are widely used in routine clinical practice and research2 where they are considered as validated tools4,16,25,28,29,34.
Some limitations should be considered. Although we adapted a well-accepted index of quality assessment (QUADAS), these adaptations have never been previously used or validated. Moreover, although some studies have supported a cutoff score of about 70% to consider a study as high quality44,50, others have questioned this kind of quality score for classifying studies41. Finally, while there were no restrictions of language for our systematic review, only keywords in English were used.
The level of evidence considering criterion-concurrent validity of clinical tests commonly used to assess spinal mobility in patients with AS is low. There is only an acceptable level of evidence for criterion-concurrent validity for tests used to assess the lumbar spine, which suggests these tests poorly reflect the mobility of this segment. Based on current literature, there are no high-quality studies supporting the criterion-concurrent validity of clinical tests for spinal mobility in patients with AS. However, these clinical measures are in widespread use because they are highly feasible in that they are easy to perform, low in cost, and rapidly implemented. Therefore, we suggest further research addressing criterion-concurrent validity of mobility tests to establish which of these clinical tests most accurately reflects spinal mobility in patients with AS.
Footnotes
-
Supported by the University of Otago (PhD Scholarship).
- Accepted for publication October 3, 2014.