Abstract
Objective. To evaluate the association between patient-reported outcome (PRO) and performance-based (PB) measures of physical functioning (PF) among individuals with self-identified arthritis to inform decisions of which to use when evaluating the effectiveness of a physical activity intervention.
Methods. Secondary data analysis of a nonrandomized 2-arm pre-post community trial of 462 individuals who self-identified as having arthritis and participated in the Walk with Ease (WWE) intervention. Two PRO and 8 PB assessments were collected at baseline (preintervention) and at 6-week followup. We calculated correlations between PB and PRO measures, assessed how measures identified changes in PF from baseline to followup, and compared PRO and PB measures to arthritis symptoms of pain, stiffness, and fatigue.
Results. Strength of correlations between PB and PRO measures varied depending on the PB measure, ranging from 0.21 to 0.54. PRO and PB measures identified PF improvements from baseline to followup, but none showed significant differences between the 2 WWE modalities (instructor-led or self-directed groups). Correlations with arthritis symptoms were stronger for PRO (0.30–0.46) than PB measures (0.03–0.31).
Conclusion. PRO measures may provide us with insights into aspects of PF that are not identified by PB measures alone. Use of PRO measures allows patients to communicate their perceptions of PF, which may provide a more accurate representation of overall PF. Our study does not suggest abandoning the use of PB measures to characterize PF in patients with self-identified arthritis, but recommends that PRO measures may serve as complementary or surrogate endpoints for some studies.
Researchers conducting evaluations of behavioral interventions such as exercise often include physical functioning (PF) as an outcome to determine effectiveness. PF is a unique endpoint because it can be assessed by a variety of methods, including observable performance-based (PB) tasks and patient-reported outcome (PRO) assessments using measures such as the National Institutes of Health Patient-Reported Outcomes Measurement Information System (PROMIS).
Each assessment method has strengths and limitations. PB measures of PF are directly observable and use well-accepted metrics such as “time to complete” a task measured by a stopwatch; however, PB measures require an observer to be present. Thus, PB assessments are often scheduled in clinic. Additionally, for multisite studies, there may be observer-to-observer measurement error when setting up the task or using the timer. PRO measures of PF have the advantage of a consistent measure (e.g., everyone uses the PROMIS PF short form) and the measure can be completed in clinic, at assessment sites, or from home as often as justifiably needed. Also, PRO allows the patient to evaluate their own performance and limitations, and improving a PRO score may be more meaningful than improving a PB score. However, PRO measures are considered subjective and may be more likely than PB measures to be biased based on a respondent’s characteristics such as sex, age, or race/ethnicity1. Sources of bias may relate to over- or underestimation of PF or differences in ways groups of people interpret particular items2. Some PRO questions such as “going up or down stairs” may also present problems if the patient has not had the opportunity to do the task.
Given the advantages and limitations of both types of assessments, researchers may feel inclined to include both in a research study; however, selecting both poses additional costs to collect data and further challenges to determine which method (PRO or PB) should be used as the primary endpoint to determine treatment effectiveness. If choosing 1 assessment method, which one? The goal of our study was to inform this decision through an analysis of secondary data generated from the evaluation of a walking physical activity intervention in adults with arthritis.
Arthritis, the most common cause of disability in the United States, especially among older adults, is an exemplar for addressing these issues3. The prevalence of arthritis has grown with the increase in obesity and it is projected that by 2030, 67 million adults will be affected by arthritis3. In addition to several uncomfortable symptoms, functional disability is a serious consequence of arthritis, which many clinical tests are unable to identify4.
The parent study evaluated a 6-week, community-based program Walk With Ease (WWE), which was developed to help individuals with arthritis to reduce symptoms5,6. WWE aimed to educate those affected by arthritis about the benefits of physical activity, increase awareness of symptom management, and offer a convenient, low-cost, moderate-intensity fitness regimen5,7. The parent study collected PRO and PB assessments of PF, along with other arthritis-relevant symptoms, at baseline and 6-week followup (end of study). Published results used PB and Health Assessment Questionnaire (HAQ) measures and found that WWE improved PF over time6.
With these data, our study addressed the following research questions that will inform future decisions on the use of specific PB and PRO-based measures of PF in research studies:
Research Question 1: Do PB and PRO measures identify the same concept of PF? This will be addressed through looking at associations between PB and PRO measures.
Research Question 2: Do PB and PRO measures provide similar results of the evaluation of the effectiveness of the WWE intervention over time and between modalities? This will be evaluated using standardized effect size estimates of change over time and differences-in-differences estimates between modalities for PB and PRO measures individually.
Research Question 3: Which method for measuring PF is more associated with arthritis-relevant symptoms of pain, stiffness, and fatigue? This will be evaluated through looking at associations of PB and PRO measures with self-reported symptom measures.
MATERIALS AND METHODS
Data
Data came from a nonrandomized 2-arm pre-post community trial of individuals with self-reported arthritis6. The study enrolled nearly 500 participants and was conducted at 33 sites. Participants were aged 18 years and older, English-speaking, cognitively able, and did not have serious medical conditions beyond arthritis6. Participants self-selected to be in the instructor-led or self-directed group. Baseline PB and PRO-based assessments were collected on the same day and followup assessments were collected 6 weeks later6. Institutional Review Board permission was obtained from the University of North Carolina.
PRO measures
The PROMIS PF measure was collected using Computerized Adaptive Testing (CAT) technology available in the Assessment Center and used a current recall period8. Participants completed about 5 questions per CAT9. Questions selected in the CAT were based on maximum posterior-weighted information criteria and the CAT stopped when the maximum standard error was 0.310. CAT tailored assessments based on an individual’s response to each question so administered items maximized the ability of the PROMIS CAT to measure a person’s PF with the minimal number of questions8. Completed items were scored based on item response theory-calibrated variables to derive a PROMIS PF T score metric using the expected a posteriori estimator11. T scores have a mean of 50 (SD 10) in the US general population with higher scores reflecting better PF. An example of a PROMIS item is: “Are you able to walk a block on flat ground?” Response options include “1. Unable to do,” “2. With much difficulty,” “3. With some difficulty,” “4. With a little difficulty,” and “5. Without any difficulty.9”
The HAQ measure of PF used a 7-day recall period and was collected using paper-based surveys at community sites6. The HAQ includes 20 items that sum together (0 to 60), with higher scores representing poorer PF12. Of the 20 items, 11 cover upper body mobility (e.g., shampoo hair, open a new milk carton) and 9 cover lower body mobility (e.g., climb up 5 steps, walk outdoors on flat ground)12. An example HAQ item is: “What is your ability to carry out daily activities?” Response options are “0. Without any difficulty,” “1. With a little difficulty,” “2. With some difficulty,” “3. With much difficulty,” and “4. Unable to do.12”
A visual analog scale (VAS) was used to measure 3 symptoms reported by patients with arthritis: pain, stiffness, and fatigue13,14. VAS uses a 100-mm line and participants are asked to mark a spot on the line reflecting pain experienced in the last 7 days13,14. The line ranges from “no pain” (furthest point left) to “pain as bad as it could be” (furthest point right)13,14. The same type of scale was used for stiffness and fatigue symptoms with higher scores indicating greater stiffness or fatigue6,13,14.
PB measures
Eight PB measures administered by a trained assessor were intended to assess several PF components. The assessor was blinded to intervention assignment for each patient. The 8 tests included timed chair stands, timed left and right turns, left and right single-leg standing assessments, 4 walking speed tests over a 20-foot stretch, and finally the 2-min step test15,16,17,18. Timed chair stands, turn tests, and single-leg stands were measured in seconds15,16. Walking speed was measured in meters per second, and the 2-min step test was measured in number of steps in 120 s15,16,17. Three timed chair stands assessed lower extremity strength6. The 360° turn tests and single-leg stands assessed balance15,16,18. The normal (average of 2 tests) and fast walking (average of 2 tests) scores measured one’s functional mobility, and lastly, the step test measured an individual’s aerobic endurance15,16. For most PB measures, higher scores indicated better PF, but for chair stands and right and left turns, higher scores indicated worse PF. Traditionally, the 8 PB measures were used together to reliably assess an individual’s PF15,16,17,18.
Missing data
For both PRO and PB measures, most missing data were from followup assessments. At baseline, 4% were missing HAQ or PROMIS scores, and at followup, 14% were missing a HAQ score and 29% missing a PROMIS score. At baseline, 2–8% of PB measures had missing data, and at followup, 33–37% of PB measures had missing data. Using chi-square tests to evaluate associations between missing data and covariates listed in Table 1, we could not determine a missing-data pattern by demographics or baseline PF so we assumed it was missing at random and used complete case analysis.
Sample characteristics
Self-reported demographic characteristics included age, sex, marital status, race/ethnicity, level of education, and body mass index (BMI). In analyses, age and BMI were continuous. Education was grouped as less than high school, high school graduate, and more than high school. Marital status was dichotomized as married or not. Race/ethnicity was grouped as non-Hispanic white, African American, and other (Hispanic, Asian, multi-race/unknown race).
Statistical analysis
Unadjusted comparisons of demographic characteristics between instructor-led and self-directed walking groups were conducted using chi-square and Student t tests.
Research Question 1: Pearson correlations were calculated between PRO and PB measures at baseline and followup for the entire cohort and stratified by WWE modality. Correlation strength was defined as weak (0.0 to < 0.01), modest (0.1 to < 0.3), moderate (0.3 to < 0.5), strong (0.5 to < 0.8), and very strong (0.8–1.0)19.
Research Question 2: To compare measures with different metrics, PB and PRO measures were individually standardized to Z-scores by subtracting each measure’s mean and dividing by the SD. Standardized effect sizes from baseline to 6-week followup were calculated for PB and PRO measures and stratified by WWE modality. Within-modality effect size (ES) was calculated as the difference between average baseline and followup scores dividing by the baseline’s SD6. We used Cohen classification of ES magnitude: < 0.32 was considered “small,” 0.33–0.55 “medium,” and 0.56–1.2 “large”20. A difference-in-difference model using standardized scores evaluated differences between PB and PRO measures of PF in identifying change in PF between WWE modalities. Sensitivity analyses adjusting for demographic covariates were conducted, but results are not shown.
Research Question 3: Pearson correlations between PB and PRO measures with VAS symptoms (pain, fatigue, and stiffness) were calculated. We expected there would be moderate associations between PF and these symptoms.
Analyses were performed in Stata (version 13.1) with 2-sided statistical tests and a significance level of 5%.
RESULTS
Participant characteristics
There were 462 adults who self-identified as having arthritis and who participated in our study. Respectively, the instructor-led and self-directed groups included 192 and 270 adults. Demographic characteristics are shown in Table 1. Marital status, BMI, and race were similarly distributed between WWE modalities. In both groups, there were higher proportions of women (85% and 90%). Level of education was significantly different between groups, with the instructor-led being less educated than the self-directed group. Further, the instructor-led group tended to be older, with mean age of 70.6 years compared with the self-directed group mean age of 64.4 years.
PROMIS and HAQ correlations
Baseline PROMIS and HAQ scores ranged from 26–63 and 0–56.25, respectively. Followup PROMIS and HAQ scores ranged from 25–61.75 and 0–58.75, respectively. Correlations between PROMIS and HAQ scores were negative because higher PROMIS scores indicate better PF, while larger HAQ scores indicate worse PF. At baseline, correlations between PRO measures of PF were strong at −0.68 for the overall cohort (instructor-led group: −0.64, self-directed group: −0.72). At followup, correlations were slightly stronger at −0.72 overall (instructor-led group: −0.71, self-directed group: −0.73).
Research Question 1: Do PB and PRO measures identify the same concept of PF?
Because higher PROMIS scores indicate better PF and higher HAQ scores indicate worse PF, expected correlations between PRO and PB measures were positive for PROMIS and negative for HAQ for most PB outcomes. However, for chair stands and right/left turns, expected correlations between PB measures were negative with PROMIS and positive with HAQ.
As noted in Table 2, strength of correlations between PB and PRO measures varied depending on PB. Modest associations of PRO measures were observed for single-leg stances. Moderate correlations of PRO measures were observed for steps, chair stands, and right/left turns. Strong associations were observed for normal/fast walk. All correlations between PB and PRO measures were statistically significant (p < 0.05).
Research Question 2: Do PB and PRO provide similar conclusions of the evaluation of the effectiveness of the WWE intervention over time and between modalities?
Unadjusted standardized ES for PB measures (except number of steps) showed small improvements from baseline to followup for the self-directed group and moderate improvements for the instructor-led group (Table 3). PRO measures of PF were consistent with small ES of 0.22–0.20 for PROMIS and 0.18–0.20 for the HAQ. Unadjusted standardized ES presented are similar in magnitude and direction to adjusted ES reported in the parent study6.
Standardized differences-in-differences results are shown (Table 4). Neither PB nor PRO measures showed statistically significant differences in PF changes between WWE modalities over time (Table 4). Although both types of measures identify improvements, there were no significant differences between modalities. This conclusion was consistent across PB and PRO measures when demographic characteristics were included.
Research Question 3: Which method for measuring PF is more associated with VAS symptoms of pain, stiffness, and fatigue?
As expected, correlations between PROMIS and VAS measures were negative and correlations between HAQ and VAS measures were positive (Table 5). Both PRO measures had stronger correlations with VAS measures than any PB measure (Table 5). However, correlations between PRO and VAS measures were moderate, ranging from absolute values of 0.30 to 0.46.
PB measures generally had poor correlations with VAS measures and some correlations were not statistically significant (Table 5). PB and VAS correlations ranged from absolute values of 0.03 to 0.31, with chair stands having the strongest correlations.
DISCUSSION
Within the context of evaluating the effectiveness of a walking intervention program for individuals with arthritis, our study examined 2 types of measures (PB and PRO) of PF. The overall goal is to inform investigators wishing to include similar endpoints in future studies. Our study examines how 8 types of PB and 2 PRO measures of PF are related and how they perform when measuring changes in PF over time and between 2 intervention modalities.
The first question we address is the extent to which PB and PRO measures in our study identify the same concept of PF. Fair to moderate correlations (0.21 to 0.49) were observed with higher associations between PB measures of normal and fast walking with both PRO measures. Lack of stronger associations between PB and PRO measures are not surprising because PB measures specific body parts or particular skills while PRO measures include questions combining several body parts and skills to provide a comprehensive representation of PF. For instance, timed chair stands hone in on lower extremity strength, which depicts 1 aspect of PF. PRO measures relate PF to activities of daily living, which attempts to convey a holistic view of PF.
The second question we examined was the ability of PB and PRO measures to detect changes over time and between intervention modalities. The WWE program was expected to improve PF from baseline to 6-week followup. Previous studies found that moderate-intensity exercises resulted in notable improvements in the strength, balance, and functional status of patients with arthritis5. In our study, we found that 6 of 8 PB measures showed small to moderate improvements (0.12–0.49) with slightly higher effect sizes for the instructor-led arm, while both PRO measures found small improvements (0.18–0.22) in both arms over time. Thus, most PB measures and both PRO measures correctly detected improvements in PF consistent with prior literature. To our knowledge, no prior studies have been conducted to determine differences between WWE modalities tested in the parent study. Neither PB nor PRO found statistically significant differences between modalities. Together, our comparisons show either PRO measure or the set of PB measures could be used to determine the WWE program effectiveness with respect to the study design.
The third question examined the association between PB and PRO measures of PF with VAS symptoms of pain, fatigue, and stiffness, which the WWE program aimed to reduce. Consistently, we found that PRO measures had stronger correlations than PB measures with pain (PRO 0.38–0.46, PB 0.03–0.31), fatigue (PRO 0.33–0.38, PB 0.05–0.20), and stiffness (PRO 0.30–0.41, PB 0.05–0.30). What may partly drive higher associations with PRO measures is that pain, fatigue, and stiffness were measured by self-report and we do not have clinical measures of each; however, it is accepted that the gold standard for measuring these symptoms is by self-report. Thus, findings support stronger evidence for construct validity (i.e., convergent validity) of PRO-based measures based on their association with clinically important arthritis symptoms.
Our findings appear to be consistent with published literature, confirming PRO are a viable way to measure PF. A study in patients with multiple sclerosis suggested PRO and PB measures access independent constructs of PF because of poor correlations between types of measures21. The study explained that PB measures focus on specific movements and do not allow us to identify overall quality of life, while PRO reflect PF beyond symptom effect21. Another study with patients with osteoarthritis following joint replacement recognized PRO better represent patient satisfaction because they relay the patient’s own perception of PF22. Finally, a study of patients with osteoporosis compared different PRO and PB measures, found moderate correlations, and concluded PRO instruments identified changes in daily activities of PF “quite well”23.
Limitations
There are limitations to our analyses. First, participants self-identified as having arthritis and we did not have clinical confirmation of diagnosis, which is a limitation if they are not similar to individuals with clinically diagnosed arthritis. In addition, participants self-selected to be in the instructor-led or self-directed group, which could lead to selection bias because treatment is not random. Another concern is discrepancy in sample size between men and women; however, this did not vary by WWE modality. There may be measurement error in the way PB and PRO measures were collected, which may affect results. We also do not know the extent to which these results generalize to PB and PRO measures not used in our study or to other therapeutic areas. We used an intent-to-treat approach and have no information about compliance with WWE, which could bias results if 1 group was less likely to comply. There were some missing data in the HAQ, PROMIS, and PB measures, with the self-directed group having a greater proportion of missing data than the instructor-led group. We could not find a relationship between missing data and demographics or baseline PF (i.e., people with worse PF having more missing data). We were also unable to adjust for clinical characteristics that could affect PF because these measures are not available in our dataset.
The parent study evaluation of the effectiveness of 2 WWE modalities would have yielded similar findings had it used 8 PB measures or 1 PRO measure. If time, costs, and participant and administrator burdens are irrelevant, investigators may wish to include both PB and PRO measures to provide comprehensive evaluations of the effect of the intervention on PF. However, time, costs, and burden are often challenges for studies. PB measures require (1) a trained assessor, thus necessitating measurements take place in a clinic or assessment site, (2) training of assessors in multisite studies and followup to maintain data collection consistency, (3) time burden to complete tasks, and (4) funds to pay assessors and participant incentives. Relatively, PRO (1) do not require an observer; however, an observer may have to be available for technical problems accessing surveys, (2) are shorter to complete (e.g., PROMIS CAT administered about 5 questions), and (3) questionnaires can be completed by participants more often in the convenience of the clinic or at home. Biased responses to PRO measures based on group characteristics such as age, sex, and race/ethnicity can be reduced using strong PRO measure design principles and psychometric evaluations.
Noting previously discussed limitations including that the study sample self-reported their arthritis conditions, our study provides support for the use of PRO measures of PF as indicators of treatment effectiveness in research studies. Costs and time are saved, relative to PB measures, to collect PRO data from patients at their convenience, especially when collecting other PRO endpoints such as fatigue and pain. Multisite trials will also benefit from consistent measures used across sites with electronic PRO data automatically stored in coordinating centers.
The use of PRO PF measures may allow us to glean insight into aspects of PF that are not identified by PB measures alone. Use of PRO measures allows patients to communicate their own perceptions of PF, which may lead to more accurate representations. Although our conclusions do not suggest abandoning the use of PB measures to identify PF, they suggest that PRO measures serve as complementary or surrogate endpoints.
Footnotes
Supported by a cooperative agreement between the US Centers for Disease Control and Prevention and the Association of American Medical Colleges (MM-0975-07/07).
- Accepted for publication September 30, 2015.