1 Introduction
In the clinical and mental health domain, the phenomenon of depression has likely received one of the greatest investments of research hours and resources in scientific research history. As noted by Fried et al. (2022), two depression rating scales rank among the hundred most-cited scientific papers across all research fields, highlighting the global interest and substantial resources devoted to measuring depressive symptoms, understanding and treating depression. However, despite extensive empirical efforts, considerable gaps remain in understanding depression’s nature (Klein (2024); Fried et al. (2022)) and in developing effective intervention strategies to reduce its substantial individual and societal burdens (James et al. (2018)). One particularly neglected area concerns methodological decisions in depression measurement, a foundational aspect of our ability to understand and treat depression (Fried et al. (2022)), among which the optimal choice of response scale formats used in depression screening tools—a factor with important implications for measurement quality—remains insufficiently explored.
Self-report instruments are a common method for depression screening, typically using severity- or frequency-based response scales (e.g., PHQ-9, Kroenke et al. (2001); IDS, Nolen & Dingemans (2004); BDI-II, Beck et al. (1996); CES-D, Radloff (1977)). These scales vary significantly in their number of response scale points, usually ranging between two (Yesavage et al. (1982)) and six(Bech et al. (2001)) points. Occasionally, even the same depression assessment instrument employs response scales of varying widths across different items (Shi et al. (2021)). The Patient Health Questionnaire (PHQ-9) is one of the most frequently used among these measures due to its brevity, ease of administration, and robust psychometric properties demonstrated across diverse populations and clinical settings. The PHQ-9’s prominence is such that it is even becoming a mandatory measure for funding certain depression research programs (Farber et al., 2020). Despite its widespread use and commendable psychometric record, fundamental methodological decisions underlying its design, such as the rationale behind its chosen response scale, remain empirically understudied.
Crucially, unresolved issues persist regarding the internal structure of the PHQ-9, raising important psychometric concerns. Although many studies support a unidimensional factor structure (Chae et al. (2024)), acceptable model fit typically requires numerous correlated residuals, thus compromising interpretability and raising concerns about replicability across diverse samples. Alternative two-factor solutions dividing items into somatic and cognitive-affective symptom dimensions (Krause et al. (2008); Lamela et al. (2020)) often provide superior model fit, yet discrepancies persist concerning the allocation of certain items, particularly psychomotor disturbance and concentration difficulties, between these two factors. University student populations have frequently demonstrated poor fit for the unidimensional PHQ-9 unless multiple residual correlations are modeled (Makhubela & Khumalo (2023); Lingán-Huamán et al. (2023); Monteiro et al. (2019)), again underscoring the persistent ambiguity regarding its internal structure. A possible contributing factor to this ambiguity may be the choice and structure of the response scale itself, as previous psychometric research suggests that varying response scale widths may substantially influence factor analytic outcomes.
The PHQ-9 traditionally employs a 4-point unipolar symptom-frequency scale, designed to capture respondents’ experience of depressive symptoms over the past two weeks, ranging from “not at all” to “nearly every day” (Kroenke et al. (2001)). Although this specific response format became the default and most widely used for PHQ-9 assessments, the original authors provided limited empirical rationale for selecting precisely four response categories. Recent studies applying item response theory (IRT) to the PHQ-9 indicate redundancy or indistinctiveness between its middle response categories, prompting some researchers to combine categories either before administering the questionnaire (Gothwal et al. (2014)) or during subsequent data analyses (Christensen et al. (2017), Dyer et al. (2016), Pedersen et al. (2016)). Such adjustments, while addressing psychometric concerns, introduce additional variability in practice and potentially affect the comparability and interpretability of depression scores across different studies and clinical contexts. Although, to our best knowledge PHQ-9 has not been administered with more than 4 scale points, there has been an increasing proliferation of ecological momentary assessment (EMA) approaches in depression research, characterized by frequent, repeated assessments of depressive symptoms within participants’ daily lives. EMA studies frequently utilize the PHQ-9 or subsets of its items but have shown considerable flexibility in adapting response scale formats, employing scales with wider ranges such as bipolar 7-point scales (e.g., Baryshnikov et al. (2023)) or even visual analog scales (e.g., Bowen et al. (2017)). This emerging practice underscores researchers’ implicit assumption that the quality of depression assessment is not crucially altered with these changes response scale format.
More broadly across psychological science, marketing, health, and social sciences, extensive empirical research has explored how response scale widths influence the psychometric properties of self-report instruments. Considerable evidence suggests that some distributional properties (Leung (2011); Wu & Leung (2017)), reliability, discriminative capacity and structural validity (Preston & Colman (2000), Lozano et al. (2008), Wakita et al. (2012)) increase up to 7 points or sometimes even beyond, although these findings are not universally consistent and the results depend on multiple factors including the construct researched (Garner (1960), Cox III (1980), Preston & Colman (2000), Abulela & Khalaf (2024)).
Psychometric research has consistently indicated that the number of response scale points influences the reliability and structural validity of psychological measures. Increasing the number of response categories up to approximately seven generally enhances internal consistency reliability by capturing greater variability between respondents (Abulela & Khalaf (2024)). Structural validity, particularly fulfillment of factor analytic assumptions, factor loadings, model fit, and variance explained, also appear sensitive to the choice of response scale widths (Maydeu-Olivares et al. (2017); Abdelsamea (2020); Xu & Leung (2018)). However, findings remain inconsistent, with some studies suggesting that narrower scales (e.g., fewer than four categories) increase skewness, negatively impact loadings, and inflate measurement error, particularly in confirmatory factor analytic contexts (Xu & Leung (2018); Dolan (1994); Hall (2017)). Evidence concerning the influence of response scale width on convergent, criterion-related, and external validity remains scarce and inconsistent. For instance, studies assessing personality or clinical constructs have found minimal or inconsistent differences in convergent validity correlations when altering scale widths (Simms et al. (2019); Rakhshani et al. (2024)).
In clinical assessment, the effects of response scale width on psychometric properties have been less frequently examined. Most of the findings revolve arround MMPI-2-R RC scales (Cox et al. (2012); Finn et al. (2015); Cox et al. (2017), Courrégé & Weed (2019)), where augmenting the original true/false scale to a 4-point scale increased reliability without noticeable trends in convergent validity change. In trauma consequences research, studying anger control with veterans, Hawthorne et al. (2006) condensed from 9 to 5 scale points Dimensions of Anger Reactions scale (DAR; forbes2004), which they concluded reduced response bias without compromising sensitivity. In ADHD assessment, the number of scale points affected several CFA indicators, including loadings and standard errors, fit, individual scores, and their reliability, which is why Shi et al. (2021) advises researchers to be cautious when condensing the response scale width from 4- to 2- scale points.
Taken together, current research practices reflect an implicit assumption that variations in response scale widths do not significantly influence the quality of depression assessment, particularly for the PHQ-9. However, extensive psychometric literature from other fields, as well as from clinical assessment suggests that the number of scale points impacts measurement, and that especially 2- point scales could be detrimental to various psychometric indices (Shi et al. (2021)). Ambiguities surrounding the internal structure of the PHQ-9, inconsistent reliability and validity findings from broader psychometric research together collectively underscore the importance of empirically examining response scale width effects specifically in depression screening contexts. Addressing these gaps through a rigorous, within-participant experimental design will provide critical insights into optimizing PHQ-9 response scale design, thus enhancing measurement accuracy, diagnostic reliability, and clinical utility.
1.1 Aims of the study
The primary aim of this study is to systematically investigate how varying the number of response scale points affects the psychometric properties of the PHQ-9.
We will address the following research questions:
How do distributional characteristics (e.g., ceiling and floor effects, mean, variability, skewness and kurtosis) and the percentage of participants above cutoff of PHQ-9 vary across response scales with different numbers of scale points?
How does the reliability of the PHQ-9 change when administered with different response scale widths?
How do the internal factor structure and model fit indices of the PHQ-9 differ across varying response scale widths?
How do convergent and external validity correlations of the PHQ-9 scores vary as a function of the number of response scale points?
Clarifying these methodological considerations will offer guidance for researchers and practitioners regarding optimal response scale selection, ultimately enhancing the accuracy, interpretability, and clinical utility of PHQ-9 assessments.
2 Methods
2.1 Participants
2.2 Instruments
Patient Health Questionnaire – PHQ-9 (Kroenke et al. (2001)) is a questionnaire designed for depression screening. It consists of 9 items corresponding to 9 symptoms listed in the DSM-4/DSM5. Responses are given on a 4-point Likert scale. The response options indicate the frequency of the symptom in the last 2 weeks and are given on a 4-point Likert scale: 0 = ‘not at all’, 1 = ‘several days’, 2 = ‘more than half of days’, and 3 = ‘almost every day’.
Center for Epidemiological Studies Depression Scale Revised – CESD-R-10 (Björgvinsson et al. (2013)) is a depression symptoms scale made designed primarily for epidemiological screening purposes. Original version of the instrument consists of 20 items (Radloff (1977)), while the revised version selected for this research has 10 items (Björgvinsson et al. (2013)). The items correspond to various states, thoughts, and feelings characteristic for depression. The responses are given on a Likert-type scale, which indicates the frequency of symptoms on these four levels: ‘Rarely or none of the time (less than 1 day)’, ‘Some or a little of the time (1‐2 days)’, ‘Occasionally or a moderate amount of time (3‐4 days)’, ‘All of the time (5‐7 days)’.
Big Five Inventory – BFI-S (Lang et al. (2011)) is a 15-item shortened form of the larger BFI-44/BFI-54 scale versions (John et al. (1991)). It measures Big Five widely accepted personality traits: Openness to Experience, Conscientiousness, Extraversion, Agreeableness, and Neuroticism. The BFI-S revision proposes the 7-point Likert scale, but the version used in this research will follow the original 5-point format (from 1 = ‘disagree strongly’ to 5 = ‘agree strongly’) used in the BFI-44/BFI-54. The original BFI-44/BFI-54 scale format is selected because the 5-point format is closer to the 4-point formats used by the PHQ-9 and the CESD-R-10, all of which will be given together in the first measurement questionnaire battery.
PANAS (Watson et al. (1988)) is used for the assessment of two mood dimensions – positive affect (positive feelings and engaging in pleasurable activities) and negative affect (feeling of distress and a selection of aversive states of emotion, such as anger, disgust, guilt or nervousness). The scale contains 20 items, 10 items assessing each of the two mood dimensions. The answers are given on a 5-point Likert scale ranging from 1 = ‘not at all’ to 5 = ‘extremely’.
Each scale was end-anchored (similar to Preston & Colman (2000))
2.3 Procedure
2.4 Statistical analysis
To assess distributional properties, we examined the ceiling and floor effects calculating the proportion of respondents receiving minimum and maximum possible score across the range of response scale points assessed, as well as the number of participants scoring above cutoff for each response scale condition, where the PHQ-9 cutoff score of >= 8 (Mihić et al. (2024)) was used. To ease comparison between varying PHQ-9 response scale widths, we converted sum scores and cutoff score to the Percentage of the Maximum possible scores (POMP; Cohen et al. (1999)) by dividing the difference between observed and minimum score by the difference between maximum and minimum score and multiplying the division result by 100. Then we tested for the trends in the POMP mean scores, skeweness and kurtosis using polynomial ANOVA.
Reliability of the PHQ-9 was assessed using three commonly used coefficients: Cronbach’s Alpha, McDonald’s Omega Total, and Omega Categorical. We calculated the coefficients for each response scale width and compared the results across the range of scale points. We used the 99% BCa bootstrap confidence intervals to assess the precision of the reliability estimates. Test-retest reliability was calculated using the Pearson correlation coefficient between the baseline PHQ-9 scores and the scores obtained with the same response scale width in the last measurement. Additionally, intraclass correlation coefficient as a measure of test-retest reliability was calculated between baseline and repeated 4- point PHQ-9 scales.
Internal structure of the instrument was examined using CFA analysis, where we tested the one-factor model of the PHQ-9 with a diagonally weighted least squares estimator for the 2-12 scale points and using MLR estimator for the entire span of the response scale widths examined. We used the following fit indices to assess the model fit: chi-square/df ratio, RMSEA, CFI, TLI, and SRMR. Following recommendations of Abulela & Khalaf (2024) and Shi et al. (2021), we examined parameter estimates, i.e. standardized factor loadings and the corresponding standard error estimates of the final solution.
For the model fit assessment, we used the commonly used fit indices cutoffs (RMSEA <=.06, SRMR <=.08, CFI and TLI >= .95) suggested by Hu & Bentler (1999). Since traditional cutoff values were established to achieve sensitivity to misspecification >=90% in continuous data, for RMSEA, SRMR and CFI we calculated Dynamic Fit Index Cutoffs (DIF; McNeish (2024)) proposed to ensure the same specificity when the data is of a Likert type, for the purpose of comparission.
3 Results
3.1 PHQ-9 score distributional properties
Figure 1 presents descriptive statistics of PHQ-9 scores across different scale point conditions. The top panel indicates a decline in the percentage of participants remaining above cutoff as the number of scale points increases from 2 to 20. A quadratic regression analysis examined the relationship between the number of scale points and the percentage of participants scoring above the PHQ-9 clinical cutoff. The model was statistically significant (R² = .77, F(2, 16) = 27.08, p < .001). Results revealed a significant linear effect (β = -1.55, t(16) = -3.998, p = .001), and a significant quadratic effect (β = 0.04, t(16) = 2.566, p = .021), implying that the rate of decrease diminishes as the number of scale points increases. The ceiling effect was not observed, while the floor effect was pronounced for 2- point response scales, and somewhat less fro 3- point scales, leveling from 4- points onwards. Mean and standard deviation panels show that the raw PHQ-9 sum scores demonstrate a linear increase in the average scores and variability as the number of scale points in repeated measurements increases, a predictable artefact of the expanding response scale range. In contrast, POMP-transformed means display a consistent decrease from 2 to approximately 15 scale points, after which the decline appears to stabilize. This trend indicates that the response scale granularity systematically influences mean estimates. Standard deviations of the POMP-transformed scores remain stable, with the notable exception of the 2-point response scale, which exhibits disproportionately higher variability. Skewness remains relatively stable across scale points with a slight increasing trend, while kurtosis fluctuates more markedly, with a modest increase as the number of scale points increases.
3.2 Reliability of the PHQ-9 across scale points
Figure 2 displays the reliability coefficients of the PHQ-9 across varying response scale widths. The internal consistency of the PHQ-9 demonstrates good psychometric properties (exceeding .80) across all examined response scale formats. It is interesting to observe almost identical values of all three coefficients at baseline measurement. In repeated measurements where response scale width varied, a notable increase in reliability is evident from 2 to 5 scale points, after which the coefficients stabilize and remain relatively constant throughout the remainder of the observed scale range. This pattern is consistent across all reliability indices examined. Coefficient Alpha and McDonald’s Omega Total yield similar values, while Omega Categorical consistently demonstrates higher coefficients by approximately .05 units, though it follows the same overall trend. The 99% BCa bootstrap confidence intervals narrow as scale points increase, suggesting greater precision in reliability estimation with more granular response formats. These findings suggest that while scales with at least 5 response options optimize measurement reliability for the PHQ-9, additional scale points beyond this threshold provide minimal incremental psychometric benefit.
Test-retest reliability of the PHQ-9 follows closely the pattern of internal consistency up to 5 scale points, showing a decline at 6 points, followed by a slight increase at 7 and 9 scale points and a trend of increase from 17 to 20 scale points, being the highest for 20 scale points at .80.
3.3 Internal structure of the PHQ-9 across scale points
Prior to exploring modification indices to allow for correlated errors, the one‐factor model fit at baseline measurement was comparable to recent findings in university student populations, with a signifficant \(\chi^2\), and RMSEA value above .06, but with the rest of the fit indices within boundaries (\(\chi^2(27)=98.26\), \(p<0.001\), \(\mathrm{CFI}=0.98\), \(\mathrm{RMSEA}=0.069\) [90% CI: \(0.055\), \(0.084\)], and \(\mathrm{SRMR}=0.040\) [90% CI: \(0.029\), \(0.051\)]).
Upon inspecting modification indices, the modification indices observed were similar to recent studies in university students samples (Makhubela & Khumalo (2023); Lingán-Huamán et al. (2023)) and the model was adjusted for correlated errors between items 2 and 6, followed by items 6 and 9, and 2 and 9. Upon this adjustment, the fit was improved for baseline, as well as for several varied scale points measurements, although chi-square test for all of the models remained statistically significant.
Figure 3 presents model fit indices for one-factor CFA models of the PHQ-9 across varying response scale widths, comparing ordinal and continuous estimation approaches. The chi-square/df ratios (top panel) reveal divergent patterns between estimation methods. Ordinal CFA models demonstrate a pronounced curvilinear relationship with scale granularity, with better fitting values at baseline and for scales with 2-4 points, followed by increase in index value for scales with 6-9 points (exceeding 6.0), and subsequent improvement for scales with 10 ad 11 points. In contrast, continuous models display more stable chi-square/df values across all scale points, generally remaining below or slightly above the conventional threshold of 3.0, with more pronounced elevations for scales with 5-7, 9 and 19 points. RMSEA values, being the chi-square based estimate, mirror the pattern observed in chi-square/df ratios for ordinal models, showing a marked increase for scales with 6-9 points (exceeding .10) before improving at 10 and 11 scale points. Continuous models demonstrate more stable RMSEA values, though scales with 6-7, 9 and 19 points show elevations slightly above .8. Confidence intervals for RMSEA are notably wider for ordinal models with mid-range scale points, indicating less precision in fit estimation. SRMR values remain consistently low across all scale points and both estimation approaches. Ordinal models display slightly higher SRMR values for scales with 6-9 points, but the differences between estimation methods are the least pronounced for this index compared to others. Comparative fit indices (CFI and TLI) remain consistently high across both estimation approaches. with CFI values exceeding acceptable thresholds (>.95) for most of the response scale widths, with 6, 7 and 19 scale points slightly below the threshold. TLI values remain above .95 using ordinal CFA approach for all of the points assessed, while they remain above .90 using estimator for continuous data, with 4, 10-11 and 17 scale points surpassing .95 value. Ordinal models consistently yield marginally higher CFI and TLI values compared to their continuous counterparts, particularly at mid-range scale points.
3.4 Standardized Parameter Estimates and Standard Errors
Figure 4 presents standardized parameter estimates for the modified one-factor PHQ-9 model. Overall, estimates tend to be slightly higher for scales with 5 or more response options compared to those with narrower formats. The most pronounced difference between categorical (DWLS) and continuous (MLR) estimation methods appears in the loading of the thoughts of suicide item, which consistently shows a lower loading of around 0.4 under MLR. For the remaining items, loadings estimated with both methods become relatively stable from 5 points onward, typically ranging between 0.6 and 0.9, with slightly lower values observed for 2- to 4-point scales. A similar pattern is evident in the standard errors of the standardized loadings (Figure 5), which tend to decrease from 2 to 5 scale points and then stabilize. Finally, Figure 6 compares correlations between estimated factor scores and participant sum scores across both estimation methods. While MLR-based estimates show consistently high correlations (approximately 0.98), a visible decline in correlation is observed for DWLS estimates.
3.5 Factor Scores
3.6 Convergent and external correlations
Figure 7 correlations displays correlations between PHQ-9 and CESD at baseline, as well as correlations between PHQ-9 measurements with varying scale points and CESD-R-10 at baseline, and concurrent correlations where CESD-R-10 was assessed using the same response scale width simultaneously. The correlation between PHQ-9 and CESD at baseline measurement was 0.80. Average correlations between PHQ-9 scores using variable-width response scales and baseline CESD-R-10 measurements were 0.71, lower than the baseline correlation—an expected finding given that fluctuations in depression symptomatology can occur within measurement timeframes. Notably, concurrent correlations between PHQ-9 and CESD-R-10 scores obtained using identical response scale widths at the same measurement point yielded a correlation of 0.89,exceeding the baseline correlation. Our analysis revealed a modest positive trend in correlations as response scale points increased from 2 to 5, after which correlations stabilized and remained relatively constant across higher numbers of scale points.
Figure 8 illustrates the relationship patterns between PHQ-9 and the Positive and Negative Affect Schedule (PANAS) components across different measurement conditions. This figure presents baseline correlations, correlations between PHQ-9 with varying scale points and PANAS scores measured at baseline, and correlations where both instruments were administered concurrently using identical response scale widths. The PHQ-9 showed a substantial positive correlation of 0.68 with PANAS negative affect at baseline, while exhibiting an inverse relationship of -0.44 with PANAS positive affect. These bidirectional associations align with theoretical expectations regarding depression’s relationship with affective states as well as with empirical findings (ADD REFERENCE). When examining PHQ-9 scores using variable-width response formats against baseline PANAS measurements, correlations averaged 0.60 for negative affect and -0.43 for positive affect, reflecting modest reduction in association strength for negative affect, but with no changes for positive affect. When both instruments were administered simultaneously using matched response formats, the associations of PHQ-9 with negative affect were at the level measured at baseline with an average correlation of 0.70, while PHQ-9’s relationship with positive affect stayed unchanged again with the correlation of -0.46. The figure additionally reveals that correlation magnitude shows initial sensitivity to scale granularity between 2-5 points for negative affect, and for 4 points compared to 2 and 3 points before stabilizing across wider scales, suggesting that response formats with at least 4 points adequately capture the relationship between depressive symptoms and affective states.
Figure 9 presents the relationships between PHQ-9 and the Big Five personality dimensions as measured by the BFI-S across different measurement conditions. The figure captures baseline correlations, correlations between PHQ-9 with varying scale points and baseline BFI-S scores, and concurrent correlations where both measures employed identical response scale widths.
The PHQ-9 showed distinct patterns of correlation with each personality dimension at baseline. The strongest association was observed with Neuroticism (0.32), consistent with established links between emotional instability and depressive symptoms. Negative correlations were found with Extraversion (-0.20), Conscientiousness (-0.28), and Agreeableness (-0.18), while Openness demonstrated no relationship with PHQ-9 (0.01).
When examining PHQ-9 with variable response formats against baseline BFI-S measurements, average correlations were 0.27 for Neuroticism, -0.16 for Extraversion, -0.25 for Conscientiousness, -0.15 for Agreeableness, and -0.04 for Openness. These values generally show relatively stable relationship of PHQ-9 assessed with varying scale points with baseline personality scores.
Concurrent administration using matched response formats, excluding correlation at 2 scale points which was visibly diminished, yielded the stronger correlations for Neuroticism with average value of 0.42 and Conscientiousness -0.38. Average correlation with Openness at -0.12 was the lowest, with Extraversion -0.22 and Agreeableness -0.20 average correlation at concurrent measurements slightly above the averages with baseline personality measurements.
Figure 7, Figure 8 and Figure 9 illustrate the relationships between PHQ-9, CESD-R-10, PANAS and BFI-S measures across different assessment conditions. We examined correlations at baseline, between PHQ-9 with varying scale points and baseline measures, and concurrent correlations with matched response scale widths.
Figure 7 shows that PHQ-9 and CESD had a strong baseline correlation (0.80), consistent with previous research (ADD REFERENCE). When comparing PHQ-9 using variable-width scales against baseline CESD-R-10, correlations averaged 0.71, reflecting expected attenuation due to temporal fluctuations in depression symptoms. Expectedly, concurrent administration with identical scale widths yielded stronger correlations (0.89), exceeding baseline values.
Figure 8 demonstrates how PHQ-9 relates to affective states, with a positive correlation (0.68) with negative affect and inverse relationship (-0.44) with positive affect at baseline. Variable-width PHQ-9 assessments correlated with baseline PANAS at 0.60 for negative affect and -0.43 for positive affect. Concurrent administration with matched scales yielded correlations of 0.70 and -0.46 respectively.
Figure 9 reveals distinct patterns between PHQ-9 and personality dimensions. At baseline, Neuroticism showed the strongest association (0.32), while negative correlations appeared with Extraversion (-0.20), Conscientiousness (-0.28), and Agreeableness (-0.18). Openness showed no relationship with PHQ-9 (0.01).
4 Discussion
In this study, we examined the psychometric consequences of varying the number of scale points in depression screening using PHQ-9. Our results revealed that, as far as the distributional properties are concerned, mean and the number of participants above cutoff score in POMP equivalents decline in an almost linear fashion as the scale points decrease, taking into account that for the means quadratic regression term was signifficant, pointing to the decline slowing down across the scale points span. Reliability estimates increase from 2- to 5- scale points, and the same general trend can be observed for standardized factor loadings and corresponding standard error estimates, for a one factor PHQ-9 model, regardless of the estimation method.
Conclusions based on fit indices diverge. First noticeable difference arises from methodology used, where RMSEA, CFI and TLI values are higher for all of the scale points except for dychotomous PHQ-9 measure for RMSEA, while SRMR values are virtually the same for both continuous and categorical methodology. Nye & Drasgow (2011) and Xia & Yang (2019), using simulation approach, addressed the issue of using conventional cutoffs for fit indices with categorical methodology, which are established using continuous methodology, concluding that both unscaled and scaled CFI and TLI categorical indices are insensitive to model misspecification, clustering above .95 under manipulated conditions. Our results using DWLS estimator closely mimic observed simulation results. While CFI and TLI fit indices under MLR estimator exibit more variable values implying the best fit for 4-, 10- and 11- scale points, under DWLS estimation they remain high and stable, pointing to the possible the lack of sensitivity to model misfit. Trends in RMSEA values, on the other hand, show more volatility and higher values under DWLS estimation, paired with sharp increase from 2- to 9- scale points, with a drop at 10- and 11- scale points, a result that is different than observed in simulation studies, which showed that RMSEA values are smaller under categorical methodology, compared to continuous, for the same misspecification level.
Finally, convergent correlations with CESD-R-10, seem to increase slightly from 2- to 5- scale points, most likely reflecting increase in scores reliability in this scale points range as a consequence of reduction in variance range restriction, leveling of after 5 scale points. When comparing convergent correlations taken with baseline default 4- point response scale CESD-R-10 measure to correlations with concurrent same scale points measures, we can observe higher coefficient values for the concurrent measures for about .01, which is likely the consequence of the variation in depression values. Criterion correlations with PANAS positive and negative affect scales show similar increase in correlations strength from 2- to 5- point scales for negative affect, and less visible trends for positive affect. The difference between correlations with baseline and concurrent PANAS negative affect measures is present, but less pronounced than in the case of convergent correlations with CESD-R-10, and this finding is not observed for positive affect, despite high correlation value. Correlations of baseline values of BFI-S with scale point varied PHQ-9 were generally stable across scale points, and generally lower than concurrent correlations, with the most pronounced differences for neuroticism and conscientiousness. Albeit less pronounced, the differences between concurrent and baseline correlations were also found for openness, extraversion and agreeableness. The absence of the difference for PANAS positive affect, and the presence of it for BFI-S dimensions, is unexpected finding.??????
4.1 Limitations
The study has several limitations. First, the sample consisted of university students, which may limit the generalizability of the findings to other populations and probably poses the upper limit to the PHQ-9 response scale width given the sample with higher education level and probably the possibility to make finer grained distinctions in symptoms assessed (find reference for education level as moderator!!!!). Second, the demanding nature of the data collection led to lower adherence across repeated measurements, resulting in some participants not completing all scale point conditions and varying durations for completing the full set of assessments. Consequently, measurement error is likely higher in repeated measurements compared to baseline. However, the randomization of response scale order across participants mitigates the risk of systematic bias associated with these variations. Third, the number of participants warrants caution in interpreting fit indices values, as some simulation studies suggest from 800(Nye & Drasgow (2011)) up to 1000 participants (Hoogland & Boomsma (1998)) for adequate Type 1 and Type 2 error control in DWLS estimation, although other researchers suggest as low as 150 is sufficient for dichotomous and trichotomous response scales (Savalei & Rhemtulla (2013)), for the expected values of fit indices to be reasonably close to respective population values.
References
Citation
@online{damjenić_mihaldžić2025,
author = {Damjenić Mihaldžić, Milana and Subotić, Siniša and Sočan,
Gregor},
title = {Departing from {Four} {Points:} {Psychometric} {Implications}
of {Modifying} {Response} {Scale} {Width} for the {PHQ-9} in
{Repeated} {Measurements}},
date = {2025-03-24},
langid = {en},
abstract = {Despite the widespread use of the Patient Health
Questionnaire-9 (PHQ-9) in depression screening, key methodological
features of its design—particularly the number of response scale
points—remain unexamined. Existing research often assumes that
variations in scale width, especially reductions below the standard
4-point format, can be made without compromising—or even potentially
improving—measurement quality. However, psychometric findings from
broader psychological and clinical research suggest this implicit
assumption warrants empirical examination. The aim of this
exploratory study is to investigate how the number of response scale
points in the PHQ-9 affects various psychometric indicators at the
scale level. Using a within-participant design, students of the
University of Banja Luka (\% female, M age = ...) completed the
PHQ-9 with its default 4-point scale at the first measurement, and
then, over the course of 21 to 35 days (depending on individual
adherence), completed the PHQ-9 on a near-daily basis with different
response scale formats. In repeated measurements, each participant
received a unique, randomized order of scale widths ranging from 2
to 20 points. We examined distributional properties, reliability,
internal structure, and convergent and external validity
correlations across the varying scale formats. Our results indicate
that the number of response scale points meaningfully influences the
PHQ-9’s distributional characteristics, reliability, and internal
structure. Reliability estimates increased from 2 to 5 scale points,
with minimal incremental gains beyond this threshold. Model fit
indices suggest that continuous estimation yields different fit
profiles compared to ordinal estimation, particularly for the RMSEA
index within the 5- to 9-point range. Additionally, parameter
estimates and standard errors tended to decrease from 5 points
onward. Convergent and external validity correlations showed modest
increases in strength between 2 and 5 scale points, stabilizing
beyond this range. These findings suggest that response scale width
is an important methodological consideration in depression
screening, with implications for measurement accuracy, diagnostic
reliability, and clinical utility.}
}








