Next | Previous | Table of Contents
Issues of instrument reliability and validity lie at the heart of any measurement activity, and a basic understanding of these concepts is necessary to make informed decisions during the planning, implementation and analysis phases of any research or evaluation project. Reliability refers to the consistency with which a measurement tool or method gathers information. If an instrument is internally consistent then subjects will tend to respond consistently to similar items on that instrument. If an instrument is temporally consistent then, if measured twice (i.e., at different times), the same subjects will tend to score at about the same level (assuming that the characteristics of interest have not changed). Validity refers to the meaningfulness of the instrument and the resultant data. Questions such as: What is actually being measured? What do the results mean? and Do the results apply to other persons? all inquire as to the meaning or validity of measurement, methods, and results.
There are two basic types of response error which can introduce inconsistency into data sets (lowering reliability) and undermine the confidence one can place in interpretation of the findings. A first type of error is non-systematic or random and occurs when subjects' scores are influenced by chance. This type of error decreases the clarity or focus of measurement by increasing the unexplained variation (sometimes unflatteringly called "slop") in the data set. Excessive muddiness in a data set increases the chance of missing any significant differences which might be present (type II errors in inference). A second type of error is systematic, and occurs when there is there is a systematic and unexpected pull on measures which tends to skew the results in a particular direction, leading to false conclusions (type I errors in inference). The possibility of systematic bias challenges the meaningfulness or validity of any significant findings.
The sources of random error and systematic bias are numerous and every attempt should be made to sharpen measurement and increase control for systematic error during the planning phase. Although a comprehensive discussion of this process is beyond the scope of this document (and readers are referred to a good research design text such as Portney & Watkins, 1993), there are several sources of random and systematic error which are inseparably related to instrument design and utilization. These topics are addressed below.
Internal Consistency: The internal consistency of a scale refers to the degree to which items of the scale "hang together". If a scale is composed of closely related items, then its internal consistency estimate (typically Chronbach's Alpha, Cronbach, 1951) is higher than for a scale where questions are disjointed or the elicited responses are unrelated. All scales should be conceptually consistent, that is have good internal consistency ( > .80 as a rough rule) since the degree of internal consistency sets the upper limit for any other reliability or validity estimate for the scale. However, there are some factors to consider when applying this rough guide universally.
The reliability (internal consistency) of a scale increases with the number of items in that scale. Thus, all other things equal, a QoL questionnaire which has fewer items will be less consistent and will result in patient scores which fluctuate more, due to random responses, than a longer instrument. As a result, it is not enough simply to compare Chronbach's (1951) Alpha levels when looking for a reliable instrument since the alpha coefficient will be lower with fewer items as well as with the scale's internal inconsistencies. Reliability coefficients may be adjusted (standardized) for item number using the Spearman Brown Prophecy Formula, well documented in any text on testing (Aiken, 1991 for example). Adjusted values, however, are not often included in the literature (and most readers do not carry out a statistical reanalysis of the reported data).
Besides mathematical explanations, there are various reasons why multiple item scales are generally more consistent than measurements using single items. One interesting theory suggests that multiple item measures tend to prompt individuals to search their memory for relevant experiences on which to base their response. This searching for relevant information is thought to reduce the influence of "snap judgements" and preconscious thoughts, which are only tangentially related to the intended purpose of the question (Pavot & Deiner, 1993a). Another way to increase response consistency is to have respondents determine and list their own most important domains (cf., Schedule for the Evaluation of Individual's Quality of Life - see XXVI, Chapter 5) or to have subjects relate items to recent life experiences. This approach also has other advantages, since interviewers can also quickly ascertain an individual's comprehension of an item and its relevance to the respondent.
Test-Retest & Alternate Forms Reliability: Another estimate of a scale's clarity or focus is the degree of overlap between two measurements taken at two different points in time using the same scale and with the same respondents: that is, the degree to which results are correlated. This, of course, assumes that the time period is short enough to ensure that characteristics being measured have not actually changed between the two measurement occasions; yet just long enough to ensure that the subjects are not responding from rote memory. The Test-Retest Reliability Coefficient estimates the degree to which the same items elicit the same response from one testing time to another. Typically test-retest reliability estimates are slightly lower than internal consistency estimates ( > .75 as a rough rule for adequate test-retest reliability). Occasionally, alternate forms of the same instrument (e.g., Life Satisfaction Index - see IX, Chapter 5) are used so as to eliminate the possibility that respondents' recall of items would affect their retest scores at a later assessment time. Alternate forms reliability coefficients are usually calculated as the correlation between respondents' scores on two different forms of an instrument completed at the same time. Good alternate forms reliability should also exceed .75.
The validity of an instrument or method refers to its "truth value". The meaning or truthfulness of a measurement is a complex issue which requires careful consideration as to an instrument's performance with respect to other (well tested) instruments or standards and as to how well it is suited to the intended purpose of evaluative activities. There are several types of validity, including face/content validity, construct validity, concurrent validity and predictive validity (Weiner & Stewart, 1984; Aiken, 1991). Face/content and predictive/concurrent validation are discussed below as concepts with which evaluators should be particularly familiar.
Face and Content Validity: The degree to which an instrument seems to ask about content which is relevant to both the task at hand and the respondent's experience is known as face validity. If a QoL instrument appears irrelevant and not well suited to a person's life experiences, then responses to the instrument are more likely to contain error due to misinterpretation and/or lack of motivation, possibly resulting in carelessness. Moreover, irrelevant items may render results which are uninterpretable - for example, when asking a group of persons with schizophrenia about their satisfaction with their spouse when very few are actually married. Content validity is similar to face validity, the main difference being that a panel of experts examine the instrument and determine the degree to which its items address the topic(s) which the instrument purports to measure. The agreement of persons who are very familiar with an issue, about the content of a measure, provides some support for the validity of its content (Streiner, 1993).
A common threat to instrument validity occurs when respondents' understanding of an item's meaning is not uniform. If, for example, one person reads the following statement: "How close are you to your family members?" to mean contact with the family of procreation whereas another person reads the same statement to refer to the family of origin, the resulting group data become uninterpretable. Close attention should be paid to the referential meaning of items from the standpoint of the persons expected to respond to the QoL instrument. It is equally important to read through an instrument and determine whether items and scales are suited to the respondent's level of comprehension (e.g., vocabulary and reading level).
Construct Validity: Throughout this document we have referred to the term "construct" which is a "theoretical idea developed to explain and organize some aspects of existing knowledge [and observations]" (American Psychological Association, 1974, p. 29). Some constructs used in QoL instruments are satisfaction, importance and functioning within the domains of personal, family, social and community experience. The development of instruments or measurement scales so that they meaningfully measure such constructs is referred to as a process of construct validation.
Measurements taken with an instrument which possesses construct validity will correlate with results from other instruments which are theoretically related to the construct in question. For example, if an objectively oriented QoL researcher were to develop a new measure of mental functioning, one would (theoretically) expect results using this instrument to correlate with other tried-and-true measures of mental ability, functioning and performance. Another method which is commonly used to investigate the construct validity of an instrument is confirmatory factor analysis. In this statistical procedure, responses to items on an instrument are assessed to see whether they cluster together in theoretically expected ways. Items, for example, tapping the construct of global satisfaction should be found to be highly correlated, whereas items associated with the construct of physical functioning would not be highly correlated with items tapping the construct of global satisfaction.
Concurrent and Predictive Validity: The concurrent and predictive validity of an instrument is an assessment of agreement between an instrument and other commonly accepted indicators, either measured at the time (i.e., concurrently) or at a later point in time (i.e., predictively). For example, a creator of a new QoL instrument who wishes to determine how well it actually measures patients' life quality could choose a well known QoL instrument with good psychometric properties, for purposes of comparison. Such a comparison might involve asking patients to complete both instruments and then correlating the results; the overlap is an indication of association of the new instrument with the chosen standard. Determination of predictive validity also involves the use of an external standard to assess the truth value of an instrument. In this case, however, the standard is either a criterion or an event which is theoretically associated with the dimension that an instrument is thought to tap. For example, scores on a newly developed QoL independence scale might be compared with ratings of patients' independence from health services personnel (a criterion) or with their future employment status (a prediction).
External Validity: A form of validity which is often taken for granted is how well inferences made from results describe the population at large. More specifically, external validity rests on how well the instruments and methods are suited to the purpose(s) of inquiry. To overlook the matching of instrument and purpose, however, calls into question the meaning of any observation, and increases the likelihood of missing finding anything altogether (Type II error). Choice of instrument and research/evaluation design will affect the way in which change is detected among the individuals included in the evaluation.
As indicated earlier, quality of life instruments have been developed according to different models of health and illness. Some instruments, for example, utilize a functional model of health whereas others utilize an experiential model and address individuals' experiences associated with illness (cf., Costain et al., 1993). Each class of instruments provides a different perspective on consumer concerns. It is also wise to consider the specificity (or narrowness) and sensitivity (or responsiveness) of an instrument on the dimensions targeted for inquiry (Health Canada, 1994b). Some measures are more global and less disease or treatment specific. These instruments attempt to characterise the impact of health on individuals' wider life experiences. While results from such measures are often comparable across programs, their relationship with specific treatments effects is less clear. Some instruments, on the other hand, are very specific to particular courses of disease and highly sensitive when it comes to detecting the impact of disease specific treatments. Such specificity however can hinder across program comparisons. Moreover, highly sensitive instruments may detect significant differences which are too small to have much clinical significance (see Health Canada, 1994b, p.13, for a discussion of "minimally important differences").
Reliability & Validity Ratings of QoL Instruments: Table 4 (see next page) presents the 28 instruments from our five year review and our ratings of their reliability and validity. Our determination of instrument quality was not a simple matter. As mentioned, reliability coefficients vary with the number of items in a scale. We did not attempt to adjust these coefficients for item number. Moreover, selection of a short, easy to implement, and less reliable scale may be a more important factor during evaluation than choosing a long, highly stable and reliable QoL indicator. Likewise, scales which address persons' behaviour or symptoms tend to be less homogeneous, and thus have lower internal consistency, than scales which are composed of more global and evaluative items. The selection and availability of psychometric reviews also influenced our decisions. For some very widely used instruments (e.g., the Medical Outcomes Study instrument - see X, Chapter 5) we selected representative studies out of hundreds available based on relevance to mental health populations or issues. For other instruments, we could only find one or two citations which may or may not have addressed the topic of instrument validity in any comprehensive way. Adding to the tentativeness of our validity rating we restate that the validity of any instrument should best be determined after a careful consideration of the intended purpose of the study. A very sensitive and specific QoL instrument, for example, would not be valid for use in a cross-program and multi-site comparative study.
For these reasons, this table is only provided as a guide and is not intended to offer a definitive solution since instruments vary in how they elicit information, in the psychometric strength of the subscales, and in the degree and depth to which they have been validated for the purposes and specific populations for which they are to be used. Chapter 5 provides more detail on, and sources for, specific instruments.
Table 4: Psychometric Evaluation of QoL Instruments
| QoL Instrument | Used
with MH Pop |
Reliability | Validity | Comments |
| ComQoL Scale | No | Fair | Fair | Contains a test of patients' discriminative abilities, used with intellectually disabled persons. |
| General Health Questionnaire | Yes | Fair to Good | Fair to Good | Most widely used instrument in MH. Strength in assessing neurotic distress and anxiety. |
| Gottenberg QoL Instrument | No | Fair to Good | Fair to Good | Strong symptomatology emphasis, measuring anxiety, concentration, depression and fatigue. |
| Health Measurement Question. | Yes | Fair to Good | Fair to Good | Utility for use within liaison psychiatry setting, in and outpatient acute and chronic psychiatric population. |
| Lancashire QoL Profile | Yes | Fair to Good | Fair to Good | Contains a measure of respondent error - a professional opinion about reliability of patient reports, for schizophrenic and mixed populations.. |
| Lehman's QoL Interview | Yes | Fair to Good | Good | Well researched, some instability with respect to objective scales, all severe and chronic populations examined. |
| Life-As-A-Whole Index (uniscale) | No | Fair | Fair | A Uniscale, quickly administered. |
| Life Experiences Checklist | No | Fair | Fair | Easily administered, psychometrics not fully developed. |
| Life Satisfaction Index | Yes | Fair to Good | Fair to Good | Often used with cognitively impaired populations and can also be used with psychiatric populations to obtain measures of well-being. |
| Medical Outcomes Study (MOS) SF-36 | Yes | Good | Good | Very widely used, not specific to mental health but has been used with a depressed patient population. |
| Multifaceted Lifestyle Satisfaction Scale | No | Fair to Good | Fair to Good | Validation required for use with MH populations. Used with cognitively impaired patients. |
| Nottingham Health Profile | Yes | Fair to Good | Fair | Problems with respondents indicating "no" problem to all scales, thus scores may bottom-out with more normal populations. |
| QoL in Depression Scale | Yes | Good | Good | High face validity for patients, easy to answer, only applicable for a population suffering from depression. |
| QoL Enjoyment and Satisfaction Questionnaire | Yes | Good | Good | Solid evaluation with high clinical relevance for a depressed population. |
| QoL Index (5 global scales) | No | Fair to Good | Fair to Good | Very short administration time using 5 global clinical rating scales. |
| QoL Index for Mental Health | Yes | Good | Fair to Good | A new, promising, instrument, goals and symptoms scales useful for patient-therapist comparisons, used on mixed chronic & severe population. |
| QoL Interview Schedule | Yes | Fair to Good | Fair to Good | Internal consistency problems with some scales, mixed severe and chronic population. |
| QoL Inventory | Yes | Fair to Good | Good | Highly evaluative content, discrepancy ratings between importance and satisfaction. |
| QoL Questionnaire (Shalock) | No | Fair to Good | Fair to Good | Developed to evaluate developmental disabilities |
| QoL Questionnaire/ Interview (Bigelow) | Yes | Fair to Good | Good | A strong employment emphasis, contains substance use/abuse and stress tolerance scales. Some apparent scale instability, mixed population. |
| QoL Scale | Yes | Good | Fair to Good | Includes various scales tapping intrapsychic dimensions useful for clinical assessment - patients w/ schizophrenia |
| QoL Self-Assessment Inventory | Yes | Fair to Good | Fair to Good | A promising new scale needing more development, used with schizophrenic and mixed chronic and severe populations. |
| QoL Systemic Inventory | No | Fair to Good | Fair to Good | Reliability boosted by use of visuals. More validation with mental illness populations required. A novel and enjoyable interactive interview approach. |
| Quality of Well-being Scale | No | Good | Good | Well developed health status index for purposes of managed care (costing & planning), not mental health specific. |
| Satisfaction with Life Scale | Yes | Good | Good | Five quick global scales focus on the judgement component of subjective well-being, used with depressed patients. |
| Schedule for the Evaluation of Individual QoL | No | Good | Good | Very sensitive to individual response consistency. Factors affecting judgment will impact on validity of results. |
| Sickness Impact Profile | Yes | Good | Good | Widely used, two factors identified with a MH population a psychosocial and physical health construct, used with a depressed population. |
| SmithKline BeechamQoL Scale | Yes | Good | Good | Ideal-Self, Sick-Self & Self-Now evaluation of overall physical and social function, used with depressed & gen. anxiety disordered population. |
Each instrument in Table 4 was given a rating from fair to good as to the reliability of its scales and the overall instrument (i.e., internal consistency, test-retest and alternate forms reliabilities) as well as the degree to which its validity had been established (i.e., face, content, construct, and concurrent/predictive validity). It was our opinion that "good" instruments (and scales) should possess internal consistencies greater than .85 and test-retest or alternate forms reliability coefficients greater than .75. Instruments which were rated "fair to good" had lower reliability coefficients, with the internal consistency of one or more scales falling into the .75 - .85 range and/or test-retest reliability of less than .75. Instruments rated as "fair" lacked adequate internal consistency and/or test-retest stability.
Validity ratings based on information in our review are also provided. Well-validated instruments were expected to include items with good face and content validity, have a stable and coherent scale structure (construct validity), and have moderate to high ( > .60) convergent correlation with other QoL measures and clinical scales. Those instruments given a "fair to good" rating were either lacking in one or more of the assessed areas or possessed weaker validity coefficients (.50 - .60). Instruments rated as "fair" lacked a clear demonstration of validity. Some instruments in our review were still in the development phase. When this was the case, it was noted in the comments section of Table 4.
While it is tempting to provide a list of the best instruments for use in mental health settings, we cannot. Evaluation coordinators within clinical settings are in the best position to know the type(s) of population they are working with, and best able to grapple with context-specific measurement and implementation issues, such as determining the QoL domains of interest, specifying the need for specific vs global measures, balancing the cost-effectiveness vs the rigour of the investigative method, deciding to use public domain vs purchased tools, and determining the overall purpose of evaluation measurement (e.g., treatment vs epidemiological vs administrative objectives). Several instruments were widely used and validated for a mixed mental health population - a fair indicator of their psychometric strength. These instruments include the Health Measurement Questionnaire, Lehman's QoL Interview, QoL Interview Schedule, QoL Inventory and QoL Questionnaire/ Interview. Other instruments which show promise with specific populations include the General Health Questionnaire, MOS or SF-36, QoL in Depression Scale, QoL Enjoyment and Satisfaction Questionnaire, Satisfaction with Life Scale, Sickness Impact Profile, and the SmithKline Beecham QoL Scale. Two newly developed scales show particular promise, the QoL Index for Mental Health and the QoL Self-Assessment Inventory.
Various complex issues face those involved with the implementation of an evaluation project or program. Yet, not all settings are able to afford an evaluation coordinator and the tasks of data collection and project management may fall to the clinical staff. This is likely to become more common as population health monitoring initiatives are launched. Irrespective of who accepts primary responsibility for project management, if things are to proceed smoothly, several issues deserve attention.
Evaluative Rigour: Determining the reasons for any observed differences or changes in the chosen measures is the primary objective of experimental and quasi-experimental research designs. Experimental rigour is critical if evaluators or researchers want to be quite confident in the meaning of their observations. There are numerous threats to the validity of findings, and while it is beyond the scope of this paper to address them all, several issues illustrate the importance of sound methodologies when implementing and interpreting the results of evaluation studies.
A pre-/post-measurement design is often used to determine whether a measured attribute has changed between two points in time. Treatment effects, however, are not the only reasons for such changes. The choice of when to take time 1 (pre-) and time 2 (post-) measurements may systematically bias results towards detection of differences. If patients, for example, are measured when they enter a program and again after eight weeks of being in the program, observed improvement may have been caused by an inevitable reduction in acuity (of symptomatology, distress, etc.) due to referrals being made to the program at the worst period of patients' illness cycle. How then can the effects due to temporal changes which would otherwise have occurred spontaneously during patients' illness cycles be sorted out from the effects of the treatment program? One solution might be to use a comparison group who are identified at the same point in their illness cycle but are not provided any treatment. Rigour could be further increased, either if patients in the treatment and control groups are matched on characteristics which might confound the results, or are randomly assigned to either treatment or control groups. This increased rigour requires a doubling of the sample size and a matching of treatment and controls, and introduces significant ethical concerns (e.g., the withholding of a potentially effective treatment from the control group). It is possible that the expense, effort and complexities associated with rigorous inquiry may be unfeasible as a widespread and routine para-clinical activity.
Data Quality and Collaboration: In addition to instrument error due to poor reliability and validity, data quality can be affected by a host of social, economic and political factors. It is important that the evaluation methodology have credibility in the eyes of those responsible for the front-line management of evaluation projects. Patients and clinicians may view data collection activities as lacking clinical utility, as an excessive administrative imposition, or even as a mechanism to facilitate cutbacks and loss of resources. Such beliefs will negatively impact the quality of any data that are gathered. At best, passive resistance is likely to result in carelessly collected and incomplete data sets. Systematic error may be introduced as respondents bias the data set towards the specific interests of their program. Such "gaming" becomes more likely when there are apparently justifiable reasons to manipulate results.
There are several principles that evaluation coordinators/organizers should follow to maximize staff buy-in and help ensure data quality. The purposes of data collection should be discussed with staff before any implementation has begun. During these initial sessions, the information needs of staff and of collaborators should be ascertained and included as an important objective of evaluation. At least some of the data being collected should directly benefit the clinical process, thus allowing clinicians and patients to see the relevance of data collection and view themselves as co-investigators in the project. The intentions of other stakeholders, with different interests in the data, should be discussed with staff. If staff are not already familiar with the concept of formative and summative evaluation this should also be explained (see Guba and Lincoln, 1989). Finally, those directly involved with data gathering should be aware of possible signs of problems with the reliability or validity of data collection tools or collection procedures. These individuals should in turn play an active role in ongoing project management.
To supplement the discussion above and to help stimulate thinking and planning for evaluation initiatives, Table 5 contains a planning checklist which is based on work by Patrick & Erickson (1993, pp 206-208). This table summarizes the basic steps for evaluating the appropriateness of health-related quality of life measurement tools and methods.
Table 5: A Guide to evaluating health-related quality of life (HRQOL) measures
Define the purpose of measurement:
Specify available resources: (time, money, and personnel available for the assessment)
- or -
Measures are:
Determine:
Describe the population: (Level of disability, age, cognitive ability, and ethnic or cultural identity)
Conceptualize QOL outcomes: (instrument domains and length)
Assess methodological characteristics: (the reliability, validity, and responsiveness of measures)
Assess instrument responsiveness:
The measure detects minimally important differences in repeated evaluations of QOL of respondents in the target populations. Does this measure need to contain items or questions that directly assess change, e.g., transition questions?
Assess instrument construct validity:
Evaluate Choice of QOL measures:
Conduct pretest or pilot study:
Prepare for data collection and analysis:
Assure quality during data collection: (striving for improvement)
Prepare findings:
An additional source of pertinent information on methods and issues encountered within mental health program evaluation is a report by Ron Goeree prepared for Health Canada (1994b) entitled "Evaluation of Programs for the Treatment of Schizophrenia: A Health Economic Perspective". This document is an easy to read compendium of practical issues in the evaluation of treatment programs for the chronically mentally ill. It includes discussion of economic evaluations, multi-dimensional and quantitative approaches to outcomes, health-related quality of life, patient satisfaction, resource utilization and unit pricing, and utilization of comparison groups. Ron Goeree's follow-up report for Health Canada (1996a), entitled "Evaluation of Programs for the Treatment of Schizophrenia: Part II - A Review of Selected Programs in Canada" is useful in presenting concrete examples to illustrate the principles espoused in the earlier publication. Both serve as a good companion piece to the current document.
To share this page just click on the social network icon of your choice.