David Hill readings reprinted from The Politics of Schizophrenia: psychiatric oppression in the United States 1983 (University Press of America 1983) with the permission of the author (currently not in print).


 

Chapter 18

INADEQUATE DATA: THE RESPONSE

 

The conclusions drawn from these various attempts to estimate reliability were consistent in acknowledging the inadequacy of psychiatric diagnostic procedures. Recommendations as to how to respond to this conclusion, however, were somewhat more varied. Some few advocated the abandonment of classification, suggesting that we should be more concerned with the individual 5 unique life situation (Chapter 19). Others proposed alternative modes of classifying human behavior (Chapter 20). Before examining such recommendations we must first discuss the most common response - the suggestion that we should discover the sources of the disagreements between diagnosticians and attempt to control them. The majority of researchers in this area were not ready to consider the possibility that the low reliability findings meant that our constructs do not reflect reality, and decided, instead, to work towards proving their existence through improving their reliability. Boisen is typical of those who, while willing to admit that Kraepelinian constructs had failed to do what they were intended to do, were, nevertheless, unwilling to abandon them.

The fifty years which have elapsed since Kraepelin's contribution was given to the world have not substantiated the hypothesis on which the system was built. . . . It is clear that the Kraepelinian system is inadequate, but there is as yet nothing to put in its place which is likely to receive general acceptance. (1938 ,p.25)

 

In discussing the most common response to the problem of poor reliability we come to the core of this body of research literature. The typical study is characterized by an opening acknowledgment of the problem followed by some brief justification for not yet abandoning the constructs in question. The study itself involves either an attempt to discover the sources of variance, or to control them, or both. A call for such research came from Kreitman in 1961. In reviewing the research he focusses on the methodological inconsistencies hindering comparison of the various studies. The fact that he finds little actual support for the reliability of psychiatric diagnoses is clear from his conclusion that 'clinical psychologists, testing the same patients by whatever means they chose, seem to succeed no better than psychiatrists in reaching an agreed diagnosis" (p.883). He goes on to emphasize the import of such failure, for these two professions, in the following way:

Those who wish to maintain the existence of a diagnostic entity must first demonstrate that by using specific criteria it is possible to achieve acceptable standards of reliability. . . . Failure to do so would mean that the category must be discarded or new criteria developed. (p. 884)

 

If the failure to establish acceptable standards of reliability necessitates a choice between discarding constructs and tightening criteria, the question becomes: how much documentation of failure, over what period of time, is required before we decide to discard. If methodological deficits are apparent in the studies which document the failure, it would, indeed, seem legitimate to remove those deficits before discarding. I would argue, however, that difficulty in comparing studies because of inconsistent methodologies becomes irrelevant if, as seems to be the case, the different approaches have produced consistently negative findings. Consistent findings emanating from different methodologies strengthen rather than weaken the credibility of those findings. I hope to demonstrate, moreover, that the methodological deficits tended to exaggerate rather than diminish the level of reliability obtained.

 

A Frontal Attack in the Wrong Direction

The typical response to the consistently negative reliability findings was the one recommended by Kreitman in 1961.

The cardinal need is for greater reliability in diagnosis itself, but to that end it is essential to know what aspects of the psychiatric examination have contributed the most reliable data, and which the least, to our diagnostic certainty. . . Does the present state of diagnostic confusion reflect differences in what is elicited and observed, or does it rather stem from conceptual difficulties? In essence then a plea is being made for adequate analysis. Progress cannot come from the contemplation of bald percentages of agreement; it is less valuable to know that so many psychiatrists agree on a particular diagnosis in a given proportion of cases than it is to understand why they disagree on the remainder. . . Perhaps it is not premature to urge that a frontal attack on these problems is likely to prove among the more rewarding of present research possibilities. (pp. 885,886)

 

The following year a group of Philadelphia psychiatrists (Ward et al., 1962) set out to discover the reasons for diagnostic disagreement. Their own particular acknowledgment of the problem took the following form:

Much current thinking holds psychiatric diagnosis to be "the soft underbelly of psychiatry" and "an indictment of the present state of psychiatry." Diagnosis is said to cause behavioral scientists "marked feelings of inferiority" because of their alleged inability to obtain agreement rates significantly better than chance. To the extent that these opinions are accurate, it is clearly an important question why a nomenclature, which is the distillation of so much experience over so many years, should so fail the test of clinical usefulness. (p. 60)

 

Ignoring the possibility that the answer to their question is that our nomenclature consists of scientifically meaningless constructs, four experienced psychiatrists were asked to identify the reason for their disagreements about 40 'patients.' Nine reasons were grouped into three categories responsible for the following amounts of disagreement: inadequacy of nosology (62.5%), inconstancy on the part of the diagnostician (32.5%), and inconstancy on the part of the 'patient' (5%). Within the 'diagnostician inconstancy' category, the most common reason for disagreement was their "weighing symptoms differently" (17.5%). Within the 'inadequate nosology' grouping, the major reasons cited were "forced choice of predominant major category" (30%) and "unclear criteria" (25%). The majority (80%) of disagreements caused by the absence of clear criteria involved the diagnosis 'chronic undifferentiated schizophrenia.' Their summary, furthermore, includes the statement that one of "three chief difficulties inherent in the present nomenclature" is "the lack of clear criteria, as in distinguishing certain reactions now labeled schizophrenia, from neuroses or schizoid personalities" (p.205).

The task, then, for those unwilling to either abandon or radically revise the classification system, was two-fold. First, they would have to try and train diagnosticians to use a standardized assessment process designed to ensure, among other things, consistent weighting of symptomatology. Second, and more importantly, improvements would have to be made in the nomenclature, including the development of clearer definitions for the various diagnostic constructs. The same group of Philadelphia researchers had already begun the latter part of the task. The 40 'patients' mentioned in their 1962 paper had been drawn from a larger sample employed in a study reported the preceding year (Beck et al., 1961). The purpose of the original study had been to determine whether attempt to remove the sources of disagreement emanating from the inadequacy of the nomenclature would, as predicted result in improved inter-rater reliability.

Before engaging in the formal aspects of the investigation, the psychiatrists had several preliminary meetings during which they discussed the various diagnostic categories, ironed out semantic differences, and reached a concensus regarding the specific criteria for each of the nosological entities. . . It was found that considerable amplification of some of the diagnostic descriptions contained in the manual [D.S.M. II] was necessary in order to minimize differences. After having reached agreement on the criteria to be used in making their clinical evaluations, the psychiatrists compiled a list of instructions to serve as a guide in diagnosis. (pp. 351,352)

 

Under these "more optimal conditions" the average degree of agreement between pairs of psychiatrists, when employing only six nosological categories (calculated by dividing the number of concordant selections of each category by the total number of times that category was employed) was 54%. The level of agreement for the category 'schizophrenic reaction' was 53% When the task was simplified by asking the diagnosticians to choose only between the three major groupings, 'psychosis,' 'neurosis' and 'personality disorder,' the inter-rate-reliability rose to 70%. After the study was two-thirds complete it was decided to include an additional measure of reliability which would allow the diagnosticians to offer an alternate diagnosis along with their first choice. I can't help wondering if such a change in plans might have arisen from the disappointing results emerging from 'plan A.' Whatever the motivation, even when equal weighting was given to agreement between two 'primary diagnoses' and agreement between a primary and an alternate diagnosis, the psychiatrists still disagreed on 27% of the cases. When agreement between two alternate diagnoses were included, disagreement was only reduced to 18%. Remember that we are only dealing with six categories in the first place. Nevertheless, Beck et al., feel justified in concluding from 'plan B' that "the diagnosticians may have been closer in their appraisals than indicated by the scoring of only the preferred diagnosis" (p.357).

Their conclusion about the major portion of the study focusses on the fact that the degree of agreement on 'specific diagnoses', 54%, was greater than that expected by chance (p<.001) and greater than that obtained by previous studies--which had produced agreement ranging from 32% to 42%. They suggest that their own findings result from their having minimized "factors that would artificially lower or inflate the rate of concordance" (p. 356). There appears, however, to be something of a contradiction between their claiming to have minimized 'artificial influences' and their earlier statement that:

These comparisons indicate that the previous studies, while perhaps reflecting the concordance rate in actual practice at various institutions, underestimate the degree of agreement obtainable under more optimal conditions. (p. 355)

 

It seems perfectly clear that they explicitly set out to introduce 'artificial influences' with the intention of improving reliability. We shall return to this issue. It suffices, here, to quote the admission of Beck et al. that:

It seems apparent that the rate of agreement of 54% for the refined diagnostic categories is not adequate for research. Moreover it is questionable whether the rate of 7O% for the major divisions would be considered adequate for research. (p. 355)

 

The research, however, continued.

An example of attempts to control for "inconstancy on the part of the diagnostician"--the second major source of disagreement identified by Ward et al. (1962)--is a study which had been undertaken ten years earlier. Seeman (1952) conducted "an investigation of interperson reliability after didactic instruction." Fifty six medical students were presented with a series of lectures in descriptive psychiatry and then asked to diagnose six 'patients' presented by the lecturer. The degree of agreement was based on comparisons with "an official psychiatric diagnosis." Seeman reports that 58% of the students disagreed with the official diagnoses on at least two out of six cases; and 14% disagreed on four or more cases. Only 7% agreed on all six cases. (Again we must realize that these findings are derived from a situation in which an artificially low number of categories are being considered.) The most Seeman is willing to claim on the basis of his results is that "this degree of success (or agreement) was attributable to more than good fortune."

 

Biased Methodology: Attempting to Improve Reliability with Artificial Conditions

Before discussing the implications of these consistent findings of inadequate reliability, I wish to raise some serious conceptual and methodological problems involved in the studies mentioned thus far. These can be divided into two major issues. First there is the question of whether these studies were realistic estimates of liability as employed in actual clinical settings. Second, we must question the appropriateness of the statistical procedures employed in drawing conclusions from such studies. The Philadelphia group led by Beck and Ward draw a clear distinction between studies "reflecting the concordance rate in actual practice" and their own work which was designed to' estimate "the degree of agreement obtainable under more optimal conditions." I have already divided, somewhat artificially, this whole body of literature into two phases: an earl y one in which the inadequacy of reliability was discovered by studies dealing with 'real life' diagnostic procedures and a later one involving studies designed explicitly to enhance reliability by improving training, definitions, or both. This distinction has been somewhat neglected in surveys of the reliability research. Perhaps it is irrelevant if we agree with the conclusion of Beck et al. that even under artificially optimal conditions, inter-rater reliability is still not adequate for research. Unfortunately, such conclusions have, as yet, have had little or not impact on the extent to which the 'schizophrenia' notion is still employed. It seems important, therefore, to discuss the issue further.

In building a case for abandoning that notion, it seems necessary to point out exactly how, and to what extent, many of these studies produced artificially high estimates of its reliability.

There are at least six ways in which such studies biased their methodology in favor of increasing reliability. Two of them, we have already seen were attempts to correct for the major sources of variance identified by Ward et al (1962). In other words, diagnostic criteria were defined more clearly and diagnosticians trained more thoroughly. The practice of providing training not used in real clinical settings remains in vogue today. In the seventies, the group of researchers responsible for the development of the D.S.M. III (the most recent diagnostic handbook) conducted studies to determine the reliability of categories employing their specific 'Research Diagnostic Criteria' (RDC) . One such study reports providing "considerable on-the-job training in the use of the RDC and many years of experience in interviewing psychiatric patients (Spitzer, et al., 1978, p.779). It is interesting, also, to note that members of the same research team previously had listed some reasons why many past studies would have been expected to produce reliability findings higher than those found in actual clinical practice. They included in their list the fact that "special efforts were made in some studies to have the participant diagnosticians come to some agreement on diagnostic principles prior to the beginning of the study (Spitzer and Fleiss, 1974, p.344)". Four years later they used these same artificial and biased techniques themselves.

The third area of bias comes in the form of non-independent observations. A core component of the methodology of inter-rater reliability is the expectation that the observations from which the level of agreement is estimated be completely independent from, i.e., unaffected by, each other. Beck et al., (1962) point out that at least four studies published prior to 1961, including the previously discussed study by Ash (1949), "permitted the second diagnostician in the pair to know the diagnosis made by the first... which would tend to inflate agreement" (p.355). More pervasive is a somewhat less obvious form of non-independence Even where the diagnosticians are unaware of the other's actual diagnosis, it had been relatively common practice for researchers to present the same information to both, a practice that hardly reflects the level of independence employed in the 'real world', where different diagnosticians employ different types and amounts of data. Examples of researchers presenting the same data include the previously mentioned cross-cultural studies in which videotapes and written descriptions were employed. That some researchers fail to acknowledge that such an approach represents a methodological problem is evidenced by the following comment from Spitzer's team:

The reliability of the RDC categories with psychiatric inpatients has been tested in three studies. The first two involved joint interviews whereby one rater conducted the interview and the other merely observed. Both made independent ratings. (Spitzer et al., 1978)

 

Whether observing the same material violates the rule of independent observations is, perhaps, open to question; it definitely, however, violates the rule that research should reflect, as closely as possible, the real world. To describe sitting in on the other's interview as "merely" observing appears to involve an underestimation of the biasing effects of hearing the other's line of questioning. A comparable process was employed, again, without any comment on its implications for the independence of the observations, in the previously cited study by Seeman, in which "the patients were presented and interviewed, and the medical students were permitted to ask whatever questions they saw fit" (1952, p.541). It is worthy of note that Spitzer et al. went on to describe the process "whereby the independent raters interviewed the patient at different times" as being "a more rarely used procedure" (1978, p.779). Furthermore, rather than viewing this procedure as the only way to measure inter-rater reliability while ensuring truly independent observations, they considered it--on the basis of the one or two day gap between the observations, to be an example of test-retest reliability. (And they did so despite the fact that the typical test-retest study employs the same diagnostician.) In either case, what they call test-retest reliability and what I call truly independent inter-rater reliability, resulted in lower levels of agreement for most of the diagnostic categories, including 'schizophrenia,' than did the studies employing less-than-independent observations.

My fourth source of concern lies in the area of the amount of information available. Diagnoses, particularly those made on admission to large state institutions, are often made on the basis of cursory interviews. Ward et al. state that the main limitation of their own study "is that the diagnostician was confined to material elicited from the patient alone during an interview of approximately one hour'' (p. 200). They suggest "that additional information. . . might have reduced diagnostic disagreement somewhat." Indeed it might have, but the fact remains that they have already superceded the amount of in formation employed in the making of diagnoses in many real settings.

My fifth point has already been made by other reviewers.

The evidence for low agreement across specific diagnostic categories is all the more surprising since for the most part, the observers in any one study were usually quite similar in orientation, training and background. (Zubin, 1967, p.383)

I would disagree only with the fact that such low agreement is surprising. Spitzer and Fleiss (1974) in their own review, state that:

All the studies summarized here involved diagnosticians of similar background and training. . . . One can only assume, therefore, that agreement between heterogeneous diagnosticians of different orientations and backgrounds, as they act in routine clinical settings, is even poorer. (p.344)

 

I would add only that, with few exceptions (e.g. Seeman's medical students) the type of homogeneity employed with regard to level of experience was, as might be expected, in the direction of all diagnosticians being of the more experienced variety. The sixth way in which studies have been biased towards increased reliability consists of the small number of diagnostic categories typically employed. We have reviewed studies which involved a choice between the three major categories (and still found the level of agreement inadequate for research purposes). We have also seen how the degree of reliability dropped even further when 'specific' or 'refined' diagnoses were employed. What is of importance to realize is that such 'specific' diagnoses often required a choice between a greatly reduced number of categories (e.g. the six employed by Beck et al., 1961) compared to the extensive number of diagnoses between which a diagnostician chooses in a real life situation.

It must be acknowledged that the researchers discussed above stated explicitly that they were merely trying to see what could be achieved under optimal conditions. But the extent of the bias involved is easily overlooked. And I suggest that many of these optimal conditions are, in any case, impossible to implement in the 'real world.' While greater specificity of definition appears to be a worthwhile goal, it seems doubtful that diagnosticians in different hospitals, let alone different cultures, will ever be persuaded to employ identical criteria in anything resembling a uniform manner. (If this should ever be accomplished psychometricians and computer programmers would quickly render the mental health professional, and their 'clinical judgement,' quite redundant.) The real world, it seems is destined to continue dealing with truly independent observations, made by diagnosticians of heterogeneous training, background, experience and orientation, forced to choose between a vast array of diagnostic categories, each variously defined, on the basis --very often--of the briefest of interviews.

We must remember that, even under the most optimal conditions that our researchers could conjure up, the reliability of the 'schizophrenia' construct remained at an unacceptable level.

 

Biased Statistics

The second general issue I wish to discuss is that of the type of statistical analyses that have been used in estimating reliability. One of the major problems besetting this body of research has been the difficulty in comparing the findings of the various studies. Apart from there being four major approaches to estimating reliability--outlined in Chapter 17--there exist crucial differences within the approaches. We have already discussed, for instance, the varying extents to which researchers have attempted to provide optimal conditions in estimating inter-rater reliability. Comparison of studies employing various degrees of specificity of definition, training and homogeneity of diagnosticians, independence of observations, and varying numbers of categories, is somewhat futile. I have previously raised the question of whether we should be too concerned about the issue of comparability when almost all the studies demonstrate inadequate reliability anyway. There is, however, yet another way in which even these findings were overestimates. We are concerned, here, with inconsistencies in the methods employed to calculate the level of agreement. Spitzer and Fleiss (1974) identify the two most popular approaches:

Some studies . . . report the proportion of overall agreement, i.e., the proportion of all patients on whom there is agreement as to the presence or absence of the diagnosis. . . . Other studies . -. report the proportion of specific agreement, which is an index obtained by ignoring all subjects agreed upon as not having the given diagnosis. . . This index can be interpreted as the probability that one diagnostician will make the specified diagnosis ...given that the other has done so. The two indices are obviously not comparable. (pp. 341,342)

 

More important is the fact that both approaches artificially inflate the level of agreement by ignoring the fact that a certain level of agreement can be expected by chance; the exact level being determined by the base-rates of the categories under examination. For instance, if two diagnosticians employ the 'schizophrenia' construct in 70% of their diagnoses one would expect a .49 (.7 x .7) level of agreement by chance alone. Only results significantly greater than this should be interpreted as being evidence of reliability. Spitzer and Fleiss suggest that a solution to this problem would have been to employ a statistical procedure described by Cohen (1968 ). The statistic, "kappa", is calculated by the formula (Po - Pc) / (l -Pc), Po being the observed and Pc being the expected proportions of agreement. Regardless of whether the observed proportion represents overall or specific agreement, one obtains identical values of "kappa." Apart from allowing comparison between studies, using these different indices of observed agreement 'kappa" permits a reanalysis of the many studies which failed to allow for base-rates and, thereby, ran the risk of overestimating reliability. Before reporting the reanalysis offered by Spitzer and Fleiss, it is interesting to wonder why such a well-known artifact as the effects of base-rates on statistical analysis could have been ignored as consistently as it was by researchers considered to be experts on such issues. Again one wonders if the explanation is to be found, in part, in the extent to which psychology and psychiatry need to preserve their classification system in order to demonstrate their comparability to the natural sciences. Perhaps one of the more accepted constructs employed in our endless efforts to categorize human beings--that of 'scientist'--is as overinclusive and lacking in reliability as that of 'schizophrenia.' Certainly, this particular rater would have some difficulty agreeing with some of his colleagues on whether the construct could be applied to this body of researchers, especially if the criteria he was trained to utilize, prior to the study, included 'objectivity, defined by freedom from bias by personal, professional or political gain.'

Spitzer and Fleiss reanalyzed "the major studies of the reliability of psychiatric diagnoses", reporting the "kappa" statistic as well as the original results. As is obvious from the nature and purpose of this statistical procedure, the reliability of those categories with higher base-rates is particularly overestimated. For instance, the mean level of reliability for the three major categories, ,'psychosis,' 'neurosis' and 'personality disorder,' fell to .55, .40 and .32, respectively, from a mean of 70 reported by Beck et al. (l961). In real life practices the diagnosis of 'schizophrenia' is among the most pervasive and we can assume that studies involving such practices, and failing to allow for base-rates, considerably overestimated the reliability of the construct. The 'schizophrenia' base-rates differed, of course, from study to study depending on the populations; and the extent to which reliability was overestimated varied accordingly. Beck et. al., for instance, reported inter-rater reliability for this category to be .53. Spitzer and Fleiss, applying the kappa statistic, estimate the true level of agreement to have been .42. They report the average level of agreement for the construct allowing for base-rates, among the major studies, to be .57. They conclude, on the basis of their reanalysis, that "there are no diagnostic categories for which reliability is uniformly high" and that "the level of reliability is no better than fair for psychosis and schizophrenia." For the reasons already discussed, they also assert that agreement "in routine clinical settings is even poorer." Finally, they point out that, despite the introduction of blatantly biased research, on the basis of a survey covering the period 1956-1974, "there appears to have been no essential change in diagnostic reliability overtime" (p.344).

The other question to be considered is whether a study that does produce agreement significantly greater than that expected by chance--even when allowing for base-rates--represents a clear indication of the level of reliability necessary to support continued usage of the construct under examination. Some studies (e.g. Seeman, 1953) report only that their findings are statistically significant and leave it up to the reader to draw their own conclusions as to the implications for continued usage. Others such as Beck et al. (1961), explicitly state doubts as to the significance of 'significance'.

Although the degree of agreement was found to be statistically greater than could be expected from random pairing of diagnoses rendered independently by each of the psychiatrists according to his own system of preferences or biases, this does not in itself indicate that the reliability is high enough for research and treatment purposes. (p.355)

 

Kreitman (1961) offers the following example:

Clearly, if two psychiatrists examine a given patient and may use any one of ten diagnoses, then the probability of their concurring on a basis of random choice is 10 per cent. If, therefore, they examine 100 patients and reach the same diagnosis in 25, they are doing much better than would be expected by chance alone, and their performance would be very highly "significant." Far from being a source of satisfaction on this account, such a result would be little short of catastrophic from the viewpoint of clinical standards. In short, it is scarcely ever justified with questions of this kind to employ statistics involving the theory of chance or random effects, and where these are legitimately employed it must constantly be remembered that the level of confidence with which it may be said that a demonstrated association differs from chance expectations is a poor clue to its importance. (p. 879)

 

Thus, in the statistical analyses we find the seventh and eighth ways in which these researchers biased their investigations in favor of demonstrating an acceptable level of reliability. Yet even these artificially optimal conditions and biased statistical procedures failed to produce the desired results. Even with the assistance of biased procedures, our most highly trained experts cannot agree on who has 'schizophrenia' and who does not. In almost any other scientific discipline such a construct would have long since been abandoned.

Back to "Witchpaper '97