LEE H. ROSENTHAL, District Judge.
This Title VII disparate-impact suit challenges the City of Houston's system for promoting firefighters to the positions of captain and senior captain. Historically, the City has promoted firefighters based on their years of service with the Houston Fire Department ("HFD") and their scores on a multiple-choice exam. The format and general content of that exam are set out in the Texas Local Government Code ("TLGC") and in the collective bargaining agreement ("CBA") between the City and the firefighters' union, the Houston Professional Fire Fighters Association ("HPFFA"). Seven black firefighters sued the City, alleging that the promotional exams for the captain and senior-captain positions were racially discriminatory, in violation of the Fourteenth Amendment, 42 U.S.C. § 1981, and Title VII, 42 U.S.C. § 2000e-2. After mediation in February and March 2010, the City and the seven firefighters reached a settlement that included a proposed consent decree. The decree would require the City to implement changes to the captain and senior-captain promotion exams in two phases. The first phase required minor changes to the November 2010 captain exam. The second phase required more significant changes to the May 2011 senior-captain exam that would apply to future captain and senior-captain exams. The HPFFA intervened in the lawsuit and objected because the proposed consent decree changed the exams in ways inconsistent with the TLGC and the CBA. The HPFFA contended that the City and the plaintiffs had not shown discrimination that would permit this court to approve the consent decree.
This court bifurcated the proceedings to resolve the HPFFA's objections. The first stage addressed the narrow set of changes proposed for the November 2010 captain exam. The second stage addressed the broader changes proposed for subsequent captain and senior-captain exams. In the first stage, this court, with the HPFFA's agreement, approved changes to the November exam. This opinion addresses the HPFFA's objections to the proposed changes to the future captain and senior-captain exams.
The HPFFA vigorously objects that the proposed changes involve "a far-reaching and wholesale restructuring of the entire promotional process that goes beyond anything plaintiffs have even alleged in this lawsuit" and "bypass both the long-established protections of state law and the union's protected role in being the sole, collective voice for the city's firefighters." (Docket Entry No. 89, at 8). The City and the seven individual plaintiffs acknowledge
This court held an evidentiary hearing to consider the proposed changes to the May 2011 senior-captain exam and to future exams. The summary of the evidence shows the welter of expert opinions the parties presented on whether the existing format and content of the City's promotion exams for the captain and senior-captain positions have a disparate impact on African-American candidates; whether the existing exams are reliable and valid measures of the knowledge and qualities relevant to the promotion decisions; whether the existing exams are reliable and valid ways to compare candidates; and whether the proposed changes to the exams will provide reliable and valid exams and address disparate impact. The experts' testimony and submissions left the court with a sense of disquiet about the opinions expressed. The science of testing to measure and compare promotion-worthiness is admittedly imperfect. The expert witnesses, particularly for the City, acknowledged some errors and some incomplete aspects of their work in designing and administering the promotion exams. At best, all the witnesses' opinions amount to uncertain efforts to gauge how well different exam approaches measure, compare, and predict job performance. The analytical steps required by the applicable legal standards must be approached with a recognition of the limits of the expert testimony.
At the same time, courts clearly lack expertise in the area of testing validity. "`The study of employment testing, although it has necessarily been adopted by the law as a result of Title VII and related statutes, is not primarily a legal subject.' Because of the substantive difficulty of test validation, courts must take into account the expertise of test validation professionals." Gulino v. N.Y. State Educ. Dep't, 460 F.3d 361, 383 (2d Cir.2006) (quoting Guardians Ass'n of N.Y.C. Police Dep't, Inc. v. Civil Serv. Comm'n of City of N.Y., 630 F.2d 79, 89 (2d Cir.1980)). The combination of the lack of judicial expertise in this area and the limits of the expertise of those who do have training and experience support a cautious and careful judicial approach.
Based on the parties' filings, the evidence, and the applicable law, this court finds that the City and the seven individual plaintiffs have shown that the captain and senior-captain exams violate Title VII. But this court also finds that some of the changes in the proposed consent decree violate the CBA and TLGC and that the City and the plaintiffs have not shown that all these changes are necessary to comply with Title VII. Based on these findings and conclusions, the proposed consent decree is accepted in part and denied in part. The use of situational-judgment questions and an assessment center are justified by the record evidence and are job-related and consistent with business necessity. But other parts of the proposed modified consent decree violate the TLGC and CBA, and the City and the plaintiffs have not shown that they are tailored to respond to the disparate impact alleged. Using the parties' descriptions of the proposed changes, the provisions that violate the TLGC and CBA without the necessary justification in the record, and which this court does not accept, are as follows:
(Docket Entry No. 69-2, at 29).
The reasons for finding these aspects of the proposed changes to the promotion examinations invalid, and the remaining aspects supported by the record and the applicable law, are explained below. This opinion first describes the promotion system in place before any changes; reviews the expert and other evidence relevant to assessing disparate impact; and analyzes whether, under the applicable law, the proposed settlement is tailored to remedying the disparate impact that is shown. A hearing is set for
Finally, because the terminology used by the HFD and the industrial psychologists who served as experts in this case produced a number of acronyms and abbreviations, a list of the most commonly used is attached to this Memorandum and Opinion.
The HFD has approximately 4,000 employees involved in firefighting. Ninety percent are in the Emergency Operations Division ("EOD"). Half of the EOD employees are at the "firefighter" level and perform "task-level jobs" such as retrieving and using fire hoses. The next rank above firefighter is "engineer operator" ("EO"). In addition to performing firefighters' tasks, EOs drive fire trucks and HFD ambulances and operate ladders and pumps. Firefighters outnumber EOs two-to-one. (Evidentiary Hr'g Tr. 115, Docket Entry No. 130).
Captains are ranked immediately above EOs. HFD captains are the "first line of supervisor position[s] in the fire department." A captain supervises the operation of fire engines, which are smaller fire trucks that carry hoses and pump water. Each HFD fire station has at least one fire engine and one captain. A captain supervises an EO and two firefighters assigned to an engine. When a captain misses a day of work, an EO may "ride up" and perform the absent captain's job duties.
Senior captains are ranked immediately above captains. A senior captain supervises the operations of "ladder trucks," which are large fire trucks with aerial ladders. Only half of the City's fire stations have a ladder truck with a senior captain in addition to a fire engine and captain. A senior captain may supervise up to eight firefighters, including EOs. When a senior captain misses a day of work, a captain may "ride up."
To summarize the promotional system that is discussed in detail below, promotion from EO to captain and from captain to senior captain depends largely on a candidate's score on a multiple-choice test. Any person meeting the experience requirement can take the test. An EO can apply for captain after four years in the fire department. A captain can apply for senior captain after two additional years of service as a captain. TEX. LOC. GOV'T CODE § 143.028(a). A candidate's length of service with the HFD will add some points to the test score, but the test score largely determines promotion.
The City makes promotion decisions based on a rank-order list of the candidates' test points added to their length-of-service points. For each captain or senior-captain position available during the three years after the exam, the top three candidates' names and scores are submitted to the HFD fire chief. The presumption is that the fire chief will select the candidate with the highest test score. If the fire chief selects the second or third highest scoring candidate, the chief must explain his reasons in writing. If a candidate is not selected for promotion within the three-year period, the candidate must retake the exam. These promotional procedures for the captain and senior-captain positions are based on the TLGC and the CBA.
The City of Houston adopted the Fire Fighter and Police Civil Service Act ("CSA"), codified as Chapter 143 of the TLGC, on January 31, 1948.
The promotional process begins when a city posts notice of an upcoming examination. Municipalities like the City of Houston, with populations greater than 1.5 million,
The TLGC requires that the test be in writing and forbids tests that "in any part consist of an oral interview." Id. § 143.032(c). The questions must "test the knowledge of the eligible promotional candidates about information and facts." Id. § 143.032(d). The information-and-fact questions "must" be based on:
Id. The questions must also be taken from the sources identified in the posted notices. Id. § 143.032(e). Finally, the "examination questions must be prepared and composed so that the grading of the examination can be promptly completed immediately after the examination is over." Id. § 143.032(f).
The exam grade determines whether the candidate will be placed on a promotion-eligibility list. Grading begins as soon as an individual candidate completes the exam. The candidate may remain present during the grading. Id. § 143.033(a). The multiple-choice exam score is based on a maximum grade of 100 points and is determined by the correctness of the answers to the questions. Id. § 143.033(c). Each candidate also receives one point for each year of seniority, with a maximum of 10 points. Id. § 143.033(b). In municipalities like Houston, a candidate must score at least 70 points on the exam to be eligible for promotion. Id. § 143.108(a).
All scores must be posted within 24 hours of the exam. Id. § 143.033(d). Each candidate may see the answers, grading, and source materials after the exam and can appeal a score within 5 days. Id. § 143.034(a). The City has 60 days to decide the appeal. Id. § 143.1015(a). A candidate who appeals is entitled to a hearing. Id. § 143.1015(b).
Once the scores are finalized, all candidates who pass are listed in rank order on a promotion-eligibility list. See id. § 143.021(c); id. § 143.108(f). When vacancies occur, the names of the three persons with the highest scores for the position are certified and provided to the head of the department with the vacancy. Id. § 143.036(b). This is known as the "Rule of Three." The TLGC provides that "[u]nless the department head has a valid reason" for not doing so, "the department head shall appoint the eligible promotional candidate having the highest grade on the eligibility list." Id. § 143.036(f). If the candidate with the highest grade is not selected, the department head must personally discuss the reason with that candidate and file a written explanation. Id.
Texas law establishes firefighters' right to collective bargaining. TEX. LOC. GOV'T CODE § 174.002(b) ("The policy of this state is that fire fighters and police officers, like employees in the private sector, should have the right to organize for collective bargaining, as collective bargaining is a fair and practical method for determining compensation and other conditions of employment. Denying fire fighters and police officers the right to organize and bargain collectively would lead to strife and unrest, consequently injuring the health, safety, and welfare of the public."); id. § 143.204(a) (stating that a firefighter association submitting a petition signed by the majority of the paid firefighters in the municipality "may be recognized ... as the sole and exclusive bargaining agent for all of the covered fire fighters"). The HPFFA is the sole and exclusive bargaining agent for the City's firefighters.
The TLGC allows the City and the HPFFA to enter into a written agreement binding when ratified by both. Id. § 143.206(a). Such an agreement can supersede the TLGC's provisions "concerning wages, salaries, rates of pay, hours of work, and other terms and conditions of employment to the extent of any conflict with the [written agreement]." Id. § 143.207(a). The agreement "preempts all contrary local ordinances, executive orders, legislation, or rules adopted by the state." Id. § 143.207(b).
The 2009-2010 CBA between the City of Houston and the HPFFA made few departures from the TLGC's exam provisions. Like the TLGC, the CBA required a grade of at least 70% for promotion eligibility. The CBA specified that the test must consist of "not less than 100 and not more than 150 questions." (Docket Entry No. 69-6, at 20). Unlike the TLGC, the CBA allowed only a .5-point increase in the score for each year of service, with a maximum of 10 points. The CBA also allowed a .5-point increase for each year of service for certain ranks. For example, an engineer or operator applying to be a captain is awarded .5 points for each year of service as an engineer. (Id.). Aside from these changes, the 2009-2010 CBA provided that the TLGC "remain[s] in full force in the same manner as on the date [the CBA] became effective." (Id. at 13).
"Congress enacted Title VII of the Civil Rights Act of 1964, 42 U.S.C. § 2000e et seq., to assure equality of employment opportunities by eliminating those practices and devices that discriminate on the basis of race, color, religion, sex, or national origin." Alexander v. Gardner-Denver Co., 415 U.S. 36, 44, 94 S.Ct. 1011, 39 L.Ed.2d 147 (1974). Title VII's prohibitions include using "a particular employment practice that causes a disparate impact on the basis of race, color, religion, sex, or national origin" unless the employment practice "is job related for the position in question and consistent with business necessity." 42 U.S.C. § 2000e-2(k)(1)(A)(i). The plaintiffs alleged that the City's promotional procedures for captain and senior captain violated Title VII's disparate-impact provision.
"Congress intended voluntary compliance to be the preferred means of achieving the objectives of Title VII." Local No. 93, Int'l Ass'n of Firefighters, AFL-CIO v. City of Cleveland, 478 U.S. 501, 515, 106 S.Ct. 3063, 92 L.Ed.2d 405 (1986). To help employers comply with Title VII, Congress authorized the Equal Employment Opportunity Commission ("EEOC") to issue compliance guidelines (the "Guidelines"). The Guidelines "are not administrative regulations promulgated pursuant to formal procedures established by the Congress.
The Guidelines require employers who make promotional decisions based on test scores to maintain records of tests and test results. 29 C.F.R. § 1607.4(A). The Guidelines' rule of thumb for determining disparate impact is the "4/5 Rule." Under this Rule:
Id. § 1607.4(D). There are exceptions to the 4/5 Rule. The Guidelines state:
Id.
If analyzing an employer's test results under the 4/5 Rule shows "that the total selection process for a job has an adverse impact, the individual components of the selection process should be evaluated for adverse impact." Id. § 1607.4(C). The method for evaluating individual components is a "validity study." Id. § 1607.3(A). The Guidelines describe three types of validity studies: criterion-related-validity studies; content-validity studies; and construct-validity studies. Id. § 1607.5(A). A criterion-related-validity study analyzes whether test results correlate to "criteria that [are] predictive of job performance." Mark R. Bandsuch, Ten Troubles with Title VII and Trait Discrimination Plus One Simple Solution (A Totality of the Circumstances Framework), 37 CAP. U.L. REV. 965, 1089 (2009). A content-validity study analyzes whether test results correlate to "the knowledge, skills, and abilities related to that job." Id. A construct-validity study examines whether test results correlate to "general characteristics important to job performance." Id.
One court has summarized the content-validity and criterion-validity methods for evaluating a promotion or other employment test, as follows:
Banos v. City of Chicago, 398 F.3d 889, 893 (7th Cir.2005) (citations and internal quotations marks omitted).
Before conducting a validity study, an employer should conduct a "job analysis." 29 C.F.R. § 1607.14(A). Each type of validity study requires a different type of job analysis. Criterion-related-validity studies require "reviewing job information to determine measures of work behavior(s) or performance that are relevant to the job or group of jobs in question." Id. § 1607.14(B)(2). "These measures or criteria are relevant to the extent that they represent critical or important job duties, work behaviors or work outcomes as developed from the review of job information"; "[b]ias should be considered."
If one or more validity studies produces evidence "sufficient to warrant use of the procedure for the intended purpose under the standard of these guidelines," the promotional procedure is "properly validated." Id. § 1607.16(X). But if no validity study produces sufficient evidence, an employer "should initiate affirmative steps to remedy the situation." Id. § 1607.17(3). These steps, "which in design and execution may be race, color, sex, or ethnic `conscious,' include, but are not limited to," the following:
Id.
On August 4, 2008, seven firefighters sued the City of Houston, alleging that the 2006 captain and senior-captain exams had a discriminatory effect on their promotion opportunities, in violation of § 1981 and 42 U.S.C. § 2000e-2. The seven individual plaintiffs contended that the 2006 exams had a disparate impact on the promotion of black firefighters to captain and senior-captain positions compared to white firefighters. Four plaintiffs — Dwight Bazile, Johnny Garrett, Trevin Hines, and Mundo Olford — were lieutenants denied promotion to captain. Three plaintiffs — George Runnels, Dwight Allen, and Thomas Ward —
The City and the plaintiffs settled. (Docket Entry No. 64). The HPFFA was not a party to the negotiations or settlement. The City agreed to promote Bazile, Olford, and Hines to captain; to promote Allen to senior captain; to allow Garrett to retire as a captain; and to allow Runnels and Ward to retire as senior captains. The City also agreed to pay each plaintiff backpay in amounts ranging from $376.80 to $23,075.46. (Docket Entry No. 69-2, at 2-8, 17-22, 26-27).
The settlement also contained a proposed consent decree to be submitted to the court for approval. (Id. at 9-10, 28-30). The decree required the City to implement changes to the captain and senior-captain exams in two phases. In the first phase, the City agreed to implement "modest" changes to the November 2010 captain exam. In the second phase, the City agreed to implement broader changes, beginning with the May 2011 senior-captain exam and applying to all future captain and senior-captain exams. The settlement agreement required the parties to give notice to the HPFFA of "this conceptual agreement" and to "meet in person or conference call [with the HPFFA] to explore potential adjustments of union suggestions prior to final settlement meeting." (Id. at 10). The settlement agreement also required approval by the Houston City Council and by this court.
The parties notified this court of the settlement and their intent to file the proposed consent decree. Before filing the decree, the parties moved to join the HPFFA to the suit because the proposed changes to the promotion exams conflicted with the TLGC and the CBA. (Docket Entry No. 69). The HPFFA moved to intervene and asked this court to bifurcate review of the proposed consent decree. The first step would be to consider the HPFFA's objections to the proposed changes to the November 2010 captain exam. The second stage would be to consider the HPFFA's objections to the proposed changes to the subsequent senior-captain and later captain and senior-captain exams. This court granted the motion and entered a scheduling order. (Docket Entry Nos. 70 & 71).
The HPFFA objected to certain proposed changes to the November 2010 captain exam. (Docket Entry No. 75). This court heard arguments and evidence on the objections on September 16, 2010. On the same date, and with the HPFFA's agreement, this court found that the existing captain exam disparately impacted black firefighters and entered an order allowing the City to implement the consent decree provisions changing the November 2010 captain exam. (Docket Entry Nos. 82 & 85). The proposed consent decree described those changes to the 2010 captain exam, as follows:
(Docket Entry No. 69-2, at 28-29).
The most significant change to the November 2010 captain exam was the inclusion of multiple-choice "situational-judgment" questions in addition to the "job-knowledge" questions used on previous exams. Situational-judgment questions present hypothetical situations encountered on the job and ask candidates how they would respond.
This court's order approved the inclusion of situational-judgment multiple-choice questions for the November 2010 captain exam based on a finding that "[t]he continued exclusive use of questions based on `fact' and `information' as stated in Local Government Code § 143.032(d) is likely to continue to result in adverse impact." (Docket Entry No. 85, at 2). The situational-judgment questions included in the November 2010 captain exam were developed by industrial-psychology consultants selected by the City, the plaintiffs, and the HPFFA (the "consultants").
This court also approved an additional consent-decree provision inconsistent with the TLGC and CBA. The TLGC and CBA allow a promotional candidate to be present while the candidate's exam is scored and require that scores be posted within 24 hours of the exam. TEX. LOC. GOV'T CODE § 143.033(a), (d). The consent decree required an "item analysis" of the score before it was finalized. Item analysis requires the consultants to aggregate data related to each question — or "item" — to eliminate questions that did not reliably measure an individual candidate's exam performance.
Many of the TLGC and CBA requirements remained in place under the consent decree for the November 2010 captain exam. The consent decree still required job-knowledge questions. The exam that
The City administered the captain exam on November 17, 2010. The consultants conducted an item analysis after the exam. A panel consisting of representatives for the City, the individual plaintiffs, and the HPFFA met to review scoring. Initially, based on the item analysis, the consultants recommended giving candidates full credit for seven job-knowledge questions and for fourteen situational-judgment questions, effectively eliminating those questions as a way to differentiate among the candidates. In addition, the City's internal SMEs recommended giving full credit for one job-knowledge question and for six situational-judgment questions. The panel agreed with the SMEs' recommendation. (Docket Entry No. 94, at 2; Docket Entry No. 94-1 at 2-3). The panel also agreed that a candidate's score on the job-knowledge portion and the situational-judgment portion would be weighted equally in calculating the final score. (Docket Entry No. 94, at 2). Based on these decisions, a rank-order results list for the exam was created.
On January 12, 2011, the City advised this court that there were more than 200 appeals by the promotional candidates. On January 14, the City moved for additional time to finalize the scores and rank-order list. (Docket Entry Nos. 110 & 112). The City sought more time than the 60 days the TLGC allowed to decide whether to sustain the appeals. The HPFFA did not object to the request, and this court granted the motion. (Docket Entry No. 117). This court has not been updated on the status of the appeals or on promotions to captain under the November 2010 exam.
The HPFFA filed its objections to the proposed changes to the May 2011 senior-captain exam and to subsequent captain and senior-captain exams. (Docket Entry No. 89). The proposed changes are described as follows:
2. Job-Knowledge Written Test
3. Scenario-Based Computer-Objective Test
4. Assessment Center
4a. [Blank]
(Docket Entry No. 69-2, at 29-30).
The parties agree that many of these proposed changes violate the TLGC and the CBA. Under the consent decree, the test designer "may" elect to use a "written job-knowledge test" depending on the job analyses for the captain and senior-captain positions. Whether such questions are included depends on the importance of the "knowledge" component compared to the skills, abilities, and other characteristics identified for the positions. If the test designer elects to use written job-knowledge questions, they are scored on a pass/fail basis. Only candidates who get a passing score remain promotion-eligible. A candidate's specific score on the job-knowledge test is otherwise irrelevant. The score is not used to produce a rank-order list and the Rule of Three is abandoned as to this part of the promotional process.
The remaining two parts of the captain and senior-captain exam are not questions based exclusively on facts and information.
The scenario-based computer objective test is based on situational-judgment concepts, using a computer to present hypothetical situations that captains and senior captains would likely encounter on the job. One type of situational-judgment question identified in the consent decree is an "in-basket exercise." In such an exercise, a candidate is given documents or other information creating a hypothetical fact pattern and is asked to analyze or describe a response. An in-basket exercise testing training abilities might ask the candidate to review a firefighter's performance evaluations and identify what training that firefighter needs to improve. (Dr. Brink Report 48). The consent decree allows for scoring these situational-judgment questions on a full-credit, partial-credit, and zero-credit basis, provided that the "same responses" receive the same credit. The consent decree requires ranking the candidates according to their scores on this part. In the initial settlement agreement, the candidates' scores on this situational-judgment component determined whether the candidate could proceed to the final phase of the exam, but the modified settlement agreement provides that all candidates advance. (Compare Docket Entry No. 69-2, at 29, with Docket Entry No. 86-1, at 3).
The final exam component is an assessment center. "An assessment center consists of multiple exercises simulating job activities that are designed to allow trained observers, or assessors, to make judgments about candidates' behaviors as related to job performance." (Dr. Brink Report 47). One type of simulation used in assessment centers is a "role play." A role play "is a simulation of a face-to-face meeting between the candidate (playing the role of a job incumbent) and a trained role player acting as a person incumbents frequently encounter on the job (such as a subordinate or citizen)." (Id.). Assessment-center activities such as role play violate the CBA and TLGC. See TEX. LOC. GOV'T CODE § 143.032(c) (forbidding tests that "in any part consist of an oral interview"). "Assessors" score promotional candidates' performance on the assessment-center activities. Although there are preset criteria distinguishing better from worse performance, the scoring system is subjective and violates the TLGC and the CBA.
Another inconsistency between the TLGC and the CBA on the one hand and the consent decree provisions on the other is the requirement in the consent decree to "band" the promotional candidates' assessment center scores. "Banding" scores means adjusting the individual test scores based on statistical analyses showing the likelihood that: (1) a candidate could score higher or lower on the same exam; and (2) the individual assessor could have given the candidate a higher or lower score for the same performance. Banding tends to convert individualized score differences into homogenized "bands" of more uniform scores. For example, three candidates' scores of 85, 86, and 87 might be "banded" as one score of 86, depending on the results of the statistical analysis. Banding is like converting individual scores of 95%, 97%, and 100% on a 100-question multiple choice test into three "As" that are viewed as identical. The conversion is based on statistical analysis showing that an individual scoring 95% on the exam has the same chance of scoring 100% on the exam as the person scoring 100% on the exam has of scoring 95%.
(Evidentiary Hr'g Tr. 100, Docket Entry No. 130).
Banding is inconsistent with the Rule of Three. Under the proposed consent decree, the final promotion decision is based on the banded assessment-center scores. Names are submitted by score "bands," not subject to the Rule of Three that would have applied under the TLGC and the CBA. Under the Rule of Three, if the three highest scores were 85, 86, and 87, the names of those applicants would be submitted. The person who scored the 87 would be selected unless the decision-maker provided a written reason for selecting the person who scored the 86 or 85. Under the banding system, the three individuals would be treated by the decision-maker as having the same score. The "band" might also be larger than three persons; its size would be determined by statistical analyses rather than a preset number. The consent decree requires the decision-maker to select one within the band and to provide a written explanation for the selection.
The consent decree does not state the role of a candidate's race or how the decision-maker may consider race in choosing who to promote within a band. There was testimony that using race as a factor to select a candidate within a band could reduce the exam's disparate impact on African-American applicants. Within a band, all applicants are viewed as equal. (Evidentiary Hr'g Tr. 107-08, Docket Entry No. 130). But the consent decree does not explicitly authorize race-based promotional decisions.
At an evidentiary hearing, the parties presented evidence as to (1) whether the senior-captain exam disparately impacted black firefighters, and (2) whether the proposed changes to the captain and senior-captain exams were justified by business necessity.
On February 8, 2006, the City of Houston administered the senior-captain exam to 221 promotional candidates. Of the 221 candidates taking the exam, 172 were white, 15 were black, 33 were Hispanic, and 1 was "other." The 212 candidates who passed by scoring above 70 consisted of 166 Caucasians, 13 African-Americans, 32 Hispanics, and 1 "other." The City promoted 70 candidates based on the rank-order list of those who passed the exam. Of those promoted, 59 were Caucasian, 2 were African-American, 8 were Hispanic, and 1 was in the "other" category. (Evidentiary Hr'g Ex. 7, Dr. McPhail Report, at 5).
The following experts submitted reports or testified as to whether the 2006 senior-captain exam disparately impacted black firefighters:
All the experts agreed that the "total selection process" for promoting HFD captains to senior captain showed disparate impact under the 4/5 Rule. The Guidelines require that "[a]dverse impact is determined first for the overall selection process for each job." Adoption of Questions and Answers to Clarify and Provide a Common Interpretation of the Uniform Guidelines on Employee Selection Procedures, 44 Fed.Reg. 11966, 11998 (1979) [hereinafter Guidelines Questions & Answers]. "The `total selection process' refers to the combined effect of all selection procedures leading to the final employment decision such as hiring or promoting." Id. The experts agreed that the rate
Both parties' experts testified that a 4/5 Rule violation is an unreliable basis to find disparate impact when the population size of one group is small. Only 17 black firefighters were eligible for promotion to senior captain. The experts agreed that this is too small a number to make a 4/5 Rule violation sufficient to find disparate impact. (Dr. Brink Report 10; Dr. Lundquist Aff. 6; Dr. Arthur Aff. 2-3; Dr. McPhail Report 6-7; Evidentiary Hr'g Ex. 12, Dr. Morris Report, at 0010447). There was general agreement among the experts that when group populations are small, statistical analyses should be used to determine whether the 4/5 Rule violation is the product of "chance." This requires determining the statistical significance of the 4/5 Rule violation. Dr. Morris's report noted that the 4/5 Rule risks "sampling error," which statistical-significance analysis mitigates. Dr. Morris's report stated:
(Dr. Morris Report 0010446-47). Dr. Brink's report discussed how the 4/5 Rule risks "Type I" error by leading to the conclusion "that adverse impact exists, when in reality the difference in selection rates is a result of sampling error (or chance)." (Dr. Brink Report 12). Dr. Brink's report explained how statistical tests can "control the potential amount of Type I error":
(Id.).
The experts also identified peer-reviewed journal articles critical of the 4/5 Rule. In 1979, Anthony Boardman and Irwin Greenberg authored analyses showing that the 4/5 Rule could lead to both Type I (falsely identifying disparate impact when none exists) and Type II (failing to identify disparate impact when it does exist) statistical errors. See Irwin Greenberg, An Analysis of the EEOCC `Four-Fifths' Rule, 25 MGMT. SCI. 762 (1979); Anthony E. Boardman, Another Analysis of the EEOCC `Four-Fifths' Rule, 25 MGMT. SCI. 770 (1979). A recent article similarly concluded that "there is a fairly high false-positive rate for the 4/5ths rule used by itself." Phillip L. Roth et al., Modeling the Behavior of the 4/5ths Rule for Determining Adverse Impact: Reasons for Caution, 91 J. APPLIED PSYCHOL. 507, 519 (2006). The article's authors cautioned that "other factors (e.g., sample size) were quite important" and recommended using "a test such as Fisher's exact test or a chi-square test to mitigate false-positives." Id. Also noting the 4/5 Rule's shortcomings, Scott Morris and Russell Lobsenz recently proposed a "more complex" statistical technique, the Z
The experts applied three tests to measure statistical significance: the Fisher exact; the Pearson chi-square; and the Z
The Pearson chi-square test estimates the probability of obtaining the observed frequency table under the null hypothesis
Unlike the Pearson chi-square and Fisher exact tests, the Z
Neither the Fisher exact, the Pearson chi-square, nor the Z
Dr. Brink offered other grounds to support his conclusion that under the 4/5 Rule, there was valid evidence of disparate impact. One was the "N of 1" or "flip-flop rule." Dr. Brink stated that the "N of 1 rule calculates an adjusted impact ratio assuming one more person from the minority group ... and one less person from the majority group were hired (and, consequently, one less minority and one more majority were not hired). If the resulting selection rates are such that the minority selection rate is now larger than the majority selection rate, selection rate differences may be attributed to small sample sizes." (Dr. Brink Report 10).
The Guidelines illustrate the application of the flip-flop rule. The Guidelines present a hypothetical in which 80 Caucasians
Guidelines Questions & Answers, 44 Fed. Reg. at 11999.
Dr. Brink's report showed that applying the flip-flop rule discussed in the Guidelines to the February 2006 senior-captain exam, the ratio of Caucasians compared to African-Americans selected for promotion did not change. Dr. Brink concluded that this was further evidence that the sample size was not too small to invalidate the disparate-impact evidence provided by the 4/5 Rule.
Dr. Brink also cited the "one-person rule" as additional support for this conclusion. The "one-person rule is computed by taking the difference between actual minority hires ... and the expected minority hires .... If the difference is less than 1, then violations of the 4/5ths rule are likely due to small sample sizes." (Dr. Brink Report 10). When the one-person rule is applied to the results of the 2006 senior-captain exam, the difference between actual minority hires and expected minority hires is not fewer than one. Dr. Brink concluded that "[i]n all cases, the one-person rule indicates that the violations are not due to small samples." (Id.).
Dr. Brink's report deemphasized the value of statistical analysis to determine disparate impact for the total selection process for senior-captain promotions. (Id. at 15-16). Dr. Brink cautioned against "dogmatic adherence" to the social scientists' use of .05 as the statistical-significance level, stating that the .05 level should not be used in all contexts and noting that lower statistical results can be meaningful. (Id.). Statistical analyses estimate only the likelihood that if a different pool of applicants applied for promotion to senior captain, adverse impact would result. But in the context of disparate-impact analysis, the applicant pool is fixed; there is no other relevant pool. Dr. Brink quoted from an article by Collins and Morris stating that:
(Id. at 12). Dr. Brink also testified that research from Collins and Morris suggested that the Fisher exact test was "overly conservative" and recommended abandoning the test as a measure of disparate impact. (Evidentiary Hr'g Tr. 157, Docket Entry No. 130).
Dr. Lundquist analyzed data relating to the senior-captain promotion process beyond the 2006 senior-captain exam.
Black Candidates White Candidates Black Candidates White Candidates Promoted Promoted AIR 1993 17 147 0 36 0.001996 13 84 1 21 0.311999 12 122 0 31 0.002002 8 104 2 64 0.412006 15 172 2 59 0.392009 17 14 136 2 31 0.63Overall 79 765 7 242
Dr. Lundquist testified that a historical analysis using data from multiple test administrations provides "a much more accurate picture of adverse impact of a promotional process." (Dr. Lundquist Aff. 6, Docket Entry No. 93-2). Dr. Arthur, however, was dismissive of any historical aggregation approach because the data would necessarily extend beyond the 2006 exam that was at issue. Dr. Arthur did state that if one were to look to historical data, "the correct analysis would be a Mantel-Haenszel chi-square test." (Dr. Arthur Aff. 3, Docket Entry No. 89-1). Dr. Lundquist agreed with Dr. Arthur that the Mantel-Haenszel test was the most methodologically appropriate analysis. The Mantel-Haenszel test allows statisticians to investigate the consistency of data trends over time while avoiding errors due to aggregation. Applied to the senior-captain exams, Dr. Lundquist found that the Mantel-Haenszel test showed a statistically significant pattern of adverse impact against African-Americans.
The City has not conducted a full validity study of its promotional exams. (Dr. Brink Report 20). The City has, however, produced job descriptions for both the captain and senior-captain positions. (Docket Entry No. 96-1, at 54). The job descriptions contain a detailed listing of the responsibilities for each. The responsibilities for the captain position include: (1) supervising "the emergency response of their assigned apparatus to ensure a safe and timely response to alarms"; (2) supervising
Under the "specifications" section are subheadings for "basic knowledge," "specific knowledge," "advanced skills," and "ability to." Bullet points beneath the subheadings describe the KSAOs for captains. The "basic" and "specific" knowledge identified for captains include: (1) knowledge directly related to firefighting — such as knowledge about municipal and private fire protection, building construction, and water supplies; (2) knowledge about federal, state, and local laws; (3) knowledge about HFD operational procedures; and (4) other types of knowledge, such as about Robert's Rule of Order. (Docket Entry No. 96-1, at 56). The advanced skills include management and supervision, organization, problem-solving, firefighting strategy and tactics, teaching, and public relations. (Id.). The abilities include: communication abilities, such as establishing and maintaining working relationships with subordinates as well as "impromptu public speaking"; tactical abilities, such as implementing firefighting strategies; leadership abilities, such as recognizing and responding to individual and group needs; and administrative abilities. (Id. at 57). The senior-captain KSAOs are similarly organized and described.
Dr. James C. Sharf prepared a report for the HPFFA concluding that the promotional exams are valid based on "validity generalization," a validation method not described in the Guidelines. Dr. Sharf, an employment consultant specializing in risk management, has published "a dozen professional publications including two peer-reviewed chapters: 1) in The Society for Industrial and Organizational Psychology Practice Series (2000), and 2) in American Psychological Association Books (2008)." Dr. Sharf served as Special Assistant to the Chairman of the EEOC from 1990 to 1993 and as the EEOC's chief psychologist from 1974 to 1978. Dr. Sharf has "over three decades' experience developing, implementing and defending selection and appraisal systems in both the public and private sector." He is a Fellow of the Society for Industrial and Organizational Psychology, a Fellow of the Association for Psychological Science, and a Fellow of the American Psychological Association. (Dr. Sharf Report 2, 5).
Acknowledging that the Guidelines identify content-related, criterion-related, and construct-validity studies as proper validation methods, Dr. Sharf argued that these studies establish only a starting point for validity analysis. Dr. Sharf pointed out that the EEOC did not intend the Guidelines to preclude "other professionally acceptable techniques with respect to validation of selection procedures" because test-validity science has evolved since the Guidelines' publication. 29 C.F.R. § 1607.14; see also Guidelines Questions & Answers, 44 Fed.Reg. at 12002 ("The validation provisions of the Guidelines are designed to be consistent with the generally accepted standards of the psychological profession."). Dr. Sharf also pointed out that since the Guidelines were published, the American Psychological Association has revised the Standards for Educational and Psychological Tests ("APA Standards") and the Society for Industrial and Organizational Psychology has revised the Principles for the Validation
Dr. Sharf argues that one basis for deemphasizing the Guidelines' validation methods is the criticism by industrial psychologists of the Guidelines' recommended approach to job analysis. Dr. Sharf characterized the Guidelines' job analysis as requiring a detailed list of job tasks based on "observable behaviors." Dr. Sharf's report identified several scholarly articles published in psychology journals suggesting that job analyses based on detailed task descriptions are unreliable or unhelpful. For example, a 1981 article in the Journal of Applied Psychology by Schmidt, Hunter, and Pearlman, found that detailed job analyses based on observable behaviors created the appearance of large differences between jobs "that are not of practical significance in selection." (Id. at 11). Similarly, the SIOP Standards require only a "general" description of KSAOs and allow a "less detailed analysis... when there is already information descriptive of the work." (Id. at 15).
A second basis for Dr. Sharf's criticisms of the Guidelines was their emphasis — reflected, for example, in the job-analysis provisions — on "observable behaviors." Dr. Sharf contrasted "observable behaviors" with "unobservable cognitive skills," which the Guidelines do not emphasize. Dr. Sharf argued that the Guidelines' focus on validity studies measuring observable behaviors produces less reliable results than validity studies measuring cognitive skills. Dr. Sharf's report cited a number of scholarly articles finding that a promotional candidate's cognitive skills better predict performance after promotion than observable behaviors. (Id. at 30-31). He summarized the articles' findings as follows: "The conclusion from these studies is that pencil and paper tests of cognitive ability such as verbal, quantitative and technical/problem solving abilities not only predict job performance but that they predict job performance better than any alternative — the general case of validity generalization research empirically built upon measures of cognitive ability." (Id. at 32). Dr. Sharf acknowledged that some job learning occurs for any candidate, but he argued that recent research shows that individuals with greater cognitive ability will acquire the skills necessary to perform a job successfully. (Id. at 36-37). In light of these studies, Dr. Sharf concluded that the Guidelines' "emphasis on `observable behavior' is both illogical and out of touch with contemporary industrial psychology because there is no knowledge, skill or ability which does not depend on unobservable mental processes involving cognitive abilities." (Id. at 11).
Dr. Sharf urged that the more reliable method for analyzing validity is validity generalization. "Validity generalization is industrial psychology's science of the general case demonstrating empirically that the cognitive abilities most studied in industrial psychology — verbal, quantitative and technical abilities — are also the best predictors of job performance." (Id. at 14). Generally, validity generalization analyzes whether a test reliably measures the verbal, quantitative, and technical skills a job requires rather than its KSAOs. The SIOP Standards recognize validity generalization as one method for validating a cognitive-based test. A former EEOC senior attorney has also argued that validity generalization is a valid measure of a test's validity. (Id. at 13).
Dr. Sharf noted that the additional responsibilities identified in the HFD job description for the senior-captain position related to assuming control, maintaining control, coordinating, supervising, and evaluating the "most effective use." (Id. at 17). A captain is responsible for "siz[ing]-up the scene at emergency medical calls, as a first responder, in order to begin providing needed emergency medical intervention to mitigate the problems encountered within the HFD guidelines." (Id. at 19). A senior captain is also responsible for assuming "control of medical emergencies when arriving first," maintaining "control of patient care until the arrival of a higher medical authority," and supervising or performing "medical intervention in accordance with one's level of training." (Id. at 19). Based on the emphasis on control and supervision in the HFD job description and a 1986 article published by Hunter & Hunter in Psychological Bulletin suggesting that "the more complex the job, the better cognitive ability predicted job performance," Dr. Sharf concluded that the senior-captain position requires greater cognitive ability than the captain position.
Dr. Sharf also compared the City's job descriptions to the job analysis for municipal firefighters created by the United States Department of Labor (the "DOL Analysis").
Using the recategorized HFD job descriptions and the DOL Analysis, Dr. Sharf prepared a "combined job analysis." (Id. at 22-30). Dr. Sharf then discussed whether the HFD captain and senior-captain exams validly measured the cognitive skills identified in the combined job analysis and found that they did. Dr. Sharf's report does not provide the analysis that led to his conclusion or refer to statistical studies to provide support. Dr. Sharf relies on scholarly articles arguing that individual differences in cognitive performance
Dr. Sharf's expert report also criticized noncognitive measures of test validity. He argued that video- or situational-based assessments are poor simulations of actual situations a captain or senior captain encounters. He colorfully explained that "[a] video depiction is hardly the stress of an adrenalin rush from the danger of a whiff of noxious chemicals or a lung full of searing smoke." (Id. at 37). Dr. Sharf also emphasized that the knowledge a candidate brings to the position of captain — "what you think" — will impact how he responds to emergency situations. He argued that having the technical knowledge required to respond to certain emergency situations is a prerequisite to making the appropriate response.
Dr. McPhail conducted a criterion-related validation of the 2006 captain exam. He did not conduct a similar validation study for the 2006 senior-captain exam. Dr. McPhail's criterion-related validation study compared candidates who were promoted to captain based on the 2006 exam with those who were not but who "rode up" as captain after the exam. The validation study analyzed whether there was a relation between success on the exam and performance as captain by comparing the promotional candidates' 2006 exam scores with performance evaluations created by Dr. McPhail and filled out by supervisors. Dr. McPhail concluded that the validation study showed only "equivocal" results as to the captain exam's validity.
Initially, Dr. McPhail identified "a set of performance dimensions appropriate and important to effective performance as a Captain," using the HFD's job description for the position, the exam, and information from internal SMEs. (Docket Entry No. 37-1, at 6). Dr. McPhail's report identified nine specific performance dimensions: "emergency operations"; "station management"; "technical knowledge"; "management resources"; "supervision"; "problem solving"; "interpersonal effectiveness"; "professional orientation & commitment"; and "overall job performance."
Using the performance dimensions, the behavioral anchors, and the SME evaluations, Dr. McPhail created a Performance Dimension Rating Form ("PDRF"). (Id. at 40). Each PDRF asks a supervisor to analyze one performance dimension of the individual to be scored. The performance dimension is identified at the top of the PDRF. The five rating categories are placed on a left-hand column and the behavioral anchors are listed beneath each. Within each rating category are twelve possible scores. The possible scores start at one, which corresponds to the "unacceptable" rating category — the lowest rating possible — and end at sixty, which corresponds to the "exemplary" rating. (Id.). An individual whose performance in emergency operations is "unacceptable" can receive a score from one to twelve and an individual whose performance is exemplary can receive a score from forty-nine to sixty. (Id.).
The PDRFs were circulated to HFD district chiefs, who supervise captains. The district chiefs were asked to score the performance dimensions for captains promoted after the January 2006 captain exam and EOs who were not selected for captain after the exam but had ridden up as captains after that. In January 2006, 438 firefighters took the exam and 157 were promoted to captain. Of those who took the exam but were not promoted, 281 EOs rode up as captains. Of the 84 district chiefs asked to evaluate performances, 77 submitted evaluations. The results of the supervisors' assessments of the captains and EOs was then compared to the scores on the January 2006 captain exam. (Id. at 60). The raw data for the PDRFs showed a mean score of 79.35, with a 10.13 standard deviation, for the 438 firefighters who took the January exam; a mean score of 90.10 with a 3.83 standard deviation, for the 157 firefighters promoted to captain based on the January exam; and a mean score of 73.35, with a 7.13 standard deviation, for EOs who took the January exam but who were not promoted and later rode up as captains. (Id. at 45).
Dr. McPhail constructed a "validation sample" of 199 firefighters to test the validity of the 2006 exam. Using the validation sample, Dr. McPhail conducted statistical analyses "to evaluate evidence for the validity of the promotional examination." (Id. at 52). The analyses included "zero-order (bivariate) correlations for three different samples: the entire validation sample, only those promoted to captain, and only those not promoted to captain." (Id.). Dr. McPhail found that the bivariate correlations "appeared to provide supporting evidence for the validity of the examination." He noted that the correlations "were all significant and ranged from r = .37 to r = .51." (Id.). But when Dr. McPhail placed the results of the bivariate analysis on a scatter plot, he noted that the relationships between the "examination scores and criteria indicated barbell shaped bivariate distributions, in which most of the performance ratings for captains were located in the upper end of the distribution and most of the performance ratings for engineers/operators were located in the lower end of the distribution." (Id.). He explained that this supported at least two inferences: (1) "the captain promotional examination effectively taps the intended construct domain which results in the observation that those scoring higher on the exam tend to have higher performance"; and (2) "because promotional examination scores were used as a basis for promotion ... scores should be correlated with captain performance because those at the formal captain rank have a greater opportunity to acquire knowledge and skills integral to effective functioning." (Id.).
Dr. McPhail also used "a number of multiple regression models" to measure the January 2006 captain exam's validity. He explained these analyses, as follows:
(Id. at 55). The regression analyses showed that the January 2006 captain exam significantly predicted station management, resource management, and problem solving. (Id.).
Dr. McPhail concluded that the statistical analyses showed "equivocal evidence of the predictive capability of the 2006 examination." (Id. at 60). He noted that the analyses of the entire sample showed "substantial and statistically significant correlations... between the test and rated performance," but that the bivariate scatter plot moderated the correlations. (Id.). He also noted that both the bivariate analysis and the multiple-regression analysis showed significant correlations within the EO subgroup with station management, management of resources, and problem solving. Dr. McPhail concluded that "among a much less restricted sample, test scores provided incremental prediction of performance ... even after accounting for the relationship of promotion status with the performance ratings." (Id. at 61).
Dr. Brink's report, prepared for the City, used the Guidelines, SIOP Standards, and scholarly articles to criticize the 2006 captain and senior-captain exams. Dr. Brink's report criticized the job descriptions offered by the HFD as job analyses, the "linkage" between the HFD's job descriptions and the exam, the reliability of the exam, and the processes the City used to establish the promotional system. Based on these criticisms, Dr. Brink concluded that the captain and senior-captain exams were not content-valid and that the City's promotional process violated the Guidelines. Dr. Brink's expert report also identified alternative evaluation measures.
Dr. Brink's report identified numerous shortcomings in the HFD job analyses. The report explained that both the Guidelines and SIOP Standards emphasize the importance of a job analysis to determine content validity. The Guidelines contain
The report first discussed the City's failure to maintain records of "required information" for documenting validity. Dr. Brink described the records the City produced to document the promotional tests' validity as "almost 8,000 mishmash pages." (Id. at 20). Under the Guidelines, the City should record the users, locations, and dates of a job analysis and any purposes related to the analysis. 29 C.F.R. § 1607.15(C)(1), (2).
Dr. Brink's report concluded that the job descriptions based on questionnaire responses were insufficient as job analyses. A job analysis should incorporate information from a number of sources, including background research, observing SMEs performing the job, interviewing SMEs, SME focus groups, and job-analysis questionnaires. The more sources are incorporated, the stronger the job analysis will be (Id. at 21). Dr. Brink found that the City relied exclusively on questionnaires. He noted that the job descriptions and questionnaires the City supplied stated that "any one position may not include all the duties listed, nor do the examples listed necessarily include all duties performed." (Id.). Dr. Brink also noted that it was not clear whether internal SMEs had participated in creating the job descriptions. Dr. Brink argued that the lack of SME input would raise concerns about accuracy. Dr. Brink noted as an example that a specific knowledge identified in the captain job description was "Robert's Rule of Order," but that "several" captains did not know what this meant. (Id. at 22). Finally, Dr. Brink's report emphasized the vagueness of both the questionnaires and the job descriptions. For example, the job descriptions for both captain and senior captain list "training" as a "knowledge." (Id.). Dr. Brink argued that these descriptions failed to meet the Guidelines' requirement that "an operational" definition should be provided for each KSAO. 29 C.F.R. § 1607.15(C)(3).
Dr. Brink's report also found that the City's "incumbent frequency ratings, supervisor criticality ratings, and computer overall criticalities" of the KSAOs identified in the job descriptions showing their relative importance did not correlate with the questions asked on the captain and senior-captain exams. His criticisms for the captain job-description assessments included the following:
(Dr. Brink Report 23). He had similar criticisms for the senior-captain job-description assessments:
(Id.).
After consulting with SMEs, Dr. Brink found that 63% of the captain exam content and 86% of the senior-captain exam content did not reflect knowledge or skills necessary for the first day of work. (Id. at 24). Dr. Brink found that this was evidence that the test violated the SIOP Standards requirement that a "selection procedure should be based on an analysis of work that defines the balance between the work behaviors, activities, and/or [KSAOs] the applicant is expected to have before placement on the job." (Id. at 24). The Guidelines similarly require that "[f]or any selection procedure measuring a knowledge, skill, or ability the user should show that (a) the selection procedure measures and is a representative sample of that knowledge, skill, or ability; and (b) that knowledge, skill, or ability is used in and is a necessary prerequisite to performance of critical or important work behavior(s)." 29 C.F.R. § 1607.14(C)(4). Referring to this as the "necessary-upon-promotion" requirement, Dr. Brink concluded that the test poorly assessed whether a promotional candidate has the KSAOs required to begin work as a captain or senior captain.
According to Dr. Brink, "[p]erhaps the most condemning fact regarding the job analysis is that it was completely irrelevant." (Dr. Brink Report 26). The report stated:
(Id.).
Dr. Brink's report also criticized the exam itself, concluding that the "linkage" of test questions to the captain and senior-captain positions was "too abstract." (Id. at 31). This was inconsistent with the Guidelines, which state as follows:
Guidelines Questions & Answers, 44 Fed. Reg. at 12007. Dr. Brink stated that the identified responsibilities should have been linked to KSAOs and that the responsibilities and KSAOs should in turn have been linked to specific exam questions. He noted that there was no documentation linking the source materials to individual questions. Although the City provided "matrices linking responsibilities and [KSAOs] to source material lists," Dr. Brink found that many of the matrices did not correlate with the cited portions of the source materials. (Dr. Brink Report 31). The City also provided documents linking responsibilities and KSAOs to the exam questions, but Dr. Brink found that the questions rarely correlated with the responsibilities and KSAOs. Dr. Brink gave the following examples:
(Id. at 32).
Dr. Brink also faulted the exam for failing to assess the identified KSAOs in "the context in which they are used on the job." (Id.). The basis for this conclusion was Dr. Brink's experience that objective multiple-choice exams poorly evaluate supervisory, leadership, and communication skills and that such exams fail to simulate situations using job-related abilities. During the evidentiary hearing, Dr. Brink distinguished between "high fidelity" tests, which closely resemble actual job behaviors, and "low fidelity" tests, which do not resemble job behaviors. Dr. Brink testified that multiple-choice tests are "low fidelity" and that only high-fidelity tests are likely to have content validity. (Evidentiary Hr'g Tr. 149, Docket Entry No 130).
Dr. Brink's report also stated that low-fidelity tests are inconsistent with the Guidelines. The Guidelines state that:
29 C.F.R. § 1607.14(C)(4). Similarly, the Q & As in the Guidelines note that:
Dr. Brink identified a number of questions on both the captain and senior-captain exams to illustrate his arguments against the exclusive use of multiple-choice questions. (Dr. Brink Report 33). A candidate's ability to delegate is examined by asking for the definition of "delegate"; a candidate's ability to manage a station is examined by asking about the difference between "managers" and "leaders"; and a candidate's ability to ensure the safety of personnel is measured by asking about the definition of "human factors theory of accident causes." (Id.). Dr. Brink summarized this aspect of his findings about the exams:
(Id. at 34).
Dr. Brink also examined the City's use of time limits and a cutoff score, finding no validity support for either. As to time limits, the Guidelines state that "[e]stablishment of time limits, if any, and how these limits are related to the speed with which duties must be performed on the job, should be explained." 29 C.F.R. § 1607.15(C)(5). As to cutoffs, the Guidelines state the following:
Id. § 1607.15(C)(7). The Guidelines also state that:
Id. § 1607.5(H).
Dr. Brink also found that the City violated the Guidelines and SIOP Standards
Using the City's data on test scores, Dr. Brink also reviewed the "item analysis" the City conducted. An "item analysis" measures a test's reliability by looking at its measurement error. "[F]or an employment test to accurately predict job performance, it must be reliable; but having a reliable test does not guarantee accurate prediction of job performance." (Id. at 39). Dr. Brink measured reliability using a formula proposed in an article by Ghiselli, Campbell, and Zedeck. The formula produces a "validity coefficient" that ranges from 0 to 1. The higher the coefficient, the more valid the test. (Id. at 38). Relying on an article by Nunnally and Bernstein, Dr. Brink acknowledged that "the level of reliability that is considered satisfactory depends on how a test is being used," but that in all cases, the validity coefficient should be at least .70. (Id. at 39). Dr. Brink stated that for "high stakes testing" like the captain and senior-captain exams, which provide the most important aspect of the promotional decision, the validity coefficient should be .90 at a "bare minimum," and .95 is "desirable." (Id.).
The City has conducted "some" item analysis for the captain exam and has conducted an item analysis for 7 items on the senior-captain exam. The City's item analysis showed validity coefficients ranging from .458 to .646 for criteria measured on those exams. Dr. Brink criticized the City's item analysis because it analyzed broad criteria. For example, on the senior-captain exam, the City measured validity against the following criteria: "strategic & tactical considerations on the fireground" (items 1-20); "fire service first responder" (items 21-45); "supervisor" (items 46-70); "terrorism response" (items 71-80); and "Houston Fire Department Guidelines" (items 81-100). (Id.). Dr. Brink's report stated that the City's item analysis failed to include any analysis based on the specific KSAOs within each large criteria group. Dr. Brink also stated that the relevant literature shows that measuring large numbers of items at the same time increases the validity coefficient. The City measured the validity of all 100 test items instead of measuring the test validity within each subcriteria, which would inflate the validity coefficient while failing to capture important criteria.
Dr. Brink also measured the "item difficulty" of the exams. "Item difficulty is a statistic used in test analysis that indicates the percentage of applicants who answered an item correctly." (Id. at 40). Item difficulty measures reliability, not validity. Dr. Brink stated that "[t]he purpose of a valid promotional exam is to differentiate candidates based on job-related criteria; if all or most candidates get an exam question correct or incorrect, the item is useless for this purpose." (Id.). Items with difficulties above .9 (90% applicants answered correctly) should be eliminated unless the exam is designed to separate the bottom 10% of applicants from the top 90%. Dr. Brink found that 34 items on the captain exam had item difficulties over .9 and that 60 items for the senior-captain exam had item difficulties above .9.
Another way to measure whether an item differentiates between candidates who will perform well after promotion and candidates who will not is through "item
The two common methods for measuring item discrimination are the index of discrimination and the item-total correlation. The index of discrimination is computed by first dividing the examinees into upper and lower groups based on overall test scores, then subtracting the proportion of the lower group who answered the item correctly from the proportion of the upper group who answered the item correctly. This produces a value, "D." Crocker and Algina argue that questions with a "D" value lower than .2 should be eliminated. Dr. Brink found that the captain exam had 53 items with a "D" value below .2 and the senior-captain exam had 66 items with a "D" value below .2. (Id.).
The item-total correlation "represents the correlation between an item and the rest of the test (i.e., the correlation between the item and the total score on the exam calculated excluding that item)." (Id.). A low item-total correlation means that an item has little relationship with the overall test and does not discriminate between those who perform well and those who perform poorly. Dr. Brink's report stated that items with "low item-total correlations should be dropped ... because they are not operating in the intended manner and do not improve reliability." (Id.). Nunnally and Bernstein wrote an article concluding that items with an item-total correlation below .05 is "a very poorly discriminating item" and that items with an item-total correlation less than .2 "are at least moderately discriminating." (Id. at 42). Dr. Brink found that 45 items on the captain exam and 46 items on the senior-captain exam had item-total correlations below .2. He also found that 6 items on the captain exam had negative item-total correlation, indicating that high-scoring candidates were more likely to get the item incorrect than were low-scoring candidates.
Finally, Dr. Brink faulted the exams for producing statistically significant performance differences between black and white examinees. Dr. Brink found that 32 items on the captain exam and 12 items on the senior-captain exam showed statistically significant differences based on a chisquare analysis using p-values of less than.05. (Id.). Dr. Brink argued that "[a]lthough changes to tests should not be made based solely on significant group differences, these items should be [the] focus of further evaluation to ensure that they are functioning appropriately." (Id.). The City had conducted no such evaluation.
Dr. Brink also calculated item bias through differential-item functioning ("DIF"). The SIOP Standards state that test developers should attempt to detect and eliminate aspects of test design, content, and format that may bias test scores for particular groups. DIF is intended to measure such sources of bias. Dr. Brink used the Mantel-Haenszel method to examine DIF. He stated that "[r]ace groups may differ with respect to performance on a particular item due to true differences (i.e., for some reason, there are real job-related differences between Blacks and Whites with respect to performance on the item within the sample) or race bias (i.e., there are not real job-related differences between Blacks and Whites with respect to performance on one item; for some reason, performance differences are occurring on the item because the item is biased
During the evidentiary hearing, Dr. Brink was asked about his findings on the multiple-choice examination format. Dr. Brink admitted that "to some degree," multiple-choice questions can measure more than job knowledge. (Evidentiary Hr'g Tr. 211, Docket Entry No. 130). But he also testified that skills such as communication and "interpersonal type of abilities" are poorly measured through such job-knowledge tests. (Id. at 212). Dr. Brink also testified that while written tests can measure leadership, command presence, and decision-making ability to a degree, there are better ways to measure these skills and abilities. (Id. at 221). He also testified that situational-judgment questions better measure these skills and abilities than written job-knowledge questions, though he preferred "high-fidelity" exercises such as those performed at an assessment center over any type of written questions. (Id. at 222).
Dr. Brink also cited empirical research by Dr. Arthur. Dr. Brink testified that this research showed that "written tests with open-ended responses [were] actually more reliable than a written test with the closed-ended multiple-choice type responses." (Id. at 223). Dr. Arthur's study used a criterion-related validity study to compare the reliability of a multiple-choice exam to a "constructed response exam" that required the individuals to generate — rather than select — responses to exam questions. The construct-response questions were short-answer questions with a structured response format scored according to preestablished criteria. Winfred Arthur, Jr. et al., Multiple-Choice and Constructed Response Tests of Ability: Race-Based Subgroup Performance Differences on Alternative Paper-and-Pencil Test Formats, 55 PERSONNEL PSYCHOLOGY 985, 996 (2002). The study found that the construct-response questions had higher reliability measures than multiple-choice questions. Id. at 998, 1000. The study also found that there was less subgroup difference from the construct-response questions, though the authors acknowledged that the sample size was small. Id. at 1001-02, 1004.
Dr. Lundquist, a witness produced by the City, provided an affidavit and testimony on the validity of multiple-choice exams. She also testified about the validity of assessment centers. Dr. Lundquist concluded that the City's exclusive use of multiple-choice job-knowledge questions should be abandoned in favor of a promotional exam system incorporating assessment centers.
Pointing to the Guidelines, Dr. Lundquist stated that "[t]he emphasis for any promotional process should be on assessing the critical knowledge, skills, abilities, and other personal characteristics (KSAOs) identified through a job analysis as being required to perform the essential duties of the job." (Dr. Lundquist Aff. 3, Docket Entry No. 93-2). She acknowledged
Dr. Lundquist's affidavit stated that situational-judgment questions and assessment centers better measure many of the abilities senior captains need. Dr. Lundquist argued that the literature shows that situational-judgment questions assess leadership and supervisory skills and abilities and should supplement, not replace, the multiple-choice job knowledge questions. Dr. Lundquist pointed to journal articles supporting the fairness and validity of assessment centers, as well as their ability to minimize disparate impact. During the evidentiary hearing, Dr. Lundquist discussed an article by Dr. Arthur concluding that assessment centers can validly measure "organization and planning and problem solving, ... [and] influencing others." (Evidentiary Hr'g Tr. 233, Docket Entry No. 130). Dr. Arthur's study used "meta-analysis to empirically assess the criterion-related validity of separate dimensions tapped by assessment centers." Winfred Arthur, Jr. et al., A Meta-Analysis of the Criterion-Related Validity of Assessment Center Dimensions, 56 PERSONNEL PSYCHOLOGY 125, 128 (2003). The study found "true validities" for the following performance dimensions: problem-solving, influencing others, and organizing and planning. Id. at 140.
Dr. Lundquist was asked about the objection that assessment centers produce "subjective" scores. She testified that through scoring standards and effective assessor training, assessment centers can produce scores approximating the objectivity of multiple-choice tests. (Evidentiary Hr'g Tr. 260-61, Docket Entry No 130). Dr. Lundquist admitted that scoring an assessment-center exercise is more subjective than scoring a multiple-choice test. But she argued that "reliability and consistency can be produced by certain controls in the design of the ... assessment center exercise itself." (Id.). She also explained that objectivity and subjectivity are best understood as existing on a continuum, and that subjective scoring becomes more "objective" by structuring the scoring to minimize an assessor's subjective evaluation of a promotional candidate's performance. (Id. at 262-63). Dr. Lundquist testified that providing assessors with "very detailed examples of what is high performance, what is average performance, [and] what is low performance" for a particular performance dimension minimizes the assessor's subjective evaluation. (Id. at 264-65). She also testified that using multiple assessors reduces subjectivity. (Id. at 265).
On cross-examination, Dr. Lundquist admitted that "supervisory skill and planning and coordination skills" are difficult to measure and noted that "[w]hether or not it's a good assessment depends entirely on how well-written the test is and how similar it is to the requirements of the job." (Id. at 268). She testified that other methods besides a written test might be better measures of such skills. (Id.).
In response to questions about the importance of measuring cognitive skills, Dr. Lundquist acknowledged that industrial-psychology literature shows that "cognitive ability ... underlies a lot of the performance, a lot of the learning that goes on in terms of any kind of job." (Id. at 235). But she emphasized that there are different types of cognitive skills and not all are important for the captain and senior-captain
Dr. Morris testified about the job analysis he had created for the captain position and about the job analysis he was creating for the senior-captain position. Dr. Morris also testified about the validity of the City's captain and senior-captain exams. Based on his work creating the job analyses, Dr. Morris concluded that the captain and senior-captain exams were not valid and offered alternatives to the City's promotional system.
To perform the job analysis for the captain position, Dr. Morris relied on subject-matter experts — both internal to the HFD and external to it — as well as source material. Dr. Morris's job analysis contained a more detailed listing of knowledge, skills, and abilities for the captain position than did the City's job description. Most of the identified "knowledges" related to: (1) firefighting, such as knowledge about equipment, structures, fires and firefighting and rescue tactics; (2) HFD standard operating procedures ("SOPs") and administrative processes; and (3) supervising. (Evidentiary Hr'g Ex. 42, at 2-6). The identified skills included those related to firefighting and operating firefighting equipment; leadership and communication; and problem-solving and decision-making. (Id. at 7). The identified abilities included lengthy lists of: (1) leadership abilities, including directing subordinates, resolving conflict, and motivating subordinates; (2) decision-making and strategic abilities such as prioritizing and developing contingency plans; (3) communication abilities, ranging from communicating with superiors to recognizing grammatical errors; (4) critical-thinking abilities, such as comprehending "complex rules, regulations, and procedures" and recognizing "critical aspects" of a problem; (5) administrative abilities, such as recording and documenting information; and (6) tactical abilities
Based on the job analysis, Dr. Morris testified that the multiple-choice exam the City used does not reliably assess whether a promotional candidate is qualified for the captain or senior-captain position. Dr. Morris testified that while multiple-choice questions can be effective to test job knowledge, they have limited value in evaluating such skills and abilities as "communication, problem identification, ... interpersonal skills, decision-making, and so forth," and these are the skills and abilities captains and senior captains should have. (Id. at 45). Dr. Morris also identified oral communication, command presence, and supervisory or interpersonal skills as skills poorly tested by a multiple-choice exam. (Id. at 48). Dr. Morris testified that the promotion exam measured only "a very small portion of the job — I wouldn't say a very small. It measures an important part of the job. But, in fact, the other part that is not measured is, in fact, very important for promotional and supervisory positions." (Id. at 66). On cross-examination, Dr. Morris acknowledged that a written test could measure more than knowledge and some KSAOs, and could measure problem-solving skills, but he emphasized that such a test would be unlikely to measure supervisory skills or oral-communication skills. (Id. at 83-85).
Dr. Morris testified that a promotional system incorporating a job-knowledge test and an assessment center would be valid. Dr. Morris did not initially recommend including a situational-judgment test but was comfortable with including it. (Id. at 45). Based on his job analysis, Dr. Morris concluded that there were "technical knowledge" requirements for the captain position that could be measured through a job-knowledge test. He testified that a situational-judgment question might also measure technical knowledge, but that "[i]t might be more likely that we would use a knowledge-based test." (Id.). Dr. Morris admitted that "good practice" is usually to create a job analysis first, and then design a promotional test. (Id. at 78-79). He testified, however, that his experience helps him make accurate predications about what type of promotional test will be more reliable. (Id. at 77-78).
Dr. Morris's testimony emphasized the need for an assessment center, particularly for senior-captain promotions. He testified that such skills and abilities as supervision and leadership are best measured in assessment centers. Because senior captains have a great "span of control" — they directly supervise a large number of firefighters — an assessment center is uniquely reliable for evaluating senior-captain candidates. (Id. at 47-49). Dr. Morris also testified that an assessment center would be appropriate for captain promotion decisions. Though senior captains exercise a greater span of control than captains, similar skills and abilities are required for both positions. Dr. Morris described the captain position as a "gatekeeper" position for senior captain and argued that this justified examining similar skills and abilities. (Id. at 63).
At an assessment center, SMEs would create "standards of performance," tested by asking the candidates to respond to simulations of situations typically faced by captains and senior captains. The candidates would be scored by "assessors" according to preestablished performance standards, which would include examples
Dr. Morris was also questioned about whether the proposed changes to the promotional system effectively measured cognitive abilities. Dr. Morris admitted that the proposed test changes would be designed to measure a candidate's ability to perform the job on the first day, not to test how the candidate would progress in the job. (Id. at 50). He testified that the Guidelines emphasize measuring a candidate's ability to start a job rather than to progress in it, based in part on the assumption that the candidate will receive additional training. (Id. at 52). But Dr. Morris disputed the contention that the proposed changes did not measure cognitive skills. He testified that cognitive abilities can be measured in ways other than a multiple-choice exam, such as through "structured interviews" and "structured oral exercises" that require exercising cognitive abilities to respond to questions and hypothetical situations. (Id. at 88).
Though Dr. Arthur did not submit an expert report, he testified on behalf of the HPFFA about the validity of many of the proposed changes to the captain and senior-captain exams. Dr. Arthur agreed with other experts' testimony that a job analysis should be the basis for developing an examination procedure. "[T]he method of assessment is informed by the job analysis process." (Evidentiary Hr'g Tr. 363, Docket Entry No 131). Dr. Arthur was skeptical of the validity of the changes proposed because the job analysis for the captain position was incomplete. He also noted that the proposals did not specify the changes in detail. (Id. at 365).
As to the use of multiple-choice job-knowledge questions, Dr. Arthur echoed the other experts' testimony that such tests can measure more than job knowledge, though he believed that certain skills and abilities are more effectively measured by other types of examinations. Dr. Arthur also pointed out that multiple-choice tests tend to be "cognitively loaded," but he argued that the issue is not the method of examination, rather the examination's "constructs," or content. (Id. at 365-66). He acknowledged that cognitively-loaded exams tend to create subgroup differences but argued that the reason for cognitive loading is not because the exam uses a multiple-choice test method. He explained that "assessment centers are likely to reduce subgroup differences not because you're using assessment centers, but because of the things that assessment centers measure." (Id. at 366; see also id. at 444). His criticism of this aspect of the proposed consent decree was that it focused on method and not content.
Dr. Arthur's testimony also provided a rough estimate based on the results from the 2010 captain exam as to whether there would be disparate impact in the number of promotional candidates eventually promoted. Dr. Arthur found that based on historical data, approximately 115 candidates are promoted each cycle. Using the rank order from the 2010 exam, Dr. Arthur
Dr. Arthur also testified about the proposed changes to the scoring system under the consent decree. For a valid examination method, "the relationship between test scores and performance is by definition linear in most situations." (Id. at 375). Dr. Arthur testified that banding tends to make performance differences obscure, contravening the "cardinal principle" of assessment. The better way to ensure score reliability is by ensuring the test is valid to begin with, so that "the same people would get the same scores on the test no matter how many times they do it." (Id. at 376). Dr. Arthur was particularly critical of the pass/fail grading of the job-knowledge component. He acknowledged that such grading could be appropriate if only "minimal" job knowledge was required for the captain and senior-captain positions, but he had not seen evidence showing this to be true. (Id. at 416-17, 422-25).
Dr. Arthur also testified that the way in which a question is presented and the type of response the question demands can affect whether the exam produces subgroup differences. Asking a question in a format that requires reading and having the answer in a format requiring a written response cognitively load the question; examining the same content by presenting the question in a video format can reduce cognitive loading. (Id. at 424-27). Dr. Arthur pointed to an article he authored finding that asking questions requiring promotional candidates to generate, rather than select, answers can minimize subgroup differences. (Id. at 386-87).
William Barry, an Assistant Chief in the Houston Fire Department, testified on the City's behalf that "the captains who scored high on the multiple-choice tests were not always as effective as captains who scored lower on these tests. In fact, I never found a direct correlation between the scores on the test and their performance." (Evidentiary Hr'g Tr. 120, Docket Entry No 130). Barry, who currently works in HFD's "Member Support" — human resources — Division, explained that "taking tests and working in the four different ranks in the emergency operations and observing people who have been promoted through the system, the ability to memorize the correct 100 facts that are going to be asked on a test does not correlate to how these people perform in either complex personnel situations or emergency situations." (Id. at 121). Barry acknowledged that he has also seen firefighters who did well on the exams perform well after promotion, though he thought the other direction was more common. (Id. at 134). He did admit that those who put the most time, energy, effort, and sacrifice into test preparation scored higher. (Id. at 137).
As noted above, the array of expert opinions about the existing exams' disparate impact on African-American candidates and the reliability and validity of the existing and proposed exams reflect the limits of the science of testing to measure and compare promotion-worthiness. The witnesses' opinions are at best attempts to gauge how well different exam approaches measure, compare, and predict job performance. But the absence of judicial expertise in the area of testing validity is well-recognized. The combination of the lack of judicial expertise in this area and the limits of the expertise of those who have training and experience support a cautious and careful judicial approach. With this background and caution, this court applies the legal standards to the
"The parties to litigation may by compromise and settlement not only save the time, expense, and psychological toll but also avert the inevitable risk of litigation." United States v. City of Miami, 664 F.2d 435, 439 (5th Cir.1981) (en banc) (Rubin, J., concurring). "Litigants ... have sought to reinforce their compromise and to obtain its more ready enforceability by incorporating it into a proposed consent decree and seeking to have the court enter this decree." Id. "A consent decree, although founded on the agreement of the parties, is a judgment. It has the force of res judicata, protecting the parties from future litigation. It thus has greater finality than a compact. As a judgment, it may be enforced by judicial sanctions, including citation for contempt if it is violated." Id. at 439-40 (internal citation and footnotes omitted).
"[A] decree disposing of some of the issues between some of the parties may be based on the consent of the parties who are affected by it but ... to the extent the decree affects other parties or other issues, its validity must be tested by the same standards that are applicable in any other adversary proceeding."
The Fifth Circuit has advised district courts that:
City of Miami, 664 F.2d at 441 (footnotes omitted). A district court should play a particularly "active role" when the litigation and settlement were instigated by a class of private plaintiffs — as opposed to the United States — because private plaintiffs have no "responsibility toward third parties who might be affected by their actions." Williams v. City of New Orleans, 729 F.2d 1554, 1560 (5th Cir.1984) (en banc).
The threshold issue in the analysis is whether, under Title VII, the City's present testing method for senior-captain promotions has a disparate impact on African-American applicants.
The plaintiffs and the City argue that the senior-captain exam disparately impacted promotions from the rank of captain to senior captain in violation of Title VII. "Title VII ... prohibits employment discrimination on the basis of race, color, religion, sex, or national origin. Title VII prohibits both intentional discrimination (known as `disparate treatment') as well as, in some cases, practices that are not intended to discriminate but in fact have a disproportionately disparate effect on minorities (known as `disparate impact')." Ricci v. DeStefano, 557 U.S. 557, 129 S.Ct. 2658, 2672, 174 L.Ed.2d 490 (2009). "As enacted in 1964, Title VII's principal nondiscrimination provision held employers liable only for disparate treatment." Id. But in Griggs v. Duke Power Co., 401 U.S. 424, 431, 91 S.Ct. 849, 28 L.Ed.2d 158 (1971), the Supreme Court "interpreted the Act to prohibit, in some cases, employers' facially neutral practices that, in fact, are `discriminatory in operation.'" Id. at 2672-73. "Twenty years after Griggs, the Civil Rights Act of 1991, 105 Stat. 1071, was enacted. The Act included a provision codifying the prohibition on disparate-impact discrimination." Id. at 2673.
A plaintiff can make a prima facie case of discrimination by showing that an employer uses "a particular employment practice" — in this case a promotional test — "that causes a disparate impact on the basis of race." 42 U.S.C. § 2000e-2(k)(1)(A)(i). "To establish a prima facie case of discrimination under a disparate-impact
Disparate impact requires "a specific practice or set of practices resulting in a significant disparity between the groups." Johnson v. Uncle Ben's, Inc., 965 F.2d 1363, 1367 (5th Cir.1992). To establish a prima facie case, plaintiffs "must engage in a `systematic analysis' of the policy or practice," Frank v. Xerox Corp., 347 F.3d 130, 135 (5th Cir.2003) (quoting Munoz v. Orr, 200 F.3d 291, 299 (5th Cir.2000)), and "establish causation by offering statistical evidence to show that the practice in question has resulted in prohibited discrimination," Stout v. Baxter Healthcare Corp., 282 F.3d 856, 860 (5th Cir.2002). "Ordinarily, a prima facie disparate impact case requires a showing of a substantial `statistical disparity between protected and non-protected workers in regards to employment or promotion.'" Id. (citing Munoz, 200 F.3d at 299-300); see also Ricci, 129 S.Ct. at 2678 (noting that a prima facie showing of disparate impact is "essentially, a threshold showing of a significant statistical disparity"); Herndon v. Coll. of Mainland, No. G-06-0286, 2009 WL 367500, at *28 (S.D.Tex. Feb. 13, 2009) ("[A prima facie showing] generally requires `evidence of... observed statistical disparities,' but may include anecdotal evidence." (citations omitted)). While the Supreme Court has "emphasized the useful role that statistical methods can have in Title VII cases," it has "not suggested that any particular number of `standard deviations' can determine whether a plaintiff has made out a prima facie case in the complex area of employment discrimination." Watson, 487 U.S. at 995 n. 3, 108 S.Ct. 2777.
It is a defense to liability if the challenged practice is shown to be "job related for the position in question and consistent with business necessity." 42 U.S.C. § 2000e-2(k)(1)(A)(i). "`The touchstone' for determining whether a test or qualification meets Title VII's measure, ... is not `good intent or the absence of discriminatory intent'; it is `business necessity.'" Ricci, 129 S.Ct. at 2697 (quoting Griggs, 401 U.S. at 431, 91 S.Ct. 849). "When an employment test `select[s] applicants for hire or promotion in a racial pattern significantly different from the pool of applicants,'... the employer must demonstrate a `manifest relationship' between test and job." Id. (quoting Albemarle Paper Co. v. Moody, 422 U.S. 405, 425, 95 S.Ct. 2362, 45 L.Ed.2d 280 (1975)); see also Frazier v. Garrison I.S.D., 980 F.2d 1514, 1526 n. 34 (5th Cir.1993) (explaining that the 1991 amendments to Title VII made clear that it is the employer's burden to show that a challenged practice is job-related and consistent
As set out above, on February 8, 2006, the City of Houston administered the senior-captain exam to 221 candidates. Of these, 172 were Caucasian, 15 were African-American, 33 were Hispanic, and 1 was "other." Of the 212 candidates passing by scoring 70 or above, 166 were Caucasian, 13 were African-American, 32 were Hispanic, and 1 was "other." The City promoted 70 individuals based on the rank order of those passing the exam. Of the 70 promoted, 59 were Caucasian, 2 were African-American, 8 were Hispanic, and 1 was "other." (Dr. McPhail Report 5).
The analysis starts with the Guideline "rule of thumb" for disparate impact, the 4/5 Rule. The expert witnesses in this case agreed that the total selection process for senior captains violated the 4/5 Rule. Under the Guidelines, a 4/5 Rule violation provides some evidence of disparate impact. See 29 C.F.R. § 1607.4(D) ("A selection rate for any race, sex, or ethnic group which is less than four-fifths (4/5) (or eighty percent) of the rate for the group with the highest rate will generally be regarded by the Federal enforcement agencies as evidence of adverse impact, while a greater than four-fifths rate will generally not be regarded by Federal enforcement agencies as evidence of adverse impact."); Guidelines Questions & Answers, 44 Fed.Reg. at 11999 (noting that when there is a 4/5 Rule violation, "[t]here usually is adverse impact"). But a 4/5 Rule violation is not enough to find disparate impact. Courts treat the 4/5 Rule as "no more than a `rule of thumb' to aid in determining whether an employment practice has a disparate impact." Waisome v. Port Auth. Of N.Y. & N.J., 948 F.2d 1370,
Courts have recognized that when a group's population is small, statistical tests should be used to determine whether a 4/5 Rule violation is the product of chance. See Black v. City of Akron, 831 F.2d 131, 134 (6th Cir.1987) ("[P]laintiffs are correct that the 4/5 rule has been criticized when used in the context of a small sample of employees being tested."); Fudge v. City of Providence Fire Dep't, 766 F.2d 650, 658 (1st Cir.1985) ("We think that in cases involving a narrow data base, the better approach is for the courts to require a showing that the disparity is statistically significant, or unlikely to have occurred by chance, applying basic statistical tests as the method of proof."). Even outside the context of small group sizes, courts have recognized that statistical analyses provide stronger evidence of disparate impact than do violations of the 4/5 Rule standing alone. See Clady, 770 F.2d at 1428 ("[T]he 80 percent rule has been sharply criticized by courts and commentators."); Isabel v. City of Memphis, 404 F.3d 404, 412-13 (6th Cir.2005) ("[W]e are grateful for statistics beyond the four-fifths rule analysis because we prefer to look to the sum of statistical evidence to make a decision in these kinds of cases."); Stagi, 391 Fed. Appx. at 138 ("The '80 percent rule' or the `four-fifths rule' has come under substantial criticism, and has not been particularly persuasive, at least as a prerequisite for making out a prima facie disparate impact case. The Supreme Court has noted that `this enforcement standard has been criticized on technical grounds and it has not provided more than a rule of thumb for the courts.'" (alterations omitted) (quoting Watson, 487 U.S. at 995 n. 3, 108 S.Ct. 2777)); see also 1 B. Lindemann & P. Grossman, EMPLOYMENT DISCRIMINATION LAW 130 (4th ed. 2007) (noting that the 80 percent rule "is inherently less probative than standard deviation analysis"); E. Shoben, Differential Pass-Fail Rates in
In this case, the expert witnesses agreed that because the African-American promotion applicants were a relatively small group, statistical tests should be used to determine whether the 4/5 Rule violation resulted from chance. (Dr. Brink Report 10; Dr. Lundquist Aff. 6; Dr. Arthur Aff. 2-3; Dr. McPhail Report 6-7; Dr. Morris Report 10446-47). The 4/5 Rule violation provides evidence of, but does not establish, disparate impact.
As the HPFFA points out, courts frequently accept a .05 probability "to rule out the possibility that the disparity occurred at random." Stagi, 391 Fed.Appx. at 137-38; see also Page v. U.S. Indus., Inc., 726 F.2d 1038, 1047 n. 5 (5th Cir. 1984) (noting that in Castaneda v. Partida, 430 U.S. 482, 496-97 n. 17, 97 S.Ct. 1272, 51 L.Ed.2d 498 (1977), the Supreme Court's "guidance" was that "a disparity in the number of minority workers in upper and lower level jobs is statistically significant if the difference between the expected number of minority employees in higher level positions exceeds the actual number by more than two or three standard deviations"); Palmer v. Shultz, 815 F.2d 84, 92-96 (D.C.Cir.1987) (noting that "statistical evidence meeting the .05 level of significance is certainly sufficient to support an inference of discrimination" (citation, internal quotation marks, and alterations omitted)); Waisome, 948 F.2d at 1376 ("Social scientists consider a finding of two standard deviations significant, meaning there is about one chance in 20 that the explanation for a deviation could be random and the deviation must be accounted for by some factor other than chance." (citation omitted)). The three generally accepted statistical tests for determining whether a 4/5 Rule violation resulted from chance — the Fisher exact, the Pearson chi-square, and the Z
The statistical evidence does not support the conclusion that the 4/5 Rule violation for the 2006 senior-captain exam is not the product of chance. But courts have properly cautioned that "[t]here is no minimum statistical threshold requiring a mandatory finding that a plaintiff has demonstrated a violation of Title VII" and "[c]ourts should take a `case-by-case approach' in judging the significance or substantiality of disparities, one that considers not only statistics but also all the surrounding facts and circumstances." Waisome, 948 F.2d at 1376; see also Int'l Bhd. of Teamsters v. United States, 431 U.S. 324, 340, 97 S.Ct. 1843, 52 L.Ed.2d 396 (1977) (noting that statistics "come in infinite variety and ... their usefulness depends on all of the surrounding facts and circumstances"). The "surrounding circumstances" of the 2006 senior-captain exam provide evidence of disparate impact. Historical data shows that the promotions of African-Americans from captain to senior captain have violated the 4/5 Rule for every promotional cycle since 1993. (Docket Entry No. 93-2, at 8-9, 13). In two of those cycles, no black captains were promoted out of a total of 29 African-American applicants. (Id. at 13).
Even without statistical data confirming that there was statistically significant disparate impact in each of these promotional cycles, the historical data substantially mitigates the risk that the 4/5 Rule violation from the 2006 examination resulted from chance, even if the probability that the 2006 4/5 Rule violation is outside the standards of deviation courts commonly accept. Dr. Lundquist's Mantel-Haenszel analysis confirms that the historical patterns of 4/5 Rule violations are statistically significant.
Dr. Arthur argued that the only relevant data is for the promotions from the 2006 senior-captain exam. But courts have looked to historical data to help assess whether a promotional procedure is the result of chance. See United Air Lines, Inc. v. Evans, 431 U.S. 553, 558, 97 S.Ct. 1885, 52 L.Ed.2d 571 (1977) ("A discriminatory act which is not made the basis for a timely charge ... may constitute relevant background evidence in a proceeding in which the status of a current practice is at issue...."); Commonwealth of Pa. v. Flaherty, 983 F.2d 1267, 1271 (3d Cir.1993) (looking to past evidence of disparate impact). The Guidelines also require analysis of historical data to assess disparate impact from current selection procedures. See 29 C.F.R. § 1607.4(D) ("Where the user's evidence concerning the impact of a selection procedure indicates adverse impact but is based upon numbers which are too small to be reliable, evidence concerning the impact of the procedure over a longer period of time and/or evidence concerning the impact which the selection procedure had when used in the same manner in similar circumstances elsewhere may be considered in determining adverse impact."). The historical data showing statistically significant disparate impact and the 4/5 Rule violation together support a finding of disparate impact from the 2006 senior-captain exam.
It is also worth noting that the Fisher exact test and the Pearson chi-square showed a 15% and 10% chance respectively that the 4/5 Rule violation was the product of chance. While higher than the 5% chance courts commonly accept, these percentages are not drastically higher. Dr. Brink's point that the 5% test is not a rigid standard to be applied under all circumstances applies here. One court made a similar point in finding age discrimination in firing decisions over objections that the
Kadas v. MCI Systemhouse Corp., 255 F.3d 359, 362 (7th Cir.2001); see also (Dr. Lundquist Test., Evidentiary Hr'g Tr. 276, Docket Entry No. 130 ("Well, there is a whole field of literature that talks about where to appropriately set that standard, should it be .05, should it be .10, should it be .01. So, although oftentimes in court you hear .05, and that is a common convention,... there's a lot of literature that suggests that depending on what the particular decision is that you're making, it may be more appropriate to use a different P value as your critical value for deciding whether something is statistically significant.")).
The HPFFA points out that some courts have also looked to whether small changes in the number of applicants in the groups at issue change the disparate-impact evidence based on the 4/5 Rule. See Deshields, 1989 WL 100664, at *1 (discrediting a 4/5 Rule violation as evidence of disparate impact because "[a] change in the race of only a few of those promoted could make a significant mathematical difference in the outcome of the four-fifths rule calculations"); Waisome, 948 F.2d at 1376 (noting that "if two additional black candidates passed the written examination the disparity would no longer be of statistical importance" and finding that the plaintiffs failed to make a prima facie showing of disparate impact). The record evidence shows that if two additional black captains had been promoted to senior captain in 2006, there would not have been a 4/5 Rule violation. Aside from the cases the HPFFA cites, however, there is no identified basis to conclude that because only two more black captains needed to be promoted for the promotional system to comply with the 4/5 Rule, this court should not find disparate impact. Given the small number of black firefighters in the HFD compared to the number of white firefighters, it makes sense that only a few additional black captains would need to be promoted for the City to achieve compliance with the 4/5 Rule. But the historical data shows that the minimum number of African-American captain promotions to avoid a 4/5 Rule violation has never been achieved in any promotion cycle. Application of the Guidelines' n-of-1 rule and Dr. Brink's testimony about the one-person rule further support this point.
This court finds that the City and the plaintiffs have shown that the 2006 senior-captain exam disparately impacted the
The HPFFA has offered two expert reports to validate the captain and senior-captain exams.
Dr. McPhail used a criterion-related validity study to analyze the results of the 2006 captain exam. The EEOC recognizes a criterion-related validity study as one of three validity measures of a promotional exam's validity. 29 C.F.R. § 1607.5(A). Courts have looked to the EEOC Guidelines validation procedures to analyze business necessity. See Banos, 398 F.3d at 892 ("The City can show that its process is `job related' by any one of three tests: criterion-related, content validity, or construct validity."); EEOC v. Dial Corp., 469 F.3d 735, 743 (8th Cir.2006) (looking to the EEOC criterion validity requirements to evaluate business necessity); Isabel, 404 F.3d at 413 (relying on the EEOC Guidelines to determine validity); United States v. City of Garland, No. 3:98-CV-0307-L, 2004 WL 741295, at *9 (W.D.Tex. Mar. 31, 2004) (discussing criterion-related validation, content validation, and construct validation). Both the Guidelines and the case law recognize that an employer need demonstrate validity through only one method. Hearn v. City of Jackson, 340 F.Supp.2d 728, 736 (S.D.Miss.2003) ("`Neither the case law nor the Uniform Guidelines purports to require that an employer must demonstrate validity using more than one method.'" (quoting Williams v. Ford Motor Co., 187 F.3d 533, 544-45 (6th Cir. 1999))); 29 C.F.R. § 1607.5(A) ("For the purposes of satisfying these guidelines, users may rely upon criterion-related validity studies, content validity studies or construct validity studies...."); 29 C.F.R. § 1607.14(C)(1) ("Users choosing to validate a selection procedure by a content validity strategy should determine whether it is appropriate to conduct such a study in the particular employment context."); see also Washington v. Davis, 426 U.S. 229, 248 n. 13, 96 S.Ct. 2040, 48 L.Ed.2d 597 (1976) (stating that "[i]t appears beyond doubt by now that there is no single method for appropriately validating employment tests for their relationship to job performance," and that any of the three recognized basic methods of validation may be used).
Dr. McPhail used multiple-regression analysis to account for the additional training that promoted captains received, but he found statistically significant correlations for only three of the nine performance dimensions he measured. (Id. at 60). Similarly, when Dr. McPhail divided the validation sample into promotional candidates who were promoted and job candidates who were not promoted, he found statistically significant correlations for only three of the nine performance dimensions, and he found those correlations only among the promotional candidates who were not promoted. Showing that the captain test produces significant correlations within one subgroup does not demonstrate that "the selection procedure is predictive of or significantly correlated with important elements of job performance." 29 C.F.R. § 1607.5(B). These correlations do not validate the test.
Dr. Sharf's validity-generalization validity study is no more persuasive. Dr. Sharf's report contains no analysis comparing the actual questions on the captain and senior-captain exams to the extensive list of cognitive skills he identified as essential to the positions. Instead, Dr. Sharf's report extensively describes literature supporting cognitive tests. He then describes the City's test as a cognitive test and concludes that it is valid. The Guidelines explicitly reject this form of validity analysis in the following section:
29 C.F.R. § 1607.9(A).
Even assuming that cognitive skills are a valid prediction of some aspects of job performance, Dr. Sharf's testimony does not provide a basis to conclude that the captain and senior-captain exams reliably measure the cognitive skills he identifies as necessary for those positions. As Dr. Brink's expert report showed, some of the exam questions at best measure a promotional candidate's ability to memorize what appear to be obscure facts.
There is also substantial evidence in the record rebutting the HPFFA's evidence that the captain and senior-captain exams are job-related. Most of the experts who submitted a report or testified stated that both the captain and senior-captain positions require skills and abilities poorly measured by a multiple-choice exam. The experts identified the following nonexclusive list of such skills and abilities: leadership; command presence; interpersonal communication; supervision; and decision-making. (See Dr. Morris Test., Evidentiary Hr'g Tr. 45-48 (testifying that while multiple-choice questions can be effective at testing job knowledge, they do not adequately assess supervisory skills, communication, problem identification, interpersonal skills, decision-making, and command presence, which are skills and abilities that captains and senior captains should have); Dr. Brink Test., Evidentiary Hr'g Tr. 211-12 (admitting that "to some degree" a written test can measure more than job knowledge, but emphasizing that skills such as communication and "interpersonal type of abilities" are poorly measured through written, multiple-choice job-knowledge tests); Dr. Lundquist Test., Evidentiary Hr'g Tr. 239 (stating that multiple-choice job-knowledge tests typically "cover ... a more limited set of skills than might be required for [the captain and senior captain] positions"); Dr. Lundquist Aff. 4 (acknowledging that the City's multiple-choice test could validly assess the technical knowledge required for the positions of captain and senior captain, but arguing that such a test "inadequately captures the range of KSAOs required for successful performance in a position such as Senior Captain"); see also Dr. Lundquist Test., Evidentiary Hr'g Tr. 240 (explaining that "as you move up through supervisory ranks, ... what you see is less of an emphasis on the technical or knowledge side and more of an emphasis on... leadership, supervisory, managerial, [and] strategic ... aspects of the job"); Dr. McPhail Validation Study of the 2006 Captain Exam, Docket Entry No. 37-1, at 37 (identifying "supervision," "problem solving," "interpersonal effectiveness," and "professional orientation & commitment," as comprising four of the nine performance dimensions for the captain position); Dr. Sharf Report 17-21 (listing management and supervision, problem solving, communication, command presence, and leadership as responsibilities of the captain and senior-captain positions)).
The evidence shows that the promotional tests for the captain and senior-captain positions did not test the entire "job domain." Courts have rejected promotional tests for similar reasons. See Isabel, 404 F.3d at 413 (upholding district court's determination that because the test only examined "job knowledge," it failed to test the entire "job domain"). The Guidelines similarly suggest that promotional examinations should replicate work behaviors. Guidelines Questions & Answers, 44 Fed. Reg. at 12007. The expert testimony demonstrates that the captain and senior-captain exams fail to test significant elements of the positions.
Dr. Brink's testimony also provided evidence that the captain and senior-captain exams did not measure the KSAOs identified in the captain and senior-captain job descriptions. The Guidelines require that "[f]or any selection procedure measuring a knowledge, skill, or ability the user should show that (a) the selection procedure measures and is a representative sample of that knowledge, skill, or ability; and (b) that knowledge, skill, or ability is used in and is a necessary prerequisite to performance of critical or important work behavior(s)." 29 C.F.R. § 1607.14(C)(4). Dr. Brink's report noted that 63% of the captain exam content and 86% of the senior-captain
The HPFFA has not demonstrated that the captain and senior-captain promotion exams are job-related and consistent with business necessity. Because this court has found disparate impact for both the captain and senior-captain exams, and because the HPFFA has failed to show that the exams are job-related and consistent with business necessity, the City may implement changes to the promotional system for the captain and senior-captain positions to the extent necessary to address the disparate impact, even if those changes are inconsistent with the CBA and state law.
The City and the plaintiffs have demonstrated that the captain and senior-captain exams disparately impacted black promotional candidates and that the promotional examinations for the positions are not justified by business necessity. This court's finding provides a basis to approve provisions of the proposed consent decree that conflict with the TLGC and the CBA. But this court must be cautious to approve only those conflicting provisions of the proposed consent decree that are necessary and tailored to remedy the demonstrated disparate impact.
"[A]ny federal decree must be a tailored remedial response to illegality." Clements, 999 F.2d at 847 (citing Shaw v. Reno, 509 U.S. 630, 113 S.Ct. 2816, 125 L.Ed.2d 511 (1993)). "A consent decree must arise from the pleaded case and further the objectives of the law upon which the complaint is based." Id. at 846; see also San Antonio Hispanic Police Officers' Org. v. City of San Antonio, 188 F.R.D. 433, 439 (W.D.Tex.1999) ("[T]he question for the courts is whether this part of the proposal has a sufficient nexus to the lawsuit to justify circumventing the collective bargaining process."). "`[F]ederal-court decrees exceed appropriate limits if they are aimed at eliminating a condition that does not violate [federal law] or does not flow from such a violation.'" Horne v. Flores, 557 U.S. 433, 129 S.Ct. 2579, 2595, 174 L.Ed.2d 406 (2009) (quoting Milliken v. Bradley, 433 U.S. 267, 282, 97 S.Ct. 2749, 53 L.Ed.2d 745 (1977)); see also Milliken, 433 U.S. at 281-82, 97 S.Ct. 2749 ("The well-settled principle that the nature and scope of the remedy are to be determined by the violation means simply that federal-court decrees must directly address and relate to the constitutional violation itself."). "Courts must be exceptionally cautious when litigants seek to achieve by consent decree what they could not achieve by their own authority. Consent is not enough when parties seek to grant themselves powers they do not hold outside of court." City of San Antonio, 188 F.R.D. at 458 (citing Clements, 999 F.2d at 846).
Courts must be particularly cautious when a consent decree conflicts with a collective-bargaining agreement. "[P]arties to a collective-bargaining agreement must have reasonable assurance that their contract will be honored." W.R. Grace & Co. v. Local Union 759, Int'l Union of United Rubber, Cork, Linoleum & Plastic Workers of Am., 461 U.S. 757, 771, 103 S.Ct. 2177, 76 L.Ed.2d 298 (1983). "[R]egardless of past wrongs, a court in considering prospective relief is not automatically empowered to make wholesale changes in agreements negotiated by the employees' exclusive bargaining agents in an obviously
The City and the plaintiffs have demonstrated that allowing the City to use situational-judgment questions and an assessment center to examine promotional candidates for the positions of captain and senior captain is tailored to remedy the disparate impact alleged in the plaintiffs' complaint and to ensure that the City's promotional processes for the positions are job-related and consistent with business necessity. The plaintiffs and the City have shown that incorporating situational-judgment tests and assessment centers diminishes the risk of disparate impact and increases the validity of the City's promotional processes. Dr. Brink credibly testified that situational-judgment tests better measure command presence, decision-making ability, and leadership than multiple choice tests. (Dr. Brink Test., Evidentiary Hr'g Tr. 221-22). There was also testimony that incorporating a situational-judgment component into the promotional examination process reduces the risk of disparate impact. Both Dr. Brink and Dr. Arthur testified that situational-judgment questions can be designed to minimize cognitive loading. Dr. Arthur testified that displaying a hypothetical situation by video instead of describing it in words reduces the amount of cognitive skill required to answer the situational-judgment question correctly. Dr. Brink, describing an article by Dr. Arthur, showed that asking questions that require promotional candidates to generate, rather than select, the correct answer can help reduce subgroup differences. (Id. at 223). Dr. Arthur's testimony that, based on the number of promotions made during past promotional cycles and the results from the 2010 captain exam incorporating situational-judgment questions, there may not be disparate impact for the 2010-2013 promotional cycle also provides circumstantial evidence that the situational-judgment component of the exam may help reduce the risk of disparate impact.
But the evidence also showed that situational-judgment questions alone are unlikely to measure the entire job domain for the captain and senior-captain positions. Dr. Lundquist, Dr. Brink, and Dr. Arthur all testified that situational-judgment tests are not as effective as assessment centers at measuring skills and abilities the experts agreed were important, such as command presence, leadership, and interpersonal communication. (See Dr. Brink Test., Evidentiary Hr'g Tr. 221-22; Dr. Lundquist Test., Evidentiary Hr'g Tr. 233 (discussing an article by Dr. Arthur concluding that assessment centers can validly measure "organization and planning and problem solving, ... [and] influencing others")). Dr. Lundquist also explained that situational-judgment questions inevitably involve some cognitive loading, which dilutes their ability to measure noncognitive skills. (Id. at 256). The experts, including Dr. Arthur, acknowledged that limiting the role of the assessment center to measuring the types of skills best measured by such centers also reduces the risk of disparate impact.
The HPFFA's argument that a promotional system should not be designed until job analyses are complete is not a sufficient basis to reject the proposed consent decree. Most of the experts agreed that both the captain and senior-captain positions require skills and abilities — including, but not limited to, command presence, supervision, leadership, interpersonal communication, and decision-making — that are poorly measured by a multiple-choice test. Dr. Lundquist and Dr. Brink, citing to empirical studies widely accepted by industrial psychologists, also showed that cognitively loaded multiple-choice tests
Dr. Arthur's distinction between the content of the tests and the multiple-choice format as factors in causing disparate impact is an important one. But as Dr. Arthur acknowledged, multiple-choice tests are generally used to analyze cognitively loaded content. The City's continued reliance on exclusively multiple-choice test formats risks continued disparate impact that is neither job-related nor justified by business necessity.
The HPFFA objects that using assessment centers risks subjective scoring. But this objection is not a basis for rejecting this part of the consent decree. There was credible evidence in the record that there are ways to make the assessment center's results less subjective. For example, Dr. Lundquist explained that by using scoring standards and effective assessor training, assessment centers can produce scores approximating the objectivity of multiple-choice tests. (Evidentiary Hr'g Tr. 260-61, Docket Entry No. 130). Establishing detailed, predetermined criteria for performance; providing thorough training to assessors; using multiple assessors; and using postexamination statistical analyses to account for subjective differences between individual assessors, are all recognized as effective ways to reduce the risk that promotional candidates will be scored based on subjective judgments unrelated to the quality of their performance in the assessment-center exercises.
While the City and the plaintiffs have demonstrated that the use of situational-judgment questions and an assessment center are both tailored to minimize disparate impact and to validate the City's promotional processes for captain and senior-captain positions, the City and the plaintiffs have not demonstrated that the proposed consent decree's wholesale abandonment of promotion based on superior performance on other components of competitive exams is needed either to reduce disparate impact or to validate the promotional processes. The scoring system the consent decree proposes makes a promotional candidate's assessment-center score — which the evidence shows only validly measures part of the "job domain" for the captain and senior-captain positions — the most important score in the promotional process. The evidence did not show that the skills and abilities best measured by an assessment center are the only or most important skills for the captain and senior-captain positions.
The proposed consent decree allows either no cognitive substantive job-knowledge test or allows a scoring system that makes a job-knowledge test pass/fail, negating the value of superior performance on that test. Dr. Arthur's testimony that a pass/fail test would be appropriate if there were only minimal competency requirements for the captain and senior-captain positions is credible and persuasive. (See Evidentiary Hr'g Tr. 416-17, Docket Entry No 131 ("But I would say if you have some basis for wanting to use pass/fail, which is a minimal competency approach, then there ought to be some argument articulated as to why for this particular exam a minimal competency approach is appropriate, whereas it's not for others.")). The City and the plaintiffs have not demonstrated that "minimal knowledge" is sufficient for either position. To the contrary, the evidence and testimony was that both captain and senior-captain positions had significant knowledge requirements and there was no evidence or testimony that the knowledge required was minimal. (See Dr. McPhail Validation Study of the 2006 Captain Exam, Docket Entry No. 37-1, at 37 (identifying "technical knowledge" as one of nine performance dimensions for the captain position based on HFD job descriptions and interviews
It appears that once a candidate's assessment-center scores are determined, the candidate's performance on either the job-knowledge test or the situational-judgment test are irrelevant to the promotion decision. Instead, the assessment-center scores are "banded" and promotional decision made based on those bands. The evidence is that in addition to the substantive knowledge measured by the job-knowledge component, the situational-judgment component provides reliable and valid measures of many of the skills and abilities relevant to the captain and senior captain "job domain." Under the proposed consent decree, these scores will only be of very limited value in determining whether a promotional candidate should be promoted. That is true for the situational-judgment-test component even though the evidence shows that it could validly measure the KSAOs directly relevant to whether a promotional candidate will succeed as a captain or senior captain. The record does not justify these consent-decree provisions as tailored to remedy the disparate impact of the existing promotion exam process.
To the contrary, the evidence shows that a job-knowledge test measuring the knowledge needed for promotion and a situational-judgment test measuring other KSAOs needed for promotion will likely produce some reliable and valid measures of a promotional candidate's ability to perform. The evidence also showed that incorporating situational-judgment questions and an assessment center into the promotion process in addition to the job-knowledge tests would reduce disparate impact. The evidence showed that a promotional examination using a job-knowledge component, a computer-based situational-judgment component, and an assessment-center component would measure the KSAOs required for promotion with more reliability and validity than using only one or even two of the components and would reduce disparate impact. But there is insufficient evidence that giving the primacy to the assessment center as proposed in the modified consent decree is also needed to reduce adverse impact or to create a promotional system that is
Similarly, there is not a sufficient evidentiary basis to abandon the Rule of Three codified in the TLGC and accepted in the CBA. Dr. Brink and Dr. Lundquist testified that statistical studies may show that for promotional candidates within a "band of scores," marginally better exam performance does not correlate to better job performance after promotion. Even accepting this testimony, however, the proposed consent decree replaces one clear and predictable method of selecting among promotional candidates with similar scores — the Rule of Three — with a decision-maker's subjective judgment and discretion. The choice of one candidate within a band is standardless. Dr. Brink and Dr. Lundquist both testified that within a band, no one candidate is viewed as more qualified than another. Neither testified that the decision-maker's discretion will produce better selections than the Rule of Three. None of the experts testified that race was to be a criterion for selecting from within a band.
There is no evidence that replacing the Rule of Three is necessary to reduce disparate impact or that banding is likely to be a more valid or reliable basis for identifying job-related qualifications than the Rule of Three. To the contrary, there was evidence that if the test is valid and reliable, there is a "linear relationship" between test scores and performance. (Evidentiary Hr'g Tr. 375, Docket Entry No. 131); accord Nash v. Consolidated City of Jacksonville, Duval Cnty., Fla., 895 F.Supp. 1536, 1551 (M.D.Fla.1995) (summarizing expert testimony showing that if a promotional test is valid and reliable, the test "served as a good predictor of success for the job"). The City has hired industrial psychology experts to design valid and reliable promotional exams. This evidence supports retaining the TLGC and CBA provisions relating to the Rule of Three, which requires the decision-maker to select the promotional candidate with the highest score unless there is a written explanation justifying a different choice. See id. at 1552 (upholding Jacksonville's "rule of one" where the promotional exam was "highly reliable and the City demonstrated the substantial job-relatedness of the exam"); cf. id. at 1553 (stating that eliminating Jacksonville's "rule of one" would "open the [promotional] process to favoritism, politics and tokenism, just what the City is trying to avoid by using the rule of one"). On the basis of the current record, the consent decree's provision abandoning the Rule of Three is not tailored to remedy disparate impact.
The proposed consent decree's provisions requiring a situational-judgment and an assessment-center component to the promotional exams are approved. Provision "1" of the proposed consent decree is also approved. Provision "4a" of the proposed consent decree, to which the HPFFA has not objected, and which requires the City to bargain for a different seniority system, is also approved.
The following provisions in the decree are not approved because they violate the TLGC and CBA and the City and the plaintiffs have not shown that they are tailored to remedy the statutory violation:
(Docket Entry No. 69-2, at 29).
Devising the proper method for scoring the examination should be accomplished by collective bargaining and after the completion of the job analysis for the senior-captain position. This is consistent both with best methods for test development and with the principles embodied in Title VII — which favors voluntary compliance — and the TLGC, CBA, and Fifth Circuit precedent. A hearing is set for
29 C.F.R. 1607.14(B)(3).
A situational-judgment question begins by describing a scenario and then asks the candidate the most likely and least likely action he or she would take as fire captain from a list of options. An example is as follows:
(Id., Question 101). The candidate must choose from the following actions: (a) "[a]nnounce that this will be a defensive mode operation, order one of the firefighters to conduct a 360 of the structure, and order incoming trucks to spot the apparatus in preparation to fly pipe"; (b) "[o]rder the next-in Engine to establish a water supply and advance a line to the front door"; (c) "[a]nnounce that this will be an offensive operation, order the next-in engine to establish a water supply, and advance a 2 ½ [inch] line to the A side doors"; and (d) "[p]ass [c]ommand to the next-in company and investigate." (Id.).
"Item discrimination provides an indication of whether an item appropriately discriminates or differentiates between those examinees who perform well on the test and those who perform poorly." (Id. at 41). The premise for item discrimination analysis is that "[i]f performance on an item is unrelated to performance on the overall exam, then that item is not differentiating among candidates as intended." (Id.). There are two common methods for measuring item discrimination: the index of discrimination and the item-total correlation. The index of discrimination is computed by first dividing test-takers into upper and lower groups based on overall test scores, then subtracting the proportion of the lower group who answered the item correctly from the proportion of the upper group who answered the item correctly. This produces a value, D. One study suggests that questions whose D value is less than two should be eliminated. The item-total correlation "represents the correlation between an item and the rest of the test (i.e., the correlation between the item and the total score on the exam calculated excluding that item)." (Id.). A low item-total correlation means that an item has little relationship with the overall test score and does not discriminate between those who perform well and those who perform poorly. (Id. at 41). One study concludes that items with an item-total correlation below .05 are "very poorly discriminating item[s]" and that items with an item-total correlation greater than .2 "are at least moderately discriminating." (Id. at 42).
Guidelines Questions & Answers, 44 Fed.Reg. at 11998.
(Docket Entry No. 37-1, at 37).
Blise v. Antaramian, 409 F.3d 861, 868 (7th Cir.2005) (alterations omitted) (quoting Chapman v. A.I. Transport, 229 F.3d 1012, 1033-34 (11th Cir.2000)).