1984 Essay Test Construction

Constructing tests

Designing tests is an important part of assessing students understanding of course content and their level of competency in applying what they are learning.  Whether you use low-stakes and frequent evaluations–quizzes–or high-stakes and infrequent evaluations–midterm and final–careful design will help provide  more calibrated results.

Here are a few general guidelines to help you get started:

  • Consider your reasons for testing.
    • Will this quiz monitor the students’ progress so that you can adjust the pace of the course?
    • Will ongoing quizzes serve to motivate students?
    • Will this final provide data for a grade at the end of the quarter?
    • Will this mid-term challenge students to apply concepts learned so far?

The reason(s) for giving a test will help you determine features such as length, format, level of detail required in answers, and the time frame for returning results to the students.

  • Maintain consistency between goals for the course, methods of teaching, and the tests used to measure achievement of goals. If, for example, class time emphasizes review and recall of information, then so can the test; if class time emphasizes analysis and synthesis, then the test can also be designed to demonstrate how well students have learned these things.
  • Use testing methods that are appropriate to learning goals. For example, a multiple choice test might be useful for demonstrating memory and recall, for example, but it may require an essay or open-ended problem-solving for students to demonstrate more independent analysis or synthesis.
  • Help Students prepare. Most students will assume that the test is designed to measure what is most important for them to learn in the course. You can help students prepare for the test by clarifying course goals as well as reviewing material. This will allow the test to reinforce what you most want students to learn and retain.
  • Use consistent language (in stating goals, in talking in class, and in writing test questions) to describe expected outcomes. If you want to use words like explain or discuss, be sure that you use them consistently and that students know what you mean when you use them.
  • Design test items that allow students to show a range of learning. That is, students who have not fully mastered everything in the course should still be able to demonstrate how much they have learned.

Multiple choice exams

Multiple choice questions can be difficult to write, especially if you want students to go beyond recall of information, but the exams are easier to grade than essay or short-answer exams. On the other hand, multiple choice exams provide less opportunity than essay or short-answer exams for you to determine how well the students can think about the course content or use the language of the discipline in responding to questions.

If you decide you want to test mostly recall of information or facts and you need to do so in the most efficient way, then you should consider using multiple choice tests.

The following ideas may be helpful as you begin to plan for a multiple choice exam:

  • Since questions can result in misleading wording and misinterpretation, try to have a colleague answer your test questions before the students do.
  • Be sure that the question is clear within the stem so that students do not have to read the various options to know what the question is asking.
  • Avoid writing items that lead students to choose the right answer for the wrong reasons. For instance, avoid making the correct alternative the longest or most qualified one, or the only one that is grammatically appropriate to the stem.
  • Try to design items that tap students’ overall understanding of the subject. Although you may want to include some items that only require recognition, avoid the temptation to write items that are difficult because they are taken from obscure passages (footnotes, for instance).
  • Consider a formal assessment of your multiple-choice questions with what is known as an “item analysis” of the test.
    For example:
    • Which questions proved to be the most difficult?
    • Were there questions which most of the students with high grades missed?

This information can help you identify areas in which students need further work, and can also help you assess the test itself: Were the questions worded clearly? Was the level of difficulty appropriate? If scores are uniformly high, for example, you may be doing everything right, or have an unusually good class. On the other hand, your test may not have measured what you intended it to.

Essay questions


“Essay tests let students display their overall understanding of a topic and demonstrate their ability to think critically, organize their thoughts, and be creative and original. While essay and short-answer questions are easier to design than multiple-choice tests, they are more difficult and time-consuming to score. Moreover, essay tests can suffer from unreliable grading; that is, grades on the same response may vary from reader to reader or from time to time by the same reader. For this reason, some faculty prefer short-answer items to essay tests. On the other hand, essay tests are the best measure of students’ skills in higher-order thinking and written expression.”
(Barbara Gross Davis, Tools for Teaching, 1993, 272)

When are essay exams appropriate?

  • When you are measuring students’ ability to analyze, synthesize, or evaluate
  • When you have been teaching at these levels (i.e. writing intensive courses, upper-division undergraduate seminars, graduate courses) or the content lends it self to more critical analysis as opposed to recalling information

How do you design essay exams?

  • Be specific
  • Use words and phrases that alert students to the kind of thinking you expect; for example, identify, compare, or critique
  • Indicate with points (or time limits) the approximate amount of time students should spend on each question and the level of detail expected in their responses
  • Be aware of time; practice taking the exam yourself or ask a colleague to look at the questions

How do you grade essay exams?

  • Develop criteria for appropriate responses to each essay question
  • Develop a scoring guide that tell what you are looking for in each response and how much credit you intend to give for each part of the response
  • Read all of the responses to question 1, then all of the responses to question 2, and on through the exam. This will provide a more holistic view of how the class answered the individual questions

How do you help students succeed on essay exams?

  • Use study questions that ask for the same kind of thinking you expect on exams
  • During lecture or discussion emphasize examples of thinking that would be appropriate on essay exams
  • Provide practice exams or sample test questions
  • Show examples of successful exam answers

Assessing your test

Regardless of the kind of exams you use, you can assess their effectiveness by asking yourself some basic questions:

  • Did I test for what I thought I was testing for?
    If you wanted to know whether students could apply a concept to a new situation, but mostly asked questions determining whether they could label parts or define terms, then you tested for recall rather than application.
  • Did I test what I taught?
    For example, your questions may have tested the students’ understanding of surface features or procedures, while you had been lecturing on causation or relation–not so much what the names of the bones of the foot are, but how they work together when we walk.
  • Did I test for what I emphasized in class?
    Make sure that you have asked most of the questions about the material you feel is the most important, especially if you have emphasized it in class. Avoid questions on obscure material that are weighted the same as questions on crucial material.
  • Is the material I tested for really what I wanted students to learn?
    For example, if you wanted students to use analytical skills such as the ability to recognize patterns or draw inferences, but only used true-false questions requiring non-inferential recall, you might try writing more complex true-false or multiple-choice questions.

[next] [prev] [prev-tail] [tail] [up]

Chapter 3
Cognitive Test Construction


Good items are the building blocks of good tests, and the validity of cognitive test scores can hinge on the quality of individual test items. Unfortunately, test makers, both in low-stakes and high-stakes settings, often presume that good items are easy to come by. As noted above by Mark Reckase, former assistant vice president at ACT, item writing is often not given the attention it deserves. Research shows that effective item writing is a challenging process, and even the highest-stakes of tests include poorly written items (Haladyna & Rodriguez, 2013).

This chapter summarizes the main stages of cognitive test construction, from conception to development, and the main features of cognitive test questions, and reviews the item writing guidelines presented in Haladyna and Downing (1989) and the style guides of major testing companies. The test construction process begins with a clear purpose statement, concise learning objectives, and a test outline or blueprint. The purpose, learning objectives, and test outline then provide a framework for item writing.

Validity and Test Purpose

As often happens in this course, we will begin our discussion of test construction with a review of validity and test purpose. Recall from Chapters 0 through 2 that validity refers to the degree to which evidence and theory support the interpretations of test scores entailed by the proposed uses of a test. In other words, validity indexes the extent to which test scores can be used for their intended purpose. These are generic definitions that apply to any type of educational or psychological measure.

In this chapter we’re focusing on cognitive tests, where the purpose of the test is to produce scores that can inform decision making in terms of aptitude and achievement, presumably of students. So, we need to define validity in terms of these more specific test uses. For example, using the first quiz in an introductory measurement course as an example, we could say that validity refers to the degree to which the content coverage of the test (as outlined in the blueprint, based on the learning objectives) supports the use of scores as a measure of student learning for topics covered in the first part of the course. Based on this definition of validity, what would you say is the purpose of the quiz? Note how test purpose and validity are closely linked.

Construction of a valid test begins with a test purpose. You need to be able to identify the three components of a test purpose, both when presented with a well-defined purpose, and when presented with a general description of a test. Later in the course youll be reviewing test reviews and test technical manuals which may or may not include clear definitions of test purpose. Youll have to take the available information and identify, to the best of your ability, what the test purpose is. Here are some verbs to look for: assess, test, and measure (obviously), but also describe, select, identify, examine, and gauge, to name a few.

Do your best to distill the lengthy description below into a one-sentence test purpose. This should be pretty straightforward. The information is all there. This description comes from the technical manual for the 2011 California Standards Test (CST), which is part of the Standardized Testing and Reporting (STAR) program for the state of California (see www.cde.ca.gov). These are more recent forms of the state tests that I took in school back in the 1980s!

California Standards Tests (CSTs) are produced for Californiapublic schools to assess the California content standards forELA, mathematics, historysocial science, and science in gradestwo through eleven.

A total of 38 CSTs form the cornerstone of the STARprogram. The CSTs, given in English, are designed to showhow well students in grades two through eleven are performingwith respect to Californias content standards. These standardsdescribe what students should know and be able to do at eachgrade level in selected content areas.

CSTs carry the most weight in school and district AcademicPerformance Index (API) calculations. In addition, the CSTsfor ELA and mathematics (grades two through eight) are used indetermining Adequate Yearly Progress (AYP), which is used tomeet the requirement of the federal Elementary and SecondaryEducation Act (ESEA) that all students score at the proficientlevel or above by 2014.

You should have come up with something like this for the CST test purpose: the CST measures ELA, mathematics, historysocial science, and science for students in grades two through eleven to show how well they are performing with respect to Californias content standards, and to help determine AYP.

Learning Objectives

To keep the description of the CSTs brief, I omitted details about the content standards. California, like all other states, has detailed standards or learning objectives defining what content/skills/knowledge/information/etc. must be covered by schools in the core subject areas. The standards specify what a student should know and be able to do after their educational experience. They establish the overarching goal for teaching and learning. Teachers, schools, and districts, to some extent, are then free to determine the best way to teach the standards.

In this chapter, well talk about educational standards as a form of learning objectives, which identify the goals or purposes of instruction. Here’s a simplified example of a learning objective for this chapter: write and critique test items. This objective is extremely simple and brief. Can you describe why it would be challenging to assess proficiency or competency for this objective? How could the objective be changed to make it easier to assess?

Learning objectives that are broadly or vaguely defined lead to low-quality unfocused test questions. The simple item-writing objective above does not include any qualifiers specifying how it is achieved or obtained or appropriately demonstrated. In state education systems, the standards are very detailed and much more numerous than what youre seeing in this class (Nebraska defines more than 75 science standards in grade 11; for details, see www.education.ne.gov/academicstandards). For example, from the Nebraska State Standards, Grade 11 Abilities to do Scientific Inquiry:

Design and conduct investigations that lead to the use of logic and evidence in the formulation of scientific explanations and models.
Formulate a testable hypothesis supported by prior knowledge to guide an investigation.
Design and conduct logical and sequential scientific investigations with repeated trials and apply findings to new investigations.

Note that these standards reflect specific things students should be able to do, and some conditions for how students can do these things well. Such specific wording greatly simplifies the item writing process because it clarifies precisely the knowledge, skills, and abilities that should be measured.

Note also that the simplest way to assess the first science objective listed above would be to simply ask students to design and conduct an investigation that leads to the use of logic and evidence in the formulation of scientific explanations and models. The standard itself is almost worded as a test question. This is often the case with well-written standards. Unfortunately, the standardized testing process includes constraints, like time limits, that make it difficult or impossible to assess standards so directly. Designing and conducting an experiment requires time and resources. Instead, in a test we might refer students to an example of an experiment and ask them to identify correct or incorrect procedures; or we might ask students to use logic when making conclusions from experimental results. In this way, we use individual test questions to indirectly assess different components of a given standard.

Features of Test Items

Depth of knowledge

In addition to being written to specific standards or learning objectives, cognitive test items are also written to assess at a specific depth of knowledge (DOK). The depth of knowledge of an item indicates its level of complexity in terms of the knowledge and skills required to obtain a correct response. Bloom and Krathwohl (1956) presented the original framework for categorizing depth of knowledge in cognitive assessments. However, the majority of achievement tests nowadays use some version of the DOK categories presented by Webb (2002). These DOK differ somewhat by content area, but are roughly defined in levels of increasing complexity as 1) recall and reproduction, 2) skills and concepts, 3) strategic thinking, and 4) extended thinking.

These simple DOK categories can be modified to meet the needs of a particular testing program. For example, here is the description of Level 1 DOK used in writing items for the standardized science tests in Nebraska:

Level 1 Recall and Reproduction requires recall ofinformation, such as a fact, definition, term, or a simpleprocedure, as well as performing a simple science process orprocedure. Level 1 only requires students to demonstrate a roteresponse, use a well-known formula, follow a set procedure (likea recipe), or perform a clearly defined series of steps. A “simple”procedure is well-defined and typically involves only one-step.

Verbs such as “identify,” “recall,” “recognize,”“use,” “calculate,” and “measure” generally represent cognitivework at the recall and reproduction level. Simple word problemsthat can be directly translated into and solved by a formula areconsidered Level 1. Verbs such as “describe” and “explain” couldbe classified at different DOK levels, depending on the complexityof what is to be described and explained.

DOK descriptions such as this are used to categorize items in the item writing process, and thereby ensure that the items together support the overall DOK required in the purpose of the test. Typically, higher DOK is preferable. However, lower levels of DOK are sometimes required to assess certain objectives, for example, ones that require students to recall or reproduce definitions, steps, procedures, or other key information. Furthermore, constraints on time and resources within the standardized testing process often make it impossible to assess the highest level of DOK, which requires extended thinking and complex cognitive demands.

Item Types

Cognitive test items come in a variety of types that differ in how material is presented to the test taker, and how responses are then collected. Most cognitive test questions begin with a stem or question statement, and then include one or more options for response. The classic multiple-choice test question includes a stem that ends with a question or some direction or indication that the test taker must choose one of a set of responses.

In general, what is the optimal number of response options in acognitive multiple-choice test question?


Research shows that the optimal number of questions in a multiple-choice item is three (Rodriguez, 2005). Tradition leads many item-writers consistently to use four options; however, a feasible fourth option is often difficult to write, leading test takers to easily discount it, and thus making it unnecessary.

A variety of selected-response item types are available. More popular types include:

true/false, where test takers simply indicate whether a statement is true or false;
multiple correct or select all that apply, where more than one option can be selected as correct;
multiple true/false, a simplified form of multiple correct where options consist of binary factual statements (true/false) and are preceded by a prompt or question statement linking them together in some way;
matching, where test takers select for each option in one list the correct match from a second list;
complex multiple-choice, where different combinations of response options can be selected as correct, resembling a constrained form of multiple correct (e.g., options A and B, A and C, or all of the above); and
evidence-based question, which can be any form of selected-response item where a follow-up question requires test takers to select an option justifying their response to the original item.

Evidence-based questions are becoming more popular in standardized achievement testing, as, test makers claim, they can be used to assess more complex reasoning. This is achieved via the nesting of content from one question inside the follow-up. Here’s a simple evidence-based question on DOK.

Part I. In a constructed-response science question, studentsare given a hypothesis and must then describe with an essayan experiment that could be used to test the hypothesis. Intheir description they must identify the key components of theexperiment and justify the importance of each component intesting the hypothesis.

What depth of knowledge level does this science questionassess?


Part II. What task from the science question in Part I best supportsthe answer for Part I?

Describe an experiment.
Identify the key components of an experiment.
Justify the importance of each component.

A constructed-response item does not present options to the test taker. As the name implies, a response must be constructed. Constructed-response items include short-answer, fill-in-the-blank, graphing, manipulation of information, and essays. Standardized performance assessments, e.g., reading fluency measures, can also be considered constructed-response tasks.

The science question within Part I of the evidence-based DOK question above is an example of a simple essay question. Note that this science question could easily be converted to a selected-response question with multiple correct answers, where various components of an experiment, some correct and some incorrect, could be presented to the student. Parts I and II from the evidence-based DOK question could also easily be converted to a single constructed-response question, where test takers identify the correct DOK for the science question, and then provide their own supporting evidence.

There are some key advantages and disadvantages to multiple-choice or selected-response items and constructed-response items. In terms of advantages, selected-response items are typically easy to administer and score, and are more objective and reliable than constructed-response items. They are also more efficient, and can be used to cover more test content in a shorter period of time. Finally, selected-response items can provide useful diagnostic information about specific misconceptions that test takers might have.

Although they are more efficient and economical, selected-response items are more difficult to write well, they tend to focus on lower-order thinking and skills, such as recall and reproduction, and they are more susceptible to test-wiseness and guessing. Constructed-response items address each of these issues. They are easier to write, especially for higher-level thinking, and they eliminate the potential for simple guessing.

The main benefit of constructed-response questions is they can be used to test more practical, authentic, and realistic forms of performance and tasks, including creative skills and abilities. The downside is that these types of performance and tasks require time to demonstrate and are then complex and costly to score.

Consider these advantages and disadvantages for the different forms of the DOK question above, and the science question with it. Would the limitations of the selected response forms be worth the gains in efficiency? Or would the gains in authenticity and DOK justify the use of the constructed-response forms?


As noted above, though constructed-response questions can be more effective at assessing higher DOK, scoring can be time-consuming, inefficient, and unreliable. These limitations in scoring are minimized, to the extent possible, through the use of scoring rubrics. Scoring rubrics provide an outline for what constitutes a correct response, or levels of correctness in a response.

Rubrics are typically described as either analytic or holistic. An analytic rubric breaks down a response into characteristics or components, each of which can be present or correct to different degrees. For example, an essay response may be scored based on its introduction, body, and conclusion. A required feature of the introduction, for example, could be a clear thesis statement. Rubrics that analyze components of a response are more time consuming to develop and use; however, they can provide a more detailed evaluation than rubrics that do not analyze the components of a response, i.e., holistic rubrics. A holistic rubric provides a single score based on an overall evaluation of a response. Holistic rubrics are simpler to develop and use; however, they do not provide detailed information about the strengths or weaknesses in a response.

Test Outline

In its simplest form, a test outline is a table that summarizes how the items in a test are distributed in terms of key features such as content areas or subscales (e.g., quantitative reasoning, verbal reasoning), standards or objectives, item types, and depth of knowledge. Table 3.1 contains a simple example for a cognitive test with three content areas.

Table 3.1: Simple Example Test Blueprint

Scale Learning Objective DOK Items
Reading Define key vocabulary 1 12
Select the most appropriate word 2 10
Writing Write a short story 3 1
Evaluate an argument and construct a rebuttal 4 2
Math Solve equations with two unknowns 4 8
Run a linear regression and interpret the output 4 5

A test outline or blueprint is used to ensure that a test measures the content areas captured by the tested construct, and that these content areas are measured in the appropriate ways. For example, in Table 3.1 notice that we’re only assessing reading using the first two levels of DOK. Perhaps scores from this test will be used to select among student applicants for summer reading program. The test purpose would then need to include some mention of reading comprehension, which would then be assessed at a deeper level of knowledge.

The learning objectives in Table 3.1 are intentionally left vague. How can they be improved to make these content areas more testable? Consider how qualifying information could be included in these objectives to clarify what would constitute high-quality performance or responses.

Item Writing

The item writing guidelines presented in Haladyna, Downing, and Rodriguez (2002) are reproduced here for reference. The guidelines are grouped into ones addressing content concerns, formatting concerns, style concerns, issues in writing the stem, and issues in writing the response options.

Content concerns

Every item should reflect specific content and a single specific mental behavior, as called for in test specifications (two-way grid, test blueprint).
Base each item on important content to learn; avoid trivial content.
Use novel material to test higher level learning. Paraphrase textbook language or language used during instruction when used in a test item to avoid testing for simply recall.
Keep the content of each item independent from content of other items on the test.
Avoid over specific and over general content when writing multiple-choice (MC) items.
Avoid opinion-based items.
Avoid trick items.
Keep vocabulary simple for the group of students being tested.

Formatting concerns

Use the question, completion, and best answer versions of the conventional MC, the alternate choice, true-false, multiple true-false, matching, and the context-dependent item and item set formats, but AVOID the complex MC (Type K) format.
Format the item vertically instead of horizontally.

Style concerns

Edit and proof items.
Use correct grammar, punctuation, capitalization, and spelling.
Minimize the amount of reading in each item.

Writing the stem

Ensure that the directions in the stem are very clear.
Include the central idea in the stem instead of the choices.
Avoid window dressing (excessive verbiage).
Word the stem positively, avoid negatives such as NOT or EXCEPT. If negative words are used, use the word cautiously and always ensure that the word appears capitalized and boldface.

Writing the choices

Develop as many effective choices as you can, but research suggests three is adequate.
Make sure that only one of these choices is the right answer.
Vary the location of the right answer according to the number of choices.
Place choices in logical or numerical order.
Keep choices independent; choices should not be overlapping.
Keep choices homogeneous in content and grammatical structure.
Keep the length of choices about equal.
None-of-the-above should be used carefully.
Avoid All-of-the-above.
Phrase choices positively; avoid negatives such as NOT.
Avoid giving clues to the right answer, such as
Specific determiners including always, never, completely, and absolutely.
Clang associations, choices identical to or resembling words in the stem.
Grammatical inconsistencies that cue the test-taker to the correct choice.
Conspicuous correct choice.
Pairs or triplets of options that clue the test-taker to the correct choice.
Blatantly absurd, ridiculous options.
Make all distractors plausible.
Use typical errors of students to write your distractors.
Use humor if it is compatible with the teacher and the learning environment.

Construct Irrelevant Variance

Rather than review each item writing guideline, we’ll just summarize the main theme that they all address. This theme has to do with the intended construct that a test is measuring. Each guideline targets a different source of what is referred to as construct irrelevant variance that is introduced in the testing process.

For example, consider guideline 8, which recommends that we “keep vocabulary simple for the group of students being tested.” When vocabulary become unnecessarily complex, we end up testing vocabulary knowledge and related constructs in addition to our target construct. The complexity of the vocabulary should be appropriate for the audience and should not interfere with the construct being assessed. Otherwise, it introduces variability in scores that is irrelevant or confounding with respect to our construct.

Another simple example is guideline 17, which recommends that we “word the stem positively” and “avoid negatives such as NOT or EXCEPT.” The use of negatives, and worse yet, double negatives, introduces a cognitive load into the testing process that may not be critical to the construct we want to assess.

Summary and Homework

This chapter provides an overview of cognitive test construction and item writing. Effective cognitive tests have a clear purpose and are structured around well-defined learning objectives. These objectives are organized, potentially by content area, within a test outline that also describes key features of the test, such as the depth of knowledge assessed, and the types of items used. Together, these features specify the number of types of items that must be developed to adequately address the test purpose.

Learning objectives

Describe the purpose of a cognitive learning objective or learning outcome statement, and demonstrate the effective use of learning objectives in the item writing process.
Describe how a test blueprint or test plan is used in cognitive test development to align the test to the content domain and learning objectives.
Compare items assessing different cognitive levels or depth of knowledge, e.g., higher-order thinking such as synthesizing and evaluating information versus lower-order thinking such as recall and definitional knowledge.
Identify and provide examples of selected-response item types (multiple-choice, true/false, matching) and constructed-response item types (short-answer, essay).
Compare and contrast selected-response and constructed-response item types, describing the benefits and limitations of each type.
Identify the main theme addressed in the item writing guidelines, and how each guideline supports this theme.
Create and use a scoring rubric to evaluate answers to a constructed-response question.
Write and critique cognitive test items that match given learning objectives and depths of knowledge and that follow the item writing guidelines.

[next] [prev] [prev-tail] [front] [up]

A careful review of any testing program will identify poorlyworded test items, written by persons with minimal trainingand inadequate insights into their audience. We need to domuch more work to produce quality test items.

— Mark Reckase, 2009 NCME Presidential Address


Leave a Reply

Your email address will not be published. Required fields are marked *