Bias in Student Evaluations of Teaching

The presence of bias and questions of validity have been an ongoing concern among researchers of student evaluations of teaching (SETs) since their inception over a century ago. The evolution of student evaluations into an integral component of faculty advancement only increased this scrutiny into their potential flaws as a measure of instructional quality. The following resource presents an overview of several of the forms of bias that research on SETs has examined.  

This resource is intended to assist administrators and faculty who utilize SETs as summative evaluations of teaching performance. The categories of bias presented here are not meant to be exhaustive, but they provide some of the key areas of biases present in student evaluations. These biases do not negate the importance of collecting and implementing student feedback, however, the biases should be considered when evaluating instructors based on student feedback.  

Instructors can review our resource on Preparing for and Responding to Student Evaluations of Teaching for research-based strategies to mitigate many of the biases listed here. 

Bias and instructor characteristics 

The largest portion of research in bias in SETs is dedicated to the influence instructor characteristics may have on student ratings of instruction – and the results are clear. Instructors of color and female-identifying instructors tend to receive lower SET scores than their white, male-identifying peers. There is no definitive explanation for why these correlations exist, though some researchers speculate that students may be punishing instructors whose behavior violates gendered or racist stereotypes. The intersectionality of these traits further expands or diminishes the scale of the bias, with white female-identifying instructors receiving higher average ratings than their black female-identifying peers. Studies of other characteristics, such as an instructor’s age, appearance, accent, personality, teaching experience, and tenure status, have yielded more mixed results. Administrators should consider these findings as they evaluate all faculty, but particularly those with minoritized identities.   

Bias and course characteristics 

Beyond their attitudes toward the instructor, student attitudes toward the course being taught also have the potential to influence their SET ratings. For example, higher-level courses, which tend to have more students majoring in the subject field, tend to receive higher SET ratings, and courses perceived by students as more difficult, which may be influenced by their perception of the discipline as a whole, tend to receive lower SET ratings. Likewise, an instructor that implements equitable active learning components into their course may receive lower SET ratings if students are unconvinced that those activities enhance their learning. 

In most instances, however, a given course characteristic may only prove a source of bias when combined with other factors. For instance, a lower-level introductory course may be taken by a student for a degree requirement. If such a course also has a large class size and takes place at 9:00am, the confluence of these factors may produce in such a student a slightly unfavorable impression of their learning experience. As such, it is important to consider the characteristics of a course, including its subject, level, size, time, and modality, to adequately compensate for potential bias in student ratings of instruction. 

Bias and student characteristics

Students’ own characteristics can also influence how they rate instructors, though the most prominent of these factors are procedural rather than demographic. A student’s race, gender, and academic year have not been found to strongly correlate with higher or lower SETs, while a student’s grade – received or expected – can significantly influence how they approach their instructor evaluations.  

There is a consensus within recent scholarship that if a student is expecting to earn a low grade, they are more likely to associate the grade with their instructor than their own actions and assign that instructor lower ratings. This correlation and the bias that it implies has for decades led instructors to accuse SETs of fomenting grade inflation and should remain a serious concern for all who utilize SET data.  

Bias and the SET process 

Studies examining bias within the logistics of the SET process have yielded a wide range of findings. Research into the design of SET instruments indicates that students are reading questions and response options carefully, and therefore the language of a survey can significantly affect the validity of the SET process.  

To obtain results with the greatest utility for instructors, researchers recommend that administrators include language that encourages students to focus on their learning and to rate their instructor in the context of how well they facilitated students’ achievement of their learning outcomes. Likewise, students are more likely to provide honest, productive feedback if they believe the evaluations reflect the quality of the course and that students and instructors—rather than administrators—are likely to benefit.   

When examining how SETs are interpreted by stakeholders, however, researchers uncovered persistent tendencies among instructors and administrators to interpret minor differences in results as significant, despite warnings against doing so. These studies advise against using quantitative SET feedback to compare an instructor’s performance with that of their peers and suggest that instructors compare their SETs with feedback they received in previous semesters. 

Advice for reviewers of SETs 

SETs are influenced by a number of student biases that need to be considered when analyzing student feedback results. This resource has provided evidence of four categories of bias (instructor characteristics, student characteristics, course characteristics, and the SET process itself) to consider when reviewing SETs.  

SETs reflect student attitudes and can never be wholly absent of bias. Therefore, all parties who utilize SETs as an assessment of teaching performance must consider the potential factors outside of an instructor’s control that influence student ratings of instruction and develop realistic expectations for the practical utility of the feedback that has been provided.   

For additional strategies on the micropropagation of SET data into teaching reviews, consult the American University Beyond SETs Task Force paper on Rethinking SETs. For research-based strategies to mitigate many of the biases listed here, consult our guide on Preparing for and Responding to Student Evaluations of Teaching. 

Additional Information and Resources

Brief History of SETs

Student evaluations of teaching emerged out of the nascent study of psychometrics in the late 19th century. Initially developed by Francis Galton, psychometrics is concerned with the quantitative measurement of psychological traits.  

In the early 20th century rating scales became the common method of psychometric analysis, particularly when assessing individuals’ perception. The first rating scale to be applied to the evaluation of instructors was developed by E.C. Elliott at the University of Wisconsin. Published in 1910, Elliott’s Scorecard for Measuring the Merit of Teachers assessed several traits, each assigned a numeric score with the total of all scores equaling a maximum of 100.  

By the 1920’s many competing rating scales were published, including Leo Brueckner’s Scales for the Rating of Teaching Skill (1927), G. C. Brandenburg and H.H. Remmers’ Purdue Rating Scale for Instructors (1928), and Clara Brown’s Rating Scale for Teachers of Home Economics (1928). By the 1930’s, however, the rating scales for evaluating instructors soon coalesced around the techniques of Rensis Likert, as outlined in his 1932 dissertation, A Technique for the Measurement of Attitudes.  

For the next three decades SETs were voluntary, only administered by instructors seeking formative feedback and with no obligation on the part of faculty to release the results to any party, including their institutions.   

The widespread adoption of SETs began in the late 1960’s and early 1970’s in response to student protests against “irrelevant courses and uninspired teaching” (Gaff & Simpson, 1994, p. 168). These protests were in part a result of the diversification of the student population, which incorporated larger numbers of women, students of color, and older students who expected course material to reflect their lived experiences.  

The economic recession of the 1970’s and a declining college age population forced higher education institutions to accede to student demands to maintain stable enrollment. The result was the creation of faculty development programs intended to promote the acquisition of instructional skills. One consequence of these programs was the introduction of systems of accountability for instructors, a key component of which were SETs. Student evaluations of instructors were subsequently reviewed by administrators in conjunction with the tradition metric of scholarly publications when considering faculty retention and promotion. 

 

References

Basow, S. A., Codos, S., & Martin, J. L. (2013). The effects of professors’ race and gender on student evaluations and performance. College Student Journal, 47(2), 352–363. 

Bendig, A. W. (1952). A preliminary study of the effect of academic level, sex, and course variables on student rating of psychology instructors. Journal of Psychology; Provincetown, Mass., Etc., 33, 21–26. 

Blum, M. L. (1936). An investigation of the relation existing between students’ grades and their ratings of the instructor’s ability to teach. Journal of Educational Psychology, 27(3), 217– 221. https://doi.org/10.1037/h0062859 

Boysen, G. A. (2015). Preventing the overinterpretation of small mean differences in student evaluations of teaching: An evaluation of warning effectiveness. Scholarship of Teaching and Learning in Psychology, 1(4), 269–282. https://doi.org/10.1037/stl0000042 

Cain, K. M., Wilkowski, B. M., Barlett, C. P., Boyle, C. D., & Meier, B. P. (2018). Do we see eye to eye? Moderators of correspondence between student and faculty evaluations of day-to- day teaching. Teaching of Psychology, 45(2), 107–114. https://doi.org/10.1177/0098628318762862 

Cain, K. M., Wilkowski, B. M., Barlett, C. P., Boyle, C. D., & Meier, B. P. (2018). Do we see eye to eye? Moderators of correspondence between student and faculty evaluations of day-to- day teaching. Teaching of Psychology, 45(2), 107–114. https://doi.org/10.1177/0098628318762862 

Carpenter, S. K., Mickes, L., Rahman, S., & Fernandez, C. (2016). The effect of instructor fluency on students’ perceptions of instructors, confidence in learning, and actual learning. Journal of Experimental Psychology, 22(2), 161–172. https://doi.org/10.1037/xap0000077 

Clayson, D. E., & Haley, D. A. (2011). Are students telling us the truth? A critical look at the student evaluation of teaching. Marketing Education Review, 21(2), 101–112. http://dx.doi.org.proxyau.wrlc.org/10.2753/MER1052-8008210201 

Costin, F., Greenough, W. T., & Menges, R. J. (1971). Student ratings of college teaching: reliability, validity, and usefulness. Review of Educational Research, 41(5), 511-535. 

Doubleday, A. F., & Lee, L. M. J. (2016). Dissecting the voice: Health professions students’ perceptions of instructor age and gender in an online environment and the impact on evaluations for faculty. Anatomical Sciences Education, 9(6), 537–544. http://dx.doi.org.proxyau.wrlc.org/10.1002/ase.1609 

Downing, V. R., Cooper, K. M., Cala, J. M., Gin, L. E., & Brownell, S. E. (2020). Fear of negative evaluation and student anxiety in community college. CBE—Life Sciences Education, 19(2). https://doi.org/10.1187/cbe.19-09-0186 

Gaff, J. G., & Simson, R. D. (1994). Faculty development in the United States. Innovative Higher Education 18(3), 167-176. 

Guthrie, E. R. (1949). The evaluation of teaching. The Educational Record, (April), 109-115. 

Hartley, E. L., & Hogan, T. P. (1972). Some additional factors in student evaluation of courses. American Educational Research Journal, 9(2), 241–250. https://doi.org/10.3102/00028312009002241 

Joye, S. W., & Wilson, J. H. (2015). Professor age and gender affect student perceptions and grades. Journal of the Scholarship of Teaching and Learning, 15(4), 126–138. 

Lewis, K. G. (1996). Faculty development in the United States: A brief history. International Journal for Academic Development, 1(2), 26–33. https://doi.org/10.1080/1360144960010204 

Littleford, L. N., & Jones, J. A. (2017). Framing and source effects on White college students’ reactions to racial inequity information. Cultural Diversity and Ethnic Minority Psychology, 23(1), 143–153. https://doi.org/10.1037/cdp0000102 

Liu, O. L. (2012). Student evaluation of instruction: In the new paradigm of distance education. Research in Higher Education, 53(4), 471–486.                   https://doi.org/10.1007/s11162-011-9236-1 

Magel, R. C., Doetkott, C., & Cao, L. (2017). A study of the relationship between gender, salary, and student ratings of instruction at a research university. NASPA Journal About Women in Higher Education, 10(1), 96–117. http://dx.doi.org.proxyau.wrlc.org/10.1080/19407882.2017.1285792 

McClain, L., Gulbis, A., & Hays, D. (2018). Honesty on student evaluations of teaching: Effectiveness, purpose, and timing matter! Assessment & Evaluation in Higher Education, 43(3), 369–385. https://doi.org/10.1080/02602938.2017.1350828 

Meltzer, A. L., & Mcnulty, J. K. (2011). Contrast effects of stereotypes: “nurturing” male professors are evaluated more positively than “nurturing” female professors. Journal of Men, 19(1), 57–64. https://doi.org/10.3149/jms.1901.57 

Nadler, J. T., Berry, S. A., & Stockdale, M. S. (2013). Familiarity and sex based stereotypes on instant impressions of male and female faculty. Social Psychology of Education: An International Journal, 16(3), 517–539. https://doi.org/10.1007/s11218-013-9217-7 

Narayanan, A., Sawaya, W. J. I., & Johnson, M. D. (2014). Analysis of differences in nonteaching factors influencing student evaluation of teaching between engineering and business classrooms. Journal of Innovative Education, 12(3), 233–265. https://doi.org/10.1111/dsji.12035 

Nargundkar, S., & Shrikhande, M. (2014). Norming of student evaluations of instruction: Impact of noninstructional factors. Decision Sciences Journal of Innovative Education, 12(1), 55– 72. http://dx.doi.org.proxyau.wrlc.org/10.1111/dsji.12023 

Parks-Stamm, E. J., & Grey, C. (2016). Evaluating engagement online: Penalties for low- participating female instructors in gender-balanced academic domains. Social Psychology, 47(5), 281–287. https://doi.org/10.1027/1864-9335/a000277 

Patrick, C. L. (2011). Student evaluations of teaching: Effects of the big five personality traits, grades and the validity hypothesis. Assessment & Evaluation in Higher Education, 36(2), 239–249. 

Ray, B., Babb, J., & Wooten, C. A. (2018). Rethinking SETs: Retuning student evaluations of teaching for student agency. Composition Studies, 46(1), 34–56. 

Risser, H. S. (2010). Internal and external comments on course evaluations and their relationship to course grades. The Mathematics Enthusiast, 7(2-3), 401–412. 

Rodriguez, M. C. (2016). The Origin and Development of Rating Scales. 

Rucker, M. H., & Haise, C. L. (2012). Effects of variations in stem and response options on teaching evaluations. Social Psychology of Education: An International Journal, 15(3), 387–394. https://doi.org/10.1007/s11218-012-9186-2 

Samuel, M. L. (2019). Flipped pedagogy and student evaluations of teaching. Active Learning in Higher Education, 22(2), 159-168. https://doi.org/10.1177/1469787419855188  

Schueths, A. M., Gladney, T., Crawford, D. M., Bass, K. L., & Moore, H. A. (2013). Passionate pedagogy and emotional labor: Students’ responses to learning diversity from diverse instructors. International Journal of Qualitative Studies in Education, 26(10), 1259–1276. http://dx.doi.org.proxyau.wrlc.org/10.1080/09518398.2012.731532 

Smith, D. L., Cook, P., & Buskist, W. (2011). An experimental analysis of the relation between assigned grades and instructor evaluations. Teaching of Psychology, 38(4), 225–228. https://doi.org/10.1177/0098628311421317 

Smith, B. P., & Hawkins, B. (2011). Examining student evaluations of Black college faculty: Does race matter? Journal of Negro Education, 80(2), 149–162. 

Socha, A. (2013). A Hierarchical approach to students’ assessments of instruction. Assessment & Evaluation in Higher Education, 38(1), 94–113. http://dx.doi.org.proxyau.wrlc.org/10.1080/02602938.2011.604713 

Stigall, L., & Blincoe, S. (2015). Student and instructor use of the Teacher Behavior Checklist. Teaching of Psychology, 42(4), 299–306. https://doi.org/10.1177/0098628315603061 

Zhao, J., & Gallant, D. J. (2012). Student evaluation of instruction in higher education: Exploring issues of validity and reliability. Assessment & Evaluation in Higher Education, 37(2), 227–235. https://doi.org/10.1080/02602938.2010.523819