By Ning Zeng, Atmospheric & Oceanic Science, UMCP

*Course evaluations have been with us for almost 100 years, and for most of those years there has been an uneasy relationship between faculty members and administrators, and faculty members and their students. Are the ratings a valid scientific basis of determining the quality of teaching? Should they be relied upon for promotion, tenure, and salary decisions? Are the anticipation of ratings a cause of grade inflation? Prof. Zeng wades into this controversy. Other views are of course welcome.*

Overemphasis on student course evaluation numbers may cause an evolution towards less competent graduates. Over the last few years, UMCP has fully implemented the student CourseEval system in which students rank the classroom teacher. This information provides useful input on faculty teaching. It is particularly attractive for administrative purpose because the numbers offer themselves as unbiased quantitative measures that can be easily used for many purposes such as ranking and decision-making. However, there are a number of potential pitfalls in using these evaluations. Here I offer a few observations.

**The ‘quantitativeness’ may lead to over-confidence in its information content: **Even though it comes as numbers, the evaluation starts from 5 subjective choices (0 for ‘strongly disagree,’ 4 for ‘strongly agree,’ etc.) that are digitized and averaged. The average student response rate is 60-70%,. The standard deviation is often high. For a typical average of 3 points (out of 4) and a standard deviation of 1, the individual evaluation can be as high as 4 (but not higher), or lower than 2. On the other hand, the campus has a policy to toss out the evaluations from courses with less than five students. This makes sense statistically, but it misses many graduate courses that do not have large enrollments, but nonetheless provide critically important advanced training – especially at thePh. D. level.

**The sense of ‘unbiased’ evaluation is not warranted:** Because the evaluation is anonymous, and conducted before the final grades are given, it is considered less prone to bias compared to, e.g., solicitation during the course where unhappy students are afraid of speaking out for fear of a negative impact on their grades.

However, the situation may be significantly more complex. For example, in a ‘difficult’ class, some students may feel so challenged and unhappy so as to give a poor evaluation of the instructor, only to discover later that their grades are not bad at all because the grades are largely relative. Ironically, I know of a situation where students in the class were highly challenged by the course material from the very beginning, and several of them complained bitterly. Realizing that they would do poorly in the class, they dropped out before the deadline for class withdrawal. The remaining students liked the class and gave high evaluations.

It is well known that the response rate is influenced by a ‘motivation gap’ as disgruntled students are more motivated to do the evaluation. Another observation is that negative comments are typically much longer with many complaints, while the positive comments are typically short, like ‘I enjoyed the class.’

**High impact of negative evaluations: **Potential consequences of this ‘motivation gap’ can be seen in an example illustrated in Figure 1. In a hypothetical situation with 30,000 students, if only the lower 75% of the class actually do the evaluation, i.e., the 25% that would give higher marks do not participate, the average score would be 0.55 points lower (2.45 as opposed to 3.00). If only the lower 50% participate, the average would be 0.80 points lower (2.20).

When the sample size is small, this situation becomes more stochastic. A total of 60 Monte Carlo simulations were conducted for a class of 20, again drawn out of a normal distribution of mean=3, standard deviation=1, as above. The upper panels in Figure 2 show a realization similar to the large sample case above. A more interesting case is shown in the lower panels of Figure 2, where the spread is relatively large. Even though there are only three very low scores (0 and 1), their impact on the average is high because one 0 needs three 4s to balance out, and one 1 needs two 4s to balance out (there are no 5s or 6s). When considering the ‘motivation gap’ which causes the potential higher scores not to be included, this impact is even more drastic. At a response rate of 75% from the lower scores, the average is now 0.62 points lower than the already not-so-high value of 2.70. The average is 0.9 points lower for the 50% response rate.

This high impact of extremely negative values on the average has the effect of driving the instructor to try hard not to ‘alienate’ the poor-performing students. It also makes it less likely that the instructor will make great effort tailored at the better students. This is a particularly important dilemma for classes that are designed for motivated majors but also open to a broader student body.

**Evolution towards a feel-good education model? **Unfortunately, as the only “quantitative” measure, these numbers are being increasingly used as a key criterion in decision-making processes including merit increase and promotion. In contrast, traditional tools such as evaluation from peers, input from exiting students, and assessment by the unit are deemed not reliable because they may come from the instructor’s ‘friends’ or the happier students. They also require more effort to evaluate, and are thus being less and less used now.

Gradually, this will steer our teaching style towards getting the best student evaluation, not what is best for our students. Do the two goals coincide?

America is facing an increasingly competitive international environment. Our graduates are finding themselves not equipped with the strong technical skills demanded by a challenging information age, especially in the STEM areas. Overemphasis on CourseEval will steer our faculty members to tailor their teaching towards making the less-competitive students happy by, for example, focusing on easier and feel-good material, as opposed to challenging them and preparing them for the real world.

Figure 1 Distributions of a hypothetical course evaluation, assuming Gaussian distribution with mean 3.00 and standard deviation 1.00. The sample size (number of students) is 30,000. Potential biases due to a ‘motivation gap’ leads to an apparent mean lower by 0.55 if only the lower 75% students respond, and by 0.80 if only the lower 50% students respond.

Figure 2 Two hypothetical but realistic examples. Number of students in class is 20, and each response is 1 of 5 choices (0-4, corresponding to ‘strongly disagree’ to ‘strongly agree’). The random samples are drawn from the same Gaussian distribution as in Fig.1 (mean=3, standard deviation=1). Upper 3 panels: the spread is moderate and the biases due to ‘motivation gap’ are similar to the large sample case in Fig.1. Lower 3 panels: the spread is large and the biases are heavily influenced by the few low ratings.