A conversation between Jo Boaler and Keith Devlin

On May 23, 2019, Stanford Mathematics Education Professor Jo Boaler, the founder and director of youcubed, and I sat down before a public audience in Cubberley Auditorium on the Stanford campus to have a discussion about the nature of 21st Century mathematics and the changes it requires to the way mathematics is taught in our schools. The (edited) video our our conversation is now available on this website and the youcubed website. (See the Videos page on either site.) Produced by youcubed in conjunction with SUMOP. Run time 31min 28sec.

Why straight A’s may indicate poor learning – report from an unusual study

This post is the promised sequel to its predecessor, On making omelets and learning math.

So you got an A. What does that say about how well you are able to apply your new-found knowledge a month from now?

There’s plenty of research into learning (from psychology, cognitive science, neuroscience, and other disciplines) that explains why learning mathematics (more precisely, learning it well, so you can use it later on) is intrinsically difficult and frustrating. But for non-scientists in particular, no amount of theoretical discussion will have quite the impact as the hard evidence from a big study, particularly one run the same way pharmaceutical companies test the effectiveness (and safety) of a new drug.

Unfortunately, studies of that nature are hard to come by in education—for the simple reason that, unlike pharmaceutical research, they are all but impossible to run in the field of learning.

But there is one such study. It was conducted a few years ago, not in K-12 schools, but at a rather unique, four-year college. That means you have to be cautious when it comes to drawing conclusions about K-12 learning. So bring your own caution. My guess is that, like me, when you read about the study and the results it produced, you will conclude they do apply to at least Grades 8-12. (I can’t say more than that because I have no experience with K-8, either first-hand or second.)

The benefits of conducting the study at this particular institution was that is allowed the researchers to conduct a randomized control study on a group of over 12,000 students over a continuous nine-year period starting with their first four years in the college. That’s very much like the large scale, multi-year studies that pharmaceutical companies run (indeed, are mandated to run) to determine the efficacy and safety of a new drug. It’s impossible to conduct such a study in most K-16 educational institutions—for a whole variety of reasons.

Classroom at the United States Air Force Academy in Colorado Springs, Colorado

For the record, I’ll tell you the name of that particular college at the outset. It’s the United States Air Force Academy (USAFA) in Colorado Springs, Colorado. Later in this article, I’ll give you a full overview of USAFA. As you will learn, in almost all respects, its academic profile is indistinguishable from most US four-year colleges. The three main differences—all of which are important for running a massive study of the kind I am talking about—are that (1) the curriculum is standard across all instructors and classes, (2) grading is standardized across all classes, and (3) students have to serve five years in the Air Force after graduation, during which time they are subject to further standardized monitoring and assessment. This framework provided the researchers a substantial amount of reliable data to measure how effective were the four years of classes as preparation for the graduates first five years in their chosen specialization within the Air Force.

True, the students at USAFA are atypical in wanting a career in the military (though for some it is simply a way to secure a good education “at no financial cost”, and after their five years of service are up they leave and pursue a different career). In particular, they enter having decided what they want to do for the next nine years of their lives. That definitely needs to be taken into account when we interpret the results of the study in terms of other educational environments. I’ll discuss that in due course. As I said, bring your own caution. But do look at—and reflect on—the facts before jumping to any conclusion

If that last (repeated) warning did not get your attention, the main research finding from the study surely will: Students who perform badly on course assignments and end-of-course evaluations turn out to have learned much better than students who sail through the course with straight A’s.

There is, as you might expect, a caveat. But only one. This is an “all else being equal” result. But it is a significant finding, from which all of us in the math instruction business can learn a lot.

As I noted already, conducting a study that can produce such an (initially surprising) result with any reliability is a difficult task. In fact, in a normal undergraduate institution, it’s impossible on several counts!

First obstacle: To see how effective a particular course has been, you need to see how well a student performs when they later face challenges for which the course experience is—or at least, should be—relevant. That’s so obvious, in theory it should not need to be stated. K-16 education is meant to prepare students for the rest of their lives, both professional and personal. How well they do on a test just after the course ends would be significant only if it correlated positively with how well they do later when faced with having to utilize what the course purportedly taught them. But, as the study shows, that is not the case; indeed the correlation is negative. 

The trouble is, for the most part, those of us in the education system usually have no way of being able to measure that later outcome. At most we can evaluate performance only until the student leaves the institution where we teach them. But even that is hard. So hard, that measuring learning from a course after the course has ended and the final exam has been graded is rarely attempted.

Certainly, at most schools, colleges, or universities, it’s just not remotely possible to set up a pharmaceutical-research-like, randomized, controlled study that follows classes of students for several years, all the time evaluating them in a standardized, systematic way. Even if the course learning outcomes being studied are from a first-year course at a four-year college, leaving the student three further years in the institution, students drop out, select different subsequent elective courses, or even change major tracks.

That problem is what made the USAFA study particularly significant. Conducted from 1997 to 2007, the subjects were 12,568 USAFA students. The researchers were Scott E. Carrell, of the Department of Economics at the University of California, Davis and James E. West of the Department of Economics and Geosciences at USAFA.

As I noted earlier, since USAFA is a fairly unique higher education institute, extrapolation of the study’s results to any other educational environment requires knowledge of what kind of institution it is.

USAFA is a fully accredited undergraduate institution of higher education with an approximate enrollment of 4,200 students. It offers 32 majors, including humanities, social sciences, basic sciences, and engineering. The average SAT for the 2005 entering class was 1309 with an average high school GPA of 3:60 (Princeton Review 2007). Applicants are selected for admission on the basis of academic, athletic, and leadership potential, and a nomination from a legal nominating authority. All students receive 100 percent scholarship to cover their tuition, room, and board. Additionally, each student receives a monthly stipend of $845 to cover books, uniforms, computer, and other living expenses. All students are required to graduate within four years, after which they must serve a for five years as a commissioned officer in the Air Force.

Approximately 17% of the study sample was female, 5% was black, 7% Hispanic, and 5% Asian. 

Academic aptitude for entry to USAFA is measured through SAT verbal and SAT math scores and an academic composite that is a weighted average of an individual’s high school GPA, class rank, and the quality of the high school attended. All entering students take a mathematics placement exam upon matriculation, which tests algebra, trigonometry, and calculus. The sample mean SAT math and SAT verbal are 663 and 632, with respective standard deviations of 62 and 66. 

UAAFA students are required to take a core set of approximately 30 courses in mathematics, basic sciences, social sciences, humanities, and engineering. Grades are determined on an A, A-, B+, B, …, C-, D, F scale, where an A is worth 4 grade points, an A- is 3.7 grade points, a B+ is 3.3 grade points, etc. The average GPA for the study sample was 2.78. Over the ten-year period of the study there were 13,417 separate course-sections taught by 1, 462 different faculty members. Average class size was 18 students per class and approximately 49 sections of each core course were taught each year.

USAFA faculty, which are both military officers and civilian employees, have graduate degrees from a broad sample of high quality programs in their respective disciplines, similar to a comparable undergraduate liberal arts college. 

Clearly, in many respects, this reads like the academic profile many American four-year colleges and universities. The main difference is the nature of the student body, where USAFA students enter with a specific career path in mind (at least for nine years), albeit a career path admitting a great many variations, perhaps also, in many cases, with a high degree of motivation. While that difference clearly has to be taken in mind when using the study’s results to make inferences for higher education as a whole, the research benefits of such an organization are significant, leading to results highly reliable for that institution.

First, there is the sheer size of the study population. So large, that there was no problem randomly assigning students to professors over a wide variety of standardized core courses. That random assignment of students to professors, together with substantial data on both professors and students, enabled the researchers to examine how professor quality affects student achievement, free from the usual problems of student self-selection. 

Moreover, grades in USAFA core courses are a consistent measure of student achievement because faculty members teaching the same course use an identical syllabus and give the same exams during a common testing period. 

Student grades in mathematics courses, in particular, are particularly reliable measures. Math professors grade only a small proportion of their own students’ exams, which vastly reduces the ability of “easy” or “hard” grading professors to affecting their students’ grades. Math exams are jointly graded by all professors teaching the course during that semester in “grading parties” where Professor A grades question 1 for all students, Professor B grades question 2 for all students, and so on. Additionally, all professors are given copies of the exams for the course prior to the start of the semester. All final grades in all core courses are determined on a single grading scale and are approved by the department chair. Student grades can thus be taken to reflect the manner in which the course is taught by each professor.

A further significant research benefit of conducting the study at USAFA is that students are required to take, and are randomly assigned to, numerous follow-on courses in mathematics, humanities, basic sciences, and engineering, so that performance in subsequent courses can be used to measure effectiveness of earlier ones—which, as we noted earlier, is a far more meaningful measure of (real) learning than weekly assignments or an end-of-term exam.

It is worth noting also that, even if a student has a particularly bad introductory course instructor, they still are required to take the follow-on related curriculum.

If you are like me, given that background information, you will take seriously the research results obtained from this study. At a cost of focusing on a special subset of students, the statistical results of the study will be far more reliable and meaningful than for most educational studies. Moreover, the study will be measuring the important, long term benefits of the course. So what are those results?

First, the researchers found there are relatively large and statistically significant differences in student achievement across professors in the contemporaneous course being taught. A one-standard deviation increase in the professor fixed effect (a variable like age, sex, ethnicity, or qualifications, that is constant across individuals) results in a 0:08 to 0:21-standard deviation increase in student achievement. 

Introductory course professors significantly affect student achievement in follow-on related courses, but these effects are quite heterogeneous across subjects.

But here is the first surprising result. Students of professors who as a group perform well in the initial mathematics course perform significantly worse in the (mandatory) follow-on related math, science, and engineering courses. For math and science courses, academic rank, teaching experience, and terminal degree status of professors are negatively correlated with contemporaneous student achievement, but positively related to follow-on course achievement. That is, students of less experienced instructors who do not possess terminal degrees perform better in the contemporaneous course being taught, but perform worse in the follow-on related courses. 

Presumably, less academically qualified instructors may spur (potentially unsustained) interest in a particular subject through higher grades, but those students perform significantly worse in follow-on related courses that rely on the initial course for content.  (Interesting side note: for humanities courses, the researchers found almost no relationship between professor observable attributes and student achievement.)

Turning our attention from instructors to students, the study found that students who struggle and frequently get low grades tend to do better than the seemingly “good” students, when you see how much they remember, and how well they can perform, months or even years later

This is the result I discussed in the previous post. On the face of it, you might still find that result had to believe. But it’s hard to ignore the result of a randomized control study of over 12,000 students over a period of nine years.

For me, the big take-home message from the study is the huge disparity between course grades produced at the time and assessment of learning obtained much later. The only defense of contemporaneous course grades I can think of is that in most instances they are the only metric that is obtainable. It would be a tolerable defense were it not for one thing. Insofar as there is any correlation between contemporaneous grades and subsequent ability to remember and make productive use of what was learned in the course, that correlation is negative.

It makes me wonder why we continue, not only to use end-of-course grades, but to frequently put great emphasis on them and treat them as if they were predictive of future performance. Continuous individual assessment of a student by a well trained teacher is surely far more reliable.

A realization that school and university grades are poor predictors of future performance is why many large corporations that employ highly skilled individuals increasingly tend to ignore academic grades and conduct their own evaluations of applicants.

On making omelets and learning math

As the old saying goes, “You can’t make an omelet without breaking eggs.” Similarly, you can’t learn math without bruising your ego. Learning math is inescapably difficult, frustrating, and painful, requiring high tolerance of failure. Good teachers have long known this, but the message has never managed to get through to students and parents (and it appears, many system administrators who evaluate students, teachers, and schools).

The parallel (between making omelets and learning math) plays out in the classroom in a manner that many students and parents would find shocking, were they aware of it. It’s this.

All other factors being equal, when you test how well students have mastered course material some months or even years after the course has ended, students who do well in courses, getting mostly A’s on assignments and exams, tend to perform worse than students who struggled and got more mediocre grades at the time.

Yes, you read that correctly, the struggling students tend to do better than the seemingly “good” students, when you see how much they remember, and how well they can perform, months or even years later.

There is a caveat. But only one. This is an “all other things being equal” result, and assumes in particular that both groups of students want to succeed and make an effort to do so. I’ll give you the lowdown on this finding in just a moment. (And I will describe one particular, highly convincing, empirical demonstration in a follow-up post.) For now, let’s take a look at the consequences.

Since the purpose of education is to prepare students for the rest of their lives, those long term effects are far more important educationally than how well the student does in the course. I stressed that word “educationally” to emphasize that I am focusing on what a student learns. The grade a student gets from a course simply measures performance during the course itself. 

If the course grade correlated positively with (long-term) learning, it would be a valuable measure. But as I just noted, although there is a correlation, it is negative.  This means that educators and parents should embrace and celebrate struggle and mediocre results, and avoid the false reassurance of progress that is so often the consequence of a stellar classroom performance. 

Again, let me stress that the underlying science is an “all other things being equal” result. Assuming that requirement is met, a good instructor should pace the course so that each student is struggling throughout, constantly having to spend time correcting mistakes.

The simple explanation for this (perhaps) counter-intuitive state of affairs is that our brains learn as a result of trying to make sense of something we find puzzling, or struggling to correct an error we have made. 

Getting straight A’s in a course may make us feel good, but we are actually not learning something by so doing; we are performing. 

Since many of us discover that, given sufficient repetitive practice, we can do well on course assignments and ace the final exam regardless of how well we really understand what we are doing, a far more meaningful measure of how well we have learned something is to test us on it some time later. Moreover, that later test should not just be a variant of the course final exam; rather we should be tested on how able we are in making use of what we had studied, either in a subsequent course or in applying that knowledge or skills in some other domain.

It is when subjected to that kind of down-the-line assessment that the student who struggled tends to do better than the one who performed well during the course.

This is not just some theoretical idea, removed from reality. In particular, it has been demonstrated in a large, random control study conducted on over 12,000 students over a nine-year period.

The students were of traditional college age, at a four-year institution, and considerable effort was put in to ensuring that all important “all other things being equal” condition was met. I’ll tell you about the study and the institution where it was carried out in a follow-on post to this one. For now, let’s look at its implications for math teaching (for students of all ages).

To understand what is going on, we must look to other research on how people learn. This is a huge topic in its own right, with research contributions from several disciplines, including neurophysiology.

Incidentally, neurophysiologists do not find the negative-correlation result counter-intuitive. It’s what they would expect, based on what they have learned about how the brain works. 

To avoid this essay getting too long, I’ll provide an extremely brief summary of that research, oriented toward teaching. (I’ll come back to all these general learning issue in future posts. It’s not an area I have worked in, but I am familiar with the work of others who do.) 

Learning occurs when we get something wrong and have to correct it. This is analogous to the much better known fact that when we subject our bodies to physical strain, say by walking, jogging, or lifting weights, the muscles we strain become stronger—we gain greater fitness.

The neurophysiologists explain this by saying that understanding something or solving a problem we have been puzzling over, is a consequence of the brain forming new connections (synapses) between neurons. (Actually, it would be more accurate to say that understanding or solving actually is the creation of those new connections.) So we can think of learning as a process to stimulate the formation of new connections in our brain. (More accurately, we should think of learning as being the formation of those new connections.)

Exactly what leads to those new connections is not really known—indeed, some of us regard this entire neurons and synapses model of brain activity as, to some extent, a scientific metaphor. What is known is that it is far more likely to occur after a period in which the brain repeatedly tries to understand something or to solve the problem, and keeps failing. (This is analogous to the way muscles get stronger when we repeatedly subject them to strain, but in the case of muscles the mechanism is much better understood.) In other words, repeatedly trying and failing is an essential part of learning.

In contrast, repeatedly and consistently performing well strengthens existing neuronal connections, which means we get better at whatever it is we are doing, but that’s not learning. (It can, however, prepare the brain for further learning.) 

Based on these considerations, the most effective way to teach something in a way that will stick is to put students in a position of having to arrive at the best answer they can, without hints, even if it’s wrong. Then, after they have committed, you can correct, preferably with a hint (just one) to prompt them to rectify the error. Psychologists who have studied this refer to the approach as introducing “desirable difficulties.” Google it if you have not come across it before. The term itself is due to the Stanford psychologist Robert Bjork. 

For sure, the result of this approach makes students (and likely their parents and their instructor) feel uncomfortable, since the student does not appear to be making progress. In particular, if the instructor gauges it well, their assignment work and end-of-term test will be littered with errors. (Instructors should grade on the curve. I frequently set the pass mark around 30%, with a score of 60% or more correct getting an A, though in an ideal world I would have preferred to not be obliged to assign a letter grade, at least based purely on contemporaneous testing.)

Of course, the students are not going to be happy about this, and their frustration with themselves is likely to be offloaded onto the instructor. But, for all that it may seem counterintuitive, they will walk away from that course with far better, more lasting, and more usable learning than if they had spent the time in a feelgood semester of shallow reinforcement that they were getting it all right. 

To sum up: Getting things right, with the well-deserved feeling of accomplishment it brings, is a wonderful thing to experience, and should be acknowledged and rewarded—when you are out in the world applying your learning to do things.  But getting everything right is counterproductive if the goal is meaningful, lasting learning. 

Learning is what happens by correcting what you got wrong. Indeed, the learning is better if the correction occurs some time after the error is made. Stewing for a while in frustration at being wrong, and not seeing how to fix it, turns out to be a good thing. 

So, if you are a student, and your instructor refuses to put you out of your misery, at least be aware that the instructor most likely is doing so because they want you to learn. Remember, you can’t learn to ride a bike or skateboard without bruising your knees and your elbows. And you can’t learn math (and various other skills) without bruising your ego. 

Cracking your ego is an unavoidable part of learning.

What topics should be covered in school mathematics?

On May 25, Jo Boaler and I had a public conversation at Stanford about K-12 mathematics. (An edited video recording will be made available on the youcubed and SUMOP websites as soon as it is ready.) Our conversation was live-tweeted by @AliceKeeler, and that led to a lively twitter debate that lasted several days, with much of the focus on what topics should be taught.

Having been a professional mathematician for close on fifty years (first in pure mathematics, then in the world of business and government service), my take, which I articulated in my conversation with Jo and you can find in the Blue Notepad videos on this site, is somewhat unusual. I actually believe it is (in many ways, but not all) not important what is taught, but rather how it is taught.

My recent post in my monthly Devlin’s Angle for the Mathematical Association of America explains why I have that view. In a follow-up post next month, I will connect the argument I present this month with the discussion Jo and I had and the ensuing twitter debate.

New video posted in the VIDEOS section

The interview was recorded in February last year, in Belgrade, Serbia, but was only recently published on Youtube. You can find it on the site VIDEOS page. The accompanying text translates more or less as follows:

“An exclusive interview given in February last year to the 12th edition of the Elements magazine, published by the Center for the Promotion of Science.
In a comprehensive interview with Aleksandar Ravas and Tijana Markovic, published in the 12th Element, Devlin spoke about mathematical thinking, mathematics education, predicting the future, the difference between false stories in mathematics and the reality, and many other interesting and socially engaged topics.”