Regardless of how you approach AI, start by informing yourself about what it can do.
For the last two years, I’ve been exploring how generative AI can be used in historical research and in our classrooms. One of the key insights I’ve gained is that most people don’t know enough about AI’s actual capabilities and limitations to make informed decisions about how to approach it in the classroom. A key assumption should be that things are changing so rapidly that it is impossible to infer current capabilities from knowledge that is more than a few weeks old. That is hard for most people to comprehend in a discipline where a book can take years to go from manuscript to book. My goal here is to dispel some of the more pervasive myths about AI that are relevant to historians and to provide some ideas about how to approach AI in the history classroom.
Myth 1: AI Hallucinates Too Much to be Useful
Any conversation about AI in the classroom needs to start with an uncomfortable acknowledgement: although it’s far from perfect, genAI is now more accurate and reliable than media reports and pundits would have you believe. It’s true that in winter 2022, when you asked the free version of ChatGPT for 20 peer-reviewed books and articles on Canada and the First World War (or any other historical topic), only four or five would be correct. The rest sounded plausible but were entirely fabricated.
Today if you ask the same question of any of the cutting-edge models—Anthropic’s Claude Sonnet-3.5 or OpenAI’s o1—they can routinely cite 20/20 in the Chicago style. OpenAI’s new Deep Research agent (not to be confused with Deepseek…the naming conventions are a real problem as they are confusingly similar) can even locate those sources on the internet, read them, and cite them accurately in an essay. It’s scary good and unless you are constantly updating AI X you might have no idea this is even possible. But your students do. We just aren’t used to things changing so dramatically so quickly.
Myth 2: AI Can’t Write or Reason at Human Levels
In my own testing, I’ve found that the newest models—specifically the reasoners like o1-pro and OpenAI’s Deep Research—are about as competent on tasks like document analysis, historical interpretation, and literature reviews as a good PhD candidate. That is a big claim, I know, and an enormous change from a couple years ago, but it accords with research that shows LLMs are now better than most humans, including domain experts, at a range of highly specialized tasks requiring reasoning.
Just look at this example historiographical paper I had OpenAI’s Deep Research produce, comparing and contrasting Canadian and American approaches to the fur trade. Although it is undeniably good, you might notice that many of the citations are to websites rather than the actual sources. While that is the case today, remember that it is only a matter of weeks or months before it will be able to access the actual sources themselves. OpenAI is planning to provide its agentic Deep Research model with access to paywalled academic sources as well as user resources in the very near future.
Don’t take my word for it. OpenAI’s new o3 model, scores 87% on a test designed by top experts in chemistry, biology, and physics on which those same human experts only score 65% (or 74% when they are allowed to correct obvious mistakes). In another recent study, 50 human physicians were asked to diagnose a disease from real-world clinical case-studies and were only correct in 76% of the cases while GPT-4 achieved 92%. Yes, that is far from perfect, but we also need to remember that humans make mistakes—and apparently more of them than the best AI systems.
Myth 3: You’ll Know AI Writing When You See It
This is also why no matter what you’ve heard, it’s impossible to accurately detect AI writing. Many studies now show that automated AI detectors do not work, especially on papers written by the latest models. While we might intuit that we can do better—by spotting words like “delve” and “tapestry”—studies repeatedly show that human educators are actually even worse: we only tend to get it right between about half and two-thirds of the time, which is not much better than chance. Overall, we may be missing up to 94% of AI writing. That doesn’t mean you won’t sometimes get it right, but you won’t know how many papers you are missing.
AI and the Classroom
This is why a lot of people have decided to ban AI, but I am not sure this is going to be feasible. Companies like ProQuest and JSTOR are beginning to integrate LLMs into their products while Adobe, Microsoft, and Google have already built them into Acrobat, Word, and Docs. Archives are also starting to use them in their digitization projects.
The point is that LLMs are everywhere and we can’t pretend that they’re not. Students are going to want to use them because they know they are useful. To tell them otherwise would be disingenuous when recent studies suggest more than 75% of people are already using them in jobs involving information processing and analysis.
What Embracing AI Looks Like in the History Classroom
We’re all still figuring this stuff out and that’s ok. Despite my own interest in the technology, I actually take a somewhat middle-ground approach to AI in my history classes.
For starters, I still assign research papers and document analyses, but I assume that students will use AI in some capacity. That can mean helping them to identify sources, to develop an outline, or to refine their thesis. I actually encourage them to “chat” with an LLM about their projects and I provide prompts to get them to act as a tutor of sorts.
LLMs are great editors too. Students who have trouble expressing their ideas in clear, cogent prose can ask an LLM to edit their work, paragraph by paragraph. What most people don’t know is that if you ask an LLM to edit or condense something, it doesn’t typically insert new ideas or arguments but just edits the prose. It will also explain why it made the changes that it did. To my mind, this is no different than encouraging students to go to the writing centre. It also mirrors the fact that many students used to get a friend, parent, or tutor to perform the same function. Because the free version of ChatGPT is now very good on these tasks, AI actually levels the playing field, especially for ESL students or those who have accessibility needs.
To me the critical point is to teach students to be responsible for the content of their work. Any incorrect citations, misquotes, or misunderstood evidence might all result in a failing grade—for real. In fact, I think AI forces us to raise the bar: there’s simply no excuse for a range of things that we used to tolerate. Proper AI use should do away with the errors above, while poorly crafted theses, unsupported arguments, and narrative papers without an argument should become a thing of the past. The same of grammatical and stylistic errors too. In a very real sense, this mirrors the expectations of the coming AI workplace: our students will be expected to use AI to improve their work and will be held to a higher standard as a result. The new bar will, effectively, be a model’s baseline output on the same task. The same will undoubtedly be true of students in graduate programs and professional historians too.
While I embrace AI in many areas, I also think that knowledge of content, process, and methodology is essential. For obvious reasons, I think this is important for historians in general, but I also think it’ll be essential in an AI enhanced world. To that end, I still have in-person, handwritten exams—even in the digital humanities courses I teach on generative AI specifically. I am a woodworker in my spare time and I know that to use any machine tool, you first have to understand what that tool automates and how the same operation could be completed without it. Otherwise, fatal conceptual problems can quickly arise and I think the same is true of history and knowledge work in general.
For historians, factual content knowledge and recall is also, perhaps paradoxically, going to become more important in a future where we will need to rise above the AI bar. It’s not just about detecting AI hallucinations, but also about being able to discern “good” outputs from bad. You might not personally care about teaching workplace skills and that’s probably okay: the AI workplace will require us to double-down on the things historians already value. And all of this requires that we test students in an environment where we can assess their skills and knowledge separate from the electronic tools they might normally be expected to use.
Conclusion
The technology behind generative AI can be terrifying because it challenges a lot of the assumptions we make about the uniqueness of our expertise as well as the time and effort required to do what we do. But I think we should have the confidence to face these challenges head-on because I also believe that human-written history still matters. The more I learn about AI, the more I’ve come to realize that the skills needed to use it effectively are actually the same ones we’ve always claimed to teach. Traditionally we taught students to think critically about the past, to formulate clear research questions, to answer them with thorough, open-ended research, and to construct evidenced-based arguments while striving to minimize bias. These skills are all going to become even more important in a world where AI can be used to speed up this process. Knowing what questions to ask, how to design a research strategy, and how to evaluate evidence and arguments will all still be critical whether you believe that AI will transform society or fizzle-out to a peripheral annoyance.
_________________________
Mark Humphries has written on public health, the First World War, and the fur trade in Canadian history. He is a Professor in the Department of History at Wilfrid Laurier University, where he is currently working on applying AI to historical research. He writes about AI, research, and teaching on a Substack called Generative History: https://generativehistory.substack.com/.