The Evidence Test: How to Tell AI Education Promises From AI Education Hype
A vendor sends you a deck. The first slide promises that their new AI tool gives students personalized feedback, improves performance, and builds confidence. The pricing page is two clicks away.
You have a decision to make. Adopt it. Pilot it. Recommend it to a colleague. Build it into the curriculum.
Before any of that, you have a quieter decision to make first: are the claims real, or are they marketing?
The Standard We Already Use, Just Not Here
In clinical practice, no one introduces a new intervention because it “seems useful” or “might help.” Without evidence that the benefit is real, meaningful, and durable, the intervention does not move into use. That is not bureaucracy. That is how we protect the people on the other end of the decision.
Health educators at Monash University in Australia made a sharp public argument this week that the same standard should apply to AI in education. Their point: education is also a clinical practice in its own way. We are shaping judgement, professional identity, accountability, and the way someone will eventually make decisions that affect real people. The risks are not less serious because they take longer to appear.
Their argument scales beyond health. It applies to every classroom, every professional development workshop, every corporate AI rollout, every “AI-powered learning” pitch hitting your inbox right now. The standard is simple to say and harder to hold: the bigger the claim, the stronger the evidence needs to be.
The Evidence Test: Four Questions Before You Adopt
Most AI education tools fail one or more of these four questions. The ones worth your time and budget pass all four. Run any pitch, any pilot, any “this changed everything” testimonial through this filter before you decide.
What exactly is the claim?
“Improves learning” is not a claim. It is a vibe. A real claim names the outcome, the population, and the time horizon. Students score 18 percent higher on a transfer task two weeks after instruction is a claim. Teachers report saving 4 hours per week on grading without a drop in feedback quality is a claim. This tool builds AI literacy is marketing copy. If you cannot rewrite a vendor’s claim as a measurable sentence, the vendor has not actually made one.
Does the evidence match the size of the claim?
A modest claim (“students engage more during AI-assisted brainstorming”) only needs modest evidence. A bold claim (“this tool replaces a tutor”) needs a controlled study with comparable groups, clear outcome measures, and replication. A short-term bump in test scores is not evidence of learning. A glowing pilot at one institution with motivated staff is not evidence of scalability. Ask: what would have to be true for the claim to fail? Has anyone actually tested that?
Does it build a skill or hide one?
This is the question almost nobody asks. An AI feedback tool that produces a sharper essay can do that two ways. It can teach the student to recognize and fix the problem next time, or it can fix the problem for them. Same output. Completely different long-term effect. A tool that builds a skill should make the learner measurably better when the tool is removed. If the metric only works while the tool is on, you have an addiction, not a skill.
Who is the evidence FROM, and who is the evidence FOR?
A case study published by the company that sells the tool is marketing, not evidence. A peer-reviewed study with independent funding is evidence. A teacher Twitter thread is a useful signal but not proof. Equally important: who benefits if this works? Vendors care about adoption. Administrators care about throughput. Students care about whether they can do the thing later, when nobody is grading them. Those three success criteria often pull in different directions.
Like what you are reading? Get insights like this delivered daily.
Free community. Daily Seed videos. Working frameworks instead of vague advice.
A Worked Example: The “AI Tutor” Pitch
Imagine you get pitched on an “AI tutor” that promises to give your students personalized instruction at scale. Run it through the test:
Question 1. What is the actual claim? “Personalized” and “tutor” are loaded words. Does the tool adapt the difficulty of practice problems based on the student’s last 10 answers, or does it just rephrase the same content in a chattier tone? Ask for the specifics. If the answer is “it talks to them,” that is a chatbot, not a tutor.
Question 2. Does the evidence match? A controlled study comparing students who use the tutor for a semester against a matched cohort that did not, measured on a transfer task three months later, is the kind of evidence that matches a bold tutoring claim. A pilot with 30 self-selected enthusiasts at one campus is not.
Question 3. Skill or shortcut? Do students who use the AI tutor for a semester then solve harder problems on their own, without it, better than students who studied the same content unassisted? That is the question. If no one has measured this, the tool may be teaching reliance, not skill.
Question 4. Source check. Who funded the study? Who designed the outcome measure? Who has a stake in the result? An independent replication is the only piece of evidence that survives a vendor’s marketing department.
The Evidence Test is not about being negative. It is about being a fair judge of a serious technology.
Some AI education tools will pass all four questions. Those are the ones worth your time, your budget, and your students. The rest will fail at least one. The point is not to reject AI in education. The point is to stop accepting marketing copy as proof of value, and to demand the same standard for the next generation’s education that we already demand for their healthcare.
Run the Evidence Test yourself
Got a specific AI tool or vendor pitch in mind? The interactive version walks you through all four questions and gives you a copyable summary you can share with colleagues or paste into a procurement memo.
The SeedStacking View
SeedStacking has always been about building AI fluency through small, evidence-tested wins that compound over time. The Evidence Test is what that looks like at the point of decision. Every tool you adopt for yourself, your students, or your team passes the same four questions before it earns a place in the stack.
This is also the difference between a literacy practice and a hype practice. A literacy practice asks “what is true here, and how would I know?” A hype practice asks “what is new here, and how fast can I have it?” One builds capability. The other builds a graveyard of half-deployed pilots and disillusioned faculty.
Ready to stop falling for AI hype and start making AI decisions?
The free Harvest Kernel community runs the Evidence Test on real tools every week. Educators, professionals, and lifelong learners comparing what actually works.
Plant Ideas. Cultivate Skills. Harvest Results.
Professor Dean | Founder, Harvest Kernel
