5 Lessons Learned building AI-Powered Assessments by John Swope

This blog is excerpted from a longer talk given to the Open edX® Educators Working Group. Full recording, transcript and summary can be found on the wiki page.


While diving deep into AI-powered assessments, especially within online and open online courses, we have distilled the most impactful early learnings into five key lessons that can be applied broadly to AI assessments. This blog aims to share these insights with you.

1. AI Costs Are Highly Variable

One of the biggest misconceptions is that AI is either inherently expensive or cheap. In reality, the costs are highly variable and depend on the specific use case. For example, generating a single multiple-choice question with GPT-3.5 costs about 0.07 cents (that’s $0.0007), whereas using GPT-4 Turbo for more complex tasks can be significantly more expensive (in our test for a single complex MCQ, a question cost 6 cents). Estimating costs in advance is crucial to avoid surprises.

The cost of a single multiple choice question can have a 100x difference in cost depending on differences in prompting and model used.

2. AI Does Some Things Well…and Some Things Poorly

AI excels in providing timely and relevant feedback, especially in writing exercises. Studies have shown that AI-generated feedback can be quite similar to human feedback in quality. However, AI struggles with tasks requiring high creativity and varied responses, such as generating multiple and unique medical case studies. Understanding these strengths and weaknesses can help educators use AI more effectively.

3. AI Scoring is Limited

AI can be a useful tool for scoring, but it’s not yet reliable enough for grading. In smale-scale experiments, GPT-4-turbo was able to match human scoring on highly measurable rubrics, but it still made errors, usually being more generous than human scorers when it errored. AI scoring can be used to determine if a student’s response meets a basic level of effort or correctness, but it should not be solely relied upon for final grades.

In small scale experiments, GPT-4-turbo did well scoring very specific measurable rubrics. When AI scores varied from human scores, the errors skewed more positive than human scoring.

4. Prompt sequencing is the easiest reliable way to get better outputs.

When working with AI, how you structure your prompts greatly affects the quality of the output. Long, complex prompts can result in inadequate responses because the AI has to prioritize certain parts over others. This becomes especially true when we are putting limits on our output format (through prompting) or our tokens (e.g. because of the desire to control costs). Breaking tasks into smaller, specific prompts (prompt sequencing) can improve accuracy and consistency, ensuring the AI addresses each part of the task thoroughly.

5. AI Changes Really Fast

AI technology is evolving at a breakneck speed. New models and capabilities are being introduced constantly, making it challenging to keep up. Educators and technologists need to stay informed and be ready to adapt their tools and methods. Stable, long-term solutions are hard to come by in such a fast-moving field, and ongoing maintenance and oversight are necessary to leverage AI effectively.

Many AI models have been released just in the last year, which means that keeping up with the models that work best with our assessments requires maintenance. (Image Credit: Maxime Labonne)


AI-powered assessments offer incredible potential for enhancing online education. By understanding the costs, leveraging AI’s strengths, and being mindful of its limitations, educators can create more effective and engaging learning experiences. As AI continues to evolve, staying informed and adaptable will be key to making the most of these powerful tools.

Click here to learn more!

Leave a Reply

Your email address will not be published. Required fields are marked *