How We Built AI-Powered Quiz Generation

The Problem with Manual Question Writing

Creating high-quality quiz questions takes time. A subject matter expert can write about 10-15 solid questions per hour, and that is before review and editing. When your platform needs thousands of questions across dozens of topics, manual authoring becomes the bottleneck.

We set out to use LLMs to generate quiz questions at scale while maintaining the quality bar our users expect. This post covers the architecture, prompt engineering, validation pipeline, and lessons learned from building our AI quiz generation system.

Architecture Overview

The system has four stages:

Generation - An LLM produces candidate questions based on topic and difficulty
Validation - Automated checks verify answer correctness and question structure
Quality filtering - A scoring model rates questions on clarity, difficulty accuracy, and educational value
Human review - High-scoring questions go to a review queue for final approval

Topic + Difficulty
       |
       v
  [LLM Generation]
       |
       v
  [Structural Validation]
       |
       v
  [Answer Verification]
       |
       v
  [Quality Scoring]
       |
       v
  [Human Review Queue]
       |
       v
  Published Question

Prompt Engineering

The prompt design went through dozens of iterations. The final version uses structured output with explicit constraints.

Here is the core generation prompt:

1function buildGenerationPrompt(
2  topic: string,
3  difficulty: "easy" | "medium" | "hard",
4  count: number
5): string {
6  return `You are a technical quiz question writer. Generate ${count} multiple-choice questions about ${topic} at ${difficulty} difficulty.
7
8Requirements for each question:
9- The stem must test a single, specific concept
10- Exactly 4 answer options
11- Exactly 1 correct answer
12- Each incorrect answer (distractor) must represent a plausible misconception
13- Include a concise explanation (under 80 words) that explains why the correct answer is right and addresses the most common distractor
14- For code questions, include a runnable code snippet in the stem
15
16Difficulty calibration:
17- easy: tests recall and basic understanding
18- medium: tests application to specific scenarios
19- hard: tests edge cases, subtle distinctions, or multi-step reasoning
20
21Return a JSON array with this exact structure:
22[
23  {
24    "text": "question text with optional code block",
25    "difficulty": "${difficulty}",
26    "topic": "${topic}",
27    "answers": [
28      { "text": "answer text", "isCorrect": true },
29      { "text": "answer text", "isCorrect": false },
30      { "text": "answer text", "isCorrect": false },
31      { "text": "answer text", "isCorrect": false }
32    ],
33    "explanation": "why the correct answer is correct"
34  }
35]
36
37Do not include opinions, trick questions, or questions with ambiguous correct answers.`;
38}

Key decisions in the prompt:

Explicit structure requirements prevent format variations that break parsing
Difficulty calibration guidance reduces the LLM's tendency to generate medium-difficulty questions regardless of the requested level
Distractor requirement forces the model to think about misconceptions rather than generating obviously wrong answers

Calling the LLM

We use structured outputs to get reliable JSON:

1import Anthropic from "@anthropic-ai/sdk";
2
3const anthropic = new Anthropic();
4
5interface GeneratedQuestion {
6  text: string;
7  difficulty: string;
8  topic: string;
9  answers: Array<{ text: string; isCorrect: boolean }>;
10  explanation: string;
11}
12
13async function generateQuestions(
14  topic: string,
15  difficulty: "easy" | "medium" | "hard",
16  count: number
17): Promise<GeneratedQuestion[]> {
18  const prompt = buildGenerationPrompt(topic, difficulty, count);
19
20  const response = await anthropic.messages.create({
21    model: "claude-sonnet-4-20250514",
22    max_tokens: 4096,
23    messages: [{ role: "user", content: prompt }],
24  });
25
26  const content = response.content[0];
27  if (content.type !== "text") {
28    throw new Error("Unexpected response type");
29  }
30
31  // Extract JSON from the response
32  const jsonMatch = content.text.match(/\[[\s\S]*\]/);
33  if (!jsonMatch) {
34    throw new Error("No JSON array found in response");
35  }
36
37  return JSON.parse(jsonMatch[0]);
38}

Structural Validation

Before checking content quality, validate the basic structure:

1import { z } from "zod";
2
3const QuestionSchema = z.object({
4  text: z.string().min(20).max(2000),
5  difficulty: z.enum(["easy", "medium", "hard"]),
6  topic: z.string().min(1),
7  answers: z
8    .array(
9      z.object({
10        text: z.string().min(1).max(500),
11        isCorrect: z.boolean(),
12      })
13    )
14    .length(4)
15    .refine(
16      (answers) => answers.filter((a) => a.isCorrect).length === 1,
17      "Exactly one answer must be correct"
18    ),
19  explanation: z.string().min(10).max(500),
20});
21
22function validateStructure(question: unknown): {
23  valid: boolean;
24  errors: string[];
25} {
26  const result = QuestionSchema.safeParse(question);
27
28  if (result.success) {
29    return { valid: true, errors: [] };
30  }
31
32  return {
33    valid: false,
34    errors: result.error.errors.map((e) => `${e.path.join(".")}: ${e.message}`),
35  };
36}

Answer Verification

The trickiest part is verifying that the marked correct answer is actually correct. For code questions, we can run the code. For conceptual questions, we use a separate LLM call as a verifier:

1async function verifyAnswer(question: GeneratedQuestion): Promise<{
2  verified: boolean;
3  confidence: number;
4  issue: string | null;
5}> {
6  // For code output questions, try to execute
7  if (question.text.includes("```") && question.text.includes("output")) {
8    return verifyCodeQuestion(question);
9  }
10
11  // For conceptual questions, use a verification prompt
12  const verificationPrompt = `Evaluate this quiz question for correctness.
13
14Question: ${question.text}
15
16Marked as correct: ${question.answers.find((a) => a.isCorrect)?.text}
17
18Other options:
19${question.answers
20  .filter((a) => !a.isCorrect)
21  .map((a) => `- ${a.text}`)
22  .join("\n")}
23
24Respond with JSON:
25{
26  "correctAnswerIsRight": true/false,
27  "confidence": 0.0-1.0,
28  "issue": "description of any problem, or null"
29}`;
30
31  const response = await anthropic.messages.create({
32    model: "claude-sonnet-4-20250514",
33    max_tokens: 500,
34    messages: [{ role: "user", content: verificationPrompt }],
35  });
36
37  const content = response.content[0];
38  if (content.type !== "text") {
39    return { verified: false, confidence: 0, issue: "Unexpected response" };
40  }
41
42  const result = JSON.parse(content.text.match(/\{[\s\S]*\}/)?.[0] ?? "{}");
43
44  return {
45    verified: result.correctAnswerIsRight === true,
46    confidence: result.confidence ?? 0,
47    issue: result.issue ?? null,
48  };
49}

We found that using a different model or temperature for verification catches more errors than self-verification with the same settings.

Quality Scoring

Each question gets a quality score from 0-100 based on multiple criteria:

1interface QualityScore {
2  total: number;
3  clarity: number;
4  distractorQuality: number;
5  difficultyAccuracy: number;
6  explanationQuality: number;
7}
8
9function scoreQuestion(question: GeneratedQuestion): QualityScore {
10  let clarity = 25;
11  let distractorQuality = 25;
12  let difficultyAccuracy = 25;
13  let explanationQuality = 25;
14
15  // Clarity checks
16  if (question.text.length < 30) clarity -= 10;
17  if (question.text.includes("which of the following")) clarity -= 5;
18  if (/\b(not|never|none)\b/i.test(question.text)) clarity -= 5;
19
20  // Distractor quality - answers should be similar length
21  const lengths = question.answers.map((a) => a.text.length);
22  const avgLength = lengths.reduce((a, b) => a + b, 0) / lengths.length;
23  const lengthVariance =
24    lengths.reduce((sum, l) => sum + Math.pow(l - avgLength, 2), 0) / lengths.length;
25  if (lengthVariance > avgLength * 2) distractorQuality -= 10;
26
27  // Check for "all of the above" or "none of the above"
28  const hasMetaAnswer = question.answers.some((a) =>
29    /all of the above|none of the above/i.test(a.text)
30  );
31  if (hasMetaAnswer) distractorQuality -= 15;
32
33  // Explanation quality
34  if (question.explanation.length < 30) explanationQuality -= 10;
35  if (!question.explanation.toLowerCase().includes("because")) {
36    explanationQuality -= 5;
37  }
38
39  const total = clarity + distractorQuality + difficultyAccuracy + explanationQuality;
40
41  return { total, clarity, distractorQuality, difficultyAccuracy, explanationQuality };
42}

Questions scoring above 75 go to the human review queue. Below 75, they are either regenerated or discarded.

The Full Pipeline

Tie everything together:

1async function generateAndValidate(
2  topic: string,
3  difficulty: "easy" | "medium" | "hard",
4  targetCount: number
5): Promise<GeneratedQuestion[]> {
6  const batchSize = Math.ceil(targetCount * 1.5); // Over-generate to account for filtering
7  const candidates = await generateQuestions(topic, difficulty, batchSize);
8
9  const approved: GeneratedQuestion[] = [];
10
11  for (const question of candidates) {
12    // Step 1: Structural validation
13    const structure = validateStructure(question);
14    if (!structure.valid) continue;
15
16    // Step 2: Answer verification
17    const verification = await verifyAnswer(question);
18    if (!verification.verified || verification.confidence < 0.85) continue;
19
20    // Step 3: Quality scoring
21    const quality = scoreQuestion(question);
22    if (quality.total < 75) continue;
23
24    approved.push(question);
25
26    if (approved.length >= targetCount) break;
27  }
28
29  return approved;
30}

Lessons Learned

Over-generate and filter. Generating 50% more questions than needed and filtering down is cheaper and faster than regenerating failed questions.

Separate generation from verification. Using the same model and prompt for both generation and verification leads to the model agreeing with itself. A separate verification step catches real errors.

Difficulty is subjective. LLMs tend to generate medium-difficulty questions regardless of the prompt. Explicit examples of each difficulty level in the prompt helped, but human calibration is still needed for hard questions.

Code questions need execution. LLMs make subtle mistakes in code output predictions. For any question that asks "What does this code output?", actually running the code is the only reliable verification.

Summary

AI-powered quiz generation works when you treat the LLM as a draft writer, not a finished product. The validation pipeline - structural checks, answer verification, quality scoring, and human review - is what turns raw LLM output into publishable questions. We generate questions 10x faster than manual authoring while maintaining our quality standard.

How We Built AI-Powered Quiz Generation

The Problem with Manual Question Writing

Architecture Overview

Prompt Engineering

Calling the LLM

Structural Validation

Answer Verification

Quality Scoring

The Full Pipeline

Lessons Learned

Summary

Enjoyed this article?

Stay Updated

Related Articles

Quiz Gamification: XP, Streaks, and Badges That Actually Work

AI-Powered Quiz Generation: From Topic to Assessment in Seconds

How to Build a Quiz App with Django and QuizAPI