Back to Blog
Product

How We Built AI-Powered Quiz Generation

A behind-the-scenes look at using LLMs to generate quiz questions - from prompt engineering to answer validation and quality filtering.

Bobby Iliev2026-04-087 min read
Share:

The Problem with Manual Question Writing

Creating high-quality quiz questions takes time. A subject matter expert can write about 10-15 solid questions per hour, and that is before review and editing. When your platform needs thousands of questions across dozens of topics, manual authoring becomes the bottleneck.

We set out to use LLMs to generate quiz questions at scale while maintaining the quality bar our users expect. This post covers the architecture, prompt engineering, validation pipeline, and lessons learned from building our AI quiz generation system.

Architecture Overview

The system has four stages:

  1. Generation - An LLM produces candidate questions based on topic and difficulty
  2. Validation - Automated checks verify answer correctness and question structure
  3. Quality filtering - A scoring model rates questions on clarity, difficulty accuracy, and educational value
  4. Human review - High-scoring questions go to a review queue for final approval
Topic + Difficulty
       |
       v
  [LLM Generation]
       |
       v
  [Structural Validation]
       |
       v
  [Answer Verification]
       |
       v
  [Quality Scoring]
       |
       v
  [Human Review Queue]
       |
       v
  Published Question

Prompt Engineering

The prompt design went through dozens of iterations. The final version uses structured output with explicit constraints.

Here is the core generation prompt:

1function buildGenerationPrompt( 2 topic: string, 3 difficulty: "easy" | "medium" | "hard", 4 count: number 5): string { 6 return `You are a technical quiz question writer. Generate ${count} multiple-choice questions about ${topic} at ${difficulty} difficulty. 7 8Requirements for each question: 9- The stem must test a single, specific concept 10- Exactly 4 answer options 11- Exactly 1 correct answer 12- Each incorrect answer (distractor) must represent a plausible misconception 13- Include a concise explanation (under 80 words) that explains why the correct answer is right and addresses the most common distractor 14- For code questions, include a runnable code snippet in the stem 15 16Difficulty calibration: 17- easy: tests recall and basic understanding 18- medium: tests application to specific scenarios 19- hard: tests edge cases, subtle distinctions, or multi-step reasoning 20 21Return a JSON array with this exact structure: 22[ 23 { 24 "text": "question text with optional code block", 25 "difficulty": "${difficulty}", 26 "topic": "${topic}", 27 "answers": [ 28 { "text": "answer text", "isCorrect": true }, 29 { "text": "answer text", "isCorrect": false }, 30 { "text": "answer text", "isCorrect": false }, 31 { "text": "answer text", "isCorrect": false } 32 ], 33 "explanation": "why the correct answer is correct" 34 } 35] 36 37Do not include opinions, trick questions, or questions with ambiguous correct answers.`; 38}

Key decisions in the prompt:

  • Explicit structure requirements prevent format variations that break parsing
  • Difficulty calibration guidance reduces the LLM's tendency to generate medium-difficulty questions regardless of the requested level
  • Distractor requirement forces the model to think about misconceptions rather than generating obviously wrong answers

Calling the LLM

We use structured outputs to get reliable JSON:

1import Anthropic from "@anthropic-ai/sdk"; 2 3const anthropic = new Anthropic(); 4 5interface GeneratedQuestion { 6 text: string; 7 difficulty: string; 8 topic: string; 9 answers: Array<{ text: string; isCorrect: boolean }>; 10 explanation: string; 11} 12 13async function generateQuestions( 14 topic: string, 15 difficulty: "easy" | "medium" | "hard", 16 count: number 17): Promise<GeneratedQuestion[]> { 18 const prompt = buildGenerationPrompt(topic, difficulty, count); 19 20 const response = await anthropic.messages.create({ 21 model: "claude-sonnet-4-20250514", 22 max_tokens: 4096, 23 messages: [{ role: "user", content: prompt }], 24 }); 25 26 const content = response.content[0]; 27 if (content.type !== "text") { 28 throw new Error("Unexpected response type"); 29 } 30 31 // Extract JSON from the response 32 const jsonMatch = content.text.match(/\[[\s\S]*\]/); 33 if (!jsonMatch) { 34 throw new Error("No JSON array found in response"); 35 } 36 37 return JSON.parse(jsonMatch[0]); 38}

Structural Validation

Before checking content quality, validate the basic structure:

1import { z } from "zod"; 2 3const QuestionSchema = z.object({ 4 text: z.string().min(20).max(2000), 5 difficulty: z.enum(["easy", "medium", "hard"]), 6 topic: z.string().min(1), 7 answers: z 8 .array( 9 z.object({ 10 text: z.string().min(1).max(500), 11 isCorrect: z.boolean(), 12 }) 13 ) 14 .length(4) 15 .refine( 16 (answers) => answers.filter((a) => a.isCorrect).length === 1, 17 "Exactly one answer must be correct" 18 ), 19 explanation: z.string().min(10).max(500), 20}); 21 22function validateStructure(question: unknown): { 23 valid: boolean; 24 errors: string[]; 25} { 26 const result = QuestionSchema.safeParse(question); 27 28 if (result.success) { 29 return { valid: true, errors: [] }; 30 } 31 32 return { 33 valid: false, 34 errors: result.error.errors.map((e) => `${e.path.join(".")}: ${e.message}`), 35 }; 36}

Answer Verification

The trickiest part is verifying that the marked correct answer is actually correct. For code questions, we can run the code. For conceptual questions, we use a separate LLM call as a verifier:

1async function verifyAnswer(question: GeneratedQuestion): Promise<{ 2 verified: boolean; 3 confidence: number; 4 issue: string | null; 5}> { 6 // For code output questions, try to execute 7 if (question.text.includes("```") && question.text.includes("output")) { 8 return verifyCodeQuestion(question); 9 } 10 11 // For conceptual questions, use a verification prompt 12 const verificationPrompt = `Evaluate this quiz question for correctness. 13 14Question: ${question.text} 15 16Marked as correct: ${question.answers.find((a) => a.isCorrect)?.text} 17 18Other options: 19${question.answers 20 .filter((a) => !a.isCorrect) 21 .map((a) => `- ${a.text}`) 22 .join("\n")} 23 24Respond with JSON: 25{ 26 "correctAnswerIsRight": true/false, 27 "confidence": 0.0-1.0, 28 "issue": "description of any problem, or null" 29}`; 30 31 const response = await anthropic.messages.create({ 32 model: "claude-sonnet-4-20250514", 33 max_tokens: 500, 34 messages: [{ role: "user", content: verificationPrompt }], 35 }); 36 37 const content = response.content[0]; 38 if (content.type !== "text") { 39 return { verified: false, confidence: 0, issue: "Unexpected response" }; 40 } 41 42 const result = JSON.parse(content.text.match(/\{[\s\S]*\}/)?.[0] ?? "{}"); 43 44 return { 45 verified: result.correctAnswerIsRight === true, 46 confidence: result.confidence ?? 0, 47 issue: result.issue ?? null, 48 }; 49}

We found that using a different model or temperature for verification catches more errors than self-verification with the same settings.

Quality Scoring

Each question gets a quality score from 0-100 based on multiple criteria:

1interface QualityScore { 2 total: number; 3 clarity: number; 4 distractorQuality: number; 5 difficultyAccuracy: number; 6 explanationQuality: number; 7} 8 9function scoreQuestion(question: GeneratedQuestion): QualityScore { 10 let clarity = 25; 11 let distractorQuality = 25; 12 let difficultyAccuracy = 25; 13 let explanationQuality = 25; 14 15 // Clarity checks 16 if (question.text.length < 30) clarity -= 10; 17 if (question.text.includes("which of the following")) clarity -= 5; 18 if (/\b(not|never|none)\b/i.test(question.text)) clarity -= 5; 19 20 // Distractor quality - answers should be similar length 21 const lengths = question.answers.map((a) => a.text.length); 22 const avgLength = lengths.reduce((a, b) => a + b, 0) / lengths.length; 23 const lengthVariance = 24 lengths.reduce((sum, l) => sum + Math.pow(l - avgLength, 2), 0) / lengths.length; 25 if (lengthVariance > avgLength * 2) distractorQuality -= 10; 26 27 // Check for "all of the above" or "none of the above" 28 const hasMetaAnswer = question.answers.some((a) => 29 /all of the above|none of the above/i.test(a.text) 30 ); 31 if (hasMetaAnswer) distractorQuality -= 15; 32 33 // Explanation quality 34 if (question.explanation.length < 30) explanationQuality -= 10; 35 if (!question.explanation.toLowerCase().includes("because")) { 36 explanationQuality -= 5; 37 } 38 39 const total = clarity + distractorQuality + difficultyAccuracy + explanationQuality; 40 41 return { total, clarity, distractorQuality, difficultyAccuracy, explanationQuality }; 42}

Questions scoring above 75 go to the human review queue. Below 75, they are either regenerated or discarded.

The Full Pipeline

Tie everything together:

1async function generateAndValidate( 2 topic: string, 3 difficulty: "easy" | "medium" | "hard", 4 targetCount: number 5): Promise<GeneratedQuestion[]> { 6 const batchSize = Math.ceil(targetCount * 1.5); // Over-generate to account for filtering 7 const candidates = await generateQuestions(topic, difficulty, batchSize); 8 9 const approved: GeneratedQuestion[] = []; 10 11 for (const question of candidates) { 12 // Step 1: Structural validation 13 const structure = validateStructure(question); 14 if (!structure.valid) continue; 15 16 // Step 2: Answer verification 17 const verification = await verifyAnswer(question); 18 if (!verification.verified || verification.confidence < 0.85) continue; 19 20 // Step 3: Quality scoring 21 const quality = scoreQuestion(question); 22 if (quality.total < 75) continue; 23 24 approved.push(question); 25 26 if (approved.length >= targetCount) break; 27 } 28 29 return approved; 30}

Lessons Learned

Over-generate and filter. Generating 50% more questions than needed and filtering down is cheaper and faster than regenerating failed questions.

Separate generation from verification. Using the same model and prompt for both generation and verification leads to the model agreeing with itself. A separate verification step catches real errors.

Difficulty is subjective. LLMs tend to generate medium-difficulty questions regardless of the prompt. Explicit examples of each difficulty level in the prompt helped, but human calibration is still needed for hard questions.

Code questions need execution. LLMs make subtle mistakes in code output predictions. For any question that asks "What does this code output?", actually running the code is the only reliable verification.

Summary

AI-powered quiz generation works when you treat the LLM as a draft writer, not a finished product. The validation pipeline - structural checks, answer verification, quality scoring, and human review - is what turns raw LLM output into publishable questions. We generate questions 10x faster than manual authoring while maintaining our quality standard.

Test Your Knowledge

Think you understand Product? Put your skills to the test with hands-on quiz questions.

Product
Start Practicing

Enjoyed this article?

Share it with your team or try our quiz platform.

Stay Updated

Get the latest tutorials and API tips delivered to your inbox.

No spam, unsubscribe anytime.