Vision-First Study Buddy
Snap your notes, get a study guide — multimodal AI for students.
Overview
Vision-First Study Buddy is a mobile-friendly web application that transforms handwritten notes, whiteboard photos, PDFs, and EPUBs into structured study guides and interactive quizzes. Instead of a traditional OCR-then-process pipeline, raw images and documents are sent directly to Vertex AI Gemini 2.5 Flash as multimodal content parts — preserving spatial layout, diagrams, and handwriting context that text extraction would lose. The FastAPI backend runs on Cloud Run and orchestrates file storage in Firebase Storage, multimodal content assembly, and structured JSON generation via specialized prompt templates. The React 19 frontend with Material UI provides drag-and-drop upload, native device camera capture, quiz taking with instant grading, and local persistence of study guides and quiz history — all optimized for mobile-first studying.
The Problem
Students accumulate mountains of handwritten notes, whiteboard photos, and PDF handouts throughout a semester. When it comes time to study, they face a disorganized pile of materials in different formats with no easy way to synthesize them. Manually creating study guides is tedious, and most note-taking apps can't read handwriting or extract meaning from diagrams. Existing OCR tools produce raw text without understanding the educational context — losing spatial relationships, diagram annotations, and the connections between concepts that make study materials meaningful.
The Approach
The core architectural decision is sending raw images and documents directly to Gemini 2.5 Flash as multimodal content parts, bypassing OCR entirely. Images are base64-encoded and sent inline, PDFs are processed natively, and EPUBs have text extracted via ebooklib. Gemini's 1M token context window allows multiple materials to be processed in a single pass without chunking or vector retrieval. Study guide generation uses a prompt template that instructs the model to produce structured JSON with hierarchical sections, key terms with definitions, and cross-topic relationships. Quiz generation supports configurable difficulty (easy/medium/hard/mixed), question count (5-25), and question types (multiple choice, short answer, true/false). Short answer grading uses a separate Gemini call for semantic comparison rather than exact string matching. The backend uses FastAPI with Pydantic models for request validation, dependency injection for services, and device-scoped material isolation via X-Device-ID headers. The React frontend persists generated study guides and quiz history to localStorage for offline access.
Architecture
From Notes to Study Materials
How your uploads become study guides and quizzes in seconds
System Architecture
How the frontend, backend, AI, and storage work together
Tech Stack
Medium effortFastAPI
Async Python API on Cloud Run
Gemini 2.5 Flash
Native Multimodal Vision + JSON Output
Vertex AI
Managed Model Serving
Firebase Storage
CDN-Backed File Storage
React 19 + MUI 6
Mobile-First SPA on Firebase Hosting
Camera Capture
MediaDevices API Integration
Quiz Engine
Semantic Grading via Gemini
Cloud Run
Serverless Container Hosting