Vision-First Study Buddy

info

description

Overview

Vision-First Study Buddy is a mobile-friendly web application that transforms handwritten notes, whiteboard photos, PDFs, and EPUBs into structured study guides and interactive quizzes. Instead of a traditional OCR-then-process pipeline, raw images and documents are sent directly to Vertex AI Gemini 3.1 Pro as multimodal content parts — preserving spatial layout, diagrams, and handwriting context that text extraction would lose. The FastAPI backend runs on Cloud Run and orchestrates file storage in Firebase Storage, multimodal content assembly, and structured JSON generation via specialized prompt templates. The React 19 frontend with Material UI provides drag-and-drop upload, native device camera capture, quiz taking with instant grading, and local persistence of study guides and quiz history — all optimized for mobile-first studying.

psychology_alt

warning

The Problem

Students accumulate mountains of handwritten notes, whiteboard photos, and PDF handouts throughout a semester. When it comes time to study, they face a disorganized pile of materials in different formats with no easy way to synthesize them. Manually creating study guides is tedious, and most note-taking apps can't read handwriting or extract meaning from diagrams. Existing OCR tools produce raw text without understanding the educational context — losing spatial relationships, diagram annotations, and the connections between concepts that make study materials meaningful.

auto_fix

The Approach

The core architectural decision is sending raw images and documents directly to Gemini 3.1 Pro as multimodal content parts, bypassing OCR entirely. Images are base64-encoded and sent inline, PDFs are processed natively, and EPUBs have text extracted via ebooklib. Gemini's 1M token context window allows multiple materials to be processed in a single pass without chunking or vector retrieval. Study guide generation uses a prompt template that instructs the model to produce structured JSON with hierarchical sections, key terms with definitions, and cross-topic relationships. Quiz generation supports configurable difficulty (easy/medium/hard/mixed), question count (5-25), and question types (multiple choice, short answer, true/false). Short answer grading uses a separate Gemini call for semantic comparison rather than exact string matching. The backend uses FastAPI with Pydantic models for request validation, dependency injection for services, and device-scoped material isolation via X-Device-ID headers. The React frontend persists generated study guides and quiz history to localStorage for offline access.

Architecture

account_tree

From Notes to Study Materials

How your uploads become study guides and quizzes in seconds

zoom_in Tap to expand

account_tree

System Architecture

How the frontend, backend, AI, and storage work together

zoom_in Tap to expand

Tech Stack

Medium effort

api

FastAPI

Async Python API on Cloud Run

visibility

Gemini 3.1 Pro

Native Multimodal Vision + JSON Output

cloud

Vertex AI

Managed Model Serving

cloud_upload

Firebase Storage

CDN-Backed File Storage

widgets

React 19 + MUI 6

Mobile-First SPA on Firebase Hosting

photo_camera

Camera Capture

MediaDevices API Integration

quiz

Quiz Engine

Semantic Grading via Gemini

deployed_code

Cloud Run

Serverless Container Hosting

Engineering Highlights

Native Multimodal VisionNo OCR Pipeline1M Token Context WindowStructured JSON OutputSemantic Answer GradingDevice-Scoped StorageMobile Camera CaptureOffline Study Guide Access