Skip to main content

AI Education Platform

Overview

A self-hosted, scalable platform for teaching and experimenting with artificial intelligence in controlled educational environments. The platform enables students and researchers to access isolated Jupyter-based development environments on GPU-backed infrastructure. Inspired by Google Colab, the system is tailored for on-premise GPU use, with full container control, secure web access, integrated chat, assignment delivery, and detailed system monitoring.

Designed and implemented by Alexander Kim in 2023–2024 for a national educational research project in Korea. The system supported over 40 concurrent users during live AI classes and successfully replaced the need for commercial cloud services.


Objectives

  • Democratize access to GPU resources for students in a classroom/lab setting
  • Provide isolated, reproducible, and secure environments for each user
  • Enable real-time interaction, file sharing, and code collaboration
  • Offer per-user monitoring of GPU and system usage
  • Reduce cost compared to cloud-based alternatives like Google Colab or AWS

Core Features

🔹 Frontend

  • Next.js-based Web UI: Clean, responsive interface with student-friendly layout
  • GraphQL Client: Interactive communication with backend for notebooks, sessions, and metadata
  • Socket.io Chat: Real-time communication for support, announcements, and Q&A
  • Notebook Management: Create, restart, and delete containers from the frontend
  • Authentication: Role-based access (admin/teacher/student)

🔹 Backend (Dual-Service Architecture)

  • NestJS API Server:
    • GraphQL API (user, session, notebook management)
    • MongoDB integration for persistent user/session storage
  • FastAPI Control Server:
    • REST endpoints for Docker container lifecycle management
    • GPU allocation logic (static MIG mapping per user)
    • SSE/Prometheus-compatible exporters for monitoring
    • Custom GPU scripts using nvidia-smi, pynvml

🔹 Dockerized Learning Environments

  • Jupyter Notebook + Terminal
  • User-specific volumes (isolated home folders)
  • Read-only shared volumes for distributing teaching materials
  • Preinstalled packages (PyTorch, TensorFlow, Scikit-learn, etc.)

🔹 GPU Infrastructure

  • Tesla V100 x4 with MIG slicing (static allocation)
  • Each container mapped to a MIG slice (guaranteed isolation)
  • Scripts to manage container ↔ MIG mapping

🔹 Monitoring

  • Prometheus + Grafana Dashboards
  • Custom GPU exporters (per MIG slice)
  • Node Exporter for system-level metrics
  • Dashboards for teachers: active users, GPU load, notebook status

🔹 Networking & Deployment

  • Nginx reverse proxy with SSL termination
  • Static IP and dynamic DNS (e.g., via DuckDNS)
  • Self-hosted on bare metal (Ubuntu server)
  • Docker Compose orchestration for all services

Architecture Diagram

AI Platform Architecture

Frontend: Next.js + GraphQL Client + Socket.IO
Backend 1 (NestJS): GraphQL API + MongoDB
Backend 2 (FastAPI): REST API + Docker SDK + GPU Metrics
Containers: Jupyter Notebook + Terminal + User Volumes
Monitoring: Prometheus + Node Exporter + Custom GPU Exporter + Grafana
Infrastructure: Tesla GPUs x4 + MIG slices + Nginx + Static IP + Docker Compose

User Interface

AI Platform UI

Clean, responsive interface designed for educational use with student-friendly layout and intuitive navigation.


Use Cases

  • University-level AI coursework (20–40 students)
  • Workshops or bootcamps in deep learning
  • Research labs requiring multi-user Jupyter access
  • Safe alternative to commercial notebooks for schools with privacy/legal constraints

Key Technologies

LayerTechnologies
FrontendNext.js, TypeScript, Apollo Client, socket.io
BackendNestJS (GraphQL), FastAPI (REST), MongoDB, Docker SDK
MonitoringPrometheus, Grafana, Node Exporter, Custom GPU Exporters
GPU ToolsNVIDIA MIG, nvidia-smi, pynvml
InfraNginx, Docker, Docker Compose, SSL, Static IP + DDNS
DevOpsDocker Volumes, shell scripts, recovery automation

Engineering Highlights

  • Designed a dual-backend architecture to separate API logic from low-level GPU/container control
  • Enabled multi-user isolation using MIG and Docker volumes
  • Implemented robust fault-recovery and container health monitoring
  • Built a chat layer with optional teacher moderation
  • Used Prometheus exporters to trace GPU load per student in real time
  • Integrated auto-mounting of teaching materials into student containers

Challenges Solved

  • MIG management: Static allocation and mapping without breaking container health
  • GPU resource fairness: One MIG slice per student, enforced via control server
  • Monitoring granularity: Metrics by GPU slice, not global GPU usage
  • Security: Nginx reverse proxy, role-based access, isolated volumes
  • Uptime: Restart scripts and health checks for critical services

Future Plans

  • Auto-grading via Jupyter nbconvert + test runners
  • Admin dashboard for real-time control of sessions
  • User activity logs and performance heatmaps
  • OAuth login for LMS integration (Google Classroom, Moodle)
  • Container auto-scaling based on load