AI Education Platform

Overview

A self-hosted, scalable platform for teaching and experimenting with artificial intelligence in controlled educational environments. The platform enables students and researchers to access isolated Jupyter-based development environments on GPU-backed infrastructure. Inspired by Google Colab, the system is tailored for on-premise GPU use, with full container control, secure web access, integrated chat, assignment delivery, and detailed system monitoring.

Designed and implemented by Alexander Kim in 2023–2024 for a national educational research project in Korea. The system supported over 40 concurrent users during live AI classes and successfully replaced the need for commercial cloud services.

Objectives

Democratize access to GPU resources for students in a classroom/lab setting
Provide isolated, reproducible, and secure environments for each user
Enable real-time interaction, file sharing, and code collaboration
Offer per-user monitoring of GPU and system usage
Reduce cost compared to cloud-based alternatives like Google Colab or AWS

Core Features

🔹 Frontend

Next.js-based Web UI: Clean, responsive interface with student-friendly layout
GraphQL Client: Interactive communication with backend for notebooks, sessions, and metadata
Socket.io Chat: Real-time communication for support, announcements, and Q&A
Notebook Management: Create, restart, and delete containers from the frontend
Authentication: Role-based access (admin/teacher/student)

🔹 Backend (Dual-Service Architecture)

NestJS API Server:
- GraphQL API (user, session, notebook management)
- MongoDB integration for persistent user/session storage
FastAPI Control Server:
- REST endpoints for Docker container lifecycle management
- GPU allocation logic (static MIG mapping per user)
- SSE/Prometheus-compatible exporters for monitoring
- Custom GPU scripts using nvidia-smi, pynvml

🔹 Dockerized Learning Environments

Jupyter Notebook + Terminal
User-specific volumes (isolated home folders)
Read-only shared volumes for distributing teaching materials
Preinstalled packages (PyTorch, TensorFlow, Scikit-learn, etc.)

🔹 GPU Infrastructure

Tesla V100 x4 with MIG slicing (static allocation)
Each container mapped to a MIG slice (guaranteed isolation)
Scripts to manage container ↔ MIG mapping

🔹 Monitoring

Prometheus + Grafana Dashboards
Custom GPU exporters (per MIG slice)
Node Exporter for system-level metrics
Dashboards for teachers: active users, GPU load, notebook status

🔹 Networking & Deployment

Nginx reverse proxy with SSL termination
Static IP and dynamic DNS (e.g., via DuckDNS)
Self-hosted on bare metal (Ubuntu server)
Docker Compose orchestration for all services

Architecture Diagram

AI Platform Architecture

Frontend: Next.js + GraphQL Client + Socket.IO
Backend 1 (NestJS): GraphQL API + MongoDB
Backend 2 (FastAPI): REST API + Docker SDK + GPU Metrics
Containers: Jupyter Notebook + Terminal + User Volumes
Monitoring: Prometheus + Node Exporter + Custom GPU Exporter + Grafana
Infrastructure: Tesla GPUs x4 + MIG slices + Nginx + Static IP + Docker Compose

User Interface

AI Platform UI

Clean, responsive interface designed for educational use with student-friendly layout and intuitive navigation.

Use Cases

University-level AI coursework (20–40 students)
Workshops or bootcamps in deep learning
Research labs requiring multi-user Jupyter access
Safe alternative to commercial notebooks for schools with privacy/legal constraints

Key Technologies

Layer	Technologies
Frontend	Next.js, TypeScript, Apollo Client, socket.io
Backend	NestJS (GraphQL), FastAPI (REST), MongoDB, Docker SDK
Monitoring	Prometheus, Grafana, Node Exporter, Custom GPU Exporters
GPU Tools	NVIDIA MIG, nvidia-smi, pynvml
Infra	Nginx, Docker, Docker Compose, SSL, Static IP + DDNS
DevOps	Docker Volumes, shell scripts, recovery automation

Engineering Highlights

Designed a dual-backend architecture to separate API logic from low-level GPU/container control
Enabled multi-user isolation using MIG and Docker volumes
Implemented robust fault-recovery and container health monitoring
Built a chat layer with optional teacher moderation
Used Prometheus exporters to trace GPU load per student in real time
Integrated auto-mounting of teaching materials into student containers

Challenges Solved

MIG management: Static allocation and mapping without breaking container health
GPU resource fairness: One MIG slice per student, enforced via control server
Monitoring granularity: Metrics by GPU slice, not global GPU usage
Security: Nginx reverse proxy, role-based access, isolated volumes
Uptime: Restart scripts and health checks for critical services

Future Plans

Auto-grading via Jupyter nbconvert + test runners
Admin dashboard for real-time control of sessions
User activity logs and performance heatmaps
OAuth login for LMS integration (Google Classroom, Moodle)
Container auto-scaling based on load

Overview​

Objectives​

Core Features​

🔹 Frontend​

🔹 Backend (Dual-Service Architecture)​

🔹 Dockerized Learning Environments​

🔹 GPU Infrastructure​

🔹 Monitoring​

🔹 Networking & Deployment​

Architecture Diagram​

User Interface​

Use Cases​

Key Technologies​

Engineering Highlights​

Challenges Solved​

Future Plans​