Abserny

Voice-activated vision assistant for the visually impaired. No screen required, ever.

Understanding Through AI

A graduation project that gives visually impaired individuals a fully spoken, gesture-driven window into the world around them, powered by Gemini AI, working even offline.

The Problem

Over 2.2 billion people worldwide live with visual impairment. Most AI assistive tools still assume a sighted user, requiring menus, icons, and screens to set up. Even the first launch demands someone who can see.

Our Answer

Abserny is built entirely around a voice-first philosophy. Every interaction, from first launch and language selection to daily use, is driven by simple gestures and spoken feedback. No screen. No visual menus. No sighted assistance needed.

How Abserny Works

A deterministic finite state machine governs every interaction, no race conditions, no ambiguity.

BOOT Initialize camera, TTS engine & FSM supervisor

READY Idle, awaiting double tap gesture

SCANNING Capture frame → route to Gemini or ML Kit

SPEAKING TTS speaks description, long press to repeat

ERROR API failure → auto-recover to READY

BOOT

Starting up…

Core Features

Gesture-Driven Control Five gestures cover every function, double tap to scan, long press to repeat the last result, swipe to change mode, triple tap for settings. No buttons, no visual interface, ever.

Gemini AI Vision Google Gemini 2.0 Flash Lite analyzes the camera feed and produces spatial spoken descriptions, hazards first, in your language, with a 1,430 ms median latency.

Four Detection Modes Scene for broad environmental awareness, Object for close-range identification, Read for OCR text recognition, People for social navigation. Swipe left or right to cycle between them at any time.

Full Offline Support Automatic silent failover to on-device ML Kit when offline, 380 ms median latency, 0% failure rate. AbserneyVision, our custom MobileNetV2 model, adds a further offline detection layer.

Bilingual, Arabic & English Complete Arabic (ar-SA) and English (en-US) support at every layer: TTS voices, per-language AI prompts, UI text, and RTL layout. Language chosen during spoken onboarding, no reading required.

Voice-First Onboarding First launch speaks a language picker then walks through each gesture interactively, the user must perform each gesture to advance. Zero sighted assistance required from the very first second.

Detection Modes

Four modes for four daily needs. Swipe left or right in the app to cycle between them at any time.

Scene "Double tap to describe your surroundings"

Broad environmental awareness. Hazards and obstacles are mentioned first, followed by spatial context, up to four items per description, always with directional words like ahead, to your left, nearby.

Object "Hold close and double tap"

Close-range object identification. Returns the precise name plus one functional detail, ideal for identifying items on a table, in a bag, or on a shelf.

Read "Point at text and double tap"

Full text recognition via on-device OCR. Reads all visible text verbatim, top to bottom, for signs, labels, documents, packaging, and displays.

People "Double tap to detect people"

Social and navigation awareness. Reports the count of people in frame, their spatial location, and observable activity, for navigating crowds, entering rooms, or approaching conversations.

Technology Stack

React Native / Expo SDK 54 Cross-platform mobile framework with a hooks-based architecture, speech, gesture, detection, and language each live in their own independent, swappable hook.

Google Gemini 2.0 Flash Lite Lowest-latency capable vision-language model. Natively multimodal, analyzes images and generates spatial natural-language descriptions in Arabic and English. Free tier, no cost barrier.

ML Kit (Google) On-device image labeling and text recognition. Fully offline, no network, no API key required. Activates automatically and silently when Gemini is unavailable.

AbserneyVision Custom MobileNetV2 model trained via transfer learning on ImageNet and COCO, offline detection layer targeting 13 indoor categories. Currently 37% validation accuracy, expanding dataset toward 80%+.

expo-speech (Native TTS) Interfaces with the Android native TTS engine. Custom speech queue with priority interruption, a new scan immediately replaces ongoing speech without any perceptible delay.

PanResponder + FSM Full-screen gesture detection governed by a deterministic finite state machine, making it structurally impossible for two scans to run simultaneously or for gestures to conflict.

Performance Results

Measured across 50 trials on a mid-range Android device (Snapdragon 665, WiFi connection).

1,430 ms

Median end-to-end latency from gesture to first spoken word, online via Gemini. Offline via ML Kit achieves 380 ms median.

87% Spatial Rate

87% of AI-generated scene descriptions include at least one spatial orientation term. 100% of descriptions with hazards mention them first.

Get Abserny on Android

Real-time, AI-powered scene description with gesture-driven control, available now for Android, iOS in progress.

Download Now