Bring your own API key
The public app is mainly the cloud version for now. Paste your OpenAI API key in the app, optionally add a Brave Search key, then press play to launch the agent.
So this is On Screen. It can do native tasks on your phone, so you do not have to worry about every small thing yourself.
I wanted a phone assistant that could do more than answer questions. On Screen is my first attempt at that: it looks at what is visible, understands the controls on the screen, and uses accessibility to help move through native Android apps.
The first version is for people who want to test the idea, give feedback, and help figure out what would make a phone agent more reliable.
Most phone assistants still stop at advice. They can answer a question or explain where a setting might be, but they usually do not inspect the current Android screen and act inside the app for you.
On Screen is my attempt at an autonomous Android agent: something that can read the visible UI, reason about the next action, and use accessibility to tap, type, scroll, and navigate through real phone workflows.
The goal is not just guidance. The goal is a phone agent that can complete native Android tasks while still being easy to supervise, pause, or stop when needed.
The public app is mainly the cloud version for now. Paste your OpenAI API key in the app, optionally add a Brave Search key, then press play to launch the agent.
Agent memory is stored on your phone. I am not storing your data on my server because there is no backend from my side. In cloud mode, the model provider still receives the data you send to it.
There are two directions: cloud and on-device. The demo version uses cloud real-time model APIs because that is the usable path today.
I also experimented with On Screen Offline: whisper.cpp for speech to text, Gemma as the vision-language model, and Kitten TTS for speech. The idea is exciting because privacy is better and there are no model API costs, but it is currently very slow and can heat the phone.
For a future version, I want to explore smaller on-device models, mobile GUI grounding, supervised fine-tuning on mobile action data, and RL on AndroidWorld-style tasks.
Technical feedback would be especially useful. If you know Android, accessibility services, phone systems, on-device ML, GUI grounding, or agent training, tell me what could make this faster and more robust.
I spent about five days building and experimenting with this version. Your comments and ideas will decide whether version two should exist.