I built my own offline speech-to-text for macOS, here is why

I talk faster than I type. Most people do. So for years the obvious move was to use one of the cloud dictation services: open the app, talk, get clean text back. The transcription quality got genuinely good. And every single time I used one, a small voice in the back of my head asked the same question: who else just heard that?

Because that is the deal you sign without reading it. You speak into your machine, the audio gets shipped to someone else's server, a model you cannot inspect turns it into text, and you take it on faith that the recording of your voice, your half-formed ideas, the client name you said out loud, gets thrown away. Maybe it does. Maybe it trains the next model. You do not get to know.

I self-host basically everything I can, so this bothered me more than it probably bothers a normal person. I decided to stop complaining and build the thing I actually wanted. It is called Dictate, it is open source, and it lives at github.com/0x0ndra/dictate.

What it does, in one sentence

Hold a hotkey, speak, release. The text you said gets typed into whatever app you are in right now: your editor, a chat box, the address bar, a commit message, does not matter. No window pops up, no copy and paste dance, no "open the dictation app first." It feels less like a tool and more like a key on your keyboard that happens to understand speech.

Under the hood it runs Whisper locally. The audio never leaves your Mac. There is no cloud, no API key, no account, no telemetry. The recording is held in memory just long enough to transcribe, and then it is gone. That last part is the whole point: the safest data is the data that does not exist a second after you are done with it.

Why local-first, and not "the privacy-friendly cloud option"

Every cloud service now has a privacy page. They all promise. And I am sure most of them mean it. But "trust us" is not a security model, it is a marketing strategy. The only version of privacy I actually believe in is the one where the data physically cannot go anywhere, because there is no network call to send it.

That is the difference between a policy and an architecture. A policy can change with the next funding round. An architecture where the audio never touches a socket cannot quietly betray you, because there is nothing there to betray. When I dictate a message about a client system, or think out loud about something half-baked, I want the guarantee to come from the code, not from a paragraph written by a legal team.

There is a second reason that is purely selfish: it works on a train, on a plane, in a cabin with no signal, in a basement server room with zero bars. Local-first is not only more private, it is more reliable. No outage, no rate limit, no "service temporarily unavailable" on the one morning you needed to fire off ten emails fast.

What it is actually like to build a tiny native macOS tool

I wrote it in Swift, as a menubar app. No dock icon, no window taking up space, just a little icon up top that sits there until you summon it. I have a soft spot for this category of software: the kind that does one thing, stays out of the way, and you forget is even running. That is the goal. The best utility is the one you stop noticing.

The fun surprise is how much macOS gives you for free if you build native. A global hotkey that works no matter which app has focus. Microphone access handled by the system permission dialog. Inserting text into the frontmost app like a real keystroke. None of this is glamorous, but it is the stuff that makes a tool feel like it belongs on the machine instead of fighting it.

The less fun part, and the honest part, is the glue:

The press-and-hold timing. Where does a recording start and stop? Hold to talk feels right, but you have to be forgiving about the edges, the quick tap, the long pause mid-sentence, the release that comes a beat too early. That is feel work, and feel work means using it for a week and fixing what annoys you.
Loading the model without freezing the app. A local Whisper model is not tiny. You want the first transcription to be fast and the UI to never block while the thing spins up. Getting that to feel instant took more care than the recording logic did.
Permissions, the macOS rite of passage. Microphone, accessibility for inserting text, the whole dance of system prompts. Native means you live inside Apple's rules, and that is a feature, not a bug: the user can see exactly what the app is allowed to touch.

None of this is hard computer science. All of it is the kind of thing you only get right by being the first annoyed user. I built it for myself, so the bar was simple: would I keep it installed? It passed, because now I genuinely reach for it without thinking.

The thing nobody tells you about local models in 2026

A few years ago "run a good speech model on your laptop, offline, fast enough to feel instant" was a research demo, not a Tuesday. Now it is just a dependency. The model is good enough, the hardware is fast enough, and the tooling has quietly gotten boring in the best way. That shift is the actual story here. The reason I could build this in my spare time is that local AI crossed the line from impressive to usable, and most people have not updated their instincts to match.

We have been trained to assume that anything smart must live in the cloud. For a lot of tasks, that assumption is now just wrong. Transcription is the perfect example: it is exactly the kind of small, well-scoped, high-frequency job that has no business leaving your device. Sending your voice to a datacenter to get back some text is like mailing a letter to the post office across the street so they can hand it back to you.

Why I think this matters beyond one menubar app

Dictate is small. It is not going to change anyone's life. But it is a concrete answer to a question I think more builders should ask: does this actually need the cloud, or did I just default to it? Most of the time the honest answer is that we reached for someone else's server out of habit, not necessity.

Local-first is not nostalgia and it is not paranoia. It is just taking the obvious position that your voice, your text, and your half-formed thoughts are yours, and that the burden of proof should be on anyone who wants to copy them somewhere you cannot see. Build the small private thing. Keep your data on your own machine. The tooling is ready now, and the only thing missing is the decision to stop shipping yourself off to a server you do not control.

The code is open. Read it, run it, fork it, tear it apart: github.com/0x0ndra/dictate. That openness is the point too. You should not have to trust me either. You should be able to check.