What does Katana actually do under the hood?

Sam Bhattacharyya

Founder, Katana

Jan 3, 2025

The UI is simple, so the details go here

When you upload a video to Katana, you'll see a progress bar saying that it's "Analyzing the video..."

Uploading a video to Katana

I didn't mention what is going on, neither on that page nor on the landing page because, not because I'm trying to hide anything, but rather because adding a laundry list of features would clutter the page. Some other tools mention "AI this" and "AI that". It's annoying, and I like when things are simple.

That said, I want to also be up front and explicit about what I / Katana actually does, so the compromise I came up with is to keep it simple on the landing page, and put the details in an explainer article. This is that article.

The laundry list

I'll start with the laundry list of what happens, and then add some details as to how it's implemented in Katana.

• We transcribe the audio (Transcribe)
• Find timestamps for each spoken word
• Figure out who said what (Diarization)
• Find the names of the speakers
• Find the out-takes
• Break the video into sections
• Identify filler words
• Figure out who is speaking when

I hate when people keep throwing around the term "AI", and a lot of tools just say they do AI this or AI that, and it makes it even less transparent. So rather than mentioning AI, I'll talk about it in terms of tasks.

I won't go into details of how each of these steps is implemented at a deep code-level, partly because most users don't care, and also because this industry is so ruthless that well-funded venture-capital backed companies will happily steal any good ideas from a solo-developer offering free tools, and then use their marketing budget to make money off of it.

Suffice to say that all of those tasks involve neural networks, and or other non-neural-network based machine-learning models. Some of those Neural Networks were built by others, and some were built by me.

I also may change how they are implemented in the future, but I'll talk about some high-level implementation details that are relevant to the user experience.

Video, Audio and Transcript

For each of the tasks above, you have some source data that you to feed to the model. The model then "transforms" the data and gives you something you can parse to get the desired output

Some tasks only require the audio - for example, the transcription tasks.

Some tasks require the transcript, like figuring out the names of the speakers.

Some tasks require a combination of audio, video and transcript, like figuring out who is speaking when.

Who is speaking when?

When I designed Katana, I spent a lot of time thinking about how to build it in a way that accomplished all of these tasks with a high degree of accuracy (the last thing you want is to show the wrong speaker when doing multi-camera switching), while also being fast, and while I hope it doesn't harm the user experience, there are some quirks that I'll mention.

Server side vs client side processing

Running AI models and rendering video are both computationally intensive tasks. The computation needed for AI augmented video editing needs to be done somewhere, and so many services like Descript, OpusClips and Riverside rent servers from Cloud providers like AWS or Google Cloud.

The first step to using those tools is usually to upload a video or send a link to a video, and so all of the AI processing and video rendering happens on their servers before providing you with results.

Not every tool does everything on their servers.Clip Champ is a really user-friendly video editing tool where all the processing happens on your computer. You may think that might slow down your computer, if but you actually use the tool, you'll find it's so much smoother and faster than all the other tools I mentioned above.

Clip champ can get away with this because it doesn't have AI features as core parts of its product. For Katana, I wanted to make an experience as smooth as Canva or ClipChamp, but also with AI features to automate everything.

Analyzing the video on your computer, as you watch it

The compromise I settled on was to do some processing (like rendering the video) on your computer, and some processing (like transcribing the audio) on the server side.If you see a white bar in the video player, that represents how much of the video has been analyzed.

I agree, it's annoying that you can't watch the full video right away, but (1) It's less annoying than waiting 15 minutes for a video on OpusClips. With Katana, you can see results in less than 1 minute. (2) I'm working on speeding up everything

Video is rendered on your computer

I think the other quirk to mention is that when you export a video, the video is rendered on your computer.

Exporting the video

On slower computers it can take a while to render the video. In this case, exporting the video takes much longer than any AI processing being done, and there's probably nothing I can do to speed up 4K rendering on a Chrome book.

It's free

That said, the fact that it does work on your computer means I can provide this tool for free. Other tools charge you because they also spend money on servers and need to cover their costs.

I figured out how to get you the same thing without needing to rent servers, and so I'm passing on the savings to you by providing this tool for free.

Not losing your work

Another benefit of doing everything on your computer is that it's a lot harder to mess up. I've seen a number of people complain about tools like Riverside and Streamyard just losing recordings or messing up edits, and knowing how those apps work on the backend, I can see why.

I don't think descript has this issue, because although they do stuff on their servers, it's just built very differently, and closer to how ClipChamp and Katana work, just with some 'magic' to gloss over server<>browser communication.

Suggestions and feedback

I hope Katana is helpful! I was previously the head of AI at Streamyard, and have been editing video podcasts for myself, friends and clients on Upwork. I built this tool because of how annoying it was to add multi-cam and create good-looking podcast videos with existing editing tools.

Otherwise, if you have any thoughts or feedback, feel free to reach out to me on LinkedIn or Twitter or sam@katana.video