Skip to main content

convertToCaptions()v4.0.131

danger

This API assumes a newer version of Whisper.cpp than the stable release to support tokenLevelTimestamps. As a downside, this version may crash unexpectedly. Use an older version of Whisper.cpp (1.0.54 or earlier) if you prefer to use a stable version of Whisper.cpp and forgo tokenLevelTimeStamps support.

Opinionated function that converts the output from transcribe() into easily digestable captions.
Can also combine words with close timestamps.
Useful for TikTok/Reel-type of videos that animate captions word-by-word.

transcribe.mjs
tsx
import path from "path";
import { transcribe, convertToCaptions } from "@remotion/install-whisper-cpp";
 
const { transcription } = await transcribe({
inputPath: "/path/to/audio.wav",
whisperPath: path.join(process.cwd(), "whisper.cpp"),
model: "medium.en",
tokenLevelTimestamps: true,
});
 
const { captions } = convertToCaptions({
transcription,
combineTokensWithinMilliseconds: 200,
});
 
for (const line of captions) {
console.log(line.text, line.startInSeconds);
}
transcribe.mjs
tsx
import path from "path";
import { transcribe, convertToCaptions } from "@remotion/install-whisper-cpp";
 
const { transcription } = await transcribe({
inputPath: "/path/to/audio.wav",
whisperPath: path.join(process.cwd(), "whisper.cpp"),
model: "medium.en",
tokenLevelTimestamps: true,
});
 
const { captions } = convertToCaptions({
transcription,
combineTokensWithinMilliseconds: 200,
});
 
for (const line of captions) {
console.log(line.text, line.startInSeconds);
}

Options

transcription

The transcription object that you retrieved from transcribe().
The tokenLevelTimestamps option must have been set to true.

combineTokensWithinMilliseconds

Combine words that are close to each other.
If words are not combined, they might display for a very short time if word-by-word captions are being used.
Disable combination by setting 0.
Recommendation: 200.

Return value

An object objects of the following shape:

ts
type Caption = {
text: string;
startInSeconds: number;
};
 
type ReturnValue = {
captions: Caption[];
};
ts
type Caption = {
text: string;
startInSeconds: number;
};
 
type ReturnValue = {
captions: Caption[];
};

Suggested usage

This shows how, given a data structure produced by convertToCaptions(), word-by-word captions can be rendered in a Remotion project.
See our TikTok template for a full reference implementation.

note

@remotion/install-whisper-cpp cannot be imported on the frontend, it is a Node.js API.
Only the TypeScript type is imported in this example

tsx
import type { Caption } from "@remotion/install-whisper-cpp";
import { Sequence, useVideoConfig } from "remotion";
 
const Captions: React.FC<{
subtitles: Caption[];
}> = ({ subtitles }) => {
const { fps } = useVideoConfig();
 
return (
<>
{subtitles.map((subtitle, index) => {
const nextSubtitle = subtitles[index + 1] ?? null;
const subtitleStartFrame = subtitle.startInSeconds * fps;
const subtitleEndFrame = Math.min(
nextSubtitle ? nextSubtitle.startInSeconds * fps : Infinity,
subtitleStartFrame + fps,
);
 
return (
<Sequence
from={subtitleStartFrame}
durationInFrames={subtitleEndFrame - subtitleStartFrame}
>
<Subtitle key={index} text={subtitle.text} />;
</Sequence>
);
})}
</>
);
};
tsx
import type { Caption } from "@remotion/install-whisper-cpp";
import { Sequence, useVideoConfig } from "remotion";
 
const Captions: React.FC<{
subtitles: Caption[];
}> = ({ subtitles }) => {
const { fps } = useVideoConfig();
 
return (
<>
{subtitles.map((subtitle, index) => {
const nextSubtitle = subtitles[index + 1] ?? null;
const subtitleStartFrame = subtitle.startInSeconds * fps;
const subtitleEndFrame = Math.min(
nextSubtitle ? nextSubtitle.startInSeconds * fps : Infinity,
subtitleStartFrame + fps,
);
 
return (
<Sequence
from={subtitleStartFrame}
durationInFrames={subtitleEndFrame - subtitleStartFrame}
>
<Subtitle key={index} text={subtitle.text} />;
</Sequence>
);
})}
</>
);
};

See also