Skip to content

[Feature request] Additional API parameters for VAD threshold/silence and prompt reset to improve long audio punctuation quality #332

@dev-jesser

Description

@dev-jesser

Hi, thanks for maintaining this fantastic project. We're using whisper-asr-webservice with the faster-whisper backend and have been seeing great results overall.

I’ve encountered a known Whisper behavior in long audio files: over time, the transcription loses punctuation and capitalization, outputting lower-quality, all-lowercase text. This appears to happen when the model drifts without context reset or when audio is segmented too aggressively or not enough.

The results were varied based on a couple different GPUs tested (RTX 3060 and RTX 3090). I had someone who was able to successfully handle 10 test files by using the following settings on the RTX 3060 using the small.en model.

vad_threshold = 0.9 # (RTX 3090 worked great with 0.7)
vad_min_silence = 10000 
asr_prompt_reset_on_temperature = 0.3

It would be great to support these parameters as part of the API request. This would allow for more flexible handling of different hardware environments and audio types.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions