Hi, thanks for maintaining this fantastic project. We're using whisper-asr-webservice with the faster-whisper backend and have been seeing great results overall.
I’ve encountered a known Whisper behavior in long audio files: over time, the transcription loses punctuation and capitalization, outputting lower-quality, all-lowercase text. This appears to happen when the model drifts without context reset or when audio is segmented too aggressively or not enough.
The results were varied based on a couple different GPUs tested (RTX 3060 and RTX 3090). I had someone who was able to successfully handle 10 test files by using the following settings on the RTX 3060 using the small.en model.
vad_threshold = 0.9 # (RTX 3090 worked great with 0.7)
vad_min_silence = 10000
asr_prompt_reset_on_temperature = 0.3
It would be great to support these parameters as part of the API request. This would allow for more flexible handling of different hardware environments and audio types.