[Freeswitch-users] Detecting the origin of voice activity using VAD
Steve Underwood
steveu at coppice.org
Mon Mar 2 16:08:27 PST 2009
Andy Spitzer wrote:
> Woof!
>
> On Sun, 01 Mar 2009 21:28:18 -0500, Brian West <brian at freeswitch.org> wrote:
>
>
>> NO. You want something that people THINK exists and works well...
>> Reliable human/voice detection doesn't exist in ANY form.
>>
>
> I beg to differ. See http://www.freepatentsonline.com/5521967.html for one way to do it. It works rather well and can quickly descriminate between voice and tone. I've no idea who owns that patent now (not me, for sure).
>
Since when did a patent mean a problem is solved? For things like speech
recognition you can achieve pretty high accuracy in voice detection, but
in that case you can delay the audio and make decisions that span the
start of the speech burst. For most telephony purposes you need to make
a decision on the very first frame of speech, as you can't afford to add
latency. That turns it into a tough problem. Something like the VAD in
G.729 is about the best people can currently do, but its far from perfect.
> There is a simpler, less reliable way of differentiating voice from tone, that as far as I know isn't patented. If you compare the RMS power levels of sequential 40 mS periods, call progress tones will have very consistent power levels from sample to sample. So if 5 or more 40 mS periods have about the same power measurement (within say, 2%), it's a tone. Voice will have dramatic power level differences over that same period. This works very well in today's telephony environment, where tones are computer generated. In the old days when ringback tone was generated off the audio hum from the 20 Hz ring voltage generator...not so well.
>
That is *not* VAD. What you describe just says "is its energy steady". I
will trigger on music, background noise and maybe even some of the fast
pulsed tone signals. A proper VAD won't.
Regards,
Steve
More information about the FreeSWITCH-users
mailing list