Thursday, May 19, 2011 | By: 六便士之歌

Spectrogram review

spectrograms are basic tools for speech and audio analysis.
  • formant: formant are the resonant frequencies created within speech.
  • speech elements:
    • vowels: voiced sounds, such as a, e, etc..
    • fricatives: fricatives are the phonemes that are produced by a constriction in the vocal tract. They don't usually contain much resonant frequencies (formants), but have content across frequency spectrum.
    • plosives: plosives are transient bursts created by closure of vocal tract followed by release, which can be voiced or unvoiced.
    • diphthongs / glides: glides are characterised by spectral movement of formants over time.
    • nasals: nasals are resonant sounds produced by vibration within the nasal cavity.
  • defining these elements in a spectrogram:
    • formant: horizontal bands
    • fricatives: vertical bands of flat spectrum 'noise'
    • vowels: several formant over long period
    • glides / diphthongs: moving formant; gradual vertical movement of a formant horizontal band
    • plosives: stop in spectrum before vowels etc.
    • nasals: two or more formants usually with a fairly large gap in between where there is a missing formant; just as vowels but with 'hole in lower spectrum'
  • signal processing stages to produce a spectrogram:
    • signal segmentation and windowing
    • transformation to frequency domain via DFT with zero padding
    • Log magnitude spectrum and stacking vectors in a matrix
    • magnitude to colour mapping and display
  • four main parameters / choices in producing a spectrogram:
    • sampling rate
    • DFT length
    • segmentation window length / zero padding length
    • overlap
  • spectral resolution
    • narrowband (long window) spectrogram makes harmonic structure clear
      • bandwidth of 45~50Hz, e.g. Fs=44.1KHz, FFT size should be ~1024
      • associated with glottal source
    • wideband (short window) spectrogram makes formant structure clear
      • bandwidth of 300~500Hz, e.g. Fs=44.1KHz, FFT size should be ~128
      • dark formant bands that change with vowels not pitch
      • formants associated 'filter properties' of vocal tract above larynx

Free Template Blogger collection template Hot Deals BERITA_wongANteng SEO
Wednesday, May 18, 2011 | By: 六便士之歌

Automatic Speech Recognition (ASR) review

1、Problem Formulation: Frequency representations need to be invariant to pitch changes
建立一个ASR系统,必须考虑的要素有:
- timing variation
- loud / quiet speech
- speaker effects, such as gender, accents and vocal mannerisms
- contextual effects

2、speech feature extraction
① speech feature means the compact representations that can highlight distinguishing information extracted from the audio signal.
② source-filter theorysource is the excitation signal, such as oscillation of the glottis;filter is the effect of the time varying vocal tract;a speech signal can be considered as the convolution of the source and the filter.
③ a typical speech encoder: how source and filter are estimated
source can be generated using a white noise generator for an unvoiced sound or a pitch detector / plus generator for a voiced sound.filter can be estimated using an LPC filter or a suitably defined filter-bank.
④ cepstrum analysis: to separate the source and filter elements of speech using spectral methods:
     idft (log |dft (s(t))|)
⑤ LPC featuresall pole parameter estimation

3、Linguistic categories for speech recognition
-phone and phoneme
-IPA
-allophone: in phonetics, an allophone is one of a set of multiple possible spoken sounds (or phones) used to pronounce a single phoneme. For example, [pʰ] (as in pin) and [p] (as in cap) are allophones for the phoneme /p/ in the English language. Speakers treat them as the same phones, but they can be pronounced differently.

4、statistical sequence recognition: Hidden Markov Models
Free Template Blogger collection template Hot Deals BERITA_wongANteng SEO
Powered by Blogger.