心喜你暖然似春

spectrograms are basic tools for speech and audio analysis.

formant: formant are the resonant frequencies created within speech.
speech elements:

vowels: voiced sounds, such as a, e, etc..
fricatives: fricatives are the phonemes that are produced by a constriction in the vocal tract. They don't usually contain much resonant frequencies (formants), but have content across frequency spectrum.
plosives: plosives are transient bursts created by closure of vocal tract followed by release, which can be voiced or unvoiced.
diphthongs / glides: glides are characterised by spectral movement of formants over time.
nasals: nasals are resonant sounds produced by vibration within the nasal cavity.

defining these elements in a spectrogram:

formant: horizontal bands
fricatives: vertical bands of flat spectrum 'noise'
vowels: several formant over long period
glides / diphthongs: moving formant; gradual vertical movement of a formant horizontal band
plosives: stop in spectrum before vowels etc.
nasals: two or more formants usually with a fairly large gap in between where there is a missing formant; just as vowels but with 'hole in lower spectrum'

signal processing stages to produce a spectrogram:

signal segmentation and windowing
transformation to frequency domain via DFT with zero padding
Log magnitude spectrum and stacking vectors in a matrix
magnitude to colour mapping and display

four main parameters / choices in producing a spectrogram:

sampling rate
DFT length
segmentation window length / zero padding length
overlap

spectral resolution

narrowband (long window) spectrogram makes harmonic structure clear

bandwidth of 45~50Hz, e.g. Fs=44.1KHz, FFT size should be ~1024
associated with glottal source

wideband (short window) spectrogram makes formant structure clear

bandwidth of 300~500Hz, e.g. Fs=44.1KHz, FFT size should be ~128
dark formant bands that change with vowels not pitch
formants associated 'filter properties' of vocal tract above larynx

Free Template Blogger collection template Hot Deals BERITA_wongANteng SEO

1、Problem Formulation: Frequency representations need to be invariant to pitch changes
建立一个ASR系统，必须考虑的要素有：
- timing variation
- loud / quiet speech
- speaker effects, such as gender, accents and vocal mannerisms
- contextual effects

2、speech feature extraction
① speech feature means the compact representations that can highlight distinguishing information extracted from the audio signal.
② source-filter theorysource is the excitation signal, such as oscillation of the glottis;filter is the effect of the time varying vocal tract;a speech signal can be considered as the convolution of the source and the filter.
③ a typical speech encoder: how source and filter are estimated
source can be generated using a white noise generator for an unvoiced sound or a pitch detector / plus generator for a voiced sound.filter can be estimated using an LPC filter or a suitably defined filter-bank.
④ cepstrum analysis: to separate the source and filter elements of speech using spectral methods:
idft (log |dft (s(t))|)
⑤ LPC featuresall pole parameter estimation

3、Linguistic categories for speech recognition
-phone and phoneme
-IPA
-allophone: in phonetics, an allophone is one of a set of multiple possible spoken sounds (or phones) used to pronounce a single phoneme. For example, [pʰ] (as in pin) and [p] (as in cap) are allophones for the phoneme /p/ in the English language. Speakers treat them as the same phones, but they can be pronounced differently.

4、statistical sequence recognition: Hidden Markov Models
Free Template Blogger collection template Hot Deals BERITA_wongANteng SEO

心喜你暖然似春

category

Labels

Popular Posts

Spectrogram review

Automatic Speech Recognition (ASR) review

About Me

Blog Archive

Followers