close

Enter

Log in using OpenID

embedDownload
Multi-Scale Modulation Filtering in
Automatic Detection of Emotions in Telephone Speech
Jouni Pohjalainen, Paavo Alku
Department of Signal Processing and Acoustics, Aalto University, Finland
Focus
Test Material
Detection of emotion classes of telephone speech becoming important in, e.g., call-center applications
◮ Broad classes (activation/arousal, valence) possibly more useful than specific classes (joy, boredom,
anger, surprise)
◮ Different emotional speech classes generally differ in both short-term timbre distribution and
long-term modulation characteristics
◮
Short-term: peripheral auditory processing
◮ Long-term: neural auditory processing on multiple time scales
◮
◮
Activation: {anger, fear, happiness} vs {boredom, disgust, neutral, sadness}
◮ Valence: {anger, disgust, fear, sadness} vs {boredom, happiness, neutral}
◮
Noise from the NOISEX-92 database is added to simulate far-end noise corruption
◮ Transmission over the GSM channel is simulated
◮
A general-purpose method for modeling the long-term acoustic properties of speech classes
◮
The 535 utterances of the Berlin database
◮ Seven emotion classes: anger, boredom, disgust, fear, joy, sadness and neutral
◮ Detection along emotional dimensions:
◮
Perceptual consideration: model the long-term dynamics on multiple time scales simultaneously
Results
Multi-Scale Filtering
Use N autoregressive
filters
to
generate
predictions
for
the
jth
feature
at
the
tth
frame:
Pr
ˆj,t,n = cj,n + k =1 bj,k ,nxj,t−snk , 1 ≤ n ≤ N
x
◮ Select the prediction with lowest squared prediction error:
2
ˆ
ˆ
xj,t = arg minxˆj,t,n (xj,t − xj,t,n)
◮ After generating the predictions, replace the original features with them
◮
EER scores (%) for the detection of high-activation emotions anger, fear and happiness. The scores that
are statistically significantly better than the baseline in the corresponding noise and channel conditions
are indicated in boldface. The maximum prediction lag (determined by maximum sn) is varied between
400 and 600 ms in order to investigate the effect of low modulation frequencies. For the telephone
cases, the system has been trained using telephone speech with high-SNR (30 dB) car noise.
Channel and noise condition
Original Telephone (SNR 0 dB)
clean car factory babble
baseline MFCCs
7.1 12.7 34.0
22.1
AR: r = 50, sn = 1
9.0 12.7 32.5
23.9
r = 10, 1 ≤ sn ≤ 4
7.5 12.3 21.7
17.9
r = 10, 1 ≤ sn ≤ 6
6.7 10.5 24.3
17.9
r = 10, 1 ≤ sn ≤ 5
7.1 10.5 22.1
17.9
+ training data selection 7.1
8.2 20.2
16.5
The Detection System
Feature extraction: 12 MFCCs + log energy normalized over the utterance + ∆/∆∆ = 39 features
◮ Feature postprocessing: using the proposed filtering method separately for each specific emotion class
◮ Classification decisions: according to frame-averaged log likelihood ratio statistics LX , the target class
is decided if maxX ∈A LX − maxX ∈B LX > T , where
◮
A is the set of target emotion classes
◮ B is the set of other emotion classes
◮ T is the detection threshold
◮
◮
P
2
Automatic selection of training data: the squared prediction residual E(t) = j ej,t
, where
ˆj,t , is clustered into “low” and “high” cluster and low-error frames are used for training
ej,t = xj,t − x
EER scores (%) for the detection of negative-valence emotions anger, disgust, fear and sadness. The
scores that are statistically significantly better than the baseline in the corresponding noise and channel
conditions are indicated in boldface.
Channel and noise condition
Original Telephone (SNR 0 dB)
clean car factory babble
baseline MFCCs
22.2 27.8 45.2
36.4
AR: r = 50, sn = 1
23.5 29.5 41.7
34.8
r = 10, 1 ≤ sn ≤ 5
21.3 27.8 40.0
32.1
+ training data selection 20.0 25.2 39.4
34.2
Example
Top panel: mel-scale spectrogram, with 40
bins, transformed back from MFCCs for a
neutral telephone utterance (original label
03a01Nc) corrupted by car interior noise
(SNR 0 dB)
◮ Lower panels: mel-scale spectrograms for the
same utterance after filtering the original
MFCCs with multi-scale autoregressive
predictors for classes “anger”, “neutral” and
“happiness”
◮
Department of Signal Processing and Acoustics
Conclusions
Filtering improved the robustness of emotion detection under noise mismatch
◮ Valence detection was also improved in matched clean speech condition
◮ The method can be adapted to model any class according to the problem
◮
http://www.acoustics.hut.fi
Jouni Pohjalainen <jouni.pohjalainen@aalto.fi>
1/--pages
Report inappropriate content