Exploration of small enrollment speaker verification on handheld devices
, M. Eng. Thesis, MIT Deparment of
Electrical Engineering and Computer Science, 2005.
[36] J. Ming, T. J. Hazen, and J. R. Glass, “Speaker verification over handheld devices with realistic noisy speech data,”
subbmitted to ICASSP’2006.
[37] L. Deng, A. Acero, M. Plumpe and X.-D. Huang, “Large-vocabulary speech recognition under adverse acoustic
environments,” in Proc. ICSLP’2000, Beijing, China, 2000, pp. 806-809.
November 10, 2005
DRAFT
20
Fig. 1. Illustration of the system used to generate multi-condition training data for the UC model, with wide-band noise of
different volumes added acoustically to the clean training data. This system is also used in the experiments to produce noisy
test data, by replacing the wide-band noise source with a test noise source.
November 10, 2005
DRAFT
21
Fig. 2. Noises used in identification experiments, showing the spectra over a period of about three seconds. From left to right,
top to bottom: jet engine, restaurant, street, mobile-phone ring, pop song, broadcast news.
November 10, 2005
DRAFT
22
TABLE I
I
DENTIFICATION ACCURACY
(%)
FOR THE UNIVERSAL COMPENSATION MODEL
(UC)
AND BASELINE MULTI
-
CONDITION
MODEL
(BSLN-M
UL
)
TRAINED USING SIMULATED
,
ACOUSTICALLY MIXED MULTI
-
CONDITION DATA AT SEVEN DIFFERENT
SNR
S
,
AND FOR THE BASELINE MODEL TRAINED USING CLEAN DATA
(BSLN-C
LN
). T
HE NUMBER ASSOCIATED WITH
EACH MODEL INDICATES THE NUMBER OF
G
AUSSIAN MIXTURES IN THE MODEL
Noise
SNR
UC
BSLN-Mul
BSLN-Cln
(dB)
32
64
128
128
32
Clean
90.64
94.84
96.51
95.79
98.41
Engine
20
83.81
87.06
88.89
86.35
62.46
15
78.26
81.75
81.59
77.62
29.05
10
51.27
52.30
51.35
53.57
7.78
Restaurant
20
85.87
91.27
93.89
94.44
93.10
15
80.56
85.95
88.33
87.46
78.97
10
67.54
73.25
75.08
67.70
43.57
Street
20
86.75
91.27
92.86
94.29
91.83
15
79.76
85.08
86.51
86.83
70.32
10
61.11
63.57
64.05
68.17
34.60
Mobile
20
73.57
80.64
84.68
68.02
56.90
phone ring
15
63.65
72.30
76.35
46.90
34.05
10
48.10
57.38
62.46
26.43
15.56
Pop
20
87.54
92.22
93.41
86.19
88.57
song
15
78.26
85.71
88.07
64.44
66.98
10
58.49
64.21
67.70
33.65
30.87
Broadcast
20
87.22
92.54
93.89
82.78
84.92
news
15
79.05
86.03
88.97
59.84
61.75
10
57.87
66.75
70.00
27.62
26.19
November 10, 2005
DRAFT
23
Fig. 3.
Identification accuracy in clean and six noisy conditions averaged over SNRs between 10–20 dB, and the overall
average accuracy across all the conditions, for UC and BSLN-Mul trained using simulated, acoustically mixed multi-condition
data at seven different SNRs, and for BSLN-Cln trained using clean data. The number associated with each model indicates the
number of Gaussian mixtures in the model.
November 10, 2005
DRAFT
24
Fig. 4. Absolute improvement in identification accuracy by the UC model trained using multi-condition data with acoustically
added noise, compared to a UC model trained using the data with electronically added noise, for test data with acoustically
added noise. Both UC models used 128 Gaussian mixtures per speaker.
November 10, 2005
DRAFT
25
Fig. 5. Spectra of utterances in office (left) and street intersection (right), recorded using the internal microphone.
TABLE II
E
QUAL ERROR RATES
(%)
FOR
UC
AND
BSLN-M
UL TRAINED USING SIMULATED NARROW
-
BAND NOISE
(NB),
WIDE
-
BAND NOISE
(WB)
AND COMBINATION
(NB+WB)
AT TEN DIFFERENT
SNR
S
,
AND FOR
BSLN-C
LN TRAINED USING
CLEAN DATA
(I
NDEX
: O–
OFFICE
, S–
STREET INTERSECTION
, H–
HEADSET
, I–
INTERNAL MICROPHONE
)
Training-Testing
UC
BSLN-Mul
BSLN-Cln
condition
NB
WB
NB+WB
NB
WB
OH - OH
6.50
8.45
7.79
7.29
12.65
8.85
OI - SI
11.98
15.63
13.51
15.63
23.96
20.83
OI - SH
14.06
17.71
14.62
22.40
30.73
30.21
November 10, 2005
DRAFT
26
1 2
5 10
20
40
60
80
1
2
5
10
20
40
60
80
False Alarm probability (in %)
Miss probability (in %)
UC, NB
UC, WB
BSLN−Mul, NB
BSLN−Mul, WB
BSLN−Cln
Fig. 6. DET curves in matched training and testing: office/heaset, for UC and BSLN-Mul trained using simulated narrow-band
noise (NB) and wide-band noise (WB) at ten different SNRs, and for BSLN-Cln trained using clean data.
November 10, 2005
DRAFT
27
1 2
5 10
20
40
60
80
1
2
5
10
20
40
60
80
False Alarm probability (in %)
Miss probability (in %)
UC, NB
UC, WB
BSLN−Mul, NB
BSLN−Mul, WB
BSLN−Cln
Fig. 7. DET curves with mismatch in environments: training–office, testing–street intersection, both using internal microphone,
for UC and BSLN-Mul trained using simulated narrow-band noise (NB) and wide-band noise (WB) at ten different SNRs, and
for BSLN-Cln trained using clean data.
November 10, 2005
DRAFT
28
1 2
5 10
20
40
60
80
1
2
5
10
20
40
60
80
False Alarm probability (in %)
Miss probability (in %)
UC, NB
UC, WB
BSLN−Mul, NB
BSLN−Mul, WB
BSLN−Cln
Fig. 8. DET curves with mismatch in both environments and microphones: training–office/internal microphone, testing–street
intersection/headset, for UC and BSLN-Mul trained using simulated narrow-band noise (NB) and wide-band noise (WB) at ten
different SNRs, and for BSLN-Cln trained using clean data.
November 10, 2005
DRAFT
29
1 2
5 10
20
40
60
1
2
5
10
20
40
60
False Alarm probability (in %)
Miss probability (in %)
UC, NB
UC, NB+WB
OH−OH
OI−SI
OI−SH
Fig. 9. Comparison between the UC models trained using simulated narrow-band noise (NB) and mixed narrow-band noise
and wide-band noise (NB+WB), for different training–testing environment/microphone conditions (Index: O–office, S–street
intersection, H–headset, I–internal microphone)
November 10, 2005
DRAFT
View publication stats
Do'stlaringiz bilan baham: |