HARK Document Version 3.3.0. (Revision: 9509) : SpectralMeanNormalizationIncremental

6.4.10 SpectralMeanNormalizationIncremental

6.4.10.1 Outline of the node

This node subtracts the mean of features from the input acoustic features. However, as a problem, to realize real-time processing, the mean of the utterance concerned cannot be subtracted. It is necessary to estimate or approximate mean values of the utterance concerned using some values.

In the SpectralMeanNormalization node Real-time average removal is achieved by using the mean of the previous utterances in the same sound source direction. In the SMNIncremental node There is a difference that real-time average removal is realized by calculating and reflecting the mean of the utterances at regular intervals.

6.4.10.2 Necessary file

No files are required.

6.4.10.3 Usage

When to use

This node is used to subtract the mean of acoustic features. This node can remove mismatches between the mean values of recording environments of audio data for acoustic model training, and audio data for recognition.

Properties of a microphone cannot be standardized often for some speech recording environments. In particular, speech-recording environments of acoustic model training and recognition are not necessarily the same. Since different persons are usually in charge of speech corpus creation for training and recording of audio data for recognition, it is difficult to arrange the same environment. Therefore, it is necessary to use features that do not depend on speech recording environments.

For example, microphones used for acquiring training data and those used for recognition are usually different. Differences in the properties of microphones appears as a mismatch of the acoustic features of the recording sound, which causes recognition performance degradation. The difference in properties of microphones does not change with time and appears as a difference of mean spectra. Therefore, the components that simply depend on recording environments can be subtracted from features by subtracting the mean spectra.

Typical connection

$\includegraphics[width=100mm]{fig/modules/SpectralMeanNormalizationIncremental.png}$

Figure 6.96: Connection example of SpectralMeanNormalizationIncremental

6.4.10.4 Input-output and property of the node

Table 6.85: Parameter list of SpectralMeanNormalizationIncremental

Parameter name	Type	Default value	Unit	Description
FBANK_COUNT	`int`	13		Dimension number of input feature parameters
PERIOD	`int`	20	[frames]	Period for calculating the average
SM_ALGORITHM	`string`	INCREMENTAL		Algorithm for calculating the initial mean value
SM_FILENAME	`string`			CSV file name of initial average
SM_HISTORY_FILENAME	`string`			CSV file name for initial average history
IGNORE_FRAMES	`int`	0	[frames]	Number of frames to exclude from the calculation
				of the average spectrum
BASENAME	`string`	"smn_"		Base name of the SMN file
OUTPUT_FNAME	`string`	"out.txt"		Dimension number of input feature parameters
SM_EXPORT_FILENAME	`string`			Name of CSV file to output the average
SM_EXPORT_ALGORITHM	`string`	LAST_SRC		Algorithm for calculating the average to be output

Input

INPUT: : Map<int, ObjectRef> type. Vector<float> type data pair of Sound source ID and feature vector.
SM_HISTORY: : Map<int, ObjectRef> type(ObjectRef is Vector<ObjectRef> and ObjectRef in Vector<ObjectRef> is Vector<float> ). This input is optional. The average spectral history for each source.
NOT_EOF: : bool type. This input is optional. This input is only used when specifying the OUTPUT_FNAME or SM_EXPORT_FILENAME parameters.

Output

OUTPUT: : Map<int, ObjectRef> type. A pair of the sound source ID and feature vector as Vector<float> type data.

Parameter

FBANK_COUNT: : int type. Its range is 0 or a positive integer. Specify the number of banks.
PERIOD: : int type. Its range is 0 or a positive integer. Specify the period for calculating the average spectrum. In other words, if 1 is specified, the average spectrum is calculated every frame. However, if 0 is specified, it means that the average spectrum is calculated indefinitely, that is, SM_ALGORITHM is used for all frames.
SM_ALGORITHM: : string type. Default value is INCREMENTAL，the possible values are [INCLEMENTAL, PREV_SM, ZERO, FILE]. Determine the initial algorithm to calculate the average spectrum to reach the $PERIOD + IGNORE\_ FRAME$ frame. In the case of INCREMENTAL, the average spectrum is calculated using all frames from the first frame of the sound source. In the case of PREV_SM, the average spectrum calculated by the immediately preceding sound source is used. In the case of ZERO, the average spectrum is assumed to be 0 and processed. In the case of FILE, the average spectrum is read from the file specified by the SM_FILENAME parameter.
SM_FILENAME: : string type. Specifies the CSV file name that gives the initial average spectrum when FILE is specified in SM_ALGORITHM.
SM_HISTORY_FILENAME: : string type. Specify the CSV file name to save the initial average spectrum history until the $PERIOD + IGNORE\_ FRAME$ frame is reached.
IGNORE_FRAMES: : int type. Its range is 0 or a positive integer. Specifies the number of frames to ignore for calculating the average spectrum.
BASENAME: : string type. Default value is "smn_". The base name of the file that stores the SMN file. The file name will be $BASENAME + id + ".csv"$ .
OUTPUT_FNAME: : string type. Default value is "out.txt".
SM_EXPORT_FILENAME: : string type. Default value is an empty string. Specify the CSV file name of the average spectrum to be exported by SM_EXPORT_ALGORITHM. The export is enabled only if a file name is specified, and is disabled if an empty string is specified.
SM_EXPORT_ALGORITHM: : string type. Default value is LAST_SRC, the possible values are [LAST_SRC, SRC_AVERAGE, FRAME_AVERAGE]. Determine the algorithm for calculating the average spectrum to be exported. In the case of LAST_SRC, the average spectrum of the last sound source is stored. In the case of SRC_AVERAGE, the average spectrum of all sound sources is stored. In the case of FRAME_AVERAGE, the average spectrum of all the frames of all the sources is stored as an average.

6.4.10.5 Details of the node

Real-time mean subtraction is realized by assuming the mean of the former utterance as an approximate value and subtracting it instead of subtracting the mean of the utterance concerned. In this method, a sound source direction must be considered additionally. Since transfer functions differ depending on sound source directions, when the utterance concerned and the former utterance are received from different directions, the mean of the former utterance is inappropriate compared with the mean approximation of the utterance concerned.

In such a case, the mean of the utterance that is uttered before the utterance concerned from the same direction is used as a mean approximation of the utterance concerned. Finally, the mean of the utterance concerned is calculated and is maintained in memory as the mean for the direction of the utterance concerned for subsequent mean subtraction. When the sound source moves by more than 10[deg] during utterance, a mean is calculated as another sound source.