Recognition Performance Test Standard¶

Overview¶

Our company has established enterprise testing standards for speech recognition performance of intelligent voice products. We actively participate in team and industry standards development led by various social organizations. Our speech testing methods and evaluation criteria have gained recognition from the industry and major enterprises. Customers can directly adopt our speech recognition performance testing methods to evaluate the speech recognition capabilities of their developed products.

This document specifies the terminology, definitions, testing specifications (including technical requirements, testing metrics, test items, test content, test equipment, and test environment), testing methods, procedures, as well as test result reporting and traceability for speech module recognition performance testing.

The content of this document is selected from our enterprise standard document Offline ASR Module Recognition Performance Testing Standard, Standard No.: QQYTL001-2018. The descriptions in this document apply to recognition performance testing for all AI offline and offline+online ASR modules.

Terms and Definitions¶

The following terms and definitions are used throughout this document:

1. Artificial Intelligence (AI)
An interdisciplinary field of computer science that develops systems demonstrating human-like intelligence capabilities including reasoning and learning.

2. Automatic Speech Recognition (ASR)
The technological process of converting speech signals into machine-readable text representations.

3. Natural Language Processing (NLP)
The computational techniques for analyzing, understanding and generating human language.

4. Voice Command
A spoken instruction that can be recognized and processed by voice-enabled systems.

5. Artificial Mouth
Also known as simulated mouth: A high-fidelity audio playback device that plays voice commands, replacing human voice as a standardized test sound source.

6. Recognition Rate

The percentage of correctly recognized voice commands out of the total commands tested on the speech module.

7. False Recognition Count

The frequency of incorrect recognitions by the speech module during simulated real-world usage scenarios over a specified time period.

8. False Wake-up

An unintended activation of the wake-up system occurring when: 1. No audio stream is present, or 2. The audio stream lacks wake-up features/events

9. Signal-to-Noise Ratio (SNR)

The power ratio (in decibels) between: - Voice command signal - Ambient noise

10. Residential Environment

The operational setting for voice modules, including: - Bedroom - Living room - Kitchen - Bathroom - Balcony

11. Quiet Environment

Defined as operational environments with ambient noise levels between 25dB-45dB.

12. Moderate Noise Environment

Defined as operational environments with ambient noise levels between 55dB-65dB.

13. High Noise Environment

Defined as operational environments with ambient noise levels between 65dB-80dB.

14. Mechanical Noise in Voice Modules

Noise generated by: 1. Component vibration during operation 2. Friction/impact between mechanical parts 3. Integrated mechanical components in voice recognition systems

Categorized by source: - Aerodynamic noise - Mechanical noise - Electromagnetic noise

15. Background Noise

Noisy background noise refers to the background voice or quasi human voice (such as the noisy voice in the venue and shopping mall environment) or the disturbing sound played by other audio devices other than the voice module, such as the sound from music, news, television and movies.

16. Echo Noise

Echo noise refers to the sound played by the voice module through its own speaker, which interferes with the speech recognition results.

17. Reverberation Noise

The voice received by the voice module after the target speaker’s voice is reflected by a smooth surface (such as a wall or object surface).

18. Environmental Noise

The environment of voice module contains background noise and reverberation, in which the background noise often contains one or more noise sources. For example, the kitchen environment also has the voice of the range hood and cooking; In the bathroom environment, there are wind noise, shower water noise and voice reverberation reflected by smooth walls; The living room environment has voice, TV and other sounds at the same time; The balcony environment has wind noise and outdoor noise (such as vehicle speaker voice, etc.) at the same time; There are engine noise and road noise in the vehicle environment.

19. Test Audio Data

Non training set audio instruction set for speech test.

20. Noise Audio Data

A set of noisy audio for voice testing.

21. Operation

The voice module is in functional operation.

22. Not in Operation

The voice module is not functioning.

23. Voice Prompt Mode

The voice module is actively playing back audio prompts or responses.

24. Non-Voice Prompt Mode

The voice module is in a silent state with No broadcast.

25. Wake-up Words

Special phrases that activate the voice recognition system, allowing it to respond to subsequent commands.

26. Command Words

The complete set of all voice commands recognized by the system, including both wake-up phrases and operational instructions.

27. Microphone Array Configurations

Voice modules can utilize various microphone array configurations:

By Quantity:
Dual-microphone array (2 mics)
Quad-microphone array (4 mics)
Hex-microphone array (6 mics)
Octo-microphone array (8 mics)
By Geometry:
Linear array (microphones in a straight line)
Circular array (microphones arranged in a circle)
Planar array (2D microphone arrangement)

28. Intelligent voice Device

Any electronic device that incorporates voice recognition capabilities for user interaction.

Test Specifications¶

Test Categories¶

Recognition/Wake-up Rate Testing¶

Measures the system’s ability to accurately recognize voice commands and wake words
Conducted in both quiet (35-45dB) and noisy (55-75dB) environments
Evaluates performance across different acoustic conditions and distances

False Wake-up Testing¶

Quantifies unintended activations caused by:
Non-wake-word speech
Environmental sounds
System noise
Excludes phonetically similar phrases to wake words
Performed during both active use and idle states

Response Time Testing¶

Measures latency from command completion to system response
Tests conducted at various distances and noise levels
Includes both initial wake-up and subsequent command processing

System Stability Testing¶

Long-duration testing under continuous operation
Performance monitoring across temperature variations
Assessment of memory usage and resource management

Test Environment Requirements¶

Testing environments are designed to simulate real-world conditions where voice-enabled devices operate. Each environment is characterized by specific acoustic properties that affect speech recognition performance.

Environment	application scenario	ambient noise (dB)	reverberation (s)	minimum distance (m)	maximum distance (m)	reference area of application scenario (m2)	applicable voice equipment
Quiet environment	unlimited	35-45	0.45-0.55	1	5	15-35	All voice recognition devices
Working condition environment	kitchen	55-60	0.65-0.75	1	2	5-10	kitchen voice recognition equipment (such as microwave oven, range hood, rice cooker, etc.)
Working condition environment	Toilet	55-60	0.65-0.75	1	2	5-10	Bathroom voice recognition equipment (such as bathtub, air heater, toilet, etc.)
Working environment	balcony	55-60	NA	1	2	5-10	balcony voice recognition equipment (such as washing machine, clothes dryer, balcony lamp, etc.)
Working condition environment	living room (hall)	55-60	0.45-0.55	1	5	15-35	living room speech recognition equipment (such as air conditioner, central control, remote controller, tea set, oxygen generator, living room lamp, TV, etc.)
Working environment	bedroom	55-60	0.45-0.55	1	5	10-20	bedroom voice recognition equipment (such as air conditioner, remote controller, desk lamp, TV, etc.)
Working condition environment	Strong noise	65-75	NA	0.5	2	5-10	Strong noise equipment or environment (such as when the fan is working at high wind damper)

Note: For devices with audio playback capabilities (e.g., smart speakers), additional testing is required during active media playback to ensure reliable voice recognition performance.

Table 1 Speech Recognition Test Environment Description

Note: The reference area of the application scenario in Table 1 conforms to the specifications for the space area of the kitchen, toilet, balcony, living room (hall) and bedroom suite in the “Five Sets of Interior Space” in GB50096-2011 Code for Residential Design.

Microphone Configuration Requirements¶

Supported Configurations¶

Voice modules support various microphone configurations to accommodate different product requirements:

Single Microphone - Basic configuration for cost-sensitive applications - Limited noise cancellation capabilities - Ideal for close-range voice capture
Microphone Arrays - Multiple microphones arranged in specific geometric patterns - Enables advanced beamforming and noise reduction - Supports far-field voice capture - Minimum 10mm spacing between adjacent microphones required

Reference Configurations¶

Single Microphone Setup¶

Figure 1: Single Microphone Configuration

Dual-Microphone Array¶

Figure 2 Schematic Diagram of Double Microphone Array

Figure 3 Schematic Diagram of Four Microphone Array

Test Language Requirements¶

The voice commands for the test shall be in the standard official language, and the Chinese language shall be Putonghua Grade II, Grade B and above.

Test Speech Speed Requirements¶

Normal speaking speed, 150-180 words/minute for Chinese mandarin.

Speech Recognition Rate/Wake-up Rate Test Index¶

Test Item	Environment	Working Status of Voice Equipment	Signal-to-Noise Situation	Noise Set	Test Set	Index	Applicability Description
Local recognition rate/ Wake up rate test	Quiet environment	Non operating/Non broadcasting	Voice: 60dB~70dB Noise: 35dB~45dB	NA	Wake up word set instruction word set	Minimum distance: ≥ 97% Maximum distance: ≥ 95%	Applicable to all speech recognition equipment
Local recognition rate/ Wake up rate test	Working condition environment	Operation	Voice: 65dB~75dB Noise: 55~60dB	Environmental noise+mechanical noise of voice equipment	Wake up word set instruction word set	Minimum distance: ≥ 92% Maximum distance: ≥ 85%	Applicable to voice recognition equipment that can generate mechanical noise
Local recognition rate/ Wake up rate test	Operating environment	Non operating/Non broadcasting	Voice: 65dB~75dB Noise: 55~60dB	Environmental noise	Wake up word set command word set	Minimum distance: ≥ 92% Maximum distance: ≥ 88%	Applicable to voice recognition equipment with medium or higher ambient noise under operating conditions
Local recognition rate/ Wake up rate test	Working condition environment	Broadcast	Voice: 65dB~75dB Noise: 55~60dB	Ambient noise+Echo noise	Wake up word set instruction word set	Minimum distance: ≥ 92% Maximum distance: ≥ 85%	Voice recognition equipment suitable for long broadcast and audio playback
Local recognition rate/ Wake up rate test	Working condition environment	Operation (strong noise)	Voice: 65dB~75dB

Table 2 Local Speech Recognition Rate/Wakeup Rate Test Indicators

Note:

The minimum distance, according to the “environment”, “application scenario” refers to “Table 1” to determine the specific distance.
The maximum distance is determined according to the “environment”, and the “application scenario” refers to “Table 1”.
The noise of voice equipment (such as kitchenhood) that generates strong mechanical noise will reach 75± 5dB.

False Wake-up Test Index¶

Test Item	Noise Set	Indicator	Description of False Wakeup Noise Set
False wake-up test	False wake-up noise set	<=3 times/24H	1) False wake-up noise set: a 24-hour noise corpus including: 4-hour TV noise set (with voice) +4-hour music (pure music or songs) +8-hour environmental noise set (where the equipment is located) +8-hour quiet environment 2) There is no wake-up word voice in the false wake-up noise set, and the noise decibel is 55dB - 65dB.

Table 3 False Wakeup Test Indicators

Response Time Test Index¶

Response time: The duration between the moment a voice command is played by the artificial mouth at close range (less than 50 cm) and the moment the voice recognition module delivers the recognized command to the device’s control or communication interface. The response time should be less than 1.0 second.

Stability Test Index¶

The recognition stability test of speech recognition module in wake-up and non wake-up states under ambient noise.
Recognition stability test under wake-up state: Play wake-up words every 1 second, run for 72 hours, no crash or restart, and can identify normally.
Recognition stability test under non wake-up state: every T_wakeup_time, wake-up words are played, 72 hours test, without crash or restart, and can be recognized normally.
T_wakeup_time is equal to the time from wake-up to exiting wake-up state plus 1 second.

Audio Collection and Standardization for Test¶

Wake Word Audio Collection Recording The wake word set includes wake words and command words that can be directly controlled without a wake word. Five males and five females, totaling 10 adult speakers, read the wake word set, which was recorded using high-fidelity recording equipment. The speech sampling rate is 44.1kHz, with environmental noise <30dB, reverberation <0.3s, and speakers positioned 20-30cm from the microphone. There is a 2-3 second pause between words, and the reading is done in the standard official language. For Mandarin Chinese, the standard requirement is Level 2B or above, with a speaking speed of 150-180 characters per minute.

Command Word Audio Collection Recording This includes wake words and all other command words. Five males and five females, totaling 10 adult speakers, read the wake word set, which was recorded using high-fidelity recording equipment. The speech sampling rate is 44.1kHz, with environmental noise <30dB, reverberation <0.3s, and speakers positioned 20-30cm from the microphone. There is a 2-3 second pause between words, and the reading is done in the standard official language. For Mandarin Chinese, the standard requirement is Level 2B or above, with a speaking speed of 150-180 characters per minute.

Test Equipment and Instrumentation¶

Test Equipment Specifications¶

The following equipment is required for comprehensive speech recognition testing. Equivalent models with similar specifications may be substituted with prior approval.

S/N	Category	Equipment	Equipment Model	Equipment Brand	Function
01	Computer	Desktop/laptop	Unlimited	Unlimited	Monitor whether the voice module feedback accurately outputs the test results
02	Sound source	Artificial mouth	4227-A	Brüel&Kjær	Play audio signal
03	Noise monitoring	Precision noise meter (sound level meter)	1357	TES	Test the sound pressure reaching the microphone
04	Noise source	Speaker/TV	Recommended model of monitor speaker: FX8	Fluid Audio	Play noise and simulate external interference
05	Audio Collection	High Fidelity Recording Equipment	R44	Roland	Audio Recording

Table 4 Speech Recognition Test Equipment

Artificial Mouth¶

Model: 4227-A
Performance index:

Rated output sound pressure SPL:
200Hz - 2kHz ----- 110dB
100Hz - 8kHz ----- 100dB *Distortion (@ 94dB):
200Hz - 250Hz ----- <2%
’>‘250Hz ----- <1% *Impedance ------ 4 Ω *Maximum bearing ----- 10W *Instantaneous withstand power ----- 50W *Nozzle diameter ----- 20mm

High-precision Noise Meter¶

Model: TES 1357
Performance index:

0.1dB resolution;
The measuring range is 30 to 130dB;
1/1,⅓,⅙,1/12,1/24 octave spectrum analysis software (optional);
Accuracy± 1.5dB (ref 94dB @ 1KHz);
Weighted measuring range: 30dB to 130dB;
C-weighted measuring range: 35dB~130dB;
Measuring gear 30-80dB, 50-100dB, 60-110dB, 80-130dB;
Frequency response 31.5 Hz to 8KHz;
Digital display 4-digit LCD, 0.1dB resolution, updated every 0.5s;
AC/DC signal output 2Vrms/full scale of each gear, 10mV/dB.

Noise Source: Monitoring Speaker¶

Model: Fluid Audio FX8
Performance index:

Frequency response: 35Hz - 22kHz (± 3dB);
Cross frequency: 2.4kHz;
Low frequency amplifier power: 80 watts;
High frequency amplifier power: 50 watts;
Signal noise:>100dB (typical A-weighted);
Polarity: positive signal+input generates an outward low-frequency displacement;
Input impedance: 20 kΩ (balanced type), 10 kΩ (unbalanced type);
Input sensitivity: when the volume control is set to the maximum value (102dB of maximum sound pressure), the input of 85 mV pink noise will produce an output sound pressure of 95dBA;
Power supply: 115V~50/60 Hz or 230V~50/60 Hz (user can switch);
Protection device: RF interference, output current limitation, over temperature protection, transient on/off;
Protection, subwoofer filter, external power fuse;
Box: medium density fiberboard with ethylene base;
Size (single monitor speaker): 340mm (height) x254mm (width) x270mm (length);
Weight (single monitor speaker): 9.8kg.

Speech Recognition Test Environment¶

As shown in the figure below, the artificial mouth (sound source) is directly in front of the microphone of the voice module, with a horizontal linear distance of L meters. The artificial mouth (sound source) is 120 - 150cm from the ground; The noise source (monitor speaker/TV), voice module and precision noise meter are located at the same plane (80 - 100cm from the ground); The distance between the noise source (monitor speaker/TV) and the microphone of the voice module is ≥ 150cm, and the precision noise meter and the microphone of the voice module are as close as possible (the distance between the two is ≤ 5cm), but cannot contact the microphone of the voice module.

Figure 4 Test room layout (fixed-point test)

Figure 5 Test room layout (non fixed point test)

Note:

Front: the position and angle of the artificial mouth can be determined according to the actual scene of the customer;
L: It depends on the actual scene;
Noise source: It can be broadcast by monitoring the speaker or TV. The location and angle of the noise source can be determined according to the actual situation of the customer.

Recognition Rate/Wake-up Rate Test Methods and Steps¶

According to the test requirements, change the position and angle of the mouth from the voice module to build different acoustic scenes. The noise source (monitor speaker/TV) plays the noise set, the mouth plays the corresponding test set, and records the test data.

Calculation Method:

Recognition rate=(number of correctly recognized instructions/total number of input instructions) * 100%
Wake up rate=(times of correct wake-up rate/total number of input instructions) * 100%

Step:

Use the noise source (monitor speaker/TV) to continuously broadcast the noise set; The manual mouth plays the commands in the test set one by one at a certain time interval;
Record test data;
Statistics and calculation of test results.

False Wake-up¶

According to the requirements of the test indicators, change the position and angle of the artificial mouth from the voice module to build different acoustic scenes. The noise source (monitor speaker/TV) plays the noise set, the artificial mouth plays the corresponding test set, and counts the number of false wakeups.

Step:

Use the noise source (monitor speaker/TV) to continuously broadcast the noise set, and use the artificial mouth to broadcast the test set;
Count the number of false wakeups.

Response Time¶

Set up the test environment, open the voice recording tool, and play the test set. After the broadcast, use the voice recording tool to calculate the time interval between the voice command and the broadcast as the response time.

Figure 6 Response Time

Step:

Use artificial mouth to play test set;
Record test data;
Calculate the response time.

Recognition Stability¶

Set up the test environment, the noise source (monitor speaker/TV) plays different kinds of noise, the artificial mouth plays the test set, the test voice module group runs normally for 168 h, no restart record, and the response time is less than 1.0 s.

Step:

Use artificial mouth to play test set;
Record the test data.

Note: As a reference standard and method for testing general speech recognition equipment, this document can be adjusted according to actual application scenarios and conditions. If there is no artificial mouth, you can also speak in an artificial way. If you need to analyze the test results, you can use high fidelity recording equipment or mobile phones and other recording equipment to record the test environment for speech recognition and optimization

Appendix¶

Speech recognition equipment	corresponding mechanical noise of speech equipment	corresponding environmental noise
Range hood	Range hood in operation noise	Kitchen ambient noise
Dishwasher	Dishwasher in operation noise	Kitchen ambient noise
Rice cooker	None	Kitchen environment noise
Microwave oven	Microwave oven in operation noise	Kitchen ambient noise
Soymilk maker	Soymilk maker in operation noise	Kitchen environment noise
Coffee maker	Coffee maker in operation noise	Kitchen ambient noise
Refrigerator	Refrigerator in operation noise	Kitchen ambient noise
Air conditioner	Air conditioner in operation noise	Living room ambient noise
Electric fan	Electric fan in operation noise	Living room ambient noise
Vacuum cleaner	Vacuum cleaner in operation noise	Living room ambient noise
Humidifier	None	Ambient noise in living room

Table 5 Partial noise sets

Note: The noise can be collected according to the actual application scenario of the terminal equipment