Identification effect test¶
Overview¶
Our company has formulated enterprise test standards for the recognition effect of voice products, and actively participated in the teams and industry standards led by various social organizations. Our voice test methods and evaluation standards have also been recognized by the industry and major enterprises. Users can directly use our recognition effect test methods to evaluate the voice recognition performance of the developed products.
This document specifies the terms, definitions, test related instructions (including test technical requirements, test indicators, test items, test contents, test equipment, test environment), test methods, steps, test results report and traceability of speech module recognition effect and performance test.
The content in this document is excerpted from our company’s enterprise standard document ☞Local voice module recognition effect and performance test standard, standard number: QQYTL001-2018. The description in the document is applicable to the recognition effect test of all AI offline or offline voice modules.
Terms and Definitions¶
The following are some terms and definitions that may be used in this document.
- Artificial intelligence
AI (abbreviation), an interdisciplinary discipline, is usually regarded as a branch of computer science, which studies models and systems that exhibit various functions related to human intelligence (such as reasoning and learning).
- Speech recognition
Automatic speech recognition (ASR) Conversion from voice signal to a certain identification of voice content by using function unit.
- NLP - Natural Language Processing
Natural language processing.
- Voice command
Voice commands recognized by the voice module.
- Artificial mouth
Artificial mouth or artificial mouth: High fidelity playback equipment, which plays voice commands instead of manual voice, serves as a standard test sound source.
- Recognition rate
Broadcast voice commands After testing the voice module, the number of correctly recognized commands accounts for the percentage of the total number of commands.
- Misrecognition times
In the living environment where the analog voice module is actually used, the number of false recognition of the voice module in a period of time.
- False wakeup
The voice wake-up system wakes up when there is no audio stream or there is no feature or event required for wake-up in the audio stream during the voice wake-up process.
- Signal noise ratio
SNR or S/N (abbreviation) The ratio of voice command power to ambient noise power, in decibels.
- House environment
The working environment of voice module is home, including bedroom environment, living room environment, kitchen environment, bathroom environment, balcony environment, etc.
- Quiet environment
The noise intensity of the working environment where the voice module is located is between 25dB-45dB, which is defined as a quiet environment.
- Moderate noise environment
The noise intensity of the working environment where the voice module is located is between 55dB-65dB, which is defined as a medium noise environment.
- Strong noise environment
The noise intensity of the working environment where the voice module is located is between 65dB-80dB, which is defined as a strong noise environment.
- Mechanical noise of voice module
The mechanical noise of voice module refers to the noise radiated from the vibration of mechanical components and shells due to the friction, impact or unbalanced force between components when the voice module machinery (also including the mechanical components integrated in the voice recognition equipment system) is running. Mechanical noise can be divided into three categories according to different sound sources: aerodynamic noise, mechanical noise and electromagnetic noise.
- Background noise
Noisy background noise refers to the background voice or quasi human voice (such as the noisy voice in the venue and shopping mall environment) or the disturbing sound played by other audio devices other than the voice module, such as the sound from music, news, television and movies.
- Echo noise
Echo noise refers to the sound played by the voice module through its own horn, which interferes with the speech recognition results.
- Reverberation noise
The voice received by the voice module after the target speaker’s voice is reflected by a smooth surface (such as a wall or object surface).
- Environmental noise
The environment of voice module contains background noise and reverberation, in which the background noise often contains one or more noise sources. For example, the kitchen environment also has the voice of the range hood and cooking; In the bathroom environment, there are wind noise, shower water noise and voice reverberation reflected by smooth walls; The living room environment has voice, TV and other sounds at the same time; The balcony environment has wind noise and outdoor noise (such as vehicle horn voice, etc.) at the same time; There are engine noise and road noise in the vehicle environment.
- test audio data
Non training set audio instruction set for speech test.
- Noise audio data
A set of noisy audio for voice testing.
- Operation
The voice module is in functional operation.
- Not in operation
The voice module is not functioning.
- Broadcasting
The voice module is in its own voice broadcast.
- Not broadcasting
The voice module is not broadcasting voice.
- Wake up words
The corpus contains wake-up words and instruction words that can be controlled directly without wake-up.
- Command words
A corpus containing wake-up words and all other instruction words.
- Duomai microphones
Multi microphone: The voice module uses multiple microphones (2 or more) to collect multi-channel voice data. According to the number of microphones, it can be divided into double microphone array, four microphone array, six microphone array, eight microphone array, etc; It can be divided into linear microphone array and ring microphone array according to microphone arrangement.
- Speech recognition equipment
Electrical equipment with integrated voice recognition module.
Description of test items¶
Speech recognition test project¶
- Recognition rate/wake-up rate test
Test the recognition rate of voice instructions in quiet and noisy environments. Test the recognition rate of wake-up words in quiet and noisy environments.
- False wake-up test
Test the number of times that the voice module is awakened by non wake-up words (not including the voice that has the same pronunciation as the wake-up word or is difficult to distinguish) in a quiet and noisy environment.
- Response time test
Test the time from the end of receiving voice instructions to giving correct recognition results under quiet and noisy environment.
- Stability test
Test the speech recognition stability of the speech module.
Speech recognition test environment¶
The speech recognition test environment shall be able to simulate the real environment and working conditions of the speech recognition equipment when it is used routinely. Usually, the home environment also includes a quiet environment when the equipment is not working. Therefore, the test set used is the instruction corpus set of the voice module, and the noise set used is the corresponding environmental noise and mechanical noise when the equipment is working. Such as kitchen appliances, kitchen noise, bathroom appliances and bathroom noise.
If the voice recognition equipment is audio equipment or needs to broadcast voice for a long time, it is also necessary to test the recognition of the equipment in the process of playing.
Environment | application scenario | ambient noise (dB) | reverberation (s) | minimum distance (m) | maximum distance (m) | reference area of application scenario (m2) | applicable voice equipment |
---|---|---|---|---|---|---|---|
Quiet environment | unlimited | 35-45 | 0.45-0.55 | 1 | 5 | 15-35 | All voice recognition devices |
Working condition environment | kitchen | 55-60 | 0.65-0.75 | 1 | 2 | 5-10 | kitchen voice recognition equipment (such as microwave oven, range hood, rice cooker, etc.) |
Working condition environment | Toilet | 55-60 | 0.65-0.75 | 1 | 2 | 5-10 | Bathroom voice recognition equipment (such as bathtub, air heater, toilet, etc.) |
Working environment | balcony | 55-60 | NA | 1 | 2 | 5-10 | balcony voice recognition equipment (such as washing machine, clothes dryer, balcony lamp, etc.) |
Working condition environment | living room (hall) | 55-60 | 0.45-0.55 | 1 | 5 | 15-35 | living room speech recognition equipment (such as air conditioner, central control, remote controller, tea set, oxygen generator, living room lamp, TV, etc.) |
Working environment | bedroom | 55-60 | 0.45-0.55 | 1 | 5 | 10-20 | bedroom voice recognition equipment (such as air conditioner, remote controller, desk lamp, TV, etc.) |
Working condition environment | Strong noise | 65-75 | NA | 0.5 | 2 | 5-10 | Strong noise equipment or environment (such as when the fan is working at high wind damper) |
Note: The reference area of the application scenario in Table 1 conforms to the specifications for the space area of the kitchen, toilet, balcony, living room (hall) and bedroom suite in the “Five Sets of Interior Space” in GB50096-2011 Code for Residential Design.
Microphone requirements for the voice module to be tested¶
The voice module can collect voice in the way of single microphone and microphone array (the distribution of microphone array has a certain geometric size and structure, such as circular array, linear array, etc.). The following structural diagrams of single microphone, double microphone and four microphone (the distance between adjacent microphones shall be at least 10mm) are for reference.
Test language requirements¶
The voice instructions for the test shall be in the standard official language, and the Chinese language shall be standard mandarin, Grade II, Grade B and above.
Test speed requirements¶
Normal speaking speed, 150-180 words/minute for Chinese mandarin.
Speech recognition rate/wake-up rate test index¶
Test item | environment | working status of voice equipment | signal-to-noise situation | noise set | test set | index | applicability description |
---|---|---|---|---|---|---|---|
Local recognition rate/ Wake up rate test |
Quiet environment | Non operating/Non broadcasting | Voice: 60dB~70dB Noise: 35dB~45dB |
NA | Wake up word set instruction word set | Minimum distance: ≥ 97% Maximum distance: ≥ 95% |
Applicable to all speech recognition equipment |
Local recognition rate/ Wake up rate test |
Working condition environment | Operation | Voice: 65dB~75dB Noise: 55~60dB |
Environmental noise+mechanical noise of voice equipment | Wake up word set instruction word set | Minimum distance: ≥ 92% Maximum distance: ≥ 85% |
Applicable to voice recognition equipment that can generate mechanical noise |
Local recognition rate/ Wake up rate test |
Operating environment | Non operating/Non broadcasting | Voice: 65dB~75dB Noise: 55~60dB |
Environmental noise | Wake up word set command word set | Minimum distance: ≥ 92% Maximum distance: ≥ 88% |
Applicable to voice recognition equipment with medium or higher ambient noise under operating conditions |
Local recognition rate/ Wake up rate test |
Working condition environment | Broadcast | Voice: 65dB~75dB Noise: 55~60dB |
Ambient noise+Echo noise | Wake up word set instruction word set | Minimum distance: ≥ 92% Maximum distance: ≥ 85% |
Voice recognition equipment suitable for long broadcast and audio playback |
Local recognition rate/ Wake up rate test |
Working condition environment | Operation (strong noise) | Voice: 65dB~75dB |
Note:
- The minimum distance, according to the “environment”, “application scenario” refers to “Table 1” to determine the specific distance.
- The maximum distance is determined according to the “environment”, and the “application scenario” refers to “Table 1”.
- The noise of voice equipment (such as cigarette machine) that generates strong mechanical noise will reach 75± 5dB.
False wake-up test index¶
Test Item | Noise Set | Indicator | Description of False Wakeup Noise Set |
---|---|---|---|
False wake-up test | False wake-up noise set | <=3 times/24H | 1) False wake-up noise set: a 24-hour noise corpus including: 4-hour TV noise set (with voice) +4-hour music (pure music or songs) +8-hour environmental noise set (where the equipment is located) +8-hour quiet environment 2) There is no wake-up word voice in the false wake-up noise set, and the noise decibel is 55dB - 65dB. |
Response time test index¶
Response time: the time interval from when the artificial mouth plays the voice command in close range (<50cm) to when the voice recognition module pushes the recognized command to the device control or communication port. Response time<1.0s.
Stability test index¶
The recognition stability test of speech recognition module in wake-up and non wake-up states under ambient noise
Recognition stability test under wake-up state: Play wake-up words every 1 second, run for 72 hours, no crash or restart, and can identify normally
Identification stability test under non wake-up state: every T_ wakeup_ The wake-up words are played once every second for 72 hours, without crash or restart, and can be recognized normally
T_ wakeup_ Time is equal to the time from wake-up to exiting wake-up state plus 1 second.
Audio Collection and Standardization Method for Speech Recognition Test¶
Wake up word audio set recording¶
The wake-up word set contains wake-up words and instruction words that can be controlled directly without wake-up. Five men and five women read the wake-up word collection 10 times in total, and recorded it with high fidelity recording equipment. The speech sampling rate is 44.1KHz, the ambient noise is<30dB, the reverberation is<0.3s, the speaker is 20-30cm away from the microphone, the interval between words is 2-3 seconds, and the standard official language is used to read aloud; The Chinese standard mandarin is required to be Grade II B or above, and the reading speed of command words is 150-180 words/minute.
Command word audio set recording¶
A corpus containing wake-up words and all other instruction words. Five men and five women read the wake-up word collection 10 times in total, and recorded it with high fidelity recording equipment. The speech sampling rate is 44.1KHz, the ambient noise is<30dB, the reverberation is<0.3s, the speaker is 20-30cm away from the microphone, the interval between words is 2-3 seconds, and the standard official language is used to read aloud; The Chinese standard mandarin is required to be Grade II B or above, and the reading speed of command words is 150-180 words/minute.
Test equipment required for speech recognition¶
The equipment and model used in speech recognition test are shown in the following table (for reference), and the parameters of main equipment are given here.
S/N | Category | Equipment | Equipment Model | Equipment Brand | Function |
---|---|---|---|---|---|
01 | Computer | Desktop/laptop | Unlimited | Unlimited | Monitor whether the voice module feedback accurately outputs the test results |
02 | Sound source | Artificial mouth | 4227-A | Br ü el&Kj | r | Play audio signal |
03 | Noise monitoring | Precision noise meter (sound level meter) | 1357 | TES | Test the sound pressure reaching the microphone |
04 | Noise source | Speaker/TV | Recommended model of monitor speaker: FX8 | Fluid Audio | Play noise and simulate external interference |
05 | Audio Collection | High Fidelity Recording Equipment | R44 | Loran/Roland | Audio Recording |
Table 4 Speech Recognition Test Equipment
Artificial mouth¶
Model: 4227-A
Performance index:
-
Rated output sound pressure SPL:
-
200Hz - 2kHz ----- 110dB
- 100Hz - 8kHz ----- 100dB *Distortion (@ 94dB):
- 200Hz - 250Hz ----- <2%
- ’>‘250Hz ----- <1% *Impedance ------ 4 Ω *Maximum bearing ----- 10W *Instantaneous withstand power ----- 50W *Nozzle diameter ----- 20mm
Precision noise meter¶
Model: TES 1357
Performance index:
- 0.1dB resolution;
- The measuring range is 30 to 130dB;
- 1/1,⅓,⅙,1/12,1/24 octave spectrum analysis software (optional);
- Accuracy± 1.5dB (ref 94dB @ 1KHz);
- Weighted measuring range: 30dB to 130dB;
- C-weighted measuring range: 35dB~130dB;
- Measuring gear 30-80dB, 50-100dB, 60-110dB, 80-130dB;
- Frequency response 31.5 Hz to 8KHz;
- Digital display 4-digit LCD, 0.1dB resolution, updated every 0.5s;
- AC/DC signal output 2Vrms/full scale of each gear, 10mV/dB.
Noise source: monitoring speaker¶
Model: Fluid Audio FX8
Performance index:
- Frequency response: 35Hz - 22kHz (± 3dB);
- Cross frequency: 2.4kHz;
- Low frequency amplifier power: 80 watts;
- High frequency amplifier power: 50 watts;
- Signal noise:>100dB (typical A-weighted);
- Polarity: positive signal+input generates an outward low-frequency displacement;
- Input impedance: 20 kOhm (balanced type), 10 kOhm (unbalanced type);
- Input sensitivity: when the volume control is set to the maximum value (102dB of maximum sound pressure), the input of 85 mV pink noise will produce an output sound pressure of 95dBA;
- Power supply: 115V~50/60 Hz or 230V~50/60 Hz (user can switch);
- Protection device: RF interference, output current limitation, over temperature protection, transient on/off;
- Protection, subwoofer filter, external power fuse;
- Box: medium density fiberboard with ethylene base;
- Size (single monitor speaker): 340mm (height) x254mm (width) x270mm (length);
- Weight (single monitor speaker): 9.8kg.
Speech recognition test environment¶
As shown in the figure below, the artificial mouth (sound source) is directly in front of the microphone of the voice module, with a horizontal linear distance of L meters. The artificial mouth (sound source) is 120 - 150cm from the ground; The noise source (monitor speaker/TV), voice module and precision noise meter are located at the same plane (80 - 100cm from the ground); The distance between the noise source (monitor speaker/TV) and the microphone of the voice module is ≥ 150cm, and the precision noise meter and the microphone of the voice module are as close as possible (the distance between the two is ≤ 5cm), but cannot contact the microphone of the voice module.
Note:
- Front: the position and angle of the artificial mouth can be determined according to the actual scene of the customer;
- L: It depends on the actual scene;
- Noise source: It can be broadcast by monitoring the speaker or TV. The location and angle of the noise source can be determined according to the actual situation of the customer.
Recognition rate/wake-up rate test methods and steps¶
According to the test requirements, change the position and angle of the mouth from the voice module to build different acoustic scenes. The noise source (monitor speaker/TV) plays the noise set, the mouth plays the corresponding test set, and records the test data.
Calculation method:
- Recognition rate=(number of correctly recognized instructions/total number of input instructions) * 100%
- Wake up rate=(times of correct wake-up rate/total number of input instructions) * 100%
Step:
- Use the noise source (monitor speaker/TV) to continuously broadcast the noise set; The manual mouth plays the commands in the test set one by one at a certain time interval;
- Record test data;
- Statistics and calculation of test results.
Test methods and steps for false wake-up¶
According to the requirements of the test indicators, change the position and angle of the artificial mouth from the voice module to build different acoustic scenes. The noise source (monitor speaker/TV) plays the noise set, the artificial mouth plays the corresponding test set, and counts the number of false wakeups.
Step:
- Use the noise source (monitor speaker/TV) to continuously broadcast the noise set, and use the artificial mouth to broadcast the test set;
- Count the number of false wakeups.
Response time test methods and steps¶
Set up the test environment, open the voice recording tool, and play the test set. After the broadcast, use the voice recording tool to calculate the time interval between the voice command and the broadcast as the response time.
Step:
- Use artificial mouth to play test set;
- Record test data;
- Calculate the response time.
Stability test methods and procedures¶
Set up the test environment, the noise source (monitor speaker/TV) plays different kinds of noise, the artificial mouth plays the test set, the test voice module group runs normally for 168 h, no restart record, and the response time is less than 1.0 s.
Step:
- Use artificial mouth to play test set;
- Record the test data.
Note: As a reference standard and method for testing general speech recognition equipment, this document can be adjusted according to actual application scenarios and conditions. If there is no artificial mouth, you can also speak in an artificial way. If you need to analyze the test results, you can use high fidelity recording equipment or mobile phones and other recording equipment to record the test environment for identification and optimization
Appendix¶
Speech recognition equipment | corresponding mechanical noise of speech equipment | corresponding environmental noise |
---|---|---|
Range hood | Range hood noise | Kitchen ambient noise |
Dishwasher | Dishwasher noise | Kitchen ambient noise |
Rice cooker | None | Kitchen environment noise |
Microwave oven | Microwave oven noise | Kitchen ambient noise |
Soymilk maker | Soymilk maker noise | Kitchen environment noise |
Coffee maker | Coffee maker noise | Kitchen environment noise |
Refrigerator | Refrigerator noise | Kitchen ambient noise |
Air conditioner | Air conditioner noise | Living room ambient noise |
Electric fan | Electric fan noise | Living room ambient noise |
Vacuum cleaner | Vacuum cleaner noise | Living room ambient noise |
Humidifier | None | Ambient noise in living room |
Note: The noise can be collected according to the actual application scenario of the terminal equipment