Skip to main content

Data Collection

Building a Multilingual Speech Corpus

DataForce supports a global audio hardware leader with high quality data for fine-tuning their ASR engine.

The Challenge

Automatic speech recognition (ASR) systems can convert user commands into text that is then processed by natural language processing systems. To have an effective ASR implementation, one needs to consider several aspects, such as sound and voice variations across genders, age groups, accents, and dialects, and the background noise associated with the environment where the ASR system will be used. In this case, the client needed to collect training and test data from multiple demographic groups in English, Hindi, German, French, and Italian.

• • • •The Solution• • • •

DataForce collected voice data and background noise across several scenarios using our proprietary mobile app, DataForce Contribute. Our app ensured that the audio files respected all technical requirements, such as signal-to-noise ratio and sampling rate. After having all voice commands and ambient noise collected in parking, driving, and windows open/closed conditions, convoluting the sound waves helped create data sets that simulated a real environment. With DataForce’s solution, the client developed and tested an efficient ASR engine capable of understanding voice commands in several languages across different scenarios.

Audio Waves

DataForce has a global community of over 1,000,000 members from around the globe and linguistic experts in over 250 languages. DataForce is its own platform but can also use client or third-party tools. This way, your data is always under control.

Request a consultation.