Data Collection Conversational AI Technology Generative AI
Creating Authentic, Real-Life Text Content at Scale
The Challenge
Our client sought to commission 180,000 pieces of real-life text content across six languages, including email, chat-style conversations, and paragraph-style texts. The dataset needed to demonstrate topical diversity with minimal repetition, and each entry needed to meet strict requirements for length, number of conversational turns, and treatment of personally identifiable information (PII). A strong emphasis was placed on natural, everyday language reflective of target locale.
• • • •The Solution• • • •
The DataForce team designed a customized and scalable process to deliver high volumes of natural, high-quality content. This included:
- Streamlined Writer Qualification and Training:
- Developed a scalable, multi-step qualification and training process that verified nativeness, writing experience, and overall quality through sample reviews.
- Collaborated across teams to refine the process, ensuring efficiency and scalability to meet the high writer volume required.
- Monitored continuously for fraudulent writer applicants and implemented evolving strategies to prevent unqualified writers from passing the qualification process.
- Customized Tech Workflow with Automations:
- Created a specialized schema on the DataForce platform for content submission.
- Leveraged automations for text normalization (including placeholder tags for PII), spelling/grammar checks, semantic similarity analysis, and file processing.
- Multipronged Approach to Ensure Diversity of Content:
- Developed a list of thousands of unique writing topics per language, suitable for each content type and locale.
- Devised a structured workflow with assigned topics and tones to ensure each piece of content submitted was unique, and that a large variety of topics were covered.
- Established a process to evenly distribute topics across writers to further ensure diversity and uniqueness of data.
- Implemented semantic similarity checks to ensure content submitted wasn’t overly repetitive or duplicated.
- Applied multiple tones to encourage diversity.
- Quality Assurance & Performance Monitoring:
- Maintained close engagement with the QA team to monitor quality at the individual writer level.
- Implemented a sample review process for both content creators and the QA team to identify trends in quality and take corrective actions as necessary.
- Analyzed trends from client feedback and implemented corrective actions as necessary to achieve close alignment with client expectations.
Results
The team successfully achieved over a 99% acceptance rate for the initial five languages. Following the completion of initial delivery milestones, the scope expanded to include a sixth language, along with additional email and chat content incorporating offensive terminology across all six languages.
Throughout the project, the team maintained a 94% average acceptance rate across all languages. Impressed by the results, the client expressed interest in expanding the project to cover eight additional languages.
Thanks to DataForce’s customized workflows and QA process, the client obtained millions of words of diverse, natural, and high-quality email, SMS, and paragraph-style content within just four months. This dataset became instrumental in training more accurate, inclusive, and balanced AI features and services.