Building Culturally Adapted Datasets for Global Conversational AI

Multilingual data is essential for enabling global conversational AI. Rich datasets are used to build systems that can deliver experiences that feel natural in a target culture. However, datasets often focus too heavily on “standard” language usage and don’t take into account local market realities and the rich variation in human language production. In this talk, Aaron Schliem, Senior AI Solutions Architect at Welocalize, will offer insights to help ML and AI teams source datasets that are more representative of local cultural realities. These tips will help you move past basic fluency to truly adapted multilingual experiences. Key topics that will be covered include: - Dimensions of culture that should be considered in datasets - Designing data collection tasks that are culturally adapted - Planning for bilingualism, code-switching, and lingua franca - Traditional grammar versus real natural language production - Full-spectrum linguistic inclusivity: race, age, education, sexual identity - Sourcing the right kinds of data generation workers - Applications for NLP in building datasets and ensuring data quality.

Speakers: