Open Language Data Initiative

Our Values & Positionality

We acknowledge that data expansion and availability is a small piece of the giant puzzle of language equity in NLP research. As part of our commitment to responsible science, we stress the importance of adopting a community-centric approach to NLP, where the involvement, wellbeing, and interests of speakers are elevated.

First, we believe that the language corpora of any given language belongs to the people who speak the language. Particularly for endangered languages, where the speaker population might be small, language data compilation by outside groups without calibrating the interest of native speakers could be viewed as a form of exploitation. As such, whenever possible, we advocate for data contributors to carefully deliberate over their methodological choices, document ethical decisions, and potentially devise deployment strategies that amplify underserved communities’ ability to directly benefit from technologies built using their language data.

Related to the concept of community-centeredness, we also advocate for data contribution that captures the sociolinguistic diversity of how languages are used across place and setting. More specifically, instead of relying on frameworks that have conventionally worked for high-resource languages, we encourage contributions reflecting how languages are used in real, situated contexts (e.g., data that includes regional variants, dialects, colloquialisms, code-mixing, etc.).

We also believe that interdisciplinarity, where humanities and social science researchers work together with technical practitioners, can give rise to more ethically- and socially-aligned forms of data collection and NLP development. For instance, sociologists and anthropologists have long grappled with the epistemological implications of power and sociohistorical dynamics in the context of research, and having their perspectives on how to use participatory methods could be imperative for sustainable language data collection in NLP.

Finally, we want to acknowledge our researcher positionality. While some of the organizers speak under-served languages, most of us are trained in Western institutions and are/were affiliated with major universities or AI research labs in the US and the UK. Occupying such positions may skew our stances on issues pertaining to language accessibility and NLP development, and we note that those who do not share our milieus may adopt different levels of criticality than ours.