Source: Getty

How African NLP Experts Are Navigating the Challenges of Copyright, Innovation, and Access

AI producers need to better consider the communities directly or indirectly providing the data used in AI development. Case studies explore tensions in reconciling the need for open and representative data while preserving community agency.

by Chijioke Okorie and Vukosi Marivate
Published on April 30, 2024


As an ideal or a practice, openness in artificial intelligence (AI) involves sharing, transparency, reusability, and extensibility that can enable third parties to access, use, and reuse data and to deploy and build upon existing AI models. This includes access to developed datasets and AI models for purposes of auditing and oversight, which can help to establish trust and accountability in AI when done well.

Certain common sayings in African languages encapsulate how issues of agency and community ownership are implicated or threatened when openness is embraced in a bid to include Africa and other parts of the Global South in discussions about the responsible use and development of AI. In the Igbo and Setswana languages, these sayings include expressions that speak to how discussions about taking (or bringing) often revolve around other people’s property. The Igbo saying wete wete ka nma n’akpa onye ozo means that people always recommend the sharing of property when such property is not theirs.1 In Setswana, the saying pelo e senang phufa, selo e a be e se sa yona essentially means that it is easy to misuse, abuse, or care less about property that does not belong to you because if something is yours, you will certainly see to it that it is taken care of.

Actors in the Global North have been the primary drivers of discussions about responsible AI, and they have focused such discussions on concepts like openness, privacy, and copyright protections. However, in recent times, there have been increased efforts to amplify perspectives from underrepresented and/or unrepresented jurisdictions including ones in the Global South so they can help shape discussions about responsible AI use and development. Within this atmosphere of inclusion (referred to here as the Global South inclusion project), openness, privacy, and copyright have continued to feature as important and indispensable considerations.

In the context of digital technology and software, the Global South inclusion project has often been underpinned by a requirement of openness. The intention has been to promote broader access and address and/or sidestep privacy and copyright issues arising from both the data needed to build AI systems and the datasets that are one outcome of building and using such systems.2 Essentially, the Global South inclusion project benefits from pushing for openness because in many instances, once data has been made open, it allows for a sidestepping of privacy and copyright issues by users of such data. However, there are more factors related to the Global South inclusion project to consider and grapple with.

First, builders of AI systems need to give greater consideration to the communities directly or indirectly providing the data used in commercial and noncommercial settings for AI development. These communities may include owners of traditional cultural expressions and traditional knowledge; data scientists and AI developers from African countries working on data collection, collation, curation, and annotation; linguists working on African languages; and users who provide or upload content (data) on African languages and practices on social media and other internet platforms.3 However, while openness in developing and deploying AI models offers transparency and shared learning, it can sometimes conflict with privacy and proprietary rights. By contrast, closed models prioritize proprietary information but can limit shared innovation.

Furthermore, fair use and representation—meaning fair and equitable access to and use of data and the inclusion of critical ethical considerations specific to diverse groups of people or contexts related to AI development and governance—are vital in AI, especially for those in the Global South.4 Ensuring that AI is used ethically and represents diverse populations can help improve fairness and minimize bias. This requires collective efforts, considering the broader impacts on society and on the individuals who contribute data.

The act of resharing data, while crucial for collaborative innovation and improvement, complicates dialogue about the privacy, copyright, ownership, and commercialization of such data. Perhaps as a result of the mistaken view that countries in the Global South are monolithic, these dynamics are often overlooked in the framing of the Global South inclusion project. But they underscore a complex ecosystem where data is not just a resource but a bridge connecting diverse communities, each with distinct, often conflicting, interests and concerns.

To highlight these interests and concerns, this work features a study of the African natural language processing (NLP) community presenting insights from the work of the Masakhane Research Foundation (a distributed research organization with the mission to advance African NLP),5 Ghana NLP (an open-source initiative focused on NLP involving Ghanaian languages),6 and KenCorpus (a community-driven project to create large Kenyan language datasets).7 The experiences of this community help to ground the practical trade-offs and challenges that arise in this discipline.

The Development of African NLP

Language data representing the wide variety of spoken African languages is scarce. A continent-spanning community is emerging to address this digital data scarcity, a community composed primarily of African AI and NLP researchers interested in applying AI to solve problems prevalent on the African continent. These researchers rely heavily on the use, resharing, and reuse of African language- and context-focused data (that is, openness) to fuel their innovations, analysis, and developments in AI.

In the Global South but particularly on the African continent, state institutions, public bodies (such as state broadcasters), and private organizations (such as commercial news services) are pivotal repositories of valuable data about local languages.8 These entities often express concerns about the potential commercial viability of the data they hold. For instance, in response to requests from data science researchers to use local language data from South Africa’s public service broadcaster, it was suggested that such data when used to train NLP models could be commercialized and, therefore, prior licensing arrangements must be undertaken.9 There is a struggle with the delicate balance of preserving the integrity and proprietary rights of the data (such as copyright protections) while acknowledging the necessity of accessibility to data, especially data regarding African contexts and other parts of the Global South.

The African NLP and AI research community referenced in this work—Masakhane, Ghana NLP, and KenCorpus—is caught in a tough spot, juggling the need to access and share data with legal rules that protect the privacy and ownership of data. Laws like copyright and data protection can sometimes limit the sharing of information needed for innovation. There is widespread recognition that, while these laws are essential for keeping data proprietary, secure, and private,10 such laws can also make it challenging for professionals to access the data they need for their work.11 As such, a demonstration of openness—meaning, the waiver or nonretention of (some) proprietary rights as a practice—is a necessary and viable practice for counteracting these restrictions and addressing these challenges.12

Yet even though openness can help address copyright and privacy concerns, the idea that such openness is a panacea for Global South inclusion and for collaborative innovation and improvement ignores the threats that openness may present to the agency and community ownership of affected stakeholders in the Global South.13 This article also explores questions about whether existing copyright or privacy frameworks are sufficient to capture the issues of agency and community ownership that are implicated or threatened when openness is embraced.

The African Community Landscape of AI Development

The proliferation of grassroots AI organizations across Africa directly addresses the gaps left by the swift advancement of AI on the continent, which other actors have been largely driving. Initiatives like Masakhane, which seeks to strengthen African NLP by and for Africans, recognize that impending innovations could sideline African languages while enabling inaccurate or suboptimal models to permeate the region. For example, many multilingual models claim to support African languages but are not fit for purpose for the communities they claim to serve. Incorrect translations may have life-changing effects depending on how they are used.14 This challenge has already materialized in content moderation of online services.15

Like in most AI systems, high-quality data is essential for developing NLP tools and systems, yet many African languages lack robust digital resources and are considered low-resource.16 Low-resource languages lack large monolingual or parallel corpora (collections of linguistic data in the form of written text or transcriptions of recorded speech) and/or manually crafted linguistic resources sufficient for building statistical NLP applications. Data about African languages and culture bridges connections between diverse disciplines working to advance languages. Linguists collect corpora to study languages, while community archivists document languages and culture. Journalists communicate with readers while trying to capture their perspectives. And AI researchers use data to build models. Without cross-disciplinary and cross-domain collaboration (between the areas of linguistics, journalism, and AI), communities may lose their ability to guide how their languages progress amid the AI revolution. More open communication channels between communities, researchers, private actors, and government actors are imperative for articulating societal needs and priorities in an evolving technological landscape.

Communities of AI researchers with limited access to financial resources face inherent challenges in generating the data necessary for AI development. This data scarcity particularly impacts linguistic diversity, as the effects of colonialism and global power structures often sideline under-resourced languages even when they have millions of speakers. For example, most chatbots are built on high-resourced languages such as English because of the availability of data in those languages, sidelining access for people who can only speak, read, or write in an African language. To overcome this divide, grassroots NLP collectives leverage collaborative social and human capital rather than financial means. Initiatives like Masakhane, which has amassed a network of more than 2,000 African researchers actively engaged in publishing research, and the KenCorpus project unite researchers to elevate local languages.

By embracing the principles of openness and transparency in sharing experiences, data, code, and resources, these communities are making remarkable strides. Masakhane, for example, has been recognized for its impact on democratizing the internet,17 while GhanaNLP’s Khaya app (which translates Ghanaian languages) has thousands of users,18 and KenCorpus has now been downloaded more than 500,000 times.19 Their approach also emphasizes participatory methodologies, with many coauthors contributing to these collective efforts. Importantly, grassroots groups make data findable, accessible, interoperable, and reusable (principles that together constitute the FAIR framework) with guidance from allies.20 In driving their languages forward on their own terms, grassroots NLP groups demonstrate that barriers to access can be overcome through inclusive cooperation and innovation. With sustainability in mind, groups such as NLP Ghana have a model where they have some of their tools available with commercial access models, while at the same time they contribute to the open resources available to all researchers as they can do so. In recognition of and support for these approaches, funders of NLP and AI data projects in the Global South should proceed from the understanding that providing financial support must serve the public good by encouraging responsible data practices.

The examples above illustrate a vital need for open collaboration and knowledge sharing to harness collective human capital while building social capital within grassroots AI movements. However, a complex tension emerges when copyright restrictions limit access to existing language resources or bar the open distribution of resulting tools and models. Tensions may equally arise when AI researchers in the Global South face pressure to adopt no copyright restrictions in distributing or making available tools and models from their work. Some stakeholders prioritize financial incentives or control over linguistic assets they have developed. For example, the decision by the copyright holder of the JW300 dataset—which contains translations of biblical texts in more than 300 languages—to remove this rich dataset from the public domain has had a major impact on NLP development for African languages.21

Others such as grassroots groups may be concerned about ensuring that local communities benefit as directly as possible from linguistic and other assets developed from data about these communities. Strict proprietary limitations can severely curb the progress of these stakeholder groups in terms of the Global South inclusion project. There is an urgent need to align priorities among creators seeking reasonable returns and communities enabling access, so that more people can equitably build technologies preserving the intangible cultural heritages tied to certain languages. With compromises valuing both open innovation and systems to recognize and reward the contributions of local communities, it is possible to formulate policies nurturing advancement rather than slowing progress through legal constraints. Ultimately, ethical frameworks should promote using language technology in ways that put the public good above profits.

Tensions Between Openness and Agency in African AI Development

“Wete wete ka nma n’akpa onye ozo.”—“Bring [this], bring [that] is an easy request if it is from someone else’s [not the person asking] bag/pocket.”

Pelo e senang phufa, selo e a be e se sa yona.”—It is easy to misuse/abuse or care less about property that does not belong to you, for if something is yours, you will certainly see to it that it is taken care of.”

As indicated earlier, there are several aspects of agency and community ownership that are implicated or threatened when the norm of openness is embraced in relation to the Global South inclusion project. In the context of AI, data openness speaks to mechanisms that allow or require access to data and underlying metadata to be free (no cost) and free from restrictions that could make such data inaccessible. Enforcing, imposing, or according copyright protections to such data could impose restrictions on accessibility, given the exclusive nature of such protections. And from the perspective of privacy rights, there is a need to secure private information from public exposure. Framed in this way, these two regimes—one prioritizing copyright protections, and one prioritizing privacy rights—could be justifiably opposed to practices that make data accessible, sometimes at no cost. For example, in data donation projects, various persons contribute or donate voice data which could qualify as personal information to a platform or database. Such a database could be subject of copyright protection but some of the contents of the database are considered personal information and therefore subject of privacy rights.

Openness as a practice seeks to address these accessibility issues in part through licensing mechanisms that do not assert copyright protections or restrictions to data. Openness also facilitates the findability, accessibility, interoperability, and reusability of data—the full FAIR spectrum.22 However, while the experiences of Masakhane, Ghana NLP, and KenCorpus as shared above offer evidence regarding the benefits of openness as a practice, there is also recognition (and evidence) that openness needs to be nuanced for specific contexts and should also account, for example, for community rights and agency in relation to data ownership and use.

Current approaches to openness among the community of African AI researchers as highlighted above involve the use of open licensing regimes that have a viral nature. The very fact of using or reusing these datasets means consenting to the proprietary nature of the data and other terms upon which the data is made available. Some of these terms may mean that, while the proprietary nature of the data is acknowledged, such proprietary rights are given up in their entirety.

For example, this is the case when licenses such as the Creative Commons’ CC0 are used. The CC0 license designed by the Creative Commons movement seeks to enable creators and owners of copyright- or database-protected content to waive those interests in their works and thereby place them as completely as possible in the public domain, so that others may freely build upon, enhance, and reuse the works for any purposes without restriction under copyright or database law.23

Sometimes, the terms of such approaches to openness treat data sources—including African AI researchers, communities with traditional cultural expressions and traditional knowledge, state institutions, public bodies (including state broadcasters), and private organizations (including commercial news services)—and their needs and interests as though they were the same. While these communities are—and in the case of the grassroots community of African AI researchers, have become—pivotal repositories of valuable local language data, their needs and interests may vary.

This is an important issue. Access to data on African languages has proven difficult over the years, leading to the birth of these African NLP communities highlighted above. However, using data scarcity as the major reason to adopt existing forms of openness could have unintended consequences by solving one problem and leaving other problems unsolved. It is necessary to take a holistic look at the continent’s context and to consider the consequences for communities such as African NLP researchers and their need for African language data, indigenous communities, and the users who generate data about traditional cultural expressions. There is an inventor’s paradox to address here. In this case, the inventor’s paradox means that it would be better to solve the whole problem rather than to just solve the smaller issue of data access. Essentially, addressing the full language development problem might prove easier in the long run than just the narrow one of data access.

AI innovation as it has been defined to date has tended to sideline African languages. For instance, the low-resourced state of African languages consequently leads to fewer AI products, services, and tools made for the African context. The grassroots movement is responding and seeking to counter this trend by authorizing the reproduction, reuse, and dissemination of local language data. Given these objectives, the seemingly available choices of open licensing regimes for the community of AI researchers become quite narrow. This community tends to focus on licensing regimes that allow free distribution, the making of derivative works (meaning reuse in the same or different environments), and attribution.

On the other hand, communities focused on the commercial viability of the local language data in their custody would prefer a licensing regime that, while being open and permitting free access, leaves room for commercialization wherever feasible. Currently, the Creative Commons Attribution-NonCommercial license—which requires re-users of a given material to give credit to the creator and also allows such re-users to distribute, remix, adapt, and build upon the material in any medium or format, for non-commercial purposes only—is intended to leave room for commercialization. However, the extent to which commercialization is feasible, is questionable particularly for materials such as data that may be hard to track once released and used as training data or in NLP/AI models.

For individuals who are members of a community recognized for specific cultural practices embodying valuable traditional cultural expressions, a preferred openness approach would be one that would not permit commercialization that excludes them from tools developed from the use of their local language data.24 In essence, depending on the approach to openness that is embraced, the agency and autonomy of some of these communities to propose alternatives may be significantly diminished.

The Masakhane initiative is an appropriate example. The MakerereNLP project involved the delivery of open, accessible, and high-quality text and speech datasets for East African languages from Uganda, Tanzania, and Kenya. The datasets were comprised of corpora and speech datasets obtained from various sources including free, crowdsourced voice contributions. These datasets were licensed under a Creative Commons’ BY-SA license, which entailed giving credit to the creator. Under this license, the dataset can be used for any purpose, including commercial purposes, and adaptations or derivative data outputs must be shared under identical terms. In essence, the license allows commercial uses, which may lead to products derived from the datasets being sold for a fee to the communities who contributed voice data for free.

Conversely, a commercial enterprise may feel constrained in using such outputs and investing in their further development given the requirement that they must make derivative datasets publicly available under similar terms. In the case of a CC0 license, there is no requirement to likewise share under identical terms or to attribute or acknowledge the source of a dataset, and there are no restrictions on commercial or noncommercial purposes. In such instances, the autonomy and agency of data contributors and data sources to be part of the decisionmaking processes for the (possible) varied uses of the data they have contributed may be negatively impacted.

This paper does not seek to discredit the principle of openness; rather it seeks to argue for a practice of openness that addresses the concerns of a diverse range of stakeholders and that does not threaten their agency or autonomy. The experiences shared in this research show that openness has contributed to the growth of grassroot movements for AI development in Africa. However, to be meaningful, the inclusion project should consider and address the ways in which exclusion or exploitation could happen amid such inclusion attempts. There must be recognition that, while these communities share an affinity in terms of the same kinds of local language data, their interests and objectives may differ. Inherent in this recognition is also an acknowledgment of the diversity of the data sources. Conflating these realities in the choice of open licensing regimes misses the point.

Having made giant strides with their grassroots movement and open sharing culture, African NLP researchers are still left with the mismatch between their adopted ideology and strategy of openness on the one hand and the diverse breadth of the region’s AI community on the other. This diverse range includes, for example, African NLP researchers, data contributors who participate in and contribute to crowdfunded data projects, commercial entities, local communities who may provide context for data, and funders who facilitate the creation of datasets. One key source of tensions is the AI commercial pipeline. It involves demands and/or pressure from many quarters to adopt a licensing regime that does not interfere with or make the commercial pipeline untenable.

Similarly, tensions arise from the absence of a real choice of licenses to address the concerns stated above. Focusing on the commercial pipeline of these datasets, licenses that restrict commercial uses are not feasible. In some cases, licenses that require attribution may also not be feasible because attribution requires that users are transparent about the provenance of their data. This may be an issue for privacy considerations in particular in cases where personal information is used. Share-alike licenses (which require re-users to share their derivative outputs with the same license as the original/source material) may suffer the same fate because, although they are good for ensuring that the diverse community continues to have access, they create problems for the commercial pipeline. In light of these issues, licenses with no restrictions present the only choice.

Although the need to include communities in the Global South in decisionmaking processes on AI governance and development is recognized, the reality of the experiences recounted in this paper shows that such inclusion may sometimes be at the expense of the agency and autonomy of some members of communities in the Global South. Founded on the tenets of openness, such an inclusion approach focuses on preaching “wete wete” (bring, bring) while ignoring the exclusion that may arise from commercial tools built through the use of openly available data. This is the issue that South Africa’s Constitutional Court observed in its ruling in a relevant 2022 case when it said:

to avoid unfair discrimination, [the state] must treat people in the same way or make available the same entitlements. But sometimes what is required of the state is to recognise the differences between persons and to provide different or more favourable treatment to some, so as to secure non-discriminatory outcomes for all.25


This research has highlighted some of the opportunities and challenges presented by considerations around openness as a way to address copyright and privacy concerns that curtail perspectives from underrepresented and unrepresented jurisdictions including in the Global South when it comes to shaping discussions about responsible AI use and development. Openness must be practiced in a manner that considers the communities directly or indirectly providing the data used in commercial and noncommercial settings for AI development. The interests of these communities may, depending on the use case, involve financial benefits, social benefits, or (mere) attribution or acknowledgment.

Copyright and privacy rules may, as a result of their proprietary and rule-based nature, result in practices that discourage openness. Yet addressing the restrictive and proprietary nature of these rules through openness does not and should not mean that openness is adopted without attending to the nuances of specific concerns, contexts, and people. In adapting openness to the nuances of the contexts of Africa (and the Global South), consideration must be given to the agency and autonomy of specific stakeholders to make decisions about the uses of their data contributions, created and annotated datasets, and the needs that AI tools and development are designed to address in the first place. The intersectionality of these concerns necessitates a comprehensive approach to data governance, one that addresses the multifaceted challenges and opportunities presented by Africa’s evolving data landscape.

From a regulatory standpoint, copyright and privacy laws may need internal reforms, or there may be a need for a specific sui generis piece of legislation such as the European Union has undertaken with the recently passed Artificial Intelligence Act. However, of more immediate benefit, given the protracted nature of legislative reforms, is the use of contracts and private ordering regimes. The doctrine of freedom of contract means that changes and tweaks can be made in existing open licensing regimes to address relevant challenges and harness relevant opportunities.

The good news is that for private actors, they can directly make changes and tweaks in the open licensing regimes to address the challenges and harness the opportunities outlined in this paper.


1 The literal translation means “bring [this], bring [that] is an easy request if it is from someone else’s bag or pocket [not that of the person asking].”

2 One inevitable outcome of joining a forum when the meeting is already underway is that one tends to adopt existing agendas and considerations or else risks being labeled someone who wants to destroy the progress already made.

3 A 2020 European Parliament report noted, “AI builds on data that capture socio-cultural expressions represented by music, videos, images, text, and social interactions, and then makes predictions based on these profoundly non-neutral and context-specific data.” See Baptiste Caramiaux, “The Use of Artificial Intelligence in the Cultural and Creative Sectors,” European Parliament Policy Department for Structural and Cohesion Policies, 2020, See also Maori Data Sovereignty Network, “What Is Maori Data Sovereignty?,”

4 Not necessarily in the sense of copyright law.

5 Masakhane is a grassroots organization whose mission is to strengthen and spur NLP research in African languages for Africans by Africans.

6 Ghana NLP is an open-source initiative focused on NLP of Ghanaian languages and its applications to local problems.

7 The Kenya Language Corpus was founded by Maseno University, the University of Nairobi, and Africa Nazarene University early in 2021. These universities have been jointly creating a language corpus, and while using machine learning and NLP, are creating tomorrow’s African language chatbot.

8 Marivate Vukosi, “Why African Natural Language Processing Now? A View From South Africa #AfricaNLP,” in Leap 4.0: African Perspectives on the Fourth Industrial Revolution, ed. Zamanzima Mazibuko-Makena (Johannesburg, South Africa: Mapungubwe Institute for Strategic Reflection, 2021): 126.

9 “P1 Computational Research: Africa Examples, Right to Research in Africa Conf., Pretoria 23Jan2023,” YouTube video, 2:21, posted by “Recreate ZA,” January 23, 2023, accessed September 12, 2023,

10 Including through copyright ownership.

11 Benefits of openness including Creative Commons Licenses. It is also worth acknowledging that there is extensive literature on the use of openness to counteract these restrictions.

12 Chijioke Ifeoma Okorie, Multi-Sided Music Platforms and the Law: Copyright, Law and Policy in Africa (London: Taylor and Francis, November 2019) 34–36.

13 As opposed to external/for-profit extraction of data.

14 Johana Bhuiyan, “Lost in AI Translation: Growing Reliance on Language Apps Jeopardizes Some Asylum Applications,” Guardian, September 7, 2023,

15 Gabriel Nicholas and Aliya Bhatia, “Lost in Translation: Large Language Models in Non-English Content Analysis,” Center for Democracy and Technology, May 23, 2023,

16 Alon Halevy, Peter Norvig, and Fernando Pereira, “The Unreasonable Effectiveness of Data,” IEEE Intelligent Systems 24, no. 2 (2009): 8–12,; and Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury, “The State and Fate of Linguistic Diversity and Inclusion in the NLP World,” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, (Kerrville, Texas: Association of Computational Linguistics, July 2020), 6282–6293.

17 “Wikimedia Foundation Research Award of the Year,” Wikimedia Research,

18 “Khaya: Translate African Languages,” GhanaNLP,

19 “Kencorpus: Kenyan Languages Corpus,” Harvard Dataverse,

20 Mark D. Wilkinson et al., “The FAIR Guiding Principles for Scientific Data Management and Stewardship,” Scientific Data 3 (2016),

21 “A ‘Blatant No’ From a Copyright Holder Stops Vital Linguistic Research Work in Africa,” Walled Culture, May 16, 2023,

22 Thomas Margoni and Luca Schirru, “The Role of Licensing in Data FAIRization,” Presses Universitaires du Septentrion, 2023,

23 Creative Commons, “CC0,”

24 See, for example, Maori Data Sovereignty Network, “What Is Maori Data Sovereignty?”

25 See paragraphs 67–69 of Constitutional Court of South Africa, “Blind SA V Minister of Trade, Industry, and Competition and Others,” Constitutional Court of South Africa, 2022,