Data is the lifeblood of artificial intelligence systems and specific architrctures and models. Hence, following the emergence of a wide selection of industrial and commercial applications due to the ongoing modern AI revolution, the simple question of where the data to build AI systems comes from arises. A group of over 50 researchers from both academia and industry wanted to answer this through an undertaking called The Data Providence Initiative.
The Data Providence Initiative: Researchers Found that Current Data Practices to Build AI Systems Risk Power Concentration on the Hands of a Few
The concentration of power in a few tech companies, coupled with Western-centric biases in training data, risks creating unrepresentative and inequitable AI systems. This is the main finding of The Data Providence Initiative. The researchers noted that addressing these challenges requires greater transparency, regulation, and a global perspective on data collection and usage.
Background
Advancing artificial intelligence depends heavily on the quality and scale of training datasets. This is the reason why developing and deploying AI systems takes a lot of time and other resources. However, despite the emergence of more sophisticated algorithms and larger models, including foundational models and frontier models, that have brought forth numerous AI applications like autonomous systems and generative artificial intelligence products, there has been limited empirical study of datasets beyond text-based ones.
The people behind The Data Providence Initiative conducted the largest longitudinal audit of datasets spanning text, speech, and video. They covered 4000 public datasets from 1990 to 2024. These included data from 608 languages, 798 sources, 659 organizations, and 67 countries. Their findings, which were detailed in a 2024 paper and shared with MIT Technology Review, revealed a troubling trend. Researchers noted that current data practices used in building AI systems risk concentrating power overwhelmingly in the hands of a few.
Shifted in Data Sources
The researchers provided the historical context behind the evolution of data collection in the field of artificial intelligence through their longitudinal audit. Findings showed that there was a shift in data sources. Data practices during the 2010s specifically involved curating data from diverse and purpose-specific sources like encyclopedias, parliamentary transcripts, earning calls or financial reports, and weather data. There were also data culled from the web. The resulting datasets were smaller and tailored for AI systems designed for specific tasks.
However, following the invention of a deep learning architecture called transformer in 2017 by a team of researchers at Google Brain, which is now part of Google DeepMind, developers started to focus on larger and more heterogenous datasets. The industry shifted to indiscriminate data scraping from the internet. Web content has become the main data source across content types like text, image, audio, and video. YouTube emerged as the leading source for multimodal models with over 70 percent of video data originating from the platform.
Transformers have paved the way for the development and deployment of large language models, which power generative text and speech applications, and multimodal language models, which extend the capabilities of generative applications. The shift in data sources has centralized data collection around a few large platforms and has given large tech companies significant control over critical AI training data. For example, Google which owns YouTube, has significant leverage in training requirements that require video and speech data.
Concentration of Power
A few major companies, such as Google, OpenAI, and Meta Platforms, have disproportionate control over AI data through exclusive data-sharing agreements with publishers, social media platforms, and forums. These companies have the resources to access, process, and store vast datasets. Their financial leverage also enables them to outbid smaller entities for exclusive rights or to negotiate large-scale data-sharing agreements. These factors sideline smaller players like startup AI companies, nonprofits, and independent researchers.
It is also worth noting that a significant portion of data carries restrictive licenses. These include non-commercial restrictions that limit access from small players and reinforce the competitive edge of big tech companies. Over 33 percent of datasets are restrictively licensed and over 80 percent of data sources have non-commercial restrictions. Hence, because of these conditions, the researchers noted the current data practices in the field create a power imbalance in which larger companies control both the development and access to AI capabilities.
Geographic and Linguistic Bias
The researchers also found that over 90 percent of datasets are from Europe and North America. Fewer than 4 percent come from Africa and other underrepresented geographic regions. English is also the dominant language. This dominance reflects the fact that 90 percent of all content available on the internet are in English. The geographic and linguistic bias reinforces a Western-centric worldview in available AI models. This means that these models tend to reflect the cultural, social, and historical contexts of Europe and North America.
Limited representation results in bias reinforcement and specifically risks disregarding or even erasing non-Western traditions and non-English languages from AI-generated outputs. There are several examples to illustrate this. For instance, when prompted, models may struggle to generate content about non-Western cultures or events because they default to Western analogies or perspectives. The same biased AI models might prioritize Western values and norms and neglect or misinterpret those from other geographic regions and cultures.
It is still worth underscoring that the number of languages represented in datasets has grown. This growth may be due to broader data collection efforts or new initiatives targeting underrepresented languages. The fact remains that coverage has not significantly improved in proportional terms. Many cultures and languages remain excluded. This continued exclusion might have stemmed from the absence of content written in other languages, efforts focusing on widely spoken or economically influential languages, and specific language barriers.
Ethical Concerns
Another notable issue in current data practices used in building AI systems is that companies are unable to trace the origins of all data used in their models due to the complexity of how datasets are aggregated. The findings from the study noted that 25 percent of text, 33 percent of speech, and 32 percent of video datasets have non-commercial licenses. These cannot be commercialized. However, because companies cannot trace the exact origins of all data, they have no way to ensure that models comply with copyright laws or ethical standards.
The rising reliance on synthetic data also raises concerns about the quality and realism of datasets. Note that synthetic data are artificially generated rather than collected from real-world sources. These are created using algorithms, simulations, or generative models. Poorly generated synthetic data may not accurately represent the nuances of real-world scenarios. Furthermore, if they lack realism, models trained on them may perform poorly in real-world applications. The researchers noted that the issue may amplify and compound existing biases.
Findings from the 2024 Data Providence Initiative in a Nutshell: Conclusion and Implications of Current Data Practices in Building Modern AI Systems
The longitudinal audit of 4000 public datasets that were produced from 1990 to 2024 provides a unique and ecosystem-level view of trends in current data practices. These practices include data sourcing, application restrictions, and representational biases. It underscored critical gaps in dataset diversity, licensing clarity, and reliance on web-dominant sources. These collectively create a power imbalance where large corporations control both the development and access to AI capabilities and reinforce biases in AI systems.
Findings showed that the current data practices used in modern AI systems or in developing AI algorithms and training AI models prioritize scale over diversity and ethics. The role of big tech companies and the Western culture and English language bias concentrate power in the hands of a few dominant organizations and create a West-leaning monoculture. These trends underscore the need for responsible data collection practices, more diverse datasets, and stricter accountability to ensure AI benefits a broader spectrum of humanity.
FURTHER READINGS AND REFERENCES