The UK government has a treasure trove of valuable datasets. We’re talking about official statistics, cultural heritage records, and crucial NHS health data. These datasets have fueled scientific discoveries, sparked business innovations, and led to better public services.
Now, with the launch of the much-awaited AI Opportunities Action Plan, we see just how important government data is for harnessing AI. But there’s a catch. Recent research from the Open Data Institute (ODI) points to major flaws in how these datasets are prepared and shared for AI use.
Let’s dive into the reliability of government data in the AI realm. Foundation models like ChatGPT and Gemini have become popular tools for accessing information on public services and policies. However, the ODI found that these AI models often miss the mark. When they search government data repositories, they frequently pull inaccurate information, relying instead on dubious sources like social media or fabricating answers entirely. This is a big deal. If someone is using AI to figure out their benefit entitlements and gets misleading advice, it can shake their trust in both AI and government services—especially when the government aims to use AI to enhance public services.
The AI Opportunities Action Plan, crafted by Matt Clifford, points out the National Data Library (NDL) as key for unlocking government data for AI developers. Yet, the reality is that many government datasets are in formats that don’t work well for AI. For instance, an ODI analysis of CommonCrawl, a major AI dataset repository, found that it pulled data from 13,556 pages of data.gov.uk, but these pages didn’t help produce accurate model outputs. In fact, when tested across 195 queries, models cited government statistics from data.gov.uk correctly just five times.
One major issue is that government data isn’t always published in AI-friendly formats. Technologies like DCAT exist to assist in making data discoverable, but scraping tools such as CommonCrawl don’t leverage these technologies fully. As a result, AI models often fetch information from lesser-quality sources, reinforcing misinformation. If the UK wants to lead in AI innovation, this gap in data quality must be fixed.
The ODI ran two experiments to see how government data influences AI models. The first one looked at how crucial UK government websites are for AI performance. By using a “machine unlearning technique,” they removed these websites from foundation model training data. The outcome? The models’ inaccuracies shot up by 42.6%, leading to crucial errors. For instance, without access to government websites, an AI model provided wrong information about Child Benefit eligibility.
In a second experiment, researchers discovered that AI models aren’t even aware of much government data. They tested how well models could recall specific statistics from data.gov.uk and found that, out of 195 queries, models only referenced official statistics correctly five times. These findings show a clear need for better utilization of government datasets.
Moving forward, we must adopt FAIR principles—making data findable, accessible, interoperable, and reusable. Tools like Croissant, designed for machine-readable metadata, can improve data discovery and integration. By enhancing dataset descriptions, we can make these resources work better for everyone, machines and humans alike.
The government also needs to encourage responsible data sharing to ensure everyone gets fair access to quality data. This could mean tax breaks for private companies sharing data, requirements for publicly funded projects to make their data open, or even a fund from AI-generated content to support trusted information.
Using privacy-enhancing technologies like Solid, which give individuals control over their data, including health information, can enable smart use of sensitive data while maintaining privacy. Data Trusts built on top of Solid can gather this data, which can then be compiled into datasets equipped with Croissant metadata for research.
The Action Plan’s focus on high-quality data aligns closely with the ODI’s commitment to connecting advanced data structures with public trust. To develop systems that work together, provide AI-ready datasets, and protect privacy, the ODI advocates for a ten-year National Data Infrastructure Roadmap that supports the Action Plan’s goals.
However, the Action Plan leaves some important areas unaddressed. It doesn’t explain how user feedback will be integrated into the National Data Library or how it will engage diverse stakeholders. There’s also a lack of clear standards for data quality, essential for creating AI-ready datasets. While it encourages AI innovators, it could do more to nurture startups focused on data preparation and governance tools.
On a global scale, data-centric governance in AI is crucial, yet it’s not a priority for many countries, which could hinder the growth of open data practices. The ODI has initiated the Global AI Policy Data Observatory to tackle this challenge, providing resources to help policymakers create effective data governance strategies.
Access to high-quality government data is vital for leveraging AI in public services. By enhancing data publication practices and investing in long-term infrastructure, the UK can lead the way in making government data work for AI. This could unlock significant benefits for society, aligning with the aspirations of the AI Opportunities Action Plan.
You can check out the full report at ODI Report: The UK Government as a Data Provider for AI.
Elena Simperl is the director of research at the ODI. Neil Majithia is a researcher at the ODI.