The UK's grand plan to fuel AI with public data faces an uphill battle, according to a recent study from the Open Data Institute (ODI). The study highlights a critical issue: the data currently available is often misleadingly titled and lacks proper metadata, making it nearly impossible for AI systems to derive meaningful insights. This is a significant challenge for the UK's ambitious National Data Library (NDL) initiative, which aims to provide researchers and businesses with powerful data-driven insights to drive growth and innovation, including in the field of AI.
The NDL was announced in the 2024 Autumn Budget with a £100 million investment, part of a larger £1.9 billion allocation to the Department for Science, Innovation, and Technology (DSIT) through 2028/29. The DSIT claims to have completed a discovery phase, identifying key opportunities and priorities for systemic reform across the public sector. However, the ODI's 'NDL-Lite' prototype, offering access to over 100,000 public datasets, reveals a stark reality.
The prototype's findings are concerning. Many datasets, particularly on data.gov.uk, are poorly labeled, outdated, or invisible to AI tools. This lack of quality and accessibility forces AI systems to resort to less reliable sources, such as news reports or commercial data, which may not always provide accurate information. The study processed and standardized 100,000 files from six public sector sources, amassing 38 GB of data, and it underscores the significant work required to make the data AI-ready.
One of the critical issues identified is the difficulty in analyzing and tracking broad terms like 'crime.' Local authority statistical releases, for instance, cannot be combined due to a lack of shared standards, and national datasets are often outdated or inaccessible. A major Home Office crime dataset, for example, has not been updated since 2018, and the updated version cannot be accessed via the ONS API.
Professor Elena Simperl, director of research at the ODI, emphasizes the growing gap between the volume of public data available and its practical usability. She notes that AI agents will often circumvent the available data, seeking information elsewhere, such as social media or news reports, because of poor metadata and missing values. This highlights the need for better data quality and accessibility to ensure the NDL's success.
The government, however, remains optimistic. A spokesperson asserts their commitment to maximizing the benefits of public sector data, aiming to make services more efficient and boost the economy. They are overhauling digital public infrastructure, including the NDL, to ensure easier data sharing and use, upgrading outdated systems, and providing new guidance for the safe and ethical use of public data.
The NDL, launched in 2004 as the Secure Research Service (SRS), offers curated, research-ready datasets to accredited researchers. However, the government's plan to replace SRS with the Integrated Data Service (IDS) from the ONS in 2020 was partially funded for legacy IT system issues, leading to budget cuts and a missed opportunity. The NDL, therefore, must learn from these challenges to avoid becoming another missed opportunity in the UK's data-driven AI journey.