Data infrastructure: unlocking more value from fragmented public data - your views wanted

Cross-posted from the mySociety Research blog.

Data users in the UK often encounter fragmented public data, where  public authorities are each spending money to publish data independently, but their outputs are difficult to find and join together. This means that a lot of effort is going into creating data which cannot be used to its full potential.

As a joint project between mySociety and the Centre for Public Data, we are writing what we hope is a simple approach to help address this problem. 

Our current thinking is that some level of central coordination is required to accompany a central mandate to publish. Public bodies need support and coordination to publish data to a lightweight common standard and in a common location.

We want to kick the wheels of this conclusion, identify existing attempts to fix the problem, and talk to people who produce and use the data to see what the obstacles are. If you’re interested, get in touch at research-publicdata@mysociety.org, or drop your views in this survey

What is fragmented public data?

Fragmented public data is when many public authorities are required to publish the same data, but not to a common standard (structure) or in a single location – so data becomes fragmented across multiple locations and multiple formats. 

For example, every English local authority is currently supposed to publish all spending over £500 each month. From Adur to Wyre Forest, council officers are working hard to publish monthly spending. 

In theory, this data should be being used by companies, researchers and journalists to provide insight into spending, spot fraud, and find opportunities to sell to councils.

But in practice, to use the data, you’ll need to search all 333 council websites each month, then import each spreadsheet into a central database – and you’ll spend a lot of time pulling your hair out, because the spreadsheets don’t use a consistent format. 

As a result, not much has actually been done with all this data. And the grand promises that spending data would unleash an ‘army of armchair auditors’ have largely failed to materialise. 

Why does this matter?

This is a problem because most of the effort is already being spent to do the job ineffectually. Councils do a lot of work to produce this data, and companies and analysts waste time fixing import scripts or crowdsourcing data, rather than creating new products or insights — and for many organisations, the skills and resources required to create national level datasets are beyond them.

This is not an isolated problem – there are many other examples of fragmented data. 

From assets of community value to election information to council land and property assets, data is often published in a fragmented and hard to reassemble way. 

For many datasets, while individual disclosures are useful, the combined data is much more than the sum of its parts because it allows real understanding of the picture across the whole country, and makes it easier to draw comparisons between different areas.

Across all these datasets the potential loss is huge – and just a bit of extra work could unlock huge amounts of the overall value of the data. We want to fix this for data that is already being published, and make sure that datasets in future are published in the best possible way. 

So what can we do?

The big problem is one of coordination. We think the UK’s central data teams just need to help public authorities do two things:

  1. Use a lightweight common standard

  2. Report the location of the data in a central register.

By taking things that individual authorities are doing anyway, but getting an agreed format and location, all the individual datasets become far more useful. But the details of the standard are less important than the fact that the government and legislators should be as interested in this side of the problem as they are in requiring the data to be published in the first place. 

We’re keen to learn lessons from previous attempts to do this, and reviewing old publication guides, our initial conclusion is that these over-complicated the idea of what open data is (with a fixation on file formats and linked standards), rather than simple interventions that help both publishers and users (where generally, the best approach is probably an Excel template with common headers). Common standards need to be a compromise between technical requirements and the people who work in local authorities who produce this data.

We’re encouraged by an approach taken by the Scottish Parliament, where publication of compliance with climate change duties was mandated in a particular format – and a consultation on this process found most organisations involved thought standard reporting was an improvement. We’re interested in any other examples of this kind of approach. 

What are we doing now?

We’re trying to explore this problem a bit more, to understand the scale of the problem, and the viability of our approach.

We are interested in talking to:

  • People who have run into this issue, what they were trying to do, if they were able to overcome it, or if they had to give up.

  • People who publish information in public bodies to understand what their restrictions are

  • People or organisations with experience in trying to coordinate a single standard or location for public data releases.

Get in touch at research-publicdata@mysociety.org, or drop your views in this survey