Swimming in the UX data lake: Using machine learning to give new life to government UX research data

A cartoon robot sitting on a stool, reading a book. There is a stack of books on the floor to the left of the robot.
Image by mohamed Hassan from Pixabay

Back in 2009, the now-defunct User Experience Working Group was a foundling in the Government of Canada (GC). We were a bunch of practitioners who self-organized into a team that worked on making user experience design a common practice in the GC. We met to work on UX projects, learn methods, share research, and ultimately drove the development of the government-wide web standards for usability, accessibility and interoperability.

One of the discussions in the early days centered around the fact that many departments were conducting usability tests and user research on the same website design (since we were all required to implement the same template for consistency). We created a research repository on the government wiki to share our research results in the hopes of reducing the potential duplication of effort. Sharing research meant that tests could be more incremental and cost effective, focusing on the content design or features specific to the individual department or project, rather than re-testing the exact same template features as everyone else.

In the past 11 years, 40 projects were shared on this wiki page. As someone who championed this approach (since I had so little money for testing back then), I have long thought about how the government could create a user research repository and mine it for insights that would serve all departments and agencies. A place to store more than just usbility testing data; think: user research interviews, ethnographic data, socio economic research, service and program feedback and countless other sources of user research that are sitting unused across departments.

Service Canada, Immigration, and Public Services and Procurement all have front end client service interactions that are multi-regional. Client research and feedback on service interactions is scalable and can be applied to blueprint service workflows and develop new service designs. When I was hired in my current job, I had to blueprint my program by myself. If I'd been able to refer to data about client facing interactions from other programs, it would have helped me evaluate our blueprint and feed into the new service model I designed. Instead, I had to do it alone from scratch as a UX team of one.

An opportunity for machine learning

Several years ago I first heard of the concept of data lakes and the idea of a central UX data repository for GC seemed close. Since then, with the lowering costs of machine learning and artificial intelligence engines, it seems even closer.

I have written before about using artificial intelligence (AI) to improve the findability of unstructured data. I even led an AI implementation which improved the search for Access to Information requests by indexing the content in the responses. The engine could also recommend which department was likely to be best suited to respond to a request, based on the search terms provided by the user.

In my current job, I receive unstructured user feedback from a variety of sources including user research led by my team, service feedback forms, and data from our call centre. We struggle to sort through and catalogue this information to pull insights we can action now and make it searchable to pull insights over the longer term.

This year we worked on cataloguing our data, documenting insights, creating a metadata model to tag the content for reuse, and loading tagged content into a single repository. Filtering the content down for re-use is necessary at this point because we lack the tools and resources to create our own searchable data pool. But we know that we are leaving so much value behind by picking and choosing what goes into our insights repository.

Our data only amounts to thousands of lines. This small amount of data isn’t significant enough to reliably train an ML engine. While we interact with 40,000+ clients a year, we only have data on a fraction of them, and this trend is not expected to significantly increase over time.

Which is what brought me back to the idea of a government-wide UX research data pool. With enough structured and unstructured data, there is the opportunity to train an ML engine for insights and create a repository about the government client experience to serve all of the departments and agencies who need it.

The key is a metadata model.

I once sat in on a meeting where a number of departments were working on creating a central repository for departments to submit a specific kind of content which would be published on a central portal. What I couldn’t figure out was why they hadn’t established a metadata model and required all of those groups to publish their content using RSS. If everyone’s RSS was structured the exact same way, the central site could just pull in all the feeds and publish the data with a consistent look and feel regardless of the source. When I asked the question, the response was: “Where were you 2 years ago? You could have saved the government $1,000,000.”

Along the same lines, we can use standard tags and a standard structure for the data, in order to combine it all into a single repository.

This metadata model doesn’t need to be complex to be effective, especially if we are working with unstructured data. Consider if we even had just a few common tags:

  • Fiscal year (standardized list)
  • Department name (standardized list)
  • Scope (standardized list e.g. website, application, program, service)
  • Topic (unstructured)
  • Project title (unstructured)
  • Project description (unstructured)
  • Type of research (standardized list e.g. usability testing, feedback)
  • Research question or task (unstructured) - one entry per question or task
  • Response (unstructured) - one entry per question or task

A basic set of tags like this, with both standardized and unstructured elements, would enable merging all of the data into a central pool. Additional tags could be proposed by the machine learning system as it analyzes and learns the data over time.

The data repository could be accessed by government departments and agencies through a lookup interface, and by the public via the government open data website. As the pool becomes a lake, a dynamic interface similar to USAFacts.org could provide a more visual experience of the data for both internal and external users.

Is this feasible?


There are plenty of people collecting user research across the GC. And there are plenty others working on information architecture, machine learning and AI policy - some of the required subject areas to implement a central data repository.

The actual data transfer doesn’t have to be complicated. It does mean reworking historical research reports, converting many from document or presentation format to data sets (think Excel spreadsheets or comma separated data sets). For future research, departments could simplify their lives by requiring that contractors or in-house teams use the metadata model to structure their data and deliver it in formats that can be easily fed into the system.

The central repository could pull data in through bulk data imports or APIs and data feeds from existing data repositories. A team would need to monitor data quality and structure, and test the imports. And the ML would need to constantly crawl the content to learn patterns, yield insights and propose additional tags.

We would need a central repository and data standards. We would need a simple data structure for the short term and the ability to expand it over time. We would need a network of departments willing to trial this and procure or build the repository, and then feed into it. We would need a machine learning engine that processes natural language and the ability to train it continuously. We would need a requirement for departments to share their data into the central repository. We would need policy, procurement, and a host of other groups to work together.

Most importantly, government would need to think bigger and realize that UX isn't only for web and apps. UX data is more than just usability test results. Across GC the general approach to UX seems to forget that service and program level design are much bigger and broader than websites, and that it needs lots of data to inform decisions. A repository like this would be invaluable to the folks working on end-to-end service design, who could benefit from the insights and feedback of other government services to validate their work and provide a broader base of data to work from.

So where do we begin?

I'm putting this into the universe to start the discussion. Do you know of any other governments working on something similar? Does your department have a data repository? Would you use or contribute to a central one? Pros? Cons?

And hey, if you’ve got the funds to test this out definitely reach out. ;)

Let's talk about it over on Twitter: ping me @spydergrrl.