MLOps Conference

MLOps Conference

On DataOps, productionizing ML models, and running experiments at scale.



Schaun Wheeler


Story-telling as a method for building production-ready machine-learning systems

Submitted Jun 16, 2021

This presentation is about the essential but overlooked role storytelling plays in machine learning productionization. Although often assumed to be at best a tangential concern, machine learning systems only become successful through successfully building trust in the model, the data, and the system in which the model and data operate. That trust-building happens through storytelling. Most often, successful storytelling happens only by accident, which is why so many machine learning systems fail to make good on the hopes and promises that motivate their construction in the first place. By learning to tell better stories - which requires us to build better stories - we can build more stable, maintainable, and successful machine learning systems.


A big part of productionizing machine learning is building trust that the models actually do what they claim to do. That trust-building is different depending on whose trust you’re trying to win.

On one level, the data scientists and engineers building, training, and deploying the model need to have confidence that it actually does what it’s supposed to do. This trust is built through tools that are both widely available and commonly taught in training courses. These include things like goodness-of-fit measures, precision and recall, checks against overfitting, cross-validation, and basic model parameterization such as methods for dealing with imbalanced classes. All of these tools build trust in the model itself.

On another level, data scientists and often product managers and other stakeholders need to have confidence that the data-generating process that feeds the model does not have any major structural features that will thwart the model and bias its results. The toolset for this level of trust-building is not very robust, usually subsumed under the heading of “exploratory data analysis”. Many companies have learned, to their detriment, that systemic biases are easy to overlook, even by skilled practitioners. The productionization of machine learning requires a more robust set of tools for helping model designers and other stakeholders to thoroughly explore the data for hidden biases. We need tools to assess the trustworthiness of the data, regardless of how much trust we have in the model.

At still another level, particularly non-technical stakeholders need to have confidence in how various models fit into one another, and into other technical systems, and into other business processes, regardless of the extent to which those processes are facilitated by technology. There are no well-defined tools for this kind of trust-building, though there are relevant skills. At the very least, we need regular attempts to assess the trustworthiness of a machine-learning model’s place in the larger system that is supposed to produce the results that justify the implementation of machine learning in the first place.


As we move up through the levels of trust-building, - from trust in the model, to trust in the data, to trust in the system - we increasingly transition from primarily writing code to primarily telling stories. Productionized machine learning - the automation of all or parts of decisions that would otherwise be left up to human judgement - inherently involves the active, iterative negotiation of a story about what is happening, why it is happening, and what we can do about it. The role of storytelling in machine-learning productionization is so often overlooked that it’s rarely considered part of the relevant skill set at all, with many practitioners limiting it to notions of report-writing and data visualization, or even considering all concerns about storytelling as simply a marketing ploy that is hardly related to the machine learning problem at all.

A greater focus on the storytelling aspect of productionizing machine learning increases buy-in from both stakeholders and customers, but it also makes for better models and better code. As we build trust in the system we discover assumptions about the data and its intended uses that informs our exploration of the data-generating process. As we build trust in our understanding of the data-generating process, we make better-informed choices about the models we use (and whether we really need to use a machine learning model at all) and can often mitigate data problems before they become problems.


I will only briefly discuss the issue of building trust in models, as that topic is generally already very familiar to practitioners. One issue I will focus on is the fact that the selection of model-evaluation metrics often depends only a little on the nature of the model itself and more on the nature of the problem the model is supposed to solve. I’ll illustrate this point using a case study from a charter school network, where my team designed a model to predict student need for an academic intervention (either negative - special accommodations or being held back in school - or positive - extra challenges or being skipped ahead a grade). Our evaluation of the model’s performance was wholly tied up in our stakeholder’s judgements about the cost of different kinds of wrong decisions - namely, how many students were they willing to give special accommodations to, even though those students didn’t really need them, in order to get the accommodations to one student who really did need them? The success on our model depended upon minimizing false positives and false negatives, but only when a vastly different priority was assigned to one than the other.

I will first approach the topic of building trust in data through the perspective of numerous scandals and exposes that roiled the tech industry over the last several years. While some of these scandals involved misuse of technical systems by malicious actors, much more commonly, data scientists and engineers seemed to have been content to trust their model without thoroughly exploring whether they had fed that model trustworthy data. I will discuss a couple tools that have been developed to evaluate data trustworthiness. In each case, the tool is really facilitating a conversation between the data and the person using the data, allowing the data to offer “surprising” commentary on itself, even if the practitioner doesn’t ask for it. This back-and-forth between data and scientist is the core pattern for successfully building trust in data. I will illustrate this principle with a case study from a digital advertising firm, where I built and open-sourced a tool for rapid exploration of geospatial data. The tool allowed me to discover multiple biases in the company’s mobile-device location data that allowed us to improve several downstream systems as well as develop entirely new products.

I will approach the topic of building trust in systems with case studies from my current company, Aampe, where we enable companies to constantly adapt their push notifications to their users’ preferences through continuous, massively-parallel experimentation. Most of the real breakthroughs we have had in the design of our machine learning systems have come through our attempts to frame our capabilities in a way that allows non-technical customers to understand how Aampe fits into their business and their other technical systems. Repeatedly, as we’ve struggled to tell the story of how a piece of Aampe’s analytic pipeline works, we have discovered that a crucial piece of that pipeline was either missing or poorly designed. I will focus on two examples: our decision to make our system learn about one particular thing - app content - in an entirely different way than we learn about everything else; and our ongoing efforts to design KPI tracking and multi-KPI optimization in our learning systems. This last effort has been greatly informed by a recent effort that has been the most obviously story-telling experience of my career: the creation of a children’s book, The User Story, that explains our entirely analytic pipeline through the perspective of a little orange dot that represents an app user.

I will end with some guidelines for building better stories for machine learning systems, with an emphasis on ethnography, a method from anthropology that prioritizes iterative, negotiated construction of stories. The ability of a story to positively shape a machine learning system depends less on how the story is told (though that matters, too), and more on how the story was built. When stories are systematically iterated upon - in a process very similar to code review - that process itself provides most of the information needed to build better systems.


{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hybrid access (members only)

Hosted by

All about data science and machine learning

Supported by

Scribble Data builds feature stores for data science teams that are serious about putting models (ML, or even sub-ML) into production. The ability to systematically transform data is the single biggest determinant of how well these models do. Scribble Data streamlines the feature engineering proces… more


Deep dives into privacy and security, and understanding needs of the Indian tech ecosystem through guides, research, collaboration, events and conferences. Sponsors: Privacy Mode’s programmes are sponsored by: more