Decisions made by loan officers, care providers, judges and human resources professionals are increasingly being made by machines using sophisticated analytic tools. But what happens when the data feeding the algorithms is inaccurate, outdated or of low quality? In this blog post, QOMPLX CEO Jason Crabtree explains why more attention must be paid to the provenance of data on which decisions are made.
Making better decisions is a top priority in the C-suite. For financial services and the insurance industry, the advent of big data brought with it the promise of cheap, accurate and real-time decision-making. This has the potential not only to impact the bottom line, but also to open up access to otherwise underrepresented consumer groups for credit, insurance products and other financial innovations.
Humans Out of the Loop
We can now automate tasks such as underwriting to the point that there is optionally no longer a human in the loop. The basis for approving or denying loans, mortgages and leases can now be based solely on a data-driven model. The challenge is that not enough business leaders are paying attention to the provenance of the data on which their companies are basing business-critical decisions.
We live in a time in which it's no longer the local bank teller or loan officer who knows us and our personal circumstances. Entire systems and datasets now hold information affiliated with our identity. However, when that data is inaccurate, it potentially exposes us to errors and ommissions from many different counterparties.
An Expiration Date for Data?
As companies use data to make significant decisions about their customers, the shelf-life of those decisions also needs to be taken into account. When an individual is underwritten or denied insurance, a loan or a job opportunity, how and where that decision is stored has a ripple effect.
When it comes to data, we need to be asking the right questions. How is the data obtained and curated? What about how it is utilized to train models? Who owns and maintains those models? How do we validate that they're fit for the purpose under which they're being asked to contribute to a decision? Perhaps the biggest question we need to address is: how do we even record decision outcomes? Reproducibility matters.
Given the untold impact data-driven decisions have on our lives, understanding the provenance of information is vital. However, it's no longer enough to know how data is being stored; how it's sourced, manipulated, and also ultimately licensed, is material. Licensure is a particularly thorny issue. If I license a data set, should I be allowed to train a model off of it? Does the legal concept of "derivative work" mean that that model is only something I can continue to use if I actually continue to pay for the data set? Alas, these are questions to which there is not yet a sufficiently clearcut answer.
Consumers in the Dark
Meanwhile, as technology companies continue to grapple with these issues, consumers are consistently left in the dark. When a consumer signs a term of service agreement, it's not always clear what they are giving their concept to, no matter how closely they read the terms and conditions. Consenting to a company using personal or business data only scratches the surface; too often, it's unclear what happens to derivative works. What consumers think they're giving consent to and what they are actually agreeing to are often worlds apart. Consider the well-documented cases around consumer genetic testing – and the ongoing research goals of commercial partners against the growing data sets.
This issue is becoming more pressing as the value of our data increases. Right now, incumbent technology companies have not done a good enough job of informing consumers about what they are consenting to when they are participating in schemes that involve these end-to-end transformation processes. Many start-ups are intentionally opaque about their sources, transformations, and uses of data.
Until there are means to require that all consumers have the ability to understand the end-to-end use of their data, it's very difficult for consumers to think about what types of downstream use they ought to either be compensated for or that they are even consenting to give you the rights to when asked to sign one a user agreements. Before we can make better decisions, first we have to decide how to better understand and evidence where our data comes from in the first place.