Credit Karma’s Ryan Graciano on Data Marts, Data Models, and Disrupting the Credit Landscape

Ryan Graciano

CTO

Credit Karma

Episode Overview

Ryan Graciano is the co-founder and CTO of Credit Karma, a company that is aligning technology and data to help bring transparency to the credit lending process. On this episode of The Data Chief, Ryan explains how Credit Karma survived early struggles such as the financial crisis of 2008. Ryan also touches on how Credit Karma navigated it’s journey to the cloud, stepping away from the comfort of on-premises data centers to the elasticity of the cloud, and the importance of grooming outside data sources to keep insights consistent. That and more on today’s episode with Ryan Graciano.

Key Takeaways:

Explainable algorithms drive success: As third party datasets become more readily available, there is an evolving need for data professionals to understand where that data is coming from and how it will affect your models. While these datasets can make it easier for models to be spun up quickly, you must be able to account for how and why those algorithms are generating particular answers.
Clean data leads to reliable answers: Data analysts must spend time making sure the data they are using is not only clean, but reliable. When an analyst uses dirty or untrustworthy data, algorithms will have a tendency to run in an undefined manner, which will lead to high variance in answer quality and consistency.
Keep data fluency a priority: Even for organizations that believe they are data literate, the process of understanding data at an organizational level is an ongoing one. A best practice for maintaining data literacy is to create a standardized set of how data is recorded and reported internally. When practices like this are standardized, organizations can avoid issues like data bias.

Key quotes

The risk models that people have been using for the past years—a lot of banks will keep the same risk model in place over a long period of time— are still in place. Those models have seen good times and bad times, so that they're able to react. What people forget about the whole system, and why it's so confusing is that the risk models are there to be consumed by the lenders. They were never designed for people to see. They were never designed to be explainable to people.

The end user is the company. They're the ones that are paying for the model to be created, driving its creation. And the person is the subject of the model. The challenging part is that it can be hard to warrant that the model is acting in a fair way. There's a bunch of legislation around this that tries to ensure that companies are doing fair things, and they aren't doing things like redlining or discriminating against certain populations. So the explain-ability problem is a major one, both for regulators and for people. But for the banks, they just want a good risk score.

Some of it is in the data that you allow the models to use. If you just let the algorithm run unconstrained, it can't necessarily tell the difference between correlation and causation. If there's a gender pay gap, then there's probably also a gap in the abilities for the genders to pay back, so there's probably a gender risk gap. The algorithm will pick that up and just say, ‘Oh well, there's a gap here. I don't know why. I just know that it exists, so I'm going to tell my banks not to lend to this other gender that's having a hard time.’ So to prevent that, you actually have to remove some of the data from the models because they can't figure that out on their own. They don't know why, they only know what.

What was very challenging for us though was just keeping up with the pace of data platforms. We went from distributed computers that we had set up on our own, to managing spark instances. As deep learning came about, that created totally different demands and re-platforming on your own hardware every time is incredibly difficult.

For data you really have to start with: What are we trying to accomplish for the end user? What should the experience be for them? And you have to think through, ‘Okay, what data will power that?’ Before we even think about the algorithms. It's just do we have all the data that we need? And often you'll find, no. A lot of A.I. is actually just gathering data and grooming it.

The actual data that the analysts use, that has to be really clean, and so you have to spend a lot of time making sure that if analyst A is looking at a problem, and analyst B is looking at a problem, they're going to both use the same data to get there. If your data is un-groomed and all over the place, even if they're both trying their best, they might actually pull the data differently and come to different conclusions.

Bio:

Ryan Graciano is a co-founder and CTO of Credit Karma, a company dedicated to re-engineering one of the largest industries in the world – consumer finance. Credit Karma’s mission is to help consumers have a better future by simplifying decision-making and management of personal credit and finances.

Credit Karma scaled aggressively to become a major disruptor of the consumer finance industry, today valued in the billions with above 50 million members. As Chief Technology Officer, Ryan designed the engineering framework and organization to support this growth and he manages a team that has grown from just one person, to now include hundreds of engineers and counting.

Prior to joining Credit Karma, Ryan worked for a small company that was acquired by IBM, where he subsequently spent a few years on search engines and enterprise software.

BACK TO THE DATA CHIEF