Disrupting the credit landscape with data: A Q&A with Credit Karma CTO, Ryan Graciano

Financing in America can be a confusing and complex process. The myriad offerings, rates, and forms are daunting for even the savviest consumer.

Credit Karma simplifies the lending process by anonymizing individual borrower data and procuring multiple financing offers depending on what the consumer is looking to finance. Whether it be a sofa, a car, or a house, customers no longer need to fill out multiple forms; Credit Karma is their one-stop credit application.

The company created this intelligent platform through years of iterations by testing what systems worked best for them. Their journey involved migrating from a data center to the cloud, reworking their algorithms to account for major economic changes, and fostering a workplace environment that recognized the importance of the data that was being generated and collected.

On a recent episode of The Data Chief, Cindi Howson sat down with Ryan Graciano, Chief Technology Officer at Credit Karma, to understand his involvement in the company’s data journey and how they were able to leverage their customer’s data to revolutionize the credit industry. Read on for insights from their conversation.

<br>Cindi Howson: In the last year, between mass unemployment and people having to defer their rent, what's been the impact on Credit Karma? Have the algorithms been adapted to account for these unusual times?

Ryan Graciano: I wouldn't say that they've been adapted. The risk models that people have been using for the past many years are still in place. Those models have seen good times and bad times, so they're able to react accordingly. 

What people forget about the whole system, and why it's so confusing, is that the risk models are there to be consumed by the lenders. They were never designed for people to see. They were never designed to be explainable to people. That's something that our company came up with after the fact because those lending decisions are so impactful to people's lives; that demand was created. But the system wasn't designed with that in mind.

Cindi: If you think about your company's role in explaining the lending process, where do you view explainable AI in this?

Ryan: It's very tricky because who's the end-user? Is it the person that's having the decision made about them? Or is it the company that's buying that score for some reason? And really, the end-user is the company. They're the ones that are paying for the model to be created, driving its creation. So then, the person is the subject of the model. 

The challenge is that it can be hard to warrant that the model is acting fairly. There's a bunch of legislation around this that tries to ensure that companies are doing fair things, not doing things like redlining or discriminating against certain populations. The explainability problem is a major one, both for regulators and people. The banks just want a good risk score.

Cindi: Credit Karma was quoted as saying there are biases in credit -- a credit gender gap. How do we prevent this?

Ryan: Some of it is in the data that you allow the models to use, so I would start there. I said some, but it's honestly most of it. If you let the algorithm run unconstrained, it can't necessarily tell the difference between correlation and causation. If there's a gender pay gap, then there's also a gap in the abilities for the genders to pay back, so there's a gender risk gap. The algorithm will pick that up and say, "Oh, well, there's a gap here. I don't know why. I just know it exists, so I'm going to tell my banks not to lend to this other gender that's having a hard time." 

To prevent that, you have to remove some of the data from the models because they can't figure that out on their own. They don't know why. They only know what.

Cindi: Why did you start using Google Cloud so early? And where do you think all of this is going?

Ryan: We started in a data center back in 2007. Then we moved to the cloud, so we became cloud-native, but it was an effort to get there. We had developed so much infrastructure to make the site run, and when we looked at other platforms, a lot of their service offerings were around things that we didn't have an issue with. I don't need a service bus if I already have a service bus and it works fine. What was challenging for us was keeping up with the pace of data platforms. We went from distributed computers that we had set up on our own to managing Spark instances. There was even Hadoop at one point. Then, as deep learning came about, that created different demands. Re-platforming on your own hardware every time is incredibly difficult.

Even if you abstract the hardware to a cloud, the software on top of the migration is crazy. So we said, "We want to be on a platform that seems like it's going to be on the bleeding edge of this because we're always going to be on the bleeding edge of this since our business depends on it." That’s what Google was doing the most at the time. I was impressed with how BigQuery had managed data access for a lot of internal use cases and how they were moving towards almost a TensorFlow-native platform, developing chips to make that more efficient to run. We thought that would be the future for us, and they were most likely to stay at the forefront.

Cindi: Take us through the R&D process when you're designing new data products.

Ryan: You have to start with: What are we trying to accomplish for the end-user? What should the experience be for them? Then you have to think, “What data will power that?” Before we think about the algorithms, it's “Do we have all the data that we need?” Often, it will be no. A lot of AI is gathering data and grooming it. 

That's where a lot of our time is spent, and where Google has been very helpful because BigQuery and other tools make this easy for us, the folks doing the R&D to groom and manage data. We have a sophisticated data science team that's able to apply standard techniques. We can get to a proof of concept fairly rapidly once we've figured out all those data hurdles. 

Cindi: Are there particular processes or reinforcements to make sure that people are using the data well?

Ryan: I think one of the data layers, the actual data that the analysts use, has to be clean. You have to spend a lot of time making sure that if analyst A is looking at a problem, and analyst B is looking at a problem, they're going to use the same data to get there. If your data is un-groomed all over the place, even if they're both trying their best, they might pull the data differently and come to different conclusions, so that's a lot of work in its own right.