Transforming DataOps in Banking

By: Sandeep Uttamchandani
  • Twitter
  • Facebook
  • LinkedIn

CDO Sessions: A Chat with DBS Bank

On July 9th 2020, Unravel CDO Sandeep Uttamchandani joined Matteo Pelati, Executive Director, DBS in a fireside chat to discuss transforming DataOps in Banking. Watch the 45 minute video below or review the transcript.


Transcript

Sandeep: Welcome Matteo, really excited to have you, can you briefly introduce yourself, tell us a bit about your background and your role at DBS.

Matteo: Thank you for the opportunity. I’m leading DBS’s data platform and have been with the bank for the last three years. Over the last three years, we’ve built the entire data platform from the ground up. We started with nothing with a team of about five people and now we are over one hundred. My last 20 years has been in this field with many different companies, mostly startups. DBS is actually my first bank.

Sandeep: That’s phenomenal and I know DBS is a big company being one of the key banks in Singapore, so it’s really phenomenal that you and your team kicked off the building of your data platform.

Matteo: As DBS is going through a company wide digitization initiative we initially started to outsource development but now, we retained much more development in-house with the use of open source technologies. We’re also contributing back to open source, too!

Sandeep: That’s phenomenal and I have seen some of your other work. Matt, you’re truly a thought leader in this space. So really, I would say spearheading a whole
bunch of things!

Matteo: Thank you very much.

Sandeep: Top of mind for everyone is COVID-19. This has been something that every enterprise is grappling with and maybe a good starting point. Matteo, can you share your thoughts on the COVID-19 situation impacting banks in Singapore?

Matteo: Obviously there is an economic impact that is inevitable everywhere in the world. Definitely there is a big impact on the organization because banks don’t traditionally have a remote workforce. All of a sudden we found ourselves having to work remotely as ordered by the government. We had to adapt and we’ve done well adapting to the transition to home based working.There were challenges in the beginning such as remote access to systems and the suddenness of all of this, but we handled and are handling it pretty well.

Sandeep: That’s definitely good news. Matteo, do you have thoughts on other industries in Singapore and how are they recovering?

Matteo: In Singapore we didn’t really have that strict lockdown other countries experienced. We did however limited social contact and the government instructed people to stay at home. There are some businesses that have been directly impacted by this i.e. Singapore Airlines, the airlines have all shut down. I’m from Italy and COVID-19 has been hugely disruptive to peoples lives as all businesses were shut down. It did happen here in Singapore, but with a lesser impact. As things start to ease up and restrictions will start to loosen, hopefully the situation will get better soon.

Sandeep: From a DBS standpoint, Matteo, what is top of mind for you from the broader aspect as you are planning ahead from the impacts of the pandemic.

Matteo: Planning ahead, we’re looking at remote working as a long-term arrangement as there are many requests for it. We’re also exploring cloud migration as most of our systems have always been on-premise. As you already know, banks and companies with PII data may find it challenging to move sensitive data to the cloud.

The current COVID-19 pandemic has accelerated the planning for the cloud. It is a journey that will take time, won’t happen overnight, but having a remote workforce has helped up understand actual use cases. We’re investing in tokenization and encryption of data so that it can be in the Cloud. There are lots of investments in that direction and have probably been sped up by the pandemic.

How DBS Bank Leverages Unravel Data

Sandeep: In addition to moving to the cloud, what new data project priorities can you shed a light on?

Matteo: As you know we are investing a lot in the data group as we’re running the data platform. Building a platform is very much about putting pieces and making them work together. We decided in the beginning to invest a lot in building a complete solution, and we started doing a lot of development as well. We built this platform from the ground up adopting open source software and building to an extent a full end-to-end self-service portal for the users. This took time, obviously, but the ROI was worth it because our users are now able to ingest data easier enabling them to simply build compute jobs.

Let me give you an example of where we leveraged Unravel. We have compute jobs that are built by our users directly on our cluster, on a web UI too. Once they have done the build they can take that artifact and easily promote it to the next environment. We can go to User Acceptance Testing (UAT), QA, and production in a fully automated way using our UI of a component that we wrote. This has now become our application lifecycle manager where we have integrated with Unravel giving us the ability to automatically validate the quality of jobs.

We leverage Unravel to analyze jobs, analyze previous runs of a job and basically block the promotion of a job if the job doesn’t satisfy certain criteria thanks to Unravel. For us, it’s not just bringing in tools and installing them, but building an entire ecosystem of integrated tools. We have integration with Airflow and many other applications that we’re building and fully automating. Having gone through this experience, we’ve learned a lot as we are bringing the same user experience to the model productionization. What we’ve done with Spark data pipelines, etc, we are gonna do with models.

We want our users to be able to build and iterate on machine learning models which should enable them to productionize them easier to replicate the same button that we have used for ETL and compute jobs to machine learning models as well. That’s our next step we’re working on right now.

Sandeep: You talked a bit about the whole self-service portal, and making it easy for the users to accomplish whatever insights they’re trying to extract which is really interesting. When you think about your existing users, how are you satisfying their needs? What are some of the gaps that you’re seeing that you’re working towards?

Matteo: There are definitely gaps because there are always new feature requests and new boxes, as any product. There are always new feature requests coming from customers and users. We do try and preempt what they need by analyzing their behavior such as usage patterns and historic requests.

For example we’re heavily investing in streaming now but historically banks have always been aligned to doing batch processing restricted by their legacy systems like mainframes, databases. Fast forward to now, we are starting to have a system that can produce real time streaming, that doesn’t change the platform in a way to support streaming data which we introduced more than 2 years ago.

This changes the whole paradigm because you just don’t want to build a platform supporting streaming, but supporting natively where all we can have end to end streaming of applications. Traditionally all the applications are built using batch processing, SQL, etc. Now the paradigm has shifted which changes the requirements for machine learning where the deployment of a model becomes independent from the serving means, transport layer etc.

While many organizations package deployment into the application and deploy as a REST API, here, we say “okay let’s isolate the model itself from the application”. So basically once the data scientist builds a model, we can deploy a model and build the tools for the discoverability of the models too. This enables me to use my model, deploy my model as a REST API, embed the model into my application, deploy the model as a streaming component, or deploy the model as a UDF inside the spark job This is how we facilitate reusability of models and the journey we’re going through, and it has started to pay back.

Three years ago we started with a very simple UI to allow users to clean their SQL so it would ease the migration of existing Teradata & Exadata jobs to our data platform. As the users became more skilled they needed more features on reusability. So the platform evolved with the users at heart and now we are at a very good stage. We get good feedback from what we have built.

Cloud Migration Strategy

Sandeep: I’ve heard some of your talks and it’s good stuff. Share some of the detailed challenges that you’re facing when you think about the cloud migration.

Matteo: We’re at the early stages of Cloud migration, you could say the exploration phase. The biggest challenge is access to data. What we are working on uses encryption and tokenization at a large scale and expanding the use throughout the entire data platform. So data access, whatever the data will be governed by these technologies.

We have to handle it holistically incorporating our own data ingestion framework. To an extent we’re simplified by the work that we have done previously because every component that reads or writes data to our platform goes through a layer that we built, which is the our data access layer, which handles all these these details, so for example all the tokenization, access validation, authorization, is all handled by the data access layer. As our users are using this data access layer, it gives us an opportunity to implement the feature across all the users in a very easy way, so that’s basically our abstraction layer.

Security and the hybrid cloud model is a challenge at the moment. How are we going to share the data between on prem and cloud? How are we going to handle the movement of data? Part of the data will be in the cloud, part of the data will be on prem, so it’s not easy to define the rules which determine what is going to be on prem, what is going to be on the cloud. So we are evaluating different technologies to help us move the data across data centers, such as abstraction layers, abstracting the file system using a caching layer. I must say that these are probably the two challenges we’re facing now, and we are at the very beginning of that, actually, so I see many more challenges on our journey.

Sandeep: Having done cloud migration a few times before, I can totally vouch on the complexity. Can you share the other ways in which Unravel is providing visibility to data operations that are helping you out?

Matteo: We use Unravel for two different purposes. One, as I mentioned, is the integration with CICD for all the validation of the jobs and the other is more for analyzing and debugging the jobs. We definitely leverage Unravel while building our ingestion framework. I also see a lot of usage from users that are writing jobs and deploying into the platform, so they can leverage Unravel to understand more about the quality of the jobs, if they did something wrong, etc, etc.

Unravel has become really useful to understand the impact of our users queries on the system. It’s very easy to migrate a SQL query that has been written for Oracle or Teradata, encounter operations like joining twenty, thirty tables, actually, and these operations are very expensive on a distributed system like Spark and the user might not necessarily know it actually. Unravel has become extremely useful, to let users understand the impacts of the operation that they’re orchestrating. As you know we have our own CICD integration that prevents users from let’s say expensive jobs in production. So this and Unravel is a very powerful combination as we empower the user. First we stop the user from messing up the platform, and second we empower the user to debug their own things and analyze their own job. Unravel gives us the possibility for users that have traditionally been DBMS users to understand more about their complex jobs.

Sandeep: Can you share what was it prior to deploying Unravel? What are some of the key operational challenges you were facing and what was the starting point for an Unravel deployment.

Matteo: Through the control checks that we implemented recently, we saw too many poor quality jobs on the platform and that obviously had a huge impact on the platform. So before introducing Unravel, we saw the platform using too many cores and jobs very inefficiently executed.

We taught users how to use Unravel, which enabled them to spend time understanding their jobs and going back to Unravel finding out what the issues were. People were not using that process previously as you know, optimization is always a hard task as people would want to deliver fast. So control checks basically started to push the user back and to Unravel to gain performance insights before putting jobs into production.

Advice for Big Data Community

Sandeep: Matteo what do you see coming in terms of technology evolution in DataOps? Earlier you mentioned about adoption of machine learning and AI, can you share some thoughts on how you’re thinking about building out the platform and potentially extending some of the capabilities you have in that domain?

Matteo: We have had different debates about how to organize the platform and we have built a lot of stuff in-house. Now we are challenged with moving to the cloud, the biggest question is, “shall we adopt the current stack that we have and leverage and can we be a cloud agnostic. Should we rely on services provided by cloud providers?”

We don’t have a clear answer yet and we’re still debating. I believe that you can get a lot of benefits on what the cloud can give you natively and you can basically make your platform work.

Talking about technology, we are investing a lot in Kubernetes, actually, and most of our workload is on Spark, that’s where we’re planning to go. Now our entire platform runs on Spark on YARN and we are investing a lot in experimenting using Spark on Kubernetes and migrating all apps to Kubernetes. This will simplify the integration with machine learning jobs as well. Running machine learning jobs are much easier on Kubernetes because you have containers and full integration is what we need.

We are also exploring technologies, like KubeFlow, for example, for productionize machine learning pipelines. To an extent, it’s like scraping a lot of stuff that has been built over the last three years and rebuilding it, because we are using different technologies.

I see a lot of hype, also, around other other languages too. Traditionally the Hadoop and Spark stack has been rotating around the JDM, Java and Scala, and more recently I started exploring with Python. We’ve also seen a lot of work using other languages like Golang and Raster. So I think there will be a huge change in the entire ecosystem, because of the limitation that JDM has. People are starting to realize that going back to a much smaller executable, like in Golang or Raster, with a very much simpler garbage collection, no garbage collection, can simplify really well.

Sandeep: I think there’s definitely a revival of the functional programming languages. You made an interesting point about a cloud agnostic platform, and one of the things that Unravel focuses a lot on is supporting technologies across on-prem as well as the cloud. For instance, we support all the three major cloud providers as well as technologies. One of the aspects we’ve added is the migration planner, any thoughts on that, Matteo? Knowing what data to move In the cloud versus what data to keep local? How are you sort of solving that?

Matteo: We are exploring different technologies and different patterns, and we have some technical limitations and policy limitations. To give you an example, all the data that we encrypt and tokenize, if they are tokenized on-prem and they need to be moved to the cloud, they actually need to be re-encrypted and re-tokenized with different access keys. So that’s one of the things that we are discussing that makes, obviously, the data movement harder.

One thing that we are exploring is having a virtualized file system layer across not just the file system, but a virtual cluster on-prem and in the cloud. For example, to visualize our file system, we’re using Alluxio. With Alluxio we are experimenting having an Alluxio cluster that spawns across the two data centers, on-prem and the cloud. We are doing the same with database technologies, as we are heavily using Aerospike and experimenting the same in Aerospike.

We have to be really careful, because being across data centers, the bandwidth between on-prem and cloud is unlimited. I’m not sure if this will be our final solution because we have to face some technical challenges, like re-tokenization of data.

Re-tokenization and re-encryption of data, with this automatic movement, is too expensive, so we are also exploring ingesting the data on-prem and in the cloud, and letting the user decide where the data should be. So, we are experimenting with these two options. We haven’t come to any conclusion yet because it’s in the R&D phase now.

Sandeep: Thank you so much for sharing. So to wrap up, Matteo, in fact, I just wanted to end with, you know, do you have any final thoughts on some of the topics we discussed? anything that you’d like to add to the discussion?

Matteo: No, not particularly. To summarize the way we run the platform is like a product company. We have product managers, we have our own roadmap that is decided by us and not by the users.

This has turned out to be very successful from two aspects; one aspect is the integration, because, you know, building a product, we make sure that every piece is fully integrated with each other and we can give the user a unified user experience – from the infrastructure, to the UI. The second is helped a lot with the retention of the engineering team, actually, because, you know, building a product creates much more engagement than doing random projects. This has been very impactful.

I think about all the integrations that we’ve done, the automation that we have done, and there are multiple aspects. For us, building our platform and our services as a product has been extremely beneficial and with payback after some time. You need to give time for the investment to return, but once you get to that stage, you’re gonna get your ROI.

Sandeep: That’s a phenomenal point, Matteo, especially treating the platform as a product and really investing and driving it. I couldn’t agree more. I think that’s really very insightful and, from your own experience, you have been clearly driving this very successfully. Thank you so much, Matteo.

Matteo: Thank you, that was great.