Running Large Scale Database Applications – Percona Podcast #62 /w Yahav Biran

by Yahav Biran, Matt Yonkovit

Link to listen and subscribe: PodBean

Yahav Biran, Principal Solutions Architect at AWS, joins Matt Yonkovit, the Head of Open Source Strategy (HOSS) at Percona, to talk about some of the challenges and nuances of running large-scale applications in the gaming industry. Yahav has a deep background not only in gaming but in technology. Join the HOSS and Yahav as they discuss technology, open-source, and more.

YouTube

Link: https://youtu.be/F8txf0Z6e0A

Yahav Biran

Principal Solutions Architect at AWS

Yahav Biran is a Principal Solutions Architect at AWS, focused on game tech at scale. Yahav enjoys contributing to open-source projects and publishes in the AWS blog and academic journals. He currently contributes to the K8s Helm community, Percona Live, AWS databases and compute blogs, and Journal of Systems Engineering. He delivers technical presentations at technology events and works with customers to design their applications in the cloud. He received his PhD (Systems Engineering) from Colorado State University.

See all talks by Yahav Biran »

Matt Yonkovit

The HOSS, Percona

Matt is currently working as the Head of Open Source Strategy (HOSS) for Percona, a leader in open source database software and services. He has over 15 years of experience in the open source industry including over 10 years of executive-level experience leading open source teams. Matt’s experience merges the technical and business aspects of the open source database experience with both a passion for hands on development and management and the leadership of building strong teams. During his time he has created or managed business units responsible for service delivery ( consulting, support, and managed services ), customer success, product management, marketing, and operations. He currently leads efforts around Percona’s OSPO, community, and developer relations efforts. He hosts the HOSS talks FOSS podcast, writes regularly, and shares his MySQL and PostgreSQL knowledge as often as possible.

See all talks by Matt Yonkovit »

Transcript

Matt Yonkovit:
Hey everybody, welcome to another episode of the HOSS talks FOSS. I’m the HOSS, head of open source strategy here at Percona, Matt Yonkovit. And today, I am here with Yahav Biran from AWS. How are you today, Yahav?

Yahav Biran:
Thank you. I’m doing great. It’s Monday.

Matt Yonkovit:
It’s Monday. That’s right. It’s Monday. And right now, it always happens when I record something, someone decides to do work outside. So my neighbor is running the lawnmower, and he’s trimming some bushes, and he’s doing all kinds of other stuff. So I’m like, Who? So, hopefully that doesn’t get picked up on the mic. But if it’s going to happen, it’s going to happen during one of my recordings here. But that’s okay. So, Yahav, he’s gonna be joining us at Percona Live, and I wanted to sit down and talk to him about some of the things that he’s done, talk to him about his career, where he got started, and what interests him. So yeah, maybe you can just give us a little introduction to yourself.

Yahav Biran:
Okay, so yeah, so thank you for having me, I am a solutions architect. And I’m in the last three years, I’ve been a solutions architect in AWS, I was supporting a d&b customer, digital, native businesses, and customers, as well as gaining customers. Now, the reason that I like to support these two types of customers, because that these two personas or customer side or for you the things that I’m very interested in, which is a large scale of everything, large scale database, large scale computing, and that is the challenge that, that I like to, usually to solve. And before that, I was, I was working in a large gaming company as a product manager, and before that, working for Microsoft, as a product manager on the computer. So, the way that I like to see myself is that I like am a generalist that likes to specialize in everything.

Matt Yonkovit:
Yes, well, yeah, I mean, and that’s great to get your hands into a lot of different activities. I’m sure that that’s something that a lot of us appreciate, because getting stuck, just doing one thing over and over again, can be a little mundane. So having the ability to float around is often a good thing. Now, you mentioned some focus on gaming. And I wanted to talk to you a little bit about that because that’s a very interesting topic; I’ve given a couple of talks on that. I’ve actually published a couple of articles in a few places on the gaming industry and databases as a whole. And it’s interesting. So my experience in that space, is you tend to have kind of a two-fold problem. And I’m curious if you’ve seen this as well, where you’ve got not only the game itself that needs its infrastructure, and its data-related activities, but you also have all of the ancillary pieces that they might outsource, or they might purchase or license from someone else. Whether it’s matchmaking services, chat, or something else, all of them have their databases. All of them have a potential point of failure. And I found that testing from a gaming perspective, they tend to do a lot on the core game, but a lot of those ancillary services tend to under tested. I don’t know if you’ve experienced that as well.

Yahav Biran:
Yes, yes, absolutely. Well, that’s exactly one of the problems that architectural problems that customers usually have when they are designing workloads on different games workloads; I should say your gaming workload, and it’s exactly what you said because the critical path of the game is you and I are just shooting or building things right. But then if we want to take if you want to make a microtransaction or making other things that are not on the critical path, it is usually delegated by other services to other services, and that by design and architecture it is decoupled right from the game and that’s sometimes get neglected in the design. The reason that it’s also neglected is that it’s usually not just it’s not on the critical path. Still, it’s also a cloud that is not the most important part of the game because it’s if you think about it, those things are just transactional like any transactions, right? Purchasing looting, purchasing loot or no looting sorry, or any other things. I may want to mention something that in the previous Percona Live in 2020, I presented a talk that helps game customers detect fraud in gaming. So what with machine learning, so what we’ve done, we basically looked at all the datasets including the game events and the microtransaction, we combined all of them, we have to build the model, right, the machine learning model, and help the game live ops team to go and access this machine learning inferences through MySQL or Postgres, right. So you can actually query run SQL functions that will on the backside on the back end, going to call machine learning, machine learning models to detect fraud.

Matt Yonkovit:
Yeah, and we’ll put a link in the video here, just so people can check that out because I believe that’s recorded on our YouTube channel for those who are interested. Or we’ll put a link to the slides because we’ll have those from the Percona Live, I do remember that talk, as well. And I think that that’s a really interesting segue. Because I believe that as we talk about the need for databases in the gaming industry, it’s always growing, if it’s not the actual critical services, or even the ancillary services, it’s the metadata, it’s the auditing, it’s the trying to find those problems. And I think that that’s an exciting space because gaming is one of the spaces where you could have, one game has dozens, if not even hundreds of different workloads that it needs to support from an infrastructure perspective, it’s not just an OLTP. It’s not just an in-memory engine, it’s not just an analytics engine, it’s all combined into one. And a lot of times if one of those pieces breaks, it then causes cascading impacts.

Yahav Biran:
Right, right. And this is why you would like to design your system in a resilient way. So it’s not going to be impacted. There is another pattern that I’m suggesting is that when you use when you basically have, let’s say that you have a large, let’s try to think about a large scale game, right, that spans across many regions. Right? And that, and it stores its state somewhere in some database, right? So you need to make sure that this state will propagate across all the regions, right. So it’s not just from a resiliency perspective, you want that the game will be more performant. Right? You don’t want to write to an Asia or Pacific well when the debt when an instance when the player is playing in Virginia, right? Or on the East Coast, I’m sorry. Right. So things like that. So you want to think about the architectural thing, aspects to spread it nicely for both performance and, and resiliency. And that needs to be applied across not just the database, but also on the compute, right? And all of those, all these two layers need to be decoupled right, so they will not experience the cascading effect that you just mentioned.

Matt Yonkovit:
Yeah. And I think that one of the things that I have seen, especially from the problems that occur, like week zero or week, one of a game launch or early on, those architectural issues that weren’t flushed out, or people didn’t think through the design or the implications, those tend to bubble up very quickly, because the traffic ramps up so fast, right? And that’s when you start to experience that pain. And unfortunately, if it’s a design issue, that becomes a more complex issue to fix. I mean, you can, you can add additional hardware reasonably quickly, you can grow instance sizes, maybe if you’ve added the ability to scale a certain way and add nodes that that will you can add horizontally as well. But if you’ve designed the schema the infrastructure, is a little wonky, maybe you haven’t built in all of the concepts to do that scale quickly, then that really can hinder your launch.

Yahav Biran:
Yeah, absolutely. So if you think about it from Compute perspective, that’s easy. I think it’s easy. It’s just because that you because the by nature, the computer is stateless, so you just need to think about spinning up and down. A resource to address needs player needs. But what do you what you mentioned now is much more critical on the database side, right? You want to make sure that the players will have the right performance so that they will have the proper reaction from the system on time. And then there is another thing, right? You mentioned that in the first two weeks things, things ramping up really, really fast. But what if the game is not good and is not successful? Right? Why? What happened if, after two weeks, everything goes down, so you also want to react faster, and not spend too much time in, in modifying those things when things are trending down? So elasticity is becoming a critical path, especially for games, right? Because it’s very trendy, it’s not like a bank, right? That will have stable patterns and games are very emotional. In workload. Yeah. So we need to think about those things as well when we design games.

Matt Yonkovit:
Yeah, events and other things, newer new patches, updates add-ons can cause those spikes at any one given time. And once you launch, sometimes you have to support for a really long time. And that means that over time, you will see not only the increase , but the decrease. Hopefully, if you’re successful, it’s a slow decrease, but eventually, it does happen. But speaking of design we just went through COVID the last couple of years, which meant that there were a lot more people online, a lot more people looking for alternative ways to escape, which means that from a gaming perspective, from a streaming perspective, from an online perspective, the traffic went ballistic, it went crazy. Everyone lived on the internet for the last few years. I mean, we lived on the internet before, but we took advantage of living on the internet the last few years. And your talk at Percona Live is talking about how that spike in traffic and how the additional workload on Postgres databases started to manifest, especially with larger datasets, some locking contention. And that was something that you started to see. And you worked with others to help address, maybe talk to us a little bit about what you saw there, as the traffic started to ramp up as those data sizes started to get bigger. Yes, so

Yahav Biran:
That so it is a combination of two things. The first one is what you mentioned, right? People just spend more time at home, and they were doing things from home. So instead of going to order something, they were ordering items online, right. So that’s, that’s, that’s, that’s one thing. But the second part is that in AWS, in Aurora specifically, we build a platform that allows you to store a lot of data, right, a lot of data. And when I’m saying a lot of data, I’m talking about 10s up to hundreds actually the limits the official limits are up to hundreds and 200 terabytes of data set of your clusters, which is unheard of. Usually, it does not mean to be 100. And the hundreds of terabytes do not need to be stored in the hot path of the application. But theoretically, you can do that. And then the combination of these two started to manifest, in locks that we hadn’t seen in a process. As a matter of fact, when we were troubleshooting the customer issues, we had very little reference, in the Postgres community to talk about these symptoms. And this is one of the reasons that my partner, Sammy, and myself, who is a database engineers, chose to, experimentally investigate the app and blog about it, and present it here.

Matt Yonkovit:
That’s great. Yeah, and I think that this is one of the interesting patterns and databases that I think application developers might take for granted. Especially now, maybe not before, but I’ve seen as people are moving faster to get applications out to do new releases. They tend to try and treat databases as a commodity, they do less design work. And so they rely more on sometimes tools ORMS or other things, and so it works and that checks the box for them. But then it doesn’t really necessarily work at scale. And it’s like one of those problems that It’s just bubbling underneath the surface. And when it reaches a point where the underlying structure can’t handle it anymore, it bursts, it causes quite significant slowdowns. And, unfortunately, right now, fortunately, fortunately, or unfortunately, we don’t have a lot of patience for slowdowns on the web, or in games or applications, we kind of expect things just to work. And we want it now, right? So even a few milliseconds or 100 milliseconds or a few seconds, it’s going to cause a bad user experience. And that can really hinder a company’s overall value and its perception by the community. Yeah,

Yahav Biran:
And you’re absolutely right, this is exactly what we’ve seen. You mentioned the RM, right, the customers are trying, they want to start fast, right, they want to start fast, and they just go with some may naive approach, they focus on their business, right. And they just want to make sure that the ordering system is in place, or whatever the business logic is. And they assume many, many things, or basically delegate that to the platform. And the platform is old but also the infrastructure, which is AWS, or whatever your favorite cloud is. And then when they find when they hit this spike, then they are crying for help. And rightfully now, there was another talk that I gave in to reinvent the last reinvent in Vegas that talks about that how you could just before you jump into this naive, a design, which is good because it’s by design, you want to start fast, you don’t want to think about all your schema in the referential integrity will work. It’s also because you don’t know those things at this stage. So the talk that I gave, in reinvent, actually helps you give you four tips that just think about those things, right? Before you go and write your application, make some assumptions, and then you just go and you’re going to be safe when you’re going to hit the terabyte mark in the hot data set. And now that I think about it, maybe it’s going to be an interesting talk for the next Percona.

Matt Yonkovit:
No, yeah, definitely. Because honestly, I think that one of the big areas is a lack of design skills or understanding of the choices you make early on and the impact you like, like I said, it’s infinitely more difficult to make design changes to your application after it’s live, than it is to add additional infrastructure to add additional nodes to scale up, to scale out. Those are things that we’ve done as an industry a good job in providing you tools to do some of that scale. I mean the Amazon interface for adding nodes, or replicas, it’s very straightforward and simple, but your application has to support that. So the design is where the focus area needs to be in the future, in order to ensure that you have great scalability.

Yahav Biran:
Yeah, and the canonical example. It’s all spot on, and the canonical example of what you just what we just discussed is partitioning, right? So you start a get you started applicate, you begin a new application, right? And you start with the ORM, right, and you just define your schema, and you just go and start building, right, you launch the application, it does lots of inserts does a lot of updates, and your database is growing. And now what now you need to partition your database to reduce the hot, the hot, the hot dataset. And doing that, when you have data is open-heart surgery. Right? But if you do it before, it’s just a modification, like a true modification in the arm spec, and that’s it, and you’re done. And that is sent to Postgres, right, the Postgres engine has evolved a lot in its support for partitioning and this is another topic that I’m investigating right now with, my buddy, that is going to present with me and that is what we want to do. We want people to think, to we want to add this partitioning thing and other and other and other patterns to be the naive approach. When you start you just put it there, right? It’s just there and you don’t have to go through this open-heart surgery and make these changes, heavy changes when you grow. And if you didn’t hit the growth that you want it’s not a big deal. You’re not going to suffer from then you naive approach, which is just partitioning your database

Matt Yonkovit:
Yeah. I mean, I think that that’s a critically big mistake that people make if they don’t think upfront. And I think it’s great advice that you need to think through some of the implications early on to avoid the problems later on. Now, changing gears just a little bit, one of the other things that I know that you have some passion for is contributing and working within the Kubernetes community. And going hand in hand with the large scale applications is also that kind of building out the ability for applications, not just the databases, to scale out, and a lot of people have been using Kubernetes, and microservices as kind of the de facto standard, as you’re designing new applications to roll out the ability to scale those applications. So I’m curious, what sort of work are you doing in the community, and maybe tell us a little bit about some of the things that are going on there.

Yahav Biran:
Right. So if we move up a little bit to the application level, the AWS and Kubernetes allow you to spin up as much compute you won’t or didn’t even wish to run. But with that comes a problem, right? It comes with a there is an increased cognitive load to understand how to operate all of this, right? And, and yeah, how to operate and manage it. And give give give you an example. Right? Kubernetes is a blessing and a curse, right? It allows you to do all kinds of to define all kinds of operators, I think Percona has as an operator for their stateful application as well. But it offers you a lot of flexibility. Now, this offering is manifested in many, many configuration files, and all things that are supposed to help you to go and deploy that helm is what is the product that I am is a service that I’m in contributing it, I’m looking at patterns that customers are adopting, and they are facing the challenges that they are facing on how to deploy complex, a complex application that includes all kinds of primitives, Kubernetes, primitives, but also their operators, right, and deploy it in an easy way, and help them to operate it and manage it more effectively so that my main contribution is around looking at patterns from customers and translate them into Helm charts that make the DevOps the customer DevOps life easier.

Matt Yonkovit:
Yeah, and I think that that’s supercritical. I mean we, as we talked about the shift and design paradigm. So many companies now are all about enabling developers and developers are all about building their individual components, we’re using what’s out there, you end up with a tapestry of many different technologies across many different environments across many different implementations, and it’s getting more and more complex. So anything that we can do to kind of rein in the complexity and gives some tools to make it a little simpler is critical. Because there’s just so much I mean I’ve talked to people and they might have an application or a set of applications that has seven-eight different databases that are part of the infrastructure, they might have a couple of different development stacks within that same infrastructure, they might have all kinds of microservices, dozens, hundreds, even, and they all do little different things, they all have different little requirements. So it’s really that shift from managing just a few servers to managing hundreds or even 1000s of servers. I mean, and I think that that shift requires a lot more focus on the ease of use and, and what you can do , and I think Helm and Kubernetes definitely go hand in hand in helping to kind of reduce that craziness.

Yahav Biran:
Yes, servers in one day mentioned, I will say that this is the x dimension, y dimension is the complexity right microservices, take the monolith and decompose it right. Every component in the microservice has its own config as its own binaries because you separate them, and that is that that makes that the cognitive load that I was referencing before. Right. So, the complexity is multi-dimensional and, and helm particularly specifically help you to go and take all the obstruction that you made, put them in one chart or two. And just hand it to the DevOps team to go and operate that instead of going and, figuring seven clouds and config maps, etc. In your Git repository.

Matt Yonkovit:
Yeah, no, I mean, reducing that complexity. So important, so critical. I want to thank you for joining me today. This has been a great conversation. I just love talking tech, I love talking about database stuff. I love talking about the industry. So this has been spectacular. I hope that those who are listening will join us at Percona Live, who’s gonna be there, and Yahav will be able to get together maybe have a drink or chat a little bit more, but love to hear about the work that you have uncovered in debugging and kind of unblocking some of those low-level locks in Postgres.

Yahav Biran:
Perfect. Yes. Thanks for having me. And I am really, really looking forward to meeting all of you please come, come by if you don’t have the time to watch, to listen to my topic, I would love to hear about issues that you have. And if I can just take some information and build some fun things.

Matt Yonkovit:
Awesome. All right.

Yahav Biran:
Thank you. Bye. Bye. Thank you. Bye, bye.

Did you like this post? Why not read more?

✎ Edit this page on GitHub