Modernize Relational Databases Through a Cloud-Native Approach – Opensource Podcast 67 /w Denis Magda

by Denis Magda, Matt Yonkovit

Link to listen and subscribe: PodBean

Distributed SQL (also called New SQL) aims to modernize relational databases by bringing better availability, scale, and performance through a cloud native approach. Yugabyte is one of the leaders at the forefront of this movement, by marrying the rock solid PostgreSQL client & protocol with a brand new cloud native backend. Denis Magda stops by and chat’s with the HOSS not only about Yugabyte but also about his work on various Apache projects.

YouTube

Link: https://youtu.be/4BqHpFpoolM

Denis Magda

Head of Developer Relations at Yugabyte

Denis Magda has spent half of his career working on distributed systems, applications, and databases. His experience spans from the development of distributed database engines and high-performance applications to training and education on the topic of distributed and cloud computing. Presently, Denis runs the Developer Relations team at Yugabyte and serves as a PMC Member for Apache Ignite. He started his professional career at Sun Microsystems and Oracle, where he led a Java development group and worked on technology evangelism efforts.

See all talks by Denis Magda »

Matt Yonkovit

The HOSS, Percona

Matt is currently working as the Head of Open Source Strategy (HOSS) for Percona, a leader in open source database software and services. He has over 15 years of experience in the open source industry including over 10 years of executive-level experience leading open source teams. Matt’s experience merges the technical and business aspects of the open source database experience with both a passion for hands on development and management and the leadership of building strong teams. During his time he has created or managed business units responsible for service delivery ( consulting, support, and managed services ), customer success, product management, marketing, and operations. He currently leads efforts around Percona’s OSPO, community, and developer relations efforts. He hosts the HOSS talks FOSS podcast, writes regularly, and shares his MySQL and PostgreSQL knowledge as often as possible.

See all talks by Matt Yonkovit »

Transcript

Matt Yonkovit:
Hey everybody, welcome to another HOSS Talks FOSS. I’m the HOSS, Matt Yonkovit, Head of Open Source Strategy here at Percona. Today, I’m joined by Denis Magda from Yugabyte. Denis, how are you doing?

Denis Magda:
Great. All right.

Matt Yonkovit:
Good, good. Now, Denis, I don’t know if you knew this, but we actually work at the same company at the same time.

Denis Magda:
Not on serious.

Matt Yonkovit:
Serious, serious, serious. So Denis doesn’t know this. But I did look up his background. And while he was at Sun Microsystems, I was at Sun Microsystems. So I was actually on the MySQL team when we got acquired, and then was with the sun up until the Oracle acquisition , and then I left for Percona.

Denis Magda:
Wow. Okay, do your ex-colleagues excellent. We didn’t even know it. Oh, no, I think that we have never met during the Sun days because I met with some of their ex-MySQL folks. And those days I used to live, in Russia, because Sun used to have a development center in Russia. And many guys from the MySQL team would usually come to that point, you want to kind of have nice conversations, etc. But great to know that we were colleagues

Matt Yonkovit:
So I wanted to maybe talk a little bit about what you’re doing now is you’re in charge of Devrolat Yugabite, you’re in charge of the community outreach and getting people to get excited and educated on what Yugabite can do. But you didn’t start in the database space. You started in Java. Maybe tell us a little bit about what brought you to databases eventually and, and how that evolution worked.

Denis Magda:
Yeah, the story like my professional career is a little bit different. I started at Sun at Oracle. And those days I was one of the Java development teams. Basically, if you guys remember, it’s a Java micro edition of Java embedded. I was on the team video developing JVM and GDK for different mobile phones and embedded devices. So generally, that was my team. Why I joined Oracle and why I decided to develop Java, because I was always curious about the internals, before joining Oracle and sun. I was a professional Java developer, I used to create different back-end applications, web applications, etc. I played his mobile applications. But I always wanted to look inside, I was curious, like, what it makes what Java engineers do to make Java work in different, let’s say, operating systems. And eventually, I was lucky enough, I passed an interview, and I joined the development team. And after that, I barely wrote code in Java, because when you’re working on the porting team, you’re literally taking the GDK and JVM and you use C C-sharp or even assembly language to make sure that your JVM works on different microcontrollers and mobile devices. However, while I was on that team, like in Java, we can create several multiple threads. And then those threads can execute different tasks in parallel using your like CPU, etc. The thing is, the Java microedition is that regardless of how many threads on the Java level you create, those threads will be mapped to a single operating system thread. So generally, like you’re not creating highly concurrent applications with mobile phones is like using Java micro edition, because probably it doesn’t, it didn’t make me it didn’t make any sense. But I was studying like, during my free time, I will study in Java concurrency, I was just trying to create highly concurrent applications. And then when I decided to switch gears, I looked for different companies that were using Java, creating, let’s say, some cutting edge technologies. And the companies who are who was really is an open-source spirit. And that’s how I came across Gridgain. Gridgain is one of the companies that is donated to Apache ignite with the Apache Software Foundation, and they remain still one of the major contributors to the project in the community. So Apache ignites for those who don’t know, it’s a distributed database for high-performance computing and memory. So that’s what they do. And I joined this company, I joined Gridgain And I joined the Apache ignite community. As a senior software engineer, I was contributing to the source code of Apache ignite to the networking layer to the storage layer for more than half a year. But then the company kind of in I myself decided I want to be like, on the field. I want to talk to the users I want to talk to the customers. And that’s basically What I used to do at Sun when I was there, so I belong to the so-called Sun campus ambassador program. It’s when I was one of the design campus ambassadors evangelists at my university. I was helping developers and graduates to learn Java and other Sun technologies. And that’s something that everyone, I started to realize a great gain. And that’s why I joined let’s say I was I, I tasted and tried many roles in that company. I was in the support team, I was on the customer on the professional services team. And eventually, then I was leading the product management and marketing. However, for my last two, and three years, I was leading the developer relations region, and for Apache ignite, throughout my last 7 years became one of the top five projects of the Apache Software Foundation, I like this community, I love this guy. I wish them luck. But eventually, I sense that I need to move forward, I want to explore something else, I wanted to remain within, let’s say, the database area, and this is how I came across Yugabite. So right now I am at Yugabite this is my third month in a row. So I’m new. Yep, yep. I’m excited. I knew we have a lot of things to do here. And I think that the real person after joining the company, I see the developers truly can benefit from distributed databases, such as Yugabite, right is not the one that you want the right. And that’s good because you always have to care about the competition in the market this way you can, let’s say innovate faster, and you can listen to your developers to your users. So that’s my quick story of how I ended up as a Java engineer to database company.

Matt Yonkovit:
It’s an interesting journey. Because I mean obviously, Apache Ignites database, it’s just a different type of database than Postgres. So it’s a little bit different to get used to. And so I’m sure that that was a bit of a switch. But I mean, relational databases have been around for so long, it’s not that difficult to make that kind of migration from where you were. Now, I am curious, for those who are listening, many people might not have heard about what distributed SQL is. I know what it is, but maybe you could just give an overview of distributed SQL for us, just for those listeners who might be new to this space.

Denis Magda:
Yeah, yeah, sure. That’s probably the best question to start with, right? If you want to talk about distributed SQL databases. So my explanation is, is quite straightforward. All of us we like, if you have ever created any blank web application, or mobile, at least web application enterprise application, you usually, it’s the chances or cries that you use a relational database, it could be Postgres, it could be Oracle, MySQL, IBM, DB2, or any other. And you usually use what you use SQL statements, you use joints, you use stored procedures, you use different triggers because that’s how you come up with historically, we are introducing different functionality and how we kind of request it and process data that resided in our database. And when it comes to distributed SQL, the concept is simple. So you still have the same SQL, the same joins the same stored procedures, the same triggers the same transactions with the same isolation levels, but right now you want them to walk at a global scale. And if when we’re talking about the global scale, if to take an example over, let’s say over the standard, single server, relational database, we can take Postgres as an example. It usually runs on one instance, right on one virtual machine or one physical server, and once the application connects to this server, it can start like issuing SQL requests through your different stored procedures, etc. But when we are talking about distributed SQL, you also have your relational database. But right now, it spans across multiple nodes, you can have five node clusters, you can have 10, node cluster 15 more clusters, those nodes can reside, if you’re talking about the cloud, they can reside in the same, let’s say, reliability zone or in the same region, or those nodes can spread across multiple regions. If you want to survive, let’s say different outages. But regardless of that for you, regardless of the deployment mode of your database for the application the experience remains the same, the application still usually connects with a single endpoint. Even if you have let’s say, 10 or 15 nodes cluster, your application still connects to this cluster with one IP address. And then the magic happens. On the database layer, your application keeps sending the same SQL statements, the same joins the same transactions. But right now, if you’re executing a transaction, and that transaction spans multiple nodes, then they’re done. The database is responsible for the consistency atomicity and other characteristics of the transactional processing. So generally speaking, to make things short distributed SQL with the same SQL you’re familiar with, and there are many, but there are many dialects, you might be dealing with, let’s say, distributed database that just supports like it’s NC 99, or NC 2011 compliant, or you can be dealing with a database that is more likely that wants to be compliant with Postgres dialect or MySQL dialect, but eventually, from the application developer standpoint, is the same SQL, right? You execute them. Ideally, most of the queries and most of the features should work out of the box.

Matt Yonkovit:
Now, a little bit deeper into that. One of the big things though is architecturally it is different than standard Postgres or even other clusters, because it’s more reliant on sharding in the backend, correct, like, so. So data isn’t, it’s not a share everything, right? Because everything’s not on every node. So there are clustered systems or replicas that you can build, that will have a copy of every database on everyone. But when you’re talking about when you have 5-15, node cluster, each of those nodes in the cluster contains a portion of the data, correct? Yeah. So it’s, it’s more a keen to a MongoDB type architecture in the back end that it is too like, MySQL clustering or a Galera cluster in MySQL, or things like that. And I guess it follows a similar pattern to a situs in that regard, does it?

Denis Magda:
Yeah, yeah, exactly. There are a lot of similarities, even though there are also differences if just if to compare between like during my session, we can discuss the difference between different databases, but generally, you’re right, when we are talking about a distributed database, usually the data is partitioned. You mentioned MongoDB, Cassandra is also a good example. But from the NoSQL space. Yugabite and CockroachDB, they also shard your data or like partition, or maybe use the term partitioning, it’s then let’s say you have 10 nodes in your cluster, and you have 1 million records, and then the partitioning algorithm, make sure that all those 1 million records are spread uniformly, or across the cluster. So like each node, ideally give this like, like 220,000, records, and then 20,000, etc, etc. So that’s what happens, but also some of the kneecaps and generally like, when you partition or shard the data this way, you also need to have the query layer, the layer that receives your requests. And then that request query layer knows how to like what node is responsible, and what node should be involved in the processing of your query. In case of let’s say, I can when it comes to your instance, Apache Ignite, right, because Apache ignite same as Yugabite as Cassandra, MongoDB, and Apache Ignite. It also shards two partitions of data across a cluster of machines. However, when you let’s say, execute a SQL query with Apache Ignite, this SQL query will be like, sent to all of your kinds of nodes, because on every node, you have a single process, basically, that process is your storage plus your processing layer. In Yugabite the architectures a little bit different, we have two different processes, we have so-called Table tablet server, that’s a process that is that that’s your data storage, that’s where your data resides. And that is sort of a container for your data processing. But on top of it, we have the query layer. And it’s also highly scalable, it’s resilient, etcetera. That’s we call it the master process. You have replicas of that master. And actually, the master usually is aware of all the data distribution in your cluster. It executes whenever you execute any DDL statements or requests that go through that master. But when you start using YugabiteDB cluster, like in short select deletes usually your queries will start going directly to your tablet servers like to your tablet processes, because those can cache data. They know how the data is distributed. There are some caching algorithms involved in regard to the data distribution.

Matt Yonkovit:
You mentioned, the similarities or some of the backgrounds with Cassandra which is funny because the CTO at Yugabite was one of the original creators of that project when he was at Facebook, right so Karthik, who’s been on our podcast here as well. He talked a little bit about the good old days there. That makes a lot of sense that there would be some similarities. Now I’m curious, is distributed SQL for every application? Is it something that is overkill, in your opinion on some applications? Or does it something that you could start with and then ease into kind of that larger-scale setup because distributed SQL tends to be really good at mass scale. But a lot of people might start with just a standard Postgres and do that migration. So I’m curious about your take on that.

Denis Magda:
And also here is that like, a recommendation of Google, they can Google Spanner, which is a distributed SQL database. And also they have other different Google SQL cloud reference, when they run my services for MySQL, etc. And their recommendation, in this case, we usually use Google Spanner when you need scale, and they usually judge by the amount of data you need, let’s say, if you need, let’s say, global resiliency if you need to comply with data residency requirements, so that your database instance, keeps European citizens data in Europe and never writes this data to America, then you use Google spanner. Or if you need to keep up petabytes of data, in other cases, probably explore our Cloud SQL fields. And my recommendation when it comes to saying to PostgresSQL, vanilla Postgres or YugabitedB or any other products complaint database, just do your homework, do you really need this scale right now. But also think when you’re thinking about how you have to think about the future? What happens is your application, what happens with your team, what’s happened with your department, let’s say in a year from now, in two years from now, for instance, if you expect that your application has to walk across a global scale, if your application has to, we’ll be having customers in Asia, in Australia, in Europe. And it’s inevitable that at some point in time, you would need to have multiple Postgres instances, or you need to arrange different posters, and sharding techniques on your own. And at this point, yeah, I mean, probably you need to start with a distributed database. But which shows deployment can be small. So generally, the first recommendation reminds me is Postgres. If all the data that will fit into a single server machine, and you do not expect that, let’s say you would need to, you would have a much bigger load much bigger, you wouldn’t need any much bigger capacity in the, in the next, let’s say, five, five-plus years. But in other cases, let’s say even if you think that right now, everything can fit nicely in Postgres, but in two or three, five years, you will be running across the globe, or you will be having a much bigger load, et cetera, then probably start with a distributed SQL database, or like Amazon, Aurora, which basically, in other kinds of way of transition from the standard relational databases, because they have a single, right, right, like, all the rights go to one node cannot score, you cannot basically scale beyond the capacity of that node. But at least you have read replicas, and those read replicas will help you to remain resilient, and they can be deployed in different regions closer to your customers. So that’s, that’s my thinking. Okay.

Matt Yonkovit:
Well, and so those who are listening Denis, you alluded to something, but we didn’t talk about it, which is, you said, oh, yeah, I’ll be discussing this during my talk. So yes, Denis is going to be at Percona Live, he’s going to be talking about the SQL compatibility between different databases in the distributed SQL space. It’s interesting. I don’t consider Aurora distributed SQL. I mean, it’s just me, I think it’s outside, I think Yugabite, Cockroach spanner or a different architecture than Aurora is, personally. But you mentioned the word compatible, and this is what’s going to be kind of interesting is the compatibility of each of the different databases that are out there. It varies, right. So like, some might have 60% compatibility, some might have 99% compatibility, but it’s going to vary depending on the implementation. And there are some things that may or may not work. And I think that’s one thing that is interesting, and it should be a really good talk. And I’m looking forward to kind of listening in on that, to see where that is, as you kind of go through that ecosystem and talk about the nuances. Sometimes it doesn’t matter because it could be most of the normal features are used and so nobody uses these really edge cases. Sometimes it does.

Denis Magda:
Yeah, speaking about Amazon, Aurora. We are on the same page here. Amazon Aurora is scalable. It’s just a scalable solution for Postgres, but it’s not a distributed database, at least also in my thinking, just my personal opinion. Forgive me. Amazon folks if you disagree. Just because let’s say it’s a great solution if you need to scale your needs, and you want to tolerate different region-level outages, that’s what Amazon Aurora is designed for. But it does not designed for if you need to outgrow your single server capacity, for instance, let’s say you master node that accepts write, like 30 terabytes of data, but you need 50 terabytes. And if Amazon doesn’t, isn’t is not able to provide this machine, we don’t have a choice is that’s the thing. Talking about compatibility. Yeah, but it’s one kind of, first kind of remind, the only 100% compatible database with Postgres is Postgres.

Matt Yonkovit:
Yes, yes. Yes.

Denis Magda:
There are other vendors on the other databases, some of them are trying harder to be like to achieve a higher compatibility level. But when I will be talking about compatibility, Postgres compatibility, and you’ll be using the language of high compatible over compatible, et cetera. But remember, the only 100% compatible database is Postgres itself. You’re right, like, you cannot just select one criterion. And judged by that criterion or the databases, it’s unfair, and we don’t want to diminish companies who like I work for Yugabyte, right. But I want to be authentic. I don’t want to disguise works, I don’t want to do to mean to diminish competing technologies, we will be using for different features or criteria. The first one is why compatibility. It’s when can I connect to this Postgres compliant database that seems PG admin or any other tool and execute some commands that I have been using for ages. For instance, I want to connect, let’s say to Google Spanner, and I want to see the structure of my database, the schema, the tables, indexes, etc. And there’s the right compatibility, just the ability to connect and use the Postgres networking protocol. It’s all about, let’s say, the ability to deserialize messages, and network packages, that are serialized, and according to the Postgres networking layer, let’s verify compatibility. The next one is syntax compatibility is I have an application and this application is designed to work with Postgres, but I can’t really use particular syntax, after connecting to your Postgres a compliant database, which is a web compliant, for instance, really be able to use the same Postgres syntax, or I have to switch let’s say to another SQL version of your database, that also matters a lot. And that helps. And that if you are, if you try hard to support at least the Postgres SQL, then many of your application developers can be successful, like the lift and shift exercise, I have an application, I want to move this application proposal. The third one is feature compatibility. You have why you have compatibility you have syntax compatibility, but what about features? One of the reasons why Postgres is gaining so much popularity these years is it’s one of the most feature-rich, open-source relational databases. They have certainly some basic stuff like stored procedures, materialized views, triggers, etc. But also, unlike other databases, not like I cannot speak for other databases, but they support features that usually do not exist in a relational database, like JSON like the full text to which the time series and other sub document-based queries, that’s what Postgres requires. And you have, let’s say, an army of application developers who use those features right now, probably not AWS, and they want to move. And then your post is compliant that basically will be judged by these criteria. How are your feature compatible with this?

Matt Yonkovit:
Are you considering feature compatibility, compatibility with extensions, or is extension separate from that? Because sometimes features are considered synonymous with extensions. Sometimes they’re not. Postgres has a really rich extension ecosystem, but they’re not part of the core Postgres right. So, it’s a little fuzzy.

Denis Magda:
Yeah, that’s a fair question. I mean, extensions, extensions also have to be included. I mean, even when I’m talking about Yugabyte right as you can buy it in play. We are not 100% compatible with Postgres, right? Obviously, we’re not close but we’re using Postgres source code as much as we can. And, but when it comes to letting’s say, we’re using the Postgres query layer, but our storage layer is different. We use different storage technologies. That’s why when you’re talking about the extensions that we have created for the query layer and not for the storage layer, that is highly likely that they will work without any issues in Yugabyte. But if you created an extension for the Postgres storage, where then it’s highly likely that you’ll let fly on YugabyteDB because our storage architecture is different. And finally, the fourth one, one is around runtime compatibility is what actually does your application behaves similar to Postgres, like how your query executes, what’s the right path for it? But for you who really, really use, let’s say, what’s, what’s your plan? Or what’s the optimizer? What’s your executor. And generally, that’s probably the hardest test because once it’s usually if you are, if you have a high level of compatibility, for the syntax, and features, of course, the lift and shift exercise should be smooth, which is take your application, and you just change the connection point, you connect it to another database, and you can use all these drivers. That’s one of the biggest benefits if you’re compatible. If you have a wire syntax and feature compatibility, use the drivers and you are good to go. You don’t need to create your own MySQL. But the runtime is like how you’re gonna be tested in production is like, if your query is if you start some of the greatest and fail, some of the indexes don’t work as expected, etc

Matt Yonkovit:
It’s interesting, because Postgres is becoming such kind of the base ingredient for so many new databases. Right? It’s so interesting because people are approaching this slightly differently. So Yugabyte, for instance, you took the client-side and kept it and then replace the back end. So you could do the distributed, but I know, I just heard as Postgres Silicon Valley last week, EdgeDB was talking about their stuff, where they’re putting basically a GraphQL interface over the top of Postgres. And doing some things there. I know, ferret DB is building a MongoDB compatibility layer on top of the storage for Postgres, so you’ve got, like, lots of different people, starting with Postgres, and then extending Postgres to do things that are really cool and innovative. But Postgres seems to be kind of that seed component that people are starting to build around, which is really cool. Especially if you already know Postgres, it makes it easier to start to get involved and jump into these different technologies. But it is such a solid, awesome core that it makes it easier for people to develop those cool things.

Denis Magda:
You know, I also kind of similar question when I decided to join Yugabyte and start working actively with the Postgres community and Postgres developers, because I want to secure that, like, someone told me like, I was watching one of the representations Yugabyte professionals, and they were showing that say that accelerated rise and growth was caused because if you go to the DB engines website that we usually go just to see, like, what are the top most popular databases, Postgres is in the top five list, but his growth is probably one of the fastest. And I’m just thinking like, why this happened. And then I checked the other, let’s say top four databases, you have Oracle, you have MySQL, we have SQL Server by Microsoft, etc. And what strike me is, that it sounds like Postgres is a Linux of the relational databases. Linux is one of the operating systems that is governed by a true open source community. Database community is not under the control of any other specific vendor. You have the Linux core, right? But out of that Linux core, you have different Linux distributions. We have Ubuntu, CentOS, Red Hat, etc, etc. But the core doesn’t belong to anybody, even if you say, Red Hat, or like any other company, want to introduce something to Linux core, you will be dealing directly with community and community, you’re down. And that’s, I think the same happened with Postgres. We know that unfortunately, MySQL still even though that’s an open-source database, it’s governed by Oracle, we have Oracle Database, if you find DB2, etc. All those databases belong to somewhere else, but Postgres is an open-source project. And I also was I was, I joined also the Silicon Valley’s conference. I think we missed each other. We can go and I was in one of the sessions were supposed to switch around was talking about let’s see some situations some people some companies like big companies, a small company in scam they want to introduce Have some features to the Postgres, but usually, the community says, just you need to come and talk to us using our protocols, you need to use our channels. Some people just, I’m not happy that the product is coming into use. Let’s see some dinosaur-style mailing lists. So like, I don’t know, what was that. But that’s the way they interact because that’s quite similar to the Apache Software Foundation. I showed a lot like, Hey, guys, why don’t you use let’s say, Slack? Why don’t you use that? Why do we use this old-fashioned mailing list? That’s how things work because our contributors sit around the world, and they cannot enter everything instantaneously. And another thing is that when there are I want probably someone introduced a feature in Postgres or like Apache Ignite. And only, let’s say, a year ago, I want to go back to the discussion. And I want to see what happened, like why how a conversation in the community led to this decision. And that’s extremely easy to do is made in this. And that helps, let’s say, just to balance the speed of innovation and quality of the product. And also kind of push back vendors who are too aggressive, who want to come and take control of the community. And luckily, what I see, was that did not happen with Linux, and as far as I understand, that didn’t happen to progress. That’s why with Postgres we see I think you have this proliferated growth. And we have so many companies, big companies, enterprise vendors using introducing new products, new solutions, something that we use Postgres as a core if you have Amazon, Aurora, Google Spanner, it’s like, in October last year, in 2021, then they announced that they are going to their supported Postgres, Postgres dialect. And that’s a big signal. Even Google recognizes this. So that’s, that’s my thing. And I think that Postgres became the Linux of relational databases, thanks to its truly open source community, the policies, the governance, etc.

Matt Yonkovit:
No, that’s an interesting take. And with that, we’re running out of time here. So I wanted to thank you for coming on today, chatting with us, even a little preview of your talk, talking about your background, and giving us some information on Yugabyte, and where it fits into the ecosystem. It’s been great chatting with you today.

Denis Magda:
Thanks, Matt. It was a pleasant conversation.

Matt Yonkovit:
And for those listening, we would love it. If you come out to Percona Live and see us out there, Denis will be there, he’ll be giving a talk. It is May 16, through 18th. And you can see it there. We’ll also have some sessions online. There’ll be some things that show up after the session, or after the conference as well. But if you do like this kind of content, please make sure you subscribe, to the YouTube channel. Subscribe to your favorite podcast app. And let us know put comments in the comment section of your favorite app and let us know what we can do, who we can bring on who we should talk to what topics we should cover. We’re always interested in what you have to say. Until next time, this is Matt. We’ll see you then.

Denis Magda:
Thanks. Bye-bye. And don’t forget to subscribe.

Did you like this post? Why not read more?

✎ Edit this page on GitHub