PostgreSQL Database Management, Backup Tools, Migration From Oracle - Percona Podcast 31

by Jobin Augustine, Sergey Kuzmichev, Matt Yonkovit

Link to listen and subscribe: PodBean

Jobin Augustine (PostgreSQL Escalation Specialist) and Sergey Kuzmichev (Senior Support Engineer) both from Percona stop by the HOSS talks FOSS to talk with Matt Yonkovit about open source, PostgreSQL, Migrations, and more. In this episode we cover a large number of topics from the challenges they are facing in their day to day jobs, recommendations and best practices for managing your databases, the differences in backup tools, challenges when migrating from other databases like Oracle, and more! They also give us an overview of their talk during Percona Live 2021. Finally we get a sneak preview of Sergey’s and his fellow colleague Vinicius’s upcoming book “Learning MySQL” that will be published in Autumn.



Jobin Augustine

PostgreSQL Escalation Specialist, Percona

Jobin Augustine is a PostgreSQL expert and Open Source advocate and has more than 19 years of working experience as consultant, architect, administrator, writer, and trainer in PostgreSQL, Oracle and other database technologies. He has always been an active participant in the Open Source communities and his main focus area is database performance and optimization. He is a contributor to various Open Source Projects and is an active blogger and loves to code in C++ and Python. Jobin holds a Masters in Computer Applications and joined Percona in 2018 as a Senior Support Engineer. Prior to joining Percona, he worked at OpenSCG for 2 years as Architect and was part of the BigSQL core team, a complete PostgreSQL distribution offering. Previous to his work at OpenSCG, Jobin worked at Dell as Database Senior Advisor for 10 years and 5 years with TCS/CMC.

See all talks by Jobin Augustine »

Sergey Kuzmichev

Senior Support Engineer, Percona

Sergey is a support engineer in Percona. Interested in all things databases, he’s currently working mainly with MySQL and PostgreSQL. He started his career working as an Oracle DBA, later moving to a DevOps engineer role supporting a Java-based trading platform running on PostgreSQL. After being a jack of all trades for a while, he’s now focusing on what he enjoys most: open source databases, systems performance, and reliability.

See all talks by Sergey Kuzmichev »

Matt Yonkovit

The HOSS, Percona

Matt is currently working as the Head of Open Source Strategy (HOSS) for Percona, a leader in open source database software and services. He has over 15 years of experience in the open source industry including over 10 years of executive-level experience leading open source teams. Matt’s experience merges the technical and business aspects of the open source database experience with both a passion for hands on development and management and the leadership of building strong teams. During his time he has created or managed business units responsible for service delivery ( consulting, support, and managed services ), customer success, product management, marketing, and operations. He currently leads efforts around Percona’s OSPO, community, and developer relations efforts. He hosts the HOSS talks FOSS podcast, writes regularly, and shares his MySQL and PostgreSQL knowledge as often as possible.

See all talks by Matt Yonkovit »


Matt Yonkovit: Hi, everybody, welcome to another HOSS Talks FOSS. I’m Matt Yonkovit, Percona’s HOSS here to talk FOSS with Sergei and Jobin, two of our key support engineers here at Percona, who have been working on all kinds of awesome things in the Postres space. And we wanted to catch up with them and talk to them a little bit about some of the work that they’ve been doing and some of the recent conversations they’ve had and presentations they’ve done. Hi, Sergei, hi, Jobin. How are you guys doing?

Jobin Augustine: Hi, we’re good. How are you?

Sergey Kuzmichev: Good. Good. Yeah. Okay, I’m doing quite good. Thank you for inviting. Thank you.

Matt Yonkovit: So you two have been at Percona for a while now. And you’ve been doing quite a bit around the Postgres space. And so I wanted to ask you just off the bat and this is a conversation so take this where you want it. What kind of interesting things are you starting to see from the people you’re working with? From a Postgres perspective? Are there interesting trends you’re starting to see and topics that are popping up?

Jobin Augustine: Yeah, the majority of the people currently are migrating from proprietary databases like Oracle. So they are getting trained in terms of other open source and open source culture and how the open source works. So many times that the conflict of culture is more, the Oracle provides everything completely packed. But when it comes to open source, we assemble pieces.

Matt Yonkovit: So it so people, in other words, so people are expecting kind of this really polished UI and everything just to work together. And they might need three or four different projects that work together. Yeah, that’s where they’re running into issues.

Jobin Augustine: Yeah, even when we say for high availability, we recommend this particular framework, then say they are coming from a world where they have only one choice. But we have a number of choices for everything. Say backup, at least half a dozen backup tools are there for Postgres. Yeah. So then we choose the best and give the okay this works should be working. Perfect for you.

Matt Yonkovit: Then, I mean, choices. choices are good. But they’re also paralysing when you’re not used to that. Right. So it’s very difficult to make that choice. Now, when we talk about whether it’s high availability or backups, typically, we have a recommendation and that recommendation, what is that?

Sergey Kuzmichev: Yeah, well, we recommend we extensively use Patroni for the high availability solution. And for the backups, our preference is pgBackRest. Although we actually do support a few other tools as well. I actually wanted to pick up on what Jobin said previously, because I actually switched from Oracle to PostgreSQL in 2015. And I was in the same position as a lot of people are now and so it is actually beneficial for me, the position I have right now, to have walked the walk like five years ago. So switching from Oracle like you have HARMAN, right. But then on PostgreSQL side, you have Barman, which sounds similar, but it’s not HARMAN actually, right. Then you have pgBackRest and quite a lot of different projects. Then there are forks of projects, and it’s an excellent project, but the and everything is, yeah, it’s a bit of open source in the good and the bad. That details like the documentation might be lacking sometimes. And you have to go and try the waters basically quite a lot. Yeah, but so far in the last five years and six years, the infrastructure and everything every project matures. And like right now, I think it is easier than ever to do this switch.

Matt Yonkovit: So you mentioned pgBackRest, right. So from a backup tool perspective. Now, you also mentioned we support other backup tools. Now is that because you know customers just want to use a specific backup or other specific use cases that one tool is better than others.

Jobin Augustine: So the pgBackRest has been around for some time, it’s matured, and it is widely used. So and that has some features which are not there in other tools, but the other backup solutions are addressing certain specific areas. Say for example, the cloud integration was мery good in. But the pgBackRest was targeting the real enterprise users who are actually like customers. Yeah. So that’s to enter a different approach. But now what we are seeing is all the tools are coming to a common domain of expertise. Say the PG backrest is gaining cloud integration.Step by step. Now they have Azure and Google Cloud support.

Matt Yonkovit: So what you’re seeing is, each of these different backup tools have different feature sets. And but slowly as they mature, the feature sets are getting integrated between the tools. Еhere becomes less and less differentiation. Now Sergeн, coming from the Oracle space, let me jump back to that, because this is an interesting perspective that we don’t normally get here because a lot of folks who I talked to have been in the open source space for 10-20 years. Now, I come from an Oracle background way, way in the back. So I mean, I remember those days, I have I think PTSD and I shellshock from when I had to handle those, the those cases in those issues, but as, as a former Oracle DBA, and as you’re talking to these folks, besides the issues that just just understanding, hey, this open source is different than a single vendor controlled project. Are there some technical things that continue to kind of worry or are hard to grasp for Oracle DBA is kind of coming over to this space?

Sergey Kuzmichev: Yeah, absolutely. For PostgreSQL, especially like the way of getting real technical, the way Oracle and PostgreSQL work with memory and the disk and everything like that they are deceivingly similar, like, you have the wall files, it’s like the redo right, then you can archive them. And it’s all pretty similar, you have a buffer pool, but now the checkpoints are not really similar, right? The IO profile that database generates is completely dissimilar. And you, you just cannot give and give all the memory to PostgreSQL whereas you could probably pull that off with Oracle, and a lot of people do indeed. So yeah, just from the, like, sending queries perspective, those are quite similar. But then when you start to kind of a prime applying real pressure to PostgreSQL, things start to crumble a lot. And conceptually, they are different. Like the MVCC model is completely different in a vacuum that is jarring for a lot of newcomers and PostgreSQL. Like it’s not bad, it’s not good. It’s just different, right? It’s completely different from anything that anyone expects didn’t work with PostreSQL before. So yeah, coming from Oracle, it’s a bit of… As I experienced it, it looks similar. So actually switched to MySQL later. And I mainly work with MySQL right now. So MySQL is quite different. It’s different from Oracle. It’s different from PostgreSQL, PostgreSQL seems to be similar to Oracle, and in a lot of places logically, it was similar. But now the right a lot of details, like a lot of fine grain than you have. But you have to understand, that you have to fully grasp. So yeah, it can be a pretty steep hill to climb.

Matt Yonkovit: Yeah, and I think that one of the things that I’ve seen over the last few years is the number of people who are looking to do that migration job. And you said, this is probably the biggest area that we’re seeing is the growth of migrations. There, they’re choosing Postgres because it’s similar. And does that lead to people then making this assumption that like, I’ve lifted I’ve shifted, maybe I’ve rewritten some code, but it should just work like the same and all of a sudden you go from, hey, I was able to handle 15,000 transactions a second now, I can only handle 5000 because I didn’t tune it right or I didn’t properly do certain things that I mean, is that what we’re seeing?

Jobin Augustine: Yeah. Yeah, first of all, the first part of the question, what is driving the migration of course, overlap as Postgres is claiming to be the most feature rich. So we get more overlap with Oracle in terms of stored procedures and functions. Say the PL/pgSQL, which we have in Postgres is somewhat similar to Oracle’s PL SQL code. So by the end, we have very rich data types. So all those things are driving the migration. The selection comes automatically do it Postres. But the other surprises like something which was working in Oracle is working poor in Postgres. That’s a very common problem. And that happens when we, when we move from Oracle to Postgres, or, if we move from Postgres to Oracle, because these two things are entirely different, when we have an application, which is fine tuned for Oracle, and if you just take that code and put it into Postgres, that makes it on some other edge cases, like Sergei mentioned, the MVCC is one major difference. So in Oracle, we have this undo area, where they move the old images to an undo space, and that’s how they manage their MVCC. But in Postgres, we leave that in the same place, the old versions are left in the same place where the data is stored. So the consequences, the inserts are fast, because insert don’t have to do anything, it creates another row and put it there. But the the demerit of this approach is, the updates are costly. The update says nothing but insert and delete. So the things which work in Oracle world may not be that efficient in Postgres, but there are areas where Postgres exceeds, especially this insert, kind of load. And we are seeing lot of applications that have complex architectures. And say, when the right very complex queries, they tuned that query for or it will, say, views and quarries on the top of us. And over a period of time, they might have recorded multiple times and optimise that query for Oracle some way it is working there, okay. And when we take that, that complex piece of code and put it into Postgres surprise comes.

Matt Yonkovit: I gotcha. So, so Okay, so it is different, and you have to tune it and make it make sure that it’s that it’s optimised and one of those things is making sure that you’re using the right indexes, because obviously, the indexes between what’s in Oracle and what is in Postgres are going to be different. And this gets back to your Percona Live session that you just gave which is talking specifically about the power use of indexes, and Postgres and kind of giving that user perspective, because there are many different indexes available in Postgres right now. And each one of them has a very unique operation and a unique thing that they’re trying to solve. So maybe if you can give us a quick overview of that talk, like what, what did you talk specifically about, and people can find the talk, so they can watch the whole thing online, but give us that quick snippet of what people can expect if they go watch that. Okay, so

Sergey Kuzmichev: As we gave the talk, I started talking, I’m travelling picked up, so I guess we will follow suit here. And the talk, we mainly targeted an audience who is not really experienced with PostgreSQL, we gave an overview of indexes, trying to give specific examples where you would like to use them, for instance, while preparing them while working on bass, and while discussing with children, the real highlight of the talk for me was the cash index, which was for a long time, like an ugly duckling in the PostgreSQL world, because it was not well logged, it was not replicated. It was not crash safe. Like it was, it was okay. You could use it, but the documentation told you not to, like from PostgreSQL 10 onwards, it works perfectly fine. It’s fine. So yeah even if you know indexes, if even if you think you know indexes, it looks the same as PostgreSQL sometimes Sometimes it’s worth revisiting basic stuff.

Matt Yonkovit: So, but let me ask you there, right, right there. So okay, I come from the MySQL space. And there’s a lot of people who aren’t familiar with the Postgres index types. You mentioned the cash index, you mentioned, it’s not crash safe. You mentioned that you can’t replicate. Right? So there’s a few things that like, right away, you’re like, Oh, I never want to use that. But he said, Oh, but tonight, it works. So you have to explain now, the cache hit next and what it does and what the benefits are, because you know, now I’m left hanging in, I will go you know, listen to the session, other people will but let’s give them…

Sergey Kuzmichev: Absolutely so the hash indexes applying the hash function to an entry. It’s really simple in what it does, it actually allows you only a single operation and the quality operation like atomically indexes for that if something equals something, right. And the length, the benefit here is that while it is limited in the use case, it actually has a peculiar property that the hash function always produces a uniform length of a key. Like, if you give it half a table of warranties, it will give you 122 characters,I don’t know exactly how many executives but still, like I mentioned that you are indexing something where you have URLs, maybe people’s names are not a good idea. But URLs, right URLs can be lengthy URL below for STEM database of blog posts, right probe, you probably are not going to look for URLs with larger than operation, right. So in the HTTP is not larger or smaller than Right? But you want to look for a specific record, the hash index will be compact, that will be fast. It has a pretty low overhead of the inserts as well compared to B-tree. And yeah, so pretty interesting, a good addition to the portfolio of indexes that you can use again from the 10 onwards. But given PostgreSQL 10 is, what, four years now, probably everyone should check it out.

Matt Yonkovit: Okay, okay. Sorry, I needed to stop you. Because you mentioned it and wanted to get everybody here to listen. Okay, so so you’re talking about hash index, I thought you said cash, but it’s hash. So my, my bad speakers here. So from a hash perspective that that’s, that’s useful information. But you know, there’s also you’re just indexes and your agenda indexes, and B-tree, which everybody knows B-tree, it’s kind of the standard in most databases. So you’re going to go through, and you’re talking about these different indexes in how they’re used. And so this is an important topic, but people just think I add an index in them done. Is that the case? I mean, like, like, Is there any more thought process, then? I have this in my WHERE clause? And I’m going to go ahead and index.

Sergey Kuzmichev: Yeah. Right. Yeah, I’ll jump in. The default index, as you mentioned, is the B-tree. And, like, the more data you add to it, you cannot expect it to perform as well as if performed before. And especially if your data profile, let’s put like the data distribution changes, for no reason, or for maybe unexpected reasons, you suddenly may have a really suboptimal index, and the B-tree index has a couple of tricks not sleeve, it’s not really specific to be three, but you can build partial indexes covering probably some keys, right. But you definitely, even if you’re just indexing and expecting everything to work, and it works like with an MVP, an early prototype, when you get real data, you might want to consider different indexes. And so you need to keep an eye on that, and go are mentioned a few other indexes that are like could be useful.

Jobin Augustine: Yeah, yeah. So normally, people just create an index and say I have an index. But they’re not really worried about whether that’s the right choice they made. Like Sergei mentioned, the hash index that really works wonderfully in many cases, actually showed a demonstration of how it can be optimised so because that can have workflows The index itself, so only certain criteria will be used for creating an index. So the index size will become small. And so especially the status codes and things like that are really great. Yeah, and coming to the specialised indexes, so the database general purpose, right and like Postgres database, so we need indexes, like a text indexing. This is altogether a different area of specialisation. So the general between decks may not work properly. So yeah, the Postgres gives this kind of index, that’s a totally different approach to the indexing, and the keywords are extracted and they create that index. Yeah. That’s it.

Matt Yonkovit: So that’s like full text searching to search a block of text for a keyword or two.

Jobin Augustine: Yeah, the beauty of the worldwide communities anywhere in the world. If there is research going on about some better indexing, the first choice for implementation could be on Postgres because it’s open and free. And somebody wants to try that on the poster. So we have a lot of these kinds of index implementations. So the bloom index is something really interesting because, yeah.

Matt Yonkovit: And, yeah, so what tell us what the bloom index is.

Jobin Augustine: Yeah, so, like Sergey mentioned that the hash index, it prepares one hash, but the problem with the one hash is, it can be used only for one column, the one column needs to be hashed to one value. But what about the multi column index, multiple columns need to be indexed, need to be hashed? So the bloom provides that feature. So it is similar to signing on the top of another signature. So we will have a very complex signature. So once we have that complex in signature, very easy to map with the actual data, and the beauty of this bloom index is that the order of the column is not important. Generally, we have the problem with a B-tree index. B-tree index, the first column and second column enter, like that the order is important right in in the index, because that’s how the B-trees are prepared. But this bloom index, because it is a hash function, the order of the column is not important. So these are the specialised indexes, solves specific purposes. Yeah.

Matt Yonkovit: Okay, okay, that makes sense. Now, let’s take a step back. And let’s speak to those who aren’t necessarily DBAs in the space here. So I have just started a new application. I’ve chosen Postgres as my database of choice. And so I built this application, how do I go about what we should do to find what needs to be indexed? And what indexes should be used? Like? How do I start there? What are the tools? And walk me through some of those decisions, or some of the things you might look for, and recommend people do in order to find which columns need to be indexed? And which is the appropriate index?

Sergey Kuzmichev: Yep. Well, ideally, if you’re only writing to the database, more likely than not that he might not need any indices at all right? So if you don’t actually depend on the retrieval of the data, what’s possible SQL, you don’t need a clustered index, right, like MySQL was probably need. So you just have heaps, like heaps of tables, and you can insert an insert and insert forever, and you don’t actually need the indexes. So you shouldn’t start with your tables and columns. I believe that you should start this query, right. So you have specific access patterns, you have specific queries that hit the database, like maybe if your application is doing some kind of a login, and it needs to get a page with your customer profile, which gets by an ID from a user table, right, then probably that user table needs an index on the ID column, whatever the index type may be. Proudly approach learning which columns need to be indexes. Well, again, I firmly believe that you should start with queries, PostgreSQL provides built in facilities for looking at slow queries range on which you can set up logging duration statement, correct me if I’m wrong, too. I don’t know maybe to zero like to complete query audit, or you may use pg_stat activity, right. Or you can set up pg_stat statements like a really popular extension that just does all the analytics for you. And it will break down the queries and give you the max duration, maximum duration average duration, right? I, if you are brave, and if you want to try new things, you can try pg_stat_monitor from Percona. Right, that adds a bit of more new features to the idea of pg_stat statements. Or you can actually go ahead and try PMM which builds on top of the data from pg_stat statements or pg_stat_monitor to give you more of a polished, unique UX experience on to your queries in your PostgreSQL database like the other talk I prepared for Percona Live with other support engineer was about usage of PMM for PostgreSQL, so you might want to check it out. But from my point of view, you should probably start this..

Jobin Augustine: Yep. Yeah. My approach also will be similar, say, a little more accuracy in that nature, avoid indexes wherever possible. Because what drives that is, people first look into the scheme and start designing indexes for that, but without realising that the index has its own cost. So as we know, he actually slows down the transaction processing, because indexes also need to be updated the inserts, updates, deletes. And the problem won’t stop there in Postgres, the indexes older also need to be maintained by auto vacuum and the cleanup needs to be happening. And the indexes also need to be there in the cache in memory, for better performance. So it occupies memory and it is associated disk i o will be there. So there is a cost associated with indexes. So try to live without an index. if it is necessary. Yeah.

Matt Yonkovit: So are you suggesting? I think there’s an interesting discussion here, because I can see, like the process of when you’re doing your application, you’re developing it, there’s not a lot of data. So yeah, the performance isn’t necessarily going to be needed in a test environment. So I can see going zero index there. So would the process collect your query log, you mentioned, Sergei doing a query audit, and setting it to zero, like the threshold, so collecting everything, and then analyse that and then see what indexes you need based on that? Or are you saying like, move to production with? And let’s just be the wild cowboys that we are, because we love to test in production? And what scale?

Jobin Augustine: Also, when we collect the customer information, we have an analyse script, and we see that normally we’ll see hundreds of indexes, which is never used, because the analysis report comes with the zero usage. So those are the areas where we can go the other way, we can remove indexes, which are not used in real production, and save space and memory and things like that, at the same time that. Yeah, at the same time, we have this problem as well. So the creation of indexes, kind of continuous tuning. So that the things which used to work earlier may not work later. So the indexing requires a continuous monitoring and the schema needs to be optimised. Maybe, yeah, even symbol B-tree index may later need to be converted into partial index.

Matt Yonkovit: Okay, so, so start with very few, if any, do some tuning to figure out what you need, but realise that this is an iterative process, and what you needed yesterday won’t be today. Because we all know that we totally love to go back in and remove things that we added before. And we just don’t leave them sitting there. I’m being sarcastic. Of course, we do. Like what technology out there doesn’t have some weird code that you’re like, I have no idea why that’s in there. And I think that applies here as well. So as you implement new features, as you implement new indexes, as you go down these rabbit holes to tune, every release cycle of a new application changes the access patterns, which means that what indexes are being used or not being used could change at every cycle. And with the continuous integration and continuous deployment that happens now an index you used yesterday might not be used tomorrow, but it might be used three days from now. And that presents kind of a weird challenge, right? Because you know, you have these evolving indexes. So, I don’t know, maybe we need to figure out the automation to build and remove indexes on the fly is needed or something you know, I don’t know. But it is an interesting space for sure. So, Sergey, you’re in favour of index?

Sergey Kuzmichev: Yeah, we should follow an approach of minimal. So just do as little as possible, but not less than a minimum, messing up with your performance. But yeah, I see, I definitely see i’d saw in practice and I created redundant indexes, right. You can never anticipate, like, unless you’re dealing with a pure IoT, I believe stuff, then you will probably not anticipate the amount of data that will grow and how exactly it will ground? Well, where it will. Great. Right. So the database they involve and trying to optimise for that before you practice, optimise for a database that is size and terabytes, when you have 10 gigabytes of data, it’s probably not something that should be done, like the way it is a bit premature, and definitely the issues will arise

Matt Yonkovit: Yeah, but isn’t there a trade off here? I’m thinking from the user perspective, or that those who are, unfortunately, tasked with managing these large systems in production. The question is, as you start to see these applications grow, adding indexes takes time, and it actually makes performance problems worse. So there is some trade off here, right, where it’s like, do I index early, hoping to get ahead of those problems? Or do I wait until I need it? Knowing that when I need it, it might cause a bigger issue than what’s happening in production right now?

Sergey Kuzmichev: Yeah, absolutely. I mean, there’s a fine line. I don’t think there is a silver bullet approach here. Yeah, absolutely. I don’t think that he should go live without any. Right. I don’t hate him. But let’s like, let’s take the other extreme, like you take every column every day, every table in your database, and you index every column and every combination of columns in both orders, right. So you will have a lot of indexes, but you will probably not have a quick database like, right. Sowhat I think is should be like, see comments on the situations that you described, you think you should probably try to not end up with a situation where you have a burning production that you have to tie in to extinguish was another index, right? So with continuous integration, and this is a continuous delivery, there should be continuous monitoring as well. Right? So you should probably be keeping their finger like, whatever is happening with your database. Right? So are the queries kind of changing, like it’s a source of a living organism, right? This is growing all the time. But uh, walls? Yeah, I mean, issues will happen. Absolutely. Like maybe tomorrow, some, you haven’t gotten unticipated the release and a new access pattern is now in production. And you absolutely have to include an index in place. But again, maybe 10 years ago, when you were designing the schema, you didn’t even know that this access pattern would be there. Right. So this table was half a year ago, and now we need to add an index there. Like it happens. Absolutely. This probably speaks to the need to test with real data. Like one thing that I see a lot is a really small development database. Like we have 1000 entries there. We run a query and maybe do a full table scan. Right, but it’s fine. It’s like 20 milliseconds, that’s on production. That right medium records, does not run 20 milliseconds, right?

Matt Yonkovit: Yeah. Yeah. I mean, you think about it from the like let’s pick Amazon, for instance. It’s like testing on a Sunday morning at 3 AM. You know you know, like that type of workload versus your black friday or your Prime Day that just happened like workload, very different workloads, and probably very different needs and requirements on how much data is being accessed when, right. And so I think it’s a very valid point that testing with a production sized workload is important. Now, are there any tools that people you’ve seen use, because, honestly, testing applications at scale is a challenge in and of itself, especially in the database side. I know that there are some testing tools that you can use to replay query workload but what’s your experience in doing that sort of testing job and why don’t you start?

Jobin Augustine: Yeah, so more than a simulated test environment, what we love to get is the real workload data. So like Sergei mentioned about the LogMeIn duration statement by which we capture actual the curries, which is executing in the system. And if you look at the new versions of Postgres, that that is the area which is getting improved a lot. Now, we have provision for sampling the quarries, different levels, say, a chorus, which is taking more than 100 milliseconds. I want a set of samples for each of those queries, I can configure the parameters and we will get those queries. And when and even Postgres provides ways by which we can look at the explained plan have those things behind the scenes, so, we have real data, real performance data coming out of real production environment, and that gives the real picture as to whether things are getting slow or repeated, we need to look at multiple aspects, there could be a query, which is very fast, but it could be executed multiple times, say millions of times, repeatedly. So an index on that will give a huge benefit. But if there is a very complex query, but if it runs in as part of a mundane report, yeah, we don’t have a problem there from the business perspective, or from the so. Yeah.

Matt Yonkovit:
But how about how you replay that workload? How do you test those queries? Like how do you know, like test that in a test environment?

Jobin Augustine: See, we request the inducer to get the most information like the analyse information, explain analyse information about those things. Yeah. And we have Postgres later tooling, like a PDB reports, which actually parses the post this log file, and propres sample with the bind variables, everything we can just pass it to customer, okay, this is where we see the problem, please prepare this explain, analyse and show us we can help you to improve. Yeah.

Matt Yonkovit:
Sergey, any thoughts on query testing or replaying query workload?

Sergey Kuzmichev: You know, it’s difficult, like, I’m not a big believer in the query play field, I know the rock wider view tools and mice curl, field worker has a query play capabilities writer can record the workload and replay it and give you a nice wrapper. But the problem is that most of those tools as far as I can tell, do not respect the concurrency of the claris. Right? You do not get the full… You cannot replicate how exactly those queries heat your database, right? And how, what was the concurrency the worst case, they’re just executed serially, one after one after one after another? Right. So that is a completely unrealistic picture. What I think is like a good approach could be made it through. Like I believe a lot of testing is being done like QA, automatic testing for applications is a part of pretty much any development process right now. So I do believe that at least is one part of that. There should be, I’m not saying that there should be exactly the query, like the queries from production, like the amount of queries from production, I’m more towards that there should be a database size and data distribution of production, right. So you should probably concentrate on either taking care of production data and masking it. And like creating a maybe a sample of it. If you like to have a gazillion databases, every one of which is really huge, you probably should not be replicating that environment. But if you’d like to have 100 gigabytes of data in production, you could probably replicate it, and mask it and maybe hash everyone’s names, right, of course, does not violate GDPR. But then you test and you run queries, and you see, like with a new release, whether maybe a new query appears in your query audit that was not there before or that is now like executing for two seconds instead of 100 milliseconds, right?

Matt Yonkovit:
Yeah, and this is where it’s interesting. So there’s this nuance, and you mentioned, like the QA testing and stuff. So I see a lot of companies that skimp on scalability testing or stress testing their systems. So you know, they’ll run through like their test suite and it will run through the like a function test where it tests all the functions individually. But it doesn’t actually stress test the system, which means you never see the concurrency you mentioned. But even if you do have a large data set, the likelihood of you running through and just choosing like a single string of functional tests actually doing anything that’s going to move the needle on the database side to find performance issues is very minimal. Now, those that do stress tests typically focus on the application. And here’s a fun fact, I’ve actually seen quite a few companies who will make their test suite, they’ll stop out and they’ll create false database calls. So they’ll make a call to an API that should go to a database, but then they’ll just return the database record directly from that, and they’ll always return the same thing. So they’ll get a consistent result. So their tests won’t fail. But they’ll emulate the database. So you never hit the database.

Sergey Kuzmichev: You know, you are like, leaving aside that pretty huge part of your performance testing like, yeah, that is probably not what you want to do.

Matt Yonkovit:
I think part of the issue is, like from a developer perspective they view databases as commodities in a lot of cases. And so it’s a necessary evil, they would love it, if it was just an API call in and out. And that’s all they need to worry about ever. Everything else is handled behind the scenes, they don’t care what database technology, they don’t care what I did, because it’s fast. My data is out. It’s in outage. And I think that’s kind of the Nirvana state and a lot of architecture developers treat the database like that just a data store. Right. And I think that that causes some of these issues to manifest in production, because you haven’t thought through all those potential issues,

Sergey Kuzmichev: Right. It could be like, if you have, if you have an SLA for an API call, if your application behind that API call, you are going to the database, I probably should include your round trip to the database and the query into your SLA. Right. So if you are stress testing without the database, then now what are you testing anyway?

Matt Yonkovit:
Yeah, it’s an interesting topic. And it gets really deep, because you know, you can peel the layers of this onion, multiple layers deep here,

Sergey Kuzmichev: I don’t think like I don’t think that he should be guarding behind. What I see sometimes is that the database is kind of isolated from the application. Right? So the application is a user of the database, but the database kind of provides a service like, like, you mentioned commodity, right. I don’t think that that is a good approach, like a database. If the database uses a database, if the application is using the database, the database, the schema that application uses, it’s probably an integral part of an application and should be treated as such. Yeah, my couple of cents here.

Matt Yonkovit:
All right. So Sergey, one last thing, you have a book coming up, what’s your book guy?

Sergey Kuzmichev: Yeah. Me and another colleague from support, we are writing the book. And the book, surprisingly, is not about PostgreSQL. Since I was learning MySQL here in Percona. And I had to pick up quite a lot of knowledge from MySQL. I got a chance to write a book that is called Learning MySQL. It’s the second edition of a moulder book. And as I was learning MySQL, I was able to apply that journey into a book form. Yeah, it’s closing to the publishing date, which is sometime late in autumn. But yeah, I teach. It’s a huge project. Writing a book is kind of a bucket list item that I had. I couldn’t skip on that opportunity. And, yeah, it made my COVID year much more exciting that it could be otherwise.

Matt Yonkovit:
I understand that. I understand that. So Jobin, Sergey, thank you for showing up and chatting with me. You know, a job and you’ve got I think four sessions from Percona Live that are online are online now. Sergey, I believe you have to. You know, so you can hear Jovan and Sergey talk about different Postgres topics, PMM monitoring, all kinds of fun things, I would encourage you to check out Percona Live website or our YouTube channel where you can get all these videos for free. You know, take a look. Watch them, learn with us. So thank you guys, for coming onto the show today and appreciate it.

Wow, what a great episode that was! We really appreciate you coming and checking it out. We hope that you love open source as much as we do. If you like this video, go ahead and subscribe to us on the YouTube channel. Follow us on Facebook, Twitter, Instagram and LinkedIn. And of course, tune into next week’s episode. We really appreciate you coming and talking open source with us.

Did you like this post? Why not read more?

✎ Edit this page on GitHub