Finding Poorly Designed Schema's and How to Fix Them - Percona Community MySQL Live Stream & Chat - Sept 30th

Speakers: Marcos Albe, Edith Puclla, Mario García, Dave Stokes

Percona Community MySQL Live Stream Sept 30th

Percona Community Live Stream for MySQL will focus on Finding Poorly Designed Schema’s and How to Fix Them. This is a part of our bi-weekly meetup to develop open-source database skills. Come up with your questions and get an answer straight forward to all your questions. Add this event to your Google Calendar.

VIDEO

Link: https://youtu.be/PDYWBDGLBE0

TRANSCRIPT

Dave Stokes: This is a podcast or stream cast we’re doing on finding poorly designed schemas, featuring Marcos Albe, who is our resident expert. And I’d like to introduce myself, I’m Dave Stokes. I am a Technology Evangelist for Percona. And we have the other two technology evangelists for Percona. This is our first time out in public together. And we’re doing some cross training here. So that when one of us is on the road, so one of the others, hopefully, we’ll be able to take over this. And I’d like to introduce Edith and Mario, and what we’re gonna do is we’re gonna let Marcos go, and the three of us will ask questions. And the first question is, finding poorly designed schemas. Don’t we automatically find those anytime we just open up someone else’s database?

Marcos Albe:
Yeah, you can notice everything goes very slowly. So that’s how, you know the scheme is probably the same, right? Like, no. Yeah. So I think last, our last webinar was precisely on important the key aspects of designing good schemas. And I guess we mentioned it that good primary keys, appropriate data types, and appropriate normalization are key to obtaining good performance, and actually being able to write decent queries later. So what we’re going to go through today is the ways that we can locate tables that have either over-indexing, lack of indexing or too big data types, and a few other details that actually are, you know, well known to cause performance degradation. So we’re gonna go through those. And I will try to show you a now deprecated feature, I was talking with Dave, and there is a filter that was deprecated in MySQL eight, zero, which is called procedure analyze. And basically, it will tell you what are the fittest data types for a given set of data. But it was deprecated. In the latest version, apparently, because it was too obscure and a little use it. But you know, it can help you. And I will be trying to write a stored procedure to replace it. So perhaps we will present that in some future webinar. But yeah, I will show you so that you can have an idea of what it does and what it’s important to look after. And then I will try a few manual queries that will compose ad hoc, to do the same. So let me share my screen.

Dave Stokes:
Okay. And folks, if you have questions, I am monitoring the YouTube live feed. So if you have questions, please bring them up there.

Marcos Albe:
All right, can you see my screen? Awesome. Okay, clear is not working here, but nobody cares about it. Okay. So I have an instance where I created a couple of databases, employees and tests. And I know that some of these tables aren’t optimal. In this case, I’m using a few errors. Others were just there by nature, right? Like, whatever schema I loaded, which is a sample schema from MySQL called employees already have this design, fallacy, defects, let’s say. So, please, the first thing I will look at is tables without primary keys. As we said, we are the Today, in MySQL, the main storage engine with cheese in ODB. The rows are sorted in the basement of the B tree for the primary key, basically, that that’s called a clustered key, and your rows are ordered naturally by the primary key values. So what you want to make sure of is that the primary key is as small as possible. And you also want to make sure it’s monotonically increasing in value, which will allow the B tree to grow organically, or like it will be already ordered in an organic way. And it doesn’t require a reordering of the rows, and it will reduce fragmentation. So a table without the primary key will have a hidden primary key created by InnoDB. And it will be 36 bytes long, which is a huge key when you compare it to a four bytes integer that can allocate up to 4 billion values, which should be enough for most any table. And 36 Bytes like those 32 extra bytes actually will also add extra weight to your secondary keys. So you know, this starts to add up, and it starts to make your schema heavier, less fit less fitting in memory, it will fit less rows and less index entries in the same amount of memory. And you want actually the opposite to actually feed as much as you can in memory. And as much as we can in a single 16 kilobytes page. So that is well-known query grows from some folk that we know as the data charmer. And let me basically, I’m going to use information schema. The information schema is a collection of metadata about your database that it’s automatically generated by my sequel. And in that schema, you can find a table called tables that contain metadata about all your tables. And you will find also another table called columns, which contains the metadata about the columns, we will join those tables on table name table schema, and ignore all the system tables, we are not going to touch never gonna want to touch those. So we will always be ignoring my sequel information schema performance schema and seeing the schema. And I’m only going to be taking a look at base tables, I don’t want to take a look at views. Because you know, you cannot alter a view rather you have to alter the tables. And then I’m going to check if they’re a primary key or a unique key. And if they don’t have either primary keys or a unique case, then it’s going to be returned in my query. And there you go. It’s telling me Okay, Marcos, you have these tables that apparently do not have a primary key.

Dave Stokes: So I do show create table.

Marcos Albe:
And indeed, lacks primary key or an indeed lacks primary key. With MySQL eight, there is a way to add a hidden primary key. Because when you are the primary key, you are changing the count of columns in your table. And that might not be decidable. If your queries rely on positional on the position of the columns for inserts and such. So adding a column might not always be a direct, easy way to do it. I mean it you know, of course, adding going ahead and say I could take In this one, in this case, I can add the primary key because there is already a good candidate for the key, right? But if I don’t have a good candidate for the key, like in this case, you know, like, I’m not sure if this will be easy to optimize, this looks easy, because Id, you know, it’s given me a hint that this is an identifier, and that it should be a reliable, unique value. I’m gonna double check on that. But the point is, if you don’t have a good candidate, and you have to add a new column to create the primary key, that might be a problem. But in eight zero, you can add a hidden primary key. And that will allow you to have both things the primary key and get rid of the problem of different columns count, different column counts for in your file,

Dave Stokes:
invisible primary key, and the system’s already grabbed another unique column to be a primary key, does it automatically trends, redo the beat tree for me? Or is it kind of behind the scenes?

Marcos Albe:
Did you get me on that one? I wouldn’t believe the test to recreate the table because, again, the table is organized on the primary key, right? So it’s a cluster, it’s a cluster to the engine. And thus, the rows will always leave at the bottom of the B tree for the primary key. And while it could have you set a unique key before, it will have to relocate the rows to the hidden primary key, I believe that will be the case. In all honesty. I haven’t checked it exactly. If it does that or not. I will believe this. Thanks. Thank you. Oh, you ID Yeah, so Gu ID is a bad idea. I think we spoke about this last time. And I show it some slides. Let me see if I can find those very quickly. It’s, yeah, I will need to share my screen again, I guess. There? Yeah. Let me do you see? Oh, fantastic. So and let me show you what happens with UU ID because it’s an interesting question. So first, you know, this is how primary keys are structured. Okay, so pretty good, your primary key is structured in a way that at the bottom of this B tree, you will have the full row. Right? So is those the values of the primary key 15-18-20-30. And then, you know, employee ID and the employee’s name. So at the bottom, you have the full row. But the important thing is that this tree is sorted by the values of the primary key, and the UID is all random. So you know, you will insert one row here than the other row here. And you will be jumping from page to page. And you will have to load the page, and insert the patient to the B tree structure, which might mean rebalancing the tree. And, in general, you’re going to end up affecting many more pages so much that take a look at this. This is a table that its the primary key with an integer. You know, it’s the, you can see that the H below on the bottom, you can see how it’s color coded. And you can see that the table is mostly ordered it you know there is a little bit of green here, but it’s mostly or that it’s so it was all sequential through the table. And you can see it only takes 700 pages or so to fit all the data. If we use a good UID which is a UID where we flip the bits so that it actually remains monotonically increasing, then it will still be neatly ordered. But it’s now taking like 400 more pages to fit my data. So why because the primary keys are bigger and that takes up space on the secondary keys refer to the primary key and that is taking more space even. So I nearly duplicated To the size of my table, that this was a small table, I admit that still, right? Just the prime, just the UU ID was a mess. But if you use a natural UU Id like if you use a UU ID without ordering without flipping the bits, such that it becomes monotonically increasing. And the UID is purely random, then you get this. And, you know, it simply takes 1000s of pages to like 4x more pages, or 6x more pages to feed the same data. Because you’re, you have to order that betray through the primary key values. And the values are totally random. And you end up with this. So it is a rather bad idea for performance. And also, it’s an illusion that is going to save you from data consistency issues. It can save you from data consistency issues, for inserts, but the updates or deletes are almost never done on the primary key value. Or, you know, it’s let me say this another way, it’s common to do updates and deletes on keys other than the primary key. So you know, when you are writing to do data centers, and you use UU ID to say I will guarantee data consistency, you’re not guaranteed anything because if I do an update over you know, employee name or an on this other data center, I do an update on employee department, I will have records affected on both sides, you know, and then the updates are going to conflict. So it’s totally shenanigans that you do IDs are going to save you from the inconsistencies in a multi-master setup. A don’t do multi-master setups unless you use beautiful synchronous replication they don’t use you they’re super expensive and are not going to help you with performance. Hope that answers your question Kiki back to this again, I’m not really sure if I will believe my hunch is that yes it will rebuild the table and put the row under the new primary hidden primary key I hope that makes sense. So you know in this case, what you will do is select count star comma count this thing ID from for right that’s a good date right like that did are as many distinct values as I have values rose. So, every value is unique. This is a good candidate for a primary key. So what I will do to avoid blocking is I will use PT online schema change I will say okay PT in lines can change now I always butcher it with this so be patient and execute alter primary key but I hope it’s fine, you know like you’re trying to next, so this guy is telling you how to check you know he’s basically recommending you the same query I just did but he’s this one is nicer I forgot PT online schema change had these let me show you very quickly. What he’s suggesting. Yeah, they decided unique index. So basically it’s Just doing the same check ID if select count distinct is equal to count star, and then it’s just printing out if it’s okay or not. And it’s telling us you’re trying to add a unique key this can result in data loss if the data is not unique. And to be honest, check unique key change. I guess

Dave Stokes: I always forget what is the right parameter? Is it No? No chicken key changes.

Marcos Albe: Right, there we go. And with, by using PT online schema change, I’m avoiding blocking the table entirely. And I will end up with my primary key in place, and I will still be able to you can see, my workload is still running this SysBench Oh, well take a look at that. You know, this was my performance before adding the primary key. And this is the performance after adding the primary key. So you can see what kind of a difference having proper primary keys can have. That was about primary keys. All right. Let’s continue, then, I’m not sure I had a table sample for this. But if I don’t, I will find one. There is an as well, the UU ID question was a good one. And I precisely have a query that will detect if you have keys primary keys that are not integers, or timestamps, right. So integers and timestamps are monitored monotonically increasing, and are eight bytes or less. And so they are very good candidates for primary keys, a timestamp with enough precision could perfectly add as the primary key as the sole primary key, or as the leftmost part of a primary key. Again, the most crucial part is that they are increasing monotonically, that they are always going up and up and up. And that way, you’re always going to be inserting on the right side of the tree. And at one point, when the tree grows too large on the right, it’s going to pivot and make, for example, if imagine the tree was only growing towards the left, right, it’s the same thing. But if at some point, it becomes too large, I could say, Okay, bring the 20 as they bring this part, as the top and rebalance using this as the root of the tree. And by making this the tree’s root, this one will go down. And then you will be able to, well, in this case, no, but if you do it on the right word, the values are increasing, then yes. But what I’m trying to say is, rebalancing doesn’t require rebuilding the whole tree. But it’s an expensive operation anyway, but you want to do as few rebalancing as possible, using UU IDs is going to cause you to do much more rebalancing. While inserting always on the right side is going to only cause rebalancing after a lot of inserts, and the pages are going to be mostly full, or actually will utilize it. While when using UU IDs, you will end up with mostly empty pages, right, like the pages are going to be very, rather empty. And that’s why we ended up with so many 1000 pages to store the same data. So with that, I crafted a very simple query again, on information schema columns, I’m just checking if the datatype of the column is not an integer or a timestamp. And I am checking if the column is being used as a primary or unique key. And if the position of that column is the leftmost and then with that, you know I was able to just identify tables where the primary key or the unit the sole UNIT key have are using characters that are more likely to be random than not so So, in the case, let me see, for example, this one, I believe, you know, my query again, I did, I did this query, just for this presentation, and I obviously didn’t pay enough attention. And this one is a unique key on more than one column.

Dave Stokes: Let’s check that select. So create table is it’s…

Marcos Albe: And yeah, oh, no, it’s a single key on Okay. Okay. So, um, you know, it already has a primary key. So this unique key is not important. So we can ignore it. Then we have Incident Record. Let’s take a look at that. This one, I believe, it’s a humongous table.

Dave Stokes: That’s a big table.

Marcos Albe:
It has a primary key, it says ID comma status. Our ID is var Char 15. So again, I’m using word let Latin one, okay, 15 bytes. Plus, what was the other column? Status where it says Status? Status here? Bar chart. So Jesus, that’s a bad primary key. So this is 65 bytes, right? Let’s take a look at the size of this table and see if we can improve on this, right? Let’s take a look. So if I go on the list an incident occurred. Okay, so roughly 800 kilobytes per, doesn’t have many rows. But let’s see how much it improves when I’m just gonna go ahead use alter table is gonna complain about the Now I see why they did this, okay. Sometimes there are restrictions on how you can or what columns need to be on the primary key, in this case, someone partitioning by status. And this mandates the Status column to be on the primary key. So apparently, I have no escape from this. So what can we do? Well, let’s see select. So it’s using 50 bytes to keep only eight bytes. So sorry, no pager. So it’s using eight bytes. But it’s saying that it needs 50 bytes. So I, what I can do is make the field smaller. In this case, it’s not going to save us disk space, because this is a variable, let’s go again with status is a bar char. And we all know this variable length, a column, so it’s not going to be taking this space. But when you do sorting, is going to allocate all the 50 bytes of memory. So it’s going to be more expensive for memory at least. And it should impact performance when doing sorting and grouping. So, I don’t have a test or benchmark to test it now. But try to fit things and make columns as small as possible for your data will be the lesson here. Again, in this case, we cannot get rid of the primary key as it is because, again, the status is the partitioning key. And so that prevents us from going ahead and just creating The primary key and assigning it is going to tell me oh no, all the columns in the partitioning keys must be part of the primary key. So this is a no go in this case so sometimes you are trapped, and you can only do so much, then let’s see. We had some other one oopsie, oopsie has you already I don’t insert it anything in the opposite, that’s a problem. And yeah, it’s chart 36 In this case, again, I don’t have many columns but let’s do random data node 700 or 800. So rude Plus, I’m selling books and these three L’s tools see access denied all right, why can I use socket in the MySQL in socket socket socket, I cannot use socket bombers, sir. And that’s sad says 001 and start right now right, and then I can do let me add a few 1000 rows, so it’s gonna have some effect okay is whoopee and it’s my Isom table. Oh my god. But take a look. It’s I don’t know almost nine MCs. And let’s see how big it gets. How much can we shrink it by adding proper primary key you able to say and then we can do alter table. Actually, I forgot to do something. Let me do something on the table. And okay, what I’m gonna do is I’m just going to add another one just to make it heavier it’s silly to have a now we got a bigger index. And if I go ahead and I, of course, you know, my son table is not gonna cut it. But let’s see how it goes. actually, it reduces the index size and increases likely increase at my table size. But this is an effect of my eyes. If you do the same, let me recreate the table one more time, create people to like see B V select star from Select ID all right and then add a few more 1000 rows. Very good. And now let’s take a look at the size of the thing. And here we have it oops, it’s 20 Max. All right. So what I think is that when we do, so it literally wipe it at 25% of the data size or 20%, give or take. But it shrunk the size, even when we actually added one column. So I added one column and still the table is smaller. So you can see the kind of impact this has. So, again, what I did here is I found I use, that simple query to find. Using the simple credit to find tables whose primary keys were not integers, some of them we could fix, others we couldn’t, and others were my bad because this code is not perfect. And it takes unique keys without checking if there is also a primary key in the table. Oh, well, it will, it should help you locate bad stuff. Next, what I will use, what I will do is also try to find tables that don’t use InnoDB. Why? Because while my item isn’t strictly poisonous, for any transactional workload, it is so and with transactional, I mean, you are actively reading and writing to the table. You’re only reading. It’s not a transactional table, it’s an archive table. It is a data warehouse table, or a facts table, whatever you want to call it. But it’s not transactional, you’re not doing any transaction on it. So my ICM is bad for transactional operations because it adds a lot of contention. And because you can only one thread can write at a time versus in ODB, where multiple threads could write to the same table at the same time, my item has table-level locking in ODB has row-level locking. So locating tables that are using my ICM or engine other than InnoDB could lead you to find you know, hotspots that might go otherwise unnoticed. So again, in this table, in this case, I will also use PT online schema change and convert it to an interview table. And it will normally take more space. My item is more space efficient. So let’s experiment is so Woopsie is nine, let’s say 10 versus 15 megabytes, right? So these tables, well this table has more data actually. So let’s see. It’s slightly bigger, right, it’s like, I don’t know, some 40% bigger or such, this was 1011. And this is 15. So it is like a solid 40% Bigger just because you made it InnoDB. But that’s the nature of row lock, row-level locking, and other benefits. So this footprint is not what in ODB is going to improve. In your case, if you’re looking for disk footprint improvements, you should look for rocks dB, and that will make your database smaller. And allow you for row-level locking and everything else. And it’s acid compliance and crash-safe. And all the goodness that InnoDB brings, you can find most of it in rocks dB. Okay.

Dave Stokes:
I mentioned that RocksDB is built into Percona release of the Percona MySQL server and standard MySQL release from Oracle. So if you get it from us,

Marcos Albe:
Yeah, yeah. And, you know, I want to make this clear. It’s not a genetic replacement for InnoDB. It’s not for everybody, it’s good that you consider it, for example, for archival tables, or for, you know, systems where the disk footprint is so obscenely large, that the cost of adding disk is more than the cost of, you know, slowing down credits a bit. Because RoksDB is super efficient for inserting, the insert rates for rocks DB are unparalleled. But the range reads, For RocksDB aren’t so spectacular. So it trades insert performance and this footprint for real read performance. But again, for people, you know, with such massive databases, we have customers, I was talking with a customer yesterday, they have 1000 shards, or 500 shards, on two data centers. And each shard could have between, you know, one or two terabytes to 1415 terabytes. So when you have 500 machines that you need to add 15 terabytes to each one, that adds up to a lot, lots of money. So you know, for those cases, it might make sense, and not for every table, perhaps only for some big tables, where you have to do compliance where you have to keep the data for compliance, and you’re not actively reading them. Then, for those kinds of tables, it makes a lot of sense. Again, it’s not for every table is not a easy drop in replacement for it to be. But you can use it for replication, XtraBackup will take a backup of it. And again, Percona Server has it built in, so you can deploy Percona Server and start testing it. Absolutely. Okay, next one. When we talk about pool schemas, there is one thing that comes to my mind and is I couldn’t care about the schema that is horribly decided if it’s not inflicting me performance pain, right? Like all I care for all the number one thing I care for is performance. And so when I think about what a bad schema is, I think it’s causing me performance issues. And I have something for that. Some goodness from CC schema. This comes built-in my sequel, and it’s called schema table statistics. And let’s see, oh, I have some space, even a space. gone evil space very good. It’s to limit clean If so, these guys telling us to look, the table that is spending, the risk inducing the most latency is sp test six. Now could be because it’s simply more up the more popular or useful table than the other ones. Or it could be because it actually really has more about design, you know, could have more rows, or could have bigger rows, we don’t know yet. But the point is, it’s taking 4x, what the other guys are taking. And, actually, if I look at it, it’s, you know, I see 3 million rows. And this one shows me 3 million rows factored in the same amount of time. And this guy is also inserting million rows, and this guy’s inserting million rows. So like, and I see there similarly name it, you know, well, again, you have to know your application a bit, it’s hard to work blindly on this. But I start to see that, you know, these tables have very different performance profiles. And so I’m just gonna go ahead and take a look, what the heck is sp test six doing? And I’m going to show create a table, it has six points. Right, it looks like a simple table, it seems over the index set, and about the indexing. So I’m going to show you a couple of queries before I optimize this table, because I actually purposely built this one. But basically, the table is suffering performance. Possibly because of the duplicate indexes, right, we’re gonna see a few more things. I also see some big meat here. And I always remember, if three go famous words, were he explained to me, that begins, it’s enough to count the ants on the earth. And that is likely too much for mostly any table in your database, unless your customers are ants, or that your scientists countenance. So we also have another query that is going to give us indexes, and statistics, so the same kind of latency statistics, but for indexes. And, let’s see what we find. And indeed, you know, like, the index with the most latency is that primary index, and the update latency, look at that, it’s 40, almost four times bigger than the index for this guy. And now let me show you this table will make a big difference. And I want to show you the difference in size of these tables, right. And I want you to keep this in mind. I’m gonna save this somewhere so we can bad so let’s go back to my sequel, but I want you to keep in mind this is roughly seven times larger, okay. We already verified index is also the slowest index, right? And I will also show you table number five. That’s number six, and this is number five. You can see they don’t have much difference. They have a D, K and C, but this one is big end. And this one is Char 199. And this is ID, int, int and this is 120 and 60. And, like we did or we can do select Max K K Max, Max being a nice, different exit from as be six medical school leukemia s again. Yeah. So we will try to reduce the latency of that table and SB this test. Yep. So what, again, what I see is that the performance on this table is far and away worse than on the center table. Right. So again, the update latency apps to 11 minutes will on this one, it’s only 2.87. The Delete latency is 4.56. And this one is, so it’s like, between three and four times bigger latency. And that’s because of all the work of extra work is doing to maintain indexes and to maintain extra pages that are necessary to fit the unnecessarily long datatypes. So I’m trying to find if these data types are appropriate for our needs, so do I really need the maximum character length for these? Because of indexes, restrictions, and whatnot, and UTF, eight 191 is the magic number, basically, its biggest index prefix you can have without enabling long index prefixes. So do I really need this much? And do I really need to begin? I can immediately tell that I don’t need the 191 characters, this one should fit in 120. And this one should fit in 60. So I will go with the business people in at this point, you know, this is a business thing, right? While I see that the data on the columns, the values on the table, the maximum length is 60. I should go to someone in the business area and say, Hey, do you expect ever to insert a value here longer than 60 characters? And they will tell me yes, no, or we need to investigate or whatever. But at some point, we’re going to conclude whether we need 60 or 191. And once we reach that conclusion, we should make this smaller if we don’t need the 191. Again, you always want the smaller datatype to fit your values. And in this case, a much smaller data type and big int will fit the ID and a much smaller data type than Char 191 will fit this character string. So I what I’m trying to do is find out, what is the leanest value datatype I can use to fit in my values. And, again, this depends on business, like what data you’re storing. So for example, I see an ID Max 100,000. Yeah, this will fit, you know, in an unsigned medium int. Alright. But like, will we ever grow beyond 128,000 rows? Or is this, like, we already have all the rows we’re ever going to have here? Well, again, it’s all business dependent. So don’t take this at face value. Again, you have to talk with developers and data owners to determine whether proposal alternatives are good. And once you find out if your alternatives are good, you can do the altar. So in this case, for example, let’s say I know that this table will ever have more than 128,000 rows. So do the show create again. So I will do an alter, I will do this in pieces to show you a bit of the impact, I would first get rid of the begins of the ad. And this is the biggest change like this is the biggest gain we’re gonna get. Take a look at this ALTER TABLE. Announcing author, of course, and it’s still him, I want to alter a table.

Dave Stokes:
By the way, knowing the size limits of your data is something you need to do when you’re designing the schema and help. It helps to keep a list of the assumptions, like using an int instead of a big int. So that when you do get to those, what four and a half million people, you can say, hey, we need to change our tables.

Marcos Albe: Yeah, again, good. Good planning for data types should be done at the very beginning of business inception, right of the application Inception is not always possible. And I guess that’s what people likes about MongoDB. You know that you are not bound by data types or schemas, even when you are in a way. You don’t have to do this and take 25 seconds of downtime to modify your table. But yeah, you know, planning ahead is should be a good idea. Oh, crap. Ruin it my, my witness. Okay, take a look. This table was 290 megabytes, okay, 289 megabytes. Let’s take a look at how much is now. 200 and soap, I will expect it to be smaller. Obviously, okay, all the charts are invalid. So I already counted down a good 10%. Let’s see how much more we can get down from that one. So then I’m going to say that I want to modify the rest of the data types that I found were too large for my table. So I’m just gonna go ahead and say alter table. Again, I will normally do this with PT online schema change. I’m just, you know, saving ourselves sometimes because mine is gonna change is slower. Okay, let’s see what that did for us. Now, we actually cut it down the table size, like 40%, versus the previous size, right? So you can see the kind of difference that good data types can make. And let’s go even further. I have another marvelous query that is not going to give us statistics, but it’s going to give us information about what indexes are redundant or duplicate. And the record, it’s another Cecil schema. view these are views sees the schema and it’s called Email redundant indexes. You can see it’s actually shown me every table where I have indexes I have covered by Okay, so it’s telling me you have a redundant index called C k, that has the column C K. And that’s actually covered by other index called CKP. And you have a redundant index KB. The redundant column is V. And it’s coded by the index k to V. Alright? So, and it’s given me a ready to go ALTER TABLE I can use to get rid of those indexes. There are really few occasions where you don’t want to drop the indexes. And one of the few occasions where I can think not dropping an index is a good idea is for example, if you have an integer column, and then a very big char column, and you want to index the whole char column up to the maximum, which is 760, some, so you can only if you have a 2000 characters column, you can only index the first half 760 plus. And, you know, if you’re doing this, if you’re doing that, like when you have an index that is I don’t know department name and some data and some data is like a very big piece of index, then you might want to keep the index that is only the integer even if the other composite index can take care of it, because loading the composite index is more expensive because it has a huge piece of data that is increasing memory footprint. So that will be one of the few occasions where I will not drop the index. Otherwise, you know these are usually good to go and dropping these indexes should result in the better table, so I’m just gonna go ahead and do them.

Dave Stokes:
Now what do you recommend instead of dropping the tables, making me miserable for a while and then after weeks making sure that well?

Marcos Albe:
That’s how I say the actual Yeah, I didn’t think of it, and it’s a great idea Yes, in my sequel eight seater, you can also hide indexes and you know, I don’t remember the syntax top of my head I barely remember my name so apologies but yeah, hiding an index is a good idea because that will allow you not to modify it that’s metadata only alter so very quickly and it will allow you to verify your application didn’t rely on that index for something, and that is not that dropping it is not going to cause a performance well so it that’s a great idea. SR let’s let me very quickly check the apps. One second, really check the syntax for that mix is invisible index.

Dave Stokes: Okay, and how do you alter the index? So you will do alter index k v in the Cyril? What do you mean?

Marcos Albe:
Alright. I’ll have to duplicate keys. So I dropped it one, and altered the other one. And again, it’s it should still didn’t finish it drop in Vegas. So that operation that was an instant alter, you can see didn’t rebuild my table immediately. So it’s going to take the data size change to reflect. What I’m going to do is it’s going to drop it because I don’t like my index. That’s going to get rid of both indexes finally, and it’s going to rebuild my table, you can see the difference between this execution time and this other execution time. This ALTER TABLE was instant, and the drop of the index was happening in the background. While this one, you know, was obviously not instant, and actually finish it with my table reveal. There we go, I got rid of another 330 megabytes of there. So at this point, let’s see how our table looks. Great. Thanks. And okay, I still have some large indexes, right, these indexes are pretty large, because they have all the color. And one more thing you can do. And I will stop talking to you about index prefixes. Let’s do something, which is let’s throw up x, say CKP, C, E, index the index C, A, B, C, and I’m just going to… Sorry, let me show you before select. So now we have to look at the prefixes. These are longer strings, Char 120, you are indexing the whole string, you’re adding 120 bytes for every index entry might not be necessary. Let’s take a look. Select count is staying left? Let’s see eight. It could be this because I’m lazy 16. And then C. Two. And finally, one more with 64. And you know, I’m just gonna see what from this. It’s I guess that’s right. Look at that, I was able to see that with eight characters alone, I’m able to identify 99 point 94% of my rows. So I don’t truly need to index the whole C column, it should be enough to index the first eight bytes to have a very good cardinality. So and I’m going to do the same with K. I’m just gonna sorry with that. And I’m lazy, so I’m not changing on the names, I’m sorry. Again, with eight characters, I am absolutely done. Let me show you the effect of four characters. So you can see how these will see. Oops, there you go. If I only use four characters, then I only get 10,000 rows like so I get only 10,000 unique values, meaning this will not give me very good cardinality. Let’s see what happens if I use six. I was six. Again, this is the most important thing for an index is what cardinality is what selectivity is providing for me. And you can see six characters already provide 95%, selectivity, and eight provide 99. Let’s make these indexes a smaller C, and I’m going to use eight, and then I’m going to be k. And I’m going to do and I’m also going to use eight. And let’s see what difference that makes. Our last table size was 125 megabytes. Let’s see if this helps. Oh, that didn’t help. I guess I fragmented the table. That’s weird.

So I’m just rebuilding the table to defragment. I assume it got fragmented. Again, I do have the sysbench running on the background. So these tables are getting updates and deletes and inserts at all times, like it’s not a steady-state table. It is getting traffic on it all the time. You can actually see it’s weird, but it didn’t block it. Optimize should be blocking it should be rebuilding the whole table. Okay, that makes more sense. So, you know, we are slowly approaching the previous size. And while still, we have more indexes. And we have another one that is very long. Let’s make those better. See, and we’re going to, again, we’re going to do the eight, see a I’m going to do the same for Kc.

Dave Stokes: We are doing it again, right?

Marcos Albe:
Well, I’m doing different indexes, right. And let’s see, what is the net effect of doing that. Again, I assume it will require an optimized table, in this case, to see the full benefits, let’s do it.I still have a few more credits to show you. So I’m gonna go through this more or less quickly.
And now with drop it 50% More of the table. So you can see how choosing proper prefixes for those indexes help them, you know, save a lot of space. And this is all performance, because that is reads that are going to be smaller rights that are going to be smaller, better fit in the pages. So let’s pay just to be manipulated, smaller batteries to be diverted, more fit in memory. It’s all beneficial, right? Like it is so important to have proper data types. And we went from a table that was how big it was. But that was like 190 gigs, and then 90 Max 200. So we went from a table with 276 Max to one with 50 max. So, and we didn’t read off any index or any column. Rather, we only reduce it the footprint. And again, it can make a night and day difference in performance.

Dave Stokes:
Reading from memory is 100,000 times faster than reading from disk even if it’s an SSD.

Marcos Albe:
Yeah, forget it like SSDs are overrated. Absolutely. Like you only want to read from memory when possible. So yeah, you know, this is really important. I cannot, you know, emphasize enough how critical disease and most people will happily ignore it. Like I tell this to customers day and night and they just say yeah, we’re gonna do it one day because they don’t realize the potential gains and you know, that all server what you’re running it could have a couple more years of life if your database are good Okay, next one. And you said indexes. So this one is a bit hard this is a question we get a lot is like finding duplicate indexes. It’s easy. It’s an aesthetic thing. Like I see the index has the same leftmost columns. It’s a duplicate, but and use it, how can you tell if it’s user or not? Well, after your database has been running for long enough, and your application has been able to issue every query your application could issue. You should have statistics about the utilization of the indexes. And, again, this will give you the list of indexes that have not been used in any query. Recently, or actually, since the start of the database. So you start the database, and there are no statistics about index uses. So if you do this query, it will bring you up every index in your database. Days later, you know, let’s take a look at these two like, select. And now, if I do this, we should actually see this guy is no longer there. Right? Like I run a query, what are the index? Let me show you again. It was one of these two, but basically, the department Yeah, this one, sorry. So you can see employee’s department manager, department number, I did a query on employees. Department Manager, and I ordered it by department number, so I knew it was going to use that index. And effectively like that disappeared from here, right? Like, it’s no longer in the list of unused indexes. So the information on this table, again, don’t take it at face value on this query, don’t take it at face value, it could be that this index is used by a query, but that query has not yet been executed, since the last start of the database. So like, like, David suggested earlier, you could hide the index, and see if it’s still, if your application, it’s okay. And you know, a couple of weeks, you can go ahead and drop it. But this is, again, very big gains ahead. Like if you can get rid of a whole index, you know, like, beautiful, like, please do it, like this guy sent me, you know, these indexes, okay, you’re not using any of these indexes markers, so just get rid of them. And you can go ahead and hide them, or drop them, if you’re so super sure, you’re not gonna need them. But the point is, this helps you identify queries that are potentially not being used, it’s in you to verify they’re truly not being used. And to, you know, later, get rid of them. But mainly, you know, you need to be sure like, if you know your queries, then you should probably be okay with this. If you don’t know your queries, then you should hide the index and wait for some time.

Dave Stokes: Write down the query or the index before the creation of the index code. Before you delete it, just in case three months later, your boss comes back to you and said, Hey, how come the quarterly reports aren’t working?

Marcos Albe:
Yeah, yeah. And application queries could be run once per year, you know, like, or once per quarter? Or, you know, God knows, because there’s some reporting or summaries or, you know, events that only happen once per year, or, you know, could be even more like I don’t have an example now, but I could probably think of an example of queries that run only once every year. And, you know, if you don’t wait one year to drop the index. Like, at the end of the year, you’re gonna find oh my god, I don’t have that index, this code is horribly slow. So yeah, you should wait. Eventual amount of time before doing the job. Right. And I just deleted my last query. And here is my last query. And I will give a beat. And David discredits, I have run, I have a silly document with me where I was weakened so that people can benefit from them. This last query, it’s a query I wrote, and you can change this multiplier. Basically, what I’m trying to do is find tables where the indexes are taking more space than the actual data. Now that I modified SB test six, I’m not sure I’m gonna have any of those. But let’s find out. Oh, sbtech, six is still having ported. So these ones are no, they’re wrong, I need to enlist, like, when I found these, it feels like it has a lot of indexes. But oops. It has quite a few indexes. So it could be that it simply that, that it has just 40 but 44 times, to me, it seems too much like I don’t know, I don’t trust these numbers, I will need to investigate more. I trust this one better, because I know the table and I know the indexes. And this, of course, it’s just an artifact of bad statistics, this table has no rows. And these numbers are not actual perfect numbers. These are statistics based on the sorry, these are projections, based on the indexes, statistics. And indexes, statistics can be off by as much as 40%. Sometimes, depending on, you know, fragmentation of the table, and bad luck, you could get a sample for the index on on a branch of the B tree where there is a lot of fragmentation and the pages are half empty. And then they the data length and the width gum with half the value. So again, don’t take these, like perfect numbers. These are approximations that should give you a good idea and guide you. And sometimes they’re not gonna tell you the truth. And you’re gonna find out later that they the numbers were wrong and that you need to update do you need to do analyze? Let me show you an sp six running analyze tables, a fairly cheap operation. And you can do this alter table. I never remember the syntax. But the default is 20 pages, which is really small. And these metadata only alter allows me to increase the sample for the index statistics of this table. And now I can do allies, table SB six. And let’s see if that changes my stats, oh my god, it’s gone. It was lying to me. And I figured it it could be lying. Because we shrink the indexes so much that it was weird.

Dave Stokes: I bet that if I do this is still there. Oh, no.

Marcos Albe: Well, that’s weird. Okay, I was expecting to still see sbtech Six, but not with like 40 Like for 4x the size or five exercise. So again, you just saw I improve it my index statistics. I didn’t analyze and these statistics were corrected. And now it’s not longer there. So again, do take this with a pinch of salt. These are guidelines, you know, and verification of the actual facts is required. But if you find a table, where your indexes are more than one and a half times the data, you are probably over indexing. And you should try to find out which indexes are not usyd duplicate or what Whatever. So, you know, this query will have been better to run before the duplicate and use the indexes queries I run, it will have made more sense for this presentation. So I apologize for that. But again, it’s a guideline, it’s something to, you know, when you have 1000s of tables or hundreds of tables, looking at them one by one, it’s not feasible. This should give you a good starting point. So with that, I conclude my collection of queries to find and correct schemas. I hope you enjoyed it.

Dave Stokes: Yeah. All right. Thank you, Marcos.

Marcos Albe:
Of course, and I will be again, I will be sharing the credits, I use it today. I will be sharing those with the community folks here. So they can actually spread the word. I’m not sure if anybody has more questions, you can post them on the chat.

Dave Stokes: That’s it, folks. Once again, thanks to Marcos, and thanks to my two colleagues, Mario and Edith. So now you’re seeing the entire tech evangelist team in in place in one place. So you’ll be seeing more of us in the future. Maybe not all together. But thanks for watching, and we’ll be back in two weeks with Marcos on another subject.

Marcos Albe: All right. Well, it’s a pleasure being with you folks. Thank you. ∎

Speakers

Marcos Albe

Principal Support Engineer, Percona

After 12 years working as a developer for local and remote firms, Marcos decided to pursuit true love and become full time DBA, so he has been doing MySQL Support at Percona for the past 8+ years, providing lead web properties with advise on anything-MySQL and in-depth system performance analysis.

See all talks by Marcos Albe »