This is one of those "Back to Basic" posts, but one with a good number of years and projects of Experience written into it.
I’ve been playing a little with, Postgres (again) and Yugabyte (link), notably the explain-plan of queries. Of course, I am looking with the biased eyes of an oracle-dba, and that comes with some prior knowledge and possibly a strong personal preference from my education in the big-red environment.
Several findings come to mind, here are the 4 items I think you need to be able to properly operate or diagnose an RDBMS:
- Know The Concepts, and have a mental picture of your RDBMS.
- SQL-access to ALL relevant information.
- Measure Effort (and time)
- Establish “Normality”, your reference, your benchmark.
Anyone working with Databases, RDBMSes, especially as troubleshooter (“Consultant” if you wear a tie), should be aware of these items, and have them Mastered to some degree.
Let me elaborate on those..
re-1, The Concept. You need a mental model, a “concept description” of the underlying layers to be able to understand your RDBMS. For something as simple as :
select ename from emp where id = :emp_id ;
In this case, you should realise that this query needs to be parsed and executed (different stages) and the execution-phase will traverse various “objects”. The RDBMS needs to use its knowledge of the emp-table, and decide to either scan the table or use an index if available. This Also means the RDBMS will have to read data, from memory or disk. And if a query is too slow, or not doing what you want, you need the background information to diagnose and fix your problem.
Hence you need some conceptual knowledge of your RDBMS.
re-2: SQL-access to all relevant is strongly recommended. Am I biased in this? It is not just me: re-read the 12-rules of Codd!
The best examples are the catalog-tables that expose how your RDBMS actually stores its objects. But you can dig much deeper using things like pg_stat_statements or by doing explain-plans of the query.
With the knowledge of the “concepts” (see above), you should be able to search for, and query, the relevant tables and views from the RDBMS to help you diagnose and fix your problem.
As the ex-oracle-DBA, I know I can use :
v$sql : for basic info about the SQL.
v$sql_plan : which steps were included in the explain plan
v$sql_bind_capture : I can even find recent values of :emp_id.
With so much information available, I can even re-create some of the executions of the statement above (future link to demo...)
I want to stress the point of using the SQL interface here, over, for example, using dtrace or the examining+grepping of log- or tracefiles. In many circumstances, you will not have access to the backend where those files are. So apart from the “Codd-Correctness" of exposing all data via SQL, there is the practical need for an RDBMS to expose its elements and its workings as tables/views/functions.
(Note: this is something that may pose a problem for a multi-node system, a clustered system, or a serverless RDBMS : the work is done in many places, and the SQL-front-end may not have easy access to all the components, Yuga, Cockroach, pay attention here).
re-3 Measure Effort (and time). Don't make the mistake of just keeping a stopwatch next to your system. Any complex IT system is notorious for inconsistent response times. It is relevant to measure the (elapsed) time, but that time is only really helpful if you also know “what happened” in that time. For example, a select query may be slow on the first retrieval, but faster on the next due to caching. Or an individual full-table-scan query may be fast when testing just because your test-data is cached and the system is idle, whereas under higher load, that response will slow down because the in-efficiency of the full-scan quickly deteriorates your system performance. Your hardware- or cloud-vendor will smile at you, though, as you will need more of his resources.
(this topic is worth a separate blog, sometime...)
(future: link to example...)
re-4 Establish what is "Normal”, of what can you expect from a certain system. This takes some time to get to know your environment, some time for logical-reasoning, and some experience to recognise if your observations are reasonable or not. What was “fast storage” in 1995, is now considered abysmal, and what was considered slow-network in 2005 is now considered Fast due to the system being half a continent away in some cloud-region.
I tend to look for typical elapsed times (my “normal” is : better than 1-10 milliseconds per SQL), and used-effort per sql (cost or block-reads per SQL) and some overall system metrics: percentage idle of the CPU should typically be 30% or better. But for some systems, and idle of near-zero can be the norm and that can be fine as well, as long as the users are happy and no degradation occurs.
The best examples I know for “normality” come from the Oracle-world: I can easily recognise a quiet looking OEM-graph, or a “regular looking” AWR report for databases that I am familiar with.
And an anecdote about "normality": in my view, 5ms is a perfectly “normal” response-time for an employee lookup (actually a bit slow in 2022, but hey: blame network- and storage-latency for some of it…). Now, If a system has a legitimate need to do lookups of 1000 employee-records per second, then a 5ms lookup will need at least 50+ CPUs to do so. There is no magic bullet to fix this on a 4-cpu system. However, if the 1000-lookups are out of touch with the actual 2 ppl working on it, something else is wrong.
Why all this...
Well, I have some reflexions on this “basic stuff”, and some wishful thinking.
With single-server RDBMS, like conventional ms-sql, mySQL, Oracle or Postgres, we learned how they were built, most experienced DBAs know the concepts of their systems. And we can diagnose them and reason out where errors or bottlenecks are located. In the case of Oracle (which I am most familiar with), we can do that using purely SQL (e.g. I can diagnose a system by using only SQL-queries and the scritps that me and others have developed for the purpose of diagnosing).
Also, for those “known” systems, we have enough experience to say if something is “normal” or not (e.g. 2ms disk-response is probably normal, slower raises questions, faster shows good SSDs or efficient use us Cache).
For the newer systems, and notably the “serverless” systems we are in the process of learning this.
And if I am allowed some pedantic advice to to the builders of those systems: Help us to find the 4 items mentioned.
Here is what makers/vendors of an RDBMS need to think about:
Help us at item1, Concepts, by explaining your “concepts” clearly and in some detail. Documentation.
Help us at item2, SQL-access, by making All Information accessible via SQL (yes, I know this is complicated if systems span multiple nodes)
Help us at item3, Measure Effort, by opening up the “effort underneath” via SQL (views). Good example is the Oracle “wait interface” where 99% of “time spent” by the system can be attributed to activity using views on v$system_event and v$session_event.
Help us at item4, benchmarking Normality, by allowing easy reporting and possibly adding some statistics-info to the RDBMS.
(Disclaimer: disillusioned-old-oracle-DBA here…)
It took Oracle some 10-15 years (oracle version 6 and 7, around 1994) to stabilise their concept, document it in a “concepts guide” and to seriously open up and publish the v$ views, although the original bstat/estat scripts go back further. I’m expecting the new breed of RDBMSes, especially the ones that are Open Source and claim to be Planet-Scalable, to do that a little speedier.
TBC...