pdvfirstblog: August 2023

TL;DR: a podcast on Sharding set me out to Think. I summarised some of the info I learned, and I offer some comments, notably for distributed systems.

First I should Thank Nicolay and Michael for their podcast (links below). I learned a few things, and I re-studies some of the material on the topic.

As they point out, naming (taxonomy) is important. It is helpful to state good definitions. Not sure if I agree with all of the wiki-page as it seems written mainly from a noSQL point of view (normalisation is not the same as vertical partitioning, in my book??). But the definitions on that page (Aug 2023) are is a good start.

https://en.wikipedia.org/wiki/Shard_(database_architecture)

Note: Using a wiki-page on a somewhat vague and fluid concept is risky, I know...

Definitions related to Sharded systems

What I now (re)understand (and thought to know since mid-90s):

Sharding always involves Multiple Nodes.

Nuance: Often Sharding implies that one Shard corresponds exactly to one Node (or container or machine), but this is (IMHO) not a given: a Node can hold multiple shards, and a shard can be held on multiple Nodes.

Horizontal Sharding: records are kept in shards as "whole records" (e.g. empno, employee-name, emp-email, date_of_birth are kept together on 1 shard, on 1 node). The record is assigned to a shard depending on some property, often based on key or hash-of-key, but can also be date-range-based, or list-based on, say ISO-Country code.

Vertical Sharding: columns or groups of columns are assigned to shards. In this case, the contents of a single records can be divided over multiple shards or nodes. Entering the domain of Columnar Databases.

A Node: Generally a combination of compute-capacity and storage-capacity, can be raw-iron, VM, or container, or other incarnations of IT-capabilities. I sometimes say: Node=workhorse, with given capacity (cpu, memory, storage)

A Shard: I'll see that as a collection of stored records or data, grouped by some property (hash, range, list...). Sometimes also called Segment or Tablet. The relation between Shard and Node doesn't have to be 1:1, but can be any of n:m (in my opinion...).

A Table: a set of records adhering to "some" definitions or constraints. In noSQL it is "anything goes" (greybeard-DB view), and in an RDBMS the definitions tend to be fairly strikt, with well-defined tables, columns and relations between them. A Table can or cannot be sharded, e.g. can be in 1 or more Shards, and thus on 1 or more Nodes if there are multiple.

When confronted with term "Sharded", I have generally observed some hash-based, even distribution over 2, 4, 16, 27 or 128 shards, and mostly over multiple machines or nodes. I recall one notable exception: a system with "alphabetical shards" with ... 27 shards. The buckets for Jones/Johnson and Sing/Smith got rather larger than the others. Long story, very Cool Idea, Vendor made a Fortune, but system never was great success.

Note the overlap with Partitioning:

Partitioning has its own, more elaborate wiki-page.

https://en.wikipedia.org/wiki/Partition_(database)

And there you find (Aug 2023), very similar definitions of "horizontal partitioning" and "vertical partitioning", and Possibly spread over 1 or more Nodes. But no mention of Shards.

When speaking of Partitioning in a database, it meant generally: creating smaller tables and using select-union (possibly a view) to treat them as 1 table. In most modern RDBMS this is now nicely hidden "under the SQL layer". It would classify as Horizontal Partitioning, or even Horizontal Sharding if designated nodes were involved.

Downsides of sharding an my view on how to "mitigate":

The mentioned negatives of Sharding (wiki + other sources) are notably: Complexity, SPoF and Re-Balancing. Allow me comment on them.

SQL-complexity:

I would reply: Hidden Under the SQL. The underlying SQL-engine and the storage mechanism Should be completely responsible for storing and finding the data from Inserts, updates, Selects and Deletes. Finding the correct shard and returning the set should happen outside of Dev-View. (but the DBA may want some insight, some monitoring and Controls...). Admittedly, there will often be impacts when data from far-out components needs to be joined, but I would expect Zero impact on the SQL-code when writing queries.

Single Points of Failure (a shard/node/storageblock and the metadata):

Same mitigations apply as for conventional RDBMS or other data-stores. Ideally, data is verified, checksummed, logged, stored, possibly replicated, and maybe even guarded by a quorum-mechanism. I would expect some monitoring notably by the team responsible for infrastructure (the provider) but also some by the Dev/Ops team. Users will definitely do "monitoring work", and will complain if things go missing.

Re-sharding or re-balancing:

Ideally this also happens under the covers, and should not be a concern of the Dev/Ops team. But "manual controls" are always appreciated. A good implementation will allow for (semi-)automatic and near-invisible re-sharding If Needed. (I appreciate this can be a Big Ask, and the timing+impact of re-sharding will generally be something an Ops would want to manage).

In my (arrogant?) opinion, these downsides (and there are others) are by now "solvable problems". And the vendors of RDBMS and other storage-solutions will be happy to Suggest and Provide. The role of the customer is to Be Informed, and make Sensible Selections from the many options available.

Instrumentation: Challenging with Distributed systems

One more word on "Observability". You (we) Want to be able to see, to Monitor, to Diagnose the Sharding system. And Ideally as an old greybeard-DBA, I want that data to be Query-Able via views or tables or functions, and preferably "Instant" and not with minutes++ of delay via logfile-parsers and log-mergers (which will probably fail under pressure!).

Postgres has the catalog-views, with for example pg_views, pg_tables and pg_database, mostly based on pg_class. SQL-server and Oracle have similar views to peek-inside.

So, for a Sharded system, I would expect views to expose the relevant entities: Tables, Shards, Nodes, and whatever relevant objects the specific system has. I would also expect to be able to join those views to compose information ad-hoc to investigate a problem.

Join data to provide Information, that is the Nature of a Database.

And in the near future I would expect those views on internal objects to expose counters and timing-info as well.

This system uses Shards? So... can I do something like:


select 
  n.nodename              node_name
, sum   ( s.elapsed_sec ) scnds_ela
, count (*)               nr_shards
from sh_shard s
join sh_node n on ( n.id = s.node_id )
group by n.nodename
order by 2 ;

If not yet, why not?

I understand some of the answer is in the "Distributed" nature. If a Database is running, distributed, over 42+ nodes (machines, containers?), then fetching that data would possibly be more then 42 RPC calls.

I must assume some of the required data, basic info on nodes, shards, tables is already "known" and available as metadata: The processing-software just below the SQL engine knows where to go to find the Shards, it knows which node they are supposed to be on...

And like stated above: if I have to extract it from "merged logfiles" or data scraped off some admin-interface at port 13001 of every node, that is too much of a roundabout-way, and too slow. Web-interfaces are good, Necessary. But some SQL-Access to the information is in the long run, much more valuable.

It took Oracle 15+ years (from 1990 to 2005) to create their AWR and expose the wait-times of most components. I dont quite know how long it took Postgres to expose the catalog (and still not much of timing-info in there!). I'm expecting some of the new, distributed systems to Learn from the Past...

To finalize, I'll repeat my motto:

In Databases, Everything is SQL, and SQL is Everything.

Links:

Anyway, the Podcast that started this over-thinking is here (link to Nicolay and Michael). Bookmark and Enjoy!

PS - Some Further Comments on some of the contents:

If too much pushback + flames come out, I might just remove this. Internet-expert-gurus and high-value sales ppl should be allowed their moment of fame...

2PC and XA transaction: Just Dont. Too many problems.

Need for Sharding: You Are Not Netflix (or Amazon). Most systems that thought they needed Sharding had an over-inflated architect, or a pushy (hardware) vendor, or could solve their problem by waiting a few yrs: Hardware and Engineering have done some surprising Catchups.

Bi-Directional replication: I'm Very Skeptical, especially when promised "over continents". I _Know_ it is Possible. But I've Never seen that work properly, or not for long anyway. Maybe, just Maybe, some systems that have a mechanism "from design built in" may get this practical + working.