Distributed Database Fundamentals

Posted by Yanchao MURONG on 2023-03-27

Distributed Database Fundamentals

What

  • A distributed database is a database that is spread (“distributed”) across multiple machines.
  • For an end-user, interacting with a distributed database should be indistinguishable from a non-distributed one.

Why

  • Performance
    • With data sizes growing exponentially, the need for fast data processing is outgrowing individual machines
  • Elasticity
    • The database can be quickly & flexibly scaled to fit the requirements by adding (or removing) resources
  • Fault-Tolerance
    • Running on more than one node allows the system to better recover from hardware failures

How do we classify distributed databases

Scaling Up vs Scaling Out

  • scale up
    • Move the database to a bigger box
      (faster CPU, more cores, more memory, faster disk, FPGAs/GPUs).
    • Typically better performance, but expensive to buy & inflexible to scale
  • scale out
    • Distribute the database across multiple nodes.
    • Often slower (due to operational overhead), but a lot cheaper, more flexible, and more fault-tolerant.

Parallel vs Distributed Databases

  • Parallel Database
    • Runs on tightly-coupled nodes (e.g. a cluster, or a multi-processor / multi-core system)
    • Typically a scale-up architecture.
  • Distributed Database
    • Runs on loosely-coupled nodes (e.g. individual machines, cloud resources)
    • Typically a scale-out architecture.

Most distributed databases are also parallel

Analytical vs Operational Databases

  • Online Analytical Processing (OLAP)
    • Focus on a few, complex, long-running analytical queries
  • Online Transactional Processing (OLTP)
    • Focus on multiple concurrent, simple, short-running transactional queries

Architecture

  • Shared Memory
    • All nodes have shared access to both memory & disk
    • Typical architecture found in scale-up, parallel databases
    • Can achieve very high performance, but is hard to scale when running out of resources
    • Postgres, Oracle, SQL Server •Main-Memory DBMS like Apache Ignite, Hyper, SAP Hana
  • Shared Disk
    • Nodes have their own CPU & memory, but share the same disk
    • Most commonly found in traditional, enterprise-grade RDBMs systems
    • (Oracle, MS SQL Server)
  • Shared Nothing
    • Data is spread across independent nodes that only communicate via the network (Enterprise Mainframe with NAS (network-attached storage))
    • Typical architecture found in “web-scale”, scale-out systems:
      • Dataflow systems like Apache Hadoop / Spark / Flink
      • Distributed Databases, Key-Value Stores / snowflake
    • Robust architecture that offers availability & scalability, but can be slower than shared-memory

Distributed Query Processing

Data Shuffling Primitives (Partitions)

Broadcasting

  • Each node sends a copy of all their data to all other nodes

Range Partitioning

  • Each node receives a predefined range of the key space
  • may have unbalanced distribution of data
  • each nodes needs to know the ranging

Hash Partitioning

  • Each node receives a portion of the key space
    determined by a hash function
  • each node does not need to known the ranging (hash function enough)

Distributed Selection / Projection

No need to shuffle

1
SELECT col+1 WHERE col > 4

Distributed GroupBy / Aggregation

1
SELECT col, COUNT(*) GROUP BY col
  • each node has the hash function so they know data should be sent to which node based on that hash function (tuples of the same key would be sent to the same node)
  • Hash partition on grouping key to collect all tuples with same key (each node will receive all tuples of the same key thanks to hashing function)
  • Compute aggregation locally on each node

Distributed Joins

  • Shuffle data around to ensure that matching pairs are on the same node
  • Then run a local join algorithm

Optimal strategy depends on: - How data is partitioned / distributed across the nodes - The size of the individual tables

Co-Located Join
  • both tables are partitioned by the join keys
Asymmetric Repartition Join
  • If only one of the tables is partitioned by the join key:
  • hash-partition the other one by the join key, run join locally
Symmetric Repartition Join
  • the tables are partitioned differently
  • If both tables are roughly the same size, then we hash-partition both by the join key, then run the join locally
Broadcast Join
  • the tables are partitioned differently
  • but one table is a lot smaller than the other
  • broadcast the small table, then run the join locally

Examples for Distributed (and Parallel) Databases

  • In-Memory Database
    • Scale-up, shared-memory, parallel database engine
    • analytical purpose
    • Focus on CPU efficiency / multi-threading
    • Typical applications are time-critical systems
      • Real-time systems, Critical Business Intelligence Solutions, Dashboarding Backends, Trading Systems, ...
    • SAP Hana, Hyper, Apache Ignite
  • Distributed Key-Value Stores
    • Scale-out, shared-nothing, distributed, operational database engine
    • Focus on fault-tolerance and transaction speed
    • Amazon DynamoDB, Apache Cassandra, FoundationDB
  • Data Warehousing Systems
    • Shared-nothing, scale-out, distributed, analytical database engine
    • Typical use cases are Business Intelligence (BI), Reporting, Operational Management, ...
    • Redshift, Teradata, Vertica, Oracle Exadata, Postgres
  • Cloud RDBMs
    • Architectural evolution of Data Warehousing Systems for modern Cloud Environments
    • Builds on Shared Nothing, but keeps data in cloud storage
      • Nodes do not “own” data, they only access what they need to process the query from cloud storage.
      • Transactions and access consistency are handled centrally via a distributed key value store. ????????
    • focus on extreme elasticity
      • Cloud Resources are “infinite”, can be provisioned within seconds.
      • Allows accessing the data from 1000s of nodes concurrently.
      • Scale resources up & down exactly as and when needed.