An intro to ClickHouse for Postgres Developers
By ClickHouse
Summary
Topics Covered
- Postgres Transactions ClickHouse Analytics
- ClickHouse Sparse Index Sorts Data
- CDC Syncs Postgres to ClickHouse
- ReplacingMergeTree Handles Updates
- Deletes Become Versioned Inserts
Full Transcript
This video is an introduction to Click House for Postgress developers. Let's
start by comparing the two databases across some high-level aspects. So we'll
start with workloads. So for Postgress that's going to be transactional workloads whereas for Clickhouse it's more analytical in terms of data storage. Postgress stores this in a row
storage. Postgress stores this in a row based format whereas Clickhouse stores data in columns data access patterns. So
with postgres we'll typically be doing single row lookups and maybe small range scans whereas in clickass it's more typical to do aggregations on a subset
of columns with updates and deletes in postgres they'll be immediate and in place whereas in click it's append only and it will then be eventually
consistent and the typical use case in postgres it's cred operations whereas in click it's typically largecale analytics both databases store data in tables and
each field has a type. There's a mapping from most Postgress types to Clickhouse types but they do use slightly different names. There are some differences in how
names. There are some differences in how we configure tables in Clickhouse and Postgress. So when it comes to indexing
Postgress. So when it comes to indexing in Postgres you might typically use a B tree but there are other indexes supported whereas in click we have the concept of a sparse index where only the
first row of a granular which is like a subset of the data will be indexed. For
primary keys in Postgres they that's used to ensure uniqueness whereas in click it's used to define the columns that are go going to go into that sparse
index that we just talked about data sorting in Postgress it's in insert order and we don't really need to think about that whereas in click house it is quite an important thing and so we need
to focus on our sorting key table engines in postgres we just have a single one whereas in clickas there are a wide range of engines to do different things and then finally ly null handling. So Postgres has native null
handling. So Postgres has native null support whereas in clickass we need to use the nullable type. The contrasting
strengths of Postgres and ClickUs create an ideal pairing for modern applications. Our data will begin its
applications. Our data will begin its journey in Postgres where our application handles transactions. So for
example, we might have a table like this users one and you see it has an ID as a primary key. We have a reputation, we
primary key. We have a reputation, we have a creation date, display name and a bunch of other fields as well. To make
this data available for analytics in Clickhouse, we need a way to sync changes in real time. A technique called change data capture has emerged as the way to do that. There are many products
for doing change data capture, both open source and commercial. I'm a bit biased, but I'd recommend using PDB if you're running Click House yourself and
Postgress CDZ via Click Pipes if you're using Clickhouse Cloud. Now let's have a look at the definition for that same users table in clickass. As mentioned
earlier, ClickUs has the concept of table engines. We'll be using the
table engines. We'll be using the replacing merge tree engine which allows us to handle updates. This table engine removes duplicate entries with the same sorting key value. It will keep the row
that has the highest version. Data is
still appending into the table, but in the background a merge process is running and that will take care of dduplicating the data. We also specify an order by expression which controls
the order in which click sorts the data.
We'd usually choose a field or fields that we're going to be querying a lot.
The three fields with the underscorep prefix are used to track various metadata around the CDC process. This
diagram here gives a conceptual understanding of what's happening when you use Postgress CDC in clickp. Step
one shows the initial snapshot of the two rows in Postgres and Click pipes performing the initial load of those two rows to click. Both rows are copied as is to click. Step two shows three
operations on the users table. So we've
got inserting a new row, updating an existing row, and deleting another row.
Click pipes will consume those operations from Postgress's logical replication slot. It's not shown in the
replication slot. It's not shown in the diagram for simplicity sake, but this data is written into an S3 bucket before being ingested into click house as versioned inserts in step three. Let's
zoom in and put the postgres and click queries side by side. So we can see the initial insert is pretty much the same on each side except the click house query has some extra metadata. The
update statement appears as a new version of the row with ID2 and the delete appears as a new version of ID 1 with is deleted marked as true. Because
of this, ClickUp has three additional rows compared to Postgress. If you want to see all of this in action, check out this video next.
Loading video analysis...