LongCut logo

An intro to ClickHouse for Postgres Developers

By ClickHouse

Summary

Topics Covered

  • Postgres Transactions ClickHouse Analytics
  • ClickHouse Sparse Index Sorts Data
  • CDC Syncs Postgres to ClickHouse
  • ReplacingMergeTree Handles Updates
  • Deletes Become Versioned Inserts

Full Transcript

This video is an introduction to Click House for Postgress developers. Let's

start by comparing the two databases across some high-level aspects. So we'll

start with workloads. So for Postgress that's going to be transactional workloads whereas for Clickhouse it's more analytical in terms of data storage. Postgress stores this in a row

storage. Postgress stores this in a row based format whereas Clickhouse stores data in columns data access patterns. So

with postgres we'll typically be doing single row lookups and maybe small range scans whereas in clickass it's more typical to do aggregations on a subset

of columns with updates and deletes in postgres they'll be immediate and in place whereas in click it's append only and it will then be eventually

consistent and the typical use case in postgres it's cred operations whereas in click it's typically largecale analytics both databases store data in tables and

each field has a type. There's a mapping from most Postgress types to Clickhouse types but they do use slightly different names. There are some differences in how

names. There are some differences in how we configure tables in Clickhouse and Postgress. So when it comes to indexing

Postgress. So when it comes to indexing in Postgres you might typically use a B tree but there are other indexes supported whereas in click we have the concept of a sparse index where only the

first row of a granular which is like a subset of the data will be indexed. For

primary keys in Postgres they that's used to ensure uniqueness whereas in click it's used to define the columns that are go going to go into that sparse

index that we just talked about data sorting in Postgress it's in insert order and we don't really need to think about that whereas in click house it is quite an important thing and so we need

to focus on our sorting key table engines in postgres we just have a single one whereas in clickas there are a wide range of engines to do different things and then finally ly null handling. So Postgres has native null

handling. So Postgres has native null support whereas in clickass we need to use the nullable type. The contrasting

strengths of Postgres and ClickUs create an ideal pairing for modern applications. Our data will begin its

applications. Our data will begin its journey in Postgres where our application handles transactions. So for

example, we might have a table like this users one and you see it has an ID as a primary key. We have a reputation, we

primary key. We have a reputation, we have a creation date, display name and a bunch of other fields as well. To make

this data available for analytics in Clickhouse, we need a way to sync changes in real time. A technique called change data capture has emerged as the way to do that. There are many products

for doing change data capture, both open source and commercial. I'm a bit biased, but I'd recommend using PDB if you're running Click House yourself and

Postgress CDZ via Click Pipes if you're using Clickhouse Cloud. Now let's have a look at the definition for that same users table in clickass. As mentioned

earlier, ClickUs has the concept of table engines. We'll be using the

table engines. We'll be using the replacing merge tree engine which allows us to handle updates. This table engine removes duplicate entries with the same sorting key value. It will keep the row

that has the highest version. Data is

still appending into the table, but in the background a merge process is running and that will take care of dduplicating the data. We also specify an order by expression which controls

the order in which click sorts the data.

We'd usually choose a field or fields that we're going to be querying a lot.

The three fields with the underscorep prefix are used to track various metadata around the CDC process. This

diagram here gives a conceptual understanding of what's happening when you use Postgress CDC in clickp. Step

one shows the initial snapshot of the two rows in Postgres and Click pipes performing the initial load of those two rows to click. Both rows are copied as is to click. Step two shows three

operations on the users table. So we've

got inserting a new row, updating an existing row, and deleting another row.

Click pipes will consume those operations from Postgress's logical replication slot. It's not shown in the

replication slot. It's not shown in the diagram for simplicity sake, but this data is written into an S3 bucket before being ingested into click house as versioned inserts in step three. Let's

zoom in and put the postgres and click queries side by side. So we can see the initial insert is pretty much the same on each side except the click house query has some extra metadata. The

update statement appears as a new version of the row with ID2 and the delete appears as a new version of ID 1 with is deleted marked as true. Because

of this, ClickUp has three additional rows compared to Postgress. If you want to see all of this in action, check out this video next.

Loading...

Loading video analysis...