Understanding PostgreSQL’s “ctid": A Hidden Gem for Efficient Row Updates 💎

Hagen Hübel
4 min readSep 28, 2024

PostgreSQL is widely known for its powerful features, including advanced indexing, full-text search, and the ability to handle massive datasets efficiently. Among its lesser-known but highly useful features is the ctid system column. Although invisible to the average user, the ctid plays a crucial role in PostgreSQL’s ability to identify rows quickly and efficiently.

In this blog post, we’ll explore the concept of ctid, how it works, and provide a practical example where using ctid can improve performance in large-scale operations.

What is ctid in PostgreSQL?

ctid is a system column that exists in every PostgreSQL table. It provides the physical location of a row within its table by indicating the block number (or page) and the tuple offset (position within the block).

In simple terms, ctid tells PostgreSQL where a row is stored physically on disk. For example, if a row’s ctid is `(3, 1)`, it means that the row is located in block 3, position 1 of that block.

Example `ctid`:

ctid = (block_number, tuple_offset)

The ctid is unique for a row at any given point in time, but it’s not persistent. This means it can change if the row is updated, moved, or modified due to table maintenance operations like `VACUUM`.

Why is ctid Useful?

Normally, when updating or deleting rows in PostgreSQL, the database has to find the row first using some form of a lookup, like searching by a primary key or performing a scan of the table. Using ctid provides a direct reference to the row’s physical location, making certain operations, like updates or deletes, more efficient.

This is especially important when dealing with batch updates or mass updates on large datasets. Instead of relying on complex queries or scanning through indexes, you can use ctid to reference the exact row in the table and update it immediately.

When Should You Use ctid?

- Batch updates where performance is critical.
- When performing operations that require direct access to specific rows, such as during ETL (Extract, Transform, Load) processes.
- For temporary, efficient row identification during intermediate steps in a transaction or update operation.

Practical Example: Batch Updates Using `ctid`

In one of our projects, concrete at infobud.news, a personalized News platform, we faced a scenario where we needed to update many rows. The rows that required updates were in an article table, and we needed to update the category_uuid column based on a lookup from another table, category. Since this operation involved millions of rows, performance was a critical concern and we decided to perform batches of 1000 records per loop.

The optimized Update Query:

The typical approach would have been to join the two tables on a shared key and then update the rows. However, in this case, we decided to use ctid to speed up the operation.

Here’s what the query looks like using ctid:

WITH to_update AS (
SELECT a.ctid AS article_ctid, c.uuid AS category_uuid
FROM public.article a
JOIN public.category c ON a.category_id = c.id
WHERE a.category_id IS NOT NULL
AND a.category_uuid IS NULL
LIMIT 1000 -- Process updates in batches
)
UPDATE public.article a
SET category_uuid = to_update.category_uuid
FROM to_update
WHERE a.ctid = to_update.article_ctid;

Explanation:

- Batching: The `WITH` clause (“CTE”) limits the update to 1000 records at a time, making the process more manageable for large datasets.
- Direct Row Reference: The ctid is used to match rows directly based on their physical location in the table (`a.ctid = to_update.article_ctid`), bypassing the need to perform an index lookup on logical keys like `id`.

Performance Benefits:

Using ctid allows PostgreSQL to find and update rows faster, as it no longer needs to search for rows by logical keys. Instead, it knows exactly where the row is located on disk, making the update process more efficient.

In this specific use case, the update time per batch was significantly reduced compared to using traditional key-based joins. This performance gain is particularly noticeable when dealing with millions of rows.

Key Points to Remember:

1. `ctid` is not persistent: The `ctid` value of a row may change if the row is updated or moved by PostgreSQL during internal operations like `VACUUM`.
2. Efficient for large updates: For large datasets or batch processing, using `ctid` can significantly improve performance by directly referencing the row’s physical location.
3. Not a replacement for primary keys: While `ctid` is useful for temporary identification of rows in a transaction or batch process, it should not replace logical identifiers like primary keys (`id`).

How to Check `ctid` Values:

You can query the `ctid` of rows directly to see the physical location of each row:

SELECT ctid, * FROM public.article WHERE id = 123;

This will return the `ctid` alongside all other columns for the row with `id = 123`. You’ll see something like:

ctid  | id   | title           | category_id | category_uuid
- - - | - - -| - - - - - - - - | - - - - - - | - - - - - - - -
(3,1) | 123 | Example Article | 5 | NULL

Conclusion

The `ctid` system column is a powerful feature in PostgreSQL, providing a direct reference to the physical location of rows. While it’s not meant for long-term row identification, it can greatly improve the performance of batch updates and other operations that involve direct row manipulation.

In scenarios like ours — updating thousands of rows in batches — using `ctid` allowed for faster updates by directly referencing the rows that needed to be updated. Although `ctid` is generally hidden, understanding and leveraging it in the right situations can provide significant performance gains.

Have you used `ctid` in your PostgreSQL operations? Let us know how it worked for you!

--

--

Hagen Hübel
Hagen Hübel

Responses (2)