Enhancing Performance in PostgreSql: The Journey to optimized Cosine Distance Searches
Many are unaware that PostgreSQL can be leveraged for similarity searches with vectors, such as embeddings, to construct Retrieval-Augmented Generation (RAG) applications powered by Large Language Models (LLM). In the current tech landscape, we encounter numerous specialized vector databases like Qdrant, designed specifically for such tasks. Yet, the guiding principle I adhere to when assembling my tech stack is the pursuit of simplicity. The fewer the components, the lesser the chances of failure, leading to more peaceful nights for any CTO or engineer. Since we already rely on PostgreSQL for nearly all our needs, it’s wise advice to avoid overburdening our stack until absolutely necessary. In this article, I will show a database solution using PostgreSQL and the remarkable pgvector extension, a role nowadays filled by Qdrant and similar technologies.
Developing the RAG pipeline for infobud.ai presented a unique set of challenges. Infobud.ai is a smart, intelligent news aggregator powered by AI. It detects relevant topics across all media platforms and auto-generates articles tailored to users’ special interests. Early in the development process, I encountered perplexing performance bottlenecks when dealing with embeddings and vectors. These issues were soon followed by even more baffling challenges, particularly sorting issues that arose when using LIMIT statements in PostgreSQL. Such problems were especially pronounced in conjunction with Common Table Expressions (CTEs), underscoring the technical hurdles we needed to overcome to efficiently process and deliver highly relevant content to our users.
Before delving into the technical intricacies, it’s important to outline the core problem we aimed to solve: Our goal was to enhance the user experience by rendering not only a specific article and its summary in a selected language but also to include additional articles and their summaries that bear a high degree of similarity to the original article. This approach aims to detect its relevance across the entire landscape of global news. To achieve this, we employed a method known as cosine similarity search — more accurately described as cosine distance search — on a table pre-populated with vector embeddings for each article.
Cosine similarity (or distance) is a metric used to measure how similar the documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. In the context of our application, these vectors represent the embeddings of articles, which are essentially numerical representations capturing the essence of the articles’ content. By computing the cosine distance between these vectors, we can identify articles whose content is most similar to that of the original article, thereby suggesting relevant reads to our users, or consuming this similarity as a measure of higher relevance because multiple news agencies are writing about it.
The implementation of this search functionality within a PostgreSQL database presents unique challenges, particularly when it comes to efficiently querying large datasets and ensuring that the results are both accurate and fast. The performance and sorting issues we encountered shed light on the complexities of working with advanced database functionalities and vectorized data. This is especially true regarding the use of indexes and the circumstances under which they might be bypassed, leading to the reliance on the more time-consuming quicksort algorithm.
I initially developed a function that searches for similar summaries in the database, utilizing the powerful pgvector extension. This function filters results in a specific language (query_lang) to include only those summaries that are not older than a specified maximum number of days (max_days) and that meet or exceed a certain similarity factor (math_threshold). Finally, I am only interested in returning the very first match_count records. As we will later see, this was the biggest challenge.
This PostgreSQL function performs admirably, its design inspired by extensive research and readings I undertook beforehand. However, it faces one significant challenge: poor performance.
Let’s examine the metrics: In a database containing 100,000 records, any search operation exceeded 1200 milliseconds, whether conducted on my MacBook Pro M1 or on Hetzner Cloud servers equipped with 4 vCPUs and 16 GB of RAM allocated for PostgreSQL. Within our system, designed to cluster articles and their related topics through relevance search, such queries are executed sequentially, cumulatively requiring several seconds to yield a result. Clearly, there was substantial room for optimization. Let’s delve into the solutions:
All necessary columns were covered by an index, whether it was the language, the publication date (“pub_date”), columns containing foreign keys that create relationships between tables, or the vector column used for storing embeddings. The latter was indexed as follows:
create index on embeddings using ivfflat (embedding vector_cosine_ops) with (lists = 100);
*) later in this post, we will identify a method that is better and more efficient than ‘ivfflat’.
However, the execution plan is a mess, revealing significant inefficiencies, indicating that almost no indexes were utilized, with the exception of primary keys for certain relations:
Further investigation brought me to two significant issues one should be aware of when working with PostgreSql:
- The term mentioned previously will never properly utilize the Index, not even partially:
where 1 - (e.embedding <=> query_embedding) > match_threshold
The reason for that is the calculation “1 minus a term using the ≤=> operator”.
2. The ORDER statement for the similarity-column using the descendant (DESC) operator will also bypass any index when the column is based on the ≤=> operator:
SELECT ...
e.embedding <=> query_embedding) as similarity
WHERE (e.embedding <=> query_embedding) > match_threshold
ORDER BY similarity DESC
We observe a significant improvement in the query plan and, consequently, in query performance when we alter the ORDER BY statement from DESC to ASC:
While a quicksort algorithm is still in use, it now consumes significantly less memory and processes fewer rows — only a quarter of what was previously required — thanks to more effective pre-selection of records. We now leverage the ivfflat-index, which dramatically reduces the number of processed embeddings from 128,628 records to just 2. We’ve also eliminated the need for a memoization effect, as it’s no longer necessary. Importantly, the Parallel Sequence Scan for embeddings has been relegated to the final step, applied to only a small subset of records instead of the vast majority.
Let’s examine the revised and optimized query, which now spans three times as many lines of code as the original. This expansion significantly aids PostgreSQL and pgvector in enhancing efficiency and saving time:
Postgres offers a great feature which is called WITH-Queries:
Common Table Expressions (CTEs / WITH-queries)
Common Table Expressions (CTEs), also known as WITH queries, are a powerful feature in PostgreSQL that allow for more organized and flexible SQL queries. CTEs enable you to define temporary result sets that you can reference within a larger query, acting like a temporary table that exists only during the execution of the query. This feature is particularly useful for breaking down complex queries into simpler, more manageable parts, making the SQL statements easier to read and maintain. Additionally, CTEs can be recursive, allowing them to reference themselves, which is ideal for querying hierarchical or tree-structured data. By using CTEs, developers can improve the performance and efficiency of their queries, as they allow for better optimization by the database engine and can eliminate the need for multiple separate queries or the use of temporary tables.
By utilizing multiple CTEs, stacked sequentially, we can pre-select our embeddings for low-cost operations, such as a simple language and date comparison, before funneling these filtered results into the more costly operations:
… followed by the query using cosine distance for similarity comparison:
The most significant improvement here is our use of the <=>
operator in a manner that enables the corresponding index to be utilized because we:
a) do not embed the <=>
operator within a mathematical expression, and b) do not sort by this column in descending order.
Instead, we shift the necessary calculation of 1 minus the similarity to a subsequent CTE and perform an ORDER BY ASC. (The "ASC" might be redundant and is included here solely for the sake of completeness and clarity.)
This is what nobody tells you! This is the insight that often goes unmentioned! I discovered it through extensively reading various issues on the GitHub repository dedicated to the pgvector extension.
LIMIT and ORDER in conjunction can mess up your results
But there is another wild issue. What we have achieved so far is to reduce the performance query from 1200 ms to 120 ms, which is a nice effect.
However, comparing the resultsets before and after the optimization lead to a small difference which is hard to detect on the first sight:
When ordering by similarity and returning only — let’s say — the very first 10 records, we will receive a result of 0.93, 0.91,. 0.89, 0.85 and so on. While on the un-optimized query we instead receive a result of 0.93, 0.92, 0.91, 0.90, 0.89, 0.88
As we observe in these results, the third row from the previous query is no longer present. This change is due to the LIMIT 10 statement. Depending on the total number of records, increasing the value in the LIMIT statement may or may not bring it back. Notably, when I set the limit to 1000 records in my dataset, the row reappears. What exactly occurred?
Stacked CTE queries are very sensible when it comes to ORDER statements in conjunction with LIMIT. They internally change the order if we don’t navigate them how we want them to operate.
It took me some time to devise a workaround, but eventually, I developed another approach involving a wrapped SELECT statement that applies the ultimate LIMIT statement after all nested queries. However, it’s crucial to replicate all ORDER BY statements throughout the entire sequence of stacked CTEs. Failing to do so will lead us back to the previously mentioned issue.
It may seem unconventional, but by this method, we encapsulate all nested and stacked CTEs within a subquery and apply the LIMIT statement at the very end. Essentially, this ensures that all records are correctly ordered by descending similarity before we apply a limit to the number of records.
Update
Gaylord Aulke was a great help to me throughout this journey. Recently, he sent me an article stating that the ‘ivfflat’ index has poor performance and should be replaced with the ‘hnsw’ index type. Although switching to ‘hnsw’ significantly speeds up similarity searches, the practices recommended in the article still make sense and are necessary.
CREATE INDEX ON embeddings
USING hnsw(embedding vector_cosine_ops)
WITH (m = 8, ef_construction = 100);
Each has its own strengths and use cases, especially in the context of approximate nearest neighbor (ANN) search, which is crucial for efficiently handling high-dimensional data like in our use case.
hnsw stands for “Hierarchical Navigable Small World”. There are some advantages of prefering hnsw over ivfflat but also some downsides we have to deal with:
(positive) Accuracy vs. Speed: HNSW generally offers higher accuracy at the cost of higher resource consumption (memory and initial indexing time), making it suitable for applications where search precision is crucial. IVFFlat provides a more flexible balance between speed and accuracy, making it effective for large-scale applications where some compromise on accuracy is acceptable for faster search performance.
(negative) Resource Usage: HNSW tends to be more memory-intensive due to its complex graph structure, whereas IVFFlat can be more efficient in memory usage, depending on the number and size of the clusters. We had to adjust our maximum memory consumption settings for Postgres on the node and also tweak some internal memory configurations, which will be detailed in a follow-up post.
Conclusion
Through this exploration, we’ve uncovered that the ‘≤=> operator’ will not utilize any index when embedded within a mathematical expression. Furthermore, we discovered that using ORDER BY xx DESC
prevents PostgreSQL from leveraging the consinus similarity index provided by the pgvector extension.
Common Table Expressions (CTEs) have proven invaluable, allowing us to decompose complex queries into more manageable segments, thereby assisting PostgreSQL in handling vast numbers of records and intricate comparison mechanisms.
Most importantly, we've learned the critical importance of applying the LIMIT statement with careful consideration, especially when combined with ORDER BY clauses and CTEs, to ensure optimal query performance.