A learned cost model to assist the query optimization process
With the rise of Deep Learning on domains like Natural Language Processing and Computer Vision, many approaches have been proposed in Databases Systems Community to integrate these methods into many aspects of the functionality of a database system. The query optimization process is a well-studied problem and many efforts have been made to improve the execution of query workloads in such a system, by improving the cardinality selectivity estimations of the query optimizer or improving the query execution time and cost predictions. However, most of these approaches make a strong assumption about the quality of the query plan provided by the optimizer. QPSeeker's goal is to learn the aforementioned distributions for a set of plans per query and provide the optimizer a better view of the query plan space, during the construction of the query plan, combining both the underlying table data and the provided query workload.
QPSeeker consits of four main components:
- Query Encoder, a feed forward network, which receives the relations and the joins in the query and outputs a query embedding vector.
- TaBERT, a BERT based model for sructural data, which receives the query and for each table outputs the joint representation of the query and the underlying structured data.
- Query Plan Encoder, constructs a left-depth tree query plan tree using LSTM cells, similar to the optimizer's plan and the root outputs the a query plan embedding vector.
- Variational Autoencoder, an autoencoder, which receives the query and query’s plan embedding vectors and computes the reconstruction loss of the common query plan space embedding vector. QPSeeker enforces a Gaussian structure to the latent space of the variational autoencoder in order to simulate the generation process of the sets of the query plans (Join Ordering) along with their statistics (Cost Model) for a given query. In this manner, QPSeeker can effectively predict the distributions of the cardinalities, costs and execution times of the query plans sets over the queries in the workload.