Can Selections on Cartesian Products be re-written as Theta Joins.?

137 Views Asked by At

I have read it somewhere in article on query optimization that instead applying selections on cartesion product, it is better to use selection theta joins. So basically if we have 2 employee tables with age as column then

select name from A inner join B on A.age>B.age

is faster than

select name from A,B where A.age>B.age

But we all know that first query is equivalent or has same meaning as query 2. Then why it is said that applying selections on theta join is faster than applying select on cartesion product I am trying to learn query optimization. And I read following statement

Selection(E1 *E2)= E1(theta join operator)E2

kindly read point 4 of this article--- https://www.geeksforgeeks.org/query-optimization-in-relational-algebra/

What is the difference between the 2 queries in terms of performance?

2

There are 2 best solutions below

7
Thorsten Kettner On

This is the first time I have heard the term "Theta Join". Database developers simply call these inner joins. Your first query

select name from A inner join B on A.age > B.age

shows an inner join. Your second query

select name from A,B where A.age > B.age

also shows an inner join, only with deprecated syntax. Before explicit joins ([INNER] JOIN, LEFT [OUTER] JOIN, CROSS JOIN) etc. got introduced in the SQL standard in 1992, we just listed the tables comma-separated in the FROM clause and used the WHERE clause to state the join criteria. While this old syntax still works, it is looked upon with disdain if you still use it.

The following is also an inner join:

select name from A cross join B where A.age > B.age

Only, it is a hidden (or better: obfuscated) inner join. The developer makes it look like a cross join of the tables and then adds the join criteria later. This is syntactically correct, but semantically it isn't. So, we shouldn't write such queries.

To the DBMS is should make no difference, if you write query #1, #2 or #3. It is all the same query in the end. It would speak against the DBMS, didn't it recognize this and come up with the same execution plan for all these queries. The article mentioning that query #1 is faster than query #2 is either wrong or is talking about some particular DBMS and probably even a particular version of that DBMS then.

Maybe still, the article didn't want to talk about how to write queries, but how to process them. If a DBMS would always build a cartesian product and only then filter the data, it might often be slow. It is usually better to filter early and work with smaller data sets.

0
SQLpro On

In a perfect world there will be no differences in terms of performances between those two queries, because the algrebizer part of the query engine, rewrite your query by separating the different operations into elementary operations of relational algebra. However, the WHERE clause should be a simple filter and not a match of information coming from different tables, which, of course, is a disguised form of join.

In other words your request:

select name from A,B where A.age > B.age

Starts by being rewritten in relational algebra in the form:

select name from A inner join B on A.age > B.age

Because a join and a cartesian products are two different operations in terms of relational algebrae...

It is from there that the optimizer will try different strategies to find the shortest execution path.

It therefore follows that, as it is ultimately the same query in terms of "root", the optimization will be strictly identical...

However, as the devil is in the details and as the implementation of an RDBMS is in the physical domain, and as SQL Server, even if it is equipped with one of the best optimizers there are, there are limits. In particular, the time to find an optimal execution plan is an operation which must imperatively be completed “before a certain time”. Then the query excution plan resulting could be not paefectly optimal, but sufficiently good to be used..

Then when your write your query into an appropriate syntax, you will help the optimizer to spend more time to do a better job...

The role of the developer should be to apply good practices... Which is rarely the case, because the set world of relational databases is often ignored by developers who think iteratively...

For example, functions in queries should only be used when there is no other solution. However, often a function can be replaced by a call to a table. For example, date calculations, and generally all date functions, gain performance by being replaced by a calendar table (the famous "tally" tables).