SQL Natural Join Dilemma re Order, Customer & Salesman tables

2.1k Views Asked by At

I was given the following question:

Write a SQL statement to make a join on the tables salesman, customer and orders in such a form that the same column of each table will appear once and only the relational rows will come.

I executed the following query:

SELECT * FROM orders NATURAL JOIN customer NATURAL JOIN salesman;

However, I was not expecting the following result:

enter image description here

enter image description here

My doubt lies in step 2.

Why didn't I get the rows with salesman_id 5002, 5003 & 5007?

I know that natural join uses the common columns to finalize the rows.

Here all the Salesman_ids are present in the result from step 1.

Why isn't the final result equal to the table resulting from step 1 with non duplicate columns from salesman added to it?

2

There are 2 best solutions below

9
On

... the same column of each table will appear once

Yes Natural Join does that.

... and only the relational rows will come.

I don't know what that means.

I disagree with those who are saying: do not use Natural Join. But it is certainly true that if you plan to use Natural Join for your queries, you must design the schema so that (loosely speaking) 'same column name means same thing'.

Then this exercise is teaching you the dangers of having same-named columns that do not mean the same thing. The danger is sometimes called the 'connection trap' or 'join trap'. (Not really a trap: you just need to learn ways to write queries over poorly-designed schemas.)

A more precise way to put that: if you have columns named the same in two different tables, the column must be a key of at least one of them. So:

  • city is not a key in any of those tables, so should not get 'captured' in a Natural Join.
  • salesman_id is not a key in table customer, so should not get 'captured' in the join from table orders.

The main way to fix up this query is by renaming some columns to avoid 'capture' (see below). It's also worth mentioning that some dialects of SQL allow:

SELECT *
FROM orders
NATURAL JOIN customer ON customer_id
...

The ON column(s) phrase means: validate that the only columns in common between the two tables are those named. Otherwise reject the query. So your query would be rejected.

Renaming means that you shouldn't use SELECT *. (Anyway, that's dangerous for 'production code' because your query might produce different columns each time there's a schema change.) The easiest way to tackle this might be to create three Views for your three base tables, with the 'accidental' same-named columns given some other name. For this one query:

SELECT ord_no, purch_amt, ord_date, customer_id,
       salesman_id AS order_salesman_id
FROM orders
NATURAL JOIN (SELECT customer_id, cust_name,
                     city AS cust_city, grade,
                     salesman_id AS cust_salesman_id
              FROM customer) AS customer_grr
NATURAL JOIN (SELECT salesman_id, name,
                     city AS salesman_city,
                     commission
              FROM salesman) AS salesman_grr

I'm using explicit AS to show renaming. Most dialects of SQL allow you to omit that keyword; just put city cust_city, ....

0
On

Why isn't the final result equal to the table resulting from step 1 with [...]?

Because natural join doesn't work how you expect--whatever that is, since you don't say.

In terms of relational algebra: Natural join returns the rows
• whose column set is the union of the input column sets and
• that have a subrow in both inputs.

In business terms: Every table & query result holds the rows that make some statement template--its (characteristic) predicate--its "meaning"--into a true statement. The designer gives the predicates for the base tables. Here, something like:

Orders = rows where
    order [ord_no] ... and was sold by salesman [salesman_id] to customer [customer_id] 
Customer = rows where
    customer [customer_id] has name [cust_name] and lives in city [city]
        and ... and is served by salesman [salesman_id]
Salesman = rows where
    salesman [salesman_id] has name [name] and works in city [city] ...

Natural join is defined so that if each input holds the rows that make its predicate into a true statement then their natural join holds the rows that make the AND/conjunction of those predicates into a true statement. So (your query):

Orders natural join Customer natural join Salesman = rows where
    order [ord_no] ... and was sold by salesman [salesman_id] to customer [customer_id] 
and customer [customer_id] has name [cust_name] and lives in city [city]
        and ... and is served by salesman [salesman_id]
and salesman [salesman_id] has name [name] and works in city [city] ...

So that natural join is asking for rows where, among other things, the customer lives in the city that the salesman works in. If that's not what you want, then you shouldn't use that expression.

Note how the meaning of a natural join of tables is a (simple) function of the meanings of its input tables. That's so for all the relational operators. So every query expression has a meaning, built from its base table meanings & relational operators.
Is there any rule of thumb to construct SQL query from a human-readable description?

Why didn't I get the rows with salesman_id 5002, 5003 & 5007?

Because those salesmen don't work a city in which one of their customers lives.