Join size estimatoin

69 Views Asked by At
  • Relations

    • w(A,B):nw =1000,V(w,A)=20,V(w,B)= 60
    • x(B,C):nx =2000,V(x,B)=50,V(x,C)=100
    • y(C,F,D):ny =3000,V(y,C)=50,V(y,D)= 50
    • z(D,E):nz =4000,V(z,D)=40,V(w,E)=100
  • notations used

    • w,x,y,z are relations
    • nw,nx,ny,nz are number of tuples in respective relations
    • V(relation,attribute)= distinct attribute in a relation
  • Method used

    1.) (|r ⨝ s|) = nr * (ns/V(A,s))
    
  • Estimating the size of the results

    1.) w ⨝ x ⨝ y ⨝ z

  • w ⨝ x = nw * (nx/V(B,x)) = 1000 * (2000/50) = 40000

  • w ⨝ x ⨝ y = 40000 * (3000/50) = 2400000
  • w ⨝ x ⨝ y ◃▹ z = 2400000 * (4000/40) = 240000000

    2.) σC=20(y) ⨝ z

  • σC=20(y) = 3000/50 = 60

  • σC=20(y) ⨝ z = 60 * (4000/100) = 6000

Am I doing it right? can anyone please correct me if I'm wrong

Thanks in advance.

1

There are 1 best solutions below

1
AntC On

(Not really an answer, but too long for a comment.)

I'm guessing "V(relation,attribute)= distinct attribute in a relation" means the count of distinct values for attribute attribute in relation relation. This really helps very little, because it doesn't tell us whether/how many any of the attribute values counted in (say) V(w,B) are equal values to those counted in V(x,B). So to make a start on question

1.) w ⨝ x ...

The attribute in common between w, x is B. We're given there are 60 distinct B values in w i.e. V(w,B); and 50 distinct B values in x i.e. V(x,B).

  • There might be no values in common between the two relations, so the cardinality of joining them is zero; so the cardinality of the overall result of 1.) is zero.
  • or all 50 of the values in V(x,B) might be in common amongst the 60 values in V(w,B), so the cardinality of the join is at least 50. In that case, all we can say is that 10 of the values from V(w,B) can't appear, so the maximum cardinality is (1000 - 10) times the number of tuples matched in x.
  • then the actual cardinality just for joining w, x could be anything in between.

The same uncertainty applies for joining on y, z. Then your 'method used' (|r ⨝ s|) = nr * (ns/V(A,s)) must include a whole bunch of assumptions. I'm not going to guess at those assumptions, but they might be something about the distribution/spread of A, B values within w; and the spread of B values between w, x.

Does your teaching material include any such assumptions/explanation for the notation?

Likewise consider question

2.) σC=20(y) ⨝ z

We're given V(y,C) = 50; this doesn't tell us whether any of those 50 distinct values is C=20; if one is, the given doesn't tell us whether all 50 of the V(y,D) distinct values appear for the tuples with C=20, perhaps only one does.

So the cardinality of σC=20(y) is anywhere between zero and (3000 - 49), again before we join on z. I don't see that your 'method used' tells us anything about the cardinality resulting from a restriction to a single value σC=20 -- unless I again make a bunch of assumptions and guess some 'deep background' about what you might have been told about restrictions.

In general with actual values in actual tables, for values to be spread evenly is highly unrealistic. Usually commercial data exhibits both clustering of same values repeated in many rows and sparseness of many possible values not appearing at all. If you're joining relations, usually in realistic commercial data there are FOREIGN KEY constraints across same-named attributes, which means (say) all 50 V(w,B) values REFER TO 50 of the 60 values in V(x,B).

This whole question has the feel of an abstract, academic exercise of no practical use whatsoever.

Oh, one extra point. Natural join is a commutative operation: w ⨝ x == x ⨝ w; and associative (w ⨝ x) ⨝ y == w ⨝ (x ⨝ y), which is why Q 1. can write w ⨝ x ⨝ y ⨝ zwithout parens and without ambiguity. The 'method used' formula is not commutative, furthermore if you extend it to apply for a join of 4 relations, you're going to get a different result depending how you group up (parenthesise) the expression.