Relations
- w(A,B):nw =1000,V(w,A)=20,V(w,B)= 60
- x(B,C):nx =2000,V(x,B)=50,V(x,C)=100
- y(C,F,D):ny =3000,V(y,C)=50,V(y,D)= 50
- z(D,E):nz =4000,V(z,D)=40,V(w,E)=100
notations used
- w,x,y,z are relations
- nw,nx,ny,nz are number of tuples in respective relations
- V(relation,attribute)= distinct attribute in a relation
Method used
1.) (|r ⨝ s|) = nr * (ns/V(A,s))Estimating the size of the results
1.) w ⨝ x ⨝ y ⨝ z
w ⨝ x = nw * (nx/V(B,x)) = 1000 * (2000/50) = 40000
- w ⨝ x ⨝ y = 40000 * (3000/50) = 2400000
w ⨝ x ⨝ y ◃▹ z = 2400000 * (4000/40) = 240000000
2.) σC=20(y) ⨝ z
σC=20(y) = 3000/50 = 60
- σC=20(y) ⨝ z = 60 * (4000/100) = 6000
Am I doing it right? can anyone please correct me if I'm wrong
Thanks in advance.
(Not really an answer, but too long for a comment.)
I'm guessing "V(relation,attribute)= distinct attribute in a relation" means the count of distinct values for attribute
attributein relationrelation. This really helps very little, because it doesn't tell us whether/how many any of the attribute values counted in (say)V(w,B)are equal values to those counted inV(x,B). So to make a start on questionThe attribute in common between
w, xisB. We're given there are 60 distinctBvalues inwi.e.V(w,B); and 50 distinctBvalues inxi.e.V(x,B).V(x,B)might be in common amongst the 60 values inV(w,B), so the cardinality of the join is at least 50. In that case, all we can say is that 10 of the values fromV(w,B)can't appear, so the maximum cardinality is (1000 - 10) times the number of tuples matched inx.w, xcould be anything in between.The same uncertainty applies for joining on
y, z. Then your 'method used'(|r ⨝ s|) = nr * (ns/V(A,s))must include a whole bunch of assumptions. I'm not going to guess at those assumptions, but they might be something about the distribution/spread ofA, Bvalues withinw; and the spread ofBvalues betweenw, x.Does your teaching material include any such assumptions/explanation for the notation?
Likewise consider question
We're given
V(y,C) = 50; this doesn't tell us whether any of those 50 distinct values isC=20; if one is, the given doesn't tell us whether all 50 of theV(y,D)distinct values appear for the tuples withC=20, perhaps only one does.So the cardinality of
σC=20(y)is anywhere between zero and (3000 - 49), again before we join onz. I don't see that your 'method used' tells us anything about the cardinality resulting from a restriction to a single valueσC=20-- unless I again make a bunch of assumptions and guess some 'deep background' about what you might have been told about restrictions.In general with actual values in actual tables, for values to be spread evenly is highly unrealistic. Usually commercial data exhibits both clustering of same values repeated in many rows and sparseness of many possible values not appearing at all. If you're joining relations, usually in realistic commercial data there are
FOREIGN KEYconstraints across same-named attributes, which means (say) all 50V(w,B)valuesREFER TO50 of the 60 values inV(x,B).This whole question has the feel of an abstract, academic exercise of no practical use whatsoever.
Oh, one extra point. Natural join
⨝is a commutative operation:w ⨝ x==x ⨝ w; and associative(w ⨝ x) ⨝ y == w ⨝ (x ⨝ y), which is why Q 1. can writew ⨝ x ⨝ y ⨝ zwithout parens and without ambiguity. The 'method used' formula is not commutative, furthermore if you extend it to apply for a join of 4 relations, you're going to get a different result depending how you group up (parenthesise) the expression.