I have a set of tables aimed at representing a profile storage system. I have created a view that should display all the important parts of this profile, that being things like the name, email, number of followers and the list of activities they are a part of.
However, I've noticed that the values in the activities cell are duplicating. The same thing appears multiple times when each activity name should only appear once. In the below table the Nulls at the bottom are intentional.
Weight | Height | Activities | TotalFollowing | TotalFollowers |
---|---|---|---|---|
68 | 170 | Horse Riding, Horse Riding, Horse Riding, Horse Riding, Horse Riding, Horse Riding, Horse Riding, Horse Riding, Bird Watching, Bird Watching, Bird Watching, Bird Watching, Bird Watching, Bird Watching, Bird Watching, Bird Watching, Hunting, Hunting, Hunting, Hunting, Hunting, Hunting, Hunting, Hunting | 4 | 2 |
63 | 179 | Horse Riding, Horse Riding, Horse Riding, Hiking, Hiking, Hiking, Swimming, Swimming, Swimming, Fishing, Fishing, Fishing | 1 | 3 |
72 | 130 | NULL | 3 | 1 |
NULL | NULL | NULL | 1 | 2 |
There is a user that is identified by it's MemberID, a link table called User_Activity which only contains MemberID and ActivityID, the latter of which points to the activity table that holds the name for an activity under a certain ID.
ActivityID | ActivityName |
---|---|
1 | Horse Riding |
2 | Hiking |
3 | Bird Watching |
4 | Backpacking |
5 | Swimming |
6 | Fishing |
7 | Hunting |
And the user activity tables holds these.
MemberID | ActivityID |
---|---|
1 | 1 |
1 | 3 |
1 | 7 |
2 | 1 |
2 | 2 |
2 | 5 |
2 | 6 |
The SQL for this view is as follows.
I assume the issue has come from the fact that the user follows table and the references I use there and up pointing back to the user, so it counts their activities again and adds them to the column depending on how many followers they have, since these two example records repeat a different number of times.
CREATE VIEW [View_Profile_Information] AS
SELECT U.MemberID,
U.FirstName,
U.LastName,
L.Town,
L.County,
L.Country,
U.Email,
U.AboutMe,
U.DistanceUnit,
U.ActivityTimePreference,
U.Weight,
U.Height,
STRING_AGG(A.ActivityName, ', ') AS Activities,
COUNT(DISTINCT UF.Follows_MemberID) AS TotalFollowing,
COUNT(DISTINCT UF2.This_MemberID) AS TotalFollowers
FROM CW1.[User] U
LEFT JOIN CW1.[Location] L
ON U.LocationID = L.LocationID
LEFT JOIN CW1.[User_Activity] UA
ON U.MemberID = UA.MemberID
LEFT JOIN CW1.[Activity] A
ON UA.ActivityID = A.ActivityID
LEFT JOIN CW1.[User_Follows] UF
ON U.MemberID = UF.This_MemberID
LEFT JOIN CW1.[User_Follows] UF2
ON U.MemberID = UF2.Follows_MemberID
GROUP BY U.MemberID, U.FirstName, LastName, Town, County, Country, Email, AboutMe, DistanceUnit, ActivityTimePreference, Weight,
Height
What you have is a query that contains multiple independent one-to-many joins. This is near guaranteed to cause data duplication in the results. Using
DISTINCT
orGROUP BY
to try to resolve the duplicates is rarely the correct solution for unexpected duplicate data.What you need to do is to moved each relationship branch into separate subqueries. In this case those subqueries can either be included directly in the select list or wrapped up in a
CROSS APPLY
.Something like:
You can also use
CROSS APPLY
to calculate the aggregated values separately down in theFROM
section of the query, so that the final select list is cleaner and more readable.Note that I have removed the
DISTINCT
s andGROUP BY
, andCOUNT(*)
is now just counting rows.