This is a question I have been wondering about for quite some time now. I'll try to explain by means of an example, but it's a general question.
Say you have two tables:
- users, which holds the first name, last name, ... of a user
- posts, which holds the posts written by a user, with fields such as title, text, ...
Now, assume I want to display all posts which have been created in the last 24 hours. In this table, I want to display the first name, last name, ... of the user who created this post. The query will probably look something like:
SELECT ... FROM posts, users WHERE posts.user_id = users.id AND [last 24 hours]
Now, on to my question. Since it is very likely that one user has created multiple posts in the last 24 hours, we are basically retrieving his/her first name, last name, ... over and over again. In other words, the result set of the query above contains duplicate data (but not duplicate rows).
Isn't it better to:
- SELECT ... FROM posts WHERE [last 24 hours]
- SELECT ... FROM users where id IN (SELECT DISTINCT user_id FROM posts WHERE [last 24 hours])
- map result of first query with result of second query at application level or in the sql procedure, to find out the first name, last name, ... of a post - this can easily be done if the identifiers (primary keys) are the index/key of some kind of hash map, array or similar.
?
I understand this is a very general question, but any insights are welcome. Thanks!
Either method should work, but you hit on the important part:
Do it at the application level.
For me, I would pull the duplicate data so that each row of my result set contained all the data I needed. SQL is more efficient at
JOIN
s and set operations than pretty much any declarative language.If you keep the data together, you will have an easier time breaking it up downstream if needed, and you only have to make one call to the database instead of two.
The benefits of this diminish as the duplicated data grows. If it's just a couple of fields it's not a big impact. If it's dozens of fields of redundant data the performance difference will be a lot more noticeable.
For your specific example, it's much better to do it all in a single query.
You can eliminate duplication at the application level if it bothers you, but 2 or 3 additional fields returned multiple times for the same user won't be very significant compared to making multiple database calls.