What is the difference between MoreLINQ's DistinctBy and Linq's GroupBy

4.3k Views Asked by At

I have two version of grouping by a list of items

List<m_addtlallowsetup> xlist_distincted = xlist_addtlallowsetups.DistinctBy(p => new { p.setupcode, p.allowcode }).OrderBy(y => y.setupcode).ThenBy(z => z.allowcode).ToList();

and groupby

List <m_addtlallowsetup>  grouped = xlist_addtlallowsetups.GroupBy(p => new { p.setupcode, p.allowcode }).Select(grp => grp.First()).OrderBy(y => y.setupcode).ThenBy(z => z.allowcode).ToList();

these two seemed to me that they are just the same, but there's gotta be a layman's explanation of their difference, their performance and disadvantages

2

There are 2 best solutions below

0
Mrinal Kamboj On BEST ANSWER

Let's review the MoreLinq APIs first, following is the code for DistinctBy:

MoreLinq - DistinctBy

Source Code

public static IEnumerable<TSource> DistinctBy<TSource, TKey>(this IEnumerable<TSource> source,
            Func<TSource, TKey> keySelector, IEqualityComparer<TKey> comparer)
        {
            if (source == null) throw new ArgumentNullException(nameof(source));
            if (keySelector == null) throw new ArgumentNullException(nameof(keySelector));

            return _(); IEnumerable<TSource> _()
            {
                var knownKeys = new HashSet<TKey>(comparer);
                foreach (var element in source)
                {
                    if (knownKeys.Add(keySelector(element)))
                        yield return element;
                }
            }
       }

Working

  • Using HashSet<T> internally it just checks the first match and returns the first element of Type T matching the Key, rest are all ignored, since Key is already added to the HashSet
  • Simplest way to get the first element pertaining to every unique Keyin the collection as defined by the Func<TSource, TKey> keySelector
  • Use case is limited (Subset of what GroupBy can achieve, also clear from your code)

Enumerable - GroupBy

(Source Code)

public static IEnumerable<IGrouping<TKey, TElement>> GroupBy<TSource, TKey, TElement>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector, Func<TSource, TElement> elementSelector) {
            return new GroupedEnumerable<TSource, TKey, TElement>(source, keySelector, elementSelector, null);
        }

 internal class GroupedEnumerable<TSource, TKey, TElement> : IEnumerable<IGrouping<TKey, TElement>>
    {
        IEnumerable<TSource> source;
        Func<TSource, TKey> keySelector;
        Func<TSource, TElement> elementSelector;
        IEqualityComparer<TKey> comparer;
 
        public GroupedEnumerable(IEnumerable<TSource> source, Func<TSource, TKey> keySelector, Func<TSource, TElement> elementSelector, IEqualityComparer<TKey> comparer) {
            if (source == null) throw Error.ArgumentNull("source");
            if (keySelector == null) throw Error.ArgumentNull("keySelector");
            if (elementSelector == null) throw Error.ArgumentNull("elementSelector");
            this.source = source;
            this.keySelector = keySelector;
            this.elementSelector = elementSelector;
            this.comparer = comparer;
        }
 
        public IEnumerator<IGrouping<TKey, TElement>> GetEnumerator() {
            return Lookup<TKey, TElement>.Create<TSource>(source, keySelector, elementSelector, comparer).GetEnumerator();
        }
 
        IEnumerator IEnumerable.GetEnumerator() {
            return GetEnumerator();
        }
    }

Working

  • As it can be seen, internally use a LookUp data structure to group all the data for a given Key
  • Provides flexibility to element and result selection via projection, thus would be able to meet lot of different use cases

Summary

  1. MoreLinq - DistinctBy achieves a small subset of what Enumerable - GroupBy can achieve. In case your use case is specific, use the More Linq API
  2. For your use case, speed wise as the scope is limited MoreLinq - DistinctBy would be faster, since unlike Enumerable - GroupBy, DistinctBy doesn't first aggregate all data and then select first for each unique Key, MoreLinq API just ignores data beyond first record
  3. If the requirement is specific use case and no data projection required then MoreLinq is a better choice.

This is a classic case in Linq, where more than one API can provide same result but we need to be wary of the cost factor, since GroupBy here is designed for much wider task than what you are expecting from DistinctBy

3
Bagus Tesa On

The Differences

GroupBy should result a 'group' that contains key (the grouping criteria) and its value. thats why you need to do Select(grp => grp.First()) first.

You might suspect MoreLinq just provide shorthand of it. MoreLinq by the source, the DistinctBy is actually done in memory by picking every single item that is new for the HashSet. The HashSet#Add will add item and returns true if its a new element for the HashSet, then the yield will return the newly added element into the enumerable.

Which One?

SQL Related

Based on the difference above, you could say doing GroupBy then project it with Select is much more safer approach as it can be translated into SQL command if you are using Entity Framework (or Linq2Sql, i suppose). Being able to be translated into SQL command is a great advantage to reduce the burden from your application and delegate operations to the Database Server instead.

However, you had to understand that GroupBy in the Entity Framework actually uses OUTER JOIN that considered as complex operation and on some cases it may cause your query being dropped immediately. Its pretty rare case, even the query i throw had lots of column, around four GroupBys are used, a bunch of ordering and Wheres.

Linq to Object

Roughly speaking when dealing with an already in memory enumerables. Running GroupBy then Select may end up have your enumerable need to be iterated by two operations. While directly using DistinctBy from MoreLinq can save some graces as it guarantees to be a single operation backed with HashSet as explained by Mrinal Kamboj answer with in depth analysis against the source code.