How to check for duplicates with less time in a list over 9000 elements by python

420 Views Asked by At

um trying to check whether there are duplicate values exists in an integer list by python . this was successful and I found that the execution time getting higher when the size of the list getting increase. How may I improve the run time of the following logic?

def containsDuplicate( nums):
    if len(nums) < 2:
        return False
    cnt = 0
    flag = False
    length = len(nums)
    while cnt < length:
        p = cnt + 1
        while p < length:
            if nums[cnt] == nums[p]:
                flag = True
                break
            p += 1
        cnt += 1
    return flag
3

There are 3 best solutions below

4
On BEST ANSWER

You could use a set:

>>> lst = [1, 2, 3]
>>> len(set(lst)) != len(lst)
False
>>> lst = [1, 2, 2]
>>> len(set(lst)) != len(lst)
True

EDIT: a faster version, as pointed out in the comments:

>>> lst = [3] + list(range(10**4))
>>> seen = set()
>>> any(x in seen or seen.add(x) for x in lst)
True
>>> seen
set([0, 1, 2, 3])
1
On

The reason a set is more efficient is that it forces the entire collection to have unique keys. Therefore, converting the collection to a set will remove any duplicates as a side effect. A dictionary has the same properties, but would be somewhat slower than a set with current Python implementations.

The main inefficiency in your attempt is that you are rescanning the tail of the list for every key you examine. A data structure with unique keys will avoid this; instead, the program will examine the key directly, and see that it is already occupied. See e.g. http://www.pythoncentral.io/hashing-strings-with-python/ for a somewhat more detailed discussion.

If you don't want to use a set or a related data structure (which is a rather irrational complaint -- if this is really so, you should explain why you have this additional requirement) an alternative would be to sort the data and then discard any adjacent duplicates; this should still be faster than repeatedly traversing the tail of the list, in the general case (i.e. the longer the list, the more it costs to traverse it multiple times). See e.g. http://interactivepython.org/runestone/static/pythonds/AlgorithmAnalysis/BigONotation.html for a Pythonic treatment of efficiency analysis.

0
On

melt the dataframe then count by value. if the grouped results have a count > 1 then duplication has occurred.

df=pd.DataFrame({"a":[1,2,3,4,5,5,6,7,8,1]})
df.reset_index(inplace=True)
df=df.melt()
print(df)
grouped=df.groupby('value').count()
print(grouped)