speeding up deleting duplicates when they're adjacent

646 Views Asked by At

I'm looking for something like #'delete-duplicates, but I know that all elements of the list are already sorted, or inversely sorted, or at least arranged so that duplicates will already be adjacent to each other. I wish to use that knowledge to ensure that execution speed is not proporational to the square of the number of elements in the list. It's trivial to use #'maplist to grow my own solution, but is there something already in the language? It would be embarrassing to reinvent the wheel.

To be clear, for largish lengths of lists, I would like the running time of the deletion to be proportional to the length of the list, not proportional to the square of that length. This is the behavior I wish to avoid:

 1 (defun one-shot (cardinality)
 2   (labels ((generate-list (the-count)
 3              (let* ((the-list (make-list the-count)))
 4                (do ((iterator 0 (1+ iterator)))
 5                  ((>= iterator the-count))
 6                  (setf (nth iterator the-list) iterator))
 7                the-list)))
 8     (let* ((given-list (generate-list cardinality))
 9            (stripped-list)
10            (start-time)
11            (end-time))
12       (setf start-time (get-universal-time))
13       (setf stripped-list (delete-duplicates given-list :test #'eql))
14       (setf end-time (get-universal-time))
15       (princ "for n = ")
16       (princ cardinality)
17       (princ ", #'delete-duplicates took ")
18       (princ (- end-time start-time))
19       (princ " seconds")
20       (terpri))))
21 (one-shot 20000)
22 (one-shot 40000)
23 (one-shot 80000)
for n = 20000, #'delete-duplicates took 6 seconds
for n = 40000, #'delete-duplicates took 24 seconds
for n = 80000, #'delete-duplicates took 95 seconds
5

There are 5 best solutions below

2
On

There is nothing like that in the language standard. However, you can do that either with a loop:

(defun remove-adjacent-duplicates (list &key (test #'eql))
  (loop for obj in list 
        and prev = nil then obj 
        for take = t then (not (funcall test obj prev))
        when take collect obj))

or with reduce (exercise left to the reader).

See the other answer for a destructive implementation.

PS. Unless you are doing something tricky with timing, you are much better off using time.

6
On

There's nothing like this in the language, but something like this makes just one pass through the list:

(defun delete-adjacent-duplicates (list &key key (test 'eql))
  (loop
     for head = list then (cdr head)
     until (endp head)
     finally (return list)
     do (setf (cdr head)
              (member (if (null key) (car head)
                          (funcall key (car head)))
                      (cdr head)
                      :key key :test-not test))))

As, @wvxvw pointed out, it might be possible to simplify this iteration using (loop for head on list finally (return list) do ...). However, 3.6 Traversal Rules and Side Effects says that modifying the cdr chain of a list during an object-traversal leads to undefined behavior. However, it's not clear whether loop for head on list is technically an object-traversal operation or not. The documentation about loop says in 6.1.2.1.3 The for-as-on-list subclause that

In the for-as-on-list subclause, the for or as construct iterates over a list. … The variable var is bound to the successive tails of the list in form1. At the end of each iteration, the function step-fun is applied to the list; the default value for step-fun is cdr. … The for or as construct causes termination when the end of the list is reached.

This says that the step function is always applied at the end of the iteration, so it sounds like loop for head on list should be OK. At any rate, any possible issues could be avoided by using do loop instead:

(defun delete-adjacent-duplicates (list &key key (test 'eql))
  (do ((head list (cdr head)))
      ((endp head) list)
    (setf (cdr head)
          (member (if (null key) (car head)
                      (funcall key (car head)))
                  (cdr head)
                  :key key :test-not test))))

The idea is to start with head being the list, then setting its cdr to the first tail that starts with a different element, then advancing the head, and continuing until there's nothing left. This should be linear in the length of the list, assuming that member is implemented in a sensible way. The use of member means that you don't have to do any extra work to get :key and :test working in the appropriate way. (Do note that :test for del-dups is going to be the :test-not of member.) Note: there's actually a slight issue with this, in that the key function will called twice for each element in the final list: once when it's the first element of a tail, and once when it's the car of head.

CL-USER> (delete-adjacent-duplicates (list 1 1 1 1 2 2 3 3 3))
(1 2 3)
CL-USER> (delete-adjacent-duplicates (list 1 2 2))
(1 2)
CL-USER> (delete-adjacent-duplicates (list 1 3 5 6 4 2 3 5) :key 'evenp)
(1 6 3)

I expect that any linear time solution is going to take a similar approach; hold a reference to the current head, find the next tail that begins with a different element, and then make that tail the cdr of the head.

4
On

For the record: your test code is basically just this:

(defun one-shot (n &aux (list (loop for i below n collect i)))
  (time (delete-duplicates list))
  (values))

It might also be useful to talk to the implementation maintainers in the case of a slow delete-duplicates.

For example (one-shot 1000000) runs in a second in CCL on my Mac. In LispWorks it runs in 0.155 seconds.

6
On

A bit different approach:

(defun compress-duplicates (list &key (test #'eql))
  (labels ((%compress-duplicates (head tail)
             (if (null tail)
               (setf (cdr head) tail)
               (progn (unless (funcall test (car head) (car tail))
                        (setf (cdr head) tail head (cdr head)))
                      (%compress-duplicates head (cdr tail))))))
    (%compress-duplicates list (cdr list)) 
    list))
                  
(compress-duplicates (list 1 1 1 2 2 3 4 4 1 1 1))
;; (1 2 3 4 1)

Test of SBCL delete-duplicates implementation:

(defun test-delete-duplicates ()
  (labels ((%test (list)
             (gc)
             (time (delete-duplicates list))))
    (loop
       :repeat 6
       :for list := (loop :for i :from 0 :below 1000
                       :collect (random 100))
       :then (append list list) :do (%test (copy-list list)))))

;; (test-delete-duplicates)

;; Evaluation took:
;;   0.002 seconds of real time
;;   0.002000 seconds of total run time (0.002000 user, 0.000000 system)
;;   100.00% CPU
;;   3,103,936 processor cycles
;;   0 bytes consed
  
;; Evaluation took:
;;   0.003 seconds of real time
;;   0.003000 seconds of total run time (0.003000 user, 0.000000 system)
;;   100.00% CPU
;;   6,347,431 processor cycles
;;   0 bytes consed
  
;; Evaluation took:
;;   0.006 seconds of real time
;;   0.006000 seconds of total run time (0.005000 user, 0.001000 system)
;;   100.00% CPU
;;   12,909,947 processor cycles
;;   0 bytes consed
  
;; Evaluation took:
;;   0.012 seconds of real time
;;   0.012000 seconds of total run time (0.012000 user, 0.000000 system)
;;   100.00% CPU
;;   25,253,024 processor cycles
;;   0 bytes consed
  
;; Evaluation took:
;;   0.023 seconds of real time
;;   0.022000 seconds of total run time (0.022000 user, 0.000000 system)
;;   95.65% CPU
;;   50,716,442 processor cycles
;;   0 bytes consed
  
;; Evaluation took:
;;   0.049 seconds of real time
;;   0.050000 seconds of total run time (0.050000 user, 0.000000 system)
;;   102.04% CPU
;;   106,747,876 processor cycles
;;   0 bytes consed

Shows linear speed.


Test of ECL delete-duplicates implementation:

;; (test-delete-duplicates)
;; real time : 0.003 secs
;; run time  : 0.003 secs
;; gc count  : 1 times
;; consed    : 95796160 bytes
;; real time : 0.007 secs
;; run time  : 0.006 secs
;; gc count  : 1 times
;; consed    : 95874304 bytes
;; real time : 0.014 secs
;; run time  : 0.014 secs
;; gc count  : 1 times
;; consed    : 95989920 bytes
;; real time : 0.028 secs
;; run time  : 0.027 secs
;; gc count  : 1 times
;; consed    : 96207136 bytes
;; real time : 0.058 secs
;; run time  : 0.058 secs
;; gc count  : 1 times
;; consed    : 96617536 bytes
;; real time : 0.120 secs
;; run time  : 0.120 secs
;; gc count  : 1 times
;; consed    : 97412352 bytes

Linear time increase too.

7
On

I would expect REMOVE-DUPLICATES to have a linear time implementation. (And indeed it does* on my local SBCL install.)

Note that REMOVE-DUPLICATES and DELETE-DUPLICATES are specified to have the same return value, and that the side effects of DELETE-DUPLICATES are not guaranteed.

* The linear time code path is only taken when the :test is #'eq,#'eql, #'equal, or #'equalp (it relies on a hash table) and there is no :key or :test-not argument supplied.