Scala algorithm: Remove duplicates from a sorted list (Sliding)
Published
Algorithm goal
In Streaming data, data may come in duplicated; it could be due to various factors such as duplicated data from sources and idempotency for redundancy; for consumption though we may need to deduplicate the data for at-most-once processing. Some deduplicators retain state of long-gone elements (in which case .distinct will suffice, but have a memory cost), but in this case we are looking at only consecutive duplicate elements.
This is an alternative approach to RemoveDuplicatesFromSortedListStateMachine
Test cases in Scala
assert(removeDuplicates(List.empty[Int]) == List.empty[Int])
assert(removeDuplicates(List(1)) == List(1))
assert(removeDuplicates(List(1, 1)) == List(1))
assert(removeDuplicates(List(1, 2)) == List(1, 2))
assert(removeDuplicates(List(1, 1, 2)) == List(1, 2))
assert(removeDuplicates(List(1, 1, 1, 2)) == List(1, 2))
assert(removeDuplicates(List(1, 1, 2, 3, 3)) == List(1, 2, 3))
assert(removeDuplicatesWithMarker(List(1, 1, 2, 3, 3)) == List(1, 2, 3))
Algorithm in Scala
23 lines of Scala (compatible versions 2.13 & 3.0), showing how concise Scala can be!
Explanation
The approach here is a little different to the State Machine approach:
The idea is that by running a sliding window, we can check pairs of elements for their equality, and if they are not equal, emit the newly found one. (this is © from www.scala-algorithms.com)
Scala concepts & Hints
Collect
'collect' allows you to use Pattern Matching, to filter and map items.
Option Type
The 'Option' type is used to describe a computation that either has a result or does not. In Scala, you can 'chain' Option processing, combine with lists and other data structures. For example, you can also turn a pattern-match into a function that return an Option, and vice-versa!
Pattern Matching
Pattern matching in Scala lets you quickly identify what you are looking for in a data, and also extract it.
Sliding / Sliding Window
Get fixed-length sliding sub-sequences (sliding windows) from another sequence