18 December 2017

# SetMap

## Introduction

Ruby comes with a Set` class`

as part of its standard library, and there is also a `Multiset`

class available as a gem. Both classes uses hashes internally. However, the two libraries are completely separate, so, of course, they do not share any code. This is somewhat regrettable, since—as we will see during the course of our discussion—sets and multisets have quite a bit in common. For this reason, I thought that it would be a nice exercise to write a more generic set class from scratch that would allow us to derive the functionality provided by the above-mentioned classes by inheritance. For good measure, I decided to throw fuzzy sets into the mix, another type of set with useful applications. The result could be called a polymorphic approach to modeling various types of sets in Ruby. This post is a tutorial-style presentation of what I came up with. If you don’t care much for lengthy explanations, head straight to Github to have a look at the code.

The first section introduces our topic by discussing the three types of sets we would like to capture. We then develop a unified model for those types. The third and final section discusses how to implement the typical operations on sets in this model.

## Three Types of Sets

Let’s first get an overview of the types of collections we are interested in by means of some quick examples.

### Classical sets

Supppose we wanted to concisely represent the letters occuring in a word, while disregarding their sequential order as well as the number of times they occur in the word. A classical set would be a good choice of data structure for this task. For instance, the word “learner” could be represented by the set

1

{ a, e, l, r, n }

The word “learn” corresponds to the same classical set, as it contains the same letters. The word “land”, on the other hand, corresponds to the set

1

{ a, l, n, d }

Taking the intersection of the two sets, which collects the elements occuring in both of them, we obtain the set `{ a, l, n }`

, which captures the overlap of the two words in terms of letters contained—a crude measure of what they have in common. So sets allow us to zoom in on questions of membership, while disregarding other aspects of particular entities we wish to study.

### Multisets

Suppose now we would like to count *how often* letters occur in a given word or text, while (still) disregarding their order. A multiset would be an appropriate data structure to accomplish this. For example, the above word “learner” corresponds to the multiset

1

{ a, e, e, l, r, r, n }

Notice that the word “learn” would be represented by a different multiset, namely

1

{ a, e, l, r, n }

So multisets make finer distinctions than classical sets.

A more concise way to represent a multiset is by means of key-value notation. In this notation, our multiset representation of the word “learner” would be written as

1

{ a: 1, e: 2, l: 1, r: 2, n: 1 }

In a programming context, the elements of a set (be it a classical set or a multiset) are often called *keys*. The above multiset has five distinct keys: “a”, “e”, “l”, “r” and “n”. In a multiset, each key comes with a count, the value associated with the key, which we will refer to as the *score* for that key. The score for the key “e” in the above multiset, for example, is 2.

### Fuzzy sets

Fuzzy sets also make finer distinctions compared to classical sets. However, they generalize classical sets in a different direction. As we just saw, in a multiset, the score for a particular key represents *multiplicity* of membership (how many times does the key occur in the set?). In a fuzzy set, the score indicates *degree of membership*. So besides an element being “fully contained” in the set or “not contained”, it may also be “partially contained” in the set, so to speak.

To see how this can be useful, take this example: consider the words “learner”, “learn”, “learned”, and “lean”. We might wish to capture how similar each of these words is to some other word, let’s say the word “learner” again. The similarity of “learner” to itself is perfect, since no edit (letter change) is required to transform one into the other. They are the same, after all. On to the more interesting cases: the word “learned” is very similar to “learner” – substituting the last letter will transform the former into the latter. “learn” is also very similar to “learner”, but a little less so – deleting the last two letters from “learner” results in “learner”, requiring two steps rather than one. The word “lean” is again a bit less similar to “learner”, requiring one additional deletion. So to transform each of the words given into “learner” requires:

- “learner”: 0 edits
- “learned”: 1 edit
- “learn”: 2 edits
- “lean”: 3 edits

For the sake of exposition, let us settle on 3 as the—pretty arbitrarily chosen—treshold from which onwards two words to be considered completely dissimilar, or “not similar at all”. Then we can map our edit counts to a scale from 0 to 1, and represent our findings as a fuzzy set which scores the key “learner” with 1.0, “learned” with 0.66, “learn” with 0.33 and “lean” with 0.0, or, using key-value notation:

1

{ learner: 1.0, learned: 0.66, learn: 0.33, lean: 0.0 }

In this fuzzy set, the degree of each element is to be interpreted as the degree of similarity to our target word “learner”.

## A Unified Model For Sets in Ruby

Equipped with some basic understanding of our problem domain established in part 01 of this series, let us begin to develop the main ingredients for a Ruby model of sets that encompasses the types of sets we have discussed (as well as potentially other ones). We start by discussing a `SetMap`

class that captures the commonalities of classical sets, fuzzy sets, and multisets, while allowing us to easily define each of these specific types via inheritance.

### Hash tables

What do the three types of sets have in common? At first glance, it seems that their internal structure is pretty different: multisets and fuzzy sets have been presented above as consisting of key-value pairs, while classical sets simply consist of a bunch of keys. However, this is merely a matter of representation. In fact, it is rather common to represent a classical set by means of a *characteristic function*, which maps the members of the set to 1, while all other objects from a given domain are mapped to 0. Taking a cue from this, we extend our key-value notation to classical sets, writing the set `{ 0, 1, 2 }`

, for example, as

1

{ 0: 1, 1: 1, 2: 1 }

From this perspective, it becomes obvious that the membership information for a set—be it a fuzzy set, a classical set, or a multiset—may be stored in a hash table:

1
2
3

{ 'a' => 1, 'b' => 1, 'c' => 1 } # 'classical set'
{ 'a' => 1, 'b' => 1, 'c' => 2 } # 'multiset'
{ 'a' => 0.3, 'b' => 1, 'c' => 0.6 } # 'fuzzy set'

Hash tables will form the basis of our representation of sets. Of course, we would not want to directly expose such a table to the user of our set class. The user need not even be aware that we are using a hash table to store her set. Rather, the hash instance that stores the set keys and their associated scores will be a collaborator object to our set object. The `initialize`

method of our `SetMap`

class sets the stage for this:

1
2
3
4
5
6

class SetMap
def initialize
@hash = {}
@size = 0
end
end

Besides the `@hash`

instance variable, we also decide to maintain an instance variable `@size`

, in the interest of being able to look up the size of our set in constant time. The size of a set is commonly defined as the sum of the scores of its keys, and the idea is that `@size`

will always store this value in an up-to-date fashion. We also make available a getter method `size`

that returns the current value of `@size`

, omitted here.

### Valid scores

What distinguishes the types of sets we have seen above from each other? It is primarily what counts as a valid score according to each type:

- A classical set either contains or does not contain a particular key, so the only valid scores are 0 and 1.
- A multiset may contain a given key
`n`

times, where`n`

is a non-negative integer. - A fuzzy set scores a given key to a degree in the unit interval from 0 to 1.

So in each case, we have to be able to express a range of possible values. We set up a bunch of class methods and class instance variables for this purpose:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

class SetMap
def self.score_type
@score_type || raise(SetError, '@score_type not initialized')
end
def self.min_score
@min_score || raise(SetError, '@min_score not initialized')
end
def self.max_score
@max_score || raise(SetError, '@max_score not initialized')
end
def self.valid_score?(val)
val.is_a?(score_type) && (min_score..max_score).cover?(val)
end
end

The first three class methods defined above are getters (on the level of the class object) for the class instance variables `@score_type`

(the kind of object we may use as a score for a key), `@min_score`

(the smallest value that may be used as a score), and `@max_score`

(the largest value that maye be used as a score). The fourth method uses these getters and describes what constitutes a valid score as a predicate.

As the second disjunct of each of the above getter methods tells us loud and clear, we are missing something so far: our class instance variables have not been set to any value! Initializing those class instance variables is precisely the job description of our target classes.

### Target classes

`SetMap`

is meant to be subclassed, with each subclass defining a particular set type by specifying a range of legal scores via the class instance variables `@score_type`

, `@min_score`

and `@max_score`

. Here is the code for classical sets:

1
2
3
4
5

class ClassicalSet < SetMap
@score_type = Integer
@min_score = 0
@max_score = 1
end

In other words, the only valid scores for the keys of classical sets are the integers `0`

and `1`

. For fuzzy sets, we write:

1
2
3
4
5

class FuzzySet < SetMap
@score_type = Numeric
@min_score = 0
@max_score = 1
end

So any `Numeric`

instance in the closed interval `[0, 1]`

is a valid score for a fuzzy set key (we choose `Numeric`

so as to allow both floats and integers). Finally, for multisets:

1
2
3
4
5

class MultiSet < SetMap
@score_type = Integer
@min_score = 0
@max_score = Float::INFINITY
end

The `Float::INFINITY`

constant has the property that `x < Float::INFINITY`

for any numeric `x`

. Setting `@max_score`

to `Float::INFINITY`

is thus a way of saying that, for multisets, there is no maximal score: any non-negative integer is allowed.

And this is really all there is to it! Specializing the capabilities of `SetMap`

to a particular target class boils down to providing appropriate values for a bunch of class instance variables.

Of course, we have not yet demonstrated what the interface for `SetMap`

actually looks like. But from now on, we will write methods that work equally well for all three set types under consideration: classical sets, fuzzy sets, and multisets.

### Key insertion

The most basic part of the interface of any set class is arguably the capability of inserting scores for particular keys. Here is the `SetMap#insert`

method:

1
2
3
4
5
6
7

def insert(key, val = 1)
raise(SetError, 'Illegal value') unless self.class.valid_score?(val)
old_score = self[key]
@hash[key] = [self[key] + val, self.class.max_score].min
@size = @size + (self[key] - old_score)
self
end

The general idea of this method is to increment the score of `key`

by `val`

. As per line 2 of the snippet, this will work only if `val`

is a valid score according to the implementation of `valid_score?`

(which in turn depends on the values of the class instance variables `@score_type`

, `@min_score`

and `@max_score`

). If that is the case, we use what is called a *bounded sum* to add `val`

to `self[key]`

, capping off the sum at `@max_score`

(line 4).

Let’s try this out using our target classes:

1
2
3
4
5
6
7
8
9
10
11
12
13
14

fuzzy_set = FuzzySet.new
fuzzy_set.insert('a', 0.5)
fuzzy_set.insert('a', 0.3)
fuzzy_set #=> #<FuzzySet: {"a": 0.8}>
multi_set = MultiSet.new
multi_set.insert('a')
multi_set.insert('a', 2)
multi_set #=> #<MultiSet: {"a": 3}>
classical_set = ClassicalSet.new
classical_set.insert('a')
classical_set.insert('a')
classical_set #=> #<ClassicalSet: {"a": 1}>

These are the desired results (assuming a—standard—`inspect`

method which we have not shown). Notice that the score range we have specified for classical sets in tandem with the bounded sum ensures that inserting the same key twice has the same effect as inserting it once: the sum of `1`

and `1`

bounded by `1`

is again `1`

.

Returning to the earlier snippet, we also need to keep track of the size of our set (line 5). Here, we also neutralize rounding errors that might occur for types of sets that allow floating point numbers as scores.

### Key retrieval

Next, consider `SetMap#retrieve`

:

1
2
3
4

def retrieve(key)
@hash[key] ? @hash[key] : 0
end
alias [] retrieve

The `retrieve`

method (which we alias as `[]`

) wraps the element reference method of our internal hash. If the hash does not contain a certain key, `@hash[key]`

will return `nil`

. In that case, `retrieve(key)`

(or, equivalently as per our alias, `self[key]`

) will return `0`

. Alternatively, we could haver set a default value for `@hash`

, but the current way seems slightly more explicit.

Observe that key retrieval is fast: accessing a hash key takes constant time on average (disregarding some fine-print), i.e., as the number of keys in a hash increases, the average time necessary to recover the value for a key does not increase. This is one of the main reasons why using hash tables to model sets is an attractive choice.

### Key removal

While it would be possible to tweak our approach and express removal of a key as insertion with a negative score, we prefer to keep things simple here:

1
2
3
4
5
6
7

def remove(key, val = 1)
raise(SetError, 'Illegal value') unless self.class.valid_score?(val)
old_score = self[key]
@hash[key] = [self[key] - val, self.class.min_score].max
@size = @size - (old_score - self[key])
self
end

`SetMap#remove`

is perfectly symmetric to the earlier `insert`

method, using a bounded difference instead of a bounded sum. For our three target classes, this ensures that negative scores cannot occur.

### Enumeration

The below `SetMap#each_pair`

method enumerates set keys and their associated scores in a straightforward manner, piggybacking on `Hash#each_pair`

. Notice that we only yield key-value pairs for which the value is non-zero, since a key scored with value `0`

is not considered part of our set.

1
2
3
4
5
6
7
8
9
10

def each_pair
return to_enum(:each_pair) unless block_given?
@hash.each_pair do |key, val|
yield([key, val]) if val != 0
end
self
end
alias each each_pair

As we will see in the next post, `each_pair`

forms the basis for all our methods that iterate over sets. This includes pretty much all the interesting operations on sets—`union`

, `intersection`

, and the like. Since `each_pair`

is aliassed as `each`

, it also allows us to include the `Enumerable`

module, which any respectable Ruby collection class should have access to.

## Operations on Sets

The `SetMap`

class discussed in the previous entry essentially serves as a wrapper around our hash table. We have also implemented a mechanism for specifying what constitutes a valid score for a given type of set. And we have provided our three target classes that inherit from `SetMap`

. The basics of our model are thus in place.

What remains to be implemented is all the interesting operations on sets! The remaining part of the interface will, however, not interact with the internally used hash table directly, but only through the interface developed so far.

For the purpose of implementing those additional operations, we open a new module `SetLike`

, which we include in `SetMap`

. Since our target classes `ClassicalSet`

, `FuzzySet`

and `MultiSet`

inherit from `SetMap`

, they will also be able to access the functionality provided by `SetLike`

.

The division of labor we adopt here takes a cue from the place the `Enumerable`

module occupies in Ruby’s design. `Enumerable`

provides a number of useful methods for working with collections to any class that chooses to include it. In doing so, `Enumerable`

assumes that the class implements an `each`

method, which forms the basis for all the methods `Enumerable`

defines. Beyond this, however, `Enumerable`

does not (need to) know anything about the class using it. Here, we do something very similar:

`SetLike`

provides most of the functionality commonly associated with sets.`SetLike`

assumes that any class that uses it implements the instance methods`retrieve`

,`insert`

,`delete`

,`each`

and`size`

.`SetLike`

does not make any further assumptions about the internals of the class using it.

In particular, none of the methods in `SetLike`

need to know that we are using a hash for internal storage. This draws a distinction between methods that do need to know that we have chosen to represent sets with hash tables (they go in the `SetMap`

class), and methods that don’t need to know this (they go in the `SetLike`

module), and thus encapsulates the internal state of a set in `SetMap`

. As long as we keep the public interface of `SetMap`

stable, we could just as well reimplement all its methods using a binary search tree instead of a hash for storing a set, say: `SetLike`

will not care.

### Union, intersection and difference

We would like to implement the usual operations on sets, like `union`

, `intersection`

, and `difference`

. Since they all follow essentially the same pattern, we focus on just one of them, `union!`

(the destructive version of `union`

).

Following the example established by Ruby’s `Set`

class, we first implement a helper method `SetLike#do_with`

that yields to a block:

1
2
3
4
5
6
7
8
9

def do_with(other)
unless other.instance_of?(self.class)
raise SetError, "#{self.class} instance needed"
end
return other.each unless block_given?
other.each { |key, val| yield([key, val]) }
end
private :do_with

This methods takes care of the type-checking for us. We would not, e.g., want to subtract a fuzzy set from a classical set, as the result will not in general be a classical set. The above guard clause ensures that the operations we will define are only carried out for two objects that belong to the same target class. Beyond this, `do_with`

simply yields the key-value pairs of the set passed to it as an argument. Using `do_with`

, we implement `union!`

as follows:

1
2
3
4
5
6

def union!(other)
do_with(other) do |key, _|
self[key] = [self[key], other[key]].max
end
self
end

The union of a given set with another one is defined by maximizing scores for given keys across the two sets. This is how the union is usually defined, and it yields the expected results.

1
2
3
4
5
6
7
8

# SetMap::from_hash(hsh) creates a new set instance and populates it
# with the key-value pairs from the given hash `hsh`
multi_set1 = MultiSet.from_hash(2 => 1, 3 => 2)
multi_set2 = MultiSet.from_hash(2 => 2, 4 => 1)
multi_set1.union!(multi_set2)
multi_set1 == MultiSet.from_hash(2 => 2, 3 => 2, 4 => 1) # true

`union!`

implements a notion of “combining what is contained in two given sets”. There is a similar, yet slightly different notion which is interesting from the perspective of our polymorphic approach: the *sum* of two sets. So let’s briefly digress:

1
2
3
4

def sum!(other)
do_with(other) { |key, val| insert(key, val) }
self
end

Rather than maximizing scores, as `union!`

did, `sum!`

adds the scores given by the other set to the scores of the receiver. For classical sets, `sum!`

and `union!`

amount to the very same thing:

1
2
3
4
5
6
7
8
9
10

# SetMap::[](*list) creates a new set instance and turns the
# members of `list` into keys of the new instance.
set1 = ClassicalSet[1, 2, 3]
set2 = ClassicalSet[2, 3, 4]
set3 = ClassicalSet[1, 2, 3]
set4 = ClassicalSet[2, 3, 4]
set1.union!(set2) == set3.sum!(set4) # true

This is why Ruby’s own `Set`

class simply defines `+`

(the sum operator) to be an alias of `|`

(the union operator). However, for other types of sets, sum and union are not always the same. This is because taking the maximum of two scores is not generally the same as summing up those two scores! For example:

1
2
3
4
5
6
7
8
9
10
11

set1 = FuzzySet.from_hash(2 => 0.4)
set2 = FuzzySet.from_hash(2 => 0.3)
set3 = FuzzySet.from_hash(2 => 0.4)
set4 = FuzzySet.from_hash(2 => 0.3)
set1.sum!(set2)
set3.union!(set4)
set1 == FuzzySet.from_hash(2 => 0.7) # true
set3 == FuzzySet.from_hash(2 => 0.4) # true

Our model captures all of this correctly.

### Set predicates

A classical set `A`

is a subset of a classical set of `B`

if any element of `A`

is also an element of `B`

. This can be expressed in terms of keys and their associated scores by saying that the score for any key in `A`

is less than or equal to the score for that same key in `B`

. This definition also applies to multisets, and fuzzy sets, so that, again, a common implementation is possible. Here is a first stab at the `SetLike#subset?`

method:

1
2
3
4
5
6
7
8

def subset?(other)
return false unless other.instance_of?(self.class)
all? do |key, _|
self[key] <= other[key]
end
end
alias <= subset?

Following the pattern established by the `do_with`

method discussed above, however, it makes sense to extract the “key comparison” functionality to a separate `compare_with?`

method that takes a block (you may want to check out the code of the multiset gem—the gem author does exactly this).

1
2
3
4
5
6
7
8
9
10
11

def keys
each.map(&:first)
end
def compare_with?(other)
return false unless other.instance_of?(self.class)
(keys | other.keys).all? do |key|
yield(self[key], other[key])
end
end

According to line 2 above, the keys of a set object are given by the first component of each key-value pair. `SetLike#compare_with?`

then iterates over the keys of both `self`

and `other`

, and yields the corresponding values to the block. This allows us to implement `subset?`

as follows:

1
2
3
4
5
6

def subset?(other)
compare_with?(other) do |s, o|
s <= o
end
end
alias <= subset?

Definitions for the other common set predicates (`proper_subset?`

, `superset?`

and `proper_superset?`

) are similar, so we omit them here.

### Equivalence

When are two sets `A`

and `B`

the same? Again, there is an answer that works for all three target classes: the two sets should be in a mutual inclusion relation, i.e., `A`

should be a subset of `B`

, and `B`

a subset of `A`

. However, invoking `subset?`

twice seems slightly redundant, since in the worst case, this amounts to performing every comparison twice. Using the `compare_with?`

method defined above, we can more simply write the following code:

1
2
3
4
5
6
7

def equivalent?(other)
compare_with?(other) do |s, o|
s == o
end
end
alias == equivalent?
alias eql? equivalent?

Notice that we have aliassed the `equivalent?`

method both as `==`

and `eql?`

. We have already taken the `==`

method for granted in some earlier snippets. Now what about `eql?`

? This brings us to our final topic for today:

### Nested sets

Unless overridden, the `Object#eql?`

method considers two objects to be the same if they are identical (i.e., are stored at the same location in memory). In the current context, overriding `Object#eql?`

is critical, because `eql?`

is the method that Ruby uses when accessing hash keys. Let’s leave the context of our `SetLike`

module for a moment, and consider this line of Ruby code:

1

some_hash[some_obj]

When executing this line, Ruby will check if there is a key `key`

to be found in `some_hash`

with the property that `some_obj.eql?(key)`

. If so, the value for `key`

will be returned.

For our purposes, this process of looking up keys in a hash is crucial because (1) we are using hashes for storing set membership information, and (2) we would like to be able to model *nested* sets, which are sets that have sets among their keys. Consider:

1
2
3
4
5
6
7
8
9

# SetMap::[](*list) creates a new set instance and turns the
# members of `list` into keys of the new instance.
set1 = ClassicalSet[1, 2, 3]
set2 = ClassicalSet[1, 2, 3]
set3 = ClassicalSet[set1, 4]
set4 = ClassicalSet[set2, 4]
set3 == set4 #=> ?

Now the question is whether `set3`

and `set4`

are the same set. It seems that the answer should be yes, because, after all, they *contain the same elements*. Assume, however, for a moment that we had not overridden `eql?`

. Then `set3`

and `set4`

do *not* come out the same in the sense of `==`

because `set1`

and `set2`

do not reference the same object, which implies that `set3[set1] == set4[set1]`

will return false, for the simple reason that `set1`

is not considered a key in the set `other`

, since it is not the case that `set2.eql?(set1)`

. So overriding `eql?`

, like we did above, is indeed critical.

### The `hash`

method

As a final aside: For our set comparisons to work, we also need to override the `Object#hash`

method so as to ensure that two set objects that are `eql?`

have the same return value when `hash`

is called on them. This can be achieved, e.g., like this:

1
2
3

def hash
each.map(&:hash).sum
end

Here, we simply map each key-value pair (a two-element array) yielded by `each`

to its `hash`

value and sum up the result, trusting that `Array#hash`

is implemented in a meaningful way. And this indeed ensures that `eql?`

sets have the same `hash`

return value.

## Conclusion

What has been achieved? As mentioned at the beginning of part 01, the set functionality that we have discussed is either readily available as part of Ruby’s Standard Library (for classical sets), or via an easily accesible Ruby gem (for multisets). However, the code presented here presents a *uniform* perspective on classical sets and multisets. While I have written `SetMap`

and its child classes as an exercise for myself, I consider this uniformity an advantage over the pre-existing implementation. We have also seen how easily the approach generalizes to further use cases by considering fuzzy sets.

While coming up with a first working implementation of the types of sets discussed here was pretty straightforward, arriving at a way to structure and modularize my code that I found convincing myself required me to go back to the drawing board several times. If you have a chance to check out my code, your feedback would be greatly appreciated.