Cover photo for post Python. Good, bad, evil -2-: Native sets

Python. Good, bad, evil -2-: Native sets

One thing I really liked in Python is the data structure set, which is built into the language core. A set in Python is defined as known from mathematics and beside mathematical use cases, it is helpful in everyday programming live. In this episode of my series on Python, I give you a short insight into how sets work in there, how you can achieve similar results in PHP and why I like the Python approach.

This entry is part of my series Python. Good, bad, evil, where I discuss my experiences with Python from a PHP developers point of view. Find references to the successor episodes in the trackbacks of this post. The previous episodes are:

  1. Missing braces

A set is one of the most fundamental mathematics concepts: A collection of arbitrary, distinct objects - i.e. each object can only occur once in a set. Several standard operations are defined on sets, like the union and intersection of two or more sets. Just to know what we're talking about from now on. Wikipedia can tell you more about sets, if you are interested.

Sets in PHP

PHP does not offer the data structure set natively. As you know, there are basically arrays and objects as data structures in PHP. That means, the only way is to emulate sets by making use of one of these structures. The SplObjectStorage class, being part of PHP since some time, delivers a quite good basis for this task. As the name indicates, it only works with objects:

<?php $set = new SplObjectStorage(); $o1 = new stdClass(); $o1->id = 23; $o2 = new stdClass(); $o2->id = 42; $o3 = new stdClass(); $o3->id = 23; $set->attach( $o1 ); $set->attach( $o2 ); echo "The set contains ", count( $set ), " objects\n"; $set->attach( $o1 ); echo "The set contains ", count( $set ), " objects\n"; $set->attach( $o3 ); echo "The set contains ", count( $set ), " objects\n"; ?>

This simple example adds 3 objects to an SplObjectStorage. The first and second echo statements print The set contains 2 objects, while the third one indicates three contained elements. Note that although $o1 and $o3 are of the same class and have the same attribute values, they are still distinct objects!

So far, SplObjectStorage mimics the behavior of a set quite fine. But not all common operations on sets are supported by SplObjectStorage. The only two supported ones are union and subtraction (i.e. set-theoretic difference), while for both the PHP code does not reflect the mathematical intuition:

<?php // … $o1, $o2 and $o3 … $set1 = new SplObjectStorage(); $set2 = new SplObjectStorage(); $set1->attach( $o1 ); $set1->attach( $o2 ); $set2->attach( $o2 ); $set2->attach( $o3 ); echo "Set 1 contains ", count( $set1 ), " objects\n"; echo "Set 2 contains ", count( $set2 ), " objects\n"; $set3 = clone $set1; $set3->addAll( $set2 ); echo "Set 3 contains ", count( $set3 ), " objects\n"; $set4 = clone $set1; $set4->removeAll( $set2 ); echo "Set 4 contains ", count( $set4 ), " objects\n"; ?>

This example creates two SplObjectStorage instances whith different combinations of the objects $o1, $o2 and $o3, which you already saw before. $set1 is cloned to avoid manipulating the original version [1] . This is against the intuition of working with sets, where a binary operation on sets would result in a new set.

All objects contained in $set2 are now added to $set3, using the addAll() method. Since an object can only occur once in an SplObjectStorage, this $set3 contains three objects after this operation, not 4 as one could expect. That means addAll() actually performs a union, while it manipulates the original set and does not create a new one for the result.

From $set4, another clone of $set1, all objects contained in $set2 are removed. The result is a set with only one element. This is basically subtraction of sets, while the original set is again manipulated.

The shown behavior is not exactly what you expect from operations on sets. As already mentioned, the original sets are modified by the methods, while one would expect from set theory, that a new set is created. However, the method names indicate this behavior quite fine. Second, the number of operations is quite limited. But on basis of these two methods, it is possible to implement all common set operations. I created a little example in my php-snippets repository on Github to show you how that could work. A usage example can also be found there, together with the SplObjectStorage example seen above.

Sets in Python

As noted before, Python provides a native set data structure. The huge advantage against PHP is, that Python sets also work with scalars, not only with objects as the SplObjectStorage approach presented before. But let's first look into examples which are comparable to the last ones:

class Foo: id = None def __init__(self, id): self.id = id mySet = set() o1 = Foo(23) o2 = Foo(42) mySet.add(o1) mySet.add(o2) print "The set contains ", len(mySet), " objects.\n" mySet.add(o1) print "The set contains ", len(mySet), " objects.\n" mySet.add(o3) print "The set contains ", len(mySet), " objects.\n"

The above shown code has the exact same output as the first PHP example. A set is created and three different objects are stored in it. Note again, that o1 and o3 are of the same class and have the same attribute values. Still, they are distinct objects and therefore do not count as duplicates in the set. [2]

The next code example shows the first differences to PHP:

# … o1, o2 and o3 … set1 = set() set2 = set() set1.add(o1) set1.add(o2) set2.add(o2) set2.add(o3) print "Set 1 contains ", len(set1), " objects.\n" print "Set 2 contains ", len(set2), " objects.\n" set3 = set1 | set2 print "Set 3 contains ", len(set3), " objects.\n" set4 = set1 - set2 print "Set 4 contains ", len(set4), " objects.\n"

As you can see, it is possible to manipulate sets through operators in Python, not only using methods. But the methods union() and difference() exist on the set objects, too. The second difference is, that these operations conform to the mathematical intuition, in the sense that they create a new set instead of manipulating an existing one. Still, using the operators |= and -= you can perform an assignment in the same operation.

Python also provides other basic set operations natively. As can be seen in the following example, intersection and symmetric difference are for example supported:

# … continued … set5 = set1 & set2 print "The intersection of set 1 and 2 contains ", len(set5), " objects.\n" set6 = set1 ^ set2 print "The symmetric difference of set 1 and 2 contains ", len(set6), " objects.\n"

To achieve these in PHP you need to implement user land code. However, sets in Python support one thing which is not possible in PHP, if you work on basis of SplObjectStorage: Storing scalar values in sets.

numberSet1 = set() numberSet2 = set() numberSet1.add(23) numberSet1.add(42) numberSet2.add(42) numberSet2.add(1337) print "Set 1 contains ", len(numberSet1), " integers.\n" print "Set 2 contains ", len(numberSet2), " integers.\n" print "Union of set 1 and 2 contains ", len(numberSet1 | numberSet2), " integers.\n" print "Intersection of set 1 and 2 contains ", len(numberSet1 & numberSet2), " integers.\n"

Conclusion

I used sets extensively in my little Pagger app, to keep track of genre strings. The tool retrieves genres from different sources (Last.FM and Freebase), merges the result and maps them to known genres if possible. Using set operations, I can easily determine mapped genres, unmapped ones and so on. Once you get used to that such a data structure is available, you find more and more applications for it.

A compliment from my side for sets in Python. Sets are a very useful data structure in Python and I really wish we'd have them in PHP, too. However, you can still write a user land implementations of sets in PHP. If you only need to store objects, you can base it on SplObjectStorage as shown above. Otherwise you need to work on basis of arrays. This is more work and probably less performant.

What you cannot achieve (without nasty extension magic) is the use of PHPs operators to manipulate sets. This makes PHP code that works with sets appear longer, but it is no real drawback. Some people might even argue that calling methods is more expressive than using operators. I don't share that opinion in this case, mostly due to a mathematical background from computer science studies, I suppose.

1
No deep clone. None of the contained objects is cloned.
2
You can implement the __eq__() and __hash__() methods on your Python class to change the meaning of distinctness here. An example for this are String objects, which are recognized to be the same, if they contain the same string.

Update 2010-03-19

Let me clarify some things, that might be misleading in this blog entry:

I do not say that you can solve any problem with Python that cannot be solved on PHP. This is simply not true, since both languages are turing complete, which means that both can be used to calculate any problem that is computable.

In addition, I do not say that there is no way in PHP to solve the particular problems tackled by a set data structure. PHP simply does not provide such on its own, so you need to implement it yourself. One way to do this is shown above, using SplObjectStorage. Other ways involve array and functions like array_unique(), array_intersect() and other array functions. Surely you can implement a PHP class which encapsulates such operations, so you get something that is close to Pythons set data structure.

However, you cannot overload operators in PHP without the specific, nasty extension from PECL, mentioned above. That means, you cannot manipulate your custom set objects intuitively, using operators. I simply like the fact the Python offers such a useful data structure natively and I like how you can work with it. I can also be presumed, that a generic set data structure in the core or as an extension would be faster than any such user land implementation.

Comments

Python is better than PHP!1!!

Arno at 2010-03-18

well, for the functionality of Pagger that is described in this post a a couple of calls to array_intersect would be sufficient enough, wouldn't it?

Tomek at 2010-03-18

The point here is not if you can realize a certain task or not. Both languages allow you to deal greatfully with such a task without much hassle.

Anyway, the solution with real sets is much easier to read than using array_unique(), array_intersect() and friends all over the place. A set also keeps constantly track of distinctiveness, which you can only achieve in PHP with multiple calls to array_unique() or using nasty tricks.

However, in the end it comes down to taste, as so often. :)

Toby at 2010-03-18

I've never used SplObjectStorage, but will now consider it when trying to solve similar problems. So far, I've been using arrays and array functions to do the same sort of thing. Although not quite the same, you can use array keys to ensure some form of uniqueness in the array.

David at 2010-03-19

Exactly this is what I meant with "nasty tricks". ;)

I tend to use array keys only (with a dummy value like true), when I need to maintain a unique set of scalars. Forcing uniqueness on identifiable objects also works fine through array keys. Even better than SplObjectStorage, if you can have multiple instances with the same identity.

Thanks for mentioning!

Toby at 2010-03-19

Maybe I'm overlooking some problem, but I see two alternative approaches in PHP. One is to use arrays with the set members as keys. This way the values are automatically unique, but it only works with scalar values. The other way is to write your own set class to encapsulate and hide all that array function nastiness.

Dagfinn Reiersøl at 2010-03-19