Heap, heap, hooray!

2010-02-09

Tobias Schlitt

php, data structure, performance, heap, top k

Presenting an efficient way to extract the top/bottom k elements of a large amount of data using a max heap in PHP.

I recently had the problem that I wanted to retrieve the smallest items from a stream of data. When talking about a stream here, I refer to a data set that I do not want to load into memory completely, since it has quite a few elements. The best way to process such data is a stream approach, where you work always on a single item at a time, iteratively, without loading the full data set.

In my special case, I had a database with 140,000 records. The processing of these records could not happen in the DB, since I needed to create vectors from text and perform calculation on these. Basically, I needed to check each vectors distance to a reference vector and keep only the k closest ones.

So, what is a good approach to solve such a task? I decided to implement a custom data structure based on a max heap to solve the problem. In this article, I present the solution and compare it to two different other approaches in terms of a small benchmark.

Naive approach

The naive solution would neither consider speed nor memory consumption:

Store all vectors with their distance values in an array.
Sort that array in ascending order.
Keep the k first items of the sorted array.

I wrote a little benchmark for the naive approach, to compare its performance to my final solution. The important code looks basically like this:

<?php

// … generate items and store them in $this->list …

usort(
    $this->list,
    function ( $a, $b )
    {
        return $a->compareTo( $b );
    }
);
array_splice( $this->list, 10 );
?>

Maintaining a list

As expected, the naive approach has a rather high memory consumption. The speed performance cannot be judged at this point, since there is no reference. So, what would be a better approach to reduce memory consumption? Constantly maintaining a list of the k smallest elements would be an idea. That means:

Constantly add elements to an array.
If the array gets to large, sort it and only keep the k smallest.

A benchmark script for the list maintaining approach is available, too. As expected, the memory consumption is lower.

Unleashing the heap

Still, my computer scientist intuition would not be satisfied with the previous approach. Continuously sorting an array is not a fast approach. But there exist approaches for fast sorted access to elements. The probably most common one is using a heap.

A heap is a tree data structure, which conforms to the so-called heap condition. This condition basically ensures that the largest (for a max-heap, smallest for a min-heap) element always resides at the top. The SPL extension luckily provides classes for both heap variants.

To realize maintaining a list of the minimal k elements, I used a derived max heap, which is limited to k elements:

<?php
class BottomK extends \SplMaxHeap
{
    protected $limit;

    public function __construct( $limit )
    {
        $this->limit = (int) $limit;
    }

    public function compare( $a, $b )
    {
        if ( !( $a instanceof Comparable ) )
        {
            throw new RuntimeException(
                'Value a must implement \\ts\\Datastructures\\Comparable.'
            );
        }
        return $a->compareTo( $b );
    }

    public function insert( $value )
    {
        parent::insert( $value );

        while ( count( $this ) > $this->limit )
        {
            $this->extract();
        }
    }
}
?>

First of all, to maintain custom elements in a heap, you need to overwrite the compare() method. This method is used in the heap to compare two elements $a and $b. It returns a positive integer value, if $a is greater than $b, a negative value if $a is smaller than $b and 0 if both objects are equal. To make my implementation generic, I defined an interface Comparable which objects stored in the BottomK structure must implement.

Whenever an element is inserted into the heap, the insert() method checks if the maximum number of elements is reached. If this is the case, the top most element is extracted and discarded. Since the class extends a SplMaxHeap, this is always the biggest element, i.e. the k smallest elements are always kept in the heap. Obviously, this results in a data structure which never contains more than k elements.

A nice side effect with the heap implementation is, that it natively implements the Countable and Iterator interfaces, so you can comfortably iterate the desired elements at any time.

Statistics

As said before, I implemented a tiny benchmark [1] to see how the selected approach performs. Find the numbers below:

	Time	Memory
Naive	1.159 (± 0.01)	2926937.6 (± 134036.264)
Maintain list	2.699 (± 0.049)	5444.8 (± 18.4)
Heap	0.918 (± 0.006)	5119.2 (± 9.6)

In the benchmark scenario I generated 10,000 simple item objects that contain a random integer between 0 and PHP_INT_MAX and determined the 10 lowest ones. I repeated each such run 10 times for each implementation. The table above shows the average and standard deviation values of the runs in terms of time (micro seconds) and memory (bytes) consumption.

As can be seen, the naive approach is more than twice as fast as constantly maintaining a sorted array, but also takes 500 times more memory at average. In addition, the naive approach varies quite much regarding its memory consumption.

The heap approach is generally the fastest, although not much faster than the naive approach, and consumes the least amount of memory, although not much less than continuously maintaining a sorted list. In addition, this approach is the most stable one, regarding its deviation from average.

Code

You can find the code of the KBottom class as well as the benchmarks in my code-snippet repository in GitHub. I did not put it under a license, yet, so it's basically all rights reserved. However, feel free to copy the code and use it for educational purposes. ;)

1: Disclaimer: Benchmarks are never objective! They highly depend on the computer they are run on and the other processes running! Benchmarks can never proof performance, but only give slight indicators. To proof performance, try computational complexity theory.

Comments

From the title i thought you meant 'parsing xml backwars' : )
Sure some heap or linked list seems like a nartutal solution. Cleanness of presented approach seems to be in the reuse of standard implementation. I like it.
ps i assume you were reading entire stream from top to bottom not parts of it?
Thanks for the article.
art

Artur Esjmont at 2010-02-08

Hi Artur,
indeed, using a Heap here is the natural solution, but I know many programmers who sadly don't know such data structures.
I'm reading the complete stream, yes. Otherwise I wouldn't get the bottom K elements from it.
Regards, Toby

Toby at 2010-02-08

Hy Toby I tried something similar to your solution about a month ago (by implementing some sort of ternary tree and a version using a simple list). Unfortunately building the Heap/Tree takes a lot of time in comparison to simply pushing objects to a list. What are your experiences e.g. for building the SPL Heap?
Regards Michael

Michael at 2010-06-16