- Revision 3
- by Tobias Schlitt
- Tue, 9 Feb 2010
- 13:49:25 +0100

I recently had the problem that I wanted to retrieve the smallest items from a stream of data. When talking about a stream here, I refer to a data set that I do not want to load into memory completely, since it has quite a few elements. The best way to process such data is a stream approach, where you work always on a single item at a time, iteratively, without loading the full data set.

In my special case, I had a database with 140,000 records. The processing of these records could not happen in the DB, since I needed to create vectors from text and perform calculation on these. Basically, I needed to check each vectors distance to a reference vector and keep only the k closest ones.

So, what is a good approach to solve such a task? I decided to implement a custom data structure based on a **max heap** to solve the problem. In this article, I present the solution and compare it to two different other approaches in terms of a small benchmark.

The naive solution would neither consider speed nor memory consumption:

Store all vectors with their distance values in an array.

Sort that array in ascending order.

Keep the k first items of the sorted array.

I wrote a little benchmark for the naive approach, to compare its performance to my final solution. The important code looks basically like this:

```
<?php
// … generate items and store them in $this->list …
usort(
$this->list,
function ( $a, $b )
{
return $a->compareTo( $b );
}
);
array_splice( $this->list, 10 );
?>
```

As expected, the naive approach has a rather high memory consumption. The speed performance cannot be judged at this point, since there is no reference. So, what would be a better approach to reduce memory consumption? Constantly maintaining a list of the k smallest elements would be an idea. That means:

Constantly add elements to an array.

If the array gets to large, sort it and only keep the k smallest.

A benchmark script for the list maintaining approach is available, too. As expected, the memory consumption is lower.

Still, my computer scientist intuition would not be satisfied with the previous approach. Continuously sorting an array is not a fast approach. But there exist approaches for fast sorted access to elements. The probably most common one is using a heap.

A heap is a tree data structure, which conforms to the so-called *heap
condition*. This condition basically ensures that the largest (for a max-heap, smallest for a min-heap) element always resides at the top. The SPL extension luckily provides classes for both heap variants.

To realize maintaining a list of the minimal k elements, I used a derived max heap, which is limited to k elements:

```
<?php
class BottomK extends \SplMaxHeap
{
protected $limit;
public function __construct( $limit )
{
$this->limit = (int) $limit;
}
public function compare( $a, $b )
{
if ( !( $a instanceof Comparable ) )
{
throw new RuntimeException(
'Value a must implement \\ts\\Datastructures\\Comparable.'
);
}
return $a->compareTo( $b );
}
public function insert( $value )
{
parent::insert( $value );
while ( count( $this ) > $this->limit )
{
$this->extract();
}
}
}
?>
```

First of all, to maintain custom elements in a heap, you need to overwrite the `compare()`

method. This method is used in the heap to compare two elements `$a`

and `$b`

. It returns a positive integer value, if `$a`

is greater than `$b`

, a negative value if `$a`

is smaller than `$b`

and 0 if both objects are equal. To make my implementation generic, I defined an interface `Comparable`

which objects stored in the `BottomK`

structure must implement.

Whenever an element is inserted into the heap, the `insert()`

method checks if the maximum number of elements is reached. If this is the case, the top most element is extracted and discarded. Since the class extends a SplMaxHeap, this is always the biggest element, i.e. the k smallest elements are always kept in the heap. Obviously, this results in a data structure which never contains more than k elements.

A nice side effect with the heap implementation is, that it natively implements the `Countable`

and `Iterator`

interfaces, so you can comfortably iterate the desired elements at any time.

As said before, I implemented a tiny benchmark [1] to see how the selected approach performs. Find the numbers below:

Time | Memory | |
---|---|---|

Naive | 1.159 (± 0.01) | 2926937.6 (± 134036.264) |

Maintain list | 2.699 (± 0.049) | 5444.8 (± 18.4) |

Heap | 0.918 (± 0.006) | 5119.2 (± 9.6) |

In the benchmark scenario I generated 10,000 simple item objects that contain a random integer between 0 and `PHP_INT_MAX`

and determined the 10 lowest ones. I repeated each such run 10 times for each implementation. The table above shows the average and standard deviation values of the runs in terms of time (micro seconds) and memory (bytes) consumption.

As can be seen, the naive approach is more than twice as fast as constantly maintaining a sorted array, but also takes 500 times more memory at average. In addition, the naive approach varies quite much regarding its memory consumption.

The heap approach is generally the fastest, although not much faster than the naive approach, and consumes the least amount of memory, although not much less than continuously maintaining a sorted list. In addition, this approach is the most stable one, regarding its deviation from average.

You can find the code of the `KBottom`

class as well as the benchmarks in my code-snippet repository in GitHub. I did not put it under a license, yet, so it's basically *all rights reserved*. However, feel free to copy the code and use it for educational purposes. ;)

- 1
- Disclaimer: Benchmarks are never objective! They highly depend on the computer they are run on and the other processes running! Benchmarks can never proof performance, but only give slight indicators. To proof performance, try computational complexity theory.

If you liked this blog post or learned something, please consider using flattr to contribute back: .

Fields with bold names are mandatory.

## Artur Esjmont

From the title i thought you meant 'parsing xml backwars' : )

Link to commentSure some heap or linked list seems like a nartutal solution. Cleanness of presented approach seems to be in the reuse of standard implementation. I like it.

ps i assume you were reading entire stream from top to bottom not parts of it?

Thanks for the article.

art

## Toby

Hi Artur,

Link to commentindeed, using a Heap here is the natural solution, but I know many programmers who sadly don't know such data structures.

I'm reading the complete stream, yes. Otherwise I wouldn't get the bottom K elements from it.

Regards,

Toby

MichaelHy Toby

Link to commentI tried something similar to your solution about a month ago (by implementing some sort of ternary tree and a version using a simple list). Unfortunately building the Heap/Tree takes a lot of time in comparison to simply pushing objects to a list. What are your experiences e.g. for building the SPL Heap?

Regards Michael

No nameNice article but I guess you should have tried it with really really big numbers to see the actual difference among your approaches. For ex: Find top 100K among 100 million numbers

Link to commentMamsaacMichael, while insertion into a heap is O(log2(n)), the extraction is O(1). If you compare it to the O(1) of insertion and O(nlog2(n)) of pushing and sorting, you will easily see that using a heap is much much more efficient =)

Link to comment## read more

The best way to do this would be to use a linked list; I mean a double linked list. This way you can easily traverse through the tree using the nodes and also get the data set one by one without affecting the array set.

Link to comment