schlitt.info - php, photography and private stuff
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
:Author: Tobias Schlitt
:Date: Tue, 09 Feb 2010 13:49:25 +0100
:Revision: 3
:Copyright: CC by-nc-sa
===================
Heap, heap, hooray!
===================
:Keywords: php, data structure, performance, heap, top k
:Description:
Presenting an efficient way to extract the top/bottom k elements of a large
amount of data using a max heap in PHP.
:Abstract:
I recently had the problem that I wanted to retrieve the smallest items from a
stream of data. When talking about a stream here, I refer to a data set that I
do not want to load into memory completely, since it has quite a few elements.
The best way to process such data is a stream approach, where you work always
on a single item at a time, iteratively, without loading the full data set.
In my special case, I had a database with 140,000 records. The processing of
these records could not happen in the DB, since I needed to create vectors
from text and perform calculation on these. Basically, I needed to check each
vectors distance to a reference vector and keep only the k closest ones.
So, what is a good approach to solve such a task? I decided to implement a
custom data structure based on a **max heap** to solve the problem. In this
article, I present the solution and compare it to two different other
approaches in terms of a small benchmark.
I recently had the problem that I wanted to retrieve the smallest items from a
stream of data. When talking about a stream here, I refer to a data set that I
do not want to load into memory completely, since it has quite a few elements.
The best way to process such data is a stream approach, where you work always
on a single item at a time, iteratively, without loading the full data set.
In my special case, I had a database with 140,000 records. The processing of
these records could not happen in the DB, since I needed to create vectors
from text and perform calculation on these. Basically, I needed to check each
vectors distance to a reference vector and keep only the k closest ones.
So, what is a good approach to solve such a task? I decided to implement a
custom data structure based on a **max heap** to solve the problem. In this
article, I present the solution and compare it to two different other
approaches in terms of a small benchmark.
--------------
Naive approach
--------------
The naive solution would neither consider speed nor memory consumption:
1. Store all vectors with their distance values in an array.
2. Sort that array in ascending order.
3. Keep the k first items of the sorted array.
I wrote a little `benchmark for the naive approach`__, to compare its
performance to my final solution. The important code looks basically like
this::
list …
usort(
$this->list,
function ( $a, $b )
{
return $a->compareTo( $b );
}
);
array_splice( $this->list, 10 );
?>
__ http://github.com/tobyS/php-snippets/blob/master/datastructures/examples/bottom_k/test_naive.php
------------------
Maintaining a list
------------------
As expected, the naive approach has a rather high memory consumption. The speed
performance cannot be judged at this point, since there is no reference. So,
what would be a better approach to reduce memory consumption? Constantly
maintaining a list of the k smallest elements would be an idea. That means:
1. Constantly add elements to an array.
2. If the array gets to large, sort it and only keep the k smallest.
A `benchmark script for the list maintaining approach`__ is available, too. As
expected, the memory consumption is lower.
__ http://github.com/tobyS/php-snippets/blob/master/datastructures/examples/bottom_k/test_maintain_list.php
-------------------
Unleashing the heap
-------------------
Still, my computer scientist intuition would not be satisfied with the previous
approach. Continuously sorting an array is not a fast approach. But there exist
approaches for fast sorted access to elements. The probably most common one is
using a heap__.
__ http://en.wikipedia.org/wiki/Heap_%28data_structure%29
A heap is a tree data structure, which conforms to the so-called *heap
condition*. This condition basically ensures that the largest (for a max-heap,
smallest for a min-heap) element always resides at the top. The SPL__ extension
luckily provides classes for both heap variants.
__ http://php.net/spl
To realize maintaining a list of the minimal k elements, I used a derived max
heap, which is limited to k elements::
limit = (int) $limit;
}
public function compare( $a, $b )
{
if ( !( $a instanceof Comparable ) )
{
throw new RuntimeException(
'Value a must implement \\ts\\Datastructures\\Comparable.'
);
}
return $a->compareTo( $b );
}
public function insert( $value )
{
parent::insert( $value );
while ( count( $this ) > $this->limit )
{
$this->extract();
}
}
}
?>
First of all, to maintain custom elements in a heap, you need to overwrite the
``compare()`` method. This method is used in the heap to compare two elements
``$a`` and ``$b``. It returns a positive integer value, if ``$a`` is greater
than ``$b``, a negative value if ``$a`` is smaller than ``$b`` and 0 if both
objects are equal. To make my implementation generic, I defined an interface
``Comparable`` which objects stored in the ``BottomK`` structure must
implement.
Whenever an element is inserted into the heap, the ``insert()`` method checks
if the maximum number of elements is reached. If this is the case, the top most
element is extracted and discarded. Since the class extends a SplMaxHeap__,
this is always the biggest element, i.e. the k smallest elements are always
kept in the heap. Obviously, this results in a data structure which never
contains more than k elements.
__ http://php.net/manual/en/class.splmaxheap.php
A nice side effect with the heap implementation is, that it natively implements
the ``Countable`` and ``Iterator`` interfaces, so you can comfortably iterate
the desired elements at any time.
----------
Statistics
----------
As said before, I implemented a tiny benchmark [#]_ to see how the selected
approach performs. Find the numbers below:
+----------------+-----------------------+----------------------------+
| | Time | Memory |
+================+=======================+============================+
| Naive | 1.159 (± 0.01) | 2926937.6 (± 134036.264) |
+----------------+-----------------------+----------------------------+
| Maintain list | 2.699 (± 0.049) | 5444.8 (± 18.4) |
+----------------+-----------------------+----------------------------+
| Heap | 0.918 (± 0.006) | 5119.2 (± 9.6) |
+----------------+-----------------------+----------------------------+
In the benchmark scenario I generated 10,000 simple item objects that contain a
random integer between 0 and ``PHP_INT_MAX`` and determined the 10 lowest ones.
I repeated each such run 10 times for each implementation. The table above
shows the average and standard deviation values of the runs in terms of time
(micro seconds) and memory (bytes) consumption.
As can be seen, the naive approach is more than twice as fast as constantly
maintaining a sorted array, but also takes 500 times more memory at average. In
addition, the naive approach varies quite much regarding its memory
consumption.
The heap approach is generally the fastest, although not much faster than the
naive approach, and consumes the least amount of memory, although not much
less than continuously maintaining a sorted list. In addition, this approach is
the most stable one, regarding its deviation from average.
----
Code
----
You can find the code of the ``KBottom`` class as well as the benchmarks in my
`code-snippet repository in GitHub`__. I did not put it under a license, yet,
so it's basically *all rights reserved*. However, feel free to copy the code
and use it for educational purposes. ;)
__ http://github.com/tobyS/php-snippets
.. [#] Disclaimer: Benchmarks are never objective! They highly depend on the
computer they are run on and the other processes running! Benchmarks can
never proof performance, but only give slight indicators. To proof
performance, try `computational complexity theory`__.
__ http://en.wikipedia.org/wiki/Computational_complexity_theory
..
Local Variables:
mode: rst
fill-column: 79
End:
vim: et syn=rst tw=79
Trackbacks
==========
Comments
========
- Artur Esjmont at Mon, 08 Feb 2010 13:19:45 +0100
From the title i thought you meant 'parsing xml backwars' : )
Sure some heap or linked list seems like a nartutal solution. Cleanness of
presented approach seems to be in the reuse of standard implementation. I
like it.
ps i assume you were reading entire stream from top to bottom not parts of
it?
Thanks for the article.
art
- Toby at Mon, 08 Feb 2010 13:52:17 +0100
Hi Artur,
indeed, using a Heap here is the natural solution, but I know many
programmers who sadly don't know such data structures.
I'm reading the complete stream, yes. Otherwise I wouldn't get the bottom K
elements from it.
Regards, Toby
- Michael at Wed, 16 Jun 2010 11:18:06 +0200
Hy Toby I tried something similar to your solution about a month ago (by
implementing some sort of ternary tree and a version using a simple list).
Unfortunately building the Heap/Tree takes a lot of time in comparison to
simply pushing objects to a list. What are your experiences e.g. for
building the SPL Heap?
Regards Michael
- No name at Sun, 08 Aug 2010 02:57:41 +0200
Nice article but I guess you should have tried it with really really big
numbers to see the actual difference among your approaches. For ex: Find top
100K among 100 million numbers
- Mamsaac at Sun, 24 Oct 2010 08:13:49 +0200
Michael, while insertion into a heap is O(log2(n)), the extraction is O(1).
If you compare it to the O(1) of insertion and O(nlog2(n)) of pushing and
sorting, you will easily see that using a heap is much much more efficient
=)
- read more at Wed, 10 Apr 2013 14:29:45 +0200
The best way to do this would be to use a linked list; I mean a double
linked list. This way you can easily traverse through the tree using the
nodes and also get the data set one by one without affecting the array set.