Heap, heap, hooray!
http://schlitt.info/opensource/blog/0718_heap_heap_hooray.html
I recently had the problem that I wanted to retrieve the smallest items from a stream of data. When talking about a stream here, I refer to a data set that I do not want to load into memory completely, since it has quite a few elements. The best way to process such data is a stream approach, where you work always on a single item at a time, iteratively, without loading the full data set.In my special case, I had a database with 140,000 records. The processing of these records could not happen in the DB, since I needed to create vectors from text and perform calculation on these. Basically, I needed to check each vectors distance to a reference vector and keep only the k closest ones.So, what is a good approach to solve such a task? I decided to implement a custom data structure based on a <strong>max heap</strong> to solve the problem. In this article, I present the solution and compare it to two different other approaches in terms of a small benchmark.
enCC by-nc-saTobias SchlittTobias Schlitt <toby@php.net>Sat, 06 Feb 2010 11:10:40 +0000Tue, 09 Feb 2010 12:49:25 +0000eZ Components Feed dev (http://ezcomponents.org/docs/tutorials/Feed)http://www.rssboard.org/rss-specificationread more at Wed, 10 Apr 2013 14:29:45 +0200
http://schlitt.info/opensource/blog/0718_heap_heap_hooray.html#comment_6
The best way to do this would be to use a linked list; I mean a double linked list. This way you can easily traverse through the tree using the nodes and also get the data set one by one without affecting the array set.read moreWed, 10 Apr 2013 12:29:45 +0000Mamsaac at Sun, 24 Oct 2010 08:13:49 +0200
http://schlitt.info/opensource/blog/0718_heap_heap_hooray.html#comment_5
Michael, while insertion into a heap is O(log2(n)), the extraction is O(1). If you compare it to the O(1) of insertion and O(nlog2(n)) of pushing and sorting, you will easily see that using a heap is much much more efficient =)MamsaacSun, 24 Oct 2010 06:13:49 +0000No name at Sun, 08 Aug 2010 02:57:41 +0200
http://schlitt.info/opensource/blog/0718_heap_heap_hooray.html#comment_4
Nice article but I guess you should have tried it with really really big numbers to see the actual difference among your approaches. For ex: Find top 100K among 100 million numbersNo nameSun, 08 Aug 2010 00:57:41 +0000Michael at Wed, 16 Jun 2010 11:18:06 +0200
http://schlitt.info/opensource/blog/0718_heap_heap_hooray.html#comment_3
Hy Toby
I tried something similar to your solution about a month ago (by implementing some sort of ternary tree and a version using a simple list). Unfortunately building the Heap/Tree takes a lot of time in comparison to simply pushing objects to a list. What are your experiences e.g. for building the SPL Heap?
Regards MichaelMichaelWed, 16 Jun 2010 09:18:06 +0000Toby at Mon, 08 Feb 2010 13:52:17 +0100
http://schlitt.info/opensource/blog/0718_heap_heap_hooray.html#comment_2
Hi Artur,
indeed, using a Heap here is the natural solution, but I know many programmers who sadly don't know such data structures.
I'm reading the complete stream, yes. Otherwise I wouldn't get the bottom K elements from it.
Regards,
TobyTobyMon, 08 Feb 2010 12:52:17 +0000Artur Esjmont at Mon, 08 Feb 2010 13:19:45 +0100
http://schlitt.info/opensource/blog/0718_heap_heap_hooray.html#comment_1
From the title i thought you meant 'parsing xml backwars' : )
Sure some heap or linked list seems like a nartutal solution. Cleanness of presented approach seems to be in the reuse of standard implementation. I like it.
ps i assume you were reading entire stream from top to bottom not parts of it?
Thanks for the article.
artArtur EsjmontMon, 08 Feb 2010 12:19:45 +0000