Indexing XML: What would you do?

2008-11-19

Tobias Schlitt

In a recent project one of the duties is to make a huge amount of XML documents searchable. I dealt with XML in several situations and used parts of XPath, XSL and Co., but this is a new challange. Be now I have made up my mind and thought about what to do, with so clear result, so I'm trying to ask the public for opinions.

Let me introduce the situation a bit more. I'm dealing with 5 different XML structures (basically) of which about 10.000 to 100.000 files and above have to become searchable. With that, I mainly need fulltext indexes on the contents of single tags and/or multiple tags at once. Search should include phrase and boolean expressions. The XML files maybe updated about once a day (only small parts per update).

So, my ideas to manage that are (so far):

1. Using an XML database

I have to admit that I a) never dealt with that so far and b) never dealt with XQuery, what doesn't seem so heavy. Problematic is, that utilizing an XML DB would mean large efforts in learning (setup, connect to with PHP, call syntax,...).

2. Parsing the XML into a DB (same structure)

That would be the most easiest part, I guess: Setting up 5 tables, that have the columns which I like to index and pumping the whole data into it. That's not the real purpose of XML infact and makes extending the searchable data a mess.

3. Parsing XML into a DB (tree structure)

An idea would be to create an abstract database layout, that mirrors the tree structure of XML data in general and to pump all the data into it. That will become a really huge table what may cause really horrible performance and getting the data back is quite uncomfortable. In comparison to 2. the extensibillity problem is solved here, but still that's not the sense of XML.

4. Using a real search engine

That's also a nice idea, but I'm still missing the tool, which does what I want. I could simply run an indexer engine on the files and query its search engine for results. Problematic is, that I have neither plain text as a format, nor HTML (both supported by most search engines), but XML. I took a short look at several standard search engines (htdig, mnoGoSearch,...) but they all seem to index either plain text or HTML (including rating techniques taking e.g. the title tag into acount). The problem is, to these engines to perform searches on specific text, to ignore the tags themselves and so on...

So, what do you say? I think version 2. and 3. are the worst and ugliest below those. 4. would be my prefference, if possible. Further ideas? Comments? Tips? Thanks in advance!

Comments

Hey Tobias
You know swish-e.org? It's an indexer/search engine which can limit your queries to certain xml tags. It indexes not only xml but also various other formats.
Greets phpwheezl

phpwheezl at 2005-07-31

Reg. 1) For further administration and extending the software this would be my choice - if time and the budget allows it.
Reg. 2)Sounds like an ugly and hard to maintain workaround...
Reg. 3) This indeed could work pretty well... setting up A NestedSet-Table for the structure and a relational table for the content... you could then search the content table or the structure-table (i.e. if the queried string should only be searched in a special path...). Since you have a NestedSet-structure, you bypass the need of doing recursive queries. However, I think this would only pay out if you do not have really tricky structures.
Reg. 4) Naah... looks like a trial-and-error thing to me ;)
-> I'd use 3 if time and money is short and a quick solution is welcome, else 1.

Thorsten at 2005-08-01

In our experience, when working with XML, you will inevitably want to do more with it. So while your current requirement is for searching, I think it would be wise to look down stream a bit to determine any other applications.
If that is the case, I recommend using a native XML DB as the long term benefits will outway the short term learning curve. Sleepycat's DB XML is a great platfrom in which to develop on. Though it can be a little tricky getting it to run with PHP. And it only currently supports Apache 1.33. However, if you want to get some quick experience with a NXD, try http://exist.sourceforge.net/. I'm not sure I'd use it in a server environment (haven't tried yet though), but it is very easy to install, setup and get running. Just point it to your xml files and it does all the work.
Hope that helps.
David Stevens Cheif Architect Conduit IT http://www.conduit-it.com/

David Stevens at 2005-08-01

About swish-e, be aware that it does not give the possibility to search for certain characters, like ++ (or I did not find how to do it). Searching for several words (the terms "red car" side by side) is also barely possible.
Always something to keep in mind.

Arnaud at 2005-08-01

More of a sidenote: You might want to do XSLT translation on your XML data to make htdig/mnogosearch be able to index "HTML" markup. This way you can also apply keyword weighting by using HTML structure tags.
Anyways - please tell us what it went down to in the end. :-)

Garvin at 2005-08-01

You can phrase search with Swish-e -- e.g., find the phrase "red car". You can also specify what qualifies as a valid word character -- e.g., a "+" -- and what qualifies as a word-break character -- e.g. a space. Places it can let you down are when tags over lap -- e.g., "red <b>car</b>" -- but you can work around this. And, although it uses libxml2 it doesn't properly support even UTF8. Otherwise, I would recommend it. It's accurate and fast and it's quite easy to hook into.

lakeheavenearth at 2005-08-02

Is XML a good way to store so much data? Why has there been a desire to replace relational databases with massive quantities of XML? I'd understand if it didn't require so many resources but it's so bloated, ugly and not rationalized. I am happy to be completely wrong as it gives me the oportunity to grow, but it seems like you are having to solve a problem that shouldn't even exist. I recomend that you put all the data in a relational database (mysql) and use some kind of templating engine(smarty) to spit out XML when you need it to transfer data from one aplication to another. Am I a ludite? Maybe. But aren't we solving real world problems? This isn't all about adopting new technology, is it? I'm afraid this isn't very helpful and I appologize for being negative, but I find this whole thing very troublesome.
Did you notice that a solution above involves HTML? That is when you know the crack smoking party has really started. Could you really bring yourself to put your data into HTML format so you can search it? Wow. I love the creativity, but it seems misapplied. Right?
Go team and best of luck and peace and all that.

Anonomous CEO Wus who can't spell at 2005-08-02

You're completly right: I receive that huge amount of XML while receiving data from differenet other applications. But since I already got a data storage format I thought it might be cleaner to just index it and get i transformed to XHTML for viewing. :)

Toby at 2005-08-02

Hmmm... that would make me loose the option to search only insde specific tags, so that's not feasible... :/

Toby at 2005-08-02

A set of XML files is a very, very poor replacement of a relational database, in fact it is not a replacement at all.
Can't you see what has been going on over the past few years? People are trying to re-invent the wheel, and a square wheel at it. First there came XPath as XML's query language, now people are wondering how to index their XML files??
These problems have been solved before: in SQL database systems. There is absolutely no advantage in using XML files instead of a relational database, only disadvantages and more work for the developer.

Captain Proton at 2005-08-02

[quote]That's also a nice idea, but I'm still missing the tool, which does what I want.[/quote]
You are not missing the tool, you are missing the education on databases that will tell you that XML for databases is not a good idea. No XML tool will ever do a better job than even today's SQL databases.

Captain Proton at 2005-08-02

I agree here, but the major problem is, that I will (definitly) have the data basis as XML files at first. The question now is, if I will convert them to a database for dealing with data in my application or if I leave the XML?
Leaving the XML will save me lots of time. Indexing the data directly from XML, too (I can simply run XSL or XML_Transformer or similar on them to get my XHTML representation) and using a search engine to index the files will save me time hacking all those sql-statements for different searches (like phrase-search, boolean logics,...).
I fully agree that having a database in place is a very valid solution to the problem, the issue is if the benefuit spent for bringing a DB in place is worth the effort?

Toby at 2005-08-02

Of course it won't, but that's not the problem described in my article. In fact I'm having XML (nothing will change that) and I wanted to know, how to move on. If importing the XML into a RDBS is the best solution, I'll go for that. :)

Toby at 2005-08-02

We have found that if you're inbound and outbound data format is going to be XML it is best to use native XML tools. We have an advanced product catalog software for the creation of web and print catalogs. It's running entirely on XML for the catalog structure. The full text indexing that a NXD like eXist gives us is very powerful.

David Stevens at 2005-08-03

Christopher Jones from Oracle was so nice to provide me some examples on the use of Oracle XML DB for such cases.
I think this looks like a very convenient way of handling things. Sadly the project I spoke about is open source, so Oracle is not feasible for me here. But maybe it's for you?

Toby at 2005-08-16

www.mnogosearch.org and www.dataparksearch.org both claims support for text/xml indexing, for dataparksearch you need libexpat to be installed at your system, while recent versions of mnogosearch are supports text/xml natively. For dataparksearch you may match any part of document by regex pattern and index it seperately, and then search this section only. Thus, you may test both these engines to select one appropriate.

Maxime at 2005-09-16

BTW, DataparkSearch doesn't need libexpat anymore to index XML or RSS data, it supports them natively.

Maxime at 2009-07-07