Indexing XML: What would you do?
In a recent project one of the duties is to make a huge amount of XML documents searchable. I dealt with XML in several situations and used parts of XPath, XSL and Co., but this is a new challange. Be now I have made up my mind and thought about what to do, with so clear result, so I'm trying to ask the public for opinions.
Let me introduce the situation a bit more. I'm dealing with 5 different XML structures (basically) of which about 10.000 to 100.000 files and above have to become searchable. With that, I mainly need fulltext indexes on the contents of single tags and/or multiple tags at once. Search should include phrase and boolean expressions. The XML files maybe updated about once a day (only small parts per update).
So, my ideas to manage that are (so far):
1. Using an XML database
I have to admit that I a) never dealt with that so far and b) never dealt with XQuery, what doesn't seem so heavy. Problematic is, that utilizing an XML DB would mean large efforts in learning (setup, connect to with PHP, call syntax,...).
2. Parsing the XML into a DB (same structure)
That would be the most easiest part, I guess: Setting up 5 tables, that have the columns which I like to index and pumping the whole data into it. That's not the real purpose of XML infact and makes extending the searchable data a mess.
3. Parsing XML into a DB (tree structure)
An idea would be to create an abstract database layout, that mirrors the tree structure of XML data in general and to pump all the data into it. That will become a really huge table what may cause really horrible performance and getting the data back is quite uncomfortable. In comparison to 2. the extensibillity problem is solved here, but still that's not the sense of XML.
4. Using a real search engine
That's also a nice idea, but I'm still missing the tool, which does what I want. I could simply run an indexer engine on the files and query its search engine for results. Problematic is, that I have neither plain text as a format, nor HTML (both supported by most search engines), but XML. I took a short look at several standard search engines (htdig, mnoGoSearch,...) but they all seem to index either plain text or HTML (including rating techniques taking e.g. the title tag into acount). The problem is, to these engines to perform searches on specific text, to ignore the tags themselves and so on...
So, what do you say? I think version 2. and 3. are the worst and ugliest below those. 4. would be my prefference, if possible. Further ideas? Comments? Tips? Thanks in advance!
Comments