Posted by Peter Vanbroekhoven on Jun 26, 2010
A client recently challenged us to create in a few days time a scalable full-text search on more than one SharePoint system. We always like a challenge, and as we see more and more clients make the move to SharePoint so the experience will serve us well, we accepted.
The setup we used is really simple. We use ActiveSP, our home-made Ruby library for talking to SharePoint through the web services. Any indexing operation is placed on a queue which is polled by the indexing agent that incrementally builds the index. We provide a web interface to configure which SharePoint document libraries or lists are to be indexed and to perform a full-text search on the indexed documents.
The indexer supports PDF, MS Office, HTML, XML, and a few miscellaneous formats (e.g., image metadata). In our test setup there is a one-minute delay between changes in SharePoint and updates to the index, but other SharePoint configurations may specify larger minimum intervals between requests for what has changed. The indexer only indexes what the configured user can access; no other security is enforced.
Scaling this setup comprises:
- Scaling the index: we use the Lucene based Solr that has several options for optimizing and scaling.
- Scaling the queue: we use Apache's ActiveMQ that scales very well, including accommodating large numbers of consumers.
- Scaling the index agent: we can start many index agents on different machines enabling us to index many documents in parallel.
In our setup, the index agent is in a single-machine setup the definite bottleneck. It needs to extract a plain-text version of the content and this is more time consuming than updating the index and working with the queues. This will be most important when initially indexing a SharePoint system from scratch and much less when incrementally indexing changes except during peak activity. This would be the ideal use case for cloud computing.
We have also looked at monitoring. Our preferred choice for monitoring nowadays is New Relic RPM. They can monitor various aspects of Ruby/Rails and Java applications, but since recently they can monitor Solr indexes as well. The only part that is currently unmonitored is the queue, so we need to look into that as well. The queue is far from the bottleneck in our setup, but it is something we will need to look at.
We are definitely not the only ones interested in full-text searching SharePoint. Since their acquisition of FAST, Microsoft has their own solution of Enterprise search. Since we have little experience (from Documentum) with FAST, we would like to know if you (our valued readers) have any experience with FAST search on SharePoint. What are its features? What are its limitations? How well does it scale? How does it handle security?
blog comments powered by DisqusEntries per category
- 9 pages are tagged with documentum
- 12 pages are tagged with events
- 14 pages are tagged with rails
- 32 pages are tagged with ruby
- 13 pages are tagged with sharepoint
