As I've been working on integrating our internal help-desk product PowerHelp with Community Server, one of the big challenges was to figure out how to get our internal wiki to not only display as an integrated part of Community Server 2008, but also to allow users to search our wiki content from the main Community Server search system.
Internally, Community Server's "Enterprise Search" functionality leverages Lucene.net to handle full-text searches. Lucene (and it's .Net port) are very well known in the search world and Lucene.net makes a great platform for supporting full-text + meta data searches across almost any kind of content.
After spending some time learning the basics of Lucene and how it works, I dove into Google's full text index to see if there was any easily-accessible examples of how to integrate with Community Server's enterprise search. No such luck.
The rest of this article will describe what I did to get my data indexed, and some of the techniques I used to discover the specifics of Community Server's Enterprise Search's interaction with Lucene to get my data out of the index and onto the screen.
The best information I found in my search was at Marc Mecca's blog, which had some basic information about what class you may have to derive from in order to get your index included in the Community Server index (it's CommunityServer.Enterprise.Search.IndexTask if you're impatient).
Undeterred, and armed with my favorite code spelunking tool (Reflector, now by RedGate Software) I began my efforts to get my custom wiki content indexed.
At first glance, this doesn't seem so hard, and the implementation of an IndexTask-derived class is pretty simple. Simply override GetDocuments() and return a collection of Lucene Document objects, one for each article you want indexed. Telligent has provided a task scheduler component that will run your indexer (along with the out-of-box indexers) by simply adding your component to the correct configuration file, restart the indexing service, and you're off to the races. Or so it seamed at first.
My first attempt
Being naive in the ways of Lucene.Net and Community Server, I thought I would start by looking at what Telligent did for their WIki post indexing task and see if I could do something similar to get my data into the index. So, I created my own WikiSearchTask, overrode GetDocuments(), and looped through my data adding each document to the return collection, something like this:
public class WikiSearchTask: IndexTask
{
public WikiSearchTask(int settingsid, int itemsToIndex)
: base(settingsid, itemsToIndex)
{
}
public override DocumentCollection GetDocuments()
{
DocumentCollection dc = new DocumentCollection();
foreach (WikiArticle a in GetWikiArticles())
{
Document doc = CreateDocument(a);
dc.Add(doc);
{
return dc;
}
}
}
Seems simple enough (I've omitted the CreateDocument() method for brevity here - we'll get to that later), and it looked at first glance like it might actually work. I added my type to the tasks.config file in the "Telligent.Tasks for Enterprise Search" folder, like so:
<task name = "ES.SearchJob" type = "CommunityServer.Enterprise.Search.SearchJob, CommunityServer.Enterprise.Search" count = "100" enabled = "true" optimize = "false" enableShutDown = "false">
<!-- <add type = "CommunityServer.Enterprise.Search.WeblogIndexTask, CommunityServer.Enterprise.Search" />
<add type = "CommunityServer.Enterprise.Search.ForumsIndexTask, CommunityServer.Enterprise.Search" />
<add type = "CommunityServer.Enterprise.Search.HubIndexTask, CommunityServer.Enterprise.Search" /> -->
<!--<add type = "CommunityServer.Enterprise.Search.WikiIndexTask, CommunityServer.Enterprise.Search" /> -->
<add type = "VSI.CommunityServer.EnterpriseSearch.WikiSearchTask, VSI.CommunityServer.EnterpriseSearch" />
<!-- <add type = "CommunityServer.Enterprise.Search.MediaGalleriesIndexTask, CommunityServer.Enterprise.Search" /> -->
</task>
I commented out all of the other tasks so I could see if my task was doing something useful and, sure enough, I saw the Lucene index get created the next time the indexing service ran. However, when I did a search for text I knew was in the documents I was indexing, nothing came up in Community Server's search results.
Knowing that I had Enterprise Search working before I started this exercise, I uncommented the other Community Server tasks above and re-ran the indexing service. This time, all of my CS content showed up when I searched, so I knew that fundamentally the indexing service was working. I also attached a debugger to Telligent.Tasks.Console.exe to make sure that my task was running. Sure enough, I saw my document collection being created and returned from the GetDocuments method, so I knew I was handing back the data to be indexed. My next trick was to figure out why none of my information was getting into the index.
When all else fails, unit test...
Next, I created a fairly simple set of NUnit tests that would call my GetDocuments method, use Lucene.Net itself to create the index, and then search to make sure my data returned correctly. This gave me some more insight into how the individual pieces all fit together, and I came up with a few test methods (note I have recreated this code from memory and it may not be exactly correct or even compile):
[TestFixture]
public class SearchTests
{
private static FileInfo INDEX_DIR = new FileInfo(GetSearchDirectory());
[Test]
public void CreateIndex()
{
// First, create a new index
WikiSearchTask task = new WikiSearchTask(1, 100);
DocumentCollection dc = task.GetDocuments();
IndexWriter writer = new IndexWriter(INDEX_DIR, new StandardAnalyzer(), true);
foreach (Document d in dc)
{
writer.AddDocument(d);
}
task.ResetIndexStatus();
// Optimize the index
// writer.Optimize();
// Close the writer
writer.Close();
// no asserts - at this point we just want to set up the test
}
[Test]
public void SearchContent()
{
// Searcher searcher = new IndexSearcher(@"Z:\CommunityServer\SearchIndex");
Searcher searcher = new IndexSearcher(GetSearchDirectory());
Analyzer analyzer = new StandardAnalyzer();
string queryStr = "wiki";
QueryParser qp = new QueryParser("body", analyzer);
Hits hits = searcher.Search(qp.Parse(queryStr), new VspFilter(es));
Assert.IsTrue(hits.Length() > 0, "Should have been at least one hit");
for (int i=0;i<hits.Length();i++)
{
Document doc = hits.Doc(i);
DumpDoc(doc);
}
searcher.Close();
}
private void DumpDoc(Document doc)
{
doc.GetField(Fields.GroupID).StringValue();
Console.WriteLine("-----------------------------");
foreach (Field f in doc.GetFields())
{
Console.WriteLine(string.Format("{0}:{1}", f.Name(), f.StringValue()));
}
Console.WriteLine("-----------------------------");
}
}
}
I Fire up NUnit, run CreateIndex, run SearchContent, and, sure enough, my documents show up in the Hits returned from Lucene.net. So, where is Community Server disconnecting my index results from the rest of its results. The key is in a few specific document fields you must include in the Document objects you return from GetDocuments(). Specifically, Enterprise Search uses a Lucene.Net class called Filter, in a derived obfuscated class that handles, from what I can tell, all of the post-search security features of Enterprise Search to exclude documents to which the currently searching user has no access rights. Although I didn't completely analyze the post-search filtering, I did find that there are several fields that are included in that filtering and, in my case, the one that had caused my search results to be excluded was HubSectionID. This is an internal ID that links parent/child hubs/sections together in Community Server. Thankfully, my indexing task already filtered out the non-public documents before they were added to the search (as we only want to expose public documents at this time) so I could simply set HubSectionID to 0, which allowed my search documents to show up in the results.
Almost there...
Once I got this far, I started seeing the dreaded NullReferenceException show up when I did a search. Now, I was sure my documents were in the result set because I was blowing up the search results screen. As Community Server is really designed around the concept of a Post, everything that is displayed on the search results screen appears to be converted to an ESIndexPost object before being bound to the result list, and if you don't include all of the fields that are expected to be there in your Lucene document, it will throw exceptions trying to build the results. I found through reflector that the following fields in the Lucene document are used to build the search results, and this code (from Reflector), as you can see, does not confirm that the field data is actually there before trying to convert the fields to the appropriate data type:
private ESIndexPost CreateESIndexPost(Document document1)
{
ESIndexPost post = new ESIndexPost();
post.ESId = 1;
post.ApplicationKey = document1.GetField(Fields.ApplicationKey).StringValue();
post.ApplicationType = (ApplicationType)Enum.Parse(typeof(ApplicationType),
document1.GetField(Fields.ApplicationType).StringValue(), true);
post.Body = document1.GetField(Fields.RawBody).StringValue();
post.GroupID = int.Parse(document1.GetField(Fields.GroupID).StringValue());
post.Name = document1.GetField(Fields.Name).StringValue();
post.PostDate = DateTools.StringToDate(document1.GetField(Fields.PostDate).StringValue());
post.PostID = int.Parse(document1.GetField(Fields.PostID).StringValue());
post.SectionID = int.Parse(document1.GetField(Fields.SectionID).StringValue());
post.SettingsID = int.Parse(document1.GetField(Fields.SettingsID).StringValue());
post.Title = document1.GetField(Fields.Title).StringValue();
post.Url = document1.GetField(Fields.Url).StringValue();
post.UserName = document1.GetField(Fields.Author).StringValue();
post.ThreadID = int.Parse(document1.GetField(Fields.ThreadID).StringValue());
post.SectionName = document1.GetField(Fields.Name).StringValue();
post.PostCategories = document1.GetValues(Fields.Tag);
return post;
}
So, with all of this spelunking done, here's what my final CreateDocument() method (previously omitted) contained:
private Document CreateDocument(WikiArticle a)
{
Document doc = new Document();
HtmlDocument html = new HtmlDocument();
html.ContentType = "text/html";
html.Extension = "aspx";
html.Encoding = "utf-8";
html.Html = a.ArticleHtml;
doc.Add(new Field(Fields.Body, html.WordsOnly, Field.Store.YES, Field.Index.TOKENIZED));
doc.Add(new Field(Fields.Name, a.Descr, Field.Store.YES, Field.Index.TOKENIZED));
doc.Add(new Field(Fields.Title, a.Descr, Field.Store.YES, Field.Index.TOKENIZED));
doc.Add(new Field(Fields.Url, string.Format("/cs/PowerHelpWiki/WikiPage.aspx?id={0}", a.ArticleID),
Field.Store.YES, Field.Index.UN_TOKENIZED));
doc.Add(new Field(Fields.ApplicationType, ApplicationType.Unknown.ToString(), Field.Store.YES,
Field.Index.UN_TOKENIZED));
doc.Add(new Field(Fields.ApplicationKey, a.ArticleID, Field.Store.YES, Field.Index.UN_TOKENIZED));
doc.Add(new Field(Fields.RawBody, a.ArticleHtml, Field.Store.YES, Field.Index.NO));
doc.Add(new Field(Fields.Author, "admin", Field.Store.YES, Field.Index.UN_TOKENIZED));
doc.Add(new Field(Fields.AuthorID, "admin", Field.Store.YES, Field.Index.UN_TOKENIZED));
doc.Add(new Field(Fields.PostDate, DateTools.DateToString(a.UpdateDate, DateTools.Resolution.MINUTE),
Field.Store.YES, Field.Index.UN_TOKENIZED));
// Add "Everyone" permission so anonymous can search
doc.Add(new Field(Fields.Role, "Everyone", Field.Store.YES, Field.Index.UN_TOKENIZED));
// Add required CS fields that aren't available in VSP Wiki but will fail the search
doc.Add(new Field(Fields.GroupID, "0", Field.Store.YES, Field.Index.UN_TOKENIZED));
doc.Add(new Field(Fields.PostID, "0", Field.Store.YES, Field.Index.UN_TOKENIZED));
doc.Add(new Field(Fields.SectionID, "0", Field.Store.YES, Field.Index.UN_TOKENIZED));
doc.Add(new Field(Fields.HubSectionID, "0", Field.Store.YES, Field.Index.UN_TOKENIZED));
doc.Add(new Field(Fields.SettingsID, this.SettingsID.ToString(), Field.Store.YES, Field.Index.UN_TOKENIZED));
doc.Add(new Field(Fields.ThreadID, "0", Field.Store.YES, Field.Index.UN_TOKENIZED));
// Todo: deal with tags once we have them in VSP
// Do a foreach on each tag and add it to the document (Lucene supports multiple entries for a single field)
// doc.Add(new Field(Fields.Tag, "", Field.Store.YES, Field.Index.UN_TOKENIZED));
doc.SetBoost(CalculateBoost(a));
return doc;
}
This should get anyone interested in integrating their content into Telligent's Enterprise Search results well on their way toward their goal. Next time I'll talk about some additional gotchas I ran into once the basic functionality was working, including things like deleting updated or removed documents from the index and only indexing articles when they have been updated, as your index fills up with duplicates fairly quickly if you don't pay attention to these things. Additionally, I'll be writing about my efforts to integrate our web-services based security system with Community Server, which makes some assumptions about data being in specific tables even when it isn't the membership provider, and how I worked around them to get our two systems working together.
I hope this helps somebody else avoid some of the pain I went through getting this all to work.