Pages

Friday, May 20, 2011

Google Search Appliance

I recently designed and developed a complete internal search solution using Google Search Appliance.(GSA)
GSA Box crawls every content urls of the given domain and serves the search results.
Example of search can be found on  SapSite

Developers could either leverage out of the box front ends which are XSLT driven or develop a custom user interface using their own respective technology.
You could also  download the latest C# .Net API that talks to the google box and results collections and build a .Net server control on top of it.

Here are the steps I did,
 Setup a Google Search Appliance box (6.8 or older).
 Select a Web server domain (www.abcd.com) where all company's content is published.
Content types could be 'web-pages', 'word', 'pdf', 'videos', 'webcasts', or any other custom types like press and events that has a meta tag on the page.

a.  Log in to GSA box and create crawl  rules like
# Rule: Home Page
http://www.abcd.com/default.aspx
This rule crawls the all the content of the domain.
b. Create a new collection and write regular expressions to follow the required patterns that you want in the result set.
For eg to include the folders as part to collection you would write
# Rule: General Content
regexp:^http(s?)://www\\.abcd\\.com/(([^/]*$)|about|industries|solutions|services|partners|platform|company|careers)
Furthermore, you could also write rules to restrict certain patters that you do not want to serve in the search results.
For instance:
regexp:^http(s?)://(www)(\\.)abcd(\\.)com/services/education/curriculum.aspx.*$
You could create many collections that you could serve different departments / regions / country and provide them in the query.

c. Create a front end and add any keywords and or synonyms that you want the query to return in the resultset.
d.After GSA has finished crawling, the collections will start returning results for the search query.

Query Examples:


a. http://gsahostaddress.com/search?q=crm&access=p&entqr=0&sort=date%3AD%3AL%3Ad1&output=xml_no_dtd&client=country_global&ud=1&oe=UTF-8&ie=UTF-8&entsp=0&site=country_global
b. http://gsahostaddress.com/search?q=crm&filter=1&getfields=*&ie=utf-8&numgm=5&output=xml_no_dtd&oe=utf-8&start=0&num=5&client=country_global&site=country_global&requiredfields=(-(sourcetype:Smart).-(SourceType:Press).-(sourcetype:Events).(sourcetype:webpage))
c .http://gsahostaddress.com/search?access=p&output=xml_no_dtd&oe=UTF-8&client=country_global&start=2&q=crm&as_sitesearch=&filter=1&num=20&getfields=*&site=default_collection



Documentation Links
1. http://code.google.com/apis/searchappliance/documentation/60/xml_reference.html
2. http://code.google.com/apis/searchappliance/opensource/index.html
3. http://code.google.com/apis/searchappliance/documentation/52/asr_reference.html

No comments: