org.opencms.search.documents
Class CmsHighlightFinder

java.lang.Object
  extended byorg.opencms.search.documents.CmsHighlightFinder

public final class CmsHighlightFinder
extends java.lang.Object

Adapted from Maik Schreiber's LuceneTools.java,v 1.5 2001/10/16 07:25:55. Alterations include: + Changed to support Lucene 1.3 release (requires no change to Lucene code base but consequently no longer supports MultiTermQuery, RangeQuery and PrefixQuery highlighting currently) + Performance enhancement - CmsHighlightExtractor caches m_query m_terms and can therefore be called repeatedly to highlight multiple results more efficently + New feature: can extract the most relevant parts of large bodies of text - with user defined size of extracts

Author:
Maik Schreiber

Constructor Summary
CmsHighlightFinder(I_CmsTermHighlighter highlighter, org.apache.lucene.search.Query query, org.apache.lucene.analysis.Analyzer analyzer)
           
 
Method Summary
 java.lang.String[] getBestFragments(java.lang.String text, int fragmentSize, int maxNumFragments)
          Highlights a text in accordance to the given m_query, extracting the most relevant sections.
 java.lang.String getBestFragments(java.lang.String text, int fragmentSize, int maxNumFragments, java.lang.String separator)
          Highlights a text in accordance to the given m_query and extracting the most relevant sections.
static void getTerms(org.apache.lucene.search.Query query, java.util.HashSet terms, boolean prohibited)
          Extracts all term texts of a given Query.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

CmsHighlightFinder

public CmsHighlightFinder(I_CmsTermHighlighter highlighter,
                          org.apache.lucene.search.Query query,
                          org.apache.lucene.analysis.Analyzer analyzer)
                   throws java.io.IOException
Parameters:
highlighter - I_TermHighlighter to use to highlight m_terms in the text
query - Query which contains the m_terms to be highlighted in the text
analyzer - Analyzer used to construct the Query
Throws:
java.io.IOException - if something goes wrong
Method Detail

getTerms

public static void getTerms(org.apache.lucene.search.Query query,
                            java.util.HashSet terms,
                            boolean prohibited)
                     throws java.io.IOException
Extracts all term texts of a given Query. Term texts will be returned in lower-case.

Parameters:
query - Query to extract term texts from
terms - HashSet where extracted term texts should be put into (Elements: String)
prohibited - true to extract "prohibited" m_terms, too
Throws:
java.io.IOException - if something goes wrong

getBestFragments

public java.lang.String[] getBestFragments(java.lang.String text,
                                           int fragmentSize,
                                           int maxNumFragments)
                                    throws java.io.IOException
Highlights a text in accordance to the given m_query, extracting the most relevant sections. The document text is analysed in fragmentSize chunks to record hit statistics across the document. After accumulating stats, the fragments with the highest scores are returned as an array of strings in order of m_score.

Parameters:
text - text to highlight m_terms in
fragmentSize - the size in bytes of each fragment to be returned
maxNumFragments - the maximum number of fragments.
Returns:
highlighted text fragments (between 0 and maxNumFragments number of fragments)
Throws:
java.io.IOException - if something goes wrong

getBestFragments

public java.lang.String getBestFragments(java.lang.String text,
                                         int fragmentSize,
                                         int maxNumFragments,
                                         java.lang.String separator)
                                  throws java.io.IOException
Highlights a text in accordance to the given m_query and extracting the most relevant sections. The document text is analysed in fragmentSize chunks to record hit statistics across the document. After accumulating stats, the fragments with the highest scores are returned in order as "separator" delimited strings.

Parameters:
text - text to highlight m_terms in
fragmentSize - the size in bytes of each fragment to be returned
maxNumFragments - the maximum number of fragments.
separator - the separator used to intersperse the document fragments (typically " ... ")
Returns:
highlighted text
Throws:
java.io.IOException - if something goes wrong