The goal of search is to get readers to relevant content. There are many things happening behind the scenes to help make this happen. As much as possible, our goal is to provide a search that works really well out-of-the-box while also giving you to visibility and control over your search results.
The heart of any search engine is the index, which is where (and how) your content is stored for retrieval. KnowledgeOwl automatically indexes the most important fields of your content for search. You do not need to add tags or keywords for search to work. Any words you use in the indexable fields will automatically be searchable.
Below is an overview of what is and what is not indexed for search.
|Indexed / searchable||Not indexed / not searchable|
|Types of pages||Articles|
Custom content categories
|Other category types|
Category / breadcrumbs
You can control your search scores and results by:
- customizing your search weights (the weights applied to each indexed field--see Search relevance scoring for more detail on the algorithms we use)
- optimizing your content - include terms in major indexed fields, such as the article title, body, permalink, meta description, and search phrases. The shorter the field, generally the higher the weight, so article title, for example, tends to return articles higher than other fields. See Search relevance scoring for more info.
- adding synonyms - which allow you to capture terms your readers might use in search that are the equivalent of terms you already have in your documentation
The knowledge base index is where we store your article content and meta data after we process and optimize it for searching.
When you save an article, we automatically digest and analyze your content for the index. This process involves...
- Stripping out HTML
- Breaking up the text along word boundaries (tokenization)
- Breaking up tokens into small fragments, called N-grams, for partial matches and autosuggestions
- Removing punctuation
- Lowercasing all tokens
- Stemming all tokens based on your chosen primary search language
- Adding the synonyms from your library for matching tokens
The initial index is automatically created when you save an article. If you make a change to your search settings that affects how the data is processed, your knowledge base will need to be reindexed. Reindexing is the process of reanalyzing and redigesting all of your content. Changes that require a reindex are...
During a reindex, full search will be disabled but autosuggest search based off of article titles will still work. Reindexing normally takes less than a minute, but larger sites or sites with very lengthy articles could take significantly longer.
Stemming is the process of reducing words to their base or root form. For example, "searching", "searches", and "searcher" can all be reduced to the word "search". This is called the stem. In our search, stemming happens automatically in the background as part of indexing and reindexing processes.
Stemming helps make search smarter. It's how search engines can find the results you want even when you don't type the exact words.
Stemming is a core task of Natural Language Processing (NLP). NLP is a branch of artificial intelligence (AI) that focuses on how computers process and analyze natural human language.
When a search is performed, relevance algorithms are used to weight the results. The algorithms assign a score to each matching search result, and the search results are ranked by score.
The following are simplified explanations of the types of algorithms used:
- The exact match
All words in the search term must appear in a searchable field for it to be considered a match. This algorithm has no tolerance for typos, but it allows for some variation in proximity of the search terms (words don't have to be in the exact order and can have a few words between them). Any matching result gets a significant boost in score because it exactly matches what a user typed into the search.
- The mostly match
75% of the search term must appear in the searchable fields. This algorithm has no tolerance for typos. Matching results get a slight boost in score.
- The somewhat match
50% of the search term must appear in the search fields. This algorithm is tolerant of typos. Matching results receive the lowest boost.
- The maybe match
Some of the search terms much match. This algorithm is tolerant of typos. No boost is applied.
In order for an article to appear in search results, it must be considered a match by at least 2 of the algorithms. Scores from all algorithms are combined to determine the article ranking.
The algorithms consider three main factors when calculating scores:
- Term frequency
This is how often the search terms appear in an article. The more terms appear, the higher the score.
- Inverse document frequency
This is how often the search term appears across all your articles. If the search term shows up in many articles, it will have less weight in scoring. If the search term only appears in a few articles, it will result in a higher score.
- Field length
This is how long the searchable field is. If it's really short, like an article title, matches will have more weight since it's more likely to be significant.
Note: Tags do not impact search weights or relevancy scores.