Let's discuss the upcoming FlexSearch v0.8 here: https://github.com/nextapps-de/flexsearch/discussions/415
Build | File | CDN |
flexsearch.bundle.js | Download | https://rawcdn.githack.com/nextapps-de/flexsearch/0.7.31/dist/flexsearch.bundle.js |
flexsearch.light.js | Download | https://rawcdn.githack.com/nextapps-de/flexsearch/0.7.31/dist/flexsearch.light.js |
flexsearch.compact.js | Download | https://rawcdn.githack.com/nextapps-de/flexsearch/0.7.31/dist/flexsearch.compact.js |
flexsearch.es5.js * | Download | https://rawcdn.githack.com/nextapps-de/flexsearch/0.7.31/dist/flexsearch.es5.js |
ES6 Modules | Download | The /dist/module/ folder of this Github repository |
Feature | flexsearch.bundle.js | flexsearch.compact.js | flexsearch.light.js |
Presets | ✓ | ✓ | - |
Async Search | ✓ | ✓ | - |
Workers (Web + Node.js) | ✓ | - | - |
Contextual Indexes | ✓ | ✓ | ✓ |
Index Documents (Field-Search) | ✓ | ✓ | - |
Document Store | ✓ | ✓ | - |
Partial Matching | ✓ | ✓ | ✓ |
Relevance Scoring | ✓ | ✓ | ✓ |
Auto-Balanced Cache by Popularity | ✓ | - | - |
Tags | ✓ | - | - |
Suggestions | ✓ | ✓ | - |
Phonetic Matching | ✓ | ✓ | - |
Customizable Charset/Language (Matcher, Encoder, Tokenizer, Stemmer, Filter, Split, RTL) | ✓ | ✓ | ✓ |
Export / Import Indexes | ✓ | - | - |
File Size (gzip) | 6.8 kb | 5.3 kb | 2.9 kb |
Rank | Library | Memory | Query (Single Term) | Query (Multi Term) | Query (Long) | Query (Dupes) | Query (Not Found) |
1 | FlexSearch | 17 | 7084129 | 1586856 | 511585 | 2017142 | 3202006 |
2 | JSii | 27 | 6564 | 158149 | 61290 | 95098 | 534109 |
3 | Wade | 424 | 20471 | 78780 | 16693 | 225824 | 213754 |
4 | JS Search | 193 | 8221 | 64034 | 10377 | 95830 | 167605 |
5 | Elasticlunr.js | 646 | 5412 | 7573 | 2865 | 23786 | 13982 |
6 | BulkSearch | 1021 | 3069 | 3141 | 3333 | 3265 | 21825569 |
7 | MiniSearch | 24348 | 4406 | 10945 | 72 | 39989 | 17624 |
8 | bm25 | 15719 | 1429 | 789 | 366 | 884 | 1823 |
9 | Lunr.js | 2219 | 255 | 271 | 272 | 266 | 267 |
10 | FuzzySearch | 157373 | 53 | 38 | 15 | 32 | 43 |
11 | Fuse | 7641904 | 6 | 2 | 1 | 2 | 3 |
Option | Values | Description | Default |
preset |
"memory" "performance" "match" "score" "default" |
The configuration profile as a shortcut or as a base for your custom settings. |
"default" |
tokenize |
"strict" "forward" "reverse" "full" |
The indexing mode (tokenizer). Choose one of the built-ins or pass a custom tokenizer function. |
"strict" |
cache |
Boolean Number |
Enable/Disable and/or set capacity of cached entries. When passing a number as a limit the cache automatically balance stored entries related to their popularity. Note: When just using "true" the cache has no limits and growth unbounded. |
false |
resolution | Number | Sets the scoring resolution (default: 9). | 9 |
context |
Boolean Context Options |
Enable/Disable contextual indexing. When passing "true" as value it will take the default values for the context. | false |
optimize | Boolean | When enabled it uses a memory-optimized stack flow for the index. | true |
boost | function(arr, str, int) => float | A custom boost function used when indexing contents to the index. The function has this signature: Function(words[], term, index) => Float . It has 3 parameters where you get an array of all words, the current term and the current index where the term is placed in the word array. You can apply your own calculation e.g. the occurrences of a term and return this factor (<1 means relevance is lowered, >1 means relevance is increased).Note: this feature is currently limited by using the tokenizer "strict" only. |
null |
Language-specific Options and Encoding: | |||
charset |
Charset Payload String (key) |
Provide a custom charset payload or pass one of the keys of built-in charsets. | "latin" |
language |
Language Payload String (key) |
Provide a custom language payload or pass in language shorthand flag (ISO-3166) of built-in languages. | null |
encode |
false "default" "simple" "balance" "advanced" "extra" function(str) => [words] |
The encoding type. Choose one of the built-ins or pass a custom encoding function. |
"default" |
stemmer |
false String Function |
false | |
filter |
false String Function |
false | |
matcher |
false String Function |
false | |
Additional Options for Document Indexes: | |||
worker |
Boolean | Enable/Disable and set count of running worker threads. | false |
document |
Document Descriptor | Includes definitions for the document index and storage. |
Option | Values | Description | Default |
resolution | Number | Sets the scoring resolution for the context (default: 1). | 1 |
depth |
false Number |
Enable/Disable contextual indexing and also sets contextual distance of relevance. Depth is the maximum number of words/tokens away a term to be considered as relevant. | 1 |
bidirectional | Boolean | Sets bidirectional search result. If enabled and the source text contains "red hat", it will be found for queries "red hat" and "hat red". | true |
Option | Values | Description | Default |
id |
String | "id"" | |
tag |
false String |
"tag" | |
index |
String Array<String> Array<Object> |
||
store |
Boolean String Array<String> |
false |
Option | Values | Description | Default |
split |
false RegExp String |
The rule to split words when using non-custom tokenizer (built-ins e.g. "forward"). Use a string/char or use a regular expression (default: /\W+/ ). |
/[\W_]+/ |
rtl |
Boolean | Enables Right-To-Left encoding. | false |
encode |
function(str) => [words] | The custom encoding function. | /lang/latin/default.js |
Option | Values | Description |
stemmer |
false String Function |
Disable or pass in language shorthand flag (ISO-3166) or a custom object. |
filter |
false String Function |
Disable or pass in language shorthand flag (ISO-3166) or a custom array. |
matcher |
false String Function |
Disable or pass in language shorthand flag (ISO-3166) or a custom array. |
Option | Values | Description | Default |
limit | number | Sets the limit of results. | 100 |
offset | number | Apply offset (skip items). | 0 |
suggest | Boolean | Enables suggestions in results. | false |
Option | Values | Description | Default |
index | String Array<String> Array<Object> |
Sets the document fields which should be searched. When no field is set, all fields will be searched. Custom options per field are also supported. | |
tag | String Array<String> |
Sets the document fields which should be searched. When no field is set, all fields will be searched. Custom options per field are also supported. | false |
enrich | Boolean | Enrich IDs from the results with the corresponding documents. | false |
bool | "and" "or" |
Sets the used logical operator when searching through multiple fields or tags. | "or" |
Option | Description | Example | Memory Factor (n = length of word) |
"strict" | index whole words | foobar |
* 1 |
"forward" | incrementally index words in forward direction | fo obarfoob ar |
* n |
"reverse" | incrementally index words in both directions | foobar fo obar |
* 2n - 1 |
"full" | index every possible combination | fooba rf oob ar |
* n * (n - 1) |
Option | Description | False-Positives | Compression |
false | Turn off encoding | no | 0% |
"default" | Case in-sensitive encoding | no | 0% |
"simple" | Case in-sensitive encoding Charset normalizations |
no | ~ 3% |
"balance" | Case in-sensitive encoding Charset normalizations Literal transformations |
no | ~ 30% |
"advanced" | Case in-sensitive encoding Charset normalizations Literal transformations Phonetic normalizations |
no | ~ 40% |
"extra" | Case in-sensitive encoding Charset normalizations Literal transformations Phonetic normalizations Soundex transformations |
yes | ~ 65% |
function() | Pass custom encoding via function(string):[words] |
## Enable Contextual Scoring Create an index and use the default context: ```js var index = new FlexSearch({ tokenize: "strict", context: true }); ``` Create an index and apply custom options for the context: ```js var index = new FlexSearch({ tokenize: "strict", context: { resolution: 5, depth: 3, bidirectional: true } }); ``` > Only the tokenizer "strict" is actually supported by the contextual index. > The contextual index requires additional amount of memory depending on depth. ### Auto-Balanced Cache (By Popularity) You need to initialize the cache and its limit during the creation of the index: ```js const index = new Index({ cache: 100 }); ``` ```js const results = index.searchCache(query); ``` A common scenario for using a cache is an autocomplete or instant search when typing. > When passing a number as a limit the cache automatically balance stored entries related to their popularity. > When just using "true" the cache is unbounded and perform actually 2-3 times faster (because the balancer do not have to run). ## Worker Parallelism (Browser + Node.js) The new worker model from v0.7.0 is divided into "fields" from the document (1 worker = 1 field index). This way the worker becomes able to solve tasks (subtasks) completely. The downside of this paradigm is they might not have been perfect balanced in storing contents (fields may have different length of contents). On the other hand there is no indication that balancing the storage gives any advantage (they all require the same amount in total). When using a document index, then just apply the option "worker": ```js const index = new Document({ index: ["tag", "name", "title", "text"], worker: true }); index.add({ id: 1, tag: "cat", name: "Tom", title: "some", text: "some" }).add({ id: 2, tag: "dog", name: "Ben", title: "title", text: "content" }).add({ id: 3, tag: "cat", name: "Max", title: "to", text: "to" }).add({ id: 4, tag: "dog", name: "Tim", title: "index", text: "index" }); ``` ``` Worker 1: { 1: "cat", 2: "dog", 3: "cat", 4: "dog" } Worker 2: { 1: "Tom", 2: "Ben", 3: "Max", 4: "Tim" } Worker 3: { 1: "some", 2: "title", 3: "to", 4: "index" } Worker 4: { 1: "some", 2: "content", 3: "to", 4: "index" } ``` When you perform a field search through all fields then this task is being balanced perfectly through all workers, which can solve their subtasks independently. ### Worker Index Above we have seen that documents will create worker automatically for each field. You can also create a WorkerIndex directly (same like using `Index` instead of `Document`). Use as ES6 module: ```js import WorkerIndex from "./worker/index.js"; const index = new WorkerIndex(options); index.add(1, "some") .add(2, "content") .add(3, "to") .add(4, "index"); ``` Or when bundled version was used instead: ```js var index = new FlexSearch.Worker(options); index.add(1, "some") .add(2, "content") .add(3, "to") .add(4, "index"); ``` Such a WorkerIndex works pretty much the same as a created instance of `Index`. > A WorkerIndex only support the `async` variant of all methods. That means when you call `index.search()` on a WorkerIndex this will perform also in async the same way as `index.searchAsync()` will do. ### Worker Threads (Node.js) The worker model for Node.js is based on "worker threads" and works exactly the same way: ```js const { Document } = require("flexsearch"); const index = new Document({ index: ["tag", "name", "title", "text"], worker: true }); ``` Or create a single worker instance for a non-document index: ```js const { Worker } = require("flexsearch"); const index = new Worker({ options }); ``` ### The Worker Async Model (Best Practices) A worker will always perform as async. On a query method call you always should handle the returned promise (e.g. use `await`) or pass a callback function as the last parameter. ```js const index = new Document({ index: ["tag", "name", "title", "text"], worker: true }); ``` All requests and sub-tasks will run in parallel (prioritize "all tasks completed"): ```js index.searchAsync(query, callback); index.searchAsync(query, callback); index.searchAsync(query, callback); ``` Also (prioritize "all tasks completed"): ```js index.searchAsync(query).then(callback); index.searchAsync(query).then(callback); index.searchAsync(query).then(callback); ``` Or when you have just one callback when all requests are done, simply use `Promise.all()` which also prioritize "all tasks completed": ```js Promise.all([ index.searchAsync(query), index.searchAsync(query), index.searchAsync(query) ]).then(callback); ``` Inside the callback of `Promise.all()` you will also get an array of results as the first parameter respectively for each query you put into. When using `await` you can prioritize the order (prioritize "first task completed") and solve requests one by one and just process the sub-tasks in parallel: ```js await index.searchAsync(query); await index.searchAsync(query); await index.searchAsync(query); ``` Same for `index.add()`, `index.append()`, `index.remove()` or `index.update()`. Here there is a special case which isn't disabled by the library, but you need to keep in mind when using Workers. When you call the "synced" version on a worker index: ```js index.add(doc); index.add(doc); index.add(doc); // contents aren't indexed yet, // they just queued on the message channel ``` Of course, you can do that but keep in mind that the main thread does not have an additional queue for distributed worker tasks. Running these in a long loop fires content massively to the message channel via `worker.postMessage()` internally. Luckily the browser and Node.js will handle such incoming tasks for you automatically (as long enough free RAM is available). When using the "synced" version on a worker index, the content isn't indexed one line below, because all calls are treated as async by default. > When adding/updating/removing large bulks of content to the index (or high frequency), it is recommended to use the async version along with `async/await` to keep a low memory footprint during long processes. ## Export / Import ### Export The export has slightly changed. The export now consist of several smaller parts, instead of just one large bulk. You need to pass a callback function which has 2 arguments "key" and "data". This callback function is called by each part, e.g.: ```js index.export(function(key, data){ // you need to store both the key and the data! // e.g. use the key for the filename and save your data localStorage.setItem(key, data); }); ``` Exporting data to the localStorage isn't really a good practice, but if size is not a concern than use it if you like. The export primarily exists for the usage in Node.js or to store indexes you want to delegate from a server to the client. > The size of the export corresponds to the memory consumption of the library. To reduce export size you have to use a configuration which has less memory footprint (use the table at the bottom to get information about configs and its memory allocation). When your save routine runs asynchronously you have to return a promise: ```js index.export(function(key, data){ return new Promise(function(resolve){ // do the saving as async resolve(); }); }); ``` > You cannot export the additional table for the "fastupdate" feature. These table exists of references and when stored they fully get serialized and becomes too large. The lib will handle these automatically for you. When importing data, the index automatically disables "fastupdate". ### Import Before you can import data, you need to create your index first. For document indexes provide the same document descriptor you used when export the data. This configuration isn't stored in the export. ```js var index = new Index({ ... }); ``` To import the data just pass a key and data: ```js index.import(key, localStorage.getItem(key)); ``` You need to import every key! Otherwise, your index does not work. You need to store the keys from the export and use this keys for the import (the order of the keys can differ). This is just for demonstration and is not recommended, because you might have other keys in your localStorage which aren't supported as an import: ```js var keys = Object.keys(localStorage); for(let i = 0, key; i < keys.length; i++){ key = keys[i]; index.import(key, localStorage.getItem(key)); } ``` ## Languages Language-specific definitions are being divided into two groups: 1. Charset 1. ___encode___, type: `function(string):string[]` 2. ___rtl___, type: `boolean` 2. Language 1. ___matcher___, type: `{string: string}` 2. ___stemmer___, type: `{string: string}` 3. ___filter___, type: `string[]` The charset contains the encoding logic, the language contains stemmer, stopword filter and matchers. Multiple language definitions can use the same charset encoder. Also this separation let you manage different language definitions for special use cases (e.g. names, cities, dialects/slang, etc.). To fully describe a custom language __on the fly__ you need to pass: ```js const index = FlexSearch({ // mandatory: encode: (content) => [words], // optionally: rtl: false, stemmer: {}, matcher: {}, filter: [] }); ``` When passing no parameter it uses the `latin:default` schema by default.
Field | Category | Description |
encode | charset | The encoder function. Has to return an array of separated words (or an empty string). |
rtl | charset | A boolean property which indicates right-to-left encoding. |
filter | language | Filter are also known as "stopwords", they completely filter out words from being indexed. |
stemmer | language | Stemmer removes word endings and is a kind of "partial normalization". A word ending just matched when the word length is bigger than the matched partial. |
matcher | language | Matcher replaces all occurrences of a given string regardless of its position and is also a kind of "partial normalization". |
#### Custom Pipeline At first take a look into the default pipeline in `src/common.js`. It is very simple and straight forward. The pipeline will process as some sort of inversion of control, the final encoder implementation has to handle charset and also language specific transformations. This workaround has left over from many tests. Inject the default pipeline by e.g.: ```js this.pipeline( /* string: */ str.toLowerCase(), /* normalize: */ false, /* split: */ split, /* collapse: */ false ); ``` Use the pipeline schema from above to understand the iteration and the difference of pre-encoding and post-encoding. Stemmer and matchers needs to be applied after charset normalization but before language transformations, filters also. Here is a good example of extending pipelines: `src/lang/latin/extra.js` → `src/lang/latin/advanced.js` → `src/lang/latin/simple.js`. ### How to contribute? Search for your language in `src/lang/`, if it exists you can extend or provide variants (like dialect/slang). If the language doesn't exist create a new file and check if any of the existing charsets (e.g. latin) fits to your language. When no charset exist, you need to provide a charset as a base for the language. A new charset should provide at least: 1. `encode` A function which normalize the charset of a passed text content (remove special chars, lingual transformations, etc.) and __returns an array of separated words__. Also stemmer, matcher or stopword filter needs to be applied here. When the language has no words make sure to provide something similar, e.g. each chinese sign could also be a "word". Don't return the whole text content without split. 3. `rtl` A boolean flag which indicates right-to-left encoding Basically the charset needs just to provide an encoder function along with an indicator for right-to-left encoding: ```js export function encode(str){ return [str] } export const rtl = false; ``` ## Encoder Matching Comparison > Reference String: __"Björn-Phillipp Mayer"__
Query | default | simple | advanced | extra |
björn | yes | yes | yes | yes |
björ | yes | yes | yes | yes |
bjorn | no | yes | yes | yes |
bjoern | no | no | yes | yes |
philipp | no | no | yes | yes |
filip | no | no | yes | yes |
björnphillip | no | yes | yes | yes |
meier | no | no | yes | yes |
björn meier | no | no | yes | yes |
meier fhilip | no | no | yes | yes |
byorn mair | no | no | no | yes |
(false positives) | no | no | no | yes |
Modifier | Memory Impact * | Performance Impact ** | Matching Impact ** | Scoring Impact ** |
resolution | +1 (per level) | +1 (per level) | 0 | +2 (per level) |
depth | +4 (per level) | -1 (per level) | -10 + depth | +10 |
minlength | -2 (per level) | +2 (per level) | -3 (per level) | +2 (per level) |
bidirectional | -2 | 0 | +3 | -1 |
fastupdate | +1 | +10 (update, remove) | 0 | 0 |
optimize: true | -7 | -1 | 0 | -3 |
encoder: "icase" | 0 | 0 | 0 | 0 |
encoder: "simple" | -2 | -1 | +2 | 0 |
encoder: "advanced" | -3 | -2 | +4 | 0 |
encoder: "extra" | -5 | -5 | +6 | 0 |
encoder: "soundex" | -6 | -2 | +8 | 0 |
tokenize: "strict" | 0 | 0 | 0 | 0 |
tokenize: "forward" | +3 | -2 | +5 | 0 |
tokenize: "reverse" | +5 | -4 | +7 | 0 |
tokenize: "full" | +8 | -5 | +10 | 0 |
document index | +3 (per field) | -1 (per field) | 0 | 0 |
document tags | +1 (per tag) | -1 (per tag) | 0 | 0 |
store: true | +5 (per document) | 0 | 0 | 0 |
store: [fields] | +1 (per field) | 0 | 0 | 0 |
cache: true | +10 | +10 | 0 | 0 |
cache: 100 | +1 | +9 | 0 | 0 |
type of ids: number | 0 | 0 | 0 | 0 |
type of ids: string | +3 | -3 | 0 | 0 |