A component to integrate authorization-aware full-text search into a mu.semte.ch stack using Elasticsearch.
The mu-search service uses Elasticsearch as a backend. Since the Elasticsearch docker image requires a lot of memory, increase the maximum on your system by executing the following command:
sysctl -w vm.max_map_count=262144
Next, add the mu-search and accompanying elasticsearch service to docker-compose.yml
services:
search:
image: semtech/mu-search:0.10.0
links:
- db:database
volumes:
- ./config/search:/config
elasticsearch:
image: semtech/mu-search-elastic-backend:1.0.0
volumes:
- ./data/elasticsearch/:/usr/share/elasticsearch/data
environment:
- discovery.type=single-node
The indices will be persisted in ./data/elasticsearch
. The search
service needs to be linked to an instance of the mu-authorization service.
Create the ./config/search
directory and create a config.json
with the following contents:
{
"types" : [
{
"type" : "document",
"on_path" : "documents",
"rdf_type" : "http://xmlns.com/foaf/0.1/Document",
"properties" : {
"title" : "http://purl.org/dc/elements/1.1/title",
"description" : "http://purl.org/dc/elements/1.1/description"
}
},
{
"type" : "user",
"on_path" : "users",
"rdf_type" : "http://xmlns.com/foaf/0.1/Person",
"properties" : {
"fullname" : "http://xmlns.com/foaf/0.1/name"
}
}
]
}
Finally, add the following rules to your dispatcher configuration in ./config/dispatcher.ex
to make the search endpoint available:
define_accept_types [
json: [ "application/json", "application/vnd.api+json" ]
]
@json %{ accept: %{ json: true } }
get "/search/*path", @json do
Proxy.forward conn, path, "http://search/"
end
Restart the dispatcher service to pick up the new configuration
docker-compose restart dispatcher
Restart the stack using docker-compose up -d
. The elasticsearch
and search
services will be created.
Search queries can now be sent to the /search
endpoint. Make sure the user has access to the data according to the authorization rules.
By default search indexes are deleted on (re)start of the mu-search service. This guide describes how to make sure search indexes are persisted on restart. Obviously, this configuration is recommended on production environments.
First, make sure the search indexes are written to a mounted volume by specifying a bind mount to /usr/share/elasticsearch/data
on the Elasticsearch container.
services:
elasticsearch:
image: semtech/mu-search-elastic-backend:1.0.0
volumes:
- ./data/elasticsearch/:/usr/share/elasticsearch/data
Recreate the elasticsearch
container by executing the following command
docker-compose up -d
Next, enable the persistent indexes flag in the root of the search configuration file ./config/search/config.json
of your project.
{
"persist_indexes": true,
"types": [
// index type specifications
]
}
Restart the search
service to pick up the new configuration.
docker-compose restart search
Search indexes will be persisted in ./data/elasticsearch
folder and not be deleted on restart of the search service.
The search API provided by mu-search is authorization-aware. I.e. search results will only contain resources the user is allowed to access. To this end mu-search organises its search indexes per access right. Based on the user's allowed groups set on the incoming search requests, mu-search determines which indexes to search in.
Indexes that don't exist yet will be created before the search operation is performed. Depending on the number of documents to index this may be a time-consuming operation.
Mu-search allows to configure authorization groups for which the indexes need to be created on startup already. This will save time at the moment the first search query for that profile arrives.
Configuration is done via the eager_indexing_groups
in the search configuration file ./config/search/config.json
. The eager indexing groups are tightly related to the GroupSpec
objects configured in mu-authorization.
The eager_indexing_groups
is an array of group specifications. Each group specification is defined by an array of objects in which each object consists of:
- name: name of the group specification (
GroupSpec
) in mu-authorization - variables: array of string values used to construct the graph URI for the group. These variables should match the possible result values of the
vars
in case of anAccessByQuery
access rule in theGroupSpec
. In case of anAlwaysAccessible
access rule, this should be an empty array.
If the application only provides public data for unauthenticated users in the graph http://mu.semte.ch/graphs/public
, the following eager indexing groups must be configured:
[
[ { "name": "public", "variables" : [] } ],
[ { "name": "clean", "variables": [] } ]
]
If, next to the public data, data is organized per organization unit in graphs like http://mu.semte.ch/graphs/<unit-name>
, the following eager indexing groups must be configured:
[
[ { "name": "public", "variables" : [] }, { "name": "organization-unit", "variables" : ["finance"] } ],
[ { "name": "public", "variables" : [] }, { "name": "organization-unit", "variables" : ["legal"] } ],
[ { "name": "clean", "variables": [] } ]
]
In case a group contains a variable, an eager index must be configured for each possible value if you want all search indexes to be prepared upfront.
Eager indexes may be combined at search time to match the user's allowed groups. For example, if some users have access to the data of the finance department as well as the legal department, both indexes will be queried when the user performs a search operation.
This how-to guide explains how to integrate mu-search with the delta-notification in order to automatically update search index entries when data in the triplestore is modified.
This guide assumes the mu-authorization and delta-notifier components have been added to your stack as explained in their respective installation guides.
Open the delta-notifier rules configuration ./config/delta/rules.js
and add the following rule:
{
match: {
// listen to all changes
},
callback: {
url: 'http://search/update',
method: 'POST'
},
options: {
resourceFormat: "v0.0.1",
gracePeriod: 10000,
ignoreFromSelf: true
}
}
Enable automatic index updates (not only invalidation) in mu-search by setting the automatic_index_updates
flag at the root of ./config/search/config.json
.
{
"automatic_index_updates": true,
"types": [
// definition of the indexed types
]
}
Restart the search and delta-notifier service.
docker-compose restart search delta-notifier
Any change you make in your application will now trigger a request to the /update
endpoint of mu-search. Depending on the indexed resources and properties, mu-search will update the appropriate search index entries.
This guide explains how to make the content of files attached to a project resource searchable in the index.
This guide assumes you have already integrated mu-search in your application and configured an index for resources of type schema:Project
.
For indexing files mu-search requires a Tika server to extract the content. Add the tika
service next to the search
and elasticsearch
services in docker-compose.yml
:
services:
search:
...
elasticsearch:
...
tika:
image: apache/tika:1.25-full
Next, add the following mounted volumes to the mu-search service in docker-compose.yml
:
/data
: folder containing the files to be indexed/cache
: folder to persist Tika's search cache
services:
search:
image: semtech/mu-search:0.10.0
volumes:
- ./config/search:/config
- ./data/files:/data
- ./data/search/cache:/cache
Next, add a property files
in the project
type index configuration. The property files
will hold the content and metadata of the files.
{
"types" : [
{
"type" : "project",
"on_path" : "projects",
"rdf_type" : "http://schema.org/Project",
"properties" : {
"name" : "http://schema.org/name",
"files" : {
"via" : [
"http://purl.org/dc/terms/hasPart",
"^http://www.semanticdesktop.org/ontologies/2007/01/19/nie#dataSource"
],
"attachment_pipeline" : "attachment"
}
}
}
]
}
via
expresses the path from the indexed resource to the file(s) having a URI like <share://path/to/your/file.pdf>
.
Recreate the mu-search service using
docker-compose up -d
After reindex has been completed, each indexed project will now contain a property files
holding the content and metadata of the files linked to the project via dct:hasPart/^nie:dataSource
.
Searching the file's content is done using the nested property content
on the defined field name, files
in this case:
GET /documents/search?filter[files.content]=open-source"
The content of a search index can be inspected by running a Kibana dashboard on top of Elasticseach by adding the following snippet to your docker-compose.override.yml
services:
kibana:
image: docker.elastic.co/kibana/kibana-oss:7.6.2
ports:
- 127.0.0.1:5601:5601
user: root
command: |
sh -c "/usr/local/bin/kibana-docker --allow-root;"
Start the container
docker-compose up -d kibana
Once Kibana has started the dashboard is available at http://localhost:5601
Make sure not to expose the Kibana dashboard in a production environment!
[To be completed...]
Elasticsearch is used as a search engine. It indexes documents according to a specified configuration and provides a REST API to search documents. The mu-search service is a layer in front of Elasticsearch that allows to specify the mapping between RDF triples and the Elasticsearch documents/properties. It also integrates with mu-authorization making sure users can only search for documents they're allowed to access.
This section describes how to configure the resources and properties to be indexed and how to pass Elasticsearch specific configurations and mapping in the mu-search configuration file.
This section describes how to mapping between RDF triples and Elasticsearch documents can be specified in the mounted /config/config.json
configuration file.
The config.json
file contains a JSON object with a property types
. This property contains an array of objects, one per document type that must be searchable.
{
"types": [
// object per searchable document type
]
}
Note that these types do not map one-on-one with the search indexes in Elasticsearch. For each document type in the list a search index will be created per authorization group.
Each type object in the types
array consists of the following properties:
- type : name of the type
- on_path : path on which the search endpoint will be published
- rdf_type : URI of the rdf:Class of the documents to index
- properties : mapping of RDF predicates to document properties
- settings : type specific Elasticsearch settings
- mappings : type specific Elasticsearch mapping
properties
contains a JSON object with a key per property in the resulting Elasticsearch document. These are the properties that will be searchable via the search API for the given resource type. The value of each key defines the mapping to RDF predicates starting from the root resource.
WARNING: there are two protected fields that should not be used as property keys: uuid
and uri
. Both are used internally by the mu-search service to store the uuid and URI of the root resource.
In the simplest scenario, the properties that need to be searchable map one-by-one on a predicate (path) of the resource.
In the example below, a search index per user group will be created for documents and users. The documents index contains resources of type foaf:Document
s with a title
and description
. The users index contains foaf:Person
s with only fullname
as searchable property.
{
"types" : [
{
"type" : "document",
"on_path" : "documents",
"rdf_type" : "http://xmlns.com/foaf/0.1/Document",
"properties" : {
"title" : "http://purl.org/dc/elements/1.1/title",
"description" : "http://purl.org/dc/elements/1.1/description"
}
},
{
"type" : "user",
"on_path" : "users",
"rdf_type" : "http://xmlns.com/foaf/0.1/Person",
"properties" : {
"fullname" : "http://xmlns.com/foaf/0.1/name"
}
}
]
}
If multiple values are found in the triplestore for a given predicate, the resulting value for the property in the search document will be an array of all values.
A property of the search document may also map to an inverse predicate. I.e. resource to be indexed is the object instead of the subject of the triple. An inverse predicate can be indicated in the mapping by prefixing the predicate URI with ^
as done in a SPARQL query.
In the example below the users index contains a property group
that maps to the inverse predicate foaf:member
relating a group to a user.
{
"types" : [
{
"type" : "user",
"on_path" : "users",
"rdf_type" : "http://xmlns.com/foaf/0.1/Person",
"properties" : {
"fullname" : "http://xmlns.com/foaf/0.1/name",
"group": "^http://xmlns.com/foaf/0.1/member"
}
}
]
}
Properties can also be mapped to lists of predicates, corresponding to a property path in RDF. In this case, the property value is an array of strings. One string per path segment. The array starts from the indexed resource and may also include inverse predicate URIs.
In the example below the documents index contains a property topics
that maps to the label of the document's primary topic and a property publishers
that maps to the names of the publishers via the inverse foaf:publications
predicate.
{
"types" : [
{
"type" : "document",
"on_path" : "documents",
"rdf_type" : "http://xmlns.com/foaf/0.1/Document",
"properties" : {
"title" : "http://purl.org/dc/elements/1.1/title",
"description" : "http://purl.org/dc/elements/1.1/description",
"topics" : [
"http://xmlns.com/foaf/0.1/primaryTopic",
"http://www.w3.org/2004/02/skos/core#prefLabel"
],
"publishers": [
"^http://xmlns.com/foaf/0.1/publications",
"http://xmlns.com/foaf/0.1/name"
]
}
}
]
}
To make the content of a file searchable, it needs to be indexed as a property in a search index. Basic indexing of PDF, Word etc. files is provided using a local Apache Tika instance. A default ingest pipeline named attachment
is created on startup of the mu-search service. Note that this is under development and liable to change.
Defining a property to index the content of a file requires the following keys:
- via : mapping of the RDF predicate (path) that relates the resource with the file(s) to index. The file URI the predicate path leads to must have a URI starting with
share://
indicating the location of the file. E.g.<share://path/to/your/file.pdf>
. - attachment_pipeline : attachment pipeline to use for indexing the files. Set to
attachment
to use the default ingest pipeline.
The example below adds a property files
in the project
type index configuration. The property files
will hold the contents of the files related to the project via dct:hasPart/^nie:dataSource
.
{
"types" : [
{
"type" : "project",
"on_path" : "projects",
"rdf_type" : "http://schema.org/Project",
"properties" : {
"name" : "http://schema.org/name",
"files" : {
"via" : [
"http://purl.org/dc/terms/hasPart",
"^http://www.semanticdesktop.org/ontologies/2007/01/19/nie#dataSource"
],
"attachment_pipeline" : "attachment"
}
}
}
]
}
For each file retrieved through the via
-definition, the Tika-processing results in an object containing the extracted text (as content
), as well as other extracted metadata (in the future). Such object may look like this:
{
content: "Extracted text here"
}
These objects are structured in the same way as the attachment
objects resulting from the Elasticsearch's Ingest Attachment Processor Plugin. Keep in mind that this implies you need to specify the path to a specific property of the attachment object when defining an Elasticsearch mapping. E.g. mapping the file's content for the files
field from the example above may look as follows:
{
"types": [
{
"type": "project",
"on_path": "projects",
...
"mappings" : {
"properties": {
"name" : { "type" : "text" },
"files.content" : { "type" : "text" }
}
}
},
// other type definitions
]
}
Currently, only indexing of local files is supported. The files' logical path as well as other metadata is expected to be in the format specified by the file-service. Files must be present in the Docker volume /data
inside the container.
Attachments processed by Tika are cached in the directory /cache
(by SHA256 of the file contents). This must be defined as a shared volume for the cache to be persistent.
See also "How to specify a file's content as property".
It's possible to map several resources of different rdf classes onto one index where that makes sense, e.g. if they share the same properties.
in config.json:
"rdf_type": [
"http://data.vlaanderen.be/ns/besluit#Bestuurseenheid",
"http://data.lblod.info/vocabularies/erediensten/CentraalBestuurVanDeEredienst",
"http://data.lblod.info/vocabularies/erediensten/BestuurVanDeEredienst",
"http://data.lblod.info/vocabularies/erediensten/RepresentatiefOrgaan"
],
Note that this is different from a composite index, where each type has its own index, as well as being indexed in the composite index. Another difference is that the composite index allows mapping different properties from the sub indexes onto one property in the composite index.
A search document can contain nested objects up to an arbitrary depth. For example for a person you can nest the address object as a property of the person search document.
A nested object is defined by the following properties:
- via : mapping of the RDF predicate that relates the resource with the nested object. May also be an inverse URI, or a list of predicate (a property path) as in non-nested properties
- rdf_type : URI of the rdf:Class of the nested object
- properties : mapping of RDF predicates to properties for the nested object
Objects can be nested to arbitrary depth. The properties object is defined in the same way as the properties of the root document, but the properties of a nested object cannot contain file attachments.
Elasticsearch mappings for nested objects must be specified in the mappings
object at the root type using a path expression as key.
In the example below the document's creator is nested in the author
property of the search document. The nested person object contains properties fullname
and the current project's title as project
.
{
"types" : [
{
"type" : "document",
"on_path" : "documents",
"rdf_type" : "http://xmlns.com/foaf/0.1/Document",
"properties" : {
"title" : "http://purl.org/dc/elements/1.1/title",
"description" : "http://purl.org/dc/elements/1.1/description",
"author" : {
"via" : "http://purl.org/dc/elements/1.1/creator",
"rdf_type" : "http://xmlns.com/foaf/0.1/Person",
"properties" : {
"fullname" : "http://xmlns.com/foaf/0.1/name",
"project": [
"http://xmlns.com/foaf/0.1/currentProject",
"http://purl.org/dc/elements/1.1/title"
]
}
}
},
"mappings": {
"properties": {
"title" : { "type" : "text" },
"author.fullname": { "type" : "text" }
}
}
}
]
}
NOTE: currently mu-search does not take the rdf_type of the nested object into account. In the above example, any resource linked via the dct:creator predicate would be included in the elasticsearch document.
Mu-search has experimental support for multilingual values. This can be done by setting the type of a property to language-string
. Background on this feature can be found in rfcs/multi-language-search.md
For example:
{
"types" : [
{
"type" : "document",
"on_path" : "documents",
"rdf_type" : "http://xmlns.com/foaf/0.1/Document",
"properties" : {
"title" : {
"via": "http://purl.org/dc/elements/1.1/title",
"type": "language-string"
}
},
"mappings": {
"properties": {
"title.default" : { "type" : "text" },
"title.en": { "type" : "text" }
}
}
}
]
}
When setting a property type to language-string, mu-search will include the language tag of the literal in the search index. In the above example the title field would be expanded to a language container in the document:
{
"title": {
"en": ["the english title"],
"default": ["this literal had no language tag"]
}
}
Literals without a language string are mapped onto the "default" field.
For searching, make sure to either specify the appropriate field (filter[title.en]=xyz
or make use of a wildcard: filter[title.*]=xyz
.
It's often advised to configure language specific analyzers for each language, this can be done in the mappings sections of the configuration.
A search index can contain documents of different types. E.g. documents (foaf:Document
) as well as creative works (schema:CreativeWork
). Currently, each simple type the composite index is constituted of must be defined separately in the index configuration as well.
A definition of a composite type index consists of the following properties:
- type : name of the composite type
- composite_types : list of simple type names that constitute the index
- on_path : path on which the search endpoint will be published
- properties : mapping of RDF predicates to document properties for each simple type
In contrast to the properties
of a simple index, the properties
of a composite index is an array. Each entry in the array is an object with the folliwng properties:
- name : name of property of the search document
- mappings : mapping to the simple type property per simple type. If the mapping for a simple type is absent, the same property name as the composite document is assumed.
The example below contains 2 simple indexes for documents and creative works, and a composite index dossier
containing both simple index types. The composite index contains (1) a property name
mapping to the document's title
and creative work's name
property respectively, and (2) a property description
mapping to the description
property for both simple types.
{
"types" : [
{
"type" : "document",
"on_path" : "documents",
"rdf_type" : "http://xmlns.com/foaf/0.1/Document",
"properties" : {
"title" : "http://purl.org/dc/elements/1.1/title",
"description" : "http://purl.org/dc/elements/1.1/description"
}
},
{
"type" : "creative-work",
"on_path" : "creative-works",
"rdf_type" : "http://schema.org/CreativeWork",
"properties" : {
"name": "http://schema.org/name",
"description": "http://schema.org/description"
}
},
{
"type" : "dossier",
"composite_types" : ["document", "creative-work"],
"on_path" : "dossiers",
"properties" : [
{
"name" : "name",
"mappings" : {
"document" : "title",
"creative-work" : "name"
}
},
{
"name" : "description",
"mappings" : {
"document" : "description"
// mapping for 'creative-work' is missing, hence same property name 'description' is assumed
}
}
]
}
]
}
Elasticsearch provides a lot of index configuration settings for analysis, logging, etc. Mu-search allows to provide this configuration for the whole domain and/or to be overridden (currently not merged!) on a per-type basis.
To specify Elasticsearch settings for all indexes, use default_settings
next to the types
specification:
"types" : [
// definition of the indexed types
],
"default_settings" : {
"analysis": {
"analyzer": {
"dutchanalyzer": {
"tokenizer": "standard",
"filter": ["lowercase", "asciifolding", "dutchstemmer"]
}
},
"filter": {
"dutchstemmer": {
"type": "stemmer",
"name": "dutch"
}
}
}
}
The content of the default_settings
object is not elaborated here but can be found in the official Elasticsearch documentation. All settings provided in settings
in the Elasticsearch configuration can be used verbatim in the default_settings
of mu-search.
To specify Elasticsearch settings for a single type, use settings
on the type index specification:
{
"types": [
{
"type": "document",
"on_path": "documents",
...
"settings" : {
"analysis": {
"analyzer": {
"dutchanalyzer": {
"tokenizer": "standard",
"filter": ["lowercase", "asciifolding", "dutchstemmer"]
}
},
"filter": {
"dutchstemmer": {
"type": "stemmer",
"name": "dutch"
}
}
}
},
// other type definitions
]
}
Elasticsearch provides the option to configure a mapping per index to specify how the properties of a document are stored and indexed. E.g. the type of the property value (string, date, boolean, ...), text-analysis to be applied on the value, etc.
In the mu-search configuration the Elasticsearch mappings can be passed via the mappings
property per index type specification.
{
"types": [
{
"type": "document",
"on_path": "documents",
...
"mappings" : {
"properties": {
"title" : { "type" : "text" },
"description" : { "type" : "text" }
}
}
},
// other type definitions
]
}
The content of the mappings
object is not elaborated here but can be found in the official Elasticsearch documentation. All settings provided in mappings.properties
in the Elasticsearch configuration can be used verbatim in the es_settings
of mu-search.
In the base scenario, indexes are created on an as-needed basis, whenever a new search profile (authorization rights and data type) is received. The first search query for a new search profile may therefore take more time to complete, because the index still needs to be built. Indexes can be manually re-indexed by triggering the POST /:type/index
endpoint (see below).
When an index is created, it is registered in the triplestore in the <http://mu.semte.ch/authorization>
graph.
[To be completed... describe used model in the triplestore]
By default, on startup or restart of mu-search, all existing indexes are deleted, since data might have changed in the meantime. However, for sure in production environments, regenerating indexes might be a costly operation.
Persistence of indexes can be enabled via the persist_indexes
flag at the root of the mu-search configuration file:
{
"persist_indexes": true,
"types": [
// index type specifications
]
}
Possible values are true
and false
. Defaults to false
.
Note that if set to true
, the indexes may be out-of-date if data has changed in the application while mu-search was down.
Configure indexes to be pre-built when the application starts. For each user search profile for which the indexes needs to be prepared, the authorization group names and their corresponding variables needs to be passed.
{
"eager_indexing_groups": [
[
{ "variables": ["company-x"], "name": "organization-read" },
{ "variables": ["company-x"], "name": "organization-write" },
{ "variables": [], "name": "public" }
],
[
{ "variables": ["company-y"], "name": "organization-read" },
{ "variables": [], "name": "public" }
],
[
{ "variables": [], "name": "clean" }
]
],
"types": [
// index type specifications
]
}
Note that if you want to prepare indexes for all user profiles in your application, you will have to provide an entry in the eager_indexing_groups
list for each possible variable value. For example, if you have an authorization group defining a user can only access the data of his company (hence, the company name is a variable of the authorization group), you will need to define an eager index group for each of the possible companies in your application.
Additive indexes are indexes that may be combined to respond to a search query in order to fully match the user's authorization groups. If a user is granted access to multiple groups, indexes will be combined to calculate the response. Therefore, it's strongly adviced the indexes contain non-overlapping data. Otherwise the result set may contain duplicates (see also: removing duplicate results).
Only indexes that are defined in the eager_indexing_groups
will be used in combinations. If no combination can be found that fully matches the user's authorization group a single index will be created for the request's authorization groups.
If data that is needed to build documents of a search index is stored across different authorization groups (e.g. public and an organization specific group), these groups need to be specified together in an eager group and not seperately. Otherwise the search index will only contain 'partial' documents.
Assume your application contains a company-specific user group in the authorization configuration; 2 companies: company X and company Y; and mu-search contains one search index definition for documents. A search index will be generated for documents of company X and another index will be generated for documents of company Y. If a user is granted access to documents of company X as well as for documents of company Y, a search query performed by this user will be effectuated by combining both search indexes.
A typical group to be specified as a single eager_indexing_group
is { "variables": [], "name": "clean" }
. The index will not contain any data, but will be used in the combination to fully match the user's allowed groups.
In some cases you may search to ignore certain allowed groups when looking for matching indexes. Typically because they will not relate to data that has to be indexed and you want to avoid having many empty indexes. In this case you will have to provide an entry in the ignored_allowed_groups
list for each group, currently this means including each possible variable value.
For example the clean group can be added to ignored_allowed_groups
by adding { "variables": [], "name": "clean" }
to the list.
In some cases you may encounter variables which are not known up front. The "variables"
array accepts a "*"
to indicate a wildcard for an attribute. This is currently supported in ignored_allowed_groups
. In eager_indexing_groups
this is supported, but only if the eager_indexing_group
array contains a single group. Within eager_indexing_groups
this allows us to create a dynamic index for an access right whilst still indicating this index does not impact other indexes. For example, you may want to index the user's message history ([{ "name": "user", "variables": ["*"] }]
which does not impact the index of the code-lists in public [{ "name": "public", "variables": [] }].
An example for ignored groups may be to ignore all of the anonymous sessions' information which could be done as: ignored_allowed_groups": [ { "name": "anonymous-session", "variables": ["*"] } ]
.
Mu-search integrates with the delta's generated by mu-authorization and dispatched by the delta-notifier.
Follow the "How to integrate mu-seach with delta's to update search indexes" guide to setup delta notification handling for mu-search. Deltas are expected in the v0.0.1 format of the delta notifier.
By default, when a delta notification is received by mu-search, all indexes containing data related to the changes are invalidated. The index will be rebuilt the next time it is searched.
Note that a change on one resource may trigger the invalidation of multiple indexes depending on the authorization groups.
Alternate to full index invalidation, indexes can be dynamically updated on a per-document basis according to received deltas. When a delta is received, the document corresponding to the delta is updated (or deleted) in every index corresponding to the delta. This update is not a blocking operation: an update will not lock the index, so that a simultaneously received search request might be run on the un-updated index.
Note that a change on one resource may trigger the update of multiple indexes depending on the authorization groups.
Partial index updates are enabled by setting the automatic_index_updates
flag at the root of the search configuation:
{
"automatic_index_updates": true,
"types": [
// definition of the indexed types
]
}
When a delta notification is handled, the update to be performed is pushed on the update queue. By default the queue is processed every minute. This timeout can be configured via update_wait_interval_minutes
in the root of the search configuration:
{
"automatic_index_updates": true,
"update_wait_interval_minutes": 8,
"types": [
// definition of the indexed types
]
}
Increasing the interval has the advantage that updates on the same document will be applied only once, but has the downside that search results will not be up-to-date for a longer time. The optimal value depends on the application (number of updates, indexed properties, user expectations, etc.)
This section describes the REST API provided by mu-search.
In order to take access rights into account, each request requires the MU_AUTH_ALLOWED_GROUPS
and MU_AUTH_USED_GROUPS
headers to be present.
Endpoint to search the given :type
index. The request format is JSON-API compliant and intended to match the request format of mu-cl-resources. Search filters are passed using query params.
A subset of the Elasticsearch Query DSL is supported, via the filter
, page
, and sort
query parameters. More complex queries should be sent via POST /:type/search
endpoint.
To search for document
s on all fields:
GET /documents/search?filter[_all]=fish
To search for document
s on the field name
:
GET /documents/search?filter[name]=fish
To search for document
s on multiple fields, combined with 'OR':
GET /documents/search?filter[name,description]=fish
To search for document
s by their URI:
GET /documents/search?filter[:uri:]=http://data.semte.ch/documents/c020b82b-61f6-4264-93c5-aba0d09812d3
To search for a field indexing a file, a specific property of the resulting attachment object must be specified as filter key using the .
-notation.
Currently the following properties are available on an attachment object:
content
: text content of the file
For example, for a property attachment
indexing a file, searching the content of the file is done using the following filter query:
GET /documents/search?filter\[attachment.content\]=Adobe"
More advanced search options, such as term, range and fuzzy searches, are supported via flags. Flags are expressed in the filter key between :
before the field name(s). E.g. the term
search flag looks as follows:
GET /documents/search?filter[:term:tag]=fish
The following sections list the flags that are currently implemented:
:id:
Filter documents by their uuid. Multiple values should be comma-seperated, such asfilter[:id:]=c9e0fe90-3785-4221-9c4b-bda70bd8d83b,e8cbc03a-97e0-4b97-931b-97caa720db14
:uri:
Filter documents by their URI. Multiple values should be comma-seperated.
:term:
: Term query:terms:
: Terms query, terms should be comma-separated, such as:filter[:terms:tag]=fish,seafood
:prefix:
: Prefix query:wildcard:
: Wildcard query:regexp:
: Regexp query:fuzzy:
: Fuzzy query with fuziness set to"AUTO"
and allowing to match multiple fields.:gt:
,lt:
,:gte:
,:lte:
: Range query:lt,gt:
,:lte,gte:
,:lt,gte:
,:lte,gt:
: Combined range query, range limits should be comma-separated such as:GET /documents/search?filter[:lte,gte:importance]=3,7
:has:
: Filter on documents having any value for the supplied field. To enable the filter, it's value must bet
. E.g.filter[:has:translation]=t
.:has-no:
: Filter on documents not having a value for the supplied field. To enable the filter, it's value must bet
. E.g.filter[:has-no:translation]=t
.
:phrase:
: Match phrase query:phrase_prefix:
: Match phrase prefix query:query:
: Query string query:sqs:
: Simple query string query:common:
Common terms query. The flag takes additional optionscutoff_frequency
andminimum_should_match
appended with commas such as:common,{cutoff_frequence},{minimum_should_match}:{field}
. Thecutoff_frequency
can also be set application-wide in the configuration file.
:fuzzy_phrase:
: A fuzzy phrase query based on span_near and span_multi. See also this Stack Overflow issue or the code.
Currently searching on multiple fields is only supported for the following flag:
:phrase:
:phrase_prefix:
:fuzzy:
Multiple filter parameters are supported.
Examples
GET /documents/search?filter[:common:description]=a+cat+named+Barney
GET /documents/search?filter[:common,0.002:description]=a+cat+named+Barney
GET /documents/search?filter[:common,0.002,2:description]=a+cat+named+Barney
GET /documents/search?filter[:sqs:name]=Barney&[:has:address]=t
Sorting is specified using the sort
query parameter, providing the field to sort on and the sort direction (asc
or desc
). Multiple sort query parameters may be provided.
GET /documents/search?filter[name]=fish&sort[priority]=asc&sort[budget]=desc
Flags can be used to specify Elasticsearch sort modes to sort on multi-valued fields. The following sort mode flags are supported: :min:
, :max:
, :sum:
, :avg:
, :median:
.
GET /documents/search?filter[name]=fish&sort[:avg:score]=asc
Note that sorting cannot be done on text fields, unless fielddata is enabled (not recommended). Keyword and numerical data types (declared in the type mapping) are recommended for sorting.
Pagination is specified using the page[number]
and page[size]
query parameters:
GET /documents/search?filter[name]=fish&page[number]=2&page[size]=20
The page number is zero-based.
By default the search endpoint doesn't return exact result counts if the result set contains more than 10K items. To enable exact counts pass count=exact
as query param (at the cost of some performance).
Highlighting is specified using the highlight[:fields:]
query parameter, where a comma separated list of fields you want highlighted should be provided.
You can use *
as field name to highlight all fields.
No settings are currently supported.
See also https://www.elastic.co/guide/en/elasticsearch/reference/current/highlighting.html.
GET /documents/search?filter[:sqs:]=fish&highlight[:fields:]=name,description
GET /documents/search?filter[:sqs:]=fish&highlight[:fields:]=*
When querying multiple indexes (with additive indexes), identical documents may be returned multiple times. Unique results can be assured using Elasticsearch's search result collapsing on the uuid
field. The search result collapsing can be toggled using the collapse_uuids
query parameter:
GET /documents/search?filter[name]=fish&collapse_uuids=t
However, note that count
property in the response still designates total non-unique results.
Accepts a raw Elasticsearch Query DSL as request body to search the given :type
index.
This endpoint is mainly intended for testing purposes and sending more complex queries than can be expressed with the GET /:type/search
endpoint.
For security reasons, the endpoint is disabled by default. It can be enabled by setting the enable_raw_dsl_endpoint
flag in the root of the configuration file:
{
"enable_raw_dsl_endpoint": true,
"types": [
// definition of the indexed types
]
}
The admin endpoints can be used to manage the indexes. These endpoints should not be publicly exposed in your application, since they allow 'root' access when no authorization headers are specified on the request.
Updates the index(es) for the given :type
. If the request is sent with authorization headers, only the authorized indexes are updated. Otherwise, all indexes for the type are updated.
Type _all
will update all indexes.
Invalidates the index(es) for the given :type
. If the request is sent with authorization headers, only the authorized indexes are invalidated. Otherwise, all indexes for the type are invalidated.
Type _all
will invalidate all indexes.
An invalidated index will be updated before executing a new search query on it.
Note that the search index is only marked as invalid in memory. I.e the index is not removed from Elasticsearch nor the triplestore. Hence, on restart of mu-search, the index will be considered valid again.
Deletes the index(es) for the given :type
in Elasticsearch and the triplestore. If the request is sent with authorization headers, only the authorized indexes are deleted. Otherwise, all indexes for the type are deleted.
Type _all
will delete all indexes.
A deleted index will be recreated before executing a new search query on it.
Processes an update of the delta-notifier. See delta integration.
Currently only delta format v.0.0.1 is supported.
This section gives an overview of all configurable options in the search configuration file config.json
. Most options are explained in more depth in other sections.
- (*) persist_indexes : flag to enable the persistence of search indexes on startup. Defaults to
false
. See persist indexes. - (*) automatic_index_updates : flag to apply automatic index updates instead of invalidating indexes on receiving deltas. Defaults to
false
. See delta integration. - eager_indexing_groups : list of user search profiles (list of authorization groups) to be indexed at startup. Defaults to
[]
. See eager indexes. - (*) batch_size : number of documents loaded from the RDF store and indexed together in a single batch. Defaults to 100.
- (*) max_batches : maximum number of batches to index. May result in an incomplete index and should therefore only be used during development. Defaults to 1.
- (*) number_of_threads : number of threads to use during indexing. Defaults to 1.
- (*) update_wait_interval_minutes : number of minutes to wait before applying an update. Allows to prevent duplicate updates of the same documents. Defaults to 1.
- (*) common_terms_cutoff_frequency : default cutoff frequency for a Common terms query. Defaults to 0.0001. See supported search methods.
- (*) enable_raw_dsl_endpoint : flag to enable the raw Elasticsearch DSL endpoint. This endpoint is disabled by default for security reasons.
- (*) attachments_path_base : path inside the Docker container where files for the attachment pipeline are mounted. Defaults to
/data
.
All options prefixed with (*) can also be configured using an UPPERCASED variant as Docker environment variables on the mu-search container. E.g. the batch_size
option can be set via the environment variable BATCH_SIZE
. Environment variables take precedence over settings configured in config.json
.
In development mode (setting the environment variable RACK_ENV
to development
), the application will listen for changes in config.json
. Any change will trigger a complete reload of the full application, including deleting existing indexes, and building any default indexes specified in eager indexing. This behaviour overrules the persist_indexes
flag.
Log messages are logged in a specific scope. A different log level can be configured per scope via environment variables like LOG_SCOPE_{scopeName}>
.
E.g.
search:
environment:
LOG_SCOPE_TIKA: "warn"
LOG_SCOPE_DELTA: "debug"
The following scopes are known:
- SETUP: system setup and initialization (default:
info
) - INDEX_MGMT: creation, updates and deletion of indexes (default:
info
) - INDEXING: indexing of documents (default:
info
) - SEARCH: execution of search queries (default:
warn
) - TIKA: extraction and indexing of file content using Tika (default:
warn
) - ELASTICSEARCH: all communication with Elasticsearch (default:
error
) - SPARQL: all communication with the database (default:
warn
) - AUTHORIZATION: incoming access rights on requests (default:
warn
) - DELTA: handling of incoming delta's (default:
info
) - UPDATE_HANDLER: processing of the updates triggered by delta's (default:
info
)
The same log levels as the mu-ruby-template are available:
debug
info
warn
error
fatal
This section gives an overview of all options that are configurable via environment variables. The options that can be configured in the config.json
file as well are not repeated here. This list contains options that can only be configured via environment variables.
- MAX_REQUEST_URI_LENGTH : maximum length of an incoming request URL. Defaults to 10240.
- MAX_REQUEST_HEADER_LENGTH : maximum length of the headers of an incoming request. Defaults to 1024000.
- MAXIMUM_FILE_SIZE : maximum size in bytes of files to extract and index content from. Defaults to 209715200.
- ELASTIC_READ_TIMEOUT : timeout in seconds of requests to Elasticsearch. Defaults to 180.
The mu-semtech/search-elastic-backend is a custom Docker image based on the official Elasticsearch image. Providing a custom image allows better control on the version of Elasticsearch, currently v7.2.0, used in combination with the mu-search service.
The custom image also makes sure the required Elasticsearch plugins, such as the ingest-attachments plugin, are already pre-installed making the integration of mu-search in your stack a lot easier.
Access rights are determined according to the contents of two headers, MU_AUTH_ALLOWED_GROUPS
and MU_AUTH_USED_GROUPS
.
Currently, a separate Elasticsearch index is created for each combination of document type and authorization group.
[To be completed...]