This page describes how lucene and the Core API are connected and how one might go about putting lucene on a dedicated indexing machine to serve one or more Core Tenants
The indexer listens for change events from the API via subscriptions to the pubsub bus and uses those events to pull relevant data from the API to index, i.e. the events themselves do not carry the data, but merely inform the indexer what to retrieve for indexing. Since lucene may be asked to index a lot of data, events are persistently queued so that a restart does not create loss of what needs to be indexed. Indexing itself uses the lucene engine to decompose the data into a seach optimized form and write it to its own storage files. For queueing and storage, lucene requires physical disk I/O to work (i.e. cannot abstract storage via S3, etc), since it does file locking and random read/write access.
Each indexing event received by the LuceneService is entered persisted into a queue and also added to an accumulation queue. The accumulation queue allows the combination of events for the same resource into a single event instead of re-indexing the same page over and over if it was changed multiple times within the accumulation time period. The event will not be removed from the persistent queue until the even has been dispatched by the accumulation queue and processed to completion or rejection after retries by the indexer. Should the indexer be restarted or crash, the persistent queue will recover all unprocessed items.
The underlying queue can only be used by a single process at a time and locks the file accoridingly. While it is possible to move the queue to shared storage, it can never be used by multiple indexers at the same time.
Lucene itself uses a directory to manage a number of files for the indexed data. The way locks work with lucene it is possible to have multiple writers and readers, i.e. it can live on a shared disk, but this causes inefficiencies, since the indexer has to be constantly opened and closed. Alternatively, a single writer with many readers could be set up to add more CPU power for searching.
Each tentant (API instance) receives their own queue and index storage. This means that for multi-tenant installs it should be possible to have separate physical indexers for each tentant. However, the location of the lucene service is a per host config, rather than a per tenant config, so some changes would be required to make this possible.
Each API servers by default will have its own copy of LuceneService per server. Until 10., when using multiple servers, the index could be stored on a shared disk. This had performance implications both because because of the shared disk usage with lock contentions and that indexers were constantly opened and closed. In 10.0 the performance issues were removed, but in the process, the lock is being held by the first instance to grab it
For this reason, and because it's generally more performant anyway, multi-server setups should use a single dedicated LuceneService instance. This behavior is supported by providing a URI to the lucene in the API config. If a uri is provided, instead of startup up its own lucene service, the API will ping the lucene service to subscribe it to its pub sub service so that change events are properly propagated. Note: this behavior has not been properly tested.
The current implementation of PubSub is a fire and forget, memory only implementation. If you are not listening to events when they happen or if the server is restarted while events are being processed through the pub sub system, the events will never arrive at the subscriber. In general, this is fine as events are more like UI events, i.e. notifications of something happening, not carrying data that only exists in the message. However it might be desirable to pubsub to be persistent message bus that stores messages for listeners should the listener drop off and come back later and to guarantee delivery of a message rather than relying on arrival to dispatch to be short enough that a restart is unlikely to catch a message mid-dispatch.
| Images 0 | ||
|---|---|---|
| No images to display in the gallery. |
Copyright © 2011 MindTouch, Inc. Powered by