Property Search

    Introduction

    Page properties have been available in the core product since the Lyons(9.02) release however there is no mechanism to search through these values. Expanding the MindTouch API to index and search property content allows expansion of current features and capabilities while also opening the door to new applications to be built.

    Intended Audience

    Developers, developers, developers! ... and some UI components for the end user.

    Status

    Initial implementation included with 9.12.0.

    Spec explains current implementation and short term improvements

    Functional Specification

    Users are able to save page metadata into page properties and then use search to find pages with metadata matching a variety of supported criteria. 

    Only properties of type text/* are supported.

    Use Cases and Query Examples

    Use page properties like tags

    Tags are the easiest way to aggregate a bunch of pages together however these are easily editable by users viewing the pages. Aggregating pages in a tag-like manner via page properties would disallow users from altering the properties. It would also hide the aggregation mechanisms from plain site.

    Find pages that have a property named 'sales' with this query:

    property:#sales

    Search for pages by a property completely matching a certain value

    Users can locate all pages that have a certain property with a known value with the following query:

    #reviewedby:"maxm"

    Search for pages by a property containing a certain value

    A property can have 0, 1, or more text values separated by a delimiter. Finding a page that has this property that contains at least one known value can be done like this:

    #reviewedby:maxm

    Note that there are no quotes around maxm indicating that it's a token that must exist within the property value but the value may contain other tokens.

    Search for pages by a property with a logical criteria

    Boolean operators can be combined to perform complex searches of page properties.

    +#reviewedby:maxm -#reviewedby:royk +#status:"approved"

    This can also be rewritten as

    #reviewedby:(+maxm -royk) AND #status:"approved"

    Search for pages by a property value within a range

    Range searches are possible to find pages with a property that is within a certain minimum or maximum bound.

    #rating:[3 TO 5]

    This will match for all pages that have a 'rating' property with values 3,4, and 5. Range searches are done lexicographically (in dictionary order) so for properties that are intended for numeric range searches, you'll need to pre-pad the numbers for a certain number of digits with 0's. Most significant numbers come first.

    Search for pages by a property containing a timestamp

    Another application of a range search is chronotagging. Pages describing events that happen at on a given date can be found with a range search like this:

    #event_date:[20091225 TO 20100125]

    This will find pages with a property 'event_date' that represents December 25, 2009 to January 25th, 2010.

    Search for users by a property

    Users can be found based on their user properties in the same ways as pages. 

    Non-goals

    Only text/plain properties are searchable. 

    Technical Specification

    UI requirements

    No UI changes are needed in order to support page property search. User property search requires search results to render a user result differently than a page result.  

    API requirements

    As with pages, attachments, comments, and tags, properties are indexed by Lucene. This allows property queries to be combined with existing queries for many powerful possibilities. Since properties are essentially key/value pairs associated with a resource, they mesh perfectly with Lucene which already treats all resources as a set of key/value pairs.

    Refer to Lucene's Query Syntax guide for examples.

    Content Type Limit

    Only contents of properties with a Content-Type of text/plain are indexed. Indexing structured (xml, json, etc.) or binary data (image, octet-stream, etc.) is not useful since it can't be searched for using Lucene. In the future, additional MIME types may be supported such as application/xml, csv, json, etc with indexing being done by custom logic for specific types of documents.

    Tag-Like Behavior with Properties

    Although properties are not meant to replace tags, it is possible to find resources based on their association with a given property name independent of the value. All property names of a resource are added to a Lucene field "property" to allow this behavior. For example, property:(+#foo -#bar) will find all pages with a property named foo but without bar. 

    Naming and Inclusion Into the Index

    Lucene indexes documents by a set of named fields. Pages and files already have a set of fields defined and although Lucene allows multiple duplicate key names for a document, it's important not to allow existing key names to be added on to by properties and their values. This is to avoid false positives from being introduced into search results when querying by a certain field such as 'author' .

    Only 'custom' properties are added to the Lucene index (properties names starting with urn:custom.mindtouch.com#). These properties are visible and modifiable via the UI (except the namespace isn't displayed). The namespace prefix is removed from the property name so the indexed field name is '#' followed by the name. So a property with the full name of urn:custom.mindtouch.com#foo is indexed in Lucene as #foo. This avoids overloading existing fields with values from properties. 

    Indexing Trigger

    Page Properties

    The Lucene service listens for page change notifications such as those triggered by page property changes. Specifically the channel

    event://*/deki/pages/dependentschanged/properties/*

    is subscribed to by the Lucene service in order to know when a page needs indexing due to a property change. 

    User properties

    TODO: define/lookup notification channel for user property updates

    Lucene Tokenizing

    Currently delimiters are any whitespace as well as the comma character. This means that all other punctuation is considered as part of the token. It's likely that this list of non-token characters will be expanded.

    Multivalue properties

    Since Lucene tokenizes strings as described above, it's possible to find resources by one or more of the tokens that a certain property may have. For example if you want to store the results of a multi-select box into properties and find resources that contain (or that don't contain) one or more value, you can as long as each value is a token. This currently means that each token must be delimited by whitespace or a comma.

    FUTURE WORK

    User Properties

    Adding users to Lucene

    Just like pages and files are treated in Lucene as resources, users must become resources as well. This allows user properties to be associated with them and for users to be located based on their properties which contain personal information. As with pages, this will include only include custom user properties.

    Visibility of user properties

    Since custom user properties may contain publicly accessed personal information that is indexed, it makes sense for the information to be publicly visible as well. Users with global READ access will have the permissions to see custom user properties of other users. This would allow storage of private information that only you can edit while allowing other properties to be seen and referenced by others (and by applications).

    Attachment Properties

    May be added to the index at a later date

    Tag page
    Viewing 7 of 7 comments: view all
    It would be nice for the property search to index compound items. For example, a page on a company might have a list of the Board of Trustees. There would be a property with some standardized format (JSON was suggested or possibly meta format in the namespace or property name that would specify how to index) that would list the board. A use case for searching would have a user searching the wiki for all companies that "Jay Jameson" served on the board. Company A has Jay Jameson, Jill Jones, and John Jordon. Company B has Gary Gould, Greg Gomez and Jay Jameson. Keeping the board in a single property keeps the user from having to search BoardMember1, BoardMember2, etc. or getting general search results for Company N which Jay Jameson is the founder.

    A DekiScript function could be used to store generic DS variables in the format used for compound properties.

    If this should be a separate spec, I can do that but it's pretty tightly coupled to this one.
    Posted 08:14, 16 Nov 2009
    @maxm can you update this spec with implementation details?
    Posted 09:15, 8 Jan 2010
    Done and Done.
    Posted 17:44, 27 Jan 2010
    p.s. Love the background on the H3 tags :)
    Posted 00:27, 3 Feb 2010
    @maxm and @guerric - was the implementation of properties as described? i'd like to move this into specs (implemented)
    Posted 11:47, 19 Mar 2010
    I'm trying this with my MindTouch installation (9.12.1), and it seems that the exact match works, but not partial matches. More specifically, for a property like:

    owner:"timothy.high"

    The following searches work:
    #owner:"timothy.high"
    #owner:timothy.high

    But not the following:
    #owner:tim
    #owner:high
    #owner:(tim)

    After some serious playing around, I found out that the following works:
    #owner:tim*
    #owner:(tim*)

    But none of the following:
    #owner:*tim*
    #owner:(*tim*)
    #owner:*high
    #owner:(*high)

    What's the deal with the wildcards, and how can I do a partial match on anything but the beginning of the value?? edited 06:17, 20 Jul 2010
    Posted 06:16, 20 Jul 2010
    @timothy.high wildcard searches in the start of a search term are now supported in 10.0 but are disabled by default due to performance implications. Arne is going to document this setting soon and I'll update this thread once he does.
    Posted 10:56, 12 Aug 2010
    Viewing 7 of 7 comments: view all
    You must login to post a comment.

    Copyright © 2011 MindTouch, Inc. Powered by