Resource Data Model

    This spec describes a database schema that has been introduced with the Lyons release. This model allows storage and optimized retrieval of resources with support for hierarchies, change tracking, metadata (with revisioning), and content independence.

    Resource Model

    Motivation

    The pre-Lyons storage model for pages and attachments have significant drawbacks that limit the introduction of new features while hampering existing functionality of attachments and change tracking. Although MediaWiki's data model has since evolved and improved, Deki inherited many of the problems by forking from an older release. This model removes the workarounds that stemmed from limitations imposed by MySql 4.x as well as introducing concepts better suited for Deki's use cases.

    Model Design

    The design of the new model is driven by the need to address known issues in the pre-Lyons model, as well as meeting requirements for the architectural vision.

    Know Issues

    • Timestamps are stored using MySql's native format saving space and improving lookup speed
    • Removing statistics tracking from the core pages tables which was causing abysmal query cache effectiveness
    • Eliminate data movement when creating, removing, or restoring a revision. This was causing significantly more complex middle tier code while increasing the cost of making db model changes as changes have to be preserved in three places with all migration queries requiring updates as well.
    • Avoid name collisions on attachments when uploading a file of the same name as one that has been previously removed.

    Design Desiderata

    • Since attachments and pages are essentially the same concepts, use the same model for storing them and eliminate code duplication. Same db queries can be reused for storing and retrieving revisions of pages, attachments, properties, etc. This significantly lowers the bar for porting Deki to other databases.
    • Separate page content from page revisions removes unncessary data duplication in the db in cases when a revision is created but content isn't changed. Furthermore separating revision info from contents allows better usage of MySQL's query cache, faster table reads, and smaller result sets from the db. Page data is only retrieved when it's necessary.
    • Reduce the need to perform data model changes when additional info needs to be stored for a given resource by allowing simple key/value storage in attributes or properties.
    • Eliminate the need for costly table updates and deletes. A single insert code path is used when any revisionable resource is modified. Allow fast and simple lookup to current data by mirroring the head revision of the resource separately from the table containing all revisions.
    • Revision everything! Looking at the history of a resource should show all information about the page at each revision including all elements associated with the resource such as language, name, location, etc. Quite a bit of data is currently overwritten as pages are moved, removed, restored.
    • Allows more than ~(128 - length(filename)) revisions of attachments
    • Support for page, attachment, and user properties

    Model Definition

    Table: resources

    The resources table contains aggregate information such as the type of resource, whether or not it's currently deleted, when it was created, and the current head revision. It also contains a mirror of the head revision for the resource in the resourcerevs table. Note that unlike Deki's and MediaWiki's 'old' table, the resourcerevs table contains the head revision as well. This data duplication exists to simplify and optimize for the  majority of queries where only the current version of the resource is relevant. Other than representing the current state of the resource, this table describes aspects of the resource that are revision-independent such as the type of resource.

    Column
    Type
    Default
    Description
    res_id int(4) unsigned NOT NULL auto_increment primary key
    res_headrev int(4) NOT NULL   current revision number of head
    res_type tinyint(1) unsigned NOT NULL 0 type of resource
    0: undefined
    2: file
    4: property
    res_deleted tinyint(1) unsigned NOT NULL 0 whether or not the resource is currently marked as removed
    res_create_timestamp datetime NOT NULL 0001-01-01 00:00:00 timestamp when the first revision was created
    res_update_timestamp datetime NOT NULL 0001-01-01 00:00:00 timestamp of the current head revision (same as resrev_timestamp)
    res_create_user_id int(4) unsigned NOT NULL 0 user id of who created the first revision
    res_update_user_id int(4) unsigned NOT NULL 0 user id of who created the head revision (same as resrev_user_id)
           
    resrev_rev int(4) NOT NULL   (copy of head revision row, see description below)
    resrev_user_id int(4) unsigned NOT NULL   (copy of head revision row, see description below)
    resrev_parent_id int(4) unsigned NOT NULL   (copy of head revision row, see description below)
    resrev_parent_page_id int(4) unsigned NULL (copy of head revision row, see description below)
    resrev_parent_user_id int(4) unsigned NULL (copy of head revision row, see description below)
    resrev_change_mask tinyint(1) unsigned NOT NULL 0 (copy of head revision row, see description below)
    resrev_name varchar(255) NOT NULL '' (empty string) (copy of head revision row, see description below)
    resrev_change_description varchar(255) NULL (copy of head revision row, see description below)
    resrev_timestamp datetime NOT NULL 0001-01-01 00:00:00 (copy of head revision row, see description below)
    resrev_content_id int(4) unsigned NOT NULL 0 (copy of head revision row, see description below)
    resrev_deleted tinyint(1) unsigned NOT NULL 0 (copy of head revision row, see description below)
    resrev_changeset_id int(4) unsigned NOT NULL 0 (copy of head revision row, see description below)
    resrev_size int(4) unsigned NOT NULL 0 content size in bytes
    resrev_mimetype varchar(255) NOT NULL '' (empty string) (copy of head revision row, see description below)
    resrev_language varchar(255) NULL (copy of head revision row, see description below)
    resrev_meta text NULL (copy of head revision row, see description below)

    Primary: res_id

    Secondary: 

    • transaction: resrev_transaction_id
    • parent_resource: resrev_parent_id
    • parent_page: resrev_parent_page_id
    • parent_file: resrev_parent_file_id
    • parent_user: resrev_parent_user_id

    Table: resourcerevs

    The resourcerevs (resource revisions) table contains information about all revisions of all resources. Revisions of a given resource are related by the resource id.

    Column
    Type
    Default
    Description
    resrev_id int(4) unsigned NOT NULL auto_increment primary key
    resrev_res_id int(4) unsigned NOT NULL 0 resource id that revision applies to; foreign key into resources table
    resrev_rev int(4) NOT NULL 0 revision number; starts at 1 and increments by 1; unique per resrev_res_id
    resrev_user_id int(4) unsigned NOT NULL 0 user who performed the change that caused the revision creation
    resrev_parent_id int(4) unsigned NOT NULL 0 parent resource id
    resrev_parent_page_id int(4) unsigned NULL parent page id; used by resources associated with a page (this column will be dropped as pages are migrated)
    resrev_parent_user_id int(4) unsigned NULL parent user id; used by resources associated with a user such as properties
    resrev_change_mask tinyint(1) unsigned NOT NULL 0 bitfield describing which columns were changed:
    0: none
    1: content
    2: name
    4: language
    8: meta
    16: deleteflag
    32: parent
    resrev_name varchar(255) NOT NULL '' (empty string) the name of the resource, such as filename or property key
    resrev_change_description varchar(255) NULL human readable description of operation that created this revision
    resrev_timestamp datetime NOT NULL 0001-01-01 00:00:00 timestamp when revision was created
    resrev_content_id int(4) unsigned NOT NULL 0 foreign key to resourcecontents table
    resrev_deleted tinyint(1) unsigned NOT NULL 0 whether or not this resource is marked as deleted
    resrev_changeset_id int(4) unsigned NOT NULL 0 foreign key to change-set which groups up multiple specific changes for an operation
    resrev_size int(4) unsigned NOT NULL 0 content size in bytes
    resrev_mimetype varchar(255) NOT NULL '' (empty string) content MIME type
    resrev_language varchar(255) NULL language code following standard language-culture convention
    resrev_meta text NULL xml describing custom generated attributes for the revision such as attachment image width/height

    Primary: resrev_id

    Unique: resid_rev, (resrev_res_id, resrev_rev)

    Table: resourcecontents

    The resource contents table stores either the content for a resources revision or a description for how to get to it outside of the db.

    Column
    Type
    Default
    Description
    rescontent_id int(4) unsigned NOT NULL auto_increment primary key
    rescontent_res_id int(4) unsigned NULL resource id that content applies to; foreign key into resources table
    rescontent_res_rev smallint(2) unsigned NULL revision number when content was added
    rescontent_value mediumblob   binary resource contents; empty when rescontent_location is used
    rescontent_mimetype varchar(255) NOT NULL '' (empty string) content MIME type
    rescontent_location varchar(255) NULL content location; NULL when rescontent_value is used
    rescontent_size int(4) unsigned NOT NULL 0 content size in bytes when known, otherwise NULL

    Primary: rescontent_id

    Unique: rescontent_res_id, (rescontent_res_id, rescontent_res_rev)

    Deleted resources

    A resource is marked as deleted with the res_deleted flag being set to 1. If a resource is deleted the only valid next revision would be to restore the resource. This allows the assumption that if the resource is marked as deleted, it's always the current head revision that performed the deletion and the user id and timestamp of the deletion are stored within that head revision.

    Future Changes

    • As pages get migrated to resources, the column resrev_parent_page_id will be dropped with the code simplified accordingly. In it's place, the resrev_parent_id column will be used to refer to a page parent.

    Permissions/ACL Integration

    Currently permissions are coupled with pages in two ways. Grants (user and group) reference pageids and restrictions are stored in the page table as a restrictionid.  As part of the page migration, restrictions shouldn't be migrated into resources schema but should instead become auxiliary. Keeping permissions outside of resources would have a number of benefits

    • Promotes a more modular design allowing systems that can be decoupled to evolve on their own with less friction
    • Improves performance of permission lookups by reducing complexity of queries and allowing better cacheability of smaller specialized tables
    • Eliminate costlier updates on resources table when permissions are modified

    Statistics Integration

    Statistics tracking of pages is currently done in the pages table. This also needs to be decoupled into its own tables for most of the same reasons as permissions.

    Name Resolution (was "Titles/URIs to resources and redirects")

    Deficiencies of current design:

    • Page titles are currently stored with pages meaning that any hierarchy operations operate directly on pages.

    Benefits of current design:

    • Creating/editing a redirect is subsumed by the

    Desiderata:

    • Decouple name resolution from resource storage
      • allows a resource to be known by multiple visible names (aliasing)
      • a "redirect" is just a type of alias that is invisible to navigation queries (concept: ghost alias?)
      • less data duplication since titles would not be stored with every single resource and resource revision
    • Avoid multiple rows being affected by tree rename operations
      • Only update a fixed number of rows when a tree is being moved
    • Efficiently query for sub-trees for a given name
    • Preserve a history of name creation/modification/deletion operations just like for resources
    • Intuitive mapping of name editing into the UI (i.e. don't require an additional UI component for it)
    • Redirects may need to be taggable
    • Broaden name resolution to apply to any concept in the wiki
      • User:Steveb -> maps to a user resource
      • Group:Admins -> maps to the admins group
      • Special:Search -> maps to special search page (should really not be special in the future though)
      • :Sandbox/Test -> maps to page with path "Sandbox/Test"
      • Talk:Sandbox/Test -> maps to talk page with path "Sandbox/Test"

    Schema (work-in-progress)

    Proposed title table columns: (very work in progress)

    • title_id
    • title_text
    • title_text_code
    • title_redirect_to_resource_id (nullable)
    • ?? title_parent_title_id

    Resources has a title_id foreign key.

    Redirect point to resources with the title_redirect_to_resource_id column.

    Title table is only updated for redirects otherwise no entries are ever removed or updated.

    TODO: Look at other name resolution mechanisms such as XRDS for inspiration.

    Tag page
    You must login to post a comment.

    Copyright © 2011 MindTouch, Inc. Powered by