
The purpose of the page_children and page_level is to provide a fast hierarchy lookup from both ancestor to all children and child to all parents. It's represented in new tables, in denormalized form, so as to not affect any existing data structure, but server purely as an additional reference.
page_children contains a record for every two pages that have a straight line relationship in the page tree. This means that the deeper a page lives in the hierarchy the more lookup records it requires, i.e. the storage overhead for any one page in the lookup tables is
number of records = page level
Since most hierarchies tend to be more wide than deep this is not a significant problem. Even for deeper hierarchies, page_children basically being a two way lookup still does not create a significant burden. The storage for any page 10 levels deep may be ten-fold, but it is still just two integers per level. The lookup burden, being pure integer lookups will always result in an index lookup. Even lookup maintenance will result in index lookups, rather than scans.
page_level could just as well be another column in pages, but keeping it separate allows creation of the lookup without changing the existing schema and lookup of page paths without having to join against pages.
As stated above the, each page requires as many entries in page_children as it has parents. For a site with n pages, the contrived and absolutely worst case scenario is a hierarchy that is n levels deep and results in a number represented by
I.e. for 100,000 pages, this would result in ~700 million rows. However, most document hierarchies are unlikely to exceed an average depth of 5 (an informal survey of this site didn't find any hierarchies deeper than 4, so the average would be around 2 - 3), which would only result in 500,000 rows for 100,000 pages.
Selecting all children of any page is accomplished via:
select page_child_id
from page_children
where page_id = {page_id};
Retrieving the page parents (including depth of each ancestor) is accomplished via:
select pc.page_id, pl.level
from page_children as pc
left join page_level as pl
on pc.page_id = pl.page_id
where pc.page_child_id = {page_id}
order by level;
Creating a new page requires knowning the parent_page_id and page_id:
insert into page_level
select {page_id},[level]+1
from page_level
where page_id = {parent_page_id};
insert into page_children values({parent_page_id},{page_id});
insert into page_children
select page_id,{page_id} as page_child_id
from page_children where page_child_id = {parent_page_id};
coming soon
This example assumes that the page has no children:
delete from page_level where page_id = {page_id};
delete from page_children where page_child_id = {page_id};
For the complete scenario, all children of the page need to be selected, ordered by level and the above deletes cascaded for all children, deepest level first.
Since it is denormalized data, it needs to be updated in sync with page creates, moves and deletes. I.e. all code that changes a page's place in the hierarchy need to lock page_children and page_level and perform the appropriate maintenance in sync. The assumptions for this system are:
Given these assumptions the burden of data maintenance falls on the write actions, allowing lookups to be cheap and easy.
There is nothing in either page_children or page_level that cannot be derived from the titles in pages (as is done right now by string comparison). This means that both a back-fill sync script can easily be constructed, as well as a consistency checker. Both can be used during development and testing to make sure that we do not have any stray SQL that changes the page hierarchy without also updating the hierarchy lookups.
| File | Version | Size | Modified | |
|---|---|---|---|---|
| ||||
| Images 1 | ||
|---|---|---|
Page Hierarchy Schemapagehierarchy.gif | ||
Copyright © 2011 MindTouch, Inc. Powered by