Was this page helpful?

TableParse: parse HTML tables into DekiScript data

    This extension is now installed on this wiki, so you can use it here if you like.  Although this extension is usable in 9.02, 9.08 or higher is recommended due to improved behavior of xml.text().

    Discuss this extension on the forums.

     

    Author
    Neil Weinstock
    Type Script
    Categories Tables
    Requires MindTouch Core 1.8.3 or later
    Status stable
    License Free/Open Source
    Manifest http://developer.mindtouch.com/@api/deki/files/4979/=tableparse.xml

     

    Install Script
    To add  this script to your site, enter the address of your MindTouch installation (ex: http://www.mindtouch.com) and click the Add Script button.  This will open your control panel and prepopulate the necessary values.  You will still need to manually add configuration settings if required.  Note that no changes are made to your site until you confirm the action in your control panel.
    Your site address:     


    Table of Contents

    See also How to add a script, Using the Extension Dialog, Learn about DekiScript, Extensions Directory.


    Functions

    Script Source

    
            

    Version History

    Version
    Date
    Author
    Description
    0.9.0 9-Dec-2009 neilw First public beta version
    1.0.0 6-Jan-2010 neilw

    Official Release

    • Added "xpath" arg to each function
    • Implemented "format" function
    1.1.0 29-Jan-2010 neilw Added "RecordList" function

    Usage Examples

    We'll use the following table for our examples (this table has id="ex1"):

    Name   TableParse
    Description A little slice of fun

    A very important few words on XPath behavior

    This applies to all three functions.

    Usually, I expect you'll pass the function the page XML containing the table, omit the XPath, and let TableParse find the first table.

    If the table you want to parse might not be the first table in the page XML, then you'll want to provide an XPath argument.  The XPath must specify the exact table you want to parse.

    If you want to simply pass the table XML to the function, then you must provide an XPath of ".".  This is because, in the absence of an XPath argument, the functions look for a table inside the given XML.  If the XML is the table, then the functions will fail to see it.  If you think this sounds pretty stupid, you're right, and I hope to resolve this as soon as possible.  If you'd like to contribute your XPath expertise to help me fix this, stop by the forums.

    So, to summarize, you can pass the functions any of the following:

    1. XML containing one or more tables, and no XPath argument.  The functions will find and use the first table found inside the XML.
    2. XML containing one or more tables, and an XPath argument specifying exactly which one.
    3. XML that is the table to be parsed, and an XPath of ".".

    TableParse.full() Examples

    Code Output Notes
    {{
      tableParse.full(page.xml, "text",
        "//table[@id='ex1']")
    }}
    [ [ "Version", "Date", "Author", "Description" ], [ "0.9.0", "9-Dec-2009", "neilw", "First public beta version" ], [ "1.0.0", "6-Jan-2010", "neilw", "Official Release Added \"xpath\" arg to each function Implemented \"format\" function" ], [ "1.1.0", "29-Jan-2010", "neilw", "Added \"RecordList\" function" ] ]
    • Here we provide XPath to be sure we pick the correct table
    • Return is basically a list of lists of table contents.
    • This demonstrates the default "text" return type
    • Note that text is trimmed, so leading space on "TableParse" is removed
    • On 9.02, the bottom right corner would have yielded only "A little slice of"
    {{
      tableParse.full(page.xml, "xml", 
        "//table[@id='ex1']")
    }}
    [ [ <html><body><strong>("Version"; <br>nil</br>; " ")</strong></body></html>, <html><body><strong>("Date"; <br>nil</br>; " ")</strong></body></html>, <html><body><strong>("Author"; <br>nil</br>; " ")</strong></body></html>, <html><body><strong>("Description"; <br>nil</br>; " ")</strong></body></html> ], [ <html><body>"0.9.0"</body></html>, <html><body>"9-Dec-2009"</body></html>, <html><body>"neilw"</body></html>, <html><body>"First public beta version"</body></html> ], [ <html><body>"1.0.0"</body></html>, <html><body>"6-Jan-2010"</body></html>, <html><body>"neilw"</body></html>, <html><body>(" "; <p>"Official Release"</p>; " "; <ul>(" "; <li>"Added \"xpath\" arg to each function"</li>; " "; <li>"Implemented \"format\" function"</li>; " ")</ul>; " ")</body></html> ], [ <html><body>"1.1.0"</body></html>, <html><body>"29-Jan-2010"</body></html>, <html><body>"neilw"</body></html>, <html><body>"Added \"RecordList\" function"</body></html> ] ]
    • This demonstrates the "xml" return type
    • Note that returned XML is always wrapped in <html><body>.  That's there to ensure that the returned XML is well-formed, and is a DekiScript-enforced function.
    {{
      tableParse.full(page.xml, "both",
        "//table[@id='ex1']")
    }}
    [ [ { text : "Version", xml : <html><body><strong>("Version"; <br>nil</br>; " ")</strong></body></html> }, { text : "Date", xml : <html><body><strong>("Date"; <br>nil</br>; " ")</strong></body></html> }, { text : "Author", xml : <html><body><strong>("Author"; <br>nil</br>; " ")</strong></body></html> }, { text : "Description", xml : <html><body><strong>("Description"; <br>nil</br>; " ")</strong></body></html> } ], [ { text : "0.9.0", xml : <html><body>"0.9.0"</body></html> }, { text : "9-Dec-2009", xml : <html><body>"9-Dec-2009"</body></html> }, { text : "neilw", xml : <html><body>"neilw"</body></html> }, { text : "First public beta version", xml : <html><body>"First public beta version"</body></html> } ], [ { text : "1.0.0", xml : <html><body>"1.0.0"</body></html> }, { text : "6-Jan-2010", xml : <html><body>"6-Jan-2010"</body></html> }, { text : "neilw", xml : <html><body>"neilw"</body></html> }, { text : "Official Release Added \"xpath\" arg to each function Implemented \"format\" function", xml : <html><body>(" "; <p>"Official Release"</p>; " "; <ul>(" "; <li>"Added \"xpath\" arg to each function"</li>; " "; <li>"Implemented \"format\" function"</li>; " ")</ul>; " ")</body></html> } ], [ { text : "1.1.0", xml : <html><body>"1.1.0"</body></html> }, { text : "29-Jan-2010", xml : <html><body>"29-Jan-2010"</body></html> }, { text : "neilw", xml : <html><body>"neilw"</body></html> }, { text : "Added \"RecordList\" function", xml : <html><body>"Added \"RecordList\" function"</body></html> } ] ]
    • This demonstrates the "both" return type.  Each table element is returned as a map containing "text" and "xml" elements.

    TableParse.keyValue() Examples

    Code Output Notes
    {{
      tableParse.keyvalue(page.xml,
        "text",
        "//table[@id='ex1']")
    }}
    { "0.9.0" : "9-Dec-2009", "1.0.0" : "6-Jan-2010", "1.1.0" : "29-Jan-2010", Version : "Date" }
    • Keyvalue() interprets the table as a column of keys and a column of values, and returns a map which reflects that structure.
    • Keys are always trimmed text of the first column entries.
    • This example shows the values returned using the default type "text"
    {{
      tableParse.keyvalue(page.xml, 
        "xml", "//table[@id='ex1']")
    }}
    { "0.9.0" : <html><body>"9-Dec-2009"</body></html>, "1.0.0" : <html><body>"6-Jan-2010"</body></html>, "1.1.0" : <html><body>"29-Jan-2010"</body></html>, Version : <html><body><strong>("Date"; <br>nil</br>; " ")</strong></body></html> }
    • This demonstrates the "xml" return type
    {{
      tableParse.keyvalue(page.xml,
        "both", 
        "//table[@id='ex1']")
    }}
    { "0.9.0" : { text : "9-Dec-2009", xml : <html><body>"9-Dec-2009"</body></html> }, "1.0.0" : { text : "6-Jan-2010", xml : <html><body>"6-Jan-2010"</body></html> }, "1.1.0" : { text : "29-Jan-2010", xml : <html><body>"29-Jan-2010"</body></html> }, Version : { text : "Date", xml : <html><body><strong>("Date"; <br>nil</br>; " ")</strong></body></html> } }
    • This demonstrates the "both" return type.

    TableParse.format() Examples

    This function is good when you have a table from which you need to extract a known subset of of the data.  By defining a format for the table (which is passed to the function, as described below), you put the table format in one place, rather than hard-coding it into a bunch of different places in your code.  So, with one fell swoop, you take all the required data from the table and throw it into a map.

    The "format" argument is a map specifying how to parse each desired cell of the table.  Each map element is a list with two or three items:

    1. The first item is the row number of the desired cell (starting from 0).
    2. The second item is the column number of the desired cell (starting from 0).
    3. The third item is optional, and specifies the type of cell: "text" (default), "xml", or "both".  By now you know how this works.

    If no cell exists at the specified location, then it is assignned the value nil.

    So let's see what we can do with our sample table:

    Code Output Notes
    {{
      var format = {
        name:        [ 0,1 ],
        description: [ 1,1,"xml" ]
      };

      tableParse.format{
        xml: page.xml,
        format: format,
        xpath: "//table[@id='ex1']"
      }
    }}
    { description : <html><body>"9-Dec-2009"</body></html>, name : "Date" }
    • The format variable compactly describes the exact format of usable data in the table.
    • Unlike the other functions, this one allows you to have some cells evaluated as text text, others as xml, and others as both.
    • For this particular example, keyvalue() would have worked almost as well, but in the real world the table format is often not so tidy.

    TableParse.recordList() Examples

    The recordList() function treats the first row as the list of field names, and each row below it as a record of values of those fields.  The results are put into a list of maps.  To illustrate this function we'll use this table:

    Name

    Bugs Bunny Being a stinker
    Elmer J. Fudd Hunting wabbits

    In this particular case, the first row of the table is a normal row (inside <tbody> with <td>s), but it could equally well be a <thead> with <th> elements.  No difference!  Also, just to show what happens, we've omitted a fieldname from the second column (it should say "Occupation").  The id attribute for this table is "exrl".  Here's how it works:

     

    Code Output Notes
    {{
     tableParse.recordList(
      page.xml,
      "text",
      "//table[@id='exrl']"
     );
    }}
    [ { COLUMN1 : "Being a stinker", Name : "Bugs Bunny" }, { COLUMN1 : "Hunting wabbits", Name : "Elmer J. Fudd" } ]
    • If a column has a blank header, the field name is set equal to COLUMN plus the column number.
    • The "text" type is shown, but "xml" and "both" are available as well, just like with the other functions.
    Was this page helpful?
    Tag page

    Files 1

    FileVersionSizeModified 
    Viewing 5 of 5 comments: view all
    Hi Neil - I am curious to know what the changes in behaviour of xml.text that you speak of are between 9.02 and 9.08? Sean
    Posted 04:08, 10 Dec 2009
    In 9.02 and earlier, it only returned the text in the selected element up until the first nested element. So, if you have a table cell with a link in it, or some text styling, it would break xml.text()'s ability to get the whole thing. A workaround is possible but ugly; for this extension I'm happy to rely on the new behavior In 9.08, where xml.text() returns all the text of all the elements inside the selection. Did that convoluted explanation make any sense? It's hard to post examples inside the limitations of the comment system here....
    Posted 06:12, 10 Dec 2009
    Hi Neil - Thanks for that. The doc page for xml.text gives no indcation that the behaviour changed. We are still running 9.02 here and use xml.text extensively - so I am obviously curious to learn if this is going to be a 'breaking' change for us or not when we upgrade. Thanks again - Sean
    Posted 12:18, 10 Dec 2009
    You're right, the docs make no mention; I only noticed by accident when one of my scripts started working better after the upgrade, and subsequently saw the change documented in the bug log. My guess is that it will not break your stuff, but only you can determine that.
    Posted 12:24, 10 Dec 2009
    Something to watch for novices is a logistical problem in the Edit form for this extension. When I came to dump the settings for this to use on another test wiki, the manifest field was inexplicably blank thus the dump did not contain the manifest path. If you subsequently click the red 'Edit' button at the foot of the page (with a view to editing page contents) you then in fact save the contents of the form and effectively remove the manifest path from your config - thus breaking your extension - and having no record of the previous manifest setting!
    Fortunately the path for this extension was the same as another extension on our server.
    Would suggest the red 'EDIT' button is renamed 'SAVE' with a warning prompt for overwriting data - and the empty manifest field issue is looked at! edited 01:47, 17 Nov 2010
    Posted 10:05, 16 Nov 2010
    Viewing 5 of 5 comments: view all
    You must login to post a comment.

    Copyright © 2011 MindTouch, Inc. Powered by