Redirected from Community/SgmlReader
Was this page helpful?

SGMLReader - Convert any HTML to valid XML

  • You do not have permissions to view this page - please try logging in.
  • You do not have permissions to view this page - please try logging in.

 

Fork me on GitHub

SGMLReader is a versatile C# .NET library written by Chris Lovett for parsing HTML/SGML files. The original community around SGMLReader used to be hosted by GotDotNet, but this site was phased out (update: it appears the code has re-surfaced on MSDN Code Gallery, but without any updates). MindTouch Dream and MindTouch Core use the SGMLReader library extensively.  Over the last few years we have made many improvements to this code; thereby, making us  the de facto maintainers of this library.  In the spirit of the original author, we're providing back these changes on the MindTouch Developer Center site.

Sample Usage

The following code parses a HTML into an XmlDocument:

XmlDocument FromHtml(TextReader reader) {

    // setup SGMLReader
    Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
    sgmlReader.DocType = "HTML";
    sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
    sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
    sgmlReader.InputStream = reader;

    // create document
    XmlDocument doc = new XmlDocument();
    doc.PreserveWhitespace = true;
    doc.XmlResolver = null;
    doc.Load(sgmlReader);
    return doc;
}

Sample Output

Visit the HTML-to-XML Conversion Examples page to see how SGMLReader converts HTML source into valid XML.

Community

If you find/fix issues in SGMLReader, please post them in the SGMLReader forum.

Download

The latest version of SGMLReader can be downloaded on GitHub.

Release History

Note: all 1.8.x releases are compatible with 1.8.0.  Use assembly redirection to account for newer versions when recompilation is not an option.

Release notes for 1.8.7 (2010-Apr-27)

  • (#7536) Provide setting to ignore DTD in parsed document (again)

Release notes for 1.8.6 (2010-Feb-19)

  • (#7536) Provide setting to ignore DTD in parsed document
  • (#7505) An attribute with a missing value should be assumed to have the name of the attribute as value
  • (#7678) SGMLReader ExpandEntity with entities not ending in ';' and skips a character
  • (#7631) SGMLReader adds 65535 character at the end of the string
  • (#7181) Add test showing behavior of > char in string literals in XML

Release notes for 1.8.5 (2009-Jul-19)

  • (#6512) unable to parse UTF-32 entities
  • (#6547) Use StringComparison.OrdinalIgnoreCase instead of StringComparison.InvariantCultureIgnoreCase

Release notes for 1.8.4 (2009-May-19)

  • (#6228) corrupt attributes may lead to invalid attribute names, which make the produced XML unparseable
  • (#6306) error when content contains prefixed XML processing instructions (e.g. <?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />)
  • added -ignore flag so tests known to fail can be ignored from the suite
  • added test to re-parse output to make sure it's valid XML (.Net sometimes was able to generate invalid XML)

Release notes for 1.8.3 (2009-Apr-03)

  • (#5552) fixed CData section parsing skips over characters

Release notes for 1.8.2 (2008-Nov-26)

  • fixed regression introduced by fixing bug 5150
  • (#5443) an extra open quote/double-quote prevents the entire element from being read properly
  • replaced == string equality with culture invariant string.Compare
  • return 'null' as NameTable since none is used
  • added '-noformat' switch for regression tests to suppress automatic reformatting (useful for formatting tests)

Release notes for 1.8.1 (2008-Oct-08)

  • (#5144) Unclosed HTML comment causes infinite loop
  • (#5150) don't use XmlNameTable with object comparisons; it becomes unreliable after a while

Release notes for 1.8.0 (2008-Jul-28)

  • BREAKING CHANGE: requires .NET 2.0
  • major code clean-up (thx jamesgmbutler for the contribution!)
  • (#4606) Add XML-only entity &apos; to HTML DTD

Release notes for 1.7.5 (2008-Jul-01)

  • (#4410) Missing quote in attribute value causes catastropic failure
  • (#4409) Unknown prefixes cannot be mapped to the same namespace

Release notes for 1.7.4 (2008-Jun-03)

  • (#4179) &sup2; entity is not recognized correctly
  • added test for entities with digits

Release notes for 1.7.3 (2008-May-05)

  • never close the BODY tag early (it causes loss of content)
  • remove  "<![CDATA[" inside CDATA sections
  • remove "]]>" inside CDATA sections
  • (#3513) convert elements with invalid tag names into text (e.g. <foo@bar.com>)

Release notes for 1.7.2 (2007-Dec-07)

  • fixed bug where parsing CDATA section skipped first character
  • don't double parse commented out CDATA sections
  • added support for namespaces on elements and attributes
  • unknown prefixes on attributes and elements resolve to '#unknown' namespace
  • fix bug when parsing down-level comments, like <![if IE]>
  • don't allow attribute with invalid names (e.g. <p foo:="invalid" ;="bad">, etc.)

Release notes for 1.7.1 (2007-Sep-25)

  • added 'GetLiteralEntitiesLookup()' method
  • fixed bugs with namespace prefixes on attributes and elements; prefixes are now stripped automatically
  • added SGMLReader constructor with XmlNameTable argument to avoid failed comparisons when reusing the DTD
  • ensured that SGMLReader is initialized identically when reusing a DTD

Release notes for 1.7

  • Fix bug reported by chriswang - MoveToAttribute didn't save state properly.
  • Fix bug reported by starascendent - build on Visual Studio 2003 was broken.
  • Fix bug reported by sanchen - ExpandCharEntity was messed up on hex entities.
  • Fix bug reported by kojiishi - off by one bug in SniffName()
  • Fix bug reported by kojiishi - bug in loading XmlDocument from SGMLReader - this was caused by the HTML documernt containing an embedded <?xml version='1.0'?> declaration, so the SGMLReader now strips these.
  • Added special stripping of punctuation characters between attributes like ",".

Release notes for 1.6

  • Improve wrapping of HTML content with auto-generated <html></html> container tags.

Release notes for 1.5

  • Fix detection of ContentType=text/html and switch to HTML mode.
  • Fix problems parsing DOCTYPE tag when case folding is on.
  • Fix reading of XHTML DTD.
  • Fix parsing of content of type CDATA that resulted in the error message 'Cannot have ']]>' inside an XML CDATA block'.
  • Fix parsing of http://www.virtuelvis.com/download/162/evilml.html.
  • Fix parsing of attributes missing the equals sign: height"4"  (thanks to Ulrich Schwanitz for his fix).
  • Fix 'SniffWhitespace' thanks to "Windy Winter".
  • Added TestSuite project.

Release notes for 1.4

  • Added UserAgent string "Mozilla/4.0 (compatible;);" so that SGMLReader gets the right content from webservers.  Fixed handling of HTML that does not start with root <html> element tag. Fixed handling of built in HTML entities.

Release notes for 1.3

  • Changed ToUpper to CaseFolding enum and added support for "auto-folding" based on input.
  • Added support for <![CDATA[...]]> blocks.
  • Added proper encoding support, including support for HTML <META http-equiv="content-type".  This means output now has the correct XML declaration (unless you specify the new -noxml option) and any existing xml declarations in the input are stipped out so you don't end up with two.
  • Added support for ASP <%...%> blocks (thanks to Dan Whalin).
  • Now strips out DOCTYPE by default since HTML DocTypes can cause problems for XmlDocument when it tries to load the HTML DTD.  but added "-doctype" switch for those who really need it to come through.
  • Fix handling of Office 2000 <?xml:namespace .../> declarations.
  • Remove bogus attributes that have no name, in cases like <class= "test">.

Release notes for 1.2

  • Converted back to Visual Studio 7.0 since this is the lowest common denominator.
  • Added ToUpper switch for upper case folding, instead of the default lower case.
  • Fix handling of UNC paths.
  • Added OFX test suite.
  • Fixed bug in parsing CDATA type elements (like <script><!-- --></script>)

Release notes for 1.1

  • Upgraded project to Visual Studio 7.1.
  • Fixed bug in accessing https authenticated sites.
  • Fixed bug in handling of content that contains nulls.
  • Improved handling of <!DOCTYPE with PUBLIC and no SYSTEM literal.
  • Fixed bug in losing attributes when auto-closing tags.
  • Fixed pretty printing output by adding WhitespaceHandling flag to SGMLReader.

Release notes for 1.0.4

  • Added -encoding option so you can change the encoding of the output file.

Release notes for 1.0.3.26932

  • Implemented ReadOuterXml and ReadInnerXml and fix some bugs in dealing with xmlns attributes and dealing with non-HTML tags.

Release notes for 1.0.3

  • Fixed some CLS compliance problems with using SGMLReader from VB and a null reference exception bug when loading SGMLReader from XmlDocument

Release notes for 1.0.2.21225

  • Fixed bug in handling of encodings. Now uses the correct encoding returned from the HTTP server

Release notes for 1.0.2.21105

  • Fixed bug in handling of input that contains blank lines at the top.

Release notes for 1.0.2

  • Added fix for the way IE & Netscape deal with characters in the range 0x80 through 0x9F in HTML.

Release notes for 1.0.1

  • Fixed bug in handling of empty elements, like <INPUT>

Release notes for 1.0

  • Add wildcard support for command line utility.

Release notes for 0.5

  • Initial

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
 
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details.
 
You should have received a copy of the GNU General Public License along with this program.  If not, see http://www.gnu.org/licenses/.
Was this page helpful?
Tag page
Viewing 1 of 1 comments: view all
The sgml parser does not detect and properly correct for end-of-line characters.
When parsing an OFX message, one of the servers sent a formatted response with the nested sections indented with spaces.
The parser (reasonably) treats end-of-line characters as white-space and even though they are appended to the text-field they must be filtered out.
However, once the parser moves onto the next line it starts consuming the indenting spaces and appends these to the text-field of the item above it!

e.g. instead of the desired resultant XML of <VER>1</VER> I get <VER>1 </VER>
The extra spaces are the indentation space of the following line.
This is an error in the way the text-fields are parsed and would require a bit of work to fix "properly".
A quick hack is to trim the string.
In SgmlReader.cs, near the end of the ParseText() function, on or near line 2096 change the code
string value = this.m_sb.ToString();
to
string value = this.m_sb.ToString().Trim(); edited 19:30, 26 Jan 2012
Posted 19:29, 26 Jan 2012
Viewing 1 of 1 comments: view all
You must login to post a comment.

Copyright © 2011 MindTouch, Inc. Powered by