Redirected from SgmlReader/HTML-to-XML Conversion Examples
0 of 1 found this page helpful

HTML-to-XML Conversion Examples

  • You do not have permissions to view this page - please try logging in.
  • You do not have permissions to view this page - please try logging in.
  • You do not have permissions to view this page - please try logging in.
  • You do not have permissions to view this page - please try logging in.
  • You do not have permissions to view this page - please try logging in.
  • You do not have permissions to view this page - please try logging in.
  • You do not have permissions to view this page - please try logging in.
  • You do not have permissions to view this page - please try logging in.
  • You do not have permissions to view this page - please try logging in.
  • You do not have permissions to view this page - please try logging in.
  • You do not have permissions to view this page - please try logging in.
  • You do not have permissions to view this page - please try logging in.
  • You do not have permissions to view this page - please try logging in.
  • You do not have permissions to view this page - please try logging in.
  • You do not have permissions to view this page - please try logging in.
  • You do not have permissions to view this page - please try logging in.

The following tests show how SgmlReader converts malformed HTML into valid XML.  Note that extended characters may appear incorrectly since this page is generated on the fly using the HTML test file from GitHub.

Test 1

Before
<html>
<body><span text />
</body>
</html>
After
<html>
  <body>
    <span text="text" />
  </body>
</html>

Test 2

Before
<html>
<body><span text="foo>bar"/>
</body>
</html>
After
<html>
  <body>
    <span text="foo">bar"/&gt;
</span>
  </body>
</html>

Test 3

Before
<html>
<body><span text="foo<bar"/>
</body>
</html>
After
<html>
  <body>
    <span text="foo&lt;bar" />
  </body>
</html>

Test 4

Before
<html>
<body>
<tag>&test&nbsp&nbsp blah blah</tag>
</body>
</html>
After
<html>
  <body>
    <tag>&amp;test  blah blah</tag>
  </body>
</html>

Test 5

Before
<html>
<body>
<tag>&nbsp&nbsp&nbsp blah blah</tag>
</body>
</html>
After
<html>
  <body>
    <tag>   blah blah</tag>
  </body>
</html>

Test 6

Before
<html>
<body>
<p>bad char: <span>&#1048576;</span></p>
</body>
</html>
After
<html>
  <body>
    <p>bad char: <span>��</span></p>
  </body>
</html>

Test 7

Before
<html>
<body>
<P class=MsoNormal dir=ltr 
style="MARGIN: 0pt;" align=left><?xml:namespace 
prefix = st1 ns = "urn:schemas-microsoft-com:office:smarttags" 
/><ST1:PERSONNAME></ST1:PERSONNAME></P>
</body>
</html>
After
<html>
  <body>
    <P class="MsoNormal" dir="ltr" style="MARGIN: 0pt;" align="left">
      <?namespace 
prefix = st1 ns = "urn:schemas-microsoft-com:office:smarttags" 
?>
      <ST1:PERSONNAME xmlns:ST1="#unknown">
      </ST1:PERSONNAME>
    </P>
  </body>
</html>

Test 8

Before
<html>
<body>
<DIV STYLE="top:214px; left:139px; position:absolute; font-size:26px;"><NOBR><SPAN STYLE="font-family:"Wingdings 2";"></SPAN></NOBR></DIV>
</body>
</html>
After
<html>
  <body>
    <DIV STYLE="top:214px; left:139px; position:absolute; font-size:26px;">
      <NOBR>
        <SPAN STYLE="font-family:" Wingdings="Wingdings" _x0032_=";">
        </SPAN>
      </NOBR>
    </DIV>
  </body>
</html>

Test 9

Before
<html>
<body>
<script type="text/javascript">/*<![CDATA[*/
var test = '<div>"test"</div>';
/*]]>*/</script>
<p>test</p>
</body>
</html>
After
<html>
  <body>
    <script type="text/javascript"><![CDATA[
var test = '<div>"test"</div>';
]]></script>
    <p>test</p>
  </body>
</html>

Test 10

Before
<html>
<body>This <P>is bad </P> XHTML.</body>
</html>
After
<html>
  <body>This <p>is bad </p> XHTML.</body>
</html>

Test 11

Before
<html>
<body><span>some text</span> <span>more text</span></body>
</html>
After
<html>
<body><span>some text</span> <span>more text</span></body>
</html>

Test 12

Before
<html>
<body><a href="http://www.cnn.com/"' title="cnn.com">cnn</a></body>
</html>
After
<html>
  <body>
    <a href="http://www.cnn.com/">cnn</a>
  </body>
</html>

Test 13

Before
<html>
<head>
<style>
<!--
</style>
</head>
</html>
After
<html>
  <head>
    <style>
      <!--
</style>
</head>
</html>
-->
    </style>
  </head>
</html>

Test 14

Before
<html>
  <body>&apos;</body>
</html>
After
<html>
  <body>'</body>
</html>

Test 15

Before
<script type="text/javascript></script>
After
<html>
  <script type="text/javascript">
  </script>
</html>

Test 16

Before
<html xmlns="http://www.w3.org/1999/xhtml"><head /><body><table u1:str="" x:str=""></table></body></html>
After
<html xmlns="http://www.w3.org/1999/xhtml">
  <head />
  <body>
    <table u1:str="" x:str="" xmlns:x="#unknown1" xmlns:u1="#unknown">
    </table>
  </body>
</html>

Test 17

Before
<html>
    <body>&sup2;</body>
</html>
After
<html>
  <body>²</body>
</html>

Test 18

Before
<html>
    <body>
       <something@something.com>
    </body>
</html>
After
<html>
  <body>&lt;something@something.com&gt;</body>
</html>

Test 19

Before
<html>
    <body>
        <script type="text/javascript">/*<![CDATA[*/ /*<![CDATA[*/ test /*]]>*/ /*]]&gt;*/</script>
    </body>
</html>
After
<html>
  <body>
    <script type="text/javascript"><![CDATA[  test  /*]]&gt;*/]]></script>
  </body>
</html>

Test 20

Before
<html>
	<body>
		<style>div.wiki { float: right; }</style>
		<em>foo</em>
	</body>
</html>
After
<html>
  <body>
    <style><![CDATA[div.wiki { float: right; }]]></style>
    <em>foo</em>
  </body>
</html>

Test 21

Before
<html><body><title>Title</title><foo>foo</foo></body></html>
After
<html>
  <body>
    <title>Title</title>
    <foo>foo</foo>
  </body>
</html>

Test 22

Before
<html><body>
<p class="MsoNormal">
	<span style="font-size: 10pt;" arial="" ,="" sans-serif="" ;;="" font-family:dummy:="" font-family:="" font-family:foo:="" arial;="" font-size:="" 13.3333px;="">
		<span class="Apple-style-span" style="font-family: Arial; font-size: 13.3333px;">-lm</span>
	</span>
</p>
</body></html>
After
<html>
  <body>
    <p class="MsoNormal">
      <span style="font-size: 10pt;" arial="" sans-serif="">
        <span class="Apple-style-span" style="font-family: Arial; font-size: 13.3333px;">-lm</span>
      </span>
    </p>
  </body>
</html>

Test 23

Before
<html><body>do <![if !supportLists]>not<![endif]> lose this text</body></html>
After
<html>
  <body>do not lose this text</body>
</html>

Test 24

Before
<html xmlns="http://implicit" xmlns:n="http://explicit"><foo attr1="1" n:attr2="2" /><n:foo attr1="1" n:attr2="2" /></html>
After
<html xmlns="http://implicit" xmlns:n="http://explicit">
  <foo attr1="1" n:attr2="2" />
  <n:foo attr1="1" n:attr2="2" />
</html>

Test 25

Before
<html xmlns:n="http://explicit"><foo attr1="1" n:attr2="2" /><n:foo attr1="1" n:attr2="2" /></html>
After
<html xmlns:n="http://explicit">
  <foo attr1="1" n:attr2="2" />
  <n:foo attr1="1" n:attr2="2" />
</html>

Test 26

Before
<html xmlns:n="http://explicit"><foo attr1="1" n:attr2="2" /><n:foo attr1="1" n:attr2="2" /></html>
After
<html xmlns:n="http://explicit">
  <foo attr1="1" n:attr2="2" />
  <n:foo attr1="1" n:attr2="2" />
</html>

Test 27

Before
<html><foo xmlns:n="http://explicit" attr1="1" n:attr2="2" /></html>
After
<html>
  <foo xmlns:n="http://explicit" attr1="1" n:attr2="2" />
</html>

Test 28

Before
<html><foo xmlns:n="http://explicit" attr1="1" n:attr2="2" /></html>
After
<html>
  <foo xmlns:n="http://explicit" attr1="1" n:attr2="2" />
</html>

Test 29

Before
<html xmlns:o="http://microsoft.com"><body>A<o:p></o:p>B<o:p></o:p></body></html>
After
<html xmlns:o="http://microsoft.com">
  <body>A<o:p></o:p>B<o:p></o:p></body>
</html>

Test 30

Before
<html xmlns:o="http://microsoft.com"><body>A<o:p></o:p>B<o:p></o:p></body></html>
After
<html xmlns:o="http://microsoft.com">
  <body>A<o:p />B<o:p /></body>
</html>

Test 31

Before
<html><body>A<o:p></o:p>B<o:p></o:p></body></html>
After
<html>
  <body>A<o:p xmlns:o="#unknown"></o:p>B<o:p xmlns:o="#unknown"></o:p></body>
</html>

Test 32

Before
<html><body>A<o:p></o:p>B<o:p></o:p></body></html>
After
<html>
  <body>A<o:p xmlns:o="#unknown" />B<o:p xmlns:o="#unknown" /></body>
</html>

Test 33

Before
<html><body>
After
<html>
  <body>
  </body>
</html>

Test 34

Before

<html>
After


<html>
</html>

Test 35

Before
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> 
<html>
After
<html>
</html>

Test 36

Before
<html>
<body>
<table><tr><td>row1<tr><td>row2</td>
After
<html>
  <body>
    <table>
      <tr>
        <td>row1</td>
      </tr>
      <tr>
        <td>row2</td>
      </tr>
    </table>
  </body>
</html> 

Test 37

Before
<html> 
<head> 
<script language="JavaScript"> 
<!-- 
--></script> 
</head> 
<body> 
<p>hello</p> 
</body> 
</html> 
After
<html>
  <head>
    <script language="JavaScript">
      <!-- 
-->
    </script>
  </head>
  <body>
    <p>hello</p>
  </body>
</html>

Test 38

Before
<html>
<![CDATA[this is a CDATA block with markup <table><tr><td> ]]>
</html>
After
<html><![CDATA[this is a CDATA block with markup <table><tr><td> ]]></html>

Test 39

Before
<p>This is really <messed_up.< p>.
After
<html>
  <p>This is really <messed_up.>&lt; p&gt;.
</messed_up.></p>
</html>

Test 40

Before
<html><class="black">Text………</html>
After
<html>
  <class>Text………</class>
</html>

Test 41

Before
<p>&copy;</p>
<br/>
After
<html>
  <p>©</p>
  <br />
</html>

Test 42

Before
<html> 
  <img src="img.gif" height"4" width= 2 > 
</html>
After
<html>
  <img src="img.gif" height="4" width="2" />
</html>

Test 43

Before
<html>
  <script><![CDATA[this is a test]]></script>
</html>
After
<html>
  <script><![CDATA[this is a test]]></script>
</html>

Test 44

Before
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" >
<HTML></HTML>
After
<html>
</html>

Test 45

Before
<b>foo</b>
After
<html>
  <b>foo</b>
</html>

Test 46

Before
blah <b>foo</b>
After
<html>blah <b>foo</b></html>

Test 47

Before
<!-- top --> <b>foo</b>
After
<!-- top -->
<html>
  <b>foo</b>
</html>

Test 48

Before
<html>
<body>
<p>&#x5a;&#90;&#90 test &#90</p>
After
<html>
  <body>
    <p>ZZZ test Z</p>
  </body>
</html>

Test 49

Before
<html>
  <?xml version="1.0" encoding="UTF-16"?>
</html>
After
<html>
</html>

Test 50

Before
<html><?xml:namespace prefix="st1" ns="urn:schemas-microsoft-com:office:smarttags" />
<body>
After
<html>
  <?namespace prefix="st1" ns="urn:schemas-microsoft-com:office:smarttags" ?>
  <body>
  </body>
</html>

Test 51

Before
<html xmlns:portal="http://schemas.microsoft.com/msn/portal/controls"><head><title>Welcome to MSN.com</title>
After
<html xmlns:portal="http://schemas.microsoft.com/msn/portal/controls">
  <head>
    <title>Welcome to MSN.com</title>
  </head>
</html>

Test 52

Before
<html xmlns:portal="http://schemas.microsoft.com/msn/portal/controls"><head><title>Welcome to MSN.com</title>
After
<html xmlns:portal="http://schemas.microsoft.com/msn/portal/controls">
  <head>
    <title>Welcome to MSN.com</title>
  </head>
</html>

Was this page helpful?
Tag page

Files 1

FileVersionSizeModified 
Viewing 4 of 4 comments: view all
It seems like the examples have gone, but it would be useful if they reappeared :)

- Regin
Posted 03:58, 13 May 2011
@kvakulo the examples are dynamically rendered from the repo. we forgot to update the link when we moved the code to github. thanks for pointing it out!
Posted 05:46, 13 May 2011
@SteveB thanks for the quick fix!
Posted 05:48, 13 May 2011
Hi Team, the examples are gone again. Could you bring them back?

Thanks
Posted 16:36, 24 Jun 2011
Viewing 4 of 4 comments: view all
You must login to post a comment.

Copyright © 2011 MindTouch, Inc. Powered by