In this tutorial, we create a screen scraping application that does NOT use Dream. Its purpose is to go to a website, pull the search results from their page, and output a list of all the returned articles as a webpage. In this example we are using the .Net libraries and and various XML, SgmlReader, and Uri libraries that already exist in C# to fulfill this task. For a more efficient, less convoluted way to produce the same results:



This tutorial shows you how to:
  • Create and build URIs using the Uri and UriBuilder class
  • Send HTTP requests using the WebRequest/WebResponse class
  • Read an HTML document using the SgmlReader, XmlDocument, XmlNode, XmlNodeReader, and XmlNodeList Classes
  • Build an XML document using the XmlTextWriter Class
  • Convert the built XML document into a HTML file using the XslTransform class

 

In this Tutorial we will also compare writing this application:
without using the Dream library  and  with using the Dream library.

Getting Started

Like in the Search-App tutorial, we want to set up a SearchForm with a textbox and button to activate the processing of the user's input.

Since we aren't using the Dream library and we need to parse HTML, we're going to need the open source SgmlReaderlibrary. It's available from MindTouch's public svn directory. Once downloaded, let's add the SgmlReaderDll.dll (this file is located in the /redist folder) and System.Web to our project References and declare them with the using keyword.

 

using Sgml; 
using System.Web;

 

We start off with creating a form that prompts the user with a search field. It should look something like this, but feel free to customize it to look any way you want it to:

searchform (1).jpg

Make the Search... button’s events (found in the button’s properties):Click is activated so that you can use the function created by these events to invoke the function that will handle the user’s input.

 


private void SearchClicked(object sender, EventArgs e){
    HandleSearch(searchBox.Text.Trim());
}

With the preliminaries set up, we create the function HandleSearch(string input) that handles the user’s request. In our example, we use the New York Time’s article search engine, but feel free to use anyone you would like.

(Note: Because the site you are screen-scraping is not aware of the fact that you are screen scraping them, this means the site is prone to change that you may not be warned of ahead of time. In addition, we recommend you use a site that creates structurally correct DOM trees. If not, the XDoc class may have trouble traversing it and return the wanted results.)

Now that we're done with the preliminaries, let's fill in the code to handle the user's input.

Creating URIs using C#'s Uri/UriBuilder Classes

We want to first create a base Uri that holds the site domain without any additional paths or queries. After that, we put the base in the UriBuilder constructor and create a new UriBuilder so that we can attach paths and queries to our URI.

(Notice: the left side is the focus of our Tutorial, however, the right side compares how the same section of code could have been written if we used the Dream library)

Without Dream With Dream
Uri nyt_uri = new Uri("http://query.nytimes.com/");
UriBuilder uri_build = new UriBuilder(nyt_uri);

Now we add the path:

uri_build.Path = "search/query";

...encode the user_input and then add the query string:

string encode_input = HttpUtility.UrlEncode(user_input);
uri_build.Query = "query=" + encode_input + "&srchst=nyt&n=100;
First, we build the Uri using XUri. Then, using chaining, we append the path and and the query to the Uri:
XUri nyt_xuri = new XUri("http://query.nytimes.com/");
XUri nyt_full_uri = nyt_xuri.At("search", "query").With("query", user_input).With("srchst", "nyt").With("n", "100");

 

Both the code on the left and on the right produce this URI:

        http://query.nytimes.com/search/query?query=encode_input&srchst=nyt&n=100

However, as you can see from our Search-app code on the right, it can do the exact same thing as the code on the left, but only in two lines.

Handling Request/Response using .NET's WebRequest and WebResponse

Next, we make a temporary file, which will be handled later when we need to call that file in the browser. After that, we take the Uri and make a request call using .NET's WebRequest class and then handle the response using .NET's WebResponse:

 

Without Dream With Dream

First, generate a temporary file in which we will store our result in (you will need this later):

string fname = Path.GetTempFileName();
string filename = fname + ".xml";

 

Next, we create the WebRequest and then specify that you're using HTTP GET. After we've done this, send the request with the GetResponse() call and handle the response as a WebResponse:

WebRequest request = WebRequest.Create(uri_build.Uri);
request.Method = "GET";
using(WebResponse response = request.GetResponse()) {

Since we can't use the WebResponse in it's raw form, we need to hand it to a StreamReader class:

using(StreamReader reader = new StreamReader(response.GetResponseStream())) {

Moreover, because we are using the Sgml class that we added to our reference earlier, we now need to initalize the SgmlReader and pass it the StreamReader we just created:

Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
sgmlReader.DocType = "html";
sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
sgmlReader.InputStream = reader;
In Dream, we create a Plug with the XUri we just built and request a response from the webpage using a Get() call:
Plug plug = Plug.New(nyt_full_uri);
DreamMessage message = plug.Get();

 

As you can see from comparing the sections of code where we do the same thing, in the version where we use Dream, we can request and handle a returned response in only two lines.

Creating the XML Document

To help us write the XML document, we use C#'s XmlTextWriter library to aid us in the building of tags for our search app result:

Without Dream With Dream

We initalize the XmlTextWriter by passing it the temporary file we created earlier into XmlTextWriter's constructor:

Encoding unicode = Encoding.UTF8; //set XML encoding
using(XmlTextWriter xtw = new XmlTextWriter(filename, unicode)) {

Note: you should probably prevent the XmlTextWriter from beginning to write, until after you're done with the entire document (in case it errors):
xtw.WriteStartDocument(false);

Now, we begin to construct our HTML document:
xtw.WriteStartElement("html");
xtw.WriteStartElement("head");
xtw.WriteElementString("title", user_input);
xtw.WriteEndElement();
xtw.WriteStartElement("body");

xtw.WriteElementString("h3", "Search Results for: " + user_input);
xtw.WriteStartElement("table");
xtw.WriteAttributeString("border", "1");
XDoc output = new XDoc("html"); //html wrapper
output.Start("head").Elem("title", user_input).End();
output.Start("body");
output.Elem("h3", "Search Result for: " + user_input);
output.Start("table").Attr("border", "1");// table for results

Note: we don't need to create a temporary file until later, after we're done and ready to flush the entire Xml document to a file.

 

Logically, building the application's Xml document with Dream is no different than the Dream-less application. However, notice how it is more interface friendly to think about handling tags and attributes by chaining methods that tend to be associated with each other. (e.g. output.Start("table").Attr("border", "1");)

Parsing the Response

Now that we have our Xml document setup with validating Html syntax header tags, we can now create a table with all the article information we plan on parsing from our response:

 

Without Dream With Dream

Create an XmlDocument, to help you handle the xpath needed to traverse to the content you wish to extract from the response HTML:

try {
    XmlDocument doc = new XmlDocument();
    doc.PreserveWhitespace = true;
    doc.XmlResolver = null;
    doc.Load(sgmlReader);

    XmlNode xroot = doc.DocumentElement;

    // traverse the path to the particular list element you want to look at
    XmlNodeList nList = xroot.SelectNodes("body/div/div/div/ol/li");

Parse the main content for article title/link, aritcle snipplet, and article information:

foreach(XmlNode xNode in nList) {
    xtw.WriteStartElement("tr");
    xtw.WriteStartElement("td");
    XmlNode h3 = xNode["h3"]; // find title/link to article
    if(h3 != null) {
        xtw.WriteNode(new XmlNodeReader(h3), true);
    }
    XmlNode p = xNode["p"];// find article snipplet
    if(p != null) {
        xtw.WriteNode(new XmlNodeReader(p), true);
    }
    XmlNode div = xNode["div"];// find article information
    if(div != null) {
        xtw.WriteNode(new XmlNodeReader(div), true);
    }
    xtw.WriteEndElement(); // close td
    xtw.WriteEndElement(); // close tr
}
} catch(Exception e) {
    Console.WriteLine(e);
    return;
}

Finally, we close the tags, flush the content out into our prepared file and then close the file:

xtw.WriteEndElement(); // close table
    xtw.WriteEndElement(); // close body
    xtw.WriteEndElement(); // close html
    xtw.Flush();
    xtw.Close();
}
We create another XDoc to handle the xpath for the HTML document we're passing in from DreamMessage:
XDoc doc = message.AsDocument();
foreach(XDoc entry in doc["body/div/div/div/ol/li"]) {
    output.Start("tr").Start("td");
    output.Add(entry["h3"]);    // retrieve article title and link
    output.Add(entry["p"]);     // retrieve article snipplet
    output.Add(entry["div"]);   // retrieve article info(author, date, number of words)
    output.End().End();         // close off tr and td
}

 

Close the end of the table and the body

output.End().End();
}

 

Create a temporary file and flush the result out to it in XHTML format

string filename = Path.GetTempFileName()+".html";
    File.WriteAllText(filename, output.ToXHtml());
}

Turning an Xml Document into an Html Document

Unfortunately, there is no easy way to convert the Xml document we just created into Xhtml, like in Dream. Therefore, this is an extra step we need to take in order to create a validating Html document that works in all browsers. Here, we will need to create a Xsl stylesheet and use C#'s Xsl class to help us create our Html page.

 

ConvertToHtml(fname, filename) which is a method I created to handle the xsl transform of the xml document to a html document:
ConvertToHtml(fname, filename);
public void ConvertToHtml(string fname, string xmlName) {

            // initialize transform
            if(_transform == null) {
                _transform = new XslCompiledTransform();

                // load up the stylesheet => which you have to create on your own
                using(System.IO.Stream stream = System.Reflection.Assembly.GetExecutingAssembly().GetManifestResourceStream("SearchSample_withoutDream.format.xsl")) {
                    using(XmlReader reader = new XmlTextReader(stream)) {
                        _transform.Load(reader);
                    }
                }
            }

            // tell the XslCompiledTransform class to use format.xsl as the stylesheet
            // to turn your written XML into validating HTML
            string htmlName = fname + ".html";

            // perform the transformation
            _transform.Transform(xmlName, htmlName);
        }
The ConvertToHtml method calls a pre-created xsl file to act as a stylesheet. This is the contents of the format.xsl stylesheet file which we need to include into our project directory:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<!-- indicates what our output DocType is going to be -->
    <xsl:output method="html" version="4.0" doctype-public="-//W3C//DTD XHTML 1.0 Transitional//EN" doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
	
        <!--Include all original nodes in html element except for the xml at the top-->	
	  <xsl:template match="/html">
            <xsl:copy-of select="current()"></xsl:copy-of>
      </xsl:template>
    </xsl:output>
</xsl:stylesheet>
Without Dream

 

Finishing Up

After we're done transforming our Xml Document into a Html Document, we need to close the the streams that we opened at the beginning of the function.

reader.Close(); // close streamReader
}
response.Close(); // close the response stream
Without Dream

 

Opening the Browser

Since the Async and Result classes don't exist outside of Dream, we cannot make a one line call to the temporary file and open it in a browser, instead we handle the opening of the file in a process that calls the browser and passes it the temporary file we created earlier:

 

Without Dream With Dream

Create a method to LaunchProgram and call it in HandleSearch:

LaunchProgram(fname + ".html");

The contents of our LaunchProgram method consists of:

public void LaunchProgram(string fname) {

    // create a new process
    System.Diagnostics.Process proc = new System.Diagnostics.Process();
    proc.StartInfo.FileName = "explorer";
    proc.StartInfo.Arguments = fname;
    proc.Start();
}

We request the operating system to open a browser and execute the file with the screen scraped article list we just created

Async.ExecuteProcess("explorer.exe", filename, Stream.Null, new Result<Tuple<int, Stream, Stream>>());

 

Now, all we need to do is build and run the application.

In Conclusion

From what we've seen from comparing implementing screen scraping without Dream and with Dream:

  • Without Dream, we would have to write at least three times more code
  • We would have to take extra steps and write an xsl file to convert our Xml Document into a Html Document
  • With Dream, we only need to use three classes in order to create the search-app. Without Dream, we used at least eleven classes. 
  • With a lot of the more trivial things abstracted away in Dream, thinking about the logic for search-app becomes more streamlined and less weighed down by the constrains of needing to call many other classes to handle small problems.

Tag page
You must login to post a comment.