Web-based data miningAutomatically extract information with HTML, XML, and Java ![]() | ![]() |
![]() |
Level: Advanced Jussi Myllymaki (mailto:jussi@almaden.ibm.com?subject=Web-based data mining), Researcher, IBM 01 Jun 2001 The World Wide Web is now undeniably the richest and most dense source of information the world has ever seen, yet its structure makes it difficult to make use of that information in a systematic way. The methods and tools described in this article will enable developers familiar with the most common technologies of the Web to quickly and easily extract the Web-delivered information they need. The rapid growth of the World Wide Web in this age of information has led to a prolific distribution of a wide variety of public information. Unfortunately, while HTML, the major carrier of this information, provides a convenient way to present information to human readers, it can be a challenging structure from which to automatically extract information relevant to a data-driven service or application. A variety of approaches have been taken to solve this problem. Most take the form of some proprietary query language that maps sections of an HTML page into code that populates a database with information from the Web page. While these approaches may offer some advantages, most are impractical for two reasons: one, they require a developer to take the time to learn a query language that can not be used in any other setting, and two, they are not robust enough to work in the face of the simple changes to the Web pages they target that are inevitable. In this article, a method for Web-based data mining is developed using the standard technologies of the Web -- HTML, XML, and Java. This method is equal in power, if not more powerful, than other proprietary solutions and requires little effort to produce robust results for those already familiar with the technologies of the Web. As an added bonus, much of the code needed to begin data extraction is included with this article. HTML is often a difficult medium to work with programmatically. The majority of the content of Web pages describes formatting irrelevant to a data-driven system, and document structure can change as often as every connection to the page, due to dynamic banner adds and other server-side-scripting. The problem is further compounded by the fact that a major portion of all Web pages are not well-formed, a result of the leniency in HTML parsing by modern Web browsers. Despite these problems, there are advantageous aspects of HTML for data miners. Interesting data can often be isolated to single
The key to the data mining technology described herein is to convert existing Web pages into XML, or perhaps more appropriately XHTML, and use a few of the many tools for working with data structured as XML to retrieve the relevant data. Fortunately, a solution exists for correcting much of the uneven design of HTML pages. Tidy, a library available in several programming languages, is a freely available product for correcting common mistakes in HTML documents and producing equivalent documents that are well-formed. Tidy may also be used to render these documents in XHTML, a subset of XML. (See Resources). The code examples in this article are written in Java and will require the Tidy jar file to be placed in the
Overview of the approach and introduction of an example We introduce the method of data extraction by means of an example. Suppose we are interested in tracking the temperature and humidity levels of Seattle, Washington, at various times of the day over the course of a few months. Supposing no off-the-shelf software for this kind of reporting fits our needs, we are still left with the opportunity to glean this information off of one of many public Web sites. Figure 1 illustrates an overview of the extraction process. Web pages are retrieved and processed until a data set is created that can be incorporated into an existing data set. Figure 1. An overview of the extraction process ![]() In a few short steps, we will have a reliable system in place that gathers just the information we need. The steps are listed here to give a brief overview of the process, and the process is shown at a high level in Figure 1.
Each of these steps will be explained in detail and the code necessary to execute them will be provided.
Obtaining the source information as XHTML In order to extract data, we of course need to know where to find it. In most cases the source will be obvious. If we wanted to keep a collection of the titles and URLs of articles from developerWorks, we would use http://www.ibm.com/developerworks/ as our target. In the case of the weather, we have several sources to choose from. We will use Yahoo! Weather in the example, though others would have worked equally as well. In particular, we will be tracking the data on the URL, http://weather.yahoo.com/forecast/Seattle_WA_US_f.html. A screen shot of this page is shown in Figure 2. Figure 2. The Yahoo! Weather Web page for Seattle, Washington ![]() In considering a source, it is important to keep these factors in mind:
While we are looking for robust solutions that will work in dynamic environments, our work will be easiest when extracting the most reliable and stable sources available. Once the source is determined, our first step in the extraction process is to convert the data from HTML to XML. We will accomplish this and other XML related tasks by constructing a Java class called We use the functionality provided by the Tidy library to do our conversion in the method Figure 3. The Yahoo! Weather Web page converted to XHTML ![]()
Finding a reference point for the data Notice that the vast majority of information in either the Web page or source XHTML view is of absolutely no concern to us. Our next task then is to locate a specific region in the XML tree from which we can extract our data without concerning ourselves with the extraneous information. For more complex extractions we may need to find several instances of these regions on a single page. Accomplishing this is usually easiest by first examining the Web page and then working with the XML. Simply looking at the page shows us that the information we are looking for is in a section in the upper-middle part of the page. With but a limited familiarity with HTML, it is easy to infer that the data we are looking for is probably all contained under the same Making note of our observations, we now consider the XHTML that the page produced. A text search for "Appar Temp" reveals, as shown in Figure 4 that the text is indeed enclosed in a table containing all of the data we need. We will make this table our reference point, or anchor. Figure 4: The anchor is found by looking for a table containing the text "Appar Temp" ![]() Now we need a way to locate this anchor. Since we are going to be using XSL to transform the XML we have, we can use XPath expressions for this task. The trivial choice would be to use:
This expression specifies a path from the root
...or even better, we can take advantage of the way XSL converts XML trees to strings:
With this anchor in hand, we can create the code that will actually extract our data. This code will be in the form of an XSL file. The goal of the XSL file is to identify the anchor, specify how to get from that anchor to the data we are looking for (in short hops), and construct an XML output file in the format we want. This process is really much simpler than it sounds. The code for the XSL that will do this is given in Listing 2 and is also available as an XSL text file. The Of course, just writing the XSL will not get the job done. We also need a tool that performs the conversion. For this, we take advantage of Listing 3
Merging and processing the results If we were only performing the data extraction once, we would now be done. However, we don‘t just want to know the temperature at one time, but at several different times. All we need to do now is to repeat our extraction process over and over again, merging the results into a single XML data file. We could again use XSL to do this, but instead we will create one last method for merging XML files in our The code for running this whole process is given in the Figure 5. The results of our Web extraction ![]()
In this article, we have described and demonstrated the fundamentals of a robust approach for extracting information from the largest source of information in existence, the World Wide Web. We have also included the coding tools necessary for enabling any Java developer to begin his or her own extraction work with a minimum amount of effort and extraction experience. While the example in the article focused on merely extracting weather information about Seattle, Washington, nearly all of the code presented here is reusable for any data extraction. In fact, aside from minor changes to the The method is as simple as it is sound. By wisely choosing data sources that are reliable and picking anchors within those sources that are tied to content and not format, you can have a low-maintenance, reliable data extraction system, and, depending on your level of experience and the amount of data to extract, you could have it up and running in less than an hour.
|