總覽
This section describes the motive, the notions and concepts used in Web-Harvest.
本章描述了在Web-Harvest涉及的動機、觀念和概念。
理念
World Wide Web, though by far thelargest knowledge base, is rarely regarded as database in traditionalsense - as source of information used for further computing. Web-Harvest isinspired by practical need for having right data at the right time. Andvery often, the Web is the only source thatpublicly provides wanted information.
萬維網(wǎng),盡管是目前最大的知識基地,但是仍然難以將它視為傳統(tǒng)意義上的數(shù)據(jù)庫,從而作為深入計算的所使用的信息來源。Web-Harvest受啟發(fā)滿足實用性的需要成為在正確的時間獲取正確的數(shù)據(jù)。web經(jīng)常是唯一給公眾提供所需要的信息來源。
基本概念
The main goal behind Web-Harvest is to empower the usage of alreadyexisting extraction technologies. Its purpose is not to propose a newmethod, but to provide a way to easily use and combine the existingones. Web-Harvestoffers the set of processors for data handling and controlflow. Each processor can be regarded as a function - it has zero or moreinput parameters and gives a result after execution. Processors couldbe combined in a pipeline, making the chain of execution. Foreasier manipulation and data reuse Web-Harvest provides variable context wherenamed variables are stored. The following diagram describes onepipeline execution:
Web-Harvest的總體目標的是要能使用已經(jīng)存在的抽取技術。它的目標不是提供一個新的方法,而是提供一種可以簡單使用并整合已經(jīng)存在的技術的新方式。Web-Harvest提供一系列數(shù)據(jù)處理和控制流程的處理器。每個處理器可以看做是一個方法-它有零個或多個輸入?yún)?shù)并能在執(zhí)行后提供一個結果。處理器可以組裝為一個管道,形成執(zhí)行鏈。為了更加簡單地操作以及數(shù)據(jù)重用,Web-Harvest提供了變量上下文,那些被命名的變量可以存儲在這個上下文中。下圖描述了一個管道的執(zhí)行過程:
The result of extraction could be available in files created duringexecution or from the variable context if Web-Harvest is programmatically used.
在執(zhí)行期間,抽取的結果可以存在于文件,如果Web-Harvest 采用編程方式進行使用時,抽取的結果也來自于變量上下文。
配置語言
Every extraction process is defined in one or more configurationfiles, using simple XML-based language. Each processor is describedby specific XML element or structure of XML elements. For theillustration, here is presented an example of configuration file:
每個抽取過程都定義在一個或多個配置文件中,并且使用簡單的基于XML的語言。每個處理器都被特定的XML元素或XML元素的結構所描述。為了說明,下面展示了一個配置文件的例子:
<?xml version="1.0" encoding="UTF-8"?><config charset="UTF-8"><var-def name="urlList"><xpath expression="//img/@src"><html-to-xml><http url="http://news.bbc.co.uk"/></html-to-xml></xpath></var-def><loop item="link" index="i" filter="unique"><list><var name="urlList"/></list><body><file action="write" type="binary" path="images/${i}.gif"><http url="${sys.fullUrl('http://news.bbc.co.uk', link)}"/></file></body></loop></config>
This configuration contains two pipelines. The first pipelineperforms the following steps:
這個配置包含了兩個管道。第一個管道執(zhí)行了下面的步驟:
1. http://news.bbc.co.uk的網(wǎng)站內(nèi)容被下載,
2. HTML清理
3. XPath 表達式用于查找頁面圖片的URL序列,
4. 新命名urlList變量用于定義包漢了圖片URL的序列。
The second pipeline uses result of the previous execution in order tocollect all page images:
第二個管道為了收集所有的頁面圖片,使用了前面執(zhí)行的結果:
1. Loop處理器迭代了所有的URL序列并且對于每項都:
2. 下載當前URL的圖片,
3. 在文件系統(tǒng)中保存圖片。
This example illustrates some procedural-language elements of Web-Harvest, likevariable definition and list iteration, few data management processors (fileand http) and couple of HTML/XML processing instructions (html-to-xmland xpath processors).
For slightly more complex example of image download, where some otherfeatures of Web-Harvestare used, see Examplespage. For technical coverage of supported processors, see Usermanual.
這個例子說明了Web-Harvest中一些過程化語言的元素,比如變量定義和列表迭代,少量數(shù)據(jù)管理的處理器(文件和http)以及一些HTML/XML處理指令。(HTML到XML和XPATH處理器)
想了解在Web-Harvest中更加復雜一點的圖片下載,以及用到的一些特點,見Examples頁。想了解所支持的處理器的技術覆蓋范圍,看Usermanual。
All data produced and consumed during extraction process in Web-Harvest havethree representations: text, binary and list. There is also special datavalue empty, whose textual representation is empty string,binary - empty byte array and list - zero length list. Which form ofdata is used - it depends on processor that consumes the data. Inprevious configuration html-to-xml processor uses downloadedcontent as text in order to transform it to HTML, loopprocessor uses variable urlList as a list in order to iterateover it and file processor treats downloaded images as binarydata when saving them to the files. In most cases proper representationof the data is chosen by Web-Harvest. However - in some situations it must beexplicitly stated which one to use. One example is fileprocessor where default data type is text and the binarycontent must be explicitly specified with type="binary"
.
Web-Harvestprovides the variable context for storing and using variables. There isno special convention for naming variables like in most of theprogramming languages. Thus, the names like arr[1], 100or #$& are valid. However, if aforementioned variableswere used in scripts or templates (see next section), where expressionsare dynamically evaluated, the exception would be thrown. It istherefore recommended to use usual programming language naming in orderto avoid any difficulties.
When Web-Harvestis programmatically used (from Java code, not from command line)variable context may be initially set by user in order to add customvalues and functionality. Similarly, after execution, variable contextis available for taking variables from it.
When user-defined functions are called (see Usermanual) separate local variable context is created (like in manyprogramming languages, including Java). The valid way to exchange databetween caller and called function is through the function parameters.
Besides the set of powerful text and XML manipulation processors, Web-Harvestsupports real scripting languages which code can be easily intergratedwithin scraper configurations. Languages currently supported are BeanShell,Groovy and Javascript. BeanShell is probably theclosest to Java syntax and power, but Groovy and Javascripthave some other adventages. It is up to the developer to use preferedlanguage or even to mix different languages in the single configuration.
Templating allowes evaluating of marked parts of the text (text"islands" surrounded with ${ and }). Evaluation isperformed using the chosen scripting language. In Web-Harvest all elements' attributes are implicitlypassed to the templating engine. In upper configuration, there are twoplaces where templater is doing the job:
path="images/${i}.gif"
in file processor, producing file names based on loop index, url="${sys.fullUrl('http://news.bbc.co.uk', link)}"
in http processor, where built-in functionality is called to calculate full URL of the image (see User manual to check all built-in objects).