 |
Documentation |
Introduction
Templates in Spb Insight project are intended to download and parse web documents of different kind,
mostly news sites. Since different web sites have very different page formatting, a template needs to be created
for every site. A template contains information about the site and code to help parse content,
clean it up from menus, advertisements and so on.
The template language was created with easiness, popularity and conformity to internet standards in mind.
Wide-spread internet language Jscript
(a.k.a. ECMAScript)
was chosen as the basic language, wrapped with XML
to enable metadata processing, such as channel names etc.
One template can contain definitions for multiple channels, as there are usually several channels on one site that
has the same formatting and the parsing code for them should be shared.
Sample template
Here you can see a sample of real-life template that parses two BBC channels. (I wish I could post
here yet another "Hello, World!" template, but it seems impossible to me. If you have any idea about it, please
give me a note).
01 <?xml version="1.0"?>
02 <template version="1.0">
03 <channel>
04 <name>BBC - Business</name>
05 <url>http://news.bbc.co.uk/2/low/business/default.stm</url>
06 </channel>
07
08 <channel>
09 <name>BBC - Science & Nature</name>
10 <url>http://news.bbc.co.uk/2/low/science/nature/default.stm</url>
11 </channel>
12
13 <parse_channel> <![CDATA[
14 var d = new Document(channel.url);
15 var tags = d.getElementsByTagName("a");
16
17 for (i in tags)
18 {
19 var href = tags[i].getAttribute("href");
20 if ( href &&
21 href.indexOf( "default.stm" ) == -1 &&
22 href.indexOf( "/2/low/" ) != -1 &&
23 href.indexOf( "/help/" ) == -1 )
24 {
25 var article = new Article();
26
27 article.id = href;
28 article.url = href;
29 article.header = tags[i].innerText;
30
31 channel.articles.push( article );
32 }
33 }
34 ]]> </parse_channel>
35
36 <parse_article> <![CDATA[
37 var d = new Document(article.url);
38 var nodeHeader = d.getElementsByTagName("h2")[0];
39
40 article.header = nodeHeader.innerText;
41 article.body = "";
42
43 for (var node = nodeHeader.nextSibling; node; node = node.nextSibling)
44 if (node.nodeName != "hr")
45 article.body += node.outerHTML;
46 else
47 break;
48 ]]> </parse_article>
49 </template>
Template representation
A template can be represented as a file or as an online template:
-
As a file, it must have an extension ".xml" and be placed in
/<directory to Spb Insight>/Templates catalog on Pocket PC
(usually it is /Program Files/Spb Insight/Templates).
After that, the channels from the template can be added using New/Local templates menu option.
This representation is usually used to create/debug templates and share them without online catalog.
-
Online templates can be added by registered users using
Developer/Add template menu item in the
www.spbinsight.com web site. After adding the template and providing
category and language for its channels, they will be available to all users of Spb Insight program
in New/Online catalog menu option.
This method is preferred when a template is worth sharing with the majority of users.
Updating process workflow
Updating the channel from the web is the main purpose of a template. Usually a channel on web site consists of a
"list of articles" page with names and links to the articles and many "article" pages, each with a single
article with full text and images. Processing of these two types of pages is done in <parse_channel> and
<parse_article> tags correspondingly. The code in each of these tags usually downloads an HTML page, parses
it and extracts relevant information. For <parse_channel> it is a list of articles (with at least
a URL and a header for each article), for <parse_article> it is full article text, full header, date and
other related information.
As you can see in the sample template above, all the article links from the HTML page are added to the
channel.articles array, although some of them may have already been parsed in previous updates. This problem is
solved automatically by the Spb Insight, as it checks which articles are already downloaded/parsed and which need
to be downloaded/parsed during this update session. This behavior is configurable through id
and dynamicid properties of the article object (see reference below).
Also, there are some sites where there is no "article" pages, as all text of all articles is placed on one page.
In this case <parse_article> tag can be omitted and all information can be extracted right
in <parse_channel> tag.
XML structure
Note! In current implementation the encoding of xml is utf-8 and cannot be changed.
A template must be a well-formed xml document.
Root tag should be named <template> with a required parameter
version. Current version of the templates is 1.0 and corresponds to the current version of the
Spb Insight project.
The following children tags are possible (all of them can be specified multiple times, in any order):
-
<channel> - each such tag corresponds to a single channel parsed by this template.
It has two required sub-tags:
- <name> - the name of the channel.
- <url> - url of the "list of articles" page. Will be used in online catalog to reveal
multiple templates that parse one site, and also will be available as channel.url member in JScript
(see reference below).
Also, it can have other sub-tags, which will be available as channel.<name_of_the_tag> member in JScript.
-
<parse_channel> - JScript code that is executed upon updating channel. All channel properties will be
available through global variable channel of type Channel (see reference below).
-
<parse_article> - JScript code that is executed to update a new article. Article properties will be
available through global variable article of type Article (see reference below).
The parsing tags have the Jscript inside and because of the requirement to have a well-formed xml file
(text inside an xml tag cannot include a '&' symbol, rather a '&')
it is recommended to add a CDATA tag. Sample:
01 <parse_article> <![CDATA[
02
03 ... Some Jscript code...
04
05 ]]> </parse_article>
Jscript code
The language we use is an extension of the
ECMAScript.
All syntax and semantics of the language, including built-in classes like String, RegExp and Date are supported
according to the standard. If you need some tutorial for this language, please search the web, there are plenty of them.
We also need to mention that the standard DOM implementation is not included in this language
(i.e. there is no document or window objects).
The extensions of the ECMAScript consist of the following three parts:
-
Channel/Article classes give simple access of the channel/article internal database. Through methods
of these classes you can add new articles and edit existing ones. Typical usage is following:
-
In <parse_channel> section there is a global variable channel of type Channel.
It has the channel.url property which is used to download document (line 14 in the sample template above).
Also, it has a channel.articles array which is used to store all objects of type Article, that
can be created and filled from the current HTML page (lines 25-31). Usually article.url and
article.header is filled in this stage.
After <parse_channel> section has been executed, the Spb Insight determines which articles in
channel.articles array are new and executes <parse_article> section for them.
-
In <parse_article> section there is a global variable article of type Article.
Usually in this section, the article object is modified to contain article.body, which is
full article text (lines 41-47). Also, all other properties can be modified, including article.header
(line 40) if at this point there is additional information about it.
-
Document/Node classes - a simple replacement for the DOM model, adapted for the needs
of application. Typical usage is:
-
Downloading An HTML page can be downloaded and parsed in one line: var d = new Document(url) (line 14, 37),
resulting in variable d being the root node of the page.
-
Searching Using getElementsByTagName, getElementsByAttr and other methods of
Node objects is a simple way to search for a specific node.
-
String conversion Each node can be converted to a string representation using Microsoft-style
inner/outerText and inner/outerHTML methods.
-
log(str)/alert(str) functions - debug facilities.
Reference
class Channel
One and only instance of this class is global variable channel in <parse_channel> section.
It stores channel-wide settings, like channel name, url, etc. Template authors may add custom string fields to this
object and they will be stored between updates.
The properties of channel objects:
- url (string) - the URL of "list of articles" page. Taken from xml node <url>, this property
shows where to download the main page from.
- thumbnail (string) - is a URL of an channel icon. Set automatically from url property as
http://<servername>/favicon.ico, but can be changed if the site's icon has a different url.
- articles (array of Article objects) - the array to store all articles that was found on the main page.
After completing execution, the Spb Insight will determine which of the articles are new (by their id and
dynamicid properties, see below) and download them.
class Article
This object represents an article. Usually these objects are created in <parse_channel> section and then
filled up with information. After creation, they are usually extended in <parse_article> section to
include full text.
The system determines whether this article is new by its id and dynamicid properties. If there is
no article with the same id present in the database, then this is a new article and it is processed in
<parse_article> section. Else we need to compare dynamicid property. If it is different
then the article on web site has changed and needs to be re-processed. Otherwise, it isn't changed and
not processed at all.
The properties of article objects:
-
id (string) - the string that uniquely identifies an article in a channel. This property along with
dynamicid (see below) is used to skip parsing or adding an article that already exists in a
channel (that saves traffic and CPU time). When the property is not set, the program will generate
an id automatically using hash of all article properties. This means article objects generated during
different updates from the same data (url, header, etc.) will receive the same
auto-generated id).
-
dynamicid (string) - the string complementary to id. When the program determines whether two articles
are the same articles, it compares ids. However, there are use cases when you need to modify an
article added in a previous update (for example the article content had been changed on web site or there
are additional commentaries to the article).
Such "dynamic" articles may be correctly processed using dynamicid property of an article object by
putting some string that identifies the version of an article into this property.
If the article with the same id already exists, but dynamicids are different,
the old article will be replaced with the newer one.
-
url (string) - the URL of a full-text article page.
In the <parse_article> section this value is used to download a page with full article text.
-
header (string) - the header of an article. This string will be displayed in the list of articles and on
top of an article HTML. It is highly recommended to fill this value right in <parse_channel> section,
because this is the only information that user will have when deciding whether to download this article's full text
or not.
-
body (string) - article full text. If there is a synopsis of an article on the "list of articles" page, it
is recommended to place it in this property right in the <parse_channel> section. This property usually
contain HTML tags.
-
thumbnail (string) - the URL to the image that will be used as an article thumbnail. If this property is not
set, the program will automatically look for a most appropriate image from the article.
-
date (Date object) - this property represents the UTC date of this article and takes an ECMAScript Date object.
-
read (string) - can be set to "1" if the article should be added as already read article.
-
manualupdate (string) - can be set to "1" if an article needs to be downloaded only manually
(the user will have to click on their icons or choose "update" from context menu).
-
updated (string) - when set to "1" the article is considered to be fully updated, and it won't be passed
to <parse_article> section. Use this flag when the "list of articles" page contains
the entire content of all articles.
class Document
This class is derived from the Node class in that it has all the same methods and properties. Above them, it
has a constructor, which downloads and parses HTML pages:
-
new Document(url [, encoding]) Downloads the HTML document from internet and builds Node tree inside.
If the automatically determined encoding is not right, second argument can be used to explicitly set
the right one (f.ex. "win-1251"). Returns object of class Node.
class Node
The node object corresponds to a single tag in HTML file. Current implementation includes the following properties:
- firstChild (Node)- returns first child node of this node, undefined if no children.
- lastChild (Node) - returns last child node of this node, undefined if no children.
- nextSibling (Node) - returns next sibling node, undefined if this was the last.
- previousSibling (Node) - returns previous sibling node, undefined if this was the first.
- parentNode (Node) - returns parent node, undefined if this is the root.
- nodeName (string) - returns tag name of the current node.
- outerHTML (string) - returns HTML representation of this node and all it's subnodes.
- innerHTML (string) - returns HTML representation of all subnodes of this node.
- outerText (string) - returns string representation of this node and all it's subnodes.
- innerText (string) - returns string representation of all subnodes of this node.
- all (Array) - returns all child nodes (recursively) of this node.
- childNodes (Array) - returns all immediate child nodes of this node.
- attributes (Object) - returns collection of attributes of this node. Can be accessed like this: node.attributes.href or node.attributes["href"]
- nodeType (int) - returns the type of this node. Can be one of the constants defined below.
Along with the properties, here are the functions of the Node class:
-
getAttribute(attributeName) - returns the string value of an attribute by it's name.
If the attribute not found, returns undefined.
-
getElementsByClassName(name) - returns array of nodes that have corresponding class (For example,
<div class='ArticleBody'> ).
-
getElementsById(name) - returns array of nodes with corresponding id
(For example, <div id='ArticleBody'>).
-
getElementsByTagName(name) - returns array of nodes with corresponding tag name.
(For example, all tags <b>).
-
getElementsByAttr(tagname, attrname, attrvalue) - returns array of nodes with corresponding tag name and attribute.
(For example, all tags <td class="main_page"> is found by issuing doc.getElementsByAttr("td", "class", "main_page")).
Also Node class defines following type constants:
- DOCUMENT_NODE
- DOCUMENT_TYPE_NODE
- COMMENT_NODE
- PROCESSING_INSTRUCTION_NODE
- TEXT_NODE
- ELEMENT_NODE
- ELEMENT_END_NODE
- ELEMENT_START_END_NODE
- CDATA_SECTION_NODE
- SECTION_NODE
- XML_DECLARATION_NODE
Debugging templates
The main facility to debug templates is the log file at /<directory to Spb Insight>/insight_log.txt on Pocket PC
(usually it is /Program Files/Spb Insight/insight_log.txt).
By default there are a lot of messages shown in the log, showing internal actions and parsing errors.
Other messages can be added using log(message) global function in Jscript.
Also, alert(message) function is provided to show message box (although we discourage using it
in public templates).
Frequently Asked Questions
-
Q: How can I ask a question about this documentation?
A: Please use our forum on www.spbdn.com
|