Spb Insight
Pocket PC version | Smartphone version | Channels | Developer

Documentation


Introduction

Templates in Spb Insight project are intended to download and parse web documents of different kind, mostly news sites. Since different web sites have very different page formatting, a template needs to be created for every site. A template contains information about the site and code to help parse content, clean it up from menus, advertisements and so on.

The template language was created with easiness, popularity and conformity to internet standards in mind. Wide-spread internet language Jscript (a.k.a. ECMAScript) was chosen as the basic language, wrapped with XML to enable metadata processing, such as channel names etc.

One template can contain definitions for multiple channels, as there are usually several channels on one site that has the same formatting and the parsing code for them should be shared.

Sample template

Here you can see a sample of real-life template that parses two BBC channels. (I wish I could post here yet another "Hello, World!" template, but it seems impossible to me. If you have any idea about it, please give me a note).

01 <?xml version="1.0"?>
02 <template version="1.0">
03      <channel>
04          <name>BBC - Business</name>
05          <url>http://news.bbc.co.uk/2/low/business/default.stm</url>
06      </channel>
07 
08      <channel>
09          <name>BBC - Science &amp; Nature</name>
10          <url>http://news.bbc.co.uk/2/low/science/nature/default.stm</url>
11      </channel>
12 
13      <parse_channel> <![CDATA[
14         var d = new Document(channel.url);
15         var tags = d.getElementsByTagName("a");
16 
17         for (i in tags)
18         {
19             var href = tags[i].getAttribute("href");
20             if ( href &&
21                  href.indexOf( "default.stm" ) == -1 &&
22                  href.indexOf( "/2/low/" ) != -1 &&
23                  href.indexOf( "/help/" ) == -1 )
24             {
25                 var article = new Article();
26 
27                 article.id     = href;
28                 article.url    = href;
29                 article.header = tags[i].innerText;
30 
31                 channel.articles.push( article );
32             }
33         }
34     ]]> </parse_channel>
35 
36     <parse_article> <![CDATA[
37         var d = new Document(article.url);
38         var nodeHeader = d.getElementsByTagName("h2")[0];
39 
40         article.header = nodeHeader.innerText;
41         article.body = "";
42 
43         for (var node = nodeHeader.nextSibling; node; node = node.nextSibling)
44             if (node.nodeName != "hr")
45                 article.body += node.outerHTML;
46             else
47                 break;
48     ]]> </parse_article>
49 </template> 

Template representation

A template can be represented as a file or as an online template:

  • As a file, it must have an extension ".xml" and be placed in /<directory to Spb Insight>/Templates catalog on Pocket PC (usually it is /Program Files/Spb Insight/Templates). After that, the channels from the template can be added using New/Local templates menu option.
    This representation is usually used to create/debug templates and share them without online catalog.
  • Online templates can be added by registered users using Developer/Add template menu item in the www.spbinsight.com web site. After adding the template and providing category and language for its channels, they will be available to all users of Spb Insight program in New/Online catalog menu option.
    This method is preferred when a template is worth sharing with the majority of users.

Updating process workflow

Updating the channel from the web is the main purpose of a template. Usually a channel on web site consists of a "list of articles" page with names and links to the articles and many "article" pages, each with a single article with full text and images. Processing of these two types of pages is done in <parse_channel> and <parse_article> tags correspondingly. The code in each of these tags usually downloads an HTML page, parses it and extracts relevant information. For <parse_channel> it is a list of articles (with at least a URL and a header for each article), for <parse_article> it is full article text, full header, date and other related information.

As you can see in the sample template above, all the article links from the HTML page are added to the channel.articles array, although some of them may have already been parsed in previous updates. This problem is solved automatically by the Spb Insight, as it checks which articles are already downloaded/parsed and which need to be downloaded/parsed during this update session. This behavior is configurable through id and dynamicid properties of the article object (see reference below).

Also, there are some sites where there is no "article" pages, as all text of all articles is placed on one page. In this case <parse_article> tag can be omitted and all information can be extracted right in <parse_channel> tag.

XML structure

Note! In current implementation the encoding of xml is utf-8 and cannot be changed.

A template must be a well-formed xml document. Root tag should be named <template> with a required parameter version. Current version of the templates is 1.0 and corresponds to the current version of the Spb Insight project.
The following children tags are possible (all of them can be specified multiple times, in any order):

  • <channel> - each such tag corresponds to a single channel parsed by this template.
    It has two required sub-tags:
    • <name> - the name of the channel.
    • <url> - url of the "list of articles" page. Will be used in online catalog to reveal multiple templates that parse one site, and also will be available as channel.url member in JScript (see reference below).
    Also, it can have other sub-tags, which will be available as channel.<name_of_the_tag> member in JScript.
  • <parse_channel> - JScript code that is executed upon updating channel. All channel properties will be available through global variable channel of type Channel (see reference below).
  • <parse_article> - JScript code that is executed to update a new article. Article properties will be available through global variable article of type Article (see reference below).

The parsing tags have the Jscript inside and because of the requirement to have a well-formed xml file (text inside an xml tag cannot include a '&' symbol, rather a '&amp;') it is recommended to add a CDATA tag. Sample:

01 <parse_article> <![CDATA[
02 
03  ... Some Jscript code...
04 
05 ]]> </parse_article> 

Jscript code

The language we use is an extension of the ECMAScript. All syntax and semantics of the language, including built-in classes like String, RegExp and Date are supported according to the standard. If you need some tutorial for this language, please search the web, there are plenty of them. We also need to mention that the standard DOM implementation is not included in this language (i.e. there is no document or window objects).

The extensions of the ECMAScript consist of the following three parts:

  • Channel/Article classes give simple access of the channel/article internal database. Through methods of these classes you can add new articles and edit existing ones. Typical usage is following:
    • In <parse_channel> section there is a global variable channel of type Channel. It has the channel.url property which is used to download document (line 14 in the sample template above). Also, it has a channel.articles array which is used to store all objects of type Article, that can be created and filled from the current HTML page (lines 25-31). Usually article.url and article.header is filled in this stage. After <parse_channel> section has been executed, the Spb Insight determines which articles in channel.articles array are new and executes <parse_article> section for them.
    • In <parse_article> section there is a global variable article of type Article. Usually in this section, the article object is modified to contain article.body, which is full article text (lines 41-47). Also, all other properties can be modified, including article.header (line 40) if at this point there is additional information about it.
  • Document/Node classes - a simple replacement for the DOM model, adapted for the needs of application. Typical usage is:
    • Downloading An HTML page can be downloaded and parsed in one line: var d = new Document(url) (line 14, 37), resulting in variable d being the root node of the page.
    • Searching Using getElementsByTagName, getElementsByAttr and other methods of Node objects is a simple way to search for a specific node.
    • String conversion Each node can be converted to a string representation using Microsoft-style inner/outerText and inner/outerHTML methods.
  • log(str)/alert(str) functions - debug facilities.

Reference

class Channel

One and only instance of this class is global variable channel in <parse_channel> section. It stores channel-wide settings, like channel name, url, etc. Template authors may add custom string fields to this object and they will be stored between updates.

The properties of channel objects:

  • url (string) - the URL of "list of articles" page. Taken from xml node <url>, this property shows where to download the main page from.
  • thumbnail (string) - is a URL of an channel icon. Set automatically from url property as http://<servername>/favicon.ico, but can be changed if the site's icon has a different url.
  • articles (array of Article objects) - the array to store all articles that was found on the main page. After completing execution, the Spb Insight will determine which of the articles are new (by their id and dynamicid properties, see below) and download them.

class Article

This object represents an article. Usually these objects are created in <parse_channel> section and then filled up with information. After creation, they are usually extended in <parse_article> section to include full text.

The system determines whether this article is new by its id and dynamicid properties. If there is no article with the same id present in the database, then this is a new article and it is processed in <parse_article> section. Else we need to compare dynamicid property. If it is different then the article on web site has changed and needs to be re-processed. Otherwise, it isn't changed and not processed at all.

The properties of article objects:

  • id (string) - the string that uniquely identifies an article in a channel. This property along with dynamicid (see below) is used to skip parsing or adding an article that already exists in a channel (that saves traffic and CPU time). When the property is not set, the program will generate an id automatically using hash of all article properties. This means article objects generated during different updates from the same data (url, header, etc.) will receive the same auto-generated id).
  • dynamicid (string) - the string complementary to id. When the program determines whether two articles are the same articles, it compares ids. However, there are use cases when you need to modify an article added in a previous update (for example the article content had been changed on web site or there are additional commentaries to the article). Such "dynamic" articles may be correctly processed using dynamicid property of an article object by putting some string that identifies the version of an article into this property. If the article with the same id already exists, but dynamicids are different, the old article will be replaced with the newer one.
  • url (string) - the URL of a full-text article page. In the <parse_article> section this value is used to download a page with full article text.
  • header (string) - the header of an article. This string will be displayed in the list of articles and on top of an article HTML. It is highly recommended to fill this value right in <parse_channel> section, because this is the only information that user will have when deciding whether to download this article's full text or not.
  • body (string) - article full text. If there is a synopsis of an article on the "list of articles" page, it is recommended to place it in this property right in the <parse_channel> section. This property usually contain HTML tags.
  • thumbnail (string) - the URL to the image that will be used as an article thumbnail. If this property is not set, the program will automatically look for a most appropriate image from the article.
  • date (Date object) - this property represents the UTC date of this article and takes an ECMAScript Date object.
  • read (string) - can be set to "1" if the article should be added as already read article.
  • manualupdate (string) - can be set to "1" if an article needs to be downloaded only manually (the user will have to click on their icons or choose "update" from context menu).
  • updated (string) - when set to "1" the article is considered to be fully updated, and it won't be passed to <parse_article> section. Use this flag when the "list of articles" page contains the entire content of all articles.

class Document

This class is derived from the Node class in that it has all the same methods and properties. Above them, it has a constructor, which downloads and parses HTML pages:

  • new Document(url [, encoding]) Downloads the HTML document from internet and builds Node tree inside. If the automatically determined encoding is not right, second argument can be used to explicitly set the right one (f.ex. "win-1251"). Returns object of class Node.

class Node

The node object corresponds to a single tag in HTML file. Current implementation includes the following properties:

  • firstChild (Node)- returns first child node of this node, undefined if no children.
  • lastChild (Node) - returns last child node of this node, undefined if no children.
  • nextSibling (Node) - returns next sibling node, undefined if this was the last.
  • previousSibling (Node) - returns previous sibling node, undefined if this was the first.
  • parentNode (Node) - returns parent node, undefined if this is the root.
  • nodeName (string) - returns tag name of the current node.
  • outerHTML (string) - returns HTML representation of this node and all it's subnodes.
  • innerHTML (string) - returns HTML representation of all subnodes of this node.
  • outerText (string) - returns string representation of this node and all it's subnodes.
  • innerText (string) - returns string representation of all subnodes of this node.
  • all (Array) - returns all child nodes (recursively) of this node.
  • childNodes (Array) - returns all immediate child nodes of this node.
  • attributes (Object) - returns collection of attributes of this node. Can be accessed like this: node.attributes.href or node.attributes["href"]
  • nodeType (int) - returns the type of this node. Can be one of the constants defined below.

Along with the properties, here are the functions of the Node class:

  • getAttribute(attributeName) - returns the string value of an attribute by it's name. If the attribute not found, returns undefined.
  • getElementsByClassName(name) - returns array of nodes that have corresponding class (For example, <div class='ArticleBody'> ).
  • getElementsById(name) - returns array of nodes with corresponding id (For example, <div id='ArticleBody'>).
  • getElementsByTagName(name) - returns array of nodes with corresponding tag name. (For example, all tags <b>).
  • getElementsByAttr(tagname, attrname, attrvalue) - returns array of nodes with corresponding tag name and attribute. (For example, all tags <td class="main_page"> is found by issuing doc.getElementsByAttr("td", "class", "main_page")).

Also Node class defines following type constants:

  • DOCUMENT_NODE
  • DOCUMENT_TYPE_NODE
  • COMMENT_NODE
  • PROCESSING_INSTRUCTION_NODE
  • TEXT_NODE
  • ELEMENT_NODE
  • ELEMENT_END_NODE
  • ELEMENT_START_END_NODE
  • CDATA_SECTION_NODE
  • SECTION_NODE
  • XML_DECLARATION_NODE

Debugging templates

The main facility to debug templates is the log file at /<directory to Spb Insight>/insight_log.txt on Pocket PC (usually it is /Program Files/Spb Insight/insight_log.txt).

By default there are a lot of messages shown in the log, showing internal actions and parsing errors. Other messages can be added using log(message) global function in Jscript. Also, alert(message) function is provided to show message box (although we discourage using it in public templates).

Frequently Asked Questions

  1. Q: How can I ask a question about this documentation?
    A: Please use our forum on www.spbdn.com
© 2008 Spb Software House
Pocket PC version | Smartphone version | Channels | Developer