Classes derived from this handle the generic parsing of HTML documents: it scans
the document and divide it into blocks of tags (where one block
consists of beginning and ending tag and of text between these
two tags).
It is independent from HtmlWindow and can be used as stand-alone parser
(Julian Smart’s idea of speech-only HTML viewer or wget-like utility -
see InetGet sample for example).
It uses system of tag handlers to parse the HTML document. Tag handlers
are not statically shared by all instances but are created for each
HtmlParser instance. The reason is that the handler may contain
document-specific temporary data used during parsing (e.g. complicated
structures like tables).
Typically the user calls only the Parse method.
Object
Cells Overview,
Tag Handlers Overview,
HtmlTag
This may (and may not) be overwritten in derived class.
This method is called each time new tag is about to be added.
tag contains information about the tag. (See HtmlTag
for details.)
Default (HtmlParser) behaviour is this:
First it finds a handler capable of handling this tag and then it calls
handler’s HandleTag method.
Adds handler to the internal list (& hash table) of handlers. This
method should not be called directly by user but rather by derived class’
constructor.
This adds the handler to this instance of HtmlParser, not to
all objects of this class! (Static front-end to AddTagHandler is provided
by HtmlWinParser).
All handlers are deleted on object deletion.
Must be overwritten in derived class.
This method is called by do_parsing
each time a part of text is parsed. txt is NOT only one word, it is
substring of input. It is not formatted or preprocessed (so white spaces are
unmodified).
Parses the m_Source from begin_pos to end_pos-1.
(in noparams version it parses whole m_Source)
This must be called after DoParsing().
Returns pointer to the file system. Because each tag handler has
reference to it is parent parser it can easily request the file by
calling
Returns product of parsing. Returned value is result of parsing
of the document. The type of this result depends on internal
representation in derived parser (but it must be derived from Object!).
See HtmlWinParser for details.
Returns pointer to the source being parsed.
Setups the parser for parsing the source string. (Should be overridden
in derived class)
Opens given URL and returns FSFile
object that can be used to read data
from it. This method may return NULL in one of two cases: either the URL doesn’t
point to any valid resource or the URL is blocked by overridden implementation
of OpenURL in derived class.
HTML_URL_PAGE | Opening a HTML page. |
HTML_URL_IMAGE | Opening an image. |
HTML_URL_OTHER | Opening a resource that doesn’t fall intoany other category. |
Always use this method in tag handlers instead of GetFS()->OpenFile()
because it can block the URL and is thus more secure.
Default behaviour is to call HtmlWindow#on_opening_url
of the associated HtmlWindow object (which may decide to block the URL or
redirect it to another one),if there’s any, and always open the URL if the
parser is not used with HtmlWindow.
Returned FSFile
object is not guaranteed to point to url, it might
have been redirected!
Proceeds parsing of the document. This is end-user method. You can simply
call it when you need to obtain parsed output (which is parser-specific)
The method does these things:
You shouldn’t use InitParser, DoParsing, GetProduct or DoneParser directly.
Forces the handler to handle additional tags
(not returned by get_supported_tags).
The handler should already be added to this parser.
Imagine you want to parse following pseudo-html structure:
It is obvious that you cannot use only one tag handler for tag.
Instead you must use context-sensitive handlers for inside
and inside
This is the preferred solution:
TAG_HANDLER_BEGIN(MYITEM, “MYITEMS”) TAG_HANDLER_PROC(tag) { // …something… m_Parser → PushTagHandler(this, “PARAM”); ParseInner(tag); m_Parser → PopTagHandler(); // …something… } TAG_HANDLER_END(MYITEM)Restores parser’s state before last call to
push_tag_handler.
Sets the virtual file system that will be used to request additional
files. (For example <IMG>
tag handler requests FSFile with the
image data.)
Call this function to interrupt parsing from a tag handler. No more tags
will be parsed afterward. This function may only be called from
HtmlParser#parse or any function called
by it (i.e. from tag handlers).
[This page automatically generated from the Textile source at 2023-06-13 21:31:42 +0000]