wxRuby Documentation Home

Wx::HtmlParser

Classes derived from this handle the generic parsing of HTML documents: it scans
the document and divide it into blocks of tags (where one block
consists of beginning and ending tag and of text between these
two tags).

It is independent from HtmlWindow and can be used as stand-alone parser
(Julian Smart’s idea of speech-only HTML viewer or wget-like utility -
see InetGet sample for example).

It uses system of tag handlers to parse the HTML document. Tag handlers
are not statically shared by all instances but are created for each
HtmlParser instance. The reason is that the handler may contain
document-specific temporary data used during parsing (e.g. complicated
structures like tables).

Typically the user calls only the Parse method.

Derived from

Object

See also

Cells Overview,
Tag Handlers Overview,
HtmlTag

Methods

HtmlParser.new

HtmlParser#add_tag

add_tag(%(arg-type)HtmlTag% tag)

This may (and may not) be overwritten in derived class.

This method is called each time new tag is about to be added.
tag contains information about the tag. (See HtmlTag
for details.)

Default (HtmlParser) behaviour is this:
First it finds a handler capable of handling this tag and then it calls
handler’s HandleTag method.

HtmlParser#add_tag_handler

add_tag_handler(%(arg-type)HtmlTagHandler% handler)

Adds handler to the internal list (& hash table) of handlers. This
method should not be called directly by user but rather by derived class’
constructor.

This adds the handler to this instance of HtmlParser, not to
all objects of this class! (Static front-end to AddTagHandler is provided
by HtmlWinParser).

All handlers are deleted on object deletion.

HtmlParser#add_text

add_word(%(arg-type)char% txt)

Must be overwritten in derived class.

This method is called by do_parsing
each time a part of text is parsed. txt is NOT only one word, it is
substring of input. It is not formatted or preprocessed (so white spaces are
unmodified).

HtmlParser#do_parsing

do_parsing(%(arg-type)Integer% begin_pos, Integer end_pos) do_parsing()

Parses the m_Source from begin_pos to end_pos-1.
(in noparams version it parses whole m_Source)

HtmlParser#done_parser

done_parser()

This must be called after DoParsing().

HtmlParser#get_fs

FileSystem get_fs()

Returns pointer to the file system. Because each tag handler has
reference to it is parent parser it can easily request the file by
calling

FSFile *f = m_Parser → GetFS() → OpenFile(“image.jpg”);

HtmlParser#get_product

Object get_product()

Returns product of parsing. Returned value is result of parsing
of the document. The type of this result depends on internal
representation in derived parser (but it must be derived from Object!).

See HtmlWinParser for details.

HtmlParser#get_source

String get_source()

Returns pointer to the source being parsed.

HtmlParser#init_parser

init_parser(%(arg-type)String% source)

Setups the parser for parsing the source string. (Should be overridden
in derived class)

HtmlParser#open_url

FSFile open_url(%(arg-type)HtmlURLType% type, String url)

Opens given URL and returns FSFile object that can be used to read data
from it. This method may return NULL in one of two cases: either the URL doesn’t
point to any valid resource or the URL is blocked by overridden implementation
of OpenURL in derived class.

Parameters

HTML_URL_PAGE Opening a HTML page.
HTML_URL_IMAGE Opening an image.
HTML_URL_OTHER Opening a resource that doesn’t fall intoany other category.

Notes

Always use this method in tag handlers instead of GetFS()->OpenFile()
because it can block the URL and is thus more secure.

Default behaviour is to call HtmlWindow#on_opening_url
of the associated HtmlWindow object (which may decide to block the URL or
redirect it to another one),if there’s any, and always open the URL if the
parser is not used with HtmlWindow.

Returned FSFile object is not guaranteed to point to url, it might
have been redirected!

HtmlParser#parse

Object parse(%(arg-type)String% source)

Proceeds parsing of the document. This is end-user method. You can simply
call it when you need to obtain parsed output (which is parser-specific)

The method does these things:

  1. calls init_parser
  2. calls do_parsing
  3. calls get_product
  4. calls done_parser
  5. returns value returned by GetProduct

You shouldn’t use InitParser, DoParsing, GetProduct or DoneParser directly.

HtmlParser#push_tag_handler

push_tag_handler(%(arg-type)HtmlTagHandler% handler, String tags)

Forces the handler to handle additional tags
(not returned by get_supported_tags).
The handler should already be added to this parser.

Parameters

Example

Imagine you want to parse following pseudo-html structure:

It is obvious that you cannot use only one tag handler for tag.
Instead you must use context-sensitive handlers for inside
and inside .

This is the preferred solution:

TAG_HANDLER_BEGIN(MYITEM, “MYITEMS”) TAG_HANDLER_PROC(tag) { // …something… m_Parser → PushTagHandler(this, “PARAM”); ParseInner(tag); m_Parser → PopTagHandler(); // …something… } TAG_HANDLER_END(MYITEM)

HtmlParser#pop_tag_handler

pop_tag_handler()

Restores parser’s state before last call to
push_tag_handler.

HtmlParser#set_fs

set_fs(%(arg-type)FileSystem% fs)

Sets the virtual file system that will be used to request additional
files. (For example <IMG> tag handler requests FSFile with the
image data.)

HtmlParser#stop_parsing

stop_parsing()

Call this function to interrupt parsing from a tag handler. No more tags
will be parsed afterward. This function may only be called from
HtmlParser#parse or any function called
by it (i.e. from tag handlers).

[This page automatically generated from the Textile source at 2023-06-03 08:07:43 +0000]