1. Introduction
1.1 General
The StringParser_BBCode
class provides
the possibility to parse strings with BB-Codes and convert them to e.g.
HTML code. BBCode
is a kind of markup "language" with which one may structure and format
text. It is similar to HTML but it utilizes square braces
instead of angle brackets. Another difference between BBCode
and HTML is that when using BBCode invalid
code is ignored whereas the validity of the code is important when using
HTML.
Here is an example for a text that was structured with BBCode:
This is a [b]bold [i]text in italics[/i] that has no meaning[/b]!
This text could now be convertet to HTML:
This is a <b>bold <i>text
in italics</i> that has no meaning</b>!
This would look like:
This is a bold text in italics that has no meaning!
The simplest possibility here to convert the BBCode
to HTML would be to replace [b]
, [i]
,
[/b]
and [/i]
through <b>
,
<i>
, </b>
and </i>
.
This would work fine with the above example. But this would also cause
problems if somebody mistyped the BBCode. An example:
This is a [b]bold [i]text in italics[/io] which has no meaning[/b]!
The author of the text mistyped while typing the [/i]
,
also touching the o key and therefore producing the input [/io]
instead of [/i]
. If one would just do the simple
replacements mentioned above, the following HTML code would
be generated:
This is a <b>bold <i>text
in italics[/io] that has no meaning</b>!
This is invalid HTML since the elements are not
correctly nested and the <i>
element is never closed.
There are other approaches to convert BBCode that use
regular expressions to make sure every element is closed correctly. But
thos approaches cannot ensure the correct nesting of the elements.
This is why this class takes a different approach. The text is read character by character and the complete text is converted in a tree structure. This tree structure is then converted to HTML after the complete text has been originally converted to the tree. The text
This is a [b]bold [i]text in italics[/i] that has no meaning[/b]!
would be converted into following tree strcuture:
- Text:
"This is a "
- Element:
[b]
- Text:
"bold "
- Element:
[i]
- Text:
"text in italics"
- Text:
- Text:
" that has no meaning"
- Text:
- Text: "!"
Now if a text such as above, where the [/i]
is
missing, is to be converted the class will realise this at the [/b]
.
This is because the [/b]
would appear while the class would
still be waiting for a [/i]
because she knows that the [i]
is still open. Now there are two possibilities where the programmer
using the class may decide what happens exactly: The first option would
be to declare the [i]
invalid, append it to the text just
after "bold"
and continue parsing at that location. The
second option would be to imply a [/i]
directly before the
[/b]
in order to close both elements. What the class does not
do is to guess that [/io]
could still mean [/i]
- this would lead to the problem that the class would make errors
elsewhere thinking that she is correcting an error that in reality is
none.
The class itself does not impose the constraint which codes are
looked for. [b]
and [i]
are popular examples
and are therefor quoted here The class itself however provides the
possibilities to define own codes as long as they occur in square
braces. In the following chapter it is shown how to define own codes.
In addition to pure tags there is also the possibility to use attributes as in HTML. These would look like the following for example:
This is a [b strength=really_bold]really
bold text[/b]!
The class detects the following different forms of attribute syntaxes:
[code attribute=value]
,[code attribute="value"]
,[code attribute='value']
- This is the form that is the form most similar to HTML.
Furthermore it is the only form that allowes to set more than one
attribute at the same time. The expression in front of the equal sign
is the attribute name and the expression behind it is the attribute
value. If a value is put in double or single quotes spaces and a
closing square brace (
]
) will also be allowed inside the attribute value. If you also want the attribute value to contain a quote character itself it must be escaped with\
. Example:[code attribute="value ] we are still inside the value\" yes, this also belongs to the value"]
. [code=value]
,[code = value]
,[code="value"]
,[code='value']
- In this form it is possible to set only one attribute. This
attribute always has the name
default
. The syntax[code=value]
would be identical to[code default=value]
. This syntax is very similar to classical BB-Code. [code:value]
,[code: value]
- This is another possible syntax and the attribute name is also
default
here.
1.2 Nesting
As seen above elements must be nested correctly. This is guaranteed by the class. Nevertheless, the above check is only a formal check. The following example shows the problematic clearly:
[b]This is a list:
[list]
[*] List item
[*] List item
[/list]
[/b]
If this is converted to HTML the following would be the output:
<b>This is a list:
<ul>
<li> List item</li>
<li> List item</li>
</ul>
</b>
This HTML code is indeed formally correctly nested
but in HTML the <b>
element must not
contain a <ul>
element. This would also cause invalid
HTML. Because of this it is possible to tell the class
which element may contain which other element. For this purpose there
are the so-called content types. Every element is assigned a
content type. Further it is possible to specifiy for each element the
content types inside which it is allowed. An example:
[a][b][c]Text[/c][/b][/a]
In this example the [b]
element would be inside the
[a]
element and the [c]
element would be
inside the [b]
element. The following tree would be
created:
- Element:
[a]
- Element:
[b]
- Element:
[c]
- Text:
"Text"
- Text:
- Element:
- Element:
Now we assign a content type to each element. The [a]
element receives the content type alpha, the [b]
element den Inhaltstyp beta and the [c]
element the content type gamma. To make sure the parser
converts every element the [b]
element must be allowed
inside the alpha content type because this is the content
type of the [a]
element, inside which the [b]
element resides. In the exact same manner the [c]
element
must be allowed inside the beta content type because that
is the content type of the [b]
element, inside which the [c]
element resides. But the [c]
element needs not to
be allowed inside the alpha content type because only the
first level is of relevance.
If no element has yet been openend the so-called root content
type is applied. This content type is block
by default but
it can be changed. Have a look at the chapterparser
functions, in which content types play another role.
But there is not only the possibility to specify the content types in which an element is allowed in - one has also the possibility to forbid an element inside certain content types. A link inside another link would not be that reasonable. On the one hand, it is easy to forbid a link directly inside another link. On the other hand, it is possible to put an element in between to work around the list of allowed content types. Example:
[link][b][link]Text[/link][/b][/link]
There is no reason to forbid [b]
inside of [link]
and there is also no reason to forbid [link]
inside of [b]
but there would necessarily be a reason for forbidding this
construction. At this point the list of disallowed content types comes
in. This list is applied to all levels whereas the list of allowed
content types is only applied to the topmost level. With this method it
is possible to inhibit constructions like the above.
1.3 Special codes
Sometimes it can be useful to deactivate the code detection for a
short period. In many forums the [code]
element is offered
in which it is possible to mark up portions of source code and inhibit
the parsing of [b]
and similar inside this part of the
text. The part may only be terminated by [/code]
. The class
posses the means to acchieve this behaviour very in a very simple way:
[code]
// this would be example code that replaces
the [b] bbcode:
// ...[/code]
In this example it is certainly not wanted that the [b]
is converted since the [b]
is part of the source code that
is to be shown litterally. For this there is a so-called processing type usecontent
that causes the class to only look for the end tag of this specific
element and ignore every other code.
- Next: 2. Defining own BBCode