com.goodeast.webobjects.switchablestring
Class SSHTMLFilter

java.lang.Object
  |
  +--com.goodeast.webobjects.switchablestring.SSHTMLFilter
Direct Known Subclasses:
SSAllowedTagsFilter, SSDisallowedTagsFilter, SSNoTagsAllowedFilter

public abstract class SSHTMLFilter
extends java.lang.Object

Implements the base functionality for a scheme to filter out HTML tags based on a test. Subclasses need to implement the method tagShouldBeEscaped() which takes a String containing the tag and returns true or false depending on whether or not the tag should be escaped by using & HTML entity or not.


Nested Class Summary
static class SSHTMLFilter.Factory
          Factory class that implements a caching scheme to avoid the overhead of creating a new object each time a new filter is desired with the same allowed or disallowed list.
 
Constructor Summary
protected SSHTMLFilter()
          Used by SSHTMLFilter subclasses.
 
Method Summary
 java.lang.String filterForAllowedHTML(java.lang.String rawString)
          This is a complex method that parses the raw string looking for tags.
abstract  boolean tagShouldBeEscaped(java.lang.String tag)
          Method called to determine if a particular tag should be escaped or not.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

SSHTMLFilter

protected SSHTMLFilter()
Used by SSHTMLFilter subclasses.

Method Detail

filterForAllowedHTML

public java.lang.String filterForAllowedHTML(java.lang.String rawString)
This is a complex method that parses the raw string looking for tags. Once a tag is found, it is passed to method tagShouldBeEscaped() to determine whether or not it will be escaped. It is a LALR(2) parser at its core, based on a 3 item FIFO buffer along with a stack that keeps track of which tags are currently open so that they can be closed at the end of the string if needed. The state machine has two states: outside a tag vs. inside a tag. It handles HTML comments and entities (stuff that starts with &!) as well as regular tags. The raw string is split into pieces based on the characters "<", "/", "!", and ">".

The algorithm in detail is:
three token FIFO buffer A1, A2, A3
stack B
output buffer result
tokenize the raw string splitting on "<", "/", "!", and ">". Include the split characters as tokens. for each token
append A3 to result
shift A2 to A3
shift A1 to A2
load next token into A1

if inside a tag
if A1 == ">"
change state to outside a tag
else if outside a tag
if A2 == "<" and A1 !="/"
if A1.firstWord in disallowedList or A1.firstWord not in allowedList
change A2 to "<"
else if A1.firstWord is not BR or LI
push A1.firstWord onto B
change state to inside a tag
if A3 == "<" and A2 == "/"
if A1.firstWord in disallowedList or A1.firstWord not in allowedList
change A3 to "<"
else if B.top == A2.firstWord
pop B
change state to inside a tag
After all tokens have been processed, we need to deal with edge cases after the end of the loop.

if outside a tag if A1 == "<"
convert A1 to "<"
if A2 == "<" and A1 == "/"
convert A2 to "<"
if A2 == "<" and A1 == "!"
convert A2 to "<"
flush A1, A2 A3 to result

if inside a tag
append ">" to result
pop elements off of B and create close tags for them

Note: some tags (like <BR>) don't have a closing tag, but these get closed off anyway, since it doesn't hurt in general and it makes the algorithm much simpler. BR and LI are special cased, since they are in the Slashdot set of allowed tags and we're going to recommend that set to developers.

Parameters:
rawString - String containing possible undesired and/or malformed HTML
Returns:
String with undesired HTML tags escaped out

tagShouldBeEscaped

public abstract boolean tagShouldBeEscaped(java.lang.String tag)
Method called to determine if a particular tag should be escaped or not. Subclasses must override this method to provide the correct behavior.

Parameters:
tag - Tag to be checked
Returns:
True is tag is undesired, false if it is OK