public final class HtmlUtils extends Object
The HtmlParser
will be open-sourced hence we took the
decision to keep these utilities in this package as well as not to
leverage others that may exist in the google3
code base.
The functionality exposed is designed to be 100% compatible with the corresponding logic in the C-version of the HtmlParser as such we are particularly concerned with cross-language compatibility.
Note: The words Javascript
and ECMAScript
are used
interchangeably unless otherwise noted.
Modifier and Type | Class and Description |
---|---|
static class |
HtmlUtils.META_REDIRECT_TYPE
Indicates the type of content contained in the
content HTML
attribute of the meta HTML tag. |
Modifier and Type | Method and Description |
---|---|
static String |
encodeCharForAscii(char chr)
Encodes the specified character using Ascii for convenient insertion into
a single-quote enclosed
String . |
static boolean |
isAttributeJavascript(String attribute)
Determines if the HTML attribute specified expects javascript
for its value.
|
static boolean |
isAttributeStyle(String attribute)
Determines if the HTML attribute specified expects a
style
for its value. |
static boolean |
isAttributeUri(String attribute)
Determines if the HTML attribute specified expects a
URI
for its value. |
static boolean |
isHtmlSpace(char chr)
Determines if the specified character is an HTML whitespace character.
|
static boolean |
isJavascriptIdentifier(char chr)
Determines if the specified character is a valid character in an
ECMAScript identifier.
|
static boolean |
isJavascriptRegexpPrefix(String input)
Determines if the input token provided is a valid token prefix to a
javascript regular expression.
|
static boolean |
isJavascriptWhitespace(char chr)
Determines if the specified character is an ECMAScript whitespace or line
terminator character.
|
static HtmlUtils.META_REDIRECT_TYPE |
parseContentAttributeForUrl(String value)
Parses the given
String to determine if it contains a URL in the
format followed by the content attribute of the meta
HTML tag. |
public static boolean isAttributeJavascript(String attribute)
onclick
attribute.
Currently returns true
for any attribute name that starts
with "on" which is not exactly correct but we trust a developer to
not use non-spec compliant attribute names (e.g. onbogus).
attribute
- the name of an HTML attributefalse
if the input is null or is not an attribute
that expects javascript code; true
public static boolean isAttributeStyle(String attribute)
style
for its value. Currently this is only true for the style
HTML attribute.attribute
- the name of an HTML attributetrue
iff the attribute name is one that expects a
style for a value; otherwise false
public static boolean isAttributeUri(String attribute)
URI
for its value. For example, both href
and src
expect a URI
but style
does not. Returns
false
if the attribute given was null
.attribute
- the name of an HTML attributetrue
if the attribute name is one that expects
a URI for a value; otherwise null
ATTRIBUTE_EXPECTS_URI
public static boolean isHtmlSpace(char chr)
Space
character
Tab
character
Line feed
character
Carriage Return
character
Zero-Width Space
character
​
)
which is not included in the C version.chr
- the char
to checktrue
if the character is an HTML whitespace character
White spacepublic static boolean isJavascriptWhitespace(char chr)
Tab
, Vertical Tab
,
Form Feed
, Space
,
No-break space
)
Line Feed
,
Carriage Return
, Line separator
,
Paragraph Separator
).
Encompasses the characters in sections 7.2 and 7.3 of ECMAScript 3, in
particular, this list is quite different from that in
Character.isWhitespace
.
ECMAScript Language Specification
chr
- the char
to checktrue
or false
public static boolean isJavascriptIdentifier(char chr)
Character.isJavaIdentifierStart
and Character.isJavaIdentifierPart
given that Java
and Javascript follow similar identifier naming rules but we lose
compatibility with the C-version.chr
- char
to checktrue
if the chr
is a Javascript whitespace
character; otherwise false
public static boolean isJavascriptRegexpPrefix(String input)
Set
of identifiers that can precede a regular expression in the
javascript grammar, and returns true
if the provided
String
is in that Set
.input
- the String
token to checktrue
iff the token is a valid prefix of a regexppublic static String encodeCharForAscii(char chr)
String
. Printable characters
are returned as-is. Carriage Return, Line Feed, Horizontal Tab,
back-slash and single quote are all backslash-escaped. All other characters
are returned hex-encoded.chr
- char
to encodechar
public static HtmlUtils.META_REDIRECT_TYPE parseContentAttributeForUrl(String value)
String
to determine if it contains a URL in the
format followed by the content
attribute of the meta
HTML tag.
This function expects to receive the value of the content
HTML
attribute. This attribute takes on different meanings depending on the
value of the http-equiv
HTML attribute of the same meta
tag. Since we may not have access to the http-equiv
attribute,
we instead rely on parsing the given value to determine if it contains
a URL.
The specification of the meta
HTML tag can be found in:
http://dev.w3.org/html5/spec/Overview.html#attr-meta-http-equiv-refresh
We return HtmlUtils.META_REDIRECT_TYPE
indicating whether the
value contains a URL and whether we are at the start of the URL or past
the start. We are at the start of the URL if and only if one of the two
conditions below is true:
Examples:
meta
tag where the content
attribute contains a URL [we are not at the start of the URL]:
<meta http-equiv="refresh" content="5; URL=http://www.google.com">
meta
tag where the content
attribute contains a URL [we are at the start of the URL]:
<meta http-equiv="refresh" content="5; URL=">
meta
tag where the content
attribute does not contain a URL:
<meta http-equiv="content-type" content="text/html">
value
- String
to parseHtmlUtils.META_REDIRECT_TYPE
indicating the presence
of a URL in the given valueCopyright © 2010–2015 Google. All rights reserved.