public class Tokenizer extends Lookahead<Token>
Reader
into a stream of Token
, supporting lookahead.
Reads from the given input and parses it into a stream of tokens. By default all token types defined by
Token
are supported. Most of the features can be further tweaked by changing the default settings.
By default the tokenizer operates as follows:
Char.isWhitepace()
Char.isDigit()
. Also if the current character is a '-' and the next is a digit, we try to read
a number.Char.isLetter()
. Once this is complete, check if
the ID matches one of the supplied keywords, and convert if necessary.isIdentifierChar(Char)
are consumed and returned as SPECIAL_IDModifier and Type | Field and Description |
---|---|
protected LookaheadReader |
input |
endOfInputIndicator, endReached, itemBuffer, problemCollector
Constructor and Description |
---|
Tokenizer(Reader input)
Creates a new tokenizer for the given input
|
Modifier and Type | Method and Description |
---|---|
void |
addError(Position pos,
String message,
Object... parameters)
Adds a parse error to the internal problem collector.
|
void |
addKeyword(String keyword)
Adds a keyword which is now being recognized by the tokenizer
|
void |
addSpecialIdStarter(char character)
Adds character as a special id starter.
|
void |
addSpecialIdTerminator(char character)
Adds character as a special id terminator.
|
void |
addStringDelimiter(char stringDelimiter,
char escapeCharacter)
Adds a new string delimiter character along with the character used to escape string within it.
|
void |
addUnescapedStringDelimiter(char stringDelimiter)
Boilerplate method for adding a string delimiter which does not support escape sequences.
|
void |
addWarning(Position pos,
String message,
Object... parameters)
Adds a warning to the internal problem collector.
|
boolean |
atEnd()
Boilerplate method for
current().isEnd() |
protected boolean |
canConsumeThisString(String string,
boolean consume)
Checks if the next characters, starting from the current, match the given string.
|
void |
clearStringDelimiters()
Removes all previously registered string delimiters.
|
void |
consumeExpectedKeyword(String keyword)
Consumes the current token, expecting it to be as KEYWORD with the given content
|
void |
consumeExpectedSymbol(String symbol)
Consumes the current token, expecting it to be as SYMBOL with the given content
|
protected Token |
endOfInput()
Creates the end of input indicator item.
|
protected Token |
fetch()
Fetches the next item from the stream.
|
protected Token |
fetchId()
Reads and returns an identifier
|
protected Token |
fetchNumber()
Reads and returns a number.
|
protected Token |
fetchSpecialId()
Reads and returns a special id.
|
protected Token |
fetchString()
Reads and returns a string constant.
|
protected Token |
fetchSymbol()
Reads and returns a symbol.
|
String |
getBlockCommentEnd()
Returns the string which ends a block comment.
|
String |
getBlockCommentStart()
Returns the string which starts a block comment.
|
char |
getDecimalSeparator()
Returns the decimal separator used in decimal numbers
|
char |
getEffectiveDecimalSeparator()
Returns the decimal separator used in the content of DECIMAL tokens.
|
char |
getGroupingSeparator()
Returns the grouping separator which can be used in numbers for group digits (e.g.
|
String |
getLineComment()
Returns the string which starts a line comment.
|
protected Token |
handleKeywords(Token idToken)
Checks if the given identifier is a keyword and returns an appropriate Token
|
protected boolean |
handleStringEscape(char separator,
char escapeChar,
Token stringToken)
Evaluates an string escape like \n
|
protected boolean |
isAtBracket(boolean inSymbol)
Determines if the underlying input is looking at a bracket.
|
protected boolean |
isAtEndOfBlockComment()
Checks if the underlying input is looking at a end of block comment
|
protected boolean |
isAtStartOfBlockComment(boolean consume)
Checks if the underlying input is looking at a start of block comment
|
protected boolean |
isAtStartOfIdentifier()
Determines if the underlying input is looking at a valid character to start an identifier
|
protected boolean |
isAtStartOfLineComment(boolean consume)
Checks if the underlying input is looking at a start of line comment.
|
protected boolean |
isAtStartOfNumber()
Determines if the underlying input is looking at the start of a number.
|
protected boolean |
isAtStartOfSpecialId()
Determines if the underlying input is looking at the start of a special id.
|
protected boolean |
isIdentifierChar(Char current)
Determines if the given Char is a valid identifier part.
|
boolean |
isKeywordsCaseSensitive()
Determines if keywords are case sensitive.
|
protected boolean |
isSymbolCharacter(Char ch)
Determines if the given Char is a symbol character.
|
boolean |
more()
Boilerplate method for
current().isNotEnd() |
void |
setBlockCommentEnd(String blockCommentEnd)
Sets the string which ends a block comment.
|
void |
setBlockCommentStart(String blockCommentStart)
Sets the string which stats a block comment.
|
void |
setDecimalSeparator(char decimalSeparator)
Sets the character which is recognized as decimal separator.
|
void |
setEffectiveDecimalSeparator(char effectiveDecimalSeparator)
Sets the decimal separator used in the content of DECIMAL tokens.
|
void |
setGroupingSeparator(char groupingSeparator)
Sets the grouping separator accepting in numbers.
|
void |
setKeywordsCaseSensitive(boolean keywordsCaseSensitive)
Sets the case sensitiveness of keywords.
|
void |
setLineComment(String lineComment)
Sets the string which stats a line comment.
|
void |
setProblemCollector(List<ParseError> problemCollector)
Installs the given problem collector.
|
protected void |
skipBlockComment()
Checks if we're looking at an end of block comment
|
protected void |
skipToEndOfLine()
Read everything upon (and including) the next line break
|
void |
throwOnError()
Throws a
ParseException if an error occurred while parsing the input. |
void |
throwOnErrorOrWarning()
Throws a
ParseException if an error or warning occurred while parsing the input |
String |
toString() |
protected LookaheadReader input
public Tokenizer(Reader input)
input
- the input to parse. The reader will be buffered by the implementation so that it can be effectively
read character b character.public void setProblemCollector(List<ParseError> problemCollector)
Lookahead
setProblemCollector
in class Lookahead<Token>
problemCollector
- the problem collector to be from now onprotected Token endOfInput()
Lookahead
This method will be only called once, as the indicator is cached.
endOfInput
in class Lookahead<Token>
protected Token fetch()
Lookahead
protected boolean isAtStartOfSpecialId()
By default this is one of the given specialIdStarters.
protected boolean isAtStartOfNumber()
By default this is either indicated by a digit or by '-' followed by a digit or a '.' followed by a digit.
protected boolean isAtBracket(boolean inSymbol)
By default all supplied brackets are checked. If treatSinglePipeAsBracket is true, a single '|' is also treated as bracket.
inSymbol
- determines if we're already parsing a symbol or just trying to decide what the next token isprotected boolean canConsumeThisString(String string, boolean consume)
string
- the string to checkconsume
- determines if the matched string should be consumed immediatelyprotected boolean isAtStartOfLineComment(boolean consume)
If a line comment is detected, any characters indicating this are consumed by this method if consume is true.
consume
- determines if the matched comment start should be consumed immediatelyprotected void skipToEndOfLine()
protected boolean isAtStartOfBlockComment(boolean consume)
If a block comment is detected, any characters indicating this are consumed by this method if consume is true .
consume
- determines if the block comment starter is to be consumed if found or notprotected boolean isAtEndOfBlockComment()
If an end of block comment is detected, any characters indicating this are consumed by this method
protected void skipBlockComment()
protected Token fetchString()
protected boolean handleStringEscape(char separator, char escapeChar, Token stringToken)
The escape character is already consumed. Therefore the input points at the character to escape. This method must consume all escaped characters.
separator
- the delimiter of this string constantescapeChar
- the escape character usedstringToken
- the resulting string constantprotected boolean isAtStartOfIdentifier()
By default, only letters can start identifiers
protected Token fetchId()
protected Token handleKeywords(Token idToken)
idToken
- the identifier to checkprotected boolean isIdentifierChar(Char current)
By default, letters, digits and '_' are valid identifier parts.
current
- the character to checkprotected Token fetchSpecialId()
protected Token fetchSymbol()
A symbol are one or two characters, which don't match any other token type. In most cases, this will be operators like + or *.
protected boolean isSymbolCharacter(Char ch)
By default these are all non-control characters, which don't match any other class (letter, digit, whitepsace)
ch
- the character to checkprotected Token fetchNumber()
public boolean isKeywordsCaseSensitive()
By default, keywords aren't case sensitive. Therefore True and true are the same keyword.
public void setKeywordsCaseSensitive(boolean keywordsCaseSensitive)
This must be setup before any call to addKeyword(String)
as this will determine internal data
structures
keywordsCaseSensitive
- true if keywords should be treated as case sensitive, false
otherwise (default)public void addKeyword(String keyword)
Detection will be case insensitive. Only ID tokens (identifiers) are checked against the given keywords, therefore a keyword must be a valid identifier.
keyword
- the keyword to be added to the list of known keywords.public void addSpecialIdStarter(char character)
character
- the character to be added as special id starterpublic void addSpecialIdTerminator(char character)
character
- the character to be added as special id terminatorpublic void clearStringDelimiters()
By default " and ' are registered as string delimiters, where string enclosed by " can have characters escaped by \
public void addStringDelimiter(char stringDelimiter, char escapeCharacter)
stringDelimiter
- the delimiter used to start and end string constantsescapeCharacter
- the character used to start an escape sequence or \0 to indicate that escaping is
not supportedpublic void addUnescapedStringDelimiter(char stringDelimiter)
stringDelimiter
- the delimiter used to start and end string constantspublic char getDecimalSeparator()
The default separator used is '.'
public void setDecimalSeparator(char decimalSeparator)
decimalSeparator
- the character to be recognized as decimal separatorpublic char getEffectiveDecimalSeparator()
The default separator used is '.'. When adapting this for language dependent inputs (e.g. using ',' as decimal separator) this value should probably not be changed, as it is used in the output (content) of the Tokens and has no effect what kind of numbers are being accepted.
public void setEffectiveDecimalSeparator(char effectiveDecimalSeparator)
setDecimalSeparator(char)
which is used to recognize decimal numbers. Therefore language dependent
input can be parsed with a constant output being language independent.effectiveDecimalSeparator
- the character used as decimal separator in the content of decimal tokenspublic char getGroupingSeparator()
This character will be accepted in numbers, but ignored (not added to the content). The default value is '_'.
public void setGroupingSeparator(char groupingSeparator)
groupingSeparator
- the character which can be used to group digits in numberspublic String getLineComment()
The default value is '/''/'
public void setLineComment(String lineComment)
lineComment
- the string used to detect a line commentpublic String getBlockCommentStart()
The default value is '/''*'
public void setBlockCommentStart(String blockCommentStart)
blockCommentStart
- the string used to detect a block commentpublic String getBlockCommentEnd()
The default value is '*''/'
public void setBlockCommentEnd(String blockCommentEnd)
blockCommentEnd
- the string used to detect the end of a block commentpublic boolean more()
current().isNotEnd()
public boolean atEnd()
current().isEnd()
public void addError(Position pos, String message, Object... parameters)
It is preferred to collect as much errors as possible and then fail with an exception instead of failing at the first problem. Often syntax errors can be worked out by the parser and we can report a set of errors at once.
pos
- the position of the error. Note that Token
implements Position
. Therefore the
current token is often a good choice for this parameter.message
- the message to describe the error. Can contain formatting parameters like %s or %d as defined
by String.format(String, Object...)
parameters
- Contains the parameters used to format the given messagepublic void addWarning(Position pos, String message, Object... parameters)
A warning indicates an anomaly which might lead to an error but still, the parser can continue to complete its work.
pos
- the position of the warning. Note that Token
implements Position
.
Therefore the current token is often a good choice for this parameter.message
- the message to describe the earning. Can contain formatting parameters like %s or %d as
defined by String.format(String, Object...)
parameters
- Contains the parameters used to format the given messagepublic void consumeExpectedSymbol(String symbol)
symbol
- the expected trigger of the current tokenpublic void consumeExpectedKeyword(String keyword)
keyword
- the expected content of the current tokenpublic void throwOnErrorOrWarning() throws ParseException
ParseException
if an error or warning occurred while parsing the inputParseException
- if an error or warning occurred while parsing.public void throwOnError() throws ParseException
ParseException
if an error occurred while parsing the input.
All warnings which occurred will be ignored.
ParseException
- if an error occurred while parsing.Copyright © 2020. All rights reserved.