diff options
Diffstat (limited to 'doc/manpage.rst.in')
-rw-r--r-- | doc/manpage.rst.in | 958 |
1 files changed, 29 insertions, 929 deletions
diff --git a/doc/manpage.rst.in b/doc/manpage.rst.in index fa40d280..c682807b 100644 --- a/doc/manpage.rst.in +++ b/doc/manpage.rst.in @@ -21,1001 +21,102 @@ specifications inside of C/C++ comments and replaces them with a hard-coded DFA. The user must supply some interface code in order to control and customize the generated DFA. + OPTIONS ------- -``-? -h --help`` - Invoke a short help. - -``-b --bit-vectors`` - Implies ``-s``. Use bit vectors as well in the - attempt to coax better code out of the compiler. Most useful for - specifications with more than a few keywords (e.g. for most programming - languages). - -``-c --conditions`` - Used to support (f)lex-like condition support. - -``-d --debug-output`` - Creates a parser that dumps information about - the current position and in which state the parser is while parsing the - input. This is useful to debug parser issues and states. If you use this - switch you need to define a macro ``YYDEBUG`` that is called like a - function with two parameters: ``void YYDEBUG (int state, char current)``. - The first parameter receives the state or ``-1`` and the second parameter - receives the input at the current cursor. - -``-D --emit-dot`` - Emit Graphviz dot data. It can then be processed - with e.g. ``dot -Tpng input.dot > output.png``. Please note that - scanners with many states may crash dot. - -``-e --ecb`` - Generate a parser that supports EBCDIC. The generated - code can deal with any character up to 0xFF. In this mode ``re2c`` assumes - that input character size is 1 byte. This switch is incompatible with - ``-w``, ``-x``, ``-u`` and ``-8``. - -``-f --storable-state`` - Generate a scanner with support for storable state. - -``-F --flex-syntax`` - Partial support for flex syntax. When this flag - is active then named definitions must be surrounded by curly braces and - can be defined without an equal sign and the terminating semi colon. - Instead names are treated as direct double quoted strings. - -``-g --computed-gotos`` - Generate a scanner that utilizes GCC's - computed goto feature. That is ``re2c`` generates jump tables whenever a - decision is of a certain complexity (e.g. a lot of if conditions are - otherwise necessary). This is only useable with GCC and produces output - that cannot be compiled with any other compiler. Note that this implies - ``-b`` and that the complexity threshold can be configured using the - inplace configuration ``cgoto:threshold``. - -``-i --no-debug-info`` - Do not output ``#line`` information. This is - useful when you want use a CMS tool with the ``re2c`` output which you - might want if you do not require your users to have ``re2c`` themselves - when building from your source. - -``-o OUTPUT --output=OUTPUT`` - Specify the ``OUTPUT`` file. - -``-r --reusable`` - Allows reuse of scanner definitions with ``/*!use:re2c */`` after ``/*!rules:re2c */``. - In this mode no ``/*!re2c */`` block and exactly one ``/*!rules:re2c */`` must be present. - The rules are being saved and used by every ``/*!use:re2c */`` block that follows. - These blocks can contain inplace configurations, especially ``re2c:flags:e``, - ``re2c:flags:w``, ``re2c:flags:x``, ``re2c:flags:u`` and ``re2c:flags:8``. - That way it is possible to create the same scanner multiple times for - different character types, different input mechanisms or different output mechanisms. - The ``/*!use:re2c */`` blocks can also contain additional rules that will be appended - to the set of rules in ``/*!rules:re2c */``. - -``-s --nested-ifs`` - Generate nested ifs for some switches. Many - compilers need this assist to generate better code. - -``-t HEADER --type-header=HEADER`` - Create a ``HEADER`` file that - contains types for the (f)lex-like condition support. This can only be - activated when ``-c`` is in use. - -``-u --unicode`` - Generate a parser that supports UTF-32. The generated - code can deal with any valid Unicode character up to 0x10FFFF. In this - mode ``re2c`` assumes that input character size is 4 bytes. This switch is - incompatible with ``-e``, ``-w``, ``-x`` and ``-8``. This implies ``-s``. - -``-v --version`` - Show version information. - -``-V --vernum`` - Show the version as a number XXYYZZ. - -``-w --wide-chars`` - Generate a parser that supports UCS-2. The - generated code can deal with any valid Unicode character up to 0xFFFF. - In this mode ``re2c`` assumes that input character size is 2 bytes. This - switch is incompatible with ``-e``, ``-x``, ``-u`` and ``-8``. This implies - ``-s``. - -``-x --utf-16`` - Generate a parser that supports UTF-16. The generated - code can deal with any valid Unicode character up to 0x10FFFF. In this - mode ``re2c`` assumes that input character size is 2 bytes. This switch is - incompatible with ``-e``, ``-w``, ``-u`` and ``-8``. This implies ``-s``. - -``-8 --utf-8`` - Generate a parser that supports UTF-8. The generated - code can deal with any valid Unicode character up to 0x10FFFF. In this - mode ``re2c`` assumes that input character size is 1 byte. This switch is - incompatible with ``-e``, ``-w``, ``-x`` and ``-u``. - -``--case-insensitive`` - All strings are case insensitive, so all - "-expressions are treated in the same way '-expressions are. - -``--case-inverted`` - Invert the meaning of single and double quoted - strings. With this switch single quotes are case sensitive and double - quotes are case insensitive. - -``--no-generation-date`` - Suppress date output in the generated file. - -``--no-generation-date`` - Suppress version output in the generated file. - -``--encoding-policy POLICY`` - Specify how ``re2c`` must treat Unicode - surrogates. ``POLICY`` can be one of the following: ``fail`` (abort with - error when surrogate encountered), ``substitute`` (silently substitute - surrogate with error code point 0xFFFD), ``ignore`` (treat surrogates as - normal code points). By default ``re2c`` ignores surrogates (for backward - compatibility). Unicode standard says that standalone surrogates are - invalid code points, but different libraries and programs treat them - differently. - -``--input INPUT`` - Specify re2c input API. ``INPUT`` can be one of the - following: ``default``, ``custom``. +.. include:: @top_srcdir@/doc/manual/options/options_list.rst -``-S --skeleton`` - Instead of embedding re2c-generated code into C/C++ - source, generate a self-contained program for the same DFA. Most useful - for correctness and performance testing. - -``--empty-class POLICY`` - What to do if user inputs empty character - class. ``POLICY`` can be one of the following: ``match-empty`` (match empty - input: pretty illogical, but this is the default for backwards - compatibility reason), ``match-none`` (fail to match on any input), - ``error`` (compilation error). Note that there are various ways to - construct empty class, e.g: [], [^\\x00-\\xFF], - [\\x00-\\xFF][\\x00-\\xFF]. - -``--dfa-minimization <table | moore>`` - Internal algorithm used by re2c to minimize DFA (defaults to ``moore``). - Both table filling and Moore's algorithms should produce identical DFA (up to states relabelling). - Table filling algorithm is much simpler and slower; it serves as a reference implementation. - -``-1 --single-pass`` - Deprecated and does nothing (single pass is by default now). - -``-W`` - Turn on all warnings. - -``-Werror`` - Turn warnings into errors. Note that this option along - doesn't turn on any warnings, it only affects those warnings that have - been turned on so far or will be turned on later. - -``-W<warning>`` - Turn on individual ``warning``. - -``-Wno-<warning>`` - Turn off individual ``warning``. - -``-Werror-<warning>`` - Turn on individual ``warning`` and treat it as error (this implies ``-W<warning>``). - -``-Wno-error-<warning>`` - Don't treat this particular ``warning`` as error. This doesn't turn off - the warning itself. - -``-Wcondition-order`` - Warn if the generated program makes implicit - assumptions about condition numbering. One should use either ``-t, --type-header`` option or - ``/*!types:re2c*/`` directive to generate mapping of condition names to numbers and use - autogenerated condition names. - -``-Wempty-character-class`` - Warn if regular expression contains empty - character class. From the rational point of view trying to match empty - character class makes no sense: it should always fail. However, for - backwards compatibility reasons ``re2c`` allows empty character class and - treats it as empty string. Use ``--empty-class`` option to change default - behaviour. - -``-Wmatch-empty-string`` - Warn if regular expression in a rule is - nullable (matches empty string). If DFA runs in a loop and empty match - is unintentional (input position in not advanced manually), lexer may - get stuck in eternal loop. - -``-Wswapped-range`` - Warn if range lower bound is greater that upper - bound. Default ``re2c`` behaviour is to silently swap range bounds. - -``-Wundefined-control-flow`` - Warn if some input strings cause undefined - control flow in lexer (the faulty patterns are reported). This is the - most dangerous and common mistake. It can be easily fixed by adding - default rule ``*`` (this rule has the lowest priority, matches any code unit and consumes - exactly one code unit). - -``-Wuseless-escape`` - Warn if a symbol is escaped when it shouldn't be. - By default re2c silently ignores escape, but this may as well indicate a - typo or an error in escape sequence. +.. include:: @top_srcdir@/doc/manual/warnings/warnings_general.rst +.. include:: @top_srcdir@/doc/manual/warnings/warnings_list.rst INTERFACE CODE -------------- -The user must supply interface code either in the form of C/C++ code -(macros, functions, variables, etc.) or in the form of ``INPLACE CONFIGURATIONS``. -Which symbols must be defined and which are optional -depends on a particular use case. - -``YYCONDTYPE`` - In ``-c`` mode you can use ``-t`` to generate a file that - contains the enumeration used as conditions. Each of the values refers - to a condition of a rule set. - -``YYCTXMARKER`` - l-value of type ``YYCTYPE *``. - The generated code saves trailing context backtracking information in - ``YYCTXMARKER``. The user only needs to define this macro if a scanner - specification uses trailing context in one or more of its regular - expressions. - -``YYCTYPE`` - Type used to hold an input symbol (code unit). Usually - ``char`` or ``unsigned char`` for ASCII, EBCDIC and UTF-8, ``unsigned short`` - for UTF-16 or UCS-2 and ``unsigned int`` for UTF-32. - -``YYCURSOR`` - l-value of type ``YYCTYPE *`` that points to the current input symbol. The generated code advances - ``YYCURSOR`` as symbols are matched. On entry, ``YYCURSOR`` is assumed to - point to the first character of the current token. On exit, ``YYCURSOR`` - will point to the first character of the following token. - -``YYDEBUG (state, current)`` - This is only needed if the ``-d`` flag was - specified. It allows one to easily debug the generated parser by calling a - user defined function for every state. The function should have the - following signature: ``void YYDEBUG (int state, char current)``. The first - parameter receives the state or -1 and the second parameter receives the - input at the current cursor. - -``YYFILL (n)`` - The generated code "calls"" ``YYFILL (n)`` when the - buffer needs (re)filling: at least ``n`` additional characters should be - provided. ``YYFILL (n)`` should adjust ``YYCURSOR``, ``YYLIMIT``, ``YYMARKER`` - and ``YYCTXMARKER`` as needed. Note that for typical programming languages - ``n`` will be the length of the longest keyword plus one. The user can - place a comment of the form ``/*!max:re2c*/`` to insert ``YYMAXFILL`` definition that is set to the maximum - length value. - -``YYGETCONDITION ()`` - This define is used to get the condition prior to - entering the scanner code when using ``-c`` switch. The value must be - initialized with a value from the enumeration ``YYCONDTYPE`` type. - -``YYGETSTATE ()`` - The user only needs to define this macro if the ``-f`` - flag was specified. In that case, the generated code "calls" - ``YYGETSTATE ()`` at the very beginning of the scanner in order to obtain - the saved state. ``YYGETSTATE ()`` must return a signed integer. The value - must be either -1, indicating that the scanner is entered for the first - time, or a value previously saved by ``YYSETSTATE (s)``. In the second - case, the scanner will resume operations right after where the last - ``YYFILL (n)`` was called. - -``YYLIMIT`` - Expression of type ``YYCTYPE *`` that marks the end of the buffer ``YYLIMIT[-1]`` - is the last character in the buffer). The generated code repeatedly - compares ``YYCURSOR`` to ``YYLIMIT`` to determine when the buffer needs - (re)filling. - -``YYMARKER`` - l-value of type ``YYCTYPE *``. - The generated code saves backtracking information in ``YYMARKER``. Some - easy scanners might not use this. - -``YYMAXFILL`` - This will be automatically defined by ``/*!max:re2c*/`` blocks as explained above. - -``YYSETCONDITION (c)`` - This define is used to set the condition in - transition rules. This is only being used when ``-c`` is active and - transition rules are being used. - -``YYSETSTATE (s)`` - The user only needs to define this macro if the ``-f`` - flag was specified. In that case, the generated code "calls" - ``YYSETSTATE`` just before calling ``YYFILL (n)``. The parameter to - ``YYSETSTATE`` is a signed integer that uniquely identifies the specific - instance of ``YYFILL (n)`` that is about to be called. Should the user - wish to save the state of the scanner and have ``YYFILL (n)`` return to - the caller, all he has to do is store that unique identifer in a - variable. Later, when the scannered is called again, it will call - ``YYGETSTATE ()`` and resume execution right where it left off. The - generated code will contain both ``YYSETSTATE (s)`` and ``YYGETSTATE`` even - if ``YYFILL (n)`` is being disabled. - +.. include:: @top_srcdir@/doc/manual/syntax/interface.rst_ SYNTAX ------ -Code for ``re2c`` consists of a set of ``RULES``, ``NAMED DEFINITIONS`` and -``INPLACE CONFIGURATIONS``. +A program can contain any number of ``re2c`` blocks. +Each block consists of a sequence of ``RULES``, ``NAMED DEFINITIONS`` and ``INPLACE CONFIGURATIONS``. RULES ~~~~~ -Rules consist of a regular expression (see ``REGULAR EXPRESSIONS``) along with a block of C/C++ code -that is to be executed when the associated regular expression is -matched. You can either start the code with an opening curly brace or -the sequence ``:=``. When the code with a curly brace then ``re2c`` counts the brace depth -and stops looking for code automatically. Otherwise curly braces are not -allowed and ``re2c`` stops looking for code at the first line that does -not begin with whitespace. If two or more rules overlap, the first rule -is preferred. - - ``regular-expression { C/C++ code }`` - - ``regular-expression := C/C++ code`` - -There is one special rule: default rule ``*`` - - ``* { C/C++ code }`` - - ``* := C/C++ code`` - -Note that default rule ``*`` differs from ``[^]``: default rule has the lowest priority, -matches any code unit (either valid or invalid) and always consumes one character; -while ``[^]`` matches any valid code point (not code unit) and can consume multiple -code units. In fact, when variable-length encoding is used, ``*`` -is the only possible way to match invalid input character (see ``ENCODINGS`` for details). - -If ``-c`` is active then each regular expression is preceded by a list -of comma separated condition names. Besides normal naming rules there -are two special cases: ``<*>`` (such rules are merged to all conditions) -and ``<>`` (such the rule cannot have an associated regular expression, -its code is merged to all actions). Non empty rules may further more specify the new -condition. In that case ``re2c`` will generate the necessary code to -change the condition automatically. Rules can use ``:=>`` as a shortcut -to automatically generate code that not only sets the -new condition state but also continues execution with the new state. A -shortcut rule should not be used in a loop where there is code between -the start of the loop and the ``re2c`` block unless ``re2c:cond:goto`` -is changed to ``continue``. If code is necessary before all rules (though not simple jumps) you -can doso by using ``<!>`` pseudo-rules. - - ``<condition-list> regular-expression { C/C++ code }`` - - ``<condition-list> regular-expression := C/C++ code`` - - ``<condition-list> * { C/C++ code }`` - - ``<condition-list> * := C/C++ code`` - - ``<condition-list> regular-expression => condition { C/C++ code }`` - - ``<condition-list> regular-expression => condition := C/C++ code`` - - ``<condition-list> * => condition { C/C++ code }`` - - ``<condition-list> * => condition := C/C++ code`` - - ``<condition-list> regular-expression :=> condition`` - - - ``<*> regular-expression { C/C++ code }`` - - ``<*> regular-expression := C/C++ code`` - - ``<*> * { C/C++ code }`` - - ``<*> * := C/C++ code`` - - ``<*> regular-expression => condition { C/C++ code }`` - - ``<*> regular-expression => condition := C/C++ code`` - - ``<*> * => condition { C/C++ code }`` - - ``<*> * => condition := C/C++ code`` - - ``<*> regular-expression :=> condition`` - - - ``<> { C/C++ code }`` - - ``<> := C/C++ code`` - - ``<> => condition { C/C++ code }`` - - ``<> => condition := C/C++ code`` - - ``<> :=> condition`` - - ``<> :=> condition`` - - - ``<! condition-list> { C/C++ code }`` - - ``<! condition-list> := C/C++ code`` - - ``<!> { C/C++ code }`` - - ``<!> := C/C++ code`` - +.. include:: @top_srcdir@/doc/manual/syntax/rules.rst_ NAMED DEFINITIONS ~~~~~~~~~~~~~~~~~ -Named definitions are of the form: - - ``name = regular-expression;`` - -If ``-F`` is active, then named definitions are also of the form: - - ``name { regular-expression }`` +.. include:: @top_srcdir@/doc/manual/syntax/named_definitions.rst_ INPLACE CONFIGURATIONS ~~~~~~~~~~~~~~~~~~~~~~ -``re2c:condprefix = yyc;`` - Allows one to specify the prefix used for - condition labels. That is this text is prepended to any condition label - in the generated output file. - -``re2c:condenumprefix = yyc;`` - Allows one to specify the prefix used for - condition values. That is this text is prepended to any condition enum - value in the generated output file. - -``re2c:cond:divider = "/* *********************************** */";`` - Allows one to customize the devider for condition blocks. You can use ``@@`` - to put the name of the condition or customize the placeholder using - ``re2c:cond:divider@cond``. - -``re2c:cond:divider@cond = @@;`` - Specifies the placeholder that will be - replaced with the condition name in ``re2c:cond:divider``. - -``re2c:cond:goto = "goto @@;";`` - Allows one to customize the condition goto statements used with ``:=>`` style rules. You can use ``@@`` - to put the name of the condition or ustomize the placeholder using - ``re2c:cond:goto@cond``. You can also change this to ``continue;``, which - would allow you to continue with the next loop cycle including any code - between loop start and re2c block. - -``re2c:cond:goto@cond = @@;`` - Spcifies the placeholder that will be replaced with the condition label in ``re2c:cond:goto``. - -``re2c:indent:top = 0;`` - Specifies the minimum number of indentation to - use. Requires a numeric value greater than or equal zero. - -``re2c:indent:string = "\t";`` - Specifies the string to use for indentation. Requires a string that should - contain only whitespace unless you need this for external tools. The easiest - way to specify spaces is to enclude them in single or double quotes. - If you do not want any indentation at all you can simply set this to "". - -``re2c:yych:conversion = 0;`` - When this setting is non zero, then ``re2c`` automatically generates - conversion code whenever yych gets read. In this case the type must be - defined using ``re2c:define:YYCTYPE``. - -``re2c:yych:emit = 1;`` - Generation of ``yych`` can be suppressed by setting this to 0. - -``re2c:yybm:hex = 0;`` - If set to zero then a decimal table is being used else a hexadecimal table will be generated. - -``re2c:yyfill:enable = 1;`` - Set this to zero to suppress generation of ``YYFILL (n)``. When using this be sure to verify that the generated - scanner does not read behind input. Allowing this behavior might - introduce sever security issues to you programs. - -``re2c:yyfill:check = 1;`` - This can be set 0 to suppress output of the - pre condition using ``YYCURSOR`` and ``YYLIMIT`` which becomes useful when - ``YYLIMIT + YYMAXFILL`` is always accessible. - -``re2c:define:YYFILL = "YYFILL";`` - Substitution for ``YYFILL``. Note - that by default ``re2c`` generates argument in braces and semicolon after - ``YYFILL``. If you need to make ``YYFILL`` an arbitrary statement rather - than a call, set ``re2c:define:YYFILL:naked`` to non-zero and use - ``re2c:define:YYFILL@len`` to denote formal parameter inside of ``YYFILL`` - body. - -``re2c:define:YYFILL@len = "@@";`` - Any occurrence of this text - inside of ``YYFILL`` will be replaced with the actual argument. - -``re2c:yyfill:parameter = 1;`` - Controls argument in braces after - ``YYFILL``. If zero, agrument is omitted. If non-zero, argument is - generated unless ``re2c:define:YYFILL:naked`` is set to non-zero. - -``re2c:define:YYFILL:naked = 0;`` - Controls argument in braces and - semicolon after ``YYFILL``. If zero, both agrument and semicolon are - omitted. If non-zero, argument is generated unless - ``re2c:yyfill:parameter`` is set to zero and semicolon is generated - unconditionally. - -``re2c:startlabel = 0;`` - If set to a non zero integer then the start - label of the next scanner blocks will be generated even if not used by - the scanner itself. Otherwise the normal ``yy0`` like start label is only - being generated if needed. If set to a text value then a label with that - text will be generated regardless of whether the normal start label is - being used or not. This setting is being reset to 0 after a start - label has been generated. - -``re2c:labelprefix = "yy";`` - Allows one to change the prefix of numbered - labels. The default is ``yy`` and can be set any string that is a valid - label. - -``re2c:state:abort = 0;`` - When not zero and switch ``-f`` is active then - the ``YYGETSTATE`` block will contain a default case that aborts and a -1 - case is used for initialization. - -``re2c:state:nextlabel = 0;`` - Used when ``-f`` is active to control - whether the ``YYGETSTATE`` block is followed by a ``yyNext:`` label line. - Instead of using ``yyNext`` you can usually also use configuration - ``startlabel`` to force a specific start label or default to ``yy0`` as - start label. Instead of using a dedicated label it is often better to - separate the ``YYGETSTATE`` code from the actual scanner code by placing a - ``/*!getstate:re2c*/`` comment. - -``re2c:cgoto:threshold = 9;`` - When ``-g`` is active this value specifies - the complexity threshold that triggers generation of jump tables rather - than using nested if's and decision bitfields. The threshold is compared - against a calculated estimation of if-s needed where every used bitmap - divides the threshold by 2. - -``re2c:yych:conversion = 0;`` - When the input uses signed characters and - ``-s`` or ``-b`` switches are in effect re2c allows one to automatically convert - to the unsigned character type that is then necessary for its internal - single character. When this setting is zero or an empty string the - conversion is disabled. Using a non zero number the conversion is taken - from ``YYCTYPE``. If that is given by an inplace configuration that value - is being used. Otherwise it will be ``(YYCTYPE)`` and changes to that - configuration are no longer possible. When this setting is a string the - braces must be specified. Now assuming your input is a ``char *`` - buffer and you are using above mentioned switches you can set - ``YYCTYPE`` to ``unsigned char`` and this setting to either 1 or ``(unsigned char)``. - -``re2c:define:YYCONDTYPE = "YYCONDTYPE";`` - Enumeration used for condition support with ``-c`` mode. - -``re2c:define:YYCTXMARKER = "YYCTXMARKER";`` - Allows one to overwrite the - define ``YYCTXMARKER`` and thus avoiding it by setting the value to the - actual code needed. - -``re2c:define:YYCTYPE = "YYCTYPE";`` - Allows one to overwrite the define - ``YYCTYPE`` and thus avoiding it by setting the value to the actual code - needed. - -``re2c:define:YYCURSOR = "YYCURSOR";`` - Allows one to overwrite the define - ``YYCURSOR`` and thus avoiding it by setting the value to the actual code - needed. - -``re2c:define:YYDEBUG = "YYDEBUG";`` - Allows one to overwrite the define - ``YYDEBUG`` and thus avoiding it by setting the value to the actual code - needed. - -``re2c:define:YYGETCONDITION = "YYGETCONDITION";`` - Substitution for - ``YYGETCONDITION``. Note that by default ``re2c`` generates braces after - ``YYGETCONDITION``. Set ``re2c:define:YYGETCONDITION:naked`` to non-zero to - omit braces. - -``re2c:define:YYGETCONDITION:naked = 0;`` - Controls braces after - ``YYGETCONDITION``. If zero, braces are omitted. If non-zero, braces are - generated. - -``re2c:define:YYSETCONDITION = "YYSETCONDITION";`` - Substitution for - ``YYSETCONDITION``. Note that by default ``re2c`` generates argument in - braces and semicolon after ``YYSETCONDITION``. If you need to make - ``YYSETCONDITION`` an arbitrary statement rather than a call, set - ``re2c:define:YYSETCONDITION:naked`` to non-zero and use - ``re2c:define:YYSETCONDITION@cond`` to denote formal parameter inside of - ``YYSETCONDITION`` body. - -``re2c:define:YYSETCONDITION@cond = "@@";`` - Any occurrence of this - text inside of ``YYSETCONDITION`` will be replaced with the actual - argument. - -``re2c:define:YYSETCONDITION:naked = 0;`` - Controls argument in braces - and semicolon after ``YYSETCONDITION``. If zero, both agrument and - semicolon are omitted. If non-zero, both argument and semicolon are - generated. - -``re2c:define:YYGETSTATE = "YYGETSTATE";`` - Substitution for - ``YYGETSTATE``. Note that by default ``re2c`` generates braces after - ``YYGETSTATE``. Set ``re2c:define:YYGETSTATE:naked`` to non-zero to omit - braces. - -``re2c:define:YYGETSTATE:naked = 0;`` - Controls braces after - ``YYGETSTATE``. If zero, braces are omitted. If non-zero, braces are - generated. - -``re2c:define:YYSETSTATE = "YYSETSTATE";`` - Substitution for - ``YYSETSTATE``. Note that by default ``re2c`` generates argument in braces - and semicolon after ``YYSETSTATE``. If you need to make ``YYSETSTATE`` an - arbitrary statement rather than a call, set - ``re2c:define:YYSETSTATE:naked`` to non-zero and use - ``re2c:define:YYSETSTATE@cond`` to denote formal parameter inside of - ``YYSETSTATE`` body. - -``re2c:define:YYSETSTATE@state = "@@";`` - Any occurrence of this text - inside of ``YYSETSTATE`` will be replaced with the actual argument. - -``re2c:define:YYSETSTATE:naked = 0;`` - Controls argument in braces and - semicolon after ``YYSETSTATE``. If zero, both agrument and semicolon are - omitted. If non-zero, both argument and semicolon are generated. - -``re2c:define:YYLIMIT = "YYLIMIT";`` - Allows one to overwrite the define - ``YYLIMIT`` and thus avoiding it by setting the value to the actual code - needed. - -``re2c:define:YYMARKER = "YYMARKER";`` - Allows one to overwrite the define - ``YYMARKER`` and thus avoiding it by setting the value to the actual code - needed. - -``re2c:label:yyFillLabel = "yyFillLabel";`` - Allows one to overwrite the name of the label ``yyFillLabel``. - -``re2c:label:yyNext = "yyNext";`` - Allows one to overwrite the name of the label ``yyNext``. - -``re2c:variable:yyaccept = yyaccept;`` - Allows one to overwrite the name of the variable ``yyaccept``. - -``re2c:variable:yybm = "yybm";`` - Allows one to overwrite the name of the variable ``yybm``. - -``re2c:variable:yych = "yych";`` - Allows one to overwrite the name of the variable ``yych``. - -``re2c:variable:yyctable = "yyctable";`` - When both ``-c`` and ``-g`` are active then ``re2c`` uses this variable to generate a static jump table - for ``YYGETCONDITION``. - -``re2c:variable:yystable = "yystable";`` - Deprecated. - -``re2c:variable:yytarget = "yytarget";`` - Allows one to overwrite the name of the variable ``yytarget``. - +.. include:: @top_srcdir@/doc/manual/syntax/configurations.rst_ REGULAR EXPRESSIONS ~~~~~~~~~~~~~~~~~~~ -``"foo"`` - literal string ``"foo"``. ANSI-C escape sequences can be used. - -``'foo'`` - literal string ``"foo"`` (characters [a-zA-Z] treated - case-insensitive). ANSI-C escape sequences can be used. - -``[xyz]`` - character class; in this case, regular expression matches either ``x``, ``y``, or ``z``. - -``[abj-oZ]`` - character class with a range in it; matches ``a``, ``b``, any letter from ``j`` through ``o`` or ``Z``. - -``[^class]`` - inverted character class. - -``r \ s`` - match any ``r`` which isn't ``s``. ``r`` and ``s`` must be regular expressions - which can be expressed as character classes. - -``r*`` - zero or more occurrences of ``r``. - -``r+`` - one or more occurrences of ``r``. +.. include:: @top_srcdir@/doc/manual/syntax/regular_expressions.rst_ -``r?`` - optional ``r``. -``(r)`` - ``r``; parentheses are used to override precedence. - -``r s`` - ``r`` followed by ``s`` (concatenation). - -``r | s`` - either ``r`` or ``s`` (alternative). - -``r`` / ``s`` - ``r`` but only if it is followed by ``s``. Note that ``s`` is not - part of the matched text. This type of regular expression is called - "trailing context". Trailing context can only be the end of a rule - and not part of a named definition. - -``r{n}`` - matches ``r`` exactly ``n`` times. - -``r{n,}`` - matches ``r`` at least ``n`` times. - -``r{n,m}`` - matches ``r`` at least ``n`` times, but not more than ``m`` times. - -``.`` - match any character except newline. - -``name`` - matches named definition as specified by ``name`` only if ``-F`` is - off. If ``-F`` is active then this behaves like it was enclosed in double - quotes and matches the string "name". - -Character classes and string literals may contain octal or hexadecimal -character definitions and the following set of escape sequences: -``\a``, ``\b``, ``\f``, ``\n``, ``\r``, ``\t``, ``\v``, ``\\``. An octal character is defined by a backslash -followed by its three octal digits (e.g. ``\377``). -Hexadecimal characters from 0 to 0xFF are defined by backslash, a lower -cased ``x`` and two hexadecimal digits (e.g. ``\x12``). Hexadecimal characters from 0x100 to 0xFFFF are defined by backslash, a lower cased -``\u`` or an upper cased ``\X`` and four hexadecimal digits (e.g. ``\u1234``). -Hexadecimal characters from 0x10000 to 0xFFFFffff are defined by backslash, an upper cased ``\U`` -and eight hexadecimal digits (e.g. ``\U12345678``). - -The only portable "any" rule is the default rule ``*``. - - - -SCANNER WITH STORABLE STATES ----------------------------- - -When the ``-f`` flag is specified, ``re2c`` generates a scanner that can -store its current state, return to the caller, and later resume -operations exactly where it left off. - -The default operation of ``re2c`` is a -"pull" model, where the scanner asks for extra input whenever it needs it. However, this mode of operation assumes that the scanner is the "owner" -the parsing loop, and that may not always be convenient. - -Typically, if there is a preprocessor ahead of the scanner in the -stream, or for that matter any other procedural source of data, the -scanner cannot "ask" for more data unless both scanner and source -live in a separate threads. - -The ``-f`` flag is useful for just this situation: it lets users design -scanners that work in a "push" model, i.e. where data is fed to the -scanner chunk by chunk. When the scanner runs out of data to consume, it -just stores its state, and return to the caller. When more input data is -fed to the scanner, it resumes operations exactly where it left off. - -Changes needed compared to the "pull" model: - -* User has to supply macros ``YYSETSTATE ()`` and ``YYGETSTATE (state)``. - -* The ``-f`` option inhibits declaration of ``yych`` and ``yyaccept``. So the - user has to declare these. Also the user has to save and restore these. - In the example ``examples/push_model/push.re`` these are declared as - fields of the (C++) class of which the scanner is a method, so they do - not need to be saved/restored explicitly. For C they could e.g. be made - macros that select fields from a structure passed in as parameter. - Alternatively, they could be declared as local variables, saved with - ``YYFILL (n)`` when it decides to return and restored at entry to the - function. Also, it could be more efficient to save the state from - ``YYFILL (n)`` because ``YYSETSTATE (state)`` is called unconditionally. - ``YYFILL (n)`` however does not get ``state`` as parameter, so we would have - to store state in a local variable by ``YYSETSTATE (state)``. - -* Modify ``YYFILL (n)`` to return (from the function calling it) if more input is needed. - -* Modify caller to recognise if more input is needed and respond appropriately. - -* The generated code will contain a switch block that is used to - restores the last state by jumping behind the corrspoding ``YYFILL (n)`` - call. This code is automatically generated in the epilog of the first ``/*!re2c */`` - block. It is possible to trigger generation of the ``YYGETSTATE ()`` - block earlier by placing a ``/*!getstate:re2c*/`` comment. This is especially useful when the scanner code should be - wrapped inside a loop. - -Please see ``examples/push_model/push.re`` for "push" model scanner. The -generated code can be tweaked using inplace configurations ``state:abort`` -and ``state:nextlabel``. +SUBMATCH EXTRACTION +------------------- +.. include:: @top_srcdir@/doc/manual/features/submatch/submatch.rst_ -SCANNER WITH CONDITION SUPPORT ------------------------------- +STORABLE STATE +-------------- -You can preceed regular expressions with a list of condition names when -using the ``-c`` switch. In this case ``re2c`` generates scanner blocks for -each conditon. Where each of the generated blocks has its own -precondition. The precondition is given by the interface define -``YYGETCONDITON()`` and must be of type ``YYCONDTYPE``. +.. include:: @top_srcdir@/doc/manual/features/state/state.rst_ -There are two special rule types. First, the rules of the condition ``<*>`` -are merged to all conditions (note that they have lower priority than -other rules of that condition). And second the empty condition list -allows one to provide a code block that does not have a scanner part. -Meaning it does not allow any regular expression. The condition value -referring to this special block is always the one with the enumeration -value 0. This way the code of this special rule can be used to -initialize a scanner. It is in no way necessary to have these rules: but -sometimes it is helpful to have a dedicated uninitialized condition -state. -Non empty rules allow one to specify the new condition, which makes them -transition rules. Besides generating calls for the define -``YYSETCONDTITION`` no other special code is generated. -There is another kind of special rules that allow one to prepend code to any -code block of all rules of a certain set of conditions or to all code -blocks to all rules. This can be helpful when some operation is common -among rules. For instance this can be used to store the length of the -scanned string. These special setup rules start with an exclamation mark -followed by either a list of conditions ``<! condition, ... >`` or a star -``<!*>``. When ``re2c`` generates the code for a rule whose state does not have a -setup rule and a star'd setup rule is present, than that code will be -used as setup code. +CONDITIONS +---------- +.. include:: @top_srcdir@/doc/manual/features/conditions/conditions.rst_ ENCODINGS --------- -``re2c`` supports the following encodings: ASCII (default), EBCDIC (``-e``), -UCS-2 (``-w``), UTF-16 (``-x``), UTF-32 (``-u``) and UTF-8 (``-8``). -See also inplace configuration ``re2c:flags``. - -The following concepts should be clarified when talking about encoding. -Code point is an abstract number, which represents single encoding -symbol. Code unit is the smallest unit of memory, which is used in the -encoded text (it corresponds to one character in the input stream). One -or more code units can be needed to represent a single code point, -depending on the encoding. In fixed-length encoding, each code point -is represented with equal number of code units. In variable-length -encoding, different code points can be represented with different number -of code units. - -ASCII - is a fixed-length encoding. Its code space includes 0x100 - code points, from 0 to 0xFF. One code point is represented with exactly one - 1-byte code unit, which has the same value as the code point. Size of - ``YYCTYPE`` must be 1 byte. - -EBCDIC - is a fixed-length encoding. Its code space includes 0x100 - code points, from 0 to 0xFF. One code point is represented with exactly - one 1-byte code unit, which has the same value as the code point. Size - of ``YYCTYPE`` must be 1 byte. - -UCS-2 - is a fixed-length encoding. Its code space includes 0x10000 - code points, from 0 to 0xFFFF. One code point is represented with - exactly one 2-byte code unit, which has the same value as the code - point. Size of ``YYCTYPE`` must be 2 bytes. - -UTF-16 - is a variable-length encoding. Its code space includes all - Unicode code points, from 0 to 0xD7FF and from 0xE000 to 0x10FFFF. One - code point is represented with one or two 2-byte code units. Size of - ``YYCTYPE`` must be 2 bytes. - -UTF-32 - is a fixed-length encoding. Its code space includes all - Unicode code points, from 0 to 0xD7FF and from 0xE000 to 0x10FFFF. One - code point is represented with exactly one 4-byte code unit. Size of - ``YYCTYPE`` must be 4 bytes. - -UTF-8 - is a variable-length encoding. Its code space includes all - Unicode code points, from 0 to 0xD7FF and from 0xE000 to 0x10FFFF. One - code point is represented with sequence of one, two, three or four - 1-byte code units. Size of ``YYCTYPE`` must be 1 byte. - -In Unicode, values from range 0xD800 to 0xDFFF (surrogates) are not -valid Unicode code points, any encoded sequence of code units, that -would map to Unicode code points in the range 0xD800-0xDFFF, is -ill-formed. The user can control how ``re2c`` treats such ill-formed -sequences with ``--encoding-policy <policy>`` flag (see ``OPTIONS`` -for full explanation). - -For some encodings, there are code units, that never occur in valid -encoded stream (e.g. 0xFF byte in UTF-8). If the generated scanner must -check for invalid input, the only true way to do so is to use default -rule ``*``. Note, that full range rule ``[^]`` won't catch invalid code units when variable-length encoding is used -(``[^]`` means "all valid code points", while default rule ``*`` means "all possible code units"). - - - -GENERIC INPUT API ------------------ +.. include:: @top_srcdir@/doc/manual/features/encodings/encodings.rst_ -``re2c`` usually operates on input using pointer-like primitives -``YYCURSOR``, ``YYMARKER``, ``YYCTXMARKER`` and ``YYLIMIT``. -Generic input API (enabled with ``--input custom`` switch) allows one to -customize input operations. In this mode, ``re2c`` will express all -operations on input in terms of the following primitives: - - +---------------------+-----------------------------------------------------+ - | ``YYPEEK ()`` | get current input character | - +---------------------+-----------------------------------------------------+ - | ``YYSKIP ()`` | advance to the next character | - +---------------------+-----------------------------------------------------+ - | ``YYBACKUP ()`` | backup current input position | - +---------------------+-----------------------------------------------------+ - | ``YYBACKUPCTX ()`` | backup current input position for trailing context | - +---------------------+-----------------------------------------------------+ - | ``YYRESTORE ()`` | restore current input position | - +---------------------+-----------------------------------------------------+ - | ``YYRESTORECTX ()`` | restore current input position for trailing context | - +---------------------+-----------------------------------------------------+ - | ``YYLESSTHAN (n)`` | check if less than ``n`` input characters are left | - +---------------------+-----------------------------------------------------+ - -A couple of useful links that provide some examples: - -1. http://skvadrik.github.io/aleph_null/posts/re2c/2015-01-13-input_model.html -2. http://skvadrik.github.io/aleph_null/posts/re2c/2015-01-15-input_model_custom.html +GENERIC API +----------- +.. include:: @top_srcdir@/doc/manual/features/generic_api/generic_api.rst_ SEE ALSO -------- -You can find more information about ``re2c`` on the website: http://re2c.org. +You can find more information about ``re2c`` at: http://re2c.org. See also: flex(1), lex(1), quex (http://quex.sourceforge.net). - AUTHORS ------- -Peter Bumbulis peter@csg.uwaterloo.ca - -Brian Young bayoung@acm.org - -Dan Nuffer nuffer@users.sourceforge.net - -Marcus Boerger helly@users.sourceforge.net - -Hartmut Kaiser hkaiser@users.sourceforge.net - -Emmanuel Mogenet mgix@mgix.com - -Ulya Trofimovich skvadrik@gmail.com +Originaly written by Peter Bumbulis in 1993; +developed and maintained by Brain Young, Marcus Boerger, Dan Nuffer and Ulya Trofimovich. +Below is a (more or less) full list of contributors retrieved from the Git history and mailing lists: +.. include:: @top_srcdir@/doc/manual/contributors.rst_ VERSION INFORMATION @@ -1023,4 +124,3 @@ VERSION INFORMATION This manpage describes ``re2c`` version @PACKAGE_VERSION@, package date @PACKAGE_DATE@. - |