diff options
Diffstat (limited to 'doc/flex.texi')
-rw-r--r-- | doc/flex.texi | 8600 |
1 files changed, 8600 insertions, 0 deletions
diff --git a/doc/flex.texi b/doc/flex.texi new file mode 100644 index 0000000..f9a9e9e --- /dev/null +++ b/doc/flex.texi @@ -0,0 +1,8600 @@ +\input texinfo.tex @c -*-texinfo-*- +@c %**start of header +@setfilename flex.info +@settitle Lexical Analysis With Flex +@include version.texi +@set authors Vern Paxson, Will Estes and John Millaway +@c "Macro Hooks" index +@defindex hk +@c "Options" index +@defindex op +@dircategory Programming +@direntry +* flex: (flex). Fast lexical analyzer generator (lex replacement). +@end direntry +@c %**end of header + +@copying + +The flex manual is placed under the same licensing conditions as the +rest of flex: + +Copyright @copyright{} 2001, 2002, 2003, 2004, 2005, 2006, 2007 The Flex +Project. + +Copyright @copyright{} 1990, 1997 The Regents of the University of California. +All rights reserved. + +This code is derived from software contributed to Berkeley by +Vern Paxson. + +The United States Government has rights in this work pursuant +to contract no. DE-AC03-76SF00098 between the United States +Department of Energy and the University of California. + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions +are met: + +@enumerate +@item + Redistributions of source code must retain the above copyright +notice, this list of conditions and the following disclaimer. + +@item +Redistributions in binary form must reproduce the above copyright +notice, this list of conditions and the following disclaimer in the +documentation and/or other materials provided with the distribution. +@end enumerate + +Neither the name of the University nor the names of its contributors +may be used to endorse or promote products derived from this software +without specific prior written permission. + +THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR +IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED +WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +PURPOSE. +@end copying + +@titlepage +@title @value{title} +@subtitle Edition @value{EDITION}, @value{UPDATED} +@author @value{authors} +@page +@vskip 0pt plus 1filll +@insertcopying +@end titlepage +@contents +@ifnottex +@node Top, Copyright, (dir), (dir) +@top flex + +This manual describes @code{flex}, a tool for generating programs that +perform pattern-matching on text. The manual includes both tutorial and +reference sections. + +This edition of @cite{The flex Manual} documents @code{flex} version +@value{VERSION}. It was last updated on @value{UPDATED}. + +This manual was written by @value{authors}. + +@menu +* Copyright:: +* Reporting Bugs:: +* Introduction:: +* Simple Examples:: +* Format:: +* Patterns:: +* Matching:: +* Actions:: +* Generated Scanner:: +* Start Conditions:: +* Multiple Input Buffers:: +* EOF:: +* Misc Macros:: +* User Values:: +* Yacc:: +* Scanner Options:: +* Performance:: +* Cxx:: +* Reentrant:: +* Lex and Posix:: +* Memory Management:: +* Serialized Tables:: +* Diagnostics:: +* Limitations:: +* Bibliography:: +* FAQ:: +* Appendices:: +* Indices:: + +@detailmenu + --- The Detailed Node Listing --- + +Format of the Input File + +* Definitions Section:: +* Rules Section:: +* User Code Section:: +* Comments in the Input:: + +Scanner Options + +* Options for Specifying Filenames:: +* Options Affecting Scanner Behavior:: +* Code-Level And API Options:: +* Options for Scanner Speed and Size:: +* Debugging Options:: +* Miscellaneous Options:: + +Reentrant C Scanners + +* Reentrant Uses:: +* Reentrant Overview:: +* Reentrant Example:: +* Reentrant Detail:: +* Reentrant Functions:: + +The Reentrant API in Detail + +* Specify Reentrant:: +* Extra Reentrant Argument:: +* Global Replacement:: +* Init and Destroy Functions:: +* Accessor Methods:: +* Extra Data:: +* About yyscan_t:: + +Memory Management + +* The Default Memory Management:: +* Overriding The Default Memory Management:: +* A Note About yytext And Memory:: + +Serialized Tables + +* Creating Serialized Tables:: +* Loading and Unloading Serialized Tables:: +* Tables File Format:: + +FAQ + +* When was flex born?:: +* How do I expand backslash-escape sequences in C-style quoted strings?:: +* Why do flex scanners call fileno if it is not ANSI compatible?:: +* Does flex support recursive pattern definitions?:: +* How do I skip huge chunks of input (tens of megabytes) while using flex?:: +* Flex is not matching my patterns in the same order that I defined them.:: +* My actions are executing out of order or sometimes not at all.:: +* How can I have multiple input sources feed into the same scanner at the same time?:: +* Can I build nested parsers that work with the same input file?:: +* How can I match text only at the end of a file?:: +* How can I make REJECT cascade across start condition boundaries?:: +* Why cant I use fast or full tables with interactive mode?:: +* How much faster is -F or -f than -C?:: +* If I have a simple grammar cant I just parse it with flex?:: +* Why doesn't yyrestart() set the start state back to INITIAL?:: +* How can I match C-style comments?:: +* The period isn't working the way I expected.:: +* Can I get the flex manual in another format?:: +* Does there exist a "faster" NDFA->DFA algorithm?:: +* How does flex compile the DFA so quickly?:: +* How can I use more than 8192 rules?:: +* How do I abandon a file in the middle of a scan and switch to a new file?:: +* How do I execute code only during initialization (only before the first scan)?:: +* How do I execute code at termination?:: +* Where else can I find help?:: +* Can I include comments in the "rules" section of the file?:: +* I get an error about undefined yywrap().:: +* How can I change the matching pattern at run time?:: +* How can I expand macros in the input?:: +* How can I build a two-pass scanner?:: +* How do I match any string not matched in the preceding rules?:: +* I am trying to port code from AT&T lex that uses yysptr and yysbuf.:: +* Is there a way to make flex treat NULL like a regular character?:: +* Whenever flex can not match the input it says "flex scanner jammed".:: +* Why doesn't flex have non-greedy operators like perl does?:: +* Memory leak - 16386 bytes allocated by malloc.:: +* How do I track the byte offset for lseek()?:: +* How do I use my own I/O classes in a C++ scanner?:: +* How do I skip as many chars as possible?:: +* deleteme00:: +* Are certain equivalent patterns faster than others?:: +* Is backing up a big deal?:: +* Can I fake multi-byte character support?:: +* deleteme01:: +* Can you discuss some flex internals?:: +* unput() messes up yy_at_bol:: +* The | operator is not doing what I want:: +* Why can't flex understand this variable trailing context pattern?:: +* The ^ operator isn't working:: +* Trailing context is getting confused with trailing optional patterns:: +* Is flex GNU or not?:: +* ERASEME53:: +* I need to scan if-then-else blocks and while loops:: +* ERASEME55:: +* ERASEME56:: +* ERASEME57:: +* Is there a repository for flex scanners?:: +* How can I conditionally compile or preprocess my flex input file?:: +* Where can I find grammars for lex and yacc?:: +* I get an end-of-buffer message for each character scanned.:: +* unnamed-faq-62:: +* unnamed-faq-63:: +* unnamed-faq-64:: +* unnamed-faq-65:: +* unnamed-faq-66:: +* unnamed-faq-67:: +* unnamed-faq-68:: +* unnamed-faq-69:: +* unnamed-faq-70:: +* unnamed-faq-71:: +* unnamed-faq-72:: +* unnamed-faq-73:: +* unnamed-faq-74:: +* unnamed-faq-75:: +* unnamed-faq-76:: +* unnamed-faq-77:: +* unnamed-faq-78:: +* unnamed-faq-79:: +* unnamed-faq-80:: +* unnamed-faq-81:: +* unnamed-faq-82:: +* unnamed-faq-83:: +* unnamed-faq-84:: +* unnamed-faq-85:: +* unnamed-faq-86:: +* unnamed-faq-87:: +* unnamed-faq-88:: +* unnamed-faq-90:: +* unnamed-faq-91:: +* unnamed-faq-92:: +* unnamed-faq-93:: +* unnamed-faq-94:: +* unnamed-faq-95:: +* unnamed-faq-96:: +* unnamed-faq-97:: +* unnamed-faq-98:: +* unnamed-faq-99:: +* unnamed-faq-100:: +* unnamed-faq-101:: +* What is the difference between YYLEX_PARAM and YY_DECL?:: +* Why do I get "conflicting types for yylex" error?:: +* How do I access the values set in a Flex action from within a Bison action?:: + +Appendices + +* Makefiles and Flex:: +* Bison Bridge:: +* M4 Dependency:: +* Common Patterns:: + +Indices + +* Concept Index:: +* Index of Functions and Macros:: +* Index of Variables:: +* Index of Data Types:: +* Index of Hooks:: +* Index of Scanner Options:: + +@end detailmenu +@end menu +@end ifnottex +@node Copyright, Reporting Bugs, Top, Top +@chapter Copyright + +@cindex copyright of flex +@cindex distributing flex +@insertcopying + +@node Reporting Bugs, Introduction, Copyright, Top +@chapter Reporting Bugs + +@cindex bugs, reporting +@cindex reporting bugs + +If you find a bug in @code{flex}, please report it using +the SourceForge Bug Tracking facilities which can be found on +@url{http://sourceforge.net/projects/flex,flex's SourceForge Page}. + +@node Introduction, Simple Examples, Reporting Bugs, Top +@chapter Introduction + +@cindex scanner, definition of +@code{flex} is a tool for generating @dfn{scanners}. A scanner is a +program which recognizes lexical patterns in text. The @code{flex} +program reads the given input files, or its standard input if no file +names are given, for a description of a scanner to generate. The +description is in the form of pairs of regular expressions and C code, +called @dfn{rules}. @code{flex} generates as output a C source file, +@file{lex.yy.c} by default, which defines a routine @code{yylex()}. +This file can be compiled and linked with the flex runtime library to +produce an executable. When the executable is run, it analyzes its +input for occurrences of the regular expressions. Whenever it finds +one, it executes the corresponding C code. + +@node Simple Examples, Format, Introduction, Top +@chapter Some Simple Examples + +First some simple examples to get the flavor of how one uses +@code{flex}. + +@cindex username expansion +The following @code{flex} input specifies a scanner which, when it +encounters the string @samp{username} will replace it with the user's +login name: + +@example +@verbatim + %% + username printf( "%s", getlogin() ); +@end verbatim +@end example + +@cindex default rule +@cindex rules, default +By default, any text not matched by a @code{flex} scanner is copied to +the output, so the net effect of this scanner is to copy its input file +to its output with each occurrence of @samp{username} expanded. In this +input, there is just one rule. @samp{username} is the @dfn{pattern} and +the @samp{printf} is the @dfn{action}. The @samp{%%} symbol marks the +beginning of the rules. + +Here's another simple example: + +@cindex counting characters and lines +@example +@verbatim + int num_lines = 0, num_chars = 0; + + %% + \n ++num_lines; ++num_chars; + . ++num_chars; + + %% + main() + { + yylex(); + printf( "# of lines = %d, # of chars = %d\n", + num_lines, num_chars ); + } +@end verbatim +@end example + +This scanner counts the number of characters and the number of lines in +its input. It produces no output other than the final report on the +character and line counts. The first line declares two globals, +@code{num_lines} and @code{num_chars}, which are accessible both inside +@code{yylex()} and in the @code{main()} routine declared after the +second @samp{%%}. There are two rules, one which matches a newline +(@samp{\n}) and increments both the line count and the character count, +and one which matches any character other than a newline (indicated by +the @samp{.} regular expression). + +A somewhat more complicated example: + +@cindex Pascal-like language +@example +@verbatim + /* scanner for a toy Pascal-like language */ + + %{ + /* need this for the call to atof() below */ + #include math.h> + %} + + DIGIT [0-9] + ID [a-z][a-z0-9]* + + %% + + {DIGIT}+ { + printf( "An integer: %s (%d)\n", yytext, + atoi( yytext ) ); + } + + {DIGIT}+"."{DIGIT}* { + printf( "A float: %s (%g)\n", yytext, + atof( yytext ) ); + } + + if|then|begin|end|procedure|function { + printf( "A keyword: %s\n", yytext ); + } + + {ID} printf( "An identifier: %s\n", yytext ); + + "+"|"-"|"*"|"/" printf( "An operator: %s\n", yytext ); + + "{"[\^{}}\n]*"}" /* eat up one-line comments */ + + [ \t\n]+ /* eat up whitespace */ + + . printf( "Unrecognized character: %s\n", yytext ); + + %% + + main( argc, argv ) + int argc; + char **argv; + { + ++argv, --argc; /* skip over program name */ + if ( argc > 0 ) + yyin = fopen( argv[0], "r" ); + else + yyin = stdin; + + yylex(); + } +@end verbatim +@end example + +This is the beginnings of a simple scanner for a language like Pascal. +It identifies different types of @dfn{tokens} and reports on what it has +seen. + +The details of this example will be explained in the following +sections. + +@node Format, Patterns, Simple Examples, Top +@chapter Format of the Input File + + +@cindex format of flex input +@cindex input, format of +@cindex file format +@cindex sections of flex input + +The @code{flex} input file consists of three sections, separated by a +line containing only @samp{%%}. + +@cindex format of input file +@example +@verbatim + definitions + %% + rules + %% + user code +@end verbatim +@end example + +@menu +* Definitions Section:: +* Rules Section:: +* User Code Section:: +* Comments in the Input:: +@end menu + +@node Definitions Section, Rules Section, Format, Format +@section Format of the Definitions Section + +@cindex input file, Definitions section +@cindex Definitions, in flex input +The @dfn{definitions section} contains declarations of simple @dfn{name} +definitions to simplify the scanner specification, and declarations of +@dfn{start conditions}, which are explained in a later section. + +@cindex aliases, how to define +@cindex pattern aliases, how to define +Name definitions have the form: + +@example +@verbatim + name definition +@end verbatim +@end example + +The @samp{name} is a word beginning with a letter or an underscore +(@samp{_}) followed by zero or more letters, digits, @samp{_}, or +@samp{-} (dash). The definition is taken to begin at the first +non-whitespace character following the name and continuing to the end of +the line. The definition can subsequently be referred to using +@samp{@{name@}}, which will expand to @samp{(definition)}. For example, + +@cindex pattern aliases, defining +@cindex defining pattern aliases +@example +@verbatim + DIGIT [0-9] + ID [a-z][a-z0-9]* +@end verbatim +@end example + +Defines @samp{DIGIT} to be a regular expression which matches a single +digit, and @samp{ID} to be a regular expression which matches a letter +followed by zero-or-more letters-or-digits. A subsequent reference to + +@cindex pattern aliases, use of +@example +@verbatim + {DIGIT}+"."{DIGIT}* +@end verbatim +@end example + +is identical to + +@example +@verbatim + ([0-9])+"."([0-9])* +@end verbatim +@end example + +and matches one-or-more digits followed by a @samp{.} followed by +zero-or-more digits. + +@cindex comments in flex input +An unindented comment (i.e., a line +beginning with @samp{/*}) is copied verbatim to the output up +to the next @samp{*/}. + +@cindex %@{ and %@}, in Definitions Section +@cindex embedding C code in flex input +@cindex C code in flex input +Any @emph{indented} text or text enclosed in @samp{%@{} and @samp{%@}} +is also copied verbatim to the output (with the %@{ and %@} symbols +removed). The %@{ and %@} symbols must appear unindented on lines by +themselves. + +@cindex %top + +A @code{%top} block is similar to a @samp{%@{} ... @samp{%@}} block, except +that the code in a @code{%top} block is relocated to the @emph{top} of the +generated file, before any flex definitions @footnote{Actually, +@code{yyIN_HEADER} is defined before the @samp{%top} block.}. +The @code{%top} block is useful when you want certain preprocessor macros to be +defined or certain files to be included before the generated code. +The single characters, @samp{@{} and @samp{@}} are used to delimit the +@code{%top} block, as show in the example below: + +@example +@verbatim + %top{ + /* This code goes at the "top" of the generated file. */ + #include <stdint.h> + #include <inttypes.h> + } +@end verbatim +@end example + +Multiple @code{%top} blocks are allowed, and their order is preserved. + +@node Rules Section, User Code Section, Definitions Section, Format +@section Format of the Rules Section + +@cindex input file, Rules Section +@cindex rules, in flex input +The @dfn{rules} section of the @code{flex} input contains a series of +rules of the form: + +@example +@verbatim + pattern action +@end verbatim +@end example + +where the pattern must be unindented and the action must begin +on the same line. +@xref{Patterns}, for a further description of patterns and actions. + +In the rules section, any indented or %@{ %@} enclosed text appearing +before the first rule may be used to declare variables which are local +to the scanning routine and (after the declarations) code which is to be +executed whenever the scanning routine is entered. Other indented or +%@{ %@} text in the rule section is still copied to the output, but its +meaning is not well-defined and it may well cause compile-time errors +(this feature is present for @acronym{POSIX} compliance. @xref{Lex and +Posix}, for other such features). + +Any @emph{indented} text or text enclosed in @samp{%@{} and @samp{%@}} +is copied verbatim to the output (with the %@{ and %@} symbols removed). +The %@{ and %@} symbols must appear unindented on lines by themselves. + +@node User Code Section, Comments in the Input, Rules Section, Format +@section Format of the User Code Section + +@cindex input file, user code Section +@cindex user code, in flex input +The user code section is simply copied to @file{lex.yy.c} verbatim. It +is used for companion routines which call or are called by the scanner. +The presence of this section is optional; if it is missing, the second +@samp{%%} in the input file may be skipped, too. + +@node Comments in the Input, , User Code Section, Format +@section Comments in the Input + +@cindex comments, syntax of +Flex supports C-style comments, that is, anything between @samp{/*} and +@samp{*/} is +considered a comment. Whenever flex encounters a comment, it copies the +entire comment verbatim to the generated source code. Comments may +appear just about anywhere, but with the following exceptions: + +@itemize +@cindex comments, in rules section +@item +Comments may not appear in the Rules Section wherever flex is expecting +a regular expression. This means comments may not appear at the +beginning of a line, or immediately following a list of scanner states. +@item +Comments may not appear on an @samp{%option} line in the Definitions +Section. +@end itemize + +If you want to follow a simple rule, then always begin a comment on a +new line, with one or more whitespace characters before the initial +@samp{/*}). This rule will work anywhere in the input file. + +All the comments in the following example are valid: + +@cindex comments, valid uses of +@cindex comments in the input +@example +@verbatim +%{ +/* code block */ +%} + +/* Definitions Section */ +%x STATE_X + +%% + /* Rules Section */ +ruleA /* after regex */ { /* code block */ } /* after code block */ + /* Rules Section (indented) */ +<STATE_X>{ +ruleC ECHO; +ruleD ECHO; +%{ +/* code block */ +%} +} +%% +/* User Code Section */ + +@end verbatim +@end example + +@node Patterns, Matching, Format, Top +@chapter Patterns + +@cindex patterns, in rules section +@cindex regular expressions, in patterns +The patterns in the input (see @ref{Rules Section}) are written using an +extended set of regular expressions. These are: + +@cindex patterns, syntax +@cindex patterns, syntax +@table @samp +@item x +match the character 'x' + +@item . +any character (byte) except newline + +@cindex [] in patterns +@cindex character classes in patterns, syntax of +@cindex POSIX, character classes in patterns, syntax of +@item [xyz] +a @dfn{character class}; in this case, the pattern +matches either an 'x', a 'y', or a 'z' + +@cindex ranges in patterns +@item [abj-oZ] +a "character class" with a range in it; matches +an 'a', a 'b', any letter from 'j' through 'o', +or a 'Z' + +@cindex ranges in patterns, negating +@cindex negating ranges in patterns +@item [^A-Z] +a "negated character class", i.e., any character +but those in the class. In this case, any +character EXCEPT an uppercase letter. + +@item [^A-Z\n] +any character EXCEPT an uppercase letter or +a newline + +@item [a-z]@{-@}[aeiou] +the lowercase consonants + +@item r* +zero or more r's, where r is any regular expression + +@item r+ +one or more r's + +@item r? +zero or one r's (that is, ``an optional r'') + +@cindex braces in patterns +@item r@{2,5@} +anywhere from two to five r's + +@item r@{2,@} +two or more r's + +@item r@{4@} +exactly 4 r's + +@cindex pattern aliases, expansion of +@item @{name@} +the expansion of the @samp{name} definition +(@pxref{Format}). + +@cindex literal text in patterns, syntax of +@cindex verbatim text in patterns, syntax of +@item "[xyz]\"foo" +the literal string: @samp{[xyz]"foo} + +@cindex escape sequences in patterns, syntax of +@item \X +if X is @samp{a}, @samp{b}, @samp{f}, @samp{n}, @samp{r}, @samp{t}, or +@samp{v}, then the ANSI-C interpretation of @samp{\x}. Otherwise, a +literal @samp{X} (used to escape operators such as @samp{*}) + +@cindex NULL character in patterns, syntax of +@item \0 +a NUL character (ASCII code 0) + +@cindex octal characters in patterns +@item \123 +the character with octal value 123 + +@item \x2a +the character with hexadecimal value 2a + +@item (r) +match an @samp{r}; parentheses are used to override precedence (see below) + +@item (?r-s:pattern) +apply option @samp{r} and omit option @samp{s} while interpreting pattern. +Options may be zero or more of the characters @samp{i}, @samp{s}, or @samp{x}. + +@samp{i} means case-insensitive. @samp{-i} means case-sensitive. + +@samp{s} alters the meaning of the @samp{.} syntax to match any single byte whatsoever. +@samp{-s} alters the meaning of @samp{.} to match any byte except @samp{\n}. + +@samp{x} ignores comments and whitespace in patterns. Whitespace is ignored unless +it is backslash-escaped, contained within @samp{""}s, or appears inside a +character class. + +The following are all valid: + +@verbatim +(?:foo) same as (foo) +(?i:ab7) same as ([aA][bB]7) +(?-i:ab) same as (ab) +(?s:.) same as [\x00-\xFF] +(?-s:.) same as [^\n] +(?ix-s: a . b) same as ([Aa][^\n][bB]) +(?x:a b) same as ("ab") +(?x:a\ b) same as ("a b") +(?x:a" "b) same as ("a b") +(?x:a[ ]b) same as ("a b") +(?x:a + /* comment */ + b + c) same as (abc) +@end verbatim + +@item (?# comment ) +omit everything within @samp{()}. The first @samp{)} +character encountered ends the pattern. It is not possible to for the comment +to contain a @samp{)} character. The comment may span lines. + +@cindex concatenation, in patterns +@item rs +the regular expression @samp{r} followed by the regular expression @samp{s}; called +@dfn{concatenation} + +@item r|s +either an @samp{r} or an @samp{s} + +@cindex trailing context, in patterns +@item r/s +an @samp{r} but only if it is followed by an @samp{s}. The text matched by @samp{s} is +included when determining whether this rule is the longest match, but is +then returned to the input before the action is executed. So the action +only sees the text matched by @samp{r}. This type of pattern is called +@dfn{trailing context}. (There are some combinations of @samp{r/s} that flex +cannot match correctly. @xref{Limitations}, regarding dangerous trailing +context.) + +@cindex beginning of line, in patterns +@cindex BOL, in patterns +@item ^r +an @samp{r}, but only at the beginning of a line (i.e., +when just starting to scan, or right after a +newline has been scanned). + +@cindex end of line, in patterns +@cindex EOL, in patterns +@item r$ +an @samp{r}, but only at the end of a line (i.e., just before a +newline). Equivalent to @samp{r/\n}. + +@cindex newline, matching in patterns +Note that @code{flex}'s notion of ``newline'' is exactly +whatever the C compiler used to compile @code{flex} +interprets @samp{\n} as; in particular, on some DOS +systems you must either filter out @samp{\r}s in the +input yourself, or explicitly use @samp{r/\r\n} for @samp{r$}. + +@cindex start conditions, in patterns +@item <s>r +an @samp{r}, but only in start condition @code{s} (see @ref{Start +Conditions} for discussion of start conditions). + +@item <s1,s2,s3>r +same, but in any of start conditions @code{s1}, @code{s2}, or @code{s3}. + +@item <*>r +an @samp{r} in any start condition, even an exclusive one. + +@cindex end of file, in patterns +@cindex EOF in patterns, syntax of +@item <<EOF>> +an end-of-file. + +@item <s1,s2><<EOF>> +an end-of-file when in start condition @code{s1} or @code{s2} +@end table + +Note that inside of a character class, all regular expression operators +lose their special meaning except escape (@samp{\}) and the character class +operators, @samp{-}, @samp{]]}, and, at the beginning of the class, @samp{^}. + +@cindex patterns, precedence of operators +The regular expressions listed above are grouped according to +precedence, from highest precedence at the top to lowest at the bottom. +Those grouped together have equal precedence (see special note on the +precedence of the repeat operator, @samp{@{@}}, under the documentation +for the @samp{--posix} POSIX compliance option). For example, + +@cindex patterns, grouping and precedence +@example +@verbatim + foo|bar* +@end verbatim +@end example + +is the same as + +@example +@verbatim + (foo)|(ba(r*)) +@end verbatim +@end example + +since the @samp{*} operator has higher precedence than concatenation, +and concatenation higher than alternation (@samp{|}). This pattern +therefore matches @emph{either} the string @samp{foo} @emph{or} the +string @samp{ba} followed by zero-or-more @samp{r}'s. To match +@samp{foo} or zero-or-more repetitions of the string @samp{bar}, use: + +@example +@verbatim + foo|(bar)* +@end verbatim +@end example + +And to match a sequence of zero or more repetitions of @samp{foo} and +@samp{bar}: + +@cindex patterns, repetitions with grouping +@example +@verbatim + (foo|bar)* +@end verbatim +@end example + +@cindex character classes in patterns +In addition to characters and ranges of characters, character classes +can also contain @dfn{character class expressions}. These are +expressions enclosed inside @samp{[}: and @samp{:]} delimiters (which +themselves must appear between the @samp{[} and @samp{]} of the +character class. Other elements may occur inside the character class, +too). The valid expressions are: + +@cindex patterns, valid character classes +@example +@verbatim + [:alnum:] [:alpha:] [:blank:] + [:cntrl:] [:digit:] [:graph:] + [:lower:] [:print:] [:punct:] + [:space:] [:upper:] [:xdigit:] +@end verbatim +@end example + +These expressions all designate a set of characters equivalent to the +corresponding standard C @code{isXXX} function. For example, +@samp{[:alnum:]} designates those characters for which @code{isalnum()} +returns true - i.e., any alphabetic or numeric character. Some systems +don't provide @code{isblank()}, so flex defines @samp{[:blank:]} as a +blank or a tab. + +For example, the following character classes are all equivalent: + +@cindex character classes, equivalence of +@cindex patterns, character class equivalence +@example +@verbatim + [[:alnum:]] + [[:alpha:][:digit:]] + [[:alpha:][0-9]] + [a-zA-Z0-9] +@end verbatim +@end example + +A word of caution. Character classes are expanded immediately when seen in the @code{flex} input. +This means the character classes are sensitive to the locale in which @code{flex} +is executed, and the resulting scanner will not be sensitive to the runtime locale. +This may or may not be desirable. + + +@itemize +@cindex case-insensitive, effect on character classes +@item If your scanner is case-insensitive (the @samp{-i} flag), then +@samp{[:upper:]} and @samp{[:lower:]} are equivalent to +@samp{[:alpha:]}. + +@anchor{case and character ranges} +@item Character classes with ranges, such as @samp{[a-Z]}, should be used with +caution in a case-insensitive scanner if the range spans upper or lowercase +characters. Flex does not know if you want to fold all upper and lowercase +characters together, or if you want the literal numeric range specified (with +no case folding). When in doubt, flex will assume that you meant the literal +numeric range, and will issue a warning. The exception to this rule is a +character range such as @samp{[a-z]} or @samp{[S-W]} where it is obvious that you +want case-folding to occur. Here are some examples with the @samp{-i} flag +enabled: + +@multitable {@samp{[a-zA-Z]}} {ambiguous} {@samp{[A-Z\[\\\]_`a-t]}} {@samp{[@@A-Z\[\\\]_`abc]}} +@item Range @tab Result @tab Literal Range @tab Alternate Range +@item @samp{[a-t]} @tab ok @tab @samp{[a-tA-T]} @tab +@item @samp{[A-T]} @tab ok @tab @samp{[a-tA-T]} @tab +@item @samp{[A-t]} @tab ambiguous @tab @samp{[A-Z\[\\\]_`a-t]} @tab @samp{[a-tA-T]} +@item @samp{[_-@{]} @tab ambiguous @tab @samp{[_`a-z@{]} @tab @samp{[_`a-zA-Z@{]} +@item @samp{[@@-C]} @tab ambiguous @tab @samp{[@@ABC]} @tab @samp{[@@A-Z\[\\\]_`abc]} +@end multitable + +@cindex end of line, in negated character classes +@cindex EOL, in negated character classes +@item +A negated character class such as the example @samp{[^A-Z]} above +@emph{will} match a newline unless @samp{\n} (or an equivalent escape +sequence) is one of the characters explicitly present in the negated +character class (e.g., @samp{[^A-Z\n]}). This is unlike how many other +regular expression tools treat negated character classes, but +unfortunately the inconsistency is historically entrenched. Matching +newlines means that a pattern like @samp{[^"]*} can match the entire +input unless there's another quote in the input. + +Flex allows negation of character class expressions by prepending @samp{^} to +the POSIX character class name. + +@example +@verbatim + [:^alnum:] [:^alpha:] [:^blank:] + [:^cntrl:] [:^digit:] [:^graph:] + [:^lower:] [:^print:] [:^punct:] + [:^space:] [:^upper:] [:^xdigit:] +@end verbatim +@end example + +Flex will issue a warning if the expressions @samp{[:^upper:]} and +@samp{[:^lower:]} appear in a case-insensitive scanner, since their meaning is +unclear. The current behavior is to skip them entirely, but this may change +without notice in future revisions of flex. + +@item + +The @samp{@{-@}} operator computes the difference of two character classes. For +example, @samp{[a-c]@{-@}[b-z]} represents all the characters in the class +@samp{[a-c]} that are not in the class @samp{[b-z]} (which in this case, is +just the single character @samp{a}). The @samp{@{-@}} operator is left +associative, so @samp{[abc]@{-@}[b]@{-@}[c]} is the same as @samp{[a]}. Be careful +not to accidentally create an empty set, which will never match. + +@item + +The @samp{@{+@}} operator computes the union of two character classes. For +example, @samp{[a-z]@{+@}[0-9]} is the same as @samp{[a-z0-9]}. This operator +is useful when preceded by the result of a difference operation, as in, +@samp{[[:alpha:]]@{-@}[[:lower:]]@{+@}[q]}, which is equivalent to +@samp{[A-Zq]} in the "C" locale. + +@cindex trailing context, limits of +@cindex ^ as non-special character in patterns +@cindex $ as normal character in patterns +@item +A rule can have at most one instance of trailing context (the @samp{/} operator +or the @samp{$} operator). The start condition, @samp{^}, and @samp{<<EOF>>} patterns +can only occur at the beginning of a pattern, and, as well as with @samp{/} and @samp{$}, +cannot be grouped inside parentheses. A @samp{^} which does not occur at +the beginning of a rule or a @samp{$} which does not occur at the end of +a rule loses its special properties and is treated as a normal character. + +@item +The following are invalid: + +@cindex patterns, invalid trailing context +@example +@verbatim + foo/bar$ + <sc1>foo<sc2>bar +@end verbatim +@end example + +Note that the first of these can be written @samp{foo/bar\n}. + +@item +The following will result in @samp{$} or @samp{^} being treated as a normal character: + +@cindex patterns, special characters treated as non-special +@example +@verbatim + foo|(bar$) + foo|^bar +@end verbatim +@end example + +If the desired meaning is a @samp{foo} or a +@samp{bar}-followed-by-a-newline, the following could be used (the +special @code{|} action is explained below, @pxref{Actions}): + +@cindex patterns, end of line +@example +@verbatim + foo | + bar$ /* action goes here */ +@end verbatim +@end example + +A similar trick will work for matching a @samp{foo} or a +@samp{bar}-at-the-beginning-of-a-line. +@end itemize + +@node Matching, Actions, Patterns, Top +@chapter How the Input Is Matched + +@cindex patterns, matching +@cindex input, matching +@cindex trailing context, matching +@cindex matching, and trailing context +@cindex matching, length of +@cindex matching, multiple matches +When the generated scanner is run, it analyzes its input looking for +strings which match any of its patterns. If it finds more than one +match, it takes the one matching the most text (for trailing context +rules, this includes the length of the trailing part, even though it +will then be returned to the input). If it finds two or more matches of +the same length, the rule listed first in the @code{flex} input file is +chosen. + +@cindex token +@cindex yytext +@cindex yyleng +Once the match is determined, the text corresponding to the match +(called the @dfn{token}) is made available in the global character +pointer @code{yytext}, and its length in the global integer +@code{yyleng}. The @dfn{action} corresponding to the matched pattern is +then executed (@pxref{Actions}), and then the remaining input is scanned +for another match. + +@cindex default rule +If no match is found, then the @dfn{default rule} is executed: the next +character in the input is considered matched and copied to the standard +output. Thus, the simplest valid @code{flex} input is: + +@cindex minimal scanner +@example +@verbatim + %% +@end verbatim +@end example + +which generates a scanner that simply copies its input (one character at +a time) to its output. + +@cindex yytext, two types of +@cindex %array, use of +@cindex %pointer, use of +@vindex yytext +Note that @code{yytext} can be defined in two different ways: either as +a character @emph{pointer} or as a character @emph{array}. You can +control which definition @code{flex} uses by including one of the +special directives @code{%pointer} or @code{%array} in the first +(definitions) section of your flex input. The default is +@code{%pointer}, unless you use the @samp{-l} lex compatibility option, +in which case @code{yytext} will be an array. The advantage of using +@code{%pointer} is substantially faster scanning and no buffer overflow +when matching very large tokens (unless you run out of dynamic memory). +The disadvantage is that you are restricted in how your actions can +modify @code{yytext} (@pxref{Actions}), and calls to the @code{unput()} +function destroys the present contents of @code{yytext}, which can be a +considerable porting headache when moving between different @code{lex} +versions. + +@cindex %array, advantages of +The advantage of @code{%array} is that you can then modify @code{yytext} +to your heart's content, and calls to @code{unput()} do not destroy +@code{yytext} (@pxref{Actions}). Furthermore, existing @code{lex} +programs sometimes access @code{yytext} externally using declarations of +the form: + +@example +@verbatim + extern char yytext[]; +@end verbatim +@end example + +This definition is erroneous when used with @code{%pointer}, but correct +for @code{%array}. + +The @code{%array} declaration defines @code{yytext} to be an array of +@code{YYLMAX} characters, which defaults to a fairly large value. You +can change the size by simply #define'ing @code{YYLMAX} to a different +value in the first section of your @code{flex} input. As mentioned +above, with @code{%pointer} yytext grows dynamically to accommodate +large tokens. While this means your @code{%pointer} scanner can +accommodate very large tokens (such as matching entire blocks of +comments), bear in mind that each time the scanner must resize +@code{yytext} it also must rescan the entire token from the beginning, +so matching such tokens can prove slow. @code{yytext} presently does +@emph{not} dynamically grow if a call to @code{unput()} results in too +much text being pushed back; instead, a run-time error results. + +@cindex %array, with C++ +Also note that you cannot use @code{%array} with C++ scanner classes +(@pxref{Cxx}). + +@node Actions, Generated Scanner, Matching, Top +@chapter Actions + +@cindex actions +Each pattern in a rule has a corresponding @dfn{action}, which can be +any arbitrary C statement. The pattern ends at the first non-escaped +whitespace character; the remainder of the line is its action. If the +action is empty, then when the pattern is matched the input token is +simply discarded. For example, here is the specification for a program +which deletes all occurrences of @samp{zap me} from its input: + +@cindex deleting lines from input +@example +@verbatim + %% + "zap me" +@end verbatim +@end example + +This example will copy all other characters in the input to the output +since they will be matched by the default rule. + +Here is a program which compresses multiple blanks and tabs down to a +single blank, and throws away whitespace found at the end of a line: + +@cindex whitespace, compressing +@cindex compressing whitespace +@example +@verbatim + %% + [ \t]+ putchar( ' ' ); + [ \t]+$ /* ignore this token */ +@end verbatim +@end example + +@cindex %@{ and %@}, in Rules Section +@cindex actions, use of @{ and @} +@cindex actions, embedded C strings +@cindex C-strings, in actions +@cindex comments, in actions +If the action contains a @samp{@{}, then the action spans till the +balancing @samp{@}} is found, and the action may cross multiple lines. +@code{flex} knows about C strings and comments and won't be fooled by +braces found within them, but also allows actions to begin with +@samp{%@{} and will consider the action to be all the text up to the +next @samp{%@}} (regardless of ordinary braces inside the action). + +@cindex |, in actions +An action consisting solely of a vertical bar (@samp{|}) means ``same as the +action for the next rule''. See below for an illustration. + +Actions can include arbitrary C code, including @code{return} statements +to return a value to whatever routine called @code{yylex()}. Each time +@code{yylex()} is called it continues processing tokens from where it +last left off until it either reaches the end of the file or executes a +return. + +@cindex yytext, modification of +Actions are free to modify @code{yytext} except for lengthening it +(adding characters to its end--these will overwrite later characters in +the input stream). This however does not apply when using @code{%array} +(@pxref{Matching}). In that case, @code{yytext} may be freely modified +in any way. + +@cindex yyleng, modification of +@cindex yymore, and yyleng +Actions are free to modify @code{yyleng} except they should not do so if +the action also includes use of @code{yymore()} (see below). + +@cindex preprocessor macros, for use in actions +There are a number of special directives which can be included within an +action: + +@table @code +@item ECHO +@cindex ECHO +copies yytext to the scanner's output. + +@item BEGIN +@cindex BEGIN +followed by the name of a start condition places the scanner in the +corresponding start condition (see below). + +@item REJECT +@cindex REJECT +directs the scanner to proceed on to the ``second best'' rule which +matched the input (or a prefix of the input). The rule is chosen as +described above in @ref{Matching}, and @code{yytext} and @code{yyleng} +set up appropriately. It may either be one which matched as much text +as the originally chosen rule but came later in the @code{flex} input +file, or one which matched less text. For example, the following will +both count the words in the input and call the routine @code{special()} +whenever @samp{frob} is seen: + +@example +@verbatim + int word_count = 0; + %% + + frob special(); REJECT; + [^ \t\n]+ ++word_count; +@end verbatim +@end example + +Without the @code{REJECT}, any occurrences of @samp{frob} in the input +would not be counted as words, since the scanner normally executes only +one action per token. Multiple uses of @code{REJECT} are allowed, each +one finding the next best choice to the currently active rule. For +example, when the following scanner scans the token @samp{abcd}, it will +write @samp{abcdabcaba} to the output: + +@cindex REJECT, calling multiple times +@cindex |, use of +@example +@verbatim + %% + a | + ab | + abc | + abcd ECHO; REJECT; + .|\n /* eat up any unmatched character */ +@end verbatim +@end example + +The first three rules share the fourth's action since they use the +special @samp{|} action. + +@code{REJECT} is a particularly expensive feature in terms of scanner +performance; if it is used in @emph{any} of the scanner's actions it +will slow down @emph{all} of the scanner's matching. Furthermore, +@code{REJECT} cannot be used with the @samp{-Cf} or @samp{-CF} options +(@pxref{Scanner Options}). + +Note also that unlike the other special actions, @code{REJECT} is a +@emph{branch}. Code immediately following it in the action will +@emph{not} be executed. + +@item yymore() +@cindex yymore() +tells the scanner that the next time it matches a rule, the +corresponding token should be @emph{appended} onto the current value of +@code{yytext} rather than replacing it. For example, given the input +@samp{mega-kludge} the following will write @samp{mega-mega-kludge} to +the output: + +@cindex yymore(), mega-kludge +@cindex yymore() to append token to previous token +@example +@verbatim + %% + mega- ECHO; yymore(); + kludge ECHO; +@end verbatim +@end example + +First @samp{mega-} is matched and echoed to the output. Then @samp{kludge} +is matched, but the previous @samp{mega-} is still hanging around at the +beginning of +@code{yytext} +so the +@code{ECHO} +for the @samp{kludge} rule will actually write @samp{mega-kludge}. +@end table + +@cindex yymore, performance penalty of +Two notes regarding use of @code{yymore()}. First, @code{yymore()} +depends on the value of @code{yyleng} correctly reflecting the size of +the current token, so you must not modify @code{yyleng} if you are using +@code{yymore()}. Second, the presence of @code{yymore()} in the +scanner's action entails a minor performance penalty in the scanner's +matching speed. + +@cindex yyless() +@code{yyless(n)} returns all but the first @code{n} characters of the +current token back to the input stream, where they will be rescanned +when the scanner looks for the next match. @code{yytext} and +@code{yyleng} are adjusted appropriately (e.g., @code{yyleng} will now +be equal to @code{n}). For example, on the input @samp{foobar} the +following will write out @samp{foobarbar}: + +@cindex yyless(), pushing back characters +@cindex pushing back characters with yyless +@example +@verbatim + %% + foobar ECHO; yyless(3); + [a-z]+ ECHO; +@end verbatim +@end example + +An argument of 0 to @code{yyless()} will cause the entire current input +string to be scanned again. Unless you've changed how the scanner will +subsequently process its input (using @code{BEGIN}, for example), this +will result in an endless loop. + +Note that @code{yyless()} is a macro and can only be used in the flex +input file, not from other source files. + +@cindex unput() +@cindex pushing back characters with unput +@code{unput(c)} puts the character @code{c} back onto the input stream. +It will be the next character scanned. The following action will take +the current token and cause it to be rescanned enclosed in parentheses. + +@cindex unput(), pushing back characters +@cindex pushing back characters with unput() +@example +@verbatim + { + int i; + /* Copy yytext because unput() trashes yytext */ + char *yycopy = strdup( yytext ); + unput( ')' ); + for ( i = yyleng - 1; i >= 0; --i ) + unput( yycopy[i] ); + unput( '(' ); + free( yycopy ); + } +@end verbatim +@end example + +Note that since each @code{unput()} puts the given character back at the +@emph{beginning} of the input stream, pushing back strings must be done +back-to-front. + +@cindex %pointer, and unput() +@cindex unput(), and %pointer +An important potential problem when using @code{unput()} is that if you +are using @code{%pointer} (the default), a call to @code{unput()} +@emph{destroys} the contents of @code{yytext}, starting with its +rightmost character and devouring one character to the left with each +call. If you need the value of @code{yytext} preserved after a call to +@code{unput()} (as in the above example), you must either first copy it +elsewhere, or build your scanner using @code{%array} instead +(@pxref{Matching}). + +@cindex pushing back EOF +@cindex EOF, pushing back +Finally, note that you cannot put back @samp{EOF} to attempt to mark the +input stream with an end-of-file. + +@cindex input() +@code{input()} reads the next character from the input stream. For +example, the following is one way to eat up C comments: + +@cindex comments, discarding +@cindex discarding C comments +@example +@verbatim + %% + "/*" { + register int c; + + for ( ; ; ) + { + while ( (c = input()) != '*' && + c != EOF ) + ; /* eat up text of comment */ + + if ( c == '*' ) + { + while ( (c = input()) == '*' ) + ; + if ( c == '/' ) + break; /* found the end */ + } + + if ( c == EOF ) + { + error( "EOF in comment" ); + break; + } + } + } +@end verbatim +@end example + +@cindex input(), and C++ +@cindex yyinput() +(Note that if the scanner is compiled using @code{C++}, then +@code{input()} is instead referred to as @b{yyinput()}, in order to +avoid a name clash with the @code{C++} stream by the name of +@code{input}.) + +@cindex flushing the internal buffer +@cindex YY_FLUSH_BUFFER() +@code{YY_FLUSH_BUFFER()} flushes the scanner's internal buffer so that +the next time the scanner attempts to match a token, it will first +refill the buffer using @code{YY_INPUT()} (@pxref{Generated Scanner}). +This action is a special case of the more general +@code{yy_flush_buffer()} function, described below (@pxref{Multiple +Input Buffers}) + +@cindex yyterminate() +@cindex terminating with yyterminate() +@cindex exiting with yyterminate() +@cindex halting with yyterminate() +@code{yyterminate()} can be used in lieu of a return statement in an +action. It terminates the scanner and returns a 0 to the scanner's +caller, indicating ``all done''. By default, @code{yyterminate()} is +also called when an end-of-file is encountered. It is a macro and may +be redefined. + +@node Generated Scanner, Start Conditions, Actions, Top +@chapter The Generated Scanner + +@cindex yylex(), in generated scanner +The output of @code{flex} is the file @file{lex.yy.c}, which contains +the scanning routine @code{yylex()}, a number of tables used by it for +matching tokens, and a number of auxiliary routines and macros. By +default, @code{yylex()} is declared as follows: + +@example +@verbatim + int yylex() + { + ... various definitions and the actions in here ... + } +@end verbatim +@end example + +@cindex yylex(), overriding +(If your environment supports function prototypes, then it will be +@code{int yylex( void )}.) This definition may be changed by defining +the @code{YY_DECL} macro. For example, you could use: + +@cindex yylex, overriding the prototype of +@example +@verbatim + #define YY_DECL float lexscan( a, b ) float a, b; +@end verbatim +@end example + +to give the scanning routine the name @code{lexscan}, returning a float, +and taking two floats as arguments. Note that if you give arguments to +the scanning routine using a K&R-style/non-prototyped function +declaration, you must terminate the definition with a semi-colon (;). + +@code{flex} generates @samp{C99} function definitions by +default. However flex does have the ability to generate obsolete, er, +@samp{traditional}, function definitions. This is to support +bootstrapping gcc on old systems. Unfortunately, traditional +definitions prevent us from using any standard data types smaller than +int (such as short, char, or bool) as function arguments. For this +reason, future versions of @code{flex} may generate standard C99 code +only, leaving K&R-style functions to the historians. Currently, if you +do @strong{not} want @samp{C99} definitions, then you must use +@code{%option noansi-definitions}. + +@cindex stdin, default for yyin +@cindex yyin +Whenever @code{yylex()} is called, it scans tokens from the global input +file @file{yyin} (which defaults to stdin). It continues until it +either reaches an end-of-file (at which point it returns the value 0) or +one of its actions executes a @code{return} statement. + +@cindex EOF and yyrestart() +@cindex end-of-file, and yyrestart() +@cindex yyrestart() +If the scanner reaches an end-of-file, subsequent calls are undefined +unless either @file{yyin} is pointed at a new input file (in which case +scanning continues from that file), or @code{yyrestart()} is called. +@code{yyrestart()} takes one argument, a @code{FILE *} pointer (which +can be NULL, if you've set up @code{YY_INPUT} to scan from a source other +than @code{yyin}), and initializes @file{yyin} for scanning from that +file. Essentially there is no difference between just assigning +@file{yyin} to a new input file or using @code{yyrestart()} to do so; +the latter is available for compatibility with previous versions of +@code{flex}, and because it can be used to switch input files in the +middle of scanning. It can also be used to throw away the current input +buffer, by calling it with an argument of @file{yyin}; but it would be +better to use @code{YY_FLUSH_BUFFER} (@pxref{Actions}). Note that +@code{yyrestart()} does @emph{not} reset the start condition to +@code{INITIAL} (@pxref{Start Conditions}). + +@cindex RETURN, within actions +If @code{yylex()} stops scanning due to executing a @code{return} +statement in one of the actions, the scanner may then be called again +and it will resume scanning where it left off. + +@cindex YY_INPUT +By default (and for purposes of efficiency), the scanner uses +block-reads rather than simple @code{getc()} calls to read characters +from @file{yyin}. The nature of how it gets its input can be controlled +by defining the @code{YY_INPUT} macro. The calling sequence for +@code{YY_INPUT()} is @code{YY_INPUT(buf,result,max_size)}. Its action +is to place up to @code{max_size} characters in the character array +@code{buf} and return in the integer variable @code{result} either the +number of characters read or the constant @code{YY_NULL} (0 on Unix +systems) to indicate @samp{EOF}. The default @code{YY_INPUT} reads from +the global file-pointer @file{yyin}. + +@cindex YY_INPUT, overriding +Here is a sample definition of @code{YY_INPUT} (in the definitions +section of the input file): + +@example +@verbatim + %{ + #define YY_INPUT(buf,result,max_size) \ + { \ + int c = getchar(); \ + result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \ + } + %} +@end verbatim +@end example + +This definition will change the input processing to occur one character +at a time. + +@cindex yywrap() +When the scanner receives an end-of-file indication from YY_INPUT, it +then checks the @code{yywrap()} function. If @code{yywrap()} returns +false (zero), then it is assumed that the function has gone ahead and +set up @file{yyin} to point to another input file, and scanning +continues. If it returns true (non-zero), then the scanner terminates, +returning 0 to its caller. Note that in either case, the start +condition remains unchanged; it does @emph{not} revert to +@code{INITIAL}. + +@cindex yywrap, default for +@cindex nowrap, %option +@cindex %option nowrap +If you do not supply your own version of @code{yywrap()}, then you must +either use @code{%option noyywrap} (in which case the scanner behaves as +though @code{yywrap()} returned 1), or you must link with @samp{-lfl} to +obtain the default version of the routine, which always returns 1. + +For scanning from in-memory buffers (e.g., scanning strings), see +@ref{Scanning Strings}. @xref{Multiple Input Buffers}. + +@cindex ECHO, and yyout +@cindex yyout +@cindex stdout, as default for yyout +The scanner writes its @code{ECHO} output to the @file{yyout} global +(default, @file{stdout}), which may be redefined by the user simply by +assigning it to some other @code{FILE} pointer. + +@node Start Conditions, Multiple Input Buffers, Generated Scanner, Top +@chapter Start Conditions + +@cindex start conditions +@code{flex} provides a mechanism for conditionally activating rules. +Any rule whose pattern is prefixed with @samp{<sc>} will only be active +when the scanner is in the @dfn{start condition} named @code{sc}. For +example, + +@c proofread edit stopped here +@example +@verbatim + <STRING>[^"]* { /* eat up the string body ... */ + ... + } +@end verbatim +@end example + +will be active only when the scanner is in the @code{STRING} start +condition, and + +@cindex start conditions, multiple +@example +@verbatim + <INITIAL,STRING,QUOTE>\. { /* handle an escape ... */ + ... + } +@end verbatim +@end example + +will be active only when the current start condition is either +@code{INITIAL}, @code{STRING}, or @code{QUOTE}. + +@cindex start conditions, inclusive v.s.@: exclusive +Start conditions are declared in the definitions (first) section of the +input using unindented lines beginning with either @samp{%s} or +@samp{%x} followed by a list of names. The former declares +@dfn{inclusive} start conditions, the latter @dfn{exclusive} start +conditions. A start condition is activated using the @code{BEGIN} +action. Until the next @code{BEGIN} action is executed, rules with the +given start condition will be active and rules with other start +conditions will be inactive. If the start condition is inclusive, then +rules with no start conditions at all will also be active. If it is +exclusive, then @emph{only} rules qualified with the start condition +will be active. A set of rules contingent on the same exclusive start +condition describe a scanner which is independent of any of the other +rules in the @code{flex} input. Because of this, exclusive start +conditions make it easy to specify ``mini-scanners'' which scan portions +of the input that are syntactically different from the rest (e.g., +comments). + +If the distinction between inclusive and exclusive start conditions +is still a little vague, here's a simple example illustrating the +connection between the two. The set of rules: + +@cindex start conditions, inclusive +@example +@verbatim + %s example + %% + + <example>foo do_something(); + + bar something_else(); +@end verbatim +@end example + +is equivalent to + +@cindex start conditions, exclusive +@example +@verbatim + %x example + %% + + <example>foo do_something(); + + <INITIAL,example>bar something_else(); +@end verbatim +@end example + +Without the @code{<INITIAL,example>} qualifier, the @code{bar} pattern in +the second example wouldn't be active (i.e., couldn't match) when in +start condition @code{example}. If we just used @code{<example>} to +qualify @code{bar}, though, then it would only be active in +@code{example} and not in @code{INITIAL}, while in the first example +it's active in both, because in the first example the @code{example} +start condition is an inclusive @code{(%s)} start condition. + +@cindex start conditions, special wildcard condition +Also note that the special start-condition specifier +@code{<*>} +matches every start condition. Thus, the above example could also +have been written: + +@cindex start conditions, use of wildcard condition (<*>) +@example +@verbatim + %x example + %% + + <example>foo do_something(); + + <*>bar something_else(); +@end verbatim +@end example + +The default rule (to @code{ECHO} any unmatched character) remains active +in start conditions. It is equivalent to: + +@cindex start conditions, behavior of default rule +@example +@verbatim + <*>.|\n ECHO; +@end verbatim +@end example + +@cindex BEGIN, explanation +@findex BEGIN +@vindex INITIAL +@code{BEGIN(0)} returns to the original state where only the rules with +no start conditions are active. This state can also be referred to as +the start-condition @code{INITIAL}, so @code{BEGIN(INITIAL)} is +equivalent to @code{BEGIN(0)}. (The parentheses around the start +condition name are not required but are considered good style.) + +@code{BEGIN} actions can also be given as indented code at the beginning +of the rules section. For example, the following will cause the scanner +to enter the @code{SPECIAL} start condition whenever @code{yylex()} is +called and the global variable @code{enter_special} is true: + +@cindex start conditions, using BEGIN +@example +@verbatim + int enter_special; + + %x SPECIAL + %% + if ( enter_special ) + BEGIN(SPECIAL); + + <SPECIAL>blahblahblah + ...more rules follow... +@end verbatim +@end example + +To illustrate the uses of start conditions, here is a scanner which +provides two different interpretations of a string like @samp{123.456}. +By default it will treat it as three tokens, the integer @samp{123}, a +dot (@samp{.}), and the integer @samp{456}. But if the string is +preceded earlier in the line by the string @samp{expect-floats} it will +treat it as a single token, the floating-point number @samp{123.456}: + +@cindex start conditions, for different interpretations of same input +@example +@verbatim + %{ + #include <math.h> + %} + %s expect + + %% + expect-floats BEGIN(expect); + + <expect>[0-9]+@samp{.}[0-9]+ { + printf( "found a float, = %f\n", + atof( yytext ) ); + } + <expect>\n { + /* that's the end of the line, so + * we need another "expect-number" + * before we'll recognize any more + * numbers + */ + BEGIN(INITIAL); + } + + [0-9]+ { + printf( "found an integer, = %d\n", + atoi( yytext ) ); + } + + "." printf( "found a dot\n" ); +@end verbatim +@end example + +@cindex comments, example of scanning C comments +Here is a scanner which recognizes (and discards) C comments while +maintaining a count of the current input line. + +@cindex recognizing C comments +@example +@verbatim + %x comment + %% + int line_num = 1; + + "/*" BEGIN(comment); + + <comment>[^*\n]* /* eat anything that's not a '*' */ + <comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */ + <comment>\n ++line_num; + <comment>"*"+"/" BEGIN(INITIAL); +@end verbatim +@end example + +This scanner goes to a bit of trouble to match as much +text as possible with each rule. In general, when attempting to write +a high-speed scanner try to match as much possible in each rule, as +it's a big win. + +Note that start-conditions names are really integer values and +can be stored as such. Thus, the above could be extended in the +following fashion: + +@cindex start conditions, integer values +@cindex using integer values of start condition names +@example +@verbatim + %x comment foo + %% + int line_num = 1; + int comment_caller; + + "/*" { + comment_caller = INITIAL; + BEGIN(comment); + } + + ... + + <foo>"/*" { + comment_caller = foo; + BEGIN(comment); + } + + <comment>[^*\n]* /* eat anything that's not a '*' */ + <comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */ + <comment>\n ++line_num; + <comment>"*"+"/" BEGIN(comment_caller); +@end verbatim +@end example + +@cindex YY_START, example +Furthermore, you can access the current start condition using the +integer-valued @code{YY_START} macro. For example, the above +assignments to @code{comment_caller} could instead be written + +@cindex getting current start state with YY_START +@example +@verbatim + comment_caller = YY_START; +@end verbatim +@end example + +@vindex YY_START +Flex provides @code{YYSTATE} as an alias for @code{YY_START} (since that +is what's used by AT&T @code{lex}). + +For historical reasons, start conditions do not have their own +name-space within the generated scanner. The start condition names are +unmodified in the generated scanner and generated header. +@xref{option-header}. @xref{option-prefix}. + + + +Finally, here's an example of how to match C-style quoted strings using +exclusive start conditions, including expanded escape sequences (but +not including checking for a string that's too long): + +@cindex matching C-style double-quoted strings +@example +@verbatim + %x str + + %% + char string_buf[MAX_STR_CONST]; + char *string_buf_ptr; + + + \" string_buf_ptr = string_buf; BEGIN(str); + + <str>\" { /* saw closing quote - all done */ + BEGIN(INITIAL); + *string_buf_ptr = '\0'; + /* return string constant token type and + * value to parser + */ + } + + <str>\n { + /* error - unterminated string constant */ + /* generate error message */ + } + + <str>\\[0-7]{1,3} { + /* octal escape sequence */ + int result; + + (void) sscanf( yytext + 1, "%o", &result ); + + if ( result > 0xff ) + /* error, constant is out-of-bounds */ + + *string_buf_ptr++ = result; + } + + <str>\\[0-9]+ { + /* generate error - bad escape sequence; something + * like '\48' or '\0777777' + */ + } + + <str>\\n *string_buf_ptr++ = '\n'; + <str>\\t *string_buf_ptr++ = '\t'; + <str>\\r *string_buf_ptr++ = '\r'; + <str>\\b *string_buf_ptr++ = '\b'; + <str>\\f *string_buf_ptr++ = '\f'; + + <str>\\(.|\n) *string_buf_ptr++ = yytext[1]; + + <str>[^\\\n\"]+ { + char *yptr = yytext; + + while ( *yptr ) + *string_buf_ptr++ = *yptr++; + } +@end verbatim +@end example + +@cindex start condition, applying to multiple patterns +Often, such as in some of the examples above, you wind up writing a +whole bunch of rules all preceded by the same start condition(s). Flex +makes this a little easier and cleaner by introducing a notion of start +condition @dfn{scope}. A start condition scope is begun with: + +@example +@verbatim + <SCs>{ +@end verbatim +@end example + +where @code{SCs} is a list of one or more start conditions. Inside the +start condition scope, every rule automatically has the prefix +@code{SCs>} applied to it, until a @samp{@}} which matches the initial +@samp{@{}. So, for example, + +@cindex extended scope of start conditions +@example +@verbatim + <ESC>{ + "\\n" return '\n'; + "\\r" return '\r'; + "\\f" return '\f'; + "\\0" return '\0'; + } +@end verbatim +@end example + +is equivalent to: + +@example +@verbatim + <ESC>"\\n" return '\n'; + <ESC>"\\r" return '\r'; + <ESC>"\\f" return '\f'; + <ESC>"\\0" return '\0'; +@end verbatim +@end example + +Start condition scopes may be nested. + +@cindex stacks, routines for manipulating +@cindex start conditions, use of a stack + +The following routines are available for manipulating stacks of start conditions: + +@deftypefun void yy_push_state ( int @code{new_state} ) +pushes the current start condition onto the top of the start condition +stack and switches to +@code{new_state} +as though you had used +@code{BEGIN new_state} +(recall that start condition names are also integers). +@end deftypefun + +@deftypefun void yy_pop_state () +pops the top of the stack and switches to it via +@code{BEGIN}. +@end deftypefun + +@deftypefun int yy_top_state () +returns the top of the stack without altering the stack's contents. +@end deftypefun + +@cindex memory, for start condition stacks +The start condition stack grows dynamically and so has no built-in size +limitation. If memory is exhausted, program execution aborts. + +To use start condition stacks, your scanner must include a @code{%option +stack} directive (@pxref{Scanner Options}). + +@node Multiple Input Buffers, EOF, Start Conditions, Top +@chapter Multiple Input Buffers + +@cindex multiple input streams +Some scanners (such as those which support ``include'' files) require +reading from several input streams. As @code{flex} scanners do a large +amount of buffering, one cannot control where the next input will be +read from by simply writing a @code{YY_INPUT()} which is sensitive to +the scanning context. @code{YY_INPUT()} is only called when the scanner +reaches the end of its buffer, which may be a long time after scanning a +statement such as an @code{include} statement which requires switching +the input source. + +To negotiate these sorts of problems, @code{flex} provides a mechanism +for creating and switching between multiple input buffers. An input +buffer is created by using: + +@cindex memory, allocating input buffers +@deftypefun YY_BUFFER_STATE yy_create_buffer ( FILE *file, int size ) +@end deftypefun + +which takes a @code{FILE} pointer and a size and creates a buffer +associated with the given file and large enough to hold @code{size} +characters (when in doubt, use @code{YY_BUF_SIZE} for the size). It +returns a @code{YY_BUFFER_STATE} handle, which may then be passed to +other routines (see below). +@tindex YY_BUFFER_STATE +The @code{YY_BUFFER_STATE} type is a +pointer to an opaque @code{struct yy_buffer_state} structure, so you may +safely initialize @code{YY_BUFFER_STATE} variables to @code{((YY_BUFFER_STATE) +0)} if you wish, and also refer to the opaque structure in order to +correctly declare input buffers in source files other than that of your +scanner. Note that the @code{FILE} pointer in the call to +@code{yy_create_buffer} is only used as the value of @file{yyin} seen by +@code{YY_INPUT}. If you redefine @code{YY_INPUT()} so it no longer uses +@file{yyin}, then you can safely pass a NULL @code{FILE} pointer to +@code{yy_create_buffer}. You select a particular buffer to scan from +using: + +@deftypefun void yy_switch_to_buffer ( YY_BUFFER_STATE new_buffer ) +@end deftypefun + +The above function switches the scanner's input buffer so subsequent tokens +will come from @code{new_buffer}. Note that @code{yy_switch_to_buffer()} may +be used by @code{yywrap()} to set things up for continued scanning, instead of +opening a new file and pointing @file{yyin} at it. If you are looking for a +stack of input buffers, then you want to use @code{yypush_buffer_state()} +instead of this function. Note also that switching input sources via either +@code{yy_switch_to_buffer()} or @code{yywrap()} does @emph{not} change the +start condition. + +@cindex memory, deleting input buffers +@deftypefun void yy_delete_buffer ( YY_BUFFER_STATE buffer ) +@end deftypefun + +is used to reclaim the storage associated with a buffer. (@code{buffer} +can be NULL, in which case the routine does nothing.) You can also clear +the current contents of a buffer using: + +@cindex pushing an input buffer +@cindex stack, input buffer push +@deftypefun void yypush_buffer_state ( YY_BUFFER_STATE buffer ) +@end deftypefun + +This function pushes the new buffer state onto an internal stack. The pushed +state becomes the new current state. The stack is maintained by flex and will +grow as required. This function is intended to be used instead of +@code{yy_switch_to_buffer}, when you want to change states, but preserve the +current state for later use. + +@cindex popping an input buffer +@cindex stack, input buffer pop +@deftypefun void yypop_buffer_state ( ) +@end deftypefun + +This function removes the current state from the top of the stack, and deletes +it by calling @code{yy_delete_buffer}. The next state on the stack, if any, +becomes the new current state. + +@cindex clearing an input buffer +@cindex flushing an input buffer +@deftypefun void yy_flush_buffer ( YY_BUFFER_STATE buffer ) +@end deftypefun + +This function discards the buffer's contents, +so the next time the scanner attempts to match a token from the +buffer, it will first fill the buffer anew using +@code{YY_INPUT()}. + +@deftypefun YY_BUFFER_STATE yy_new_buffer ( FILE *file, int size ) +@end deftypefun + +is an alias for @code{yy_create_buffer()}, +provided for compatibility with the C++ use of @code{new} and +@code{delete} for creating and destroying dynamic objects. + +@cindex YY_CURRENT_BUFFER, and multiple buffers Finally, the macro +@code{YY_CURRENT_BUFFER} macro returns a @code{YY_BUFFER_STATE} handle to the +current buffer. It should not be used as an lvalue. + +@cindex EOF, example using multiple input buffers +Here are two examples of using these features for writing a scanner +which expands include files (the +@code{<<EOF>>} +feature is discussed below). + +This first example uses yypush_buffer_state and yypop_buffer_state. Flex +maintains the stack internally. + +@cindex handling include files with multiple input buffers +@example +@verbatim + /* the "incl" state is used for picking up the name + * of an include file + */ + %x incl + %% + include BEGIN(incl); + + [a-z]+ ECHO; + [^a-z\n]*\n? ECHO; + + <incl>[ \t]* /* eat the whitespace */ + <incl>[^ \t\n]+ { /* got the include file name */ + yyin = fopen( yytext, "r" ); + + if ( ! yyin ) + error( ... ); + + yypush_buffer_state(yy_create_buffer( yyin, YY_BUF_SIZE )); + + BEGIN(INITIAL); + } + + <<EOF>> { + yypop_buffer_state(); + + if ( !YY_CURRENT_BUFFER ) + { + yyterminate(); + } + } +@end verbatim +@end example + +The second example, below, does the same thing as the previous example did, but +manages its own input buffer stack manually (instead of letting flex do it). + +@cindex handling include files with multiple input buffers +@example +@verbatim + /* the "incl" state is used for picking up the name + * of an include file + */ + %x incl + + %{ + #define MAX_INCLUDE_DEPTH 10 + YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH]; + int include_stack_ptr = 0; + %} + + %% + include BEGIN(incl); + + [a-z]+ ECHO; + [^a-z\n]*\n? ECHO; + + <incl>[ \t]* /* eat the whitespace */ + <incl>[^ \t\n]+ { /* got the include file name */ + if ( include_stack_ptr >= MAX_INCLUDE_DEPTH ) + { + fprintf( stderr, "Includes nested too deeply" ); + exit( 1 ); + } + + include_stack[include_stack_ptr++] = + YY_CURRENT_BUFFER; + + yyin = fopen( yytext, "r" ); + + if ( ! yyin ) + error( ... ); + + yy_switch_to_buffer( + yy_create_buffer( yyin, YY_BUF_SIZE ) ); + + BEGIN(INITIAL); + } + + <<EOF>> { + if ( --include_stack_ptr 0 ) + { + yyterminate(); + } + + else + { + yy_delete_buffer( YY_CURRENT_BUFFER ); + yy_switch_to_buffer( + include_stack[include_stack_ptr] ); + } + } +@end verbatim +@end example + +@anchor{Scanning Strings} +@cindex strings, scanning strings instead of files +The following routines are available for setting up input buffers for +scanning in-memory strings instead of files. All of them create a new +input buffer for scanning the string, and return a corresponding +@code{YY_BUFFER_STATE} handle (which you should delete with +@code{yy_delete_buffer()} when done with it). They also switch to the +new buffer using @code{yy_switch_to_buffer()}, so the next call to +@code{yylex()} will start scanning the string. + +@deftypefun YY_BUFFER_STATE yy_scan_string ( const char *str ) +scans a NUL-terminated string. +@end deftypefun + +@deftypefun YY_BUFFER_STATE yy_scan_bytes ( const char *bytes, int len ) +scans @code{len} bytes (including possibly @code{NUL}s) starting at location +@code{bytes}. +@end deftypefun + +Note that both of these functions create and scan a @emph{copy} of the +string or bytes. (This may be desirable, since @code{yylex()} modifies +the contents of the buffer it is scanning.) You can avoid the copy by +using: + +@vindex YY_END_OF_BUFFER_CHAR +@deftypefun YY_BUFFER_STATE yy_scan_buffer (char *base, yy_size_t size) +which scans in place the buffer starting at @code{base}, consisting of +@code{size} bytes, the last two bytes of which @emph{must} be +@code{YY_END_OF_BUFFER_CHAR} (ASCII NUL). These last two bytes are not +scanned; thus, scanning consists of @code{base[0]} through +@code{base[size-2]}, inclusive. +@end deftypefun + +If you fail to set up @code{base} in this manner (i.e., forget the final +two @code{YY_END_OF_BUFFER_CHAR} bytes), then @code{yy_scan_buffer()} +returns a NULL pointer instead of creating a new input buffer. + +@deftp {Data type} yy_size_t +is an integral type to which you can cast an integer expression +reflecting the size of the buffer. +@end deftp + +@node EOF, Misc Macros, Multiple Input Buffers, Top +@chapter End-of-File Rules + +@cindex EOF, explanation +The special rule @code{<<EOF>>} indicates +actions which are to be taken when an end-of-file is +encountered and @code{yywrap()} returns non-zero (i.e., indicates +no further files to process). The action must finish +by doing one of the following things: + +@itemize +@item +@findex YY_NEW_FILE (now obsolete) +assigning @file{yyin} to a new input file (in previous versions of +@code{flex}, after doing the assignment you had to call the special +action @code{YY_NEW_FILE}. This is no longer necessary.) + +@item +executing a @code{return} statement; + +@item +executing the special @code{yyterminate()} action. + +@item +or, switching to a new buffer using @code{yy_switch_to_buffer()} as +shown in the example above. +@end itemize + +<<EOF>> rules may not be used with other patterns; they may only be +qualified with a list of start conditions. If an unqualified <<EOF>> +rule is given, it applies to @emph{all} start conditions which do not +already have <<EOF>> actions. To specify an <<EOF>> rule for only the +initial start condition, use: + +@example +@verbatim + <INITIAL><<EOF>> +@end verbatim +@end example + +These rules are useful for catching things like unclosed comments. An +example: + +@cindex <<EOF>>, use of +@example +@verbatim + %x quote + %% + + ...other rules for dealing with quotes... + + <quote><<EOF>> { + error( "unterminated quote" ); + yyterminate(); + } + <<EOF>> { + if ( *++filelist ) + yyin = fopen( *filelist, "r" ); + else + yyterminate(); + } +@end verbatim +@end example + +@node Misc Macros, User Values, EOF, Top +@chapter Miscellaneous Macros + +@hkindex YY_USER_ACTION +The macro @code{YY_USER_ACTION} can be defined to provide an action +which is always executed prior to the matched rule's action. For +example, it could be #define'd to call a routine to convert yytext to +lower-case. When @code{YY_USER_ACTION} is invoked, the variable +@code{yy_act} gives the number of the matched rule (rules are numbered +starting with 1). Suppose you want to profile how often each of your +rules is matched. The following would do the trick: + +@cindex YY_USER_ACTION to track each time a rule is matched +@example +@verbatim + #define YY_USER_ACTION ++ctr[yy_act] +@end verbatim +@end example + +@vindex YY_NUM_RULES +where @code{ctr} is an array to hold the counts for the different rules. +Note that the macro @code{YY_NUM_RULES} gives the total number of rules +(including the default rule), even if you use @samp{-s)}, so a correct +declaration for @code{ctr} is: + +@example +@verbatim + int ctr[YY_NUM_RULES]; +@end verbatim +@end example + +@hkindex YY_USER_INIT +The macro @code{YY_USER_INIT} may be defined to provide an action which +is always executed before the first scan (and before the scanner's +internal initializations are done). For example, it could be used to +call a routine to read in a data table or open a logging file. + +@findex yy_set_interactive +The macro @code{yy_set_interactive(is_interactive)} can be used to +control whether the current buffer is considered @dfn{interactive}. An +interactive buffer is processed more slowly, but must be used when the +scanner's input source is indeed interactive to avoid problems due to +waiting to fill buffers (see the discussion of the @samp{-I} flag in +@ref{Scanner Options}). A non-zero value in the macro invocation marks +the buffer as interactive, a zero value as non-interactive. Note that +use of this macro overrides @code{%option always-interactive} or +@code{%option never-interactive} (@pxref{Scanner Options}). +@code{yy_set_interactive()} must be invoked prior to beginning to scan +the buffer that is (or is not) to be considered interactive. + +@cindex BOL, setting it +@findex yy_set_bol +The macro @code{yy_set_bol(at_bol)} can be used to control whether the +current buffer's scanning context for the next token match is done as +though at the beginning of a line. A non-zero macro argument makes +rules anchored with @samp{^} active, while a zero argument makes +@samp{^} rules inactive. + +@cindex BOL, checking the BOL flag +@findex YY_AT_BOL +The macro @code{YY_AT_BOL()} returns true if the next token scanned from +the current buffer will have @samp{^} rules active, false otherwise. + +@cindex actions, redefining YY_BREAK +@hkindex YY_BREAK +In the generated scanner, the actions are all gathered in one large +switch statement and separated using @code{YY_BREAK}, which may be +redefined. By default, it is simply a @code{break}, to separate each +rule's action from the following rule's. Redefining @code{YY_BREAK} +allows, for example, C++ users to #define YY_BREAK to do nothing (while +being very careful that every rule ends with a @code{break} or a +@code{return}!) to avoid suffering from unreachable statement warnings +where because a rule's action ends with @code{return}, the +@code{YY_BREAK} is inaccessible. + +@node User Values, Yacc, Misc Macros, Top +@chapter Values Available To the User + +This chapter summarizes the various values available to the user in the +rule actions. + +@table @code +@vindex yytext +@item char *yytext +holds the text of the current token. It may be modified but not +lengthened (you cannot append characters to the end). + +@cindex yytext, default array size +@cindex array, default size for yytext +@vindex YYLMAX +If the special directive @code{%array} appears in the first section of +the scanner description, then @code{yytext} is instead declared +@code{char yytext[YYLMAX]}, where @code{YYLMAX} is a macro definition +that you can redefine in the first section if you don't like the default +value (generally 8KB). Using @code{%array} results in somewhat slower +scanners, but the value of @code{yytext} becomes immune to calls to +@code{unput()}, which potentially destroy its value when @code{yytext} is +a character pointer. The opposite of @code{%array} is @code{%pointer}, +which is the default. + +@cindex C++ and %array +You cannot use @code{%array} when generating C++ scanner classes (the +@samp{-+} flag). + +@vindex yyleng +@item int yyleng +holds the length of the current token. + +@vindex yyin +@item FILE *yyin +is the file which by default @code{flex} reads from. It may be +redefined but doing so only makes sense before scanning begins or after +an EOF has been encountered. Changing it in the midst of scanning will +have unexpected results since @code{flex} buffers its input; use +@code{yyrestart()} instead. Once scanning terminates because an +end-of-file has been seen, you can assign @file{yyin} at the new input +file and then call the scanner again to continue scanning. + +@findex yyrestart +@item void yyrestart( FILE *new_file ) +may be called to point @file{yyin} at the new input file. The +switch-over to the new file is immediate (any previously buffered-up +input is lost). Note that calling @code{yyrestart()} with @file{yyin} +as an argument thus throws away the current input buffer and continues +scanning the same input file. + +@vindex yyout +@item FILE *yyout +is the file to which @code{ECHO} actions are done. It can be reassigned +by the user. + +@vindex YY_CURRENT_BUFFER +@item YY_CURRENT_BUFFER +returns a @code{YY_BUFFER_STATE} handle to the current buffer. + +@vindex YY_START +@item YY_START +returns an integer value corresponding to the current start condition. +You can subsequently use this value with @code{BEGIN} to return to that +start condition. +@end table + +@node Yacc, Scanner Options, User Values, Top +@chapter Interfacing with Yacc + +@cindex yacc, interface + +@vindex yylval, with yacc +One of the main uses of @code{flex} is as a companion to the @code{yacc} +parser-generator. @code{yacc} parsers expect to call a routine named +@code{yylex()} to find the next input token. The routine is supposed to +return the type of the next token as well as putting any associated +value in the global @code{yylval}. To use @code{flex} with @code{yacc}, +one specifies the @samp{-d} option to @code{yacc} to instruct it to +generate the file @file{y.tab.h} containing definitions of all the +@code{%tokens} appearing in the @code{yacc} input. This file is then +included in the @code{flex} scanner. For example, if one of the tokens +is @code{TOK_NUMBER}, part of the scanner might look like: + +@cindex yacc interface +@example +@verbatim + %{ + #include "y.tab.h" + %} + + %% + + [0-9]+ yylval = atoi( yytext ); return TOK_NUMBER; +@end verbatim +@end example + +@node Scanner Options, Performance, Yacc, Top +@chapter Scanner Options + +@cindex command-line options +@cindex options, command-line +@cindex arguments, command-line + +The various @code{flex} options are categorized by function in the following +menu. If you want to lookup a particular option by name, @xref{Index of Scanner Options}. + +@menu +* Options for Specifying Filenames:: +* Options Affecting Scanner Behavior:: +* Code-Level And API Options:: +* Options for Scanner Speed and Size:: +* Debugging Options:: +* Miscellaneous Options:: +@end menu + +Even though there are many scanner options, a typical scanner might only +specify the following options: + +@example +@verbatim +%option 8bit reentrant bison-bridge +%option warn nodefault +%option yylineno +%option outfile="scanner.c" header-file="scanner.h" +@end verbatim +@end example + +The first line specifies the general type of scanner we want. The second line +specifies that we are being careful. The third line asks flex to track line +numbers. The last line tells flex what to name the files. (The options can be +specified in any order. We just divided them.) + +@code{flex} also provides a mechanism for controlling options within the +scanner specification itself, rather than from the flex command-line. +This is done by including @code{%option} directives in the first section +of the scanner specification. You can specify multiple options with a +single @code{%option} directive, and multiple directives in the first +section of your flex input file. + +Most options are given simply as names, optionally preceded by the +word @samp{no} (with no intervening whitespace) to negate their meaning. +The names are the same as their long-option equivalents (but without the +leading @samp{--} ). + +@code{flex} scans your rule actions to determine whether you use the +@code{REJECT} or @code{yymore()} features. The @code{REJECT} and +@code{yymore} options are available to override its decision as to +whether you use the options, either by setting them (e.g., @code{%option +reject)} to indicate the feature is indeed used, or unsetting them to +indicate it actually is not used (e.g., @code{%option noyymore)}. + + +A number of options are available for lint purists who want to suppress +the appearance of unneeded routines in the generated scanner. Each of +the following, if unset (e.g., @code{%option nounput}), results in the +corresponding routine not appearing in the generated scanner: + +@example +@verbatim + input, unput + yy_push_state, yy_pop_state, yy_top_state + yy_scan_buffer, yy_scan_bytes, yy_scan_string + + yyget_extra, yyset_extra, yyget_leng, yyget_text, + yyget_lineno, yyset_lineno, yyget_in, yyset_in, + yyget_out, yyset_out, yyget_lval, yyset_lval, + yyget_lloc, yyset_lloc, yyget_debug, yyset_debug +@end verbatim +@end example + +(though @code{yy_push_state()} and friends won't appear anyway unless +you use @code{%option stack)}. + +@node Options for Specifying Filenames, Options Affecting Scanner Behavior, Scanner Options, Scanner Options +@section Options for Specifying Filenames + +@table @samp + +@anchor{option-header} +@opindex ---header-file +@opindex header-file +@item --header-file=FILE, @code{%option header-file="FILE"} +instructs flex to write a C header to @file{FILE}. This file contains +function prototypes, extern variables, and types used by the scanner. +Only the external API is exported by the header file. Many macros that +are usable from within scanner actions are not exported to the header +file. This is due to namespace problems and the goal of a clean +external API. + +While in the header, the macro @code{yyIN_HEADER} is defined, where @samp{yy} +is substituted with the appropriate prefix. + +The @samp{--header-file} option is not compatible with the @samp{--c++} option, +since the C++ scanner provides its own header in @file{yyFlexLexer.h}. + + + +@anchor{option-outfile} +@opindex -o +@opindex ---outfile +@opindex outfile +@item -oFILE, --outfile=FILE, @code{%option outfile="FILE"} +directs flex to write the scanner to the file @file{FILE} instead of +@file{lex.yy.c}. If you combine @samp{--outfile} with the @samp{--stdout} option, +then the scanner is written to @file{stdout} but its @code{#line} +directives (see the @samp{-l} option above) refer to the file +@file{FILE}. + + + +@anchor{option-stdout} +@opindex -t +@opindex ---stdout +@opindex stdout +@item -t, --stdout, @code{%option stdout} +instructs @code{flex} to write the scanner it generates to standard +output instead of @file{lex.yy.c}. + + + +@opindex ---skel +@item -SFILE, --skel=FILE +overrides the default skeleton file from which +@code{flex} +constructs its scanners. You'll never need this option unless you are doing +@code{flex} +maintenance or development. + +@opindex ---tables-file +@opindex tables-file +@item --tables-file=FILE +Write serialized scanner dfa tables to FILE. The generated scanner will not +contain the tables, and requires them to be loaded at runtime. +@xref{serialization}. + +@opindex ---tables-verify +@opindex tables-verify +@item --tables-verify +This option is for flex development. We document it here in case you stumble +upon it by accident or in case you suspect some inconsistency in the serialized +tables. Flex will serialize the scanner dfa tables but will also generate the +in-code tables as it normally does. At runtime, the scanner will verify that +the serialized tables match the in-code tables, instead of loading them. + +@end table + +@node Options Affecting Scanner Behavior, Code-Level And API Options, Options for Specifying Filenames, Scanner Options +@section Options Affecting Scanner Behavior + +@table @samp +@anchor{option-case-insensitive} +@opindex -i +@opindex ---case-insensitive +@opindex case-insensitive +@item -i, --case-insensitive, @code{%option case-insensitive} +instructs @code{flex} to generate a @dfn{case-insensitive} scanner. The +case of letters given in the @code{flex} input patterns will be ignored, +and tokens in the input will be matched regardless of case. The matched +text given in @code{yytext} will have the preserved case (i.e., it will +not be folded). For tricky behavior, see @ref{case and character ranges}. + + + +@anchor{option-lex-compat} +@opindex -l +@opindex ---lex-compat +@opindex lex-compat +@item -l, --lex-compat, @code{%option lex-compat} +turns on maximum compatibility with the original AT&T @code{lex} +implementation. Note that this does not mean @emph{full} compatibility. +Use of this option costs a considerable amount of performance, and it +cannot be used with the @samp{--c++}, @samp{--full}, @samp{--fast}, @samp{-Cf}, or +@samp{-CF} options. For details on the compatibilities it provides, see +@ref{Lex and Posix}. This option also results in the name +@code{YY_FLEX_LEX_COMPAT} being @code{#define}'d in the generated scanner. + + + +@anchor{option-batch} +@opindex -B +@opindex ---batch +@opindex batch +@item -B, --batch, @code{%option batch} +instructs @code{flex} to generate a @dfn{batch} scanner, the opposite of +@emph{interactive} scanners generated by @samp{--interactive} (see below). In +general, you use @samp{-B} when you are @emph{certain} that your scanner +will never be used interactively, and you want to squeeze a +@emph{little} more performance out of it. If your goal is instead to +squeeze out a @emph{lot} more performance, you should be using the +@samp{-Cf} or @samp{-CF} options, which turn on @samp{--batch} automatically +anyway. + + + +@anchor{option-interactive} +@opindex -I +@opindex ---interactive +@opindex interactive +@item -I, --interactive, @code{%option interactive} +instructs @code{flex} to generate an @i{interactive} scanner. An +interactive scanner is one that only looks ahead to decide what token +has been matched if it absolutely must. It turns out that always +looking one extra character ahead, even if the scanner has already seen +enough text to disambiguate the current token, is a bit faster than only +looking ahead when necessary. But scanners that always look ahead give +dreadful interactive performance; for example, when a user types a +newline, it is not recognized as a newline token until they enter +@emph{another} token, which often means typing in another whole line. + +@code{flex} scanners default to @code{interactive} unless you use the +@samp{-Cf} or @samp{-CF} table-compression options +(@pxref{Performance}). That's because if you're looking for +high-performance you should be using one of these options, so if you +didn't, @code{flex} assumes you'd rather trade off a bit of run-time +performance for intuitive interactive behavior. Note also that you +@emph{cannot} use @samp{--interactive} in conjunction with @samp{-Cf} or +@samp{-CF}. Thus, this option is not really needed; it is on by default +for all those cases in which it is allowed. + +You can force a scanner to +@emph{not} +be interactive by using +@samp{--batch} + + + +@anchor{option-7bit} +@opindex -7 +@opindex ---7bit +@opindex 7bit +@item -7, --7bit, @code{%option 7bit} +instructs @code{flex} to generate a 7-bit scanner, i.e., one which can +only recognize 7-bit characters in its input. The advantage of using +@samp{--7bit} is that the scanner's tables can be up to half the size of +those generated using the @samp{--8bit}. The disadvantage is that such +scanners often hang or crash if their input contains an 8-bit character. + +Note, however, that unless you generate your scanner using the +@samp{-Cf} or @samp{-CF} table compression options, use of @samp{--7bit} +will save only a small amount of table space, and make your scanner +considerably less portable. @code{Flex}'s default behavior is to +generate an 8-bit scanner unless you use the @samp{-Cf} or @samp{-CF}, +in which case @code{flex} defaults to generating 7-bit scanners unless +your site was always configured to generate 8-bit scanners (as will +often be the case with non-USA sites). You can tell whether flex +generated a 7-bit or an 8-bit scanner by inspecting the flag summary in +the @samp{--verbose} output as described above. + +Note that if you use @samp{-Cfe} or @samp{-CFe} @code{flex} still +defaults to generating an 8-bit scanner, since usually with these +compression options full 8-bit tables are not much more expensive than +7-bit tables. + + + +@anchor{option-8bit} +@opindex -8 +@opindex ---8bit +@opindex 8bit +@item -8, --8bit, @code{%option 8bit} +instructs @code{flex} to generate an 8-bit scanner, i.e., one which can +recognize 8-bit characters. This flag is only needed for scanners +generated using @samp{-Cf} or @samp{-CF}, as otherwise flex defaults to +generating an 8-bit scanner anyway. + +See the discussion of +@samp{--7bit} +above for @code{flex}'s default behavior and the tradeoffs between 7-bit +and 8-bit scanners. + + + +@anchor{option-default} +@opindex ---default +@opindex default +@item --default, @code{%option default} +generate the default rule. + + + +@anchor{option-always-interactive} +@opindex ---always-interactive +@opindex always-interactive +@item --always-interactive, @code{%option always-interactive} +instructs flex to generate a scanner which always considers its input +@emph{interactive}. Normally, on each new input file the scanner calls +@code{isatty()} in an attempt to determine whether the scanner's input +source is interactive and thus should be read a character at a time. +When this option is used, however, then no such call is made. + + + +@opindex ---never-interactive +@item --never-interactive, @code{--never-interactive} +instructs flex to generate a scanner which never considers its input +interactive. This is the opposite of @code{always-interactive}. + + +@anchor{option-posix} +@opindex -X +@opindex ---posix +@opindex posix +@item -X, --posix, @code{%option posix} +turns on maximum compatibility with the POSIX 1003.2-1992 definition of +@code{lex}. Since @code{flex} was originally designed to implement the +POSIX definition of @code{lex} this generally involves very few changes +in behavior. At the current writing the known differences between +@code{flex} and the POSIX standard are: + +@itemize +@item +In POSIX and AT&T @code{lex}, the repeat operator, @samp{@{@}}, has lower +precedence than concatenation (thus @samp{ab@{3@}} yields @samp{ababab}). +Most POSIX utilities use an Extended Regular Expression (ERE) precedence +that has the precedence of the repeat operator higher than concatenation +(which causes @samp{ab@{3@}} to yield @samp{abbb}). By default, @code{flex} +places the precedence of the repeat operator higher than concatenation +which matches the ERE processing of other POSIX utilities. When either +@samp{--posix} or @samp{-l} are specified, @code{flex} will use the +traditional AT&T and POSIX-compliant precedence for the repeat operator +where concatenation has higher precedence than the repeat operator. +@end itemize + + +@anchor{option-stack} +@opindex ---stack +@opindex stack +@item --stack, @code{%option stack} +enables the use of +start condition stacks (@pxref{Start Conditions}). + + + +@anchor{option-stdinit} +@opindex ---stdinit +@opindex stdinit +@item --stdinit, @code{%option stdinit} +if set (i.e., @b{%option stdinit)} initializes @code{yyin} and +@code{yyout} to @file{stdin} and @file{stdout}, instead of the default of +@file{NULL}. Some existing @code{lex} programs depend on this behavior, +even though it is not compliant with ANSI C, which does not require +@file{stdin} and @file{stdout} to be compile-time constant. In a +reentrant scanner, however, this is not a problem since initialization +is performed in @code{yylex_init} at runtime. + + + +@anchor{option-yylineno} +@opindex ---yylineno +@opindex yylineno +@item --yylineno, @code{%option yylineno} +directs @code{flex} to generate a scanner +that maintains the number of the current line read from its input in the +global variable @code{yylineno}. This option is implied by @code{%option +lex-compat}. In a reentrant C scanner, the macro @code{yylineno} is +accessible regardless of the value of @code{%option yylineno}, however, its +value is not modified by @code{flex} unless @code{%option yylineno} is enabled. + + + +@anchor{option-yywrap} +@opindex ---yywrap +@opindex yywrap +@item --yywrap, @code{%option yywrap} +if unset (i.e., @code{--noyywrap)}, makes the scanner not call +@code{yywrap()} upon an end-of-file, but simply assume that there are no +more files to scan (until the user points @file{yyin} at a new file and +calls @code{yylex()} again). + +@end table + +@node Code-Level And API Options, Options for Scanner Speed and Size, Options Affecting Scanner Behavior, Scanner Options +@section Code-Level And API Options + +@table @samp + +@anchor{option-ansi-definitions} +@opindex ---option-ansi-definitions +@opindex ansi-definitions +@item --ansi-definitions, @code{%option ansi-definitions} +instruct flex to generate ANSI C99 definitions for functions. +This option is enabled by default. +If @code{%option noansi-definitions} is specified, then the obsolete style +is generated. + +@anchor{option-ansi-prototypes} +@opindex ---option-ansi-prototypes +@opindex ansi-prototypes +@item --ansi-prototypes, @code{%option ansi-prototypes} +instructs flex to generate ANSI C99 prototypes for functions. +This option is enabled by default. +If @code{noansi-prototypes} is specified, then +prototypes will have empty parameter lists. + +@anchor{option-bison-bridge} +@opindex ---bison-bridge +@opindex bison-bridge +@item --bison-bridge, @code{%option bison-bridge} +instructs flex to generate a C scanner that is +meant to be called by a +@code{GNU bison} +parser. The scanner has minor API changes for +@code{bison} +compatibility. In particular, the declaration of +@code{yylex} +is modified to take an additional parameter, +@code{yylval}. +@xref{Bison Bridge}. + +@anchor{option-bison-locations} +@opindex ---bison-locations +@opindex bison-locations +@item --bison-locations, @code{%option bison-locations} +instruct flex that +@code{GNU bison} @code{%locations} are being used. +This means @code{yylex} will be passed +an additional parameter, @code{yylloc}. This option +implies @code{%option bison-bridge}. +@xref{Bison Bridge}. + +@anchor{option-noline} +@opindex -L +@opindex ---noline +@opindex noline +@item -L, --noline, @code{%option noline} +instructs +@code{flex} +not to generate +@code{#line} +directives. Without this option, +@code{flex} +peppers the generated scanner +with @code{#line} directives so error messages in the actions will be correctly +located with respect to either the original +@code{flex} +input file (if the errors are due to code in the input file), or +@file{lex.yy.c} +(if the errors are +@code{flex}'s +fault -- you should report these sorts of errors to the email address +given in @ref{Reporting Bugs}). + + + +@anchor{option-reentrant} +@opindex -R +@opindex ---reentrant +@opindex reentrant +@item -R, --reentrant, @code{%option reentrant} +instructs flex to generate a reentrant C scanner. The generated scanner +may safely be used in a multi-threaded environment. The API for a +reentrant scanner is different than for a non-reentrant scanner +@pxref{Reentrant}). Because of the API difference between +reentrant and non-reentrant @code{flex} scanners, non-reentrant flex +code must be modified before it is suitable for use with this option. +This option is not compatible with the @samp{--c++} option. + +The option @samp{--reentrant} does not affect the performance of +the scanner. + + + +@anchor{option-c++} +@opindex -+ +@opindex ---c++ +@opindex c++ +@item -+, --c++, @code{%option c++} +specifies that you want flex to generate a C++ +scanner class. @xref{Cxx}, for +details. + + + +@anchor{option-array} +@opindex ---array +@opindex array +@item --array, @code{%option array} +specifies that you want yytext to be an array instead of a char* + + + +@anchor{option-pointer} +@opindex ---pointer +@opindex pointer +@item --pointer, @code{%option pointer} +specify that @code{yytext} should be a @code{char *}, not an array. +This default is @code{char *}. + + + +@anchor{option-prefix} +@opindex -P +@opindex ---prefix +@opindex prefix +@item -PPREFIX, --prefix=PREFIX, @code{%option prefix="PREFIX"} +changes the default @samp{yy} prefix used by @code{flex} for all +globally-visible variable and function names to instead be +@samp{PREFIX}. For example, @samp{--prefix=foo} changes the name of +@code{yytext} to @code{footext}. It also changes the name of the default +output file from @file{lex.yy.c} to @file{lex.foo.c}. Here is a partial +list of the names affected: + +@example +@verbatim + yy_create_buffer + yy_delete_buffer + yy_flex_debug + yy_init_buffer + yy_flush_buffer + yy_load_buffer_state + yy_switch_to_buffer + yyin + yyleng + yylex + yylineno + yyout + yyrestart + yytext + yywrap + yyalloc + yyrealloc + yyfree +@end verbatim +@end example + +(If you are using a C++ scanner, then only @code{yywrap} and +@code{yyFlexLexer} are affected.) Within your scanner itself, you can +still refer to the global variables and functions using either version +of their name; but externally, they have the modified name. + +This option lets you easily link together multiple +@code{flex} +programs into the same executable. Note, though, that using this +option also renames +@code{yywrap()}, +so you now +@emph{must} +either +provide your own (appropriately-named) version of the routine for your +scanner, or use +@code{%option noyywrap}, +as linking with +@samp{-lfl} +no longer provides one for you by default. + + + +@anchor{option-main} +@opindex ---main +@opindex main +@item --main, @code{%option main} + directs flex to provide a default @code{main()} program for the +scanner, which simply calls @code{yylex()}. This option implies +@code{noyywrap} (see below). + + + +@anchor{option-nounistd} +@opindex ---nounistd +@opindex nounistd +@item --nounistd, @code{%option nounistd} +suppresses inclusion of the non-ANSI header file @file{unistd.h}. This option +is meant to target environments in which @file{unistd.h} does not exist. Be aware +that certain options may cause flex to generate code that relies on functions +normally found in @file{unistd.h}, (e.g. @code{isatty()}, @code{read()}.) +If you wish to use these functions, you will have to inform your compiler where +to find them. +@xref{option-always-interactive}. @xref{option-read}. + + + +@anchor{option-yyclass} +@opindex ---yyclass +@opindex yyclass +@item --yyclass=NAME, @code{%option yyclass="NAME"} +only applies when generating a C++ scanner (the @samp{--c++} option). It +informs @code{flex} that you have derived @code{NAME} as a subclass of +@code{yyFlexLexer}, so @code{flex} will place your actions in the member +function @code{foo::yylex()} instead of @code{yyFlexLexer::yylex()}. It +also generates a @code{yyFlexLexer::yylex()} member function that emits +a run-time error (by invoking @code{yyFlexLexer::LexerError())} if +called. @xref{Cxx}. + +@end table + +@node Options for Scanner Speed and Size, Debugging Options, Code-Level And API Options, Scanner Options +@section Options for Scanner Speed and Size + +@table @samp + +@item -C[aefFmr] +controls the degree of table compression and, more generally, trade-offs +between small scanners and fast scanners. + +@table @samp +@opindex -C +@item -C +A lone @samp{-C} specifies that the scanner tables should be compressed +but neither equivalence classes nor meta-equivalence classes should be +used. + +@anchor{option-align} +@opindex -Ca +@opindex ---align +@opindex align +@item -Ca, --align, @code{%option align} +(``align'') instructs flex to trade off larger tables in the +generated scanner for faster performance because the elements of +the tables are better aligned for memory access and computation. On some +RISC architectures, fetching and manipulating longwords is more efficient +than with smaller-sized units such as shortwords. This option can +quadruple the size of the tables used by your scanner. + +@anchor{option-ecs} +@opindex -Ce +@opindex ---ecs +@opindex ecs +@item -Ce, --ecs, @code{%option ecs} +directs @code{flex} to construct @dfn{equivalence classes}, i.e., sets +of characters which have identical lexical properties (for example, if +the only appearance of digits in the @code{flex} input is in the +character class ``[0-9]'' then the digits '0', '1', ..., '9' will all be +put in the same equivalence class). Equivalence classes usually give +dramatic reductions in the final table/object file sizes (typically a +factor of 2-5) and are pretty cheap performance-wise (one array look-up +per character scanned). + +@opindex -Cf +@item -Cf +specifies that the @dfn{full} scanner tables should be generated - +@code{flex} should not compress the tables by taking advantages of +similar transition functions for different states. + +@opindex -CF +@item -CF +specifies that the alternate fast scanner representation (described +above under the @samp{--fast} flag) should be used. This option cannot be +used with @samp{--c++}. + +@anchor{option-meta-ecs} +@opindex -Cm +@opindex ---meta-ecs +@opindex meta-ecs +@item -Cm, --meta-ecs, @code{%option meta-ecs} +directs +@code{flex} +to construct +@dfn{meta-equivalence classes}, +which are sets of equivalence classes (or characters, if equivalence +classes are not being used) that are commonly used together. Meta-equivalence +classes are often a big win when using compressed tables, but they +have a moderate performance impact (one or two @code{if} tests and one +array look-up per character scanned). + +@anchor{option-read} +@opindex -Cr +@opindex ---read +@opindex read +@item -Cr, --read, @code{%option read} +causes the generated scanner to @emph{bypass} use of the standard I/O +library (@code{stdio}) for input. Instead of calling @code{fread()} or +@code{getc()}, the scanner will use the @code{read()} system call, +resulting in a performance gain which varies from system to system, but +in general is probably negligible unless you are also using @samp{-Cf} +or @samp{-CF}. Using @samp{-Cr} can cause strange behavior if, for +example, you read from @file{yyin} using @code{stdio} prior to calling +the scanner (because the scanner will miss whatever text your previous +reads left in the @code{stdio} input buffer). @samp{-Cr} has no effect +if you define @code{YY_INPUT()} (@pxref{Generated Scanner}). +@end table + +The options @samp{-Cf} or @samp{-CF} and @samp{-Cm} do not make sense +together - there is no opportunity for meta-equivalence classes if the +table is not being compressed. Otherwise the options may be freely +mixed, and are cumulative. + +The default setting is @samp{-Cem}, which specifies that @code{flex} +should generate equivalence classes and meta-equivalence classes. This +setting provides the highest degree of table compression. You can trade +off faster-executing scanners at the cost of larger tables with the +following generally being true: + +@example +@verbatim + slowest & smallest + -Cem + -Cm + -Ce + -C + -C{f,F}e + -C{f,F} + -C{f,F}a + fastest & largest +@end verbatim +@end example + +Note that scanners with the smallest tables are usually generated and +compiled the quickest, so during development you will usually want to +use the default, maximal compression. + +@samp{-Cfe} is often a good compromise between speed and size for +production scanners. + +@anchor{option-full} +@opindex -f +@opindex ---full +@opindex full +@item -f, --full, @code{%option full} +specifies +@dfn{fast scanner}. +No table compression is done and @code{stdio} is bypassed. +The result is large but fast. This option is equivalent to +@samp{--Cfr} + + +@anchor{option-fast} +@opindex -F +@opindex ---fast +@opindex fast +@item -F, --fast, @code{%option fast} +specifies that the @emph{fast} scanner table representation should be +used (and @code{stdio} bypassed). This representation is about as fast +as the full table representation @samp{--full}, and for some sets of +patterns will be considerably smaller (and for others, larger). In +general, if the pattern set contains both @emph{keywords} and a +catch-all, @emph{identifier} rule, such as in the set: + +@example +@verbatim + "case" return TOK_CASE; + "switch" return TOK_SWITCH; + ... + "default" return TOK_DEFAULT; + [a-z]+ return TOK_ID; +@end verbatim +@end example + +then you're better off using the full table representation. If only +the @emph{identifier} rule is present and you then use a hash table or some such +to detect the keywords, you're better off using +@samp{--fast}. + +This option is equivalent to @samp{-CFr}. It cannot be used +with @samp{--c++}. + +@end table + +@node Debugging Options, Miscellaneous Options, Options for Scanner Speed and Size, Scanner Options +@section Debugging Options + +@table @samp + +@anchor{option-backup} +@opindex -b +@opindex ---backup +@opindex backup +@item -b, --backup, @code{%option backup} +Generate backing-up information to @file{lex.backup}. This is a list of +scanner states which require backing up and the input characters on +which they do so. By adding rules one can remove backing-up states. If +@emph{all} backing-up states are eliminated and @samp{-Cf} or @code{-CF} +is used, the generated scanner will run faster (see the @samp{--perf-report} flag). +Only users who wish to squeeze every last cycle out of their scanners +need worry about this option. (@pxref{Performance}). + + + +@anchor{option-debug} +@opindex -d +@opindex ---debug +@opindex debug +@item -d, --debug, @code{%option debug} +makes the generated scanner run in @dfn{debug} mode. Whenever a pattern +is recognized and the global variable @code{yy_flex_debug} is non-zero +(which is the default), the scanner will write to @file{stderr} a line +of the form: + +@example +@verbatim + -accepting rule at line 53 ("the matched text") +@end verbatim +@end example + +The line number refers to the location of the rule in the file defining +the scanner (i.e., the file that was fed to flex). Messages are also +generated when the scanner backs up, accepts the default rule, reaches +the end of its input buffer (or encounters a NUL; at this point, the two +look the same as far as the scanner's concerned), or reaches an +end-of-file. + + + +@anchor{option-perf-report} +@opindex -p +@opindex ---perf-report +@opindex perf-report +@item -p, --perf-report, @code{%option perf-report} +generates a performance report to @file{stderr}. The report consists of +comments regarding features of the @code{flex} input file which will +cause a serious loss of performance in the resulting scanner. If you +give the flag twice, you will also get comments regarding features that +lead to minor performance losses. + +Note that the use of @code{REJECT}, and +variable trailing context (@pxref{Limitations}) entails a substantial +performance penalty; use of @code{yymore()}, the @samp{^} operator, and +the @samp{--interactive} flag entail minor performance penalties. + + + +@anchor{option-nodefault} +@opindex -s +@opindex ---nodefault +@opindex nodefault +@item -s, --nodefault, @code{%option nodefault} +causes the @emph{default rule} (that unmatched scanner input is echoed +to @file{stdout)} to be suppressed. If the scanner encounters input +that does not match any of its rules, it aborts with an error. This +option is useful for finding holes in a scanner's rule set. + + + +@anchor{option-trace} +@opindex -T +@opindex ---trace +@opindex trace +@item -T, --trace, @code{%option trace} +makes @code{flex} run in @dfn{trace} mode. It will generate a lot of +messages to @file{stderr} concerning the form of the input and the +resultant non-deterministic and deterministic finite automata. This +option is mostly for use in maintaining @code{flex}. + + + +@anchor{option-nowarn} +@opindex -w +@opindex ---nowarn +@opindex nowarn +@item -w, --nowarn, @code{%option nowarn} +suppresses warning messages. + + + +@anchor{option-verbose} +@opindex -v +@opindex ---verbose +@opindex verbose +@item -v, --verbose, @code{%option verbose} +specifies that @code{flex} should write to @file{stderr} a summary of +statistics regarding the scanner it generates. Most of the statistics +are meaningless to the casual @code{flex} user, but the first line +identifies the version of @code{flex} (same as reported by @samp{--version}), +and the next line the flags used when generating the scanner, including +those that are on by default. + + + +@anchor{option-warn} +@opindex ---warn +@opindex warn +@item --warn, @code{%option warn} +warn about certain things. In particular, if the default rule can be +matched but no default rule has been given, the flex will warn you. +We recommend using this option always. + +@end table + +@node Miscellaneous Options, , Debugging Options, Scanner Options +@section Miscellaneous Options + +@table @samp +@opindex -c +@item -c +A do-nothing option included for POSIX compliance. + +@opindex -h +@opindex ---help +@item -h, -?, --help +generates a ``help'' summary of @code{flex}'s options to @file{stdout} +and then exits. + +@opindex -n +@item -n +Another do-nothing option included for +POSIX compliance. + +@opindex -V +@opindex ---version +@item -V, --version +prints the version number to @file{stdout} and exits. + +@end table + + +@node Performance, Cxx, Scanner Options, Top +@chapter Performance Considerations + +@cindex performance, considerations +The main design goal of @code{flex} is that it generate high-performance +scanners. It has been optimized for dealing well with large sets of +rules. Aside from the effects on scanner speed of the table compression +@samp{-C} options outlined above, there are a number of options/actions +which degrade performance. These are, from most expensive to least: + +@cindex REJECT, performance costs +@cindex yylineno, performance costs +@cindex trailing context, performance costs +@example +@verbatim + REJECT + arbitrary trailing context + + pattern sets that require backing up + %option yylineno + %array + + %option interactive + %option always-interactive + + @samp{^} beginning-of-line operator + yymore() +@end verbatim +@end example + +with the first two all being quite expensive and the last two being +quite cheap. Note also that @code{unput()} is implemented as a routine +call that potentially does quite a bit of work, while @code{yyless()} is +a quite-cheap macro. So if you are just putting back some excess text +you scanned, use @code{yyless()}. + +@code{REJECT} should be avoided at all costs when performance is +important. It is a particularly expensive option. + +There is one case when @code{%option yylineno} can be expensive. That is when +your patterns match long tokens that could @emph{possibly} contain a newline +character. There is no performance penalty for rules that can not possibly +match newlines, since flex does not need to check them for newlines. In +general, you should avoid rules such as @code{[^f]+}, which match very long +tokens, including newlines, and may possibly match your entire file! A better +approach is to separate @code{[^f]+} into two rules: + +@example +@verbatim +%option yylineno +%% + [^f\n]+ + \n+ +@end verbatim +@end example + +The above scanner does not incur a performance penalty. + +@cindex patterns, tuning for performance +@cindex performance, backing up +@cindex backing up, example of eliminating +Getting rid of backing up is messy and often may be an enormous amount +of work for a complicated scanner. In principal, one begins by using +the @samp{-b} flag to generate a @file{lex.backup} file. For example, +on the input: + +@cindex backing up, eliminating +@example +@verbatim + %% + foo return TOK_KEYWORD; + foobar return TOK_KEYWORD; +@end verbatim +@end example + +the file looks like: + +@example +@verbatim + State #6 is non-accepting - + associated rule line numbers: + 2 3 + out-transitions: [ o ] + jam-transitions: EOF [ \001-n p-\177 ] + + State #8 is non-accepting - + associated rule line numbers: + 3 + out-transitions: [ a ] + jam-transitions: EOF [ \001-` b-\177 ] + + State #9 is non-accepting - + associated rule line numbers: + 3 + out-transitions: [ r ] + jam-transitions: EOF [ \001-q s-\177 ] + + Compressed tables always back up. +@end verbatim +@end example + +The first few lines tell us that there's a scanner state in which it can +make a transition on an 'o' but not on any other character, and that in +that state the currently scanned text does not match any rule. The +state occurs when trying to match the rules found at lines 2 and 3 in +the input file. If the scanner is in that state and then reads +something other than an 'o', it will have to back up to find a rule +which is matched. With a bit of headscratching one can see that this +must be the state it's in when it has seen @samp{fo}. When this has +happened, if anything other than another @samp{o} is seen, the scanner +will have to back up to simply match the @samp{f} (by the default rule). + +The comment regarding State #8 indicates there's a problem when +@samp{foob} has been scanned. Indeed, on any character other than an +@samp{a}, the scanner will have to back up to accept "foo". Similarly, +the comment for State #9 concerns when @samp{fooba} has been scanned and +an @samp{r} does not follow. + +The final comment reminds us that there's no point going to all the +trouble of removing backing up from the rules unless we're using +@samp{-Cf} or @samp{-CF}, since there's no performance gain doing so +with compressed scanners. + +@cindex error rules, to eliminate backing up +The way to remove the backing up is to add ``error'' rules: + +@cindex backing up, eliminating by adding error rules +@example +@verbatim + %% + foo return TOK_KEYWORD; + foobar return TOK_KEYWORD; + + fooba | + foob | + fo { + /* false alarm, not really a keyword */ + return TOK_ID; + } +@end verbatim +@end example + +Eliminating backing up among a list of keywords can also be done using a +``catch-all'' rule: + +@cindex backing up, eliminating with catch-all rule +@example +@verbatim + %% + foo return TOK_KEYWORD; + foobar return TOK_KEYWORD; + + [a-z]+ return TOK_ID; +@end verbatim +@end example + +This is usually the best solution when appropriate. + +Backing up messages tend to cascade. With a complicated set of rules +it's not uncommon to get hundreds of messages. If one can decipher +them, though, it often only takes a dozen or so rules to eliminate the +backing up (though it's easy to make a mistake and have an error rule +accidentally match a valid token. A possible future @code{flex} feature +will be to automatically add rules to eliminate backing up). + +It's important to keep in mind that you gain the benefits of eliminating +backing up only if you eliminate @emph{every} instance of backing up. +Leaving just one means you gain nothing. + +@emph{Variable} trailing context (where both the leading and trailing +parts do not have a fixed length) entails almost the same performance +loss as @code{REJECT} (i.e., substantial). So when possible a rule +like: + +@cindex trailing context, variable length +@example +@verbatim + %% + mouse|rat/(cat|dog) run(); +@end verbatim +@end example + +is better written: + +@example +@verbatim + %% + mouse/cat|dog run(); + rat/cat|dog run(); +@end verbatim +@end example + +or as + +@example +@verbatim + %% + mouse|rat/cat run(); + mouse|rat/dog run(); +@end verbatim +@end example + +Note that here the special '|' action does @emph{not} provide any +savings, and can even make things worse (@pxref{Limitations}). + +Another area where the user can increase a scanner's performance (and +one that's easier to implement) arises from the fact that the longer the +tokens matched, the faster the scanner will run. This is because with +long tokens the processing of most input characters takes place in the +(short) inner scanning loop, and does not often have to go through the +additional work of setting up the scanning environment (e.g., +@code{yytext}) for the action. Recall the scanner for C comments: + +@cindex performance optimization, matching longer tokens +@example +@verbatim + %x comment + %% + int line_num = 1; + + "/*" BEGIN(comment); + + <comment>[^*\n]* + <comment>"*"+[^*/\n]* + <comment>\n ++line_num; + <comment>"*"+"/" BEGIN(INITIAL); +@end verbatim +@end example + +This could be sped up by writing it as: + +@example +@verbatim + %x comment + %% + int line_num = 1; + + "/*" BEGIN(comment); + + <comment>[^*\n]* + <comment>[^*\n]*\n ++line_num; + <comment>"*"+[^*/\n]* + <comment>"*"+[^*/\n]*\n ++line_num; + <comment>"*"+"/" BEGIN(INITIAL); +@end verbatim +@end example + +Now instead of each newline requiring the processing of another action, +recognizing the newlines is distributed over the other rules to keep the +matched text as long as possible. Note that @emph{adding} rules does +@emph{not} slow down the scanner! The speed of the scanner is +independent of the number of rules or (modulo the considerations given +at the beginning of this section) how complicated the rules are with +regard to operators such as @samp{*} and @samp{|}. + +@cindex keywords, for performance +@cindex performance, using keywords +A final example in speeding up a scanner: suppose you want to scan +through a file containing identifiers and keywords, one per line +and with no other extraneous characters, and recognize all the +keywords. A natural first approach is: + +@cindex performance optimization, recognizing keywords +@example +@verbatim + %% + asm | + auto | + break | + ... etc ... + volatile | + while /* it's a keyword */ + + .|\n /* it's not a keyword */ +@end verbatim +@end example + +To eliminate the back-tracking, introduce a catch-all rule: + +@example +@verbatim + %% + asm | + auto | + break | + ... etc ... + volatile | + while /* it's a keyword */ + + [a-z]+ | + .|\n /* it's not a keyword */ +@end verbatim +@end example + +Now, if it's guaranteed that there's exactly one word per line, then we +can reduce the total number of matches by a half by merging in the +recognition of newlines with that of the other tokens: + +@example +@verbatim + %% + asm\n | + auto\n | + break\n | + ... etc ... + volatile\n | + while\n /* it's a keyword */ + + [a-z]+\n | + .|\n /* it's not a keyword */ +@end verbatim +@end example + +One has to be careful here, as we have now reintroduced backing up +into the scanner. In particular, while +@emph{we} +know that there will never be any characters in the input stream +other than letters or newlines, +@code{flex} +can't figure this out, and it will plan for possibly needing to back up +when it has scanned a token like @samp{auto} and then the next character +is something other than a newline or a letter. Previously it would +then just match the @samp{auto} rule and be done, but now it has no @samp{auto} +rule, only a @samp{auto\n} rule. To eliminate the possibility of backing up, +we could either duplicate all rules but without final newlines, or, +since we never expect to encounter such an input and therefore don't +how it's classified, we can introduce one more catch-all rule, this +one which doesn't include a newline: + +@example +@verbatim + %% + asm\n | + auto\n | + break\n | + ... etc ... + volatile\n | + while\n /* it's a keyword */ + + [a-z]+\n | + [a-z]+ | + .|\n /* it's not a keyword */ +@end verbatim +@end example + +Compiled with @samp{-Cf}, this is about as fast as one can get a +@code{flex} scanner to go for this particular problem. + +A final note: @code{flex} is slow when matching @code{NUL}s, +particularly when a token contains multiple @code{NUL}s. It's best to +write rules which match @emph{short} amounts of text if it's anticipated +that the text will often include @code{NUL}s. + +Another final note regarding performance: as mentioned in +@ref{Matching}, dynamically resizing @code{yytext} to accommodate huge +tokens is a slow process because it presently requires that the (huge) +token be rescanned from the beginning. Thus if performance is vital, +you should attempt to match ``large'' quantities of text but not +``huge'' quantities, where the cutoff between the two is at about 8K +characters per token. + +@node Cxx, Reentrant, Performance, Top +@chapter Generating C++ Scanners + +@cindex c++, experimental form of scanner class +@cindex experimental form of c++ scanner class +@strong{IMPORTANT}: the present form of the scanning class is @emph{experimental} +and may change considerably between major releases. + +@cindex C++ +@cindex member functions, C++ +@cindex methods, c++ +@code{flex} provides two different ways to generate scanners for use +with C++. The first way is to simply compile a scanner generated by +@code{flex} using a C++ compiler instead of a C compiler. You should +not encounter any compilation errors (@pxref{Reporting Bugs}). You can +then use C++ code in your rule actions instead of C code. Note that the +default input source for your scanner remains @file{yyin}, and default +echoing is still done to @file{yyout}. Both of these remain @code{FILE +*} variables and not C++ @emph{streams}. + +You can also use @code{flex} to generate a C++ scanner class, using the +@samp{-+} option (or, equivalently, @code{%option c++)}, which is +automatically specified if the name of the @code{flex} executable ends +in a '+', such as @code{flex++}. When using this option, @code{flex} +defaults to generating the scanner to the file @file{lex.yy.cc} instead +of @file{lex.yy.c}. The generated scanner includes the header file +@file{FlexLexer.h}, which defines the interface to two C++ classes. + +The first class, +@code{FlexLexer}, +provides an abstract base class defining the general scanner class +interface. It provides the following member functions: + +@table @code +@findex YYText (C++ only) +@item const char* YYText() +returns the text of the most recently matched token, the equivalent of +@code{yytext}. + +@findex YYLeng (C++ only) +@item int YYLeng() +returns the length of the most recently matched token, the equivalent of +@code{yyleng}. + +@findex lineno (C++ only) +@item int lineno() const +returns the current input line number (see @code{%option yylineno)}, or +@code{1} if @code{%option yylineno} was not used. + +@findex set_debug (C++ only) +@item void set_debug( int flag ) +sets the debugging flag for the scanner, equivalent to assigning to +@code{yy_flex_debug} (@pxref{Scanner Options}). Note that you must build +the scanner using @code{%option debug} to include debugging information +in it. + +@findex debug (C++ only) +@item int debug() const +returns the current setting of the debugging flag. +@end table + +Also provided are member functions equivalent to +@code{yy_switch_to_buffer()}, @code{yy_create_buffer()} (though the +first argument is an @code{istream*} object pointer and not a +@code{FILE*)}, @code{yy_flush_buffer()}, @code{yy_delete_buffer()}, and +@code{yyrestart()} (again, the first argument is a @code{istream*} +object pointer). + +@tindex yyFlexLexer (C++ only) +@tindex FlexLexer (C++ only) +The second class defined in @file{FlexLexer.h} is @code{yyFlexLexer}, +which is derived from @code{FlexLexer}. It defines the following +additional member functions: + +@table @code +@findex yyFlexLexer constructor (C++ only) +@item yyFlexLexer( istream* arg_yyin = 0, ostream* arg_yyout = 0 ) +constructs a @code{yyFlexLexer} object using the given streams for input +and output. If not specified, the streams default to @code{cin} and +@code{cout}, respectively. + +@findex yylex (C++ version) +@item virtual int yylex() +performs the same role is @code{yylex()} does for ordinary @code{flex} +scanners: it scans the input stream, consuming tokens, until a rule's +action returns a value. If you derive a subclass @code{S} from +@code{yyFlexLexer} and want to access the member functions and variables +of @code{S} inside @code{yylex()}, then you need to use @code{%option +yyclass="S"} to inform @code{flex} that you will be using that subclass +instead of @code{yyFlexLexer}. In this case, rather than generating +@code{yyFlexLexer::yylex()}, @code{flex} generates @code{S::yylex()} +(and also generates a dummy @code{yyFlexLexer::yylex()} that calls +@code{yyFlexLexer::LexerError()} if called). + +@findex switch_streams (C++ only) +@item virtual void switch_streams(istream* new_in = 0, ostream* new_out = 0) +reassigns @code{yyin} to @code{new_in} (if non-null) and @code{yyout} to +@code{new_out} (if non-null), deleting the previous input buffer if +@code{yyin} is reassigned. + +@item int yylex( istream* new_in, ostream* new_out = 0 ) +first switches the input streams via @code{switch_streams( new_in, +new_out )} and then returns the value of @code{yylex()}. +@end table + +In addition, @code{yyFlexLexer} defines the following protected virtual +functions which you can redefine in derived classes to tailor the +scanner: + +@table @code +@findex LexerInput (C++ only) +@item virtual int LexerInput( char* buf, int max_size ) +reads up to @code{max_size} characters into @code{buf} and returns the +number of characters read. To indicate end-of-input, return 0 +characters. Note that @code{interactive} scanners (see the @samp{-B} +and @samp{-I} flags in @ref{Scanner Options}) define the macro +@code{YY_INTERACTIVE}. If you redefine @code{LexerInput()} and need to +take different actions depending on whether or not the scanner might be +scanning an interactive input source, you can test for the presence of +this name via @code{#ifdef} statements. + +@findex LexerOutput (C++ only) +@item virtual void LexerOutput( const char* buf, int size ) +writes out @code{size} characters from the buffer @code{buf}, which, while +@code{NUL}-terminated, may also contain internal @code{NUL}s if the +scanner's rules can match text with @code{NUL}s in them. + +@cindex error reporting, in C++ +@findex LexerError (C++ only) +@item virtual void LexerError( const char* msg ) +reports a fatal error message. The default version of this function +writes the message to the stream @code{cerr} and exits. +@end table + +Note that a @code{yyFlexLexer} object contains its @emph{entire} +scanning state. Thus you can use such objects to create reentrant +scanners, but see also @ref{Reentrant}. You can instantiate multiple +instances of the same @code{yyFlexLexer} class, and you can also combine +multiple C++ scanner classes together in the same program using the +@samp{-P} option discussed above. + +Finally, note that the @code{%array} feature is not available to C++ +scanner classes; you must use @code{%pointer} (the default). + +Here is an example of a simple C++ scanner: + +@cindex C++ scanners, use of +@example +@verbatim + // An example of using the flex C++ scanner class. + + %{ + int mylineno = 0; + %} + + string \"[^\n"]+\" + + ws [ \t]+ + + alpha [A-Za-z] + dig [0-9] + name ({alpha}|{dig}|\$)({alpha}|{dig}|[_.\-/$])* + num1 [-+]?{dig}+\.?([eE][-+]?{dig}+)? + num2 [-+]?{dig}*\.{dig}+([eE][-+]?{dig}+)? + number {num1}|{num2} + + %% + + {ws} /* skip blanks and tabs */ + + "/*" { + int c; + + while((c = yyinput()) != 0) + { + if(c == '\n') + ++mylineno; + + else if(c == @samp{*}) + { + if((c = yyinput()) == '/') + break; + else + unput(c); + } + } + } + + {number} cout "number " YYText() '\n'; + + \n mylineno++; + + {name} cout "name " YYText() '\n'; + + {string} cout "string " YYText() '\n'; + + %% + + int main( int /* argc */, char** /* argv */ ) + { + @code{flex}Lexer* lexer = new yyFlexLexer; + while(lexer->yylex() != 0) + ; + return 0; + } +@end verbatim +@end example + +@cindex C++, multiple different scanners +If you want to create multiple (different) lexer classes, you use the +@samp{-P} flag (or the @code{prefix=} option) to rename each +@code{yyFlexLexer} to some other @samp{xxFlexLexer}. You then can +include @file{<FlexLexer.h>} in your other sources once per lexer class, +first renaming @code{yyFlexLexer} as follows: + +@cindex include files, with C++ +@cindex header files, with C++ +@cindex C++ scanners, including multiple scanners +@example +@verbatim + #undef yyFlexLexer + #define yyFlexLexer xxFlexLexer + #include <FlexLexer.h> + + #undef yyFlexLexer + #define yyFlexLexer zzFlexLexer + #include <FlexLexer.h> +@end verbatim +@end example + +if, for example, you used @code{%option prefix="xx"} for one of your +scanners and @code{%option prefix="zz"} for the other. + +@node Reentrant, Lex and Posix, Cxx, Top +@chapter Reentrant C Scanners + +@cindex reentrant, explanation +@code{flex} has the ability to generate a reentrant C scanner. This is +accomplished by specifying @code{%option reentrant} (@samp{-R}) The generated +scanner is both portable, and safe to use in one or more separate threads of +control. The most common use for reentrant scanners is from within +multi-threaded applications. Any thread may create and execute a reentrant +@code{flex} scanner without the need for synchronization with other threads. + +@menu +* Reentrant Uses:: +* Reentrant Overview:: +* Reentrant Example:: +* Reentrant Detail:: +* Reentrant Functions:: +@end menu + +@node Reentrant Uses, Reentrant Overview, Reentrant, Reentrant +@section Uses for Reentrant Scanners + +However, there are other uses for a reentrant scanner. For example, you +could scan two or more files simultaneously to implement a @code{diff} at +the token level (i.e., instead of at the character level): + +@cindex reentrant scanners, multiple interleaved scanners +@example +@verbatim + /* Example of maintaining more than one active scanner. */ + + do { + int tok1, tok2; + + tok1 = yylex( scanner_1 ); + tok2 = yylex( scanner_2 ); + + if( tok1 != tok2 ) + printf("Files are different."); + + } while ( tok1 && tok2 ); +@end verbatim +@end example + +Another use for a reentrant scanner is recursion. +(Note that a recursive scanner can also be created using a non-reentrant scanner and +buffer states. @xref{Multiple Input Buffers}.) + +The following crude scanner supports the @samp{eval} command by invoking +another instance of itself. + +@cindex reentrant scanners, recursive invocation +@example +@verbatim + /* Example of recursive invocation. */ + + %option reentrant + + %% + "eval(".+")" { + yyscan_t scanner; + YY_BUFFER_STATE buf; + + yylex_init( &scanner ); + yytext[yyleng-1] = ' '; + + buf = yy_scan_string( yytext + 5, scanner ); + yylex( scanner ); + + yy_delete_buffer(buf,scanner); + yylex_destroy( scanner ); + } + ... + %% +@end verbatim +@end example + +@node Reentrant Overview, Reentrant Example, Reentrant Uses, Reentrant +@section An Overview of the Reentrant API + +@cindex reentrant, API explanation +The API for reentrant scanners is different than for non-reentrant +scanners. Here is a quick overview of the API: + +@itemize +@code{%option reentrant} must be specified. + +@item +All functions take one additional argument: @code{yyscanner} + +@item +All global variables are replaced by their macro equivalents. +(We tell you this because it may be important to you during debugging.) + +@item +@code{yylex_init} and @code{yylex_destroy} must be called before and +after @code{yylex}, respectively. + +@item +Accessor methods (get/set functions) provide access to common +@code{flex} variables. + +@item +User-specific data can be stored in @code{yyextra}. +@end itemize + +@node Reentrant Example, Reentrant Detail, Reentrant Overview, Reentrant +@section Reentrant Example + +First, an example of a reentrant scanner: +@cindex reentrant, example of +@example +@verbatim + /* This scanner prints "//" comments. */ + + %option reentrant stack noyywrap + %x COMMENT + + %% + + "//" yy_push_state( COMMENT, yyscanner); + .|\n + + <COMMENT>\n yy_pop_state( yyscanner ); + <COMMENT>[^\n]+ fprintf( yyout, "%s\n", yytext); + + %% + + int main ( int argc, char * argv[] ) + { + yyscan_t scanner; + + yylex_init ( &scanner ); + yylex ( scanner ); + yylex_destroy ( scanner ); + return 0; + } +@end verbatim +@end example + +@node Reentrant Detail, Reentrant Functions, Reentrant Example, Reentrant +@section The Reentrant API in Detail + +Here are the things you need to do or know to use the reentrant C API of +@code{flex}. + +@menu +* Specify Reentrant:: +* Extra Reentrant Argument:: +* Global Replacement:: +* Init and Destroy Functions:: +* Accessor Methods:: +* Extra Data:: +* About yyscan_t:: +@end menu + +@node Specify Reentrant, Extra Reentrant Argument, Reentrant Detail, Reentrant Detail +@subsection Declaring a Scanner As Reentrant + + %option reentrant (--reentrant) must be specified. + +Notice that @code{%option reentrant} is specified in the above example +(@pxref{Reentrant Example}. Had this option not been specified, +@code{flex} would have happily generated a non-reentrant scanner without +complaining. You may explicitly specify @code{%option noreentrant}, if +you do @emph{not} want a reentrant scanner, although it is not +necessary. The default is to generate a non-reentrant scanner. + +@node Extra Reentrant Argument, Global Replacement, Specify Reentrant, Reentrant Detail +@subsection The Extra Argument + +@cindex reentrant, calling functions +@vindex yyscanner (reentrant only) +All functions take one additional argument: @code{yyscanner}. + +Notice that the calls to @code{yy_push_state} and @code{yy_pop_state} +both have an argument, @code{yyscanner} , that is not present in a +non-reentrant scanner. Here are the declarations of +@code{yy_push_state} and @code{yy_pop_state} in the reentrant scanner: + +@example +@verbatim + static void yy_push_state ( int new_state , yyscan_t yyscanner ) ; + static void yy_pop_state ( yyscan_t yyscanner ) ; +@end verbatim +@end example + +Notice that the argument @code{yyscanner} appears in the declaration of +both functions. In fact, all @code{flex} functions in a reentrant +scanner have this additional argument. It is always the last argument +in the argument list, it is always of type @code{yyscan_t} (which is +typedef'd to @code{void *}) and it is +always named @code{yyscanner}. As you may have guessed, +@code{yyscanner} is a pointer to an opaque data structure encapsulating +the current state of the scanner. For a list of function declarations, +see @ref{Reentrant Functions}. Note that preprocessor macros, such as +@code{BEGIN}, @code{ECHO}, and @code{REJECT}, do not take this +additional argument. + +@node Global Replacement, Init and Destroy Functions, Extra Reentrant Argument, Reentrant Detail +@subsection Global Variables Replaced By Macros + +@cindex reentrant, accessing flex variables +All global variables in traditional flex have been replaced by macro equivalents. + +Note that in the above example, @code{yyout} and @code{yytext} are +not plain variables. These are macros that will expand to their equivalent lvalue. +All of the familiar @code{flex} globals have been replaced by their macro +equivalents. In particular, @code{yytext}, @code{yyleng}, @code{yylineno}, +@code{yyin}, @code{yyout}, @code{yyextra}, @code{yylval}, and @code{yylloc} +are macros. You may safely use these macros in actions as if they were plain +variables. We only tell you this so you don't expect to link to these variables +externally. Currently, each macro expands to a member of an internal struct, e.g., + +@example +@verbatim +#define yytext (((struct yyguts_t*)yyscanner)->yytext_r) +@end verbatim +@end example + +One important thing to remember about +@code{yytext} +and friends is that +@code{yytext} +is not a global variable in a reentrant +scanner, you can not access it directly from outside an action or from +other functions. You must use an accessor method, e.g., +@code{yyget_text}, +to accomplish this. (See below). + +@node Init and Destroy Functions, Accessor Methods, Global Replacement, Reentrant Detail +@subsection Init and Destroy Functions + +@cindex memory, considerations for reentrant scanners +@cindex reentrant, initialization +@findex yylex_init +@findex yylex_destroy + +@code{yylex_init} and @code{yylex_destroy} must be called before and +after @code{yylex}, respectively. + +@example +@verbatim + int yylex_init ( yyscan_t * ptr_yy_globals ) ; + int yylex_init_extra ( YY_EXTRA_TYPE user_defined, yyscan_t * ptr_yy_globals ) ; + int yylex ( yyscan_t yyscanner ) ; + int yylex_destroy ( yyscan_t yyscanner ) ; +@end verbatim +@end example + +The function @code{yylex_init} must be called before calling any other +function. The argument to @code{yylex_init} is the address of an +uninitialized pointer to be filled in by @code{yylex_init}, overwriting +any previous contents. The function @code{yylex_init_extra} may be used +instead, taking as its first argument a variable of type @code{YY_EXTRA_TYPE}. +See the section on yyextra, below, for more details. + +The value stored in @code{ptr_yy_globals} should +thereafter be passed to @code{yylex} and @code{yylex_destroy}. Flex +does not save the argument passed to @code{yylex_init}, so it is safe to +pass the address of a local pointer to @code{yylex_init} so long as it remains +in scope for the duration of all calls to the scanner, up to and including +the call to @code{yylex_destroy}. + +The function +@code{yylex} should be familiar to you by now. The reentrant version +takes one argument, which is the value returned (via an argument) by +@code{yylex_init}. Otherwise, it behaves the same as the non-reentrant +version of @code{yylex}. + +Both @code{yylex_init} and @code{yylex_init_extra} returns 0 (zero) on success, +or non-zero on failure, in which case errno is set to one of the following values: + +@itemize +@item ENOMEM +Memory allocation error. @xref{memory-management}. +@item EINVAL +Invalid argument. +@end itemize + + +The function @code{yylex_destroy} should be +called to free resources used by the scanner. After @code{yylex_destroy} +is called, the contents of @code{yyscanner} should not be used. Of +course, there is no need to destroy a scanner if you plan to reuse it. +A @code{flex} scanner (both reentrant and non-reentrant) may be +restarted by calling @code{yyrestart}. + +Below is an example of a program that creates a scanner, uses it, then destroys +it when done: + +@example +@verbatim + int main () + { + yyscan_t scanner; + int tok; + + yylex_init(&scanner); + + while ((tok=yylex()) > 0) + printf("tok=%d yytext=%s\n", tok, yyget_text(scanner)); + + yylex_destroy(scanner); + return 0; + } +@end verbatim +@end example + +@node Accessor Methods, Extra Data, Init and Destroy Functions, Reentrant Detail +@subsection Accessing Variables with Reentrant Scanners + +@cindex reentrant, accessor functions +Accessor methods (get/set functions) provide access to common +@code{flex} variables. + +Many scanners that you build will be part of a larger project. Portions +of your project will need access to @code{flex} values, such as +@code{yytext}. In a non-reentrant scanner, these values are global, so +there is no problem accessing them. However, in a reentrant scanner, there are no +global @code{flex} values. You can not access them directly. Instead, +you must access @code{flex} values using accessor methods (get/set +functions). Each accessor method is named @code{yyget_NAME} or +@code{yyset_NAME}, where @code{NAME} is the name of the @code{flex} +variable you want. For example: + +@cindex accessor functions, use of +@example +@verbatim + /* Set the last character of yytext to NULL. */ + void chop ( yyscan_t scanner ) + { + int len = yyget_leng( scanner ); + yyget_text( scanner )[len - 1] = '\0'; + } +@end verbatim +@end example + +The above code may be called from within an action like this: + +@example +@verbatim + %% + .+\n { chop( yyscanner );} +@end verbatim +@end example + +You may find that @code{%option header-file} is particularly useful for generating +prototypes of all the accessor functions. @xref{option-header}. + +@node Extra Data, About yyscan_t, Accessor Methods, Reentrant Detail +@subsection Extra Data + +@cindex reentrant, extra data +@vindex yyextra +User-specific data can be stored in @code{yyextra}. + +In a reentrant scanner, it is unwise to use global variables to +communicate with or maintain state between different pieces of your program. +However, you may need access to external data or invoke external functions +from within the scanner actions. +Likewise, you may need to pass information to your scanner +(e.g., open file descriptors, or database connections). +In a non-reentrant scanner, the only way to do this would be through the +use of global variables. +@code{Flex} allows you to store arbitrary, ``extra'' data in a scanner. +This data is accessible through the accessor methods +@code{yyget_extra} and @code{yyset_extra} +from outside the scanner, and through the shortcut macro +@code{yyextra} +from within the scanner itself. They are defined as follows: + +@tindex YY_EXTRA_TYPE (reentrant only) +@findex yyget_extra +@findex yyset_extra +@example +@verbatim + #define YY_EXTRA_TYPE void* + YY_EXTRA_TYPE yyget_extra ( yyscan_t scanner ); + void yyset_extra ( YY_EXTRA_TYPE arbitrary_data , yyscan_t scanner); +@end verbatim +@end example + +In addition, an extra form of @code{yylex_init} is provided, +@code{yylex_init_extra}. This function is provided so that the yyextra value can +be accessed from within the very first yyalloc, used to allocate +the scanner itself. + +By default, @code{YY_EXTRA_TYPE} is defined as type @code{void *}. You +may redefine this type using @code{%option extra-type="your_type"} in +the scanner: + +@cindex YY_EXTRA_TYPE, defining your own type +@example +@verbatim + /* An example of overriding YY_EXTRA_TYPE. */ + %{ + #include <sys/stat.h> + #include <unistd.h> + %} + %option reentrant + %option extra-type="struct stat *" + %% + + __filesize__ printf( "%ld", yyextra->st_size ); + __lastmod__ printf( "%ld", yyextra->st_mtime ); + %% + void scan_file( char* filename ) + { + yyscan_t scanner; + struct stat buf; + FILE *in; + + in = fopen( filename, "r" ); + stat( filename, &buf ); + + yylex_init_extra( buf, &scanner ); + yyset_in( in, scanner ); + yylex( scanner ); + yylex_destroy( scanner ); + + fclose( in ); + } +@end verbatim +@end example + + +@node About yyscan_t, , Extra Data, Reentrant Detail +@subsection About yyscan_t + +@tindex yyscan_t (reentrant only) +@code{yyscan_t} is defined as: + +@example +@verbatim + typedef void* yyscan_t; +@end verbatim +@end example + +It is initialized by @code{yylex_init()} to point to +an internal structure. You should never access this value +directly. In particular, you should never attempt to free it +(use @code{yylex_destroy()} instead.) + +@node Reentrant Functions, , Reentrant Detail, Reentrant +@section Functions and Macros Available in Reentrant C Scanners + +The following Functions are available in a reentrant scanner: + +@findex yyget_text +@findex yyget_leng +@findex yyget_in +@findex yyget_out +@findex yyget_lineno +@findex yyset_in +@findex yyset_out +@findex yyset_lineno +@findex yyget_debug +@findex yyset_debug +@findex yyget_extra +@findex yyset_extra + +@example +@verbatim + char *yyget_text ( yyscan_t scanner ); + int yyget_leng ( yyscan_t scanner ); + FILE *yyget_in ( yyscan_t scanner ); + FILE *yyget_out ( yyscan_t scanner ); + int yyget_lineno ( yyscan_t scanner ); + YY_EXTRA_TYPE yyget_extra ( yyscan_t scanner ); + int yyget_debug ( yyscan_t scanner ); + + void yyset_debug ( int flag, yyscan_t scanner ); + void yyset_in ( FILE * in_str , yyscan_t scanner ); + void yyset_out ( FILE * out_str , yyscan_t scanner ); + void yyset_lineno ( int line_number , yyscan_t scanner ); + void yyset_extra ( YY_EXTRA_TYPE user_defined , yyscan_t scanner ); +@end verbatim +@end example + +There are no ``set'' functions for yytext and yyleng. This is intentional. + +The following Macro shortcuts are available in actions in a reentrant +scanner: + +@example +@verbatim + yytext + yyleng + yyin + yyout + yylineno + yyextra + yy_flex_debug +@end verbatim +@end example + +@cindex yylineno, in a reentrant scanner +In a reentrant C scanner, support for yylineno is always present +(i.e., you may access yylineno), but the value is never modified by +@code{flex} unless @code{%option yylineno} is enabled. This is to allow +the user to maintain the line count independently of @code{flex}. + +@anchor{bison-functions} +The following functions and macros are made available when @code{%option +bison-bridge} (@samp{--bison-bridge}) is specified: + +@example +@verbatim + YYSTYPE * yyget_lval ( yyscan_t scanner ); + void yyset_lval ( YYSTYPE * yylvalp , yyscan_t scanner ); + yylval +@end verbatim +@end example + +The following functions and macros are made available +when @code{%option bison-locations} (@samp{--bison-locations}) is specified: + +@example +@verbatim + YYLTYPE *yyget_lloc ( yyscan_t scanner ); + void yyset_lloc ( YYLTYPE * yyllocp , yyscan_t scanner ); + yylloc +@end verbatim +@end example + +Support for yylval assumes that @code{YYSTYPE} is a valid type. Support for +yylloc assumes that @code{YYSLYPE} is a valid type. Typically, these types are +generated by @code{bison}, and are included in section 1 of the @code{flex} +input. + +@node Lex and Posix, Memory Management, Reentrant, Top +@chapter Incompatibilities with Lex and Posix + +@cindex POSIX and lex +@cindex lex (traditional) and POSIX + +@code{flex} is a rewrite of the AT&T Unix @emph{lex} tool (the two +implementations do not share any code, though), with some extensions and +incompatibilities, both of which are of concern to those who wish to +write scanners acceptable to both implementations. @code{flex} is fully +compliant with the POSIX @code{lex} specification, except that when +using @code{%pointer} (the default), a call to @code{unput()} destroys +the contents of @code{yytext}, which is counter to the POSIX +specification. In this section we discuss all of the known areas of +incompatibility between @code{flex}, AT&T @code{lex}, and the POSIX +specification. @code{flex}'s @samp{-l} option turns on maximum +compatibility with the original AT&T @code{lex} implementation, at the +cost of a major loss in the generated scanner's performance. We note +below which incompatibilities can be overcome using the @samp{-l} +option. @code{flex} is fully compatible with @code{lex} with the +following exceptions: + +@itemize +@item +The undocumented @code{lex} scanner internal variable @code{yylineno} is +not supported unless @samp{-l} or @code{%option yylineno} is used. + +@item +@code{yylineno} should be maintained on a per-buffer basis, rather than +a per-scanner (single global variable) basis. + +@item +@code{yylineno} is not part of the POSIX specification. + +@item +The @code{input()} routine is not redefinable, though it may be called +to read characters following whatever has been matched by a rule. If +@code{input()} encounters an end-of-file the normal @code{yywrap()} +processing is done. A ``real'' end-of-file is returned by +@code{input()} as @code{EOF}. + +@item +Input is instead controlled by defining the @code{YY_INPUT()} macro. + +@item +The @code{flex} restriction that @code{input()} cannot be redefined is +in accordance with the POSIX specification, which simply does not +specify any way of controlling the scanner's input other than by making +an initial assignment to @file{yyin}. + +@item +The @code{unput()} routine is not redefinable. This restriction is in +accordance with POSIX. + +@item +@code{flex} scanners are not as reentrant as @code{lex} scanners. In +particular, if you have an interactive scanner and an interrupt handler +which long-jumps out of the scanner, and the scanner is subsequently +called again, you may get the following message: + +@cindex error messages, end of buffer missed +@example +@verbatim + fatal @code{flex} scanner internal error--end of buffer missed +@end verbatim +@end example + +To reenter the scanner, first use: + +@cindex restarting the scanner +@example +@verbatim + yyrestart( yyin ); +@end verbatim +@end example + +Note that this call will throw away any buffered input; usually this +isn't a problem with an interactive scanner. @xref{Reentrant}, for +@code{flex}'s reentrant API. + +@item +Also note that @code{flex} C++ scanner classes +@emph{are} +reentrant, so if using C++ is an option for you, you should use +them instead. @xref{Cxx}, and @ref{Reentrant} for details. + +@item +@code{output()} is not supported. Output from the @b{ECHO} macro is +done to the file-pointer @code{yyout} (default @file{stdout)}. + +@item +@code{output()} is not part of the POSIX specification. + +@item +@code{lex} does not support exclusive start conditions (%x), though they +are in the POSIX specification. + +@item +When definitions are expanded, @code{flex} encloses them in parentheses. +With @code{lex}, the following: + +@cindex name definitions, not POSIX +@example +@verbatim + NAME [A-Z][A-Z0-9]* + %% + foo{NAME}? printf( "Found it\n" ); + %% +@end verbatim +@end example + +will not match the string @samp{foo} because when the macro is expanded +the rule is equivalent to @samp{foo[A-Z][A-Z0-9]*?} and the precedence +is such that the @samp{?} is associated with @samp{[A-Z0-9]*}. With +@code{flex}, the rule will be expanded to @samp{foo([A-Z][A-Z0-9]*)?} +and so the string @samp{foo} will match. + +@item +Note that if the definition begins with @samp{^} or ends with @samp{$} +then it is @emph{not} expanded with parentheses, to allow these +operators to appear in definitions without losing their special +meanings. But the @samp{<s>}, @samp{/}, and @code{<<EOF>>} operators +cannot be used in a @code{flex} definition. + +@item +Using @samp{-l} results in the @code{lex} behavior of no parentheses +around the definition. + +@item +The POSIX specification is that the definition be enclosed in parentheses. + +@item +Some implementations of @code{lex} allow a rule's action to begin on a +separate line, if the rule's pattern has trailing whitespace: + +@cindex patterns and actions on different lines +@example +@verbatim + %% + foo|bar<space here> + { foobar_action();} +@end verbatim +@end example + +@code{flex} does not support this feature. + +@item +The @code{lex} @code{%r} (generate a Ratfor scanner) option is not +supported. It is not part of the POSIX specification. + +@item +After a call to @code{unput()}, @emph{yytext} is undefined until the +next token is matched, unless the scanner was built using @code{%array}. +This is not the case with @code{lex} or the POSIX specification. The +@samp{-l} option does away with this incompatibility. + +@item +The precedence of the @samp{@{,@}} (numeric range) operator is +different. The AT&T and POSIX specifications of @code{lex} +interpret @samp{abc@{1,3@}} as match one, two, +or three occurrences of @samp{abc}'', whereas @code{flex} interprets it +as ``match @samp{ab} followed by one, two, or three occurrences of +@samp{c}''. The @samp{-l} and @samp{--posix} options do away with this +incompatibility. + +@item +The precedence of the @samp{^} operator is different. @code{lex} +interprets @samp{^foo|bar} as ``match either 'foo' at the beginning of a +line, or 'bar' anywhere'', whereas @code{flex} interprets it as ``match +either @samp{foo} or @samp{bar} if they come at the beginning of a +line''. The latter is in agreement with the POSIX specification. + +@item +The special table-size declarations such as @code{%a} supported by +@code{lex} are not required by @code{flex} scanners.. @code{flex} +ignores them. +@item +The name @code{FLEX_SCANNER} is @code{#define}'d so scanners may be +written for use with either @code{flex} or @code{lex}. Scanners also +include @code{YY_FLEX_MAJOR_VERSION}, @code{YY_FLEX_MINOR_VERSION} +and @code{YY_FLEX_SUBMINOR_VERSION} +indicating which version of @code{flex} generated the scanner. For +example, for the 2.5.22 release, these defines would be 2, 5 and 22 +respectively. If the version of @code{flex} being used is a beta +version, then the symbol @code{FLEX_BETA} is defined. + +@item +The symbols @samp{[[} and @samp{]]} in the code sections of the input +may conflict with the m4 delimiters. @xref{M4 Dependency}. + + +@end itemize + +@cindex POSIX comp;compliance +@cindex non-POSIX features of flex +The following @code{flex} features are not included in @code{lex} or the +POSIX specification: + +@itemize +@item +C++ scanners +@item +%option +@item +start condition scopes +@item +start condition stacks +@item +interactive/non-interactive scanners +@item +yy_scan_string() and friends +@item +yyterminate() +@item +yy_set_interactive() +@item +yy_set_bol() +@item +YY_AT_BOL() + <<EOF>> +@item +<*> +@item +YY_DECL +@item +YY_START +@item +YY_USER_ACTION +@item +YY_USER_INIT +@item +#line directives +@item +%@{@}'s around actions +@item +reentrant C API +@item +multiple actions on a line +@item +almost all of the @code{flex} command-line options +@end itemize + +The feature ``multiple actions on a line'' +refers to the fact that with @code{flex} you can put multiple actions on +the same line, separated with semi-colons, while with @code{lex}, the +following: + +@example +@verbatim + foo handle_foo(); ++num_foos_seen; +@end verbatim +@end example + +is (rather surprisingly) truncated to + +@example +@verbatim + foo handle_foo(); +@end verbatim +@end example + +@code{flex} does not truncate the action. Actions that are not enclosed +in braces are simply terminated at the end of the line. + +@node Memory Management, Serialized Tables, Lex and Posix, Top +@chapter Memory Management + +@cindex memory management +@anchor{memory-management} +This chapter describes how flex handles dynamic memory, and how you can +override the default behavior. + +@menu +* The Default Memory Management:: +* Overriding The Default Memory Management:: +* A Note About yytext And Memory:: +@end menu + +@node The Default Memory Management, Overriding The Default Memory Management, Memory Management, Memory Management +@section The Default Memory Management + +Flex allocates dynamic memory during initialization, and once in a while from +within a call to yylex(). Initialization takes place during the first call to +yylex(). Thereafter, flex may reallocate more memory if it needs to enlarge a +buffer. As of version 2.5.9 Flex will clean up all memory when you call @code{yylex_destroy} +@xref{faq-memory-leak}. + +Flex allocates dynamic memory for four purposes, listed below @footnote{The +quantities given here are approximate, and may vary due to host architecture, +compiler configuration, or due to future enhancements to flex.} + +@table @asis + +@item 16kB for the input buffer. +Flex allocates memory for the character buffer used to perform pattern +matching. Flex must read ahead from the input stream and store it in a large +character buffer. This buffer is typically the largest chunk of dynamic memory +flex consumes. This buffer will grow if necessary, doubling the size each time. +Flex frees this memory when you call yylex_destroy(). The default size of this +buffer (16384 bytes) is almost always too large. The ideal size for this +buffer is the length of the longest token expected, in bytes, plus a little more. Flex will allocate a few +extra bytes for housekeeping. Currently, to override the size of the input buffer +you must @code{#define YY_BUF_SIZE} to whatever number of bytes you want. We don't plan +to change this in the near future, but we reserve the right to do so if we ever add a more robust memory management +API. + +@item 64kb for the REJECT state. This will only be allocated if you use REJECT. +The size is the large enough to hold the same number of states as characters in the input buffer. If you override the size of the +input buffer (via @code{YY_BUF_SIZE}), then you automatically override the size of this buffer as well. + +@item 100 bytes for the start condition stack. +Flex allocates memory for the start condition stack. This is the stack used +for pushing start states, i.e., with yy_push_state(). It will grow if +necessary. Since the states are simply integers, this stack doesn't consume +much memory. This stack is not present if @code{%option stack} is not +specified. You will rarely need to tune this buffer. The ideal size for this +stack is the maximum depth expected. The memory for this stack is +automatically destroyed when you call yylex_destroy(). @xref{option-stack}. + +@item 40 bytes for each YY_BUFFER_STATE. +Flex allocates memory for each YY_BUFFER_STATE. The buffer state itself +is about 40 bytes, plus an additional large character buffer (described above.) +The initial buffer state is created during initialization, and with each call +to yy_create_buffer(). You can't tune the size of this, but you can tune the +character buffer as described above. Any buffer state that you explicitly +create by calling yy_create_buffer() is @emph{NOT} destroyed automatically. You +must call yy_delete_buffer() to free the memory. The exception to this rule is +that flex will delete the current buffer automatically when you call +yylex_destroy(). If you delete the current buffer, be sure to set it to NULL. +That way, flex will not try to delete the buffer a second time (possibly +crashing your program!) At the time of this writing, flex does not provide a +growable stack for the buffer states. You have to manage that yourself. +@xref{Multiple Input Buffers}. + +@item 84 bytes for the reentrant scanner guts +Flex allocates about 84 bytes for the reentrant scanner structure when +you call yylex_init(). It is destroyed when the user calls yylex_destroy(). + +@end table + + +@node Overriding The Default Memory Management, A Note About yytext And Memory, The Default Memory Management, Memory Management +@section Overriding The Default Memory Management + +@cindex yyalloc, overriding +@cindex yyrealloc, overriding +@cindex yyfree, overriding + +Flex calls the functions @code{yyalloc}, @code{yyrealloc}, and @code{yyfree} +when it needs to allocate or free memory. By default, these functions are +wrappers around the standard C functions, @code{malloc}, @code{realloc}, and +@code{free}, respectively. You can override the default implementations by telling +flex that you will provide your own implementations. + +To override the default implementations, you must do two things: + +@enumerate + +@item Suppress the default implementations by specifying one or more of the +following options: + +@itemize +@opindex noyyalloc +@item @code{%option noyyalloc} +@item @code{%option noyyrealloc} +@item @code{%option noyyfree}. +@end itemize + +@item Provide your own implementation of the following functions: @footnote{It +is not necessary to override all (or any) of the memory management routines. +You may, for example, override @code{yyrealloc}, but not @code{yyfree} or +@code{yyalloc}.} + +@example +@verbatim +// For a non-reentrant scanner +void * yyalloc (size_t bytes); +void * yyrealloc (void * ptr, size_t bytes); +void yyfree (void * ptr); + +// For a reentrant scanner +void * yyalloc (size_t bytes, void * yyscanner); +void * yyrealloc (void * ptr, size_t bytes, void * yyscanner); +void yyfree (void * ptr, void * yyscanner); +@end verbatim +@end example + +@end enumerate + +In the following example, we will override all three memory routines. We assume +that there is a custom allocator with garbage collection. In order to make this +example interesting, we will use a reentrant scanner, passing a pointer to the +custom allocator through @code{yyextra}. + +@cindex overriding the memory routines +@example +@verbatim +%{ +#include "some_allocator.h" +%} + +/* Suppress the default implementations. */ +%option noyyalloc noyyrealloc noyyfree +%option reentrant + +/* Initialize the allocator. */ +#define YY_EXTRA_TYPE struct allocator* +#define YY_USER_INIT yyextra = allocator_create(); + +%% +.|\n ; +%% + +/* Provide our own implementations. */ +void * yyalloc (size_t bytes, void* yyscanner) { + return allocator_alloc (yyextra, bytes); +} + +void * yyrealloc (void * ptr, size_t bytes, void* yyscanner) { + return allocator_realloc (yyextra, bytes); +} + +void yyfree (void * ptr, void * yyscanner) { + /* Do nothing -- we leave it to the garbage collector. */ +} + +@end verbatim +@end example + + +@node A Note About yytext And Memory, , Overriding The Default Memory Management, Memory Management +@section A Note About yytext And Memory + +@cindex yytext, memory considerations + +When flex finds a match, @code{yytext} points to the first character of the +match in the input buffer. The string itself is part of the input buffer, and +is @emph{NOT} allocated separately. The value of yytext will be overwritten the next +time yylex() is called. In short, the value of yytext is only valid from within +the matched rule's action. + +Often, you want the value of yytext to persist for later processing, i.e., by a +parser with non-zero lookahead. In order to preserve yytext, you will have to +copy it with strdup() or a similar function. But this introduces some headache +because your parser is now responsible for freeing the copy of yytext. If you +use a yacc or bison parser, (commonly used with flex), you will discover that +the error recovery mechanisms can cause memory to be leaked. + +To prevent memory leaks from strdup'd yytext, you will have to track the memory +somehow. Our experience has shown that a garbage collection mechanism or a +pooled memory mechanism will save you a lot of grief when writing parsers. + +@node Serialized Tables, Diagnostics, Memory Management, Top +@chapter Serialized Tables +@cindex serialization +@cindex memory, serialized tables + +@anchor{serialization} +A @code{flex} scanner has the ability to save the DFA tables to a file, and +load them at runtime when needed. The motivation for this feature is to reduce +the runtime memory footprint. Traditionally, these tables have been compiled into +the scanner as C arrays, and are sometimes quite large. Since the tables are +compiled into the scanner, the memory used by the tables can never be freed. +This is a waste of memory, especially if an application uses several scanners, +but none of them at the same time. + +The serialization feature allows the tables to be loaded at runtime, before +scanning begins. The tables may be discarded when scanning is finished. + +@menu +* Creating Serialized Tables:: +* Loading and Unloading Serialized Tables:: +* Tables File Format:: +@end menu + +@node Creating Serialized Tables, Loading and Unloading Serialized Tables, Serialized Tables, Serialized Tables +@section Creating Serialized Tables +@cindex tables, creating serialized +@cindex serialization of tables + +You may create a scanner with serialized tables by specifying: + +@example +@verbatim + %option tables-file=FILE +or + --tables-file=FILE +@end verbatim +@end example + +These options instruct flex to save the DFA tables to the file @var{FILE}. The tables +will @emph{not} be embedded in the generated scanner. The scanner will not +function on its own. The scanner will be dependent upon the serialized tables. You must +load the tables from this file at runtime before you can scan anything. + +If you do not specify a filename to @code{--tables-file}, the tables will be +saved to @file{lex.yy.tables}, where @samp{yy} is the appropriate prefix. + +If your project uses several different scanners, you can concatenate the +serialized tables into one file, and flex will find the correct set of tables, +using the scanner prefix as part of the lookup key. An example follows: + +@cindex serialized tables, multiple scanners +@example +@verbatim +$ flex --tables-file --prefix=cpp cpp.l +$ flex --tables-file --prefix=c c.l +$ cat lex.cpp.tables lex.c.tables > all.tables +@end verbatim +@end example + +The above example created two scanners, @samp{cpp}, and @samp{c}. Since we did +not specify a filename, the tables were serialized to @file{lex.c.tables} and +@file{lex.cpp.tables}, respectively. Then, we concatenated the two files +together into @file{all.tables}, which we will distribute with our project. At +runtime, we will open the file and tell flex to load the tables from it. Flex +will find the correct tables automatically. (See next section). + +@node Loading and Unloading Serialized Tables, Tables File Format, Creating Serialized Tables, Serialized Tables +@section Loading and Unloading Serialized Tables +@cindex tables, loading and unloading +@cindex loading tables at runtime +@cindex tables, freeing +@cindex freeing tables +@cindex memory, serialized tables + +If you've built your scanner with @code{%option tables-file}, then you must +load the scanner tables at runtime. This can be accomplished with the following +function: + +@deftypefun int yytables_fload (FILE* @var{fp} [, yyscan_t @var{scanner}]) +Locates scanner tables in the stream pointed to by @var{fp} and loads them. +Memory for the tables is allocated via @code{yyalloc}. You must call this +function before the first call to @code{yylex}. The argument @var{scanner} +only appears in the reentrant scanner. +This function returns @samp{0} (zero) on success, or non-zero on error. +@end deftypefun + +The loaded tables are @strong{not} automatically destroyed (unloaded) when you +call @code{yylex_destroy}. The reason is that you may create several scanners +of the same type (in a reentrant scanner), each of which needs access to these +tables. To avoid a nasty memory leak, you must call the following function: + +@deftypefun int yytables_destroy ([yyscan_t @var{scanner}]) +Unloads the scanner tables. The tables must be loaded again before you can scan +any more data. The argument @var{scanner} only appears in the reentrant +scanner. This function returns @samp{0} (zero) on success, or non-zero on +error. +@end deftypefun + +@strong{The functions @code{yytables_fload} and @code{yytables_destroy} are not +thread-safe.} You must ensure that these functions are called exactly once (for +each scanner type) in a threaded program, before any thread calls @code{yylex}. +After the tables are loaded, they are never written to, and no thread +protection is required thereafter -- until you destroy them. + +@node Tables File Format, , Loading and Unloading Serialized Tables, Serialized Tables +@section Tables File Format +@cindex tables, file format +@cindex file format, serialized tables + +This section defines the file format of serialized @code{flex} tables. + +The tables format allows for one or more sets of tables to be +specified, where each set corresponds to a given scanner. Scanners are +indexed by name, as described below. The file format is as follows: + +@example +@verbatim + TABLE SET 1 + +-------------------------------+ + Header | uint32 th_magic; | + | uint32 th_hsize; | + | uint32 th_ssize; | + | uint16 th_flags; | + | char th_version[]; | + | char th_name[]; | + | uint8 th_pad64[]; | + +-------------------------------+ + Table 1 | uint16 td_id; | + | uint16 td_flags; | + | uint32 td_lolen; | + | uint32 td_hilen; | + | void td_data[]; | + | uint8 td_pad64[]; | + +-------------------------------+ + Table 2 | | + . . . + . . . + . . . + . . . + Table n | | + +-------------------------------+ + TABLE SET 2 + . + . + . + TABLE SET N +@end verbatim +@end example + +The above diagram shows that a complete set of tables consists of a header +followed by multiple individual tables. Furthermore, multiple complete sets may +be present in the same file, each set with its own header and tables. The sets +are contiguous in the file. The only way to know if another set follows is to +check the next four bytes for the magic number (or check for EOF). The header +and tables sections are padded to 64-bit boundaries. Below we describe each +field in detail. This format does not specify how the scanner will expand the +given data, i.e., data may be serialized as int8, but expanded to an int32 +array at runtime. This is to reduce the size of the serialized data where +possible. Remember, @emph{all integer values are in network byte order}. + +@noindent +Fields of a table header: + +@table @code +@item th_magic +Magic number, always 0xF13C57B1. + +@item th_hsize +Size of this entire header, in bytes, including all fields plus any padding. + +@item th_ssize +Size of this entire set, in bytes, including the header, all tables, plus +any padding. + +@item th_flags +Bit flags for this table set. Currently unused. + +@item th_version[] +Flex version in NULL-terminated string format. e.g., @samp{2.5.13a}. This is +the version of flex that was used to create the serialized tables. + +@item th_name[] +Contains the name of this table set. The default is @samp{yytables}, +and is prefixed accordingly, e.g., @samp{footables}. Must be NULL-terminated. + +@item th_pad64[] +Zero or more NULL bytes, padding the entire header to the next 64-bit boundary +as calculated from the beginning of the header. +@end table + +@noindent +Fields of a table: + +@table @code +@item td_id +Specifies the table identifier. Possible values are: +@table @code +@item YYTD_ID_ACCEPT (0x01) +@code{yy_accept} +@item YYTD_ID_BASE (0x02) +@code{yy_base} +@item YYTD_ID_CHK (0x03) +@code{yy_chk} +@item YYTD_ID_DEF (0x04) +@code{yy_def} +@item YYTD_ID_EC (0x05) +@code{yy_ec } +@item YYTD_ID_META (0x06) +@code{yy_meta} +@item YYTD_ID_NUL_TRANS (0x07) +@code{yy_NUL_trans} +@item YYTD_ID_NXT (0x08) +@code{yy_nxt}. This array may be two dimensional. See the @code{td_hilen} +field below. +@item YYTD_ID_RULE_CAN_MATCH_EOL (0x09) +@code{yy_rule_can_match_eol} +@item YYTD_ID_START_STATE_LIST (0x0A) +@code{yy_start_state_list}. This array is handled specially because it is an +array of pointers to structs. See the @code{td_flags} field below. +@item YYTD_ID_TRANSITION (0x0B) +@code{yy_transition}. This array is handled specially because it is an array of +structs. See the @code{td_lolen} field below. +@item YYTD_ID_ACCLIST (0x0C) +@code{yy_acclist} +@end table + +@item td_flags +Bit flags describing how to interpret the data in @code{td_data}. +The data arrays are one-dimensional by default, but may be +two dimensional as specified in the @code{td_hilen} field. + +@table @code +@item YYTD_DATA8 (0x01) +The data is serialized as an array of type int8. +@item YYTD_DATA16 (0x02) +The data is serialized as an array of type int16. +@item YYTD_DATA32 (0x04) +The data is serialized as an array of type int32. +@item YYTD_PTRANS (0x08) +The data is a list of indexes of entries in the expanded @code{yy_transition} +array. Each index should be expanded to a pointer to the corresponding entry +in the @code{yy_transition} array. We count on the fact that the +@code{yy_transition} array has already been seen. +@item YYTD_STRUCT (0x10) +The data is a list of yy_trans_info structs, each of which consists of +two integers. There is no padding between struct elements or between structs. +The type of each member is determined by the @code{YYTD_DATA*} bits. +@end table + +@item td_lolen +Specifies the number of elements in the lowest dimension array. If this is +a one-dimensional array, then it is simply the number of elements in this array. +The element size is determined by the @code{td_flags} field. + +@item td_hilen +If @code{td_hilen} is non-zero, then the data is a two-dimensional array. +Otherwise, the data is a one-dimensional array. @code{td_hilen} contains the +number of elements in the higher dimensional array, and @code{td_lolen} contains +the number of elements in the lowest dimension. + +Conceptually, @code{td_data} is either @code{sometype td_data[td_lolen]}, or +@code{sometype td_data[td_hilen][td_lolen]}, where @code{sometype} is specified +by the @code{td_flags} field. It is possible for both @code{td_lolen} and +@code{td_hilen} to be zero, in which case @code{td_data} is a zero length +array, and no data is loaded, i.e., this table is simply skipped. Flex does not +currently generate tables of zero length. + +@item td_data[] +The table data. This array may be a one- or two-dimensional array, of type +@code{int8}, @code{int16}, @code{int32}, @code{struct yy_trans_info}, or +@code{struct yy_trans_info*}, depending upon the values in the +@code{td_flags}, @code{td_lolen}, and @code{td_hilen} fields. + +@item td_pad64[] +Zero or more NULL bytes, padding the entire table to the next 64-bit boundary as +calculated from the beginning of this table. +@end table + +@node Diagnostics, Limitations, Serialized Tables, Top +@chapter Diagnostics + +@cindex error reporting, diagnostic messages +@cindex warnings, diagnostic messages + +The following is a list of @code{flex} diagnostic messages: + +@itemize +@item +@samp{warning, rule cannot be matched} indicates that the given rule +cannot be matched because it follows other rules that will always match +the same text as it. For example, in the following @samp{foo} cannot be +matched because it comes after an identifier ``catch-all'' rule: + +@cindex warning, rule cannot be matched +@example +@verbatim + [a-z]+ got_identifier(); + foo got_foo(); +@end verbatim +@end example + +Using @code{REJECT} in a scanner suppresses this warning. + +@item +@samp{warning, -s option given but default rule can be matched} means +that it is possible (perhaps only in a particular start condition) that +the default rule (match any single character) is the only one that will +match a particular input. Since @samp{-s} was given, presumably this is +not intended. + +@item +@code{reject_used_but_not_detected undefined} or +@code{yymore_used_but_not_detected undefined}. These errors can occur +at compile time. They indicate that the scanner uses @code{REJECT} or +@code{yymore()} but that @code{flex} failed to notice the fact, meaning +that @code{flex} scanned the first two sections looking for occurrences +of these actions and failed to find any, but somehow you snuck some in +(via a #include file, for example). Use @code{%option reject} or +@code{%option yymore} to indicate to @code{flex} that you really do use +these features. + +@item +@samp{flex scanner jammed}. a scanner compiled with +@samp{-s} has encountered an input string which wasn't matched by any of +its rules. This error can also occur due to internal problems. + +@item +@samp{token too large, exceeds YYLMAX}. your scanner uses @code{%array} +and one of its rules matched a string longer than the @code{YYLMAX} +constant (8K bytes by default). You can increase the value by +#define'ing @code{YYLMAX} in the definitions section of your @code{flex} +input. + +@item +@samp{scanner requires -8 flag to use the character 'x'}. Your scanner +specification includes recognizing the 8-bit character @samp{'x'} and +you did not specify the -8 flag, and your scanner defaulted to 7-bit +because you used the @samp{-Cf} or @samp{-CF} table compression options. +See the discussion of the @samp{-7} flag, @ref{Scanner Options}, for +details. + +@item +@samp{flex scanner push-back overflow}. you used @code{unput()} to push +back so much text that the scanner's buffer could not hold both the +pushed-back text and the current token in @code{yytext}. Ideally the +scanner should dynamically resize the buffer in this case, but at +present it does not. + +@item +@samp{input buffer overflow, can't enlarge buffer because scanner uses +REJECT}. the scanner was working on matching an extremely large token +and needed to expand the input buffer. This doesn't work with scanners +that use @code{REJECT}. + +@item +@samp{fatal flex scanner internal error--end of buffer missed}. This can +occur in a scanner which is reentered after a long-jump has jumped out +(or over) the scanner's activation frame. Before reentering the +scanner, use: +@example +@verbatim + yyrestart( yyin ); +@end verbatim +@end example +or, as noted above, switch to using the C++ scanner class. + +@item +@samp{too many start conditions in <> construct!} you listed more start +conditions in a <> construct than exist (so you must have listed at +least one of them twice). +@end itemize + +@node Limitations, Bibliography, Diagnostics, Top +@chapter Limitations + +@cindex limitations of flex + +Some trailing context patterns cannot be properly matched and generate +warning messages (@samp{dangerous trailing context}). These are +patterns where the ending of the first part of the rule matches the +beginning of the second part, such as @samp{zx*/xy*}, where the 'x*' +matches the 'x' at the beginning of the trailing context. (Note that +the POSIX draft states that the text matched by such patterns is +undefined.) For some trailing context rules, parts which are actually +fixed-length are not recognized as such, leading to the abovementioned +performance loss. In particular, parts using @samp{|} or @samp{@{n@}} +(such as @samp{foo@{3@}}) are always considered variable-length. +Combining trailing context with the special @samp{|} action can result +in @emph{fixed} trailing context being turned into the more expensive +@emph{variable} trailing context. For example, in the following: + +@cindex warning, dangerous trailing context +@example +@verbatim + %% + abc | + xyz/def +@end verbatim +@end example + +Use of @code{unput()} invalidates yytext and yyleng, unless the +@code{%array} directive or the @samp{-l} option has been used. +Pattern-matching of @code{NUL}s is substantially slower than matching +other characters. Dynamic resizing of the input buffer is slow, as it +entails rescanning all the text matched so far by the current (generally +huge) token. Due to both buffering of input and read-ahead, you cannot +intermix calls to @file{<stdio.h>} routines, such as, @b{getchar()}, +with @code{flex} rules and expect it to work. Call @code{input()} +instead. The total table entries listed by the @samp{-v} flag excludes +the number of table entries needed to determine what rule has been +matched. The number of entries is equal to the number of DFA states if +the scanner does not use @code{REJECT}, and somewhat greater than the +number of states if it does. @code{REJECT} cannot be used with the +@samp{-f} or @samp{-F} options. + +The @code{flex} internal algorithms need documentation. + +@node Bibliography, FAQ, Limitations, Top +@chapter Additional Reading + +You may wish to read more about the following programs: +@itemize +@item lex +@item yacc +@item sed +@item awk +@end itemize + +The following books may contain material of interest: + +John Levine, Tony Mason, and Doug Brown, +@emph{Lex & Yacc}, +O'Reilly and Associates. Be sure to get the 2nd edition. + +M. E. Lesk and E. Schmidt, +@emph{LEX -- Lexical Analyzer Generator} + +Alfred Aho, Ravi Sethi and Jeffrey Ullman, @emph{Compilers: Principles, +Techniques and Tools}, Addison-Wesley (1986). Describes the +pattern-matching techniques used by @code{flex} (deterministic finite +automata). + +@node FAQ, Appendices, Bibliography, Top +@unnumbered FAQ + +From time to time, the @code{flex} maintainer receives certain +questions. Rather than repeat answers to well-understood problems, we +publish them here. + +@menu +* When was flex born?:: +* How do I expand backslash-escape sequences in C-style quoted strings?:: +* Why do flex scanners call fileno if it is not ANSI compatible?:: +* Does flex support recursive pattern definitions?:: +* How do I skip huge chunks of input (tens of megabytes) while using flex?:: +* Flex is not matching my patterns in the same order that I defined them.:: +* My actions are executing out of order or sometimes not at all.:: +* How can I have multiple input sources feed into the same scanner at the same time?:: +* Can I build nested parsers that work with the same input file?:: +* How can I match text only at the end of a file?:: +* How can I make REJECT cascade across start condition boundaries?:: +* Why cant I use fast or full tables with interactive mode?:: +* How much faster is -F or -f than -C?:: +* If I have a simple grammar cant I just parse it with flex?:: +* Why doesn't yyrestart() set the start state back to INITIAL?:: +* How can I match C-style comments?:: +* The period isn't working the way I expected.:: +* Can I get the flex manual in another format?:: +* Does there exist a "faster" NDFA->DFA algorithm?:: +* How does flex compile the DFA so quickly?:: +* How can I use more than 8192 rules?:: +* How do I abandon a file in the middle of a scan and switch to a new file?:: +* How do I execute code only during initialization (only before the first scan)?:: +* How do I execute code at termination?:: +* Where else can I find help?:: +* Can I include comments in the "rules" section of the file?:: +* I get an error about undefined yywrap().:: +* How can I change the matching pattern at run time?:: +* How can I expand macros in the input?:: +* How can I build a two-pass scanner?:: +* How do I match any string not matched in the preceding rules?:: +* I am trying to port code from AT&T lex that uses yysptr and yysbuf.:: +* Is there a way to make flex treat NULL like a regular character?:: +* Whenever flex can not match the input it says "flex scanner jammed".:: +* Why doesn't flex have non-greedy operators like perl does?:: +* Memory leak - 16386 bytes allocated by malloc.:: +* How do I track the byte offset for lseek()?:: +* How do I use my own I/O classes in a C++ scanner?:: +* How do I skip as many chars as possible?:: +* deleteme00:: +* Are certain equivalent patterns faster than others?:: +* Is backing up a big deal?:: +* Can I fake multi-byte character support?:: +* deleteme01:: +* Can you discuss some flex internals?:: +* unput() messes up yy_at_bol:: +* The | operator is not doing what I want:: +* Why can't flex understand this variable trailing context pattern?:: +* The ^ operator isn't working:: +* Trailing context is getting confused with trailing optional patterns:: +* Is flex GNU or not?:: +* ERASEME53:: +* I need to scan if-then-else blocks and while loops:: +* ERASEME55:: +* ERASEME56:: +* ERASEME57:: +* Is there a repository for flex scanners?:: +* How can I conditionally compile or preprocess my flex input file?:: +* Where can I find grammars for lex and yacc?:: +* I get an end-of-buffer message for each character scanned.:: +* unnamed-faq-62:: +* unnamed-faq-63:: +* unnamed-faq-64:: +* unnamed-faq-65:: +* unnamed-faq-66:: +* unnamed-faq-67:: +* unnamed-faq-68:: +* unnamed-faq-69:: +* unnamed-faq-70:: +* unnamed-faq-71:: +* unnamed-faq-72:: +* unnamed-faq-73:: +* unnamed-faq-74:: +* unnamed-faq-75:: +* unnamed-faq-76:: +* unnamed-faq-77:: +* unnamed-faq-78:: +* unnamed-faq-79:: +* unnamed-faq-80:: +* unnamed-faq-81:: +* unnamed-faq-82:: +* unnamed-faq-83:: +* unnamed-faq-84:: +* unnamed-faq-85:: +* unnamed-faq-86:: +* unnamed-faq-87:: +* unnamed-faq-88:: +* unnamed-faq-90:: +* unnamed-faq-91:: +* unnamed-faq-92:: +* unnamed-faq-93:: +* unnamed-faq-94:: +* unnamed-faq-95:: +* unnamed-faq-96:: +* unnamed-faq-97:: +* unnamed-faq-98:: +* unnamed-faq-99:: +* unnamed-faq-100:: +* unnamed-faq-101:: +* What is the difference between YYLEX_PARAM and YY_DECL?:: +* Why do I get "conflicting types for yylex" error?:: +* How do I access the values set in a Flex action from within a Bison action?:: +@end menu + +@node When was flex born? +@unnumberedsec When was flex born? + +Vern Paxson took over +the @cite{Software Tools} lex project from Jef Poskanzer in 1982. At that point it +was written in Ratfor. Around 1987 or so, Paxson translated it into C, and +a legend was born :-). + +@node How do I expand backslash-escape sequences in C-style quoted strings? +@unnumberedsec How do I expand backslash-escape sequences in C-style quoted strings? + +A key point when scanning quoted strings is that you cannot (easily) write +a single rule that will precisely match the string if you allow things +like embedded escape sequences and newlines. If you try to match strings +with a single rule then you'll wind up having to rescan the string anyway +to find any escape sequences. + +Instead you can use exclusive start conditions and a set of rules, one for +matching non-escaped text, one for matching a single escape, one for +matching an embedded newline, and one for recognizing the end of the +string. Each of these rules is then faced with the question of where to +put its intermediary results. The best solution is for the rules to +append their local value of @code{yytext} to the end of a ``string literal'' +buffer. A rule like the escape-matcher will append to the buffer the +meaning of the escape sequence rather than the literal text in @code{yytext}. +In this way, @code{yytext} does not need to be modified at all. + +@node Why do flex scanners call fileno if it is not ANSI compatible? +@unnumberedsec Why do flex scanners call fileno if it is not ANSI compatible? + +Flex scanners call @code{fileno()} in order to get the file descriptor +corresponding to @code{yyin}. The file descriptor may be passed to +@code{isatty()} or @code{read()}, depending upon which @code{%options} you specified. +If your system does not have @code{fileno()} support, to get rid of the +@code{read()} call, do not specify @code{%option read}. To get rid of the @code{isatty()} +call, you must specify one of @code{%option always-interactive} or +@code{%option never-interactive}. + +@node Does flex support recursive pattern definitions? +@unnumberedsec Does flex support recursive pattern definitions? + +e.g., + +@example +@verbatim +%% +block "{"({block}|{statement})*"}" +@end verbatim +@end example + +No. You cannot have recursive definitions. The pattern-matching power of +regular expressions in general (and therefore flex scanners, too) is +limited. In particular, regular expressions cannot ``balance'' parentheses +to an arbitrary degree. For example, it's impossible to write a regular +expression that matches all strings containing the same number of '@{'s +as '@}'s. For more powerful pattern matching, you need a parser, such +as @cite{GNU bison}. + +@node How do I skip huge chunks of input (tens of megabytes) while using flex? +@unnumberedsec How do I skip huge chunks of input (tens of megabytes) while using flex? + +Use @code{fseek()} (or @code{lseek()}) to position yyin, then call @code{yyrestart()}. + +@node Flex is not matching my patterns in the same order that I defined them. +@unnumberedsec Flex is not matching my patterns in the same order that I defined them. + +@code{flex} picks the +rule that matches the most text (i.e., the longest possible input string). +This is because @code{flex} uses an entirely different matching technique +(``deterministic finite automata'') that actually does all of the matching +simultaneously, in parallel. (Seems impossible, but it's actually a fairly +simple technique once you understand the principles.) + +A side-effect of this parallel matching is that when the input matches more +than one rule, @code{flex} scanners pick the rule that matched the @emph{most} text. This +is explained further in the manual, in the section @xref{Matching}. + +If you want @code{flex} to choose a shorter match, then you can work around this +behavior by expanding your short +rule to match more text, then put back the extra: + +@example +@verbatim +data_.* yyless( 5 ); BEGIN BLOCKIDSTATE; +@end verbatim +@end example + +Another fix would be to make the second rule active only during the +@code{<BLOCKIDSTATE>} start condition, and make that start condition exclusive +by declaring it with @code{%x} instead of @code{%s}. + +A final fix is to change the input language so that the ambiguity for +@samp{data_} is removed, by adding characters to it that don't match the +identifier rule, or by removing characters (such as @samp{_}) from the +identifier rule so it no longer matches @samp{data_}. (Of course, you might +also not have the option of changing the input language.) + +@node My actions are executing out of order or sometimes not at all. +@unnumberedsec My actions are executing out of order or sometimes not at all. + +Most likely, you have (in error) placed the opening @samp{@{} of the action +block on a different line than the rule, e.g., + +@example +@verbatim +^(foo|bar) +{ <<<--- WRONG! + +} +@end verbatim +@end example + +@code{flex} requires that the opening @samp{@{} of an action associated with a rule +begin on the same line as does the rule. You need instead to write your rules +as follows: + +@example +@verbatim +^(foo|bar) { // CORRECT! + +} +@end verbatim +@end example + +@node How can I have multiple input sources feed into the same scanner at the same time? +@unnumberedsec How can I have multiple input sources feed into the same scanner at the same time? + +If @dots{} +@itemize +@item +your scanner is free of backtracking (verified using @code{flex}'s @samp{-b} flag), +@item +AND you run your scanner interactively (@samp{-I} option; default unless using special table +compression options), +@item +AND you feed it one character at a time by redefining @code{YY_INPUT} to do so, +@end itemize + +then every time it matches a token, it will have exhausted its input +buffer (because the scanner is free of backtracking). This means you +can safely use @code{select()} at the point and only call @code{yylex()} for another +token if @code{select()} indicates there's data available. + +That is, move the @code{select()} out from the input function to a point where +it determines whether @code{yylex()} gets called for the next token. + +With this approach, you will still have problems if your input can arrive +piecemeal; @code{select()} could inform you that the beginning of a token is +available, you call @code{yylex()} to get it, but it winds up blocking waiting +for the later characters in the token. + +Here's another way: Move your input multiplexing inside of @code{YY_INPUT}. That +is, whenever @code{YY_INPUT} is called, it @code{select()}'s to see where input is +available. If input is available for the scanner, it reads and returns the +next byte. If input is available from another source, it calls whatever +function is responsible for reading from that source. (If no input is +available, it blocks until some input is available.) I've used this technique in an +interpreter I wrote that both reads keyboard input using a @code{flex} scanner and +IPC traffic from sockets, and it works fine. + +@node Can I build nested parsers that work with the same input file? +@unnumberedsec Can I build nested parsers that work with the same input file? + +This is not going to work without some additional effort. The reason is +that @code{flex} block-buffers the input it reads from @code{yyin}. This means that the +``outermost'' @code{yylex()}, when called, will automatically slurp up the first 8K +of input available on yyin, and subsequent calls to other @code{yylex()}'s won't +see that input. You might be tempted to work around this problem by +redefining @code{YY_INPUT} to only return a small amount of text, but it turns out +that that approach is quite difficult. Instead, the best solution is to +combine all of your scanners into one large scanner, using a different +exclusive start condition for each. + +@node How can I match text only at the end of a file? +@unnumberedsec How can I match text only at the end of a file? + +There is no way to write a rule which is ``match this text, but only if +it comes at the end of the file''. You can fake it, though, if you happen +to have a character lying around that you don't allow in your input. +Then you redefine @code{YY_INPUT} to call your own routine which, if it sees +an @samp{EOF}, returns the magic character first (and remembers to return a +real @code{EOF} next time it's called). Then you could write: + +@example +@verbatim +<COMMENT>(.|\n)*{EOF_CHAR} /* saw comment at EOF */ +@end verbatim +@end example + +@node How can I make REJECT cascade across start condition boundaries? +@unnumberedsec How can I make REJECT cascade across start condition boundaries? + +You can do this as follows. Suppose you have a start condition @samp{A}, and +after exhausting all of the possible matches in @samp{<A>}, you want to try +matches in @samp{<INITIAL>}. Then you could use the following: + +@example +@verbatim +%x A +%% +<A>rule_that_is_long ...; REJECT; +<A>rule ...; REJECT; /* shorter rule */ +<A>etc. +... +<A>.|\n { +/* Shortest and last rule in <A>, so +* cascaded REJECTs will eventually +* wind up matching this rule. We want +* to now switch to the initial state +* and try matching from there instead. +*/ +yyless(0); /* put back matched text */ +BEGIN(INITIAL); +} +@end verbatim +@end example + +@node Why cant I use fast or full tables with interactive mode? +@unnumberedsec Why can't I use fast or full tables with interactive mode? + +One of the assumptions +flex makes is that interactive applications are inherently slow (they're +waiting on a human after all). +It has to do with how the scanner detects that it must be finished scanning +a token. For interactive scanners, after scanning each character the current +state is looked up in a table (essentially) to see whether there's a chance +of another input character possibly extending the length of the match. If +not, the scanner halts. For non-interactive scanners, the end-of-token test +is much simpler, basically a compare with 0, so no memory bus cycles. Since +the test occurs in the innermost scanning loop, one would like to make it go +as fast as possible. + +Still, it seems reasonable to allow the user to choose to trade off a bit +of performance in this area to gain the corresponding flexibility. There +might be another reason, though, why fast scanners don't support the +interactive option. + +@node How much faster is -F or -f than -C? +@unnumberedsec How much faster is -F or -f than -C? + +Much faster (factor of 2-3). + +@node If I have a simple grammar cant I just parse it with flex? +@unnumberedsec If I have a simple grammar can't I just parse it with flex? + +Is your grammar recursive? That's almost always a sign that you're +better off using a parser/scanner rather than just trying to use a scanner +alone. + +@node Why doesn't yyrestart() set the start state back to INITIAL? +@unnumberedsec Why doesn't yyrestart() set the start state back to INITIAL? + +There are two reasons. The first is that there might +be programs that rely on the start state not changing across file changes. +The second is that beginning with @code{flex} version 2.4, use of @code{yyrestart()} is no longer required, +so fixing the problem there doesn't solve the more general problem. + +@node How can I match C-style comments? +@unnumberedsec How can I match C-style comments? + +You might be tempted to try something like this: + +@example +@verbatim +"/*".*"*/" // WRONG! +@end verbatim +@end example + +or, worse, this: + +@example +@verbatim +"/*"(.|\n)"*/" // WRONG! +@end verbatim +@end example + +The above rules will eat too much input, and blow up on things like: + +@example +@verbatim +/* a comment */ do_my_thing( "oops */" ); +@end verbatim +@end example + +Here is one way which allows you to track line information: + +@example +@verbatim +<INITIAL>{ +"/*" BEGIN(IN_COMMENT); +} +<IN_COMMENT>{ +"*/" BEGIN(INITIAL); +[^*\n]+ // eat comment in chunks +"*" // eat the lone star +\n yylineno++; +} +@end verbatim +@end example + +@node The period isn't working the way I expected. +@unnumberedsec The '.' isn't working the way I expected. + +Here are some tips for using @samp{.}: + +@itemize +@item +A common mistake is to place the grouping parenthesis AFTER an operator, when +you really meant to place the parenthesis BEFORE the operator, e.g., you +probably want this @code{(foo|bar)+} and NOT this @code{(foo|bar+)}. + +The first pattern matches the words @samp{foo} or @samp{bar} any number of +times, e.g., it matches the text @samp{barfoofoobarfoo}. The +second pattern matches a single instance of @code{foo} or a single instance of +@code{bar} followed by one or more @samp{r}s, e.g., it matches the text @code{barrrr} . +@item +A @samp{.} inside @samp{[]}'s just means a literal@samp{.} (period), +and NOT ``any character except newline''. +@item +Remember that @samp{.} matches any character EXCEPT @samp{\n} (and @samp{EOF}). +If you really want to match ANY character, including newlines, then use @code{(.|\n)} +Beware that the regex @code{(.|\n)+} will match your entire input! +@item +Finally, if you want to match a literal @samp{.} (a period), then use @samp{[.]} or @samp{"."} +@end itemize + +@node Can I get the flex manual in another format? +@unnumberedsec Can I get the flex manual in another format? + +The @code{flex} source distribution includes a texinfo manual. You are +free to convert that texinfo into whatever format you desire. The +@code{texinfo} package includes tools for conversion to a number of formats. + +@node Does there exist a "faster" NDFA->DFA algorithm? +@unnumberedsec Does there exist a "faster" NDFA->DFA algorithm? + +There's no way around the potential exponential running time - it +can take you exponential time just to enumerate all of the DFA states. +In practice, though, the running time is closer to linear, or sometimes +quadratic. + +@node How does flex compile the DFA so quickly? +@unnumberedsec How does flex compile the DFA so quickly? + +There are two big speed wins that @code{flex} uses: + +@enumerate +@item +It analyzes the input rules to construct equivalence classes for those +characters that always make the same transitions. It then rewrites the NFA +using equivalence classes for transitions instead of characters. This cuts +down the NFA->DFA computation time dramatically, to the point where, for +uncompressed DFA tables, the DFA generation is often I/O bound in writing out +the tables. +@item +It maintains hash values for previously computed DFA states, so testing +whether a newly constructed DFA state is equivalent to a previously constructed +state can be done very quickly, by first comparing hash values. +@end enumerate + +@node How can I use more than 8192 rules? +@unnumberedsec How can I use more than 8192 rules? + +@code{Flex} is compiled with an upper limit of 8192 rules per scanner. +If you need more than 8192 rules in your scanner, you'll have to recompile @code{flex} +with the following changes in @file{flexdef.h}: + +@example +@verbatim +< #define YY_TRAILING_MASK 0x2000 +< #define YY_TRAILING_HEAD_MASK 0x4000 +-- +> #define YY_TRAILING_MASK 0x20000000 +> #define YY_TRAILING_HEAD_MASK 0x40000000 +@end verbatim +@end example + +This should work okay as long as your C compiler uses 32 bit integers. +But you might want to think about whether using such a huge number of rules +is the best way to solve your problem. + +The following may also be relevant: + +With luck, you should be able to increase the definitions in flexdef.h for: + +@example +@verbatim +#define JAMSTATE -32766 /* marks a reference to the state that always jams */ +#define MAXIMUM_MNS 31999 +#define BAD_SUBSCRIPT -32767 +@end verbatim +@end example + +recompile everything, and it'll all work. Flex only has these 16-bit-like +values built into it because a long time ago it was developed on a machine +with 16-bit ints. I've given this advice to others in the past but haven't +heard back from them whether it worked okay or not... + +@node How do I abandon a file in the middle of a scan and switch to a new file? +@unnumberedsec How do I abandon a file in the middle of a scan and switch to a new file? + +Just call @code{yyrestart(newfile)}. Be sure to reset the start state if you want a +``fresh start, since @code{yyrestart} does NOT reset the start state back to @code{INITIAL}. + +@node How do I execute code only during initialization (only before the first scan)? +@unnumberedsec How do I execute code only during initialization (only before the first scan)? + +You can specify an initial action by defining the macro @code{YY_USER_INIT} (though +note that @code{yyout} may not be available at the time this macro is executed). Or you +can add to the beginning of your rules section: + +@example +@verbatim +%% + /* Must be indented! */ + static int did_init = 0; + + if ( ! did_init ){ +do_my_init(); + did_init = 1; + } +@end verbatim +@end example + +@node How do I execute code at termination? +@unnumberedsec How do I execute code at termination? + +You can specify an action for the @code{<<EOF>>} rule. + +@node Where else can I find help? +@unnumberedsec Where else can I find help? + +You can find the flex homepage on the web at +@uref{http://flex.sourceforge.net/}. See that page for details about flex +mailing lists as well. + +@node Can I include comments in the "rules" section of the file? +@unnumberedsec Can I include comments in the "rules" section of the file? + +Yes, just about anywhere you want to. See the manual for the specific syntax. + +@node I get an error about undefined yywrap(). +@unnumberedsec I get an error about undefined yywrap(). + +You must supply a @code{yywrap()} function of your own, or link to @file{libfl.a} +(which provides one), or use + +@example +@verbatim +%option noyywrap +@end verbatim +@end example + +in your source to say you don't want a @code{yywrap()} function. + +@node How can I change the matching pattern at run time? +@unnumberedsec How can I change the matching pattern at run time? + +You can't, it's compiled into a static table when flex builds the scanner. + +@node How can I expand macros in the input? +@unnumberedsec How can I expand macros in the input? + +The best way to approach this problem is at a higher level, e.g., in the parser. + +However, you can do this using multiple input buffers. + +@example +@verbatim +%% +macro/[a-z]+ { +/* Saw the macro "macro" followed by extra stuff. */ +main_buffer = YY_CURRENT_BUFFER; +expansion_buffer = yy_scan_string(expand(yytext)); +yy_switch_to_buffer(expansion_buffer); +} + +<<EOF>> { +if ( expansion_buffer ) +{ +// We were doing an expansion, return to where +// we were. +yy_switch_to_buffer(main_buffer); +yy_delete_buffer(expansion_buffer); +expansion_buffer = 0; +} +else +yyterminate(); +} +@end verbatim +@end example + +You probably will want a stack of expansion buffers to allow nested macros. +From the above though hopefully the idea is clear. + +@node How can I build a two-pass scanner? +@unnumberedsec How can I build a two-pass scanner? + +One way to do it is to filter the first pass to a temporary file, +then process the temporary file on the second pass. You will probably see a +performance hit, due to all the disk I/O. + +When you need to look ahead far forward like this, it almost always means +that the right solution is to build a parse tree of the entire input, then +walk it after the parse in order to generate the output. In a sense, this +is a two-pass approach, once through the text and once through the parse +tree, but the performance hit for the latter is usually an order of magnitude +smaller, since everything is already classified, in binary format, and +residing in memory. + +@node How do I match any string not matched in the preceding rules? +@unnumberedsec How do I match any string not matched in the preceding rules? + +One way to assign precedence, is to place the more specific rules first. If +two rules would match the same input (same sequence of characters) then the +first rule listed in the @code{flex} input wins, e.g., + +@example +@verbatim +%% +foo[a-zA-Z_]+ return FOO_ID; +bar[a-zA-Z_]+ return BAR_ID; +[a-zA-Z_]+ return GENERIC_ID; +@end verbatim +@end example + +Note that the rule @code{[a-zA-Z_]+} must come *after* the others. It will match the +same amount of text as the more specific rules, and in that case the +@code{flex} scanner will pick the first rule listed in your scanner as the +one to match. + +@node I am trying to port code from AT&T lex that uses yysptr and yysbuf. +@unnumberedsec I am trying to port code from AT&T lex that uses yysptr and yysbuf. + +Those are internal variables pointing into the AT&T scanner's input buffer. I +imagine they're being manipulated in user versions of the @code{input()} and @code{unput()} +functions. If so, what you need to do is analyze those functions to figure out +what they're doing, and then replace @code{input()} with an appropriate definition of +@code{YY_INPUT}. You shouldn't need to (and must not) replace +@code{flex}'s @code{unput()} function. + +@node Is there a way to make flex treat NULL like a regular character? +@unnumberedsec Is there a way to make flex treat NULL like a regular character? + +Yes, @samp{\0} and @samp{\x00} should both do the trick. Perhaps you have an ancient +version of @code{flex}. The latest release is version @value{VERSION}. + +@node Whenever flex can not match the input it says "flex scanner jammed". +@unnumberedsec Whenever flex can not match the input it says "flex scanner jammed". + +You need to add a rule that matches the otherwise-unmatched text, +e.g., + +@example +@verbatim +%option yylineno +%% +[[a bunch of rules here]] + +. printf("bad input character '%s' at line %d\n", yytext, yylineno); +@end verbatim +@end example + +See @code{%option default} for more information. + +@node Why doesn't flex have non-greedy operators like perl does? +@unnumberedsec Why doesn't flex have non-greedy operators like perl does? + +A DFA can do a non-greedy match by stopping +the first time it enters an accepting state, instead of consuming input until +it determines that no further matching is possible (a ``jam'' state). This +is actually easier to implement than longest leftmost match (which flex does). + +But it's also much less useful than longest leftmost match. In general, +when you find yourself wishing for non-greedy matching, that's usually a +sign that you're trying to make the scanner do some parsing. That's +generally the wrong approach, since it lacks the power to do a decent job. +Better is to either introduce a separate parser, or to split the scanner +into multiple scanners using (exclusive) start conditions. + +You might have +a separate start state once you've seen the @samp{BEGIN}. In that state, you +might then have a regex that will match @samp{END} (to kick you out of the +state), and perhaps @samp{(.|\n)} to get a single character within the chunk ... + +This approach also has much better error-reporting properties. + +@node Memory leak - 16386 bytes allocated by malloc. +@unnumberedsec Memory leak - 16386 bytes allocated by malloc. +@anchor{faq-memory-leak} + +UPDATED 2002-07-10: As of @code{flex} version 2.5.9, this leak means that you did not +call @code{yylex_destroy()}. If you are using an earlier version of @code{flex}, then read +on. + +The leak is about 16426 bytes. That is, (8192 * 2 + 2) for the read-buffer, and +about 40 for @code{struct yy_buffer_state} (depending upon alignment). The leak is in +the non-reentrant C scanner only (NOT in the reentrant scanner, NOT in the C++ +scanner). Since @code{flex} doesn't know when you are done, the buffer is never freed. + +However, the leak won't multiply since the buffer is reused no matter how many +times you call @code{yylex()}. + +If you want to reclaim the memory when you are completely done scanning, then +you might try this: + +@example +@verbatim +/* For non-reentrant C scanner only. */ +yy_delete_buffer(YY_CURRENT_BUFFER); +yy_init = 1; +@end verbatim +@end example + +Note: @code{yy_init} is an "internal variable", and hasn't been tested in this +situation. It is possible that some other globals may need resetting as well. + +@node How do I track the byte offset for lseek()? +@unnumberedsec How do I track the byte offset for lseek()? + +@example +@verbatim +> We thought that it would be possible to have this number through the +> evaluation of the following expression: +> +> seek_position = (no_buffers)*YY_READ_BUF_SIZE + yy_c_buf_p - YY_CURRENT_BUFFER->yy_ch_buf +@end verbatim +@end example + +While this is the right idea, it has two problems. The first is that +it's possible that @code{flex} will request less than @code{YY_READ_BUF_SIZE} during +an invocation of @code{YY_INPUT} (or that your input source will return less +even though @code{YY_READ_BUF_SIZE} bytes were requested). The second problem +is that when refilling its internal buffer, @code{flex} keeps some characters +from the previous buffer (because usually it's in the middle of a match, +and needs those characters to construct @code{yytext} for the match once it's +done). Because of this, @code{yy_c_buf_p - YY_CURRENT_BUFFER->yy_ch_buf} won't +be exactly the number of characters already read from the current buffer. + +An alternative solution is to count the number of characters you've matched +since starting to scan. This can be done by using @code{YY_USER_ACTION}. For +example, + +@example +@verbatim +#define YY_USER_ACTION num_chars += yyleng; +@end verbatim +@end example + +(You need to be careful to update your bookkeeping if you use @code{yymore(}), +@code{yyless()}, @code{unput()}, or @code{input()}.) + +@node How do I use my own I/O classes in a C++ scanner? +@section How do I use my own I/O classes in a C++ scanner? + +When the flex C++ scanning class rewrite finally happens, then this sort of thing should become much easier. + +@cindex LexerOutput, overriding +@cindex LexerInput, overriding +@cindex overriding LexerOutput +@cindex overriding LexerInput +@cindex customizing I/O in C++ scanners +@cindex C++ I/O, customizing +You can do this by passing the various functions (such as @code{LexerInput()} +and @code{LexerOutput()}) NULL @code{iostream*}'s, and then +dealing with your own I/O classes surreptitiously (i.e., stashing them in +special member variables). This works because the only assumption about +the lexer regarding what's done with the iostream's is that they're +ultimately passed to @code{LexerInput()} and @code{LexerOutput}, which then do whatever +is necessary with them. + +@c faq edit stopped here +@node How do I skip as many chars as possible? +@unnumberedsec How do I skip as many chars as possible? + +How do I skip as many chars as possible -- without interfering with the other +patterns? + +In the example below, we want to skip over characters until we see the phrase +"endskip". The following will @emph{NOT} work correctly (do you see why not?) + +@example +@verbatim +/* INCORRECT SCANNER */ +%x SKIP +%% +<INITIAL>startskip BEGIN(SKIP); +... +<SKIP>"endskip" BEGIN(INITIAL); +<SKIP>.* ; +@end verbatim +@end example + +The problem is that the pattern .* will eat up the word "endskip." +The simplest (but slow) fix is: + +@example +@verbatim +<SKIP>"endskip" BEGIN(INITIAL); +<SKIP>. ; +@end verbatim +@end example + +The fix involves making the second rule match more, without +making it match "endskip" plus something else. So for example: + +@example +@verbatim +<SKIP>"endskip" BEGIN(INITIAL); +<SKIP>[^e]+ ; +<SKIP>. ;/* so you eat up e's, too */ +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node deleteme00 +@unnumberedsec deleteme00 +@example +@verbatim +QUESTION: +When was flex born? + +Vern Paxson took over +the Software Tools lex project from Jef Poskanzer in 1982. At that point it +was written in Ratfor. Around 1987 or so, Paxson translated it into C, and +a legend was born :-). +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node Are certain equivalent patterns faster than others? +@unnumberedsec Are certain equivalent patterns faster than others? +@example +@verbatim +To: Adoram Rogel <adoram@orna.hybridge.com> +Subject: Re: Flex 2.5.2 performance questions +In-reply-to: Your message of Wed, 18 Sep 96 11:12:17 EDT. +Date: Wed, 18 Sep 96 10:51:02 PDT +From: Vern Paxson <vern> + +[Note, the most recent flex release is 2.5.4, which you can get from +ftp.ee.lbl.gov. It has bug fixes over 2.5.2 and 2.5.3.] + +> 1. Using the pattern +> ([Ff](oot)?)?[Nn](ote)?(\.)? +> instead of +> (((F|f)oot(N|n)ote)|((N|n)ote)|((N|n)\.)|((F|f)(N|n)(\.))) +> (in a very complicated flex program) caused the program to slow from +> 300K+/min to 100K/min (no other changes were done). + +These two are not equivalent. For example, the first can match "footnote." +but the second can only match "footnote". This is almost certainly the +cause in the discrepancy - the slower scanner run is matching more tokens, +and/or having to do more backing up. + +> 2. Which of these two are better: [Ff]oot or (F|f)oot ? + +From a performance point of view, they're equivalent (modulo presumably +minor effects such as memory cache hit rates; and the presence of trailing +context, see below). From a space point of view, the first is slightly +preferable. + +> 3. I have a pattern that look like this: +> pats {p1}|{p2}|{p3}|...|{p50} (50 patterns ORd) +> +> running yet another complicated program that includes the following rule: +> <snext>{and}/{no4}{bb}{pats} +> +> gets me to "too complicated - over 32,000 states"... + +I can't tell from this example whether the trailing context is variable-length +or fixed-length (it could be the latter if {and} is fixed-length). If it's +variable length, which flex -p will tell you, then this reflects a basic +performance problem, and if you can eliminate it by restructuring your +scanner, you will see significant improvement. + +> so I divided {pats} to {pats1}, {pats2},..., {pats5} each consists of about +> 10 patterns and changed the rule to be 5 rules. +> This did compile, but what is the rule of thumb here ? + +The rule is to avoid trailing context other than fixed-length, in which for +a/b, either the 'a' pattern or the 'b' pattern have a fixed length. Use +of the '|' operator automatically makes the pattern variable length, so in +this case '[Ff]oot' is preferred to '(F|f)oot'. + +> 4. I changed a rule that looked like this: +> <snext8>{and}{bb}/{ROMAN}[^A-Za-z] { BEGIN... +> +> to the next 2 rules: +> <snext8>{and}{bb}/{ROMAN}[A-Za-z] { ECHO;} +> <snext8>{and}{bb}/{ROMAN} { BEGIN... +> +> Again, I understand the using [^...] will cause a great performance loss + +Actually, it doesn't cause any sort of performance loss. It's a surprising +fact about regular expressions that they always match in linear time +regardless of how complex they are. + +> but are there any specific rules about it ? + +See the "Performance Considerations" section of the man page, and also +the example in MISC/fastwc/. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node Is backing up a big deal? +@unnumberedsec Is backing up a big deal? +@example +@verbatim +To: Adoram Rogel <adoram@hybridge.com> +Subject: Re: Flex 2.5.2 performance questions +In-reply-to: Your message of Thu, 19 Sep 96 10:16:04 EDT. +Date: Thu, 19 Sep 96 09:58:00 PDT +From: Vern Paxson <vern> + +> a lot about the backing up problem. +> I believe that there lies my biggest problem, and I'll try to improve +> it. + +Since you have variable trailing context, this is a bigger performance +problem. Fixing it is usually easier than fixing backing up, which in a +complicated scanner (yours seems to fit the bill) can be extremely +difficult to do correctly. + +You also don't mention what flags you are using for your scanner. +-f makes a large speed difference, and -Cfe buys you nearly as much +speed but the resulting scanner is considerably smaller. + +> I have an | operator in {and} and in {pats} so both of them are variable +> length. + +-p should have reported this. + +> Is changing one of them to fixed-length is enough ? + +Yes. + +> Is it possible to change the 32,000 states limit ? + +Yes. I've appended instructions on how. Before you make this change, +though, you should think about whether there are ways to fundamentally +simplify your scanner - those are certainly preferable! + + Vern + +To increase the 32K limit (on a machine with 32 bit integers), you increase +the magnitude of the following in flexdef.h: + +#define JAMSTATE -32766 /* marks a reference to the state that always jams */ +#define MAXIMUM_MNS 31999 +#define BAD_SUBSCRIPT -32767 +#define MAX_SHORT 32700 + +Adding a 0 or two after each should do the trick. +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node Can I fake multi-byte character support? +@unnumberedsec Can I fake multi-byte character support? +@example +@verbatim +To: Heeman_Lee@hp.com +Subject: Re: flex - multi-byte support? +In-reply-to: Your message of Thu, 03 Oct 1996 17:24:04 PDT. +Date: Fri, 04 Oct 1996 11:42:18 PDT +From: Vern Paxson <vern> + +> I assume as long as my *.l file defines the +> range of expected character code values (in octal format), flex will +> scan the file and read multi-byte characters correctly. But I have no +> confidence in this assumption. + +Your lack of confidence is justified - this won't work. + +Flex has in it a widespread assumption that the input is processed +one byte at a time. Fixing this is on the to-do list, but is involved, +so it won't happen any time soon. In the interim, the best I can suggest +(unless you want to try fixing it yourself) is to write your rules in +terms of pairs of bytes, using definitions in the first section: + + X \xfe\xc2 + ... + %% + foo{X}bar found_foo_fe_c2_bar(); + +etc. Definitely a pain - sorry about that. + +By the way, the email address you used for me is ancient, indicating you +have a very old version of flex. You can get the most recent, 2.5.4, from +ftp.ee.lbl.gov. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node deleteme01 +@unnumberedsec deleteme01 +@example +@verbatim +To: moleary@primus.com +Subject: Re: Flex / Unicode compatibility question +In-reply-to: Your message of Tue, 22 Oct 1996 10:15:42 PDT. +Date: Tue, 22 Oct 1996 11:06:13 PDT +From: Vern Paxson <vern> + +Unfortunately flex at the moment has a widespread assumption within it +that characters are processed 8 bits at a time. I don't see any easy +fix for this (other than writing your rules in terms of double characters - +a pain). I also don't know of a wider lex, though you might try surfing +the Plan 9 stuff because I know it's a Unicode system, and also the PCCT +toolkit (try searching say Alta Vista for "Purdue Compiler Construction +Toolkit"). + +Fixing flex to handle wider characters is on the long-term to-do list. +But since flex is a strictly spare-time project these days, this probably +won't happen for quite a while, unless someone else does it first. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node Can you discuss some flex internals? +@unnumberedsec Can you discuss some flex internals? +@example +@verbatim +To: Johan Linde <jl@theophys.kth.se> +Subject: Re: translation of flex +In-reply-to: Your message of Sun, 10 Nov 1996 09:16:36 PST. +Date: Mon, 11 Nov 1996 10:33:50 PST +From: Vern Paxson <vern> + +> I'm working for the Swedish team translating GNU program, and I'm currently +> working with flex. I have a few questions about some of the messages which +> I hope you can answer. + +All of the things you're wondering about, by the way, concerning flex +internals - probably the only person who understands what they mean in +English is me! So I wouldn't worry too much about getting them right. +That said ... + +> #: main.c:545 +> msgid " %d protos created\n" +> +> Does proto mean prototype? + +Yes - prototypes of state compression tables. + +> #: main.c:539 +> msgid " %d/%d (peak %d) template nxt-chk entries created\n" +> +> Here I'm mainly puzzled by 'nxt-chk'. I guess it means 'next-check'. (?) +> However, 'template next-check entries' doesn't make much sense to me. To be +> able to find a good translation I need to know a little bit more about it. + +There is a scheme in the Aho/Sethi/Ullman compiler book for compressing +scanner tables. It involves creating two pairs of tables. The first has +"base" and "default" entries, the second has "next" and "check" entries. +The "base" entry is indexed by the current state and yields an index into +the next/check table. The "default" entry gives what to do if the state +transition isn't found in next/check. The "next" entry gives the next +state to enter, but only if the "check" entry verifies that this entry is +correct for the current state. Flex creates templates of series of +next/check entries and then encodes differences from these templates as a +way to compress the tables. + +> #: main.c:533 +> msgid " %d/%d base-def entries created\n" +> +> The same problem here for 'base-def'. + +See above. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unput() messes up yy_at_bol +@unnumberedsec unput() messes up yy_at_bol +@example +@verbatim +To: Xinying Li <xli@npac.syr.edu> +Subject: Re: FLEX ? +In-reply-to: Your message of Wed, 13 Nov 1996 17:28:38 PST. +Date: Wed, 13 Nov 1996 19:51:54 PST +From: Vern Paxson <vern> + +> "unput()" them to input flow, question occurs. If I do this after I scan +> a carriage, the variable "YY_CURRENT_BUFFER->yy_at_bol" is changed. That +> means the carriage flag has gone. + +You can control this by calling yy_set_bol(). It's described in the manual. + +> And if in pre-reading it goes to the end of file, is anything done +> to control the end of curren buffer and end of file? + +No, there's no way to put back an end-of-file. + +> By the way I am using flex 2.5.2 and using the "-l". + +The latest release is 2.5.4, by the way. It fixes some bugs in 2.5.2 and +2.5.3. You can get it from ftp.ee.lbl.gov. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node The | operator is not doing what I want +@unnumberedsec The | operator is not doing what I want +@example +@verbatim +To: Alain.ISSARD@st.com +Subject: Re: Start condition with FLEX +In-reply-to: Your message of Mon, 18 Nov 1996 09:45:02 PST. +Date: Mon, 18 Nov 1996 10:41:34 PST +From: Vern Paxson <vern> + +> I am not able to use the start condition scope and to use the | (OR) with +> rules having start conditions. + +The problem is that if you use '|' as a regular expression operator, for +example "a|b" meaning "match either 'a' or 'b'", then it must *not* have +any blanks around it. If you instead want the special '|' *action* (which +from your scanner appears to be the case), which is a way of giving two +different rules the same action: + + foo | + bar matched_foo_or_bar(); + +then '|' *must* be separated from the first rule by whitespace and *must* +be followed by a new line. You *cannot* write it as: + + foo | bar matched_foo_or_bar(); + +even though you might think you could because yacc supports this syntax. +The reason for this unfortunately incompatibility is historical, but it's +unlikely to be changed. + +Your problems with start condition scope are simply due to syntax errors +from your use of '|' later confusing flex. + +Let me know if you still have problems. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node Why can't flex understand this variable trailing context pattern? +@unnumberedsec Why can't flex understand this variable trailing context pattern? +@example +@verbatim +To: Gregory Margo <gmargo@newton.vip.best.com> +Subject: Re: flex-2.5.3 bug report +In-reply-to: Your message of Sat, 23 Nov 1996 16:50:09 PST. +Date: Sat, 23 Nov 1996 17:07:32 PST +From: Vern Paxson <vern> + +> Enclosed is a lex file that "real" lex will process, but I cannot get +> flex to process it. Could you try it and maybe point me in the right direction? + +Your problem is that some of the definitions in the scanner use the '/' +trailing context operator, and have it enclosed in ()'s. Flex does not +allow this operator to be enclosed in ()'s because doing so allows undefined +regular expressions such as "(a/b)+". So the solution is to remove the +parentheses. Note that you must also be building the scanner with the -l +option for AT&T lex compatibility. Without this option, flex automatically +encloses the definitions in parentheses. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node The ^ operator isn't working +@unnumberedsec The ^ operator isn't working +@example +@verbatim +To: Thomas Hadig <hadig@toots.physik.rwth-aachen.de> +Subject: Re: Flex Bug ? +In-reply-to: Your message of Tue, 26 Nov 1996 14:35:01 PST. +Date: Tue, 26 Nov 1996 11:15:05 PST +From: Vern Paxson <vern> + +> In my lexer code, i have the line : +> ^\*.* { } +> +> Thus all lines starting with an astrix (*) are comment lines. +> This does not work ! + +I can't get this problem to reproduce - it works fine for me. Note +though that if what you have is slightly different: + + COMMENT ^\*.* + %% + {COMMENT} { } + +then it won't work, because flex pushes back macro definitions enclosed +in ()'s, so the rule becomes + + (^\*.*) { } + +and now that the '^' operator is not at the immediate beginning of the +line, it's interpreted as just a regular character. You can avoid this +behavior by using the "-l" lex-compatibility flag, or "%option lex-compat". + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node Trailing context is getting confused with trailing optional patterns +@unnumberedsec Trailing context is getting confused with trailing optional patterns +@example +@verbatim +To: Adoram Rogel <adoram@hybridge.com> +Subject: Re: Flex 2.5.4 BOF ??? +In-reply-to: Your message of Tue, 26 Nov 1996 16:10:41 PST. +Date: Wed, 27 Nov 1996 10:56:25 PST +From: Vern Paxson <vern> + +> Organization(s)?/[a-z] +> +> This matched "Organizations" (looking in debug mode, the trailing s +> was matched with trailing context instead of the optional (s) in the +> end of the word. + +That should only happen with lex. Flex can properly match this pattern. +(That might be what you're saying, I'm just not sure.) + +> Is there a way to avoid this dangerous trailing context problem ? + +Unfortunately, there's no easy way. On the other hand, I don't see why +it should be a problem. Lex's matching is clearly wrong, and I'd hope +that usually the intent remains the same as expressed with the pattern, +so flex's matching will be correct. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node Is flex GNU or not? +@unnumberedsec Is flex GNU or not? +@example +@verbatim +To: Cameron MacKinnon <mackin@interlog.com> +Subject: Re: Flex documentation bug +In-reply-to: Your message of Mon, 02 Dec 1996 00:07:08 PST. +Date: Sun, 01 Dec 1996 22:29:39 PST +From: Vern Paxson <vern> + +> I'm not sure how or where to submit bug reports (documentation or +> otherwise) for the GNU project stuff ... + +Well, strictly speaking flex isn't part of the GNU project. They just +distribute it because no one's written a decent GPL'd lex replacement. +So you should send bugs directly to me. Those sent to the GNU folks +sometimes find there way to me, but some may drop between the cracks. + +> In GNU Info, under the section 'Start Conditions', and also in the man +> page (mine's dated April '95) is a nice little snippet showing how to +> parse C quoted strings into a buffer, defined to be MAX_STR_CONST in +> size. Unfortunately, no overflow checking is ever done ... + +This is already mentioned in the manual: + +Finally, here's an example of how to match C-style quoted +strings using exclusive start conditions, including expanded +escape sequences (but not including checking for a string +that's too long): + +The reason for not doing the overflow checking is that it will needlessly +clutter up an example whose main purpose is just to demonstrate how to +use flex. + +The latest release is 2.5.4, by the way, available from ftp.ee.lbl.gov. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node ERASEME53 +@unnumberedsec ERASEME53 +@example +@verbatim +To: tsv@cs.UManitoba.CA +Subject: Re: Flex (reg).. +In-reply-to: Your message of Thu, 06 Mar 1997 23:50:16 PST. +Date: Thu, 06 Mar 1997 15:54:19 PST +From: Vern Paxson <vern> + +> [:alpha:] ([:alnum:] | \\_)* + +If your rule really has embedded blanks as shown above, then it won't +work, as the first blank delimits the rule from the action. (It wouldn't +even compile ...) You need instead: + +[:alpha:]([:alnum:]|\\_)* + +and that should work fine - there's no restriction on what can go inside +of ()'s except for the trailing context operator, '/'. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node I need to scan if-then-else blocks and while loops +@unnumberedsec I need to scan if-then-else blocks and while loops +@example +@verbatim +To: "Mike Stolnicki" <mstolnic@ford.com> +Subject: Re: FLEX help +In-reply-to: Your message of Fri, 30 May 1997 13:33:27 PDT. +Date: Fri, 30 May 1997 10:46:35 PDT +From: Vern Paxson <vern> + +> We'd like to add "if-then-else", "while", and "for" statements to our +> language ... +> We've investigated many possible solutions. The one solution that seems +> the most reasonable involves knowing the position of a TOKEN in yyin. + +I strongly advise you to instead build a parse tree (abstract syntax tree) +and loop over that instead. You'll find this has major benefits in keeping +your interpreter simple and extensible. + +That said, the functionality you mention for get_position and set_position +have been on the to-do list for a while. As flex is a purely spare-time +project for me, no guarantees when this will be added (in particular, it +for sure won't be for many months to come). + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node ERASEME55 +@unnumberedsec ERASEME55 +@example +@verbatim +To: Colin Paul Adams <colin@colina.demon.co.uk> +Subject: Re: Flex C++ classes and Bison +In-reply-to: Your message of 09 Aug 1997 17:11:41 PDT. +Date: Fri, 15 Aug 1997 10:48:19 PDT +From: Vern Paxson <vern> + +> #define YY_DECL int yylex (YYSTYPE *lvalp, struct parser_control +> *parm) +> +> I have been trying to get this to work as a C++ scanner, but it does +> not appear to be possible (warning that it matches no declarations in +> yyFlexLexer, or something like that). +> +> Is this supposed to be possible, or is it being worked on (I DID +> notice the comment that scanner classes are still experimental, so I'm +> not too hopeful)? + +What you need to do is derive a subclass from yyFlexLexer that provides +the above yylex() method, squirrels away lvalp and parm into member +variables, and then invokes yyFlexLexer::yylex() to do the regular scanning. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node ERASEME56 +@unnumberedsec ERASEME56 +@example +@verbatim +To: Mikael.Latvala@lmf.ericsson.se +Subject: Re: Possible mistake in Flex v2.5 document +In-reply-to: Your message of Fri, 05 Sep 1997 16:07:24 PDT. +Date: Fri, 05 Sep 1997 10:01:54 PDT +From: Vern Paxson <vern> + +> In that example you show how to count comment lines when using +> C style /* ... */ comments. My question is, shouldn't you take into +> account a scenario where end of a comment marker occurs inside +> character or string literals? + +The scanner certainly needs to also scan character and string literals. +However it does that (there's an example in the man page for strings), the +lexer will recognize the beginning of the literal before it runs across the +embedded "/*". Consequently, it will finish scanning the literal before it +even considers the possibility of matching "/*". + +Example: + + '([^']*|{ESCAPE_SEQUENCE})' + +will match all the text between the ''s (inclusive). So the lexer +considers this as a token beginning at the first ', and doesn't even +attempt to match other tokens inside it. + +I thinnk this subtlety is not worth putting in the manual, as I suspect +it would confuse more people than it would enlighten. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node ERASEME57 +@unnumberedsec ERASEME57 +@example +@verbatim +To: "Marty Leisner" <leisner@sdsp.mc.xerox.com> +Subject: Re: flex limitations +In-reply-to: Your message of Sat, 06 Sep 1997 11:27:21 PDT. +Date: Mon, 08 Sep 1997 11:38:08 PDT +From: Vern Paxson <vern> + +> %% +> [a-zA-Z]+ /* skip a line */ +> { printf("got %s\n", yytext); } +> %% + +What version of flex are you using? If I feed this to 2.5.4, it complains: + + "bug.l", line 5: EOF encountered inside an action + "bug.l", line 5: unrecognized rule + "bug.l", line 5: fatal parse error + +Not the world's greatest error message, but it manages to flag the problem. + +(With the introduction of start condition scopes, flex can't accommodate +an action on a separate line, since it's ambiguous with an indented rule.) + +You can get 2.5.4 from ftp.ee.lbl.gov. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node Is there a repository for flex scanners? +@unnumberedsec Is there a repository for flex scanners? + +Not that we know of. You might try asking on comp.compilers. + +@c TODO: Evaluate this faq. +@node How can I conditionally compile or preprocess my flex input file? +@unnumberedsec How can I conditionally compile or preprocess my flex input file? + + +Flex doesn't have a preprocessor like C does. You might try using m4, or the C +preprocessor plus a sed script to clean up the result. + + +@c TODO: Evaluate this faq. +@node Where can I find grammars for lex and yacc? +@unnumberedsec Where can I find grammars for lex and yacc? + +In the sources for flex and bison. + +@c TODO: Evaluate this faq. +@node I get an end-of-buffer message for each character scanned. +@unnumberedsec I get an end-of-buffer message for each character scanned. + +This will happen if your LexerInput() function returns only one character +at a time, which can happen either if you're scanner is "interactive", or +if the streams library on your platform always returns 1 for yyin->gcount(). + +Solution: override LexerInput() with a version that returns whole buffers. + +@c TODO: Evaluate this faq. +@node unnamed-faq-62 +@unnumberedsec unnamed-faq-62 +@example +@verbatim +To: Georg.Rehm@CL-KI.Uni-Osnabrueck.DE +Subject: Re: Flex maximums +In-reply-to: Your message of Mon, 17 Nov 1997 17:16:06 PST. +Date: Mon, 17 Nov 1997 17:16:15 PST +From: Vern Paxson <vern> + +> I took a quick look into the flex-sources and altered some #defines in +> flexdefs.h: +> +> #define INITIAL_MNS 64000 +> #define MNS_INCREMENT 1024000 +> #define MAXIMUM_MNS 64000 + +The things to fix are to add a couple of zeroes to: + +#define JAMSTATE -32766 /* marks a reference to the state that always jams */ +#define MAXIMUM_MNS 31999 +#define BAD_SUBSCRIPT -32767 +#define MAX_SHORT 32700 + +and, if you get complaints about too many rules, make the following change too: + + #define YY_TRAILING_MASK 0x200000 + #define YY_TRAILING_HEAD_MASK 0x400000 + +- Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-63 +@unnumberedsec unnamed-faq-63 +@example +@verbatim +To: jimmey@lexis-nexis.com (Jimmey Todd) +Subject: Re: FLEX question regarding istream vs ifstream +In-reply-to: Your message of Mon, 08 Dec 1997 15:54:15 PST. +Date: Mon, 15 Dec 1997 13:21:35 PST +From: Vern Paxson <vern> + +> stdin_handle = YY_CURRENT_BUFFER; +> ifstream fin( "aFile" ); +> yy_switch_to_buffer( yy_create_buffer( fin, YY_BUF_SIZE ) ); +> +> What I'm wanting to do, is pass the contents of a file thru one set +> of rules and then pass stdin thru another set... It works great if, I +> don't use the C++ classes. But since everything else that I'm doing is +> in C++, I thought I'd be consistent. +> +> The problem is that 'yy_create_buffer' is expecting an istream* as it's +> first argument (as stated in the man page). However, fin is a ifstream +> object. Any ideas on what I might be doing wrong? Any help would be +> appreciated. Thanks!! + +You need to pass &fin, to turn it into an ifstream* instead of an ifstream. +Then its type will be compatible with the expected istream*, because ifstream +is derived from istream. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-64 +@unnumberedsec unnamed-faq-64 +@example +@verbatim +To: Enda Fadian <fadiane@piercom.ie> +Subject: Re: Question related to Flex man page? +In-reply-to: Your message of Tue, 16 Dec 1997 15:17:34 PST. +Date: Tue, 16 Dec 1997 14:17:09 PST +From: Vern Paxson <vern> + +> Can you explain to me what is ment by a long-jump in relation to flex? + +Using the longjmp() function while inside yylex() or a routine called by it. + +> what is the flex activation frame. + +Just yylex()'s stack frame. + +> As far as I can see yyrestart will bring me back to the sart of the input +> file and using flex++ isnot really an option! + +No, yyrestart() doesn't imply a rewind, even though its name might sound +like it does. It tells the scanner to flush its internal buffers and +start reading from the given file at its present location. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-65 +@unnumberedsec unnamed-faq-65 +@example +@verbatim +To: hassan@larc.info.uqam.ca (Hassan Alaoui) +Subject: Re: Need urgent Help +In-reply-to: Your message of Sat, 20 Dec 1997 19:38:19 PST. +Date: Sun, 21 Dec 1997 21:30:46 PST +From: Vern Paxson <vern> + +> /usr/lib/yaccpar: In function `int yyparse()': +> /usr/lib/yaccpar:184: warning: implicit declaration of function `int yylex(...)' +> +> ld: Undefined symbol +> _yylex +> _yyparse +> _yyin + +This is a known problem with Solaris C++ (and/or Solaris yacc). I believe +the fix is to explicitly insert some 'extern "C"' statements for the +corresponding routines/symbols. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-66 +@unnumberedsec unnamed-faq-66 +@example +@verbatim +To: mc0307@mclink.it +Cc: gnu@prep.ai.mit.edu +Subject: Re: [mc0307@mclink.it: Help request] +In-reply-to: Your message of Fri, 12 Dec 1997 17:57:29 PST. +Date: Sun, 21 Dec 1997 22:33:37 PST +From: Vern Paxson <vern> + +> This is my definition for float and integer types: +> . . . +> NZD [1-9] +> ... +> I've tested my program on other lex version (on UNIX Sun Solaris an HP +> UNIX) and it work well, so I think that my definitions are correct. +> There are any differences between Lex and Flex? + +There are indeed differences, as discussed in the man page. The one +you are probably running into is that when flex expands a name definition, +it puts parentheses around the expansion, while lex does not. There's +an example in the man page of how this can lead to different matching. +Flex's behavior complies with the POSIX standard (or at least with the +last POSIX draft I saw). + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-67 +@unnumberedsec unnamed-faq-67 +@example +@verbatim +To: hassan@larc.info.uqam.ca (Hassan Alaoui) +Subject: Re: Thanks +In-reply-to: Your message of Mon, 22 Dec 1997 16:06:35 PST. +Date: Mon, 22 Dec 1997 14:35:05 PST +From: Vern Paxson <vern> + +> Thank you very much for your help. I compile and link well with C++ while +> declaring 'yylex ...' extern, But a little problem remains. I get a +> segmentation default when executing ( I linked with lfl library) while it +> works well when using LEX instead of flex. Do you have some ideas about the +> reason for this ? + +The one possible reason for this that comes to mind is if you've defined +yytext as "extern char yytext[]" (which is what lex uses) instead of +"extern char *yytext" (which is what flex uses). If it's not that, then +I'm afraid I don't know what the problem might be. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-68 +@unnumberedsec unnamed-faq-68 +@example +@verbatim +To: "Bart Niswonger" <NISWONGR@almaden.ibm.com> +Subject: Re: flex 2.5: c++ scanners & start conditions +In-reply-to: Your message of Tue, 06 Jan 1998 10:34:21 PST. +Date: Tue, 06 Jan 1998 19:19:30 PST +From: Vern Paxson <vern> + +> The problem is that when I do this (using %option c++) start +> conditions seem to not apply. + +The BEGIN macro modifies the yy_start variable. For C scanners, this +is a static with scope visible through the whole file. For C++ scanners, +it's a member variable, so it only has visible scope within a member +function. Your lexbegin() routine is not a member function when you +build a C++ scanner, so it's not modifying the correct yy_start. The +diagnostic that indicates this is that you found you needed to add +a declaration of yy_start in order to get your scanner to compile when +using C++; instead, the correct fix is to make lexbegin() a member +function (by deriving from yyFlexLexer). + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-69 +@unnumberedsec unnamed-faq-69 +@example +@verbatim +To: "Boris Zinin" <boris@ippe.rssi.ru> +Subject: Re: current position in flex buffer +In-reply-to: Your message of Mon, 12 Jan 1998 18:58:23 PST. +Date: Mon, 12 Jan 1998 12:03:15 PST +From: Vern Paxson <vern> + +> The problem is how to determine the current position in flex active +> buffer when a rule is matched.... + +You will need to keep track of this explicitly, such as by redefining +YY_USER_ACTION to count the number of characters matched. + +The latest flex release, by the way, is 2.5.4, available from ftp.ee.lbl.gov. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-70 +@unnumberedsec unnamed-faq-70 +@example +@verbatim +To: Bik.Dhaliwal@bis.org +Subject: Re: Flex question +In-reply-to: Your message of Mon, 26 Jan 1998 13:05:35 PST. +Date: Tue, 27 Jan 1998 22:41:52 PST +From: Vern Paxson <vern> + +> That requirement involves knowing +> the character position at which a particular token was matched +> in the lexer. + +The way you have to do this is by explicitly keeping track of where +you are in the file, by counting the number of characters scanned +for each token (available in yyleng). It may prove convenient to +do this by redefining YY_USER_ACTION, as described in the manual. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-71 +@unnumberedsec unnamed-faq-71 +@example +@verbatim +To: Vladimir Alexiev <vladimir@cs.ualberta.ca> +Subject: Re: flex: how to control start condition from parser? +In-reply-to: Your message of Mon, 26 Jan 1998 05:50:16 PST. +Date: Tue, 27 Jan 1998 22:45:37 PST +From: Vern Paxson <vern> + +> It seems useful for the parser to be able to tell the lexer about such +> context dependencies, because then they don't have to be limited to +> local or sequential context. + +One way to do this is to have the parser call a stub routine that's +included in the scanner's .l file, and consequently that has access ot +BEGIN. The only ugliness is that the parser can't pass in the state +it wants, because those aren't visible - but if you don't have many +such states, then using a different set of names doesn't seem like +to much of a burden. + +While generating a .h file like you suggests is certainly cleaner, +flex development has come to a virtual stand-still :-(, so a workaround +like the above is much more pragmatic than waiting for a new feature. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-72 +@unnumberedsec unnamed-faq-72 +@example +@verbatim +To: Barbara Denny <denny@3com.com> +Subject: Re: freebsd flex bug? +In-reply-to: Your message of Fri, 30 Jan 1998 12:00:43 PST. +Date: Fri, 30 Jan 1998 12:42:32 PST +From: Vern Paxson <vern> + +> lex.yy.c:1996: parse error before `=' + +This is the key, identifying this error. (It may help to pinpoint +it by using flex -L, so it doesn't generate #line directives in its +output.) I will bet you heavy money that you have a start condition +name that is also a variable name, or something like that; flex spits +out #define's for each start condition name, mapping them to a number, +so you can wind up with: + + %x foo + %% + ... + %% + void bar() + { + int foo = 3; + } + +and the penultimate will turn into "int 1 = 3" after C preprocessing, +since flex will put "#define foo 1" in the generated scanner. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-73 +@unnumberedsec unnamed-faq-73 +@example +@verbatim +To: Maurice Petrie <mpetrie@infoscigroup.com> +Subject: Re: Lost flex .l file +In-reply-to: Your message of Mon, 02 Feb 1998 14:10:01 PST. +Date: Mon, 02 Feb 1998 11:15:12 PST +From: Vern Paxson <vern> + +> I am curious as to +> whether there is a simple way to backtrack from the generated source to +> reproduce the lost list of tokens we are searching on. + +In theory, it's straight-forward to go from the DFA representation +back to a regular-expression representation - the two are isomorphic. +In practice, a huge headache, because you have to unpack all the tables +back into a single DFA representation, and then write a program to munch +on that and translate it into an RE. + +Sorry for the less-than-happy news ... + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-74 +@unnumberedsec unnamed-faq-74 +@example +@verbatim +To: jimmey@lexis-nexis.com (Jimmey Todd) +Subject: Re: Flex performance question +In-reply-to: Your message of Thu, 19 Feb 1998 11:01:17 PST. +Date: Thu, 19 Feb 1998 08:48:51 PST +From: Vern Paxson <vern> + +> What I have found, is that the smaller the data chunk, the faster the +> program executes. This is the opposite of what I expected. Should this be +> happening this way? + +This is exactly what will happen if your input file has embedded NULs. +From the man page: + +A final note: flex is slow when matching NUL's, particularly +when a token contains multiple NUL's. It's best to write +rules which match short amounts of text if it's anticipated +that the text will often include NUL's. + +So that's the first thing to look for. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-75 +@unnumberedsec unnamed-faq-75 +@example +@verbatim +To: jimmey@lexis-nexis.com (Jimmey Todd) +Subject: Re: Flex performance question +In-reply-to: Your message of Thu, 19 Feb 1998 11:01:17 PST. +Date: Thu, 19 Feb 1998 15:42:25 PST +From: Vern Paxson <vern> + +So there are several problems. + +First, to go fast, you want to match as much text as possible, which +your scanners don't in the case that what they're scanning is *not* +a <RN> tag. So you want a rule like: + + [^<]+ + +Second, C++ scanners are particularly slow if they're interactive, +which they are by default. Using -B speeds it up by a factor of 3-4 +on my workstation. + +Third, C++ scanners that use the istream interface are slow, because +of how poorly implemented istream's are. I built two versions of +the following scanner: + + %% + .*\n + .* + %% + +and the C version inhales a 2.5MB file on my workstation in 0.8 seconds. +The C++ istream version, using -B, takes 3.8 seconds. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-76 +@unnumberedsec unnamed-faq-76 +@example +@verbatim +To: "Frescatore, David (CRD, TAD)" <frescatore@exc01crdge.crd.ge.com> +Subject: Re: FLEX 2.5 & THE YEAR 2000 +In-reply-to: Your message of Wed, 03 Jun 1998 11:26:22 PDT. +Date: Wed, 03 Jun 1998 10:22:26 PDT +From: Vern Paxson <vern> + +> I am researching the Y2K problem with General Electric R&D +> and need to know if there are any known issues concerning +> the above mentioned software and Y2K regardless of version. + +There shouldn't be, all it ever does with the date is ask the system +for it and then print it out. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-77 +@unnumberedsec unnamed-faq-77 +@example +@verbatim +To: "Hans Dermot Doran" <htd@ibhdoran.com> +Subject: Re: flex problem +In-reply-to: Your message of Wed, 15 Jul 1998 21:30:13 PDT. +Date: Tue, 21 Jul 1998 14:23:34 PDT +From: Vern Paxson <vern> + +> To overcome this, I gets() the stdin into a string and lex the string. The +> string is lexed OK except that the end of string isn't lexed properly +> (yy_scan_string()), that is the lexer dosn't recognise the end of string. + +Flex doesn't contain mechanisms for recognizing buffer endpoints. But if +you use fgets instead (which you should anyway, to protect against buffer +overflows), then the final \n will be preserved in the string, and you can +scan that in order to find the end of the string. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-78 +@unnumberedsec unnamed-faq-78 +@example +@verbatim +To: soumen@almaden.ibm.com +Subject: Re: Flex++ 2.5.3 instance member vs. static member +In-reply-to: Your message of Mon, 27 Jul 1998 02:10:04 PDT. +Date: Tue, 28 Jul 1998 01:10:34 PDT +From: Vern Paxson <vern> + +> %{ +> int mylineno = 0; +> %} +> ws [ \t]+ +> alpha [A-Za-z] +> dig [0-9] +> %% +> +> Now you'd expect mylineno to be a member of each instance of class +> yyFlexLexer, but is this the case? A look at the lex.yy.cc file seems to +> indicate otherwise; unless I am missing something the declaration of +> mylineno seems to be outside any class scope. +> +> How will this work if I want to run a multi-threaded application with each +> thread creating a FlexLexer instance? + +Derive your own subclass and make mylineno a member variable of it. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-79 +@unnumberedsec unnamed-faq-79 +@example +@verbatim +To: Adoram Rogel <adoram@hybridge.com> +Subject: Re: More than 32K states change hangs +In-reply-to: Your message of Tue, 04 Aug 1998 16:55:39 PDT. +Date: Tue, 04 Aug 1998 22:28:45 PDT +From: Vern Paxson <vern> + +> Vern Paxson, +> +> I followed your advice, posted on Usenet bu you, and emailed to me +> personally by you, on how to overcome the 32K states limit. I'm running +> on Linux machines. +> I took the full source of version 2.5.4 and did the following changes in +> flexdef.h: +> #define JAMSTATE -327660 +> #define MAXIMUM_MNS 319990 +> #define BAD_SUBSCRIPT -327670 +> #define MAX_SHORT 327000 +> +> and compiled. +> All looked fine, including check and bigcheck, so I installed. + +Hmmm, you shouldn't increase MAX_SHORT, though looking through my email +archives I see that I did indeed recommend doing so. Try setting it back +to 32700; that should suffice that you no longer need -Ca. If it still +hangs, then the interesting question is - where? + +> Compiling the same hanged program with a out-of-the-box (RedHat 4.2 +> distribution of Linux) +> flex 2.5.4 binary works. + +Since Linux comes with source code, you should diff it against what +you have to see what problems they missed. + +> Should I always compile with the -Ca option now ? even short and simple +> filters ? + +No, definitely not. It's meant to be for those situations where you +absolutely must squeeze every last cycle out of your scanner. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-80 +@unnumberedsec unnamed-faq-80 +@example +@verbatim +To: "Schmackpfeffer, Craig" <Craig.Schmackpfeffer@usa.xerox.com> +Subject: Re: flex output for static code portion +In-reply-to: Your message of Tue, 11 Aug 1998 11:55:30 PDT. +Date: Mon, 17 Aug 1998 23:57:42 PDT +From: Vern Paxson <vern> + +> I would like to use flex under the hood to generate a binary file +> containing the data structures that control the parse. + +This has been on the wish-list for a long time. In principle it's +straight-forward - you redirect mkdata() et al's I/O to another file, +and modify the skeleton to have a start-up function that slurps these +into dynamic arrays. The concerns are (1) the scanner generation code +is hairy and full of corner cases, so it's easy to get surprised when +going down this path :-( ; and (2) being careful about buffering so +that when the tables change you make sure the scanner starts in the +correct state and reading at the right point in the input file. + +> I was wondering if you know of anyone who has used flex in this way. + +I don't - but it seems like a reasonable project to undertake (unlike +numerous other flex tweaks :-). + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-81 +@unnumberedsec unnamed-faq-81 +@example +@verbatim +Received: from 131.173.17.11 (131.173.17.11 [131.173.17.11]) + by ee.lbl.gov (8.9.1/8.9.1) with ESMTP id AAA03838 + for <vern@ee.lbl.gov>; Thu, 20 Aug 1998 00:47:57 -0700 (PDT) +Received: from hal.cl-ki.uni-osnabrueck.de (hal.cl-ki.Uni-Osnabrueck.DE [131.173.141.2]) + by deimos.rz.uni-osnabrueck.de (8.8.7/8.8.8) with ESMTP id JAA34694 + for <vern@ee.lbl.gov>; Thu, 20 Aug 1998 09:47:55 +0200 +Received: (from georg@localhost) by hal.cl-ki.uni-osnabrueck.de (8.6.12/8.6.12) id JAA34834 for vern@ee.lbl.gov; Thu, 20 Aug 1998 09:47:54 +0200 +From: Georg Rehm <georg@hal.cl-ki.uni-osnabrueck.de> +Message-Id: <199808200747.JAA34834@hal.cl-ki.uni-osnabrueck.de> +Subject: "flex scanner push-back overflow" +To: vern@ee.lbl.gov +Date: Thu, 20 Aug 1998 09:47:54 +0200 (MEST) +Reply-To: Georg.Rehm@CL-KI.Uni-Osnabrueck.DE +X-NoJunk: Do NOT send commercial mail, spam or ads to this address! +X-URL: http://www.cl-ki.uni-osnabrueck.de/~georg/ +X-Mailer: ELM [version 2.4ME+ PL28 (25)] +MIME-Version: 1.0 +Content-Type: text/plain; charset=US-ASCII +Content-Transfer-Encoding: 7bit + +Hi Vern, + +Yesterday, I encountered a strange problem: I use the macro processor m4 +to include some lengthy lists into a .l file. Following is a flex macro +definition that causes some serious pain in my neck: + +AUTHOR ("A. Boucard / L. Boucard"|"A. Dastarac / M. Levent"|"A.Boucaud / L.Boucaud"|"Abderrahim Lamchichi"|"Achmat Dangor"|"Adeline Toullier"|"Adewale Maja-Pearce"|"Ahmed Ziri"|"Akram Ellyas"|"Alain Bihr"|"Alain Gresh"|"Alain Guillemoles"|"Alain Joxe"|"Alain Morice"|"Alain Renon"|"Alain Zecchini"|"Albert Memmi"|"Alberto Manguel"|"Alex De Waal"|"Alfonso Artico"| [...]) + +The complete list contains about 10kB. When I try to "flex" this file +(on a Solaris 2.6 machine, using a modified flex 2.5.4 (I only increased +some of the predefined values in flexdefs.h) I get the error: + +myflex/flex -8 sentag.tmp.l +flex scanner push-back overflow + +When I remove the slashes in the macro definition everything works fine. +As I understand it, the double quotes escape the slash-character so it +really means "/" and not "trailing context". Furthermore, I tried to +escape the slashes with backslashes, but with no use, the same error message +appeared when flexing the code. + +Do you have an idea what's going on here? + +Greetings from Germany, + Georg +-- +Georg Rehm georg@cl-ki.uni-osnabrueck.de +Institute for Semantic Information Processing, University of Osnabrueck, FRG +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-82 +@unnumberedsec unnamed-faq-82 +@example +@verbatim +To: Georg.Rehm@CL-KI.Uni-Osnabrueck.DE +Subject: Re: "flex scanner push-back overflow" +In-reply-to: Your message of Thu, 20 Aug 1998 09:47:54 PDT. +Date: Thu, 20 Aug 1998 07:05:35 PDT +From: Vern Paxson <vern> + +> myflex/flex -8 sentag.tmp.l +> flex scanner push-back overflow + +Flex itself uses a flex scanner. That scanner is running out of buffer +space when it tries to unput() the humongous macro you've defined. When +you remove the '/'s, you make it small enough so that it fits in the buffer; +removing spaces would do the same thing. + +The fix is to either rethink how come you're using such a big macro and +perhaps there's another/better way to do it; or to rebuild flex's own +scan.c with a larger value for + + #define YY_BUF_SIZE 16384 + +- Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-83 +@unnumberedsec unnamed-faq-83 +@example +@verbatim +To: Jan Kort <jan@research.techforce.nl> +Subject: Re: Flex +In-reply-to: Your message of Fri, 04 Sep 1998 12:18:43 +0200. +Date: Sat, 05 Sep 1998 00:59:49 PDT +From: Vern Paxson <vern> + +> %% +> +> "TEST1\n" { fprintf(stderr, "TEST1\n"); yyless(5); } +> ^\n { fprintf(stderr, "empty line\n"); } +> . { } +> \n { fprintf(stderr, "new line\n"); } +> +> %% +> -- input --------------------------------------- +> TEST1 +> -- output -------------------------------------- +> TEST1 +> empty line +> ------------------------------------------------ + +IMHO, it's not clear whether or not this is in fact a bug. It depends +on whether you view yyless() as backing up in the input stream, or as +pushing new characters onto the beginning of the input stream. Flex +interprets it as the latter (for implementation convenience, I'll admit), +and so considers the newline as in fact matching at the beginning of a +line, as after all the last token scanned an entire line and so the +scanner is now at the beginning of a new line. + +I agree that this is counter-intuitive for yyless(), given its +functional description (it's less so for unput(), depending on whether +you're unput()'ing new text or scanned text). But I don't plan to +change it any time soon, as it's a pain to do so. Consequently, +you do indeed need to use yy_set_bol() and YY_AT_BOL() to tweak +your scanner into the behavior you desire. + +Sorry for the less-than-completely-satisfactory answer. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-84 +@unnumberedsec unnamed-faq-84 +@example +@verbatim +To: Patrick Krusenotto <krusenot@mac-info-link.de> +Subject: Re: Problems with restarting flex-2.5.2-generated scanner +In-reply-to: Your message of Thu, 24 Sep 1998 10:14:07 PDT. +Date: Thu, 24 Sep 1998 23:28:43 PDT +From: Vern Paxson <vern> + +> I am using flex-2.5.2 and bison 1.25 for Solaris and I am desperately +> trying to make my scanner restart with a new file after my parser stops +> with a parse error. When my compiler restarts, the parser always +> receives the token after the token (in the old file!) that caused the +> parser error. + +I suspect the problem is that your parser has read ahead in order +to attempt to resolve an ambiguity, and when it's restarted it picks +up with that token rather than reading a fresh one. If you're using +yacc, then the special "error" production can sometimes be used to +consume tokens in an attempt to get the parser into a consistent state. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-85 +@unnumberedsec unnamed-faq-85 +@example +@verbatim +To: Henric Jungheim <junghelh@pe-nelson.com> +Subject: Re: flex 2.5.4a +In-reply-to: Your message of Tue, 27 Oct 1998 16:41:42 PST. +Date: Tue, 27 Oct 1998 16:50:14 PST +From: Vern Paxson <vern> + +> This brings up a feature request: How about a command line +> option to specify the filename when reading from stdin? That way one +> doesn't need to create a temporary file in order to get the "#line" +> directives to make sense. + +Use -o combined with -t (per the man page description of -o). + +> P.S., Is there any simple way to use non-blocking IO to parse multiple +> streams? + +Simple, no. + +One approach might be to return a magic character on EWOULDBLOCK and +have a rule + + .*<magic-character> // put back .*, eat magic character + +This is off the top of my head, not sure it'll work. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-86 +@unnumberedsec unnamed-faq-86 +@example +@verbatim +To: "Repko, Billy D" <billy.d.repko@intel.com> +Subject: Re: Compiling scanners +In-reply-to: Your message of Wed, 13 Jan 1999 10:52:47 PST. +Date: Thu, 14 Jan 1999 00:25:30 PST +From: Vern Paxson <vern> + +> It appears that maybe it cannot find the lfl library. + +The Makefile in the distribution builds it, so you should have it. +It's exceedingly trivial, just a main() that calls yylex() and +a yyrap() that always returns 1. + +> %% +> \n ++num_lines; ++num_chars; +> . ++num_chars; + +You can't indent your rules like this - that's where the errors are coming +from. Flex copies indented text to the output file, it's how you do things +like + + int num_lines_seen = 0; + +to declare local variables. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-87 +@unnumberedsec unnamed-faq-87 +@example +@verbatim +To: Erick Branderhorst <Erick.Branderhorst@asml.nl> +Subject: Re: flex input buffer +In-reply-to: Your message of Tue, 09 Feb 1999 13:53:46 PST. +Date: Tue, 09 Feb 1999 21:03:37 PST +From: Vern Paxson <vern> + +> In the flex.skl file the size of the default input buffers is set. Can you +> explain why this size is set and why it is such a high number. + +It's large to optimize performance when scanning large files. You can +safely make it a lot lower if needed. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-88 +@unnumberedsec unnamed-faq-88 +@example +@verbatim +To: "Guido Minnen" <guidomi@cogs.susx.ac.uk> +Subject: Re: Flex error message +In-reply-to: Your message of Wed, 24 Feb 1999 15:31:46 PST. +Date: Thu, 25 Feb 1999 00:11:31 PST +From: Vern Paxson <vern> + +> I'm extending a larger scanner written in Flex and I keep running into +> problems. More specifically, I get the error message: +> "flex: input rules are too complicated (>= 32000 NFA states)" + +Increase the definitions in flexdef.h for: + +#define JAMSTATE -32766 /* marks a reference to the state that always j +ams */ +#define MAXIMUM_MNS 31999 +#define BAD_SUBSCRIPT -32767 + +recompile everything, and it should all work. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-90 +@unnumberedsec unnamed-faq-90 +@example +@verbatim +To: "Dmitriy Goldobin" <gold@ems.chel.su> +Subject: Re: FLEX trouble +In-reply-to: Your message of Mon, 31 May 1999 18:44:49 PDT. +Date: Tue, 01 Jun 1999 00:15:07 PDT +From: Vern Paxson <vern> + +> I have a trouble with FLEX. Why rule "/*".*"*/" work properly,=20 +> but rule "/*"(.|\n)*"*/" don't work ? + +The second of these will have to scan the entire input stream (because +"(.|\n)*" matches an arbitrary amount of any text) in order to see if +it ends with "*/", terminating the comment. That potentially will overflow +the input buffer. + +> More complex rule "/*"([^*]|(\*/[^/]))*"*/ give an error +> 'unrecognized rule'. + +You can't use the '/' operator inside parentheses. It's not clear +what "(a/b)*" actually means. + +> I now use workaround with state <comment>, but single-rule is +> better, i think. + +Single-rule is nice but will always have the problem of either setting +restrictions on comments (like not allowing multi-line comments) and/or +running the risk of consuming the entire input stream, as noted above. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-91 +@unnumberedsec unnamed-faq-91 +@example +@verbatim +Received: from mc-qout4.whowhere.com (mc-qout4.whowhere.com [209.185.123.18]) + by ee.lbl.gov (8.9.3/8.9.3) with SMTP id IAA05100 + for <vern@ee.lbl.gov>; Tue, 15 Jun 1999 08:56:06 -0700 (PDT) +Received: from Unknown/Local ([?.?.?.?]) by my-deja.com; Tue Jun 15 08:55:43 1999 +To: vern@ee.lbl.gov +Date: Tue, 15 Jun 1999 08:55:43 -0700 +From: "Aki Niimura" <neko@my-deja.com> +Message-ID: <KNONDOHDOBGAEAAA@my-deja.com> +Mime-Version: 1.0 +Cc: +X-Sent-Mail: on +Reply-To: +X-Mailer: MailCity Service +Subject: A question on flex C++ scanner +X-Sender-Ip: 12.72.207.61 +Organization: My Deja Email (http://www.my-deja.com:80) +Content-Type: text/plain; charset=us-ascii +Content-Transfer-Encoding: 7bit + +Dear Dr. Paxon, + +I have been using flex for years. +It works very well on many projects. +Most case, I used it to generate a scanner on C language. +However, one project I needed to generate a scanner +on C++ lanuage. Thanks to your enhancement, flex did +the job. + +Currently, I'm working on enhancing my previous project. +I need to deal with multiple input streams (recursive +inclusion) in this scanner (C++). +I did similar thing for another scanner (C) as you +explained in your documentation. + +The generated scanner (C++) has necessary methods: +- switch_to_buffer(struct yy_buffer_state *b) +- yy_create_buffer(istream *is, int sz) +- yy_delete_buffer(struct yy_buffer_state *b) + +However, I couldn't figure out how to access current +buffer (yy_current_buffer). + +yy_current_buffer is a protected member of yyFlexLexer. +I can't access it directly. +Then, I thought yy_create_buffer() with is = 0 might +return current stream buffer. But it seems not as far +as I checked the source. (flex 2.5.4) + +I went through the Web in addition to Flex documentation. +However, it hasn't been successful, so far. + +It is not my intention to bother you, but, can you +comment about how to obtain the current stream buffer? + +Your response would be highly appreciated. + +Best regards, +Aki Niimura + +--== Sent via Deja.com http://www.deja.com/ ==-- +Share what you know. Learn what you don't. +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-92 +@unnumberedsec unnamed-faq-92 +@example +@verbatim +To: neko@my-deja.com +Subject: Re: A question on flex C++ scanner +In-reply-to: Your message of Tue, 15 Jun 1999 08:55:43 PDT. +Date: Tue, 15 Jun 1999 09:04:24 PDT +From: Vern Paxson <vern> + +> However, I couldn't figure out how to access current +> buffer (yy_current_buffer). + +Derive your own subclass from yyFlexLexer. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-93 +@unnumberedsec unnamed-faq-93 +@example +@verbatim +To: "Stones, Darren" <Darren.Stones@nectech.co.uk> +Subject: Re: You're the man to see? +In-reply-to: Your message of Wed, 23 Jun 1999 11:10:29 PDT. +Date: Wed, 23 Jun 1999 09:01:40 PDT +From: Vern Paxson <vern> + +> I hope you can help me. I am using Flex and Bison to produce an interpreted +> language. However all goes well until I try to implement an IF statement or +> a WHILE. I cannot get this to work as the parser parses all the conditions +> eg. the TRUE and FALSE conditons to check for a rule match. So I cannot +> make a decision!! + +You need to use the parser to build a parse tree (= abstract syntax trwee), +and when that's all done you recursively evaluate the tree, binding variables +to values at that time. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-94 +@unnumberedsec unnamed-faq-94 +@example +@verbatim +To: Petr Danecek <petr@ics.cas.cz> +Subject: Re: flex - question +In-reply-to: Your message of Mon, 28 Jun 1999 19:21:41 PDT. +Date: Fri, 02 Jul 1999 16:52:13 PDT +From: Vern Paxson <vern> + +> file, it takes an enormous amount of time. It is funny, because the +> source code has only 12 rules!!! I think it looks like an exponencial +> growth. + +Right, that's the problem - some patterns (those with a lot of +ambiguity, where yours has because at any given time the scanner can +be in the middle of all sorts of combinations of the different +rules) blow up exponentially. + +For your rules, there is an easy fix. Change the ".*" that comes fater +the directory name to "[^ ]*". With that in place, the rules are no +longer nearly so ambiguous, because then once one of the directories +has been matched, no other can be matched (since they all require a +leading blank). + +If that's not an acceptable solution, then you can enter a start state +to pick up the .*\n after each directory is matched. + +Also note that for speed, you'll want to add a ".*" rule at the end, +otherwise rules that don't match any of the patterns will be matched +very slowly, a character at a time. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-95 +@unnumberedsec unnamed-faq-95 +@example +@verbatim +To: Tielman Koekemoer <tielman@spi.co.za> +Subject: Re: Please help. +In-reply-to: Your message of Thu, 08 Jul 1999 13:20:37 PDT. +Date: Thu, 08 Jul 1999 08:20:39 PDT +From: Vern Paxson <vern> + +> I was hoping you could help me with my problem. +> +> I tried compiling (gnu)flex on a Solaris 2.4 machine +> but when I ran make (after configure) I got an error. +> +> -------------------------------------------------------------- +> gcc -c -I. -I. -g -O parse.c +> ./flex -t -p ./scan.l >scan.c +> sh: ./flex: not found +> *** Error code 1 +> make: Fatal error: Command failed for target `scan.c' +> ------------------------------------------------------------- +> +> What's strange to me is that I'm only +> trying to install flex now. I then edited the Makefile to +> and changed where it says "FLEX = flex" to "FLEX = lex" +> ( lex: the native Solaris one ) but then it complains about +> the "-p" option. Is there any way I can compile flex without +> using flex or lex? +> +> Thanks so much for your time. + +You managed to step on the bootstrap sequence, which first copies +initscan.c to scan.c in order to build flex. Try fetching a fresh +distribution from ftp.ee.lbl.gov. (Or you can first try removing +".bootstrap" and doing a make again.) + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-96 +@unnumberedsec unnamed-faq-96 +@example +@verbatim +To: Tielman Koekemoer <tielman@spi.co.za> +Subject: Re: Please help. +In-reply-to: Your message of Fri, 09 Jul 1999 09:16:14 PDT. +Date: Fri, 09 Jul 1999 00:27:20 PDT +From: Vern Paxson <vern> + +> First I removed .bootstrap (and ran make) - no luck. I downloaded the +> software but I still have the same problem. Is there anything else I +> could try. + +Try: + + cp initscan.c scan.c + touch scan.c + make scan.o + +If this last tries to first build scan.c from scan.l using ./flex, then +your "make" is broken, in which case compile scan.c to scan.o by hand. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-97 +@unnumberedsec unnamed-faq-97 +@example +@verbatim +To: Sumanth Kamenani <skamenan@crl.nmsu.edu> +Subject: Re: Error +In-reply-to: Your message of Mon, 19 Jul 1999 23:08:41 PDT. +Date: Tue, 20 Jul 1999 00:18:26 PDT +From: Vern Paxson <vern> + +> I am getting a compilation error. The error is given as "unknown symbol- yylex". + +The parser relies on calling yylex(), but you're instead using the C++ scanning +class, so you need to supply a yylex() "glue" function that calls an instance +scanner of the scanner (e.g., "scanner->yylex()"). + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-98 +@unnumberedsec unnamed-faq-98 +@example +@verbatim +To: daniel@synchrods.synchrods.COM (Daniel Senderowicz) +Subject: Re: lex +In-reply-to: Your message of Mon, 22 Nov 1999 11:19:04 PST. +Date: Tue, 23 Nov 1999 15:54:30 PST +From: Vern Paxson <vern> + +Well, your problem is the + +switch (yybgin-yysvec-1) { /* witchcraft */ + +at the beginning of lex rules. "witchcraft" == "non-portable". It's +assuming knowledge of the AT&T lex's internal variables. + +For flex, you can probably do the equivalent using a switch on YYSTATE. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-99 +@unnumberedsec unnamed-faq-99 +@example +@verbatim +To: archow@hss.hns.com +Subject: Re: Regarding distribution of flex and yacc based grammars +In-reply-to: Your message of Sun, 19 Dec 1999 17:50:24 +0530. +Date: Wed, 22 Dec 1999 01:56:24 PST +From: Vern Paxson <vern> + +> When we provide the customer with an object code distribution, is it +> necessary for us to provide source +> for the generated C files from flex and bison since they are generated by +> flex and bison ? + +For flex, no. I don't know what the current state of this is for bison. + +> Also, is there any requrirement for us to neccessarily provide source for +> the grammar files which are fed into flex and bison ? + +Again, for flex, no. + +See the file "COPYING" in the flex distribution for the legalese. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-100 +@unnumberedsec unnamed-faq-100 +@example +@verbatim +To: Martin Gallwey <gallweym@hyperion.moe.ul.ie> +Subject: Re: Flex, and self referencing rules +In-reply-to: Your message of Sun, 20 Feb 2000 01:01:21 PST. +Date: Sat, 19 Feb 2000 18:33:16 PST +From: Vern Paxson <vern> + +> However, I do not use unput anywhere. I do use self-referencing +> rules like this: +> +> UnaryExpr ({UnionExpr})|("-"{UnaryExpr}) + +You can't do this - flex is *not* a parser like yacc (which does indeed +allow recursion), it is a scanner that's confined to regular expressions. + + Vern +@end verbatim +@end example + +@c TODO: Evaluate this faq. +@node unnamed-faq-101 +@unnumberedsec unnamed-faq-101 +@example +@verbatim +To: slg3@lehigh.edu (SAMUEL L. GULDEN) +Subject: Re: Flex problem +In-reply-to: Your message of Thu, 02 Mar 2000 12:29:04 PST. +Date: Thu, 02 Mar 2000 23:00:46 PST +From: Vern Paxson <vern> + +If this is exactly your program: + +> digit [0-9] +> digits {digit}+ +> whitespace [ \t\n]+ +> +> %% +> "[" { printf("open_brac\n");} +> "]" { printf("close_brac\n");} +> "+" { printf("addop\n");} +> "*" { printf("multop\n");} +> {digits} { printf("NUMBER = %s\n", yytext);} +> whitespace ; + +then the problem is that the last rule needs to be "{whitespace}" ! + + Vern +@end verbatim +@end example + +@node What is the difference between YYLEX_PARAM and YY_DECL? +@unnumberedsec What is the difference between YYLEX_PARAM and YY_DECL? + +YYLEX_PARAM is not a flex symbol. It is for Bison. It tells Bison to pass extra +params when it calls yylex() from the parser. + +YY_DECL is the Flex declaration of yylex. The default is similar to this: + +@example +@verbatim +#define int yy_lex () +@end verbatim +@end example + + +@node Why do I get "conflicting types for yylex" error? +@unnumberedsec Why do I get "conflicting types for yylex" error? + +This is a compiler error regarding a generated Bison parser, not a Flex scanner. +It means you need a prototype of yylex() in the top of the Bison file. +Be sure the prototype matches YY_DECL. + +@node How do I access the values set in a Flex action from within a Bison action? +@unnumberedsec How do I access the values set in a Flex action from within a Bison action? + +With $1, $2, $3, etc. These are called "Semantic Values" in the Bison manual. +See @ref{Top, , , bison, the GNU Bison Manual}. + +@node Appendices, Indices, FAQ, Top +@appendix Appendices + +@menu +* Makefiles and Flex:: +* Bison Bridge:: +* M4 Dependency:: +* Common Patterns:: +@end menu + +@node Makefiles and Flex, Bison Bridge, Appendices, Appendices +@appendixsec Makefiles and Flex + +@cindex Makefile, syntax + +In this appendix, we provide tips for writing Makefiles to build your scanners. + +In a traditional build environment, we say that the @file{.c} files are the +sources, and the @file{.o} files are the intermediate files. When using +@code{flex}, however, the @file{.l} files are the sources, and the generated +@file{.c} files (along with the @file{.o} files) are the intermediate files. +This requires you to carefully plan your Makefile. + +Modern @command{make} programs understand that @file{foo.l} is intended to +generate @file{lex.yy.c} or @file{foo.c}, and will behave +accordingly@footnote{GNU @command{make} and GNU @command{automake} are two such +programs that provide implicit rules for flex-generated scanners.}@footnote{GNU @command{automake} +may generate code to execute flex in lex-compatible mode, or to stdout. If this is not what you want, +then you should provide an explicit rule in your Makefile.am}. The +following Makefile does not explicitly instruct @command{make} how to build +@file{foo.c} from @file{foo.l}. Instead, it relies on the implicit rules of the +@command{make} program to build the intermediate file, @file{scan.c}: + +@cindex Makefile, example of implicit rules +@example +@verbatim + # Basic Makefile -- relies on implicit rules + # Creates "myprogram" from "scan.l" and "myprogram.c" + # + LEX=flex + myprogram: scan.o myprogram.o + scan.o: scan.l + +@end verbatim +@end example + + +For simple cases, the above may be sufficient. For other cases, +you may have to explicitly instruct @command{make} how to build your scanner. +The following is an example of a Makefile containing explicit rules: + +@cindex Makefile, explicit example +@example +@verbatim + # Basic Makefile -- provides explicit rules + # Creates "myprogram" from "scan.l" and "myprogram.c" + # + LEX=flex + myprogram: scan.o myprogram.o + $(CC) -o $@ $(LDFLAGS) $^ + + myprogram.o: myprogram.c + $(CC) $(CPPFLAGS) $(CFLAGS) -o $@ -c $^ + + scan.o: scan.c + $(CC) $(CPPFLAGS) $(CFLAGS) -o $@ -c $^ + + scan.c: scan.l + $(LEX) $(LFLAGS) -o $@ $^ + + clean: + $(RM) *.o scan.c + +@end verbatim +@end example + +Notice in the above example that @file{scan.c} is in the @code{clean} target. +This is because we consider the file @file{scan.c} to be an intermediate file. + +Finally, we provide a realistic example of a @code{flex} scanner used with a +@code{bison} parser@footnote{This example also applies to yacc parsers.}. +There is a tricky problem we have to deal with. Since a @code{flex} scanner +will typically include a header file (e.g., @file{y.tab.h}) generated by the +parser, we need to be sure that the header file is generated BEFORE the scanner +is compiled. We handle this case in the following example: + +@example +@verbatim + # Makefile example -- scanner and parser. + # Creates "myprogram" from "scan.l", "parse.y", and "myprogram.c" + # + LEX = flex + YACC = bison -y + YFLAGS = -d + objects = scan.o parse.o myprogram.o + + myprogram: $(objects) + scan.o: scan.l parse.c + parse.o: parse.y + myprogram.o: myprogram.c + +@end verbatim +@end example + +In the above example, notice the line, + +@example +@verbatim + scan.o: scan.l parse.c +@end verbatim +@end example + +, which lists the file @file{parse.c} (the generated parser) as a dependency of +@file{scan.o}. We want to ensure that the parser is created before the scanner +is compiled, and the above line seems to do the trick. Feel free to experiment +with your specific implementation of @command{make}. + + +For more details on writing Makefiles, see @ref{Top, , , make, The +GNU Make Manual}. + +@node Bison Bridge, M4 Dependency, Makefiles and Flex, Appendices +@section C Scanners with Bison Parsers + +@cindex bison, bridging with flex +@vindex yylval +@vindex yylloc +@tindex YYLTYPE +@tindex YYSTYPE + +This section describes the @code{flex} features useful when integrating +@code{flex} with @code{GNU bison}@footnote{The features described here are +purely optional, and are by no means the only way to use flex with bison. +We merely provide some glue to ease development of your parser-scanner pair.}. +Skip this section if you are not using +@code{bison} with your scanner. Here we discuss only the @code{flex} +half of the @code{flex} and @code{bison} pair. We do not discuss +@code{bison} in any detail. For more information about generating +@code{bison} parsers, see @ref{Top, , , bison, the GNU Bison Manual}. + +A compatible @code{bison} scanner is generated by declaring @samp{%option +bison-bridge} or by supplying @samp{--bison-bridge} when invoking @code{flex} +from the command line. This instructs @code{flex} that the macro +@code{yylval} may be used. The data type for +@code{yylval}, @code{YYSTYPE}, +is typically defined in a header file, included in section 1 of the +@code{flex} input file. For a list of functions and macros +available, @xref{bison-functions}. + +The declaration of yylex becomes, + +@findex yylex (reentrant version) +@example +@verbatim + int yylex ( YYSTYPE * lvalp, yyscan_t scanner ); +@end verbatim +@end example + +If @code{%option bison-locations} is specified, then the declaration +becomes, + +@findex yylex (reentrant version) +@example +@verbatim + int yylex ( YYSTYPE * lvalp, YYLTYPE * llocp, yyscan_t scanner ); +@end verbatim +@end example + +Note that the macros @code{yylval} and @code{yylloc} evaluate to pointers. +Support for @code{yylloc} is optional in @code{bison}, so it is optional in +@code{flex} as well. The following is an example of a @code{flex} scanner that +is compatible with @code{bison}. + +@cindex bison, scanner to be called from bison +@example +@verbatim + /* Scanner for "C" assignment statements... sort of. */ + %{ + #include "y.tab.h" /* Generated by bison. */ + %} + + %option bison-bridge bison-locations + % + + [[:digit:]]+ { yylval->num = atoi(yytext); return NUMBER;} + [[:alnum:]]+ { yylval->str = strdup(yytext); return STRING;} + "="|";" { return yytext[0];} + . {} + % +@end verbatim +@end example + +As you can see, there really is no magic here. We just use +@code{yylval} as we would any other variable. The data type of +@code{yylval} is generated by @code{bison}, and included in the file +@file{y.tab.h}. Here is the corresponding @code{bison} parser: + +@cindex bison, parser +@example +@verbatim + /* Parser to convert "C" assignments to lisp. */ + %{ + /* Pass the argument to yyparse through to yylex. */ + #define YYPARSE_PARAM scanner + #define YYLEX_PARAM scanner + %} + %locations + %pure_parser + %union { + int num; + char* str; + } + %token <str> STRING + %token <num> NUMBER + %% + assignment: + STRING '=' NUMBER ';' { + printf( "(setf %s %d)", $1, $3 ); + } + ; +@end verbatim +@end example + +@node M4 Dependency, Common Patterns, Bison Bridge, Appendices +@section M4 Dependency +@cindex m4 +The macro processor @code{m4}@footnote{The use of m4 is subject to change in +future revisions of flex. It is not part of the public API of flex. Do not depend on it.} +must be installed wherever flex is installed. +@code{flex} invokes @samp{m4}, found by searching the directories in the +@code{PATH} environment variable. Any code you place in section 1 or in the +actions will be sent through m4. Please follow these rules to protect your +code from unwanted @code{m4} processing. + +@itemize + +@item Do not use symbols that begin with, @samp{m4_}, such as, @samp{m4_define}, +or @samp{m4_include}, since those are reserved for @code{m4} macro names. If for +some reason you need m4_ as a prefix, use a preprocessor #define to get your +symbol past m4 unmangled. + +@item Do not use the strings @samp{[[} or @samp{]]} anywhere in your code. The +former is not valid in C, except within comments and strings, but the latter is valid in +code such as @code{x[y[z]]}. The solution is simple. To get the literal string +@code{"]]"}, use @code{"]""]"}. To get the array notation @code{x[y[z]]}, +use @code{x[y[z] ]}. Flex will attempt to detect these sequences in user code, and +escape them. However, it's best to avoid this complexity where possible, by +removing such sequences from your code. + +@end itemize + +@code{m4} is only required at the time you run @code{flex}. The generated +scanner is ordinary C or C++, and does @emph{not} require @code{m4}. + +@node Common Patterns, ,M4 Dependency, Appendices +@section Common Patterns +@cindex patterns, common + +This appendix provides examples of common regular expressions you might use +in your scanner. + +@menu +* Numbers:: +* Identifiers:: +* Quoted Constructs:: +* Addresses:: +@end menu + + +@node Numbers, Identifiers, ,Common Patterns +@subsection Numbers + +@table @asis + +@item C99 decimal constant +@code{([[:digit:]]@{-@}[0])[[:digit:]]*} + +@item C99 hexadecimal constant +@code{0[xX][[:xdigit:]]+} + +@item C99 octal constant +@code{0[0123456]*} + +@item C99 floating point constant +@verbatim + {dseq} ([[:digit:]]+) + {dseq_opt} ([[:digit:]]*) + {frac} (({dseq_opt}"."{dseq})|{dseq}".") + {exp} ([eE][+-]?{dseq}) + {exp_opt} ({exp}?) + {fsuff} [flFL] + {fsuff_opt} ({fsuff}?) + {hpref} (0[xX]) + {hdseq} ([[:xdigit:]]+) + {hdseq_opt} ([[:xdigit:]]*) + {hfrac} (({hdseq_opt}"."{hdseq})|({hdseq}".")) + {bexp} ([pP][+-]?{dseq}) + {dfc} (({frac}{exp_opt}{fsuff_opt})|({dseq}{exp}{fsuff_opt})) + {hfc} (({hpref}{hfrac}{bexp}{fsuff_opt})|({hpref}{hdseq}{bexp}{fsuff_opt})) + + {c99_floating_point_constant} ({dfc}|{hfc}) +@end verbatim + +See C99 section 6.4.4.2 for the gory details. + +@end table + +@node Identifiers, Quoted Constructs, Numbers, Common Patterns +@subsection Identifiers + +@table @asis + +@item C99 Identifier +@verbatim +ucn ((\\u([[:xdigit:]]{4}))|(\\U([[:xdigit:]]{8}))) +nondigit [_[:alpha:]] +c99_id ([_[:alpha:]]|{ucn})([_[:alnum:]]|{ucn})* +@end verbatim + +Technically, the above pattern does not encompass all possible C99 identifiers, since C99 allows for +"implementation-defined" characters. In practice, C compilers follow the above pattern, with the +addition of the @samp{$} character. + +@item UTF-8 Encoded Unicode Code Point +@verbatim +[\x09\x0A\x0D\x20-\x7E]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF]([\x80-\xBF]{2})|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x90-\xBF]([\x80-\xBF]{2})|[\xF1-\xF3]([\x80-\xBF]{3})|\xF4[\x80-\x8F]([\x80-\xBF]{2}) +@end verbatim + +@end table + +@node Quoted Constructs, Addresses, Identifiers, Common Patterns +@subsection Quoted Constructs + +@table @asis +@item C99 String Literal +@code{L?\"([^\"\\\n]|(\\['\"?\\abfnrtv])|(\\([0123456]@{1,3@}))|(\\x[[:xdigit:]]+)|(\\u([[:xdigit:]]@{4@}))|(\\U([[:xdigit:]]@{8@})))*\"} + +@item C99 Comment +@code{("/*"([^*]|"*"[^/])*"*/")|("/"(\\\n)*"/"[^\n]*)} + +Note that in C99, a @samp{//}-style comment may be split across lines, and, contrary to popular belief, +does not include the trailing @samp{\n} character. + +A better way to scan @samp{/* */} comments is by line, rather than matching +possibly huge comments all at once. This will allow you to scan comments of +unlimited length, as long as line breaks appear at sane intervals. This is also +more efficient when used with automatic line number processing. @xref{option-yylineno}. + +@verbatim +<INITIAL>{ + "/*" BEGIN(COMMENT); +} +<COMMENT>{ + "*/" BEGIN(0); + [^*\n]+ ; + "*"[^/] ; + \n ; +} +@end verbatim + +@end table + +@node Addresses, ,Quoted Constructs, Common Patterns +@subsection Addresses + +@table @asis + +@item IPv4 Address +@code{(([[:digit:]]@{1,3@}".")@{3@}([[:digit:]]@{1,3@}))} + +@item IPv6 Address +@verbatim +hex4 ([[:xdigit:]]{1,4}) +hexseq ({hex4}(:{hex4}*)) +hexpart ({hexseq}|({hexseq}::({hexseq}?))|::{hexseq}) +IPv6address ({hexpart}(":"{IPv4address})?) +@end verbatim + +See RFC2373 for details. + +@item URI +@code{(([^:/?#]+):)?("//"([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?} + +This pattern is nearly useless, since it allows just about any character to +appear in a URI, including spaces and control characters. See RFC2396 for +details. + +@end table + + +@node Indices, , Appendices, Top +@unnumbered Indices + +@menu +* Concept Index:: +* Index of Functions and Macros:: +* Index of Variables:: +* Index of Data Types:: +* Index of Hooks:: +* Index of Scanner Options:: +@end menu + +@node Concept Index, Index of Functions and Macros, Indices, Indices +@unnumberedsec Concept Index + +@printindex cp + +@node Index of Functions and Macros, Index of Variables, Concept Index, Indices +@unnumberedsec Index of Functions and Macros + +This is an index of functions and preprocessor macros that look like functions. +For macros that expand to variables or constants, see @ref{Index of Variables}. + +@printindex fn + +@node Index of Variables, Index of Data Types, Index of Functions and Macros, Indices +@unnumberedsec Index of Variables + +This is an index of variables, constants, and preprocessor macros +that expand to variables or constants. + +@printindex vr + +@node Index of Data Types, Index of Hooks, Index of Variables, Indices +@unnumberedsec Index of Data Types +@printindex tp + +@node Index of Hooks, Index of Scanner Options, Index of Data Types, Indices +@unnumberedsec Index of Hooks + +This is an index of "hooks" that the user may define. These hooks typically correspond +to specific locations in the generated scanner, and may be used to insert arbitrary code. + +@printindex hk + +@node Index of Scanner Options, , Index of Hooks, Indices +@unnumberedsec Index of Scanner Options + +@printindex op + +@c A vim script to name the faq entries. delete this when faqs are no longer +@c named "unnamed-faq-XXX". +@c +@c fu! Faq2 () range abort +@c let @r=input("Rename to: ") +@c exe "%s/" . @w . "/" . @r . "/g" +@c normal 'f +@c endf +@c nnoremap <F5> 1G/@node\s\+unnamed-faq-\d\+<cr>mfww"wy5ezt:call Faq2()<cr> + +@bye |