Last Lecture Revisited What and Why to study this course? Def. of translator and compiler Structure of compiler, front-end, back-end “Pass”
Chapter 2. Lexical Analysis Lexical Analyzer Syntax Analyzer Symbol Table Token Get next token Source Code 词法分析器 语法分析器 符号表 记号 取下一个记号 源程序 2
What lexical analysis does: Lexical analyzer Tokens Source code
2.1 Token and attributes 2.1.1 Token、pattern、lexeme Lexeme:strings in source code。 Token:a token represent a type of strings subject to some rules (patterns) 。 Pattern: determines how a token relates to lexemes letters sequences lexeme token pattern
An example letters token var count : integer ; count = 5 ; lexeme sequences lexeme token pattern eg: var count : integer ; count = 5 ; lexeme
Token Examples Description of Patterns var var var for for for Tokens and patterns Token Examples Description of Patterns var var var for for for relation < , < = , = , … < or <= or = or … id sum, count, D5 letter and number string starting with a letter num 3.1, 10, 2.8 E12 any numerical value literal “seg. error” the string within “ ”
noun Dalian, Sea names of things Token Examples Pattern description relation < , < = , = , … < or <= or = or … id sum, count, D5 letter and number string starting with a letter Token Examples Pattern description Chinese Zemin Jiang male/female with China nationality American Clinton,Obama male/female with USA nationality
2.1 词法记号及属性 Some issues in the history Should we omit the blanks or not? DO 8 I 3. 75 DO8I 3. 75 DO 8 I 3, 75 Should we reserve the keywords or not? IF THEN THEN THEN=ELSE;ELSE …
2.1.2 Attributes of tokens position := initial + rate * 60 id,pointer to the position item in the symble table assign _ op, id,pointer to the initial item in the symble table add_op,+ id,pointer to the rate item in the symble table mul_ op, * num,60
Exercise of symbol table: Write a piece of code to calculate the number of words, the number of distinct words. eg: I love Dalian and I love DLUT Word number:7 Distinct word number:5
2.1.3 Lexical Errors It’s hard to find the following error, since lexical analysis is locally insightful. fi (a == f (x) ) … An exception: 123.
2.1.3 Lexical Errors Recovery “emergency model” “recovery try” delete an additional character insert a missing character replace an invalid character with an valid one swap two adjacent characters
2.2 Description and Identification of Tokens 2.2.1 String and Language Alphabet table:set of symbol,eg. = {0,1} String:sequence of symbols, eg.0110, Language:set of strings {,0,00,000,…}, {}, Sentence:string in a language empty string 字母 组合 串 语言 集合 字母表 词法单元 模式 记号
Operators over strings 2.2.1 String and Language Operators over strings concatenation : xy,s = s = s Product(exponent): s0=,si=si -1s(i > 0)
Operators over languages plus:L∪M = {s | s L 或 s M } concatenation :LM = {st | s L 且 t M} exponent:L0是{ },Li是Li -1L closure:L = L0 ∪ L1 ∪ L2 ∪… positive closure: L+ = L1 ∪ L2 ∪… e.g. 2.2(p14) L: { A, B, …, Z, a, b, …, z }, D: { 0, 1, …, 9 } L∪D, LD, L6, L*, L(L∪D )*, D+
2.2.2 Regular Expression (RE) 正规式是用于说明词法单元如何到对应到词法记号的模式。与非形式化的描述相比,正规式更具形式化,更加精确。 2.2.2 Regular Expression (RE) Regular expression: According to a set of rules, formal regulations show how to form languages from scratch.
RE language defined description {} a {a} a (r) | (s) L(r)∪L(s) r and s are FEs (r)(s) L(r)L(s) r and s are FEs (r)* (L(r))* r is a FE (r) L(r) r is a FE Priorities of operators: * > concatenation > |
Some examples of RE = {a, b} a | b {a, b} (a | b) (a | b ) {aa, ab, ba, bb} aa | ab | ba | bb {aa, ab, ba, bb} a* all the strings consisting of ‘a’ (a | b)* all the strings consisting of ‘a’ and ‘b’
这样就保证了,每个名字对应的正规式中使用的各种符号已经在前面定义了,从而可以避免递归定义的情况。 2.2.3 Names of RE For brevity, we can name some REs。 d1 r1 d2 r2 . . . dn rn di must be different from each other Every ri must be a FE over {d1, d2, …, di-1 } 这样就保证了,每个名字对应的正规式中使用的各种符号已经在前面定义了,从而可以避免递归定义的情况。
Examples of RE IDs in Pascal letter A | B | … | Z | a | b | … | z digit 0 | 1 | … | 9 id letter(letter|digit)*
Unsigned numbers in Pascal Simplified notations: r+=rr* r?=r| [a-z]=a|b|c|…|z Unsigned numbers in Pascal 1946,11.28,63.6E8,1.99E6 digit 0 | 1 | … | 9 digits digit digit* optional_fraction .digits| optional_exponent (E ( + | | ) digits ) | numdigits optional_fraction optional_exponent Simplified notations: num digit+ (.digit+)? (E(+|)? digit+)?
relop < | < = | = | < > | > | > = while while do do relop < | < = | = | < > | > | > = id letter (letter | digit )* num digit+ (.digit+)? (E (+ | )? digit+)? delim blank | tab | newline ws delim+ Actually, the tokens are the RE names
Summary 源程序字符流 顺序组合 词法单元 词法记号 模式 非形式化描述 形式化描述 正规式 字母 组合 串 语言 集合 字母表 名字 词法分析器工作原理: 源程序字符流 顺序组合 词法单元 词法记号 模式 非形式化描述 形式化描述 正规式 字母 组合 串 语言 集合 字母表 名字 词法分析器 记号(token)流 源代码