CSCE 434 Lecture 4
« previous | Monday, September 2, 2013 | next »
Compilers
Abstract View of a compiler:
Input: source code Output: errors and machine code
We are still pretty far away from computers understanding natural language
Implications:
- Recognize legal (and illegal) programs
- Generate correct code (for legal programs)
- Manage storage [1] of variables and code
- We need format for object/machine code
Compilers have two components: front and back ends
Front End
The first traversal of source code
Accepts source code and converts it into an intermediate language representation (usually some data structure), abbreviated IL
IL can be manipulated to produce functionally equivalent optimizations
Keep separate the algorithm and the implementation: The front end only understands the implementation. Algorithm recognition is reducible to graph recognition, which is NP-complete
Further consists of two sub-components: a scanner and a parser
Scanner
Accepts source code and outputs "tokens"
Tokens: "x = x + y;" becomes <id, x> = <id, x> + <id, y> ;
Character string for a token is a lexeme
eliminates whitespace (tabs, blanks, comments, unnecessary characters)
Generates errors for malformed tokens.
Speed is key issue: scanner has to read code quickly
Other parts:
- identifiers
- alphabetic followed by alphanumerics
- numbers
- 0 or digit from 1–9 followed by digits from 0–9
- decimals: integer '.' digits from 0–9
- reals: (decimal or integer) 'E' (+ or -) digits from 0–9
- complex: '(' real ',' real ')'
- precision is an issue: hardware might not support numbers that big
Parser
Accepts tokens and constructs IL
Puts words (tokens) into sentences based on language rules
Back End
maps IL onto target machine
Simplify retargeting (strategy pattern; allow swapping out front and back ends):
multiple front ends (taking in multiple languages like C, C++, Fortran, etc.) multiple back ends
Mapping any-to-any is not possible: Computer Hardware architecture may imply a certain style of programming, which not all languages can support. Intermediate representation must include support for all recognized languages
Regular Expressions
Let be an alphabet.
- is a RE denoting the set
- if , then is a RE denoting
- if and are REs, , (unfinished)
Footnotes
- ↑ storage is an artifact of computing, we don't know any other way to compute