CSCE 434 Lecture 4

« previous | Monday, September 2, 2013 | next »

Lecture Slides

Compilers

Abstract View of a compiler:

Input: source code
Output: errors and machine code

We are still pretty far away from computers understanding natural language

Implications:

Recognize legal (and illegal) programs
Generate correct code (for legal programs)
Manage storage ^[1] of variables and code
We need format for object/machine code

Compilers have two components: front and back ends

Front End

The first traversal of source code

Accepts source code and converts it into an intermediate ~~language~~ representation (usually some data structure), abbreviated IL

IL can be manipulated to produce functionally equivalent optimizations

Keep separate the algorithm and the implementation: The front end only understands the implementation. Algorithm recognition is reducible to graph recognition, which is NP-complete

Further consists of two sub-components: a scanner and a parser

Scanner

Accepts source code and outputs "tokens"

Note: terminology is borrowed from language analysis

Tokens: "x = x + y;" becomes <id, x> = <id, x> + <id, y> ;

Character string for a token is a lexeme

eliminates whitespace (tabs, blanks, comments, unnecessary characters)

Generates errors for malformed tokens.

Speed is key issue: scanner has to read code quickly

Other parts:

identifiers: alphabetic followed by $k$ alphanumerics
numbers: 0 or digit from 1–9 followed by digits from 0–9; decimals: integer '.' digits from 0–9; reals: (decimal or integer) 'E' (+ or -) digits from 0–9; complex: '(' real ',' real ')'; precision is an issue: hardware might not support numbers that big

Parser

Accepts tokens and constructs IL

Puts words (tokens) into sentences based on language rules

Back End

maps IL onto target machine

Simplify retargeting (strategy pattern; allow swapping out front and back ends):

multiple front ends (taking in multiple languages like C, C++, Fortran, etc.) multiple back ends

Mapping any-to-any is not possible: Computer Hardware architecture may imply a certain style of programming, which not all languages can support. Intermediate representation must include support for all recognized languages

Regular Expressions

Let $\Sigma$ be an alphabet.

$\epsilon$ is a RE denoting the set $\{\epsilon \}$
if $a\in \Sigma$ , then $a$ is a RE denoting $\{a\}$
if $r$ and $s$ are REs, $L(r)$ , (unfinished)

Footnotes

↑ storage is an artifact of computing, we don't know any other way to compute

[1] storage is an artifact of computing, we don't know any other way to compute

[1]