Compilers for High-Level Language - Introduction
In this article, you will read about the designing and the operation of the compilers working for the high-level programming languages. So, let us have a look at the basic compiler functions.
There are several operations which are essential for the compilation of a high-level computer program. We discuss the programming in the high-level language with the help of grammar. It is done for the construction of a compiler. What do we mean by grammar here? It is a single pass.
In this blog, we will see the working of a single pass compiler. But we also need the compilers to make several passes for the compilation of complicated or sophisticated code and its optimization. The function of the several passes is also required in the case of other languages and for the program analysis as well.
Rather than specifying the source program as a character string, it is convenient to regard it is as a sequence of tokens. The fundamental building blocks of any language are called as the tokens. For instance, a variable name, a keyword, an arithmetic operator, and an integer can be the forms of tokens. What is lexical analysis? It is the work of the source statement to be scanned in addition to the classification and the recognizing the various tokens. The scanner is the computer part which has been assigned the work of analytical functioning of the task.
There are several sub-topics under the topic of compilers which the students need to read and understand for a better approach to computer programming. One of the most important is the concept of interpreting and debugging the computer program.
After the scanning of the tokens has taken place, there should be the reorganization of the language construct. This contains the tasks such as assigning the statement and its declaration which the grammar has described. The other name of the process is parsing. The part of a computer which performs the task is named as a parser.
Now, we have reached the last step of the basic process of translation. It is the object code generation. Rather than creating a symbolic program for the later translation by the assembler, the compilers directly create programs in the machine language.
Don’t confuse the word with its typical meaning. For a programming language, grammar is the formal description of the form or syntax, of individual statements in the language and programs. The semantics and the meanings of the program and the various statements are not contained in the grammar. The knowledge and the codes are supplied to the routines. Are you confused between the semantics and syntax? Let us consider this with the help of an example.
I: = J+K and X: = Y+I
Here, I, J, K are integer variables and X and Y are real variables. The syntax of both these statements is identical. All the statements in these two examples are assignment statements. Here, the two values have been separated by an operator + and expression gives the values which are assigned. We have the different semantics of both the expressions.
In the first statement, we get to know that we need to add the two variables using the integer arithmetic operations in the expression.
In the second statement, we learn that the expression is of the floating point addition. It has to be converted to the integer form before carrying out the process of addition for the two variables.
This implies that the compilation of the two statements will be different and the required machine instructions are also varied. Grammar would be used for the description. Code generation for these two statements imparts the difference between the two statements. There are several notations for writing grammar. One of the signs is BNF. It is Backus-Naur Form. But it is not used for many programming languages because of its inadequateness. What does the grammar specify? It is the languages’ legal statement. There are several advantages of the system which is simple and used widely. The most prominent is the capability of making sufficient use for various purposes.
It is a combination of a set of rules which have the main purpose of the definition of the syntax to be used in the programming language construction. It is a PASCAL definition.
: = READ ()
• The denotation of grammar in this context is – .
• “is defined to be” -: =
• The left part of the expression shows the language construct which defines . There is syntax description on the right side of the expression.
• The character string in the expression between the angular brackets is called nonterminal symbols. It is the generalized name given to the grammar constructs.
• The characters or the entries not included in the list are called grammar’s terminal symbols.
In the given an example, and are the nonterminal symbols; whereas, READ, (, and) the tokens are the terminal symbols. Through the explanation of the given an example, we have come to know that a token READ consists of the token .
The spaces in the given an example are not important. The motive of the spaces is readability improvement.
The process of scanning of the “to be compiled program” along with token recognition which is used for making up the source statements is called lexical analysis. The basic purpose of designing the scanner is the recognition of the operators, keywords, identifiers, character stringers, integers, floating point numbers and the other items of the part of the source program. For the exact recognition of the tokens, we need the compilation of the programming language as well as the description by the grammar.
Single tokens are used for the recognition of the integers, identifiers. The other function here is the alternative use of the tokens. These can be used as grammar part. If that happens, then the scanner has been programmed to consider the tokens as single characters. For example, these are 0,1, A, B, and many more. Interpretation of such characters’ sequence is the work to be done by the parser. For this operation to perform, the parser needs to recognize such identifiers which use the general parser techniques. In comparison to the parser, the scanner can perform the task more efficiently.
There are several multiple-character identifiers in large part of the source program. This is the best way in the significant saving the compilation time. One thing which is easy in this format is including the limited length identifiers in the scanner as compared to the routine in the general-purpose parsing.
The scanner is the one part which recognizes both the single and the multiple character tokens and that too directly. The output of the scanner is a token sequence. There is a fixed-length code to increase the efficiency to be used later. The example is an integer.
What happens when the token scanned token is a keyword?
There is the supply of sufficient information when the scanned token is an operator, a keyword and coding scheme. When it is an identifier, there is a specific need of the already scanned particular identifier. This is the thing which happens in the case of integers, character string constants, floating point values and many more. If we need to accomplish it, then they are needed to be associated with the token specifier with the type code for such tokens.