If you glanced at the title and thought, “I don’t care — I don’t write C code,” then hang on a minute. While it is true that C has a preprocessor and you can notoriously do strange and — depending on your point of view — horrible or wonderful things with it, there are actually other options and you don’t have to use any of them with a C program. You can actually use the C preprocessor with almost any kind of text file. And it’s not the only preprocessor you can abuse this way. For example, the m4 preprocessor is wildly complex, vastly underused, and can handle C source code or anything else you care to send to it.
Definitions
I’ll define a preprocessor as a program that transforms its input file into an output file, reacting to commands that are probably embedded in the file itself. Most often, that output is then sent to some other program to do the “real” work. That covers cpp
, the C preprocessor. It also covers things like sed
. Honestly, you can easily create custom preprocessors using C, awk
, Python, Perl, or any other programming language. There are many other standard programs that you could think of as preprocessors, for example, tr
. However, one of the most powerful is made to preprocess complex input files called m4. For some reason — maybe because of its complexity — you don’t see much m4 in the wild.
What Preprocessor?
If you’ve only used modern C compilers, you may wonder where the preprocessor even is. An ordinary system now does the entire compile in — as far as you can tell — one single pass. However, your compiler should offer a cpp
executable that does the preprocessor logic externally, if you prefer. For gcc (and many other compilers), the preprocessor is named — unsurprisingly — cpp
. The preprocessor has four major tasks:
- Substitute one string for another, including “macros” that look like a function call.
- Evaluate expressions and include parts of the input or exclude them based on the expression’s value.
- Strip out comments.
- Read in other files.
Of course, usually, the input is C source code, and the output is headed for the compiler, but it doesn’t have to be that way.
A Simple Example
Suppose you have a configuration file of some sort that has messages in it, originally in English. The file looks like this:
message1: Good Morning message2: Good Night message3: The cat is white
We want to arrange it so we can easily change the messages and build a new configuration file. There are several ways you could do this, each with some advantages and disadvantages.
Imagine you have a file called langs
:
#define ENGLISH 0 #define SPANISH 1
Obviously, you could add more languages here, and the numbers are arbitrary as long as they are unique.
Now, we can create a template for the final configuration file:
#include "langs" #ifndef LANG #define LANG ENGLISH #endif #include "xlat" message1: GOOD_MORNING message2: GOOD_NIGHT message3: CAT(WHITE)
There are a few things to notice about this file. First, it includes our language definition file. It then defines LANG
as one of those symbols unless something else has already defined it. We will soon see what that might be, but assume this sets LANG
to ENGLISH
for now.
The include of xlat
populates the tags like GOODMORNING
with the correct string in whatever language we choose. Here’s what xlat
looks like:
#if LANG==ENGLISH #define WHITE white #define GOOD_MORNING Good Morning #define GOOD_NIGHT Good Morning #define CAT(clr) The cat is clr #endif #if LANG==SPANISH #define WHITE blanco #define GOOD_MORNING Buenos Días #define GOOD_NIGHT Buenas Noches #define CAT(clr) El gato es clr #endif
Note that the good morning message has a Unicode character in it. That’s one small issue with using tools like this. The encoding will come out as a C-style escape character. Depending on what you are going to use the output for, that may or may not be acceptable. In fact, there are several things the preprocessor does for the compiler that we probably want to suppress.
If you just run:
cpp template
You get:
# 0 "template" # 0 "<built-in>" # 0 "<command-line>" # 1 "/usr/include/stdc-predef.h" 1 3 4 # 0 "<command-line>" 2 # 1 "template" # 1 "langs" 1 # 2 "template" 2 # 1 "xlat" 1 # 8 "template" 2 message1: Good Morning message2: Good Night message3: The cat is white
What we want is at the bottom, true, but there’s a lot of stuff to help the compiler generate error messages and other things.
The trick is to put a few options on the command line:
cpp -udef -P template
These options are for gcc’s preprocessor. If you use something else, you may have to make your own decisions.
Customizing
If you want the Spanish version, you could simply edit the file. But you can also tell the preprocessor to force the LANG symbol, and since the template won’t redefine it, you’ll get the language of your choice:
cpp -udef -P -D LANG=SPANISH template
As I mentioned, the Unicode character will look funny after this, depending on how you look at it.
Another Way
This isn’t the only way to use the preprocessor in this example. You could detect the language and then include a different file — ENGLISH or SPANISH — to get the same result. This would have the advantage of many small independent files you could send to different translators, for example.
There are probably dozens of other ways you could do this, too. The preprocessor is like a multitool. There are lots of ways to do almost anything.
Preprocessor on Steroids
If you really want to get fancy with the preprocessor, try m4
. It is similar in idea to the C preprocessor but has many superpowers. It isn’t specific to C, so there’s not much you have to do to coax it to work with your files. Unlike the C preprocessor, m4
doesn’t care about lines. For example, consider this input:
Hello! define(HACKADAY,1) Testing our macro: HACKADAY The End
If you run that through m4
, you’ll notice there is a strange blank line between Hello and the line that says “Testing.” Why? Because the macro definition only consumes the characters up to the close parenthesis. Everything else is still in the file, including that newline at the end. If you type some text in after the definition, there’s no problem, and it will show up in the output.
If you want to ignore the rest of the line, you use dnl
(delete to new line) like this:
define(HACKADAY,1)dnl
Arguments in m4 use the dollar sign notation, much like the shell. Quoting is strange, too, since you use the back quote to open and the apostrophe to close. Like this:
define(HACKADAY,`eval(10**$1)')
As you might expect, this allows you to say HACKADAY(2) and get 100 as the result — the double asterisk is exponentiation.
A Pleasant Diversion
One of the best features of m4
is that it has at least ten different output streams. The default is stream 0 and the rest are numbered from 1 to 9. You can write to any of the streams easily, or write to an out-of-range stream like -1 to discard input. At the end, the output streams are put together in order. Hypothetically, then, you could have a macro that adds an item to a report, for example. The report has a header, a body, and a totals column. You could put all the header code into the first stream (or “diversion”, in m4-speak). Then put the body code in diversion 2 and the total code in diversion 3.
At the end, the generated program would have all the headers, then all the body items, and, finally, the totals and you could write them in any order you find convenient. If you want to throw text away, you should divert to a negative file number. Some m4
programs — including the GNU one — allow larger numbers of diversions than the standard.
As a simple example, consider this script:
dnl These comments will be discarded dnl First, we are going to divert to #1 dnl Then we will print each word along with a count dnl incrementing the count (_c) dnl At the end, we will switch back to 0 and output the count dnl This way, the header of the "report" will have the count dnl followed by the words we wanted to count divert(1)dnl define(_c,0)dnl define(WC,` define(`_c',incr(_c))dnl _c: $1')dnl WC(Hello) WC(There) WC(Hackaday) WC(2024) divert(0)dnl List of _c words:
Note that the lines that start with dnl
are essentially comments. The rest is cryptic, but the idea is to define a macro to output a list of words with sequence numbers. The header contains a total count which, of course, we don’t know until the end. But since the header is put in diversion 0 and the rest in diversion 1, everything comes out in the right order.
There’s too much about m4
to cover in a single post, but you can read more about it on your own. Honestly, if you really need the power of m4
, maybe you should be thinking about awk or Python anyway. You’ll probably have to recreate your own version of the divert system, though, so if you really need that functionality, maybe there is something to m4
.
On the other hand, maybe try awk. Or mix awk, shell script, and the C processor in terrible ways.
Linux Fu: Preprocessing Beyond Code
Source: Manila Flash Report
0 Comments