Linux Fu: Preprocessing Beyond Code

If you glanced at the title and thought, “I don’t care — I don’t write C code,” then hang on a minute. While it is true that C has a preprocessor and you can notoriously do strange and — depending on your point of view — horrible or wonderful things with it, there are actually other options and you don’t have to use any of them with a C program. You can actually use the C preprocessor with almost any kind of text file. And it’s not the only preprocessor you can abuse this way. For example, the m4 preprocessor is wildly complex, vastly underused, and can handle C source code or anything else you care to send to it.

Definitions

I’ll define a preprocessor as a program that transforms its input file into an output file, reacting to commands that are probably embedded in the file itself. Most often, that output is then sent to some other program to do the “real” work. That covers cpp, the C preprocessor. It also covers things like sed. Honestly, you can easily create custom preprocessors using C, awk, Python, Perl, or any other programming language. There are many other standard programs that you could think of as preprocessors, for example, tr. However, one of the most powerful is made to preprocess complex input files called m4. For some reason — maybe because of its complexity — you don’t see much m4 in the wild.

What Preprocessor?

If you’ve only used modern C compilers, you may wonder where the preprocessor even is. An ordinary system now does the entire compile in — as far as you can tell — one single pass. However, your compiler should offer a cpp executable that does the preprocessor logic externally, if you prefer. For gcc (and many other compilers), the preprocessor is named — unsurprisingly — cpp. The preprocessor has four major tasks:

  1. Substitute one string for another, including “macros” that look like a function call.
  2. Evaluate expressions and include parts of the input or exclude them based on the expression’s value.
  3. Strip out comments.
  4. Read in other files.

Of course, usually, the input is C source code, and the output is headed for the compiler, but it doesn’t have to be that way.

A Simple Example

Suppose you have a configuration file of some sort that has messages in it, originally in English. The file looks like this:

message1: Good Morning
message2: Good Night
message3: The cat is white

We want to arrange it so we can easily change the messages and build a new configuration file. There are several ways you could do this, each with some advantages and disadvantages.

Imagine you have a file called langs:

#define ENGLISH 0
#define SPANISH 1

Obviously, you could add more languages here, and the numbers are arbitrary as long as they are unique.

Now, we can create a template for the final configuration file:

#include "langs"

#ifndef LANG
#define LANG ENGLISH
#endif

#include "xlat"

message1: GOOD_MORNING
message2: GOOD_NIGHT
message3: CAT(WHITE)

There are a few things to notice about this file. First, it includes our language definition file. It then defines LANG as one of those symbols unless something else has already defined it. We will soon see what that might be, but assume this sets LANG to ENGLISH for now.

The include of xlat populates the tags like GOODMORNING with the correct string in whatever language we choose. Here’s what xlat looks like:

#if LANG==ENGLISH
#define WHITE white
#define GOOD_MORNING Good Morning
#define GOOD_NIGHT Good Morning
#define CAT(clr) The cat is clr

#endif

#if LANG==SPANISH
#define WHITE blanco
#define GOOD_MORNING Buenos Días
#define GOOD_NIGHT Buenas Noches
#define CAT(clr) El gato es clr
#endif

Note that the good morning message has a Unicode character in it. That’s one small issue with using tools like this. The encoding will come out as a C-style escape character. Depending on what you are going to use the output for, that may or may not be acceptable. In fact, there are several things the preprocessor does for the compiler that we probably want to suppress.

If you just run:

cpp template

You get:

# 0 "template" 
# 0 "<built-in>" 
# 0 "<command-line>" 
# 1 "/usr/include/stdc-predef.h" 1 3 4 
# 0 "<command-line>" 2 
# 1 "template" 
# 1 "langs" 1 
# 2 "template" 2 


# 1 "xlat" 1 
# 8 "template" 2 

message1: Good Morning 
message2: Good Night 
message3: The cat is white

What we want is at the bottom, true, but there’s a lot of stuff to help the compiler generate error messages and other things.

The trick is to put a few options on the command line:

cpp -udef -P template

These options are for gcc’s preprocessor. If you use something else, you may have to make your own decisions.

Customizing

If you want the Spanish version, you could simply edit the file. But you can also tell the preprocessor to force the LANG symbol, and since the template won’t redefine it, you’ll get the language of your choice:

cpp -udef -P -D LANG=SPANISH template

As I mentioned, the Unicode character will look funny after this, depending on how you look at it.

Another Way

This isn’t the only way to use the preprocessor in this example. You could detect the language and then include a different file — ENGLISH or SPANISH — to get the same result. This would have the advantage of many small independent files you could send to different translators, for example.

There are probably dozens of other ways you could do this, too. The preprocessor is like a multitool. There are lots of ways to do almost anything.

Preprocessor on Steroids

If you really want to get fancy with the preprocessor, try m4. It is similar in idea to the C preprocessor but has many superpowers. It isn’t specific to C, so there’s not much you have to do to coax it to work with your files. Unlike the C preprocessor, m4 doesn’t care about lines. For example, consider this input:

Hello!
define(HACKADAY,1)
Testing our macro:
HACKADAY
The End

If you run that through m4, you’ll notice there is a strange blank line between Hello and the line that says “Testing.” Why? Because the macro definition only consumes the characters up to the close parenthesis. Everything else is still in the file, including that newline at the end. If you type some text in after the definition, there’s no problem, and it will show up in the output.

If you want to ignore the rest of the line, you use dnl (delete to new line) like this:

define(HACKADAY,1)dnl

Arguments in m4 use the dollar sign notation, much like the shell. Quoting is strange, too, since you use the back quote to open and the apostrophe to close. Like this:

define(HACKADAY,`eval(10**$1)')

As you might expect, this allows you to say HACKADAY(2) and get 100 as the result — the double asterisk is exponentiation.

A Pleasant Diversion

One of the best features of m4 is that it has at least ten different output streams. The default is stream 0 and the rest are numbered from 1 to 9. You can write to any of the streams easily, or write to an out-of-range stream like -1 to discard input. At the end, the output streams are put together in order. Hypothetically, then, you could have a macro that adds an item to a report, for example. The report has a header, a body, and a totals column. You could put all the header code into the first stream (or “diversion”, in m4-speak). Then put the body code in diversion 2 and the total code in diversion 3.

At the end, the generated program would have all the headers, then all the body items, and, finally, the totals and you could write them in any order you find convenient. If you want to throw text away, you should divert to a negative file number. Some m4 programs — including the GNU one — allow larger numbers of diversions than the standard.

As a simple example, consider this script:

dnl These comments will be discarded
dnl First, we are going to divert to #1
dnl Then we will print each word along with a count
dnl incrementing the count (_c)
dnl At the end, we will switch back to 0 and output the count
dnl This way, the header of the "report" will have the count
dnl followed by the words we wanted to count
divert(1)dnl
define(_c,0)dnl
define(WC,`
define(`_c',incr(_c))dnl
_c: $1')dnl
WC(Hello)
WC(There)
WC(Hackaday)
WC(2024)
divert(0)dnl
List of _c words:

Note that the lines that start with dnl are essentially comments. The rest is cryptic, but the idea is to define a macro to output a list of words with sequence numbers. The header contains a total count which, of course, we don’t know until the end. But since the header is put in diversion 0 and the rest in diversion 1, everything comes out in the right order.

There’s too much about m4 to cover in a single post, but you can read more about it on your own. Honestly, if you really need the power of m4, maybe you should be thinking about awk or Python anyway. You’ll probably have to recreate your own version of the divert system, though, so if you really need that functionality, maybe there is something to m4.

On the other hand, maybe try awk. Or mix awk, shell script, and the C processor in terrible ways.



Linux Fu: Preprocessing Beyond Code
Source: Manila Flash Report

Post a Comment

0 Comments