I got this query from a guy I met an an Arduino meet-up. He’s done a fair bit of PHP programming and was having difficulty getting to grips with C. In particular data types are different in the two languages. I thought I’d try to write up some notes for non-C programming types to help them with Arduino programming.
A bit of history might help explain the background of C and that could make it easier to understand why it is the way it is. C dates from the late 70s. Its language name was the next letter in sequence after two precursor languagrs, A and B. it’s what is known as a functional language. At the time other programming languages included COBOL, Fortran and Algol, these are all procedural languages. The difference is subtle, a procedural language defines a number of procedures that each perform some task. You might have a procedure to draw a line, or print a tax report. A functional language comprises a number of functions, they also perform discrete tasks, but the difference is that when a function has completed its task it passes back a result to the function that called it. So you might have a function that calculates the GST on an item and it will return that amount.
C is a compiled language, which is distinct from an interpreted language. You write your program in a human readable form, process it using something called a compiler, which produces a runnable program. In Arduino terms the compiler generates machine code, which the Arduino development environment then downloads to the ATMega chip. The chip has a boot loader program already installed and whenever the board is switched on, it will attempt to run the machine code that was last downloaded to it.
An interpreted language also starts off as human-readable source code, but there is no compiler to turn it into machine instructions. Instead the source code is read line by line by a separate program (the interpreter), which attempts to execute the line of source code.
C tries to be very close to machine code. You can address the computer memory, perform bit-manipulation tasks and complex Boolean logic. C’s first and to date, it’s arguably most important function was as the language Unix is written. Unix is the operating system behind every Mac and a lot of PCs. Clearly it’s important that C programs can do very low level tasks very quickly and without using large amounts of memory. When you are trying to fathom out how data types work, it’s a good idea to think just how the information is stored in the computer.
An integer is a whole number. Just how big a number depends on the computer. It is stored in computer memory as a number of bytes of data. A byte is 8 bits, and on the Arduino an integer is stored in 2 bytes. Integers are stored in binary format, so because the integer is using 16 bits it means an Arduino can store any integer between 0 and 65535. In reality we can also have negative numbers, so the way we represent a negative number is we take one bit and say if this bit is 0 then it’s a positive number, if it’s 1 then it’s a negative number. This leaves us 15 bits out of our original 16. So an Arduino can store any integer between -32768 and +32767.
C does a similar thing with characters. There is only a finite number of characters on the keyboard, so C will store one character in 1 byte of memory (8 bits). Again characters are stored in binary form and each character is assigned a different number. These assignments use something called ASCII format. For example the letter upper-case A is stored as ASCII number 65, lower-case a is stored as 97. 8 bits is enough space to store 256 different values, so an Arduino can store 256 different characters.
So, to recap, computer memory is arranges in bytes. Characters are stored in single bytes and integers are stored in pairs of bytes. There are also schemes for storing decimal numbers, but that’s a different story.
If I want to store the words Hello World, there are 11 characters and C allows me to store these very economically in 12 bytes of computer memory. Each letter will be stored in one byte using ASCII format and the 12 bytes will have the values 72,101,108,108,111,32,87,111,113,108,68,0 0 is a special value at the end of a C string, it means “this is the end of the string”. It’s important, because C doesn’t store the length of the string anywhere, it just stores a zero at the end. Any function that is reading the string has to know to stop when it hits the zero. In ASCII, there is no character with the code 0 (so C actually only has 255 different characters).
Another way of thinking of characters strings is as a one-dimensional array of characters. And C is consistent here, because if I wanted to store a list (or array) of integers C would store them in the same way as it stores arrays of characters. I might want to store the number of days in each month in the year – 31,28,31,30,31,30,31,31,30,31,30,31 – that’s 12 integers and it would be stored in 24 bytes, 2 bytes for each integer. There wouldn’t be a zero byte at the end though since that’s only needed for character strings. C is a bit inconsistent here; the language will keep track of the length of a character array or string, but you need to remember how many items you have stored in an integer array. That does make sense though. It would unreasonable if the language marked the end of an integer array with a zero; what if you wanted to stored a zero as one of the numbers in the array?
Another important point to note is that C does not store integers and characters any differently from one another (apart from the extra 0 at the end of a string), but there are clearly two ways to store a single digit. The character 2 is stored as a value 50 in a single byte but the number 2 is stored in two bytes as a value 2. You can perform maths operations on the number, for example multiply it by 10 and the 2 becomes 20. But performing maths operations on the character will not always give you the results you’d expect. If you add one to the character 2 (which has been stored as value 50) you get the value 51, because the arithmetic operation is performed on whatever is in memory. 51 is the value that the character 3 is stored as, so that arithmetic operation happened to work. However if you multiple the character 2 by 10, you get value 500. This is outside the range of values the can get stored in 8 bits so you definitely don’t get what you’d expect. Actually what you do get is the value 244 and that’s the ASCII code for o with a circumflex. This highlights another point about C. How can you get 244 if you multiply 50 by 10?
To answer that, remember that characters are stored in 8 bits of memory. When we multiply 50 by 10 we get the result 500. To store 500 in binary would require 9 bits. When the computer tries to store it in 8 bits, it loses some information, and what gets stored is the code for 244.
When you’re programming an Arduino, you are using a language based on C. This is a strongly typed language. You can’t mix your data types, so if a variable starts of as an integer you can’t then treat it as a string and vice versa.