Win32 Assembler Tutorial by T$ - Chapter 3.141 [deinmeister.de]

Hi there, welcome to the 4th chapter of the Win32Asm tutorial. This time the topic is about synchronizing (keeping the code output in-time) animations. There are 2 main reasons for using synchronization in demos and games:

1. to make the animations fit to the background music and sound effects
2. to prevent the code from running too slow or too fast (which is especially important on a system like the PC, because the relation between the slowest and the fastest machine can easily be up to 1:100, which is the development of less than 10 years!)

This tutorial will also give an introduction in using multiple threads at once.

By the way, quite a lot of people asked about using DirectDraw in fullscreen. For the ones who do not know about it yet, take a look at the source of this tutorial (yes, it's inside the ZIP archive... ;)) and don´t forget to check out the similarities to cooperative mode, which makes using both modes in one application rather easy.

For those who did not figure out the number of the last tute: The number 2.718... is known as e, it is quite useful if you are working with logarithms, power, maybe sine as well. Take a look at it if you want to use the fpu efficiently.

1. Keeping stuff in order

One can say that, in general, there are 2 kinds of algorithms:

Interpolating algorithms:
like interpolation plasmas, fire effects, snow effects and all that stuff that calculates the new state out of the current one. You cannot determine a certain state without calculating all previous ones. These algorithms normally create the output in fixed units of difference.
Functions:
These algorithms create any state you want by giving it one or more simple values within the range and granularity the function accepts. You do not need previous states created by it. As the name points out, they are typical for anything made by using physical or mathematical formulas (movement calculation, sine plasmas, most sound generators).

Let's take a look at the behaviour of these algorithms while being synched:

Interpolations
always run smooth, however, if the computer is too slow, the animation will be slow, too.
You have to include any kind of delays for preventing them from running too fast.
If the computer is too slow, it goes out of sync so that you may have to stop the animation before reaching its end, stopping the music until the calculations are finished, using a loop in the music or whatever may be as little conspicuous to the user as possible.

Function-based animations
are always running in time if fed correctly, however, if the computer is too slow, it looks rather jerky (the slide-show effect).
If the frame rate drops down too much, you may increment the input value by a lower amount than required for synching in order to avoid the slide-show effect, however, then you will get the same problems as stated above.

The following possibilities are there to get a value used as an input for a function:

The system timer in Windows can be read by using timeGetTime inside winmm.dll (a multimedia component of Windows). It provides millisecond accuracy, which should be enough in most cases. The return value is the time since Windows was started. You should not use this value directly, it is better using the difference between 2 calls of the function (either between two frames or between the current frame and the start of the program), since the counter wraps around between a couple of days. This is getting more and more important, think of the "suspend-to-disk" functions used today.

A similar way is using RDTSC (on Pentium and above, not on Cyrix M1), a 64-bit counter being increased at every clock tick of the CPU. The drawback is, that a) this function may be restricted to kernel code and b) you have to get the frequency of the CPU clock first. Similar to it is using Windows' QueryPerformanceCounter function, which uses RDTSC.

Another source for the time can be received through the sound playback: You can get the current playback position from the sound playback engine, but the accuracy of the sound players is often rather poor. But, if you have written the player yourself, or at least modified it (e.g., the only thing needed is the handle of the playback buffer and the number of times the buffer was looped in order to get the exact playback position), you'll get the best value for synching to the music. This works with modules, wave streams and mp3 as well.

Possible implementations of a delay:

Hey, a simple delay (like the one found in TP or in the Win95 driver being fucked up on faster CPUs) is the Nr.1 example of lame coding.

But there are quite nice varieties of it:
The most often used type of delay under DOS was the waitretrace. We can use it under DirectDraw, too. The function for changing the front and the back buffer can be used to wait for the retrace. It is often a good idea to use the waitretrace since it prevents graphical artifacts. The problem is that it is not very good to use for synching since you cannot control if it happens between about 50 up to 100 times per second, so that you have to measure the refresh rate and use it in your code if it does matter how many frames your code runs at all. If it is possible to calculate a whole frame between two refreshes for a long time, you'll get the smoothest animation possible (normally, it isn't). In most cases, an additional method of synchronization should be used with waitretraces for achieving an optimal result.

Another implementation of a delay can be done by using a timer.
Don't think you could put the animation's code within the timer's callback procedure! This may work well if your code is finished all the time before the procedure is called again.
OK, you have a fast computer and everyone should buy a new one if your program crashes? Damn, you've been trapped. You'll be surprised that even on your machine the code may fuck up sometimes - since Windows is a multitasker, it is possible that another process is eating the cpu power you would have needed. Just think of what happens if the RAM is swapped to or from disk.
Better idea: Every time you start calculating a new frame, set a flag indicating that a frame is under creation. If the frame is ready, check the flag repeatedly until it is cleared. For clearing it you use the timer callback. I call this technique a time-wasting loop. These loops may also be realized by waiting until a certain time (look above for seeing how to get the time) elapses. I know this is not an elegant approach to the problem, but by using the timer callback in combination with multiple threads like in the sample code we can make one out of it (take a look at the sample code coming along with this tute).

2. Threads and multitasking

A thread is a piece of code being executed. Everything done by your code can interfere or be used by one, several or all of your threads. It is possible having several threads running at the same time.

Every program started is also called a process. Each process consists of at least one thread and can create, stop, restart and kill at least all of its own threads (if it kills the last thread, the program terminates).

Note that in most cases using multiple threads in a program does not speed it up (this may even be the case in multi-CPU environments).

Let's see in which cases using several threads can be useful:

If you are using a window and its message loop in your program (you surely do), your message loop thread is blocked while it is processing the messages. This is not a problem with the graphics, but if it is blocked for a long time (read: dragging or resizing the window, navigating through menus,...), your audio output will sound like a machine gun.
If you are loading files not being already cached, it may also take some time (just think of CD ROM access times). So it may be useful using a special thread for it.
You can do precalculations or data decompression while processing user input (read: while waiting for the user).

For loading uncompressed files into memory, it is better to use FileMapping instead of doing it with a thread since Windows will load the file into memory with best performance (it will load the file into memory if you read from the corresponding memory area unless it is not already preloaded there).

Most other situations are using threads just for convenience, most uses for threads would also be possible within a single thread. For example, a program can handle 10 windows using a single thread covering all windows or create a single thread for each window, covering only this window.

All threads within a program access the same memory. So one can use global variables in order to communicate between the threads. The disadvantage of it is that if you create several identical threads, you cannot use global variables for saving your data because the threads overwrite them concurrently. So you may either use the stack for storing local data or use the Thread Local Storage provided by Win32, a memory allocation function which allocates memory in the context of a thread.

Using the stack, the memory is normally referenced the following way:

mov eax,[esp + displacement]

For performance reasons, esp should always contain a power of 4 (dword alignment).
Displacement consists of a) the position of your data if you have several dwords on the stack and b) the dwords pushed onto the stack using push or call since the data was pushed on the stack.

If you do not want to keep track of all pushed and popped bytes on the stack, you may also use a register, preferably ebp, to store the initial stack position so that you only need to know the position of the data (this can be simplified using Variable equ RelativePosition equates, btw).
But if you use ebp like that you are wasting one of the 7 registers, and using registers as efficiently as possible is one of the main goals of asm, isn't it?

Note that esp and ebp have the default selector SS, so if you use any other register as an index to the stack put an SS selector override before it. Under Win32, this can be ignored since DS, ES and SS use the same selector.

The following does the same, but at a different code size:

mov eax,[esp+edi] ;uses ss reg by default

and

mov eax,[ss:edi+esp] ;uses ds reg by default, override needed

The fact that threads practically run at the same time can cause problems when using an object like files, locked surfaces, windows, ... in several threads. Imagine one thread opens a file or locks a surface for writing. Then the scheduler switches to another thread using the same surface or file, which has finished its job and closes the file or unlocks the surface. Now the first thread regains processing time and writes to the surface or file, which has been freed. Got the point? You cannot use global variables to get around this problem since in the time between access to such a flag variable and its verification a task switch may occur. This happens rather seldom, but causes hard to track random failures of your code.

A better attempt is to use CriticalSections, which can be declared using the appropriate Win32 functions. A critical section is a part of your code which does not run while another thread is in a critical section itself. If needed, a thread is sent to sleep unless the critical section is finished. Although they work quite efficiently and very reliably, critical sections reduce the amount of multitasking. So, in the case that a larger amount of your code requires being in a critical section, it may be better to put it into a single thread. Now I show how threads can be used efficient in practice while getting around the problems mentioned above. For multi-CPU environments better solutions are possible, but most of us only own a single-cpu environment.

Practical implementation of a multithreaded app

The first thread (created by the OS) is used to initialize the memory, DirectDraw (if needed) and the message loop. It also starts and controls all other threads. This thread can do normal calculations as well or may only be used for message processing and thread management. This scheme is also called the Master-Slave model.

Some slave threads which may be used with this App:

- a sound playback thread which initializes the complete sound stuff, enters the sound processing loop and runs until it gets informed by the main thread that it should shut down. It is also possible to let it shut down itself, e.g. because the music was completely played back.

- a graphic calculation thread behaving similar than the thread above. Note that it does not make sense to let several threads calculate graphic unless you are running on a multiprocessor system because accessing the framebuffer concurrently causes problems mentioned above.

- a so-called worker thread which does things like precalculations or preparing other data and ends if it has done its job. If it does not finish until the data is needed, the main thread may either increase the priority (see below) of this thread and/or pause the thread(s) needing the data provided by this thread.

When the program should end, the main thread should signal all other threads still running that they should end and wait until they have closed before exiting. Otherwise the result may be improper deinitialisation and memory leakage.

Each thread is also having a priority level which determines how the scheduler determines processing time to it. All threads of a lower priority have to wait until the threads of higher priorities have done their job. Threads of the same priority get the same slice of cpu time.

A thread playing back sound is often given a HIGH priority level so that it can calculate the sound after the OS has done its job and before other work is done.

Most threads work fine with the default, the NORMAL priority level.

Worker threads may run at IDLE priority level, so they only run while all other threads are currently waiting (if you use waiting functions like GetMessage or MsgWaitForMultipleObjects) or being suspended. This allows calculations in the background without eating CPU time needed by the more important threads.

Not only the threads of an application have a priority level, the several processes have one, too. The scheduler first takes a look at the priority of the process, then at the priority of its threads. Using the NORMAL priority for your application should work fine in most cases.

BTW: Some people still think setting the priority of their code to HIGH or REALTIME will make their program faster. In fact, it will in most cases just make the program less multitasking-compliant. So, while writing the code, always think of worst-case conditions with temporarily blocked code execution and heavily changing amount of cpu power left to you. Even if your program is the only one on the screen, there are still several threads running as well. In reality, this will normally not happen but your code should be able to handle this case (one never knows...).

3. The sample app

The sample app consists of the initial thread handling the window, starting the other threads and DDraw, one thread synched via timer, two identical synched via timeGetTime and one synched to the refresh, which does also display the output of the other threads by copying the shared system memory buffer into the surface. This is not a realistic setting (it causes graphical artifacts), but it is useful for showing how the several things work.

The code should be quite self-explanatory with its comments.

All 4 sprites should traverse the screen within 4 seconds. You'll see that the 2 threads using the system time will never get out of sync.

The thread restarted by the timer all 25 milliseconds does its job also very well (even when the power management throttled down my cpu it still worked the same way). It is quite hard to get it out of sync.

The thread using the flips happening at refresh time is just useful to keep the animation within a certain speed range, but is not reliable as it was in plain DOS.

Check out how the example reacts if another process is eating cpu time as well or even blocking it for some time.

4. push dword EverythingRead ; ExitProcess

The next (and last) tutorial of the Win32Asm series will be a collection of useful snippets and functions, especially for coders which come from DOS or want to port code to Win32.

Win32 Assembler Tutorial Part 3.141

Contents:

1. Keeping stuff in order

2. Threads and multitasking

Practical implementation of a multithreaded app

3. The sample app

4. push dword EverythingRead ; ExitProcess