Parallelize an Application on a Multicore CPU
The following topics are covered in this tutorial:
- Implementation of a Sobel filter with Preesm
- C Code generation
- Parallelization for multi-threaded environment
Prerequisite: Tutorial Introduction
The tutorial is composed of the following parts:
- 1. Initial project setup
- 2. Sequential implementation of the Sobel Filter
- 3. Parallelization of the Application
- 4. Multicore/Multithreaded execution
- Appendix: Change the input resolution
Last update the 09.21.2016 - Tutorial created the 07.31.2013 by K. Desnos
1. Initial project setup
The first task of this tutorial consists of retrieving and compiling a dummy project that will serve as a basis for the remainder of this tutorial. This first application simply consists of 2 actors and 3 parameters:
- Read_YUV: Actor that reads a YUV video frame by frame and outputs the three components separately.
- display: Actor that displays an YUV frame in an SDL window.
- height and width parameters correspond to the dimensions of the read and displayed video frames.
- index parameter is an integer referencing the display window. (can be used to enable several simultaneous windows).
Download the following files (Pthread library and CMake may have already been downloaded for tutorial_introduction):
- Preesm Project: [link] (34 KB)
- YUV Sequence: [link] (9 MB)
- DejaVu TTF Font: [link] (757KB)
- CMake: [link]
- SDL Development Library (v. 2.0): Do take the "Development library" and not the runtime library [link]
- Pthread development library: This is a self extracting archive [link] (For Win only)
Uncompress the preesm project in a directory named "org.ietr.preesm.sobel".
If you are using Windows, follow the instruction given in "/Code/lib/ReadMe.md" to make sure that the required libraries are in the right place. Under Linux, the SDL2 can be retrieved via the packet manager (library is named libsdl2-dev).
On Windows, for the CMake .bat scripts to work correctly, you will need to copy the content of the SDL2 lib file from /Code/lib/SDL-2.0..xx/lib/x86 (or .../lib/x64) to /Code/lib/SDL-2.0.xx/lib.
1.2. Run Preesm Project
- Right-click in the "Package Explorer" of Preesm and import the "org.ietr.preesm.sobel" project.
- Right-click on the workflow "/Workflows/Codegen.workflow" and select "Run As > Preesm Workflow"
- In the scenario selection wizard, select "/Scenarios/1core.scenario" and click OK.
- During its execution, the workflow will log information into the Console of Preesm. When running a workflow, you should always check this console for warnings and errors (or any other useful information).
The workflow execution generates several intermediary dataflow graphs that can be found in the "/Algo/generated/" directory. The C code generated by the workflow is contained in the "/Code/generated/" directory.
1.3. Run the generated C Project
Before compilation, in "/Code/include/yuvRead.h", make sure that the PATH to the YUV file is correct (the YUV file that you downloaded).
To compile and run the generated C code, simply use the CMake project of the "/Code/" directory. We strongly advise you to generate the IDE projects and binaries in the "/Code/bin" directory so as not to mix the source code with OS/IDE specific files. In the "/Code/" directory, batch scripts (*.bat and *.sh) are available to automatically create the appropriate folder and launch the CMake project generation for Windows users of code::blocks (CMakeCodeblock.bat) and Visual Studio 2013 (CMakeVS2013.bat) as well as for Linux GCC users (CMakeGCC.sh).
The following figure shows how the running application should look like. At this point, the application does not do any "real" computation on the image. The performance figure displayed in frames per second (fps) in the console should be noted, as it will be an upper bound to the performance of the developed video processing application. Indeed, since you will now add new actors to the dataflow graph, the amount of computation will increase and the application performance will decrease.
2. Sequential implementation of the Sobel filter
The Sobel filter is an image transformation widely used in image processing applications in order to detect the edges of a 2-dimensions picture. The application of this filter consists of convoluting the Y component of the original image with 2 matrices to obtain two intermediary images. These two images are then assembled to form the final image. More information on the Sobel filter can be found [here].
2.1. Original C code
A C implementation of the Sobel filter is given hereafter:
The C file and its corresponding header file can be downloaded [here]. The C file and the header file should respectively be placed in "/Code/src/" and "/Code/include/".
2.2. Preesm sequential Sobel
Add a new Sobel actor to the application graph. To do so:
- Double-click on '/Algo/top_display.diagram" to open the graph editor.
- Click on "Actor" in the palette (on the right side of the graph editor), then click in the graph to add a new Actor.
- In the "Create Actor" wizard, name the new actor "Sobel".
- Delete the existing fifo between "y" ports of "Read_YUV" and "display" actors.
- Click on "Fifo" in the palette and click successively on the "y" ports of the "Read_YUV" and on the "Sobel" actor.
- Name "input" the "Sobel" input port for the new fifo.
- Repeat the last 2 steps to add another fifo between an "output" port of the "Sobel" actor and the "y" port of the display actor.
- Set the type of the newly created fifos to "uchar" (right click on the fifo > Set the Data type).
- Click on "Dependency" in the palette and click successively on the "width" parameter and the "Sobel" actor, name "width" the new configuration input port.
- Add another dependency between the "height" parameter and the "Sobel" actor.
You now need to set the properties of the new ports (production and consumption rates). To do so,
- Locate the "Properties" view of Preesm. (If the view is not visible, go into "Menu bar > Window > Show View > Other..." and select "General > Properties").
- In the graph editor, select the fifo between actors "Read_YUV" and "Sobel".
- In the "Properties" view, set the expressions associated to the source and target ports to: "height*width".
- Repeat the last two steps for the fifo between actors "Sobel" and "display".
Next, you need to tell Preesm which C function should be called when the newly added Sobel actor is to be executed. To do so,
- Drag-and-drop the "sobel.h" file from the "Package Explorer" on the "Sobel" actor in the graph.
- In the dialog, choose a prototype for the loop function (the function called at each execution of the sobel actor), select the only one proposed (named sobel).
- When asked to select an init function, click "Cancel". (this function is optional and in the case of the sobel actor, unnecessary).
- Save the diagram.
To complete the application configuration,
- Open the "/Scenarios/1core.scenario" by double-clicking on it in the "Package Explorer".
- In the "Constraints" tab, select the "Core0" operator and tick the "Sobel" box to allow the Sobel actor to execute on this core.
- Save the scenario.
- Execute the workflow. (Don't forget to check the Console for errors and warnings.)
2.3. Run the sequential Sobel
The objective of this step is to confirm the correct behavior of the filter sequential implementation before parallelizing and optimizing it. Before compiling the application, add "#include sobel.h" in the "/Code/include/x86.h" header file.
The performance obtained with the sequential implementation will serve as a comparison point to measure the benefits of future optimizations. To run the application, simply follow the steps presented in section 1.2 and 1.3 ("Run Preesm Project" and "Run Generated C Project").
If errors have occured in Preesm, they appear in red in the Console. In case the code has not been correctly generated, you can check the Preesm console.
The following figure shows how the running application should look like.
3. Parallelization of the Application
The objective of this section is to modify the original Sobel application so as to expose a parameterizable degree of data parallelism. The basic idea behind this modification is to split the original image into slices that can be processed in parallel.
3.1. Split/Merge Actors
The computation of the Sobel filter involves the convolution of the image with 3x3 matrices. This operation implies that the computation of the nth line of pixels of the output image requires an access to the (n-1)th and (n+1)th lines of pixel of the input image. Consequently, the Split actor will produce slices with 2 extra lines of pixel: the last line from the previous slice and the first line of the next slice. A C implementation of the Split actor is given hereafter:
Hereafter is the C implementation of the Merge actor whose purpose is to assemble the processed slices into the output image.
In addition to the "input" and "output" pointers, those two actors receive 3 parameters:
- width and height: the dimensions of the sliced image
- nbSlice: the number of slices created/assembled by the actors. It is the developper responsibility to ensure that height is a multiple of nbSlice.
The C and header files corresponding to the Split and Merge actors can be downloaded [here]. Uncompress them in the "include" and "src" directories of your project.
3.2. Preesm parameterizable parallel implementation
Following the same steps as those given in section 2.2, add the "Split" and "Merge actors and the "nbSlice" parameter to the graph and connect them like shown on the following figure. Set the default value of the "nbSlice" parameter to 8 (edit the Expression field in the "Properties" view of the parameter).
Since the Sobel actor will now receive slices of images rather entire images, we must give it the size of the slices as a parameter, not the size of the entire images. In order to do so, add a "sobel_height" parameter to the graph. Remove the dependency between the "height" parameter and the "Sobel" actor and replace it by three dependencies:
- one between "height" and "sobel_height";
- one between "nbSlice" and "sobel_height";
- and one between "sobel_height" and the "height" input port of the "Sobel" actor.
In the "Properties" view of the new parameter, enter the following expression: height/nbSlice+2.
Set the data type of all new FIFOs to "uchar", then, define the following production and consumption rates for the new FIFOs of the graph.
|FIFO||Source Production||Target Consumption|
|read_YUV → Split||height*width||height*width|
|Split → Sobel||nbSlice*width*(height/nbSlice+2)||height*width|
|Sobel → Merge||height*width||nbSlice*width*(height/nbSlice+2)|
|Merge → display||height*width||height*width|
Before executing the workflow, you must:
- Associate the Split and Merge actors respectively with the loop function prototypes "split" and "merge" from the "/Code/include/splitMerge.h" file (still no init function there).
- Update the scenario to allow the execution of the new actors on "Core0".
3.3. Exposed parallelism
After executing the workflow on the mono-core scenario, open the graph generated in "/Algo/generated/singlerate/top_display.graphml". This graph results from the transformation of the input SDF graph into an equivalent single-rate graph where each edge has equal production and consumption rates. As expected, this graph reveals 8 duplicates of the Sobel actor, each responsible for the processing of one of the slices.
Before proceeding to the next step, we strongly advise you to compile and run the application on 1 core. Even though a monocore execution will not benefit from the exposed parallelism, this step is often necessary to ensure the correct functionnal behavior of the application. Indeed, once parallelized on multiple threads/core, the debugging task often become more complex and tiresome.
To compile the application, simply follow the steps presented in section 1.2 and do not forget to add the "#include splitMerge.h" directive to "/Code/include/x86.h".
4. Multicore/Multithreaded execution
In this section, we are going to define a new architecture model and a new scenario in Preesm to exploit the application parallelism revealed in section 3.
4.1. Multicore architecture model
In order to exploit the parallelism of the application, a new multicore architecture must be created. In Preesm, the System-Level Architecture (S-LAM) is used to model heterogeneous multiprocessor architecture with a high level of abstraction. This architecture model is used during the workflow execution to map and schedule the actors on the processing elements of the architecture and to route the inter-core communications. More information on the S-LAM model can be found [here].
To create a new multicore achitecture model similar to the one in the figure, follow the following steps:
- In the "Package Explorer", create a copy of "/Archi/1CoreX86.slam" and name it "4CoreX86.slam"
- Double-click on the new "4CoreX86.slam" to open it with the S-LAM Editor.
- Copy/Paste the "Core0" processing element 3 times. Name the new cores "Core1" to "Core3".
- Using the "undirectedDataLink" from the Palette, add connections between the "shared_mem" and the 3 new cores. Name all ports "shared_mem".
You can use these steps to add any number of processing elements to your architecture to best reflect the number of core of your CPU. Note that a thread will be generated for each core of the architecture model where some actors are mapped.
4.2. Generation of a multicore schedule
Before generating a multicore schedule, you need to create a new scenario that will associate the Sobel algorithm with the new 4Core architecture. To do so:
- In the "Package Explorer", create a copy of "/Scenarios/1core.scenario" and name it "4core.scenario".
- Double-click on the new "4core.scenario" to open it with the Scenario Editor.
- In the "Overview" tab, set the "Architecture file path" to "/Archi/4CoreX86.slam".
- Save the scenario, close it and reopen it to take the new architecture into account in the editor.
- In the "Constraints" tab, allow the execution of all actors on all cores of the architecture. We advise you to allow the execution of the display actor only on Core0 as this makes the closure of the display window stop the whole program execution.
- In the "Simulation" tab, allow the execution of the broadcast/implode/explode actors on all cores.
- Save the updated scenario.
You can now run the "Codegen.workflow" with the new scenario. The generated schedule should make use of the 4 cores, as displayed in the produced Gantt graph.
4.3. Run the multithread code
When running the program, an fps counter is displayed (in the console and in the window). Using this indication, it is possible to measure the performance gain obtained when using multiple threads. For example, on an quad-core Intel Xeon CPU clocked at 3.10GHz, we observed a speedup of 2.67 (from ~375fps to ~1000fps) on a 352x288 image.
In order to have realistic timings in the Gantt chart, actor timings should be entered in the scenario. Actor timings must be measured by execution profiling.
The final project resulting from all the modifications presented in this tutorial is available [here]. (Note that the external library, the YUV sequence and the generated C code are not included).
You can continue the tutorials by trying Tutorial 3: Code Generation for Multicore DSP: [Here].
Appendix: Change the input resolution
This section details the changes needed in order to run the Sobel application with YUV sequences of different resolution and length (such as the ones available [here]).
- In the application graph editor.
- Set the new values of width and height, and check that nbSlice is a multiple of height.
- Save the graph modifications and run the workflows.
- In "/Code/include/yuvRead.h", set the pre-processor variable "NB_FRAME" to the number of frames of your sequence. You can also change the path to your YUV sequence in this file. Note that in the current implementation, the Read_YUV actor will not work for a large number of HD frames because of an overflow of the file pointer.
- In "/Code/include/displayYUV.h", set the pre-processor variable "DISPLAY_W" to "<your video width>*NB_DISPLAY" and "DISPLAY_H" to your video height.