Author manuscript, published in "Workshop on Design and Architectures for Signal end Image Processing, Belgium (2008)" OLLAF: a Fine Grained Dynamically Reconfigurable Architecture for OS Support

Fine Grained Dynamically Reconfigurable Architecture (FGDRA) offers a flexibility for embedded systems with a great power processing efficiency by exploiting optimizations opportunities at architectural level thanks to their fine configuration granularity. But this increase design complexity that should be abstracted by tools and operating system. In order to have a usable solution, a good inter-overlapping between tools, OS, and platform must exist. In this paper we present OLLAF, an FGDRA specially designed to efficiently support an OS. The studies presented here show the contribution of this architecture in terms of hardware context management and preemption support. Studies presented here show the gain that can be obtained, by using OLLAF instead of a classical FPGA, in terms of context management and preemption overhead.


I. INTRODUCTION
This work takes place in the SMILE project. This project aims at provide a distributed middle layer to efficiently handle the complexity of a tomorrow's RSoC 2 . This system may contains several computing units of different types. It will embed at least one or more General Purpose Processor (GPP), but also dynamically reconfigurable architectures (DRA) at different granularities and especialy FGDRA 3 . Tomorrow's computing systems has to comply with lots of constraints. Those constraints may be time related, to meet real time requirements, but also power consumption constraints, as it is, and will be more and more, one of the primary concern of electronical devices.
By fine grained, we here means an architecture which is reconfigurable at the bit level. A dynamically reconfigurable architecture, using single bit LUT and flipflop, and providing a bit level reconfigurable interconnection matrix, as the one presented here, or basic logic fabric of most commercial FPGA, are examples of FGDRA. Those kind of architecture can be adapted to any application more optimally than a coarser grain DRA. This feature make them today the platform of choice when it comes to handle computational tasks in a highly constrained context.
In more general terms FGDRA can achieves much better efficiency than GPP does, while offering the same versatility and, potentially, a very close flexibility. The counterpart is that 1 Operating system enabled Low LAtency Fgdra 2 Reconfigurable System on Chip 3 Fine Grained Dynamically Reconfigurable Architecture it introduces a much greater complexity for application designers. This complexity could be lowered to an acceptable level in two ways. First by providing powerful CAD tool. Lots of research are thus led in the field of high level synthesis [1]. The second way is to abstract the system complexity by providing a middle layer, e.g an operating system, that abstracts the lower level of the system [2]. Moreover, an OS could manage new tasks at run time. This property is a feature of importance for DRA. For all those reasons, a specialized operating system is required for FGDRA.
In our work we make a difference between a FGDRA, which is a general term, and a FPGA which, for us, relate to an actual silicon device sold under this designation and which can be used as a FGDRA but is actually not designed especialy thor that purpose.
The SMILE project follows a distributed approach of the system. Each computing unit of a RSoC (GPP, DSP, DRA, ...) has its own real time kernel. This topology allows to use a specific custom made real time kernel for each computing unit. It then allows to take into account every specificities of each computing unit. A message passing communication scheme, based on MPI 4 , ensure a consistent operation of the whole system. In this frame of mind, we developed a dedicated real time kernel for a FGDRA.
This kernel is an adaptation to FGDRA of an abstract OS model which could be described as follow : • it manages the execution of a set of task on a given versatile computational ressource. More concreatly it will run periodically a special algorithm to evaluate where and when to run each tasks. This period is called Tick and is a tradeoff between efficiency and flexibility, a typical value in classical OS is tens of milliseconds. • it offers an abstracted view of the platform to the task designer. In other terms, each task can be designed without worrying about other tasks and sometimes even about the platform. It then offers a standardized set of services such as communications or synchronizations between tasks. This model slightly differs from most OS implementation proposed for FPGA management even if the overall idea remain the same.
Both the history of micro-processor based system and our previous work based on currently available FPGA devices led us to think that not only an OS kernel must be conceived to handle a FGDRA, but a FGDRA must also be designed to support efficiently this OS kernel. This article relate our original works in this direction. The FGDRA core that we have designed will be presented as well as a more general view of our approach of a FGDRA and its related OS kernel. This paper is organized as follows. Section 2 discuss of related works in the field of OS for FGDRA. Section 3 explains our original FGDRA platform proposition named OLLAF. Section 4 discuss more precisely of the context management scheme and its extention to configuration management. Section 5 explains how our architecture can affect different OS services. Section 6 exposes an analytic comparison between OLLAF and other methods used todays in terms of preemption overhead and efficiency. Finally, conclusions are drawn in section 7.

A. OS for FGDRA
Several research have been led in the field of OS for FGDRA [3], [4], [5], [6]. All those studies present an OS more or less customized to enable specific FGDRA related services. Example of such services are : partial reconfiguration management, hardware task preemption or hardware task migration. They are all designed on top of a platform composed of a commercial FPGA and a micro-processor. This microprocessor may be a softcore processor, an embedded hardwired core or even an external processor.
In the 90's, some works have also been published about the design of a specific architecture for dynamical reconfiguration. In [7] authors discuss about the first multi-context reconfigurable device. This concept as been implemented by NEC on the Dynamically Reconfigurable Logic Engine (DRLE) [8]. At the same period, the concept of DPGA was introduced, it was also proposed in [9] to implement a DPGA in the same die as a classic microprocessor to form one of the first SoC including dynamically reconfigurable logic. In 1995, Xilinx even applied a patent on multi-context programmable device proposed as an XC4000E FPGA with multiple configuration planes [10].
More recently, in [11], authors propose to add special material to a DRA to support OS services, they worked on top of a classic FPGA.
The work presented in this paper try to take advantage of those previous work both about hardware reconfigurable platform and OS for FGDRA.

B. previous work
Our first work on OS for FGDRA was related to preemption of hardware task on FPGA [12]. For that purpose we explored the use of a scanpath at the task level. In order to accelerate the context transfer we explore the possibility of using multiple parallels scanpaths. We also provided the Context Management Unit or CMU, which is a small IP capable to manage the whole process of saving and restoring tasks contexts.
In that study both the CMU and the scanpath were build to be implemented on top of any available commercial FPGA. This approach showed number of limitations. They could be summarized in this way: implementing this kind of OS related material on top of the existing DRA introduce unacceptable overhead on both the task and the OS service. Differently said, most of OS related material should be as much as possible hardwired into the platform's architecture.

A. Specifications of a FGDRA with OS support
We have designed a FGDRA with OS support following those specifications.
It should first address the problem of the configuration speed of a task. This is one of the primary concerns because if the system spend more time configuring itself than actually running tasks, then its efficiency will be poor. The configuration speed will thus have a big impact on the scheduling strategy.
In order to enable more choice on scheduling scheme, and to match some real time requirement, our FGDRA platform must also include preemption facilities. For the same reasons than configuration, the speed of context saving and restoring process will be one of our primary concerns. On this particular point, previous work we have discussed in section 2 will be adapted and reused.
Scheduling on a single GPP system is just a matter of time. The problem is to distribute the computation time between different tasks. In the case of a DRA the system must distribute both computation time and computation resources. Scheduling in such a system is then no more a one dimensional problem, but a three dimensional one. One dimension is the time and the two others are the surface of reconfigurable resources. Performing such a scheduling at run time with real time constraints is at this stage not conceivable. But the FGDRA should help getting close to that goal. The primary concern on this subject is to ensure an easy task relocation. For that, the reconfigurable logic core should be splited into several equivalent blocks. This will allow to move a task from a block hal-00665805, version 1 -2 Feb 2012 Another aspect of an operating system is to provide inter task communication services. In our case we will distinguish two cases. First the case of a task running on top of our FGDRA and communicating with another task running on a different computing unit, for example a GPP. This case will not be covered here as this problem concern the whole heterogeneous platform, not only the particular FGDRA computing unit. The second case is when two, or more, tasks run on top of the same FGDRA communicate together. This communication channel should remain the same wherever the task is placed on the FGDRA reconfigurable core and whatever state those tasks are (running, pending, waiting, ...). That mean that the FGDRA platform must provide a rationalized communication medium including some sort of exchange memories.
The same arguments could also be applied to inputs/outputs. Here again two cases exist. First the case of I/O being a global resource of the whole platform. Secondly the case of special I/O directly bound to the FGDRA. Figure 2 show a global view of OLLAF, our original FGDRA designed to support OS sevices as they have just been specified.

B. Proposed solutions
In the center stand the reconfigurable logic core of the FGDRA. This core is organized in columns, each column can be reconfigured separately and offer the same set of services. That means that a task uses an integer number of columns. This topology as been chosen for two reasons. First using a partial reconfiguration by column transforms the scheduling problem into a two dimensional problem (time + 1D space) which will be easier to handle in real time situations. Secondly as every columns is the same and offers the same set of services, tasks can be moved from one column to another without any change on the configuration data.
In the figure, at the bottom of each column you can notice two hardware blocks called CMU and HCM. The CMU as said earlier is an IP able to manage automatically task's context saving and restoring. The HCM standing for Hardware Configuration Manager is pretty much the same but to handle configuration data also called bitstream. On each column a local configuration/context memory is added. This memory can be seen as a first level of cache memory to store contexts and configurations close to the column where it might most probably be required. The internal architecture of the core provides adequate materials to work with CMU and HCM.
More about this will be discussed in the next section. On the right of the figure stands a big block called "HW Sup + HW RTK + central memory". This block contain a classic microprocessor which serves as a hardware supervisor. It runs a custom real time kernel specially adapted to handle FGDRA related OS services and platform level communication services. Along with this hardware supervisor a central memory is provided for OS use only. Basically this memory will store configuration and eventual context of every task that may run on the FGDRA. This supervisor communicates with all columns using a dedicated control bus.
Finally, on top of the figure 2 you can see the application communication medium. This communication medium provides a communication port to each column. Those communications ports will be directly bound to the reconfigurable interconnection matrix of the core. If I/O had to be bound to the FGDRA they would be connected with this communication medium in the same way reconfigurable columns are.

C. Logic core overview
In order to make the description of the FGDRA core more understandable, we will here split its functionalities between two points of view. The first one is the functional point of view, it consists on the information that a task designer may have to know in order to design the architecture. The second point of view is the configuration point of view, it consists on information about reconfiguration plane. As one of the main goals of the OS is to abstract configuration management, this point of view could be seen as the OS point of view.
Internal architecture of a LE in the functional point of view can be seen on figure 3. This architecture integrates elements that compose a classic Logic Element of FGDRA. If we want to improve functional architecture, it should not change our conclusion on configuration point of view.
A multiplexor based interconnect as been choosen instead of the passing MOS transistor used in most commercial FPGA. In this way we can lower the number of configuration bit required to allow the same connection flexibility. In this last interconnection scheme, the number of configuration bit grow linearly with interconnection possibility while using multiplexor makes it grow as a log2 function.
At first, configuration memory points are modellized as a D flip-flop. This allow us to rapidly apply our works on context management to configuration management. However, configuration and context management remains two separate path, a context swap can be performed without any change hal-00665805, version 1 -2 Feb 2012 in configuration. This can be interesting for checkpointing or when more than one instance of the same task runnning.

IV. CONTEXT MANAGEMENT SCHEME
In [12] we proposed a context management scheme based on a scanpath, a local context memory and the CMU which is a small IP capable of managing automatically context transfer between the scanpath and the local memory. The context management scheme in OLLAF is slightly different in two ways. First, every context management related material is hard wired into the platform. Secondly, we added two more stage in order to even lower preemption overhead and to ensure the consistency of the system.
As context management materials are added at platform level and no more at task level, it needed to be splited differently. As the Programable Logic Core is column based, it was then natural to implement context management at columns level. A CMU and a local memory have then been added to each column, and one scanpath is provided for each column's set of flipflops.
In order to lower preemption overhead, our reconfigurable logic core use a double memory plane. Flipflops used in LE are thus replaced with two FF with switching material. Architecture of this double plane FF can be seen on figure 4. Run and scan are then no more two working modes but two parallel planes which can be swapped as will. With this topology, the context of a task can be shifted in while the previous task is still running and shifted out while the next one is already running. The effective task switching overhead is then taken down to one clock cycle as illustrated in figure  6.
Contexts are transfered by the CMU into Local Context Memories using this hidden scanpath. Because the context of every column can be transfered in parallel, Local Context Memories are placed at column level. It is particularly usefull when task use more than one column. Those memories can contain at this stage 10 contexts. They can be seen as local cache memories to optimize access to a bigger memory called the Central Context Repository. The Central Context Repository is a large memory space storing the context of each task instance run by the system. Local Context Memories should then store contexts of tasks who are most likely to be the next to be ran on the corresponding column.
After a preemption of the corresponding task, a context can be stored in more than one LCM in addition to the copy stored in the Central Context Repository. In such situation, care must be taken to ensure the consistency of the task execution. For that purpose, contexts are tagged by the CMU each time a context saving is performed with a version number. The operating system keep tracks of this version number and also increment it each time a context saving is performed. In this way the system can then check for the validity of a context before a context restoration. The system must also try to update the context copy in the CCR as short as possible after a context saving is performed.
Dual Plan Scanpath, Local Context Memory and Central Context Repository form a complex memory hierarchy specially designed to optimize preemption overhead. The same memory scheme is also used for configuration management except configuration do not change during execution so it does not need to be saved and then no versioning control is required here. The programmable logic core use a dual configuration plane equivalent to the Dual Plane Scanpath used for context. Each column has a Hardware Configuration Manager which is a simplified version of the CMU (without saving mechanism). A Local Configuration Memory is provided besside Local Context Memory, the name LCM is used as in figure 3 to relate to both those memories. In the same way, the CCR can refer to Central Context/Configuration Repository.
In best case, preemption overhead can then be bound to one clock cycle.
A scenario of a typical preemption is presented here. In this scenario we consider the case where context and configuration of both task are already stored into the right LCM. Let's consider that a task T1 is preempted to run another task T2, scenario of task preemption is then as follow : • T1 is running and the scheduler decide to preempt it to run T2 instead • T2's configuration and eventually context is shifted on the second configuration plane Fig. 6. Typical preemption scenario • once the transfer is completed the two configurations planes are switched • now T2 is running and T1's context can be shifted out to be saved • T1's context is updated as soon as possible in the CCR This scenario is illustrated in figure 6. This is the case when both context and configuration of T2 are already stored into LCM. That means that, in order to have this favorable case, we need an anticipated scheduling to manage our Context/Configuration Memories Hierarchy as a smart cache.

V. CONFIGURATION, PREEMPTION AND OS INTERACTION
In previous sections an architectural view of our FGDRA has been exposed. In this section, we discuss about the impact of this architecture on OS services. We will here consider the three services most specifically related to the FGDRA.
First, the configuration management service. On the hardware side, each column provides a hardware configuration manager and an associated local memory. As stated earlier that mean that configurations have to be placed in advance in the local configuration memory. The associated service running on the hardware supervisor micro-processor will thus need to take that into account. That imply that this service must manage an intelligent cache to prefetch task configuration on the columns where it might most probably be placed. In order to do so, an anticipated scheduling must be performed.
Secondly, the preemption service. The same principle must be applicable here as those applied for configuration management. Except that contexts also have to be saved. The context management service must ensure that it never exist more than one valid context for each task in the entire FGDRA. Context must thus be transferred as soon as possible from local context memory to the centralized global memory of the hardware supervisor. This service will also have a big impact on the scheduling service as the ability to perform preemption with a very low overhead allow the use of more flexible scheduling algorithms.
And last the scheduling service and in particular the space management part of the scheduling. It takes advantage of the column topology and of the centralized communication scheme. As stated, fewer computing power will be required to manage a one dimensional space at run time. The problem is here similar to memory management in classical GPP based system. The reconfigurable resource could then be managed as a virtual infinite space containing an undetermined number of columns. The job is then to dynamically map the required set of columns (task) into the real space (the actual reconfigurable logic core of the FGDRA).

VI. PREEMPTION COST COMPARISON
This section present an analytic comparison of preemption efficiency in OLLAF and other solution from past works or literature. We will here consider six methods : • XIL a solution based on the xilinx XAPP290 [13] using ICAP to transfer both context and configuration and using the readback bitstream for context extraction. • Scan a solution using a simple scanpath for context transfer as described in both [14] and [12], and using ICAP interface for configuration. • PCS8 is similar to Scan solution but using 8 parallel scanpath as described in [12]. • DPScan use a dual plane scanpath similar to the one used in OLLAF for context and ICAP for configuration. This method is also studied in [14], referred as a shadow Scan Chain. • MM use once again ICAP for configuration and the memory mapped solution proposed in [14]. • OLLAF this last solution being the use of separate, column distributed, dual plane scanpath for configuration and context as proposed in this article. In this study we consider two parameter. The preemption overhead H is the cost of a preemption for the system in terms of time. The efficiency of preemption process λ is then λ = 1− H P with P is the minimum period at which preemption occurs so in our case P is the clock tick of the operating system. In this study we use a typical clock tick of 10ms. In order to focus on the architectural view only all times will be expressed and estimated in number of clock cycle. Assuming a typical clock frequency of 100MHz the OS tick is 10 6 tclk. Task sizes will be expressed as n, the number of flipflop used. The time cost of a preemption take into account two context transfers and one configuration transfer.
Analytic expression of H for each case are estimated as follow : • XIL In [14] authors estimate that bitstream contain 20 times more data than context related data so the bitstream of a task of size n is approximately 21n.  • OLLAF In OLLAF, both context and configuration transfer are hidden so the total cost of the preemption is always 1 clock cycle whatever the size of the task.
In order to make a concrete case comparison, we will consider two task T1 and T2. We consider a DES56 cryptographic IP that requires 862 flipflops, and a 16tap FIR filter that requires 563 flipflop. Both of those IPs can be found in www.opencores.org. To ease the computation we will consider two task using the average number of flipflop of the two considered IP. So for T1 and T2 we got n = 862+563 2 713. Table I show the overhead H and the efficiency λ for each method presented.
Those results show that in this case, using our method leads to a preemption overhead around 500 times smaller than the bests others cases.
If we now consider that not only one task is preempted but the whole FGDRA, assuming a 1 Million LE's logic core, estimation of overhead and efficiency for each method are shown in table II. Those results show clearly the benefit of OLLAF platform over actual FPGA concerning preemption. Using actual methods, preemption overhead is linearly dependant on the size of the task. In OLLAF, this overhead do not depends on the size of the task and is always of only one clock cycle.
In OLLAF, both context and configuration transfers are hidden due to the use of a dual configuration plane. The latency L between the moment a preemption is asked and the moment the new task effectively begin to run can also being studied. This latency only depends on the size of the columns. That means that for a given platform, it will be a constant. In the worst case this latency will be far shorter than the OS tick period. OS tick period being in any case the shortest time in which the system can respond to an event, we can consider that this latency will not affect the system at all.

VII. CONCLUSION AND PERSPECTIVES
A global view of OLLAF, a FGDRA that enhance OS service support has been presented, and in more details its reconfigurable logic core. We claim that OS and platform must be closely linked to each others in order to perform as optimally as possible.
In this paper we presented in more details our context management scheme and its extention to configuration management. It has been shown that this scheme permit a far better preemption efficiency than other methods in use today.
Today, the reconfigurable logic core have been designed and is being tested by several simulations. The rest of the FGDRA is also in progress. The dedicated custom OS services are written as an extension of µC/OS-II, a well proven real time OS. We are also working on the distributed management of the whole heterogeneous system including, at least, one of our FGDRA and its dedicated real time kernel, and one GPP.