AN IMPROVED DATA STREAMING MODEL FOR SYNTHETIC SYSTEM-ON-CHIP BENCHMARK CIRCUIT GENERATION

by

Jankiben Patel
B.E, Gujarat University, 1998

PROJECT SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

MASTER OF ENGINEERING

In the
School of Engineering Science

© Jankiben Patel 2010
SIMON FRASER UNIVERSITY
Summer 2010

All rights reserved. However, in accordance with the Copyright Act of Canada, this work may be reproduced, without authorization, under the conditions for Fair Dealing. Therefore, limited reproduction of this work for the purposes of private study, research, criticism, review and news reporting is likely to be in accordance with the law, particularly if cited appropriately.
APPROVAL

Name: Jankiben Patel
Degree: Master of Engineering
Title of Project: An improved data streaming model for Synthetic System-on-Chip benchmark circuit generation

Examing Committee:

Chair: 
Dr. Rick Hobson, P.Eng
Professor, School of Engineering Science

Dr. Lesley Shannon, P.Eng
Senior Supervisor
Assistant Professor, School of Engineering Science

Dr. Marek Syrzycki
Supervisor
Professor, School of Engineering Science

Dr. Glenn Chapman, P.Eng
Internal Examiner
Professor, School of Engineering Science

Date Defended/Approved: July 20, 2010
Declaration of Partial Copyright Licence

The author, whose copyright is declared on the title page of this work, has granted to Simon Fraser University the right to lend this thesis, project or extended essay to users of the Simon Fraser University Library, and to make partial or single copies only for such users or in response to a request from the library of any other university, or other educational institution, on its own behalf or for one of its users.

The author has further granted permission to Simon Fraser University to keep or make a digital copy for use in its circulating collection (currently available to the public at the “Institutional Repository” link of the SFU Library website <www.lib.sfu.ca> at: <http://ir.lib.sfu.ca/handle/1892/112>) and, without changing the content, to translate the thesis/project or extended essays, if technically possible, to any medium or format for the purpose of preservation of the digital work.

The author has further agreed that permission for multiple copying of this work for scholarly purposes may be granted by either the author or the Dean of Graduate Studies.

It is understood that copying or publication of this work for financial gain shall not be allowed without the author’s written permission.

Permission for public performance, or limited permission for private scholarly use, of any multimedia materials forming part of this work, may have been granted by the author. This information may be found on the separately catalogued multimedia material and in the signed Partial Copyright Licence.

While licensing SFU to permit the above uses, the author retains copyright in the thesis, project or extended essays, including the right to change the work for subsequent purposes, including editing and publishing the work in whole or in part, and licensing other parties, as the author may desire.

The original Partial Copyright Licence attesting to these terms, and signed by this author, may be found in the original bound copy of this work, retained in the Simon Fraser University Archive.

Simon Fraser University Library
Burnaby, BC, Canada
ABSTRACT

Field Programmable Gate Array (FPGA) researchers aim to improve the quality of the Computer-Aided Design (CAD) tools used to map designs onto FPGAs and evaluate different possible architectures. To test new architectures and CAD tools, researchers need to use benchmark circuits representative of realistic applications, which are not freely available. Therefore, researchers have considered randomly-generated benchmark circuits that can model the complexities of real systems. We use the existing Benchmark Circuit Generator (BCGEN), which generates circuits with System-on-Chip (SoCs) architectures. While BCGEN provides a good framework for circuit generation, the existing dataflow communication patterns have some limitations. Specifically, they have large numbers of inputs-outputs (I/Os) which is not scalable, and do not support data-buffering. We aim to improve BCGEN’s dataflow communication model by reducing the number of I/Os. Hence, they will be scalable and include data buffering capabilities between logic stages which is typical for data-streaming applications.

Keywords: Field Programmable Gate Array; Computer-Aided Design; System-on-Chip; Synthetic Benchmarks
ACKNOWLEDGEMENTS

I am heartily thankful to my supervisor Dr. Lesley Shannon for this project. I would have been lost without her kind support and knowledge. I would also like to thank Dr. Steve Wilton (University of British Columbia) and his student Cindy Mark for sharing their work on the Benchmark Circuit Generator and for their permission to carry on the project which made this work successful.

I am also thankful to my husband Jasbir and my son Shubham for supporting me throughout my work and dealing with my absence from home to make this work successful. I would also like to thank to my all lab members (especially David, Jason and Jian) for their kind support in different fronts whenever I needed.
# TABLE OF CONTENTS

Approval .......................................................................................................................... ii
Abstract ......................................................................................................................... iii
Acknowledgements ....................................................................................................... iv
Table of Contents ......................................................................................................... v
List of figures ................................................................................................................ vii
List of tables ................................................................................................................ viii
Glossary ........................................................................................................................ ix

1: Chapter 1 .................................................................................................................. 1
   Introduction ................................................................................................................. 1
   1.1 Motivation ............................................................................................................ 2
   1.2 Objective ............................................................................................................ 3
   1.3 Contribution ...................................................................................................... 4
   1.4 Organization ...................................................................................................... 5

2: Chapter 2 .................................................................................................................. 6
   Background ................................................................................................................. 6
   2.1 FPGA Architecture ............................................................................................. 6
   2.2 CAD Tools .......................................................................................................... 9
   2.3 BCGEN .............................................................................................................. 10
   2.4 Existing Synthetic Benchmark Circuit Generators ........................................... 12
       2.4.1 GEN ........................................................................................................... 12
       2.4.2 GNL .......................................................................................................... 13
       2.4.3 Other Generators ..................................................................................... 13

3: Chapter 3 .................................................................................................................. 14
   Data Streaming Model .............................................................................................. 14
   3.1 Updates in the BCGEN Data Streaming Model ............................................... 14
   3.2 Details of data streaming modification .............................................................. 15
       3.2.1 Multiplexer ............................................................................................... 16
       3.2.2 Demultiplexer .......................................................................................... 17
       3.2.3 SRL FIFO ................................................................................................ 18

4: Chapter 4 .................................................................................................................. 19
   Updates to the BCGEN Code Base .......................................................................... 19
   4.1 Software Organization ....................................................................................... 19
4.2 Our Modifications ........................................................................................................... 21

5: Chapter 5 .......................................................................................................................... 23
Testing and Discussion of Results ......................................................................................... 23
5.1 Methodology .................................................................................................................... 23
5.2 Results ............................................................................................................................ 24

6: Chapter 6 .......................................................................................................................... 28
Conclusions and Future Work ............................................................................................... 28
6.1 Conclusions ..................................................................................................................... 28
6.2 Future work ...................................................................................................................... 28

REFERENCES ..................................................................................................................... 30

APPENDIX .......................................................................................................................... 32
Our Modified Source code for BCGEN dataflow pattern ...................................................... 32
LIST OF FIGURES

Figure 2.1: Architecture of a logic block [12] ................................................................. 6
Figure 2.2: Architecture of an Island Style FPGA [12] ................................................... 7
Figure 2.3: Typical CAD flow ......................................................................................... 8
Figure 2.4: Dataflow connection in original BCGEN ................................................... 12
Figure 3.1: Proposed dataflow connection in the BCGEN ........................................... 15
Figure 3.2: Modules connected using multiplexer ....................................................... 17
Figure 3.3: Modules connected using demultiplexer ................................................... 17
Figure 4.1: Software organization flow for the BCGEN ........................................... 20
Figure 5.1 Results of I/O .............................................................................................. 27
Figure 5.2 Results of Rent Parameters ......................................................................... 27
Figure 5.3 Results of No. Of Clusters ........................................................................... 27
LIST OF TABLES

Table 5.1: Results from our work and original BCGEN for the Dataflow network ............ 25
# GLOSSARY

<table>
<thead>
<tr>
<th>Abbreviation</th>
<th>Definition</th>
</tr>
</thead>
<tbody>
<tr>
<td>ASIC</td>
<td>Application-Specific Integrated Circuit</td>
</tr>
<tr>
<td>BCGEN</td>
<td>Benchmark Circuit Generator</td>
</tr>
<tr>
<td>BLIF</td>
<td>Berkeley Logic Interchange Format</td>
</tr>
<tr>
<td>CAD</td>
<td>Computer-Aided Design</td>
</tr>
<tr>
<td>DEMUX</td>
<td>Demultiplexer</td>
</tr>
<tr>
<td>FIFO</td>
<td>First-In-First-Out</td>
</tr>
<tr>
<td>FPGA</td>
<td>Field Programmable Gate Array</td>
</tr>
<tr>
<td>HDL</td>
<td>Hardware Description Language</td>
</tr>
<tr>
<td>IC</td>
<td>Integrated Circuit</td>
</tr>
<tr>
<td>I/O</td>
<td>Input/Output</td>
</tr>
<tr>
<td>IP</td>
<td>Intellectual Property</td>
</tr>
<tr>
<td>LUT</td>
<td>Look-Up Table</td>
</tr>
<tr>
<td>MCNC</td>
<td>Microelectronic Center of North Carolina</td>
</tr>
<tr>
<td>MUX</td>
<td>Multiplexer</td>
</tr>
<tr>
<td>NoC</td>
<td>Network-on-Chip</td>
</tr>
<tr>
<td>SoC</td>
<td>System-on-Chip</td>
</tr>
</tbody>
</table>
1: CHAPTER 1

Introduction

Field Programmable Gate Arrays (FPGAs) are Integrated Circuits (ICs) that can be configured by the user after manufacturing. The user describes a design using a Hardware Description Language (HDL), such as VHDL or verilog, which runs through a Computer Aided Design (CAD) flow to generate a bitstream for the FPGA configuration.

FPGAs have become a very popular implementation technology for digital circuits, an attractive alternative to Application-Specific Integrated Circuits (ASICs) due to their lower volume costs. ASICs are mainly used for implementing a high volume/high speed/low power designs, as FPGAs now have sufficient logic density to implement System-on-Chips (SoCs). FPGAs are used in many applications, including wireless applications [1], biomedical imaging [2], digital signal processing [3], automotive electronics [4] and computer vision [5].

In the past two decades, the logic capacity of FPGAs has increased dramatically. The FPGA market increased from $1.9 billion in 2005 to $2.75 billion by the year 2010 [6]. This increase is due to their increased logic density, performance capabilities and lower development costs. The performance of circuits on a FPGA is based on the quality of the FPGA architecture, the FPGA’s hardware fabric, and quality of the Computer-Aided Design (CAD) tools used to map circuits onto the FPGA.
FPGA CAD tool research aims to improve the quality of the final design mapped onto a FPGA via improving the algorithms that convert the HDL design into the final bitstream. Similarly, new FPGA architecture is analyzed using these CAD tools. CAD tools are used to implement different circuits to the new FPGA architecture. To evaluate new CAD algorithms and architectures, a number of test circuits required, known as benchmark circuits. The benchmark circuits are mapped, placed and routed onto the different FPGA architectures using different CAD algorithms. The quality of the architecture is evaluated based on the area and delay measurement for the different benchmark circuits.

1.1 Motivation

There are many benchmark circuits available that can be implemented on an FPGA. Many vendors have their own large databases of such benchmark circuits. However, such vendor specific circuits are not freely available to researchers and access to the source code is restricted. To overcome this limitation, randomly generated benchmark circuits can be used for research. In such circuits, random netlists are generated from parameters specified by the user. Large numbers of test circuits can be generated using a benchmark circuit generator because they do not require any design time. Previously introduced benchmark circuit generators are GEN [7] and GNL [8]. More recently, Benchmark Circuit Generator (BCGEN), a stochastical benchmark circuit generator developed by Cindy Mark (UBC) [9], was created to generate circuits that are designed to emulate Systems-on-Chip (SoC). BCGEN creates circuits by connecting intellectual property (IP) modules, such as processors and memory,
using different communication patterns, including bus, star and dataflow communication patterns.

The BCGEN tool provides a good framework for generating benchmark circuits. However, there are limitations when generating circuits with a dataflow communication pattern. In the existing dataflow communication pattern, the number of Inputs and Outputs (I/Os) in the final circuit increases dramatically, and there is no support for data buffering. Hence, our aim is to reduce the number of inputs and outputs and augment the BCGEN tool to include the data buffering capability in the dataflow communication pattern of the BCGEN tool.

1.2 Objective

The main objective of this project is to improve the representation of the dataflow system in BCGEN by including data buffering capabilities and minimizing I/O growth. Limiting the number of I/Os is achieved by inserting multiplexer and demultiplexer nodes between the original nodes (stages). Additionally, inserting a FIFO (First-In-First-out) between the original nodes provides communication media for data storage, which enables data buffering between stages.

The original BCGEN tool has the following characteristics:

- The final circuits are generated by stitching modules together using different communication patterns such as star, bus or dataflow.
- The component circuits are selected from existing benchmarks (such as circuits from the MCNC benchmark suite) or from other existing generators.
• The final circuits are generated based on either user selectable parameters or default parameters.
• The generator is able to handle combinations of multiple communication patterns in a single circuit.

Our modifications to the existing generator aim to improve the dataflow modelled circuits.

The characteristics of the modified dataflow pattern are:
• Multiplexer and Demultiplexer nodes are used in between the existing nodes to reduce the I/O count.
• FIFO modules are inserted in between the existing nodes (stages) to implement data buffering, typically found in dataflow designs.

1.3 Contribution

Contributions for this work are as follows:

1) Firstly, multiplexer and demultiplexer modules are designed such that they are able to produce the desired number of output connections from the given number of input connections.

2) Secondly, FSL FIFO module is used for data buffering. Source code is obtained from Open Core [10].

3) Thirdly, a more accurate and scalable representation of the dataflow communication pattern is implemented by stitching multiplexer, demultiplexer and FIFO modules between the original modules.
The existing BCGEN tool is improved by integrating the above mentioned modifications. After successful implementation, the circuits generated using the improved BCGEN are tested using T-V pack and VPR [11]. The test results are compared with the circuits generated using the original BCGEN. The analysis showed that the circuits generated using the improved BCGEN has fewer I/O connections than the previously generated circuits (using original BCGEN).

1.4 Organization

The background information related to this work including the FPGA architecture, CAD tools and previous work in the synthetic circuit generator as well as detailed functionality of the BCGEN is described in Chapter 2. Chapter 3 details our updates for BCGEN to improve the dataflow circuit models. Finally, Chapter 4 covers system verification and testing followed by Chapter 5, which concludes this report and suggests possible future work.
2: CHAPTER 2

Background

This chapter provides an overview of FPGA architecture as well as information about its CAD flow. The chapter also summarizes previous work done in the synthetic benchmark circuit generators including BCGEN.

2.1 FPGA Architecture

FPGAs have three main components: logic blocks, programmable routing fabric and, input/output (I/O) blocks. Logic blocks are mainly used for implementing the combinational and sequential logic functionality for the desired circuit. Typically, a logic block contains Look-Up Tables (LUTs), Flip-Flops (FFs) and 2-to-1 multiplexers. Figure 2.1 shows a simple architecture of a logic block.

Figure 2.1: Architecture of a logic block [12]

Figure 2.1 shows a 4-input LUT, a Flip-Flop and a 2-to-1 multiplexer based logic block. An N input LUT is implemented as a N-bit addressable \(2^N \times 1\) memory. Each single bit memory can store two different values. Therefore, \(2^N\) bits can be used to store \(2^{2^N}\) different combinations representing \(2^{2^N}\) different
logic functions. The flip-flop is used for implementing the sequential logic in a design. Finally, the multiplexer selects whether the implemented logical function is purely combinational or sequential. The programmable routing fabric is used to provide the interconnect between the logic blocks, or between the logic blocks and the I/O blocks. I/O blocks behave as an input or output pads as per the circuit requirement. FPGA architectures are mainly classified as:

- Island-Style FPGA
- Row-Based FPGA
- Hierarchical FPGA

As the Island-Style FPGA architecture (see Figure 2.2) is the most popular for researchers, we provide a detailed discussion in the following.

![Architecture of an Island Style FPGA](image)
As shown in Figure 2.2, “channels” of routing wires surround the logic block on all sides (hence it is known as the “Island-Style FPGA”). Logic blocks are connected to the routing wires via a connection block (not shown in Figure 2.2). The switch block is a set of programmable switches, which allows the appropriate connection between a source and the related sinks (see Figure 2.2). Logic blocks are typically grouped in clusters and the connections made within these grouping are known as intra-cluster connections. Conversely, the connections made between different clusters are known as the *inter-cluster* connections. Generally, the intra-cluster connections are faster than *inter-cluster* connections. The following section describes how a user programs the device to implement a design.

![Diagram of Typical CAD flow](image)

*Figure 2.3: Typical CAD flow*
2.2 CAD Tools

An FPGA user designs a digital circuit using either a Hardware Description Language (HDL) or schematic entry. Subsequently, the CAD tools are used for converting from this high level designs (HDL and schematic) into a bit stream file which is used to program the FPGA. Typical CAD tool flow is shown in Figure 2.3. In CAD tools, high level synthesis converts the VHDL/Verilog circuit into the register transfer level (RTL) description. Logic optimization performs required logic simplification. In the subsequent step, a netlist of logic gates is mapped into a netlist of logic blocks (LUTs + FFs) to implement the logic function, which is known as technology mapping. Then, the logic blocks are packed into the clusters. Next, the clusters are placed onto a virtual mapping of the device. The cluster placement is optimized such that the connection length between the connected clusters will be minimized. Once the location of the cluster has been decided, the router determines the specific routing resources (wire tracks and switches) that should be used to connect all the logic blocks’ input and output pins as required by the circuit. Finally, the circuit is converted into the bit stream, which is used to program an FPGA.

Placement and routing are important aspects in CAD tools for FPGAs. There have been many algorithms developed for efficient placement and routing. To test these CAD algorithms accurately, large benchmark circuits are required to map onto an FPGA. Most researchers use the MCNC (Microelectronics Center of North Carolina) benchmark circuits for this purpose. However, the largest benchmark circuit is unrealistically smaller than the circuits implemented on
modern commercial FPGAs. Other benchmark circuits are also available, but they also have size limitations. An efficient solution to this problem is to use randomly generated benchmark circuits. These randomly generated circuits behave like the real circuits. This study explores functionality of the BCGEN (existing benchmark circuit generator) and presents an improvement for the large dataflow circuits of the existing BCGEN. These circuits usually use a common format known as the Berkeley Logic Interchange Format (BLIF) to encapsulate the netlist of the circuit.

2.3 BCGEN

Mark [10] has developed a benchmark circuit generator (BCGEN) that generates the benchmark circuits by combining a number of randomly selected logic blocks from the given MCNC library. The benchmark circuit is generated using hierarchical SoC technique. In today’s FPGAs, the implemented circuits contain Intellectual Property (IP) modules, including processor modules that are connected using buses and on-chip networks. Similarly, BCGEN generates the benchmark circuits that can have many types of IP modules connected using bus, star or dataflow network connection pattern. The circuits are generated using different hierarchical levels and different network connection patterns based on the user requirement. The library of MCNC benchmark circuits and the user constraint file are used for building the synthetic benchmark circuits.

The MCNC benchmark circuit library is divided into different sub-categories such as processors, interfaces, and controllers. The final circuit construction is done in three stages. In the first stage, the values for primary
parameters such as hierarchical depth, number of networks, number of leaf modules and bus width are selected. These primary parameters are important to decide the shape and the size of the final circuit. In the second stage, the circuit is constructed by selecting the size and the network pattern to connect the modules at each hierarchical level. Finally, the leaf module pins are connected as appropriate to other module pins to realize the desired benchmark circuit model.

For the bus network type, one sub-module is selected as a master and all other sub-modules are selected as slaves. All these slave modules are connected to the master module using the bus network interface. In the star network type, one of the modules is selected as a head sub-module and all other modules are selected as tail sub-modules. The input and output communications between these modules are handled by an algorithm [13].

The dataflow network connection is illustrated in Figure 2.4. In this network pattern, connections are made between adjacent sub-modules. However, if a module has more output connections (e.g. module Q) than the input connections of the next module (e.g. module R), the additional connections can be skipped to the subsequent module (e.g. module T) because the sub-module (module T) has more input connections than the output connections of the preceding module (module S). Moreover, the connections can be made as a feedback loop between the stages (e.g. the Q and R sub-modules). Finally in the first/last stage, any unconnected I/Os (if feedback of stage skipping is not possible) are become primary I/Os for the final generated circuit. The resultant generated circuits are then validated using the T-V pack and VPR [11].
2.4 Existing Synthetic Benchmark Circuit Generators

This section describes the properties of the existing synthetic benchmark circuit generators:

- GEN
- GNL
- Other generators

Now, each of these generators is discussed in detail.

2.4.1 GEN

GEN benchmark generator was originally designed by Hutton [7] in 1997. The CIRC tool analyzes the existing circuit and provides information about the circuit characteristics. The tool uses MCNC benchmark circuits as input circuits and measures the circuit properties in the form of a specification file. Subsequently, GEN uses this specification file as an input and builds a “clone” (circuit) according to the user specification. The post routing results of the cloned circuit are compared with the result of the original circuits using the VPR tool.
combinational and sequential logic circuits [14]. Additionally, Kundarewich et al. [15] extended the GEN such that it can use partitioning information to develop the hierarchical circuits.

2.4.2 GNL

This generator, developed by Stroobandt et al. [8] in 1999, generates the benchmark circuit using an analytical method. This method is based on the rent rule and ratio of the circuit’s multi-terminal nets. The tool uses the user-defined constraints such as the value of rent parameter, the ratio of input and output pins, and the number of input and output pins. The generator uses bottom up clustering approach to generate circuits. However, the delay characteristics are not controlled using the tool.

2.4.3 Other Generators

The method described by Tom et al. [16] in 2005 generates larger circuits by randomly stitching the BLIF circuits. This method generates a circuit with a large number of inputs and outputs. Furthermore, Grant et al. [17] uses an edge swapping method to generate benchmark circuits. This method tries to preserve properties of the original circuits in the newly generated circuit. Moreover, Harlow’s method [18] generates a set of Binary Decision Diagrams (BDDs). One more generator, Partgen is developed by Pistorius et al. [19] in 2000. This generator generates circuits from the library, which already has several kind of circuits such as regular and combinational, memory, controller and irregular, and combinational.
3: CHAPTER 3

Data Streaming Model

In this chapter, the details of our contributions are discussed. The modifications done in the existing dataflow pattern of the BCGEN are also explained.

3.1 Updates in the BCGEN Data Streaming Model

As described in Chapter 2, the dataflow network of the original BCGEN (see Figure 2.4) is designed such that the additional connections to those directly connected sub-modules can be either skipped or connected as a feedback loop if input or output pins remain unconnected on the sub-module. Remaining unconnected inputs and outputs are converted to adjacent module’s primary input or output pins. Because of this implementation, the resultant circuit has a large number of I/Os. To restrict the number of I/Os in the resultant circuit, additional multiplexer and demultiplexer modules are inserted in between the existing sub-modules. Furthermore, data buffering capability is added by using Shift Register Logic First In-First Out (SRL FIFO) [10]. This results in a more accurate and scalable representation of the dataflow communication pattern by stitching multiplexers, demultiplexers, and FIFO modules between the original modules to represent the Data Streaming.
3.2 Details of data streaming modification

The proposed modification in the existing dataflow connection is illustrated in Figure 3.1.

As shown in Figure 3.1, the P and Q modules are selected from the MCNC library for the dataflow network connection using the original BCGEN. However, our proposed modification inserts a multiplexer, demultiplexer and FIFO between the P and Q modules. The P module has 30 input pins and 20 output pins. On the other hand, the Q module has 42 input pins and 25 output pins. To connect modules P and Q using the dataflow connection (see Figure 3.1), the module Q has 22 more input connections than the output connections of the module P. In the original BCGEN, these additional input connections are connected to the additional output connections of any subsequent module that has more output connections than the following module or as primary outputs of the final circuit. In the original implementation, the former is the more likely, such that the connections are made by either feedback loop or skipping the modules.

In our approach, the feedback loops and skipping between the stages in the dataflow implementation are avoided. To simplify this, all the module outputs
are first converted (using multiplexer or demultiplexer) into the user defined bus width. Similarly, all the module inputs are converted (using demultiplexer or multiplexer) from the bus width. Additionally, a FIFO is connected between these mux/demux or demux/mux modules that is designed according to the databus width. By using this approach, the inequality of the input and output connections between the modules can be efficiently handled without using feedback loops and skipping between stages. This reduces the number of input and output connections in the final circuit.

3.2.1 Multiplexer

The multiplexer module is designed such that the number of connections between two modules are reduced according to the system requirement. The multiplexer unit is mainly used where the output connections of a module are more than the input connections of a subsequent module. The multiplexer can be customized to obtain any number of output connections from the given number of input connections. For example, if two BLIF modules are selected from the MCNC library to join side by side using the BCGEN’s data path algorithm. One module has 32 output pins and the subsequent module has 23 input pins. Hence, nine extra connections should be reduced to efficiently connect the two modules. This is easily achieved using different combinations of the multiplexers (as shows in Figure 3.2).
3.2.2 Demultiplexer

The demultiplexer module is designed such that it increases the number of connections between two modules according to the system requirement. The demultiplexer unit is mainly used where the number of output connections of a module is less than the input connections to subsequent module. Any number of output connections can be generated from the given number of input connections. For example, the demultiplexer is used if two BLIF modules are selected from the MCNC library to join side by side using the BCGEN’s data path algorithm where one module has 23 output pins and the subsequent module has 32 input pins. Hence, additional nine connections should be generated to efficiently connect the two modules. This is easily achieved using different combinations of the demultiplexers (as shows in Figure 3.3).
3.2.3 SRL FIFO

Shift Register logic FIFO (SRL FIFO) is mainly used for data buffering and data overflow prevention. The VHDL code for the SRL FIFO is obtained from an open cores website [10]. This FIFO from the open core website is 8-bit wide and 32 bit long. However, it has been modified such that the user specified bus-width is used as the FIFO width in the proposed system. Furthermore, the clock and the reset signals of the FIFO are connected to the system clock and reset signals to synchronize with the overall system.
CHAPTER 4

Updates to the BCGEN Code Base

This chapter gives a detailed description of the BCGEN software organization and highlights our contribution to BCGEN. The programming language for the original BCGEN is C, within which we made all our modifications to implement our proposed data streaming functionality. The updated source code of the modified BCGEN is given in the Appendix. We have added approximately 1165 lines of code in the original BCGEN to implement modified dataflow communication pattern and size of our code is approximately 41 KB.

4.1 Software Organization

Figure 4.1 describes how BCGEN generates synthetic circuits based on user constraints. As shown in the figure, BCGEN has two sets of inputs for the circuit generation, the first is the library of the MCNC benchmark circuits and the other is the user constraints file. The user is able to specify different primary parameters, such as the number of networks, the number of leaf modules, the hierarchy depth and bus width in the form of a user constraints file. These parameters describe the overall size and shape of the generated circuit. If the user does not provide any of these parameters, BCGEN uses the default parameters for the circuit’s generation. The library of MCNC benchmarks contains different BLIF modules including processors, interfaces and controllers, which are used by BCGEN as the component modules for the final circuit.
Figure 4.1: Software organization flow for the BCGEN
Using these two sets of inputs, BCGEN constructs the user specified network framework by arranging different randomly selected modules from the MCNC library in the form of a tree. During this framework arrangement, BCGEN assigns the top hierarchy level to a network and the remaining networks are assigned to a random hierarchy level. Now, the algorithm selects each leaf module for the networks according to the probability function and generates different communication patterns depending on the user’s specifications. The user is able to choose any combination from the available star, bus and dataflow communication patterns in a single design. The star, bus and dataflow network patterns generated according to the original BCGEN algorithm [13] is detailed in Section 2.3.

After generating the appropriate communication network, the algorithm attaches different leaf modules to the specified network and checks whether additional communication patterns are required. If additional communication patterns are required, then the algorithm goes back to generate another communication pattern as shown in Figure 4.1. After all the required communication patterns are generated, the algorithm generates system level circuitry and attaches the clock and reset signals to the final generated circuit. The format of the final generated circuit is BLIF, which is used as an input to T-V pack.

### 4.2 Our Modifications

Our modifications to the original BCGEN logic flow are highlighted in Figure 4.1 using bold and italics font. Detailed descriptions of our modification to
the dataflow network generation are described in Chapter 3. The other significant modification was to enable the dataflow communication pattern to support fixed bus widths. Based on the original method of stitching dataflow modules together, bus widths between modules were random. By using the multiplexers and demultiplexers to reduce or increase the number of I/Os, the bus width specified in the user constraint file is used as the width of the FIFO block. The different benchmark circuits generated using the dataflow communication pattern for our modified version and the original version of BCGEN are compared in the next Chapter.
5: CHAPTER 5

Testing and Discussion of Results

In this chapter, the details of our test parameters, results and the evaluation of our updates to the dataflow communication pattern of the original BCGEN are provided. To do this, we generated a set of circuits using both our updated version and the original version of BCGEN and compared the results of both circuit generation methods. The user constraints file and the MCNC library remain same for both versions. We have used T-VPACK and VPR5.0 [11] for mapping each circuit into a minimum-sized FPGA.

5.1 Methodology

In this section, the details of the test parameters used for verification of both modified and old BCGEN are described. We have used the same test parameters as the ones used for testing the original BCGEN. Hence, we can have precise comparison of the performance of both the BCGEN implementations. First of all, a clustered architecture is used, where each cluster (i.e. logic block) contains four 4-input LUTs and four FFs. Each cluster has 10 inputs and 4 outputs. The routing segments are uniform spanning four logic blocks in this architecture. T-VPACK is used to pack LUTs and FFs into the logic blocks. Generally, VPR recognizes the '.net' format so T-VPACK converts a netlist of blif format to the '.net' format. Throughout the testing process, clustering, placement and routing are timing-driven. For placement and routing, each circuit
is first converted into a netlist (.net) format using T-VPACK. Finally, the circuit is placed and routed into an FPGA architecture using VPR tool.

5.2 Results

The results obtained for 12 circuits generated with the dataflow communication pattern using the modified BCGEN, as well as from the original BCGEN, are tabulated in Table 5.1. In Column 2, rows with ‘M’ (Modified version) denote our results and rows with ‘O’ (Original version) denote the results obtained for the original BCGEN. Column 3 and 4 in Table 5.1 show that the number of I/Os generated using our version of BCGEN in the final generated circuits are fewer than the number of I/Os generated using the original BCGEN. Figure 5.1 shows a graphical representation of I/Os, where I/P BCGEN Modified indicates inputs generated using our version, I/P BCGEN Original indicates inputs generated using BCGEN original dataflow pattern and O/P BCGEN Modified gives outputs generated using our versions whereas O/P BCGEN Original gives outputs generated using BCGEN Original dataflow pattern. This demonstrates that the addition of the multiplexers and demultiplexers in between the existing modules has the desired effect of reducing the number of I/Os.

Rent’s rule is given in Equation 1 below for the given sub-circuit, which is known as the relationship between the number of external I/Os and number of modules in the given sub-circuit.

\[ A = \alpha B^\rho \]  

(1)
In Equation 1, $A$ is the number of external pins in a given sub-circuit after partitioning and $B$ is the number of modules in a given sub-circuit after partitioning where $\kappa$ is known as the Rent’s constant and $\rho$ is known as the Rent parameter. The Rent parameter indicates the complexity of interconnects in a circuit. The Rent parameter depends on the circuit structure and selected partitioning algorithm. In the original BCGEN, the Rent parameter is calculated using a recursive Fiduccia-Mattheyses partitioning algorithm. We have used the same method to calculate the Rent parameter for the final circuits generated using the modified BCGEN. To reduce the congestion in the circuit layout, the Rent parameter should be lower. The results given in Table 5.1 and graphical

<table>
<thead>
<tr>
<th>Circuit</th>
<th>BCGEN Version</th>
<th>Input</th>
<th>Output</th>
<th>Rent Parameter</th>
<th>No. of Clusters</th>
<th>Avg. Net Length</th>
<th>Channel Width</th>
<th>Critical Path Delay</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Mod.</td>
<td>47</td>
<td>11</td>
<td>0.466</td>
<td>984</td>
<td>8.82</td>
<td>18</td>
<td>41</td>
</tr>
<tr>
<td></td>
<td>Orig.</td>
<td>47</td>
<td>39</td>
<td>0.656</td>
<td>93</td>
<td>6.82</td>
<td>13</td>
<td>18.7</td>
</tr>
<tr>
<td>2</td>
<td>Mod.</td>
<td>14</td>
<td>11</td>
<td>0.429</td>
<td>936</td>
<td>8.35</td>
<td>19</td>
<td>37.9</td>
</tr>
<tr>
<td></td>
<td>Orig.</td>
<td>17</td>
<td>11</td>
<td>0.529</td>
<td>45</td>
<td>5.78</td>
<td>10</td>
<td>20</td>
</tr>
<tr>
<td>3</td>
<td>Mod.</td>
<td>31</td>
<td>15</td>
<td>0.461</td>
<td>961</td>
<td>8.44</td>
<td>19</td>
<td>44.3</td>
</tr>
<tr>
<td></td>
<td>Orig.</td>
<td>31</td>
<td>40</td>
<td>0.581</td>
<td>74</td>
<td>5.55</td>
<td>11</td>
<td>19.6</td>
</tr>
<tr>
<td>4</td>
<td>Mod.</td>
<td>22</td>
<td>18</td>
<td>0.531</td>
<td>1949</td>
<td>8.97</td>
<td>21</td>
<td>52.7</td>
</tr>
<tr>
<td></td>
<td>Orig.</td>
<td>44</td>
<td>45</td>
<td>0.574</td>
<td>159</td>
<td>7.81</td>
<td>16</td>
<td>36.5</td>
</tr>
<tr>
<td>5</td>
<td>Mod.</td>
<td>46</td>
<td>56</td>
<td>0.519</td>
<td>503</td>
<td>7.68</td>
<td>17</td>
<td>42</td>
</tr>
<tr>
<td></td>
<td>Orig.</td>
<td>70</td>
<td>56</td>
<td>0.545</td>
<td>239</td>
<td>6.98</td>
<td>17</td>
<td>25.5</td>
</tr>
<tr>
<td>6</td>
<td>Mod.</td>
<td>29</td>
<td>20</td>
<td>0.521</td>
<td>1741</td>
<td>8.93</td>
<td>19</td>
<td>41.9</td>
</tr>
<tr>
<td></td>
<td>Orig.</td>
<td>46</td>
<td>55</td>
<td>0.6</td>
<td>328</td>
<td>8.29</td>
<td>17</td>
<td>37</td>
</tr>
<tr>
<td>7</td>
<td>Mod.</td>
<td>71</td>
<td>116</td>
<td>0.701</td>
<td>9682</td>
<td>21.32</td>
<td>48</td>
<td>85.1</td>
</tr>
<tr>
<td></td>
<td>Orig.</td>
<td>71</td>
<td>277</td>
<td>0.707</td>
<td>8742</td>
<td>21.96</td>
<td>45</td>
<td>86.8</td>
</tr>
<tr>
<td>8</td>
<td>Mod.</td>
<td>257</td>
<td>26</td>
<td>0.582</td>
<td>2708</td>
<td>14.48</td>
<td>32</td>
<td>62.2</td>
</tr>
<tr>
<td></td>
<td>Orig.</td>
<td>257</td>
<td>238</td>
<td>0.651</td>
<td>1737</td>
<td>19.27</td>
<td>27</td>
<td>70</td>
</tr>
<tr>
<td>9</td>
<td>Mod.</td>
<td>76</td>
<td>138</td>
<td>0.698</td>
<td>8654</td>
<td>22.43</td>
<td>48</td>
<td>85.6</td>
</tr>
<tr>
<td></td>
<td>Orig.</td>
<td>76</td>
<td>259</td>
<td>0.708</td>
<td>7565</td>
<td>25.57</td>
<td>50</td>
<td>102</td>
</tr>
<tr>
<td>10</td>
<td>Mod.</td>
<td>230</td>
<td>8</td>
<td>0.693</td>
<td>9722</td>
<td>16.39</td>
<td>44</td>
<td>225</td>
</tr>
<tr>
<td></td>
<td>Orig.</td>
<td>230</td>
<td>418</td>
<td>0.705</td>
<td>7820</td>
<td>18.08</td>
<td>44</td>
<td>227</td>
</tr>
<tr>
<td>11</td>
<td>Mod.</td>
<td>261</td>
<td>306</td>
<td>0.725</td>
<td>13142</td>
<td>19.48</td>
<td>45</td>
<td>226</td>
</tr>
<tr>
<td></td>
<td>Orig.</td>
<td>261</td>
<td>633</td>
<td>0.729</td>
<td>12791</td>
<td>20.47</td>
<td>45</td>
<td>226</td>
</tr>
<tr>
<td>12</td>
<td>Mod.</td>
<td>169</td>
<td>22</td>
<td>0.700</td>
<td>9079</td>
<td>19.50</td>
<td>48</td>
<td>82.8</td>
</tr>
<tr>
<td></td>
<td>Orig.</td>
<td>169</td>
<td>231</td>
<td>0.708</td>
<td>7589</td>
<td>21.27</td>
<td>47</td>
<td>82.2</td>
</tr>
</tbody>
</table>
representation of results drawn in Figure 5.2 clearly show that the Rent parameter generated from our circuits is less than the circuits generated using the original BCGEN because of reduced I/Os using the modified BCGEN. Lower Rent parameters reduce congestion in the layout.

Average post-routing net length affects the number of routing resources used by a net to connect to the logic blocks. Net length is defined as the number of consecutive clusters covered by a net. From the tabulated results (Table 5.1), the results obtained using the modified BCGEN have higher average net lengths because these circuits have greater internal connectivity due to added multiplexers, demultiplexers and FIFO blocks in between the original sub-modules during our circuit generation. Results generated for No. Of Clusters is graphically represented in Figure 5.3, where BCGEN original represents the result of number of clusters for original version and BCGEN modified represents the result of number of clusters for our modified version. To route a circuit, VPR uses only the required routing resources, resulting in an FPGA architecture with a minimum channel width. As mentioned above, higher average net length requires more routing resources, which tends to increase the minimum channel width. Therefore, our circuits result in mapping with larger channel widths. The maximum speed for a circuit is generally determined by the critical path delay, which is predominantly due to its routing. As such, the critical path delay for the modified BCGEN is more than the original implementation as they have longer internal nets (see Avg. Net Length, Table 5.1).
Figure 5.1 Results of I/O

Figure 5.2 Results of Rent Parameters

Figure 5.3 Results of No. Of Clusters
6: CHAPTER 6

Conclusions and Future Work

This chapter concludes our work and provides some suggestions for possible future work.

6.1 Conclusions

Our work improves the dataflow communication pattern of the original BCGEN to better model real circuits. This is achieved by inserting multiplexer and demultiplexer modules between the original modules (stages). Using the proposed approach, the number of I/Os in the generated circuit are reduced using the modified BCGEN. Additionally, the data buffering capability is also added by inserting the FIFO modules. As a result, the Rent parameter using the modified BCGEN is reduced by limiting the number of I/Os in the final generated circuit. This should allow circuit sizes to scale to better represent actual SoCs. Furthermore, the inclusion of buffering between stages is a key component in actual data-flow circuits again improving the quality of the synthetic data streaming benchmarks.

6.2 Future work

For future work, the star communication pattern should be updated to include real Network-on-Chip (NoC) architectures (mesh, torus etc.) as well as the ability to represent application specific topologies in the BCGEN. BCGEN
should also include the ability to generate circuits comprising multiple clock domains. Finally, in the original BCGEN, bus networks support only single master architectures whereas, based on design trends, multi-master bus architectures would be useful.
REFERENCES


APPENDIX

Our Modified Source code for BCGEN dataflow pattern

As discussed in Chapter 4, here is the source code we added/updated to implement our desired changes. Details are given in terms of the specific file listed below:

- **GENERATE_CIRCUIT.C**
  This is an existing file from the original BCGEN, to which we have added code to update the dataflow pattern. When invoked, it adds multiplexers, demultiplexers and FIFOs between existing modules to reduce the number of I/Os and to include data buffering capabilities.

- **MODIFIED_DATAFLOW.C**
  This is our additional code for the modified data flow pattern to reduce the number of I/Os and to add the data buffering capability by adding multiplexers, demultiplexers and FIFOs between existing modules.

- **DATA_MUX.C**
  This code is used to generate multiplexers and demultiplexers to produce required number of outputs from the given number of inputs.

**GENERATE_CIRCUIT.C**

/* This is existing case in Original BCGEN where we have added call for our modified method for BCGEN*/

```
case e_DATAFLOW:
  //original code for DATAFLOW... here...
  //retval = generate_dataflow_subcircuit(fpout, inst, structure, handle);
  //clock test code..ADDED MODIFIED VERSION OF DATAFLOW
  retval = generate_dataflow_subcircuit_clocktest(fpout, inst, structure, handle);
  break;
```

**MODIFIED_DATAFLOW.H**

```
//header file for method description
circuit_t *generate_dataflow_subcircuit_clocktest(FILE *fpout, instance_t *inst, structure_t *structure, char *handle);
```

**MODIFIED_DATAFLOW.C**

```
#include "../include.h"
//janki's code..
circuit_node_t *generate_dataflow_multiplexer(FILE *fpout,int flag,char *handle,int no_input,int no_output);
circuit_t *generate_dataflow_fifo(FILE *fpout,int flag,char *handle,int no_input,int no_output);
void create_fifo(int in_nol,int out_nol,char *name);

//CODE FOR MODIFIED DATAFLOW COMMUNICATION PATTERN

```
```
int flag; // variable from original code
int buswidth;
int temp;
int clock;
in int i, j, k, m, n; // counters
int node_index; // static
int total_nodes; // constant
char *name; // string holder
char *copy_handle;
int counter; // added for modified version for incrementing node value...
in rem_opins, common, min; // local variables
int *inter_level_empty_input_pins, *inter_level_empty_output_pins;
BOOL finished_inputs, clock_pin, irupt_ctrl;
in ffs[BUFFER_LENGTH], num_ffs;
structure_node_t *s_node; // pointers
circuit_t *circuit,*fifo_circuit;
circuit_node_t *node, *irupt_ctrl_node;
circuit_edge_t *edge;
int reset_source_index, reset_latch_node;
int irupt_sink_index, irupt_latch_node, irupt_ctrl_node_index;
circuit_node_t *node_1,*node_2,*node_3;
circuit_node_t *fifo_node; // added for modified version (for fifo)
// added for FIFO (modified version)...
in fifo_counter,"f_node;
f_node = (int*) my_malloc(sizeof(int));
fifo_counter=0;
// index is allocated after order is generated
fifo_node = (structure_node_t *) my_malloc (1 * sizeof (structure_node_t));
fifo_node->circuit = NULL;
fifo_node->numNodesI = 0;
fifo_node->numNodesO = 0;
fifo_node->nodesI = NULL;
fifo_node->nodesO = NULL;
fifo_node->rem_ipins = 0;
fifo_node->rem_opins = 0;
fifo_node->clock = FALSE;
// end of node initialization for fifo...
fifo_flag = 0;
num_ffs = 0;
total_nodes = structure->num_blocks + structure->num_structures;
node_index = total_nodes;
s_node = structure->nodes;
name = (char *) my_malloc(STRING*sizeof(char));
copy_handle = (char *) my_malloc(STRING*sizeof(char)); // mem allocation for copy_handle
inter_level_empty_output_pins = (int *) my_malloc (structure->length * sizeof(int));
inter_level_empty_input_pins = (int *) my_malloc (structure->length * sizeof(int));
business = inst->constraints.bus_width;
counter = 0;
for (i = 0; i < structure->length; i ++)
{
    inter_level_empty_output_pins[i] = 0;
    inter_level_empty_input_pins[i] = 0;
}
business = inst->constraints.bus_width;
printf("\n bus width =",business);
printf(name, "s_\d", structure->index);
circuit = circuit_alloc_and_init(name, total_nodes, 0, 0, 0);
reset_source_index = -1;
reset_latch_node = -1;
irupt_sink_index = -1;
irupt_latch_node = -1;
for (i = 0; i < total_nodes; i ++)
{
if (s_node[i].substructure == inst->reset_source)
{
    reset_source_index = i;
}
if (s_node[i].substructure == inst->interrupt_sink)
{
    irupt_sink_index = i;
}
if (structure->nodes[i].type == e_BLOCK)
    sprintf(name, "b_%d", ((block_t *)structure->nodes[i].substructure)->index);
else
    sprintf(name, "s_%d", ((structure_t *)structure->nodes[i].substructure)->index);
node = circuit_node_alloc_and_init_from_subcircuit(structure->nodes[i].circuit, name, 0, 0, &i, &i);
    circuit_set_node(circuit, node, i);
}

// generate outputs from the previously indicated sources to the indicated sinks
for (i = 0; i < total_nodes; i++)
{
    if (s_node[i].numNodesO == 0)
    {
        // if it is at the end of the sequence allocate the output nodes
        if (s_node[i].sequence == structure->length - 1)
        {
            for (k = 0; k < s_node[i].rem_opins; k++)
            {
                node_index = generate_node_add_external_o(circuit, node_index, i);
            }
            s_node[i].rem_opins = 0;
        }
        // else
        // this is a node with no outputs, leave until later to maybe fill in gaps
    }
    else if (s_node[i].numNodesO == 1)
    {
        counter++;
        sprintf(copy_handle,”%s_%d”,handle,counter);
        printf("n name of copy_handle=%s",copy_handle);
        if(s_node[i].rem_opins > buswidth) //demultiplex
        {
            flag =0;
        }
        else //multiplex..
        {
            flag =1;
        }
        node_1 = generate_dataflow_multiplexer(fpout,flag,copy_handle,s_node[i].rem_opins,buswidth);
        circuit_add_node(circuit,node_1);
        for (j = 0; j < s_node[i].rem_opins; j++)
        {
            edge = circuit_edge_alloc_and_init(1, 1, 1, NULL);
            circuit_add_edge(circuit, edge);
            circuit_edge_set_source_node(edge, 0,i);
            circuit_edge_set_sink_node(edge, 0, node_index);
            circuit_node_add_out(circuit_get_node(circuit, i), edge->net);
            circuit_node_add_fanin(circuit_get_node(circuit, node_index), edge->net);
        }
        node_index++;
    }
    //add fifo here... between two node... code for test...
    counter++;
    sprintf(copy_handle,”%s_%d”,handle,counter);
    flag =2;
    f_node[fifo_counter] = node_index;
fifo_counter = fifo_counter + 1;
f_node = (int*) my_realloc(f_node, sizeof(int) * (fifo_counter + 1));
// here node_index will be 3 ...

// store node_index value to some variable so that it can be used to add reset and clock later on.
fifo_circuit = generate_dataflow_fifo(fpout, flag, copy_handle, buswidth, buswidth);
// check for clock....
// if circuit does not have clock this function returns -1 otherwise it returns clock pin number

clock = circuit_has_clock_pin(fifo_circuit);
if (clock != -1)
{
fifo_node->clock = TRUE;
circuit_swap_input_order(fifo_circuit, clock, fifo_circuit->num_inputs - 1);
}

// end of check for clock
// convert into a node
fifo_circuit_node = circuit_node_alloc_and_init_from_subcircuit(fifo_circuit, "FIFO", 0, 0, NULL, NULL);
circuit_add_node(circuit, fifo_circuit_node);
fifo_flag = 1;
for (j = 0; j < buswidth; j++)
{
edge = circuit_edge_alloc_and_init(1, 0, 1, NULL);
// original
 circuit_add_edge(circuit, edge);
circuit_edge_set_source_node(edge, 0, node_index - 1);
circuit_edge_add_sink(edge, node_index);
circuit_node_add_out(circuit_get_node(circuit, node_index - 1), edge->net);
circuit_node_add_fanin(circuit_get_node(circuit, node_index), edge->net);
}
for (j = 0; j < 2; j++) // which will add two additional input to fifo...
{
circuit_edge_add_sink(edge, node_index);
circuit_node_add_fanin(circuit_get_node(circuit, node_index), edge->net);
}
node_index++;
// end of fifo..
// here (node_index - 1) = fifo node number ...
for (j = 0; j < 3; j++) // connect three output from fifo to cindy's block..
{
edge = circuit_edge_alloc_and_init(1, 0, 1, NULL);
 circuit_add_edge(circuit, edge);
circuit_edge_set_source_node(edge, 0, node_index - 1);
circuit_edge_add_sink(edge, s_node[i].nodesO[0]->index);
circuit_node_add_out(circuit_get_node(circuit, node_index - 1), edge->net);
circuit_node_add_fanin(circuit_get_node(circuit, s_node[i].nodesO[0]->index), edge->net);
}
s_node[i].nodesO[0]->rem_ipins = s_node[i].nodesO[0]->rem_ipins - 3;
// because of connection from fifo to cindy's node..
if (buswidth > s_node[i].nodesO[0]->rem_ipins)
{
flag = 0;
}
else
{
flag = -1;
}
counter++; sprintf(copy_handle, "%s_%d", handle, counter);
nod_3 = generate_dataflow_multiplexer(fpout, flag, copy_handle, buswidth, s_node[i].nodesO[0]->rem_ipins);
circuit_add_node(circuit, node_3);
for (j = 0; j < buswidth; j++)
{
    edge = circuit_edge_alloc_and_init(1, 0, 1, NULL); //original
    circuit_add_edge(circuit, edge);
    circuit_edge_set_source_node(edge, 0, node_index - 1);
    circuit_edge_add_sink(edge, node_index);
    circuit_node_add_out(circuit_get_node(circuit, node_index - 1), edge->net);
    circuit_node_add_fanin(circuit_get_node(circuit, node_index), edge->net);
}
for (j = 0; j < s_node[i].nodesO[0]->rem_ipins; j++)
{
    edge = circuit_edge_alloc_and_init(1, 0, 1, NULL);
    circuit_add_edge(circuit, edge);
    circuit_edge_set_source_node(edge, 0, node_index - 1);
    circuit_edge_add_sink(edge, s_node[i].nodesO[0]->index);
    circuit_node_add_out(circuit_get_node(circuit, node_index - 1), edge->net);
    circuit_node_add_fanin(circuit_get_node(circuit, s_node[i].nodesO[0]->index), edge->net);
    s_node[i].nodesO[0]->rem_ipins = 0;
    s_node[i].rem_opins = 0;
    node_index++;
}
//end of else if
else // outputs to many blocks
{
    for (j = 0; j < s_node[i].numNodesO; j++)
    {
        if (s_node[i].rem_opins != 0)
        {
            counter++;
            printf(copy_handle, "%s%d", handle, counter);
            printf("\n name of copy_handle =%s", copy_handle);
            if (s_node[i].rem_opins > buswidth) //demultiplex
            {
                flag = 0;
            }
            else //multiplex...
            {
                flag = 1;
            }
            node_1 generate_dataflow_multiplexer(fpout, flag, copy_handle, s_node[i].rem_opins, buswidth);
            circuit_add_node(circuit, node_1);
            for (k = 0; k < s_node[i].rem_opins; k++)
            {
                edge = circuit_edge_alloc_and_init(1, 1, 1, NULL);
                circuit_add_edge(circuit, edge);
                circuit_edge_set_source_node(edge, 0, i);
                circuit_edge_set_sink_node(edge, 0, node_index);
                circuit_node_add_out(circuit_get_node(circuit, s_node[i].index), edge->net);
                circuit_node_add_fanin(circuit_get_node(circuit, node_index), edge->net);
            }
            node_index++;
        }
    }
    //add fifo here... between two node... code for test...
    counter++;
    sprintf(copy_handle, "%s%d", handle, counter);
    flag = 2;
    f_node[fifo_counter] = node_index;
    fifo_counter = fifo_counter + 1;
    f_node = (int*) my_realloc(f_node, sizeof(int) * (fifo_counter + 1));
    fifo_circuit = generate_dataflow_fifo(fpout, flag, copy_handle, buswidth, buswidth);
    //check for clock....
}
// if circuit does not have clock this function returns -1 otherwise it returns clock pin
// number

clock = circuit_has_clock_pin(fifo_circuit);
if (clock != -1)
{
    fifo_node->clock = TRUE;
circuit_swap_input_order(fifo_circuit, clock, fifo_circuit->num_inputs = 1);
}

// end of check for clock
// convert into a node

fifo_circuit_node = circuit_node_alloc_and_init_from_subcircuit(fifo_circuit, "FIFO", 0, 0, NULL, NULL);

for (j = 0; j < buswidth; j++)
{
    edge = circuit_edge_alloc_and_init(1, 0, 1, NULL);
circuit_add_edge(circuit, edge);
    circuit_edge_set_source_node(edge, 0, node_index - 1);
circuit_edge_add_sink(edge, node_index);
circuit_node_add_out(circuit_get_node(circuit, node_index - 1), edge->net);
circuit_node_add_fanin(circuit_get_node(circuit, node_index), edge->net);
}

// which will add two additional input to fifo...

for(j=0;j<2;j++)
{
    circuit_edge_add_sink(edge, node_index);
circuit_node_add_fanin(circuit_get_node(circuit, node_index), edge->net);
}

for (j = 0; j < buswidth; j++)
{ // connect three output from fifo to cindy's block...
    edge = circuit_edge_alloc_and_init(1, 0, 1, NULL);
circuit_add_edge(circuit, edge);
    circuit_edge_set_source_node(edge, 0, node_index - 1);
circuit_edge_add_sink(edge, s_node[i].nodesO[0]->index);
circuit_node_add_out(circuit_get_node(circuit, node_index - 1), edge->net);
circuit_node_add_fanin(circuit_get_node(circuit, s_node[i].nodesO[0]->index), edge->net);
}

s_node[i].nodesO[0]->rem_ipins = s_node[i].nodesO[0]->rem_ipins - 3; // because of connection
// from fifo to cindy's node...

// for second node...

if (buswidth > s_node[i].nodesO[j]->rem_ipins)
    flag = 0;
else
    lag = 1;

node_2 = generate_dataflow_multiplexer(fpout, flag, copy_handle, buswidth, s_node[i].nodesO[j]->rem_ipins);

for (k = 0; k < buswidth; k++)
{ // circuit does not have clock this function returns -1 otherwise it returns clock pin
    edge = circuit_edge_alloc_and_init(1, 1, 1, NULL);
circuit_add_edge(circuit, edge);
    circuit_edge_set_source_node(edge, 0, i);
circuit_edge_set_sink_node(edge, 0, node_index);
circuit_node_add_out(circuit_get_node(circuit, node_index - 1), edge->net);
circuit_node_add_fanin(circuit_get_node(circuit, node_index), edge->net);
for (k = 0; k < s_node[i].nodesO[j]->rem_ipins; k++)
{
  edge = circuit_edge_alloc_and_init(1, 1, 1, NULL);
  circuit_add_edge(circuit, edge);
  circuit_edge_set_source_node(edge, 0, i);
  circuit_edge_set_sink_node(edge, 0, s_node[i].nodesO[j]->index);
  circuit_node_add_out(circuit_get_node(circuit, node_index), edge->net);
  circuit_node_add_fanin(circuit_get_node(circuit, s_node[i].nodesO[j]->index), edge->net);
}

s_node[i].nodesO[j]->rem_ipins = 0;
s_node[i].rem_opins = 0;
node_index++; // end of if.
else
{
  s_node[i].numNodesO = j;
  s_node[i].nodesO[j]->numNodesI --;
} // end of else
} // end of for..
} // end of else
} // end of for...

// intra_level assignments
for (i = 0; i < structure->length; i++)
{
  for (j = structure->order[i]; j < structure->order[i + 1]; j++)
  {
    if (s_node[j].rem_opins != 0)
    {
      // the last nodes in the sequence should never reach here b/c opins
      // should always be asserted
      for (m = structure->order[i + 1]; m < structure->order[i + 2]; m++)
      // search for input pins
      {
        if (s_node[m].rem_ipins != 0)
        {
          counter++;
          sprintf(copy_handle, "%s_%d", handle, counter);
          printf("\n name of copy_handle =%s", copy_handle);
          if (s_node[j].rem_opins > buswidth)
            // demultiplex
            {
              flag = 0;
            }
            else // multiplex..
            {
              flag = 1;
            }
          node_1 = generate_dataflow_multiplexer(fpout, flag, copy_handle, s_node[j].rem_opins, buswidth);
          circuit_add_node(circuit, node_1);
          for (n = 0; n < s_node[j].rem_opins; n++)
          {
            edge = circuit_edge_alloc_and_init(1, 1, 1, NULL);
            circuit_add_edge(circuit, edge);
            circuit_edge_set_source_node(edge, 0, j);
            circuit_edge_set_sink_node(edge, 0, node_index);
            circuit_node_add_out(circuit_get_node(circuit, s_node[j].index),
                                  edge->net);
            circuit_node_add_fanin(circuit_get_node(circuit, node_index),
                                   edge->net);
          }
          node_index++;
          counter++;
          sprintf(copy_handle, "%s_%d", handle, counter);
          flag = 2;
        }
      }
    }
  }
} // end of for...
f_node[fifo_counter] = node_index;
fifo_counter = fifo_counter + 1;
f_node = (int*) my_realloc(f_node, sizeof(int)*(fifo_counter+1));

fifo_circuit = generate_dataflow_fifo(fpout, flag, copy_handle, buswidth, buswidth);
clock = circuit_has_clock_pin(fifo_circuit);
if (clock != -1)
{
    fifo_node->clock = TRUE;
circuit_swap_input_order(fifo_circuit, clock, fifo_circuit->num_inputs = 1);
}

fifo_circuit_node = circuit_node_alloc_and_init_from_subcircuit(fifo_circuit, "FIFO", 0, 0, NULL, NULL);
circuit_add_node(circuit, fifo_circuit_node);
fifo_flag = 1;
for (j = 0; j < buswidth; j++)
{
    edge = circuit_edge_alloc_and_init(1, 0, 1, NULL); //original
    circuit_add_edge(circuit, edge);
    circuit_edge_set_source_node(edge, 0, node_index - 1);
    circuit_edge_add_sink(edge, node_index);
    circuit_node_add_out(circuit_get_node(circuit, node_index - 1), edge->net);
    circuit_node_add_fanin(circuit_get_node(circuit, node_index), edge->net);
}
for (j = 0; j < 2; j++)
{
    circuit_edge_add_sink(edge, node_index);
    circuit_node_add_fanin(circuit_get_node(circuit, node_index), edge->net);
}
node_index++;
for (j = 0; j < 3; j++)
{
    edge = circuit_edge_alloc_and_init(1, 0, 1, NULL);
    circuit_add_edge(circuit, edge);
    circuit_edge_set_source_node(edge, 0, node_index - 1);
    circuit_edge_add_sink(edge, s_node[i].nodesO[0]->index);
    circuit_node_add_out(circuit_get_node(circuit, node_index - 1), edge->net);
    circuit_node_add_fanin(circuit_get_node(circuit, s_node[i].nodesO[0]->index), edge->net);
}s_node[i].nodesO[0]->rem_ipins = s_node[i].nodesO[0]->rem_ipins - 3;
//for second node...
if (buswidth > s_node[m].rem_ipins)
{
    flag = 0;
}
else
{
    flag = 1;
}
node_2 = generate_dataflow_multiplexer(fpout, flag, copy_handle, buswidth, s_node[m].rem_ipins);
circuit_add_node(circuit, node_2);
for (n = 0; n < buswidth; n++)
{
    edge = circuit_edge_alloc_and_init(1, 1, 1, NULL);
    circuit_add_edge(circuit, edge);
    circuit_edge_set_source_node(edge, 0, i);
    circuit_edge_set_sink_node(edge, 0, node_index);
    circuit_node_add_out(circuit_get_node(circuit, node_index - 1), edge->net);
    circuit_node_add_fanin(circuit_get_node(circuit, node_index), edge->net);
}
for (n = 0; n < s_node[m].rem_ipins; n++)
{  
  edge = circuit_edge_alloc_and_init(1, 1, 1, NULL);
  circuit_add_edge(circuit, edge);
  circuit_edge_set_source_node(edge, 0, i);
  circuit_edge_set_sink_node(edge, 0, s_node[m].index);
  circuit_node_add_out(circuit_get_node(circuit, node_index), edge->net);
  circuit_node_add_fanin(circuit_get_node(circuit, s_node[m].index),
                       edge->net);
  }
  s_node[m].rem_ipins = 0;
  s_node[j].rem_opins = 0;
  node_index++;
  } //end of if..
  } // for input pins in the next level
  break; // ie no more empty ipins in the next level to be allocated
  }
  }
  }
  }
  }
  }
  }
  node_index = generate_structure_add_external_io(structure, circuit, node_index);
  // INTERRUPT
  irupt_ctrl = TRUE;
  if (structure->level != inst->constraints.hier_depth){
    // HACK - THIS ASSUMES THERE IS ALWAYS AT LEAST 2 NODES ON EACH NETWORK
    if (irupt_sink_index == -1){
      if (total_nodes == 1){
        irupt_ctrl = FALSE;
      }
      }
      irupt_ctrl_node = generate_interrupt_controller_node(fpout, circuit, node_index,
                                 total_nodes, handle);
      circuit_add_node(circuit, irupt_ctrl_node);
      irupt_ctrl_node_index = node_index;
      node_index = generate_node_add_external_o(circuit, node_index, irupt_ctrl_node_index);
      }
      else{
        i = total_nodes - 1 + (structure->level != 0); // -1 for sink, +1
        //for the upper level
        if (i > 1){
          irupt_ctrl_node = generate_interrupt_controller_node(fpout, circuit, node_index, i, handle);
          circuit_add_node(circuit, irupt_ctrl_node);
          irupt_ctrl_node_index = node_index;
          node_index ++;
          if (structure->level != inst->level_interrupt_sink){
            edge = circuit_edge_alloc_and_init(1, 1, 1,
                          "irupt");
            circuit_add_edge(circuit, edge);
            generate_new_edge_connection(circuit, edge,
                          irupt_ctrl_node_index, irupt_sink_index);
          }
          else{
            irupt_latch_node = node_index;
            node_index =
            generate_node_add_flipflop_to_output(circuit, node_index, irupt_ctrl_node_index);
            edge = circuit_edge_alloc_and_init(1, 1, 1,
                          "irupt");
            circuit_add_edge(circuit, edge);
            generate_new_edge_connection(circuit, edge,
                          irupt_latch_node, irupt_sink_index);
          }
          if (structure->level != 0){
            node_index = generate_node_add_external_i(circuit, node_index, irupt_ctrl_node_index);
          }
        }
      }  
  }  
}
else{ // only happens if there are 2 nodes and on the 0th level
    irupt_ctrl = FALSE;
    if(structure->level == inst->level_interrupt_sink){
        irupt_latch_node = node_index;
        node_index = generate_node_add_flipflop_to_input(circuit, node_index, irupt_sink_index);
    }
}
}
for (i = 0; i < total_nodes; i ++){
    if (i != irupt_sink_index){
        sprintf(name, "b_%d_irupt", i);
        edge = circuit_edge_alloc_and_init(1, 1, 1, name);
        circuit_add_edge(circuit, edge);
        if (!irupt_ctrl){
            if (irupt_latch_node == -1){
                if (irupt_sink_index == -1){
                    circuit_remove_edge(circuit, circuit->num_edges - 1);  
                    node_index = generate_node_add_external_o(circuit, node_index, i);
                }
            } else{
                generate_new_edge_connection(circuit, edge, i, irupt_latch_node);
            }  
        } else{
            generate_new_edge_connection(circuit, edge, i, irupt_ctl_node_index);
        }
    } else{
        if (irupt_sink_index != -1){
            irupt_latch_node = node_index;
            node_index = generate_node_add_flipflop_to_input(circuit, node_index, irupt_sink_index);
            node_index = generate_node_add_external_i(circuit, node_index, node_index - 1);
        }
    }  
}

// RESET
if (reset_source_index == -1)//this cond.. does not satisfy second time..
{
    node_index = generate_node_add_external_i(circuit, node_index, 0);
    edge = circuit_get_edge(circuit, circuit->num_edges - 1);
    for (i = 1; i < total_nodes; i ++){
        circuit_edge_add_sink(edge, i);
        circuit_node_add_fanin(circuit_get_node(circuit, i), edge->net);
    }
}  
else{
    if (structure->level == 0){ // if the source is at the top level
        reset_latch_node = node_index;
        node_index = generate_node_add_flipflop_to_output(circuit, node_index, reset_source_index);
        edge = circuit_edge_alloc_and_init(1, 0, 1, "reset");
        circuit_add_edge(circuit, edge);
        circuit_edge_set_source_node(edge, 0, reset_latch_node);
        circuit_node_add_out(circuit_get_node(circuit, reset_latch_node), edge->net);
    }
    else{ // if not at the top level
if (structure->level == inst->level_reset_source)
    // add a latch at the source to prevent combinational cycles
    //this executes first time...
    reset_latch_node = node_index;
    node_index = generate_node_add_flipflop_to_output (circuit, node_index, reset_source_index);
    node_index = generate_node_add_external_o(circuit, node_index, node_index - 1);
    edge = circuit_get_edge(circuit, circuit_get_edge(circuit, circuit->nodes[reset_latch_node]->outs[0])->net);
} else // add the output to the upper level
    node_index = generate_node_add_external_o(circuit, node_index, reset_source_index);
    edge = circuit_get_edge(circuit, circuit->num_edges - 1);
}

for (i = 0; i < total_nodes; i ++) // add sinks to the rest of the nodes on this level
    if (i != reset_source_index)
        circuit_edge_add_sink(edge, i);
        circuit_node_add_fanin(circuit_get_node(circuit, i), edge->net);

if(fifo_flag ==1)
{
    for(j=0;j<fifo_counter;j++)
    
    circuit_edge_add_sink(edge, f_node[j]);
    circuit_node_add_fanin(circuit_get_node(circuit, f_node[j]), edge->net);
}

}

clock_pin = FALSE;

for (i = 0; i < total_nodes; i ++)
    if (s_node[i].clock == TRUE)
        if (!clock_pin)
            node_index = generate_node_add_external_i(circuit, node_index, i);
            edge = circuit->edges[circuit->num_edges - 1];
            clock_pin = TRUE;
        else
            circuit_edge_add_sink(edge, i);
            circuit_node_add_fanin(circuit_get_node(circuit, i), edge->net);
}

if (fifo_node->clock == TRUE)
{
    for(j=0;j<fifo_counter;j++)
    
    circuit_edge_add_sink(edge, f_node[j]);
    circuit_node_add_fanin(circuit_get_node(circuit, f_node[j]), edge->net);
}

if (irupt_latch_node != -1){
if (!clock_pin)
node_index = generate_node_add_external_i(circuit, node_index, irupt_latch_node);
clock_pin = TRUE;
}
else{
edge = circuit_get_edge(circuit, circuit->num_edges - 1);
circuit_edge_add_sink(edge, irupt_latch_node);
circuit_node_add_fanin(circuit_get_node(circuit, irupt_latch_node), edge->net);
}

if (reset_latch_node != -1)
if (!clock_pin)
node_index = generate_node_add_external_i(circuit, node_index, reset_latch_node);
clock_pin = TRUE;
}
else{
edge = circuit_get_edge(circuit, circuit->num_edges - 1);
circuit_edge_addSink(edge, reset_latch_node);
circuit_node_add_fanin(circuit_get_node(circuit, reset_latch_node), edge->net);
}

for (i = 0; i < num_ffs; i++)
if (!clock_pin)
node_index = generate_node_add_external_i(circuit, node_index, ffs[i]);
clock_pin = TRUE;
}
else{
edge = circuit_get_edge(circuit, circuit->num_edges - 1);
circuit_edge_addSink(edge, ffs[i]);
circuit_node_add_fanin(circuit_get_node(circuit, ffs[i]), edge->net);
}

circuit_sanity(circuit, fpout, TRUE);
free(f_node);
free(name);
free(inter_level_empty_output_pins);
free(inter_level_empty_input_pins);
return circuit;

//end of code to change...
circuit_node_t *generate_dataflow_multiplexer(FILE *fpout, int flag, char *handle, int no_input, int no_output)
{
FILE *fp;
char *cmd_string, *filename;
circuit_t *mux_circ;
circuit_node_t *mux_node,*demux_node;
char *name; _int clock;
int count;
name =(char *) my_malloc(2*STRING*sizeof(char));
cmd_string = (char *) my_malloc(2*STRING*sizeof(char));
filename = (char *) my_malloc(2*STRING*sizeof(char));
count =0;
printf("\n number of input =%d",no_input);
printf("\n number of output =%d",no_output);
if(flag==0)
{
printf(name,"mux_%s",handle);
printf(filename, "multiplexer_%s", handle);
create_mux( no_input,no_output,filename);
#ifdef WIN32
SetCurrentDirectory("C:\\Docume~1\\jwp11\\desktop\\research\\bcgen\\bcgen\BUS GENERATOR");
#endif
}
sprintf(cmd_string, "perl module_gen_janki.pl --filename=%s --entityname=mux_%s --design=%s > %s.output 2>&1", filename, filename, filename, filename);
system (cmd_string);
#endif
sprintf(filename, "BUS_GENERATOR/multiplexer_%s.blif", handle);
fp = fopen(filename, "r");
if (fp == (FILE *) NULL) fprintf(fpout,"Error generating multiplexer block\n");
else
{
    mux_circ = read_blif(fp, fpout, filename, FALSE, TRUE);
    fclose(fp);
    if (mux_circ == (circuit_t *)NULL)
    {
        fprintf(fpout,"Error reading mux block\n");
    }
}
// convert into a node
mux_node = circuit_node_alloc_and_init_from_subcircuit(mux_circ, name, 0, 0, NULL, NULL);
#endif
// flag -1 then demultiplex signals...mean input number is smaller than output....
if (flag ==-1)
{
    sprintf(name,"demux_%s",handle);
    sprintf(filename, "demultiplexer_%s", handle);
    create_mux(no_input,no_output,filename);
    #ifdef WIN32
SetCurrentDirectory("C:\\Docume~1\\jjp11\\desktop\\research\\bcgen\\bcgen\\BUS_GENERATOR");
    sprintf(cmd_string, "perl module_gen_janki.pl --filename=%s --entityname=mux_%s --design=%s > %s.output 2>&1", filename, filename, filename, filename);
    system (cmd_string);
    #endif
    sprintf(filename, "BUS_GENERATOR/demultiplexer_%s.blif", handle);
    fp = fopen(filename, "r");
    if (fp == (FILE *) NULL) fprintf(fpout,"Error generating demultiplexer block\n");
    else
    {
        mux_circ = read_blif(fp, fpout, filename, FALSE, TRUE);
        fclose(fp);
        if (mux_circ == (circuit_t *)NULL)
        {
            fprintf(fpout,"Error reading demux block\n");
        }
    }
}
// convert into a node
mux_node = circuit_node_alloc_and_init_from_subcircuit(mux_circ, name, 0, 0, NULL, NULL);
//end of if for flag -1
// flag -1 then demultiplex signals...mean input number is smaller than output....
free(cmd_string);
free(filename);
return mux_node;
} //end of generate_dataflow_multiplexer function...

circuit_t *generate_dataflow_fifo(FILE *fpout,int flag,char *handle,int no_input,int no_output)
{
    FILE *fp;
    char *cmd_string, *filename;


circuit_t *mux_circ;
char *name;
int clock;
int count;
name =(char *) my_malloc(2*STRING*sizeof(char));
cmd_string = (char *) my_malloc(2*STRING*sizeof(char));
filename = (char *) my_malloc(2*STRING*sizeof(char));
count =0;
printf("\n number of input =%d",no_input);
printf("\n number of output =%d",no_output);
//flag ==2 ..for fifo core...
if (flag ==2)
{
sprintf(name,"fifo_%s",handle);
sprintf(filename, "fifo_core_%s", handle);
create_fifo(no_input,no_output,filename);
#ifdef WIN32
SetCurrentDirectory("C:\\Docume~1\\jjp11\\desktop\\research\\bcgen\\bcgen\\BUS_GENERATOR");
sprintf(cmd_string, "perl module_gen_janki.pl --filename=%s --entityname=mux_%s -- design=%s > %s.out", filename,filename,filename,filename);
system (cmd_string);
SetCurrentDirectory("..");
#endif
sprintf(file
name, "BUS_GENERATOR/fifo_core_%s.blif", handle);
fp = fopen(filename, "r");
if (fp == (FILE *) NULL)
{
 fprintf(fpout,"Error generating fifo block\n");
}
else
{
 mux_circ = read_blif(fp, fpout, filename, FALSE, TRUE);
fclose(fp);
if (mux_circ == (circuit_t *)NULL)
{
 fprintf(fpout,"Error reading fifo block\n");
}
}
//end of if for flag ==2
free(cmd_string);
free(filename);
return mux_circ;
}//end of function

//CODE FOR SRL FIFO (VHDL code borrowed from OPEN CORE)..
void create_fifo(int in_no1, int out_no1,char *name)
{
 int in_no,out_no;
 char *name_copy;
 FILE *f;
 SetCurrentDirectory("C:\\Docume~1\\jjp11\\desktop\\research\\bcgen\\bcgen\\BUS_GENERATOR");
 name_copy = (char *) my_malloc(2*STRING*sizeof(char));
 printf("\n name of the file =%s",name);
 strcpy(name_copy,name);
 strcat(name_copy,".vhd");
in_no =in_no1;
out_no = out_no1;
f= fopen(name_copy,"w");
if(f==NULL)
{
 printf("An error has occurred.\n");
 return 1;
}
//start of vhdl file printing........
fprintf(f,"library IEEE;\n");
fprintf(f,"use IEEE.STD_LOGIC_1164.ALL;\n");
fprintf(f,"use IEEE.STD_LOGIC_ARITH.ALL;\n");
use IEEE.STD_LOGIC_UNSIGNED.ALL;

use IEEE.NUMERIC_STD.ALL;

entity %s is

    generic(wide : integer := %d); -- set to how wide fifo is to be

    port(
        indata_in      : in     std_logic_vector (width -1 downto 0);
        indata_out     : out    std_logic_vector (width -1 downto 0);
        inreset        : in     std_logic;
        inwrite        : in     std_logic;
        inread         : in     std_logic;
        infull         : out    std_logic;
        inhalf_full    : out    std_logic;
        indata_present : out    std_logic;
        inclk          : in     std_logic);

end %s ;

architecture behavioural of %s is

    constant srl_length  : integer := 32; -- set to srl 'type' 16 or
    constant pointer_vec : integer := 5; -- set to number of bits
    needed to store pointer = log2(srl_length);

    type srl_array is array ( srl_length -1 downto 0 ) of
        STD_LOGIC_VECTOR ( wide -1 downto 0 );

    signal fifo_store : srl_array;
    signal pointer            : integer range 0 to srl_length;
    signal pointer_zero        : std_logic;
    signal pointer_full        : std_logic;
    signal valid_write         : std_logic;
    signal half_full_int       : std_logic_vector( pointer_vec -1 downto 0);
    signal empty               : std_logic := '1';
    signal valid_count         : std_logic;

begin

    -- Valid write, high when valid to write data to the store.
    if valid_write = '1' then
        fifo_store <= fifo_store( fifo_store'left -1 downto 0 ) & data_in;
    end if;

    ndata_srl :process( clk )
    
        if rising_edge( clk ) then
            if reset = '1' then
                pointer <= pointer;
            elsif empty = '1' and pointer_full = '0' then
                pointer <= pointer;
            else
                pointer <= pointer + 1;
            end if;
        end if;

    end process;

    ndata_out <= fifo_store( pointer );

    ndata_present : out std_logic;

    nlclk : in std_logic;

end %s ;
DATA_MUX.C

#include "../include.h"

// CODE TO CREATE MULTIPLEXER AND DEMULTIPLEXER

void create_mux(int in_no1, int out_no1, char *name);
void create_mux(int in_no1, int out_no1, char *name)
{
    int in_no, out_no, div1_result, div2_result, div3_result, mod, diff, value, rem_line, diff1;
    int flag, check;
    int var, counter;
    int *input = (int*)malloc(sizeof(int));
    int *output = (int*)malloc(sizeof(int));
    int copy_input, copy_output;
    int i, j, count, temp_value;
    char *mux_name, *name_copy, *copy_mux_name;
    FILE *f, *f1;
    SetCurrentDirectory("C:\\Documents-1\\jjp11\\desktop\\research\\bcgen\\bcgen\\BUS_GENERATOR");
    name_copy = (char *) my_malloc(2 * STRING * sizeof(char));
    mux_name = (char *) my_malloc(2 * STRING * sizeof(char));
    copy_mux_name = (char *) my_malloc(2 * STRING * sizeof(char));
    printf("\n name of the file =%s", name);
    strcpy(name_copy, name);
    sprintf(mux_name, "mux_%s", name_copy);
    strcat(name_copy, ".vhd");
    strcpy(copy_mux_name, mux_name);
    fprintf(mux_name, "mux_%s", name_copy);
    strcat(name_copy, ".vhd");
    strcpy(copy_mux_name, mux_name);

    //code for 2 to 1 mux...
    in_no = in_no1;
    out_no = out_no1;
    *input = in_no;
    *output = out_no;
    // case 1 when where *input > *output...
    if(*input > *output)
    {
        // start of mux.vhd...
        f1=fopen(copy_mux_name, "w");
        if(f1==NULL)
        {
            printf("An error has occurred.\n");
        }
    }
return 1;
}
fprintf(f1,"library IEEE;\n");
fprintf(f1,"use IEEE.STD_LOGIC_1164.ALL;\n");
fprintf(f1,"use IEEE.STD_LOGIC_ARITH.ALL;\n");
fprintf(f1,"use IEEE.STD_LOGIC_UNSIGNED.ALL;\n");
fprintf(f1,"\n entity %s is \n",mux_name);
fprintf(f1,"\nport\n\n(X : in std_logic;\nY :in std_logic;\nS : in std_logic;\n Z : out std_logic\n);\nend %s;\n\narchitecture Behavioral of %s is\nbegin\n  Z <= X when S = '0' else Y;\nend Behavioral;\n\nend of mux.vhd file....
//now create another file for reducing the signal by multiplexing...by connecting to
//mux.vhd using port map....

f= fopen(name_copy,"w");
if(f==NULL)
{
  printf("An error has occurred.\n");
  return 1;
}
//start of vhdl file printing.......
fprintf(f,"library IEEE;\n");
fprintf(f,"use IEEE.STD_LOGIC_1164.ALL;\n");
fprintf(f,"use IEEE.STD_LOGIC_ARITH.ALL;\n");
fprintf(f,"use IEEE.STD_LOGIC_UNSIGNED.ALL;\n");
fprintf(f,"\n entity %s is \n",name);
//end of file printing
printf("welcome to program...\n");
//---------------PRINTING TO FILENAME.VHD FILE---------------------------
fprintf(f,"\n generic \n\n M : integer := %d ;\n N : integer := %d \n );", in_no ,
out_no);
fprintf(f,"\nport\n\nA : in std_logic_vector(0 to %d);\nB : out std_logic_vector(0 to %d);\n\n end %s;\n\narchitecture Behavioral of %s is\ncomponent %s\n port\n\n(X : in std_logic;\nY :in std_logic;\nS : in std_logic;\nZ : out std_logic)\n);\nend component;\n\no=2;\ncount =1;\nM=in_no%2;\nfor(i=in_no/2;i>=1;i=i/2)\n{\n  if(M!=0)\n  {\n    out_no = out_no-1;\n  }\n  if(out_no >=i)\n  {\n    break;\n  }\n  count=count+1;\n  M=i%2;\n};//end of for...
printf("\n count =%d",count);
printf("\n out_no = %d",out_no);
temp_val=1;
for(i=1;i<=count;i++)\n{ //generate internal signals
  fprintf(f,"\n signal X_%d  : std_logic_vector(0 to %d);\n",i,(in_no/temp_val-1));
temp_val=temp_val*2;
}
for(i=1;i<count;i++)\n{ //generate internal signals
  fprintf(f,"\n signal X_%d  : std_logic_vector(0 to %d);\n",i,(in_no/temp_val-1));
temp_val=temp_val*2;
}
fprintf(f,"\n J :for i in 0 to %d generate",in_no-1);
fprintf(f, "\nX_1(i) <= A(i);")
fprintf(f,"\nend generate J;");
val =2;
for(i=1; i<count;i++)
{
    M= in_no%2;
    if(i>1)
    {
        //gives in_no input to multiplexer...
        fprintf(f,"\nG%d: for i in 0 to %d generate",i,in_no-1);
        fprintf(f,"U%d: %s port map(X =>X_%d(i*2) ,Y => X_%d((i*2)+1), S =>X_%d(i*2), Z =>B(i));",i,mux_name,i,i,i);
        fprintf(f,"\nend generate G%d;",i);
    }
    if(M!=0)
    {
        fprintf(f,"\nB(%d) <= X_%d(%d);",*output,in_no-1,i);
        *output = *output-1;
    }
in_no = in_no/2;
    val =val*2;
}
//end of for loop...
diff = (in_no*2) - out_no;
value -diff *2;//number of input goes to multiplexer...
rem_line = (in_no*2) -value;//out_no = rem_line +diff.
fprintf(f, "\nG: for i in 0 to %d generate",diff-1);
fprintf(f,"\nU : %s port map(X =>X_%d(i*2) ,Y => X_%d((i*2)+1), S =>X_%d(i*2), Z =>B(i));",mux_name,count,count,count);
fprintf(f,"\nend generate G ;");
} =value;
for(i = diff ;i <out_no;i++)
{
    fprintf(f, " \n B%d <= X_%d(%d);",i,count,j);
    j++;}
fprintf(f, "\n end Behavioral;");
fclose(f); //CLOSE THE FILE
}//end of case 1..*input > *output..

//case 2.. where *input1 < * input2 THEN demultiplex signals....
if(*input<=*output)
{
diff1= *output-*input;
flag -1;
check=1;
counter =0;
var =0;
f= fopen(name_copy,"w");
if(f==NULL)
{
    printf("An error has occurred.\n");
    return 1;
}
//start of vhdl file printing........
fprintf(f,"library IEEE;\n");
fprintf(f,"use IEEE.STD_LOGIC_1164.ALL;\n");
fprintf(f,"use IEEE.STD_LOGIC_ARITH.ALL;\n");
fprintf(f,"use IEEE.STD_LOGIC_UNSIGNED.ALL;\n");
fprintf(f,"\nentity %s is\n",name);
//means I need to increase wires upto diff1...from the *input..
----------------------PRINTING TO FILENAME.VHD FILE---------------------
fprintf(f, "\ngeneric \n(M : integer := %d ;\n(N : integer := %d)\n);","in_no , out_no);
fprintf(f,"\nport(\nA : in std_logic_vector(0 to %d);\nB : out\nstd_logic_vector( 0 to %d)\n);","in_no-1,out_no-1);
fprintf(f,"\narchitecture Behavioral of %s is\n
architecture Behavioral of %s is\nbegin\n");
//if diff1 is zero means both are same values ...so just connect from A to B...
if(diff1 == 0) {
    var = var + 1;
    fprintf(f,"\n B(0) <= A(0) xor A(1);\n")
    fprintf(f,"K%d:for i in 1 to %d generate \n",var,in_no-1);
    fprintf(f,"B(i) <= A(i);\n ");
    fprintf(f," end generate;\n");
}

//else means difference is greater than zero so we have to make copy of //wires...which //can connect to B else
{
    fprintf(f,"\n B(0) <= A(0) xor A(1);\n")
    do
    {
        var = var + 1;
        if(check == 1)
        {
            fprintf(f,"K%d:for i in 1 to %d generate \n",var,in_no-1);
        }
        else
        {
            fprintf(f,"K%d:for i in 0 to %d generate \n",var,in_no-1);
        }
        check = 0;
        fprintf(f,"B(i+%d) <= A(i);\n",counter);
        fprintf(f," end generate;\n");
        if(flag == 0)
        {
            diff1 = diff1-in_no;
        }
        flag = 0;
        counter = counter + (in_no);
    } while(diff1 > in_no);

    if(diff1 < 0)
    {
        diff1 = (-diff1);
    }
    var = var + 1;
    fprintf(f,"K%d:for i in 0 to %d generate \n",var,(diff1-1));
    fprintf(f,"B(i+%d) <= A(i);\n",counter);
    fprintf(f," end generate;\n");
}
//end of else
fprintf(f,"end behavioral;\n");
fclose(f);
}//end of if... case 2......

}//end of function create_mux...