ALMA MATER STUDIORUM - UNIVERSITÀ DI BOLOGNA TESI DI LAUREA in Fisica Generale T2 Validation and...

66
ALMA M ATER STUDIORUM - UNIVERSITÀ DI BOLOGNA SCUOLA DI INGEGNERIA E ARCHITETTURA Dipartimento di Ingegneria dell'Energia Elettrica e dell'Informazione "Guglielmo Marconi" - DEI Corso di laurea triennale in Ingegneria elettronica e delle telecomunicazioni TESI DI LAUREA in Fisica Generale T2 Validation and implementation of PCIe on a FPGA-based custom board for pixel-detector LHC upgrade CANDIDATO RELATORE: Silvio Zanoli Chiar.mo Prof. Alessandro Gabrielli CORRELATORE: Chiar.mo Prof. Mauro Villa Anno Accademico 2016/2017 Sessione I

Transcript of ALMA MATER STUDIORUM - UNIVERSITÀ DI BOLOGNA TESI DI LAUREA in Fisica Generale T2 Validation and...

ALMA MATER STUDIORUM - UNIVERSITÀ DI BOLOGNA

SCUOLA DI INGEGNERIA E ARCHITETTURA

Dipartimento di Ingegneria dell'Energia Elettrica

e dell'Informazione "Guglielmo Marconi" - DEI

Corso di laurea triennale in

Ingegneria elettronica e delle telecomunicazioni

TESI DI LAUREA

in

Fisica Generale T2

Validation and implementation of PCIe on a FPGA-based custom

board for pixel-detector LHC upgrade

CANDIDATO RELATORE:

Silvio Zanoli Chiar.mo Prof. Alessandro Gabrielli

CORRELATORE:

Chiar.mo Prof. Mauro Villa

Anno Accademico 2016/2017

Sessione I

Abstract

Questa tesi si riferisce principalmente al lavoro di comprensione, validazione ed

implementazione a scopo dimostrativo di elettronica di lettura per rivelatori a

pixel posti sugli esperimenti di fisica delle alte energie al CERN di Ginevra. In

particolare questa tesi tratta la periferica PCIe di una nuova scheda che

implementa due FPGA come core del progetto, chiamata Pixel-ROD (Pixel Read

out Driver), pensata come naturale prosecuzione delle odierne serie di schede di

readout montate oggi nel Pixel Detector di ATLAS. Si parla di naturale seguito in

quanto le attuali schede di readout dei Pixel Detector di ATLAS sono state

progettate, costruite e attualmente sono ancora sotto la responsabilità del

Dipartimento di Fisica ed Astronomia di Bologna e della Sezione di Bologna

dell’Istituto Nazionale di Fisica Nucleare.

Il progetto della scheda Pixel-ROD è cominciato tre anni fa, poiché il trend

generale per l’evoluzione dell’elettronica off-detector di LHC (Large Hadron

Collider) è quello di abbandonare la più vecchi1a interfaccia VME, per passare a

quelle più nuove e veloci (come il PCIe). Inoltre, poiché i rivelatori di ATLAS e

CMS, che sono gli esperimenti principali di LHC, saranno accomunati dallo

stesso chip di readout che interfaccerà i futuri Pixel Detector, la Pixel-ROD

potrebbe essere usata non solo per l’evoluzione di ATLAS ma anche per altri

esperimenti.

La caratteristica principale della Pixel-ROD è la possibilità di utilizzo sia come

scheda di readout singola, sia in una catena reale di acquisizione dati, che si

interfaccia con dispositivi di terze parti.

Il lavoro svolto in questa tesi inizia con la comprensione generale della scheda

(già presente) per poi elaborare programmi di test su sistemi Linux per verificare

l'effettivo funzionamento dell'interfaccia PCIe a scopo di validare tale periferica

finendo con l'elaborazione di un software dimostrativo che simula un flusso di dati

reale ad alta velocità di trasferimento.

Contents Introduction ...............................................................................................................1

1 LHC .........................................................................................................................2

1.1 Main experiments at LHC ...................................................................................3

1.2 The ATLAS detector ...........................................................................................3

1.3 ATLAS' structure ................................................................................................4

1.3.1 Inner Detector.............................................................................................6

2 Off-detector electronics............................................................................................9

2.1 IBL’s electronics .................................................................................................9

2.1.1 IBL BOC.....................................................................................................10

2.1.2 IBL ROD ....................................................................................................12

2.1.3 TIM...........................................................................................................14

2.1.4 SBC...........................................................................................................14

2.2 The road towards PCIe based board..................................................................14

3 PCI-Express overview..............................................................................................15

3.1 Old standard's crisis .........................................................................................15

3.2 PCI-Express standard .......................................................................................17

3.2.1 PCI-Express topology .................................................................................19

3.2.2 PCI-Express Layers .....................................................................................21

4 Pixel-ROD board.....................................................................................................28

4.1 VME vs PCIe ....................................................................................................28

4.1.1 VME overview ...........................................................................................29

4.1.2 PCIe choice ...............................................................................................29

4.2 Pixel-Rod's making off ......................................................................................31

4.2.1 KC705 .......................................................................................................31

4.2.2 ZC702 .......................................................................................................33

4.3 Pixel-Rod overview.......................................................................................... 35

4.3.1 New Features ........................................................................................... 36

4.3.2 Layers' specifics ........................................................................................ 37

5 Pixel-ROD tests results ........................................................................................... 39

5.1 Test completed ............................................................................................... 39

5.1.1 Power-Up supply test ............................................................................... 40

5.1.2 Kintex-Zynq internal bus test ..................................................................... 40

5.1.4 Kintex and Zynq UART and memory test..................................................... 41

5.2 PCI-Express Validation ..................................................................................... 43

5.2.1 First Attempt ............................................................................................ 43

5.2.2 XDMA IP core and firmware implementation.............................................. 44

5.2.3 Linux custom driver................................................................................... 48

5.2.4 Echo Test.................................................................................................. 51

5.2.5 Buffer and throughput Test ....................................................................... 52

5.2.6 BER Test ................................................................................................... 54

Conclusion and future development .......................................................................... 57

Bibliography............................................................................................................. 59

1

Introduction

This thesis mainly refers to the comprension, validation and implementation for

demonstration purposes of pixel readout electronics for high-energy physics

experiments at CERN of Geneva. In particular, this thesis deals with the PCIe

peripheral implemented on a new FPGA based board, named Pixel-ROD (Pixel

Read Out Driver), thought as a natural follow-up of the present series of readout

boards, implemented into ATLAS Pixel detector. We are talking about natural

follow-up as the current ATLAS Pixel Detector readout cards have been designed,

built and are still under the responsibility of the Department of Physics and

Astronomy of Bologna and the Bologna Section of the National Institute of

Nuclear Physics. The project of Pixel-ROD board started three years ago, since

the general trend on the update for the off-detector LHC (Large Hadron Collider)

phase 2 electronics is to leave the older VME interface for newer and faster buses

(such as PCIe). Moreover, as the ATLAS and CMS experiments, which are the

main LHC experiments, will share the same readout chip that will interface the

future Pixel Detectors, the Pixel-ROD board could be used not only for the

ATLAS upgrade, but also for other experiments. The main feature of the Pixel

ROD board is that it can be used both as a standalone readout electronics or in real

data acquisition chains by interfacing with third party devices.

This thesis is intended to provide a brief overview of the environment in which

Pixel- ROD board was conceived. After this Introduction, the Chapter One

summarizes the ATLAS experiment, focusing on the detectors point of view.

Chapter Two treat the PCIe standard. Chapter Three describe the tests that have

been carried out so far. Finally, the Conclusions depict the current situation of

tests, the obtained results and the future analysis that will be performed in order to

match a real application on a specific experiment.

In particular, the work performed in this thesis focus on the elaboration of a

simple test program on Linux systems in aim of the validation of the PCIe

interface present on board ending with the develop of a demo software that

simulate an high throughput data flow

2

Chapter 1

LHC LHC (Large Hadrons Collider) is the largest and most powerful particle

accelerator ever built, placed in the tunnel which housed the LEP (Large Electron-

Positron collider) in Geneva and managed by the European Organization for

Nuclear Research, also known as CERN (Conseil Européen pour la Recherche

Nucleaire). LHC is made up of a ring 27 km long ([1]) placed at a depth of

about 100 m. There are four points where protons are forced to collide and where

the four huge detector (known as ATLAS, ALICE, CMS and LHCb) are set up to

record every detail of the particle collisions, providing a huge amount of data to

analyse (see Figure 1.1).

Protons are not straightly inserted into the beam pipe of the main ring, but they

undergo a sequence of accelerators thanks to which they reach the desired energy

of 6.5 TeV and above. The nominal maximum collision energy for proton in LHC

is 14 TeV, however the accelerator is now working with a collision energy of 13

TeV, since with this expedient, the powering system is optimized. At this energy,

protons have a speed very close to the speed of light in vacuum. Under nominal

condition, each beam contain 2808 bunches of proton composed of about

Figure 1.1: LHC overview

3

particles each. The beam is held in the accelerator ring by 1232 superconducting

dipole magnets, that create a maximum magnet field of 8.3 T, and focused by 392

quadrupole magnets.

1.1 Main experiments at LHC

As already stated, there are four main points of interaction along LHC's ring

where the particles are forced to collide and where the detector are placed, these

detectors are:

ATLAS, A Toroidal Lhc ApparaTus : it is one of the two general

purpose detectors at LHC. This detector investigates a wide range of

physical problems, spacing from research over the Higgs boson or dark

matter to further measurements of the Standard Model parameters.

CMS, Compact Muon Solenoid: it is the second general purpose detector

at LHC. Together with ATLAS it shares the same scientific goals,

although it uses different technical solutions and different magnet-system

designs.

LHCb, Large Hadrons Collider beauty: this experiment is specialized in

investigating the slight differences between matter and antimatter by

studying quark beauty particles.

ALICE, A Large Ion Collider Experiment: it is a heavy ion collision

detector. It is designed to study the physics of strongly interacting matter

at extreme energy densities, where the matter forms a new phase, called

quark-gluon plasma.

1.2 The ATLAS detector

As stated above, the ATLAS experiment is a general purpose particle detector

installed at LHC, which is used to study different kinds of physical phenomena.

The detector is 44 m long ([2]), with an outer diameter of 25 m and weights

4

approximately 7000 T. ATLAS has a cylindrical symmetry, as well as the

detectors of CMS and ALICE.

The ATLAS detector describes the events using a coordinate system where the

origin is set in nominal interaction point, with the z-axis defined by the beam

direction and the x-y plane is transverse to it ([3]). The positive x-axis is

defined as pointing from the interaction point to the centre of the ring, while the y-

axis is pointing upwards. The azimuthal angle ϕ is around the beam axis and the

polar angle θ is the angle from the beam axis. Pseudorapidity η is then defined as

: its value ranges from -∞ to +∞, corresponding to the vector lying

on the y-axis, to infinity where it is along the z-axis. Using the pseudorapidity, is

possible to define the distance ΔR in the pseudorapidity-azimuthal angle space as

1.3 ATLAS' structure

The structure of ATLAS’s detector is presented in Figure 1.2. It is built with a

cylindrical symmetry around the beam pipe axis centered on the interaction point

allowing a large pseudorapidity ( read. Ideally, it can be divided into five main

regions: a barrel region, where low event could be read, two end caps region

that covers medium and two forward region that take care of the area with

higher .

Figure 1.2: ATLAS detector overview

5

As shown in figure 1.2, ATLAS is made up of different group of sub-detectors

designed to track proton-proton collision and the particles that these collision is

responsible of. The innermost of these sub-detectors systems is called Pixel

Detector, then we could find calorimeters and muon spectrometers. The inner

detector is surrounded by a solenoid that provide a 2 T magnetic field while barrel

and end caps section of the muon spectrometer are surrounded by toroidal magnet

that provide, respectively, about 0,5 T and 1 T.

Particles produced from proton-proton collision start their travel through ATLAS

starting from the Inner Detector that cover the region with | η |< 2,5, here,

charged particles interact with different layer of the detector producing "hits": a

packet of data that sign the passage of a particle and used to reconstruct it's

trajectory . The momenta and the charge of these particles can be measured, as

their trajectories are bent by a 2 T magnetic field provided by the central solenoid

A schematic global

section view of ATLAS

detector is provided in

Figure 1.3, illustrating

how and where different

particles can interact.

The inner detector,

designed to have high

granularity and high

momentum

measurement resolution,

provides essential

information such as first

and second vertices

recognition. In order to

achieve the needed

performance,

semiconductor detector

Figure 1.3: Sect ion view of ATLAS detector in the transverse

plane, illustrating layers’ positioning

6

are used close to the beam for precise measurements.

Moving away from the central beam pipe, two type of calorimeter can be found:

electromagnetic calorimeter and hadronic calorimeter, needed to detect either

electron/photon or hadron by determining both energy and position. proceeding

toward the extern, Muons' Spectrometer are needed to identify muon that were

untouched by the inner layer, Muons’ spectrometer is made up of four different

types of chambers: two types are intended to provide position and momentum

measurement, while the other two types are meant to grant precise and robust

information for the hardware-based trigger decision making.

1.3.1 Inner Detector

The Inner Detector (ID) is the closest detector to the beam pipe: designed to be

radiation hard and have a long term stability. As shown in figure 1.4, the full ID is

a cylinder 6,2 m long and with a diameter of 2,1 m.

Figure 1.4: A section view of the ATLAS Inner Detector

The ID is segmented with cylindrical symmetry in the barrel region while it has

coaxial disk in the end caps regions. As shown in figure 1.5, ID is made up of

three main different layer, explained in the following paragraphs.

Transition Radiation Tracker

The Transition Radiation Tracker (TRT) detector is made up of thin drift tubes in

the outer part of the ID, in which we find a non-flammable gas mixture. these

tubes follow the symmetry of the corresponding region: the tubes installed in the

7

end-cap region have a radial direction while the other straw follow the pipe

direction. Transition radiation is a form of electromagnetic radiation emitted

when charged particle pass through an inhomogeneous mean, such as a boundary

between two different means.

The TRT occupies the largest space of the ID and provides the majority of the hits

per track, hence highly contributing to the momentum measurement. TRT offers

more hits per track in order to retrieve the information about momentum, even

though it has a lower precision compared to the silicon detectors. Unlike all other

active parts of ID, the drift tubes do not need to be cooled and are not subject of

radiation degradation, as the gas can simply be replaced.

Semiconductor Tracker

The SemiConductor Tracker (SCT) is a tracker made up of silicon strips, with a

technology similar to the one employed in the Silicon Pixel Detector. The reason

for using such trackers, instead of pixels, are mainly economical, since the barrel

of SCT covers more than 30 times the surface of Pixel detector

Pixel Detector

The innermost and most important detector is Pixel Detector. This detector is

designed to have the finest granularity among the other ones. The system consist

of four cylindrical layers of silicon "pixels", these four layer are named, from the

inside to the outside: Insertable B-Layer (IBL), B-Layer (L0), Layer 1 (L1) and

Layer 2 (L2). Each layer is constituted by several module. A module consist of:

sensors, 16 Front End (FE-I3) chips responsable for reading the charge signal

from pixels, a flex-hybrid, a Module Controller Chip (MCC) and a pigtail.

The board developed and tested in this thesis is proposed as a readout board for

this layer ([4]).

8

Figure 1.7: Structure and arrangement of the layers of Inner Detector in the barrel reg ion.

9

Chapter 2

Off-detector electronics In order to explaining the reasons that led to the development of a new read-out

board, we need to look at the current set-up of the off-detector electronics for the

Insertable Barrel Layer. This layer was the newest and innermost layer

implemented into the barrel of four cylindrical pixel layers. High-energy physics

experiments usually distinguish between on-detector and off-detector electronics:

the first is the front-end electronics implemented near the detector itself, where

radiation resistance is a fundamental parameter for the electronics and often is

custom made for the experiment; the second is the readout system that is

implemented far from the detector, that doesn't need to be radiation-hard, allowing

the use of more powerful commercial devices.

In the following sections it will describe the off-detector topics in order to

understand the task that the Pixel-ROD board was designed for.

2.1 IBL’s electronics

The readout system for IBL requires an appropriate off-detector apparatus, which

we're going to describe. This readout system is made of several components:

Back of Crate (BOC) board;

Optical modules to interface FE-I4 chips with BOC board;

S-Link for sending data from the BOC board to the ATLAS TDAQ

system;

Read Out Driver (ROD) board:

Gigabit Ethernet to send front-end calibration’s histograms;

VME Crate;

TTC Interface Module (TIM);

Single Board Computer (SBC).

10

Figure 2.1: Complete visual layout of the data acquisition

system. In red the normal data path, in b lue the deviation

of the Histogram data path.

Each BOC-ROD pair can interface and route 5,12 Gb/s of data that comes from

16 IBL modules (32 FE-I4 chips). With this throughput, the whole IBL's readout

requires 15 BOC-ROD pairs. placed in a single VME crate.

This setup is replicated,

smaller, here in Bologna

inside a VME crate that host

a few BOC-ROD interfaces.

The data path is simplified

in figure 2.1: 32 front-end

chips FE-I4 drive 32 serial

lines, each supporting 160

Mb/s, connected to the

BOC board via optical

links. Here the signal from each line is converted from optical to electrical, then

demultiplexed to one 12-bit-wide bus, which proceeds towards the ROD board,

through the VME backplane connector. The ROD board begins the real data

formatting, to build up a ROD data frame, that is sent to the TDAQ computers. On

the ROD board, data can take two different paths (see Figure 2.1): the first routes

the ROD data events back to the BOC, where four S-Link modules process the

data towards the ATLAS TDAQ PCs, implementing a total output bandwidth of

5.12 Gb/s; the second route for data is toward the PC for histogram making,

exclusively used during calibration of the FE-I4 chips ([4]).

2.1.1 IBL BOC

The BOC board, shown in Figure 2.2, is in charge of the control interface to the

detector, as well as the data interface from the detector itself. One other major task

of the BOC is to provide the clock to the front-end chip connected: the clock is

received by the BOC from the TIM and can be delayed, if needed, furthermore, a

Phase Locked Loop (PLL) generates copies of this clock for the ROD and the

detector. The detector clock is then handled by the FPGAs and coded into the

control streams for the individual detector modules. The IBL BOC contains three

Xilinx Spartan FPGAs:

11

One BOC Control FPGA (BCF);

Two BOC Main FPGAs (BMF).

BOC Control FPGA

In the BOC Control FPGA a Microblaze embedded processor is instantiated

mainly in order to manage the Ethernet connection for the board and to perform

some self test. The

BCF is also

responsible for

FPGAs

configuration that

happen in two steps:

firstly, the BCF

loads its

configuration in

“Master Serial

Peripheral Interface”

mode from a 64 Mbit

SPI FLASH. Then, the configuration data for the two main FPGAs are read from

a second SPI FLASH memory and downloaded via the Slave Serial configuration

ports. Depending on the configuration, the BCF can also load software from a

third SPI FLASH.

BOC Main FPGA

The two BMFs encode the configuration data coming from the ROD and send

them straight to the FE-14chips. For testing purposes, this configuration stream

can be directly generated from the BMFs. One of the main task that these two

FPGAs execute is the deserialization of the incoming data from the front-end

chips on the RX path (Figure 2.1) after the data collection and the word

alignment. They also are in charge to send the decoded data to the ROD board.

On the TX side, they also manage four optical S-Links connections to the TDAQ

system.

Figure 2.2: IBL BOC board

12

2.1.2 IBL ROD

The Insertable Barrel Layer Read Out Driver (IBL ROD) ([5])is designed in

order to accomplish some tasks, like propagate timing and trigger signals to the

front-end electronics, as well as sending an appropriate configuration to them.

The most important task for the ROD is accomplished during physical runs: it

receives data and event fragments from the 32 FE-I4 chips and transform them

into a ROD data frame, which is sent back to the ATLAS TDAQ, through the

BOC’s S-Link connections (see Figure 2.1).

The IBL ROD (shown in Figure 2.3) is composed of:

One Digital Signal Processor MDSP (Texas Instruments TMS320C6201 -

GJC200), which is currently not used;

Figure 2.3: The IBL ROD board

13

One Program Reset Manager (PRM) FPGA (Xilinx Spartan-6

XC6SLX45-FGG484);

One ROD Controller (Master) FPGA (Xilinx Virtex-5 XC5VFX70T-

FF1136);

Two “Slaves” FPGAs (Xilinx Spartan-6 XC6SLX150-FGG900);

One Phase-Locked Loop PLL (Lattice ispClock 5620);

32 MByte SDRAM DDR;

Four Mbit FLASH SST39VF040-70-4C-NH;

Two GByte DDR2 SODIMM;

64 Mbit FLASH Atmel AT45DB642D;

Three Gbit Ethernet interfaces with PHY DP83865.

In order to understand the task of the ROD, the main devices are hereafter

described.

ROD Master

A Virtex-5 FPGA is the Master of the Read Out Driver. It interfaces with the

front-end chips and the triggers that come from TTC Module. This FPGA

contains a PowerPC, an embedded hard processor. This FPGA have several tasks

to accomplish, one of the more important is to process the trigger information or

the event information and deliver it to the Spartan FPGAs (Slaves).

ROD Slaves

The two Spartan-6 FPAGs work as slaves on the ROD board, they implement a

soft-processor named Microblaze. All the data generated by the FE-14 chips got

collected by these FPGAs that send them in the ROD's SSRAM.

During calibration runs, histogram can be generated and sent to the histogram

server through an Ethernet connection (one for each Spartan-6).

14

2.1.3 TIM

The TTC (Timing, Trigger and Control) Interface Module (TIM) interfaces the

ATLAS Level-1 Trigger system signals to the pixel Read-Out Drivers. For what

concerns the IBL off-detector electronics, the TIM sends the clock to the BOC

board, which then propagates it to the ROD, as stated above. Furthermore, the

TIM receives and propagates triggers.

2.1.4 SBC The Single Board Computer, as the name suggests, is actually a computer

mounted on a 6U board with a VME interface chip. It is used to control all the

VME operations on the ROD and to program the ROD FPGAs, usually after

power up. It can also be used to monitor the temperature, or voltages, on the

RODs master device.

2.2 The road towards PCIe based board

The entire read-out of the IBL, as said before, implements 15 BOC-ROD pairs,

installed in 2014, while the remaining layers need respectively 38 pairs for the

Layer 1, 26 pairs for the Layer 2, 22 pairs for the B-Layer and 12 for the external

B-Layer and Disks cards will be installed this year (it was decided only in a

second time to implement this system also for all other layers). This excellent

system is, unfortunately, limited; especially looking to the upgrade of the whole

LHC detector, planned for the 2023 ([6]), in order to achieve a much higher

luminosity (up to 10 times the actual one). Such huge improvement in luminosity

means that the electronics will need to withstand a much higher data rate.

Looking into this direction, many electronic boards have been presented for the

readout of such experiments ([4]). All the projects, from the electronic

viewpoint, share a common feature: the implementation of an electronic board

designed to be flexible and highly configurable, with powerful FPGAs connecting

to many optical transceivers and with PCIe interface to guarantee an extremely

high throughput.

15

Chapter 3

PCI-Express overview Since the first communication protocol had seen the light, the increasing needs of

huge amount of data has always drove the research on the path of higher

throughput. Starting with a really tiny data rate (for example, 56 Kbit/s on old

router) and arriving, nowadays, at astonishing bandwidths implementing new

protocols and project strategies. One of these new protocols is the PCI-Express

(Peripheral Component Interconnect Express, from now, it will be simply called

PCIe), as the name suggest, PCIe is the third generation high performance I/O bus

used to interconnect peripheral devices; this protocol achieve its required

throughput and reliability through a serial, packet-based, communication at high

speed, on differential line. This standard was studied and implemented in order to

exceed the issues that were quickly growing in other protocol like PCI or PCI-X.

In the next chapter the aspect that made parallel protocol obsolete and drove

towards the PCIe will be described.

3.1 Old standard's crisis

As stated above, PCIe is

the third generation bus

used for peripheral

connection. The first

generation included busses

like: ISA, EISA, VESA,

Micro Channel and many

other, all these different

protocol where slow and

too many. The second

generation busses included

PCI, AGP and PCI-X ([7]). When the PCI bus was first introduced in the early

Figure 3.1: Bandwidth and maximum number of Card

slot per Bus

16

1990s, it had a unifying effect on the plethora of I/O buses of first generation, it

also brought a lot of advantages, like processor independence, buffered isolation,

bus mastering, and true plug-and-play operation. Although all of these benefits,

PCI had several problems

of increasing importance as

the bandwidth request

increase, for example,

bandwidth limitations, host

pin-count limitations, lack

of real-time data transfer

services and so on. His

derivatives, such as PCI-X

and AGP, only moved

farther the problem, raising

the maximum throughput.

The data-rate of the PCI bus and its derivatives can be significantly less than the

theoretical bandwidth due to protocol overhead and bus parallel topology: more

device connected on the same bus meant a slower, or not functioning at all,

connection, as shown in figure 3.1 ([8]). This led to PCIe, a different kind of

protocol: in order to alleviate the problem of parallel connection PCI-Express

offers a serial architecture using clock data recovery (CDR) and differential

signal, PCIe also employs dual-simplex point-to-point links topology to overcome

the limitations of a shared bus (see figure 3.2). The links use high-speed serial

transceivers with embedded clock and data differential signals operating at 2.5

GTransfer/s with 8b/10b encoding. A link can consist of a single (x1) lane

providing peak bandwidth of 500 Mbytes/s (2 directions x 2.5 Gbits/s x 8/10

encoding). One of the major facts that made PCIe so diffuse is the backward

compatibility with PCI standards: current OS that are compatible with the PCI

software model can boot and run on PCI Express systems without any change to

device drivers or the operating system. However, to take advantage of the new,

advanced features of PCI Express, software modification will be necessary (for

example, in order to access the extend configuration space).

Figure 3.2: PCI VS PCI-E topology

17

Figure 3.3: PCIe

throughput VS PCIe

Version, "Total

Bandwidth" is

calculated on the Full-

Duplex

communicat ion

3.2 PCI-Express standard

As seen in the previous chapters, PCIe is a serial, packet-based, standard. Before

talking of the constitutive element of this protocol, such as topology and layers,

it's necessary to discuss, in order to understand the main topics, about the used

terminology and the performance achieved.

Terminology

Each connection between two PCIe devices is called Interconnect or Link: a point

to point communication that allows performing bidirectional transmissions

(ordinary PCI request or interrupt). Each Link is composed of 1 or more Lane (1,

2, 4, 8, 16 or 32 possible lanes). A PCIe card uses a Link with, for example, 8

Lanes, will be indicated as PCIe-x8. Each Lane counts exactly 2 differential

signaling: one for sending data and one for receiving, making each Lane a full-

duplex communication channel composed of 4 wires ([8]).

Performance

Performance of PCIe strongly depends on the standard version we're referring to:

nowadays there are 4 official versions with 4 different performances ([9])

(even if the fourth generation is still in its embryonic state, see figure 3.3):

V 1.0: with 2,5 GTransfer/s;

V 2.0: with 5 GTransfer/s;

V3.0: with 8 GTransfer/s;

V4.0: with 16 GTranser/s.

18

V 1.0

The first version of PCIe protocol allowed a 2,5 GTransfer/s which means that

every Lane could transfer up to 2,5 Gbit/s, using an 8/10b encoding format. For

example, a PCIe-x8-gen1 is capable of a maximum throughput of (considering

only sending or receiving):

V 2.0

The second version of PCIe protocol changed the data-transfer rate, raising it up

to 5 GTransfer/s, using the same 8/10b encoding format. For example, a PCIe-x8-

gen2 is capable of a maximum throughput of (considering only sending or

receiving):

V 3.0

The third version of PCIe protocol changed again the data-transfer rate, raising it

up to 8 GTransfer/s, and changing also the encoding format from 8/10b to

128/130b. For example, a PCIe-x8-gen3 is capable of a maximum throughput of

(considering only sending or receiving):

V 4.0

The fourth version of PCIe protocol changed once again the data-transfer rate,

raising it up to 16 GTransfer/s, without changing the encoding format remaining

with the 128/130b format. For example, a PCIe-x8-gen4 is capable of a maximum

throughput of (considering only sending or receiving):

19

3.2.1 PCI-Express topology

As seen in the previous

chapters, PCIe standard

uses a point to point

serial connection in

order to achieve the

desired performance. In

order to start talking

about topology of the

PCIe standard, it's

necessary to understand

of it's constitutive

elements: Root Complex, Switch, Endpoint and enumerating system (figure 3.4).

Root Complex

The Root Complex (RC) could be seen as the main controller of this standard

([8]): it denotes the device that connects the CPU and memory subsystem to

the PCIe fabric and support one or more PCIe ports. Each port could be connected

to an endpoint device or a switch, in this last case, there will be formed a sub-

hierarchy. The RC is responsible to generate transaction requests and initialize

configuration transactions on the behalf of the CPU, it's also capable of generating

both memory and IO requests as well as generating locked transaction requests

without responding to them. The principal work of RC is to transmit packets out

of its ports and receive packets on its ports which it forwards to memory. A multi-

port RC could work as a router in order to rout a packet from port to port even if

that's not required.

RC also implements central resources such as: hot plug control, power

management controller, interrupt controller, error detection and reporting logic.

The RC initializes, at t=0, with a bus number, device number and function

number. With RC got defined the hierarchy and hierarchy domain: hierarchy is

the ensemble of all the devices and Link associated that are either directly

Figure 3.4: PCIe example topology

20

connected or indirectly connected via switch and bridges, hierarchy domain is the

full tree connected to a sign le port.

Switch

Switches can be thought as two or more PCI-to-PCI bridges, each bridges

associated with a switch port and implementing one header register of

configuration; configuration and enumeration software detect these registers and

initialize them at boot time. These bridges are internally connected via a non-

defined bus. The port pointing to the RP is an upstream port, all other port are

downstream ports. The main work of Switches is to forward all type of packet

(transaction) from any ingress to any egress in a manner similar to PCI bridges

using memory, IO or configuration address based routing.

The logical bridges within the switch implement PCI configuration header which

hold memory and IO base and limit address register as well as primary bus

number, secondary bus number and subordinate bus number registers. Switches

implement two type of arbitration mechanism: port arbitration and VC arbitration,

by which they determinate the priority with which to forward packets from

ingress to egress ports, Switches support locked request ([8]).

Endpoints

Endpoints are every device other than RP and switches that are requester or

completers of PCIe transaction such as Ethernet, USB or graphic card that lays on

the PCIe BUS. They initiate transactions as a requester or respond to transactions

as a completer. Two type of endpoints exist: PCI-Express endpoints and legacy

endpoint. Legacy endpoints devices are not required to support 64-bit memory

addressing capability, they may support IO transaction, they may also support

locked transaction semantics as a completer but not as a requester. Interrupt

capable legacy devices may support legacy style interrupt generation using

message requests but must in addition support MSI generation using memory

write transaction. PCIe Endpoints must not support IO or locked transaction

semantics and must support MSI style interrupt generation and 64-bit memory

addressing capability. Both endpoints implement Type 0 configuration headers

21

and respond to configuration transactions as a completer. Each endpoint is

initialized with a device ID (both requester and completer ID) composed of a bus

number, device number and function number. Endpoints are always device 0 on

the bus assigned to them, that doesn't mean, anyway, that there could be only

"dev-0" on o bus.

Enumerating System

Even if enumerating system is something that doesn't strictly regard the topology

of PCIe systems, it's mandatory to understand at least the basis of it in order to be

able to talk about the implementation of PCIe (in the next chapter). Standard PCI

Plug and Play enumeration software enumerates PCIe systems. Each PCIe Link is

equivalent to a logical PCI bus, that means that every Link is assigned to a bus

number. The PCIe endpoint is the device 0 on a PCIe Link of a given bus number.

Only one device (dev 0) exists per PCIe Link. Enumerations of the switches are

slightly different: the internal bus of the switch that connects all of the virtual

bridges together is also numerated.

The first Link associated with the RP is number bus 1, that's because number 0 is

an internal virtual bus ([8]).

3.2.2 PCI-Express Layers As said before, PCIe is a standard that uses packed-based communication system

and like the majority of these type of communications, it bases its work on a

stacked architecture. The stack of the PCIe standard is similar to a simplified

ISO/OSI stack: it has, in fact, only 3 layers: Physical Layer, Data Link Layer and

Transaction Layer (see figure 3.5) ([10]). This kind of architecture was chosen

in order to achieve 2 major objectives: an ultra reliable communication protocol

and the isolation between different functional areas ([7]). This allows updating

or upgrading of one or more layers, often without requiring changes in the other

layers. For example, new transaction types might be included in newer revisions

of a protocol specification that does not affect lower layers, or the physical media

might be changed with no major effects on higher layers. On the side of the

reliability, PCIe gen 3 ensures a Bit Error Rate (BER) of 10-12 meaning that only

22

one bit on one thousand of

billion is statistically

wrong. There are two

main packet types:

Transaction Layer Packets

(TLPs) and Data Link

Layer Packets (DLLPs)

([8]). While DLLPs

are for service

communications PCIe

constitutive elements,

TLPs are the packets that

actually move the data

from and to devices. In the following paragraph each layer will be examined; then

an ideal TLP transaction will be analyzed in order do reassume all the layer's

work.

Transaction Layer

The Transaction Layer (TL) is the upper layer of the PCI Express architecture,

and its primary function is to accept, buffer, and disseminate transaction layer

packets (TLPs). TLPs have four

addressing space, which are:

memory, IO, configuration and

message. These last space's

transactions (the message) need

to be discussed in order to

understand the passage from

PCI to PCIe: PCI 2.2

introduced an alternate method

of propagating system

interrupts called message

signaled interrupt (MSI).

Figure 3.6: Tlp's header format

Figure 3.5: The PCIe stack with a packet transfer example

23

Here a special- format memory-write transaction was used instead of a hard-wired

sideband signal as an optional capability in a PCI 2.2 system. The PCI Express

specification reuses the MSI concept as a primary method for interrupt processing

and uses a message space to accept all prior sideband signals, such as interrupts,

power-management requests, and resets, as in-band messages. Other “special

cycles” within the PCI 2.2 specification, such as interrupt acknowledge, are also

implemented as in-band messages.

All requests are implemented as split transactions and could be of two big

categories: posted and not-posted transactions. Posted transactions don't need any

reply by the completer (receiver) like, for example, write transaction or message

transaction. Not-posted transaction, otherwise, need a reply, like, for example, a

memory read transaction or an IO write. In case of a not-posted transaction the TL

receives response packets from the link layer and matches these with the original

software requests. Each packet has a unique identifier that enables response

packets to be directed to the correct originator. The packet format offers 32-bit

memory addressing and extended 64-bit memory addressing. Packets also have

attributes such as “no-snoop,” “relaxed ordering,” and “priority,” which may be

used to route these packets optimally through the IO subsystem. As this layer is so

important for the PCIe architecture, it's compulsory to analyze, even if

superficially, the TLPs generated here. Transaction Layer Packet (TLP) consist of

a TLP header that could be composed of 3 or 4 double word (32 bit), a data

payload which can be formed by 0 to 1023 double words (for the types of packets

who need it), and an optional "TLP digest" tail. The header is composed of several

field (figure 3.6). The two most important field of this header are the "Fmt" that

determinate the length of the header and the presence (or absence) of data, and the

"Type" field that determinate the type of the transaction. "Length" determinate the

length of the data ([8]).

Data Link Layer

The primary role of a link layer is to ensure reliable delivery of the packet across

the PCI Express link(s). When a TLP is passed from the Transaction layer, the

Data Link layer places before it a 12-bit sequence number, then protects the

24

contents of it by using a 32-bit LCRC value. The Data Link Layer calculates the

LCRC value based on the TLP received from the Transaction Layer and the

sequence number it has just applied. The LCRC calculation utilizes each bit in the

packet, including the reserved bits. This LCRC is then applied as a tail to the

packet. The packet is then saved in a "retry buffer" and sent to the Physical layer.

A credit-based, flow-control protocol ensures that packets are transmitted only

when it is known that a buffer is available to receive this packet at the other end,

which eliminates any waste of bus bandwidth due to resource constraints ([7]).

For incoming TLPs, the Data Link Layer accepts them from the Physical Layer

and checks the sequence number and LCRC, If it is correct, the Data Link Layer

then passes the TLP in object up to the side of the Transaction Layer. If an error is

detected (either wrong sequence number or LCRC does not match), the Data Link

Layer communicates with the transmitter of the corrupted packet the error and

asks to retry. These communication are made possible through Data Link Layer

Packets (DLLPs). DLLPs are responsible, among other things, to communicate

ACK or NACK to the transmitter: in order to see how this mechanism work, it's

helpful to imagine a transaction between two devices: A and B: The Data Link

Layer of the remote Device B receives the TLP and checks for CRC errors. If

there is no error, the Data Link Layer of Device B returns an ACK DLLP with a

sequence ID to Device A. Device A has confirmation that the TLP has reached

Device B successfully. Device A clears its replay buffer of the TLP associated

with that sequence ID.

If on the other hand a CRC error is detected in the TLP received at the remote

Device B, then a NAK DLLP with a sequence ID is returned to Device A. An

error has occurred during TLP transmission. Device A's Data Link Layer replays

associated TLPs from the replay buffer. For a given TLP in the replay buffer, if

the transmitter device receives a NAK 4 times and the TLP is replayed 3

additional times as a result, then the Data Link Layer logs the error, reports a

correctable error, and re-trains the Link ([10]).

25

Physical Layer

The Physical Layer connects to the Link on one side and interfaces to the Data

Link Layer on the other side. The Physical Layer processes outbound packets

before transmission to the Link and processes inbound packets received from the

Link. The two sections of the Physical Layer associated with transmission and

reception of packets are referred to as the "transmit logic" and the "receive logic".

Two sub-blocks make up the Physical Layer. These are the logical Physical Layer

and the electrical Physical Layer ([8]), both sub-blocks are split into

independent transmit logic and receive logic which allows dual simplex

communication. The first sub-block (logical Physical layer) transmit logic

receives packets from the Data Link layer and attaches to them a "start" header

and a "stop" tail. Each byte of a packet is then scrambled with the aid of Linear

Feedback Shift Register type scrambler. By scrambling the bytes, repeated bit

patterns on the Link are eliminated, thus reducing the average EMI noise

generated. The resultant bytes are encoded into an 8/10b code by the 8b/10b

encoding logic. The primary purpose of encoding 8b characters to 10b symbols is

to create sufficient 1-to-0 and 0-to-1 transition density in the bit stream to

facilitate recreation of a receive clock with the aid of a PLL at the remote receiver

device. Note that data is not

transmitted along with a clock.

Instead, the bit stream contains

sufficient transitions to allow

the receiver device to recreate

a receive clock ([10]). The

parallel-to-serial converter

generates a serial bit stream of

the packet on each Lane and

transmits it differentially at 2.5

Gbit/s (In case of a PCIe

Gen1). The receiving side is,

clearly, dual to the one Figure 3.7: PCIe schematic physical layer

26

explained.

Before explaining the second sub-block (electrical Physical layer) transmit logic,

it's necessary to talk about differential line: each Line on the PCIe standard is

composed of 2 pair of differential line (receiver and transmitter), a positive

voltage difference between D+ and D- (positive and negative differential line)

denote a logical "1" while a negative voltage difference between D+ and D-

denote a logical "0"; no difference between these line means that the driver is in

the high- impedance tristate condition, which is referred to as the electrical- idle

and low-power state of the Link ([10]).

The second sub-block (electrical Physical layer) transmit/receive logic is easy to

understand looking at the figure 3.7. Every differential lane is AC coupled and

driven by a differential pair, that make a PCIe board's driver and receiver, short-

circuit tolerant, furthermore two devices at opposite ends of a Link can have their

own ground and power planes. The AC coupling capacitor is between 75-200 nF.

The transmitter DC common mode voltage is established during Link training and

initialization. The DC common mode impedance is typically 50 ohms while the

differential impedance is 100 ohms typical. This impedance is matched with a

standard FR4 board ([11]).

TLPs transaction example

The following steps are referred to the figure 3.5 and it's not included the

electrical Physical layer passage ([8]):

1. Device B’s core passes a request for service to the PCI Express hardware

interface. How this is done is not covered by the PCI Express

Specification, and it is device-specific. General information contained in

the request would include:

a. The PCI Express command to be performed

b. Start address or ID of target (if address routing or ID routing are

used)

c. Transaction type (memory read or write, configuration cycle, etc.)

d. Data payload size (and the data to send, if any)

27

e. Virtual Channel/Traffic class information

f. Attributes of the transfer: No Snoop bit set?, Relaxed Ordering

set?, etc.

2. The Transaction Layer builds the TLP header, data payload, and digest

based on the request from the core. Before sending a TLP to the Data Link

Layer, flow control credits and ordering rules must be applied.

3. When the TLP is received at the Data Link Layer, a Sequence Number is

assigned and a Link CRC is calculated for the TLP (includes Sequence

Number). The TLP is then passed on to the Physical Layer.

4. At the Physical Layer, byte striping, scrambling, encoding, and

serialization are performed. STP and END control (K) characters are

appended to the packet. The packet is sent out on the transmit side of the

link.

5. At the Physical Layer receiver of Device A, de-serialization, framing

symbol check, decoding, and byte un-striping are performed. Note that at

the Physical Layer, the first level or error checking is performed (on the

control codes).

6. The Data Link Layer of the receiver calculates CRC and checks it against

the received value. It also checks the Sequence Number of the TLP for

violations. If there are no errors, it passes the TLP up to the Transaction

Layer of the receiver. The information is decoded and passed to the core of

Device A. The Data Link Layer of the receiver will also notify the

transmitter of the success or failure in processing the TLP by sending an

Ack or Nak DLLP to the transmitter. In the event of a Nak (No

Acknowledge), the transmitter will re-send all TLPs in its Retry Buffer.

28

Chapter 4

Pixel-ROD board As stated at the end of chapter 2 (2.2 The road towards PCIe based board), the

knowledge acquired with IBL electronics made clear the limits of the BOC-ROD

system, especially looking to the future upgrade of the whole LHC detector. This

led to the development of many electronic boards for the readout of such

experiments. All the projects, from the electronic viewpoint, share many common

features, two of them are the presence of a PCIe interface to guarantee an

extremely high throughput and high-end FPGA connected to many optical

transceivers. Looking into high-speed devices, it has been decided to keep

working with FPGAs from Xilinx, upgrading to the 7-Series family. This decision

was taken in order to exploit all the experience and the efforts spent on the ROD

board and allowing the portability of the firmware onto this new one, named

Pixel-ROD, after upgrading it on the newly introduced platform. Furthermore,

given the success of the Master-Slave architecture from the ROD board, it was

decided to use two FPGAs on the Pixel-ROD board. Since the process of creating

and debugging such highly complex boards often turns out to be very time

consuming, it was decided to design the Pixel-ROD from two evaluation boards

made by Xilinx: the KC705 and the ZC702.

4.1 VME vs PCIe

Before discussing of the path that led to the creation of the Pixel-ROD it's

mandatory to understand the currently used protocol for the communication

between boards used for readout of the ATLAS experiment, the VME bus. This

knowledge will help to understand the reasons that led to the choice of the PCIe

standard.

29

4.1.1 VME overview

The VMEbus (Versa

Module Europa bus, figure

4.1) is a parallel bus

introduced by Motorola in

1981. It was originally

designed as the I/O bus for

the then newly introduced

68000 CPU from Motorola

([12]). Since the early

1990s, VME has

commanded almost half of the embedded computer boards market. The 32-bit

VMEbus offers a maximum bandwidth of up to 40 MB/s. The first significant

change in the standards was the definition of a 64-bit version in 1995. The result

was a doubling of the bandwidth to 80 MB/s. The first mechanical change was the

introduction of the VME-64X specifications in 1998. The newly defined

connectors was designed so that the same modules can be used in either legacy

VMEbus or VME-64X backplanes, although a VME-64X board on a legacy

VMEbus backplane loses some I/O functionality. Further extensions to VME

signaling, 2eVME and 2eSST, have been standardized but have not been widely

accepted. VME is an asynchronous bus, and the limiting factor on the number of

modules per chassis is caused by the obvious request of signal integrity. However,

the number of modules in a single chassis is a function of the backplane and how

well the signals can propagate through it. The physical constraints of a rack-

mounted card cage limit the maximum number of modules in a VME system to

21. The maximum data-rate achieved nowadays is of 320 MB/s with the VME320

([13]).

4.1.2 PCIe choice The reasons that led to the choice of a PCIe based board are many and all valid.

For the design of the Pixel-ROD was chosen to adopt a PCIe-x8 Gen2, that

Figure 4.1: A VME board

30

means, looking at the math made on chapter 3, of an ideal throughput of 4 GB/s.

That's an enormous data-rate if compared to the 320 MB/s of the VME standard.

PCIe implement a packet based communication with automatic re-transmission in

case of error, making it more reliable and stable. Furthermore, PCIe use a point-

to-point serial communication system that overcomes the 21 board per crate

limitation of VME. Moreover, PCIe need less connection to be routed on the

board and smaller connectors, that reduce the size of the board and the effort spent

to route all the metallic path, making route error more unlikely to happen and

more easy to find. One of the most important aspects that make the PCIe the best

choice for the ATLAS experiment is, anyway, the capability to be directly

inserted (and even hot-plugged) in the TDAC system without using optical link

and other external board. That gives to the system plenty of flexibility.

31

Figure 4.2: Xilinx's KC705 evaluation board

4.2 Pixel-Rod's making off

As stated at the beginning of this chapter, since the process of creating and

debugging such highly complex boards often turns out to be very time consuming,

it was decided to design the Pixel-ROD from two high-end evaluation boards

made by Xilinx: the KC705 and the ZC702 by merging them into a more complex

and powerful board. The two boards will be described hereafter in order to

understand how the Pixel-ROD was made ([4]).

4.2.1 KC705

For the slave device was thought to adopt a powerful FPGA from Xilinx’s Kintex

family, a good example of which was the Xilinx’s evaluation board, named

KC705. The KC705, shown in Figure 4.2, is an interesting board in many ways.

The first reason of this interest is the fact that the KC705 is already a PCIe board

Since the Kintex-7 FPGA on KC705 board supports a PCIe connection up to

Gen2 x8, the 8- lane PCI Express edge connector performs data transfers at the

rate of 5 GT/s. The PCIe transmits and receives signal traces with a characteristic

impedance of 85 Ω ±10%. The PCIe clock is routed as a 100 Ω differential pair.

The KC705's main devices and features are listed below ([14]):

32

Kintex-7 28nm FPGA (XC7K325T-2FFG900C);

1GB DDR3 memory SODIMM;

PCI Express Gen2 8- lane endpoint connectivity;

SFP+ connector;

Two VITA 57.1 FMC Connectors (one HPC, one LPC);

10/100/1000 tri-speed Ethernet with Marvell Alaska 88E1111 PHY;

128 MB Linear Byte Peripheral Interface (BPI) flash memory;

128 Mb Quad Serial Peripheral Interface (SPI) flash memory;

USB-to-UART bridge;

USB JTAG via Digilent module;

Fixed 200 MHz LVDS oscillator;

I2C programmable LVDS oscillator;

Kintex-7 FPGA

Kintex-7 is a powerful medium-range FPGA that can easily take on its back the

work of the two Spartan-6 devices situated on the ROD board. A comparison, in

terms of base components, between those two FPGA is shown in figure 4.3. The

Xilinx Kintex-7 XC7K325T-2FFG900 on the KC705 board has the following

features ([15]):

Advanced high-performance FPGA logic based on real 6- input lookup

table (LUT) technology configurable as distributed memory;

High-performance DDR3 interface supporting up to 1866 Mb/s;

High-speed serial connectivity with built- in 16 gigabit transceivers (GTX)

from 600 Mb/s to maximum rates of 12.5 Gb/s, offering a special low-

power mode, optimized for chip-to-chip interfaces;

A user configurable analog interface (XADC), incorporating dual 12-bit

analog-to-digital converters (ADC) with on-chip temperature and supply

sensors;

Powerful clock management tiles (CMT), combining phase- locked loop

(PLL) and mixed-mode clock manager (MMCM) blocks for high precision

and low jitter;

33

Integrated block for PCI Express (PCIe), for up to x8 Gen2 Endpoint and

Root-Port designs;

500 maximum user I/Os (excluding GTX) and 16Kb of Block RAM

(BRAM).

4.2.2 ZC702

As stated before, the second demo board that was taken as starting point is the

ZC702 that could be seen in figure 4.4. In chapter 1 it has been seen that the ROD

board make of a Virtex-5 FPGA its master device. This FPGA implements a hard

processor PowerPC. A hard (or simulated) processor allows to write software in C

or C++ that can be run and changed without touching the firmware. The ZC702

Xilinx's demo board was chosen because its FPGA, the Zynq-7000, embeds a hard

ARM processor . Hereafter the main features of the ZC702 demo board are listed

([16]):

Zynq-7000 FPGA (XC7Z020-1CLG484C), featuring two ARM Cortex-

A9 MPCore hard processors;

1 GB DDR3 component memory (Micron MT41J256M8HX-15E);

10/100/1000 tri-speed Ethernet with Marvell Alaska 88E1116R PHY;

128 Mb Quad SPI flash memory;

USB-to-UART bridge;

USB JTAG interface using a Digilent module;

Two VITA 57.1 FMC LPC connectors;

Fixed 200 MHz LVDS oscillator;

Figure 4.3: Comparison between Kintex-7 and Spartan-6

34

Figure 4.4: ZC702 demo board

Zynq-7000 FPGA

The Zynq 7000 FPGA consists of an integrated Processor System (PS) and

Programmable Logic (PL). The PS integrates two ARM Cortex-A9 MPCore with

a frequency up to 667 MHz, AMBA bus, internal memories, external memory

interfaces, and various peripherals including USB, Ethernet, and many other.

The PS runs independently of the PL and boots at power-up or reset.

35

4.3 Pixel-Rod overview

The analysis of the two previous boards was essential in order to design the new

Pixel-ROD board. As stated before, it was decided to design the Pixel-ROD from

those two boards, by merging them into a more complex one. While merging the

KC705 with the ZC702, many features had to be removed, since note useful for a

readout board, while other features had to be redesigned as they need to be shared

among all the hardware of the Pixel-ROD.

The main features removed from the KC705 were the LCD display, few GPIO

buttons and LED, the SD card reader and the HDMI port. Speaking of feature

removed, from the ZC702 SD card reader, HDMI port, GPIO buttons and LEDs

were removed, along with one of the two LPC FMC, the USB port and PMODS

connectors. The removed feature left space on the board for more useful ones, for

example, busses between the two FPGA, which are essential in order to achieve a

full "Master-Slave" architecture as desired. As result of this merge, the principal

devices on the Pixel ROD board are:

Kintex-7 28nm FPGA (XC7K325T-2FFG900C);

Zynq 7000 FPGA (XC7Z020-1CLG484C), featuring two ARM Cortex-A9

MPCore;

2 GB DDR3 memory SODIMM (Kintex DDR);

1 GB DDR3 component memory (Micron MT41J256M8HX-15E, Zynq

DDR3);

PCI Express Gen2 8- lane endpoint connectivity;

SFP+ connector;

Three VITA 57.1 FMC Connectors (one HPC, two LPC);

10/100/1000 tri-speed Ethernet with Marvell Alaska PHY;

Two 128 Mb Quad SPI flash memory;

Two USB-to-UART bridges;

USB JTAG interface (using a Digilent module or header connection);

Two fixed 200 MHz LVDS oscillators;

I2C programmable LVDS oscillator;

36

4.3.1 New Features

As already mentioned, the Pixel-ROD implement redesigned or totally new

features if compared to the two single boards. These are: busses from the two

FPGAs, differential clock, a new JTAG chain and the power up chain. There are

three main type of busses: a 21 bit differential bus that runs between the two

FPGA that provides an high-speed communication oriented do achieve a ROD-

like "Master-Slave" architecture. The second bus added is a single bit differential

line to share a common clock between Zynq and Kintex. The last bus is a 5 bit

general purpose, single ended bus. The JTAG chain was modified in several ways:

first of all, a 12 pin (3x4) header was added in order to be able to exclude the

Kintex from the JTAG chain. The second, big, change that had been made on the

JTAG chain is the introduction of an internal JTAG between Zynq and Kintex, in

order to be able to program the slave FPGA from the master one. The power

supply stage had to be reinvented: since the board was designed in order to be

plugged inside a computer, the simple merge of the two supply stage of the two

demo-board, would have been too space consuming ([17]). The result is shown

in figure 4.5.

Figure 4.5: The Pixel-ROD board

37

4.3.2 Layers' specifics

The electrical and mechanical characteristics of a board are defined by two

parameters: the stack-up, that defines thickness and function of each layer, and the

dimension of the board itself. During the process of merging the two Xilinx's

demo-boards, the first thing that resulted impossible was the complete one-to-one

mapping since the total number of layer needed to be not more than 16. This limit

is dominated by the PCIe standard that requires a determinate thickness to fit the

PCIe slot. In order to reach the requested level of insulation and to reduce the

cross-talk phenomena, signal layers are alternated with ground layers while power

planes are placed in the innermost layer of the board as shown in figure 4.6.

Figure 4.6: Pixel-ROD's stack-up

38

The size constraint was imposed from the PC case that the board is designed for.

It's important to remember that this board is designed to be inserted inside a

TDAC computer. That imposed a maximum 30 cm of width (plus a little space for

the connectors) while the height parameter was left free in order to have enough

area for all the devices ([17]). This merge and these space limits resulted in

complex layout for each layer as could be seen in figure 4.7.

Figure 4.7: One of the 16 layer of the Pixel-ROD

39

Chapter 5

Pixel-ROD tests results The complexity of modern electronic require appropriate and careful testing and

hardware wake-up phase, especially in the prototyping phase: even a good, well

engineered, project, could result in a malfunctioning or even broken board, due to

limits of the constructive process. With "hardware wake-up" are here intended all

the action to be performed in order to ensure that the hardware is working

correctly when the board is powered up for the first time. This first step is needed

to exclude major board faults and to make sure that the hardware can be correctly

configured and the firmware can be actually tested. The work that has been done

could be divided into 2 stages: the correct configuration of power up devices and

the validation of all the other devices and functionalities implementing custom

firmware (and software) to verify the expected performance. Before the work of

this thesis, all the devices had been tested except the FMC and PCIe connections.

The work accomplished in this thesis was the test of the PCIe interface (and the

extension of the memory tested since only half of it was validated) while the test

of the FMC connector is still currently under test: only the HPC one has been

validated nowadays

5.1 Tests

In this paragraph the already passed tests will be briefly discussed in order to give

a complete picture of the situation. Tests already passed include: power-up supply

test, Kintex-Zynq internal bus test, test of memory and UART of both Zynq and

Kintex, SFP and GBTX tests. This last test won't be described, for more

information see: ([4])

40

5.1.1 Power-Up supply test

More sophisticated component on

the board, like the FPGA, need a

special and well designed power-

up sequence in order to work

correctly. This task is

accomplished by the power-up

stage composed of several

UCD9248 programmable

switching from Texas Instruments.

While testing this stage, four

errors were identified: the first two

errors were due to the digital and analogue ground that where accidentally left

floating on two different components. The third fault was caused by a too

aggressive sectioning of the power supply rail. The last error was caused by the

swap of three signals in the project phase ([4]). Once these errors had been

resolved and the UCD9248 stack had been correctly programmed, the board was

officially turned on in order to proceed with the other tests. An example of the

power up sequence could be seen in figure 5.1.

5.1.2 Kintex-Zynq internal bus test

In order to test the correct

programmability via JTAG,

the 200 MHz clock IC, and the

correct functioning of the 21

bit differential bus that connect

Kintex and Zynq (named KZ-

Bus) a simple test was

designed: the Zynq adds 1 to

Figure 5.1: Power-Up sequence.

Figure 5.2: GPIO p in analyzed with the oscilloscope

while the described test is in running.

41

an internal 21 bit internal counter at every clock pulse and put it on the KZ-Bus.

The Kintex, on the other side, simply subtracts the old bus value from the newly

read one; if the difference is equal to 1 then a GPIO pin is pushed to 0, otherwise

it pull the same GPIO pin to 1. That means that when the Zynq count restarts and

all the lanes of the bus pass from 1 to 0 a pulse will be present on the designed

GPIO pin. That means that on the GPIO pin a periodic pulse will be seen and this

period could be calculated, knowing the clock frequency, to be about 10,5 ms as

can be seen in figure 5.2.

5.1.4 Kintex and Zynq UART and memory test

All the tests that involved the programming of one or both FPGAs were

accomplished with the help of the "Vivado design suite", a development

environment for Xilinx FPGA that allow synthesizing and implement HDL

design. While for easy test, like the KZ-Bus test, is possible to think to manually

write the code in VHDL or Verilog, That's not possible for more complex code, at

least, not in this phase. In order to help the designer through his job, Vivado offers

a graphic environment where the designer can drop and connect complex block on

a canvas and connects them, mostly using "AXI bus": one of the bus of the

AMBA family (Advanced Microcontroller Bus Architecture). For example, it's

possible to implement a soft processor, called Microblaze that can be programmed

with C or C++ code. It's also possible to use an "UART" block that take care of

the UART connection or a "Memory Interface Generator (MIG)" that interfaces

the FPGAs with the On-Board RAM and many other blocks. It's important to

notice that all the firmware tested on the Kintex was first tested on the KC705

demo board because of the extreme similarity of these two boards. That was

useful because if a firmware works fine on the KC705 and don't work on the

Pixel-ROD, that means that the firmware is correct and an electrical error has

occurred.

Memory test

In order to test the RAM configured on the Kinteq, a MIG block was used to

interface with the RAM implementing a Microblaze soft processor.

42

A C code was designed in order to write 1 GB of known data inside the 2 GB

RAM, then, through a custom xsct terminal was possible to read the content of the

area that was written, in order to verify the correct execution of the program. This

program was launched on both Kintex and Zynq without any problem ([4]).

An example of the verbose of this test can be found in figure 5.3

UART test

To test the UART connectivity an UART and a Microblaze blocks where

implemented. On the Microblaze an Echo server was programmed in order to

accomplish a simple task: when a message is sent from the UART of a connected

computer, the Microblaze of the Pixel-ROD should take this message to send it

back to the computer. This test worked fine on the Zynq but failed on the Kinteq.

That was due to the Ethernet IC driver that implemented the link between the

FPGA and the Ethernet connector itself. It was noticed that it wasn't powered

because the power supply rail wasn't connected to any power source. Once this

bug was fixed, the test started to work even on the Kinteq side of the board

([4]). An example of the verbose of this test can be found in figure 5.4

Figure 5.3:

An example of the

verbose given by the

execution of the

RAM test

Figure 5.4:

An example of the

verbose given by the

execution of the

ECHO test

43

5.2 PCI-Express Validation

In order to validate the PCIe port situated on the Pixel-ROD and presented in the

previous chapters (Chapter 3), a series of tests designed and run. It has been clear

since the first attempt that this kind of work needs a more complex workbench

compared to the one present at the beginning. To test the hardware in a reliable

and secure way the entire test was at first tried on the KC705 development board

and then implemented on the Pixel-ROD. That was possible because the Kinteq

side of the Pixel-ROD is very similar to the KC705. The workbench was

composed of the KC705 (at the beginning), an open-air pc board where it was

possible to plug a PCIe card, a computer with the VIVADO design suite to be able

to program the board. Now let us describe how Linux implements various drivers.

5.2.1 First Attempt

The first attempt carried on was aimed at verifying the presence of hardware

connections. In order to do so, a simple test was imaginated, coded and

programmed onto the KC705. To verify connections between the FPGA and the

connector the VHDL code was designed to transmit a simple square wave on each

PCIe pin, then with an oscilloscope, each pin would have been analyzed. This test

didn't even see the light: indeed, it was impossible for the Vivado design suite to

compile this project. The reason behind that was quite simple: the pins used for

the PCIe connection are special purpose pins: they are 32 PIN (PCIe-x8 Gen2, 8

lane, 4 pin for each lane) served by a GBTx transceiver. That means that it's

impossible to directly command the pins and to use them it's necessary to

command the GBTx transceiver, instead. The first attempt was, hence, a failure.

Failure that led to a more deep understanding of the path to follow to validate and

use the PCIe present and in a more efficient way. The next steps will be divided in

2 main sides: a PC side and an FPGA side, both needed to create a communication

system used for validation and performance measurements. The work hereafter

described, is based on a custom driver for mounted on an Ubuntu PC and a

Firmware programmed onto the KC705. The Ubuntu PC was totally "naked": in

order to easily plug the PCIe board making possible the communication between

44

the custom program on the PC and the custom firmware. All this enormous work

has been made possible by Xilinx online firmware example, driver source code

used and modified in order to accomplish this job ([18]) and with the help of

the IP integrator (explained in chapter 5.1.4).

5.2.2 XDMA IP core and firmware implementation

The work-flow to validate and measure the performance of the PCIe bus on the

Pixel-ROD is hereafter described: first, a custom firmware was designed and

loaded into the FPGA of the KC705, the Kinteq. This firmware implemented a

Direct Memory Access (or DMA) that worked as a bridge from the PCIe port to

the memory and from the memory to the PCIe port. Secondly, a C program on the

Ubuntu PC was implemented. This program uses the functionalities given by a

custom driver and carries out read and write operation from and to the RAM

memory on the board plugged in the PCIe slot of the pc. In this section, the DMA

IP core and it's firmware implementation is described.

DMA/Bridge subsystem for PCIe

The DMA/Bridge subsystem for PCIe (also called XDMA) is an IP block that

implements a high performance, configurable Scatter Gather DMA for use with

the PCIe Gen2.1 and Gen3.x. The IP provides an optional AXI4 or AXI4-Stream

user interface. The XDMA can be configured to be either a high performance

direct memory access (DMA) data mover or a bridge between the PCI Express

and AXI memory spaces. The master side of this block reads and writes requests

on the PCIe and its core enables the user to perform direct memory transfers, both

Host to Card (H2C), and Card to Host (C2H). The core can be configured to have

a common AXI4 memory mapped interface shared by all channels or an AXI4-

Stream interface per channel. Memory transfers are specified on a per-channel

basis in descriptor linked lists, which the DMA fetches from host memory and

processes. Events such as descriptor completion and errors are signalled using

interrupts. The core also provides a configurable number of user interrupt wires

that generate interrupts to the host ([19]). This IP core can be seen in figure 5.5

along with its port

45

.

Internally, the core can be configured to implement up to eight independent

physical DMA engines. These DMA engines can be mapped to individual AXI4-

Stream interfaces or a shared AXI4 memory mapped (MM) interface to the user

application. On the AXI4 MM interface, the DMA Subsystem for PCIe generates

requests and expected completions. The AXI4-Stream interface copes with data-

only. On the PCIe side, the DMA has an internal arbitration and bridge logic to

generate read and write transaction level packets (TLPs) to the PCIe. This is done

over the Integrated Block for the PCIe core Requester Request (RQ) bus, and to

accept completions from PCIe over the Integrated Block for the PCIe Request

Completion (RC) bus.

The configuration with the physical bus is performed with the signal

pci_exp_rxn[7:0], pci_exp_rxp[7:0], pci_exp_txn[7:0] and pci_exp_txp[7:0]. The

Figure 5.5: Symbol of the XDMA IP core with its TX and RX ports visible on the right

46

communication with the memory start from the M-AXI. A series of parameter

make this IP core customizable and elastic. Inside the configuration menu it's

possible to configure the data width, the number of Lane desired (2, 4 or 8), the

reference clock, the type of DMA (memory mapped or stream) the Base Address

Register used (BAR), the number of interrupts and channels desired and many

other options. Before explaining the firmware implementation, it's mandatory to

briefly discuss another block: the Memory Interface Generator.

Memory Interface Generator

The Memory Interface Generator (MIG) IP core ([20])is a controller and

physical layer for interfacing 7-series FPGA, as well as other AXI4 slave devices,

to DDR3 memory. Given the wide variety of DDR3 modules and components

available, this IP core is very flexible and configurable. This IP core can be seen

in figure 5.6.

Figure 5.6: Example of A MIG, on the right the DDR3 port connect the board

to the physical world while S_AXI is the port on which data are received or

transmitted

47

Firmware implementation

As stated above, the firmware was entirely designed using the IP integrator. It's

composed of a XDMA, a memory interface generator and other support logic

needed to correctly connect these two block and make them function properly, the

design of this firmware could be found in figure 5.6.

It's important to remember that for every block, a corresponding chunk of code is

generated. The flux of the data follow the path of a classical DMA

implementation: in case of write, the address and the data are given on the

pci_express_x8 in form of PCIe TLPs then the DMA core communicate with the

MIG through the AXI interconnect. This last block will write in the memory at the

given address using the previously configured option. In case of read only the

address is given on the pci_express_x8, the request is passed to the MIG which

replies with the data requested. In the following tests the XDMA is configured to

Figure 5.6: Firmware obtained with the IP integrator. Input/output port can be easily found on the right:

ddr3_sdram and pci_express_x8. On the left it's possible to see the internal signal needed to bring to the

system clock and reset

48

work, on the PCIe side, as a PCIe-x8 Gen2 completer. These are the higher

specifications that both the KC705 and the Pixel-ROD could reach. That gives the

system an ideal throughput of 4 GByte/s (see chapter 3.2). It's possible, anyway,

to configure the whole system to work with lower specifics, like PCIe-x4 Gen1.

5.2.3 Linux custom driver

The firmware seen above gives the user the ability to read and write from and to

the memory through the PCIe protocol. Hence, once the firmware is designed, and

the board plugged on the naked Linux PC, a software that communicate via PCIe

it's needed to validate and measure this protocol. This kind of software is always

divided into two major parts: a device driver end a C code that implements that

driver. The device driver is needed in order to obtain a mid- layer placed between

the hardware and the pure software. The main task of a device driver is to provide

a variety of functionalities coded at an higher level of abstraction.

Linux device driver

The choice of Ubuntu as operative system was led by the need of an open

operating system of the Linux family. That's because, as said above, in order to

accomplish all the tasks a custom driver was needed. In Linux every kind of

driver is thought as a module that can be inserted or removed anytime, each

module is written in C code and, in this thesis, compiled using the last version of

the GCC compiler. It's important to understand that a module, even if coded in C,

don't use classic C functions and libraries. A Linux OS could be divided in two

major space: a user space and a kernel spaces. A module operates inside the kernel

space where it can access different libraries and hardware. In order to have access

to this program space, a C source code needs to be compiled with specific options

and by specifying a precise path where the compiler could retrieve the libraries.

This path is given by the folder that contains the source code of the OS. During

the process of coding and compiling both module C code and user program, no

IDE was used. That's because a custom Makefile was used in order to have a

complete control all over the process and the product of the compilation. An

enormous variety of module exist, even if designed for different tasks. The kind of

49

module used in this thesis is a Device Driver and specifically a device driver

called Char Driver. A device driver is a module that makes available to the user

space a series of functions used to command the hardware beneath. As for the

module, various kind of device drivers exist, on this thesis it was used a char

driver. A char driver is a module that makes possible the communication with the

hardware from the user space by creating a special virtual file into which a user

program can read or write ([21]). For example, if an user space program wants

to communicate with an hardware port (a PCIe port or any other), it only needs to

read or write the message on this special virtual file.

The work of a chard driver is mainly the implementation of 3 data structure: file,

inode and File_operation.

File is a structure defined in <linux/fs.h> , and it is one of the most important data

structure used in device drivers. Note that a file has nothing to do with the FILE

pointers of user-space programs. A FILE is defined in the C library and never

appears in kernel code. A struct file, on the other hand, is a kernel structure that

never appears in user programs. The file structure represents an open file. (It is not

specific to device drivers; every open file in the system has an associated struct

file in kernel space.) It is created by the kernel on open and it is passed to any

function that operates on the file, until the last close. After all instances of the file

are closed, the kernel releases the data structure. In the kernel sources, a pointer to

struct file is usually called either file or filp (“file pointer”) ([21]).

Inode is a structure that is used by the kernel internally to represent files.

Therefore, it is different from the file structure that represents an open file

descriptor. There can be numerous file structures representing multiple open

descriptors on a single file, but they all point to a single inode structure.

File_operation is a structure that represents how a char driver sets up the

connection. The structure, defined in, is a collection of function pointers. Each

open file (represented internally by a file structure) is associated with its own set

of functions (by including a field called f_op that points to a file_operations

structure). The operations are mostly in charge of implementing the system calls

50

and are therefore, named open, read, and so on. It's possible to consider the file to

be an “object” and the functions operating on it to be its “methods,” using object-

oriented programming terminology to denote actions declared by an object to act

on itself. This is the first sign of object-oriented programming that could be seen

in the Linux kernel.

Custom driver

The development of a char driver intended to work with the previously seen

XDMA had been simplified by an existing module released on the Xilinx web

page ([18]). This char driver implements a base PCI Linux driver enriching it

with PCIe specific function. In order to read and write data on the PCIe bus 2

simple function was available Read and Write. The principal FOPS (File

Operation Struct) is hereafter reported in the figure 5.7.

This structure, as could be seen, contain a series of pointer (.owner, .open,

.release, etc.), each of this pointer point to a specific function. For example, .write

point to char_sgdma_write which is defined as static ssize_t char_sgdma_write

(struct file *file, const char __user *buf, size_t count, loff_t *pos);. This function

is called every time an user space code write operation is performed on the file

created by the char driver.

Figure 5.7: one of the FOPS of the char driver

51

5.2.4 Echo Test

Once all the C sources module have been compiled and inserted in the OS it was

possible to start the first functioning test. It is divided into 2 major parts: a Linux

script and two C source codes. The concept of the test is simple: the user has to

write a text string to the RAM of the board plugged on the naked PC through the

PCIe port on a specific address, then the program reads to the same address and

compares the data read with the data wrote.

Script

Before starting to loop over a block, a control is effectuated in order to find if the

script is executed as super user or not, then a special program is lunched. This

program is used to retrieve the name of the file created from the char driver. Then

an infinite loop is executed. Inside this loop, first a string is requested to the user

then this string is passed to a specific write program along with the desired write

address and an offset. After the execution of the write program, a read program is

called with the address, the offset, the name of the file to read and the string

expected.

C source code

The C code implemented to read and write on the PCIe board is formed by two

source codes: one used to write a specific amount of bytes in a given address and

one used to read with the same logic. The reason that lays beneath the choice of

two separated C codes instead of a single source it's the research of modularity in

order to be able to reuse the program with only few modifications. This two

source codes and the script itself was obtained by analyzing and by understanding

the example source code that Xilinx give with his driver ([18]). The writing

part of the C code work like that: first all the arguments passed by the script are

retrieved using specific functions and the information here contained are

extrapolated and saved (like the name of the file, the string to write, etc.). At the

beginning, the string is turned in an aligned vector dynamically allocated, then the

file is opened using its name. Using an lseek function the program select the

52

address where to write. Then with a simple write the program writes on the file.

It's important to remember that all this functions (lseek, write and read) are system

calls, not normal functions defined by the programmer. Such system calls, if used

on this file, call the function associated inside the previously seen FOPS. The read

C code works in the same way except on the system call read instead of write and

the comparison between the string read and the expected one.

The test was first carried on the KC705 board programmed with the firmware

seen above. Once it worked on the demo-board, validating the software and the

firmware implemented, this was moved to the Pixel-ROD where it worked

flawlessly. The result of this test was of fundamental importance: the success of it

meant that the hardware was functioning even if without the information of speed

or other data collected in the following tests.

5.2.5 Buffer and throughput Test

In order to measure the effective throughput of the board it was first necessary to

understand what would have been the most correct way to do it: when writing

buffers of different size, the total time used to transfer a fixed amount of data

change of two order of magnitude in the worst case (the time was measured with

function that call the system clock). That's due to two main reasons: the packet-

oriented way to transfer data of the PCIe standard and the implementation of a

DMA that works more efficiently when it's in charge of big amount of data

transfer. That's because having a big buffer and carrying out a burst transfer on the

AXI bus that commanded with the MIG, this is more efficient than a small buffer

repeated many times. In order to find the optimal buffer size a simple test was

designed: the concept is to transfer the same precise and repeated sequence of data

(the whole ASCII table, integer form 0 to 255) and the same total amount of

bytes: 100 MBytes, every time with different buffer sizes. The first buffer has a

size of 1 Kbyte, the second 2 Kbyte and so on with power of 2 until the last that

was an 8 MByte long buffer.

The firmware used is the same for all the three tests. The C code on the PC is very

similar to the one used for the echo test: it first instantiate and initializes all the

twelve buffers. For each of these buffers the program loops in a block where the

53

current buffer is first repeatedly written on the memory of the Board at increasing

addresses until a complete 100 MBytes write is done. Secondly, the whole transfer

written before, get read with a reading buffer of the same size and compared with

the copy of it on the PC. The last step isn't used for performance measurements

(because the comparison of each buffer read requires a lot of time) but for error

retrieve: it wasn't excluded the presence of memory failure.

The test applied first on the KC705 revealed a peak of throughput when using

buffers of 2 Mbyte. This peak corresponds to a data rate of about 2,5 GB/s as can

be seen in figure 5.8. Even if that's quite far from the theoretical throughput of 4

GB/s the result is right and as expected. In order to understand the reason of this

difference is essential to understand three main things: first, 4GB/s is the total

throughput, this includes the three header and the three tail of the PCIe packet, not

only it's data. Second, everything, from the firmware to the driver, is intended to

validate the PCIe, not for fast transactions. Third, 4 GB/s is the peak data rate of

the PCIe-x8 Gen2 protocol, the effective data rate depends on a lot of variable,

first of all the speed of the RAM on the board. Once this software run flawlessly

on the KC705 it has been carried rigidly into the Kintex of the Pixel-ROD. The

test run on the two boards revealed the same throughput.

54

Figure 5.8: the output produced by the throughput test. it's easy to see that every time the

buffer grows the time needed to write 100 MBytes decrease while the read operation is

dominated by the time used to compare.

This test was intentionally reduced to use a maximum buffer of 2 MBytes after the

conclusions in the previous chapter.

55

5.2.6 BER Test

The last test carried on is a pseudo-BER test. In order to test the BER (Bit Error

Rate) a simple but efficient test was produced. The core concept is to write and

read large amount of data in the same way as seen in the previous test for a long

period of time. The test executed first on the KC705 consists of a C code that, in a

similar manner of the throughput test, writes a determinate sequence on the RAM

of the board. Then it reads it by comparing the results and saving them on a file.

There are, anyway, many differences between these two tests: first of all, the

throughput test reads and writes only 100 MBytes for each buffer in order to find

the "faster" buffer. The second test, instead, writes and reads 1 GB (all the RAM

of the KC705) per time using only the 2 MByte buffer (which is the faster one as

seen before) and run relentless until a user stops it. The second main difference is

the sequence used: during the throughput test the sequence of data used was of no

interest. However, to calculate a BER that takes care of all possible factors, the

sequence used is of extreme importance. This test calculates, indeed, a cumulative

BER that's take care of both PCIe and on board RAM failure. In order to force

all the possible errors, a special sequence was used: this sequence re-writes on the

same memory cell in short time with a bit configuration at maximum boolean

distance from the previous one. For example, if during the first iteration of the

program it writes on the first cell, in hexadecimal, 0x00 then at the second

iteration, it will write 0xff and so on for all the cell of the RAM. This pattern is

the one used by MemTest86 for the test of high-end PC's RAM ([22]).

As seen before, the maximum speed achieved in write is 2,5 GB/s but, in order to

compare every buffer read with the one expected, the average speed dropped

under 800 MB/s. The BER expected was in any case of the order of 10-21.

Considering also that, in order to obtain a good data interpretation it's needed the

registration of, at least, 20 errors it's easy to calculate the estimated time of

execution:

56

That's obviously absurd: in order to test such BER other methods are used. This

test is used to ensure that no error was introduced during the sizing of the

parameter of the various interconnections. The test was first run on the KC705 in

order to determinate it's stability and reliability, then it has been implemented on

the Pixel-ROD and this time with a small change: instead of writing only 1 GB

per time, the source code was changed to write and read the entire 2 GB RAM on

the Pixel-ROD.

The test has been executed for nearly 10 hours transferring 24 TByte with no

error. This data could be interpreted as a mark that ensure that the effective BER

is < 10-14 and the sizing of the connection is solid and well made.

57

Conclusion and future development At present - July 2017 - two prototypes of Pixel-ROD boards have been produced

and, with the work of this thesis, now the PCIe validation and tests join the

already tested peripherals and devices. All the tests are carried on in the Electronic

Design Laboratory of INFN (Istituto Nazionale di Fisica Nucleare) and of the

Physics and Astronomy Department of Bologna University. Even if the work of

validation is completing, some more tests need to be created and performed in

order to assess a full validation of both prototypes. Even if it's a first release, the

Pixel-ROD board has already shown its capability to perform fast data acquisition

and processing during the tests.

The work described in this thesis is focused on understanding the PCIe protocol

and on the implementation of firmware and software on the Pixel-ROD board.

This allow validating the peripherals to make the board a good candidate for the

data acquisition in high-energy physics due to its reliability and bandwidth.

Therefore, future tests will aim to integrate Pixel-ROD within a simple data

acquisition chain, that implements the well-known standard from ATLAS Pixel

Detector or even the AURORA protocol from CMS experiment.

In conclusion, looking towards the LHC upgrades and beyond, the Pixel-ROD

board or its future upgrade represents an important step toward the modernization

of the data acquisition, especially for the readout of the pixel detectors. In

conclusion this feature makes the Pixel-ROD a good candidate to be part of an

electronic readout chain for tracking systems.

58

Ringraziamenti

Questa tesi è frutto di una bellissima collaborazione col dipartimento di Fisica e

Astronomia dell'università di Bologna, Vorrei quindi ringraziare coloro senza i quali,

tutto questo non sarebbe stato possibile.

Ringrazio innanzitutto il professor Gabrielli, Relatore, ed il professor Mauro Villa,

Correlatore, per il loro essenziale supporto in questo importante periodo.

Ringrazio poi tutti i colleghi dell'università di Bologna e gli amici di una vita che mi

hanno sostenuto lungo questo percorso, con particolare attenzione a Giulio Masinelli che

ha condiviso con me la strada.

Vorrei infine ringraziare la mia famiglia e la mia ragazza, le due bussole che mi hanno

guidato fino a qui e con le quali non ho ancora finito di viaggiare.

59

Bibliography

[1]CERN.[Online]https://home.cern/.

[2]ATLAS website.[Online]http://atlas.cern/.

[3]The ATLAS Collaboration, ATLAS detector and physics performance Technical

Design.

[4]C.Preti, A PCIe-based readout and control board to interface with new-generation

detectors for the LHC upgrade, 2016.

[5]G. Balbi, D. Falchieri, A. Gabrielli, L. Lama, R. Travaglini, S. Zannoli, IBL ROD board rev

C reference manual, November 2012, https://espace.cern.ch/.

[6]Béjar Alonso I. et al., High-Luminosity Large Hadron Collider (HL-LHC) Preliminary

Design Report, CERN-2015-005, 17 December 2015, CERN, Geneva.

[7]National Instrument, http://www.ni.com/white-paper/3767/en/.

[8]Ravi Budruk, Don Anderson, Tom Shanley, PCI Express System Architecture, 2008.

[9]PCI SIG, https://pcisig.com/specifications/pciexpress/.

[10] M. DINESH KUMAR, IMPLEMENTATION OF PCS OF PHYSICAL LAYER FOR PCI

EXPRESS, 2009.

[11] Rick Eads, Keysight Technologies, PCI Express Electrical Basics, 2014.

[12] National Instrument, http://www.ni.com/white-paper/5712/en/.

[13]VITA's website, http://www.vita.com/page-1855175#G.

[14]Xilinx R, 26 August 2016,

https://www.xilinx.com/support/documentation/boards_and_kits/kc705/ug810_KC705

_Eval_Bd.pdf.

60

[15]Xilinx R, 7 Series FPGA Overview, DS180 (v1.17), 27 May 2015,

https://www.xilinx.com/support/documentation/data_sheets/ds180_7Series_Overview.

pdf.

[16]Xilinx R, ZC702 Evaluation Board for the Zynq-7000 XC7Z020 All Programmable SoC,

UG850 (v1.5), 4 September 2015,

https://www.xilinx.com/support/documentation/boards_and_kits/zc702_zvik/ug850-

zc702-eval-bd.pdf.

[17]L. Lama, G. Balbi, D. Falchieri, G. Pellegrini, C. Preti, A. Gabrielli, A PCIe DAQ board

prototype for Pixel Detector in High Energy Physics, 12-01-C01073, Journal of

Instrumentation, 2017.

[18]Xilinx's website, https://www.xilinx.com/support/answers/65444.html.

[19]Xilinx's website,

https://www.xilinx.com/support/documentation/ip_documentation/xdma/v3_1/pg195-

pcie-dma.pdf.

[20]Xilinx, AXI Traffic Generator, PG125 (v2.0), 6 April 2016,

https://www.xilinx.com/support/documentation/ip_documentation/axi_traffic_gen/v2_

0/pg125-axi-traffic-gen.pdf.

[21]Jonathan Corbet, Alessandro Rubini, Greg Kroah-Hartman; Linux device drivers; third

edition.

[22]MemTest86 web site; http://www.memtest86.com/.