ALMA MATER STUDIORUM - UNIVERSITÀ DI BOLOGNA TESI DI LAUREA in Fisica Generale T2 Validation and...
-
Upload
trinhthien -
Category
Documents
-
view
212 -
download
0
Transcript of ALMA MATER STUDIORUM - UNIVERSITÀ DI BOLOGNA TESI DI LAUREA in Fisica Generale T2 Validation and...
ALMA MATER STUDIORUM - UNIVERSITÀ DI BOLOGNA
SCUOLA DI INGEGNERIA E ARCHITETTURA
Dipartimento di Ingegneria dell'Energia Elettrica
e dell'Informazione "Guglielmo Marconi" - DEI
Corso di laurea triennale in
Ingegneria elettronica e delle telecomunicazioni
TESI DI LAUREA
in
Fisica Generale T2
Validation and implementation of PCIe on a FPGA-based custom
board for pixel-detector LHC upgrade
CANDIDATO RELATORE:
Silvio Zanoli Chiar.mo Prof. Alessandro Gabrielli
CORRELATORE:
Chiar.mo Prof. Mauro Villa
Anno Accademico 2016/2017
Sessione I
Abstract
Questa tesi si riferisce principalmente al lavoro di comprensione, validazione ed
implementazione a scopo dimostrativo di elettronica di lettura per rivelatori a
pixel posti sugli esperimenti di fisica delle alte energie al CERN di Ginevra. In
particolare questa tesi tratta la periferica PCIe di una nuova scheda che
implementa due FPGA come core del progetto, chiamata Pixel-ROD (Pixel Read
out Driver), pensata come naturale prosecuzione delle odierne serie di schede di
readout montate oggi nel Pixel Detector di ATLAS. Si parla di naturale seguito in
quanto le attuali schede di readout dei Pixel Detector di ATLAS sono state
progettate, costruite e attualmente sono ancora sotto la responsabilità del
Dipartimento di Fisica ed Astronomia di Bologna e della Sezione di Bologna
dell’Istituto Nazionale di Fisica Nucleare.
Il progetto della scheda Pixel-ROD è cominciato tre anni fa, poiché il trend
generale per l’evoluzione dell’elettronica off-detector di LHC (Large Hadron
Collider) è quello di abbandonare la più vecchi1a interfaccia VME, per passare a
quelle più nuove e veloci (come il PCIe). Inoltre, poiché i rivelatori di ATLAS e
CMS, che sono gli esperimenti principali di LHC, saranno accomunati dallo
stesso chip di readout che interfaccerà i futuri Pixel Detector, la Pixel-ROD
potrebbe essere usata non solo per l’evoluzione di ATLAS ma anche per altri
esperimenti.
La caratteristica principale della Pixel-ROD è la possibilità di utilizzo sia come
scheda di readout singola, sia in una catena reale di acquisizione dati, che si
interfaccia con dispositivi di terze parti.
Il lavoro svolto in questa tesi inizia con la comprensione generale della scheda
(già presente) per poi elaborare programmi di test su sistemi Linux per verificare
l'effettivo funzionamento dell'interfaccia PCIe a scopo di validare tale periferica
finendo con l'elaborazione di un software dimostrativo che simula un flusso di dati
reale ad alta velocità di trasferimento.
Contents Introduction ...............................................................................................................1
1 LHC .........................................................................................................................2
1.1 Main experiments at LHC ...................................................................................3
1.2 The ATLAS detector ...........................................................................................3
1.3 ATLAS' structure ................................................................................................4
1.3.1 Inner Detector.............................................................................................6
2 Off-detector electronics............................................................................................9
2.1 IBL’s electronics .................................................................................................9
2.1.1 IBL BOC.....................................................................................................10
2.1.2 IBL ROD ....................................................................................................12
2.1.3 TIM...........................................................................................................14
2.1.4 SBC...........................................................................................................14
2.2 The road towards PCIe based board..................................................................14
3 PCI-Express overview..............................................................................................15
3.1 Old standard's crisis .........................................................................................15
3.2 PCI-Express standard .......................................................................................17
3.2.1 PCI-Express topology .................................................................................19
3.2.2 PCI-Express Layers .....................................................................................21
4 Pixel-ROD board.....................................................................................................28
4.1 VME vs PCIe ....................................................................................................28
4.1.1 VME overview ...........................................................................................29
4.1.2 PCIe choice ...............................................................................................29
4.2 Pixel-Rod's making off ......................................................................................31
4.2.1 KC705 .......................................................................................................31
4.2.2 ZC702 .......................................................................................................33
4.3 Pixel-Rod overview.......................................................................................... 35
4.3.1 New Features ........................................................................................... 36
4.3.2 Layers' specifics ........................................................................................ 37
5 Pixel-ROD tests results ........................................................................................... 39
5.1 Test completed ............................................................................................... 39
5.1.1 Power-Up supply test ............................................................................... 40
5.1.2 Kintex-Zynq internal bus test ..................................................................... 40
5.1.4 Kintex and Zynq UART and memory test..................................................... 41
5.2 PCI-Express Validation ..................................................................................... 43
5.2.1 First Attempt ............................................................................................ 43
5.2.2 XDMA IP core and firmware implementation.............................................. 44
5.2.3 Linux custom driver................................................................................... 48
5.2.4 Echo Test.................................................................................................. 51
5.2.5 Buffer and throughput Test ....................................................................... 52
5.2.6 BER Test ................................................................................................... 54
Conclusion and future development .......................................................................... 57
Bibliography............................................................................................................. 59
1
Introduction
This thesis mainly refers to the comprension, validation and implementation for
demonstration purposes of pixel readout electronics for high-energy physics
experiments at CERN of Geneva. In particular, this thesis deals with the PCIe
peripheral implemented on a new FPGA based board, named Pixel-ROD (Pixel
Read Out Driver), thought as a natural follow-up of the present series of readout
boards, implemented into ATLAS Pixel detector. We are talking about natural
follow-up as the current ATLAS Pixel Detector readout cards have been designed,
built and are still under the responsibility of the Department of Physics and
Astronomy of Bologna and the Bologna Section of the National Institute of
Nuclear Physics. The project of Pixel-ROD board started three years ago, since
the general trend on the update for the off-detector LHC (Large Hadron Collider)
phase 2 electronics is to leave the older VME interface for newer and faster buses
(such as PCIe). Moreover, as the ATLAS and CMS experiments, which are the
main LHC experiments, will share the same readout chip that will interface the
future Pixel Detectors, the Pixel-ROD board could be used not only for the
ATLAS upgrade, but also for other experiments. The main feature of the Pixel
ROD board is that it can be used both as a standalone readout electronics or in real
data acquisition chains by interfacing with third party devices.
This thesis is intended to provide a brief overview of the environment in which
Pixel- ROD board was conceived. After this Introduction, the Chapter One
summarizes the ATLAS experiment, focusing on the detectors point of view.
Chapter Two treat the PCIe standard. Chapter Three describe the tests that have
been carried out so far. Finally, the Conclusions depict the current situation of
tests, the obtained results and the future analysis that will be performed in order to
match a real application on a specific experiment.
In particular, the work performed in this thesis focus on the elaboration of a
simple test program on Linux systems in aim of the validation of the PCIe
interface present on board ending with the develop of a demo software that
simulate an high throughput data flow
2
Chapter 1
LHC LHC (Large Hadrons Collider) is the largest and most powerful particle
accelerator ever built, placed in the tunnel which housed the LEP (Large Electron-
Positron collider) in Geneva and managed by the European Organization for
Nuclear Research, also known as CERN (Conseil Européen pour la Recherche
Nucleaire). LHC is made up of a ring 27 km long ([1]) placed at a depth of
about 100 m. There are four points where protons are forced to collide and where
the four huge detector (known as ATLAS, ALICE, CMS and LHCb) are set up to
record every detail of the particle collisions, providing a huge amount of data to
analyse (see Figure 1.1).
Protons are not straightly inserted into the beam pipe of the main ring, but they
undergo a sequence of accelerators thanks to which they reach the desired energy
of 6.5 TeV and above. The nominal maximum collision energy for proton in LHC
is 14 TeV, however the accelerator is now working with a collision energy of 13
TeV, since with this expedient, the powering system is optimized. At this energy,
protons have a speed very close to the speed of light in vacuum. Under nominal
condition, each beam contain 2808 bunches of proton composed of about
Figure 1.1: LHC overview
3
particles each. The beam is held in the accelerator ring by 1232 superconducting
dipole magnets, that create a maximum magnet field of 8.3 T, and focused by 392
quadrupole magnets.
1.1 Main experiments at LHC
As already stated, there are four main points of interaction along LHC's ring
where the particles are forced to collide and where the detector are placed, these
detectors are:
ATLAS, A Toroidal Lhc ApparaTus : it is one of the two general
purpose detectors at LHC. This detector investigates a wide range of
physical problems, spacing from research over the Higgs boson or dark
matter to further measurements of the Standard Model parameters.
CMS, Compact Muon Solenoid: it is the second general purpose detector
at LHC. Together with ATLAS it shares the same scientific goals,
although it uses different technical solutions and different magnet-system
designs.
LHCb, Large Hadrons Collider beauty: this experiment is specialized in
investigating the slight differences between matter and antimatter by
studying quark beauty particles.
ALICE, A Large Ion Collider Experiment: it is a heavy ion collision
detector. It is designed to study the physics of strongly interacting matter
at extreme energy densities, where the matter forms a new phase, called
quark-gluon plasma.
1.2 The ATLAS detector
As stated above, the ATLAS experiment is a general purpose particle detector
installed at LHC, which is used to study different kinds of physical phenomena.
The detector is 44 m long ([2]), with an outer diameter of 25 m and weights
4
approximately 7000 T. ATLAS has a cylindrical symmetry, as well as the
detectors of CMS and ALICE.
The ATLAS detector describes the events using a coordinate system where the
origin is set in nominal interaction point, with the z-axis defined by the beam
direction and the x-y plane is transverse to it ([3]). The positive x-axis is
defined as pointing from the interaction point to the centre of the ring, while the y-
axis is pointing upwards. The azimuthal angle ϕ is around the beam axis and the
polar angle θ is the angle from the beam axis. Pseudorapidity η is then defined as
: its value ranges from -∞ to +∞, corresponding to the vector lying
on the y-axis, to infinity where it is along the z-axis. Using the pseudorapidity, is
possible to define the distance ΔR in the pseudorapidity-azimuthal angle space as
1.3 ATLAS' structure
The structure of ATLAS’s detector is presented in Figure 1.2. It is built with a
cylindrical symmetry around the beam pipe axis centered on the interaction point
allowing a large pseudorapidity ( read. Ideally, it can be divided into five main
regions: a barrel region, where low event could be read, two end caps region
that covers medium and two forward region that take care of the area with
higher .
Figure 1.2: ATLAS detector overview
5
As shown in figure 1.2, ATLAS is made up of different group of sub-detectors
designed to track proton-proton collision and the particles that these collision is
responsible of. The innermost of these sub-detectors systems is called Pixel
Detector, then we could find calorimeters and muon spectrometers. The inner
detector is surrounded by a solenoid that provide a 2 T magnetic field while barrel
and end caps section of the muon spectrometer are surrounded by toroidal magnet
that provide, respectively, about 0,5 T and 1 T.
Particles produced from proton-proton collision start their travel through ATLAS
starting from the Inner Detector that cover the region with | η |< 2,5, here,
charged particles interact with different layer of the detector producing "hits": a
packet of data that sign the passage of a particle and used to reconstruct it's
trajectory . The momenta and the charge of these particles can be measured, as
their trajectories are bent by a 2 T magnetic field provided by the central solenoid
A schematic global
section view of ATLAS
detector is provided in
Figure 1.3, illustrating
how and where different
particles can interact.
The inner detector,
designed to have high
granularity and high
momentum
measurement resolution,
provides essential
information such as first
and second vertices
recognition. In order to
achieve the needed
performance,
semiconductor detector
Figure 1.3: Sect ion view of ATLAS detector in the transverse
plane, illustrating layers’ positioning
6
are used close to the beam for precise measurements.
Moving away from the central beam pipe, two type of calorimeter can be found:
electromagnetic calorimeter and hadronic calorimeter, needed to detect either
electron/photon or hadron by determining both energy and position. proceeding
toward the extern, Muons' Spectrometer are needed to identify muon that were
untouched by the inner layer, Muons’ spectrometer is made up of four different
types of chambers: two types are intended to provide position and momentum
measurement, while the other two types are meant to grant precise and robust
information for the hardware-based trigger decision making.
1.3.1 Inner Detector
The Inner Detector (ID) is the closest detector to the beam pipe: designed to be
radiation hard and have a long term stability. As shown in figure 1.4, the full ID is
a cylinder 6,2 m long and with a diameter of 2,1 m.
Figure 1.4: A section view of the ATLAS Inner Detector
The ID is segmented with cylindrical symmetry in the barrel region while it has
coaxial disk in the end caps regions. As shown in figure 1.5, ID is made up of
three main different layer, explained in the following paragraphs.
Transition Radiation Tracker
The Transition Radiation Tracker (TRT) detector is made up of thin drift tubes in
the outer part of the ID, in which we find a non-flammable gas mixture. these
tubes follow the symmetry of the corresponding region: the tubes installed in the
7
end-cap region have a radial direction while the other straw follow the pipe
direction. Transition radiation is a form of electromagnetic radiation emitted
when charged particle pass through an inhomogeneous mean, such as a boundary
between two different means.
The TRT occupies the largest space of the ID and provides the majority of the hits
per track, hence highly contributing to the momentum measurement. TRT offers
more hits per track in order to retrieve the information about momentum, even
though it has a lower precision compared to the silicon detectors. Unlike all other
active parts of ID, the drift tubes do not need to be cooled and are not subject of
radiation degradation, as the gas can simply be replaced.
Semiconductor Tracker
The SemiConductor Tracker (SCT) is a tracker made up of silicon strips, with a
technology similar to the one employed in the Silicon Pixel Detector. The reason
for using such trackers, instead of pixels, are mainly economical, since the barrel
of SCT covers more than 30 times the surface of Pixel detector
Pixel Detector
The innermost and most important detector is Pixel Detector. This detector is
designed to have the finest granularity among the other ones. The system consist
of four cylindrical layers of silicon "pixels", these four layer are named, from the
inside to the outside: Insertable B-Layer (IBL), B-Layer (L0), Layer 1 (L1) and
Layer 2 (L2). Each layer is constituted by several module. A module consist of:
sensors, 16 Front End (FE-I3) chips responsable for reading the charge signal
from pixels, a flex-hybrid, a Module Controller Chip (MCC) and a pigtail.
The board developed and tested in this thesis is proposed as a readout board for
this layer ([4]).
9
Chapter 2
Off-detector electronics In order to explaining the reasons that led to the development of a new read-out
board, we need to look at the current set-up of the off-detector electronics for the
Insertable Barrel Layer. This layer was the newest and innermost layer
implemented into the barrel of four cylindrical pixel layers. High-energy physics
experiments usually distinguish between on-detector and off-detector electronics:
the first is the front-end electronics implemented near the detector itself, where
radiation resistance is a fundamental parameter for the electronics and often is
custom made for the experiment; the second is the readout system that is
implemented far from the detector, that doesn't need to be radiation-hard, allowing
the use of more powerful commercial devices.
In the following sections it will describe the off-detector topics in order to
understand the task that the Pixel-ROD board was designed for.
2.1 IBL’s electronics
The readout system for IBL requires an appropriate off-detector apparatus, which
we're going to describe. This readout system is made of several components:
Back of Crate (BOC) board;
Optical modules to interface FE-I4 chips with BOC board;
S-Link for sending data from the BOC board to the ATLAS TDAQ
system;
Read Out Driver (ROD) board:
Gigabit Ethernet to send front-end calibration’s histograms;
VME Crate;
TTC Interface Module (TIM);
Single Board Computer (SBC).
10
Figure 2.1: Complete visual layout of the data acquisition
system. In red the normal data path, in b lue the deviation
of the Histogram data path.
Each BOC-ROD pair can interface and route 5,12 Gb/s of data that comes from
16 IBL modules (32 FE-I4 chips). With this throughput, the whole IBL's readout
requires 15 BOC-ROD pairs. placed in a single VME crate.
This setup is replicated,
smaller, here in Bologna
inside a VME crate that host
a few BOC-ROD interfaces.
The data path is simplified
in figure 2.1: 32 front-end
chips FE-I4 drive 32 serial
lines, each supporting 160
Mb/s, connected to the
BOC board via optical
links. Here the signal from each line is converted from optical to electrical, then
demultiplexed to one 12-bit-wide bus, which proceeds towards the ROD board,
through the VME backplane connector. The ROD board begins the real data
formatting, to build up a ROD data frame, that is sent to the TDAQ computers. On
the ROD board, data can take two different paths (see Figure 2.1): the first routes
the ROD data events back to the BOC, where four S-Link modules process the
data towards the ATLAS TDAQ PCs, implementing a total output bandwidth of
5.12 Gb/s; the second route for data is toward the PC for histogram making,
exclusively used during calibration of the FE-I4 chips ([4]).
2.1.1 IBL BOC
The BOC board, shown in Figure 2.2, is in charge of the control interface to the
detector, as well as the data interface from the detector itself. One other major task
of the BOC is to provide the clock to the front-end chip connected: the clock is
received by the BOC from the TIM and can be delayed, if needed, furthermore, a
Phase Locked Loop (PLL) generates copies of this clock for the ROD and the
detector. The detector clock is then handled by the FPGAs and coded into the
control streams for the individual detector modules. The IBL BOC contains three
Xilinx Spartan FPGAs:
11
One BOC Control FPGA (BCF);
Two BOC Main FPGAs (BMF).
BOC Control FPGA
In the BOC Control FPGA a Microblaze embedded processor is instantiated
mainly in order to manage the Ethernet connection for the board and to perform
some self test. The
BCF is also
responsible for
FPGAs
configuration that
happen in two steps:
firstly, the BCF
loads its
configuration in
“Master Serial
Peripheral Interface”
mode from a 64 Mbit
SPI FLASH. Then, the configuration data for the two main FPGAs are read from
a second SPI FLASH memory and downloaded via the Slave Serial configuration
ports. Depending on the configuration, the BCF can also load software from a
third SPI FLASH.
BOC Main FPGA
The two BMFs encode the configuration data coming from the ROD and send
them straight to the FE-14chips. For testing purposes, this configuration stream
can be directly generated from the BMFs. One of the main task that these two
FPGAs execute is the deserialization of the incoming data from the front-end
chips on the RX path (Figure 2.1) after the data collection and the word
alignment. They also are in charge to send the decoded data to the ROD board.
On the TX side, they also manage four optical S-Links connections to the TDAQ
system.
Figure 2.2: IBL BOC board
12
2.1.2 IBL ROD
The Insertable Barrel Layer Read Out Driver (IBL ROD) ([5])is designed in
order to accomplish some tasks, like propagate timing and trigger signals to the
front-end electronics, as well as sending an appropriate configuration to them.
The most important task for the ROD is accomplished during physical runs: it
receives data and event fragments from the 32 FE-I4 chips and transform them
into a ROD data frame, which is sent back to the ATLAS TDAQ, through the
BOC’s S-Link connections (see Figure 2.1).
The IBL ROD (shown in Figure 2.3) is composed of:
One Digital Signal Processor MDSP (Texas Instruments TMS320C6201 -
GJC200), which is currently not used;
Figure 2.3: The IBL ROD board
13
One Program Reset Manager (PRM) FPGA (Xilinx Spartan-6
XC6SLX45-FGG484);
One ROD Controller (Master) FPGA (Xilinx Virtex-5 XC5VFX70T-
FF1136);
Two “Slaves” FPGAs (Xilinx Spartan-6 XC6SLX150-FGG900);
One Phase-Locked Loop PLL (Lattice ispClock 5620);
32 MByte SDRAM DDR;
Four Mbit FLASH SST39VF040-70-4C-NH;
Two GByte DDR2 SODIMM;
64 Mbit FLASH Atmel AT45DB642D;
Three Gbit Ethernet interfaces with PHY DP83865.
In order to understand the task of the ROD, the main devices are hereafter
described.
ROD Master
A Virtex-5 FPGA is the Master of the Read Out Driver. It interfaces with the
front-end chips and the triggers that come from TTC Module. This FPGA
contains a PowerPC, an embedded hard processor. This FPGA have several tasks
to accomplish, one of the more important is to process the trigger information or
the event information and deliver it to the Spartan FPGAs (Slaves).
ROD Slaves
The two Spartan-6 FPAGs work as slaves on the ROD board, they implement a
soft-processor named Microblaze. All the data generated by the FE-14 chips got
collected by these FPGAs that send them in the ROD's SSRAM.
During calibration runs, histogram can be generated and sent to the histogram
server through an Ethernet connection (one for each Spartan-6).
14
2.1.3 TIM
The TTC (Timing, Trigger and Control) Interface Module (TIM) interfaces the
ATLAS Level-1 Trigger system signals to the pixel Read-Out Drivers. For what
concerns the IBL off-detector electronics, the TIM sends the clock to the BOC
board, which then propagates it to the ROD, as stated above. Furthermore, the
TIM receives and propagates triggers.
2.1.4 SBC The Single Board Computer, as the name suggests, is actually a computer
mounted on a 6U board with a VME interface chip. It is used to control all the
VME operations on the ROD and to program the ROD FPGAs, usually after
power up. It can also be used to monitor the temperature, or voltages, on the
RODs master device.
2.2 The road towards PCIe based board
The entire read-out of the IBL, as said before, implements 15 BOC-ROD pairs,
installed in 2014, while the remaining layers need respectively 38 pairs for the
Layer 1, 26 pairs for the Layer 2, 22 pairs for the B-Layer and 12 for the external
B-Layer and Disks cards will be installed this year (it was decided only in a
second time to implement this system also for all other layers). This excellent
system is, unfortunately, limited; especially looking to the upgrade of the whole
LHC detector, planned for the 2023 ([6]), in order to achieve a much higher
luminosity (up to 10 times the actual one). Such huge improvement in luminosity
means that the electronics will need to withstand a much higher data rate.
Looking into this direction, many electronic boards have been presented for the
readout of such experiments ([4]). All the projects, from the electronic
viewpoint, share a common feature: the implementation of an electronic board
designed to be flexible and highly configurable, with powerful FPGAs connecting
to many optical transceivers and with PCIe interface to guarantee an extremely
high throughput.
15
Chapter 3
PCI-Express overview Since the first communication protocol had seen the light, the increasing needs of
huge amount of data has always drove the research on the path of higher
throughput. Starting with a really tiny data rate (for example, 56 Kbit/s on old
router) and arriving, nowadays, at astonishing bandwidths implementing new
protocols and project strategies. One of these new protocols is the PCI-Express
(Peripheral Component Interconnect Express, from now, it will be simply called
PCIe), as the name suggest, PCIe is the third generation high performance I/O bus
used to interconnect peripheral devices; this protocol achieve its required
throughput and reliability through a serial, packet-based, communication at high
speed, on differential line. This standard was studied and implemented in order to
exceed the issues that were quickly growing in other protocol like PCI or PCI-X.
In the next chapter the aspect that made parallel protocol obsolete and drove
towards the PCIe will be described.
3.1 Old standard's crisis
As stated above, PCIe is
the third generation bus
used for peripheral
connection. The first
generation included busses
like: ISA, EISA, VESA,
Micro Channel and many
other, all these different
protocol where slow and
too many. The second
generation busses included
PCI, AGP and PCI-X ([7]). When the PCI bus was first introduced in the early
Figure 3.1: Bandwidth and maximum number of Card
slot per Bus
16
1990s, it had a unifying effect on the plethora of I/O buses of first generation, it
also brought a lot of advantages, like processor independence, buffered isolation,
bus mastering, and true plug-and-play operation. Although all of these benefits,
PCI had several problems
of increasing importance as
the bandwidth request
increase, for example,
bandwidth limitations, host
pin-count limitations, lack
of real-time data transfer
services and so on. His
derivatives, such as PCI-X
and AGP, only moved
farther the problem, raising
the maximum throughput.
The data-rate of the PCI bus and its derivatives can be significantly less than the
theoretical bandwidth due to protocol overhead and bus parallel topology: more
device connected on the same bus meant a slower, or not functioning at all,
connection, as shown in figure 3.1 ([8]). This led to PCIe, a different kind of
protocol: in order to alleviate the problem of parallel connection PCI-Express
offers a serial architecture using clock data recovery (CDR) and differential
signal, PCIe also employs dual-simplex point-to-point links topology to overcome
the limitations of a shared bus (see figure 3.2). The links use high-speed serial
transceivers with embedded clock and data differential signals operating at 2.5
GTransfer/s with 8b/10b encoding. A link can consist of a single (x1) lane
providing peak bandwidth of 500 Mbytes/s (2 directions x 2.5 Gbits/s x 8/10
encoding). One of the major facts that made PCIe so diffuse is the backward
compatibility with PCI standards: current OS that are compatible with the PCI
software model can boot and run on PCI Express systems without any change to
device drivers or the operating system. However, to take advantage of the new,
advanced features of PCI Express, software modification will be necessary (for
example, in order to access the extend configuration space).
Figure 3.2: PCI VS PCI-E topology
17
Figure 3.3: PCIe
throughput VS PCIe
Version, "Total
Bandwidth" is
calculated on the Full-
Duplex
communicat ion
3.2 PCI-Express standard
As seen in the previous chapters, PCIe is a serial, packet-based, standard. Before
talking of the constitutive element of this protocol, such as topology and layers,
it's necessary to discuss, in order to understand the main topics, about the used
terminology and the performance achieved.
Terminology
Each connection between two PCIe devices is called Interconnect or Link: a point
to point communication that allows performing bidirectional transmissions
(ordinary PCI request or interrupt). Each Link is composed of 1 or more Lane (1,
2, 4, 8, 16 or 32 possible lanes). A PCIe card uses a Link with, for example, 8
Lanes, will be indicated as PCIe-x8. Each Lane counts exactly 2 differential
signaling: one for sending data and one for receiving, making each Lane a full-
duplex communication channel composed of 4 wires ([8]).
Performance
Performance of PCIe strongly depends on the standard version we're referring to:
nowadays there are 4 official versions with 4 different performances ([9])
(even if the fourth generation is still in its embryonic state, see figure 3.3):
V 1.0: with 2,5 GTransfer/s;
V 2.0: with 5 GTransfer/s;
V3.0: with 8 GTransfer/s;
V4.0: with 16 GTranser/s.
18
V 1.0
The first version of PCIe protocol allowed a 2,5 GTransfer/s which means that
every Lane could transfer up to 2,5 Gbit/s, using an 8/10b encoding format. For
example, a PCIe-x8-gen1 is capable of a maximum throughput of (considering
only sending or receiving):
V 2.0
The second version of PCIe protocol changed the data-transfer rate, raising it up
to 5 GTransfer/s, using the same 8/10b encoding format. For example, a PCIe-x8-
gen2 is capable of a maximum throughput of (considering only sending or
receiving):
V 3.0
The third version of PCIe protocol changed again the data-transfer rate, raising it
up to 8 GTransfer/s, and changing also the encoding format from 8/10b to
128/130b. For example, a PCIe-x8-gen3 is capable of a maximum throughput of
(considering only sending or receiving):
V 4.0
The fourth version of PCIe protocol changed once again the data-transfer rate,
raising it up to 16 GTransfer/s, without changing the encoding format remaining
with the 128/130b format. For example, a PCIe-x8-gen4 is capable of a maximum
throughput of (considering only sending or receiving):
19
3.2.1 PCI-Express topology
As seen in the previous
chapters, PCIe standard
uses a point to point
serial connection in
order to achieve the
desired performance. In
order to start talking
about topology of the
PCIe standard, it's
necessary to understand
of it's constitutive
elements: Root Complex, Switch, Endpoint and enumerating system (figure 3.4).
Root Complex
The Root Complex (RC) could be seen as the main controller of this standard
([8]): it denotes the device that connects the CPU and memory subsystem to
the PCIe fabric and support one or more PCIe ports. Each port could be connected
to an endpoint device or a switch, in this last case, there will be formed a sub-
hierarchy. The RC is responsible to generate transaction requests and initialize
configuration transactions on the behalf of the CPU, it's also capable of generating
both memory and IO requests as well as generating locked transaction requests
without responding to them. The principal work of RC is to transmit packets out
of its ports and receive packets on its ports which it forwards to memory. A multi-
port RC could work as a router in order to rout a packet from port to port even if
that's not required.
RC also implements central resources such as: hot plug control, power
management controller, interrupt controller, error detection and reporting logic.
The RC initializes, at t=0, with a bus number, device number and function
number. With RC got defined the hierarchy and hierarchy domain: hierarchy is
the ensemble of all the devices and Link associated that are either directly
Figure 3.4: PCIe example topology
20
connected or indirectly connected via switch and bridges, hierarchy domain is the
full tree connected to a sign le port.
Switch
Switches can be thought as two or more PCI-to-PCI bridges, each bridges
associated with a switch port and implementing one header register of
configuration; configuration and enumeration software detect these registers and
initialize them at boot time. These bridges are internally connected via a non-
defined bus. The port pointing to the RP is an upstream port, all other port are
downstream ports. The main work of Switches is to forward all type of packet
(transaction) from any ingress to any egress in a manner similar to PCI bridges
using memory, IO or configuration address based routing.
The logical bridges within the switch implement PCI configuration header which
hold memory and IO base and limit address register as well as primary bus
number, secondary bus number and subordinate bus number registers. Switches
implement two type of arbitration mechanism: port arbitration and VC arbitration,
by which they determinate the priority with which to forward packets from
ingress to egress ports, Switches support locked request ([8]).
Endpoints
Endpoints are every device other than RP and switches that are requester or
completers of PCIe transaction such as Ethernet, USB or graphic card that lays on
the PCIe BUS. They initiate transactions as a requester or respond to transactions
as a completer. Two type of endpoints exist: PCI-Express endpoints and legacy
endpoint. Legacy endpoints devices are not required to support 64-bit memory
addressing capability, they may support IO transaction, they may also support
locked transaction semantics as a completer but not as a requester. Interrupt
capable legacy devices may support legacy style interrupt generation using
message requests but must in addition support MSI generation using memory
write transaction. PCIe Endpoints must not support IO or locked transaction
semantics and must support MSI style interrupt generation and 64-bit memory
addressing capability. Both endpoints implement Type 0 configuration headers
21
and respond to configuration transactions as a completer. Each endpoint is
initialized with a device ID (both requester and completer ID) composed of a bus
number, device number and function number. Endpoints are always device 0 on
the bus assigned to them, that doesn't mean, anyway, that there could be only
"dev-0" on o bus.
Enumerating System
Even if enumerating system is something that doesn't strictly regard the topology
of PCIe systems, it's mandatory to understand at least the basis of it in order to be
able to talk about the implementation of PCIe (in the next chapter). Standard PCI
Plug and Play enumeration software enumerates PCIe systems. Each PCIe Link is
equivalent to a logical PCI bus, that means that every Link is assigned to a bus
number. The PCIe endpoint is the device 0 on a PCIe Link of a given bus number.
Only one device (dev 0) exists per PCIe Link. Enumerations of the switches are
slightly different: the internal bus of the switch that connects all of the virtual
bridges together is also numerated.
The first Link associated with the RP is number bus 1, that's because number 0 is
an internal virtual bus ([8]).
3.2.2 PCI-Express Layers As said before, PCIe is a standard that uses packed-based communication system
and like the majority of these type of communications, it bases its work on a
stacked architecture. The stack of the PCIe standard is similar to a simplified
ISO/OSI stack: it has, in fact, only 3 layers: Physical Layer, Data Link Layer and
Transaction Layer (see figure 3.5) ([10]). This kind of architecture was chosen
in order to achieve 2 major objectives: an ultra reliable communication protocol
and the isolation between different functional areas ([7]). This allows updating
or upgrading of one or more layers, often without requiring changes in the other
layers. For example, new transaction types might be included in newer revisions
of a protocol specification that does not affect lower layers, or the physical media
might be changed with no major effects on higher layers. On the side of the
reliability, PCIe gen 3 ensures a Bit Error Rate (BER) of 10-12 meaning that only
22
one bit on one thousand of
billion is statistically
wrong. There are two
main packet types:
Transaction Layer Packets
(TLPs) and Data Link
Layer Packets (DLLPs)
([8]). While DLLPs
are for service
communications PCIe
constitutive elements,
TLPs are the packets that
actually move the data
from and to devices. In the following paragraph each layer will be examined; then
an ideal TLP transaction will be analyzed in order do reassume all the layer's
work.
Transaction Layer
The Transaction Layer (TL) is the upper layer of the PCI Express architecture,
and its primary function is to accept, buffer, and disseminate transaction layer
packets (TLPs). TLPs have four
addressing space, which are:
memory, IO, configuration and
message. These last space's
transactions (the message) need
to be discussed in order to
understand the passage from
PCI to PCIe: PCI 2.2
introduced an alternate method
of propagating system
interrupts called message
signaled interrupt (MSI).
Figure 3.6: Tlp's header format
Figure 3.5: The PCIe stack with a packet transfer example
23
Here a special- format memory-write transaction was used instead of a hard-wired
sideband signal as an optional capability in a PCI 2.2 system. The PCI Express
specification reuses the MSI concept as a primary method for interrupt processing
and uses a message space to accept all prior sideband signals, such as interrupts,
power-management requests, and resets, as in-band messages. Other “special
cycles” within the PCI 2.2 specification, such as interrupt acknowledge, are also
implemented as in-band messages.
All requests are implemented as split transactions and could be of two big
categories: posted and not-posted transactions. Posted transactions don't need any
reply by the completer (receiver) like, for example, write transaction or message
transaction. Not-posted transaction, otherwise, need a reply, like, for example, a
memory read transaction or an IO write. In case of a not-posted transaction the TL
receives response packets from the link layer and matches these with the original
software requests. Each packet has a unique identifier that enables response
packets to be directed to the correct originator. The packet format offers 32-bit
memory addressing and extended 64-bit memory addressing. Packets also have
attributes such as “no-snoop,” “relaxed ordering,” and “priority,” which may be
used to route these packets optimally through the IO subsystem. As this layer is so
important for the PCIe architecture, it's compulsory to analyze, even if
superficially, the TLPs generated here. Transaction Layer Packet (TLP) consist of
a TLP header that could be composed of 3 or 4 double word (32 bit), a data
payload which can be formed by 0 to 1023 double words (for the types of packets
who need it), and an optional "TLP digest" tail. The header is composed of several
field (figure 3.6). The two most important field of this header are the "Fmt" that
determinate the length of the header and the presence (or absence) of data, and the
"Type" field that determinate the type of the transaction. "Length" determinate the
length of the data ([8]).
Data Link Layer
The primary role of a link layer is to ensure reliable delivery of the packet across
the PCI Express link(s). When a TLP is passed from the Transaction layer, the
Data Link layer places before it a 12-bit sequence number, then protects the
24
contents of it by using a 32-bit LCRC value. The Data Link Layer calculates the
LCRC value based on the TLP received from the Transaction Layer and the
sequence number it has just applied. The LCRC calculation utilizes each bit in the
packet, including the reserved bits. This LCRC is then applied as a tail to the
packet. The packet is then saved in a "retry buffer" and sent to the Physical layer.
A credit-based, flow-control protocol ensures that packets are transmitted only
when it is known that a buffer is available to receive this packet at the other end,
which eliminates any waste of bus bandwidth due to resource constraints ([7]).
For incoming TLPs, the Data Link Layer accepts them from the Physical Layer
and checks the sequence number and LCRC, If it is correct, the Data Link Layer
then passes the TLP in object up to the side of the Transaction Layer. If an error is
detected (either wrong sequence number or LCRC does not match), the Data Link
Layer communicates with the transmitter of the corrupted packet the error and
asks to retry. These communication are made possible through Data Link Layer
Packets (DLLPs). DLLPs are responsible, among other things, to communicate
ACK or NACK to the transmitter: in order to see how this mechanism work, it's
helpful to imagine a transaction between two devices: A and B: The Data Link
Layer of the remote Device B receives the TLP and checks for CRC errors. If
there is no error, the Data Link Layer of Device B returns an ACK DLLP with a
sequence ID to Device A. Device A has confirmation that the TLP has reached
Device B successfully. Device A clears its replay buffer of the TLP associated
with that sequence ID.
If on the other hand a CRC error is detected in the TLP received at the remote
Device B, then a NAK DLLP with a sequence ID is returned to Device A. An
error has occurred during TLP transmission. Device A's Data Link Layer replays
associated TLPs from the replay buffer. For a given TLP in the replay buffer, if
the transmitter device receives a NAK 4 times and the TLP is replayed 3
additional times as a result, then the Data Link Layer logs the error, reports a
correctable error, and re-trains the Link ([10]).
25
Physical Layer
The Physical Layer connects to the Link on one side and interfaces to the Data
Link Layer on the other side. The Physical Layer processes outbound packets
before transmission to the Link and processes inbound packets received from the
Link. The two sections of the Physical Layer associated with transmission and
reception of packets are referred to as the "transmit logic" and the "receive logic".
Two sub-blocks make up the Physical Layer. These are the logical Physical Layer
and the electrical Physical Layer ([8]), both sub-blocks are split into
independent transmit logic and receive logic which allows dual simplex
communication. The first sub-block (logical Physical layer) transmit logic
receives packets from the Data Link layer and attaches to them a "start" header
and a "stop" tail. Each byte of a packet is then scrambled with the aid of Linear
Feedback Shift Register type scrambler. By scrambling the bytes, repeated bit
patterns on the Link are eliminated, thus reducing the average EMI noise
generated. The resultant bytes are encoded into an 8/10b code by the 8b/10b
encoding logic. The primary purpose of encoding 8b characters to 10b symbols is
to create sufficient 1-to-0 and 0-to-1 transition density in the bit stream to
facilitate recreation of a receive clock with the aid of a PLL at the remote receiver
device. Note that data is not
transmitted along with a clock.
Instead, the bit stream contains
sufficient transitions to allow
the receiver device to recreate
a receive clock ([10]). The
parallel-to-serial converter
generates a serial bit stream of
the packet on each Lane and
transmits it differentially at 2.5
Gbit/s (In case of a PCIe
Gen1). The receiving side is,
clearly, dual to the one Figure 3.7: PCIe schematic physical layer
26
explained.
Before explaining the second sub-block (electrical Physical layer) transmit logic,
it's necessary to talk about differential line: each Line on the PCIe standard is
composed of 2 pair of differential line (receiver and transmitter), a positive
voltage difference between D+ and D- (positive and negative differential line)
denote a logical "1" while a negative voltage difference between D+ and D-
denote a logical "0"; no difference between these line means that the driver is in
the high- impedance tristate condition, which is referred to as the electrical- idle
and low-power state of the Link ([10]).
The second sub-block (electrical Physical layer) transmit/receive logic is easy to
understand looking at the figure 3.7. Every differential lane is AC coupled and
driven by a differential pair, that make a PCIe board's driver and receiver, short-
circuit tolerant, furthermore two devices at opposite ends of a Link can have their
own ground and power planes. The AC coupling capacitor is between 75-200 nF.
The transmitter DC common mode voltage is established during Link training and
initialization. The DC common mode impedance is typically 50 ohms while the
differential impedance is 100 ohms typical. This impedance is matched with a
standard FR4 board ([11]).
TLPs transaction example
The following steps are referred to the figure 3.5 and it's not included the
electrical Physical layer passage ([8]):
1. Device B’s core passes a request for service to the PCI Express hardware
interface. How this is done is not covered by the PCI Express
Specification, and it is device-specific. General information contained in
the request would include:
a. The PCI Express command to be performed
b. Start address or ID of target (if address routing or ID routing are
used)
c. Transaction type (memory read or write, configuration cycle, etc.)
d. Data payload size (and the data to send, if any)
27
e. Virtual Channel/Traffic class information
f. Attributes of the transfer: No Snoop bit set?, Relaxed Ordering
set?, etc.
2. The Transaction Layer builds the TLP header, data payload, and digest
based on the request from the core. Before sending a TLP to the Data Link
Layer, flow control credits and ordering rules must be applied.
3. When the TLP is received at the Data Link Layer, a Sequence Number is
assigned and a Link CRC is calculated for the TLP (includes Sequence
Number). The TLP is then passed on to the Physical Layer.
4. At the Physical Layer, byte striping, scrambling, encoding, and
serialization are performed. STP and END control (K) characters are
appended to the packet. The packet is sent out on the transmit side of the
link.
5. At the Physical Layer receiver of Device A, de-serialization, framing
symbol check, decoding, and byte un-striping are performed. Note that at
the Physical Layer, the first level or error checking is performed (on the
control codes).
6. The Data Link Layer of the receiver calculates CRC and checks it against
the received value. It also checks the Sequence Number of the TLP for
violations. If there are no errors, it passes the TLP up to the Transaction
Layer of the receiver. The information is decoded and passed to the core of
Device A. The Data Link Layer of the receiver will also notify the
transmitter of the success or failure in processing the TLP by sending an
Ack or Nak DLLP to the transmitter. In the event of a Nak (No
Acknowledge), the transmitter will re-send all TLPs in its Retry Buffer.
28
Chapter 4
Pixel-ROD board As stated at the end of chapter 2 (2.2 The road towards PCIe based board), the
knowledge acquired with IBL electronics made clear the limits of the BOC-ROD
system, especially looking to the future upgrade of the whole LHC detector. This
led to the development of many electronic boards for the readout of such
experiments. All the projects, from the electronic viewpoint, share many common
features, two of them are the presence of a PCIe interface to guarantee an
extremely high throughput and high-end FPGA connected to many optical
transceivers. Looking into high-speed devices, it has been decided to keep
working with FPGAs from Xilinx, upgrading to the 7-Series family. This decision
was taken in order to exploit all the experience and the efforts spent on the ROD
board and allowing the portability of the firmware onto this new one, named
Pixel-ROD, after upgrading it on the newly introduced platform. Furthermore,
given the success of the Master-Slave architecture from the ROD board, it was
decided to use two FPGAs on the Pixel-ROD board. Since the process of creating
and debugging such highly complex boards often turns out to be very time
consuming, it was decided to design the Pixel-ROD from two evaluation boards
made by Xilinx: the KC705 and the ZC702.
4.1 VME vs PCIe
Before discussing of the path that led to the creation of the Pixel-ROD it's
mandatory to understand the currently used protocol for the communication
between boards used for readout of the ATLAS experiment, the VME bus. This
knowledge will help to understand the reasons that led to the choice of the PCIe
standard.
29
4.1.1 VME overview
The VMEbus (Versa
Module Europa bus, figure
4.1) is a parallel bus
introduced by Motorola in
1981. It was originally
designed as the I/O bus for
the then newly introduced
68000 CPU from Motorola
([12]). Since the early
1990s, VME has
commanded almost half of the embedded computer boards market. The 32-bit
VMEbus offers a maximum bandwidth of up to 40 MB/s. The first significant
change in the standards was the definition of a 64-bit version in 1995. The result
was a doubling of the bandwidth to 80 MB/s. The first mechanical change was the
introduction of the VME-64X specifications in 1998. The newly defined
connectors was designed so that the same modules can be used in either legacy
VMEbus or VME-64X backplanes, although a VME-64X board on a legacy
VMEbus backplane loses some I/O functionality. Further extensions to VME
signaling, 2eVME and 2eSST, have been standardized but have not been widely
accepted. VME is an asynchronous bus, and the limiting factor on the number of
modules per chassis is caused by the obvious request of signal integrity. However,
the number of modules in a single chassis is a function of the backplane and how
well the signals can propagate through it. The physical constraints of a rack-
mounted card cage limit the maximum number of modules in a VME system to
21. The maximum data-rate achieved nowadays is of 320 MB/s with the VME320
([13]).
4.1.2 PCIe choice The reasons that led to the choice of a PCIe based board are many and all valid.
For the design of the Pixel-ROD was chosen to adopt a PCIe-x8 Gen2, that
Figure 4.1: A VME board
30
means, looking at the math made on chapter 3, of an ideal throughput of 4 GB/s.
That's an enormous data-rate if compared to the 320 MB/s of the VME standard.
PCIe implement a packet based communication with automatic re-transmission in
case of error, making it more reliable and stable. Furthermore, PCIe use a point-
to-point serial communication system that overcomes the 21 board per crate
limitation of VME. Moreover, PCIe need less connection to be routed on the
board and smaller connectors, that reduce the size of the board and the effort spent
to route all the metallic path, making route error more unlikely to happen and
more easy to find. One of the most important aspects that make the PCIe the best
choice for the ATLAS experiment is, anyway, the capability to be directly
inserted (and even hot-plugged) in the TDAC system without using optical link
and other external board. That gives to the system plenty of flexibility.
31
Figure 4.2: Xilinx's KC705 evaluation board
4.2 Pixel-Rod's making off
As stated at the beginning of this chapter, since the process of creating and
debugging such highly complex boards often turns out to be very time consuming,
it was decided to design the Pixel-ROD from two high-end evaluation boards
made by Xilinx: the KC705 and the ZC702 by merging them into a more complex
and powerful board. The two boards will be described hereafter in order to
understand how the Pixel-ROD was made ([4]).
4.2.1 KC705
For the slave device was thought to adopt a powerful FPGA from Xilinx’s Kintex
family, a good example of which was the Xilinx’s evaluation board, named
KC705. The KC705, shown in Figure 4.2, is an interesting board in many ways.
The first reason of this interest is the fact that the KC705 is already a PCIe board
Since the Kintex-7 FPGA on KC705 board supports a PCIe connection up to
Gen2 x8, the 8- lane PCI Express edge connector performs data transfers at the
rate of 5 GT/s. The PCIe transmits and receives signal traces with a characteristic
impedance of 85 Ω ±10%. The PCIe clock is routed as a 100 Ω differential pair.
The KC705's main devices and features are listed below ([14]):
32
Kintex-7 28nm FPGA (XC7K325T-2FFG900C);
1GB DDR3 memory SODIMM;
PCI Express Gen2 8- lane endpoint connectivity;
SFP+ connector;
Two VITA 57.1 FMC Connectors (one HPC, one LPC);
10/100/1000 tri-speed Ethernet with Marvell Alaska 88E1111 PHY;
128 MB Linear Byte Peripheral Interface (BPI) flash memory;
128 Mb Quad Serial Peripheral Interface (SPI) flash memory;
USB-to-UART bridge;
USB JTAG via Digilent module;
Fixed 200 MHz LVDS oscillator;
I2C programmable LVDS oscillator;
Kintex-7 FPGA
Kintex-7 is a powerful medium-range FPGA that can easily take on its back the
work of the two Spartan-6 devices situated on the ROD board. A comparison, in
terms of base components, between those two FPGA is shown in figure 4.3. The
Xilinx Kintex-7 XC7K325T-2FFG900 on the KC705 board has the following
features ([15]):
Advanced high-performance FPGA logic based on real 6- input lookup
table (LUT) technology configurable as distributed memory;
High-performance DDR3 interface supporting up to 1866 Mb/s;
High-speed serial connectivity with built- in 16 gigabit transceivers (GTX)
from 600 Mb/s to maximum rates of 12.5 Gb/s, offering a special low-
power mode, optimized for chip-to-chip interfaces;
A user configurable analog interface (XADC), incorporating dual 12-bit
analog-to-digital converters (ADC) with on-chip temperature and supply
sensors;
Powerful clock management tiles (CMT), combining phase- locked loop
(PLL) and mixed-mode clock manager (MMCM) blocks for high precision
and low jitter;
33
Integrated block for PCI Express (PCIe), for up to x8 Gen2 Endpoint and
Root-Port designs;
500 maximum user I/Os (excluding GTX) and 16Kb of Block RAM
(BRAM).
4.2.2 ZC702
As stated before, the second demo board that was taken as starting point is the
ZC702 that could be seen in figure 4.4. In chapter 1 it has been seen that the ROD
board make of a Virtex-5 FPGA its master device. This FPGA implements a hard
processor PowerPC. A hard (or simulated) processor allows to write software in C
or C++ that can be run and changed without touching the firmware. The ZC702
Xilinx's demo board was chosen because its FPGA, the Zynq-7000, embeds a hard
ARM processor . Hereafter the main features of the ZC702 demo board are listed
([16]):
Zynq-7000 FPGA (XC7Z020-1CLG484C), featuring two ARM Cortex-
A9 MPCore hard processors;
1 GB DDR3 component memory (Micron MT41J256M8HX-15E);
10/100/1000 tri-speed Ethernet with Marvell Alaska 88E1116R PHY;
128 Mb Quad SPI flash memory;
USB-to-UART bridge;
USB JTAG interface using a Digilent module;
Two VITA 57.1 FMC LPC connectors;
Fixed 200 MHz LVDS oscillator;
Figure 4.3: Comparison between Kintex-7 and Spartan-6
34
Figure 4.4: ZC702 demo board
Zynq-7000 FPGA
The Zynq 7000 FPGA consists of an integrated Processor System (PS) and
Programmable Logic (PL). The PS integrates two ARM Cortex-A9 MPCore with
a frequency up to 667 MHz, AMBA bus, internal memories, external memory
interfaces, and various peripherals including USB, Ethernet, and many other.
The PS runs independently of the PL and boots at power-up or reset.
35
4.3 Pixel-Rod overview
The analysis of the two previous boards was essential in order to design the new
Pixel-ROD board. As stated before, it was decided to design the Pixel-ROD from
those two boards, by merging them into a more complex one. While merging the
KC705 with the ZC702, many features had to be removed, since note useful for a
readout board, while other features had to be redesigned as they need to be shared
among all the hardware of the Pixel-ROD.
The main features removed from the KC705 were the LCD display, few GPIO
buttons and LED, the SD card reader and the HDMI port. Speaking of feature
removed, from the ZC702 SD card reader, HDMI port, GPIO buttons and LEDs
were removed, along with one of the two LPC FMC, the USB port and PMODS
connectors. The removed feature left space on the board for more useful ones, for
example, busses between the two FPGA, which are essential in order to achieve a
full "Master-Slave" architecture as desired. As result of this merge, the principal
devices on the Pixel ROD board are:
Kintex-7 28nm FPGA (XC7K325T-2FFG900C);
Zynq 7000 FPGA (XC7Z020-1CLG484C), featuring two ARM Cortex-A9
MPCore;
2 GB DDR3 memory SODIMM (Kintex DDR);
1 GB DDR3 component memory (Micron MT41J256M8HX-15E, Zynq
DDR3);
PCI Express Gen2 8- lane endpoint connectivity;
SFP+ connector;
Three VITA 57.1 FMC Connectors (one HPC, two LPC);
10/100/1000 tri-speed Ethernet with Marvell Alaska PHY;
Two 128 Mb Quad SPI flash memory;
Two USB-to-UART bridges;
USB JTAG interface (using a Digilent module or header connection);
Two fixed 200 MHz LVDS oscillators;
I2C programmable LVDS oscillator;
36
4.3.1 New Features
As already mentioned, the Pixel-ROD implement redesigned or totally new
features if compared to the two single boards. These are: busses from the two
FPGAs, differential clock, a new JTAG chain and the power up chain. There are
three main type of busses: a 21 bit differential bus that runs between the two
FPGA that provides an high-speed communication oriented do achieve a ROD-
like "Master-Slave" architecture. The second bus added is a single bit differential
line to share a common clock between Zynq and Kintex. The last bus is a 5 bit
general purpose, single ended bus. The JTAG chain was modified in several ways:
first of all, a 12 pin (3x4) header was added in order to be able to exclude the
Kintex from the JTAG chain. The second, big, change that had been made on the
JTAG chain is the introduction of an internal JTAG between Zynq and Kintex, in
order to be able to program the slave FPGA from the master one. The power
supply stage had to be reinvented: since the board was designed in order to be
plugged inside a computer, the simple merge of the two supply stage of the two
demo-board, would have been too space consuming ([17]). The result is shown
in figure 4.5.
Figure 4.5: The Pixel-ROD board
37
4.3.2 Layers' specifics
The electrical and mechanical characteristics of a board are defined by two
parameters: the stack-up, that defines thickness and function of each layer, and the
dimension of the board itself. During the process of merging the two Xilinx's
demo-boards, the first thing that resulted impossible was the complete one-to-one
mapping since the total number of layer needed to be not more than 16. This limit
is dominated by the PCIe standard that requires a determinate thickness to fit the
PCIe slot. In order to reach the requested level of insulation and to reduce the
cross-talk phenomena, signal layers are alternated with ground layers while power
planes are placed in the innermost layer of the board as shown in figure 4.6.
Figure 4.6: Pixel-ROD's stack-up
38
The size constraint was imposed from the PC case that the board is designed for.
It's important to remember that this board is designed to be inserted inside a
TDAC computer. That imposed a maximum 30 cm of width (plus a little space for
the connectors) while the height parameter was left free in order to have enough
area for all the devices ([17]). This merge and these space limits resulted in
complex layout for each layer as could be seen in figure 4.7.
Figure 4.7: One of the 16 layer of the Pixel-ROD
39
Chapter 5
Pixel-ROD tests results The complexity of modern electronic require appropriate and careful testing and
hardware wake-up phase, especially in the prototyping phase: even a good, well
engineered, project, could result in a malfunctioning or even broken board, due to
limits of the constructive process. With "hardware wake-up" are here intended all
the action to be performed in order to ensure that the hardware is working
correctly when the board is powered up for the first time. This first step is needed
to exclude major board faults and to make sure that the hardware can be correctly
configured and the firmware can be actually tested. The work that has been done
could be divided into 2 stages: the correct configuration of power up devices and
the validation of all the other devices and functionalities implementing custom
firmware (and software) to verify the expected performance. Before the work of
this thesis, all the devices had been tested except the FMC and PCIe connections.
The work accomplished in this thesis was the test of the PCIe interface (and the
extension of the memory tested since only half of it was validated) while the test
of the FMC connector is still currently under test: only the HPC one has been
validated nowadays
5.1 Tests
In this paragraph the already passed tests will be briefly discussed in order to give
a complete picture of the situation. Tests already passed include: power-up supply
test, Kintex-Zynq internal bus test, test of memory and UART of both Zynq and
Kintex, SFP and GBTX tests. This last test won't be described, for more
information see: ([4])
40
5.1.1 Power-Up supply test
More sophisticated component on
the board, like the FPGA, need a
special and well designed power-
up sequence in order to work
correctly. This task is
accomplished by the power-up
stage composed of several
UCD9248 programmable
switching from Texas Instruments.
While testing this stage, four
errors were identified: the first two
errors were due to the digital and analogue ground that where accidentally left
floating on two different components. The third fault was caused by a too
aggressive sectioning of the power supply rail. The last error was caused by the
swap of three signals in the project phase ([4]). Once these errors had been
resolved and the UCD9248 stack had been correctly programmed, the board was
officially turned on in order to proceed with the other tests. An example of the
power up sequence could be seen in figure 5.1.
5.1.2 Kintex-Zynq internal bus test
In order to test the correct
programmability via JTAG,
the 200 MHz clock IC, and the
correct functioning of the 21
bit differential bus that connect
Kintex and Zynq (named KZ-
Bus) a simple test was
designed: the Zynq adds 1 to
Figure 5.1: Power-Up sequence.
Figure 5.2: GPIO p in analyzed with the oscilloscope
while the described test is in running.
41
an internal 21 bit internal counter at every clock pulse and put it on the KZ-Bus.
The Kintex, on the other side, simply subtracts the old bus value from the newly
read one; if the difference is equal to 1 then a GPIO pin is pushed to 0, otherwise
it pull the same GPIO pin to 1. That means that when the Zynq count restarts and
all the lanes of the bus pass from 1 to 0 a pulse will be present on the designed
GPIO pin. That means that on the GPIO pin a periodic pulse will be seen and this
period could be calculated, knowing the clock frequency, to be about 10,5 ms as
can be seen in figure 5.2.
5.1.4 Kintex and Zynq UART and memory test
All the tests that involved the programming of one or both FPGAs were
accomplished with the help of the "Vivado design suite", a development
environment for Xilinx FPGA that allow synthesizing and implement HDL
design. While for easy test, like the KZ-Bus test, is possible to think to manually
write the code in VHDL or Verilog, That's not possible for more complex code, at
least, not in this phase. In order to help the designer through his job, Vivado offers
a graphic environment where the designer can drop and connect complex block on
a canvas and connects them, mostly using "AXI bus": one of the bus of the
AMBA family (Advanced Microcontroller Bus Architecture). For example, it's
possible to implement a soft processor, called Microblaze that can be programmed
with C or C++ code. It's also possible to use an "UART" block that take care of
the UART connection or a "Memory Interface Generator (MIG)" that interfaces
the FPGAs with the On-Board RAM and many other blocks. It's important to
notice that all the firmware tested on the Kintex was first tested on the KC705
demo board because of the extreme similarity of these two boards. That was
useful because if a firmware works fine on the KC705 and don't work on the
Pixel-ROD, that means that the firmware is correct and an electrical error has
occurred.
Memory test
In order to test the RAM configured on the Kinteq, a MIG block was used to
interface with the RAM implementing a Microblaze soft processor.
42
A C code was designed in order to write 1 GB of known data inside the 2 GB
RAM, then, through a custom xsct terminal was possible to read the content of the
area that was written, in order to verify the correct execution of the program. This
program was launched on both Kintex and Zynq without any problem ([4]).
An example of the verbose of this test can be found in figure 5.3
UART test
To test the UART connectivity an UART and a Microblaze blocks where
implemented. On the Microblaze an Echo server was programmed in order to
accomplish a simple task: when a message is sent from the UART of a connected
computer, the Microblaze of the Pixel-ROD should take this message to send it
back to the computer. This test worked fine on the Zynq but failed on the Kinteq.
That was due to the Ethernet IC driver that implemented the link between the
FPGA and the Ethernet connector itself. It was noticed that it wasn't powered
because the power supply rail wasn't connected to any power source. Once this
bug was fixed, the test started to work even on the Kinteq side of the board
([4]). An example of the verbose of this test can be found in figure 5.4
Figure 5.3:
An example of the
verbose given by the
execution of the
RAM test
Figure 5.4:
An example of the
verbose given by the
execution of the
ECHO test
43
5.2 PCI-Express Validation
In order to validate the PCIe port situated on the Pixel-ROD and presented in the
previous chapters (Chapter 3), a series of tests designed and run. It has been clear
since the first attempt that this kind of work needs a more complex workbench
compared to the one present at the beginning. To test the hardware in a reliable
and secure way the entire test was at first tried on the KC705 development board
and then implemented on the Pixel-ROD. That was possible because the Kinteq
side of the Pixel-ROD is very similar to the KC705. The workbench was
composed of the KC705 (at the beginning), an open-air pc board where it was
possible to plug a PCIe card, a computer with the VIVADO design suite to be able
to program the board. Now let us describe how Linux implements various drivers.
5.2.1 First Attempt
The first attempt carried on was aimed at verifying the presence of hardware
connections. In order to do so, a simple test was imaginated, coded and
programmed onto the KC705. To verify connections between the FPGA and the
connector the VHDL code was designed to transmit a simple square wave on each
PCIe pin, then with an oscilloscope, each pin would have been analyzed. This test
didn't even see the light: indeed, it was impossible for the Vivado design suite to
compile this project. The reason behind that was quite simple: the pins used for
the PCIe connection are special purpose pins: they are 32 PIN (PCIe-x8 Gen2, 8
lane, 4 pin for each lane) served by a GBTx transceiver. That means that it's
impossible to directly command the pins and to use them it's necessary to
command the GBTx transceiver, instead. The first attempt was, hence, a failure.
Failure that led to a more deep understanding of the path to follow to validate and
use the PCIe present and in a more efficient way. The next steps will be divided in
2 main sides: a PC side and an FPGA side, both needed to create a communication
system used for validation and performance measurements. The work hereafter
described, is based on a custom driver for mounted on an Ubuntu PC and a
Firmware programmed onto the KC705. The Ubuntu PC was totally "naked": in
order to easily plug the PCIe board making possible the communication between
44
the custom program on the PC and the custom firmware. All this enormous work
has been made possible by Xilinx online firmware example, driver source code
used and modified in order to accomplish this job ([18]) and with the help of
the IP integrator (explained in chapter 5.1.4).
5.2.2 XDMA IP core and firmware implementation
The work-flow to validate and measure the performance of the PCIe bus on the
Pixel-ROD is hereafter described: first, a custom firmware was designed and
loaded into the FPGA of the KC705, the Kinteq. This firmware implemented a
Direct Memory Access (or DMA) that worked as a bridge from the PCIe port to
the memory and from the memory to the PCIe port. Secondly, a C program on the
Ubuntu PC was implemented. This program uses the functionalities given by a
custom driver and carries out read and write operation from and to the RAM
memory on the board plugged in the PCIe slot of the pc. In this section, the DMA
IP core and it's firmware implementation is described.
DMA/Bridge subsystem for PCIe
The DMA/Bridge subsystem for PCIe (also called XDMA) is an IP block that
implements a high performance, configurable Scatter Gather DMA for use with
the PCIe Gen2.1 and Gen3.x. The IP provides an optional AXI4 or AXI4-Stream
user interface. The XDMA can be configured to be either a high performance
direct memory access (DMA) data mover or a bridge between the PCI Express
and AXI memory spaces. The master side of this block reads and writes requests
on the PCIe and its core enables the user to perform direct memory transfers, both
Host to Card (H2C), and Card to Host (C2H). The core can be configured to have
a common AXI4 memory mapped interface shared by all channels or an AXI4-
Stream interface per channel. Memory transfers are specified on a per-channel
basis in descriptor linked lists, which the DMA fetches from host memory and
processes. Events such as descriptor completion and errors are signalled using
interrupts. The core also provides a configurable number of user interrupt wires
that generate interrupts to the host ([19]). This IP core can be seen in figure 5.5
along with its port
45
.
Internally, the core can be configured to implement up to eight independent
physical DMA engines. These DMA engines can be mapped to individual AXI4-
Stream interfaces or a shared AXI4 memory mapped (MM) interface to the user
application. On the AXI4 MM interface, the DMA Subsystem for PCIe generates
requests and expected completions. The AXI4-Stream interface copes with data-
only. On the PCIe side, the DMA has an internal arbitration and bridge logic to
generate read and write transaction level packets (TLPs) to the PCIe. This is done
over the Integrated Block for the PCIe core Requester Request (RQ) bus, and to
accept completions from PCIe over the Integrated Block for the PCIe Request
Completion (RC) bus.
The configuration with the physical bus is performed with the signal
pci_exp_rxn[7:0], pci_exp_rxp[7:0], pci_exp_txn[7:0] and pci_exp_txp[7:0]. The
Figure 5.5: Symbol of the XDMA IP core with its TX and RX ports visible on the right
46
communication with the memory start from the M-AXI. A series of parameter
make this IP core customizable and elastic. Inside the configuration menu it's
possible to configure the data width, the number of Lane desired (2, 4 or 8), the
reference clock, the type of DMA (memory mapped or stream) the Base Address
Register used (BAR), the number of interrupts and channels desired and many
other options. Before explaining the firmware implementation, it's mandatory to
briefly discuss another block: the Memory Interface Generator.
Memory Interface Generator
The Memory Interface Generator (MIG) IP core ([20])is a controller and
physical layer for interfacing 7-series FPGA, as well as other AXI4 slave devices,
to DDR3 memory. Given the wide variety of DDR3 modules and components
available, this IP core is very flexible and configurable. This IP core can be seen
in figure 5.6.
Figure 5.6: Example of A MIG, on the right the DDR3 port connect the board
to the physical world while S_AXI is the port on which data are received or
transmitted
47
Firmware implementation
As stated above, the firmware was entirely designed using the IP integrator. It's
composed of a XDMA, a memory interface generator and other support logic
needed to correctly connect these two block and make them function properly, the
design of this firmware could be found in figure 5.6.
It's important to remember that for every block, a corresponding chunk of code is
generated. The flux of the data follow the path of a classical DMA
implementation: in case of write, the address and the data are given on the
pci_express_x8 in form of PCIe TLPs then the DMA core communicate with the
MIG through the AXI interconnect. This last block will write in the memory at the
given address using the previously configured option. In case of read only the
address is given on the pci_express_x8, the request is passed to the MIG which
replies with the data requested. In the following tests the XDMA is configured to
Figure 5.6: Firmware obtained with the IP integrator. Input/output port can be easily found on the right:
ddr3_sdram and pci_express_x8. On the left it's possible to see the internal signal needed to bring to the
system clock and reset
48
work, on the PCIe side, as a PCIe-x8 Gen2 completer. These are the higher
specifications that both the KC705 and the Pixel-ROD could reach. That gives the
system an ideal throughput of 4 GByte/s (see chapter 3.2). It's possible, anyway,
to configure the whole system to work with lower specifics, like PCIe-x4 Gen1.
5.2.3 Linux custom driver
The firmware seen above gives the user the ability to read and write from and to
the memory through the PCIe protocol. Hence, once the firmware is designed, and
the board plugged on the naked Linux PC, a software that communicate via PCIe
it's needed to validate and measure this protocol. This kind of software is always
divided into two major parts: a device driver end a C code that implements that
driver. The device driver is needed in order to obtain a mid- layer placed between
the hardware and the pure software. The main task of a device driver is to provide
a variety of functionalities coded at an higher level of abstraction.
Linux device driver
The choice of Ubuntu as operative system was led by the need of an open
operating system of the Linux family. That's because, as said above, in order to
accomplish all the tasks a custom driver was needed. In Linux every kind of
driver is thought as a module that can be inserted or removed anytime, each
module is written in C code and, in this thesis, compiled using the last version of
the GCC compiler. It's important to understand that a module, even if coded in C,
don't use classic C functions and libraries. A Linux OS could be divided in two
major space: a user space and a kernel spaces. A module operates inside the kernel
space where it can access different libraries and hardware. In order to have access
to this program space, a C source code needs to be compiled with specific options
and by specifying a precise path where the compiler could retrieve the libraries.
This path is given by the folder that contains the source code of the OS. During
the process of coding and compiling both module C code and user program, no
IDE was used. That's because a custom Makefile was used in order to have a
complete control all over the process and the product of the compilation. An
enormous variety of module exist, even if designed for different tasks. The kind of
49
module used in this thesis is a Device Driver and specifically a device driver
called Char Driver. A device driver is a module that makes available to the user
space a series of functions used to command the hardware beneath. As for the
module, various kind of device drivers exist, on this thesis it was used a char
driver. A char driver is a module that makes possible the communication with the
hardware from the user space by creating a special virtual file into which a user
program can read or write ([21]). For example, if an user space program wants
to communicate with an hardware port (a PCIe port or any other), it only needs to
read or write the message on this special virtual file.
The work of a chard driver is mainly the implementation of 3 data structure: file,
inode and File_operation.
File is a structure defined in <linux/fs.h> , and it is one of the most important data
structure used in device drivers. Note that a file has nothing to do with the FILE
pointers of user-space programs. A FILE is defined in the C library and never
appears in kernel code. A struct file, on the other hand, is a kernel structure that
never appears in user programs. The file structure represents an open file. (It is not
specific to device drivers; every open file in the system has an associated struct
file in kernel space.) It is created by the kernel on open and it is passed to any
function that operates on the file, until the last close. After all instances of the file
are closed, the kernel releases the data structure. In the kernel sources, a pointer to
struct file is usually called either file or filp (“file pointer”) ([21]).
Inode is a structure that is used by the kernel internally to represent files.
Therefore, it is different from the file structure that represents an open file
descriptor. There can be numerous file structures representing multiple open
descriptors on a single file, but they all point to a single inode structure.
File_operation is a structure that represents how a char driver sets up the
connection. The structure, defined in, is a collection of function pointers. Each
open file (represented internally by a file structure) is associated with its own set
of functions (by including a field called f_op that points to a file_operations
structure). The operations are mostly in charge of implementing the system calls
50
and are therefore, named open, read, and so on. It's possible to consider the file to
be an “object” and the functions operating on it to be its “methods,” using object-
oriented programming terminology to denote actions declared by an object to act
on itself. This is the first sign of object-oriented programming that could be seen
in the Linux kernel.
Custom driver
The development of a char driver intended to work with the previously seen
XDMA had been simplified by an existing module released on the Xilinx web
page ([18]). This char driver implements a base PCI Linux driver enriching it
with PCIe specific function. In order to read and write data on the PCIe bus 2
simple function was available Read and Write. The principal FOPS (File
Operation Struct) is hereafter reported in the figure 5.7.
This structure, as could be seen, contain a series of pointer (.owner, .open,
.release, etc.), each of this pointer point to a specific function. For example, .write
point to char_sgdma_write which is defined as static ssize_t char_sgdma_write
(struct file *file, const char __user *buf, size_t count, loff_t *pos);. This function
is called every time an user space code write operation is performed on the file
created by the char driver.
Figure 5.7: one of the FOPS of the char driver
51
5.2.4 Echo Test
Once all the C sources module have been compiled and inserted in the OS it was
possible to start the first functioning test. It is divided into 2 major parts: a Linux
script and two C source codes. The concept of the test is simple: the user has to
write a text string to the RAM of the board plugged on the naked PC through the
PCIe port on a specific address, then the program reads to the same address and
compares the data read with the data wrote.
Script
Before starting to loop over a block, a control is effectuated in order to find if the
script is executed as super user or not, then a special program is lunched. This
program is used to retrieve the name of the file created from the char driver. Then
an infinite loop is executed. Inside this loop, first a string is requested to the user
then this string is passed to a specific write program along with the desired write
address and an offset. After the execution of the write program, a read program is
called with the address, the offset, the name of the file to read and the string
expected.
C source code
The C code implemented to read and write on the PCIe board is formed by two
source codes: one used to write a specific amount of bytes in a given address and
one used to read with the same logic. The reason that lays beneath the choice of
two separated C codes instead of a single source it's the research of modularity in
order to be able to reuse the program with only few modifications. This two
source codes and the script itself was obtained by analyzing and by understanding
the example source code that Xilinx give with his driver ([18]). The writing
part of the C code work like that: first all the arguments passed by the script are
retrieved using specific functions and the information here contained are
extrapolated and saved (like the name of the file, the string to write, etc.). At the
beginning, the string is turned in an aligned vector dynamically allocated, then the
file is opened using its name. Using an lseek function the program select the
52
address where to write. Then with a simple write the program writes on the file.
It's important to remember that all this functions (lseek, write and read) are system
calls, not normal functions defined by the programmer. Such system calls, if used
on this file, call the function associated inside the previously seen FOPS. The read
C code works in the same way except on the system call read instead of write and
the comparison between the string read and the expected one.
The test was first carried on the KC705 board programmed with the firmware
seen above. Once it worked on the demo-board, validating the software and the
firmware implemented, this was moved to the Pixel-ROD where it worked
flawlessly. The result of this test was of fundamental importance: the success of it
meant that the hardware was functioning even if without the information of speed
or other data collected in the following tests.
5.2.5 Buffer and throughput Test
In order to measure the effective throughput of the board it was first necessary to
understand what would have been the most correct way to do it: when writing
buffers of different size, the total time used to transfer a fixed amount of data
change of two order of magnitude in the worst case (the time was measured with
function that call the system clock). That's due to two main reasons: the packet-
oriented way to transfer data of the PCIe standard and the implementation of a
DMA that works more efficiently when it's in charge of big amount of data
transfer. That's because having a big buffer and carrying out a burst transfer on the
AXI bus that commanded with the MIG, this is more efficient than a small buffer
repeated many times. In order to find the optimal buffer size a simple test was
designed: the concept is to transfer the same precise and repeated sequence of data
(the whole ASCII table, integer form 0 to 255) and the same total amount of
bytes: 100 MBytes, every time with different buffer sizes. The first buffer has a
size of 1 Kbyte, the second 2 Kbyte and so on with power of 2 until the last that
was an 8 MByte long buffer.
The firmware used is the same for all the three tests. The C code on the PC is very
similar to the one used for the echo test: it first instantiate and initializes all the
twelve buffers. For each of these buffers the program loops in a block where the
53
current buffer is first repeatedly written on the memory of the Board at increasing
addresses until a complete 100 MBytes write is done. Secondly, the whole transfer
written before, get read with a reading buffer of the same size and compared with
the copy of it on the PC. The last step isn't used for performance measurements
(because the comparison of each buffer read requires a lot of time) but for error
retrieve: it wasn't excluded the presence of memory failure.
The test applied first on the KC705 revealed a peak of throughput when using
buffers of 2 Mbyte. This peak corresponds to a data rate of about 2,5 GB/s as can
be seen in figure 5.8. Even if that's quite far from the theoretical throughput of 4
GB/s the result is right and as expected. In order to understand the reason of this
difference is essential to understand three main things: first, 4GB/s is the total
throughput, this includes the three header and the three tail of the PCIe packet, not
only it's data. Second, everything, from the firmware to the driver, is intended to
validate the PCIe, not for fast transactions. Third, 4 GB/s is the peak data rate of
the PCIe-x8 Gen2 protocol, the effective data rate depends on a lot of variable,
first of all the speed of the RAM on the board. Once this software run flawlessly
on the KC705 it has been carried rigidly into the Kintex of the Pixel-ROD. The
test run on the two boards revealed the same throughput.
54
Figure 5.8: the output produced by the throughput test. it's easy to see that every time the
buffer grows the time needed to write 100 MBytes decrease while the read operation is
dominated by the time used to compare.
This test was intentionally reduced to use a maximum buffer of 2 MBytes after the
conclusions in the previous chapter.
55
5.2.6 BER Test
The last test carried on is a pseudo-BER test. In order to test the BER (Bit Error
Rate) a simple but efficient test was produced. The core concept is to write and
read large amount of data in the same way as seen in the previous test for a long
period of time. The test executed first on the KC705 consists of a C code that, in a
similar manner of the throughput test, writes a determinate sequence on the RAM
of the board. Then it reads it by comparing the results and saving them on a file.
There are, anyway, many differences between these two tests: first of all, the
throughput test reads and writes only 100 MBytes for each buffer in order to find
the "faster" buffer. The second test, instead, writes and reads 1 GB (all the RAM
of the KC705) per time using only the 2 MByte buffer (which is the faster one as
seen before) and run relentless until a user stops it. The second main difference is
the sequence used: during the throughput test the sequence of data used was of no
interest. However, to calculate a BER that takes care of all possible factors, the
sequence used is of extreme importance. This test calculates, indeed, a cumulative
BER that's take care of both PCIe and on board RAM failure. In order to force
all the possible errors, a special sequence was used: this sequence re-writes on the
same memory cell in short time with a bit configuration at maximum boolean
distance from the previous one. For example, if during the first iteration of the
program it writes on the first cell, in hexadecimal, 0x00 then at the second
iteration, it will write 0xff and so on for all the cell of the RAM. This pattern is
the one used by MemTest86 for the test of high-end PC's RAM ([22]).
As seen before, the maximum speed achieved in write is 2,5 GB/s but, in order to
compare every buffer read with the one expected, the average speed dropped
under 800 MB/s. The BER expected was in any case of the order of 10-21.
Considering also that, in order to obtain a good data interpretation it's needed the
registration of, at least, 20 errors it's easy to calculate the estimated time of
execution:
56
That's obviously absurd: in order to test such BER other methods are used. This
test is used to ensure that no error was introduced during the sizing of the
parameter of the various interconnections. The test was first run on the KC705 in
order to determinate it's stability and reliability, then it has been implemented on
the Pixel-ROD and this time with a small change: instead of writing only 1 GB
per time, the source code was changed to write and read the entire 2 GB RAM on
the Pixel-ROD.
The test has been executed for nearly 10 hours transferring 24 TByte with no
error. This data could be interpreted as a mark that ensure that the effective BER
is < 10-14 and the sizing of the connection is solid and well made.
57
Conclusion and future development At present - July 2017 - two prototypes of Pixel-ROD boards have been produced
and, with the work of this thesis, now the PCIe validation and tests join the
already tested peripherals and devices. All the tests are carried on in the Electronic
Design Laboratory of INFN (Istituto Nazionale di Fisica Nucleare) and of the
Physics and Astronomy Department of Bologna University. Even if the work of
validation is completing, some more tests need to be created and performed in
order to assess a full validation of both prototypes. Even if it's a first release, the
Pixel-ROD board has already shown its capability to perform fast data acquisition
and processing during the tests.
The work described in this thesis is focused on understanding the PCIe protocol
and on the implementation of firmware and software on the Pixel-ROD board.
This allow validating the peripherals to make the board a good candidate for the
data acquisition in high-energy physics due to its reliability and bandwidth.
Therefore, future tests will aim to integrate Pixel-ROD within a simple data
acquisition chain, that implements the well-known standard from ATLAS Pixel
Detector or even the AURORA protocol from CMS experiment.
In conclusion, looking towards the LHC upgrades and beyond, the Pixel-ROD
board or its future upgrade represents an important step toward the modernization
of the data acquisition, especially for the readout of the pixel detectors. In
conclusion this feature makes the Pixel-ROD a good candidate to be part of an
electronic readout chain for tracking systems.
58
Ringraziamenti
Questa tesi è frutto di una bellissima collaborazione col dipartimento di Fisica e
Astronomia dell'università di Bologna, Vorrei quindi ringraziare coloro senza i quali,
tutto questo non sarebbe stato possibile.
Ringrazio innanzitutto il professor Gabrielli, Relatore, ed il professor Mauro Villa,
Correlatore, per il loro essenziale supporto in questo importante periodo.
Ringrazio poi tutti i colleghi dell'università di Bologna e gli amici di una vita che mi
hanno sostenuto lungo questo percorso, con particolare attenzione a Giulio Masinelli che
ha condiviso con me la strada.
Vorrei infine ringraziare la mia famiglia e la mia ragazza, le due bussole che mi hanno
guidato fino a qui e con le quali non ho ancora finito di viaggiare.
59
Bibliography
[1]CERN.[Online]https://home.cern/.
[2]ATLAS website.[Online]http://atlas.cern/.
[3]The ATLAS Collaboration, ATLAS detector and physics performance Technical
Design.
[4]C.Preti, A PCIe-based readout and control board to interface with new-generation
detectors for the LHC upgrade, 2016.
[5]G. Balbi, D. Falchieri, A. Gabrielli, L. Lama, R. Travaglini, S. Zannoli, IBL ROD board rev
C reference manual, November 2012, https://espace.cern.ch/.
[6]Béjar Alonso I. et al., High-Luminosity Large Hadron Collider (HL-LHC) Preliminary
Design Report, CERN-2015-005, 17 December 2015, CERN, Geneva.
[7]National Instrument, http://www.ni.com/white-paper/3767/en/.
[8]Ravi Budruk, Don Anderson, Tom Shanley, PCI Express System Architecture, 2008.
[9]PCI SIG, https://pcisig.com/specifications/pciexpress/.
[10] M. DINESH KUMAR, IMPLEMENTATION OF PCS OF PHYSICAL LAYER FOR PCI
EXPRESS, 2009.
[11] Rick Eads, Keysight Technologies, PCI Express Electrical Basics, 2014.
[12] National Instrument, http://www.ni.com/white-paper/5712/en/.
[13]VITA's website, http://www.vita.com/page-1855175#G.
[14]Xilinx R, 26 August 2016,
https://www.xilinx.com/support/documentation/boards_and_kits/kc705/ug810_KC705
_Eval_Bd.pdf.
60
[15]Xilinx R, 7 Series FPGA Overview, DS180 (v1.17), 27 May 2015,
https://www.xilinx.com/support/documentation/data_sheets/ds180_7Series_Overview.
pdf.
[16]Xilinx R, ZC702 Evaluation Board for the Zynq-7000 XC7Z020 All Programmable SoC,
UG850 (v1.5), 4 September 2015,
https://www.xilinx.com/support/documentation/boards_and_kits/zc702_zvik/ug850-
zc702-eval-bd.pdf.
[17]L. Lama, G. Balbi, D. Falchieri, G. Pellegrini, C. Preti, A. Gabrielli, A PCIe DAQ board
prototype for Pixel Detector in High Energy Physics, 12-01-C01073, Journal of
Instrumentation, 2017.
[18]Xilinx's website, https://www.xilinx.com/support/answers/65444.html.
[19]Xilinx's website,
https://www.xilinx.com/support/documentation/ip_documentation/xdma/v3_1/pg195-
pcie-dma.pdf.
[20]Xilinx, AXI Traffic Generator, PG125 (v2.0), 6 April 2016,
https://www.xilinx.com/support/documentation/ip_documentation/axi_traffic_gen/v2_
0/pg125-axi-traffic-gen.pdf.
[21]Jonathan Corbet, Alessandro Rubini, Greg Kroah-Hartman; Linux device drivers; third
edition.
[22]MemTest86 web site; http://www.memtest86.com/.