Testing Platform for a PCIe-based readout and control ... Testing Platform for a PCIe-based readout...

81
ALMA MATER STUDIORUM ● UNIVERSITÀ DI BOLOGNA Scuola di Ingegneria e Architettura Dipartimento di Ingegneria dell’Energia Elettrica e dell’Informazione “Guglielmo Marconi” - DEI Corso di Laurea in Ingegneria Elettronica e Telecomunicazioni Tesi di Laurea in Fisica Generale T-2 Testing Platform for a PCIe-based readout and control board to interface with new-generation detectors for the LHC upgrade Anno Accademico 2016/2017 Sessione I Relatore: Chiar.mo Prof. Mauro Villa Correlatore: Chiar.mo Prof. Alessandro Gabrielli Candidato: Giulio Masinelli

Transcript of Testing Platform for a PCIe-based readout and control ... Testing Platform for a PCIe-based readout...

  • ALMA MATER STUDIORUM ● UNIVERSITÀ DI BOLOGNA

    Scuola di Ingegneria e Architettura

    Dipartimento di Ingegneria dell’Energia Elettrica

    e dell’Informazione “Guglielmo Marconi” - DEI

    Corso di Laurea in Ingegneria Elettronica e Telecomunicazioni

    Tesi di Laurea

    in

    Fisica Generale T-2

    Testing Platform for a PCIe-based readout and

    control board to interface with new-generation

    detectors for the LHC upgrade

    Anno Accademico 2016/2017

    Sessione I

    Relatore:

    Chiar.mo Prof. Mauro Villa

    Correlatore:

    Chiar.mo Prof. Alessandro Gabrielli

    Candidato:

    Giulio Masinelli

  • “Cui dono lepidum novum libellum

    arida modo pumice expolitum?

    Corneli, tibi: namque tu solebas

    meas esse aliquid putare nugas.”

    Catullo, Carme I

  • Abstract

    Questa tesi è rivolta alla comprensione e validazione di una scheda PCIe,

    denominata Pixel-ROD e progettata dal laboratorio di elettronica dell'Istituto

    Nazionale di Fisica Nucleare e da docenti del dipartimento di Fisica e Astronomia

    per le necessità dell'esperimento ATLAS al CERN di Ginevra.

    Il lavoro di tesi è consistito nello sviluppo di una piattaforma di test da impiegare

    per la verifica della corretta funzionalità della scheda e della sua capacità di

    rispondere a ripetuti e continui stimoli, con particolare enfasi sul sottosistema di

    memoria e sull’interfaccia PCIe. La scheda è stata progettata come sostituzione

    dell’elettronica off-detector attualmente installata per l’esperimento ATLAS,

    composta da schede VME, note con il nome di Back of Crate (BOC) e Read Out

    Driver (ROD). La scelta dello standard PCIe segue l’ormai consolidato trend di

    impiegare schede dotate di FPGA per velocizzare il calcolo real-time eseguito sui

    PC. Queste sono solitamente costituite da PCB dedicate e necessitano di essere

    connesse alla scheda madre del computer per mezzo di un’interfaccia adeguata.

    Essendo quest’ultima a costituire, in genere, il collo di bottiglia di tali sistemi, le

    schede demo di ultima generazione comunicano attraverso la performante

    interfaccia PCIe. Non solo, le schede PCIe possono essere direttamente connesse

    alla scheda madre di PC dedicati all’acquisizione dati consentendo, dunque, una più

    elevata velocità di risposta (avendo accesso diretto alle principali risorse della

    macchina ospitante) ed una più facile installazione.

    Questa tesi fornisce una breve panoramica dell’ambiente per il quale la scheda è

    stata progettata. In particolare, dopo questa introduzione, il primo capitolo presenta

    l’esperimento ATLAS concentrandosi sui rivelatori più prossimi al punto di

    interazione, dove avviene la collisione fra i fasci di protoni dell’acceleratore. Il

    secondo capitolo, invece, descrive l’attuale elettronica off-detector così come le sue

    principali limitazioni. Il terzo capitolo delinea lo standard PCIe concentrandosi

    sugli aspetti più importanti con cui ci si è dovuti scontrare in fase di test. Il quarto

    capitolo fornisce le principali motivazioni delle scelte di progetto, quali la decisione

    di realizzare la scheda come unione di due demo board, adottando la collaudata

    architettura Master-Slave dell’accoppiata BOC-ROD. Infine, il quinto capitolo

  • descrive la piattaforma di test sviluppata e tutti i test a cui la scheda è stata

    sottoposta.

    Il lavoro svolto è stato principalmente focalizzato allo sviluppo di un intensivo test

    che coinvolgesse le memorie installate sulla scheda ed alla creazione di una

    piattaforma di test per fornire l’hardware necessario alla verifica della possibilità di

    eseguire trasferimenti di dati tra il PC ospitante e le memorie on-board per mezzo

    dell’interfaccia PCIe.

  • Index 1 The LHC accelerator and the ATLAS experiment ........................................ 3

    1.1 The Large Hadron Collider ....................................................................................3

    1.1.1 Machine Parameters ......................................................................................4

    1.1.2 Main experiments at LHC .............................................................................5

    1.2 The ATLAS detector .............................................................................................5

    1.2.1 The coordinate system of ATLAS .................................................................5

    1.3 The Layout of the ATLAS detector ........................................................................6

    1.3.1 Inner Detector ...............................................................................................7

    1.3.2 Semiconductor Tracker .................................................................................9

    1.3.3 Pixel Detector ...............................................................................................9

    1.4 Structure of Pixel Detector ................................................................................... 10

    1.4.1 IBL ............................................................................................................. 10

    1.4.2 Sensors for IBL ........................................................................................... 11

    1.4.3 FE-I4 .......................................................................................................... 12

    2 Current off detector electronics for IBL ...................................................... 15

    2.1 IBL electronics .................................................................................................... 15

    2.2 IBL BOC ............................................................................................................. 17

    2.2.1 BOC Control FPGA .................................................................................... 18

    2.2.2 BOC Main FPGA........................................................................................ 18

    2.3 IBL ROD............................................................................................................. 19

    2.3.1 ROD Master................................................................................................ 19

    2.3.2 ROD Slaves ................................................................................................ 19

    2.4 TIM ..................................................................................................................... 19

    2.5 System limitations ............................................................................................... 20

    3 PCIe specifications and usage model ............................................................ 21

    3.1 Architecture of a PCIe system .............................................................................. 21

    3.2 Interconnection .................................................................................................... 22

  • 3.3 Topology ............................................................................................................. 24

    3.4 Electrical specifications and I/O Lines ................................................................. 26

    3.5 PCIe Address Space ............................................................................................ 27

    3.6 Device Tree ......................................................................................................... 31

    3.6.1 Linux PCIe Subsystem ................................................................................ 33

    3.7 The three Layers of the protocol .......................................................................... 34

    3.7.1 Transaction Layer ....................................................................................... 35

    3.7.2 Data Link Layer .......................................................................................... 35

    3.7.3 Physical Layer ............................................................................................ 37

    4 Pixel-ROD board ........................................................................................... 39

    4.1 Xilinx KC705 ...................................................................................................... 39

    4.1.1 Why PCIe ................................................................................................... 40

    4.1.2 Kintex-7 FPGA ........................................................................................... 41

    4.2 Xilinx ZC702 ...................................................................................................... 41

    4.3 The Pixel-ROD board .......................................................................................... 42

    4.3.1 Space constraints ........................................................................................ 43

    5 Pixel-ROD test results ................................................................................... 47

    5.1 Power supply ....................................................................................................... 47

    5.2 Board interfaces and memory .............................................................................. 48

    5.2.1 Vivado Design Suite ................................................................................... 48

    5.3 Memory test ........................................................................................................ 48

    5.3.1 Vivado IP Integrator and AXI4 Interface ..................................................... 49

    5.3.2 Architecture of the memory test .................................................................. 54

    5.4 Hardware Testing Platform .................................................................................. 59

    5.5 PCIe interface test ............................................................................................... 61

    5.5.1 Architecture of the PCIe interface test ......................................................... 62

    Conclusions ...................................................................................................... 67

    Bibliography ..................................................................................................... 69

  • 1

    Introduction

    This thesis refers to the comprehension and validation of a PCIe board, named

    Pixel-ROD and developed by the Electronic Design Laboratory of INFN (Istituto

    Nazionale di Fisica Nucleare) and by teachers of the DIFA (DIpartimento di Fisica

    e Astronomia) to meet the needs of ATLAS experiment at CERN, in Geneva.

    The thesis work consisted in the development of a test bench used to verify the

    board full functionalities and its response to stressful stimuli with emphasis on the

    memory subsystem and the PCIe interface. The board was designed as an upgrade

    for the current off-detector electronics present at ATLAS, replacing the previous

    series of readout boards, which are mainly made up of VME boards, known as Back

    of Crate (BOC) and Read Out Driver (ROD). The choice of a PCIe card follows the

    upcoming trend to exploit FPGA boards in order to speed-up real time calculations

    performed in a PC. They are usually mounted on an external PCB so they must be

    connected via an appropriate interface to the motherboard of the host computer.

    Since this interface is often the bottleneck of such systems, new generation FPGA

    evaluation boards communicate via PCIe. But PCIe advantages go further than this.

    Indeed, PCIe boards can be directly connected on the motherboard of ATLAS

    TDAQ PCs providing a faster response (giving straight access to the main resources

    of the PCs) and an easier installation.

    The thesis is intended to provide a brief overview of the environment the board was

    developed to be installed in. After this Introduction, the Chapter 1 summarizes the

    ATLAS experiment, focusing on the closest detectors to the interaction point.

    Chapter 2 describes the off-detector electronics the board is meant to replace,

    concentrating on their limitations. Chapter 3 delineates the PCIe protocol focusing

    on the crucial aspects for board validation. Chapter 4 illustrates how the board was

    developed as a merging of two demo boards in order to exploit the Master-Slave

    architecture of the BOC-ROD pair as well as narrowing down all the possible

    mistakes. Finally, Chapter 5 describes the testing platform and the tests that have

    been carried out.

  • 2

    The work of this thesis was mainly focused to thoroughly test the board memory

    subsystem in order to validate the device response to stressful stimuli and to develop

    a testing platform to provide the hardware needed to fully verify the ability to

    perform PCIe data transaction to a host PC and the related performance.

  • 3

    Chapter 1

    The LHC accelerator and the ATLAS experiment

    In this chapter, a brief overview of the LHC accelerator complex and of the ATLAS

    experiment is presented. More details are provided for the ATLAS pixel detector, for which

    the readout board described in the next chapters has been designed.

    1.1 The Large Hadron Collider The Large Hadron Collider (LHC) is the largest and most powerful particle accelerator on

    Earth. It is located in the circular tunnel which housed the LEP (Large Electron-Positron

    collider) in Geneva at the border between France and Switzerland and it is managed by the

    European Organization for Nuclear Research, also known as CERN (Conseil Européen

    pour la Recherche Nucleaire). CERN is a collaboration between 20 European member

    states and non-member states from the rest of the world. The accelerator is approximately

    27 km in circumference and lies 100 m below the ground [1]. There are four interaction

    points where protons or lead ions are forced to collide at high energies (see figure 1.1). At

    the four interaction points, gigantic experiments (ALICE, ATLAS, CMS and LHCb) are

    set up to record every detail of the particle collisions. The acceleration chain consists of

    several steps (see figure 1.2). Indeed, protons are not directly inserted into the beam pipe

    of the main ring, but they begin their acceleration in the linear accelerator LINAC 4, where

    they are accelerated from rest to an energy of 50 MeV. After being sent to the Booster to

    be accelerated to 1.4 GeV, they are injected into the Proton Synchrotron to be accelerated

    to 25 GeV. The last acceleration before the Large Hadron Collider is provided by the Super

    Proton Synchrotron where they are accelerated to 450 GeV. Finally, in the LHC, protons

    are accelerated from 450 GeV to 6.5 TeV.

    Figure 1.1: LHC overview.

  • 4

    1.1.1 Machine Parameters The nominal maximum collision energy for proton in LHC is 14 TeV; however, the

    accelerator is now running at a collision energy of 13 TeV, 6.5 TeV per proton beam. At

    this level of energy, protons move with a speed very close to the speed of light in vacuum.

    The proton beams consist of 2808 bunches of protons. Each bunch contains about 1011

    particles each so that many proton collisions can happen at each bunch crossing. The

    protons are held in the accelerator ring by 1232 superconducting dipole magnets that create

    a maximum magnet field of 8.3 T. Along the beam line, 392 quadrupole magnets are used

    to focus the particles in the interaction points and defocus them after. The Large Hadron

    collider is built to have a peak instantaneous luminosity of 𝐿 = 1034 cm−2s−1 (the ratio

    of the number of events detected in a certain time to the interaction cross-section).

    Figure 1.2: the accelerator complex at CERN: LHC protons start from LINAC4 and are

    accelerated by the Booster, the Proton Synchrotron and the Super Proton Synchrotron

    before entering in LHC.

  • 5

    1.1.2 Main experiments at LHC As already stated, at the four interaction points, where protons or lead ions are collided

    there are four detectors (fig. 1.1):

    • ATLAS, A Toroidal Lhc ApparaTus: being one of the two general purpose

    detectors at LHC, it was designed for the research of the Higgs Boson and it

    investigates a wide range of physical problems, spacing from searching concerning

    Beyond Standard Model Physics (Dark Matter, Supersymmetries) to precision

    measurements of the Standard Model parameters.

    • CMS, Compact Muon Solenoid: the second general purpose detector at LHC. It

    has the same purpose of ATLAS experiment (same scientific goals) although it

    uses different technical solutions and different magnet-system designs.

    • LHCb, Large Hadrons Collider beauty: specialized in investigating the

    differences between matter and antimatter by studying quark beauty particles.

    • ALICE, A Large Ion Collider Experiment: a heavy ion collision detector. It

    studies the proprieties of strongly interacting matter at extreme energy densities,

    where the matter forms a new phase, called Quark-Gluon Plasma.

    1.2 The ATLAS detector The ATLAS experiment is a general-purpose particle detector installed at the LHC which

    is used to study various kinds of physical phenomena. It is 46 m long, 25 m high, 25 m

    wide and weighs 7000 tons [2]. It is operated by an international collaboration with

    thousands of scientists from all over the world: more than 3000 scientists from 174

    institutes in 38 countries work on the ATLAS experiment. ATLAS has a cylindrical

    symmetry, as well as the detectors of CMS and ALICE.

    1.2.1 The coordinate system of ATLAS To properly and precisely describe the ATLAS detector, a coordinate system is introduced.

    In this reference frame collision, events are described using right-handed spherical

    coordinates where the origin is set in the nominal interaction point, the z-axis along the

    beam direction and the x-y plane transverse to it (the positive x-axis pointing to the center

  • 6

    of the LHC ring and the positive y-axis

    pointing upwards, see figure 1.3) [3].

    Therefore, a vector can be described

    using the azimuthal angle φ, the polar

    angle θ and the radius r, as shown in

    figure 1.3. But instead of using θ, it is

    usual to take the pseudorapidity η

    defined as −𝑙𝑛 𝑡𝑎𝑛𝜽

    𝟐. Its value ranges

    from -infinity, corresponding to a

    vector being along the negative semi-

    axis of z, to + infinity, referring to a

    vector being along the positive semi-

    axis of z. So, the value η = 0

    corresponds to a vector in the x-y

    plane.

    1.3 The Layout of the ATLAS detector

    Figure 1.4: ATLAS detector overview

    The structure of ATLAS’s detector is illustrated in a cutaway view in Figure 1.4. It is built

    with a cylindrical symmetry around the interaction point and geometrically divided into a

    Figure 1.3: LHC’s ATLAS well and its coordinate

    system.

  • 7

    barrel region (low η region), two endcap regions (medium η region) and two forward

    regions (high η region). The full detector is made up of several groups of sub-detectors,

    designed to identify and record the particles coming out of the proton-proton collisions.

    From inwards to outwards, these sub-detectors form three systems: the Inner Detector

    (ID), the Calorimeters and the Muon Spectrometer. The Inner Detector is surrounded by a

    central solenoid that provides a 2 T magnetic field while barrel and end-caps section of the

    Muon Spectrometer are surrounded by toroidal magnets that provides a field of 0.5 T and

    1 T, respectively. Particles produced from the proton-proton collisions firstly arrive in the

    Inner Detector which covers the region of |η| < 2.5. The charged particles interact with

    different layers of the detector depositing there part of their energy. This is then

    transformed in electronic signals and hence hits are used to reconstruct the particle

    trajectory. The momenta and charge of these charged particles can be measured by the

    bending of their trajectories inside the 2 T magnetic field provided by the central solenoid.

    As the innermost layer of ATLAS, the ID provides essential information, such as the

    recognition of first and second vertices from which charged particles are seen to emerge.

    The ID is therefore designed to have a high granularity and a high momentum measurement

    resolution. To meet the performance requirement, semiconductor detectors are used for

    precise measurement close to the beam (the Pixel Detector and the Semiconductor Tracker)

    and a noble gas detector is used in the outer layer (the Transition-Radiation Tracker).

    Further away from the collision point, the Calorimeters can be found. They are composed

    of the electromagnetic calorimeters and the hadronic calorimeters, which are designed to

    identify electron/photon or hadron respectively and measure their energy and coordinates.

    The position information is obtained by segmenting the calorimeters longitudinally and

    laterally. The calorimeters will not stop muons as they interact very little with the

    calorimeter absorber. Therefore, muons will pass through the full detector and arrive in the

    outermost layers of the ATLAS detector: the muon spectrometer. Figure 1.5 shows the

    detector response to different particles, using a schematic transverse section view of the

    ATLAS detector.

    1.3.1 Inner Detector As stated, the Inner Detector (ID) is placed closest to the beam line; therefore, its design

    must allow excellent radiation hardness and long-term stability in addition to ensure

    adequate performance. As shown in figure 1.6, the full ID is a cylinder 6.2 m long and with

    a diameter of 2.1 m and a coverage of |η| < 2.5.

  • 8

    Figure 1.5: section view of ATLAS detector in the transverse plane, illustrating

    layers’ positioning.

    Figure 1.6: a section view of the ATLAS Inner Detector

    The ID is segmented into cylindrical structures in the barrel region while it has coaxial

    sensor disks in the end cap regions. As shown in figure 1.7, ID in the barrel region is made

    up of three main different layer. In the following paragraphs, the two innermost ones are

    presented, from the most external to the most internal one.

  • 9

    Figure 1.7: structure and arrangement of the layers of Inner Detector in the

    barrel region.

    1.3.2 Semiconductor Tracker The SemiConductor Tracker (SCT) is a tracker made up of silicon strips, with a technology

    similar to the one employed in the Silicon Pixel Detector. Each SCT layer has two sets of

    SCT strips which are glued back-to-back with an angle of 40 mrad in between to measure

    both lateral and longitudinal coordinates. The choice of a silicon strip detector is mainly

    motivated by two facts: the large area covered (about 63 m2) and the small particle

    occupancy: in the SCT region, less than one track travels a SCT chip, allowing for a

    consistent reduction in terms of instrumented channels keeping spatial accuracy and noise

    levels within the design limits.

    1.3.3 Pixel Detector The innermost and most important detector is the Pixel Detector. It is designed to have the

    finest granularity among the other ones; indeed, being very close to the interaction region,

    the track density is high. The system consists of four cylindrical layers, named as the

    following, going outwards: Insertable B-Layer (IBL), B-Layer (L0), Layer 1 (L1) and

    Layer 2 (L2).

  • 10

    1.4 Structure of Pixel Detector As stated, the current configuration of the Pixel Detector consists of four layers, as shown

    in Figure 1.7. Together, the L0, L1, L2 layers are composed of 112 long staves that are

    made of 13 modules tilted on z axis by 1.1 degrees toward the interaction point (Figure

    1.8b); furthermore, to allow overlapping, the staves are tilted by 20 degrees on the x-y plane

    (see Figure 1.8a).

    Regarding to sensors, 16 Front End (FE-I3) chips, a flex-hybrid, a Module Controller Chip

    (MCC) and a pigtail together form what is called a module. FE-I3s are responsible for

    reading the charge signal from pixels. Each FE-I3 is 195 µm thick, with a top surface of

    1.09 cm by 0.74 cm, counting 3.5 million of transistors in 250 nm CMOS technology. They

    are bump bonded over the sensors (Figure 1.9) and each one has an analog amplifier able

    to discriminate signals of 5000 electrons with a noise threshold of 200 electrons. The

    module collects signals from the sensors and packs them in a single data event which is

    sent to the ROD board.

    1.4.1 IBL The Insertable Barrel Layer is a pixel detector inserted with a contracted beam-pipe inside

    the B-Layer. The fact of being very close to the interaction point forces some constraints

    that are not needed in other layers: electronics must be much more radiation hard and the

    sensible area needs to cover more than the 70% of the surface as in B-Layer. To achieve

    Figure 1.8: staves disposition around the beam pipe (a), and modules

    layout inside (b).

  • 11

    those objectives a new front-end readout chip, called FE-I4, was developed, leading to an

    active area of 90% [4].

    1.4.2 Sensors for IBL IBL’s modules and sensors are different from ATLAS’s ones because of the technology

    chosen for the pixels. There were two main candidates:

    • planar;

    • 3D.

    The main characteristics of these two technologies are explained hereafter, as well as the

    upgrade from FE-I3 chip to FE-I4.

    Planar sensors were used within the B-Layer too, but requirements on IBL’s ones was much

    stricter: the inactive border had to pass from 1 mm of the old ones to the new 450 µm.

    Several studies have been performed since B-Layer pixel were produced and now it is clear

    that an irradiated sensor can increase the collected charge if it is less thick. One of the

    Figure 1.9: Silicon sensor and read out chip (FE-I3)

    bump bonded.

    Figure 1.10: cross-sectional schematic of the n-in-p planar

    pixel sensor

  • 12

    adopted variants of planar sensors is the so-called conservative n-in-n design. It uses

    sensors which are already known to work, while trying to fulfill all requirements for IBL

    such as the 450 µm inactive edge limit. Moreover, the pixel length in z has been reduced to

    match the new 250 µm pixel cell length of the FE-I4 [4] (Figure 1.10).

    On the other hand, geometry of 3-D sensors is completely different from planar ones (see

    Figure 1.11). Their wafers take advantage of new silicon technology advances that produce

    column-like electrodes which penetrate the substrate instead of being implanted on the

    surface like the planar technology [5]. Indeed, they are built thanks to plasma micro-

    machining to etch deep narrow apertures in the silicon substrate to form the electrodes of

    the junctions [6]. Since the charge collected is low, there is the need of reading it from two

    electrodes at once. Another downside of these sensors is that noise increases with the

    number of electrodes and it is even affected by their diameter.

    Nevertheless, full 3-D sensors’ active area extends much closer to the surface, reducing

    non-sensible volume. The faces of 3-D sensors, independently from the type, are much

    closer one another, allowing a much lower bias voltage (150 V versus 1000 V of a planar

    sensor). This also leads to a lower leakage current and thus less cooling. When a particle

    passes through the electrode area, efficiency results diminished by 3.3%. This effect affects

    only in perpendicular particles and thus will not affect IBL for its sensors are tilted by 20°.

    1.4.3 FE-I4 FE-I4 (Figure 1.12) is the new ATLAS pixel chip developed to be used in upgraded

    luminosity environments, in the framework of the Insertable B-Layer (IBL) project but also

    for the outer pixel layers of Super-LHC. FE-I4 is developed using a 130 nm CMOS process,

    in an 8 metal option with 2 thick aluminum top layers for enhanced power routing. Care

    Figure 1.11: two types of 3-D sensors, double sided

    (a) and full 3-D (b).

  • 13

    has been taken to separate analog and digital power

    nets. With the reduction of the thickness of the gate

    oxide, the 130 nm CMOS process shows an increased

    radiation tolerance with respect to previous less

    scaled processes. The reasons behind the redesign of

    the pixel Front-End FE-I3 came from several aspects

    related to system issues and physics performance of

    the pixel detector. With a smaller innermost layer

    radius for the IBL project and an increased

    luminosity, the hit rate increases to levels which the

    FE-I3 architecture is not capable of handling; while

    the FE-14 can reach an average hit rate with < 1%

    data loss of 400 MHz/cm2. It was shown that the FE-

    I3 column-drain architecture scales badly with high

    hit rates and increased FE area, leading to

    unacceptable inefficiencies for the IBL (see Figure

    1.13). To avoid that, FE-I4 was designed to stores hits

    locally getting rid of a column-drain based transfer.

    The FE-I4 pixel size is also reduced, from 50 x 400

    µm2 to 50 x 250 µm 2 which reduces the pixel cross-section and enhances the single point

    resolution in z direction. FE-I4 is built up from an array of 80 by 336

    pixels, each pixel being subdivided into analog and digital section. The

    total FE-I4 active size is 20 mm (z direction) by 16.8 mm (φ direction),

    with about 2 mm more foreseen for periphery, leading to an active area

    of close to 90% of the total. The FE is now a standalone unit avoiding

    the extra routing needed for a Module Controller Chip for

    communication and data output. Communication and output blocks are

    included in the periphery of the FE. Going to a larger FE size is

    beneficial with respect to active over total area ratio as well as for the

    building up of modules and staves. This leads to more integrated stave

    and barrel concepts, and therefore reduces the amount of material

    needed per detector layer. This material reduction provides a better

    overall tracking measurement, since the probability of small or large

    scatterings of particle inside the tracker itself is reduced. One of the main

    advantages of having a large FE is also the cost reduction.

    Figure 1.12: FE-I4 Layout

    Figure 1.13:

    performance of column

    drain readout

    architecture [7].

  • 14

  • 15

    Chapter 2

    Current off detector electronics for IBL

    Hereafter is presented the current set-up of the off-detector electronics for the Insertable

    Barrel Layer in order to describe the environment in which the new Pixel-ROD board was

    conceived and to understand the requirements that it needs to fulfill.

    High-energy physics experiments usually distinguish between on-detector and off-detector

    electronics referring to the front-end electronics implemented near the detector itself and

    to the readout system that can be implemented far from the detector. While in the first

    scenario radiation resistance is a fundamental parameter, in the second one there is a less

    compelling requirement of radiation resistance allowing the employment of more powerful

    devices.

    2.1 IBL electronics The IBL readout requires an appropriate off-detector system that is schematically shown in

    Figure 2.1.

    Figure 2.1: schematic block of the IBL readout system

    This readout system is made of several components:

    • Back of Crate (BOC) board:

    • Optical modules to interface FE-I4 chips with BOC board;

  • 16

    • S-Link for sending data from the BOC board to the ATLAS TDAQ

    system.

    • Read Out Driver (ROD) board:

    • Gigabit Ethernet to send front-end calibration histograms.

    • VME Crate;

    • TTC Interface Module (TIM);

    • Single Board Computer (SBC).

    FE-14 data are received from the BOC board via the RX optical modules, then 8b/10b

    decoding is performed before passing data to the ROD that processes them. During physics

    runs, events to be sent to the ATLAS TDAQ (Trigger and Data Acquisition) are sent back

    to the BOC, where 4 S-Link modules are implemented. S-Link stands for Simple LINK

    and it is a simple interconnection protocol that implements error reporting and test

    functions, too. Each BOC-ROD pair can interface and route data coming from 16 IBL

    modules (32 FE-I4 chips, for a total input bandwidth of 5.12 Gb/s). The whole IBL readout

    requires 15 BOC-ROD pairs that can be all placed in a single VME crate (one BOC-ROD

    pair for each of the 14 staves of IBL detector (that counts 32 FE-I4s) plus a one pair to

    serve the diamond beam monitor detector) [8].

    Figure 2.2: visual layout of the data acquisition system. In red the normal data

    path, in blue the deviation for the generation of the histogram

    Figure 2.2 illustrate the data path: the 32 front-end FE-14 chips drive 32 serial lines, each

    supporting 160 Mb/s, connected to the BOC board via optical links. Here the signal from

    each line is converted from optical to electrical, then demultiplexed to one 12-bit-wide bus,

    which proceeds towards the ROD board, through the VME backplane connector.

  • 17

    After that, in order to build the data frame that has to be sent to the TDAQ computers, the

    ROD board begin the data formatting. Data, after being transmitted to the ROD board, can

    take two different paths, as show in Figure 2.2.

    In the first one, ROD data are sent back to the BOC, where four S-Link modules forward

    the data towards the ATLAS TDAQ PCs, reaching a total output bandwidth of 5.12 Gb/s;

    in the second one, data are delivered to a PC for histogram processing (exclusively used

    during calibration runs to properly calibrate the FE-I4 chips).

    2.2 IBL BOC The BOC board (Figure 2.3) is responsible for handling the control interface to the detector,

    as well as the data interface from the detector itself. Another of the main task of the BOC

    is to provide the clock to the front-end chips connected. Furthermore, a Phase Locked Loop

    (PLL) generates copies of this clock for the ROD and the detector. The detector clock is

    then handled by the FPGAs and coded into the control streams for the individual detector

    modules.

    The IBL BOC contains three Xilinx Spartan FPGAs:

    • one BOC Control FPGA (BCF);

    • two BOC Main FPGAs (BMF).

    Figure 2.3: IBL BOC board

  • 18

    2.2.1 BOC Control FPGA The BOC Control FPGA is responsible for the overall control and data shipping of the

    board. An embedded processor (the Microblaze) is instantiated on this FPGA, mainly to

    provide Ethernet access to the card, but it is also able to implement some self-test for the

    board as well as being responsible for the other FPGAs configuration by loading

    configuration data from a FLASH memory accessed via SPI.

    2.2.2 BOC Main FPGA The two main FPGAs encode the configuration data regarding the FE-I4 front-end chips

    connected to the ROD, into a 40 Mb/s serial stream that is sent straight to the FE-I4

    themselves. This two FPGAs also manage the deserialization of the incoming data from

    the front-end chips; after the data collection and the word alignment, the decoded data are

    sent to the ROD board. On the transmission side, these two Spartan FPGAs also manage

    the optical connection via four S-Links to the ATLAS TDAQ system.

    Figure 2.4: IBL ROD board

  • 19

    2.3 IBL ROD The Insertable Barrel Layer Read Out Driver (IBL ROD) [9] is a board developed to

    substitute the older ATLAS Silicon Read Out Driver (SiROD), that is used for the ATLAS

    off-Detector electronics sub-system to interface with Silicon Tracker (SCT) and Pixel L0,

    L1 and L2 Front End Detector modules.

    The board main tasks are: data gathering and event fragment building during physics runs

    and histogramming during calibration runs.

    As stated, during runs, the board receives data and event fragments from the 32 FE-I4 chips

    and transforms them into a ROD data frame, which is sent back to the ATLAS TDAQ

    through the BOC S-Link connections (see Figure 2.2).

    2.3.1 ROD Master An FPGA (Xilinx Virtex-5 XC5VFX70T-FF1136) is the Master of the Read-Out Driver,

    which act as interface with the front-end chips. This FPGA also contains a PowerPC, an

    embedded hard processor. Its main task is sending the event information to the two slaves

    FPGAs [10].

    2.3.2 ROD Slaves The two FPGAs that work as slaves on the ROD board are the Xilinx Spartan-6

    XC6SLX150-FGG900. They implement an embedded soft processor, named Microblaze.

    All data generated by IBL during ATLAS experiments pass through these two FPGAs and

    are collected inside the on-board RAM (SODIMM DDR2 2GB); moreover, during

    calibration runs, histograms can be generated and sent to the histogram server.

    2.4 TIM The TTC (Timing, Trigger and Control) Interface Module (TIM) acts as interface between

    the ATLAS Level-1 Trigger system signals and the pixel Read-Out Drivers using the LHC-

    standard TTC and Busy system. In particular, the board is designed to propagate the TTC

    clock all over the experiment: for what concerns the IBL off-detector electronics, the TIM

    sends the clock to the BOC board, which then propagates it to the ROD, as stated above.

    Furthermore, the TIM receives and propagates triggers through the custom backplane.

  • 20

    2.5 System limitations Since the ROD board described in the previous section met all the strict requirements the

    ATLAS experiment had, it was decided to implement the same system also for all other

    layers (L0, L1, L2). But while the IBL electronics required 15 ROD boards, the remaining

    layers required about 110 boards. Even if, at the moment, the space occupation is clearly

    high but still sustainable, it already shows the limits of this system: the future upgrade of

    the whole LHC detector to the higher luminosity HL-LHC (whose luminosity will be raised

    up to a factor of 10) will need more and more boards to face a much higher data rate [11].

    Indeed, the link bandwidth is proportional to the product of occupancy (which is a function

    of the luminosity) and trigger rate of the front-end devices [8]. It is expected that hit rate

    for IBL at 140 Pile-Up will be raised up to 3 GHz/cm2 and the readout rate up to 4.8 Gbits/s

    per chip for 1 MHz trigger rate [12].

  • 21

    Chapter 3

    PCIe specifications and usage model

    In the so-called Long Shutdown of the LHC accelerator foreseen for 2023, the whole LHC

    detector will be upgraded to the higher luminosity HL-LHC. In particular, the nominal

    luminosity will be raised up to roughly ten times the actual one [11], implying that the

    electronics has to withstand a much higher data rate. Although many different read-out

    electronics boards have been presented, they all share a common feature: high flexibility

    and configurability with PCIe interface as well as powerful FPGAs connecting to many

    optical transceivers [13]. So, before the description of the board named Pixel-ROD, an

    overall architectural perspective of the PCIe technology is provided.

    3.1 Architecture of a PCIe system PCI Express (PCIe which stands for Peripheral Component Interconnect Express) is a high-

    performance I/O device interconnection bus used in the mobile, workstation, desktop,

    server, embedded computing and communication platforms in order to expand the

    capabilities of a host system by providing slots where expansion cards can be installed. It

    has established itself as the successor to PCI providing higher performance, increased

    flexibility and scalability. Indeed, since the so called first generation of buses (ISA, EISA,

    VESA and Micro Channel) and the second one (PCI, AGP, and PCI-X), PC buses have

    doubled in performance roughly every three years. But processors have roughly doubled in

    performance in half that time following the Moore’s Law. In addition, although PCI has

    enjoyed remarkable success, there were evidence that a multi-drop, parallel bus

    implementation was close to its practical limit of performance as it cannot be easily scaled

    up in frequency or down in voltage: it faced a series of challenges such as bandwidth

    limitations, host pin-count limitations, lack of real-time data transfer and signal skew

    limited synchronously clocked data transfer. Indeed, the bandwidth of the PCI bus and its

    derivates can be significantly less than the theoretical value due to protocol overhead and

    bus topology. All approaches to pushing these limits to create a higher bandwidth, general-

    purpose I/O bus result in large cost increases for little performance gain [14]. So, there was

    the need to engineer a new generation of PCI to serve as a standard I/O bus for future

    generation platforms. There have been several efforts to create higher bandwidth buses and

    this has resulted in a PC platform supporting a variety of application-specific buses

  • 22

    alongside the PCI I/O expansion bus. Indeed, PCIe offers a serial architecture that alleviates

    some of the limitations of parallel bus architectures by using clock data recovery (CDR)

    and differential signaling. Using CDR as opposed to source synchronous clocking lowers

    pin-count, enables superior frequency scalability and makes data synchronization easier.

    Moreover, PCIe was designed to be software-compatible with the PCI (so the older

    software systems are still able to detect and configure PCIe cards although without PCIe

    features such as the access to the extended configuration space that will be discussed in the

    next paragraphs).

    3.2 Interconnection A PCIe interconnect that connects two devices together is referred to as a Link. It consists

    of either x1, x2, x4, x8, x12, x16 and x32 signals pairs in each direction. These signals are

    referred to as Lanes. So, a x1 Link consists of 1 Lane or 1 differential signal pair in each

    direction for a total of 4 signals (each Lane constitutes a full-duplex communication

    channel). A x32 Link consists of 32 Lanes or 32 signal pairs for each direction for a total

    of 128 signals [15]. In order to make the bus software backwards compatible with PCI and

    PCI-X systems (predecessor buses), it maintains the same usage model and load-store

    communication model. When it comes to the differences, PCI and PCI-X buses are multi-

    drop parallel interconnect buses in which many devices share the same bus, while PCIe

    implements a serial, point-to-point type interconnection for communication between

    devices. In systems requiring multiple devices to be interconnected, interconnections are

    made possible thanks to switches. The point-to-point interconnection lead to limited

    electrical load on the Link overcoming the limitations of a shared bus. Moreover, as stated,

    a serial interconnection results in fewer pins per device package which reduces PCIe board

    design cost and complexity. Another significant feature is the possibility to implement

    scalable numbers for pins and signal Lanes that allows a huge flexibility according to

    communication performance requirements (PCIe specifications defined operations for a

    maximum of 32 Lanes). So, the size of PCIe cards and slots vary depending upon the

    number of supported Lanes. During hardware initialization, the Link is automatically

    initialized for Link width and frequency of operation by the device on the opposite ends of

    the Link without involving any kind of firmware. A packed-based communication protocol

    is used over the serial interconnection. Packets are serially transmitted and received and

    byte striped across the availed Lanes. This feature contributes keeping the device pin count

    low and reducing system cost by means of in-band accomplishing of Hot Plug, power

    management, error handling and interrupt signaling using packed based messaging instead

  • 23

    of side-band signals. Each packet to be transmitted over the Link consist of Bytes of

    information. The first generation of the standard has a transmission/reception rate of 2.5

    Gbits/s per Lane per direction (it has been doubled in the second generation). PCIe standard

    also specifies three clocking architectures: Common Refclk, Separate Refclk, and the

    already cited Clock Data Recovery. Common Refclk specifies a 100 MHz clock (Refclk),

    with greater than ±300 ppm frequency stability at both the transmitting and receiving

    devices. It was the most widely supported architecture among the first commercially

    available devices. However, the same clock source must be distributed to every PCIe device

    while keeping the clock-to-clock skew to less than 12 ns between devices. This can be a

    problem with large circuit boards or when crossing a backplane connector to another circuit

    board. In the case a low-skew configuration is not workable, the Separate Refclk

    architecture, with independent clocks at each end, can be used. The clocks do not have to

    be more accurate than ±300 ppm, because the PCIe standard allows for a total frequency

    deviation of 600 ppm between transmitter and receiver. Finally, the Clock Data Recovery

    architecture is the simplest, as it requires only one clock source, at the transmitter [16] and

    it has become the most used configuration. In this scenario, since there is no clock signal

    on the Link, the receiver uses a PLL to recover a clock from the 0-to-1 and 1-to-0 transitions

    of the incoming bit stream. To allow for the clock recovery on the signal line independently

    on the data transmitted, a DC balanced protocol is used. Every Byte of data to be

    transmitted is converted into 10-bit code via an 8b/10b encoder in the transmitter device

    (so 10-bit symbols are employed). The consequence is a 25% additional overhead to

    transmit a byte of data. All symbols, in order to be compatible with the Clock Data

    Recovery architecture, are guaranteed to have one-zero transitions. PCIe implements a

    dual-simplex Link capable of transmitting and receiving data simultaneously on a transmit

    and receive Lane. So, to obtain the aggregate bandwidth (which assumes simultaneous

    traffic in both directions) the transmission/reception rate has to be multiplied by 2 then by

    the numbers of Lanes (the so-called Link Width) and finally divided by 10 to account the

    10-bits per Byte encoding. For example, a x1 PCI Express Link first generation, has an

    aggregate throughput of 0.5 GBytes/s while a x32 PCI Express Link reaches 16 GBytes/s.

    In the case of the second version of the PCIe protocol (named PCIe gen. 2), the previous

    values had to be multiplied by a factor of two. Indeed, in this case, the data transfer is raised

    up to 5 GTransfers/s which means that every Lane could transfer up to 5 Gbit/s using the

    8b/10b encoding format [17]. As a side note, the third version of the protocol not only

    increased the data-transfer rate, raising it up to 8 GTransfer/s, but also changed the

    encoding format from 8b/10b to 128b/130b (to reduce the protocol overhead) [18].

  • 24

    3.3 Topology In order to understand the topology of the PCIe standard, some definitions are provided:

    PCIe end-point: PCIe device to be connected.

    Root complex: host controller that connects the CPU of the host machine to the rest of the

    PCIe devices. PCIe has its own address space consisting of either 32 or 64 bits depending

    upon the Root-Complex and it is only visible by PCIe components like the Root-Complex,

    end-points, switches and bridges. Root-complex can interrupt the CPU for any of the events

    generated by the Root-Complex itself or by any of the PCIe devices. Moreover, it can also

    access the memory without CPU intervention (acting as a sort of DMA). PCIe end-points

    can use this feature to write/read data to/from the memory. In order to do so, Root-complex

    makes the end-point the bus master (giving the permission to access the memory) and

    generates the corresponding memory address.

    Bridge: it provides forward and reverse bridging allowing designers to migrate local bus,

    PCI, PCI-X and USB bus interfaces to the serial PCIe architecture.

    Switch: born to replace the multi-drop bus used in PCI and to provide fan-out for the I/O

    bus, it is also used to realize a peer-to-peer communication between different endpoints and

    this traffic, if it does not involve cache-coherent memory transfers, need not be forwarded

    to the host bridge.

    Figure 3.1 shows how the PCIe components (Root-Complex, bridges, end-points and

    switches) are interconnected to PCIe Links.

    Figure 3.1: PCIe topology

  • 25

    As stated, Root-Complex allows the connections of the many PCIe end-points. This task is

    accomplished thanks to root-ports that can be directly connected to end-points, to a bridge

    or to a switch connected to several end-points. In the case of Root-Complex or switches, in

    order to implement a point-to-point topology (which means that a single serial link connects

    two devices) multiple Virtual PCI to PCI bridges are used. These are the devices that

    connects multiple buses together providing a (virtual) PCI bridge for the up-stream PCIe

    connection and one (virtual) PCI bridge for each down-stream PCIe connection (Figure

    3.2). An identification number is assigned to each bus by the software during the

    enumeration process that is used by switches and bridges to identify the path of a

    transaction. Every switch or bridge must store the information about three bus numbers:

    the primary bus number (that reflects the number of the bus the switch is connected to), the

    secondary bus number (identifying the bus with the lowest number that can be reached)

    and subordinate bus number (the bus with the highest number that can be reached).

    Figure 3.2: SoC detail

    In the case of the switch of the previous example (see figure 3.3), the primary bus number

    is 3, the secondary bus number is 4, and subordinate bus number, 8. So any transaction

    targeted from bus 4 to bus 8 will be accepted and handled by the switch [19].

  • 26

    3.4 Electrical specifications and I/O Lines Having adopted a serial bus technology, PCIe uses far less I/O lines than PCI. As stated,

    PCIe devices employ differential drivers and receivers (a pair of differential TX lines and

    a pair of differential RX lines for each Lane) implementing the High-speed LVDS (Low-

    Voltage Differential Signaling) electrical signaling standard [15]. The differential driver is

    DC coupled from the differential receiver at the opposite end of the Link thanks to a

    capacitor at the driver side. This means that two devices at the opposite end of a Link can

    use different DC common mode voltages (range: 0 V to 3.6 V). The differential signal is

    derived by measuring the voltage difference between two terminals. Logical values: a

    positive voltage difference between the positive terminal and the negative one implies

    Logical 1. On the other hand, a negative voltage difference between the same terminals

    implies a Logical 0. Finally, when the driver is put in a high-impedance tristate condition

    (also called Electrical-Idle or low-power state of the Link), the two terminals are driven at

    the same potential. Let the voltage with respect to the ground on each conductor be 𝑉𝐷+

    and 𝑉𝐷−.

    The differential peak-to-peak voltage is defined as 2 ∗ max | 𝑉𝐷+ − 𝑉𝐷− | . To signal a

    logical 1 or a logical 0, the differential peak-to-peak voltage driven by the transmitter must

    Figure 3.3: Switch detail

  • 27

    be between 800 mV (minimum) and 1200 mV (max). Conversely, during the Link

    Electrical Idle state, the transmitter drives a differential peak voltage of between 0 mV to

    20 mV. As stated, the receiver is able to sense a logical 1, a logical 0 as well as the Electrical

    Idle state of the Link, by detecting the voltage on the Link via a differential receiver

    amplifier. Due to signal loss along the Link, the receiver must be designed to sense an

    attenuated version of the differential signal driven by the transmitter. The receiver

    sensitivity is fixed to a differential peak-to-peak voltage of between 175 mV and 1200 mV,

    while the electrical idle detect threshold can range from 65 mV (minimum) to 175 mV

    (maximum). Any voltage less than 65 mV peak-to-peak implies that the Link is in the

    Electrical Idle state [15].

    PCIe specifications also defines other auxiliary signals (the differential clock REFCLK

    used in the Common Refclk clocking architecture, a voltage signal +12V#, PERST# to

    indicate when data signals are stable and present signals (PRSNT1# and PRSNT2#) for

    hot-plug detection) (see Figure 3.4). As stated, unlike PCI, PCIe does not use dedicated

    interrupt lines but relies on in-band signaling transmitted through the differential TX and

    RX lines.

    3.5 PCIe Address Space The host system can access any of the PCIe end-points only by using the PCIe Address

    Space (Figure 3.5). It is important to note that this address space is virtual, there is no

    Figure 3.4: I/O Lines

  • 28

    physical memory associated: it only represent a list of addresses used by the transaction

    layer (explained later) in order to identify the target of the transaction.

    Root-complex has also configuration registers (to configure the Link width, frequency and

    the Address Translation Unit that translates CPU addresses to PCIe ones), a Configuration

    Space that contains all the information regarding end-points (such as device ID and vendor

    ID) and has also registers to configure the end-points (used, for example, to put the device

    in low-power mode). PCIe specifications defined the Configuration Space to be backward

    compatible to PCI but increased its dimension from 256 B to 4 kB. The first 64 bytes are

    standard (they are called the standard headers) and both PCIe and PCI defined two types

    of standard headers: type 1 (containing info regarding root-ports, bridges and switches

    (such as primary, secondary and subordinate bus numbers)) and type 0 (containing info

    regarding end-points). Every PCIe component has its own Configuration Space. Figure 3.6

    shows the standardized type 0 header that is present in the Configuration Space of a PCIe

    end-point (only the first 64 Bytes are shown). It contains information regarding the device

    (device ID, vendor ID, Status and Command used by the host system to configure and

    control the end-point), the header type (that differentiates type 0 from type 1 headers) and

    Base Address Registers used to configure the Memory Space. The mechanism that

    determines the address to which the Configuration Space of a particular end-point should

    be mapped in the PCIe Address Space is called Enhanced Configuration Access

    Mechanism (ECAM).

    Figure 3.5: PCIe Address Space

  • 29

    Figure 3.6: Configuration Space Header of an end-point

    The address created is a function of bus number, device number, function and register

    number (Figure 3.7).

    Figure 3.7: Enhanced Configuration Access Mechanism

    For example (see Figure 3.5), the Configuration Space of the first end-point (Bus:1,

    Device:0, Function:0) is mapped to the address 100000h of the PCIe Address Space. The

    host system is capable of reading the Configuration Space of an end-point thanks to the

    Configurable Address Space presented in Root-Complex that has a region (CFG0) of 4 kB

    (that matches with the size of the configuration space). Indeed, the CPU can only access

    the Root-Complex internal registers and cannot directly read the Configuration Spaces of

    PCIe devices. To do so, Configurable Address Space has to be “connected” to the

    Configuration Space of the end-point. To be able to do that, Root-Complex implements an

    Address Translation Table which has to be programmed with the source address (A in the

    example of Figure 3.8) that is an address in the Configurable Space, the destination address

    (the ECAM address (for example 100000h)) and the size (4 kB in the case of an access to

    the Configuration Space). So, when the CPU access the CFG0 region of the Configurable

    Address Space, the Address Translation Unit makes sure that Root-Complex accesses to

  • 30

    the ECAM address corresponding to the Configuration Space of the desired end-point.

    Naturally, multiple end-points can be connected to the Root-complex. For example, a PCIe

    Bridge (see Figure 3.9). In the example, its ECAM address (mapped in the PCIe Address

    Space) is 200000h. To access the configuration space of the PCIe Bridge, the same region

    in the Configurable Address Space can be used (thus programming in a different way the

    Address Translation Table). There can be other devices connected to the Bridge (for

    example, an end-point). Again, using the Enhanced Configuration Mechanism, the end-

    point Configuration Space is mapped in the PCIe Address Space. PCIe specifications define

    a new type of transaction in order to access devices connected beyond the bridge: there is

    a second region in the Configurable Address Space (CFG1) dedicated to this task. The rest

    of the Configurable Address Space can be used for I/O Space (generally 64 kB) and

    Memory Space (where the peripheral registers and memory are mapped). The Memory

    Space of an end-point cannot be access in the same way the CFG0 and CFG1 are accessed

    because its size may vary from card to card. So, the host system must know the size of the

    Memory Space of each end-point. To do so, host uses the information stored in the Base

    Address Registers presented in the Configuration Space Header (see Figure 3.10). Once

    the host system has got the size of the Memory Space of an end-point, it allocates an equal

    amount of memory in the Configurable Address Space (in the region dedicated to Memory

    Space) that will be used to access the Memory Space of the PCIe end-point. In order to do

    so, the Configurable Address Space has to be mapped in the PCIe Address Space. Again,

    this is made possible thanks to the Address Translation Table. In the example of Figure 3.9,

    Figure 3.8: addressing methods with Address

    Translation Table

  • 31

    the Address Translation Table is programmed to have: source address B (the first address

    of the Configurable Address Space dedicated to Memory Space), destination address B

    (address in the PCIe Address Space) and size equal to 256 MB since the whole Memory

    Space of the Configurable Address Space needs to be mapped in the PCIe Address Space.

    Figure 3.9: PCIe Address Space

    Now, whenever the CPU accesses these regions in the Configurable Address Space,

    because of Address Translation Table, the Root-Complex will access the corresponding

    regions in the PCIe Address Space. It has to be note that, at this point, PCIe end-points will

    not respond to the access in this region because theirs Memory Space is not mapped to the

    PCIe Address Space, jet. To do so, the Root-Complex has to access to the Base Address

    Registers and writes the starting address of the PCIe Address Space to which the Memory

    Space of the end-point has to be mapped. After that, the Memory Space of the PCIe end-

    point will respond to any memory request that happens to this region of the PCIe Address

    Space since there will be a one to one mapping between its Memory Address and the PCIe

    Address [19].

    3.6 Device Tree A device tree (data structure used to describe the hardware in Linux-based operation

    systems) is created for devices not enumerated dynamically (Figure 3.11). So, in this case,

    device tree nodes are only created for Root-Complex (the end-points are enumerated

    dynamically). The properties configured in the device tree are shared among all the end-

  • 32

    points. One of main property that can be configured is a field of the structure called

    “ranges”. It is used to program the Address Translation Unit (see Figure 3.12).

    Figure 3.10: Configuration Space Header

    Figure 3.11: example of Device Tree

  • 33

    Figure 3.12: each cell is 32-bit wide. The first 3 cells are dedicated to PCI Address (the

    first one contains flags and the others store the address proper (since PCIe Address space

    has a maximum depth of 64-bit)), the fourth cell stores the CPU address (an address of the

    Configurable Address Space) and the last one contains the size information. This

    information is used to program the Address Translation Unit.

    3.6.1 Linux PCIe Subsystem

    Figure 3.13: Linux PCIe Subsystem: at the bottom, the Root-Complex platform driver.

  • 34

    Each platform can have its own Root-Complex Driver (that are responsible for initializing

    the Root-Complex registers, programming the Address Translation Unit, extracting the I/O

    resource and memory information from the “ranges” properties of the device tree and

    invoking an API in the PCI-BIOS layer in order to start the enumeration process). The PCI-

    BIOS layer performs BIOS-type initialization. It provides the before mentioned API used

    by the Root-Complex drivers and invokes the PCI Core to start bus scanning. So, PCI Core

    scans the bus by using the callback provided by the Root-Complex driver in order to read

    the Configuration Space of PCIe end-points. During bus scanning, when PCI-Core finds a

    device with a known device ID and vendor ID, it provides the corresponding PCIe device

    driver. This one stores all the information about the PCIe end-point and it also provides the

    implementation of the interrupt handlers for any of the interrupts that can be raised by the

    PCIe card. Moreover, each end-point can also interact with its own domain specific upper-

    layer (for example, an Ethernet card can interact with the Ethernet stack).

    3.7 The three Layers of the protocol As stated, PCIe is a standard that uses a packed-based communication system and its

    architecture is specified in layers. Indeed, the protocol is made up of three layers: the

    Transaction, Data Link and Physical Layer, as shown in Figure 3.14. Layered protocols

    have been used for years in data communication. Indeed, they permit isolation between

    different functional area in the protocol and allow upgrading one or more layers without

    requiring updates of the others. So, the revision of the protocol might affect the physical

    media with no major effects on higher layers [14].

    Figure 3.14: layers of the PCIe architecture

  • 35

    The software layers will generate read and write requests that are transported by the

    transaction layer to the I/O devices using a packet-based, split-transaction protocol. There

    are two main packet types: Transaction Layer Packets (TLPs) and Data Link Layer Packets

    (DLLPs). While DLLPs are meant for service communication between PCIe constitutive

    elements, TLPs are the packets that move the data from and to devices.

    3.7.1 Transaction Layer

    The Transaction Layer is the higher layer of the PCI Express architecture. It receives read

    and write requests from the software layer and creates request packets (TLPs) for

    transmission to the Data Link layer. From the transmitting side of a PCIe transaction, TLPs

    are formed with protocol information (type of transaction, recipient address, transfer size,

    etc.) inserted on header fields. As stated, PCIe does not use dedicated interrupt lines but

    relies on in-band signaling. This method of propagating system interrupts was introduced

    as an alternative to the hard-wired sideband signal in PCI rev 2.2 specifications and it was

    made the primary method for interrupting processing in the PCIe protocol. Transactions

    are divided into posted and non-posted transactions. While posted transactions do not need

    any response packed, non-posted transactions, need a reply. In this case, the Transaction

    Layer has to receive the response packets from the Data Link Layer and to match these

    with the original software requests. This task can be easily accomplished since each packet

    has a unique identifier that enables response packets to be directed to the correct originator.

    The packet format supports 32-bit memory addressing and extended 64-bit memory

    addressing. Packets also have attributes such as “no-snoop,” “relaxed-ordering” and

    “priority” which may be used to prioritize the flow throughout the platform [14]. It is used,

    for example, to process streaming data first in order to avoid late real-time data.

    3.7.2 Data Link Layer The Data Link Layer performs as an intermediate stage between the Transaction Layer and

    the Physical Layer. Its primary duty is to provide a reliable mechanism for the exchange of

    the TLPs by appending a 32-bit cyclic redundancy check (CRC-32) and a sequence ID for

    data integrity management (packet acknowledgement and retry mechanisms) (see Figure

    3.15).

  • 36

    In order to reduce packet retries (and the associated waste of bus bandwidth), a credit-based

    fair queuing is adopted: a “credit” is accumulated to queues as they wait for service and it

    is spent by queues while they are being serviced. Queues with positive credit are eligible

    for service [20]. See figure 3.16. This scheduling algorithm ensures that packets are only

    transmitted when it is

    known that a buffer is

    available to receive the

    packet at the other end.

    For incoming TLPs,

    the Data Link Layer

    accepts them from the

    Physical Layer and

    checks the sequence

    number and CRC. If an

    error is detected, the

    layer communicates

    the need to resend.

    Otherwise, TLP is

    delivered to the

    Transaction Layer.

    Figure 3.15: each Layer appends a header and a

    tail to the packet

    Figure 3.16: credit-based fair queuing (traffic shaping)

  • 37

    3.7.3 Physical Layer The Physical Layer interfaces the Data Link Layer with signaling technology for link data

    interchange. There are two sections of the Layer: the transmit logic and the receive logic

    responsible for transmission and reception of packets, respectively. These sections, in turn,

    are made up of a logical layer and an electrical layer. As the description of the electrical

    layer was already provided (see paragraph 3.1.1), the only logical layer will be discussed.

    Its main tasks are (from the transmission side): being responsible for framing the packet

    with start and end of packet bytes (see Figure 3.15); splitting Byte data across the lanes (in

    multi-lane links) (see Figure 3.17); Byte scrambling to reduce electromagnetic emissions

    (by dispersing the power spectrum in a wider frequency band) and to facilitated the work

    of the clock recovery (by removing long sequences of ‘0’ or ‘1’ only); 8b/10b encoding

    and serialization of the 10-bit symbols before sending it across the link to the receiving

    device. From the receiving side, a dual task is performed: deserialization, 8b/10b decoding,

    byte descrambling, data reassembling (in multi-lane links) and unframing.

    Figure 3.17: single-lane Link byte stream (on the left)

    and the splitting of the data across the lanes in the case

    of a 4-lane Link (on the right).

  • 38

  • 39

    Chapter 4

    Pixel-ROD board

    The following paragraphs describe the board, still in prototype stage, developed as a

    replacement of the previous series of readout boards, employed into ATLAS Pixel

    Detector. To take advantage of all the experience and efforts spent on the ROD board

    (allowing also firmware portability), it was decided to keep working with FPGAs from

    Xilinx, upgrading to the 7-Series family. Moreover, exploiting the successful Master-Slave

    architecture of the previous boards, the Pixel-ROD was conceived as a merging of two

    Xilinx evaluation boards: the KC705 (that constitutes the slave device) and the ZC702

    (master section). This way of proceeding also allows a huge speed-up of the design and

    debugging process since the newly developed hardware and software testing platforms can

    be validated on already tested and highly reliable boards (to try out the testing platforms

    themselves) before applying them to the prototype. Since the tests that will be lately

    discussed are meant to validate the slave section of the board, a brief general overview of

    the KC705 is provided before the description of the Pixel-ROD.

    4.1 Xilinx KC705 As stated, the slave unit of the Pixel-ROD board is mainly based on a Xilinx evaluation

    board, the KC705. This one, shown in Figure 4.1, has a massive range of applications and

    its primary features are listed below [21]:

    • Kintex-7 28nm FPGA (XC7K325T-2FFG900C);

    • 1GB DDR3 memory SODIMM 800MHz/1600Mbps;

    • PCIe gen. 2 8-lane endpoint connectivity;

    • SFP+ connector;

    • 10/100/1000 tri-speed Ethernet with Marvell Alaska 88E1111 PHY;

    • 128MB Linear BPI Flash for PCIe Configuration;

    • USB-to-UART bridge;

  • 40

    • USB JTAG via Digilent module;

    • Fixed 200 MHz LVDS oscillator;

    • I2C programmable LVDS oscillator;

    Figure 4.1: Xilinx KC705 demo board

    4.1.1 Why PCIe The KC705 is a PCIe board and the reasons behind the adoption of the PCIe interface are

    not only to be found in the fact that PCIe is the best candidate to replace the role of slower

    VME buses (whose data-rate is limited to 320 MB/s of the VME320), but also in the

    creation of a new installation configurations. Indeed, one or two PCIe boards can be directly

    connected on the motherboard of TDAQ PCs providing a faster response (giving straight

    access to the main resources of the PCs) and an easier installation. This configuration is the

    most likely to be adopted for the experimental phase that will start after the Long Shutdown

    scheduled for 2023, not only in the ATLAS experiment, but also in CMS. This

    configuration follows the trend established by KC705-like boards that are mainly designed

    in order to speed-up real time calculations performed in a PC. They are usually mounted

    on an external PCB so they must be connected via an appropriate interface to the

    motherboard of the host PC. Since this interface is often the bottleneck of such systems,

    new generation FPGA evaluation boards communicate via PCIe.

  • 41

    4.1.2 Kintex-7 FPGA

    The Xilinx Kintex-7 XC7K325T-2FFG900 mounted on this board is a powerful medium-

    range FPGA, that can be used to replace both Spartan-6 devices on the ROD board [22,

    23]. Its key features are presented hereafter:

    • Advanced high-performance FPGA with logic elements based on real 6 input

    lookup table (LUT) than can be used to program the combinatory logic or to be

    configured as distributed memory;

    • High-performance DDR3 interface supporting up to 1866 Mb/s;

    • High-speed serial connectivity with 16 built-in Gigabit transceivers (GTX)

    having rates from 600 Mb/s to a maximum of 12.5 Gb/s, offering a special low-

    power mode, optimized for chip-to-chip interfaces;

    • A user configurable analog interface (XADC), incorporating dual 12-bit analog-

    to-digital converters (ADC) with on-chip temperature and supply sensors;

    • Powerful clock management tiles (CMT), combining phase-locked loop (PLL) and

    mixed-mode clock manager (MMCM) blocks for high precision and low jitter;

    • Integrated block for PCI Express (PCIe), for up to x8 Gen2 Endpoint and Root Port

    designs;

    • 500 maximum user I/Os (excluding GTX) and 16 kb of Block RAM (BRAM).

    4.2 Xilinx ZC702 The demo board from which the master section was derived is the Xilinx ZC702. It was

    chosen between the many boards in Xilinx catalogue because its FPGA embeds a hard

    processor (two ARM Cortex-A9) [24] that will substitute the hard-processor implemented

    on the Virtex-5 of the ROD-board (see paragraph 2.31).

  • 42

    4.3 The Pixel-ROD board As stated, the Pixel-ROD was conceived as a merging of the KC705 and ZC702. Naturally,

    many features of the two boards had to be removed since not necessary for a read-out board

    (Xilinx demo boards are not application specific) while many others had to be redesigned

    or completely developed from scratch, as they needed to be shared with the whole new

    board. Removed features include the LCD display, the SD card reader, the HDMI port, few

    GPIOs and LEDs. On the other hand, to implement a ROD-like Master-Slave architecture,

    a 21-bit differential bus has been added between the two FPGAs in order to obtain the

    necessary communication as well as a 1-bit differential line to provide a common clock.

    Moreover, another 5-bit wide single-ended bus was introduced as a general-purpose

    interconnection bus. One of the other features that needed a complete redesign was the

    JTAG chain which had to include the two FPGAs. To do so, a 12 (3x4) pin header (see

    Figure 4.2) was added to allow the possibility of excluding the Kintex from the JTAG chain

    in order to prevent unwanted programming of the slave FPGA. In addition, another internal

    JTAG from Zynq to Kintex has been added. It allows the programming of the slave FPGA

    with the desired firmware, using the Zynq FPGA as Master. This has been very helpful

    during the debugging session. Indeed, since the Pixel-ROD board was installed inside a

    PC, it was very difficult to access to the JTAG port.

    Figure 4.2: custom JTAG

    configuration header. In blue

    the full JTAG chain, in red the

    internal JTAG chain that

    excludes the Kintex.

  • 43

    The main devices and features implemented on the Pixel-ROD board are the following:

    • Kintex-7 28 nm FPGA (XC7K325T-2FFG900C);

    • Zynq 7000 FPGA (XC7Z020-1CLG484C), featuring two ARM Cortex A9

    MPCore;

    • 2 GB DDR3 memory SODIMM (Kintex DDR);

    • 1 GB DDR3 component memory (Micron MT41J256M8HX-15E, Zynq DDR3);

    • PCI Express Gen2 8-lane endpoint connectivity;

    • SFP+ connector;

    • Three VITA 57.1 FMC Connectors (one HPC, two LPC);

    • Two 10/100/1000 tri-speed Ethernet with Marvell Alaska PHY;

    • Two 128 Mb Quad SPI flash memory;

    • Two USB-to-UART bridges;

    • USB JTAG interface (using a Digilent module or header connection);

    • Two fixed 200 MHz LVDS oscillators;

    • I2C programmable LVDS oscillator;

    As of now, a single Pixel-ROD can interface up to 16 equivalent FE-14 channels (half of

    the 32 channels of the BOC-ROD pair).

    4.3.1 Space constraints The stack-up of a board defines the composition, the thickness and the function of each

    layer of a Printed Circuit Board (PCB). As stated, the constraints of the Pixel-ROD were

    extracted from two Xilinx demo boards, but it was not possible a one-to-one mapping since

    all the resulting layers needed to be merged into one stack-up of 16 layers. In fact, the

  • 44

    maximum number of layers is fixed by PCIe standards in order to respect the constraint on

    the thickness of the board (otherwise, it won’t fit into the PCIe slot): the allowable thickness

    ranges from 1.44 to 1.70 mm [25]. The stack-up adopted is shown in the Figure 4.3: the 16

    PCB layers were used to provide the necessary space to the high number of traces while

    ensuring the alternation of signal layers and ground ones as well as the concentration of the

    power layers into the innermost section of the board in order to reduce the cross-talk

    between planes and to reach the required level of insulation.

    In Figure 4.4 an example of a PCB layer is presented. Another constraint on the size was

    due to the PC case the board will be installed in. So, the maximum length has been set to

    30 cm, thereby adding little space for the device placement. Finally, the height was left

    free, to allow sufficient room for all the necessary devices. The result of all these efforts is

    presented in Figure 4.5.

    Figure 4.3: Stack-up of Pixel-ROD.

  • 45

    Figure 4.5: the Pixel-ROD prototype. In blue, the

    components connected to the Kintex FPGA; in red, the ones

    related to the Zynq FPGA; in yellow, the power stage.

    Figure 4.4: one of the 16 layers.

  • 46

  • 47

    Chapter 5

    Pixel-ROD test results

    Because of its complexity, the Pixel-ROD had to pass through several testing stages in

    order to verify its correct behavior. The stages were divided in: hardware wake-up phase,

    to assure the hardware is working correctly and can be correctly configured, and validation

    of all the board functionalities by configuring, debugging and testing each device installed

    on the board. Since this thesis work concerns the creation of a testing platform meant not

    only for the validation of the PCIe interface of the board, but also for presenting and

    achieving a high-performance PCIe system which can transfer data between the board and

    a PC, only a brief description of the hardware wake-up phase is provided. Conversely, the

    platform itself, as well as the firmware adopted, will be comprehensively covered.

    5.1 Power supply As stated, the first testing stage the board passed through, involved the configuration of the

    board power-up which led to the programming of the three UCD9248 power controllers.

    The tool Fusion Digital Power Designer from Texas Instruments allows to set many

    important parameters such as the voltage of each rail and the sequence of power-up for

    each of them (see Figure 5.1).

    Figure 5.1: Fusion Digital Power Designer GUI, configuration of the voltage rails

  • 48

    5.2 Board interfaces and memory When we started developing the test platform to perform PCIe transaction between a host

    PC and the board, the Ethernet Subsystem, the internal bus connecting the Zynq to the

    Kintex, the SFP port as well as all the interfaces of the two FPGAs were already tested.

    Since all the previous tests in conjunction to the newly developed ones take advantage of

    the many tools provided by Vivado Design Suite by Xilinx, a briefly description of this

    CAD tool is provided.

    5.2.1 Vivado Design Suite Vivado Design Suite is a tool suite developed to increase the overall productivity for

    designing, integrating, and implementing systems using many of the Xilinx devices that

    come with a variety of recent technology, including high speed I/O interfaces, hardened

    microprocessors and peripherals, analog mixed signals and more. The Vivado Design Suite

    allows for synthesis and implementation of HDL designs, enabling the developers to

    synthesize their designs, perform timing analysis, examine RTL diagrams, simulate a

    design reaction to different stimuli and configure the target device with the programmer.

    The design implementation is accelerated thanks to place and route tools that analytically

    optimize for multiple and concurrent design metrics, such as timing, congestion, total wire

    length, utilization and power [26].

    5.3 Memory test As the functionalities of the power stage and interfaces of Pixel-ROD had been

    consolidated, a complete memory test was designed in order to prove not only the

    possibility to perform some basic functions on the board, but also the ability to sustain high

    speed memory accesses. In particular, the test was meant to verify if the RAM module

    accessible by the Kintex FPGA (a 2 GB SODIMM DDR3 memory bank) is subjected to

    disturbance errors when performing repeatedly accesses to the same memory bank but in

    different rows in a short period of time. These disturbance errors are caused by charge

    leakage and occur when the repeated accesses cause charge loss in a memory cell, before

    the cell contents can be refreshed at the next DRAM refresh interval. Moreover, as DRAM

    process technology scales down to smaller dimensions, it becomes more difficult to prevent

    DRAM cells from electrically interacting with each other [27].

    Another advantage of achieving a test of this kind is that it brings the board closer to

    implement its full functionalities. In fact, it involves not only the bare trace interconnection

  • 49

    between devices, but also specific ICs and especially the development of a complex

    firmware (that can become very time consuming). In this way, the smart design obtained

    using the KC705 board as reference (see chapter 4.3), speeded-up the entire process since

    it made available a platform very similar to the Pixel-ROD board, where the firmware could

    be validated before being loaded on the tested board itself.

    5.3.1 Vivado IP Integrator and AXI4 Interface In order to develop the firmware of the test, the Intellectual Property (IP) Integrator tool

    has been used (it is a part of the Vivado Design Suite). It has been defined by Xilinx as “the

    industry’s first plug-and-play system integration design environment” since it allows the

    user to create complex system designs by instantiating and interconnecting IP cores from

    the Vivado IP catalog into a design canvas. In this way, the user can take advantage of the

    already available IP in the Vivado library to speed-up the firmware development, which

    otherwise would take a consistent amount of time. Available with the Vivado Design Suite

    are many IP subsystems for Ethernet, PCIe, HDMI, video processing and image sensor

    processing. As an example, the AXI-4 PCIe subsystem is made up of multiple IP cores

    including PCIe, DMA, AXI-4 Interconnect and it is used to provide the software stack

    necessary to the developed testing platform. Therefore, before going into further details of

    the tests, a brief description of the main IP cores used as well as the AXI interface is

    provided.

    The AXI protocol

    AXI stands for Advanced eXtensible Interface [28] and it is part of the ARM Advanced

    Microcontroller Bus Architecture (AMBA), a family of open-standard micro controller

    buses. AMBA is widely used on a range of ASIC and SoC parts including applications

    processors used in modern portable mobile devices like smartphone. Nowadays, it has

    become a de-facto standard for 32-bit embedded processors because of the exhaustive

    documentation and no-need to pay royalties. The AXI4 protocol is the default interface for

    IP cores an it was extensively used during the debug of the Pixel-ROD board. The AXI4

    protocol presents three key features: firstly, it provides a standardized interface between

    the many IPs (and so allowing the user to concentrate on the system debug rather than on

    the protocol needed); secondly, the AXI4 protocol is flexible, meaning that it suits a variety

    of applications, from single, light data transaction to bursts of 256 data transfers with just

  • 50

    a single address phase; finally, since the AXI4 is an industrial standard, it also allows the

    access to the whole ARM environment.

    There are three types of AXI4 interfaces:

    • AXI4, used for high-performance memory-mapped operations;

    • AXI4-Lite, used for simple, low-throughput memory-mapped communication;

    • AXI4-Stream, used for high-speed data streams.

    The AXI4 interface uses a Master-Slave architecture and all AXI4 masters and slaves can

    be connected by means of a specific IP, named Interconnect. Both AXI4 and AXI4-Lite

    defines the following independent transaction channels:

    • Read Address Channel;

    • Read Data Channel;

    • Write Address Channel;

    • Write Data Channel;

    • Write Response Channel.

    The address channels carry control information

    that describes the nature of the data to be

    transferred. Data can simultaneously move in both

    directions between master and slave and data

    transfer sizes can vary (see Figure 5.2). The limit

    in AXI4 is a burst transaction of up to 256 data

    transfers, while the AXI4-Lite interface allows 1

    data transfer per transaction only.

    When the master needs to read data from a slave,

    it sends over the dedicated channel bo