an analysis of low-dimensional entangled states with neural ......1. Introduction 4 posed of...

Università Cattolica del Sacro Cuore

Sede di Brescia

Facoltà di Scienze Matematiche, Fisiche e Naturali

Corso di Laurea Magistrale in Fisica

an analysis of low-dimensionalentangled states with neural networks

Relatore:

Ch.mo Prof. Fausto Borgonovi

Correlatori:

Dr. Chahan Kropf

Dr. Marco Luigi Della Vedova

Laureando: Marco Gulĺı

Matricola: 4709669

Anno Accademico 2018/2019

Contents

1 Introduction 3

2 Quantum entanglement 62.1 A brief history of quantum entanglement . . . . . . . . . . . . 62.2 Entangled and separable states . . . . . . . . . . . . . . . . . 8

2.2.1 Bipartite quantum entanglement . . . . . . . . . . . . 82.2.2 Pure and mixed states . . . . . . . . . . . . . . . . . . 102.2.3 Bipartite and multipartite quantum entanglement . . 112.2.4 Schmidt decomposition and local unitary transforma-

tions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2.5 Werner states . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Geometry of entanglement . . . . . . . . . . . . . . . . . . . . 142.3.1 The complex projective space . . . . . . . . . . . . . . 152.3.2 The Bloch sphere . . . . . . . . . . . . . . . . . . . . . 152.3.3 The set of density matrices . . . . . . . . . . . . . . . 17

2.4 Measures of entanglement . . . . . . . . . . . . . . . . . . . . 182.4.1 PPT criterion . . . . . . . . . . . . . . . . . . . . . . . 212.4.2 Negativity . . . . . . . . . . . . . . . . . . . . . . . . . 222.4.3 Concurrence . . . . . . . . . . . . . . . . . . . . . . . 222.4.4 Purity of the reduced state . . . . . . . . . . . . . . . 232.4.5 Entropy of entanglement . . . . . . . . . . . . . . . . . 242.4.6 Entanglement witnesses . . . . . . . . . . . . . . . . . 24

2.5 The entanglement classification problem . . . . . . . . . . . . 25

3 Neural networks 273.1 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . 273.2 Supervised machine learning . . . . . . . . . . . . . . . . . . . 28

3.2.1 Steps of the supervised learning approach . . . . . . . 283.3 Artificial neural networks . . . . . . . . . . . . . . . . . . . . 32

3.3.1 The perceptron . . . . . . . . . . . . . . . . . . . . . . 333.3.2 Feed-forward neural networks . . . . . . . . . . . . . . 373.3.3 Convolutional neural networks . . . . . . . . . . . . . 433.3.4 Some disadvantages of neural networks . . . . . . . . . 44

CONTENTS 2

4 A study of low-dimensional entangled states with neural net-works 464.1 The sampling procedure . . . . . . . . . . . . . . . . . . . . . 48

4.1.1 Werner states . . . . . . . . . . . . . . . . . . . . . . . 494.1.2 Pure states . . . . . . . . . . . . . . . . . . . . . . . . 504.1.3 Mixed states . . . . . . . . . . . . . . . . . . . . . . . 51

4.2 Werner states . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.2.1 The training process of the perceptron . . . . . . . . . 514.2.2 Dependence of the perceptron performance on the states

sampling . . . . . . . . . . . . . . . . . . . . . . . . . 534.3 Bipartite systems of qubits . . . . . . . . . . . . . . . . . . . 54

4.3.1 Werner states and rotated Werner states . . . . . . . . 544.3.2 General mixed states . . . . . . . . . . . . . . . . . . . 564.3.3 General pure states . . . . . . . . . . . . . . . . . . . . 584.3.4 General pure and mixed states . . . . . . . . . . . . . 594.3.5 Bloch vectors . . . . . . . . . . . . . . . . . . . . . . . 59

4.4 Bipartite systems of qutrits and ququarts . . . . . . . . . . . 60

5 Conclusions 63

6 Acknowledgments 65

References 66

Introduction

Quantum physics was born in the first years of the twentieth century asa result of a crisis of classical physics, which could not explain many experi-mental results. To quote but a few, we mention the black body electromag-netic spectrum [1] and the photoelectric effect [2]. As well known, quantummechanics proposed a solution to these issues based upon the wave-particleduality. In quantum physics, particles are described by wavefunctions andthey behave differently from classical particles. Among such differences isquantum entanglement, which was first discussed by Schröedinger in 1935[3]. Quantum entanglement is due to correlations that do not have any clas-sical counterpart, for example they do not depend on the distance. Quantumentanglement was debated starting with a famous paper by Einstein, Podol-sky and Rosen (EPR), who questioned its physical validity with the so-calledEPR paradox [4]. If quantum mechanics provided a complete explanation ofnature, then the correlations behind quantum entanglement would underpinan information transfer faster than light, which would contradict special rel-ativity and the causality principle. The conclusions drawn by EPR are thatthe quantum theory is not be complete and neglects some hidden variables,so to be considered as a hidden-variables theory. In the 1960s, J. S. Bell[5] described some inequalities on observables correlations whose violationwould confirm the validity of quantum mechanics against hidden-variablestheories. Since the 1970s on, many experiments [6, 7, 8] have validated theviolation of the Bell’s inequalities, which has credited quantum mechanicsand has ruled out hidden-variables theories.

Over time, quantum entanglement has become increasingly importantalso for several applications like quantum computers [9], quantum cryptog-raphy [10] and it is thought to play a role also in biological systems [11].However, even today there is no complete knowledge of this phenomenon,for example there is no universal tool to determine exactly if an unknownquantum state is entangled or not. This is the case both experimentallyand theoretically. Experimentally, the problem is to find a universal observ-able to witness exactly all entangled states. The theoretical question, whichwe approach in this thesis, is focused on searching general analytical criteriathat determine exactly the boundary between the set of entangled states andthe set of separable states. This problem is solved exactly for some caseslike any bipartite system described by a pure state, bipartite systems com-

1. Introduction 4

posed of two-dimensional particles and special classes of states like Wernerstates. However, for instance, for multipartite states the boundary betweenentangled and separable states is known only approximately. In particular,those states close to the boundary are hard to classify. Besides, the numberof degrees of freedom to describe the system increases with the dimensionand the number of parties of the system itself. Another question is that thehigher the dimension, the smaller the volume of the set of separable states,so it is hard to sample high-dimensional separable states.

In this thesis, we will use artificial neural networks to study the boundarybetween the set of entangled states and the set of separable states, which is ahighly non-linear function defined in a space with many degrees of freedom.In this regard, since neural networks are theoretically universal approxima-tors of any non-linear boundary [12, 13], they appear well suited for thetask. They also have become popular recently as a result of their efficiencyto deal with complex problems like image classification and recognition [14],earthquake prediction and detection [15], medical diagnosis of some diseases[16] and many-particle quantum entanglement [17].

Artificial neural networks belong to the class of supervised machine learn-ing algorithms. Hence, they are typically fed with training data, used tolearn a classification boundary, and then their performance is gauged onunseen data belonging to a test dataset. Usually, training and test set comefrom external sources but in our case they were both sampled artificially.In our work, the neural networks were fed with states which got predictedas entangled or separable. In the supervised approach, these predictionsneeded to be compared with exact labels, which can be produced for somelow-dimensional states as those we considered in this work. As we will see,artificial neural networks are generally able to approximate the boundarybetween the set of entangled states and the set of separable states. Fur-thermore, their performance is efficient both for linear and some non-linearboundaries. This thesis is structured as follows:

• in the first Chapter, the mathematical definition of bipartite and mul-tipartite entanglement for pure and mixed states is given, moreoverwe show and discuss the validity of some criteria to classify quantumstates and we investigate the geometry of entangled and separablestates.

• In the second Chapter, we present some fundamentals of machinelearning describing how some machine learning algorithms work. Next,we introduce the basic unit of artificial neural networks, the percep-tron, which is the elementary brick of more complex structures asfeed-forward and convolutional neural networks. Aspects of neuralnetworks related to their architecture and training process on data arethen discussed further, highlighting some of their potentialities andlimits.

1. Introduction 5

• Finally, the third Chapter concludes this thesis showing some resultsabout the classification of states with neural networks, in bipartitesystems with a dimension up to 16. Some aspects and consequences ofthe sampling procedure are discussed further. Then, the performanceof neural networks is analyzed both for a special class of mixed states,the Werner states, and for general pure and mixed states.

Quantum entanglement

2.1 A brief history of quantum entanglement

Quantum systems may exhibit properties that do not have any classicalcounterpart. This happens both for one-particle phenomena, such as su-perposition of states and tunneling, and for many-particle phenomena, likequantum entanglement. The concept of quantum entanglement was firstintroduced by Schröedinger in 1935 [3] to deal with the Einsten, Podolskyand Rosen (EPR) paradox [4]. In order to understand the importance ofquantum entanglement to explain non-classical correlations, we shall retracesome of the crucial historical steps of quantum mechanics.Historically speaking, quantum mechanics was born as the consequence of aseries of events and discoveries in the 1920s. The scientific community wasenthusiastic for the new quantum theory, which allowed to explain somephenomena as the photoelectric effect [2, 18], but at the same time, it wasalso skeptical.

• Quantum mechanics sounds deceptive because the measurement playsa central role in the theory. Since measuring a certain observableis synonym with perturbing the system itself, such a theory wouldbe dependent on the particular conditions under which measurementsare performed. As a result quantum mechanics would give rise toconclusions that are just a mere consequence of some experimentalconditions and might not reveal any intrisic property of a physicalsystem.

• The quantum statistical approach is based upon wave functions, whichappeared to be mere formal devices thought just for the sake of ex-pressing the results of a measurement and whose finer physical detailsare unknown.

In his paper authored with Podolsky and Rosen in 1935 [4], Einstein dealtwith the relation between completeness, reality and locality of quantum me-chanics [19].

• Saying that quantum mechanics is complete means that any elementof the physical reality must have a counterpart in the physical the-ory. In other terms, any system must be described by an underlying

2. Quantum entanglement 7

physical state that expresses simultaneously every physical quantity(momentum, position, energy, time...).

• The Einstein-Podolsky-Rosen (EPR) criterion of reality says roughlyspeaking that if we can know certainly (i.e., with a probability equalto 1) the value of a certain physical quantity without disturbing thesystem, then this is an element of reality.

• Any physical theory must be EPR-local, which means that an infor-mation cannot travel instantly on an arbitrary distance. Indeed, thiswould imply the existence of superluminal signals, which would violatethe causality principle of special relativity.

Assuming that quantum mechanics is complete, real and EPR-local at thesame time would imply the so-called EPR paradox. In order to grasp thisimplication, let us make a gedankenexperiment where a source emits twoelectrons along two opposite directions and towards two distant observers,Alice and Bob. Each pair of electrons has a total magnetic spin equal to 0,therefore their state is entangled and two alternatives are possible:

1. Alice’s electron spin is positive and Bob’s one is negative: A+,B-,

2. Alice’s electron spin is negative and Bob’s one is positive: A-,B+.

Now, let us suppose that Alice measures positive spin and next Bob mea-sures a negative spin. Focusing on the time elapsed between the emissionand Alice’s measurement, a question arises: was the value of the magneticspin determined at the moment of the emission or at the moment the mea-surement?

• According to quantum mechanics, the pair of electrons is in an super-position of the states A+,B- and A-,B+, i.e. it is entangled. Only oncethe single spins are measured, the system state collapses onto one ofthe above states.

• EPR’s argument is that the single magnetic spins of the electrons arewell defined at the moment when the electrons are emitted, thereforethe act of measurement just unveals their values without determiningthem.

Thus, if quantum mechanics answer were correct, right after Alice’s mea-surement, a signal would have to travel from Alice to Bob to determine Bob’smeasurement outcome, such that the total spin is zero. Such informationtransfer would have to be instantaneous and independent of the distance.Hence, it would travel faster than light. As a consequence, quantum en-tanglement would contradict the causality principle from special relativity,in which no superluminal signals are physically possible. Assuming that


quantum mechanics is a physically consistent theory, i.e. it is real and EPR-local, would exclude that it is complete. The quantum approach would notdescribe some quantities, the hidden variables, therefore there should exista more general theory, the hidden-variable theory, that could predict anymeasurement result satisfying the three principles of reality, EPR-localityand completeness. The EPR paradox remained an open question for somedecades.

In 1964, J. S. Bell [5] derived an inequality for triads of observablesrepresented by their associated variables A, B and C:

C(A,C)− C(B,A)− C(B,C) ≤ 1, (2.1)

where C(A,C) is the correlation between A and C.Let Ch(A,C) be the correlation expressed by a hidden-variable theory, de-fined as the expectation value on some hidden parameter λ with a probabilitydistribution p(λ)

Ch(A,C) =

∫A(λ)C(λ)p(λ)dλ.

The quantum-mechanical correlation between A and C, Cq(A,C), in a sys-tem whose state is |φ〉 is the expectation value of their product on this state,i.e.

Cq(A,C) = 〈φ|AC|φ〉.

It turns out that the correlations Cq do not always respect the Bell’s inequal-ities (2.1). From the 1970s on, quantum correlations measured in severalexperiments [6, 7, 8] have shown to violate Bell’s inequalities (2.1). Besides,this violation is not exact and comes with some exceptions called ”loopholes”[20]. In spite of these loopholes, the scientific community has confirmed thevalidity of quantum mechanics against any hidden-variable theory, which hasto be ruled out. Thus, quantum mechanics cannot be EPR-local withoutviolating causality and realism, or contradicting relativity. Coming back toour gedankenexperiment, we can conclude that in the time elapsed betweenthe emission and the measurement, the pair of electrons is in a superpositionof the two entangled alternatives A+,B- and A-,B+. Their spins become ele-ments of reality after a measurement has been performed, which determinesthe single spins causing the collapse of the wavefunction onto either A+,B-or A-,B+.

2.2 Entangled and separable states

2.2.1 Bipartite quantum entanglement

We aim at giving a first mathematical definition of quantum entanglementfor systems split into two parties. Each party could be represented by oneor many particles like electrons or atoms. Here, we start from a bipartite


system composed of two qubits, i.e. a pair of two-levels particles. The singleparticles wavefunctions |ψ1〉 and |ψ2〉 are elements of the Hilbert spacesHA and HB, for which we assign the two orthonormal bases {|0〉A, |1〉A},{|0〉B, |1〉B} respectively. Here, the sites basis representation of the twovectors |0〉 and |1〉 is

|0〉 =(

10

), |1〉 =

(01

).

Let |Ψ〉 ∈ H = HA⊗HB be a state of the total bipartite system. Here, H isa Hilbert space with an orthonormal basis composed of four vectors whichwe define as

|00〉AB =

1000

, |01〉AB =

0100

, |10〉AB =

0010

, |11〉AB =

0001

.The system state |Ψ〉 is entangled if there does not exist any pair of states|ψ1〉 ∈ HA, |ψ2〉 ∈ HB such that |Ψ〉 = |ψ1〉 ⊗ |ψ2〉. Given α, β, γ, δ ∈ C, letus write

|ψ1〉 = α|0〉A + β|1〉A,

|ψ2〉 = γ|0〉B + δ|1〉B,

then their tensor product is

|ψ1〉 ⊗ |ψ2〉 = αγ|00〉AB + αδ|01〉AB + βγ|10〉AB + βδ|11〉AB. (2.2)

The state |Ψ〉 is entangled if there does not exist any quartet α, β, γ, δ suchthat it can be decomposed as in Eq. (2.2).As an example, let us consider the states

|Ψ1〉 =1√2

(|01〉AB + |00〉AB) ,

|Ψ2〉 =1√2

(|01〉AB + |10〉AB) .

The state |Ψ1〉 is not entangled, as a matter of fact it can be expressedaccording to the decomposition (2.2) with αγ = αδ = 0 and βγ = βδ =1/√

2. A possible solution of these equalities is α = 0, β = 1, γ = δ = 1/√

2,so that

|ψ1〉 = |0〉A,

|ψ2〉 =1√2

(|0〉B + |1〉B).

On the other hand, |Ψ2〉 is entangled, indeed the equations αδ = βγ = 1/√

2and αγ = βδ = 0 do not have any solution, thus |Ψ2〉 cannot be factorized


according to Eq. (2.2). The state |Ψ2〉 belongs to the class of Bell states |Bi〉(i = 1, 2, 3, 4), which are maximally entangled states of a two-qubit systemand are defined as

|B1〉 =1√2

(|01〉AB − |10〉AB) , (2.3)

|B2〉 =1√2

(|01〉AB + |10〉AB) , (2.4)

|B3〉 =1√2

(|00〉AB − |11〉AB) , (2.5)

|B4〉 =1√2

(|00〉AB + |11〉AB) . (2.6)

In the previous example it has been relatively simple to devise a short math-ematical proof to determine if the quantum state of two qubits is entangledor not. Some issues arise in the more general case, in which for instance

• in place of qubits, we choose parties of larger dimension d > 2,

• in place of bipartite states, we analyze the entanglement of a many-party system,

• the state of the system is mixed.

2.2.2 Pure and mixed states

In the previous Section 2.2.1 we dealt with systems whose state |ψ〉 is pureor, in other words, they can be described by a wave function. More generally,the state of a system can correspond to a large number N of possible prepa-rations |ψi〉〈ψi|, each of them has a probability pi. In this case, the systemstate is mixed and it is described by a density matrix ρ̂, whose expression is

ρ̂ =N∑i=1

pi|ψi〉〈ψi|. (2.7)

While mixed states can be represented only by density matrices, pure statescan be written down both as density matrices and as wavefunctions. Besides,we point out that the concept of mixed state must not be confused with thequantum superposition principle, for which linear combinations of statesresult in a state.A density matrix ρ̂ is a complex matrix with three properties

• it is Hermitian: ρ̂ = ρ̂†,

• it is semidefinite positive: ∀|ψ〉 : 〈ψ|ρ̂|ψ〉 ≥ 0,

• it has trace one: Tr(ρ̂) = 1.


An important property of a d×d density matrix is that it has d2−1 degreesof freedom. For instance, the density matrix ρ̂ of a two-qubit system has 15independent components

ρ̂ =

a11 a12 + ib12 a13 + ib13 a14 + ib14

a12 − ib12 a22 a23 + ib23 a24 + ib24a13 − ib13 a23 − ib23 a33 a34 + ib34a14 − ib14 a24 − ib24 a34 − ib34 1−

∑3i=1 aii

, (2.8)where aij = Re(ρij) and bij = Im(ρij) with i, j = 1, 2, 3, 4.

In order to determine whether a density matrix represents a pure or amixed state, one could calculate its purity P ∈ [1/d, 1] defined as

Pur(ρ̂) = Tr(ρ̂†ρ̂) = Tr(ρ2), (2.9)

where ρ̂† is the Hermitian conjugate of ρ̂. If ρ̂ is pure Pur(ρ̂) = 1, if ρ̂ ismixed, then Pur(ρ̂) < 1 and Pur(ρ̂) = 1/d when ρ̂ is a maximally mixedstate.

2.2.3 Bipartite and multipartite quantum entanglement

Bipartite systems

A bipartite system is represented by a Hilbert space H = H1 ⊗ H2, whereH1 and H2 are the two Hilbert spaces corresponding to the single parties[21]. A pure state |Ψ〉 ∈ H is

• separable if there exists at least one pair of states |ψ1〉 ∈ H1, |ψ2〉 ∈H2 such that

|Ψ〉 = |ψ1〉 ⊗ |ψ2〉, (2.10)

• entangled otherwise.

A mixed state ρ̂ is

• separable if it can be expressed as the convex combination of localdensity matrices ρ̂i1 and ρ̂

i2

ρ̂ =∑i

piρ̂i1 ⊗ ρ̂i2,

(0 ≤ pi ≤ 1,

∑i

pi = 1

)(2.11)

• entangled otherwise.


Multipartite systems

The definition of entangled and separable states can be generalized to amultipartite system [22], described by a Hilbert space H = H1 ⊗ ... ⊗Hm,where H1, ...,Hm are the Hilbert spaces of the single parties. The state ρ̂defined on H is fully separable if and only if it can be decomposed as theconvex combination of m local density matrices ρ̂i1, ..., ρ̂

im as

ρ̂ =

N∑i=1

piρ̂i1 ⊗ ...⊗ ρ̂im,

(0 ≤ pi ≤ 1,

∑i

pi = 1

)

otherwise, ρ̂ is entangled .A multipartite state can be also partially separable. Let A = {I1, ..., Ik}

be a partition of the indices set I = {1, ...,m}, such that all Ij are disjointand I =

⋃kj=1 Ij . Under the condition k < m, a number k of Hilbert spaces

H̃1, ..., H̃k can be defined starting from H = H1 ⊗ ... ⊗ Hm. For example,if m = 3 and k = 2 a possibility could be H̃1 = H1 ⊗ H2 and H̃2 = H3.The state ρ̂ is separable with respect to the partition A if and only ifit can be written as the convex combination of the k local density matricesρ̂i1, ..., ρ̂

ik, each defined on H̃1, ..., H̃k, thus

ρ̂ =

N∑i=1

piρ̂i1 ⊗ ...⊗ ρ̂ik,

(0 ≤ pi ≤ 1,

N∑i=1

pi = 1

).

A multipartite system divided into m parties can be split it into k ≤ mparties, which form a k-partition [23]. It could happen that a certain stateρ̂ is entangled on the m parties but fully separable with respect to the k-partition. In this regard, a state is defined as k-separable if it is fullyseparable with respect to a certain k-partition and two states that are k-separable with respect to the same partitions belong to the same separabilityclass. States that are k-separable only on the trivial partition (namely,k = 1) are characterized as fully entangled or inseparable. A k-separablestate is more entangled than any other l-separable state for k < l, thereforea hierarchy of separability can be established. Let us make an example ofhierarchy of separability for a three-qubit system, associated to five possiblepartitions: 123, 1|23, 2|13, 3|12, 1|2|3. Correspondingly, the state of thesystem could be

• fully separable, so it is separable for all of the partitions,

• 3-qubit biseparable, if it is separable for the partitions 123, 1|23, 2|13,3|12 but not for 1|2|3,

• 2-qubit biseparable, if it is separable for 123, 1|23 and 2|13 but not for3|12 nor 1|2|3,


• 1-qubit biseparable, if it is separable for 123 and 1|23,

• fully entangled, which is separable just for the trivial partition 123.

2.2.4 Schmidt decomposition and local unitary transformations

In Eqs. (2.10), (2.11) we defined separability for pure states and mixedstates. Let us focus on pure states, taking |Ψ〉 ∈ H with H = H1 ⊗ H2.Here, the two Hilbert spaces H1 and H2 can be referred to two respectiveorthonormal bases {|αi〉} and {|βj〉}. The state |Ψ〉 can be decomposed alsoas a function of some non-negative coefficients pij , called Schmidt coefficientsor also entanglement spectrum, such that

|Ψ〉 =∑ij

√pij |αi〉 ⊗ |βj〉,

∑ij

pij = 1

(2.12)where {|αi〉 ⊗ |βj〉} are separable vectors that form the Schmidt basis. TheSchmidt coefficients are related to quantum entanglement of pure states.Let us suppose that |Ψ〉 has only one non-vanishing Schmidt coefficient pkl,thus

|Ψ〉 = √pkl|αk〉 ⊗ |βl〉 (2.13)

Since Eqs. (2.10) and (2.13) are two decompositions of the same state |Ψ〉on different bases, we can conclude that it is separable.

On the two Hilbert spaces H1,H2 we can define respectively two arbi-trary local unitary transformations Û1, Û2, which fulfill the condition

ÛiÛ†i = Û

†i Ûi = Î , (i = 1, 2)

with Î as the identity operator. Let us define Û as

Û = Û1 ⊗ Û2, (2.14)

and let us apply it to the state |Ψ〉, so that

Û |Ψ〉 =∑ij

√pij Û1|αi〉 ⊗ Û2|βj〉 =

∑ij

√pij |α̃i〉 ⊗ |β̃j〉,

where |α̃i〉 = Û1|αi〉, |β̃j〉 = Û2|βj〉. The two states Û |Ψ〉 and |Ψ〉 have equalSchmidt coefficients, or entanglement spectrum, hence quantum entangle-ment of the state |Ψ〉 is invariant under local unitary transformations likeÛ . As a matter of fact, such kind of transformation maps the orthonormalbasis {|αi〉}, {|βj〉} into {|α̃i〉}, {|β̃j〉} without transforming the Schmidt co-efficients, that are uniquely defined. In this case, the two states |Ψ〉 andÛ |Ψ〉 are defined as local unitary equivalent, or LU-equivalent.


2.2.5 Werner states

Werner states were first introduced by R. F. Werner in 1989 [24] as anexample of d×d states whose bipartite entanglement depends upon a singleparameter psym ∈ [0, 1]. A bipartite Werner state ρ̂W can be expressed as

ρ̂W =1

d2 − dα(Î − αP̂ ), α = (1− 2psym)d+ 1

1− 2psym + d, (2.15)

where Î is the identity operator and P̂ =∑

ij |i〉〈j| ⊗ |j〉〈i| is the projectoroperator defined on the orthonormal bases {|i〉}, {|j〉} of the single parti-tions.The bipartite Werner state ρ̂W is

• separable if psym ≥ 1/2,

• entangled if psym < 1/2.

The density matrix components of Werner states depend linearly on theparameter psym, for example the expression of a two-qubit Werner state is

psym/3 0 0 00 (3− 2psym)/6 (4psym − 3)/6 00 (4psym − 3)/6 (3− 2psym)/6 00 0 0 psym/3

. (2.16)In 2003 Eggeling defined multipartite Werner states [25], which no longerdepend just on the parameter psym but on N ! − 1 independent parametersand can be expressed as the linear combination of N ! permutations on Nsubsystems.

2.3 Geometry of entanglement

In this Section, we focus on the geometrical characterization of quantumentanglement. As a general remark, the set of entangled states and the setof separable states are disjoint and their boundary corresponds generallyto a highly non-linear and irregular function. Furthermore, as we saw inSection 2.2.2, the density matrix of a mixed state can be written alwaysas the convex combination of two or more pure states. Besides, densitymatrices represent a special class of the more general family of Hermitianmatrices. Both these facts can be translated into geometrical terms. Purestates belong to the complex projective space, which is a convex subset ofMd2 , the set of the d× d density matrices of mixed states. In turn, Md2 isa convex subset of the set of the d× d Hermitian matrices.


2.3.1 The complex projective space

The complex projective space CPd is defined as the quotient space betweenthe unitary d-sphere S2d−1 and the unitary circle S1 [26], i.e.

CPd = S2d−1/S1. (2.17)

According to this definition, CPd contains all elements Z1, Z2 ∈ S2d−1 suchthat their difference Z1 − Z2 ∈ S1. For example, the complex projectivespaces CP0, CP1 and CP2 are a point, the unitary circle and the unitarysphere respectively.

Let (z1, ..., zd) ∈ CPd, then we can write

(z1, z2, ..., zd) = (n1, n2eiϕ2 , ..., nde

iϕd),

(d∑i=1

n2i = 1

). (2.18)

According to this parametrization, the generic element of the complex pro-jective space is identified by d positive numbers n1, ..., nd and by d−1 phasesϕ2, ..., ϕd ∈ [0, 2π[. Furthermore, the coefficients n1, ..., nd can be expressedas functions of certain angles θ1 ∈ [0, 2π[ and θ2, ..., θd ∈ [0, π] as

n1 = cos θ1 sin θ2... sin θdn2 = sin θ1 sin θ2... sin θd...nd = cos θd

(2.19)

The parametrization of Eq. (2.19) defines the hyperoctant coordinates, whichfor d = 2 are the usual spherical coordinates

n1 = cos θ1 sin θ2n2 = sin θ1 sin θ2n3 = cos θ2

Among others, the complex projective space CPd is also called the projectiveHilbert space. In fact, the tuple of Eq. (2.18) can be immediately thoughtof a wavefunction |ψ〉 = (ψ1, ..., ψd) by putting zi = ψi for each i = 1, ..., d.Further, another correspondence can be established between the points ofCPd and the Bloch vector of a state.

2.3.2 The Bloch sphere

Let us consider the simple case of a single qubit. The complex projec-tive space CP2 corresponds to the Bloch sphere, i.e. the unitary three-dimensional ball. The points on the surface describe pure states and theinternal points are mixed states. Let ~σ = (σx, σy, σz), where σx, σy, σz arethe Pauli matrices expressed as

σx =

(0 11 0

), σy =

(0 −ii 0

), σz =

(1 00 −1

).


Any one-qubit density matrix ρ can be written as a function of its Blochvector ~B as

ρ =I

2+ ~B · ~σ.

A representation of the Bloch sphere can be given by choosing the Paulimatrices σx, σy, σz as the axes. The North and the South poles are theeigenstates |0〉z and |1〉z of σz, as illustrated in Fig. 2.1.

Figure 2.1: Representation of the Bloch ball in the system σx−σy −σz. The North andthe South poles correspond to the eigenstates |0〉z and |1〉z of the Pauli matrix σz. Theloci of pure and mixed states are the surface and the internal points of the Bloch sphererespectively.

The Bloch ball can be generalized also to some bipartite systems [27]. Inthis case, the density matrix ρ is expressed as a function of its generalizedBloch vector ~B as

ρ =I

d+ ~B · ~Γ, (2.20)

where, for two-qubit systems, d = 4 and ~Γ ∈ {I, σx, σy, σz}⊗2/{I ⊗ I}. Thecentre of the Bloch sphere is the state I/d, which represents the maximallymixed state and is separable.The components Bi of the Bloch vector ~B can be obtained from Eq. (2.20)calculating

Bi = Tr(Γiρ), i = 1, ..., d2 − 1. (2.21)

All of the components Bi are real and define the mixture coordinates, whichare the Cartesian coordinates in the space of the d×d density matricesMd2 .


2.3.3 The set of density matrices

Let us consider the set of four-dimensional density matrices,M4, and the setof the four-dimensional separable density matrices, M4s, which is a convexsubset ofM4. The geometry ofM4s can be characterized with the maximalseparable ball Bs. The set Bs is the largest ball inscribed into M4 andcentred at I/4. This set was proven to be separable [28]. As shown in Fig.2.2, the three-dimensional cross section of the 15-dimensional separable ballcan be represented into a tetrahedron whose vertices are the four maximallyentangled Bell states |Bi〉〈Bi| (here i = 1, 2, 3, 4) as defined in Eqs. (2.3),(2.4), (2.5), (2.6).

Generally, the separability of Bs has been showed for bipartite systemsof higher dimensions as well [29, 30]. As a general remark, the set of sep-arable density matrices Md2s is larger than Bs, i.e. Bs ⊆ Md

2

s , and formultipartite systems the states of Bs are not only separable but fully sep-arable. The maximal separable ball Bs is a set with non-zero measure andthe ratio between its volume V (Bs) and the volume of the density matricesset V (Md2) is

V (Bs)

V (Md2)=

π(d2−1)/22(d

2−d4)/2Γ(d4)

Γ

(d4 + 1

2

)dd4(d2 − 1)(d4−1)/2Πd2k=1Γ(k)

, (2.22)

where Γ(x) =∫∞0

(tx−1/et

)dt is the Gamma function. The higher the

dimension d, the smaller the ratio of Eq. (2.22). Even though the maximalseparable ball does not contain all separable states, in high-dimensionalspaces the volume ofMd2s behaves as V (Bs). As a consequence, the numberof entangled states is larger than the number of separable states. In the limitd→ +∞, the set of separable density matrices Md2s collapses on a set withzero measure.

Let us now come back to the maximal separable ball inscribed into M4and let us consider the two-dimensional cross section of Bs shown in Fig.2.3. Generally, we can conclude that only the convex combination of twoseparable states results in a separable state. Indeed, let us express theseparable mixed states ρ(1), ρ(2) as

ρ(1) =∑k

pkρk1 ⊗ ρk2, ρ(2) =

∑i

piρi1 ⊗ ρi2.

Their convex combination with a constant α ∈ [0, 1] is

αρ(1) + (1− α)ρ(2) = α∑k

pkρk1 ⊗ ρk2 + (1− α)

∑i

piρi1 ⊗ ρi2 = (2.23)

∑i,k

αpkρk1 ⊗ ρk2 + (1− α)piρi1 ⊗ ρi2 =

∑i,k

akρk1 ⊗ ρk2 + biρi1 ⊗ ρi2.


Family of Werner states

Maximal separable ball

Figure 2.2: Three-dimensional section of the 15-dimensional maximal separable ball Bsinto the space of density matrices M4. The maximal separable ball is inscribed into thetetrahedron whose vertices are the Bell states |Bi〉〈Bi| (i = 1, 2, 3, 4). . Here, ρ∗ = I/d isthe maximally mixed state and r is the radius of Bs. The diagonal from the centre to theside |B2〉〈B2| contains the Werner states. Original figure taken from [26].

Since both ak, bi ∈ [0, 1], for each i, k, we can make the substitution i, k → j,hence Eq. (2.23) becomes

αρ(1) + (1− α)ρ(2) =∑j

ajρj1 ⊗ ρ

j2 + bjρ

j1 ⊗ ρ

j2 =

∑j

pjρj1 ⊗ ρ

j2,

which is a separable state once again. Furthermore, nothing a priori canbe concluded on the convex combination between two entangled states orbetween an entangled state and a separable state, which may give either aseparable or an entangled state.

2.4 Measures of entanglement

So far, we have defined entangled states as those states that are not separableas a tensor product of states. While this definition is precise, it is impracticalto verify whether an unknown quantum state is entangled or not. Hence, wewould like to find simple computable measures to assess whether and howmuch a state is entangled. In order to illustrate the concept of measure ofentanglement, let us consider the following example for pure bipartite states.Let H = H1 ⊗H2 be a Hilbert space and let us give |ψ1〉 ∈ H1, |ψ2〉 ∈ H2.


Maximal separable ball

Figure 2.3: Two-dimensional section of the 15-dimensional maximal separable ball. Onlythe convex combination between two separable states ρ(1) and ρ(2) results in a separablestate (blue line), whereas convex combinations between one separable state ρ(1) and anentangled state ρ(3) (violet line) or between two entangled states ρ(3) and ρ(4) (green line)may give a separable or an entangled state. Original figure taken from [26].

Any separable state |φ〉 ∈ H can be expressed as

|φ〉 = |ψ1〉 ⊗ |ψ2〉.

Let |ψ〉 ∈ H be a generic state, then the minimal distance D between |ψ〉and any separable state |φ〉 is

D = min|φ〉|||φ〉 − |ψ〉||, (2.24)

where ||.|| could be, for instance, the canonical norm defined on the Hilbertspace H. The metric D is a measure of entanglement, if D = 0 then |ψ〉 isseparable, otherwise it is entangled.More generally, we can give axiomatic properties that a suitable entangle-ment measure must fulfill [31, 32]. In this regard, we define three operationsperformable on a system split into two parties A and B [33].

• A local operation (LO) involves a measurement that can be performedlocally on A or B. Let â and b̂ be two local operations defined on theparties A and B respectively. The operators â and b̂ are unitary, hencethey fulfill the conditions

∑i â†i âi = ÎA,

∑j b̂†j b̂j = ÎB, where ÎA and

ÎB are the identity operators defined on the parties A and B. A global


measurement on the system that combines the two local operations âand b̂ can be expressed as

∑ij âi ⊗ b̂j =

∑i âi ⊗

∑j b̂j .

• Classical communication (CC) refers to correlations between the par-ties A and B. When a complete measurement on the global system isperformed, this can be represented with an operator which is not nec-essarily decomposable as the tensor product of individual operators âand b̂, acting on the single parties A and B respectively. When theinitial state ρ̂ of a system undergoes a transformation that involvesboth LO and CC, it will be eventually expressed as∑

ij

(âi ⊗ b̂j

)ρ̂(â†i ⊗ b̂

†j

), (2.25)

where∑

ij â†i âib̂

†j b̂j = Î with Î as the identity operator defined on the

global system.

• Postselection (PS) is carried out once both LO and CC have alreadybeen performed, therefore the final density matrix of Eq. (2.25) isrenormalized as

ρ̂→∑ij

(âi ⊗ b̂j

)ρ̂(â†i ⊗ b̂

†j

)Tr[(âi ⊗ b̂j

)ρ̂(â†i ⊗ b̂

†j

)] .An entanglement measure F must respect the following axiomatic properties[31, 33]:

1. F : ρ̂→ m maps any state ρ̂ into a positive real number m ∈ [0,+∞[,

2. F(ρ̂) = 0 for any separable state ρ̂,

3. for any state ρ̂, F is invariant under LU operations ÛA ⊗ ÛB actingon any two parties A and B of the system, i.e.

F(ρ̂) = F(Û †A ⊗ Û†B ρ̂ ÛA ⊗ ÛB),

4. the entanglement of a system does not increase after performing LO+CCoperations, hence

F (ρ̂) ≥ F

(âi ⊗ b̂j

)ρ̂(â†i ⊗ b̂

†j

)Tr[(âi ⊗ b̂j

)ρ̂(â†i ⊗ b̂

†j

)] ,

where â and b̂ are the two operators acting on two parties A and B ofthe system previously defined in the PS procedure.


Any function fulfilling the previous conditions is also called entanglementmonotone. The property (2) ensures that any entanglement measure is0 in presence of only separable states. The condition (3) states that theamount of entanglement of a system is invariant under change of basis, henceentanglement measures preserve the LU equivalence of entangled states. Thestatement (4) ensures the invariance of the amount of entanglement underLO and CC operations.There are several measures of entanglement, which can be continuous ordiscrete. A continuous measure equals 0 for any separable state and isstrictly positive for any entangled state, whereas a discrete measure is 0 forseparable states and 1 for entangled states.

2.4.1 PPT criterion

The positive partial transpose (PPT) criterion was first introduced by Peresand Horodecki [34] [35] as a necessary and sufficient discrete measure fortwo-qubit and qubit-qutrit density matrices to be separable. This criterionis only necessary for higher dimensional states.

Given a two-qubit or qubit-qutrit state ρ, the PPT criterion states thatits partial transpose ρT1 with respect to the first subsystem (or, symmetri-cally, ρT2 w.r.t. the second subsystem) has non-negative eigenvalues µi (orλi) if and only if ρ̂ is separable, thus

∀i : µi ≥ 0 (or λi ≥ 0)⇐⇒ ρ̂ is separable. (2.26)

The two partial transposes of a two-qubit state ρ can be expressed as

ρT1ij,lk = ρlj,ik, ρT2ij,lk = ρik,lj ,

where ρij,kl are the components of the density matrix, with ij and kl labellingthe basis of the first and the second subsystem respectively. Therefore, letρ be expressed as

ρ =

ρ11,11 ρ11,12 ρ12,11 ρ12,12ρ11,21 ρ11,22 ρ12,21 ρ12,22ρ21,11 ρ21,12 ρ22,11 ρ22,12ρ21,21 ρ21,22 ρ22,21 ρ22,22

.The partial transpose of ρ on the first subsystem, ρT1 , is

ρT1 =


, (2.27)


and, by the same token, the partial transpose on the second subsystem, ρT2 ,is

ρT2 =


. (2.28)The PPT criterion can be applied to find the exact boundary of the set ofseparable states for two-qubit and qubit-qutrit systems. However, for higherdimensional cases the PPT criterion is only a necessary condition. Hence,there are entangled states that are not detected by the PPT criterion butall separable states are.

2.4.2 Negativity

Given the partial transpose ρT1 with eigenvalues µi, the negativity N of thestate ρ is defined as

N (ρ) =∑i

|µi| − µi2

, (2.29)

For two-qubit and qubit-qutrit systems the negativity criterion works as anecessary and sufficient continuous measure

N (ρ) = 0⇐⇒ ρ is separable.

This entanglement measure is a direct consequence of the PPT criterion.Thus, it is again only necessary for bipartite systems with a dimension largerthan 6.

2.4.3 Concurrence

Another continuous entanglement measure is the concurrence C [36, 37, 38],which is necessary and sufficient for two-qubit states and only necessary forlarger systems. The concurrence of a pure state |ψ〉 is defined as

C(|ψ〉) = |〈ψ|ψ̃〉| = |〈ψ|(σy ⊗ σy)|ψ∗〉|,

where |ψ̃〉 = σy⊗σy|ψ〉 is the spin-flipped state of |ψ〉 and |ψ∗〉 is the complexconjugate of |ψ〉. The spin-flip transformation σy ⊗ σy can be written as

σy ⊗ σy =(

0 −ii 0

)⊗2=

0 0 0 −10 0 1 00 1 0 0−1 0 0 0


Given α, β, γ, δ ∈ C, a two-qubit pure state |ψ〉 and its spin-flipped trans-formed |ψ̃〉 are

|ψ〉 =

αβγδ

, |ψ̃〉 =−δγβ−α

.The concurrence of |ψ〉 is

C(|ψ〉) = | − α∗δ + β∗γ + γβ∗ − αδ∗| = | − 2Re(αδ) + 2Re(βγ)|,

and if Re(βγ) = Re(αδ), then |ψ〉 is separable.Given the density matrix ρ of a two-qubit system with no more than twonull eigenvalues, its concurrence is also obtained as

C(ρ) = max{

0,√λ1 −

√λ2 −

√λ3 −

√λ4

},

where λi (i = 1, 2, 3, 4) are the eigenvalues of:

ρρ̃ = ρ(σy ⊗ σy)ρ†(σy ⊗ σy).

2.4.4 Purity of the reduced state

The purity of the reduced state (PRS) is a continuous measure that worksas necessary and sufficient for bipartite pure states [39]. The bipartite purestate ρ defined over a Hilbert space H = H1 ⊗H2 can be expressed as

ρ =∑ijkl

cijkl|i〉11〈j| ⊗ |k〉22〈l|,

where cijkl ∈ R and {|i〉1} and {|k〉2} are two orthonormal bases for H1 andH2 respectively. The reduced states ρ(1) = Tr2(ρ̂), ρ(2) = Tr1(ρ̂) are thepartial traces over H2 and H1 respectively,

ρ(1) = Tr2(ρ) =∑ijkl

cijkl|i〉11〈j|〈k|l〉2 =∑ijk

cijkk|i〉11〈j|, (2.30)

ρ(2) = Tr1(ρ) =∑ijkl

cijkl〈i|j〉1|k〉22〈l| =∑ikl

ciikl|k〉22〈l|. (2.31)

The PRS criterion states that

• ρ is separable ⇐⇒ Pur(ρ(1)) = Pur(ρ(2)) = 1,

• ρ is entangled ⇐⇒ Pur(ρ(1)),Pur(ρ(2)) < 1.

The PRS criterion fails to detect quantum entanglement of any mixed state,whose reduced state is necessarily mixed.


2.4.5 Entropy of entanglement

The entropy of entanglement (EE) is a continuous functional defined for bi-partite pure states, whose quantum entanglement can be measured throughtheir Von Neumann entropy [40, 41], defined for a given state ρ as

S(ρ) = −Tr(ρ ln(ρ)), (2.32)

where the logarithm of the density matrix ρ is expanded as

ln(ρ) =∞∑k=1

(−1)k+1 (ρ− I)k

k.

The EE of a bipartite pure state E(ρ) coincides with the Von Neumannentropy (2.32) of its reduced states ρ(1), ρ(2),

E(ρ) = S(ρ(1)) = S(ρ(2)). (2.33)

The EE of a certain state measures how much information contained in thestate ρ defined over a Hilbert space H = H1 ⊗ H2 is lost if it is separatedinto H1 and H2. Therefore, in case ρ is separable the amount of such lossis zero because it can be split into the reduced states defined on H1 andH2. By the same token, splitting an entangled state ρ into the two Hilbertspaces H1, H2 necessarily involves a loss different from zero.Generally speaking, EE is defined also for bipartitions of multipartite sys-tems described by a pure state. Besides, when it comes to pure states manyother entanglement measures relate directly to the EE criterion: in thisregard, we may just mention entanglement cost, distillable entanglement,squashed entanglement, relative entropy of entanglement as examples [31].

2.4.6 Entanglement witnesses

An entanglement witness Ŵ is a Hermitian operator that is not positive def-inite generally but its expectation value on any separable state is positive[36, 42]. Geometrically, entanglement witnesses are represented by hyper-planes tangent to the set of separable states. As illustrated in Fig. 2.4, givena witness Ŵ and a separable state ρ̂s, then

Tr(ρ̂sŴ ) ≥ 0. (2.34)

If the expectation value of Ŵ on a density matrix ρ̂ is negative, i.e.

Tr(ρ̂Ŵ ) < 0, (2.35)

then ρ̂ is entangled.


Entangled states space

Separable states space

Figure 2.4: Simplified representation of entanglement witnesses in the states spaces. Thewitness Ŵ satisfies the conditions (2.34), (2.35) with respect to the separable state ρ̂s andthe entangled state ρ̂.

An entanglement witness Ŵ is identified by a set of states σ̂ such thatTr(σ̂Ŵ ) = 0 or, equivalently

σ̂ =∑i

αiÔi, Tr(σ̂Ŵ ) = 0 =⇒ Tr(ÔiŴ ) = 0. (2.36)

Here, Ôi are operators orthogonal to Ŵ . Entanglement witnesses representexperimental observables and they generally provide just a necessary butnot a sufficient criterion. Moreover, entanglement witnesses do not give ageneral method to classify entangled states because they detect a limitedset of entangled states and there is no one-to-all correspondence betweenwitness and states generally as illustrated in Fig. 2.5.

2.5 The entanglement classification problem

The entanglement classification problem consists in determining if an un-known quantum state ρ is entangled or separable. Generally, this wouldrequire to test all possible tensor products of states and see if they describeρ, which is very tedious in general. Hence, as we discussed in the last Sec-tion, we need some practicable approach such as entanglement measures.However, no universal measure is currently known. Moreover, several mea-sures as the geometric entanglement defined in Eq. (2.24) are also hardto compute. As discussed in Section 2.3, the boundary between entangledand separable states is highly non-linear. Consequently, all known measuresinvolve non-linear operations, which is in sharp contrast with the quantumtheory, that is mostly linear. Non-linear classification problems are not


Entangled states space

Separable states space

Figure 2.5: Simplified representation of the correspondence state/witness in the statesspaces. Whereas Ŵ1 witnesses ρ̂1 as an entangled state, this does not hold for ρ̂2 which, inspite of being entangled, does not satisfy the previous equations (2.34) (2.35) with respectto Ŵ1.

unique to quantum entanglement, but are very common in real-world ap-plications such as image recognition. These have been successfully treatedusing numerical methods and fast computers. While several numerical ap-proaches to quantum entanglement exist, none has ever been able to solvefully the problem so far. There are two main issues.

• It has been proven that the hardness of the classification task increaseswhen a state is located within an inverse exponential (with respect tothe dimension) distance from the boundary between entangled andseparable states sets [43]. In this case, the complexity of the problemand the computational cost of the solving algorithm are not polynomialwith respect to the number of degrees of freedom.

• A computer has finite numerical precision, which makes impossibleto encode exactly the components of a density matrix. Therefore, itmay happen that a separable state sits just in a neighbourhood of theboundary and it ends up being encoded as entangled. This issue mightmake the problem ill-defined [44].

In this work, we propose artificial neural networks to approach the quantumentanglement classification task. Theoretically, artificial neural networksare able to reproduce any non-linear continuous function, therefore we wouldexpect that they could potentially provide a successful method to investigateand explore the boundary between the set of separable states and the set ofentangled states in some low-dimensional systems.

Neural networks

Artificial neural networks (ANNs) have a structure inspired by the neuralconnections in human brain and have recently become popular in dealingwith a wide variety of non-linear problems. According to the universalapproximation theorem [12, 13], the classification boundary of any non-linearproblem can be approximated with an artificial neural network at arbitraryprecision. Broadly speaking, ANNs are able to make classifications after theyhave been trained on some data, whose features are learned, generalized andused to devise a model, able to make predictions on other unseen data. Thisis the typical approach of machine learning (ML).

3.1 Machine learning

The term ”machine learning” was first introduced by A. Samuel in 1959[45]. Machine learning was born as a branch of artificial intelligence and theintuitive idea behind it is to make computers able to learn from experiencewithout being explicitely programmed. We could say that a computer learnsa certain task from experience when its performance in doing the task im-proves with the amount of experience [46]. Practically speaking, the learningprocess involves feeding the computer with a training dataset, which is usedto learn and devise a model, so to gain experience. Next, the performanceof the model needs to be certified on other datasets, called test datasets, toproduce an output, an answer to a problem, so to carry out a task. Someexamples of tasks could be classification, regression, prediction, forecasts,and depending upon the task, three approaches are possible [47, 48].

• In the supervised learning approach, the training dataset is split intofeatures and labels. The former are used to learn a task such as clas-sification, the latter are compared to the predictions of the model.Classification and regression are two tasks typical of supervised learn-ing. While classification involves qualitative labels categorized into afinite number of classes, in regression problems labels are quantitativeand belong to a continuous range.

• In the unsupervised learning approach, the training dataset is com-posed only of features. The algorithm is then used for pattern recog-nition, such as clustering tasks.

3. Neural networks 28

• The semi-supervised learning approach is located halfway between theprevious two approaches and usually involves a small portion of la-belled data and a big part of unlabelled data in the training set.

In our work, we will use the supervised learning approach. The aim isto classify some low-dimensional entangled states for which labels can beassigned exactly with some necessary and sufficient entanglement measuresintroduced in Section 2.4.A simple example of supervised classification consists in classifying animals.For instance, let us a consider a training set composed of 3 classes: parrot,dog, hamster. The dataset contains N samples, each of which has 5 features:1. does it fly?, 2. does it live in a cage?, 3. does it eat seeds?, 4. does itbark?, 5. does it run on a wheel?. A certain classifier like artificial neuralnetworks is fed with this training set, used to learn a classification pattern forparrots, dogs, hamsters. After the training process has finished, we can testthe classifier on new labelled samples. For example, parrot that flies, livesin a cage, eats seeds, does not bark and does not run on a wheel. Startingfrom these five features, the classifier should return the label ”parrot”. Notethat we have only the three classes dog, hamster, parrot, hence the classifiercannot find that it is a car.

3.2 Supervised machine learning

In our work, we focus on ANNs within supervised learning approach. Thefeatures sets contain the density matrix components or the Bloch vectors asdefined in Eqs. (2.7) and (2.21) respectively. The labels are qualitative andbelong to two different classes, separable (0) or entangled (1). Dealing withsome low-dimensional bipartite states is essential to understand how muchefficiently an ANN can learn the boundary between the set of entangledstates and the set of separable states with a supervised approach. Indeed,since there exist necessary and sufficient entanglement measures for suchstates as we discussed in Section 2.4, the boundary is known exactly andthe predictions of the neural network can be compared to exact labels.

3.2.1 Steps of the supervised learning approach

The first step of supervised ML is the choice of an optimal training datasetD. This set is composed of pairs of the form

D = {(x1, y1), ..., (xN , yN )},

where N is the size of the training set, xi and yi are the features vectorand the label of the ith sample respectively. We denote the features set asX and the labels set as Y. The samples of the training dataset D shouldbe generated randomly with some probability distribution, next each xi is


labelled with a labelling function f̃ : X → Y [48]. In our case, each xi ∈ Xare the density matrix components or the Bloch vectors of the ith state, itslabel yi ∈ Y is either separable (0) or entangled (1) and the labelling func-tion is an entanglement measure. The characteristics of the optimal trainingset may depend also on the model. For example, artificial neural networkshave been found efficient using training samples equally partitioned betweenclasses and, in some cases, they do not require a large set size N [49].The second step consists in choosing the model, which in our case are theartificial neural networks. This is equivalent to selecting a hypothesis func-tion h : X → Y from the space of all possible hypothesis functions. Let Pbe a probability distribution defined on X ×Y, used to sample the trainingset D. The true error (or risk) of the hypothesis function h is defined as

L(h) = P ({(xi, yi) ∈ X × Y : h(xi) 6= yi}) .

The model has not got access to the probability distribution P but to thetraining dataset D, therefore what is computed is the empirical error LD(h),which is

LD(h) =|{i ∈ [1, N ] : h(xi) 6= yi}|

N. (3.1)

Here, the numerator is the cardinality of the set containing the misclassifiedsamples. The model selection is usually made upon empirical risk minimiza-tion, so the optimal model should minimize both the empirical risk LD(h)and the true risk L(h). This condition is achieved when the training set Dis �/2-representative [48], namely

∀h,∃� > 0 : |LD(h)− L(h)| ≤�

2.

The hypothesis function h depends on some hyperparameters, whichare statistical weights and biases in the case of ANNs. The weights of theoptimal hypothesis function minimize the empirical error, which is measuredwith a loss function L : h(X ) × Y → [0,+∞[. In other words, the optimalmodel has weights and biases that minimize the loss function L(h(xi), yi)for all points of the training set. An optimal model is found by applyingan iterative optimization algorithm that updates weights and biases step bystep, until the optimum is reached. The accuracy Ã of the model trained ona dataset with N samples is

Ã = 1−N∑i=1

L(h(xi), yi). (3.2)

The obtained optimal model thus depends also on the chosen loss function.In regression problems, for instance, a common loss function is the meansquared error LMSE function [48], defined as

LMSE =1

N

N∑i=1

|yi − h(xi)|2.


When it comes to binary classification such as the problem we are inter-ested in, a typical choice is the binary cross entropy loss function LBCE[50]. This loss function expresses the difference between two discrete prob-ability distributions p and q. Here, p and q correspond to the probabilitydistribution of the dataset labels and of the predicted labels, respectively.The probability distribution of the set labels equal to 1 and 0 are denotedas p(y=1) and p(y=0). On the other hand, the probability distributions ofthe labels predicted as y = 1 and y = 0 are q(y=1) and q(y=0) respectively.The classification rule is based upon the fact that the predicted label is 1if q(y=1) > q(y=0) and 0 otherwise. The binary cross entropy loss functioncalculated on N samples is

LBCE = −1

N

N∑i=1

[p(y=1)i log q

(y=1)i + p

(y=0)i log q

(y=0)i

]. (3.3)

Let us now make a fictitious example to compare LBSE and LBCE. Threesamples have labels equal to 0, 1, 0 and their labels predicted from a modelare 0, 1, 1. Their probability distributions are shown in table 3.1.

p(y=1) p(y=0) q(y=1) q(y=0)

First sample 0% 100% 40% 60%

Second sample 100% 0% 70% 30%

Third sample 0% 100% 60% 40%

Table 3.1: Fictitious example of probability distributions for three samples. Here, p(y=1)

and p(y=0) are the probability distributions of the labels 1 and 0 from the dataset, whichcan be either 100% or 0%. On the other hand, q(y=1) and q(y=0) are the probabilitydistributions underlying the labels predicted respectively as 1 and 0 from the model.

The MSE and the BSE loss functions are then

LMSE =1

3

(|0− 0|2 + |1− 1|2 + |0− 1|2

)= 0.33,

LBCE = −1

3[log(0.6) + log(0.7) + log(0.4)] = 0.26.

The respective accuracies calculated according to Eq. (3.2) are

ÃMSE = 1− 0.33 = 0.67,

ÃBCE = 1− 0.26 = 0.74.

The former is smaller than the latter. In classification problems like this,the BCE loss function is preferred over the MSE because, while LBCE issensitive to the difference between the probability distributions p and q, the


loss function LMSE is not. For example, let us suppose that just for the

third point we have q(y=1)3 =85%, q

(y=0)3 = 15%. Then, we would obtain

LBCE = 0.4, ÃBCE = 0.6,

LMSE = 0.33, ÃMSE = 0.67.

Even though the result for LBCE has changed, the MSE loss function wouldyield the same result as before.

Let us come back to the last step of the supervised ML approach. Oncethe empirical error as defined in Eq. (3.1) has been minimized, the accuracyof the model must be gauged on a test set. This is done by quantifying thegeneralization error LD,f̃ (h) related to the test set and defined as

LD,f̃ (h) = P{xi : h(xi) 6= f̃(xi)}. (3.4)

The generalization error is expressed in terms of a random probability distri-bution P that the hypothesis function h is different from the correct labellingfunction f̃ for a certain input xi. Therefore, the goal of the supervised ap-proach is to find a hypothesis function h to reproduce the labelling functionf̃ . This is done generally with a certain variability, therefore h = f̃ ± ∆f̃where ∆f̃ is the variance. Here, it is not said that ∆f̃ is negligible. Thus,in order to reproduce the labelling function f̃ the most accurately, we wishto fulfill the condition ∆f̃ ≈ 0. This could be achieved by using more thanone single pair of training and test datasets, according to the cross valida-tion technique. Such strategy could be applied in different ways. A firstexample of cross validation involves using more than one pair of trainingand test set. The accuracy of the model is finally computed as the arith-metic average of the accuracies obtained for each pair of sets. Another kindof cross validation is the k-fold cross validation. This algorithm consists ofrandomly partitioning the same dataset into k different folds in order totrain the model on k− 1 of them and test it on the remaining fold, which isthen called validation dataset. As illustrated in Fig. 3.1, this procedure isiterated for k times and the final accuracy Ã is expressed as the mean valueof the accuracies Ãi calculated at each iteration, i.e.

Ã =1

k

k∑i=1

Ãi.

Moreover, a comparison between the empirical error and the general-ization error of Eqs. (3.1) and (3.4) respectively, can be useful to evaluatepower of generalization of the model. Typically, when the empirical error issmaller than the generalization error, the model has been overtrained so itspredictions correspond too closely to the training data. Fluctuations andvariations of the training set have been learned as well. Such model might


fail to fit accurately other datasets and this situation is known as overfit-ting. In order to overcome overfitting, a possible strategy could be to startover the training process, tuning different hyperparameters or modifying theoptimization process addressed to the empirical risk minimization.

Dataset

folds (training set)1 fold (validation set)

…

1st iteration: accuracy

2nd iteration: accuracy

iteration: accuracy

Figure 3.1: Sketch of k-fold cross validation. A dataset is split into k folds, so the modelis trained on k − 1 of them and tested on the remaining fold, the validation set. Thisprocedure is iterated for k times and the final accuracy is calculated as the average of theaccuracies Ãi for i = 1, ..., k.

3.3 Artificial neural networks

Artificial neural networks are amongst the most common ML algorithmsused for classification. They are flexible, powerful, easy to use and haveproven to be successful in solving non-linear problems [51, 52]. The firstcomputational model was designed in 1943 by McCulloch and Pitts [53].Later, Rosenblatt proposed the perceptron, the fundamental unit of neu-ral networks [54]. Perceptrons could be applied successfully to solve linearproblems and were shown to fail for non-linear tasks. In this regard, the firstmore complex model of artificial neural networks was published in the 1960s[55]. However, a limit to the practical implementation of complex and largeneural structures was the lack of powerful computers. These limitationswere mostly overcome in the 1980s, when parallel computing processing wasinvented, and since then on further developments in this field have beenachieved. A neural network connects units called neurons. Each connection


is associated to statistical weights learned in the training process.In our work we focus on feed-forward neural networks, in which neurons

are organized in three kinds of layers as illustrated in Fig. 3.2. The neuronsin the input layer correspond to the features set of the data. The hidden lay-ers describe a series of non-linear transformations. Finally, in classificationtasks as ours, the output layer produces the predicted labels. The adjective”feed-forward” means that the information can only travel in one direction,from the input to the output layer passing through hidden layers.

Input layerHidden layer

Output layer

Figure 3.2: Sketch of a feed-forward neural network, split into input, hidden and outputlayers. The size of the input layer equals the size of the features vector whereas the size ofthe hidden layer is determined heuristically. A possible rule of thumb is that the numberof hidden neurons should range between 1 and the size of the input layer [56].

3.3.1 The perceptron

The perceptron is the fundamental unit of ANNs. Its structure is composedof only one input and one output layer, as shown in Fig. 3.3. First, theinputs of the perceptron are the n features of xi, the i

th feature vector ofthe dataset. Each input xi,j is connected to the output neuron with weightwj . The output neuron then computes the sum si =

∑j wjxi,j + b, where

b is a bias variable. The role of the bias b is shifting the decision boundarybetween the classes, as shown in Fig. 3.4. Subsequently, an activationfunction f is applied to si. The output f(si) then constitutes a discrete ora continuous label. Given an input xi, the output ŷi of the perceptron is


expressed as

ŷi = f

∑j

wjxi,j + b

. (3.5)

Input layer Output layer

Figure 3.3: Perceptron with 3 input neurons. Once the weighted sum of the inputs xi,jand weights wj has been calculated, the quantity

∑j wjxi,j + b is used as the argument

of an activation function f . Depending upon the value of the activation function, thepredicted label ŷi belongs to the class 0 or 1.

A possible example of activation function f is the step function Θ:

Θ

∑j

wjxi,j + b

= { 1 if ∑j wjxi,j + b ≥ 00 otherwise

(3.6)

In this case when the step function is equal to 1, the output neuron isactivated and the predicted label is ŷ = 1, otherwise when the step functionis 0, we have ŷ = 0. More commonly, one could use a (semi-)continuousfunction such as the rectified linear unit, relu, defined as

relu

∑j

wjxi,j + b

= { ∑j wjxi,j + b if ∑j wixi,j + b ≥ 00 otherwise

(3.7)

Other examples of activation functions are the hyperbolic tangent and the


Class 0

Class 1

Figure 3.4: Shifting of a linear boundary (violet line) between the two classes 0 and 1with the bias b. The bias b does not alter the direction of the linear boundary, which isdetermined by the weights.

sigmoid function σ, defined respectively as

tanh

∑j

wjxi,j + b

= e∑j wjxi,j+b − e−∑j wjxi,j+be∑

j wjxi,j+b + e−∑

j wixi,j+b, (3.8)

σ

∑j

wjxi,j + b

= 11 + e−

∑j wjxi,j+b

, (3.9)

The activation functions (3.7) - (3.9) are usually preferred over the stepfunction (3.6) since they perform better for the network training. Theircontinuous output could be casted also into a discrete label by applying aposteriori a Θ function. The four activation functions above are illustratedin Fig. 3.5.

Given a training set D = {(x1, y1), ..., (xN , yN )}, let n be the dimensionboth of each feature vector xi and of the weights vector w. We now describethe training procedure of the perceptron, in which the weights vector w andthe bias b are updated iteratively.

1. Initialize w, b at w0, b0.

2. For each sample (xi, yi) ∈ D, first compute the prediction ŷi and thenupdate the weights and the bias at w1, b1 according to the rule

w1 = w0 + η(yi − ŷi)xi, (3.10)


-4 -2 0 2 4

0

1

-4 -2 0 2 4

0

1

-4 -2 0 2 4

-1

0

1

-4 -2 0 2 4

0

1

(A) step function (B) relu function

(C) tanh function (D) sigmoid function

Figure 3.5: Four possible activation functions for the output layer of the perceptron.

b1 = b0 + η(yi − ŷi)xi (3.11)

Here η is the learning rate, or step size, a factor related to the speedof the learning process.

3. Repeat the previous step until the empirical error defined in Eq. 3.1does not improve beyond an iteration error threshold δ > 0.

The perceptron is a linear classifier, hence its performance is optimal whenthe training set is linearly separable, i.e. the boundary between the classesis an n-dimensional hyperplane. Under this condition the perceptron con-verges, hence after a finite number of iterations the weights shown in Eq.(3.10) do not improve beyond δ > 0. If the training set is linearly separablewith a hyperplane of margin γ > 0, there exist a weights vector w with||w|| = 1 and a bias b such that, for all i = 1, ..., N

yi = 1 : w · xi > γ,

yi = 0 : w · xi < −γ.

Given R = ||xi||∞ = maxj=1,n(|xi,j |), the perceptron converges after a num-ber of iterations of the order of O(R2/γ2) [57].


3.3.2 Feed-forward neural networks

Feed-forward neural networks are also known as multilayer perceptrons,since they concatenate many perceptrons without cycles into input, hid-den and output layers [58], as shown in Fig. 3.2. Whereas perceptronscan only predict linear boundaries, feed-forward neural networks are able toapproximate any continuous functions, according to the universal approxi-mation theorem [12, 13].

Universal approximation theorem. Let ϕ : R → R be a nonconstant,bounded and continuous function and let X ⊆ Rn be compact. Given C(X)as the space of real-valued continuous functions defined on X, ε > 0 andF ∈ C(X), then there exist an integer M , some constants aj , bj ∈ R andvectors w ∈ RM such that a function f can be defined

f(x) =

M∑j=1

ajϕ(wjx+ bj),

as an approximation of the function F , in the sense that

∀x ∈ X : ||f(x)− F (x)|| < ε.

In this statement of the theorem, ϕ is the activation function, f is the sin-gle hidden layer of the artificial neural network and F models the actualboundary of the problem. Thus, the universal approximation theorem en-sures theoretically that feed-forward neural networks with one hidden layercan approximate any boundary in a high-dimensional space. For example,feed-forward neural networks have been shown to reproduce complex op-erations such as diagonalization and eigenvalues calculation of Hermitianmatrices as well [59].

Let us consider a feed-forward neural network with one hidden layerand one output neuron, as the one shown in Fig. 3.6. Let n and n1 bethe numbers of input neurons and of the hidden neurons respectively. Theweights of the connections towards the hidden layer and towards the outputlayer are organized into two matrices W (1), W (2) respectively. The biasesof the hidden layer and of the output neuron can be casted into two arraysb(1), b(2). Let f1 and f2 be the activation functions of the input neurons andof the hidden neurons respectively. The predicted label ŷi of an input xi is

ŷi = f2

n1∑k1=1

W(2)k11f1

n∑j=1

W(1)k1jxi,j + b

(1)k1

+ b(2)1 . (3.12)

Even though neural networks with one hidden layer are universal approx-imators, the more highly non-linear the problem, the exponentially larger


𝟏𝟏

𝟐𝟏

𝟑𝟏

𝟒𝟏

𝟏𝟐

𝑾𝟏𝟏𝟏

𝑾𝟐𝟏𝟏

𝑾𝟏𝟐𝟏

𝑾𝟐𝟐𝟏

𝑾𝟐𝟑𝟏

𝑾𝟐𝟒𝟏

𝑾𝟏𝟑𝟏

𝑾𝟏𝟐𝟏

𝑾𝟏𝟏𝟐

𝑾𝟐𝟏𝟐

𝑾𝟑𝟏𝟐

𝑾𝟒𝟏𝟐

Activation function Activation function

Figure 3.6: Example of feed-forward neural network with one hidden layer. The neuralnetwork takes a two-dimensional input xi. The connections towards the hidden and theoutput layer are associated with two weights matrices W (1) and W (2) and with two acti-vation functions f1 and f2, respectively. The statistical biases of the hidden neurons andof the output neuron are casted into b(1) and b(2) respectively.

the size of the hidden layer must be, which comes with a high computationalcost. Practically speaking, the architecture of a feed-forward neural networkcan be optimized into more than one single hidden layer, such as the oneshown in Fig. 3.7. In neural networks with m > 1 hidden layers and oneoutput neuron, the expression of the output ŷi of a given input xi can begeneralized from Eq. 3.12. Let n be the number of input neurons and letn1, ..., nm be the sizes of the m hidden layers. Here we have m weights ma-trices W (1), ...,W (m), m bias arrays b(1), ..., b(m) and m activation functionsf1, ..., fm. The predicted label ŷi of an input xi is

ŷi = fm

nm∑km=1

W(m)km1

fm

... f2 n1∑k1=1

W(2)k2k1

f1

n∑j=1

W(1)k1jxi,j + b

(1)k1

+ b(2)k2

... ) + b(m)1

).

(3.13)


……

…

…

…

…

…

…

Input layer hidden layer hidden layer Output layer

Figure 3.7: Illustration of a feed-forward neural network with m hidden layers. So far,we have dealt with feed-forward neural networks with only one output neuron. Moregenerally, there exist also models with different architectures in which the size of theoutput layer is equal to the number of classes.

Backpropagation and gradient descent

The learning process of a feed-forward neural network is based upon the so-called backpropagation, an optimization approach first introduced in 1986[60]. Backpropagation is iterative and recursive, and aims at tuning theweights such that a given loss function is progressively minimized. First, allthe weights of the network are initialized randomly. An input xi undergoesnon-linear transformations throughout the hidden layers until the output ŷiis calculated as in Eq. (3.13). For each input xi, the difference betweenthe expected output yi and the predicted output ŷi is computed with a lossfunction. When the loss function has been calculated for all N samples ofthe training set once, one epoch has been done. Next, the error expressed bythe loss function is backpropagated to all hidden neurons. Backpropagationcan be carried out with the gradient descent method, a first-order algorithmused to find the minimum of the loss function.

The intuition behind the gradient descent is that a given function F ,differentiable in the neighbourhood of a point x(0), decreases fastest movingfrom x(0) along the direction −∇F . Moving by a quantity α > 0 from x(0)along −∇F , we end up in a point

x(1) = x(0) − α∇Fx=x(0) , (3.14)


where ∇Fx=x(0) is the gradient of F calculated at x = x(0). Correspondingly,we have F (x(0)) ≥ F (x(1)). Iterating this procedure, we find a series of pointsx(2), x(3), x(4),... such that the decreasing succession

F (x(0)) ≥ F (x(1)) ≥ F (x(2)) ≥ F (x(3)) ≥ F (x(4))...

is created. It is this succession that represents the descent of the gradient∇F , as illustrated in Fig. 3.8.

Direction 𝟎

Minimum

𝟏

𝟐

𝟑

𝟒

Figure 3.8: Illustration of the gradient descent of a function F towards the mini-mum. Here, the coloured circular surfaces describe different level surfaces of the func-tion F such that the darker the level surface, the closer to the minimum. The descentof the gradient over the level surfaces is represented with red arrows. The five pointsx(0), x(1), x(2), x(3), x(4) define a decreasing succession of the function F , which will endup eventually in the minimum.

Let us consider a feed-forward neural network composed of m hiddenlayers and with a loss function L( ~W,~b), where ~W = (W (1), ...,W (m)) and~b =(b(1), ..., b(m)). The optimization of L( ~W,~b) can be done with the gradientdescent method. The learning rate is η and the procedure can be split intothree main steps.

1. Initialize ~W , ~b at ~W0, ~b0.

2. Update the weights and the bias to ~W1, ~b1 according to the rule

~W1 = ~W0 + η(∇WL) ~W= ~W0 , (3.15)

~b1 = ~b0 + η(∇WL) ~W= ~W0 , (3.16)


where (∇WL) ~W= ~W0 is the gradient of the loss function with respectto ~W and calculated at ~W = ~W0.

3. Repeat the previous step for a number of times r such that the finalweights and biases are

~Wr = ~Wr−1 + η(∇WL) ~W= ~Wr−1 , (3.17)

~br = ~br−1 + η(∇WL) ~W= ~Wr−1 . (3.18)

Here, (∇WL) ~W= ~Wr−1 is the gradient of the loss function with respect to~W and calculated at ~W = ~Wr−1. The number of times r is determinedby the convergence condition ∇WL < δ, where δ > 0 is the iterationerror threshold.

In this procedure, the gradient of the loss function ∇WL in Eqs. (3.17),(3.18) can be computed against a generic number of inputs between 1 andN. Such number is called the batch size.

Generally, it is not said that the convergence is reached in a global mini-mum of the loss function. If the loss function is convex, the gradient descentalgorithm can converge to a global minimum almost surely, i.e. with aprobability equal to 100%. In case the loss function is not convex, then theconvergence happens almost surely at a local minimum [61, 62]. A valid toolto check whether the gradient ∇WL improves or not beyond the thresholdδ > 0 is to represent the loss function and the classification accuracy asdefined in Eq. (3.2) as functions of the number of epochs.

In terms of convergence of the gradient descent, the learning rate η playsa crucial role because it controls the speed and rate of convergence. In casethe learning rate η is large, the speed of convergence is high but the globalminimum may be missed and the algorithm might end up in another point.On the other hand, a small learning rate η makes the convergence slow and itcan require too many epochs to realize convergence. There is no general ruleto determine the optimal learning rate, which should be evaluated case bycase and whose value generally ranges between 10−6 and 1 [63]. A strategyto improve the convergence of gradient descent is adding a moment term inEqs. (3.17), (3.18) [60], which become

~Wr = ~Wr−1 + η(∇WL) ~W= ~Wr−1 + p(~Wr−1 − ~Wr−2), (3.19)

~br = ~br−1 + η(∇WL) ~W= ~Wr−1 + p(~Wr−1 − ~Wr−2). (3.20)

The moment term speeds up the convergence especially when the loss func-tion meets a ravine, a deep narrow valley with steep sides as the one sketchedin Fig. 3.9. In this case, the ravine can make the descent towards the min-imum slow with oscillations on the ridges of the ravine itself. However, at


each ridge the weights components perpendicular to the gradient directionare equal and opposite and these transversal oscillations can be averagedout with the momentum term p( ~Wr−1− ~Wr−2) of Eqs. (3.19), (3.20), whichsum up only the components parallel to the gradient direction −∇WL.

Ideal path towards the minimum

Gradient descent effective path

Effective path components

Figure 3.9: Comparison between the gradient descent effective path (red) and the idealpath (green) towards the minimum, represented inside a concave ravine (blue curves). Inthe four points A,B,C,D the effective path is decomposed along the longitudinal directionw1 (light blue arrows) and the transversal direction w2 (light blue dashed arrows). Takenthe two pairs of points A,B and C,D, the gradient descent effective path has equaland parallel components along w1 and equal and opposite components along w2. Themomentum term in Eqs. (3.19), (3.20) cancels the opposite transversal contributions andaccounts only for the components along w1, which makes the effective path similar themost to the ideal path.

In order to deal efficiently with large datasets of high-dimensional fea-tures space, other first-order optimization algorithms, like Adam, have beenproposed [64]. The name ”Adam” stands for ”adaptive moment estimation”.This is an iterative algorithm inspired to the gradient descent, which includestwo moments for which adaptive learning rates are computed. The Adamalgorithm involves a learning rate η, an iteration error threshold δ > 0, twodecay rates β1, β2 ∈ [0, 1[ and an infinitesimal number � > 0. Using the samenotations and conventions introduced for the gradient descent algorithm, wecan summarize the optimization process of Adam into three steps.

1. Initialize ~W , ~b, m, v at ~W0, ~b0, m0, v0. Here, m and v are the firstand the second moment.

2. Putting gr = (∇WL) ~W= ~Wr with r ≥ 1 as the number of iterations,


update the two moments as

mr =β1mr−1 + (1− β1)gr−1

1− βr−11,

vr =β2vr−1 + (1− β2)(gr−1)2

1− βr−12.

3. Update the weights and the bias at ~Wr, ~br as

~Wr = ~Wr−1 − ηmr√vr + �

, (3.21)

~br = ~br−1 − ηmr√vr + �

. (3.22)

The number of times r is determined by the convergence condition∇WL < δ, where δ > 0 is the iteration error threshold.

Additionally, in the second step of the algorithm it is possible to update thelearning rate. Therefore η at the step r is updated as

ηr = η

√1− βr2

1− βr1.

In this case, the learning rate η in Eqs. (3.21), (3.22) is replaced with ηr.Among its advantages, Adam usually requires little memory, is computa-tionally efficient and its parameters require little tuning with respect to thestandard values: η = 0.001, β1 = 0.9, β2 = 0.999, � = 10

−8.

3.3.3 Convolutional neural networks

Convolutional neural networks are commonly used for images classification[65, 66]. They were created in 2012 by Krizhevsky [67], who won the Im-ageNet classification contest of the year showing that convolutional neuralnetworks can improve the performance of feed-forward neural networks forsuch task. An example of a convolutional neural network is shown in Fig.3.10, where the input is an image split into 16 pixels. Each pixel is describedby a number, therefore the image is represented as a tensor of numbers. Pix-els in the input are convoluted with a filter, or kernel, which applies lineartransformations, such as summation and multiplications, to its entries. Theresults of these operations are casted into a successive layer, represented by amatrix. Besides, filters can do also maximum pooling, an operation in whichthe maximum of the filter is mapped as the entry of another layer, whichconsists in a matrix once again. The output of the convolutional neuralnetwork is flattened and used as the input of a feed-forward neural network,which produces the predicted label corresponding to the input image.


4 1 0 1

3 2 10 0

2 5 8 3

4 0 2 6

10 13 11

12 25 21

11 15 19

12 25 21

Convolutional neural network Feed – forward neural network

Convolutional filter Max pooling filter

Image input

Figure 3.10: Example of a convolutional neural network. The input is an image of 16pixels, represented as a 4x4 tensor of numbers. The input is passed through a convolutionalfilter of size 2x2 (red dashed square), which sums up its entries and maps them into a3x3 matrix. Next, a max pooling filter of size 3x1 (light blue dashed square) selects themaximum of each 3x1 block and maps it into a 1x3 tensor. Its three components are theoutput of the convolutional neural network, which is sent to a feed-forward neural networkcomposed of one hidden layer.

3.3.4 Some disadvantages of neural networks

An important disadvantage of neural networks is that they are computa-tionally expensive especially when the problem is highly non-linear. Forexample, let us take a feed-forward neural network with 30 inputs, twohidden layers both composed of 30 neurons and one output neuron. Thisarchitecture has two 30x30 weights matrices, one 30-dimensional weightsvector, three 30-dimensional bias vectors. Therefore, using this architecturecould be expensive and this may require to develop additional algorithms toreduce its computational cost [68].Moreover, according to the universal approximation theorem introduced inSection 3.3.2, there does exist at least one neural network f to approximatethe actual boundary F of the problem, which implies that f is not gen-erally unique. Practically speaking, finding an approximation at arbitraryprecision may be time-consuming because it might imply a long process ofhyperparameters tuning.


Another drawback of ANNs is related to their property of being ”blackboxes”. A black box is a device whose input and output are immediatelyaccessible but its internal working principles are not known. Actually, ananalysis of the internal working of neural networks can be provided [69, 70],but it generally requires additional algorithms created ad hoc, which dependupon the particular nature of the problem treated.

A study of low-dimensional entangled

states with neural networks

In Section 3.3.2, we have introduced artificial neural networks as univer-sal approximators of any continuous function. The question we deal with inthis Chapter is whether and how much ANNs can approximate the bound-ary between the set of entangled states and the set of separable states.Our method consists in applying neural networks within the supervised ap-proach, for which a benchmark to assess their predictions is needed. Hence,we focus on some low-dimensional bipartite systems for which entanglementmeasures can label entangled and separable states exactly. Specifically, wedeal with bipartite systems of qubits, qutrits (i.e., three-level parties) andququarts (i.e. four-level parties), analyzing four different kinds of states:

1. Werner states ρW , as defined in Eq. (2.15).

2. Rotated Werner states ρrW , obtained by applying random unitary trans-formations U of the form of Eq. (2.14) to Werner states ρW , so that

ρrW = U†ρWU. (4.1)

3. Pure states ρp, which we represented not as wavefunctions but as den-sity matrices.

4. Mixed states ρm, according to the definition of Eq. (2.7).

Each Werner state ρW as defined in Eq. (2.15) was labelled as entangled ifpsym < 1/2 and separable if psym ≥ 1/2. Its rotated ρrW is LU-equivalent,thus it has the same label entangled (1) or separable (0).Pure states were labelled with the purity of the reduced state (PRS) criterionintroduced in Section 2.4.4, which works as necessary and sufficient for anybipartite pure state. This entanglement measure is continuous. Thus, apure state ρp was labelled as separable (0) only when the purity of both itsredu

an analysis of low-dimensional entangled states with neural ......1. Introduction 4 posed of...

Documents

Transcript of an analysis of low-dimensional entangled states with neural ......1. Introduction 4 posed of...