Arezzo, 19-21 Gennaio 2006 Seminario internazionale digital philology and medieval texts Roberto...

28
Arezzo, 19-21 Gennaio 2006 Seminario internazionale digital philology and medieval texts Roberto Rosselli Del Turco Dipartimento di Scienze del Linguaggio Università di Torino La digitalizzazione di testi letterari di area germanica: problemi e proposte

Transcript of Arezzo, 19-21 Gennaio 2006 Seminario internazionale digital philology and medieval texts Roberto...

Arezzo, 19-21 Gennaio 2006

Seminario internazionale

digital philology and medieval texts

Roberto Rosselli Del Turco

Dipartimento di Scienze del Linguaggio

Università di Torino

La digitalizzazione di testi letterari di area germanica:

problemi e proposte

2

Presentation Outline

• Introduction

• Character encoding

• Metrical markup

• Conclusion

The Digital Vercelli Book Project:

http://islp.di.unipi.it/bifrost/vbd/

3

Introduction

• Digital editions require “digital objects”• Image digitizing and processing relies on

reliable and mature techniques/tools• Text encoding can be a very time-consuming

and difficult process• Literary texts belonging to the Old Germanic

tradition present specific problems • Problems range from character encoding

(transcription level) to meter encoding (edition level)

4

Character Encoding

What does “text encoding” mean?What are characters for a computer?What does “character encoding” mean?

“code” really means “number”A = 65 (dec.) or 41 (hex.) or 0100001 (bin.)

The first encoding standards: ASCII (7 and 8 bit), EBCDICThe ISO ASCII-based standards: ISO 8859-1 etc.

more characters but interchange problems

5

Old English Characters

Ancient writing systems present very specific problemsF.i. scribes writing in Old English modified the Latin alphabet to reflect OE phonological features:

modified letters (æ œ ð)

new letters (þ ƿ)

unused letters (g v) <- ʒ fSignificant variations related to different times, places (scriptoria), scribal habits, writings

6

Problems in OE character visualization

• ASCII and ISO 8859-* miss a good number of important characters

• From an HTML page of the DOE corpus:

• The corresponding source code: <img src="T04290_files/etail-uppercase.gif" align="top" border="0">fne swa he cwæde: Micel is gefea

7

The Unicode Standard

The Unicode site: http://www.unicode.org/A “universal character encoding standard used for representation of text for computer processing”Fully compatible and synchronized with the corresponding versions of International Standard ISO/IEC 10646Latest major revision 4.1, 5.0 in beta97720 different characters, room for many more (about 1 million)Universal, efficient, unambiguousCharacters – glyphs distinction

8

The Unicode Standard

9

Characters in an Old English manuscript

Considerable variation of shapes for the same character:

a s y M

• Size variation:

a i e

• Special characters (abbreviations, punctuation):

10

Encoding OE Characters

Why encode “non standard” characters

– To allow for paleographical analysis– To track scribe habits– To obtain a high quality text-only facsimile

What to encode

Not every letter variation is meaningful

How to encode

Unicode + XML markup + MUFI compliant font

11

Entities

Entities are “empty boxes” (think about constants in programming languages)Entities must be declared at the beginning of the XML (or, more often, in a separate file):

<!ENTITY lows "&#61735;"> <!-- low s letter --><!ENTITY longs "ſ"> <!-- long, f shaped s letter -->

They allow for interchange with legacy operating systems and platformThey simplify the handling of “special characters” (and more)

12

TEI P4 and Unicode

• How to use entities:

&longs; “s” not very useful

N.B.: entity names are “lost” forever!!!

&longs; “” visualization<c type='longs'>s</c> visualization + search

• ... but what about “missing” characters?

13

TEI P5 and Unicode

Use the <g> element in the text:... <g ref=“#lows”/> ...

together with the <charDesc> one

<charDesc><char id=“lows”><charName>“LATIN SMALL LETTER S LOW UNDER

THE LINE”</charName><charProp>

<localName>entity</localName><value>lows</value>

</charProp><mapping type=“standardized”>s</mapping><mapping type=“PUA”>U+F127</mapping>

</charDesc>

14

TEI P5 and Unicode

Another example:

<charDesc><gliph id=“r1”>

<gliphName>LATIN SMALL LETTER R WITH ONE FUNNY STROKE</gliphName>

<charProp><localName>entity</localName><value>r1</value>

</charProp><graphic url=“r1img.png”/>

</gliph></charDesc>

15

Metrical Markup

Old Germanic meter features:non isosyllabicsyllabic quantity not particularly relevantlong verse composed of two half-lineshalf-lines bound by alliterationstress pattern

No specific solutions in the TEI guidelinesSeveral prosodic theories (Sievers to Hoover)Stylistic features problemsRisk of complex, overlapping markup

16

General Structure of Old Germanic Meter

A markup proposal:

<lg><l>

<hl>Hwæt! Ic swefna cyst</hl><hl>secgan wylle</hl>

</l></lg>

<lg> (line group) only needed where stanzas occur (Deor)<hl> (half line) syntactic sugar for <seg type="halfline"><hlA> and <hlB> not needed

17

Meter encoding v. 1

A simple method to encode meter using attributes of the <hl> element:

<hl><met name="Sievers" code="D1" scan="//\x"/><met name="Russom" code="x/Sx" scan="x|/\x"/><met name="Hoover" code="nAn" scan="xx /\x"/>...HWÆT! WE GARDENA

</hl>

Doesn't allow for alternative scansions using the same systemDoesn't take into account syllables (and disagreement in syllable counts/stress pattern)

18

Meter encoding v. 2

A more complete (and complex) method:

<hl n="3a"><met system="Sievers" resp="Schwab" totalSyllables="5" scansion="D-1" Anacrusis="0" Extrametrical="0" Lift="1,2,4" halfLift="3" dip="5" allitGlyph="w" allitSound="/w/" allitPosition="1,2" />

<met system="Sievers" resp="Fulk" totalSyllables="4" scansion="D-1" Anacrusis="0" Extrametrical="0" Lift="1,2" halfLift="3" dip="4" allitGlyph="w" allitSound="/w/" allitPosition="1,2" />

weorc wuldorfaeder</hl>

Scansion not associated to the actual text ...

19

Meter encoding v. 2

... in fact you could take it out of the text:

<hl n="3a" id="CH.3a">weorc wuldorfaeder</hl>

...

<met target="CH.3a" system="Sievers" resp="Schwab" totalSyllables="5" scansion="D-1" Anacrusis="0" Extrametrical="0" Lift="1,2,4" halflift="3" dip="5" AlitGlyph="w" allitSound="/w/" Allitposition="1,2" />

<met target="CH.3a" system="Sievers" resp="Fulk" totalSyllables="4" scansion="D-1" Anacrusis="0" Extrametrical="0" Lift="1,2" halflift="3" dip="4" AlitGlyph="w" allitSound="/w/" Allitposition="1,2" />

20

Meter encoding v. 2

To establish a direct connection between scansion and text you have to mark syllables

You could add this to the simple model:

<hl><met name="Russom" scan="/x|/xx" sylls="1a.1.1 1a.1.2 1a.1.3 1a.1.4 1a.1.5" />

<met name="Bliss" scan="/|\xx" sylls="1a.1.1 1a.1.3 1a.1.4 1a.1.5"/>

<syl id=1a.1.1>þe<syl id="1a.1.2">od</syl><syl id="1a.1.3">cyn</syl><syl id="1a.1.4">in</syl><syl id="1a.1.5">ga</syl>

</hl>

21

Meter encoding v. 3

The most complete (and complex!) method:

<fvLib id="PS" type="Prosodic Stress"> <ignored id="x"/> //ignored in scansion <dip id="SO"/> <dipResolution id="SOR"/> //second half of resolved lift

<halfLiftLongPosition id="S1LP"/> // = V+CC <halfLiftLongNature id="S1LN"/> // = long Vowel

<halfLiftShort id="S1S"/> <liftLongPosition id="S2LP"/> // lift long by position

...</fvLib>

22

Meter encoding v. 3

The Feature Structure looks complex, but need only be designed once:

<hl n="3a" id="CH.3a"><syll id="ch3a.1">weord</syll> <syll id="ch3a.2">wul</syll><syll id="ch3a.3">dor</syll><syll id="ch3a.4">fae</syll><syll id="ch3a.5">der</syll></hl>

....<linkGrp type="metrical prosody" domains="PS AT AP AG T1" targFunc="?">

<!--...--><link id="L1" targets="ch3a.1 S2LP A1 APW AGW"/><link id="L2" targets="ch3a.2 S2LP A1 APW AGW"/>...

23

Stylistic features: the kenning

• Main element:

<kenning>

Using the <kenning> element without further markup is the simplest way to markup kenningar in a text

Examples:

<kenning>swanrād</kenning><kenning>beadolēoma</kenning>

24

Stylistic features: the kenning

Sub-elements

<bw> base word

To single out the base word in a kenning

<det> determinant

To single out the determinant

<refer> referent

Explicit markup of the object or person the kenning is referred to

25

Stylistic features: the kenning

Attributes

type specifies the type of kenning

level specifies the level, i.e. if the kenning is hosted/hosting another kenning and its

position in the hierarchy

class specifies a general semantic class which the kenning belongs to

func specifies the stylistic function of the kenning

26

Stylistic features: the kenning

Examples:

<kenning><det>beado</det><bw>lēoma</bw><refer>sweord</refer>

</kenning>

<kenning level="1"> <det> <kenning level="2">

<det>heofon</det><bw>engla</bw></kenning> </det> <bw>cyning</bw></kenning>

27

A Work in Progress...

• Coming soon on the Digital Medievalist site:

http://www.digitalmedievalist.org/

Collaborative edition on the wikiMetrical-markup list for discussion ([email protected])Feel free to ask and/or suggest!

28

Conclusion

The Digital Vercelli Book team:

Federica GoriaRaffaele CioffiEmilia Di MaioRoberto Rosselli Del Turco

The Metrical Markup team:

Dorothy Carr PorterDaniel Paul O'DonnellRoberto Rosselli Del Turco