Puglia marac-file formats-20111020

41
Revisiting File Formats for Digitization Steven T. Puglia Digital Conversion Services Manager Office of Strategic Initiatives Library of Congress 101 Independence Ave, SE Washington, DC 20540, USA Phone: 202-707-5726 Email: [email protected]
  • date post

    19-Oct-2014
  • Category

    Technology

  • view

    3.276
  • download

    0

description

 

Transcript of Puglia marac-file formats-20111020

Page 1: Puglia marac-file formats-20111020

Revisiting File Formats for DigitizationSteven T. PugliaDigital Conversion Services ManagerOffice of Strategic InitiativesLibrary of Congress101 Independence Ave, SEWashington, DC 20540, USAPhone: 202-707-5726Email: [email protected]

Page 2: Puglia marac-file formats-20111020

Federal Agencies Digitization Guidelines Initiativewww.digitizationguidelines.gov

Working Groups-•Still Image•Audio-Visual•Authentication

Page 3: Puglia marac-file formats-20111020

In general, within the digital library community, format and compression recommendations for master and derivative image files remain based on older perspectives regarding digitization, digital preservation, and IT/network/web technologies.

Page 4: Puglia marac-file formats-20111020

In the current, and likely future, constrained fiscal climate, file format and image compression recommendations should be reviewed and revised to-

Take into account cost-effectiveness and efficiency in:•Mass digitization projects•Short-term digital storage•Derivative production•Moving files into and within an IT, network, and web delivery infrastructure•Digital preservation and long-term storage

Reflect current perspectives on digital preservation; take advantage of new computer, IT, and web technologies; and respond to the changing expectations of researchers and users.

Page 5: Puglia marac-file formats-20111020

For still or raster images, considering:

•TIFF•JPEG 2000•JPEG•PNG•PDF and PDF/A

Page 6: Puglia marac-file formats-20111020

Not just sustainability, considering:

•Authenticity•Data resilience and error recovery•Embedded metadata support•Fiscal implications•Sustainability Confidence and Pros/Cons•Technical Pros/Cons•Data resilience and error recovery•Ease and accuracy of validation

Page 7: Puglia marac-file formats-20111020

David Rosenthal, Stanford University-http://blog.dshr.org/2011/09/whats-wrong-with-research-communication.html

“If we were starting with a blank sheet of paper to design a mechanism for communicating about research, what would be our requirements?

•Repeatability•Reusability•Immediacy•Transparency•Openness•Sustainability•Permanence•Authenticity”

Page 8: Puglia marac-file formats-20111020

Recommended Data Formats for Preservation Purposes in the Florida Digital Archivehttp://fclaweb.fcla.edu/uploads/Lydia%20Motyka/FDA_documentation/recFormats.pdf

Page 9: Puglia marac-file formats-20111020

Sustainability Confidence:

High•TIFF - uncompressed•JPEG 2000 - JP2 and lossless compression•PNG•PDF/A – FCLA recommends for text

Medium•TIFF – lossless compressed•JPEG•JPEG 2000 – JP2 and lossy compression•PDF – FCLA recommends for text

Page 10: Puglia marac-file formats-20111020

For audio:•Broadcast WAV

For video:•MXF (Media eXchange Format)

•Working on application specification (http://www.digitizationguidelines.gov/guidelines/MXF_app_spec.html)

•Library of Congress planning on lossless JPEG2000 encoding (http://www.digitizationguidelines.gov/still-image/documents/Snyder.pdf and http://www.digitalpreservation.gov/news/events/other_meetings/storage11/

docs/04_snyder.pdf)

SMPTE working on AXF or Archive eXchange Format

Page 11: Puglia marac-file formats-20111020

Image and File Compression?

•Yes! No! Maybe!

•Circling back to yes – when appropriate and using a reasoned approach

Page 12: Puglia marac-file formats-20111020

Howard Besser, NYUFile Compression Strategies, ALA Mid-Winter Panel, 1999 (RLG DigiNews)

The Short Life of Digital Information – The Scrambling Problem

•In order to solve short-term problems resulting from the use of digital technology, we've engaged in practices that may result in long-term peril.

•In the past, because large-scale storage was costly and bandwidth was fairly narrow, many repositories responded to these constraints by compressing their master images or multimedia.

Page 13: Puglia marac-file formats-20111020

•According to the reasoning that dominated until recently, compressed master files take up less storage, are easier to deliver to users with slow network connections, and are more convenient to handle internally.

•In recent years, a number of institutions have come to question this tenet as storage costs have plummeted and network speeds have dramatically increased.

•Compression creates a number of problems.

•Another very important issue is that both lossy and lossless compression add still another level of complexity to the encoding of a file, making it even more difficult for future archeologists trying to decipher its contents.

Page 14: Puglia marac-file formats-20111020

Lou Sharpe, Picture ElementsFile Compression Strategies, ALA Mid-Winter Panel, 1999 (RLG DigiNews)

Advocated the use of visually lossless (but lossy) compression with certain types of originals. Because "practical people make practical decisions,"

Time, conversion costs, staffing, etc., are all factors in the decision-making process for digital reformatting.

He called attention to a handout that framed his position as a debate proposition:

1. Resolved, that visually lossless (yet lossy) compression of tonal images of illustrated book pages can be used to create high-quality digital masters if…conditions are met.

Page 15: Puglia marac-file formats-20111020

2. Further resolved, that such images are of sufficient quality to serve as preservation images for books which are: Found in the stacks, not in the rare book room. Likely to remain available somewhere in physical form.

3. Further resolved, that such images are of comparable or superior quality to accepted preservation approaches such as microfilm.

4. Further resolved, that cost matters in digital library image conversion projects, even though it is other people's money.

With a final nod to the improvements proposed for JPEG 2000, Sharpe argued that at a minimum, the library and archival community should not close the door on the use of visually lossless compression.

Page 16: Puglia marac-file formats-20111020

Shift in thinking began in the mid-2000s.

A lot of consideration- •In Europe•By large university libraries•By those dealing with mass digitization

It is a matter of- •Scale and economics•A better understanding of the realities and risks of digital preservation

Page 17: Puglia marac-file formats-20111020

“Exabytes: Documenting the ‘digital age’ and huge growth in computing capacity,” by Brian Vastag, Washington Post, Feb. 10, 2011http://www.washingtonpost.com/wp-dyn/content/article/2011/02/10/AR2011021004916.html

Megabytes are dead.

Gigabytes are passe.

So much digital data now moves around the globe that those who endeavor to measure it employ a new - or new to non-nerds - term.

Meet the exabyte.

Page 18: Puglia marac-file formats-20111020

Rise of Information in the Digital Agehttp://www.washingtonpost.com/wp-dyn/content/graphic/2011/02/11/GR2011021100614.html?sid=ST2011021100514

Page 19: Puglia marac-file formats-20111020

Really big data: The challenges of managing mountains of information, by John Brandon, October 18, 2011 http://www.computerworld.com/s/article/9220504/Really_big_data_The_challenges_of_managing_mountains_of_information?

The Library of Congress processes 2.5 petabytes of data each year, which amounts to 40TB per week.

Thomas Youkel, group chief of enterprise systems engineering at the Library, estimates the data load will quadruple in the next few years as the Library continues to carry out its dual mandates to serve up data for historians and preserve information in all its forms.

Page 20: Puglia marac-file formats-20111020

David Rosenthal, Stanford Universityhttp://blog.dshr.org/2010/12/rob-sharpes-case-for-format-migration.html

My case is that, as we see from the last few years focus on "sustainability of digital preservation", the major problems in digital preservation are economic.

(Blue Ribbon Task Force on Sustainable Digital Preservation and Access - http://brtf.sdsc.edu/)

Page 21: Puglia marac-file formats-20111020

Andy Jackson, The British Libraryhttp://www.openplanetsfoundation.org/blogs/2011-01-12-format-obsolescence-and-sustainable-access

This means that the long-term cost of preserving our collection scales not only with the size of the files, but also rises as the number of formats we are required to support is increased.

Page 22: Puglia marac-file formats-20111020

We can not avoid dealing with compressed file formats, including lossy compressed-

•Currently, up to 375 billion digital photos are taken each year and the number continues to increase -“How many photos have ever been taken?” by Jonathan Good, Sept. 15, 2011

•This is orders of magnitude more raster image files than are being produced by digitization efforts

•We can expect pretty much all of these 375 billion digital photos per year are JPG files

•The answer for digital preservation is not going to be insisting all image files be saved as uncompressed formats

Page 23: Puglia marac-file formats-20111020

David Rosenthal, Stanford Universityhttp://blog.dshr.org/2011/03/how-few-copies.html

Compression reduces the redundancy within a single copy and increases the risk of damage.

There are also techniques that increase the redundancy within a single copy and reduce the risk.

Page 24: Puglia marac-file formats-20111020

David Rosenthal, Stanford Universityhttp://blog.dshr.org/2011/03/how-few-copies.html

If you ask the people who run large data centers what are the most important causes of data loss, you get a list like this:

•Operator error•External Attack•Insider Attack•Economic Failure•Organization Failure

Page 25: Puglia marac-file formats-20111020

Erik Hetzner, California Digital Libraryhttp://groups.google.com/group/digital-curation/msg/b487a1b0188f9c0c

I see no reason to store, as a matter of policy, uncompressed files on our disks. In fact, I think we should be more aggressive about compressing files.

(Hetzner focuses on lossless compression.)

Page 26: Puglia marac-file formats-20111020

Erik Hetzner, California Digital Libraryhttp://groups.google.com/group/digital-curation/msg/b487a1b0188f9c0c

Even without error correcting codes, I don’t think the arguments for storing uncompressed data only as a matter of policy are strong at all.

When we take error correcting codes into account, not compressing your data as a policy in order to keep a higher level of redundancy seems like the worst way to increase the redundancy of the data.

Smart people have figured out how to make codes which can reliably correct limited errors in bytestreams. Why not use them?

Page 27: Puglia marac-file formats-20111020

Data corruption is and will remain a problem.

An active part of digital preservation will be to overcome this problem.

The LOCKSS concept includes one approach for dealing with the problem – “…the bits and bytes are continually audited and repaired…to protect fragile digital content for the very long time.” http://www.eecs.harvard.edu/~mema/publications/SOSP2003.pdf

LOCKSS now has a 12 year track record.

Page 28: Puglia marac-file formats-20111020

David Rosenthal, Stanford Universityhttp://blog.dshr.org/2011/03/how-few-copies.html

Thus we can say that some digital content is going to get lost or damaged. This shouldn't be news; the same is true of analog content.

We have rules of thumb to guide us in trying to reduce the amount of loss and damage:

•The more copies the safer•The less correlated the copies the safer•The more reliable each copy the safer•The faster failures are detected and repaired the safer•The less aggressive the compression the smaller the effect of damage

Page 29: Puglia marac-file formats-20111020

If image files are being brought into a managed environment, compression, particularly lossless compression, is much less of a concern.

Conversely, if images are being stored on DVDs on a shelf, then compression raises the risks significantly.

Page 30: Puglia marac-file formats-20111020

One option for file format and compression (lossless and lossy) - JPEG 2000

Page 31: Puglia marac-file formats-20111020

There remain barriers for many organizations to adoption of JPEG 2000 (limited open source tools), and concerns and related potential risks (corruption and potential legal issues).

These issues have been acknowledged within the broader cultural heritage digitization community.

Page 32: Puglia marac-file formats-20111020

A number of research studies have been conducted on the robustness of JPEG 2000.

Studies have seen similar results in terms of susceptibility to corruption.

Nevertheless, organizations have concluded that JPEG 2000 is an appropriate file format choice from a robustness perspective – “conclude that JPEG 2000 is a good current solution for our digital repositories.” A Format for Digital Preservation of Images by Buonora and Liberatihttp://www.dlib.org/dlib/july08/buonora/07buonora.html

Page 33: Puglia marac-file formats-20111020

It is worth noting the format includes some “resiliency” elements that add robustness and thereby counteract some effects of data loss.

These resiliency elements are described in the notes at the bottom of the Sustainability of Digital Formats – Planning for Library of Congress web page (http://www.digitalpreservation.gov/formats/fdd/fdd000138.shtml).

Page 34: Puglia marac-file formats-20111020

Wellcome Libraryhttp://jpeg2000wellcomelibrary.blogspot.com/2010/06/we-need-how-much-storage.html

In 2009, the Wellcome Library set out an ambitious vision to digitise a large proportion of its historic collections. This would take the annual digitisation activities of the Library from hundreds, or at most, thousands of images per year to several million images per year.

…we realised this could see the generation of up to 30m images over 5 years. Exciting, but perhaps slightly daunting, considering we didn't yet have an infrastructure to fully support such a large collection of digital assets.

Page 35: Puglia marac-file formats-20111020

Wellcome Library-

Anyone reading this blog will understand why the scale of the programme is key to the blog topic.

When we asked our IT department to tell us how much it would cost to store 30m TIFF files - our de facto standard for the couple hundred thousand images in our existing picture library - we were stunned.

Two petabytes of online, spinning disk storage with a top-of-the-line enterprise management system and remote backup would cost how much?

We learned that the cost would be something like a fifth of our total budget for the entire digitisation programme.

Page 36: Puglia marac-file formats-20111020

Wellcome Library-

Should we consider a lower-cost storage solution? Even tape back-up was quite expensive for that scale, and you can't serve images up online from tape anyway.

We revised our image sizes, factoring in smaller and smaller resolutions and/or bit depths for material like the printed books, which didn't need full colour, high resolution images. We still couldn't afford the storage costs.

Finally, we saw the light and started looking into a relatively new image format called JPEG 2000.

Page 37: Puglia marac-file formats-20111020

JPEG 2000 Summit

•The thought process and rationale for a decision to adopt and implement JPEG 2000

•What are the advantages and disadvantages? When does JPEG 2000 make sense?

•What are the barriers to adoption and implementation for organizations interested in JPEG 2000?

•How do we as a community overcome the barriers?

•For organizations that decide JPEG 2000 is a good match for their needs and goals, what is needed to make adoption and implementation practical and feasible?

•Who is willing to work on these efforts and how do we move forward?

Page 38: Puglia marac-file formats-20111020

Organizations using or accepting JPEG 2000:

•British Library•National Library of the Czech Republic•Google•Harvard University•Internet Archive•Library of Congress

oNational Audio Visual Conservation CenteroNational Digital Newspaper Program

•National Library of the Netherlands•National Library of Norway•Wellcome Library

Page 39: Puglia marac-file formats-20111020

It is very possible, more digital images are produced by mass digitization efforts and saved as JPEG 2000 files than other file formats.

Despite concerns and a clear need for organizational support relating to implementing JPEG 2000, far more cultural heritage organizations are using JPEG 2000 for digitization than most people realize.

Page 40: Puglia marac-file formats-20111020

Also, JPEG 2000 is widely implemented in other communities as well-•Digital cinema•Medical imaging•Geospatial/remote sensing•Law enforcement (facial image compression for biometrics).

The Department of Defense and Intelligence Community have adopted the ISO standard for JPEG 2000 for the National Imagery Transmission Format standard as well.

Page 41: Puglia marac-file formats-20111020

Conclusions:There is not a single answer to the question of file format for raster image files produced by digitization projects.

There are a number of file formats worthy of consideration – suitable from technical, sustainability, fiscal, and other perspectives.

Compression can represent a reasonable risk for appropriate efforts, and is likely a practical reality as digitization and digital preservation efforts scale.

Not using compression likely represents a real risk, particularly given the dramatic and continued growth in digital data.