Technology Watch Report Preserving the Data Explosion: Using PDF Betsy A. Fanning AIIM 1100 Wayne Avenue, Suite 1100, Silver Spring, MD 20910 USA [email protected] DPC Technology Watch Series Report 08-02 April 2008 © Digital Preservation Coalition & AIIM 2008 1 Executive Summary The introduction of the Internet and desktop computing has transformed the way information is handled. The Internet placed a new level of urgency on the business world in expecting that information would be made available immediately – much faster than what the paper-based information world could handle. This need for information now led to the transition from a paper-based document centric environment to an environment of electronic documents and electronic email. This transition from paper to electronic documents led the way for document file formats like PDF, Portable Document Format to be introduced. A PDF document is essentially an electronic document that is equivalent to the paper document. This report reviews the use of PDF, Portable Document Format, more specifically, PDF/Archive as an archival file format to preserve an organization's knowledge. It should be noted that the use of PDF or PDF/Archive alone will not ensure the long-term preservation of electronic documents. When PDF/Archive is combined with a comprehensive records management program and formally established records policies and procedures, an organization can be sure that their electronic documents will be preserved. Electronic documents just like paper-based documents are preserved for reasons beyond why they were created. Organizations preserve documents based on their historical value, to preserve the organization's knowledge, for research, and the uniqueness of the information. Not all information needs to be preserved. The archival value of the information must be assessed based upon the content of the documents being preserved. While focused on PDF and PDF/Archive, this report will address some of the other file formats which may be used or considered for archiving electronic documents. It will explore the PDF standards efforts and their suitability to long-term preservation including: − PDF/Engineering, − PDF/Exchange, − PDF/Universal Access, and − PDF Healthcare. PDF/Archive was developed as a result of numerous organisations needing to be able to preserve their electronic documents and be able to access and view the documents at a future time with the document appearing as it had when it was produced. In order to ensure that the documents could be accessed, it was important to ensure that the standard addressed the goals of device independence, files being self- contained and self-documenting, and that there would be no restrictions like encryption that would impede the access to the documents. 2 Given the wide acceptance of PDF, the development of PDF/Archive for long-term preservation of electronic documents is a logical use of the file format. Through the use of PDF/A, organizations can be sure that their documents will be preserved for the long term. While PDF/A may be a suitable file format today for long-term preservation of electronic documents, it should be noted that there may be other file formats introduced in the future that may better serve the needs of an organization. Therefore, organizations should be continually reviewing the available file formats to ensure they have selected the best format for their purposes. Keywords: PDF, Archive, Preservation, Engineering, File Format, Healthcare, Portable Document Format 3 1.0 Introduction The Internet has transformed the workplace and changed the way paper is used. The use of electronic documents has exploded and become the normal business practice. It is estimated that over 90% of all business records are electronic1. This transition from paper to electronic documents has led the way for document file formats such as PDF, Portable Document Format, to be introduced. PDF was made popular by the fact that it enabled users to see electronic documents as they did their paper counterparts. With a large amount of information being based in electronic formats, it is critical to be able to preserve or archive this knowledge for future generations. This report will describe the PDF standards activities currently being developed and their relevance to digital preservation. 2.0 Background Records, paper- or electronic-based, are preserved for specific reasons beyond the reason they were created. The records may have historical value, need to be preserved to maintain corporate knowledge, provide a glimpse of the direction the organisation was taking at a specific point in time, or have some need for future research or value as new products are being introduced. An organisation cannot archive all the information it creates. Therefore, it is important to assess the uniqueness of the information and the requirement for archiving. Assessing the archival value of the information requires a review of the content of the information or documents. This requires judgments to be made as to the value of the content for reference, research, validation or reinforcement for decisions; preserving the organisation's history; maintaining relevant information for legal, administrative, or fiscal purposes; or celebrating the history of the organisation at notable anniversaries. Not many years ago, predictions were made that information was exploding at astronomical rates and that this information was to be in paper format at that time, the electronic document was only used for non-business purposes and not the essence of business that it is today. An electronic document is an "electronic representation of a page-oriented aggregation of text and graphic data, and metadata useful to identify, understand and render that data, that can be reproduced on paper or optical microform without significant loss of its information content2." Shortly after these predictions were made, organisations made the distinct decision that they had to better manage their paper or be overcome by the tons of paper that were being generated in their organisations on a daily basis. At approximately the same time, the concept of the "paperless office" was born, which was to be a utopia for organisations. Take a moment to picture in your mind the vision that these early crusaders had. In this "paperless" society, organisations would not use paper for any business transactions. Instead they would use electronic documents, electronic forms, electronic mail (e-mail), and telephone calls to conduct 1 http://www.arma.org/erecords/index.cfm?view=publications 2 ISO 19005-1: 2005, Document management – Electronic document file format for long-term preservation – Part 1: Use of PDF 1.4 (PDF/A-1) 4 business. Imagine what a typical office in this "paperless" world would look like…the desks would be bare of paper and have a computer monitor and a telephone on them. There would be no clutter, no chaos, no records filing rooms, no copiers, etc. Soon after the concept was introduced, it was realised that a "paperless" office was nearly impossible to achieve and the concept of a "less paper" office was born which was thought to be easy to achieve. The paperless office concept could not be fully achieved due to our human dependence on paper to do our work and the lack of tools available for the electronic documents. The "less paper" concept contributed to the overall acceptance of electronic documents. The explosion of information has been the result of improving digital technologies including more powerful computation resources, faster data streams, and more storage resources. Data needs to be shared across both organisational and geographical boundaries, and where storage tends to be independently managed. The ease and speed with which this data is generated and changed also makes it increasingly difficult to ensure its quality. With all of this electronic information, the challenge is in finding the nuggets of information and knowledge buried under this avalanche of bits and bytes. Reliable tools to access and integrate legacy, raw and derived data, and manage its transformation into knowledge are in high demand. The process of transforming data into knowledge requires access and integration. PDF is a file format originated by Adobe Systems in the early 1990s created for the primary purpose of exchanging documents. PDF was developed because Adobe Systems needed to exchange document electronically amongst their employees and the technologies at the time did not provide that capability. It was intended to make electronic documents essentially similar to their paper equivalents by being authentic, reliable and easy to use. The PDF Reference is an open specification that defines the features and functions for the PDF file format. An open specification is one that is publicly available. Adobe made all the PDF Reference specifications freely available on their web site and allows any software developer to use the specification in designing their products. When PDF was first introduced, it was used primarily by graphics artists, designers, and publishers for producing and exchanging proofs. Now, PDF is used to exchange all types of data including vector graphics (illustrations and designs), text, and raster graphics (photographs and other types of images). Through the evolution of the Internet and use of PDF, PDF has become a de facto standard for exchanging documents. While the "paperless" office did not come to be, businesses learned a number of valuable lessons which have been implemented today. The experiment of the paperless office taught organisations that their employees were more comfortable with paper. Organisations learned that paper was easier to navigate through because it facilitated cross-referencing. It was easy to apply notes to the paper. Paper was found to facilitate collaboration efforts and made coordinating activities easier. It was not too long ago that when attending a meeting, the attendees used paper to follow the meeting. Now, meeting attendees rely on electronic documents and leave the paper back in the office. These characteristics provided a challenge to the software developers who were developing electronic document applications as they 5 needed to make it easy to navigate through the document as well as make the electronic documents as easy to use as possible. As PDF evolved, it became the electronic equivalent to paper through the functionality and features added to the specification. This led the various PDF developers to introduce products that made the user experience of working with an electronic document seem very similar to the way they were accustomed to working with paper. PDF as a document format is feature-rich. This richness can be a hindrance when developing applications to fit specific needs such as exchanging or archiving documents. As users became more proficient with electronic documents, they began to request functionality and features that used the existing features of PDF and added new features and functionality to the PDF specification. In 1994, PDF 1.1 introduced support for: − External links − Article threads − Security features − Device independent colour − Notes PDF 1.2 introduced in 1996 provided new features that enabled PDF to be beneficial in the prepress industry including: − Support for OPI (Open Prepress Interface) 1.3 specifications − Support for the CMYK (colour model short for cyan, magenta, yellow and key (black)) colour space − Spot colours could be maintained in PDF − Halftone functions could be included as well as overprint instructions In 1999, PDF 1.3 provided support for: − 2-byte CID fonts − OPI 2.0 specifications − New colour space called DeviceN to improve support for spot colours − Smooth shading, a technology that allows for efficient and very smooth blends (transitions from one colour or tint to another) − Annotations May 2001, PDF 1.4 was released providing: − Transparency support that allows text or images to be seen through − Improved security − Improved support for JavaScript PDF 1.5 released in May 2003, introduced: − Improved compression techniques including object streams and JPEG2000 compression 6 − Support for layers − Improved support for tagged PDF November 2006 marks the date that PDF 1.7 was introduced providing: − Improved support for commenting and security − 3D support was improved In 2000, Adobe Systems initiated the first of what would become several efforts to standardise subsets of the PDF Reference for specific purposes. The first subset to be introduced became known as PDF/X after which came numerous other ISO PDF standards. The current PDF standards portfolio consists of: − PDF – Portable Document Format − PDF/A – Portable Document Format/Archive − PDF/E – Portable Document Format/Engineering − PDF/UA – Portable Document Format/Universal Access − PDF Healthcare – Portable Document Format Healthcare − PDF/X – Portable Document Format eXchange 2.1 Graphics Exchange The PDF/X, PDF/eXchange, is the family of standards that was and is continuing to be developed by ISO TC 130, Graphic Technology. ISO TC 130, Graphic Technology, is responsible for developing standards for the printing and graphic technology fields. The PDF/X standard provides an efficient vehicle for exchanging files representing print ready material. PDF/X is predominantly used by the graphic industry.3 The family of PDF/X standards currently consists of: − ISO 15929, Graphic technology – Prepress digital data exchange – Guidelines and principles for development of PDF/X standards − ISO 15930-1, Graphic technology – Prepress digital data exchange – Use of PDF – Part 1: Complete exchange using CMYK data (PDF/X-1 and PDF/X-1a) − ISO 15930-2, Graphic technology – Prepress digital data exchange – Use of PDF – Part 2: Partial exchange (PDF/X-2) − ISO 15930-3, Graphic technology – Prepress digital data exchange – Use of PDF – Part 3: Complete exchange suitable for colour managed workflows (PDF/X-3) − ISO 15930-4, Graphic technology – Prepress digital data exchange – Use of PDF – Part 4: Complete exchange of CMYK and spot colour printing data using PDF 1.4 (PDF/X-1a) − ISO 15930-5, Graphic technology – Prepress digital data exchange – Use of PDF – Part 5: Partial exchange of printing data using PDF 1.4 (PDF/X-2) 3 PDF/X Application Notes 7 − ISO 15930-6, Graphic technology – Prepress digital data exchange – Use of PDF – Part 6: Complete exchange of printing data suitable for colour-managed workflows using PDF 1.4 (PDF/X-3) ISO 15930 specifies the use of PDF for the dissemination of complete digital data in a single exchange that contains all elements ready for final print production. This standard defines the data format and how it is to be used to ensure the file transmitted contains all the content information necessary to process and render the document as it had been intended to be. This means that the colour is exchanged in the exact way that the designer had intended and that the hues or tones of the colour are not altered resulting in a higher quality image since the fonts and colours are embedded in the file. 2.2 Archival The family of PDF/Archive (PDF/A) standards was and is being developed by an ISO Joint Working Group under the auspices of ISO TC 171 SC2, Document Management Applications, Application Issues in cooperation with representatives from ISO TC130, Graphics Technology, ISO TC46 SC11, Information and documentation – Archives/records management and ISO TC 42, Photography. This joint working group brought together experts from the library, archival, document management, records management, graphics, government agencies, industry, and software developer communities to develop this standard. The family of PDF/A standards consists of: − ISO 19005-1:2005, Document management – Electronic document file format for long-term preservation – Part 1: Use of PDF 1.4 (PDF/A-1) This standardisation effort began in the United States as a joint project of AIIM, the Enterprise Content Management Association and NPES, The Association for Suppliers of Printing, Publishing, and Converting Technologies as a result of a need raised by numerous organizations to be able to reliably preserve electronic documents. It quickly gained a great deal of visibility with over 400 individuals requesting to be on the committee listserv to follow the committee activities. The reason PDF/Archive began was because several organisations were being faced with large collections of electronic documents that they needed to manage, preserve, and make searchable and available for generations to come. These groups began by formalising the need for a PDF archival format and identifying the business requirements. In addition to archiving the electronic documents, it was necessary to be able to reliably render the documents a hundred years from now which meant that a format supporting long- term preservation was necessary. In the initial stages of this effort, the committee discussed the various file formats that could be used for archiving electronic documents. These formats included TIFF (Tagged Image File Format), XML (eXtensible Markup Language), native file formats and PDF. PDF was chosen as the file format best 8 suited for long-term preservation due to its wide adoption in numerous applications and ease of creating PDF files from digitally born documents. PDF is an open file format for electronic documents. While the format is considered proprietary, the specification for the file format is publicly available. Adobe Systems owns patents on the format but allows developers to use the specification to develop products that produce and render PDF files royalty free. Regardless of the operating system or tool used to create the PDF file, the PDF file will display exactly as it is intended to be displayed on any device using any operating system. Due to the de facto adoption of PDF, many organisations already mandate the retention of PDF documents. PDF is recognised as being feature rich which can cause difficulties for specific uses such as long-term preservation. The committee recognised that PDF documents are not self-contained but rely on system fonts and other content stored external to the file. These external links can change or get broken over time which would allow information to be lost causing a preservation problem. The objectives of the PDF/A working group that guided the technical development of the standard, included: − Device-independence to ensure files did not require a specific platform or operating system to render − Files needed to be self-containing and have all the resources necessary for rendering − Self-documenting containing their own description − Lack of restrictive elements like encryption that would hamper the rendering of the document years from when it was originally created − Disclosure − Widespread use and adoption of the format 4 The PDF/A standard defines long-term as "the period of time long enough for there to be concern about the impacts of changing technologies, including support for new media and data formats, and of a changing user community, on the information being held in a repository, which may extend into the indefinite future.5" PDF/A was intended to address6: − Defining a file format that preserves static visual appearance of electronic documents over time − Provides a framework for recording metadata about electronic documents 4 PDF/A Application Notes 5 ISO 19005-1: 2005, Document management – Electronic document file format for long-term preservation – Part 1: Use of PDF 1.4 (PDF/A-1) 6 PDF/A Frequently Asked Questions (FAQ) 9 − Provides a framework for defining the logical structure and semantic properties of electronic documents As the working group developed PDF/Archive, they identified the following as issues that needed to be resolved through the development process: − Long-term preservation of electronic documents – PDF is feature-rich which imposes problems or issues for the archival nature of electronic documents − PDF is not necessarily self-contained − Lack of standardisation of PDF tools which leads to inconsistent results − Exchange barriers including content such as external links which may be inaccessible − In the engineering world, there are multiple proprietary formats which require individual and often expensive viewers − Making PDFs fully accessible to individuals with disabilities − Need for a secure, electronic container that can store and transmit relevant healthcare information without limitations on the type of information being transmitted while keeping the costs low − Need to expand PDF − Easy to use − Reliability The PDF/Archive file format is based on and includes the functionality included in the PDF Reference 1.4. However, in order to ensure long-term preservation, it was necessary to limit the specific functionality of PDF by establishing specific requirements. Therefore, the PDF/A standard specified features that are allowed and not allowed. PDF/A-1 (ISO 19005-1) files allow7: − Embedded fonts which includes only those fonts which may be legally embedded without a royalty fee − Device-independent colour − XMP Metadata PDF/A-1 files do not allow8: − Encryption as the method of encryption may not be supported when the files are opened at a later date − LZW Compression due to intellectual property constraints − Embedded files − External content references because the references may change or be broken 7 PDF/A Frequently Asked Questions (FAQ) 8 PDF/A Frequently Asked Questions (FAQ) 10
Description: