U.S. EPA Library Collections Digitization Process Report September 24, 2007 Submitted to U.S. Environmental Protection Agency Environmental Protection Agency (EPA) EPA Library Collections Digitization Process Report TABLE OF CONTENTS INTRODUCTION 1. DIGITIZATION PROCESS FOR THE EPALIBRARY COLLECTIONS 1.1 PRE-SCANNING DOCUMENT AND MANIFEST RECEIPT AND PREPARATION 1.2 DOCUMENT TRACKING AND CONTROL DURING SCANNING TO DVD OUTPUT PROCESS 1.3 SCANNING TO DIGITAL IMAGE 1.4 OPTICAL CHARACTER RECOGNITION 1.5 CD/DVD CREATION 1.6 MEDIA DELIVERY QUALITY CONTROL 1.7 SHIPPING MATERIALS BACK TO EPA LIBRARIES 2. LOADING THE SCANNED IMAGES AND METADATA ON TO THE EPA NEPIS WEB SITE 2.1 RESTORE DATA TO LOCAL PC WORKSTATION AND INITIAL QC 2.2 EXPORTING DATA 2.3 RE-DISTRIBUTION OF DOCUMENTS TO INDEXES 2.4 HARDCOPY INDEX PROCESSING 2.5 PROCESS INDEXES TO BUILD SEARCHABLE DATABASES 2.6 POST-PROCESSING – GENERATING STATIC CONTENT PAGES 2.7 PREPARE DATA TO SHIP TO THE EPANATIONAL COMPUTER CENTER 2.8 QACHECKS FOR FUNCTIONALITY OF THE PUBLIC SERVER 3. COLLABORATION BETWEEN CONTRACTORS 4. STANDARDS AND EQUIPMENT 5. QUALITY CONTROL/QUALITY ASSURANCE 5.1 QUALITY CONTROL PLAN 5.2 QUALITY ASSURANCE IMPLEMENTATION 6. CAPABILITIES AND FEATURES OF NSCEP WEB 6.1 BACKGROUND INFORMATION 6.2 CAPABILITIES AND FEATURES CONCLUSION LIST OF EXHIBITS Exhibit 1. From the EPA Library to the NSCEP Web Site: A Total View of Document Digitization and Management Process Flow Exhibit 2. Standard Scanning Specifications Exhibit 3. Flowchart for Loading the Scanned Images and Metadata to the NEPIS Web Site Exhibit 4. Industry Imaging Standards Exhibit 5. Imaging Hardware and Software Exhibit 6. NSCEP (NEPIS) ZyLab Imaging System Hardware and Software INTRODUCTION The U.S. Environmental Protection Agency (EPA) requested that the EPA contractor prepare and submit a Digitization Process Report that outlines the applications, hardware, and processes and procedures used to digitize documents for inclusion in the National Service Center for Environmental Publications (NSCEP) Web site (http://www.epa.gov/NSCEP), a public Web site supported by the internal National Environmental Publications Internet Site (NEPIS). The following report describes the process of digitization of EPA documents at the EPA contractor’s Document Imaging Facility and the EPA NEPIS process. Also included in the report are discussions regarding relevant standards, an overview of quality control measures, a summary of hardware and software used, and the capabilities and features associated with the NSCEP Web site. Six sections comprise this report, including: • Section 1: Digitization Process for the EPA Library Collections. This section provides an overview of the entire digitization workflow from receipt of the library collection boxes to the loading of the images and data onto the EPA NEPIS Web site. • Section 2: Loading the Scanned Images and Metadata on to the EPA NEPIS Web Site. This section describes the process of loading the scanned documents to the NEPIS Web site. • Section 3: Collaboration between Contractors. This section outlines the EPA contractor’s process to ensure the effective collaboration of the various units, staffs, and groups involved in the entire processing workflow. • Section 4: Standards and Equipment. This section discusses the industry standards to which the EPA contractor adheres and identifies the equipment and software to perform the work in a timely manner. • Section 5: Quality Control/Quality Assurance. This section reviews industry-standard quality assurance methodology and toolset that delivers high-quality imaging products, data, and services. • Section 6: Capabilities and Features of the NSCEP Web Site. A general overview of the NSCEP Web site is provided in this section. 1. DIGITIZATION PROCESS FOR THE EPA LIBRARY COLLECTIONS This section describes the pre-scanning processing steps, including the important quality control steps of title verification and EPA publication number check. A detailed presentation of the scanning process from document preparation through optical character recognition (OCR) and output creation including all of the critical quality control steps is then provided. Exhibit 1 depicts the entire digitization workflow from receipt of the library collection boxes to the loading of the images and data onto the EPA NEPIS Web site. Each of the major processes areas and steps are also described in further detail throughout sections 1 and 2. 1.1 PRE-SCANNING DOCUMENT AND MANIFEST RECEIPT AND PREPARATION Upon receipt of the boxes of documents and manifests to be scanned from EPA libraries, the EPA contractor will perform the following pre-scanning preparation steps: • Log each box in an EPA Library Document Log. • Track each box as it goes through scanning steps. • Compare the manifest with the contents of the boxes and note any discrepancies. Contact the library responsible for shipping the box to resolve all discrepancies. • Check each document to ensure it has a properly formatted and standardized EPA publication number assigned. If needed, request a new publication number from appropriate NSCEP staff members. • Compare the title on the physical document to the data entered on the manifest and update the manifest, if needed. • Apply a publication number label to the upper right corner of the document cover, if needed. 1.2 DOCUMENT TRACKING AND CONTROL DURING SCANNING TO DVD OUTPUT PROCESS Documents received for processing will be recorded in the Project Receiving Logs by the EPA contractor’s document preparation team. Upon receipt, the Document Processing Supervisor and team leader will identify and inventory the containers and document files and estimate the size of the collection to optimize staffing and resource needs to meet the turnaround requirements. All materials that are received for document processing will be inspected against the transmittal documentation to ensure that all pieces have been received. Materials will be marked with unique container/box identifiers that enable a work history to be compiled that reflects all tasks completed on that container. Then next step is to log materials in the container control and tracking database to maintain document chain-of custody and integrity throughout the processing cycle. This step will also facilitate the location and retrieval of a document, in the processing workflow, in the event that the client requests a particular document or publication be returned. Peculiar information that is unique to a specific container or batch of documents will be captured in the tracking logs to ensure accountability of documents. 1.3 SCANNING TO DIGITAL IMAGE The EPA contractor will produce clearly legible, high-quality digital images in the correct orientation and reduction that meet American National Standards Institute (ANSI) and Association for Information and Image Management (AIIM) electronic imaging standards, consistent with the specific requirements of the EPA. The EPA contractor uses scanning and imaging software packages that have been developed to implement digital imaging requirements and improved over the years for the delivery of digital images. Imaged pages will be returned to the EPA in the exact order, collation, and condition in which they were received. 1.3.1 IMAGING WORK ORDER Each request for imaging services will be initiated with a Document Imaging Work Order completed by the project manager in consultation with the EPA. The Document Imaging Work Order details the specific requirements and instructions and forms the basis for all work produced. The Work Order and project control procedure will ensure that product quality and workflow throughout the image processing are consistent with the requirements of the EPA. The instructions on the Document Imaging Work Order will determine the order or sequence in which materials are processed to ensure the orderly flow of documents through the imaging process. The EPA contractor supervisors will constantly monitor and track the processing pipeline to ensure that all material designated for imaging is processed as requested and that batches of documents can be immediately located and retrieved, if required. The use of processing control logs for each workflow task also enables the identification and location of documents throughout the processing cycle. 1.3.2 DOCUMENT PREPARATION Prior to scanning, staff will prepare the documents to facilitate high-speed scanning and maintain the physical integrity of batches in the collections. The EPA contractor’s imaging management will define unique document preparation and unitization instructions according to the requirements specified by the EPA. These instructions will ensure that the proper collation and integrity of the document collection, including the original source file configurations, are maintained throughout the document conversion process. Line supervisors will ensure that document preparation staff follow the instructions contained in the Document Imaging Work Order. In its preparation activities, the EPA contractor will follow where applicable the standards ANSI/AIIM TR15-1997–"Planning Considerations, Addressing Preparation of Documents for Image Capture" and ANSI/AIIM MS52-1991–"Recommended Practice for the Requirements and Characteristics of Original Documents Intended for Optical Scanning". The EPA contractor’s document preparation manuals contain specific instructions for materials that may require special handling, such as onionskin paper, brittle, fragile, damaged, old, or one-of-a-kind documents. Such documents may need to be scanned on the flatbed of the document scanners rather than through the auto-feed mechanism to prevent damage or mutilation during processing. Line supervisors will ensure that document preparation staff maintain records or control logs describing the nature, condition, and characteristics of documents and in accordance with the ANSI/AIIM TR31-1993–"Performance Guidelines for the Legal Acceptance of Records Produced by Information Technology Systems". Preparation staff are trained and required to use the utmost care in the handling of all client documents regardless of condition. The process of preparation and unitization will involve the removal of any binding elements (staples, paper clips, prong fasteners, metal slide clasps, etc.), and the substitution of target sheets in their place to facilitate indexing and reconstruction of documents after scanning. The EPA contractor will use a paper cutter to remove the spine from bound publications in preparation for scanning. Use of documented and established procedures, employee training, and supervision ensures that publications and other documents are not destroyed during the cutting operation. A highly efficient method of capturing metadata from document collection is the creation and use of target sheets. Specific metadata, such as document type, name, title, number, and date, will be captured from the shipping manifests and recorded in the target sheets. During a later stage in the processing workflow, the accuracy of the index information captured on the target sheets will be verified by experienced indexers. After scanning and image inspection has been completed, reconstruction of the scanned documents is performed by removing the target sheets and other slip sheets and re-binding the documents to the original state and condition as when they were received. During document reconstruction, loose documents will be secured by multiple rubber bands or binder clips to prevent the documents from sliding out or shifting within the box during transportation. Line supervisors will inspect the reconstructed documents to ensure that no prepping target sheets are left behind in the collection and that fasteners have been replaced, where necessary. Supervision and quality control will ensure the integrity, order, and condition of the document collection prior to return to the EPA. 1.3.3 DOCUMENT SCANNING The EPA contractor performs document scanning in accordance with ANSI and AIIM electronic imaging standards to ensure readability and admissibility in court. These standards include ANSI/AIIM MS44-1988 (R1993)–"Recommended Practice for Quality Control of Image Scanners"; ANSI/AIIM MS53-1993–"Recommended Practice; File Format for Storage and Exchange of Image; Bi-Level Image File Format: Part 1;" ANSI/AIIM TR19-1993–"Electronic Imaging Display Devices (for selecting imaging devices)"; ANSI/AIIM TR26-1993–"Resolution as it Relates to Photographic and Electronic Imaging"; and ANSI/AIIM TR381996–"Compilation of Test Targets for Document Imaging Systems". To ensure that the EPA receives the highest quality products and services, the EPA contractor will use the most current procedures and techniques to optimize the quality of the products of the scanning process. As part of the overall digital imaging quality control process, all of the EPA contractor’s scanning equipment will undergo regular calibration, testing, and maintenance by authorized manufacturer technicians. The EPA contractor’s scanning software has on-screen displays that enable the operator to make the image setting adjustments needed to optimize the scanner’s output. The scanner operator will be responsible for the initial image quality review. As each page is scanned and displayed on a 19-inch monitor, the scanner operator will examine it for quality and completeness. At this time, the scanner operator can catch and rectify quality issues such as misfed pages, poor image contrast, and incomplete images. By catching errors at this step, the scanner operator can adjust the scanner’s settings to respond to the changing document conditions and sizes, and then rescan the documents immediately. During the scanning phase of a project, the scanning supervisor will remotely view random individual image files to check the quality of the scanner operator’s work. The EPA contractor’s scanning software used in document scanning assigns a unique, sequentially numbered file name to each imaged page as it is scanned, which becomes the identifier for the imaged page throughout the imaging process. Exhibit 2 summarizes various scanning specification standards common to the industry. 1.3.4 POST-SCANNING PROCESSING The EPA contractor will process all scanned images through an image optimizer program running on dedicated image-enhancement servers. Imaging technicians will use the selectable functions of the program, such as black border cropping, de-skewing, rotation, de-speckling, automatic character repair, and line removal, to improve overall image quality. Image enhancement will make the images clearer for viewing, as well as provide a better source image for improved OCR accuracy. Image enhancement procedures do not change the informational content of the documents and are therefore in accordance with the ANSI/AIIM TR311993–"Performance Guidelines for the Legal Acceptance of Records Produced by Information Technology Systems". Control logs will be used to record the file names of all images that are enhanced to ensure legal admissibility. 1.3.5 IMAGE INSPECTION AND QUALITY CONTROL Following image enhancement, the images will undergo a visual quality control process by a trained and experienced quality control staff member checking the scanned images against the original documents. The operator will check for the proper orientation and overall quality of the image. The operator will also check the image size in relation to the original ocument and confirm that all pages were scanned in sequence and that no pages (either single- or double-sided) were skipped. The operators will perform corrective re-scan actions, where possible. Any images deemed of poor or marginal quality will be flagged for correction. In checking for the possibility of missed pages, operators are trained to look for tip-offs such as non-sequential pagination or a missing first page of a document. As the operators identify these errors, they will re-scan and replace and insert images, as necessary. For quality assurance, the supervisor will monitor the staff and take corrective action as required. The EPA contractor, in developing its image inspection and quality control plan, utilizes the standard ANSI/AIIM TR34-1996– "Sampling Procedures for Inspection by Attributes of Images in Electronic Image Management (EIM) and Micrographic Systems". 1.3.6 DOCUMENT INDEXING (METADATA) Indexing of the metadata from the image collection is performed in the indexing module after image inspection. The data elements specified in the project requirements are captured using the target sheets during the document preparation phase described earlier. Document metadata, such as the EPA publication number, title, number of pages, and date created earlier in index files, will be presented in the indexing module for verification and correction. The original manifest data is used for validation. In the module, the index code representing each data element pertaining to a given image range is verified and flagged with the code. The EPA publication number, title, number of pages, and year (in four-digit format) are thus assigned to the appropriate image ranges that make up a specific document. By comparing information available from the image of the document, the document indexers confirm the accuracy of the metadata. The number of images for each publication or document is pulled from the page-table database. 1.3.7 IMAGE ENDORSEMENT Even though the images will not be endorsed or stamped with any of the data elements, the EPA contractor’s endorsement software will generate an electronic file level index for each container of publications or documents. This cross-reference file-level index will consist of the EPA publication number, Tagged Image File Format (TIFF) image filename, document metadata (document title, four-digit year, and number of pages), document break indicator, and the container name. After page numbering and indexing, a scanning supervisor will perform an electronic quality control check, which includes an automated document data check including number format and duplication check. Any errors discovered will be recorded in an electronic log file and noted on a quality control sheet, corrected, and then reviewed with the imaging supervisor, who will discuss them with the original indexer. 1.3.8 IMAGE VOLUME STAGING Following image endorsement, selected groups of image data files will be organized and staged for volume staging based on the specifications of the Document Imaging Work Order. Images will be grouped or aggregated by document, and, because best practice dictates that documents not span container boundaries, a container of documents will always be located on a single image volume. The Premaster process will produce a cross-reference index file used for the tracking, location, storage, and retrieval of images. This index file may consist of the document number, title, page number, revision level, and date, as well as any other information specified in the Document Imaging Work Order. The index file may also be used to link the images to OCR, database records, or for loading to an image viewer application. 1.4 OPTICAL CHARACTER RECOGNITION The EPA contractor’s OCR processing software and workflow, based on state-ofthe-art software and technologies, will ensure the production and delivery of optimum-quality text files with the best interpretation of each original source material. Previously scanned document digital image data will be converted to computer-readable ASCII text in the ZyLab format. The OCR volume version of a deliverable image volume will have a one-for-one image-to-OCR text correlation in an identical directory structure. This procedure ensures the order, integrity, and accountability for all source material while delivering a one-forone image-to-text product. Preprocessing individual batches of images before full OCR production begins greatly improves the accuracy of recognition output. The preprocessing and image enhancement procedures do not change the informational content of the documents and are therefore in accordance with the ANSI/AIIM TR311993–"Performance Guidelines for the Legal Acceptance of Records Produced by Information Technology Systems". Control logs will be used to record the filenames of all enhanced images to ensure legal admissibility. 1.5 CD/DVD CREATION Once all the verifications are completed, output media (CD, DVD, or removable hard drive), compliant with ISO9660, will be produced from the premastered files. The output file format will be TIFF Group IV 300 dpi single-page TIFF and their corresponding OCR text files, in the ZyLab format, unless otherwise requested. After CD/DVD creation, every image on each copy of the deliverable media will be checked programmatically for compression/ decompression errors, dots per square inch (DPI) resolution, and page size to ensure that all images were recorded correctly on the media and in accordance with the specifications on the work order. 1.6 MEDIA DELIVERY QUALITY CONTROL A final quality control check will be performed on all production deliverables before they are released for shipping. The work product will be cross-checked against the original files to ensure that the number of images and document records match the numbers indicated by the final scanning and image inspectionlogsandthedatabaseserver. In addition, the scanning supervisor will perform random quality checks on the final product before it is delivered. Specified copies of the scanned images will be delivered including OCR text files and metadata on DVDs or other media specified. In addition to keeping backups of all images, the EPA contractor will store one copy of the delivered media to provide subsequent support services. 1.7 SHIPPING MATERIALS BACK TO EPA LIBRARIES Upon completion of the scanning activities, the boxes of materials will be prepared for returning to EPA. Publications will be inspected to ensure the documents have been properly secured with multiple rubber bands and/or binder clips. The manifest will then be reviewed to ensure the information accurately reflects what publications are in that box and then placed into the container as a packing slip. Any disparity will be reconciled. Cushioning material will be placed in the boxes to ensure the material is secure and will not move or shift during transport. This eliminates the possibility of any damage during transit. Upon notification that the scanned document media has been successfully loaded into the NSCEP Web site, boxes of publications will then be shipped back to the libraries or designated locations as specified by the EPA. 2. LOADING THE SCANNED IMAGES AND METADATA ON TO THE EPA NEPIS WEB SITE This section details the process used to prepare the scanned images and OCR data output for the EPA National Computer Center to upload to the NEPIS Web site. 2.1 RESTORE DATA TO LOCAL PC WORKSTATION AND INITIAL QC The data produced from the scanning and OCR operation is restored to a local workstation where quality control processing steps begin and additional manipulation is performed. The first step in this quality process is to sort through the data files to ensure that all documents are new to the system. 2.2 EXPORTING DATA Each document file is processed to reorganize the text and place text and images on a local Cincinnati server that is used to stage the information for the eventual transition to the EPA Internet public access server at the National Computer Center (NCC) in Research Triangle Park, North Carolina. This process is called exporting, and it integrates the new document file data with the cumulative collection of over 26,000 digitized documents files already in the national repository. A complete collection of EPA digitized document text, images, and metadata that can be displayed in its final form as it appears on the EPA Internet public access server is called an “index.” There are currently eight separate indexes based on publication year to increase the capacity of the system without reducing its performance. To accomplish this, the export process targets an intermediate index that serves as a workspace and distribution point. At this stage, document metadata is much more accessible and is checked to ensure completeness as another one of the quality control steps. Publication year data, for instance, will be added at this point if it is not already in place. 2.3 RE-DISTRIBUTION OF DOCUMENTS TO INDEXES The next processing step is to re-distribute the documents amongst the appropriate eight publication year range indexes that are part of the recent redesign effort implemented on the public access site. The process selects from the interim index based on the publication date year range into which the document fits. 2.4 HARDCOPY INDEX PROCESSING The Monthly Hardcopy Document Title List is processed into a specific hardcopy system index (for hardcopy print mail orders), and metadata adjustments are made to the documents already in the new date range indexes to provide a cross correlation between documents that are available for hardcopy print mail orders and those available as digitized online page images. During this process, static content for pages that indicate new documents available in hardcopy print copies and the contents pages for foreign language documents is created. 2.5 PROCESS INDEXES TO BUILD SEARCHABLE DATABASES Each of the nine index directories (the hardcopy index and the eight date range indexes) are then reprocessed by the ZyLab ZyImage software application to build the databases that enable the search of the text and its correlation to the associated images. The completed indexes are quality control spot checked to ensure that they are functioning properly with the new data. 2.6 POST-PROCESSING – GENERATING STATIC CONTENT PAGES With the completed update indexes in hand in Cincinnati, the next step is to post-process the indexes using custom software scripting to generate the static contents pages that is provided to enumerate all available online documents in addition to the dynamically served search content. 2.7 PREPARE DATA TO SHIP TO THE EPA NATIONAL COMPUTER CENTER All of the digitized data files required to deploy the new documents to the NCC public access system are now available. Since the local development system and the public access system started with the same data, it is now possible, with each update, to compare the condition of the public system (the state of the development system immediately after shipping the last update) to the current state of the server to isolate those files that have changed or been added during the process. The identified files are harvested from their working directories and staged for shipping. The data is then once again posted to DVD format for shipment to the NCC for installation on the NCC Internet public access server at the earliest convenient date. Staff located at the NCC facility restore the data to the NEPIS public server. 2.8 QA CHECKS FOR FUNCTIONALITY OF THE PUBLIC SERVER After each update, the Web site will be checked to ensure full functionality and document appearance and quality. Any issues will be reported and action coordinated between the EPA and contractors to resolve the problem. As detailed in section 1.7, only once the final quality control review has taken place will the collection of library materials be returned to the originating library or other EPA designated repository. Exhibit 3 presents a graphical description of the digitization process workflow described in section 2. Included in the workflow diagram are the five quality control steps taken to ensure quality processing and that a high-quality image is displayed on the NEPIS Web site. Section 6 offers an overall description of the NEPIS Web site. 3. COLLABORATION BETWEEN CONTRACTORS The success of the EPA contractor’s process to digitize client material is dependent on the effective collaboration of the various units, staffs, and contractors involved in the entire processing workflow. The EPA contractor will ensure the compatibility of all system configurations, software versions, input, and output delivery formats. Initial testing of compatibility of inputs and outputs will be performed to ensure seamless, mistake-proof handoffs and the delivery of quality output. An initial batch of images and metadata will be delivered for uploading to the NEPIS public servers as validation of digitization process. The EPA contractor’s approach to quality requires verification that the practices, procedures, and processes that inject quality into project operations and work products are always performed to documented standards. 4. STANDARDS AND EQUIPMENT Throughout the digitization processes described in sections 1 and 2, applicable industry standards will be adhered to at all times. Knowledge of and disciplined adherence to applicable standards is critical to completing digitization work with acceptable quality. Exhibit 4 describes the applicable standards for the EPA library collection digitization process. The chart identifies the standard, describes the standard and its use, and also cites the applicable ISO-related document. Exhibit 5 presents hardware and software applicable for the EPA Library digitization project. The chart breaks down the equipment and software by process step and presents product information. EPA-furnished hardware and software used to prepare the scanned image for loading onto the NEPIS Web site is described in Exhibit 6. The primary suite of tools used in this process is ZyLab, an industry leader in information access solutions, specifically software to use in capturing, archiving, searching, and support tools for the most common business-critical activities. ZyLab’s imaging platform applications have a unique combination of search technology, security, and business-focused, content management functionality. 5. QUALITY CONTROL/ QUALITY ASSURANCE Quality assurance (QA) approach is not just a set of tools and processes, but a management philosophy where leaders are engaged at all phases of the quality process, ensuring project success. It provides the opportunity to continuously improve and streamline processes, to eliminate waste from the system, and to determine when peak performance has been achieved. The application of proven processes will benefit EPA by improving task management, end-product quality, and significantly reducing overall project risk. 5.1 QUALITY CONTROL PLAN The EPA contractor’s Quality Control Plan (QCP) developed and implemented for all programs identifies QA functions, quality control processes and measures, and overall QA program responsibilities in support of each project. The plan includes quality assurance and quality control procedures that ensure that all deliverables conform to specifications and client expectations. These procedures start with the definition of project performance requirements and include procedures and processes to ensure the quality of all work performed. This includes, but is not limited to, developing procedures, recruiting and training personnel, developing quality measurement methods, analyzing performance data, obtaining client feedback, and developing and implementing corrective/preventive actions. For the EPA Library Digitization project, the EPA contractor will review the QCP to ensure that our processes and compliance requirements are fully documented, understood, and implemented. The updated plan will provide focus on our customer feedback and guidance on continuous process improvement, thus allowing optimized service and product delivery to the EPA. 5.2 QUALITY ASSURANCE IMPLEMENTATION The EPA contractor’s approach to quality requires verification that the practices, procedures, and processes that we use to engineer quality into project operations and work products have been performed to documented standards. As part of our overall quality assurance, the EPA Library Digitization project will have a QA plan that includes conducting a series of in-process, multilevel, and multiphase reviews and audits that ensure ongoing high-quality service delivery and provide our management and processing teams with predictive data that can be used to identify potential quality risks. To ensure that all products and services delivered under the EPA Library Digitization project conform to the appropriate performance and industry standards, we will employ our QA processes as follows: • Schedule meetings for the project manager and the program quality assurance manager to define the scope, schedule, and budget for quality assurance with the Contracting Officer’s Technical Representative (COTR). • Develop the quality measurement method required to ensure that all performance requirements are met. • Train staff in the specific QA procedures required to perform QA activities for products/services in their area of expertise. • Perform QA activities in accordance with established EPA procedures and regulations. • Develop and implement an aggressive improvement plan to meet all customer requirements when a quality issue is identified. The EPA contractor brings a culture of quality and a relentless pursuit of operating excellence through the systematic use of proven best practices, industry standards, and continuous process improvement initiatives that will be leveraged to provide accurate, consistent, and timely information and products for the EPA Library Digitization project. 6. CAPABILITIES AND FEATURES OF NSCEP WEB SITE The visible end product of the digitization of EPA publications resides on the EPA’s National Environmental Publications Internet Site (NEPIS)/National Service Center for Environmental Publications (NSCEP) Web site. This site is intended to help users identify and order EPA products. More than 7,000 in-stock and 26,000 digital titles are available free of charge to search and retrieve, download, print, and/or order. The Web site URLs are: http://www.epa.gov/nscep/ http://www.epa.gov/ncepihom/ http://nepis.epa.gov/ 6.1 BACKGROUND INFORMATION The NEPIS Web site started back in 1995–1996 with an initial scan effort of approximately 4,000 documents. By 2001, the site contained 7,000 titles. Between 2001 and 2007, the site grew to about 11,000 titles consisting of about one million pages. About 18 months ago, the EPA migrated all of the Web site content to the ZyLab application currently active and running the site. Two forces have been at work shaping NEPIS over the last year. First, at the beginning of the year, a desire was expressed to create a "one-stop" location that would combine both the NSCEP hard copy shopping cart feature and NEPIS image and text retrieval services as seen by the public, thus minimizing navigation. The original concept was to accomplish this with a Web site restructuring. When a prototype was presented in March 2006 that demonstrated the desired functionality entirely within the new ZyLab-based service, this was selected as the preferred approach. The second force in play is the influx of EPA documents from regional libraries that, in the space of six months, has doubled the size of the repository to 26,000 titles (approximately 2.4 million pages). This spurred a recent reorganization of the data into multiple repositories sorted by publication date. While the shift to multiple indexes was temporarily disruptive (it changed document URLs), it was necessary to ensure sufficient growth capacity and continued high performance into the future. 6.2 CAPABILITIES AND FEATURES The following section details the capabilities and features of the current NEPIS/NSCEP Web site. 6.2.1 NAVIGATION The navigation menus and tools are consistent and clearly labeled throughout the site. The site offers a left-hand navigation bar that includes links to: basic information, simple search, customer survey, where you live, resource tools, related links, foreign language publications, and a site map. To further help the user with navigation, a breadcrumb trail (e.g., EPA Home > NSCEP > Document Display) is displayed at the top of the page. This feature allows the user to follow their path in reverse order or jump back multiple pages. The site also uses consistent and clearly labeled headers and footers to further enhance navigability. The header displays a “Contact Us” link and a search of “All EPA” or “This Area.” The footer contains links to “EPA Home,” “Privacy and Security Notice,” and “Contact Us.” 6.2.2 SEARCH TOOLS Since the site is a publications database, the majority of the home page is devoted to searching for products. There are several tools to search the database, including: • Simple search. This search allows the user to enter keyword(s) and limit by date range(s) and/or if the publication is available in hard copy. • Advanced search. This robust search allows the user to enter keyword(s) and then limit or refine using any or all of eight parameters, including: 1. How to look for it. This drop-down menu allows the user to use Boolean operators (i.e., and, or, not), search for all, any, or exact phrase within the keyword(s) field. 2. Date document was added to the library. This drop-down menu allows the user to limit the search to today, yesterday, last week, last two weeks, last month, last three months, last six months, last year, and last two years. 3. Results precision (fuzziness). A fuzzy search is a search capability that locates all occurrences of a word, including those that are “close” in spelling. This drop-down menu allows the user to select from one to four character difference or exact match. 4. Select dates to search. These checkboxes allow the user to limit the search to a date range(s). 5. Control which page will be viewed first. The radio buttons allow the user to either select “first page of document” or “first page found with matching search keyword(s).” 6. Choose the number of results to be displayed. This drop-down menu allows the user to view 10 (default), 25, 50, or 100 search results on a page. 7. Rank search results. There are two drop-down menus that allow the user to rank the search results by “hit density” or “number of hits” and in descending or ascending order. Hit density describes how a publication is judged more relevant if a large number of the total number of words in it are query words or related terms. Number of hits describes how many times the query word(s) are found in a publication. 8. Choose what to display. The radio buttons allow the user to either select “page images” or “text format.” If the user selects “page images,” it can be refined to low, medium, or high quality; enlarged in size to 200 percent, 280 percent, or 400 percent; or a TIFF. • Fields search. This search allows the user to enter query terms within the area defined as the field. The user can search by EPA publication number, title, source, page count, and publication year, and limit the search by some of the parameters that were described in the advanced search section. • Title lists. This search provides static lists of publications representing the complete inventory. Lists are ordered by publication number for the associated Office of origin: •• 400 Series, Office of Air and Radiation. •• 500 Series, Office of Solid Waste and Emergency Response. •• 600 Series, Office of Research and Development. •• 700 Series, Office of Prevention, Pesticides and Toxic Substances. •• 800 Series, Office of Water. •• All other designations. •• The complete list. • New titles. This search allows the user to search a selection of newly received publications available to order or, in some instances, to view online. This list is updated monthly. 6.2.3 SEARCH TECHNIQUES The site allows for many search techniques that more advanced users employ in their queries, such as: • Wild cards. Wild card symbols added to content words create flexibility to search statements. Use wild cards to search for prefix, root, and suffix, and to find variations in spelling of a word. The NEPIS/NSCEP Web site uses two wild card symbols: the question mark (?) and asterisk (*). The question mark replaces a single character (e.g., b?rn, retrieves born and barn and burn) and the asterisk replaces zero or more characters (e.g., *vert retrieves convert and revert.). • Boolean operators. The operators AND, OR, and NOT create a relationship between the query terms. AND signifies that all of the terms should be present. OR means that one of the terms should be present. And NOT states that the term(s) after it should not be present. • Positional operators. These operators identify either a required proximity between content words, or a content word’s proximity to other document elements. The WITHIN and PRECEDES operators help ensure that search terms are contextually related. • Number range operators. Users can search for numbers both as “terms” (alphanumeric character strings) and as numeric values. The following math operators can be used in number range searches: •• < less than •• < = less than or equal to •• = equal to •• < > not equal to •• greater than •• = greater than or equal to • Quorum operator. The quorum operator searches for a specified number of terms within a search query from one to all. • Separators. Separators limit a search to a physically defined range of a text file. In this sense, they are similar to proximity search statements. •• EOG end of page •• EOL end of line •• EOP end of paragraph •• EOS end of sentence 6.2.4 SEARCH RESULTS The term “search results” refers to the list of publications retrieved based on the parameters of the search. The search results list is initially sorted by hit density and includes the publication number, document title, date of publication, the number of pages in the publication, status (if it is viewable or not), and a shopping cart icon, if it is available to order in hard copy. Illustrated below is an example of the search results display. (image of NSCEP table of selected publications) 6.2.5 DOCUMENT DISPLAY The site uses an icon bar that allows the user to move through a publication and through the search results. The document display lists information that is common in search results, such as a document counter, page counter, and direct page links. The document display also allows the user to zoom in and out on the publication, add to shopping cart, and download in several formats, including text, TIFF, and PDF. The document properties (e.g., hit density, rank, size, file name, etc.) can also be reached from this page by clicking on (image of double-down arrow icon) icon. The user can easily find their keyword(s) search in the publication since the term(s) are highlighted in yellow. By clicking on the “next hit” icon (image of right-pointing arrow icon) the user can automatically go to the next page containing the keyword(s) without having to skip over intervening pages that do not contain the keyword(s). Depicted below are the features of the document display. (image of toolbar with horizontal collection of labeled control/selection icons) 6.2.6 NSCEP WEB SITE CUSTOMER SATISFACTION SURVEY The NEPIS/NSCEP Web site utilizes a structured survey form to find out how well it is meeting the needs and expectations of its users. The survey results can be used be to make improvements to the site to better meet the needs of its users. The survey has been approved by the Office of Management and Budget (OMB) (control number 20900019). 6.2.7 FOREIGN LANGUAGE PUBLICATIONS The NEPIS/NSCEP Web site offers a selection of publications translated into 18 different languages, including: Arabic, Cambodian, Chinese, Chinese (Mandarin), French, German, Haitian (Creole), Hmong, Italian, Japanese, Korean, Laotian, Llocano, Portuguese, Russian, Spanish, Tagalog, and Vietnamese. This dynamic list may change depending on the availability of foreign language publications available through NSCEP. 6.2.8 RELATED LINKS The site offers additional resources through its “Related Links” page to help users find the information for which they are searching, but did not find on the NEPIS/NSCEP Web site. The related links include: • Government Printing Office • National Technical Information Service • EPA Program Publications • Information Products Bulletin • Test Methods • Federal Register • Laws and Regulations 6.2.9 SITE MAP The site map is a hierarchical view listing all of the sections of the Web site. Each listing in the site map is a hyperlink to its respective area. 6.2.10 RESOURCE TOOLS The “Resource Tools” section provides links to tools that assist users to understand EPA information. The tools in this section are: • Frequent Questions. There are 17 frequently asked questions and responses. Questions include: Is there a charge for my document?; Are documents available online?; and How do I order an EPA publication? • EPA Publications Numbering System. This table provides the alphanumeric codes and their corresponding definition. • Understanding EPA Terminology. This section provides two links that go outside of the NEPIS/NSCEP Web site that helps users understand terms, acronyms, abbreviations, and the organizational structure of the EPA. • Frequent EPA Questions. This link takes users outside of the NEPIS/NSCEP Web site to the EPA site’s “Frequent Questions.” 6.2.11 HELP This section of the Web site describes and provides example of simple and advanced searches, as well as search techniques. This section also includes a description of the icon tool bar. The Web site is also equipped with Bandwidth Conservation so that search functions accommodate both broadband and dial-up users to conserve bandwidth. 6.2.12 FUTURE The site is being updated constantly. There is a pilot program currently in development that will enable PDF formats to be linked providing access to the publications in color. Additional features may be added as the pilot progresses. CONCLUSION Since 1998, the EPA contractor has scanned more than 30 million pages for a variety of clients. Some of the EPA contractor’s more notable projects have included: • 26.9 million pages scanned since 1998, including 14.5 million pages since 2002, for the U.S. Department of Justice (DOJ) Office of Litigation Support Contracts Mega1 and Mega2. • 2.4 million pages scanned in 2002–2003 for the Headwaters project (Civil Division). • More than 553,000 pages scanned to pdf format for the NASA Columbia Accident Investigation Board. • 1.31 million pages scanned for the 9/11 Victims Compensation Fund adjudication effort in 2003. In addition, the EPA contractor has rendered OCR digital images on more than 12.9 million pages since 1998. The large volume of documents processed by the EPA contractor has allowed team members at the EPA contractor’s Document Imaging Facility to obtain extensive professional expertise. To complement this hands-on experience, the managers and supervisors at the EPA contractor’s Document Imaging Facility have a combined 39 years experience in the document scanning and imaging services industry. And, to make the most of this professional experience, the EPA contractor equips its Document Imaging Facility with the latest technology resources. This report thoroughly describes the processes and procedures the EPA contractor employs to ensure the quality and integrity of scanned documents. From receipt of the document to final delivery, this proven process allows the EPA contractor to carefully track the progress and ensure the quality of each document. The end result is a near identical image of the printed document. In fact, the EPA contractor has delivered high-quality imaging products on time with Image CD acceptance rates consistently higher than 98 percent since 2000. The EPA contractor believes that its experience and expertise, its commitment to producing and delivering high-quality imaging products and services on time, the emphasis on meeting critical customer values, and superior technical knowledge, capability, and responsiveness underscored by continuous process improvement initiatives will serve the EPA well. Exhibit 1. From the EPA Library to the NSCEP Web Site: A Total View of Document Digitization and Management Process Flow. (logic diagram image) Exhibit 2. Standard Scanning Specifications. SPECIFICATION: File Format STANDARD: CCITT Group IV TIFF DESCRIPTION: TIFF format is standard in document imaging and document management systems. It is normally used with CCITT Group IV 2D compression, which supports black-and-white (also called bitonal or monochrome) images. In high-volume environments, documents are typically scanned in black and white (rather than color or grayscale) to conserve storage capacity. An average A4 scan produces 30 kilobytes (KB) of data at 200 ppi (pixels per inch resolution) and 50 KB of data at 300 ppi. 300 ppi is far more common than 200 ppi. Because TIFF format supports multiple pages, multi-page documents can be saved as single TIFF files rather than as a series of files for each scanned page. SPECIFICATION: Resolution STANDARD: 300 DPI DESCRIPTION: Image resolution refers to the spacing of pixels in an image and is measured in pixels per inch (ppi), which is commonly referred to as dots per inch, or dpi. The higher the resolution, the more pixels in the image. Higher resolution allows for more detail and subtle color transitions in an image. A printed image that has a low resolution may look pixelated or made up of small squares with jagged edges and without smoothness. SPECIFICATION: Page Size STANDARD: 8.5" x 11" DESCRIPTION: Minimum auto-feed size is 2.1" x 2.9"; maximum auto-feed size is 11" x 17"; maximum flatbed size is 11" x 17". Other larger-sized documents may be scanned or digitized using specializes large format scanning equipment. SPECIFICATION: Page Orientation STANDARD: Portrait or Landscape DESCRIPTION: Regular documents are captured in portrait mode. Other documents, such as spreadsheets, are captured landscape mode (Top-to-Top) so that they are readable without the need to rotate the image. Exhibit 3. Flowchart for Loading the Scanned Images and Metadata to the NEPIS Web Site. (logic diagram image) Exhibit 4. Industry Imaging Standards. STANDARD: ANSI/AIIM TR15-1997 TITLE: Planning Considerations, Addressing Preparation of Documents for Image Capture DESCRIPTION AND USE: This report is indispensable for organizations considering image capture to convert existing record collections. The physical preparation of documents for image capturing systems is outlined in the planning considerations and described in a set of generic procedures. ISO Related Document: ISO 12652 STANDARD: ANSI/AIIM TR19-1993 TITLE: Electronic Imaging Display Devices (for selecting imaging devices) DESCRIPTION AND USE: This technical report provides information on various aspects of EIM display technology when selecting a computer display for EIM applications. ISO Related Document: STANDARD: ANSI/AIIM TR26-1993 TITLE: Resolution as it Relates to Photographic and Electronic Imaging DESCRIPTION AND USE: This in-depth discussion of resolution in imaging systems describes what the term "resolution" means to various photographic and electronic imaging resolution and applies it to the evaluation of photographic and electronic imaging products. ISO Related Document: ISO 14989 STANDARD: ANSI/AIIM TR31-:2-1993 TITLE: Performance Guidelines for the Legal Acceptance of Records Produced by Information Technology Systems Part 2: DESCRIPTION AND USE: Acceptance by Government Agencies The guideline address laws enacted by government that affect personal or business recordkeeping practices and the provisions that require records to be kept available for audit or submission or establish the form of records. ISO Related Document: STANDARD: ANSI/AIIM TR34-1996 TITLE: Sampling Procedures for Inspection by Attributes of Images in Electronic Image Management (EIM) and Micrographic Systems DESCRIPTION AND USE: This technical report contains procedures that may be used to select and apply sampling inspection plans to determine if a batch of electronic images meets specified quality guidelines. ISO Related Document: STANDARD: ANSI/AIIM TR38-1996 TITLE: Compilation of Test Targets for Document Imaging Systems DESCRIPTION AND USE: This technical report is a directory of the most commonly used test charts and test patterns used in document imaging applications; such as, electronic document imaging, facsimile, micrographics, or photocopying document imaging components. The directory includes information regarding the type of chart, its primary application, the characteristics tested, source, etc. ISO Related Document: ISO 11141 STANDARD: ANSI/AIIM MS44-1988 (R1993) TITLE: Recommended Practice for Quality Control of Image Scanners DESCRIPTION AND USE: Adopted as a Federal Information Processing Standard (FIPS), MS44 provides procedures for the ongoing control of quality within an electronic image management (EIM) system from input to output. Regular use of these procedures should ensure that the established level of quality is maintained. ISO Related Document: ISO 216-1975 STANDARD: ANSI/AIIM MS52-1991 TITLE: Recommended Practice for the Requirements and Characteristics of Original Documents Intended for Optical Scanning DESCRIPTION AND USE: This standard describes the physical characteristics of original documents which will facilitate scanning of the documents. It also identifies those characteristics that will make scanning difficult or impossible. Furthermore, this standard provides general recommendations for the design of documents in order to make those documents easier to process. ISO Related Document: ISO 10196 STANDARD: ANSI/AIIM MS53-1993 TITLE: Recommended Practice; File Format for Storage and Exchange of Image; Bi-Level Image File Format: Part 1 DESCRIPTION AND USE: This standard specifies a file format for the exchange of bilevel electronic images coded using CCITT Recommendations T.4 and T.6, (Group 3) plus bit-mapped images (having no compression). This standard puts into one standard, a self-contained file format for bi-level image file transfer in imaging environments. ISO Related Document: STANDARD: ANSI/AIIM MS55-1994 TITLE: Recommended Practice for the Identification and Indexing of Page Components (Zones) for Automated Processing in an Electronic Image Management Environment (for Zone OCR quality control) DESCRIPTION AND USE: This document identifies a media- and application-independent document structure and indexing scheme that will allow necessary and sufficient description of document pages and zones (rectangular sub-areas) within a page. The document addresses the information required for automatic location of areas (zones) to be analyzed, enhanced, or compressed. Not applicable for EPA digitization. ZyLab OCR technology uses the coordinates within the image for character location and identification. ISO Related Document: STANDARD: ISO 19005-1 TITLE: Document Management-Electronic Document File Format for Long-Term Preservation-Part 1: Use of PDF (PDF/A) DESCRIPTION AND USE: PDF/A is suggested as a preferred format for page-oriented textual (or primarily textual) documents when layout and visual characteristics are more significant than logical structure. For PDFs based on page images digitized by scanning, the source images are considered the master format if available and PDFs created from those images may be optimized for access convenience. Not applicable for EPA digitization. Required delivery to EPA is TIFF images. ISO Related Document: Exhibit 5. Imaging Hardware and Software. Equipment (Model) Description: Document Scanners (Use: Paper Scanning): Kodak i660 (black-and-white and color) Rated Speed - pages/min: 120 Rated Resolution - resolution: Up to 800 DPI Software: EPA Contractor Proprietary Equipment (Model) Description: Document Scanners (Use: Paper Scanning): Fujitsu fi-4750C (black-and-white and color)* Rated Speed - pages/min: 90 Rated Resolution - resolution: Up to 800 DPI Software: EPA Contractor Proprietary Equipment (Model) Description: Document Scanners (Use: Paper Scanning): Fujitsu M4097D* Rated Speed - pages/min: 90 Rated Resolution - resolution: Up to 600 DPI Software: EPA Contractor Proprietary Equipment (Model) Description: Document Scanners (Use: Paper Scanning): Kodak 1500 Rated Speed - pages/min: 50 Rated Resolution - resolution: Up to 600 DPI Equipment (Model) Description: Document Scanners (Use: Paper Scanning): Kodak 3520 Rated Speed - pages/min: 90 Rated Resolution - resolution: Up to 600 DPI Software: EPA Contractor Proprietary Equipment (Model) Description: Document Scanners (Use: Paper Scanning): Dell PC -Optiplex GX270 2.80 GHz, 1 GB RAM with MS Windows XP, Pentium 4 and 19" Monitor Rated Speed - pages/min: (blank) Rated Resolution - resolution: (blank) Software: (blank) Equipment (Model) Description: Workstations (Use: Image Inspection; Document Metadata QC and Verification): Dell PC -Optiplex GX270 2.40 GHz, 1 GB RAM, with MS Windows XP, Pentium 4 and Dual 19" Monitors Rated Speed - pages/min: (blank) Rated Resolution - resolution: (blank) Software: EPA Contractor Proprietary Equipment (Model) Description: Unattended Backend Workstations (Use: Digital Image Enhancement, Endorsement, Staging, CD/DVD Burning): Dell PC -Optiplex GX270 2.80 GHz, 1 GB RAM, with MS Windows XP, Pentium 4 Rated Speed - pages/min: (blank) Rated Resolution - resolution: (blank) Software: EPA Contractor Proprietary Equipment (Model) Description: OCR Processing (Use: Process Digital Image to OCR Text): Dell PC -Optiplex GX270 2.80 GHz, 1 GB RAM, with MS Windows XP, Pentium 4 Rated Speed - pages/min: 1,050 Rated Resolution - resolution: 16 Software: ZyLab OCR and Index Equipment (Model) Description: PDF Capture (Use: Convert Digital Images to PDF Format): Dell PC -Optiplex GX270 2.80 GHz, 1 GB RAM, with MS Windows XP, Pentium 4 Rated Speed - pages/min: 950 Rated Resolution - resolution: 16 Software: Adobe Acrobat Capture 3.1 Unlimited Cluster License * Includes flat-bed capability. Exhibit 6. NSCEP (NEPIS) ZyLab Imaging System Hardware and Software. Equipment Type: Dell PowerEdge 2650 Dell Service Company: Dell Equipment Type: Dell PowerEdge 2950 Dual Core Add-On Disk Storage Service Company: Dell Equipment Type: Dell PowerEdge 2650 Service Company: Dell Equipment Type: Dell PowerEdge 2950 Dual Core Add-On Disk Storage Service Company: Dell ZyLab Cluster PCs Used to Attach to Scanner for Creating TIFF Files (Housed in Computer Room 308A) Equipment Type: Dell PC - Optiplex GX270 2.80 GHz, 512K/533MHz, FSB with MS Windows 2000 Professional, Pentium 4 Service Company: Dell Equipment Type: Dell PC - Optiplex GX270 2.80 GHz, 512K/533MHz, FSB with MS Windows 2000 Professional, Pentium 4 Service Company: Dell ZyLab Cluster PC Used to Attach to Scanner for Creating TIFF Files Equipment Type: Dell PC - Optiplex GX270 2.80 GHz, 512K/533MHz, FSB with MS Windows 2000 Professional, Pentium 4 Service Company: Dell Equipment Type: Fujitsu Color Scanner # 1-Model No.:fi-5750C Service Company: Fujitsu Equipment Type: Fujitsu Color Scanner # 2-Model No.:fi-5750C Service Company: Fujitsu Equipment Type: Fujitsu Color Scanner # 3-Model No.:fi-5750C Service Company: Fujitsu Equipment Type: Fujitsu Color Scanner # 4-Model No.:fi-5750C Service Company: Fujitsu Equipment Type: ZyIMAGE Enterprise Webserver License Service Company: ZyLab NA Equipment Type: ZyALERT Service Company: ZyLab NA Equipment Type: ZyIMAGE v5 Basic License includes International OCR Engine Service Company: ZyLab NA Equipment Type: ZyIMAGE Global Professional OCR Engine per CPU Service Company: ZyLab NA Equipment Type: ZyIMAGE v5 Workstation License Service Company: ZyLab NA Equipment Type: ZyIMAGE Global Standard OCR Engine per CPU Service Company: ZyLab NA Equipment Type: ZyIMAGE Application Integrator (ZAI) SW license Service Company: ZyLab NA Equipment Type: ZyIMAGE V5.0 XML Wrapper Module SW license Service Company: ZyLab NA ZyLab Software to Operate PC Used to Attach to Scanner #1 to OCR ZyLab TIFF Files Equipment Type: ZyScan License (1 workstation) Service Company: ZyLab NA Equipment Type: ZyIMAGE Global Professional OCR engine software for each CPU Service Company: ZyLab NA ZyLab Software to Operate PC Used to Attach to Scanner #2 to OCR ZyLab TIFF Files Equipment Type: ZyScan License (1 workstation) Service Company: ZyLab NA Equipment Type: ZyIMAGE Global Professional OCR engine software for each CPU Service Company: ZyLab NA ZyLab Software to Operate PC Used to Attach to Scanner #3 to OCR ZyLab TIFF Files ZyScan License (1 workstation) Service Company: ZyLab NA ZyIMAGE Global Professional OCR engine software for each CPU Service Company: ZyLab NA CA Unicenter Remote Software Packages (Installed on local and national NSCEP (NEPIS)/ZyLab PCs and servers.)