Impressions White Paper on Integrated Indexing

The trend toward electronic publishing has not diminished the need for indexing. Rather, it has put new demands on both indexes and indexers. An electronic text file, especially one that contains a structured mark-up language, can be put to many uses. A single book may be reprinted in different formats (different trim size, multiple volumes, on-line, CD, etc.). An index that can travel with the electronic text and be independent of page numbers is very useful. But how can that kind of index be created? Standard indexing relies on stable, printed pages with set page numbers. Insertion of the index terms into the text is now possible, but because of the lack of suitable, well-designed tools, creating an index with this method usually compromises the quality of the index, the speed of the indexing, or both. Our goal was to a create high-quality index, mark the text in the electronic files with index information, and keep indexing time and errors to a minimum. We brought together people with expertise in indexing, programming, and mark-up languages to develop a process to accomplish this goal.

Indexing encompasses a variety of purposes, from automatically generated lists of names to complex associations of concepts and facts, and all of these may be found within contexts as diverse as databases, on-line documents, and printed books. An understanding of the different types of indexes is crucial to efficient, accurate index creation.

The Basic Types of Indexes

Standard Index

The standard back-of-the-book index must be created by an indexer; its creation is an intellectual process. The index’s complexity can vary widely, but it normally includes the concepts covered in the book, a deliberate and planned structure including double-posting and cross references, and a carefully considered vocabulary. Although, this type of index can be created completely without electronic tools and files, the indexer normally works from hard copies of electronically typeset book pages and uses dedicated indexing software. Such software provides a wide range of powerful tools for formatting and structuring the index, allowing the indexer to concentrate on the quality of the index while minimizing time and drudgery. A standard index depends on text and page numbers that are relatively final, so indexing must take place after the typesetter produces relatively final pages. Any changes to the index caused by reflow must be taken care of by hand later. The index is useful for only that one rendition of the book; for example, reprinting the unaltered text in a larger trim-size format makes the original index’s page locators obsolete.

Extracted Index

An extracted index can be generated completely without the services of an indexer. Key words or types of constructions (italics, headings, etc.) can be searched for and extracted from typeset electronic files, along with their page numbers. This approach is useful for simple lists (names, headings) or a concordance of terms and does not include the intelligence of a standard index. It is therefore of limited use. It is easily and quickly created but may contain superfluous entries or miss some information. It cannot include concepts or ranges of text. It cannot include more than one level (main entry only). Like a standard index, it depends on relatively stable content and page numbers. A new extracted index must be generated for each rendition of a book.

Embedded Index

An embedded index consists of the index terms themselves embedded within the book’s text (using electronic files). Concepts can be included, but the indexing of ranges of text may or may not be allowed, depending on the software being used. Multiple levels are possible but may be cumbersome for the indexer to create and keep track of. It is harder to control vocabulary, cross references, double-posting, etc. because most of the tools found in dedicated indexing software are not available in embedded indexing software. For example, viewing the index while inserting entries is not possible, and a compiled version of the index can be viewed but not edited. Because the terms themselves are keyed in, errors are likely. The applications used to accomplish embedding have been developed primarily to allow insertion of tags into typesetting or word processing files; they vary widely and lack the features of dedicated indexing software. Embedded indexing takes more time than standard indexing (roughly three times longer) and results in a lower-quality index. Indexing can commence before typesetting if significant alterations in the text are not expected. Reflow is not a problem, but edits to the text can damage or eliminate the embedded index information.

Integrated Index

The types of indexes described above have significant limitations, In response, Impressions developed the integrated index, which combines the use of dedicated indexing software with mark-up language in the electronic text files. We can now create a high-quality, complex index that is integrated with the electronic text, effectively combining the best features of the standard index with those of the embedded index. It is created in a minimum of time with a minimum of error and interference with the indexing process. The index is not created within the typesetting software and is not embedded within the text. However, the typesetting software must understand structured mark-up language and be able to differentiate page breaks. No specific interface, except a standard text editor, is needed for entering the mark-up language structure into the text files. As the indexer creates the index within dedicated indexing software from paged hard copy of the text, she marks begin and end points for each entry on the hard copy and assigns each entry or subentry a unique identifier. These tags are later inserted into the electronic file. One entry may span a word or many paragraphs, and entries may overlap without limit. Through programmatic analyses of the index file and the tagged files, the page locators are updated to reflect the typeset pages, so one index can be used for multiple renditions of a book. An integrated index is also ideal for on-line use. Index updates are easy to incorporate. The text must be relatively stable before indexing commences, but the indexer can work from proof pages or manuscript, and reflow is not a problem. This method takes approximately twice as long as standard indexing, although if the index will be used for at least two renditions of the book (print or on-line), it is cost-effective.

The Impressions Integrated Indexing Process

Pooling knowledge from indexers, mark-up language specialists, and programmers, Impressions has developed a process for integrating an index with a book. The programmatic application is complex, yet the concept is simple and the indexer’s work is minimally affected. Currently, the technology needed to accomplish the entire process is not collected into one standard interface, so multiple technologies are used.

The indexer creates an index from a print out of proof pages, using dedicated index creation software such as Cindex. The index may be as complex or as simple as the book requires, including cross references, double-posting, and complex sorting of entries, as needed. For the convenience of the indexer and to allow the use of the dedicated indexing software, the pages have dummy page numbers. As with any indexing project, nearly stable text is needed; this avoids laborious handwork later and minimizes possible disruption/destruction of the embedded information. If the stability can be assured, indexing can take place at the manuscript stage, using the manuscript’s page numbers.

While creating the index, the indexer marks, on the paper copy of the proofs or manuscript, the beginning and end of the text that each index entry applies to. Each entry is assigned a unique identifier. This ID is included on the paper copy and is also inserted into the index file. A series of error-checking steps are included to ensure that each ID is unique and that each entry has an ID. The indexer finalizes the index in the indexing software. After copyediting, the index file and paper copy of the text are updated as needed, and a final word processing file is exported from the indexing software for typesetting.

Using the marked up manuscript or proof pages, the identifiers are placed in the typesetting file, thereby integrating the index with the text. Impressions uses XML syntax to accomplish this, and thus requires that the typesetting software understand the complexities of XML. A series of error-checking steps are included.

Typesetting proceeds with these tagged files; any substantive changes to the text will require careful addition/deletion/movement of the index entry ranges in both the text file being typeset and the final index file. When pages are final they are exported with page break codes inserted. Scripted processes are used to link up the actual page numbers, including ranges, with the dummy ones used in the index, creating a final index with the correct page numbers for that one rendition of the book. This step can be repeated as many times as needed for all the various renditions of a book. The final index file is typeset and paged into place in the final book.

If the book is destined to be used in an online format or made into an eBook, the index information in the electronic files can be used as the target for a linked, online index. The electronic index could even be generated instead of being static, thus allowing for high-powered searches, using the intelligence of the index as a guide.