The purpose of this document is to give a brief high level overview of the - HDF document format. This document does not go into in-depth technical - detail and is only meant as a supplement to the Microsoft Word 97 Binary - File Format freely available at Wotsit.org.
-The OLE file format is not discussed in this document. It is assumed that - the reader has a working knowledge of the POIFS API.
- -A Word file is made up of the document text and data structures - containing formatting information about the text. Of course, this is a - very simplified illustration. There are fields and macros and other - things that have not been considered. At this stage, HDF is mainly - concerned with formatted text.
-The entry point for HDF's reading of a Word file is the File Information - Block (FIB). This structure is the entry point for the locations and size - of a document's text and data structures. The FIB is located at the - beginning of the main stream.
-The document's text is also located in the main stream. Its starting - location is given as FIB.fcMin and its length is given in bytes by - FIB.ccpText. These two values are not very useful in getting the text - because of unicode. There may be unicode text intermingled with ASCII - text. That brings us to the piece table.
-The piece table is used to divide the text into non-unicode and unicode - pieces. The size and offset are given in FIB.fcClx and FIB.lcbClx - respectively. The piece table may contain Property Modifiers (prm). - These are for complex(fast-saved) files and are skipped. Each text piece - contains offsets in the main stream that contain text for that piece. - If the piece uses unicode, the file offset is masked with a certain bit. - Then you have to unmask the bit and divide by 2 to get the real file - offset.
-All text formatting is based on styles contained in the StyleSheet. - The StyleSheet is a data structure containing among other things, style - descriptions. Each style description can contain a paragraph style and - a character style or simply a character style. Each style description - is stored in a compressed version on file. Basically these are deltas - from another style.
-Eventually, you have to chain back to the nil style which is an - imaginary style with certain implied values.
-Paragraph and Character formatting properties for a document's text are - stored on file as deltas from some base style in the Stylesheet. The - deltas are used to create a complete uncompressed style in memory.
-Uncompressed paragraph styles are represented by the Pargraph - Properties(PAP) data structure. Uncompressed character styles are - represented by the Character Properties(CHP) data structure. The styles - for the document text are stored in compressed format in the - corresponding Formatted Disk Pages (FKP). A compressed PAP is referred - to as a PAPX and a compressed CHP is a CHPX. The FKP locations are - stored in the bin table. There are seperate bin tables for CHPXs and - PAPXs. The bin tables' locations and sizes are stored in the FIB.
-A FKP is a 512 byte OLE page. It contains the offsets of the beginning - and end of each paragraph/character run in the main stream and the - compressed properties for that interval. The compessed PAPX is based on - its base style in the StyleSheet. The compressed CHPX is based on the - enclosing paragraph's base style in the Stylesheet.
-All compressed properties(CHPX, PAPX, SEPX) contain a grpprl. A grpprl - is an array of sprms. A sprm defines a delta from some base property. - There is a table of possible sprms in the Word 97 spec. Each sprm is a - two byte operand followed by a parameter. The parameter size depends on - the sprm. Each sprm describes an operation that should be performed on - the base style. After every sprm in the grpprl is performed on the base - style you will have the style for the paragraph, character run, - section, etc.
-HDF is the name of OUR port of the Microsoft Word 97(-2002) file format to - pure Java.
-HDF is still in early development. It is in the - scratchpad section of the - CVS. Source code in the org.apache.poi.hdf.extractor tree is - legacy code. Source in the org.apache.poi.hdf.model - tree is the old legacy code refactored into an object model. Check the How-To - page for detailed examples on using HDF. -
-- We are looking for developers!!! If you are interested in helping with HDF - familiarize yourself with the source code and just start coding. Make sure - you read the guidelines for - getting involved
-HWPF Milestones
-- Milestones - | -- Target Date - | -- Owner - | -
---|---|---|
- Read in a Word document -with minimum formatting -(no lists, tables, footnotes, -endnotes, headers, footers) -and write it back out with the -result viewable in Word -97/2000 - | -- 07/11/2003 - | -- Ryan - | -
- Add support for Lists and -Tables - | -- 8/15/2003 - | -- - | -
- HWPF 1.0-alpha release with -documentation and examples - | -- 8/18/2003 - | -- Praveen/Ryan - | -
- Add support for Headers, -Footers, endnotes, and -footnotes - | -- 8/31/2003 - | -- ? - | -
- Add support for forms and -mail merge - | -- September/October 2003 - | -- ? - | -
HWPF Task Lists
-Read in a Word document with minimum formatting (no lists, tables, footnotes, -endnotes, headers, footers) and write it back out with the result viewable in Word 97/2000
-- Task - | -- Target Date - | -- Owner - | -
---|---|---|
- Create classes to read and -write low level data -structures with test cases - | -- 7/10/2003 - | -- Ryan - | -
- Create classes to read and -write FontTable and Font -names with test case - | -- 7/10/2003 - | -- Praveen - | -
- Final test - | -- 7/11/2003 - | -- Ryan - | -
Develop user friendly API so it is fun and easy to read and write word documents -with java.
-- Task - | -- Target Date - | -- Owner - | -
---|---|---|
- Develop a way for SPRMS to -be compressed and -uncompressed - | -- - | -- - | -
- Override CHPAbstractType -with a concrete class that -exposes attributes with -human readable names - | -- - | -- - | -
- Override PAPAbstractType -with a concrete class that -exposes attributes with -human readable names - | -- - | -- - | -
- Override SEPAbstractType -with a concrete class that -exposes attributes with -human readable names - | -- - | -- - | -
- Override DOPAbstractType -with a concrete class that -exposes attributes with -human readable names - | -- - | -- - | -
- Override TAPAbstractType -with a concrete class that -exposes attributes with -human readable names - | -- - | -- - | -
- Override TCAbstractType -with a concrete class that -exposes attributes with -human readable names - | -- - | -- - | -
- Develop a VerifyIntegrity -class for testing so it is easy -to determine if a Word -Document is well-formed. - | -- - | -- - | -
- Develop general intuitive -API to tie everything together - | -- - | -- - | -
Add support for lists and tables
-- Task - | -- Target Date - | -- Owner - | -
---|---|---|
- Add data structures for -reading and writing list data -with test cases. - | -- - | -- - | -
- Add data structures for -reading and writing tables -with test cases. - | -- - | -- - | -
HWPF 1.0-alpha release with documentation and examples
-- Task - | -- Target Date - | -- Owner - | -
---|---|---|
- Document the user model -API - | -- - | -- - | -
- Document the low level -classes - | -- - | -- - | -
- Come up with detailed How-To’s - | -- - | -- - | -