Completed the third main section of the HPSF HOW-TO.

git-svn-id: https://svn.apache.org/repos/asf/jakarta/poi/trunk@353000 13f79535-47bb-0310-9956-ffa450edef68
2003-02-05 19:33:27 +00:00 · 2003-02-05 19:33:27 +00:00 · 34357fe6b8
parent 9b34b0ded0
commit 34357fe6b8
4 changed files with 359 additions and 116 deletions
--- a/src/documentation/xdocs/hpsf/how-to.xml
+++ b/src/documentation/xdocs/hpsf/how-to.xml
@ -35,8 +35,7 @@
    <li>
     <p>The <link href="#sec3">third section</link> tells how to read
      non-standard properties. Non-standard properties are application-specific
-      name/value/type triples. <em>This section is still to be written. Look up
-      the API documentation for the time being!</em></p>
+      triples consisting of an ID, a type, and a value.</p>
     </li>
   </ol>

@ -303,39 +302,43 @@ else
   <section title="Reading Non-Standard Properties">

    <note>This section tells how to read non-standard properties. Non-standard
-     properties are application-specific name/type/value triples.</note>
+     properties are application-specific ID/type/value triples.</note>

-    <p>Now comes the really hardcode stuff. As mentioned above,
+    <section title="Overview">
+     <p>Now comes the real hardcode stuff. As mentioned above,
      <code>SummaryInformation</code> and
      <code>DocumentSummaryInformation</code> are just special cases of the
-     general concept of a property set. The general concept says that a
-     property set consists of <strong>properties</strong>. Each property is an
-     entity that has a <strong>name</strong>, a <strong>type</strong>, and a
-     <strong>value</strong>.</p>
+      general concept of a property set. This concept says that a
+      <strong>property set</strong> consists of properties and that each
+      <strong>property</strong> is an entity with an <strong>ID</strong>, a
+      <strong>type</strong>, and a <strong>value</strong>.</p>

     <p>Okay, that was still rather easy. However, to make things more
      complicated, Microsoft in its infinite wisdom decided that a property set
-     shalt be broken into <strong>sections</strong>. Each section holds a bunch
-     of properties. But since that's still not complicated enough: A section
-     can optionally have a dictionary that maps property IDs to property
-     names - we'll explain later what that means.</p>
+      shalt be broken into one or more <strong>sections</strong>. Each section
+      holds a bunch of properties. But since that's still not complicated
+      enough, a section may have an optional <strong>dictionary</strong> that
+      maps property IDs to <strong>property names</strong> - we'll explain
+      later what that means.</p>

-    <p>So the procedure to get to the properties is as follows:</p>
+     <p>The procedure to get to the properties is the following:</p>

     <ol>
-     <li>Use the <code>PropertySetFactory</code> to create a
-      <code>PropertySet</code> from an input stream. You can try this with any
-      input stream: You'll either <code>PropertySet</code> instance or an
+      <li>Use the <strong><code>PropertySetFactory</code></strong> class to
+       create a <code>PropertySet</code> object from a property set stream. If
+       you don't know whether an input stream is a property set stream, just
+       try to call <code>PropertySetFactory.create(java.io.InputStream)</code>:
+       You'll either get a <code>PropertySet</code> instance returned or an
       exception is thrown.</li>

      <li>Call the <code>PropertySet</code>'s method <code>getSections()</code>
-      to get a list of sections contained in the property set. Each section is
+       to get the sections contained in the property set. Each section is
       an instance of the <code>Section</code> class.</li>

      <li>Each section has a format ID. The format ID of the first section in a
       property set determines the property set's type. For example, the first
-      (and only) section of the SummaryInformation property set has a format ID
-      of <code>F29F85E0-4FF9-1068-AB-91-08-00-2B-27-B3-D9</code>. You can
+       (and only) section of the SummaryInformation property set has a format
+       ID of <code>F29F85E0-4FF9-1068-AB-91-08-00-2B-27-B3-D9</code>. You can
       get the format ID with <code>Section.getFormatID()</code>.</li>

      <li>The properties contained in a <code>Section</code> can be retrieved
@ -345,7 +348,9 @@ else
      <li>A property has a name, a type, and a value. The <code>Property</code>
       class has methods to retrieve them.</li>
     </ol>
+    </section>

+    <section title="A Sample Application">
     <p>Let's have a look at a sample Java application that dumps all property
      set streams contained in a POI file system. The full source code of this
      program can be found as <em>ReadCustomPropertySets.java</em> in the
@ -381,7 +386,9 @@ import org.apache.poi.util.HexDump;</source>
    <p>The <code>POIFSReader</code> is set up in a way that the listener
     <code>MyPOIFSReaderListener</code> is called on every file in the POI file
    system.</p>
+    </section>

+    <section title="The Property Set">
     <p>The listener class tries to create a <code>PropertySet</code> from each
     stream using the <code>PropertySetFactory.create()</code> method:</p>

@ -420,7 +427,9 @@ import org.apache.poi.util.HexDump;</source>
     other types of exceptions cause the program to terminate by throwing a
     runtime exception. If all went well, we can print the name of the property
     set stream.</p>
+    </section>

+    <section title="The Sections">
     <p>The next step is to print the number of sections followed by the
     sections themselves:</p>

@ -447,8 +456,8 @@ for (Iterator i = sections.iterator(); i.hasNext();)
      instances of the <code>Section</code> class in their proper order.</p>

     <p>The sample code shows a loop that retrieves the <code>Section</code>
-     objects one by one and prints some information about each one. Here is the
-     complete body of the loop:</p>
+      objects one by one and prints some information about each one. Here is
+      the complete body of the loop:</p>

     <source>/* Print a single section: */
 Section sec = (Section) i.next();
@ -473,12 +482,14 @@ for (int i2 = 0; i2 &lt; properties.length; i2++)
    out("      Property ID: " + id + ", type: " + type +
        ", value: " + value);
 }</source>
+    </section>

+    <section title="The Section's Format ID">
     <p>The first method called on the <code>Section</code> instance is
-     <code>getFormatID()</code>. As explained above, the format ID of the first
-     section in a property set determines the type of the property set. Its
-     type is <code>ClassID</code> which is essentially a sequence of 16
-     bytes. A real application using its own type of a custom property set
+      <code>getFormatID()</code>. As explained above, the format ID of the
+      first section in a property set determines the type of the property
+      set. Its type is <code>ClassID</code> which is essentially a sequence of
+      16 bytes. A real application using its own type of a custom property set
      should have defined a unique format ID and, when reading a property set
      stream, should check the format ID is equal to that unique format ID. The
      sample program just prints the format ID it finds in a section:</p>
@ -495,7 +506,9 @@ out("      Format ID: " + s);</source>
      the <code>org.apache.poi.util</code> package. Another helper method is
      <code>out()</code> which just saves typing
      <code>System.out.println()</code>.</p>
+    </section>

+    <section title="The Properties">
     <p>Before getting the properties, it is possible to find out how many
      properties are available in the section via the
      <code>Section.getPropertyCount()</code>. The sample application uses this
@ -525,12 +538,14 @@ out("      No. of properties: " + propertyCount);</source>
    out("      Property ID: " + id + ", type: " + type +
        ", value: " + value);
 }</source>
+    </section>

-    <p>The output of the sample program might look like the following. It shows
-     the summary information and the document summary information property sets
-     of a Microsoft Word document. However, unlike the first and second section
-     of this HOW-TO the application does not have any code which is specific to
-     the <code>SummaryInformation</code> and
+    <section title="Sample Output">
+     <p>The output of the sample program might look like the following. It
+      shows the summary information and the document summary information
+      property sets of a Microsoft Word document. However, unlike the first and
+      second section of this HOW-TO the application does not have any code
+      which is specific to the <code>SummaryInformation</code> and
      <code>DocumentSummaryInformation</code> classes.</p>

     <source>Property set stream "/SummaryInformation":
@ -604,13 +619,231 @@ No property set stream: "/1Table"</source>
      <li>The properties are not in any particular order in the section,
       although they slightly tend to be sorted by their IDs.</li>
     </ul>
+    </section>

-    <note>[To be continued.]</note>
+    <section title="Property IDs">
+     <p>Properties in the same section are distinguished by their IDs. This is
+      similar to variables in a programming language like Java, which are
+      distinguished by their names. But unlike variable names, property IDs are
+      simple integral numbers. There is another similarity, however. Just like
+      a Java variable has a certain scope (e.g. a member variables in a class),
+      a property ID also has its scope of validity: the section.</p>

-    <note>A last note: There are still some aspects of HSPF left which are not
-     documented in this HOW-TO. You should dig into the Javadoc API
-     documentation to learn further details. Since you struggled through this
-     document up to this point, you are well prepared.</note>
+     <p>Two property IDs in sections with different section format IDs
+      don't have the same meaning even though their IDs might be equal. For
+      example, ID 4 in the first (and only) section of a summary
+      information property set denotes the document's author, while ID 4 in the
+      first section of the document summary information property set means the
+      document's byte count. The sample output above does not show a property
+      with an ID of 4 in the first section of the document summary information
+      property set. That means that the document does not have a byte
+      count. However, there is a property with an ID of 4 in the
+      <em>second</em> section: This is a user-defined property ID - we'll get
+      to that topic in a minute.</p>
+
+     <p>So, how can you find out what the meaning of a certain property ID in
+      the summary information and the document summary information property set
+      is? The standard property sets as such don't have any hints about the
+      <strong>meanings of their property IDs</strong>. For example, the summary
+      information property set does not tell you that the property ID 4 stands
+      for the document's author. This is external knowledge. Microsoft defined
+      standard meanings for some of the property IDs in the summary information
+      and the document summary information property sets. As a help to the Java
+      and POI programmer, the class <code>PropertyIDMap</code> in the
+      <code>org.apache.poi.hpsf.wellknown</code> package defines constants
+      for the "well-known" property IDs. For example, there is the
+      definition</p>
+
+     <source>public final static int PID_AUTHOR = 4;</source>
+
+     <p>These definitions allow you to use symbolic names instead of
+      numbers.</p>
+
+     <p>In order to provide support for the other way, too, - i.e. to map
+      property IDs to property names - the class <code>PropertyIDMap</code>
+      defines two static methods:
+      <code>getSummaryInformationProperties()</code> and
+      <code>getDocumentSummaryInformationProperties()</code>. Both return
+      <code>java.util.Map</code> objects which map property IDs to
+      strings. Such a string gives a hint about the property's meaning. For
+      example,
+      <code>PropertyIDMap.getSummaryInformationProperties().get(4)</code>
+      returns the string "PID_AUTHOR". An application could use this string as
+      a key to a localized string which is displayed to the user, e.g. "Author"
+      in English or "Verfasser" in German. HPSF might provide such
+      language-dependend ("localized") mappings in a later release.</p>
+
+     <p>Usually you won't have to deal with those two maps. Instead you should
+      call the <code>Section.getPIDString(int)</code> method. It returns the
+      string associated with the specified property ID in the context of the
+      <code>Section</code> object.</p>
+
+     <p>Above you learned that property IDs have a meaning in the scope of a
+      section only. However, there are two exceptions to the rule: The property
+      IDs 0 and 1 have a fixed meaning in <strong>all</strong> sections:</p>
+
+     <table>
+      <tr>
+       <th>Property ID</th>
+       <th>Meaning</th>
+      </tr>
+
+      <tr>
+       <td>0</td>
+       <td>The property's value is a <strong>dictionary</strong>, i.e. a
+	mapping from property IDs to strings.</td>
+      </tr>
+
+      <tr>
+       <td>1</td>
+       <td>The property's value is the number of a <strong>codepage</strong>,
+	i.e. a mapping from character codes to characters. All strings in the
+	section containing this property must be interpreted using this
+	codepage. Typical property values are 1252 (8-bit "western" characters)
+	or 1200 (16-bit Unicode characters).</td>
+      </tr>
+     </table>
+    </section>
+
+    <section title="Property types">
+     <p>A property is nothing without its value. It is stored in a property set
+      stream as a sequence of bytes. You must know the property's
+      <strong>type</strong> in order to properly interpret those bytes and
+      reasonably handle the value. A property's type is one of the so-called
+      Microsoft-defined <strong>"variant types"</strong>. When you call
+      <code>Property.getType()</code> you'll get a <code>long</code> value
+      which denoting the property's variant type. The class
+      <code>Variant</code> in the <code>org.apache.poi.hpsf</code> package
+      holds most of those <code>long</code> values as named constants. For
+      example, the constant <code>VT_I4 = 3</code> means a signed integer value
+      of four bytes. Examples of other types are <code>VT_LPSTR = 30</code>
+      meaning a null-terminated string of 8-bit characters, <code>VT_LPWSTR =
+       31</code> which means a null-terminated Unicode string, or <code>VT_BOOL
+       = 11</code> denoting a boolean value.</p>
+
+     <p>In most cases you won't need a property's type because HPSF does all
+      the work for you.</p>
+    </section>
+
+    <section title="Property values">
+     <p>When an application wants to retrieve a property's value and calls
+      <code>Property.getValue()</code>, HPSF has to interpret the bytes making
+      out the value according to the property's type. The type determines how
+      many bytes the value consists of and what
+      to do with them. For example, if the type is <code>VT_I4</code>, HPSF
+      knows that the value is four bytes long and that these bytes
+      comprise a signed integer value in the little-endian format. This is
+      quite different from e.g. a type of <code>VT_LPWSTR</code>. In this case
+      HPSF has to scan the value bytes for a Unicode null character and collect
+      everything from the beginning to that null character as a Unicode
+      string.</p>
+
+     <p>The good new is that HPSF does another job for you, too: It maps the
+      variant type to an adequate Java type.</p>
+
+     <table>
+      <tr>
+       <th>Variant type:</th>
+       <th>Java type:</th>
+      </tr>
+
+      <tr>
+       <td>VT_I2</td>
+       <td>java.lang.Integer</td>
+      </tr>
+
+      <tr>
+       <td>VT_I4</td>
+       <td>java.lang.Long</td>
+      </tr>
+
+      <tr>
+       <td>VT_FILETIME</td>
+       <td>java.util.Date</td>
+      </tr>
+
+      <tr>
+       <td>VT_LPSTR</td>
+       <td>String</td>
+      </tr>
+
+      <tr>
+       <td>VT_LPWSTR</td>
+       <td>String</td>
+      </tr>
+
+      <tr>
+       <td>VT_CF</td>
+       <td>byte[]</td>
+      </tr>
+
+      <tr>
+       <td>VT_BOOL</td>
+       <td>java.lang.Boolean</td>
+      </tr>
+
+     </table>
+
+     <p>The bad news is that there are still a couple of variant types HPSF
+      does not yet support. If it encounters one of these types it
+      returns the property's value as a byte array and leaves it to be
+      interpreted by the application.</p>
+
+     <p>An application retrieves a property's value by calling the
+      <code>Property.getValue()</code> method. This method's return type is the
+      abstract <code>Object</code> class. The <code>getValue()</code> method
+      looks up the property's variant type, reads the property's value bytes,
+      creates an instance of an adequate Java type, assigns it the property's
+      value and returns it. Primitive types like <code>int</code> or
+      <code>long</code> will be returned as the corresponding class,
+      e.g. <code>Integer</code> or <code>Long</code>.</p>
+    </section>
+
+
+    <section title="Dictionaries">
+     <p>The property with ID 0 has a very special meaning: It is a
+      <strong>dictionary</strong> mapping property IDs to property names. We
+      have seen already that the meanings of standard properties in the 
+      summary information and the document summary information property sets
+      have been defined by Microsoft. The advantage is that the labels of
+      properties like "Author" or "Title" don't have to be stored in the
+      property set. However, a user can define custom fields in, say, Microsoft
+      Word. For each field the user has to specify a name, a type, and a
+      value.</p>
+
+     <p>The names of the custom-defined fields (i.e. the property names) are
+      stored in the document summary information second section's
+      <strong>dictionary</strong>. The dictionary is a map which associates
+      property IDs with property names.</p>
+
+     <p>The method <code>Section.getPIDString(int)</code> not only returns with
+      the well-known property names of the summary information and document
+      summary information property sets, but with self-defined properties,
+      too. It should also work with self-defined properties in self-defined
+      sections.</p>
+    </section>
+
+    <section title="Codepage support">
+     <fixme author="Rainer Klute">Improve codepage support!</fixme>
+
+     <p>The property with ID 1 holds the number of the codepage which was used
+      to encode the strings in this section. The present HPSF codepage support
+      is still very limited: When reading property value strings, HPSF
+      distinguishes between 16-bit characters and 8-bit characters. 16-bit
+      characters should be Unicode characters and thus be okay. 8-bit
+      characters are interpreted according to the platform's default character
+      set. This is fine as long as the document being read has been written on
+      a platform with the same default character set. However, if you receive a
+      document from another region of the world and want to process it with
+      HPSF you are in trouble - unless the creator used Unicode, of course.</p>
+    </section>
+
+    <section title="Further Reading">
+     <p>There are still some aspects of HSPF left which are not covered by this
+      HOW-TO. You should dig into the Javadoc API documentation to learn
+      further details. Since you've struggled through this document up to this
+      point, you are well prepared.</p>
+    </section>
   </section>
  </section>
 </body>
--- a/src/documentation/xdocs/hpsf/todo.xml
+++ b/src/documentation/xdocs/hpsf/todo.xml
@ -16,22 +16,25 @@

   <ol>
    <li>
-     <p>Add writing capability for property sets.</p>
+     <p>Add writing capability for property sets. Presently property sets can
+      be read only.</p>
    </li>
    <li>
-     <p>Add codepage support.</p>
-    </li>
-    <li>
-     <p>Add Unicode support.</p>
+     <p>Add codepage support: Presently the bytes making out the string in a
+      property's value are interpreted using the platform's default character
+      set.</p>
    </li>
    <li>
     <p>Add resource bundles to
      <code>org.apache.poi.hpsf.wellknown</code> to ease
-      localizations.</p>
+      localizations. This would be useful for mapping standard property IDs to
+      localized strings. Example: The property ID 4 could be mapped to "Author"
+      in English or "Verfasser" in German.</p>
    </li>
    <li>
     <p>Implement reading functionality for those property types that are not
-      yet supported (other than byte arrays).</p>
+      yet supported. HPSF should return proper Java types instead of just byte
+      arrays.</p>
    </li>
    <li>
     <p>Add WMF to <code>java.awt.Image</code> example code in <link
--- a/src/java/org/apache/poi/hpsf/TypeReader.java
+++ b/src/java/org/apache/poi/hpsf/TypeReader.java
@ -137,6 +137,11 @@ public class TypeReader
                 * Read a byte string. In Java it is represented as a
                 * String object. The 0x00 bytes at the end must be
                 * stripped.
+		 *
+		 * FIXME: Reading an 8-bit string should pay attention
+		 * to the codepage. Currently the byte making out the
+		 * property's value are interpreted according to the
+		 * platform's default character set.
                 */
                final int first = offset + LittleEndian.INT_SIZE;
                long last = first + LittleEndian.getUInt(src, offset) - 1;
--- a/src/java/org/apache/poi/hpsf/wellknown/PropertyIDMap.java
+++ b/src/java/org/apache/poi/hpsf/wellknown/PropertyIDMap.java
@ -79,7 +79,8 @@ public class PropertyIDMap extends HashMap
 {

    /*
-     * The following definitions are for the Summary Information.
+     * The following definitions are for property IDs in the first
+     * (and only) section of the Summary Information property set.
     */
    public final static int PID_TITLE = 2;
    public final static int PID_SUBJECT = 3;
@ -103,7 +104,8 @@ public class PropertyIDMap extends HashMap


    /*
-     * The following definitions are for the Document Summary Information.
+     * The following definitions are for property IDs in the first
+     * section of the Document Summary Information property set.
     */

    /**