mirror of https://github.com/apache/poi.git
Completed the third main section of the HPSF HOW-TO.
git-svn-id: https://svn.apache.org/repos/asf/jakarta/poi/trunk@353000 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
9b34b0ded0
commit
34357fe6b8
|
@ -35,8 +35,7 @@
|
|||
<li>
|
||||
<p>The <link href="#sec3">third section</link> tells how to read
|
||||
non-standard properties. Non-standard properties are application-specific
|
||||
name/value/type triples. <em>This section is still to be written. Look up
|
||||
the API documentation for the time being!</em></p>
|
||||
triples consisting of an ID, a type, and a value.</p>
|
||||
</li>
|
||||
</ol>
|
||||
|
||||
|
@ -303,39 +302,43 @@ else
|
|||
<section title="Reading Non-Standard Properties">
|
||||
|
||||
<note>This section tells how to read non-standard properties. Non-standard
|
||||
properties are application-specific name/type/value triples.</note>
|
||||
properties are application-specific ID/type/value triples.</note>
|
||||
|
||||
<p>Now comes the really hardcode stuff. As mentioned above,
|
||||
<section title="Overview">
|
||||
<p>Now comes the real hardcode stuff. As mentioned above,
|
||||
<code>SummaryInformation</code> and
|
||||
<code>DocumentSummaryInformation</code> are just special cases of the
|
||||
general concept of a property set. The general concept says that a
|
||||
property set consists of <strong>properties</strong>. Each property is an
|
||||
entity that has a <strong>name</strong>, a <strong>type</strong>, and a
|
||||
<strong>value</strong>.</p>
|
||||
general concept of a property set. This concept says that a
|
||||
<strong>property set</strong> consists of properties and that each
|
||||
<strong>property</strong> is an entity with an <strong>ID</strong>, a
|
||||
<strong>type</strong>, and a <strong>value</strong>.</p>
|
||||
|
||||
<p>Okay, that was still rather easy. However, to make things more
|
||||
complicated, Microsoft in its infinite wisdom decided that a property set
|
||||
shalt be broken into <strong>sections</strong>. Each section holds a bunch
|
||||
of properties. But since that's still not complicated enough: A section
|
||||
can optionally have a dictionary that maps property IDs to property
|
||||
names - we'll explain later what that means.</p>
|
||||
shalt be broken into one or more <strong>sections</strong>. Each section
|
||||
holds a bunch of properties. But since that's still not complicated
|
||||
enough, a section may have an optional <strong>dictionary</strong> that
|
||||
maps property IDs to <strong>property names</strong> - we'll explain
|
||||
later what that means.</p>
|
||||
|
||||
<p>So the procedure to get to the properties is as follows:</p>
|
||||
<p>The procedure to get to the properties is the following:</p>
|
||||
|
||||
<ol>
|
||||
<li>Use the <code>PropertySetFactory</code> to create a
|
||||
<code>PropertySet</code> from an input stream. You can try this with any
|
||||
input stream: You'll either <code>PropertySet</code> instance or an
|
||||
<li>Use the <strong><code>PropertySetFactory</code></strong> class to
|
||||
create a <code>PropertySet</code> object from a property set stream. If
|
||||
you don't know whether an input stream is a property set stream, just
|
||||
try to call <code>PropertySetFactory.create(java.io.InputStream)</code>:
|
||||
You'll either get a <code>PropertySet</code> instance returned or an
|
||||
exception is thrown.</li>
|
||||
|
||||
<li>Call the <code>PropertySet</code>'s method <code>getSections()</code>
|
||||
to get a list of sections contained in the property set. Each section is
|
||||
to get the sections contained in the property set. Each section is
|
||||
an instance of the <code>Section</code> class.</li>
|
||||
|
||||
<li>Each section has a format ID. The format ID of the first section in a
|
||||
property set determines the property set's type. For example, the first
|
||||
(and only) section of the SummaryInformation property set has a format ID
|
||||
of <code>F29F85E0-4FF9-1068-AB-91-08-00-2B-27-B3-D9</code>. You can
|
||||
(and only) section of the SummaryInformation property set has a format
|
||||
ID of <code>F29F85E0-4FF9-1068-AB-91-08-00-2B-27-B3-D9</code>. You can
|
||||
get the format ID with <code>Section.getFormatID()</code>.</li>
|
||||
|
||||
<li>The properties contained in a <code>Section</code> can be retrieved
|
||||
|
@ -345,7 +348,9 @@ else
|
|||
<li>A property has a name, a type, and a value. The <code>Property</code>
|
||||
class has methods to retrieve them.</li>
|
||||
</ol>
|
||||
</section>
|
||||
|
||||
<section title="A Sample Application">
|
||||
<p>Let's have a look at a sample Java application that dumps all property
|
||||
set streams contained in a POI file system. The full source code of this
|
||||
program can be found as <em>ReadCustomPropertySets.java</em> in the
|
||||
|
@ -381,7 +386,9 @@ import org.apache.poi.util.HexDump;</source>
|
|||
<p>The <code>POIFSReader</code> is set up in a way that the listener
|
||||
<code>MyPOIFSReaderListener</code> is called on every file in the POI file
|
||||
system.</p>
|
||||
</section>
|
||||
|
||||
<section title="The Property Set">
|
||||
<p>The listener class tries to create a <code>PropertySet</code> from each
|
||||
stream using the <code>PropertySetFactory.create()</code> method:</p>
|
||||
|
||||
|
@ -420,7 +427,9 @@ import org.apache.poi.util.HexDump;</source>
|
|||
other types of exceptions cause the program to terminate by throwing a
|
||||
runtime exception. If all went well, we can print the name of the property
|
||||
set stream.</p>
|
||||
</section>
|
||||
|
||||
<section title="The Sections">
|
||||
<p>The next step is to print the number of sections followed by the
|
||||
sections themselves:</p>
|
||||
|
||||
|
@ -447,8 +456,8 @@ for (Iterator i = sections.iterator(); i.hasNext();)
|
|||
instances of the <code>Section</code> class in their proper order.</p>
|
||||
|
||||
<p>The sample code shows a loop that retrieves the <code>Section</code>
|
||||
objects one by one and prints some information about each one. Here is the
|
||||
complete body of the loop:</p>
|
||||
objects one by one and prints some information about each one. Here is
|
||||
the complete body of the loop:</p>
|
||||
|
||||
<source>/* Print a single section: */
|
||||
Section sec = (Section) i.next();
|
||||
|
@ -473,12 +482,14 @@ for (int i2 = 0; i2 < properties.length; i2++)
|
|||
out(" Property ID: " + id + ", type: " + type +
|
||||
", value: " + value);
|
||||
}</source>
|
||||
</section>
|
||||
|
||||
<section title="The Section's Format ID">
|
||||
<p>The first method called on the <code>Section</code> instance is
|
||||
<code>getFormatID()</code>. As explained above, the format ID of the first
|
||||
section in a property set determines the type of the property set. Its
|
||||
type is <code>ClassID</code> which is essentially a sequence of 16
|
||||
bytes. A real application using its own type of a custom property set
|
||||
<code>getFormatID()</code>. As explained above, the format ID of the
|
||||
first section in a property set determines the type of the property
|
||||
set. Its type is <code>ClassID</code> which is essentially a sequence of
|
||||
16 bytes. A real application using its own type of a custom property set
|
||||
should have defined a unique format ID and, when reading a property set
|
||||
stream, should check the format ID is equal to that unique format ID. The
|
||||
sample program just prints the format ID it finds in a section:</p>
|
||||
|
@ -495,7 +506,9 @@ out(" Format ID: " + s);</source>
|
|||
the <code>org.apache.poi.util</code> package. Another helper method is
|
||||
<code>out()</code> which just saves typing
|
||||
<code>System.out.println()</code>.</p>
|
||||
</section>
|
||||
|
||||
<section title="The Properties">
|
||||
<p>Before getting the properties, it is possible to find out how many
|
||||
properties are available in the section via the
|
||||
<code>Section.getPropertyCount()</code>. The sample application uses this
|
||||
|
@ -525,12 +538,14 @@ out(" No. of properties: " + propertyCount);</source>
|
|||
out(" Property ID: " + id + ", type: " + type +
|
||||
", value: " + value);
|
||||
}</source>
|
||||
</section>
|
||||
|
||||
<p>The output of the sample program might look like the following. It shows
|
||||
the summary information and the document summary information property sets
|
||||
of a Microsoft Word document. However, unlike the first and second section
|
||||
of this HOW-TO the application does not have any code which is specific to
|
||||
the <code>SummaryInformation</code> and
|
||||
<section title="Sample Output">
|
||||
<p>The output of the sample program might look like the following. It
|
||||
shows the summary information and the document summary information
|
||||
property sets of a Microsoft Word document. However, unlike the first and
|
||||
second section of this HOW-TO the application does not have any code
|
||||
which is specific to the <code>SummaryInformation</code> and
|
||||
<code>DocumentSummaryInformation</code> classes.</p>
|
||||
|
||||
<source>Property set stream "/SummaryInformation":
|
||||
|
@ -604,13 +619,231 @@ No property set stream: "/1Table"</source>
|
|||
<li>The properties are not in any particular order in the section,
|
||||
although they slightly tend to be sorted by their IDs.</li>
|
||||
</ul>
|
||||
</section>
|
||||
|
||||
<note>[To be continued.]</note>
|
||||
<section title="Property IDs">
|
||||
<p>Properties in the same section are distinguished by their IDs. This is
|
||||
similar to variables in a programming language like Java, which are
|
||||
distinguished by their names. But unlike variable names, property IDs are
|
||||
simple integral numbers. There is another similarity, however. Just like
|
||||
a Java variable has a certain scope (e.g. a member variables in a class),
|
||||
a property ID also has its scope of validity: the section.</p>
|
||||
|
||||
<note>A last note: There are still some aspects of HSPF left which are not
|
||||
documented in this HOW-TO. You should dig into the Javadoc API
|
||||
documentation to learn further details. Since you struggled through this
|
||||
document up to this point, you are well prepared.</note>
|
||||
<p>Two property IDs in sections with different section format IDs
|
||||
don't have the same meaning even though their IDs might be equal. For
|
||||
example, ID 4 in the first (and only) section of a summary
|
||||
information property set denotes the document's author, while ID 4 in the
|
||||
first section of the document summary information property set means the
|
||||
document's byte count. The sample output above does not show a property
|
||||
with an ID of 4 in the first section of the document summary information
|
||||
property set. That means that the document does not have a byte
|
||||
count. However, there is a property with an ID of 4 in the
|
||||
<em>second</em> section: This is a user-defined property ID - we'll get
|
||||
to that topic in a minute.</p>
|
||||
|
||||
<p>So, how can you find out what the meaning of a certain property ID in
|
||||
the summary information and the document summary information property set
|
||||
is? The standard property sets as such don't have any hints about the
|
||||
<strong>meanings of their property IDs</strong>. For example, the summary
|
||||
information property set does not tell you that the property ID 4 stands
|
||||
for the document's author. This is external knowledge. Microsoft defined
|
||||
standard meanings for some of the property IDs in the summary information
|
||||
and the document summary information property sets. As a help to the Java
|
||||
and POI programmer, the class <code>PropertyIDMap</code> in the
|
||||
<code>org.apache.poi.hpsf.wellknown</code> package defines constants
|
||||
for the "well-known" property IDs. For example, there is the
|
||||
definition</p>
|
||||
|
||||
<source>public final static int PID_AUTHOR = 4;</source>
|
||||
|
||||
<p>These definitions allow you to use symbolic names instead of
|
||||
numbers.</p>
|
||||
|
||||
<p>In order to provide support for the other way, too, - i.e. to map
|
||||
property IDs to property names - the class <code>PropertyIDMap</code>
|
||||
defines two static methods:
|
||||
<code>getSummaryInformationProperties()</code> and
|
||||
<code>getDocumentSummaryInformationProperties()</code>. Both return
|
||||
<code>java.util.Map</code> objects which map property IDs to
|
||||
strings. Such a string gives a hint about the property's meaning. For
|
||||
example,
|
||||
<code>PropertyIDMap.getSummaryInformationProperties().get(4)</code>
|
||||
returns the string "PID_AUTHOR". An application could use this string as
|
||||
a key to a localized string which is displayed to the user, e.g. "Author"
|
||||
in English or "Verfasser" in German. HPSF might provide such
|
||||
language-dependend ("localized") mappings in a later release.</p>
|
||||
|
||||
<p>Usually you won't have to deal with those two maps. Instead you should
|
||||
call the <code>Section.getPIDString(int)</code> method. It returns the
|
||||
string associated with the specified property ID in the context of the
|
||||
<code>Section</code> object.</p>
|
||||
|
||||
<p>Above you learned that property IDs have a meaning in the scope of a
|
||||
section only. However, there are two exceptions to the rule: The property
|
||||
IDs 0 and 1 have a fixed meaning in <strong>all</strong> sections:</p>
|
||||
|
||||
<table>
|
||||
<tr>
|
||||
<th>Property ID</th>
|
||||
<th>Meaning</th>
|
||||
</tr>
|
||||
|
||||
<tr>
|
||||
<td>0</td>
|
||||
<td>The property's value is a <strong>dictionary</strong>, i.e. a
|
||||
mapping from property IDs to strings.</td>
|
||||
</tr>
|
||||
|
||||
<tr>
|
||||
<td>1</td>
|
||||
<td>The property's value is the number of a <strong>codepage</strong>,
|
||||
i.e. a mapping from character codes to characters. All strings in the
|
||||
section containing this property must be interpreted using this
|
||||
codepage. Typical property values are 1252 (8-bit "western" characters)
|
||||
or 1200 (16-bit Unicode characters).</td>
|
||||
</tr>
|
||||
</table>
|
||||
</section>
|
||||
|
||||
<section title="Property types">
|
||||
<p>A property is nothing without its value. It is stored in a property set
|
||||
stream as a sequence of bytes. You must know the property's
|
||||
<strong>type</strong> in order to properly interpret those bytes and
|
||||
reasonably handle the value. A property's type is one of the so-called
|
||||
Microsoft-defined <strong>"variant types"</strong>. When you call
|
||||
<code>Property.getType()</code> you'll get a <code>long</code> value
|
||||
which denoting the property's variant type. The class
|
||||
<code>Variant</code> in the <code>org.apache.poi.hpsf</code> package
|
||||
holds most of those <code>long</code> values as named constants. For
|
||||
example, the constant <code>VT_I4 = 3</code> means a signed integer value
|
||||
of four bytes. Examples of other types are <code>VT_LPSTR = 30</code>
|
||||
meaning a null-terminated string of 8-bit characters, <code>VT_LPWSTR =
|
||||
31</code> which means a null-terminated Unicode string, or <code>VT_BOOL
|
||||
= 11</code> denoting a boolean value.</p>
|
||||
|
||||
<p>In most cases you won't need a property's type because HPSF does all
|
||||
the work for you.</p>
|
||||
</section>
|
||||
|
||||
<section title="Property values">
|
||||
<p>When an application wants to retrieve a property's value and calls
|
||||
<code>Property.getValue()</code>, HPSF has to interpret the bytes making
|
||||
out the value according to the property's type. The type determines how
|
||||
many bytes the value consists of and what
|
||||
to do with them. For example, if the type is <code>VT_I4</code>, HPSF
|
||||
knows that the value is four bytes long and that these bytes
|
||||
comprise a signed integer value in the little-endian format. This is
|
||||
quite different from e.g. a type of <code>VT_LPWSTR</code>. In this case
|
||||
HPSF has to scan the value bytes for a Unicode null character and collect
|
||||
everything from the beginning to that null character as a Unicode
|
||||
string.</p>
|
||||
|
||||
<p>The good new is that HPSF does another job for you, too: It maps the
|
||||
variant type to an adequate Java type.</p>
|
||||
|
||||
<table>
|
||||
<tr>
|
||||
<th>Variant type:</th>
|
||||
<th>Java type:</th>
|
||||
</tr>
|
||||
|
||||
<tr>
|
||||
<td>VT_I2</td>
|
||||
<td>java.lang.Integer</td>
|
||||
</tr>
|
||||
|
||||
<tr>
|
||||
<td>VT_I4</td>
|
||||
<td>java.lang.Long</td>
|
||||
</tr>
|
||||
|
||||
<tr>
|
||||
<td>VT_FILETIME</td>
|
||||
<td>java.util.Date</td>
|
||||
</tr>
|
||||
|
||||
<tr>
|
||||
<td>VT_LPSTR</td>
|
||||
<td>String</td>
|
||||
</tr>
|
||||
|
||||
<tr>
|
||||
<td>VT_LPWSTR</td>
|
||||
<td>String</td>
|
||||
</tr>
|
||||
|
||||
<tr>
|
||||
<td>VT_CF</td>
|
||||
<td>byte[]</td>
|
||||
</tr>
|
||||
|
||||
<tr>
|
||||
<td>VT_BOOL</td>
|
||||
<td>java.lang.Boolean</td>
|
||||
</tr>
|
||||
|
||||
</table>
|
||||
|
||||
<p>The bad news is that there are still a couple of variant types HPSF
|
||||
does not yet support. If it encounters one of these types it
|
||||
returns the property's value as a byte array and leaves it to be
|
||||
interpreted by the application.</p>
|
||||
|
||||
<p>An application retrieves a property's value by calling the
|
||||
<code>Property.getValue()</code> method. This method's return type is the
|
||||
abstract <code>Object</code> class. The <code>getValue()</code> method
|
||||
looks up the property's variant type, reads the property's value bytes,
|
||||
creates an instance of an adequate Java type, assigns it the property's
|
||||
value and returns it. Primitive types like <code>int</code> or
|
||||
<code>long</code> will be returned as the corresponding class,
|
||||
e.g. <code>Integer</code> or <code>Long</code>.</p>
|
||||
</section>
|
||||
|
||||
|
||||
<section title="Dictionaries">
|
||||
<p>The property with ID 0 has a very special meaning: It is a
|
||||
<strong>dictionary</strong> mapping property IDs to property names. We
|
||||
have seen already that the meanings of standard properties in the
|
||||
summary information and the document summary information property sets
|
||||
have been defined by Microsoft. The advantage is that the labels of
|
||||
properties like "Author" or "Title" don't have to be stored in the
|
||||
property set. However, a user can define custom fields in, say, Microsoft
|
||||
Word. For each field the user has to specify a name, a type, and a
|
||||
value.</p>
|
||||
|
||||
<p>The names of the custom-defined fields (i.e. the property names) are
|
||||
stored in the document summary information second section's
|
||||
<strong>dictionary</strong>. The dictionary is a map which associates
|
||||
property IDs with property names.</p>
|
||||
|
||||
<p>The method <code>Section.getPIDString(int)</code> not only returns with
|
||||
the well-known property names of the summary information and document
|
||||
summary information property sets, but with self-defined properties,
|
||||
too. It should also work with self-defined properties in self-defined
|
||||
sections.</p>
|
||||
</section>
|
||||
|
||||
<section title="Codepage support">
|
||||
<fixme author="Rainer Klute">Improve codepage support!</fixme>
|
||||
|
||||
<p>The property with ID 1 holds the number of the codepage which was used
|
||||
to encode the strings in this section. The present HPSF codepage support
|
||||
is still very limited: When reading property value strings, HPSF
|
||||
distinguishes between 16-bit characters and 8-bit characters. 16-bit
|
||||
characters should be Unicode characters and thus be okay. 8-bit
|
||||
characters are interpreted according to the platform's default character
|
||||
set. This is fine as long as the document being read has been written on
|
||||
a platform with the same default character set. However, if you receive a
|
||||
document from another region of the world and want to process it with
|
||||
HPSF you are in trouble - unless the creator used Unicode, of course.</p>
|
||||
</section>
|
||||
|
||||
<section title="Further Reading">
|
||||
<p>There are still some aspects of HSPF left which are not covered by this
|
||||
HOW-TO. You should dig into the Javadoc API documentation to learn
|
||||
further details. Since you've struggled through this document up to this
|
||||
point, you are well prepared.</p>
|
||||
</section>
|
||||
</section>
|
||||
</section>
|
||||
</body>
|
||||
|
|
|
@ -16,22 +16,25 @@
|
|||
|
||||
<ol>
|
||||
<li>
|
||||
<p>Add writing capability for property sets.</p>
|
||||
<p>Add writing capability for property sets. Presently property sets can
|
||||
be read only.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>Add codepage support.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>Add Unicode support.</p>
|
||||
<p>Add codepage support: Presently the bytes making out the string in a
|
||||
property's value are interpreted using the platform's default character
|
||||
set.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>Add resource bundles to
|
||||
<code>org.apache.poi.hpsf.wellknown</code> to ease
|
||||
localizations.</p>
|
||||
localizations. This would be useful for mapping standard property IDs to
|
||||
localized strings. Example: The property ID 4 could be mapped to "Author"
|
||||
in English or "Verfasser" in German.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>Implement reading functionality for those property types that are not
|
||||
yet supported (other than byte arrays).</p>
|
||||
yet supported. HPSF should return proper Java types instead of just byte
|
||||
arrays.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>Add WMF to <code>java.awt.Image</code> example code in <link
|
||||
|
|
|
@ -137,6 +137,11 @@ public class TypeReader
|
|||
* Read a byte string. In Java it is represented as a
|
||||
* String object. The 0x00 bytes at the end must be
|
||||
* stripped.
|
||||
*
|
||||
* FIXME: Reading an 8-bit string should pay attention
|
||||
* to the codepage. Currently the byte making out the
|
||||
* property's value are interpreted according to the
|
||||
* platform's default character set.
|
||||
*/
|
||||
final int first = offset + LittleEndian.INT_SIZE;
|
||||
long last = first + LittleEndian.getUInt(src, offset) - 1;
|
||||
|
|
|
@ -79,7 +79,8 @@ public class PropertyIDMap extends HashMap
|
|||
{
|
||||
|
||||
/*
|
||||
* The following definitions are for the Summary Information.
|
||||
* The following definitions are for property IDs in the first
|
||||
* (and only) section of the Summary Information property set.
|
||||
*/
|
||||
public final static int PID_TITLE = 2;
|
||||
public final static int PID_SUBJECT = 3;
|
||||
|
@ -103,7 +104,8 @@ public class PropertyIDMap extends HashMap
|
|||
|
||||
|
||||
/*
|
||||
* The following definitions are for the Document Summary Information.
|
||||
* The following definitions are for property IDs in the first
|
||||
* section of the Document Summary Information property set.
|
||||
*/
|
||||
|
||||
/**
|
||||
|
|
Loading…
Reference in New Issue