From 34357fe6b8fef0fb1857c5625def45c0365f54ce Mon Sep 17 00:00:00 2001
From: Rainer Klute The third section tells how to read
+ The third section tells how to read
non-standard properties. Non-standard properties are application-specific
- name/value/type triples. This section is still to be written. Look up
- the API documentation for the time being!
Now comes the really hardcode stuff. As mentioned above,
- SummaryInformation
and
- DocumentSummaryInformation
are just special cases of the
- general concept of a property set. The general concept says that a
- property set consists of properties. Each property is an
- entity that has a name, a type, and a
- value.
Now comes the real hardcode stuff. As mentioned above,
+ SummaryInformation
and
+ DocumentSummaryInformation
are just special cases of the
+ general concept of a property set. This concept says that a
+ property set consists of properties and that each
+ property is an entity with an ID, a
+ type, and a value.
Okay, that was still rather easy. However, to make things more - complicated, Microsoft in its infinite wisdom decided that a property set - shalt be broken into sections. Each section holds a bunch - of properties. But since that's still not complicated enough: A section - can optionally have a dictionary that maps property IDs to property - names - we'll explain later what that means.
+Okay, that was still rather easy. However, to make things more + complicated, Microsoft in its infinite wisdom decided that a property set + shalt be broken into one or more sections. Each section + holds a bunch of properties. But since that's still not complicated + enough, a section may have an optional dictionary that + maps property IDs to property names - we'll explain + later what that means.
-So the procedure to get to the properties is as follows:
+The procedure to get to the properties is the following:
-PropertySetFactory
to create a
- PropertySet
from an input stream. You can try this with any
- input stream: You'll either PropertySet
instance or an
- exception is thrown.PropertySetFactory
class to
+ create a PropertySet
object from a property set stream. If
+ you don't know whether an input stream is a property set stream, just
+ try to call PropertySetFactory.create(java.io.InputStream)
:
+ You'll either get a PropertySet
instance returned or an
+ exception is thrown.PropertySet
's method getSections()
- to get a list of sections contained in the property set. Each section is
- an instance of the Section
class.PropertySet
's method getSections()
+ to get the sections contained in the property set. Each section is
+ an instance of the Section
class.F29F85E0-4FF9-1068-AB-91-08-00-2B-27-B3-D9
. You can
- get the format ID with Section.getFormatID()
.F29F85E0-4FF9-1068-AB-91-08-00-2B-27-B3-D9
. You can
+ get the format ID with Section.getFormatID()
.Section
can be retrieved
- with Section.getProperties()
. The result is an array of
- Property
instances.Section
can be retrieved
+ with Section.getProperties()
. The result is an array of
+ Property
instances.Property
- class has methods to retrieve them.Property
+ class has methods to retrieve them.Let's have a look at a sample Java application that dumps all property - set streams contained in a POI file system. The full source code of this - program can be found as ReadCustomPropertySets.java in the - examples area of the POI source code tree. Here are the key - sections:
+Let's have a look at a sample Java application that dumps all property + set streams contained in a POI file system. The full source code of this + program can be found as ReadCustomPropertySets.java in the + examples area of the POI source code tree. Here are the key + sections:
The POIFSReader
is set up in a way that the listener
MyPOIFSReaderListener
is called on every file in the POI file
system.
The listener class tries to create a The listener class tries to create a PropertySet
from each
+ PropertySet
from each
stream using the PropertySetFactory.create()
method:
The next step is to print the number of sections followed by the
+ The next step is to print the number of sections followed by the
sections themselves: The The To retrieve the sections, use the To retrieve the sections, use the The sample code shows a loop that retrieves the The sample code shows a loop that retrieves the PropertySet
's method getSectionCount()
- returns the number of sections.PropertySet
's method getSectionCount()
+ returns the number of sections.getSections()
- method. This method returns a java.util.List
containing
- instances of the Section
class in their proper order.getSections()
+ method. This method returns a java.util.List
containing
+ instances of the Section
class in their proper order.Section
- objects one by one and prints some information about each one. Here is the
- complete body of the loop:Section
+ objects one by one and prints some information about each one. Here is
+ the complete body of the loop:
The first method called on the Section
instance is
- getFormatID()
. As explained above, the format ID of the first
- section in a property set determines the type of the property set. Its
- type is ClassID
which is essentially a sequence of 16
- bytes. A real application using its own type of a custom property set
- should have defined a unique format ID and, when reading a property set
- stream, should check the format ID is equal to that unique format ID. The
- sample program just prints the format ID it finds in a section:
The first method called on the Section
instance is
+ getFormatID()
. As explained above, the format ID of the
+ first section in a property set determines the type of the property
+ set. Its type is ClassID
which is essentially a sequence of
+ 16 bytes. A real application using its own type of a custom property set
+ should have defined a unique format ID and, when reading a property set
+ stream, should check the format ID is equal to that unique format ID. The
+ sample program just prints the format ID it finds in a section:
As you can see, the getFormatID()
method returns a
- ClassID
object. An array containing the bytes can be
- retrieved with ClassID.getBytes()
. In order to get a nicely
- formatted printout, the sample program uses the hex()
helper
- method which in turn uses the POI utility class HexDump
in
- the org.apache.poi.util
package. Another helper method is
- out()
which just saves typing
- System.out.println()
.
As you can see, the getFormatID()
method returns a
+ ClassID
object. An array containing the bytes can be
+ retrieved with ClassID.getBytes()
. In order to get a nicely
+ formatted printout, the sample program uses the hex()
helper
+ method which in turn uses the POI utility class HexDump
in
+ the org.apache.poi.util
package. Another helper method is
+ out()
which just saves typing
+ System.out.println()
.
Before getting the properties, it is possible to find out how many
- properties are available in the section via the
- Section.getPropertyCount()
. The sample application uses this
- method to print the number of properties to the standard output:
Before getting the properties, it is possible to find out how many
+ properties are available in the section via the
+ Section.getPropertyCount()
. The sample application uses this
+ method to print the number of properties to the standard output:
Now its time to get to the properties themselves. You can retrieve a
- section's properties with the method
- Section.getProperties()
:
Now its time to get to the properties themselves. You can retrieve a
+ section's properties with the method
+ Section.getProperties()
:
As you can see the result is an array of Property
- objects. This class has three methods to retrieve a property's ID, its
- type, and its value. The following code snippet shows how to call
- them:
As you can see the result is an array of Property
+ objects. This class has three methods to retrieve a property's ID, its
+ type, and its value. The following code snippet shows how to call
+ them:
The output of the sample program might look like the following. It shows
- the summary information and the document summary information property sets
- of a Microsoft Word document. However, unlike the first and second section
- of this HOW-TO the application does not have any code which is specific to
- the SummaryInformation
and
- DocumentSummaryInformation
classes.
The output of the sample program might look like the following. It
+ shows the summary information and the document summary information
+ property sets of a Microsoft Word document. However, unlike the first and
+ second section of this HOW-TO the application does not have any code
+ which is specific to the SummaryInformation
and
+ DocumentSummaryInformation
classes.
There are some interestion items to note:
+There are some interestion items to note:
-Properties in the same section are distinguished by their IDs. This is + similar to variables in a programming language like Java, which are + distinguished by their names. But unlike variable names, property IDs are + simple integral numbers. There is another similarity, however. Just like + a Java variable has a certain scope (e.g. a member variables in a class), + a property ID also has its scope of validity: the section.
-Two property IDs in sections with different section format IDs + don't have the same meaning even though their IDs might be equal. For + example, ID 4 in the first (and only) section of a summary + information property set denotes the document's author, while ID 4 in the + first section of the document summary information property set means the + document's byte count. The sample output above does not show a property + with an ID of 4 in the first section of the document summary information + property set. That means that the document does not have a byte + count. However, there is a property with an ID of 4 in the + second section: This is a user-defined property ID - we'll get + to that topic in a minute.
+ +So, how can you find out what the meaning of a certain property ID in
+ the summary information and the document summary information property set
+ is? The standard property sets as such don't have any hints about the
+ meanings of their property IDs. For example, the summary
+ information property set does not tell you that the property ID 4 stands
+ for the document's author. This is external knowledge. Microsoft defined
+ standard meanings for some of the property IDs in the summary information
+ and the document summary information property sets. As a help to the Java
+ and POI programmer, the class PropertyIDMap
in the
+ org.apache.poi.hpsf.wellknown
package defines constants
+ for the "well-known" property IDs. For example, there is the
+ definition
These definitions allow you to use symbolic names instead of + numbers.
+ +In order to provide support for the other way, too, - i.e. to map
+ property IDs to property names - the class PropertyIDMap
+ defines two static methods:
+ getSummaryInformationProperties()
and
+ getDocumentSummaryInformationProperties()
. Both return
+ java.util.Map
objects which map property IDs to
+ strings. Such a string gives a hint about the property's meaning. For
+ example,
+ PropertyIDMap.getSummaryInformationProperties().get(4)
+ returns the string "PID_AUTHOR". An application could use this string as
+ a key to a localized string which is displayed to the user, e.g. "Author"
+ in English or "Verfasser" in German. HPSF might provide such
+ language-dependend ("localized") mappings in a later release.
Usually you won't have to deal with those two maps. Instead you should
+ call the Section.getPIDString(int)
method. It returns the
+ string associated with the specified property ID in the context of the
+ Section
object.
Above you learned that property IDs have a meaning in the scope of a + section only. However, there are two exceptions to the rule: The property + IDs 0 and 1 have a fixed meaning in all sections:
+ +Property ID | +Meaning | +
---|---|
0 | +The property's value is a dictionary, i.e. a + mapping from property IDs to strings. | +
1 | +The property's value is the number of a codepage, + i.e. a mapping from character codes to characters. All strings in the + section containing this property must be interpreted using this + codepage. Typical property values are 1252 (8-bit "western" characters) + or 1200 (16-bit Unicode characters). | +
A property is nothing without its value. It is stored in a property set
+ stream as a sequence of bytes. You must know the property's
+ type in order to properly interpret those bytes and
+ reasonably handle the value. A property's type is one of the so-called
+ Microsoft-defined "variant types". When you call
+ Property.getType()
you'll get a long
value
+ which denoting the property's variant type. The class
+ Variant
in the org.apache.poi.hpsf
package
+ holds most of those long
values as named constants. For
+ example, the constant VT_I4 = 3
means a signed integer value
+ of four bytes. Examples of other types are VT_LPSTR = 30
+ meaning a null-terminated string of 8-bit characters, VT_LPWSTR =
+ 31
which means a null-terminated Unicode string, or VT_BOOL
+ = 11
denoting a boolean value.
In most cases you won't need a property's type because HPSF does all + the work for you.
+When an application wants to retrieve a property's value and calls
+ Property.getValue()
, HPSF has to interpret the bytes making
+ out the value according to the property's type. The type determines how
+ many bytes the value consists of and what
+ to do with them. For example, if the type is VT_I4
, HPSF
+ knows that the value is four bytes long and that these bytes
+ comprise a signed integer value in the little-endian format. This is
+ quite different from e.g. a type of VT_LPWSTR
. In this case
+ HPSF has to scan the value bytes for a Unicode null character and collect
+ everything from the beginning to that null character as a Unicode
+ string.
The good new is that HPSF does another job for you, too: It maps the + variant type to an adequate Java type.
+ +Variant type: | +Java type: | +
---|---|
VT_I2 | +java.lang.Integer | +
VT_I4 | +java.lang.Long | +
VT_FILETIME | +java.util.Date | +
VT_LPSTR | +String | +
VT_LPWSTR | +String | +
VT_CF | +byte[] | +
VT_BOOL | +java.lang.Boolean | +
The bad news is that there are still a couple of variant types HPSF + does not yet support. If it encounters one of these types it + returns the property's value as a byte array and leaves it to be + interpreted by the application.
+ +An application retrieves a property's value by calling the
+ Property.getValue()
method. This method's return type is the
+ abstract Object
class. The getValue()
method
+ looks up the property's variant type, reads the property's value bytes,
+ creates an instance of an adequate Java type, assigns it the property's
+ value and returns it. Primitive types like int
or
+ long
will be returned as the corresponding class,
+ e.g. Integer
or Long
.
The property with ID 0 has a very special meaning: It is a + dictionary mapping property IDs to property names. We + have seen already that the meanings of standard properties in the + summary information and the document summary information property sets + have been defined by Microsoft. The advantage is that the labels of + properties like "Author" or "Title" don't have to be stored in the + property set. However, a user can define custom fields in, say, Microsoft + Word. For each field the user has to specify a name, a type, and a + value.
+ +The names of the custom-defined fields (i.e. the property names) are + stored in the document summary information second section's + dictionary. The dictionary is a map which associates + property IDs with property names.
+ +The method Section.getPIDString(int)
not only returns with
+ the well-known property names of the summary information and document
+ summary information property sets, but with self-defined properties,
+ too. It should also work with self-defined properties in self-defined
+ sections.
The property with ID 1 holds the number of the codepage which was used + to encode the strings in this section. The present HPSF codepage support + is still very limited: When reading property value strings, HPSF + distinguishes between 16-bit characters and 8-bit characters. 16-bit + characters should be Unicode characters and thus be okay. 8-bit + characters are interpreted according to the platform's default character + set. This is fine as long as the document being read has been written on + a platform with the same default character set. However, if you receive a + document from another region of the world and want to process it with + HPSF you are in trouble - unless the creator used Unicode, of course.
+There are still some aspects of HSPF left which are not covered by this + HOW-TO. You should dig into the Javadoc API documentation to learn + further details. Since you've struggled through this document up to this + point, you are well prepared.
+