- Added first sections to HPSF HOW-TO.

git-svn-id: https://svn.apache.org/repos/asf/jakarta/poi/trunk@352153 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Rainer Klute 2002-03-06 09:03:53 +00:00
parent 4d54d7cb62
commit 641974525a
4 changed files with 800 additions and 19 deletions

View File

@ -143,7 +143,7 @@
<div align="right">
<table cellspacing="0" cellpadding="2" border="0" width="100%">
<tr>
<td bgcolor="#525D76"><font color="#ffffff" size="+1"><font face="Arial,sans-serif"><b> 1.1-dev (March 3 2002)</b></font></font></td>
<td bgcolor="#525D76"><font color="#ffffff" size="+1"><font face="Arial,sans-serif"><b> 1.1-dev (March 6 2002)</b></font></font></td>
</tr>
<tr>
<td>

View File

@ -74,9 +74,501 @@
<td>
<br>
<p align="justify">TODO: This documentation is still to be written. For the
time being, please see the API documentation (javadocs) of the
<code>org.apache.poi.hpsf</code> package.</p>
<p align="justify">This HOW-TO is organized in three section. You should read them
sequentially because the later sections build upon the earlier ones.</p>
<ol>
<li>
<p align="justify">The <a href="#sec1">first section</a> explains how to read
the most important standard properties of a Microsoft Office
document. Standard properties are things like title, author, creation
date etc. It is quite likely that you will find here what you need and
don't have to read the other sections.</p>
</li>
<li>
<p align="justify">The <a href="#sec2">second section</a> goes a small step
further and focusses on reading additional standard properties. It also
talks about exceptions that may be thrown when dealing with HPSF and
shows how you can read properties of embedded objects.</p>
</li>
<li>
<p align="justify">The <a href="#sec3">third section</a> tells how to read
non-standard properties. Non-standard properties are application-specific
name/value/type triples.</p>
</li>
</ol>
<anchor id="sec1"></anchor>
<div align="right">
<table cellspacing="0" cellpadding="2" border="0" width="99%">
<tr>
<td bgcolor="#525D76"><font color="#ffffff" size="+0"><font face="Arial,sans-serif"><b>Reading Standard Properties</b></font></font></td>
</tr>
<tr>
<td>
<br>
<note>This section explains how to read
the most important standard properties of a Microsoft Office
document. Standard properties are things like title, author, creation
date etc. Chances are that you will find here what you need and
don't have to read the other sections.</note>
<p align="justify">The first thing you should understand is that properties are stored in
separate documents inside the POI filesystem. (If you don't know what a
POI filesystem is, read its <a href="../poifs/index.html">documentation</a>.) A document in a POI
filesystem is also called a <em>stream</em>.</p>
<p align="justify">The following example shows how to read a POI filesystem's
"title" property. Reading other properties is similar. Consider the API
documentation of <code>org.apache.poi.hpsf.SummaryInformation</code>.</p>
<p align="justify">The standard properties this section focusses on can be
found in a document called <em>\005SummaryInformation</em> in the root of
the POI filesystem. The notation <em>\005</em> in the document's name
means the character with the decimal value of 5. In order to read the
title, an application has to perform the following steps:</p>
<ol>
<li>
<p align="justify">Open the document <em>\005SummaryInformation</em> located in the root
of the POI filesystem.</p>
</li>
<li>
<p align="justify">Create an instance of the class
<code>SummaryInformation</code> from that
document.</p>
</li>
<li>
<p align="justify">Call the <code>SummaryInformation</code> instance's
<code>getTitle()</code> method.</p>
</li>
</ol>
<p align="justify">Sounds easy, doesn't it? Here are the steps in detail.</p>
<div align="right">
<table cellspacing="0" cellpadding="2" border="0" width="98%">
<tr>
<td bgcolor="#525D76"><font color="#ffffff" size="-1"><font face="Arial,sans-serif"><b>Open the document \005SummaryInformation in the root of the POI filesystem</b></font></font></td>
</tr>
<tr>
<td>
<br>
<p align="justify">An application that wants to open a document in a POI filesystem
(POIFS) proceeds as shown by the following code fragment. (The full
source code of the sample application is available in the
<em>examples</em> section of the POI source tree as
<em>ReadTitle.java</em>.)</p>
<div align="center">
<table cellspacing="2" cellpadding="2" border="1">
<tr>
<td>
<pre>
import java.io.*;
import org.apache.poi.hpsf.*;
import org.apache.poi.poifs.eventfilesystem.*;
// ...
public static void main(String[] args)
throws IOException
{
final String filename = args[0];
POIFSReader r = new POIFSReader();
r.registerListener(new MyPOIFSReaderListener(),
"\005SummaryInformation");
r.read(new FileInputStream(filename));
}</pre>
</td>
</tr>
</table>
</div>
<p align="justify">The first interesting statement is</p>
<div align="center">
<table cellspacing="2" cellpadding="2" border="1">
<tr>
<td>
<pre>POIFSReader r = new POIFSReader();</pre>
</td>
</tr>
</table>
</div>
<p align="justify">It creates a
<code>org.apache.poi.poifs.eventfilesystem.POIFSReader</code> instance
which we shall need to read the POI filesystem. Before the application
actually opens the POI filesystem we have to tell the
<code>POIFSReader</code> which documents we are interested in. In this
case the application should do something with the document
<em>\005SummaryInformation</em>.</p>
<div align="center">
<table cellspacing="2" cellpadding="2" border="1">
<tr>
<td>
<pre>
r.registerListener(new MyPOIFSReaderListener(),
"\005SummaryInformation");</pre>
</td>
</tr>
</table>
</div>
<p align="justify">This method call registers a
<code>org.apache.poi.poifs.eventfilesystem.POIFSReaderListener</code>
with the <code>POIFSReader</code>. The <code>POIFSReaderListener</code>
interface specifies the method <code>processPOIFSReaderEvent</code>
which processes a document. The class
<code>MyPOIFSReaderListener</code> implements the
<code>POIFSReaderListener</code> and thus the
<code>processPOIFSReaderEvent</code> method. The eventing POI filesystem
calls this method when it finds the <em>\005SummaryInformation</em>
document. In the sample application <code>MyPOIFSReaderListener</code> is
a static class in the <em>ReadTitle.java</em> source file.)</p>
<p align="justify">Now everything is prepared and reading the POI filesystem can
start:</p>
<div align="center">
<table cellspacing="2" cellpadding="2" border="1">
<tr>
<td>
<pre>r.read(new FileInputStream(filename));</pre>
</td>
</tr>
</table>
</div>
<p align="justify">The following source code fragment shows the
<code>MyPOIFSReaderListener</code> class and how it retrieves the
title.</p>
<div align="center">
<table cellspacing="2" cellpadding="2" border="1">
<tr>
<td>
<pre>
static class MyPOIFSReaderListener implements POIFSReaderListener
{
public void processPOIFSReaderEvent(POIFSReaderEvent e)
{
SummaryInformation si = null;
try
{
si = (SummaryInformation)
PropertySetFactory.create(e.getStream());
}
catch (Exception ex)
{
throw new RuntimeException
("Property set stream \"" +
event.getPath() + event.getName() + "\": " + ex);
}
final String title = si.getTitle();
if (title != null)
System.out.println("Title: \"" + title + "\"");
else
System.out.println("Document has no title.");
}
}
</pre>
</td>
</tr>
</table>
</div>
<p align="justify">The line</p>
<div align="center">
<table cellspacing="2" cellpadding="2" border="1">
<tr>
<td>
<pre>SummaryInformation si = null;</pre>
</td>
</tr>
</table>
</div>
<p align="justify">declares a <code>SummaryInformation</code> variable and initializes it
with <code>null</code>. We need an instance of this class to access the
title. The instance is created in a <code>try</code> block:</p>
<div align="center">
<table cellspacing="2" cellpadding="2" border="1">
<tr>
<td>
<pre>si = (SummaryInformation)
PropertySetFactory.create(e.getStream());</pre>
</td>
</tr>
</table>
</div>
<p align="justify">The expression <code>e.getStream()</code> returns the input stream
containing the bytes of the property set stream named
<em>\005SummaryInformation</em>. This stream is passed into the
<code>create</code> method of the factory class
<code>org.apache.poi.hpsf.PropertySetFactory</code> which returns
a <code>org.apache.poi.hpsf.PropertySet</code> instance. It is more or
less safe to cast this result to <code>SummaryInformation</code>, a
convenience class with methods like <code>getTitle()</code>,
<code>getAuthor()</code> etc.</p>
<p align="justify">The <code>PropertySetFactory.create</code> method may throw all sorts
of exceptions. We'll deal with them in the next sections. For now we just
catch all exceptions and throw a <code>RuntimeException</code>
containing the message text of the origin exception.</p>
<p align="justify">If all goes well, the sample application retrieves the title and prints
it to the standard output. As you can see you must be prepared for the
case that the POI filesystem does not have a title.</p>
<div align="center">
<table cellspacing="2" cellpadding="2" border="1">
<tr>
<td>
<pre>final String title = si.getTitle();
if (title != null)
System.out.println("Title: \"" + title + "\"");
else
System.out.println("Document has no title.");</pre>
</td>
</tr>
</table>
</div>
<p align="justify">Please note that a Microsoft Office document does not necessarily
contain the <em>\005SummaryInformation</em> stream. The documents created
by the Microsoft Office suite have one, as far as I know. However, an
Excel spreadsheet exported from StarOffice 5.2 won't have a
<em>\005SummaryInformation</em> stream. In this case the applications
won't throw an exception but simply does not call the
<code>processPOIFSReaderEvent</code> method. You have been warned!</p>
</td>
</tr>
</table>
</div>
<br>
</td>
</tr>
</table>
</div>
<br>
<anchor id="sec2"></anchor>
<div align="right">
<table cellspacing="0" cellpadding="2" border="0" width="99%">
<tr>
<td bgcolor="#525D76"><font color="#ffffff" size="+0"><font face="Arial,sans-serif"><b>Additional Standard Properties, Exceptions And Embedded Objects</b></font></font></td>
</tr>
<tr>
<td>
<br>
<note>This section focusses on reading additional standard properties. It
also talks about exceptions that may be thrown when dealing with HPSF and
shows how you can read properties of embedded objects.</note>
<p align="justify">A couple of <em>additional standard properties</em> are not
contained in the <em>\005SummaryInformation</em> stream explained above,
for example a document's category or the number of multimedia clips in a
PowerPoint presentation. Microsoft has invented an additional stream named
<em>\005DocumentSummaryInformation</em> to hold these properties. With two
minor exceptions you can proceed exactly as described above to read the
properties stored in <em>\005DocumentSummaryInformation</em>:</p>
<ul>
<li>
<p align="justify">Instead of <em>\005SummaryInformation</em> use
<em>\005DocumentSummaryInformation</em> as the stream's name.</p>
</li>
<li>
<p align="justify">Replace all occurrences of the class
<code>SummaryInformation</code> by
<code>DocumentSummaryInformation</code>.</p>
</li>
</ul>
<p align="justify">And of course you cannot call <code>getTitle()</code> because
<code>DocumentSummaryInformation</code> has different query methods. See
the API documentation for the details!</p>
<p align="justify">In the previous section the application simply caught all
<em>exceptions</em> and was in no way interested in any
details. However, a real application will likely want to know what went
wrong and act appropriately. Besides any IO exceptions there are three
HPSF resp. POI specific exceptions you should know about:</p>
<dl>
<dt>
<code>NoPropertySetStreamException</code>:</dt>
<dd>
<p align="justify">This exception is thrown if the application tries to create a
<code>PropertySet</code> or one of its subclasses
<code>SummaryInformation</code> and
<code>DocumentSummaryInformation</code> from a stream that is not a
property set stream. A faulty property set stream counts as not being a
property set stream at all. An application should be prepared to deal
with this case even if opens streams named
<em>\005SummaryInformation</em> or
<em>\005DocumentSummaryInformation</em> only. These are just names. A
stream's name by itself does not ensure that the stream contains the
expected contents and that this contents is correct.</p>
</dd>
<dt>
<code>UnexpectedPropertySetTypeException</code>
</dt>
<dd>
<p align="justify">This exception is thrown if a certain type of property set is
expected somewhere (e.g. a <code>SummaryInformation</code> or
<code>DocumentSummaryInformation</code>) but the provided property
set is not of that type.</p>
</dd>
<dt>
<code>MarkUnsupportedException</code>
</dt>
<dd>
<p align="justify">This exception is thrown if an input stream that is to be parsed
into a property set does not support the
<code>InputStream.mark(int)</code> operation. The POI filesystem uses
the <code>DocumentInputStream</code> class which does support this
operation, so you are safe here. However, if you read a property set
stream from another kind of input stream things may be
different.</p>
</dd>
</dl>
<p align="justify">Many Microsoft Office documents contain <em>embedded
objects</em>, for example an Excel sheet on a page in a Word
document. Embedded objects may have property sets of their own. An
application can open these property set streams as described above. The
only difference is that they are not located in the POI filesystem's root
but in a nested directory instead. Just register a
<code>POIFSReaderListener</code> for the property set streams you are
interested in. For example, the <em>POIBrowser</em> application in the
contrib section tries to open each and every document in a POI filesystem
as a property set stream. If this operation was successful it displays the
properties.</p>
</td>
</tr>
</table>
</div>
<br>
<anchor id="sec3"></anchor>
<div align="right">
<table cellspacing="0" cellpadding="2" border="0" width="99%">
<tr>
<td bgcolor="#525D76"><font color="#ffffff" size="+0"><font face="Arial,sans-serif"><b>Reading Non-Standard Properties</b></font></font></td>
</tr>
<tr>
<td>
<br>
<note>This section tells how to read
non-standard properties. Non-standard properties are application-specific
name/value/type triples.</note>
<div align="center">
<table cellspacing="2" cellpadding="2" border="1">
<tr>
<td bgcolor="#c0c0c0"><font size="-1" color="#023264">Write this section!</font></td>
</tr>
</table>
</div>
</td>
</tr>
</table>
</div>
<br>
</td>
</tr>

View File

@ -212,7 +212,8 @@
<li>Glen Stampoultzis (glens at apache.org)</li>
<li>Rainer Klute (klute at rainer-klute dot de)</li>
<li>
<a href="http://www.rainer-klute.de/">Rainer Klute</a> (klute at apache dot org)</li>
</ul>

View File

@ -9,9 +9,297 @@
</header>
<body>
<s1 title="How To Use the HPSF APIs">
<p class="todo">TODO: This documentation is still to be written. For the
time being, please see the API documentation (javadocs) of the
<code>org.apache.poi.hpsf</code> package.</p>
<p>This HOW-TO is organized in three section. You should read them
sequentially because the later sections build upon the earlier ones.</p>
<ol>
<li>
<p>The <link href="#sec1">first section</link> explains how to read
the most important standard properties of a Microsoft Office
document. Standard properties are things like title, author, creation
date etc. It is quite likely that you will find here what you need and
don't have to read the other sections.</p>
</li>
<li>
<p>The <link href="#sec2">second section</link> goes a small step
further and focusses on reading additional standard properties. It also
talks about exceptions that may be thrown when dealing with HPSF and
shows how you can read properties of embedded objects.</p>
</li>
<li>
<p>The <link href="#sec3">third section</link> tells how to read
non-standard properties. Non-standard properties are application-specific
name/value/type triples.</p>
</li>
</ol>
<anchor id="sec1" />
<s2 title="Reading Standard Properties">
<note>This section explains how to read
the most important standard properties of a Microsoft Office
document. Standard properties are things like title, author, creation
date etc. Chances are that you will find here what you need and
don't have to read the other sections.</note>
<p>The first thing you should understand is that properties are stored in
separate documents inside the POI filesystem. (If you don't know what a
POI filesystem is, read its <link
href="../poifs/index.html">documentation</link>.) A document in a POI
filesystem is also called a <strong>stream</strong>.</p>
<p>The following example shows how to read a POI filesystem's
"title" property. Reading other properties is similar. Consider the API
documentation of <code>org.apache.poi.hpsf.SummaryInformation</code>.</p>
<p>The standard properties this section focusses on can be
found in a document called <em>\005SummaryInformation</em> in the root of
the POI filesystem. The notation <em>\005</em> in the document's name
means the character with the decimal value of 5. In order to read the
title, an application has to perform the following steps:</p>
<ol>
<li>
<p>Open the document <em>\005SummaryInformation</em> located in the root
of the POI filesystem.</p>
</li>
<li>
<p>Create an instance of the class
<code>SummaryInformation</code> from that
document.</p>
</li>
<li>
<p>Call the <code>SummaryInformation</code> instance's
<code>getTitle()</code> method.</p>
</li>
</ol>
<p>Sounds easy, doesn't it? Here are the steps in detail.</p>
<s3 title="Open the document \005SummaryInformation in the root of the
POI filesystem">
<p>An application that wants to open a document in a POI filesystem
(POIFS) proceeds as shown by the following code fragment. (The full
source code of the sample application is available in the
<em>examples</em> section of the POI source tree as
<em>ReadTitle.java</em>.)</p>
<source>
import java.io.*;
import org.apache.poi.hpsf.*;
import org.apache.poi.poifs.eventfilesystem.*;
// ...
public static void main(String[] args)
throws IOException
{
final String filename = args[0];
POIFSReader r = new POIFSReader();
r.registerListener(new MyPOIFSReaderListener(),
"\005SummaryInformation");
r.read(new FileInputStream(filename));
}</source>
<p>The first interesting statement is</p>
<source>POIFSReader r = new POIFSReader();</source>
<p>It creates a
<code>org.apache.poi.poifs.eventfilesystem.POIFSReader</code> instance
which we shall need to read the POI filesystem. Before the application
actually opens the POI filesystem we have to tell the
<code>POIFSReader</code> which documents we are interested in. In this
case the application should do something with the document
<em>\005SummaryInformation</em>.</p>
<source>
r.registerListener(new MyPOIFSReaderListener(),
"\005SummaryInformation");</source>
<p>This method call registers a
<code>org.apache.poi.poifs.eventfilesystem.POIFSReaderListener</code>
with the <code>POIFSReader</code>. The <code>POIFSReaderListener</code>
interface specifies the method <code>processPOIFSReaderEvent</code>
which processes a document. The class
<code>MyPOIFSReaderListener</code> implements the
<code>POIFSReaderListener</code> and thus the
<code>processPOIFSReaderEvent</code> method. The eventing POI filesystem
calls this method when it finds the <em>\005SummaryInformation</em>
document. In the sample application <code>MyPOIFSReaderListener</code> is
a static class in the <em>ReadTitle.java</em> source file.)</p>
<p>Now everything is prepared and reading the POI filesystem can
start:</p>
<source>r.read(new FileInputStream(filename));</source>
<p>The following source code fragment shows the
<code>MyPOIFSReaderListener</code> class and how it retrieves the
title.</p>
<source>
static class MyPOIFSReaderListener implements POIFSReaderListener
{
public void processPOIFSReaderEvent(POIFSReaderEvent e)
{
SummaryInformation si = null;
try
{
si = (SummaryInformation)
PropertySetFactory.create(e.getStream());
}
catch (Exception ex)
{
throw new RuntimeException
("Property set stream \"" +
event.getPath() + event.getName() + "\": " + ex);
}
final String title = si.getTitle();
if (title != null)
System.out.println("Title: \"" + title + "\"");
else
System.out.println("Document has no title.");
}
}
</source>
<p>The line</p>
<source>SummaryInformation si = null;</source>
<p>declares a <code>SummaryInformation</code> variable and initializes it
with <code>null</code>. We need an instance of this class to access the
title. The instance is created in a <code>try</code> block:</p>
<source>si = (SummaryInformation)
PropertySetFactory.create(e.getStream());</source>
<p>The expression <code>e.getStream()</code> returns the input stream
containing the bytes of the property set stream named
<em>\005SummaryInformation</em>. This stream is passed into the
<code>create</code> method of the factory class
<code>org.apache.poi.hpsf.PropertySetFactory</code> which returns
a <code>org.apache.poi.hpsf.PropertySet</code> instance. It is more or
less safe to cast this result to <code>SummaryInformation</code>, a
convenience class with methods like <code>getTitle()</code>,
<code>getAuthor()</code> etc.</p>
<p>The <code>PropertySetFactory.create</code> method may throw all sorts
of exceptions. We'll deal with them in the next sections. For now we just
catch all exceptions and throw a <code>RuntimeException</code>
containing the message text of the origin exception.</p>
<p>If all goes well, the sample application retrieves the title and prints
it to the standard output. As you can see you must be prepared for the
case that the POI filesystem does not have a title.</p>
<source>final String title = si.getTitle();
if (title != null)
System.out.println("Title: \"" + title + "\"");
else
System.out.println("Document has no title.");</source>
<p>Please note that a Microsoft Office document does not necessarily
contain the <em>\005SummaryInformation</em> stream. The documents created
by the Microsoft Office suite have one, as far as I know. However, an
Excel spreadsheet exported from StarOffice 5.2 won't have a
<em>\005SummaryInformation</em> stream. In this case the applications
won't throw an exception but simply does not call the
<code>processPOIFSReaderEvent</code> method. You have been warned!</p>
</s3>
</s2>
<anchor id="sec2"/>
<s2 title="Additional Standard Properties, Exceptions And Embedded Objects">
<note>This section focusses on reading additional standard properties. It
also talks about exceptions that may be thrown when dealing with HPSF and
shows how you can read properties of embedded objects.</note>
<p>A couple of <strong>additional standard properties</strong> are not
contained in the <em>\005SummaryInformation</em> stream explained above,
for example a document's category or the number of multimedia clips in a
PowerPoint presentation. Microsoft has invented an additional stream named
<em>\005DocumentSummaryInformation</em> to hold these properties. With two
minor exceptions you can proceed exactly as described above to read the
properties stored in <em>\005DocumentSummaryInformation</em>:</p>
<ul>
<li><p>Instead of <em>\005SummaryInformation</em> use
<em>\005DocumentSummaryInformation</em> as the stream's name.</p></li>
<li><p>Replace all occurrences of the class
<code>SummaryInformation</code> by
<code>DocumentSummaryInformation</code>.</p></li>
</ul>
<p>And of course you cannot call <code>getTitle()</code> because
<code>DocumentSummaryInformation</code> has different query methods. See
the API documentation for the details!</p>
<p>In the previous section the application simply caught all
<strong>exceptions</strong> and was in no way interested in any
details. However, a real application will likely want to know what went
wrong and act appropriately. Besides any IO exceptions there are three
HPSF resp. POI specific exceptions you should know about:</p>
<dl>
<dt><code>NoPropertySetStreamException</code>:</dt>
<dd><p>This exception is thrown if the application tries to create a
<code>PropertySet</code> or one of its subclasses
<code>SummaryInformation</code> and
<code>DocumentSummaryInformation</code> from a stream that is not a
property set stream. A faulty property set stream counts as not being a
property set stream at all. An application should be prepared to deal
with this case even if opens streams named
<em>\005SummaryInformation</em> or
<em>\005DocumentSummaryInformation</em> only. These are just names. A
stream's name by itself does not ensure that the stream contains the
expected contents and that this contents is correct.</p></dd>
<dt><code>UnexpectedPropertySetTypeException</code></dt>
<dd><p>This exception is thrown if a certain type of property set is
expected somewhere (e.g. a <code>SummaryInformation</code> or
<code>DocumentSummaryInformation</code>) but the provided property
set is not of that type.</p></dd>
<dt><code>MarkUnsupportedException</code></dt>
<dd><p>This exception is thrown if an input stream that is to be parsed
into a property set does not support the
<code>InputStream.mark(int)</code> operation. The POI filesystem uses
the <code>DocumentInputStream</code> class which does support this
operation, so you are safe here. However, if you read a property set
stream from another kind of input stream things may be
different.</p></dd>
</dl>
<p>Many Microsoft Office documents contain <strong>embedded
objects</strong>, for example an Excel sheet on a page in a Word
document. Embedded objects may have property sets of their own. An
application can open these property set streams as described above. The
only difference is that they are not located in the POI filesystem's root
but in a nested directory instead. Just register a
<code>POIFSReaderListener</code> for the property set streams you are
interested in. For example, the <em>POIBrowser</em> application in the
contrib section tries to open each and every document in a POI filesystem
as a property set stream. If this operation was successful it displays the
properties.</p>
</s2>
<anchor id="sec3"/>
<s2 title="Reading Non-Standard Properties">
<note>This section tells how to read
non-standard properties. Non-standard properties are application-specific
name/value/type triples.</note>
<fixme author="Rainer Klute">Write this section!</fixme>
</s2>
</s1>
</body>
</document>