Update HWPF documentation to include the newly added word 6/95 text extraction support, as well as mention XWPF + Microsoft spec docs

git-svn-id: https://svn.apache.org/repos/asf/poi/trunk@959384 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Nick Burch 2010-06-30 17:40:33 +00:00
parent fd922298ef
commit f71a66b0ab
4 changed files with 58 additions and 32 deletions

View File

@ -35,8 +35,12 @@
<section><title>Overview</title>
<p>HWPF is the name of our port of the Microsoft Word 97(-2007) file format
to pure Java. It <em>does not</em> support the new Word 2007 .docx
file format, which is not OLE2 based.</p>
to pure Java. It also provides limited read only support for the older
Word 6 and Word 95 file formats.</p>
<p>The partner to HWPF for the new Word 2007 .docx format is <em>XWPF</em>.
Whilst HWPF and XWPF provide similar features, there is not a common
interface across the two of them at this time.</p>
<p>HWPF is still in early development. It is in the <link
href="http://svn.apache.org/viewcvs.cgi/poi/trunk/src/scratchpad/">
@ -53,6 +57,20 @@
code.
</p>
<section>
<title>XWPF Patches Required!</title>
<p>At the moment, XWPF covers many common use cases for reading and writing
.docx files. Whilst this is a great thing, it does mean that XWPF does
everything that the current POI committers need it to do, and so none of
the committers are actively adding new features.</p>
<p>If you come across a feature in XWPF that you need, and isn't currently
there, please do send in a patch to add the extra functionality! More details
on contributing patches are available on the <link
href="../getinvolved/index.html">"Contribution to POI" page</link>.</p>
</section>
<section>
<title>HWPF Pointman Needed!</title>
@ -65,12 +83,12 @@
<p>If <strong>you</strong> are interested in becoming the new HWPF
pointman, you should look into the Microsoft Word internals. A good
starting point seems to be Ryan Ackley's <link
href="docoverview.html">overview</link>. This document contains a link to
a detailled Word format description you can find somewhere at
<link href="http://www.wotsit.org/">http://www.wotsit.org/</link>. Please
do not contact Ryan Ackley directly, because he is working for a company
now that signed a NDA with Microsoft and thus he will be no longer able to
answer questions.</p>
href="docoverview.html">overview</link>. Full details on the word format
is available from
<link href="http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx">Microsoft</link>,
but the documentation can be a little hard to get into at first... Try reading the
<link href="docoverview.html">overview</link> first, and looking at the existing
code, then finally look up the documentation for specific missing features.</p>
<p>As a first step you should familiarize yourself with the source code,
examples, test cases, and the HWPF patches available at <link
@ -88,13 +106,14 @@
</ul>
<p>When you start coding, you will not yet have write access to the
CVS repository. Please submit your patches to <link
SVN repository. Please submit your patches to <link
href="http://issues.apache.org/">Bugzilla</link> and nag <link
href="mailto:klute@apache.org">Rainer Klute</link> until he commits
them. Besides the actual checking in of HWPF patches Rainer will also do
some minor reviews now and then of your source code patches, test cases
and documentation to help ensure software quality. But most of the time
you will be on your own.</p>
href="mailto:dev@poi.apache.org">the dev list</link> until someone commits
them. Besides the actual checking in of HWPF patches, current POI
committers will also do some minor reviews now and then of your source code
patches, test cases and documentation to help ensure software quality. But
most of the time you will be on your own. However, anyone offering useful
contributions over a period of time will be offered committership!</p>
<p>Please do not forget to write <link
href="http://www.junit.org/">JUnit</link> test cases and documentation!
@ -102,15 +121,9 @@
consider that other contributors should be able to understand your source
code easily. If you need any help getting started with JUnit test cases
for HWPF, please ask on the developers' mailing list! If you show that you
are prepared to stick at it you will most likely be given CVS commit
access.</p>
<p><strong>Important:</strong> It is legally vital for POI that you have
never seen any documentation or specification from Microsoft that required
you or your employer to sign an NDA to get it. Please do read the <link
href="../getinvolved/index.html">"Contribution to POI" page</link> for
details! This page also contains further information for you to start POI
development.</p>
are prepared to stick at it you will most likely be given SVN commit
access. See <link href="../getinvolved/index.html">"Contribution to POI" page</link>
for more details and help getting started.</p>
<p>Of course we will help you as best as we can. However, presently there
is no committer who is really familiar with the Word format, so you'll be

View File

@ -86,7 +86,8 @@
provide this functionality. Examples include: <link href="http://xml.apache.org/cocoon">Cocoon</link> for
which there are serializers for HSSF;
<link href="http://www.openoffice.org">Open Office.org</link> with whom we collaborate in documenting the
XLS format; and <link href="http://lucene.apache.org/">Lucene</link>
XLS format; and <link href="http://tika.apache.org/">Tika</link> /
<link href="http://lucene.apache.org/">Lucene</link>,
for which we provide format interpretors. When practical, we donate
components directly to those projects for POI-enabling them.
</p>

View File

@ -50,14 +50,16 @@
<section><title>HWPF and XWPF for Word Documents</title>
<p>
HWPF is our port of the Microsoft Word 97 (-2003) file format to pure
Java. It supports read, and limited write capabilities. Please see <link
href="./hwpf/index.html">the HWPF project page for more
Java. It supports read, and limited write capabilities. It also provides
simple text extraction support for the older Word 6 and Word 95 formats.
Please see <link href="./hwpf/index.html">the HWPF project page for more
information</link>. This component remains in early stages of
development. It can already read and write simple files.
</p>
<p>
We are also working on the XWPF for the WordprocessingML (2007+) format from the
OOXML specification.
OOXML specification. This provides read and write support for simpler
files, along with text extraction capabilities.
</p>
</section>
<section><title>HSLF and XSLF for PowerPoint Documents</title>
@ -108,8 +110,8 @@
<section><title>HSMF for Outlook Messages</title>
<p>
HSMF is our port of the Microsoft Outlook message file format to pure
Java. It currently only some of the textual content of MSG files.
Further support and documentation is expected over the comming weeks and months.
Java. It currently only some of the textual content of MSG files, and
some attachments. Further support and documentation is coming in slowly.
For now, users are advised to consult the unit tests for example use.
Please see <link href="./hsmf/index.html">the HPBF project page for more
information</link>.

View File

@ -81,11 +81,15 @@
</section>
<section><title>Word</title>
<p>For .doc files, in scratchpad there is
<p>For .doc files from Word 97 - Word 2003, in scratchpad there is
<em>org.apache.poi.hwpf.extractor.WordExtractor</em>, which will
return text for your document. Those using POI 3.5 can also use
return text for your document.</p>
<p>Those using POI 3.7 can also extract simple textual content from
older Word 6 and Word 95 files, using the scratchpad class
<em>org.apache.poi.hwpf.extractor.Word6Extractor</em>.</p>
<p>Since POI 3.5, it is possible to use
<em>org.apache.poi.xwpf.extractor.XPFFWordExtractor</em>, to perform
a similar task for .docx files.</p>
text extraction for .docx files.</p>
</section>
<section><title>PowerPoint</title>
@ -97,6 +101,12 @@
perform a similar task for .pptx files.</p>
</section>
<section><title>Publisher</title>
<p>For .pub files, in scratchpad there is
<em>org.apache.poi.hpbf.extractor.PublisherExtractor</em>, which
will return text for your file.</p>
</section>
<section><title>Visio</title>
<p>For .vsd files, in scratchpad there is
<em>org.apache.poi.hdgf.extractor.VisioTextExtractor</em>, which