mirror of https://github.com/apache/poi.git
Various new bits of documentation on embeded files and text extraction
git-svn-id: https://svn.apache.org/repos/asf/poi/trunk@647567 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
2ccd3e1311
commit
51859d65ab
|
@ -43,6 +43,7 @@
|
|||
<menu-item label="HDGF" href="hdgf/index.html"/>
|
||||
<menu-item label="POI-Ruby" href="poi-ruby.html"/>
|
||||
<menu-item label="POI-Utils" href="utils/index.html"/>
|
||||
<menu-item label="Text Extraction" href="text-extraction.html"/>
|
||||
<menu-item label="Download" href="ext:download"/>
|
||||
</menu>
|
||||
|
||||
|
|
|
@ -30,6 +30,7 @@
|
|||
<menu label="POIFS">
|
||||
<menu-item label="Overview" href="index.html"/>
|
||||
<menu-item label="How To" href="how-to.html"/>
|
||||
<menu-item label="Embeded Documents" href="embeded.html"/>
|
||||
<menu-item label="File System Documentation" href="fileformat.html"/>
|
||||
<menu-item label="Use Cases" href="usecases.html"/>
|
||||
</menu>
|
||||
|
|
|
@ -0,0 +1,95 @@
|
|||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!--
|
||||
====================================================================
|
||||
Licensed to the Apache Software Foundation (ASF) under one or more
|
||||
contributor license agreements. See the NOTICE file distributed with
|
||||
this work for additional information regarding copyright ownership.
|
||||
The ASF licenses this file to You under the Apache License, Version 2.0
|
||||
(the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
====================================================================
|
||||
-->
|
||||
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V1.1//EN" "../dtd/document-v11.dtd">
|
||||
<document>
|
||||
<header>
|
||||
<title>Apache POI - POIFS - Documents embeded in other documents</title>
|
||||
<subtitle>Overview</subtitle>
|
||||
<authors>
|
||||
<person name="Nick Burch" email="nick@apache.org"/>
|
||||
<person name="Yegor Kozlov" email="yegor@apache.org"/>
|
||||
</authors>
|
||||
</header>
|
||||
<body>
|
||||
<section><title>Overview</title>
|
||||
<p>It is possible for one OLE 2 based document to have other
|
||||
OLE 2 documents embeded in it. For example, and Excel file
|
||||
may have a word document and a powerpoint slideshow
|
||||
embeded as part of it.</p>
|
||||
<p>Normally, these other documents are stored in subdirectories
|
||||
of the OLE 2 (POIFS) filesystem. The exact location of the
|
||||
embeded documents will vary depending on the type of the
|
||||
master document, and the exact directory names will differ
|
||||
each time. To figure out exactly which directory to look
|
||||
in, you will either need to process the appropriate OLE 2
|
||||
linking entry in the master document, or simple iterate
|
||||
over all the directories in the filesystem.</p>
|
||||
<p>As a general rule, you will find the same OLE 2 entries
|
||||
in the subdirectories, as you would've found at the root
|
||||
of the filesystem were a document to not be embeded.</p>
|
||||
|
||||
<section><title>Files embeded in Excel</title>
|
||||
<p>Excel normally stores embeded files in subdirectories
|
||||
of the filesystem root. Typically these subdirectories
|
||||
are named starting with MBD, with 8 hex characters following.</p>
|
||||
</section>
|
||||
|
||||
<section><title>Files embeded in Word</title>
|
||||
<p>Word normally stores embeded files in subdirectories
|
||||
of the ObjectPool directory, itself a subdirectory of the
|
||||
filesystem root. Typically these subdirectories and named
|
||||
starting with an underscore, followed by 10 numbers.</p>
|
||||
</section>
|
||||
|
||||
<section><title>Files embeded in PowerPoint</title>
|
||||
<p>PowerPoint does not normally store embeded files
|
||||
in the OLE2 layer. Instead, they are held within records
|
||||
of the main PowerPoint file. To get at them, you need to
|
||||
find the appropriate data within the PowerPoint stream,
|
||||
and work from that.</p>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
<section><title>Listing POIFS contents</title>
|
||||
<p>POIFS provides a simple tool for listing the contents of
|
||||
OLE2 files. This can allow you to see what your POIFS file
|
||||
contents, and hence if it has any embeded documents in it,
|
||||
and where.</p>
|
||||
<p>The tool to use is <em>org.apache.poi.poifs.dev.POIFSLister</em>.
|
||||
This tool may be run from the command line, and takes a filename
|
||||
as its parameter. It will print out all the directories and
|
||||
files contained within the POIFS file.</p>
|
||||
</section>
|
||||
|
||||
<section><title>Opening embeded files</title>
|
||||
<p>All of the POIDocument classes (HSSFWorkbook, HSLFSlideShow,
|
||||
HWPFDocument and HDGFDiagram) can either be opened from
|
||||
a POIFSFileSystem, or from a specific directory within a
|
||||
POIFSFileSystem. So, to open embeded files, simply locate the
|
||||
appropriate DirectoryNode that represents the subdirectory
|
||||
of interest, and pass this + the overall POIFSFileSystem to
|
||||
the constructor.</p>
|
||||
<p>I you want to extract the textual contents of the embeded file,
|
||||
then open the appropriate POIDocument, and then pass this to
|
||||
the extractor class, instead of simply passing the POIFSFilesystem
|
||||
to the extractor.</p>
|
||||
</section>
|
||||
</body>
|
||||
</document>
|
|
@ -0,0 +1,106 @@
|
|||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!--
|
||||
====================================================================
|
||||
Licensed to the Apache Software Foundation (ASF) under one or more
|
||||
contributor license agreements. See the NOTICE file distributed with
|
||||
this work for additional information regarding copyright ownership.
|
||||
The ASF licenses this file to You under the Apache License, Version 2.0
|
||||
(the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
====================================================================
|
||||
-->
|
||||
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V1.3//EN" "./dtd/document-v13.dtd">
|
||||
|
||||
<document>
|
||||
<header>
|
||||
<title>POI Text Extraction</title>
|
||||
<authors>
|
||||
<person id="NB" name="Nick Burch" email="nick@apache.org"/>
|
||||
</authors>
|
||||
</header>
|
||||
|
||||
<body>
|
||||
<section><title>Overview</title>
|
||||
<p>POI provides text extraction for all the supported file
|
||||
formats. In addition, it provides access to the metadata
|
||||
associated with a given file, such as title and author.</p>
|
||||
<p>In addition to providing direct text extraction classes,
|
||||
POI works closely with the
|
||||
<link href="http://incubator.apache.org/tika/">Apache Tika</link>
|
||||
text extraction library. Users may wish to simply utilise
|
||||
the functionality provided by Tika.</p>
|
||||
</section>
|
||||
|
||||
<section><title>Common functionality</title>
|
||||
<p>All of the POI text extractors extend from
|
||||
<em>org.apache.poi.POITextExtractor</em>. This provides a common
|
||||
method across all extractors, getText(). For many cases, the text
|
||||
returned will be all you need. However, many extractors do provide
|
||||
more targetted text extraction methods, so you may wish to use
|
||||
these in some cases.</p>
|
||||
<p>All POIFS / OLE 2 based text extractors also extend from
|
||||
<em>org.apache.poi.POIOLE2TextExtractor</em>. This additionally
|
||||
provides common methods to get at the <link href="hpfs/">HPFS
|
||||
document metadata</link>.</p>
|
||||
<p>All OOXML based text extractors (available in POI 3.5 and later)
|
||||
also extend from
|
||||
<em>org.apache.poi.POIOOXMLTextExtractor</em>. This additionally
|
||||
provides common methods to get at the OOXML metadata.</p>
|
||||
</section>
|
||||
|
||||
<section><title>Text Extractor Factory - POI 3.5 or later</title>
|
||||
<p>A new class in POI 3.5,
|
||||
<em>org.apache.poi.extractor.ExtractorFactory</em> provides a
|
||||
similar function to WorkbookFactory. You simply pass it an
|
||||
InputStream, a file, a POIFSFileSystem or a OOXML Package. It
|
||||
figures out the correct text extractor for you, and returns it.</p>
|
||||
</section>
|
||||
|
||||
<section><title>Excel</title>
|
||||
<p>For .xls files, there is
|
||||
<em>org.apache.poi.hssf.extractor.ExcelExtractor</em>, which will
|
||||
return text, optionally with formulas instead of their contents.
|
||||
Those using POI 3.5 can also use
|
||||
<em>org.apache.poi.xssf.extractor.XSSFExcelExtractor</em>, to perform
|
||||
a similar task for .xlsx files.</p>
|
||||
</section>
|
||||
|
||||
<section><title>Word</title>
|
||||
<p>For .doc files, in scratchpad there is
|
||||
<em>org.apache.poi.hwpf.extractor.WordExtractor</em>, which will
|
||||
return text for your document. Those using POI 3.5 can also use
|
||||
<em>org.apache.poi.xwpf.extractor.XPFFWordExtractor</em>, to perform
|
||||
a similar task for .docx files.</p>
|
||||
</section>
|
||||
|
||||
<section><title>PowerPoint</title>
|
||||
<p>For .ppt files, in scratchpad there is
|
||||
<em>org.apache.poi.hslf.extractor.PowerPointExtractor</em>, which
|
||||
will return text for your slideshow, optionally restricted to just
|
||||
slides text or notes text. Those using POI 3.5 can also use
|
||||
<em>org.apache.poi.xslf.extractor.XSLFPowerPointExtractor</em>, to
|
||||
perform a similar task for .pptx files.</p>
|
||||
</section>
|
||||
|
||||
<section><title>Visio</title>
|
||||
<p>For .vsd files, in scratchpad there is
|
||||
<em>org.apache.poi.hdgf.extractor.VisioTextExtractor</em>, which
|
||||
will return text for your file.</p>
|
||||
</section>
|
||||
</body>
|
||||
|
||||
<footer>
|
||||
<legal>
|
||||
Copyright 2005 The Apache Software Foundation or its licensors, as applicable.
|
||||
$Revision: 639487 $ $Date: 2008-03-20 22:31:15 +0000 (Thu, 20 Mar 2008) $
|
||||
</legal>
|
||||
</footer>
|
||||
</document>
|
Loading…
Reference in New Issue