[MRM-141] remove irrelevant documentation

git-svn-id: https://svn.apache.org/repos/asf/maven/archiva/trunk@591958 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Brett Porter 2007-11-05 11:05:08 +00:00
parent 7474f81764
commit ed87e673fe
2 changed files with 0 additions and 243 deletions

View File

@ -1,153 +0,0 @@
-----
Indexer Design
-----
Brett Porter
-----
25 July 2006
-----
~~ Copyright 2006 The Apache Software Foundation.
~~
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License.
~~ NOTE: For help with the syntax of this file, see:
~~ http://maven.apache.org/guides/mini/guide-apt-format.html
Indexer Design
<<Note: The current indexer design is under review. This document will grow into what it should be, and the code and
tests refactored to match>>
~~TODO: separate API design from Lucene implementation design
* Standard Artifact Index
We currently want to index these elements from the repository:
* for each artifact file: the artifact ID, version, group ID, classifier, type (extension), filename (including path
from the repository base), checksums (md5, sha1) and size
* for each artifact POM: the packaging, licenses, dependencies, build plugins, reporting plugins
* plugin prefix
* Java classes within a JAR artifact (delimited by \n)
* filenames within an archive (delimited by \n)
* the identifier of the source repository
Each record in the index refers to an artifact. Since the content for a record can come from various sources, the
record may need to be updated when different files that are related to the same artifact are discovered (ie, the
POM, or for plugins the metadata that contains their prefix).
To simplify this, the process for discovery is as follows:
* Discovered artifacts will read the related POM and metadata from the repository to index, rather than relying on
it being discovered. This ensures that partial discovery still yields correct results in all cases, and it is
possible to construct the entire record without having to read back from the index.
* POMs that do not have a packaging of POM are not sent to the indexer.
The result of this process is that updates to a POM or repository metadata and not the corresponding artifact(s) will
not update the index. As POMs should not be modified, this will not be a major concern. Likewise, updates to metadata
will only accompany updates to the artifact itself, so will not cause a problem.
The above case may have a problem if the discovery happens during the middle of a deployment outside of the
repository manager (where the artifact is present, but the metadata or POM is not). To avoid such cases, the
discoverer should only detect changes more than a minute old (this blackout should be configurable).
Other techniques were considered:
* Processing each artifact file individually, updating each record as needed. This would result in having to read
back each index record before writing. This is quite costly in Lucene as it would be "read, delete, add". You
must have a reader and writer open for that process, and it greatly complicates the code.
* Have three indices, one for each. This would complicate searching (and may affect ranking of results, though this
was not analysed). While Lucene is
{{{http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-b11296f9e7b2a5e7496d67118d0a5898f2fd9823} capable of
searching multiple indices}}, it is expected that the results would be in the form of a list of separate records
rather than the "table join" this effectively is. A similar derivative of this technique would be to store
everything in one index, using a field (previously, doctype) to identify each record.
Records in the index are keyed by their path from the repository root. While this is longer than using the
dependency conflict ID, Lucene cannot delete by a combination of terms, so would require storing an additional
field in the index where the file already exists.
The plugin prefix could be found either from inside the plugin JAR (<<<META-INF/maven/plugin.xml>>>), or from the
repository metadata for the plugin's group. For simplicity, the first approach will be used. This means at present
there is no need to index the repository metadata, however that may be considered in future.
Note that archetypes currently don't have a packaging associated with them in Maven, so it is not recorded in the POM.
However, to be able to search by this type, the indexer will look for a <<<META-INF/maven/archetype.xml>>> file, and
if found set its packaging to <<<maven-archetype>>>. In the future, this handling will be deprecated as the POMs
can start using the appropriate packaging.
The index is shared among multiple repositories. The source repository is recorded in the index record. The
discovery/conversion/reporting mechanisms are expected to deal with duplicates before reaching the indexer, so if the
indexer encounters an artifact from a different repository than it was already added, it will simply replace the
record.
When indexing metadata from a POM, the POM should be loaded using the Maven project builder so that inheritance and
interpolation are performed. This ensures that the record is as complete as possible, and that searching by
fields that are inherited will reveal both the parent and the children in the search results.
* Reduced Size Index
An additional index is maintained by the repository manager in the
{{{../apidocs/org/apache/maven/archiva/indexing/MinimalArtifactIndexRecord.html} MinimalIndex}} class. This
indexes all of the same artifacts as the first index, but stores them with shorter field names and less information to
maintain a smaller size. This index is appropriate for use by certain clients such as IDE integration for fast
searching. For a fuller interface to the repository information, the integration should use the XMLRPC interface.
The following fields are in the reduced index:
* <<<j>>>: The JAR filename
* <<<s>>>: The JAR size
* <<<d>>>: The last modified timestamp
* <<<c>>>: A list of classes in the JAR (\n delimited)
* <<<m>>>: md5 checksum of the JAR
* <<<pk>>>: the primary key of the artifact
Only JARs are indexed at present. The JAR filename is used as the key for later deleting entries.
* Searching
Searching will be reasonably flexible, though the general use case will be to enter a single parsed query that is
applied to all fields in the index.
Some features that will be available:
* <Search through most fields for a particular keyword>: the general case described above.
* <Search by a particular field (exact match)>: This would be needed for search by checksum.
* <Search in a range of field values>: This would be needed for searching based on update time. Note that in
Lucene it may be better to search by other fields (or return all), and then filter the results by dates rather
than making dates part of a search query.
* <Limit search to particular fields>: It will be useful to only search Java classes and packages, for example
Another thing to note is that the search results should be able to be composed entirely from the index for performance
reasons. It should not have to read any metadata files or properties of files such as size and checksum from the disk.
This enables searching a repository remotely without having the physical repository available, which is useful for
IDE integration among other things.
Note that to be able to do an exact match search, a field must be stored untokenized. For fields where it makes sense
to search both tokenized and untokenized, they will be stored twice. This currently includes: artifact ID, group ID,
and version.

View File

@ -1,90 +0,0 @@
~~ Copyright 2006 The Apache Software Foundation.
~~
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License.
~~ NOTE: For help with the syntax of this file, see:
~~ http://maven.apache.org/guides/mini/guide-apt-format.html
ProxyManager
The ProxyManager is designed to be used as a simple object or bean for use by
a command-line application or web application.
Configuration
An instance of a ProxyManager requires a configuration object that will
define its behavior called ProxyConfiguration. The ProxyConfiguration is a
plexus component and can be looked up to get an instance of it. Below is a sample
plexus lookup statement:
----------
ProxyConfiguration config = (ProxyConfiguration) container.lookup( ProxyConfiguration.ROLE );
----------
Currently, a ProxyConfiguration lookup will return an empty instance of the
ProxyConfiguration which means it doesn't have any default definitions yet on
how the ProxyManager should behave. So the next step is to explicitly define
its behavior.
----------
ProxyConfiguration config = (ProxyConfiguration) container.lookup( ProxyConfiguration.ROLE );
config.setRepositoryCachePath( "/user/proxy-cache" );
ArtifactRepositoryLayout defLayout = new DefaultRepositoryLayout();
File repo1File = new File( "src/test/remote-repo1" );
ProxyRepository repo1 = new ProxyRepository( "central", "http://www.ibiblio.org/maven2", defLayout );
config.addRepository( repo1 );
----------
The above statements sets up the ProxyConfiguration to use the directory
<<</user/proxy-cache>>> as the location of the proxy's repository cache.
Then it creates a ProxyRepository instance with an id of <<<central>>> to
look for remote files in ibiblio.org.
Instantiation
To create or retrieve an instance of a ProxyManager, one will need to use the
ProxyManagerFactory.
----------
ProxyManagerFactory factory = (ProxyManagerFactory) container.lookup( ProxyManagerFactory.ROLE );
proxy = factory.getProxyManager( "default", config );
----------
The factory requires two parameters. The first parameter is the proxy_type
that you will want to use. And the second parameter is the ProxyConfiguration
which we already did above. The proxy_type defines the client that the
ProxyManager is expected to service. Currently, only <<<default>>>
ProxyManager type is available and is defined to be for Maven 2.x clients.
Usage
* The get() method
The ProxyManager get( target ) method is used to retrieve a path file. This
method first looks into the cache if the target exists. If it does not, then
the ProxyManager will search all the ProxyRepositories present in its
ProxyConfiguration. When the target path is found, the ProxyManager creates
a copy of it in its cache and returns a File instance of the cached copy.
* The getRemoteFile() method
The ProxyManager getRemoteFile( path ) method is used to force the
ProxyManager to ignore the contents of its cache and search all the
ProxyRepository objects for the specified path and retrieve it when
available. When successful, the ProxyManager creates a copy of the remote
file in its cache and then returns a File instance of the cached copy.