[MRM-127] update intended design

git-svn-id: https://svn.apache.org/repos/asf/maven/repository-manager/trunk@425872 13f79535-47bb-0310-9956-ffa450edef68
2006-07-26 22:06:23 +00:00 · 2006-07-26 22:06:23 +00:00 · 01cde262dd
parent 0d68e24a2e
commit 01cde262dd
1 changed files with 45 additions and 27 deletions
--- a/maven-repository-indexer/src/site/apt/design.apt
+++ b/maven-repository-indexer/src/site/apt/design.apt
@ -11,16 +11,18 @@ Indexer Design
  <<Note: The current indexer design is under review. This document will grow into what it should be, and the code and
  tests refactored to match>>
  ~~TODO: separate API design from Lucene implementation design
 * Standard Artifact Index
  We currently want to index these elements from the repository:
-    * for each artifact file: the artifact ID, version, group ID, classifier, type (extension), filename,
+    * for each artifact file: the artifact ID, version, group ID, classifier, type (extension), filename (including path
-      checksums (md5, sha1) and size
+      from the repository base), checksums (md5, sha1) and size
    * for each artifact POM: the packaging, licenses, dependencies, build plugins, reporting plugins
-    * plugin prefix from the repository metadata (in the future, more may be indexed)
+    * plugin prefix
    * Java classes and packages within a JAR artifact (delimited by \n)
@ -32,23 +34,42 @@ Indexer Design
  record may need to be updated when different files that are related to the same artifact are discovered (ie, the
  POM, or for plugins the metadata that contains their prefix).
-  Records in the index are generally keyed by their dependency conflict ID (ie, a combination of group, artifact,
+  To simplify this, the process for discovery is as follows:
  version, type  and classifier). The exception to this rule is the POM: if an entry already exists with a different
  type but the same group, artifact, version and no classifier, then a POM entry is not added and the model fields are
  applied to the existing entry. Conversely, if a POM is added first and an artifact with the same group, artifact,
  version and no classifier is later added then it overwrites the record of the POM.
-  The above process, especially with regard to the handling of the POM, should be much simpler if the discoverer is
+    * Discovered artifacts will read the related POM and metadata from the repository to index, rather than relying on
-  able to associate a POM to the artifact instead of feeding them in separately as it does at present.
+      it being discovered. This ensures that partial discovery still yields correct results in all cases, and it is
      possible to construct the entire record without having to read back from the index.
-  While some of the information stored is specific to a particular type of file, it is all maintained in a single index
+    * POMs that do not have a packaging of POM are not sent to the indexer.
  for simplicity. In the future, if the content of the various documents diverges greatly, it may be split into separate
  indexes. In that case, we may consider using Lucene's
  {{{http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-b11296f9e7b2a5e7496d67118d0a5898f2fd9823} multiple index
  searching capabilities}}.
-  Currently, the discoverer returns POMs as separate artifact entries to the actual artifact, and any derived artifacts
+  The result of this process is that updates to a POM or repository metadata and not the corresponding artifact(s) will
-  in the repository. To accommodate this, when indexed
+  not update the index. As POMs should not be modified, this will not be a major concern. Likewise, updates to metadata
  will only accompany updates to the artifact itself, so will not cause a problem.
  The above case may have a problem if the discovery happens during the middle of a deployment outside of the
  repository manager (where the artifact is present, but the metadata or POM is not). To avoid such cases, the
  discoverer should only detect changes more than a minute old (this blackout should be configurable).
  Other techniques were considered:
    * Processing each artifact file individually, updating each record as needed.  This would result in having to read
      back each index record before writing. This is quite costly in Lucene as it would be "read, delete, add". You
      must have a reader and writer open for that process, and it greatly complicates the code.
    * Have three indices, one for each. This would complicate searching (and may affect ranking of results, though this
      was not analysed). While Lucene is
      {{{http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-b11296f9e7b2a5e7496d67118d0a5898f2fd9823} capable of
      searching multiple indices}}, it is expected that the results would be in the form of a list of separate records
      rather than the "table join" this effectively is. A similar derivative of this technique would be to store
      everything in one index, using a field (previously, doctype) to identify each record.
  Records in the index are keyed by their path from the repository root. While this is longer than using the
  dependency conflict ID, Lucene cannot delete by a combination of terms, so would require storing an additional
  field in the index where the file already exists.
  The plugin prefix can be found either from inside the plugin JAR (<<<META-INF/maven/plugin.xml>>>), or from the
  repository metadata for the plugin's group. For simplicity, the first approach will be used. This means at present
  there is no need to index the repository metadata, however that may be considered in future.
  Note that archetypes currently don't have a packaging associated with them in Maven, so it is not recorded in the POM.
  However, to be able to search by this type, the indexer will look for a <<<META-INF/maven/archetype.xml>>> file, and
@ -83,7 +104,7 @@ Indexer Design
    * <<<m>>>: md5 checksum of the JAR
-  Only JARs are indexed at present.
+  Only JARs are indexed at present. The JAR filename is used as the key for later deleting entries.
 * Searching
@ -92,9 +113,13 @@ Indexer Design
  Some features that will be available:
-    * <Search by a particular field (exact match)>: This would be needed for search by checksum
+    * <Search through most fields for a particular keyword>: the general case described above.
-    * <Search in a range of field values>: This would be needed for searching based on update time
+    * <Search by a particular field (exact match)>: This would be needed for search by checksum.
    * <Search in a range of field values>: This would be needed for searching based on update time. Note that in
      Lucene it may be better to search by other fields (or return all), and then filter the results by dates rather
      than making dates part of a search query.
    * <Limit search to particular fields>: It will be useful to only search Java classes and packages, for example
@ -102,10 +127,3 @@ Indexer Design
  reasons. It should not have to read any metadata files or properties of files such as size and checksum from the disk.
  This enables searching a repository remotely without having the physical repository available, which is useful for
  IDE integration among other things.
 * Limitations
  Currently, because the POM and artifacts are fed in separately, there is no way to associate an artifact with a
  classifier to its POM, meaning there is less information about it in the index. It may be best that this occurs by
  design - it seems that while it is desirable to search by classifier you only want to find the main artifact for
  browsing and see the derived artifact listed under that. How this evolves should be carefully considered.