mirror of https://github.com/apache/lucene.git
Ref Guide: Change gatherNodes to nodes in the docs
This commit is contained in:
parent
ab319eab24
commit
ce4a7911ea
|
@ -18,13 +18,13 @@
|
|||
// specific language governing permissions and limitations
|
||||
// under the License.
|
||||
|
||||
Graph traversal with streaming expressions uses the `gatherNodes` function to perform a breadth-first graph traversal.
|
||||
Graph traversal with streaming expressions uses the `nodes` function to perform a breadth-first graph traversal.
|
||||
|
||||
The `gatherNodes` function can be combined with the `scoreNodes` function to provide recommendations. `gatherNodes` can also be combined with the wider streaming expression library to perform complex operations on gathered node sets.
|
||||
The `nodes` function can be combined with the `scoreNodes` function to provide recommendations. `nodes` can also be combined with the wider streaming expression library to perform complex operations on gathered node sets.
|
||||
|
||||
`gatherNodes` traversals are distributed within a SolrCloud collection and can span collections.
|
||||
`nodes` traversals are distributed within a SolrCloud collection and can span collections.
|
||||
|
||||
`gatherNodes` is designed for use cases that involve zooming into a neighborhood in the graph and performing precise traversals to gather node sets and aggregations. In these types of use cases `gatherNodes` will often provide sub-second performance. Some sample use cases are provided later in the document.
|
||||
`nodes` is designed for use cases that involve zooming into a neighborhood in the graph and performing precise traversals to gather node sets and aggregations. In these types of use cases `nodes` will often provide sub-second performance. Some sample use cases are provided later in the document.
|
||||
|
||||
[IMPORTANT]
|
||||
====
|
||||
|
@ -34,13 +34,13 @@ This document assumes a basic understanding of graph terminology and streaming e
|
|||
[[GraphTraversal-BasicSyntax]]
|
||||
== Basic Syntax
|
||||
|
||||
We'll start with the most basic syntax and slowly build up more complexity. The most basic syntax for `gatherNodes` is:
|
||||
We'll start with the most basic syntax and slowly build up more complexity. The most basic syntax for `nodes` is:
|
||||
|
||||
[source,plain]
|
||||
----
|
||||
gatherNodes(emails,
|
||||
walk="johndoe@apache.org->from",
|
||||
gather="to")
|
||||
nodes(emails,
|
||||
walk="johndoe@apache.org->from",
|
||||
gather="to")
|
||||
----
|
||||
|
||||
Let's break down this simple expression.
|
||||
|
@ -55,20 +55,20 @@ The walk parameter also accepts a list of root node IDs:
|
|||
|
||||
[source,plain]
|
||||
----
|
||||
gatherNodes(emails,
|
||||
walk="johndoe@apache.org, janesmith@apache.org->from",
|
||||
gather="to")
|
||||
nodes(emails,
|
||||
walk="johndoe@apache.org, janesmith@apache.org->from",
|
||||
gather="to")
|
||||
----
|
||||
|
||||
The `gatherNodes` function above finds all the edges with "johndoe@apache.org" or "janesmith@apache.org" in the `from` field and gathers the `to` field.
|
||||
The `nodes` function above finds all the edges with "johndoe@apache.org" or "janesmith@apache.org" in the `from` field and gathers the `to` field.
|
||||
|
||||
Like all <<streaming-expressions.adoc#streaming-expressions,Streaming Expressions>>, you can execute a `gatherNodes` expression by sending it to the `/stream` handler. For example:
|
||||
Like all <<streaming-expressions.adoc#streaming-expressions,Streaming Expressions>>, you can execute a `nodes` expression by sending it to the `/stream` handler. For example:
|
||||
|
||||
[source,bash]
|
||||
----
|
||||
curl --data-urlencode 'expr=gatherNodes(emails,
|
||||
walk="johndoe@apache.org, janesmith@apache.org->from",
|
||||
gather="to")' http://localhost:8983/solr/emails/stream
|
||||
curl --data-urlencode 'expr=nodes(emails,
|
||||
walk="johndoe@apache.org, janesmith@apache.org->from",
|
||||
gather="to")' http://localhost:8983/solr/emails/stream
|
||||
----
|
||||
|
||||
The output of this expression would look like this:
|
||||
|
@ -107,14 +107,14 @@ The output of this expression would look like this:
|
|||
|
||||
All of the tuples returned have the `node` field. The `node` field contains the node IDs gathered by the function. The `collection`, `field`, and `level` of the traversal are also included in the output.
|
||||
|
||||
Notice that the level is "1" for each tuple in the example. The root nodes are level 0 (in the example above, the root nodes are "johndoe@apache.org, janesmith@apache.org") By default the `gatherNodes` function emits only the _*leaf nodes*_ of the traversal, which is the outer-most node set. To emit the root nodes you can specify the `scatter` parameter:
|
||||
Notice that the level is "1" for each tuple in the example. The root nodes are level 0 (in the example above, the root nodes are "johndoe@apache.org, janesmith@apache.org") By default the `nodes` function emits only the _*leaf nodes*_ of the traversal, which is the outer-most node set. To emit the root nodes you can specify the `scatter` parameter:
|
||||
|
||||
[source,plain]
|
||||
----
|
||||
gatherNodes(emails,
|
||||
walk="johndoe@apache.org->from",
|
||||
gather="to",
|
||||
scatter="branches, leaves")
|
||||
nodes(emails,
|
||||
walk="johndoe@apache.org->from",
|
||||
gather="to",
|
||||
scatter="branches, leaves")
|
||||
----
|
||||
|
||||
The `scatter` parameter controls whether to emit the _branches_ with the _leaves_. The root nodes are considered "branches" because they are not the outer-most level of the traversal.
|
||||
|
@ -164,14 +164,14 @@ Now the level 0 root node is included in the output.
|
|||
[[GraphTraversal-Aggregations]]
|
||||
== Aggregations
|
||||
|
||||
`gatherNodes` also supports aggregations. For example:
|
||||
`nodes` also supports aggregations. For example:
|
||||
|
||||
[source,plain]
|
||||
----
|
||||
gatherNodes(emails,
|
||||
walk="johndoe@apache.org, janesmith@apache.org->from",
|
||||
gather="to",
|
||||
count(*))
|
||||
nodes(emails,
|
||||
walk="johndoe@apache.org, janesmith@apache.org->from",
|
||||
gather="to",
|
||||
count(*))
|
||||
----
|
||||
|
||||
The expression above finds the edges with "\johndoe@apache.org" or "\janesmith@apache.org" in the `from` field and gathers the values from the `to` field. It also aggregates the count for each node ID gathered.
|
||||
|
@ -182,35 +182,35 @@ Edges are uniqued as part of the traversal so the count will *not* reflect the n
|
|||
|
||||
The aggregation functions supported are `count(*)`, `sum(field)`, `min(field)`, `max(field)`, and `avg(field)`. The fields being aggregated should be present in the edges collected during the traversal. Later examples (below) will show aggregations can be a powerful tool for providing recommendations and limiting the scope of traversals.
|
||||
|
||||
[[GraphTraversal-NestinggatherNodesfunctions]]
|
||||
== Nesting gatherNodes functions
|
||||
[[GraphTraversal-Nestingnodesfunctions]]
|
||||
== Nesting nodes functions
|
||||
|
||||
The `gatherNodes` function can be nested to traverse deeper into the graph. For example:
|
||||
The `nodes` function can be nested to traverse deeper into the graph. For example:
|
||||
|
||||
[source,plain]
|
||||
----
|
||||
gatherNodes(emails,
|
||||
gatherNodes(emails,
|
||||
walk="johndoe@apache.org->from",
|
||||
gather="to"),
|
||||
walk="node->from",
|
||||
gather="to")
|
||||
nodes(emails,
|
||||
nodes(emails,
|
||||
walk="johndoe@apache.org->from",
|
||||
gather="to"),
|
||||
walk="node->from",
|
||||
gather="to")
|
||||
----
|
||||
|
||||
In the example above the outer `gatherNodes` function operates on the node set collected from the inner `gatherNodes` function.
|
||||
In the example above the outer `nodes` function operates on the node set collected from the inner `nodes` function.
|
||||
|
||||
Notice that the inner `gatherNodes` function behaves exactly as the examples already discussed. But the `walk` parameter of the outer `gatherNodes` function behaves differently.
|
||||
Notice that the inner `nodes` function behaves exactly as the examples already discussed. But the `walk` parameter of the outer `nodes` function behaves differently.
|
||||
|
||||
In the outer `gatherNodes` function the `walk` parameter works with tuples coming from an internal streaming expression. In this scenario the `walk` parameter maps the `node` field to the `from` field. Remember that the node IDs collected from the inner `gatherNodes` expression are placed in the `node` field.
|
||||
In the outer `nodes` function the `walk` parameter works with tuples coming from an internal streaming expression. In this scenario the `walk` parameter maps the `node` field to the `from` field. Remember that the node IDs collected from the inner `nodes` expression are placed in the `node` field.
|
||||
|
||||
Put more simply, the inner expression gathers all the people that "\johndoe@apache.org" has emailed. We can call this group the "friends of \johndoe@apache.org". The outer expression gathers all the people that the "friends of \johndoe@apache.org" have emailed. This is a basic friends-of-friends traversal.
|
||||
|
||||
This construct of nesting `gatherNodes` functions is the basic technique for doing a controlled traversal through the graph.
|
||||
This construct of nesting `nodes` functions is the basic technique for doing a controlled traversal through the graph.
|
||||
|
||||
[[GraphTraversal-CycleDetection]]
|
||||
== Cycle Detection
|
||||
|
||||
The `gatherNodes` function performs cycle detection across the entire traversal. This ensures that nodes that have already been visited are not traversed again. Cycle detection is important for both limiting the size of traversals and gathering accurate aggregations. Without cycle detection the size of the traversal could grow exponentially with each hop in the traversal. With cycle detection only new nodes encountered are traversed.
|
||||
The `nodes` function performs cycle detection across the entire traversal. This ensures that nodes that have already been visited are not traversed again. Cycle detection is important for both limiting the size of traversals and gathering accurate aggregations. Without cycle detection the size of the traversal could grow exponentially with each hop in the traversal. With cycle detection only new nodes encountered are traversed.
|
||||
|
||||
Cycle detection *does not* cross collection boundaries. This is because internally the collection name is part of the node ID. For example the node ID "\johndoe@apache.org", is really `emails/johndoe@apache.org`. When traversing to another collection "\johndoe@apache.org" will be traversed.
|
||||
|
||||
|
@ -221,10 +221,10 @@ Each level in the traversal can be filtered with a filter query. For example:
|
|||
|
||||
[source,plain]
|
||||
----
|
||||
gatherNodes(emails,
|
||||
walk="johndoe@apache.org->from",
|
||||
fq="body:(solr rocks)",
|
||||
gather="to")
|
||||
nodes(emails,
|
||||
walk="johndoe@apache.org->from",
|
||||
fq="body:(solr rocks)",
|
||||
gather="to")
|
||||
----
|
||||
|
||||
In the example above only emails that match the filter query will be included in the traversal. Any Solr query can be included here. So you can do fun things like <<spatial-search.adoc#spatial-search,geospatial queries>>, apply any of the available <<query-syntax-and-parsing.adoc#query-syntax-and-parsing,query parsers>>, or even write custom query parsers to limit the traversal.
|
||||
|
@ -236,10 +236,10 @@ Any streaming expression can be used to provide the root nodes for a traversal.
|
|||
|
||||
[source,plain]
|
||||
----
|
||||
gatherNodes(emails,
|
||||
search(emails, q="body:(solr rocks)", fl="to", sort="score desc", rows="20")
|
||||
walk="to->from",
|
||||
gather="to")
|
||||
nodes(emails,
|
||||
search(emails, q="body:(solr rocks)", fl="to", sort="score desc", rows="20")
|
||||
walk="to->from",
|
||||
gather="to")
|
||||
----
|
||||
|
||||
The example above provides the root nodes through a search expression. You can also provide arbitrarily complex, nested streaming expressions with joins, etc., to specify the root nodes.
|
||||
|
@ -262,88 +262,88 @@ Look closely at step 2. In large graphs, step 2 can lead to a very large travers
|
|||
. A large traversal that visit millions of unique nodes is slow and takes a lot of memory because cycle detection is tracked in memory.
|
||||
. High frequency nodes are also not useful in determining users with similar tastes. The content that fewer people have viewed provides a more precise recommendation.
|
||||
|
||||
The `gatherNodes` function has the `maxDocFreq` param to allow for filtering out high frequency nodes. The sample code below shows steps 1 and 2 of the recommendation:
|
||||
The `nodes` function has the `maxDocFreq` param to allow for filtering out high frequency nodes. The sample code below shows steps 1 and 2 of the recommendation:
|
||||
|
||||
[source,plain]
|
||||
----
|
||||
gatherNodes(logs,
|
||||
search(logs, q="userID:user1", fl="articleID", sort="articleID asc", fq="action:view", qt="/export"),
|
||||
walk="articleID->articleID",
|
||||
gather="userID",
|
||||
fq="action:view",
|
||||
maxDocFreq="10000",
|
||||
count(*)))
|
||||
nodes(logs,
|
||||
search(logs, q="userID:user1", fl="articleID", sort="articleID asc", fq="action:view", qt="/export"),
|
||||
walk="articleID->articleID",
|
||||
gather="userID",
|
||||
fq="action:view",
|
||||
maxDocFreq="10000",
|
||||
count(*)))
|
||||
----
|
||||
|
||||
In the example above, the inner search expression searches the `logs` collection and returning all the articles viewed by "user1". The outer `gatherNodes` expression takes all the articles emitted from the inner search expression and finds all the records in the logs collection for those articles. It then gathers and aggregates the users that have read the articles. The `maxDocFreq` parameter limits the articles returned to those that appear in no more then 10,000 log records (per shard). This guards against returning articles that have been viewed by millions of users.
|
||||
In the example above, the inner search expression searches the `logs` collection and returning all the articles viewed by "user1". The outer `nodes` expression takes all the articles emitted from the inner search expression and finds all the records in the logs collection for those articles. It then gathers and aggregates the users that have read the articles. The `maxDocFreq` parameter limits the articles returned to those that appear in no more then 10,000 log records (per shard). This guards against returning articles that have been viewed by millions of users.
|
||||
|
||||
[[GraphTraversal-TrackingtheTraversal]]
|
||||
== Tracking the Traversal
|
||||
|
||||
By default the `gatherNodes` function only tracks enough information to do cycle detection. This provides enough information to output the nodes and aggregations in the graph.
|
||||
By default the `nodes` function only tracks enough information to do cycle detection. This provides enough information to output the nodes and aggregations in the graph.
|
||||
|
||||
For some use cases, such as graph visualization, we also need to output the edges. Setting `trackTraversal="true"` tells `gatherNodes` to track the connections between nodes, so the edges can be constructed. When `trackTraversal` is enabled a new `ancestors` property will appear with each node. The `ancestors` property contains a list of node IDs that pointed to the node.
|
||||
For some use cases, such as graph visualization, we also need to output the edges. Setting `trackTraversal="true"` tells `nodes` to track the connections between nodes, so the edges can be constructed. When `trackTraversal` is enabled a new `ancestors` property will appear with each node. The `ancestors` property contains a list of node IDs that pointed to the node.
|
||||
|
||||
Below is a sample `gatherNodes` expression with `trackTraversal` set to true:
|
||||
Below is a sample `nodes` expression with `trackTraversal` set to true:
|
||||
|
||||
[source,plain]
|
||||
----
|
||||
gatherNodes(emails,
|
||||
gatherNodes(emails,
|
||||
walk="johndoe@apache.org->from",
|
||||
gather="to",
|
||||
trackTraversal="true"),
|
||||
walk="node->from",
|
||||
trackTraversal="true",
|
||||
gather="to")
|
||||
nodes(emails,
|
||||
nodes(emails,
|
||||
walk="johndoe@apache.org->from",
|
||||
gather="to",
|
||||
trackTraversal="true"),
|
||||
walk="node->from",
|
||||
trackTraversal="true",
|
||||
gather="to")
|
||||
----
|
||||
|
||||
[[GraphTraversal-Cross-CollectionTraversals]]
|
||||
== Cross-Collection Traversals
|
||||
|
||||
Nested `gatherNodes` functions can operate on different SolrCloud collections. This allow traversals to "walk" from one collection to another to gather nodes. Cycle detection does not cross collection boundaries, so nodes collected in one collection will be traversed in a different collection. This was done deliberately to support cross-collection traversals. Note that the output from a cross-collection traversal will likely contain duplicate nodes with different collection attributes.
|
||||
Nested `nodes` functions can operate on different SolrCloud collections. This allow traversals to "walk" from one collection to another to gather nodes. Cycle detection does not cross collection boundaries, so nodes collected in one collection will be traversed in a different collection. This was done deliberately to support cross-collection traversals. Note that the output from a cross-collection traversal will likely contain duplicate nodes with different collection attributes.
|
||||
|
||||
Below is a sample `gatherNodes` expression that traverses from the "emails" collection to the "logs" collection:
|
||||
Below is a sample `nodes` expression that traverses from the "emails" collection to the "logs" collection:
|
||||
|
||||
[source,plain]
|
||||
----
|
||||
gatherNodes(logs,
|
||||
gatherNodes(emails,
|
||||
search(emails, q="body:(solr rocks)", fl="from", sort="score desc", rows="20")
|
||||
walk="from->from",
|
||||
gather="to",
|
||||
scatter="leaves, branches"),
|
||||
walk="node->user",
|
||||
fq="action:edit",
|
||||
gather="contentID")
|
||||
nodes(logs,
|
||||
nodes(emails,
|
||||
search(emails, q="body:(solr rocks)", fl="from", sort="score desc", rows="20")
|
||||
walk="from->from",
|
||||
gather="to",
|
||||
scatter="leaves, branches"),
|
||||
walk="node->user",
|
||||
fq="action:edit",
|
||||
gather="contentID")
|
||||
----
|
||||
|
||||
The example above finds all people who sent emails with a body that contains "solr rocks". It then finds all the people these people have emailed. Then it traverses to the logs collection and gathers all the content IDs that these people have edited.
|
||||
|
||||
[[GraphTraversal-CombininggatherNodesWithOtherStreamingExpressions]]
|
||||
== Combining gatherNodes With Other Streaming Expressions
|
||||
[[GraphTraversal-CombiningnodesWithOtherStreamingExpressions]]
|
||||
== Combining nodes With Other Streaming Expressions
|
||||
|
||||
The `gatherNodes` function can act as both a stream source and a stream decorator. The connection with the wider stream expression library provides tremendous power and flexibility when performing graph traversals. Here is an example of using the streaming expression library to intersect two friend networks:
|
||||
The `nodes` function can act as both a stream source and a stream decorator. The connection with the wider stream expression library provides tremendous power and flexibility when performing graph traversals. Here is an example of using the streaming expression library to intersect two friend networks:
|
||||
|
||||
[source,plain]
|
||||
----
|
||||
intersect(on="node",
|
||||
sort(by="node asc",
|
||||
gatherNodes(emails,
|
||||
gatherNodes(emails,
|
||||
walk="johndoe@apache.org->from",
|
||||
gather="to"),
|
||||
walk="node->from",
|
||||
gather="to",
|
||||
scatter="branches,leaves")),
|
||||
nodes(emails,
|
||||
nodes(emails,
|
||||
walk="johndoe@apache.org->from",
|
||||
gather="to"),
|
||||
walk="node->from",
|
||||
gather="to",
|
||||
scatter="branches,leaves")),
|
||||
sort(by="node asc",
|
||||
gatherNodes(emails,
|
||||
gatherNodes(emails,
|
||||
walk="janedoe@apache.org->from",
|
||||
gather="to"),
|
||||
walk="node->from",
|
||||
gather="to",
|
||||
scatter="branches,leaves")))
|
||||
nodes(emails,
|
||||
nodes(emails,
|
||||
walk="janedoe@apache.org->from",
|
||||
gather="to"),
|
||||
walk="node->from",
|
||||
gather="to",
|
||||
scatter="branches,leaves")))
|
||||
----
|
||||
|
||||
The example above gathers two separate friend networks, one rooted with "\johndoe@apache.org" and another rooted with "\janedoe@apache.org". The friend networks are then sorted by the `node` field, and intersected. The resulting node set will be the intersection of the two friend networks.
|
||||
|
@ -362,18 +362,18 @@ Here is the sample syntax:
|
|||
----
|
||||
top(n="5",
|
||||
sort="count(*) desc",
|
||||
gatherNodes(baskets,
|
||||
random(baskets, q="productID:ABC", fl="basketID", rows="500"),
|
||||
walk="basketID->basketID",
|
||||
fq="-productID:ABC",
|
||||
gather="productID",
|
||||
count(*)))
|
||||
nodes(baskets,
|
||||
random(baskets, q="productID:ABC", fl="basketID", rows="500"),
|
||||
walk="basketID->basketID",
|
||||
fq="-productID:ABC",
|
||||
gather="productID",
|
||||
count(*)))
|
||||
----
|
||||
|
||||
Let's break down exactly what this traversal is doing.
|
||||
|
||||
. The first expression evaluated is the inner `random` expression, which returns 500 random basketIDs, from the `baskets` collection, that have the `productID` "ABC". The `random` expression is very useful for recommendations because it limits the traversal to a fixed set of baskets, and because it adds the element of surprise into the recommendation. Using the `random` function you can provide fast sample sets from very large graphs.
|
||||
. The outer `gatherNodes` expression finds all the records in the `baskets` collection for the basketIDs generated in step 1. It also filters out `productID` "ABC" so it doesn't show up in the results. It then gathers and counts the productID's across these baskets.
|
||||
. The outer `nodes` expression finds all the records in the `baskets` collection for the basketIDs generated in step 1. It also filters out `productID` "ABC" so it doesn't show up in the results. It then gathers and counts the productID's across these baskets.
|
||||
. The outer `top` expression ranks the productIDs emitted in step 2 by the count and selects the top 5.
|
||||
|
||||
In a nutshell this expression finds the products that most frequently co-occur with product "ABC" in past shopping baskets.
|
||||
|
@ -388,7 +388,7 @@ Before diving into the syntax of the `scoreNodes` function it's useful to unders
|
|||
[[GraphTraversal-HowItWorks]]
|
||||
==== *How It Works*
|
||||
|
||||
The `scoreNodes` function assigns a score to each node emitted by the gatherNodes expression. By default the `scoreNodes` function uses the `count(*)` aggregation, which is the co-occurrence count, as the TF value. The IDF value for each node is fetched from the collection where the node was gathered. Each node is then scored using the TF*IDF formula, which provides a boost to nodes with a lower frequency across all market baskets.
|
||||
The `scoreNodes` function assigns a score to each node emitted by the nodes expression. By default the `scoreNodes` function uses the `count(*)` aggregation, which is the co-occurrence count, as the TF value. The IDF value for each node is fetched from the collection where the node was gathered. Each node is then scored using the TF*IDF formula, which provides a boost to nodes with a lower frequency across all market baskets.
|
||||
|
||||
Combining the co-occurrence count with the IDF provides a score that shows how important the relationship is between productID ABC and the recommendation candidates.
|
||||
|
||||
|
@ -403,12 +403,12 @@ top(n="1",
|
|||
sort="nodeScore desc",
|
||||
scoreNodes(top(n="50",
|
||||
sort="count(*) desc",
|
||||
gatherNodes(baskets,
|
||||
random(baskets, q="productID:ABC", fl="basketID", rows="500"),
|
||||
walk="basketID->basketID",
|
||||
fq="-productID:ABC",
|
||||
gather="productID",
|
||||
count(*)))))
|
||||
nodes(baskets,
|
||||
random(baskets, q="productID:ABC", fl="basketID", rows="500"),
|
||||
walk="basketID->basketID",
|
||||
fq="-productID:ABC",
|
||||
gather="productID",
|
||||
count(*)))))
|
||||
----
|
||||
|
||||
This example builds on the earlier example "Calculate market basket co-occurrence".
|
||||
|
@ -428,20 +428,20 @@ Here is the sample syntax:
|
|||
----
|
||||
top(n="5",
|
||||
sort="count(*) desc",
|
||||
gatherNodes(logs,
|
||||
top(n="30",
|
||||
sort="count(*) desc",
|
||||
gatherNodes(logs,
|
||||
search(logs, q="userID:user1", fl="articleID", sort="articleID asc", fq="action:read", qt="/export"),
|
||||
walk="articleID->articleID",
|
||||
gather="userID",
|
||||
fq="action:read",
|
||||
maxDocFreq="10000",
|
||||
count(*))),
|
||||
walk="node->userID",
|
||||
gather="articleID",
|
||||
fq="action:read",
|
||||
count(*)))
|
||||
nodes(logs,
|
||||
top(n="30",
|
||||
sort="count(*) desc",
|
||||
nodes(logs,
|
||||
search(logs, q="userID:user1", fl="articleID", sort="articleID asc", fq="action:read", qt="/export"),
|
||||
walk="articleID->articleID",
|
||||
gather="userID",
|
||||
fq="action:read",
|
||||
maxDocFreq="10000",
|
||||
count(*))),
|
||||
walk="node->userID",
|
||||
gather="articleID",
|
||||
fq="action:read",
|
||||
count(*)))
|
||||
----
|
||||
|
||||
Let's break down the expression above step-by-step.
|
||||
|
@ -449,11 +449,11 @@ Let's break down the expression above step-by-step.
|
|||
. The first expression evaluated is the inner `search` expression. This expression searches the `logs` collection for all records matching "user1". This is the user we are making the recommendation for.
|
||||
+
|
||||
There is a filter applied to pull back only records where the "action:read". It returns the `articleID` for each record found. In other words, this expression returns all the articles "user1" has read.
|
||||
. The inner `gatherNodes` expression operates over the articleIDs returned from step 1. It takes each `articleID` found and searches them against the `articleID` field.
|
||||
. The inner `nodes` expression operates over the articleIDs returned from step 1. It takes each `articleID` found and searches them against the `articleID` field.
|
||||
+
|
||||
Note that it skips high frequency nodes using the `maxDocFreq` param to filter out articles that appear over 10,000 times in the logs. It gathers userIDs and aggregates the counts for each user. This step finds the users that have read the same articles that "user1" has read and counts how many of the same articles they have read.
|
||||
. The inner `top` expression ranks the users emitted from step 2. It will emit the top 30 users who have the most overlap with user1's reading list.
|
||||
. The outer `gatherNodes` expression gathers the reading list for the users emitted from step 3. It counts the articleIDs that are gathered.
|
||||
. The outer `nodes` expression gathers the reading list for the users emitted from step 3. It counts the articleIDs that are gathered.
|
||||
+
|
||||
Any article selected in step 1 (user1 reading list), will not appear in this step due to cycle detection. So this step returns the articles read by the users with the most similar readings habits to "user1" that "user1" has not read yet. It also counts the number of times each article has been read across this user group.
|
||||
. The outer `top` expression takes the top articles emitted from step 4. This is the recommendation.
|
||||
|
@ -467,49 +467,49 @@ The example below illustrates a protein pathway traversal:
|
|||
|
||||
[source,plain]
|
||||
----
|
||||
gatherNodes(proteins,
|
||||
gatherNodes(proteins,
|
||||
walk="NRAS->name",
|
||||
gather="interacts"),
|
||||
walk="node->name",
|
||||
gather="drug")
|
||||
nodes(proteins,
|
||||
nodes(proteins,
|
||||
walk="NRAS->name",
|
||||
gather="interacts"),
|
||||
walk="node->name",
|
||||
gather="drug")
|
||||
----
|
||||
|
||||
Let's break down exactly what this traversal is doing.
|
||||
|
||||
. The inner `gatherNodes` expression traverses in the `proteins` collection. It finds all the edges in the graph where the name of the protein is "NRAS". Then it gathers the proteins in the `interacts` field. This gathers all the proteins that "NRAS" interactions with.
|
||||
. The outer `gatherNodes` expression also works with the `proteins` collection. It gathers all the drugs that correspond to proteins emitted from step 1.
|
||||
. The inner `nodes` expression traverses in the `proteins` collection. It finds all the edges in the graph where the name of the protein is "NRAS". Then it gathers the proteins in the `interacts` field. This gathers all the proteins that "NRAS" interactions with.
|
||||
. The outer `nodes` expression also works with the `proteins` collection. It gathers all the drugs that correspond to proteins emitted from step 1.
|
||||
. Using this stepwise approach you can gather the drugs along the pathway of interactions any number of steps away from the root protein.
|
||||
|
||||
[[GraphTraversal-ExportingGraphMLtoSupportGraphVisualization]]
|
||||
== Exporting GraphML to Support Graph Visualization
|
||||
|
||||
In the examples above, the `gatherNodes` expression was sent to Solr's `/stream` handler like any other streaming expression. This approach outputs the nodes in the same JSON tuple format as other streaming expressions so that it can be treated like any other streaming expression. You can use the `/stream` handler when you need to operate directly on the tuples, such as in the recommendation use cases above.
|
||||
In the examples above, the `nodes` expression was sent to Solr's `/stream` handler like any other streaming expression. This approach outputs the nodes in the same JSON tuple format as other streaming expressions so that it can be treated like any other streaming expression. You can use the `/stream` handler when you need to operate directly on the tuples, such as in the recommendation use cases above.
|
||||
|
||||
There are other graph traversal use cases that involve graph visualization. Solr supports these use cases with the introduction of the `/graph` request handler, which takes a `gatherNodes` expression and outputs the results in GraphML.
|
||||
There are other graph traversal use cases that involve graph visualization. Solr supports these use cases with the introduction of the `/graph` request handler, which takes a `nodes` expression and outputs the results in GraphML.
|
||||
|
||||
http://graphml.graphdrawing.org/[GraphML] is an XML format supported by graph visualization tools such as https://gephi.org/[Gephi], which is a sophisticated open source tool for statistically analyzing and visualizing graphs. Using a `gatherNodes` expression, parts of a larger graph can be exported in GraphML and then imported into tools like Gephi.
|
||||
http://graphml.graphdrawing.org/[GraphML] is an XML format supported by graph visualization tools such as https://gephi.org/[Gephi], which is a sophisticated open source tool for statistically analyzing and visualizing graphs. Using a `nodes` expression, parts of a larger graph can be exported in GraphML and then imported into tools like Gephi.
|
||||
|
||||
There are a few things to keep mind when exporting a graph in GraphML:
|
||||
|
||||
. The `/graph` handler can export both the nodes and edges in the graph. By default, it only exports the nodes. To export the edges you must set `trackTraversal="true"` in the `gatherNodes` expression.
|
||||
. The `/graph` handler currently accepts an arbitrarily complex streaming expression which includes a `gatherNodes` expression. If the streaming expression doesn't include a `gatherNodes` expression, the `/graph` handler will not properly output GraphML.
|
||||
. The `/graph` handler currently accepts a single arbitrarily complex, nested `gatherNodes` expression per request. This means you cannot send in a streaming expression that joins or intersects the node sets from multiple `gatherNodes` expressions. The `/graph` handler does support any level of nesting within a single `gatherNodes` expression. The `/stream` handler does support joining and intersecting node sets, but the `/graph` handler currently does not.
|
||||
. The `/graph` handler can export both the nodes and edges in the graph. By default, it only exports the nodes. To export the edges you must set `trackTraversal="true"` in the `nodes` expression.
|
||||
. The `/graph` handler currently accepts an arbitrarily complex streaming expression which includes a `nodes` expression. If the streaming expression doesn't include a `nodes` expression, the `/graph` handler will not properly output GraphML.
|
||||
. The `/graph` handler currently accepts a single arbitrarily complex, nested `nodes` expression per request. This means you cannot send in a streaming expression that joins or intersects the node sets from multiple `nodes` expressions. The `/graph` handler does support any level of nesting within a single `nodes` expression. The `/stream` handler does support joining and intersecting node sets, but the `/graph` handler currently does not.
|
||||
|
||||
[[GraphTraversal-SampleRequest]]
|
||||
=== Sample Request
|
||||
|
||||
[source,bash]
|
||||
----
|
||||
curl --data-urlencode 'expr=gatherNodes(enron_emails,
|
||||
gatherNodes(enron_emails,
|
||||
walk="kayne.coulter@enron.com->from",
|
||||
trackTraversal="true",
|
||||
gather="to"),
|
||||
walk="node->from",
|
||||
scatter="leaves,branches",
|
||||
curl --data-urlencode 'expr=nodes(enron_emails,
|
||||
nodes(enron_emails,
|
||||
walk="kayne.coulter@enron.com->from",
|
||||
trackTraversal="true",
|
||||
gather="to")' http://localhost:8983/solr/enron_emails/graph
|
||||
gather="to"),
|
||||
walk="node->from",
|
||||
scatter="leaves,branches",
|
||||
trackTraversal="true",
|
||||
gather="to")' http://localhost:8983/solr/enron_emails/graph
|
||||
----
|
||||
|
||||
[[GraphTraversal-SampleGraphMLOutput]]
|
||||
|
|
|
@ -213,9 +213,9 @@ features(collection1,
|
|||
numTerms=250)
|
||||
----
|
||||
|
||||
== gatherNodes
|
||||
== nodes
|
||||
|
||||
The `gatherNodes` function provides breadth-first graph traversal. For details, see the section <<graph-traversal.adoc#graph-traversal,Graph Traversal>>.
|
||||
The `nodes` function provides breadth-first graph traversal. For details, see the section <<graph-traversal.adoc#graph-traversal,Graph Traversal>>.
|
||||
|
||||
== knn
|
||||
|
||||
|
|
Loading…
Reference in New Issue