mirror of https://github.com/apache/lucene.git
Return the same input vector if its a unit vector in VectorUtil#l2normalize (#12726)
### Description While going through [VectorUtil](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/VectorUtil.java) class, I observed we don't have a check for unit vector in `VectorUtil#l2normalize` so passing a unit vector goes thorough the whole L2 normalization(which is totally not required and it should early exit?). I confirmed this by trying out a silly example of `VectorUtil.l2normalize(VectorUtil.l2normalize(nonUnitVector))` and it performed every calculation twice. We could also argue that user should not call for this for a unit vector but I believe there would be cases where user simply want to perform the L2 normalization without checking the vector or if there are some overflowing values. TL;DR : We should early exit in `VectorUtil#l2normalize`, returning the same input vector if its a unit vector This is easily avoidable if we introduce a light check to see if the L1 norm or squared sum of input vector is equal to 1.0 (or) maybe just check `Math.abs(l1norm - 1.0d) <= 1e-5` (as in this PR) because that unit vector dot product(`v x v`) are not exactly 1.0 but like example : `0.9999999403953552` etc. With `1e-5` delta here we would be assuming a vector v having `v x v` >= `0.99999` is a unit vector or say already L2 normalized which seems fine as the delta is really small? and also the check is not heavy one?.
This commit is contained in:
parent
2a8d187a99
commit
7943b7ad1c
|
@ -242,6 +242,8 @@ Optimizations
|
||||||
* GITHUB#12702: Disable suffix sharing for block tree index, making writing the terms dictionary index faster
|
* GITHUB#12702: Disable suffix sharing for block tree index, making writing the terms dictionary index faster
|
||||||
and less RAM hungry, while making the index a bit (~1.X% for the terms index file on wikipedia). (Guo Feng, Mike McCandless)
|
and less RAM hungry, while making the index a bit (~1.X% for the terms index file on wikipedia). (Guo Feng, Mike McCandless)
|
||||||
|
|
||||||
|
* GITHUB#12726: Return the same input vector if its a unit vector in VectorUtil#l2normalize. (Shubham Chaudhary)
|
||||||
|
|
||||||
Changes in runtime behavior
|
Changes in runtime behavior
|
||||||
---------------------
|
---------------------
|
||||||
|
|
||||||
|
|
|
@ -106,18 +106,21 @@ public final class VectorUtil {
|
||||||
* @throws IllegalArgumentException when the vector is all zero and throwOnZero is true
|
* @throws IllegalArgumentException when the vector is all zero and throwOnZero is true
|
||||||
*/
|
*/
|
||||||
public static float[] l2normalize(float[] v, boolean throwOnZero) {
|
public static float[] l2normalize(float[] v, boolean throwOnZero) {
|
||||||
double squareSum = IMPL.dotProduct(v, v);
|
double l1norm = IMPL.dotProduct(v, v);
|
||||||
int dim = v.length;
|
if (l1norm == 0) {
|
||||||
if (squareSum == 0) {
|
|
||||||
if (throwOnZero) {
|
if (throwOnZero) {
|
||||||
throw new IllegalArgumentException("Cannot normalize a zero-length vector");
|
throw new IllegalArgumentException("Cannot normalize a zero-length vector");
|
||||||
} else {
|
} else {
|
||||||
return v;
|
return v;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
double length = Math.sqrt(squareSum);
|
if (Math.abs(l1norm - 1.0d) <= 1e-5) {
|
||||||
|
return v;
|
||||||
|
}
|
||||||
|
int dim = v.length;
|
||||||
|
double l2norm = Math.sqrt(l1norm);
|
||||||
for (int i = 0; i < dim; i++) {
|
for (int i = 0; i < dim; i++) {
|
||||||
v[i] /= length;
|
v[i] /= (float) l2norm;
|
||||||
}
|
}
|
||||||
return v;
|
return v;
|
||||||
}
|
}
|
||||||
|
|
Loading…
Reference in New Issue