【Lucene/Solr 7.0】index-time boost の無効化を理解するための前提の調査

Lucene/Solr 7 が先日リリースされました！ Lucene/Solr 7 の新機能も気になるところですが（別エントリでまとめようかと）、7 では index-time boost の機能が無効になるとの情報をキャッチしました。

Lucene/Solr 7.0にてインデクシング時重み付け指定できなくなる件について、ブログ書きました！

関口宏司のLuceneブログ | LUCENE-6819: Good bye index-time boost https://t.co/fSDSHVudvb
— Koji Sekiguchi (@kojisays) 2017年8月25日

詳細な内容は上記の関口さんのブログに詳しいのですが、boost のスコアの扱いについてよくわかっていない部分があり、無効になった場合の影響度合いがピンと来ていなかったので、ちょっと調べなおして見ました。このエントリを読んでいただければ、boost のスコアに与える影響について理解を深められ、index-time boost の無効化があなたが管理している Solr を使ったシステムへどういう影響を与えるか、すこし見えてくるかなと思います。

前提知識

TF-IDF の式
BM25 の式
Solr のクエリのデバッグ方法と、explain の読み方
- 以下のスライドがわかりやすいのでおすすめです。
- 第16回Lucene/Solr勉強会 – ランキングチューニングと定量評価 #SolrJP

疑問その１、index-time boost と query boost はスコアに与える影響は同じなのか違うのか

index-time boost が無効になるなら、代わりに query boost を使えばいいじゃない！と思ったのですが、そもそも代わりになるのか？というところについて調べたいなと思いました。 index-time で boost 値を 2 と指定した場合と、query で指定したときとではスコアが変わるかどうか？という点について調べてみました。

疑問その２、omitNorms=true にした場合の挙動の違い (BM25 / TF-IDF)

omitNorms=true にすると何が起きるかというと、ClassicSimilarity では TF-IDF にもとづいてスコアが計算されるのですが、Lucene の ClassicSimilarity はオリジナルの式に手が加えられており、フィールドに含まれる文書の長さとboot値を考慮するように実装されており、その「フィールドの文章の長さと boost 値の考慮」が無効になります。

一方、BM25Similarity ではオリジナルの式の時点でフィールド長を考慮することが含まれています。Lucene ではこれに boost 値を考慮するように実装されているはず、という点と、omitNorms=true にすると ClassicSimilarity の用に式の該当箇所が無効になるのか？という疑問になります。

結論

結論から先に書くと以下の通りでした。

疑問その１について、index-time boost を使用している場合は query time boost の導入や、doc value + function query による boost などの代替手段の検討が必要だが、query time boost と index-time boost では式の影響する箇所が異なるり、boost 値やドキュメントとクエリが全く同じでも違うスコアになることから十分なテストが必要。また、そもそも omitNorms=true なフィールドについては index-time boost はそもそも設定ができないので、index-time boost の無効化による影響はないので、こちらは心配しなくても大丈夫。
疑問その２について、BM25Similarity のフィールド長に対する考慮も ClassiSimilarity と同様に、omitNorms=true の場合、スコア計算から除外されるようになっている。同じような挙動になるが、スコアやスコアを計算する式に対する影響は全く異なるので注意が必要。

したがって、Solr 7 を導入する場合は、omitNorms=true が指定されていないフィールドに index-time boost を使用している場合に限られるが、全く同じスコアになる代替手段は提供されていないので、十分なテストが必要、という理解をしました。

また、以下のようなことがわかりました。

query time boost と index-time boost の影響する箇所は全く違う。index-time boost はスコアを計算する式の変数の値が変化するのに対して、query time boost は、指定された boost 値をプレーンなスコアに単純に乗算するので、index-time boost よりも影響が強い。
- BM25 の場合、index-time boost の値を大きくすると計算式の変数 fieldLength の値が小さくなりスコアが上昇する、ClassicSimilarity の場合は計算式の変数 fieldNorm の値が大きくなりスコアが上昇する (下記、調査その１：index-time boost をセットしたときのスコアの変化を参照）
- ２種類の index-time boost の doc boost / field boost のいずれも上記と同様に fieldLength / fieldNorm の値に影響する、また別の変数の値には影響しない
- BM25 の b の値は、いわゆる boost の値ではない (schema.xml で別途指定できる)。クエリやドキュメントでは調整できない。
omitNorms を true にした場合のスコアの変化は、使用する Similarity によって全く違うので要注意。
- Classicsimilarity の場合、omitNorms=true に設定すると fieldNorm が固定で 1.0 になる、BM25 の場合、omitNorms=true に設定するとフィールド長の影響を計算する変数 b / avgFieldLength / fieldLength が使用されないように式が変化する。 (下記、調査その２：omitNorms=true にした場合のスコアの変化を参照）

（ところで、BM25 の explain は計算式まで表示されるのでとっても親切ですね！）

調査その1: index-time boost をセットしたときのスコアの変化

index-time boost の値を設定した場合のスコアの変化と変化する箇所を確認するいずれも Solr 6.4.2

BM25Similarity

index-time boost の値を大きくすると fieldLength の値が小さくなり、結果スコアが大きくなる

index-time boost なし

9.910029 = weight(Description_sand:ローソン in 0) [SchemaSimilarity], result of:
  9.910029 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
    7.991893 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
      4.0 = docFreq
      13305.0 = docCount
    1.2400103 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:
      1.0 = termFreq=1.0
      1.2 = parameter k1
      0.75 = parameter b
      30.368282 = avgFieldLength
      16.0 = fieldLength

field boost のみ

12.394508 = weight(Description_sand:ローソン in 0) [SchemaSimilarity], result of:
  12.394508 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
    7.991893 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
      4.0 = docFreq
      13305.0 = docCount
    1.5508852 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:
      1.0 = termFreq=1.0
      1.2 = parameter k1
      0.75 = parameter b
      30.368282 = avgFieldLength
      4.0 = fieldLength

doc + field boost

13.330253 = weight(Description_sand:ローソン in 0) [SchemaSimilarity], result of:
  13.330253 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
    7.991893 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
      4.0 = docFreq
      13305.0 = docCount
    1.6679718 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:
      1.0 = termFreq=1.0
      1.2 = parameter k1
      0.75 = parameter b
      30.368282 = avgFieldLength
      0.64 = fieldLength

ClassicSimilarity

fieldNorms の値が変化し、スコアが増える

index-time boost なし

19.742617 = weight(Description_sand:ローソン in 0) [SchemaSimilarity], result of:
  19.742617 = score(doc=0,freq=1.0), product of:
    8.886533 = queryWeight, product of:
      8.886533 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
        4.0 = docFreq
        13305.0 = docCount
      1.0 = queryNorm
    2.2216332 = fieldWeight in 0, product of:
      1.0 = tf(freq=1.0), with freq of:
        1.0 = termFreq=1.0
      8.886533 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
        4.0 = docFreq
        13305.0 = docCount
      0.25 = fieldNorm(doc=0)

field boost のみ (2 * 0.25 = 0.5)

39.485233 = weight(Description_sand:ローソン in 0) [SchemaSimilarity], result of:
  39.485233 = score(doc=0,freq=1.0), product of:
    8.886533 = queryWeight, product of:
      8.886533 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
        4.0 = docFreq
        13305.0 = docCount
      1.0 = queryNorm
    4.4432664 = fieldWeight in 0, product of:
      1.0 = tf(freq=1.0), with freq of:
        1.0 = termFreq=1.0
      8.886533 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
        4.0 = docFreq
        13305.0 = docCount
      0.5 = fieldNorm(doc=0)

field + doc boost (22.50.25=1.25)

98.71308 = weight(Description_sand:ローソン in 0) [SchemaSimilarity], result of:
  98.71308 = score(doc=0,freq=1.0), product of:
    8.886533 = queryWeight, product of:
      8.886533 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
        4.0 = docFreq
        13305.0 = docCount
      1.0 = queryNorm
    11.108166 = fieldWeight in 0, product of:
      1.0 = tf(freq=1.0), with freq of:
        1.0 = termFreq=1.0
      8.886533 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
        4.0 = docFreq
        13305.0 = docCount
      1.25 = fieldNorm(doc=0)

調査その2: omitNorms=true にした場合のスコアの変化

BM25 では omitNorms=true なフィールドと false なフィールドに対して同じ q を送信し、スコアの explain を確認、計算内容の違いを調べるいずれも Solr 6.4.2

BM25Similarity

omitNorms=true にすると b / avgFieldLength / fieldLength が計算式に含まれない

1 : omitNorms=false なフィールド

5.110843 = weight(Description_sand:ローソン in 29) [SchemaSimilarity], result of:
  5.110843 = score(doc=29,freq=1.0 = termFreq=1.0
), product of:
    0.5 = boost
    8.243133 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
      3.0 = docFreq
      13304.0 = docCount
    1.2400244 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:
      1.0 = termFreq=1.0
      1.2 = parameter k1
      0.75 = parameter b
      30.369589 = avgFieldLength
      16.0 = fieldLength

2 : omitNorms=true なフィールド

5.072106 = weight(Title_search_sand:ローソン in 965) [SchemaSimilarity], result of:
  5.072106 = score(doc=965,freq=2.0 = termFreq=2.0
), product of:
    0.5 = boost
    7.377609 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
      12.0 = docFreq
      19996.0 = docCount
    1.375 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1) from:
      2.0 = termFreq=2.0
      1.2 = parameter k1
      0.0 = parameter b (norms omitted for field)

ClassicSimilarity

omitNorms=true にすると fieldNorm の値が 1.0 固定になる

1 : omitNorms=false なフィールド

19.742617 = weight(Description_sand:ローソン in 0) [SchemaSimilarity], result of:
  19.742617 = score(doc=0,freq=1.0), product of:
    8.886533 = queryWeight, product of:
      8.886533 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
        4.0 = docFreq
        13305.0 = docCount
      1.0 = queryNorm
    2.2216332 = fieldWeight in 0, product of:
      1.0 = tf(freq=1.0), with freq of:
        1.0 = termFreq=1.0
      8.886533 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
        4.0 = docFreq
        13305.0 = docCount
      0.25 = fieldNorm(doc=0)

2 : omitNorms=true なフィールド

96.589584 = weight(Title_search_sand:ローソン in 965) [SchemaSimilarity], result of:
  96.589584 = score(doc=965,freq=2.0), product of:
    8.26433 = queryWeight, product of:
      8.26433 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
        13.0 = docFreq
        19997.0 = docCount
      1.0 = queryNorm
    11.687528 = fieldWeight in 965, product of:
      1.4142135 = tf(freq=2.0), with freq of:
        2.0 = termFreq=2.0
      8.26433 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
        13.0 = docFreq
        19997.0 = docCount
      1.0 = fieldNorm(doc=965)