Elasticsearch 聚合问题解决

数据不全

ES 聚合 - 时区问题

ES 在进行聚合是，对时间进行格式化的时候采用的是东八区的计时方式，导致聚合结果存在遗漏，解决办法，指定time_zone为 +08:00

实例

{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "createTime": {
              "from": 1577462400,
              "to": 1577548799,
              "include_lower": true,
              "include_upper": true,
              "boost": 1.0
            }
          }
        }
      ],
      "adjust_pure_negative": true,
      "boost": 1.0
    }
  },
  "_source": false,
  "aggregations": {
    "enterpriseIdTerms": {
      "date_histogram": {
        "field": "createTime",
        "format": "yyyy-MM-dd",
        "interval": "1d",
        "offset": 0,
        "order": {
          "_key": "asc"
        },
        "time_zone": "+08:00",
        "keyed": false,
        "min_doc_count": 0
      },
      "aggregations": {
        "callCount": {
          "value_count": {
            "field": "uniqueId"
          }
        }
      }
    }
  }
}

对应Java代码

// jestClient 客户端
DateHistogramAggregationBuilder dateAggregation = AggregationBuilders.dateHistogram("dateAggregations")
    // 聚合字段
    .field("createTime") 
    // 聚合维度
    .dateHistogramInterval(DateHistogramInterval.DAY)
    // 聚合格式 
    .format("yyyy-MM-dd")
    // 默认为0
    .minDocCount(0L) 
    // 按时间正序
    .order(Histogram.Order.KEY_ASC) 
    // 聚合时区
    .timeZone(DateTimeZone.forTimeZone(TimeZone.getTimeZone("GMT+8")))
        // 子聚合
        .subAggregation(AggregationBuilders.count(StatFieldEnum.CALL_COUNT.getKey()).field(CtiCloudCdrField.UNIQUE_ID));

// 博客提供的java实例，不确定是否是restClient 客户端。
AggregationBuilder dateAggs =  AggregationBuilders
    .dateHistogram("dateAggs") // 别名
    .field("@timestamp")  // 指定聚合哪个时间字段
    .interval(DateHistogramInterval.DAY) // 按天聚合
    .minDocCount(0L) // 默认为0
    .order(Histogram.Order.KEY_ASC) // 按时间正序
    .timeZone("+08:00") // 指定时区
    .subAggregation( // 子聚合
        AggregationBuilders
            .sum("sumAggs")
            .field("tx_count"));

基础聚合

ES 对doc中的字段先计算后聚合

不废话直接上语句

需求：对personName = hero 的数据 value1 进行数据量统计，（不到一百按一百算）。

{
  "size": 0,
  "query": {
    "match": {
      "personName": "hero"
    }
  },
  "aggregations": {
    "duration": {
      "sum": {
        "script": "(params._source.value1/100 + (params._source.value1%100!=0?1:0))*100"
      }
    }
  }
}

ES 获取聚合结果的去重总数：

ES 去重计数：cardinality（count(distinct)）

针对ES索引统计某个字段上出现的不同值的个数时，可以使用cardinality聚合查询完成:

request

GET http://localhost:9200/cdr_202103*/_search
Content-Type: application/json

{
  "aggregations": {
    "count": {
      "cardinality": {
        "field": "enterpriseId"
      }
    },
    "count2": {
      "terms": {
        "field": "enterpriseId"
      }
    }
  }
}

ES nested 字段聚合

ES nested 子聚合

ES 对nested字段的某个属性聚合时,有时候,需要计算记录数,而不是在nested字段数组中出现的次数时;可以使用reverse_nested
语句实现需求.reverse_nested语句,可以在nested子聚合的前提下,查询上层聚合的数据属性信息.查询nested字段上层的别的属性. nested 子字段聚合时，聚合上层数据

示例

curl -XGET 'localhost:9200/ticket_202001*/_search' -H 'Content-Type:application' -d'
{
  "size":0,
  "aggregations":{
    "record":{
      "nested": {
        "path": "record"
      },
      "aggregations":{
        "recordNo":{
          "terms": {
            "field": "record.no",
            "size": 10
          },
          "aggregations": {
            "thinkCount":{
              "reverse_nested":{},
              "aggregations":{
                "ids":{
                  "terms":{
                    "field":"uniqueId"
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}'

ES nested 字段size 获取

示例

curl -XGET 'localhost:9200/ticket_*/_search?pretty' -H 'Content-Type:application/json' -d'
{
  "size": 0,
  "aggregations": {
    "ticketIds": {
      "terms": {
        "field": "uniqueId"
      },
      "aggregations": {
        "taskCount": {
          "sum": {
            "script": "params._source.record.size()"
          },
          "size": 100
        }
      }
    }
  }
}'

ES 桶聚合与分页聚合

ES 聚合结果过滤、分页、排序 - bucket_filter(过滤)、bucket_sort(排序、截取)。

存在版本制约，>6.0.0

语句示例：

{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "type": "my_record"
          }
        }
      ],
      "adjust_pure_negative": true,
      "boost": 1.0
    }
  },
  "_source": false,
  "aggregations": {
    "dateTerm": {
      "terms": {
        "field": "uniqueId",
        "size": 3000,
        "min_doc_count": 1,
        "shard_min_doc_count": 0,
        "show_term_doc_count_error": false
      },
      "aggregations": {
        "recordCount": {
          "value_count": {
            "field": "uniqueId"
          }
        },
        "bucket_filter": {
          "bucket_selector": {
            "buckets_path": {
              "recordCount": "recordCount > _count"
            },
            "script": "params.recordCount < 100"
          }
        },
        "bucket_sort": {
          "bucket_sort": {
            "sort": [
              {
                "recordCount": {
                  "order": "desc"
                }
              }
            ],
            "size": 10,
            "from": 0
          }
        }
      }
    }
  }
}

ES 聚合结果分页

技术点：bucket_sort(分页操作)、cardinality(总数计算)

先决条件：

1 2	ES 结构： {city, humanCount} 需求：统计分页统计每个city的人口情况

分析语句：

{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggregations": {
    "dateTerm": {
      "terms": {
        "field": "city",
        "size": 3000,
        "min_doc_count": 1,
        "shard_min_doc_count": 0,
        "show_term_doc_count_error": false
      },
      "aggregations": {
        "humanCount": {
          "value_count": {
            "field": "city"
          }
        },
        "bucket_sort": {
          "bucket_sort": {
            "sort": [
              {
                "humanCount": {
                  "order": "desc"
                }
              }
            ],
            "size": 10,
            "from": 0
          }
        }
      }
    },
    "totalCount": {
      "cardinality": {
        "filed": "city"
      }
    }
  }
}

参考信息

参考链接：
- ES 聚合时区问题解决