Aggregation Pipeline聚合管道

On this page本页内容

The aggregation pipeline is a framework for data aggregation modeled on the concept of data processing pipelines.聚合管道是一个基于数据处理管道概念的数据聚合框架。Documents enter a multi-stage pipeline that transforms the documents into aggregated results.文档进入一个多阶段管道,将文档转换为聚合结果。For example:例如:

In the example,在这个例子中,

db.orders.aggregate([
   { $match: { status: "A" } },
   { $group: { _id: "$cust_id", total: { $sum: "$amount" } } }
])

First Stage: The $match stage filters the documents by the status field and passes to the next stage those documents that have status equal to "A".第一阶段$match阶段按status字段过滤文档,并将status等于"A"的文档传递到下一阶段。

Second Stage: The $group stage groups the documents by the cust_id field to calculate the sum of the amount for each unique cust_id.第二阶段$group阶段按cust_id字段对文档进行分组,以计算每个唯一cust_id的金额总和。

Pipeline管道

The MongoDB aggregation pipeline consists of stages.MongoDB聚合管道由多个阶段组成。Each stage transforms the documents as they pass through the pipeline.每个阶段在文档通过管道时对其进行转换。Pipeline stages do not need to produce one output document for every input document; e.g., some stages may generate new documents or filter out documents.管道阶段不需要为每个输入文档生成一个输出文档;例如,某些阶段可能会生成新文档或过滤掉文档。

Pipeline stages can appear multiple times in the pipeline with the exception of $out, $merge, and $geoNear stages.管道阶段可以在管道中多次出现,$out$merge$geoNear阶段除外。For a list of all available stages, see Aggregation Pipeline Stages.有关所有可用阶段的列表,请参见聚合管道阶段

MongoDB provides the db.collection.aggregate() method in the mongo shell and the aggregate command to run the aggregation pipeline.MongoDB提供了mongo shell中的db.collection.aggregate()方法,以及aggregate命令来运行聚合管道。

For example usage of the aggregation pipeline, consider Aggregation with User Preference Data and Aggregation with the Zip Code Data Set.例如,使用聚合管道时,请考虑使用用户首选项数据进行聚合,以及使用邮政编码数据集进行聚合

Starting in MongoDB 4.2, you can use the aggregation pipeline for updates in:从MongoDB 4.2开始,您可以在以下位置使用聚合管道进行更新:

Command命令mongo Shell MethodsShell方法
findAndModify
update

Pipeline Expressions管道表达式

Some pipeline stages take a pipeline expression as the operand.某些管道阶段采用管道表达式作为操作数。Pipeline expressions specify the transformation to apply to the input documents.管道表达式指定要应用于输入文档的转换。Expressions have a document structure and can contain other expression.表达式具有文档结构,可以包含其他表达式

Pipeline expressions can only operate on the current document in the pipeline and cannot refer to data from other documents: expression operations provide in-memory transformation of documents.管道表达式只能对管道中的当前文档进行操作,不能引用其他文档中的数据:表达式操作提供文档的内存转换。

Generally, expressions are stateless and are only evaluated when seen by the aggregation process with one exception: accumulator expressions.通常,表达式是无状态的,只有在聚合过程看到时才进行计算,只有一个例外:累加器表达式。

The accumulators, used in the $group stage, maintain their state (e.g. totals, maximums, minimums, and related data) as documents progress through the pipeline.$group阶段使用的累加器在文档通过管道时保持其状态(例如,总计、最大值、最小值和相关数据)。Some accumulators are available in the $project stage; however, when used in the $project stage, the accumulators do not maintain their state across documents.有些累加器在$project阶段可用;但是,当在$project阶段使用时,累加器不会在文档中保持其状态。

Starting in version 4.4, MongoDB provides the $accumulator and $function aggregation operators.从4.4版开始,MongoDB提供了$accumulator$function聚合运算符。These operators provide users with the ability to define custom aggregation expressions in JavaScript.这些运算符为用户提供了在JavaScript中定义自定义聚合表达式的能力。

For more information on expressions, see Expressions.有关表达式的详细信息,请参见表达式

Aggregation Pipeline Behavior聚合管道行为

In MongoDB, the aggregate command operates on a single collection, logically passing the entire collection into the aggregation pipeline.在MongoDB中,聚合命令对单个集合进行操作,在逻辑上将整个集合传递到聚合管道中。To optimize the operation, wherever possible, use the following strategies to avoid scanning the entire collection.要优化操作,请尽可能使用以下策略以避免扫描整个集合。

Pipeline Operators and Indexes管道运算符和索引

MongoDB’s query planner analyzes an aggregation pipeline to determine whether indexes can be used to improve pipeline performance.MongoDB的查询计划器分析聚合管道,以确定是否可以使用索引来提高管道性能。For example, the following pipeline stages can take advantage of indexes:例如,以下管道阶段可以利用索引:

Note

The following pipeline stages do not represent a complete list of all stages which can use an index.以下管道阶段并不代表可以使用索引的所有阶段的完整列表。

$match
The $match stage can use an index to filter documents if it occurs at the beginning of a pipeline.如果文档发生在管道的开头,$match阶段可以使用索引来过滤文档。
$sort
The $sort stage can use an index as long as it is not preceded by a $project, $unwind, or $group stage.$sort阶段可以使用索引,只要它前面没有$project$unwind阶段或$group阶段。
$group

The $group stage can sometimes use an index to find the first document in each group if all of the following criteria are met:如果满足以下所有条件,$group阶段有时可以使用索引查找每个组中的第一个文档:

  • The $group stage is preceded by a $sort stage that sorts the field to group by,$group阶段前面是一个$sort阶段,它对要分组的字段进行排序,
  • There is an index on the grouped field which matches the sort order and分组字段上有一个与排序顺序和
  • The only accumulator used in the $group stage is $first.$group阶段中使用的唯一累加器是$first

See Optimization to Return the First Document of Each Group for an example.有关示例,请参见返回每个组的第一个文档的优化

$geoNear
The $geoNear pipeline operator takes advantage of a geospatial index.$geoNear管道操作符利用地理空间索引。When using $geoNear, the $geoNear pipeline operation must appear as the first stage in an aggregation pipeline.使用$geoNear时,$geoNear管道操作必须显示为聚合管道中的第一个阶段。

Changed in version 3.2:在版本3.2中更改:Starting in MongoDB 3.2, indexes can cover an aggregation pipeline.从MongoDB 3.2开始,索引可以覆盖聚合管道。In MongoDB 2.6 and 3.0, indexes could not cover an aggregation pipeline since even when the pipeline uses an index, aggregation still requires access to the actual documents.在MongoDB 2.6和3.0中,索引不能覆盖聚合管道,因为即使管道使用索引,聚合仍然需要访问实际文档。

Early Filtering早期筛选

If your aggregation operation requires only a subset of the data in a collection, use the $match, $limit, and $skip stages to restrict the documents that enter at the beginning of the pipeline.如果聚合操作只需要集合中的数据子集,请使用$match$limit$skip阶段来限制在管道开头输入的文档。When placed at the beginning of a pipeline, $match operations use suitable indexes to scan only the matching documents in a collection.当放置在管道的开头时,$match操作使用合适的索引来只扫描集合中匹配的文档。

Placing a $match pipeline stage followed by a $sort stage at the start of the pipeline is logically equivalent to a single query with a sort and can use an index.$match管道阶段后跟$sort阶段放在管道的开头在逻辑上相当于一个带有排序的查询,可以使用索引。When possible, place $match operators at the beginning of the pipeline.如果可能,请在管道的开头放置$match操作符。

Considerations注意事项

Sharded Collections分片集合

The aggregation pipeline supports operations on sharded collections.聚合管道支持对分片集合的操作。See Aggregation Pipeline and Sharded Collections.请参阅聚合管道和分片集合

Aggregation Pipeline vs Map-Reduce聚合管道与Map Reduce的对比

The aggregation pipeline provides better performance and a more coherent interface than map-reduce.聚合管道提供了比map-reduce更好的性能和更一致的接口。

Various map-reduce operations can be rewritten using aggregation pipeline operators, such as $group, $merge, etc.可以使用聚合管道运算符重写各种map-reduce操作,例如$group$merge等。For map-reduce operations that require custom functionality, MongoDB provides the $accumulator and $function aggregation operators starting in version 4.4.对于需要自定义功能的map-reduce操作,MongoDB从4.4版开始提供$accumulator$function聚合运算符。These operators provide users with the ability to define custom aggregation expressions in JavaScript.这些运算符为用户提供了在JavaScript中定义自定义聚合表达式的能力。

See Map-Reduce Examples for details.有关详细信息,请参见Map-Reduce示例

Limitations局限性

Aggregation pipeline have some limitations on value types and result size.聚合管道在值类型和结果大小上有一些限制。See Aggregation Pipeline Limits for details on limits and restrictions on the aggregation pipeline.有关聚合管道的限制和限制的详细信息,请参见聚合管道限制

Pipeline Optimization管道优化

The aggregation pipeline has an internal optimization phase that provides improved performance for certain sequences of operators.聚合管道有一个内部优化阶段,为某些运算符序列提供改进的性能。For details, see Aggregation Pipeline Optimization.有关详细信息,请参见聚合管道优化