MongoDB Aggregation Framework an Overview

Mar 31, 2019

Mins to Read

All

The MongoDB Aggregation Framework is a way to query the data from MongoDB. It helps us to break the complex logics into a simple set of sequential operations. Just like a waterfall model, Output from one stage is fed as an input for the next stage until the desired result is achieved.

Each stage of transformation during this process is termed as pipelines.

Fig.1.1. Sample workflow on The Aggregation Framework.

How it works ?

Syntax:

 
db..aggregate()

Aggregation operation should be defined as an array, Each array value is a stage.

 
( [ {Stage 1}, {Stage 2}, {Stage 3} ] )

It will pass the set of existing documents or it will generate the computed result based on aggregation condition that result should be passed (piped) to the next stage.

Sample Query:

 
db.mydbops.aggregate( [

{$match : {"AUTHOR" : "SAKTHI"}},

{ $project : {AUTHOR : 1 ,BLOG : 1 ,TAGS : 1}},

{$group: { "_id" : "$BLOG", THOUGHTS: { $push: "$TAGS" }}},

{ "$skip": 1 },

{ "$limit": 1 }

] );

Guidelines:

Query for each stage has to be optimised for optimal performance.
- You may require only a subset of data from a collection to perform an aggregation operation. Therefore, use the $match, $limit, and $skip stages to filter the documents.
- Using $match operation in the start stages of the aggregation will be helpful in certain cases as it scans only the documents is satisfied the given condition and pass only the matching documents in a collection to the next stage.
- Ensure data set size is filtered during the initial stages than the later stages to improve the performance.

The most common stages in aggregation framework:

$project − Pass documents to specific field or newly computed field to the next stage

$match − Reduce the number of documents that is passed to the next stage

$group − Group document by a distinct key, the key can also be a computed key

$sort − Returns the input documents in a sorted order

$skip − It is possible to skip the particular amount of documents for a given amount of documents.

$limit − Limit the number of documents for the next stage

$unwind − This is used to unwind document that are using arrays. When using an array, the data is kind of pre-joined and this operation will be undone with this to have individual documents again. Thus with this stage we will increase the amount of documents for the next stage.

Explanation With Use Case:

Consider a collection named mydbops having the following documents with information containing Blogging information such as Blog name, Tag information, Author name etc.

Sample Document:

 
{
"_id"      : ObjectId("5c7b83fdfe01e9de2e7d4935"),
"BLOG"     : "MySQL INNODB ARCHITECTURE",
"AUTHOR"   : "SAKTHI",
"POSTED"   : ISODate("2019-03-03T07:36:29.879Z"),
"DATA"     : "{
. . . . . . . . . . . . . . . . . . . . . . . .
}",
"VIEWERS"  : 10,
"TAGS"     : [ "MySQL", "InnoDB" ],
"comments" : [ { "AUTHOR" : "KARTHIK", "TEXT" : "NICE THOUGHT!!!!" }, { "AUTHOR" : "SELVA", "TEXT" : "GOOD!!!!" } ]
}

We can run the aggregation queries against the above set of documents. Let us discuss about how the aggregation works on step by step.

Requirement:

Need to find the total number of hits for the blogs posted during this year so far, for each Author, With tag MySQL in order of highest hits (viewers) first.

Aggregate query:

 
db.mydbops.aggregate([
{ $match : { TAGS : "MySQL" , 'POSTED' : { '$gte' : ISODate("2018-12-31T23:59:59") }} },
{ $project : {AUTHOR : 1 ,VIEWERS :1}},
{"$group" : {_id:"$AUTHOR", HITS:{ $sum:"$VIEWERS"}}},
{$sort: {HITS: -1}}])

Stages Explained:

STAGE 1:

 
{ $match : {TAGS: "MySQL", 'POSTED': {'$gte': ISODate("2018-12-31T23:59:59")}} }

In Stage 1, the $match operator scan’s what are all the documents satisfies the above condition only that documents should pass to the next stage

STAGE 2:

 
{ $project: {AUTHOR: 1, VIEWERS:1} }

In Stage 2, What are all the fields mentioned in the $project operator only that fields should pass to the next stage

STAGE 3:

 
{ "$group": {_id:"$AUTHOR", HITS: {$sum: "$VIEWERS"}} }

In Stage 3, $group operator group the documents by some specified expression and outputs to the next stage a document for each distinct grouping. The output documents contain an _id field which contains the distinct group by key.

STAGE 4:

 
{ $sort: {HITS: -1} }

In Stage 4, $sort operator takes a document that specifies the field(s) to sort by and the respective sort order. 1 to specify ascending order and -1 to specify descending order.

Aggregation flow Diagram:

Production Problems:

Using $match in initial stage is a best approach, But if the document size is huge, Then further stages will suffer and a lot of CPU cycles will be wasted.
- In case of huge document size, It’s advisable to have required columns projected first than later.
- Aggregate supports $lookup which is comparative to JOINS in RDBMS, Should be used only when necessary.
- Not all complex logics can be fixed by Aggregate. Try to optimise the document structure by analysing the Frequency of operation and query pattern.

MongoDB

Features

No items found.