### Introduction

Mean average precision, which is often referred as mAP, is a common evaluation metric for object detection.

In this blog post, I would like to discuss how mAP is computed.

### Detection Average Precision (AP)

The mean average precision is just the mean of the average precisions (AP), so let’s take a look at how to compute AP first.

#### Evaluation Algorithm

The AP evaluation algorithm could be described as follows.

- Rank the all predicted detections for all the images from the entire evaluation dataset according to the predicted confidence scores in descending order.
- Iterate through all the predicted detections from high rank (high predicted confidence score) to low rank (low predicted confidence score), check if the predicted detection matches one of the ground truth detections. If there is a match between a predicted detection and a ground truth detection, mark the predicted detection as a correct detection, and the ground truth detection will be “removed” so that it will not be matched twice, otherwise, mark the predicted detection as an incorrect detection. The match criteria could be arbitrary. But usually the intersection-over-union (IOU), sometimes also referred as Jaccard similarity, between the predicted detection and the ground truth detection over 50% is used. Suppose the rank of the current predicted detection is $i$, compute the precision and recall for the detections from rank 1 to rank $i$.
- Once the iteration is over, each predicted detection will have a (precision, recall) tuple. Plot the all the (precision, recall) points from all the predicted detections on a 2D space, where recall is the on the $x$-axis and precision is on the $y$-axis. All the (precision, recall) points form a precision-recall curve.
- The average precision, i.e., AP, is computed from the precision-recall curve. However, the exact formula could vary. The precision-recall curve could be further smoothed. The AP could be defined as the area under curve (AUC), the mean of sampled precisions of fixed recall intervals, etc. AP is always a value between 0 and 1. The higher the AP is, the better the detection is.

#### Concrete Example

Given an detection evaluation dataset which has 5 ground truth detections, our detection algorithm submitted 6 predicted detections. Using the above evaluation algorithm, the evaluator computed the precision and recall for each of the predicted detections.

Rank | Predicted Detection ID | Image ID | Predicted Confidence | Matched Ground Truth Detection ID |
Number of Matched Detections So Far |
Precision | Recall |
---|---|---|---|---|---|---|---|

1 | 5 | 2 | 0.95 | 3 | 1 | 1 | 0.2 |

2 | 1 | 1 | 0.9 | 1 | 2 | 1 | 0.4 |

3 | 3 | 1 | 0.8 | - | 2 | 0.667 | 0.4 |

4 | 2 | 1 | 0.75 | - | 2 | 0.5 | 0.4 |

5 | 4 | 2 | 0.7 | 4 | 3 | 0.6 | 0.6 |

6 | 6 | 3 | 0.65 | - | 3 | 0.5 | 0.6 |

Note that precision is just the “Number of Matched Detections So Far” divided by the “Rank”, and recall is just the the “Number of Matched Detections So Far” divided by the total number of the ground truth detections, which is 5 in our case.

The AP in our example is defined to be the AUC of the precision and recall curve. We could plot the precision and recall curve, and compute the AUC using the following program.

```
# precision_recall.py
from typing import List, Tuple
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from sklearn import metrics
def modify_precisions_recalls(
precisions: List[float],
recalls: List[float]) -> Tuple[List[float], List[float]]:
"""Add head and tail points to precisions and recalls for precision-recall curve plot.
Args:
precisions (List[float]): Precisions.
recalls (List[float]): Recalls.
Returns:
Tuple[List[float], List[float]]: Modified precisions and recalls for precision-recall curve plot.
"""
modified_precisions = [precisions[0]] + precisions.copy() + [0]
modified_recalls = [0] + recalls.copy() + [recalls[-1]]
return modified_precisions, modified_recalls
def plot_precision_recall(precisions: List[float], recalls: List[float],
filename: str) -> None:
"""Plot precision-recall curve and compute AP (AUC).
Args:
precisions (List[float]): Precisions.
recalls (List[float]): Recalls.
filename (str): Plot file name.
"""
assert len(precisions) == len(recalls)
# Using AUC as AP
auc = metrics.auc(x=recalls, y=precisions)
fig, ax = plt.subplots(figsize=(6, 6))
ax.axis("square")
ax.plot(recalls[1:-1], precisions[1:-1], "-bo", label=None, clip_on=False)
ax.plot(recalls[0:2], precisions[0:2], "-ro", label=None, clip_on=False)
ax.plot(recalls[-2:], precisions[-2:], "-go", label=None, clip_on=False)
ax.set_xlim([0, 1.0])
ax.set_ylim([0, 1.0])
ax.xaxis.set_major_locator(ticker.MultipleLocator(0.1))
ax.yaxis.set_major_locator(ticker.MultipleLocator(0.1))
ax.set_xlabel("Recall", fontsize=12, fontweight="bold")
ax.set_ylabel("Precision", fontsize=12, fontweight="bold")
ax.set_title(f"AP = {auc}", fontsize=14, fontweight="bold")
ax.fill_between(recalls[1:-1], precisions[1:-1], color="skyblue")
ax.fill_between(recalls[0:2], precisions[0:2], color="salmon")
ax.fill_between(recalls[-2:-1], precisions[-2:-1], color="lime")
fig.savefig(filename + ".svg", format="svg", dpi=600, bbox_inches="tight")
def main():
precisions = [1, 1, 0.667, 0.5, 0.6, 0.5]
recalls = [0.2, 0.4, 0.4, 0.4, 0.6, 0.6]
filename = "precision-recall"
modified_precisions, modified_recalls = modify_precisions_recalls(
precisions=precisions, recalls=recalls)
plot_precision_recall(precisions=modified_precisions,
recalls=modified_recalls,
filename=filename)
if __name__ == "__main__":
main()
```

In our example, the AP is 0.51.

Note that the AUC from recall 0 to 0.2 are “free”. In practice, this free AUC is very small because there are a lot of ground truth detections, so it should not bias the evaluation significantly.

### Detection Mean Average Precision (mAP)

Once we know how to compute the average precision for object detection, we could further compute the mean average precision, i.e., the mAP that we commonly see in a lot of object detection benchmarks.

In different context, mAP could be computed differently. In some scenarios, AP is computed for each class independently, and the mAP is just the weighted average of the APs for all the classes. For example, we could compute three AP values, AP@Car, AP@Pedestrian, AP@Bicycle, and the mAP is just the weighted average of the three AP values. In some other scenarios, by varying the match criteria, many APs could be generated, and the mAP is just the weighted average of all the APs. For example, we could vary the IOU threshold for match, we could compute five AP values, AP@IOU0.5, AP@IOU0.6, AP@IOU0.7, AP@IOU0.8, AP@IOU0.9 (AP@IOU0.5 > AP@IOU0.6 > AP@IOU0.7 > AP@IOU0.8 > AP@IOU0.9), and the mAP is just the weighted average of the five AP values.

### Fool mAP Evaluation?

Is it possible to add random predicted detections in order to improve the AP/mAP?

If we guess there are missing detections in image 2 for the above example, we randomly add two detections, detection id = 7 and 8, but unfortunately none of the two detections hits the ground truth detection.

Rank | Predicted Detection ID | Image ID | Predicted Confidence | Matched Ground Truth Detection ID |
Number of Matched Detections So Far |
Precision | Recall |
---|---|---|---|---|---|---|---|

1 | 5 | 2 | 0.95 | 3 | 1 | 1 | 0.2 |

2 | 1 | 1 | 0.9 | 1 | 2 | 1 | 0.4 |

3 | 3 | 1 | 0.8 | - | 2 | 0.667 | 0.4 |

4 | 7 | 2 | 0.78 | - | 2 | 0.5 | 0.4 |

5 | 2 | 1 | 0.75 | - | 2 | 0.4 | 0.4 |

6 | 8 | 2 | 0.72 | - | 2 | 0.333 | 0.4 |

7 | 4 | 2 | 0.7 | 4 | 3 | 0.429 | 0.6 |

8 | 6 | 3 | 0.65 | - | 3 | 0.375 | 0.6 |

AP becomes 0.47 which is lower than the original AP 0.51. This means the precision of the detection is very important for reaching a high AP/mAP score. Conventional object detection model has a hard-coded post processing step called *non-maximum suppression (NMS)*. If this step is not done very well, there could be a lot of predicted detections that point to the same object. These detections would be considered as false positives and it will reduce the precision of the detection and further reduce the AP/mAP.

Similarly, it we only submit the detections that have the highest predicted confidence scores, such as the following one.

Rank | Predicted Detection ID | Image ID | Predicted Confidence | Matched Ground Truth Detection ID |
Number of Matched Detections So Far |
Precision | Recall |
---|---|---|---|---|---|---|---|

1 | 5 | 2 | 0.95 | 3 | 1 | 1 | 0.2 |

2 | 1 | 1 | 0.9 | 1 | 2 | 1 | 0.4 |

AP becomes 0.4 which is lower than the original AP 0.51. This means the recall of the detection is also very important for reaching a high AP/mAP score.

So the general conclusion is it is very hard to fool the AP/mAP evaluation.