Amazon Rekognition: What It Is and When to Use It

Definition

Amazon Rekognition is a fully managed computer-vision service that analyzes images and videos through a simple API. Behind the scenes it runs deep-learning models — trained and updated by AWS — that detect objects, scenes, text, faces, celebrities, unsafe content, and personal protective equipment. Rather than collecting training data and training your own CV models, you call DetectLabels, DetectFaces, DetectText, DetectModerationLabels, or CompareFaces, and pay per image (or per minute of video) processed. For workloads where AWS's pre-trained models are good enough — and they usually are for common scenes and objects — Rekognition is dramatically faster to deploy than a custom SageMaker CV model.

How It Works

Rekognition has two main surfaces: image APIs and video APIs.

Image APIs (synchronous)

Call an operation with either an S3 object reference or a base64-encoded byte stream up to 5 MB. Rekognition returns a JSON response with labels, bounding boxes, confidence scores, and metadata. Key operations:

  • DetectLabels — identifies thousands of objects, scenes, and activities (for example "Dog", "Beach", "Wedding"). Returns hierarchical labels and bounding boxes.
  • DetectFaces — detects faces and returns attributes (age range, gender, emotions, landmarks, pose).
  • CompareFaces — compares a source face to target faces and returns similarity scores.
  • SearchFacesByImage / IndexFaces — build and query a face collection (a server-side index of face vectors).
  • DetectText — OCR for printed and stylized text in images.
  • DetectModerationLabels — flags explicit, suggestive, violent, or otherwise unsafe content using a hierarchical taxonomy.
  • DetectProtectiveEquipment — detects PPE (head cover, face cover, hand cover) for workplace safety.
  • RecognizeCelebrities — identifies well-known people in images.
  • Custom Labels — you upload a small labeled dataset and Rekognition trains a private model; inference is through the same API surface.

Video APIs (asynchronous)

For stored video in S3, you call a Start* operation (for example StartLabelDetection), Rekognition processes the video and publishes a notification to SNS when done, then you call Get* to retrieve results. Supported jobs include label detection, face detection/search, celebrity recognition, content moderation, person tracking, and text detection.

For live video, Rekognition integrates with Kinesis Video Streams: frames flow into Rekognition Video, and results stream out through a Kinesis Data Stream for real-time processing (for example detecting a known face at a door camera).

Key Features and Limits

  • Thousands of labels out of the box, spanning objects, scenes, activities, food, and more.
  • Face collections — up to 20 million faces per collection, searchable in milliseconds.
  • Moderation taxonomy — two-tier labels (Top Level "Violence" with sub-labels "Graphic Violence", etc.) for nuanced content policies.
  • Custom Labels — classification or object detection trained with 10–100+ labeled images per class.
  • Streaming Video — integrates with Kinesis Video Streams for real-time alerts.
  • Sync image size limits: 5 MB (base64) or 15 MB (S3 object), JPEG / PNG only. Minimum 80×80 pixels.
  • Async video: up to 10 GB per video, 6 hours duration. MPEG-4 and MOV containers with H.264 codec.
  • Face search latency: milliseconds against millions of faces.

Common Use Cases

  1. Content moderation at scale — social platforms flagging unsafe user uploads before human review. Rekognition moderates the firehose cheaply; humans review only flagged items.
  2. Identity verification — compare a selfie to an ID photo for onboarding, fraud prevention, and access control.
  3. Media and entertainment search — tag a film archive by celebrity, scene, and object to make it searchable.
  4. Public safety and retail analytics — detect PPE compliance, count people, track dwell time via store cameras.
  5. Accessibility — auto-caption images and extract text for screen readers.
  6. Smart home and security — known-face alerts through streaming video + Kinesis.
  7. Document preprocessing — lightweight OCR before feeding to Textract or a custom NLP pipeline.

Pricing Model

Rekognition is billed per image analyzed or per minute of video processed:

  • Image APIs — per 1,000 images analyzed per operation; tiered so the first 1M per month is one rate and volume thereafter is cheaper.
  • Face Storage — per 1,000 face metadata objects stored per month in a face collection.
  • Stored video analysis — per minute of video analyzed, priced per operation.
  • Streaming video — per minute of video stream processed.
  • Custom Labels — per training hour, plus per inference hour while the custom model is running (you start/stop it explicitly).
  • Free Tier — 5,000 images analyzed per month and 1,000 face metadata objects for the first 12 months.

Custom Labels billing is the key gotcha: the model bills by the hour while started, whether or not you send requests. Stop the model when not in use, or use Rekognition's pre-trained APIs where possible.

Pros and Cons

Pros

  • No ML expertise required — call an API, get results.
  • Models are maintained and improved by AWS over time.
  • Integrates cleanly with S3, Lambda, Kinesis, and EventBridge for event-driven pipelines.
  • Face collections scale to tens of millions with millisecond search.
  • Pay-per-use with a generous free tier for prototyping.

Cons

  • Accuracy is fixed by AWS's model — you can't fine-tune the pre-trained APIs (only Custom Labels).
  • Legal and regulatory scrutiny around face detection (particularly public-sector use) requires careful evaluation.
  • Custom Labels inference is hourly, not per-request, so it's pricey unless you keep the endpoint hot.
  • Video analysis is async for stored files — not suited to millisecond-latency use cases without streaming APIs.
  • Some labels and moderation categories evolve over time, which can change downstream logic.

Comparison with Alternatives

| | Amazon Rekognition | Google Vision AI | Azure AI Vision | SageMaker Custom Model | | --- | --- | --- | --- | --- | | Setup | API-only | API-only | API-only | Train, deploy, maintain | | Customization | Custom Labels | AutoML Vision | Custom Vision | Full control | | Video | Yes, async + streaming | Yes | Yes | Custom | | Pricing | Per image / per minute | Per image | Per transaction | Per instance-hour | | Best for | AWS-native CV at scale | GCP-native CV | Azure-native CV | Highly specialized models |

Rule of thumb: use Rekognition first — only move to SageMaker if Custom Labels isn't accurate enough or you need a model architecture Rekognition doesn't offer.

Exam Relevance

  • Machine Learning Specialty (MLS-C01) — Rekognition vs Custom Labels vs SageMaker, choosing the right service for a given CV requirement, and the async video pattern (Start* → SNS → Get*).
  • AI Practitioner (AIF-C01) — Rekognition as one of the AWS AI services alongside Transcribe, Translate, Comprehend, Textract, Polly, Lex.
  • Solutions Architect Associate (SAA-C03) — event-driven patterns: S3 upload → Lambda → Rekognition → DynamoDB/SNS; moderation pipelines; Kinesis Video + Rekognition for real-time.

Exam trap: Rekognition detects text in scenes (signs, license plates, street signs). Textract is for documents (forms, tables, receipts). If the question mentions structured documents, the answer is Textract, not Rekognition.

Frequently Asked Questions

Q: When should I use Rekognition Custom Labels instead of the built-in APIs?

A: Use built-in APIs whenever AWS's generic labels cover your use case — "detect a dog in a photo" or "flag nudity" work out of the box. Use Custom Labels when you need domain-specific classifications AWS doesn't know — for example "this is an in-stock pallet versus an out-of-stock pallet" or "this widget has scratches." Custom Labels needs as little as 10 images per class for classification or bounding-box object detection, but you pay for training plus per-hour inference while the model is running, so batch your requests and stop the endpoint when idle.

Q: What's the difference between Rekognition and Textract?

A: Rekognition analyzes general imagery and video — objects, scenes, faces, celebrities, moderation, and text-in-scene like street signs. Textract extracts structured data from documents — forms, tables, receipts, invoices, IDs. If the question is "what's in this picture?" use Rekognition; if the question is "pull the fields out of this invoice," use Textract. They're complementary and often chained.

Q: How do I analyze live streaming video?

A: Ingest frames into Kinesis Video Streams, then create a Rekognition Video stream processor pointed at a face collection or label detector. Rekognition processes frames and writes results to a Kinesis Data Stream, which you consume with Lambda or Kinesis Client Library. This pattern is used for smart-home known-face alerts, retail analytics, and public-safety monitoring. Latency is typically a few seconds end-to-end.


This article reflects AWS features and pricing as of 2026. AWS services evolve rapidly — always verify against the official Amazon Rekognition documentation before making production decisions.

Published: 4/17/2026

This article is for informational purposes only. AWS services, pricing, and features change frequently — always verify details against the official AWS documentation before making production decisions.

More in Machine Learning