Text Extraction with AWS Lambda and Amazon Textract in .NET

If you’ve ever tried extracting text from PDF / Image files, you know it’s not always straightforward—especially when working with scanned documents. In this article, let’s build a simple, serverless solution using AWS Textract, AWS Lambda, and .NET.

Here’s the idea: Whenever a PDF or image is uploaded to an S3 bucket, a Lambda function (written in .NET) gets triggered. This function uses Textract to pull out the text content from the document. No servers, no queues, just clean, event-driven architecture.

By the end of this guide, you’ll have a working setup that can extract text from PDFs without spinning up any backend services. If you’re a .NET developer looking to explore AWS and serverless processing, this is a great place to start.

What is AWS Textract?

AWS Textract is a fully managed service from AWS that can extract text, tables, and key-value pairs from scanned documents and images — including PDFs. Think of it as OCR (Optical Character Recognition) on steroids.

You don’t need to train any ML models or manage any infrastructure. Just send a document to Textract, and it returns structured data that you can directly use in your application.

Textract offers two main APIs:

DetectDocumentText – This one is straightforward. It extracts plain text (words and lines) from the document. Perfect when you just need raw text from PDFs.
AnalyzeDocument – This goes deeper. It can identify key-value pairs, tables, and form data, making it ideal for processing structured documents like invoices, forms, or receipts.

There’s also an asynchronous mode called StartDocumentTextDetection, which is useful for handling large or multi-page documents. This is handy when you want to kick off the processing and fetch results later without running into timeouts.

For this guide, we’ll keep it simple and use DetectDocumentText. Here’s the basic idea: you upload a PDF to an S3 bucket → a Lambda function (written in .NET) gets triggered → Textract processes the document and returns the extracted text → you can log or store it as needed.

It’s clean, scalable, and requires zero server management. You can read more about it in the official AWS documentation.

How AWS Textract Processes PDF / Documents

When you pass a PDF file or image to AWS Textract, it doesn’t just give you a block of plain text. Instead, it returns a structured response containing different types of elements — called Blocks. Each block represents a piece of the document like a word, line, table cell, or selection element (checkboxes, etc.).

Here’s a quick breakdown of how it works:

Input Sources : Textract can process documents either directly from:
- An S3 bucket (most common in serverless scenarios)
- A byte stream (base64-encoded file in memory)
Processing : Once the document is received, Textract analyzes it and breaks it down into blocks with metadata. For example, a line of text comes with coordinates, confidence score, and text content. Here is a sample response block:
```
{
  "BlockType": "LINE",
  "Text": "Invoice #1234",
  "Confidence": 98.45
}
```
Block Types : Some of the most common block types:
- PAGE: Represents each page in the document
- LINE: A full line of detected text
- WORD: Individual words with bounding boxes
- TABLE, CELL: Used in AnalyzeDocument for structured tabular data
- KEY_VALUE_SET: For form field detection (key + value)
Limitations : If you’re using the synchronous API (DetectDocumentText), Textract can only process up to one page at a time. This is a key limitation to remember when working with multi-page PDFs. For multi-page documents, you’ll need to switch to the asynchronous API (StartDocumentTextDetection), which we’ll cover in a separate article.

What we’ll build?

In this section, let’s break down the architecture for processing PDFs using AWS Lambda, Textract, and .NET. We’ll follow a simple flow that involves AWS S3, Lambda functions, and Textract for text extraction.

Here’s how it works:

PDF Upload to S3 : The first step is uploading a PDF document to an S3 bucket. This can be triggered by users through your web or mobile app, or automated via other workflows. Once the PDF is uploaded, it triggers an S3 Event Notification.
Lambda Function Trigger : AWS S3 can trigger a Lambda function every time a new PDF is uploaded. The Lambda function is where we’ll write the logic to call AWS Textract and process the document. The event data from S3, which includes information about the uploaded file, is passed to the Lambda function as input.
Calling AWS Textract : Inside the Lambda function, we’ll call the DetectDocumentText API of Textract to extract the text from the PDF. Textract will return a list of blocks containing all the detected text (words, lines, etc.).
Processing the Textract Output : The output from Textract will be a JSON response that contains the blocks of text detected. We can then loop through these blocks and extract the text content or even further process the information — for example, storing the text in a database, sending it to another service, or triggering further actions.
Final Output : After processing the Textract output, the Lambda function can either store the extracted text in a database (like DynamoDB or RDS), log it for further inspection, or send it to other AWS services for additional processing (such as Amazon Comprehend for sentiment analysis, or OpenSearch for indexing).

Here’s a simple diagram of the architecture:

+------------------+        +------------------------+        +-------------------+
|   S3 Bucket      |   ---> |  AWS Lambda Function    |   ---> |   AWS Textract    |
| (PDF Upload)     |        |  (Triggered by S3 Event)|        |   (Extract Text)  |
+------------------+        +------------------------+        +-------------------+
                                        |                                   |
                                        v                                   v
                              +-------------------+         +--------------------+
                              |  Store Extracted  |         |  Process Further   |
                              |      Text         |         |  (Store, Analyze)  |
                              +-------------------+         +--------------------+

This architecture is simple but powerful. With AWS Lambda and Textract, you get an event-driven, serverless solution for processing documents without worrying about scaling or managing infrastructure.

In the next section, we’ll dive into how to set up the .NET Lambda project to make this all work.

Prerequisites

Before getting started, make sure you have an active AWS account with access to key services like Amazon S3, Textract, Lambda, IAM, and CloudWatch. You’ll need to create an S3 bucket where documents (PDFs or images) will be uploaded for processing—ensure this bucket resides in the same region as your Lambda function to avoid latency or permission issues.

Next, configure an IAM Role for your Lambda function with the necessary permissions. At a minimum, it should have AmazonTextractFullAccess, AmazonS3ReadOnlyAccess (or scoped permissions to the specific bucket), and access to write logs to CloudWatch.

On your local development environment, ensure that the latest .NET SDK is installed. Additionally, install the required AWS SDK NuGet packages such as AWSSDK.Textract, Amazon.Lambda.S3Events and AWSSDK.S3 to interact with AWS services programmatically. While optional, having the AWS CLI installed and configured can significantly simplify testing and debugging during development.

Setting Up the .NET Lambda Project

Let’s build our AWS .NET Lambda first. Open up Visual Studio and create a new Lambda C# Project. For this demonstration I went with a simple Empty Function.

lambda

To this new Lambda project, install the following NuGet packages.

Install-Package AWSSDK.S3
Install-Package Amazon.Lambda.S3Events
Install-Package AWSSDK.Textract

And here is the Lambda Code.

using Amazon;
using Amazon.Lambda.Core;
using Amazon.Lambda.S3Events;
using Amazon.S3;
using Amazon.Textract;
using Amazon.Textract.Model;

// Assembly attribute to enable the Lambda function's JSON input to be converted into a .NET class.
[assembly: LambdaSerializer(typeof(Amazon.Lambda.Serialization.SystemTextJson.DefaultLambdaJsonSerializer))]

namespace TextractDemo;
public class Function
{
    private static readonly AmazonTextractClient textractClient = new(RegionEndpoint.USEast1);
    private static readonly AmazonS3Client s3Client = new(RegionEndpoint.USEast1);

    // This handler will be triggered by the S3 event when a PDF is uploaded
    public async Task FunctionHandler(S3Event s3Event, ILambdaContext context)
    {
        foreach (var record in s3Event.Records)
        {
            // Extract the S3 bucket name and file name from the event
            var bucketName = record.S3.Bucket.Name;
            var objectKey = Uri.EscapeDataString(record.S3.Object.Key.Trim());

            try
            {
                // Call Textract to extract text from the PDF
                var documentText = await ExtractTextFromPdf(bucketName, objectKey);

                // Here, you can process the extracted text further (e.g., store it in a database)
                Console.WriteLine($"Extracted Text: {documentText}");
            }
            catch (Exception ex)
            {
                Console.WriteLine($"Error processing file: {ex.Message}");
            }
        }
    }

    // Call Textract to process the PDF and extract text
    private async Task<string> ExtractTextFromPdf(string bucketName, string objectKey)
    {
        Console.WriteLine($"Bucket: {bucketName}, ObjectKey: {objectKey}");

        var request = new DetectDocumentTextRequest
        {
            Document = new Document
            {
                S3Object = new Amazon.Textract.Model.S3Object
                {
                    Bucket = bucketName,
                    Name = objectKey
                }
            }
        };

        try
        {
            var response = await textractClient.DetectDocumentTextAsync(request);

            // Print the full response for debugging
            Console.WriteLine("Textract Response:");
            Console.WriteLine("Response Blocks Count: " + response.Blocks.Count);

            // Print details of each block for debugging, including coordinates (bounding box)
            foreach (var block in response.Blocks)
            {
                Console.WriteLine($"Block ID: {block.Id}, Block Type: {block.BlockType}");

                // Only process LINE blocks
                if (block.BlockType == "LINE")
                {
                    Console.WriteLine($"Line Text: {block.Text}");

                    // Print coordinates (bounding box) of the line
                    if (block.Geometry != null)
                    {
                        var boundingBox = block.Geometry.BoundingBox;
                        Console.WriteLine($"Bounding Box - Top: {boundingBox.Top}, Left: {boundingBox.Left}, Width: {boundingBox.Width}, Height: {boundingBox.Height}");
                    }
                }
            }

            // Extract and combine all text blocks into one string
            var extractedText = string.Join("\n", response.Blocks
                .FindAll(b => b.BlockType == "LINE")
                .ConvertAll(b => b.Text));

            return extractedText;
        }
        catch (Exception ex)
        {
            Console.WriteLine($"Error calling Textract: {ex}");
            throw;
        }
    }
}

The FunctionHandler method serves as the entry point for the AWS Lambda function. It gets triggered automatically whenever a new object is uploaded to an Amazon S3 bucket. For each file upload event, it extracts the bucket name and object key (the file path), then calls a helper method ExtractTextFromPdf to process the file. If the text extraction succeeds, the function logs the extracted content. If there’s an error during processing, it catches the exception and logs the error message for troubleshooting.

The ExtractTextFromPdf method is responsible for interacting with Amazon Textract to perform OCR (Optical Character Recognition) on the uploaded document. It constructs a DetectDocumentTextRequest by pointing to the S3 object (bucket and key), and sends this request to Textract via the DetectDocumentTextAsync method. The method then processes the response, which includes a list of “Blocks” — each representing a piece of detected content like a word, line, or table. It filters out only the LINE blocks, logs the line text along with its bounding box coordinates (position and size on the document), and finally combines all line texts into a single string to return.

Two AWS clients are initialized at the class level: AmazonTextractClient for calling Textract APIs, and AmazonS3Client in case S3 operations are needed in the future. Both are configured to use the us-east-1 region. The logging throughout the function helps in understanding what Textract is returning and debugging any potential issues with document processing.

That’s it, you can publish this Lambda to AWS by right clicking the Lambda Project, and select Publish to AWS Lambda. PS, for the permissions, I have selected the Basic Exection Role. Later in this article, we will add the additional required permissions.

publish lambda

Setting up the S3 Bucket & Triggers

I have created a new bucket named cwm-textract-demo in us-east-1 region.

Let’s now configure the S3 Bucket Event Notification to trigger the Lambda. Navigate to the Properties tab of the S3 Bucket, and scroll down to Event Notifications. Click on Create event notification.

event notification

Under the Destination section, select your newly created Lambda Function. Technically, whenever a file is uploaded to this S3 Bucket, an event of type S3Event would be passed on while triggering the selected Lambda Function. This event contains a set of records which include the metadata of the S3 File.

From here on, the Lambda would handle the flow. But before that, we need to give the necessary permissions to the Lambda Function to access the object on S3 Bucket, as well as to use the Textract API.

Lambda Permissions

Open up the Lambda Function, and navigate to the Configuration tab, and then Permissions. Here, select the assigned role and open it up on AWS IAM.

The following are the permissions I have assigned to my IAM role. However, since this is just for a DEMO purpose and I would be destroying these resources ASAP, I have given extra permissions to the Lambda. In ideal cases, you need to give only the required permissions.

permissions

Testing & CloudWatch Logs

With everything set up—Lambda deployed, S3 bucket created, permissions configured—it’s time to test the end-to-end pipeline.

To test, simply upload an image or PDF file with some visible text into your configured S3 bucket (in our case, cwm-textract-demo). This should automatically trigger the Lambda function.

If everything is configured correctly, you’ll start seeing logs in CloudWatch Logs under the corresponding Lambda log group.

logs

In this first screenshot, each detected line of text is printed out along with its bounding box coordinates (top, left, width, and height). These values are incredibly useful if you ever plan to overlay text visually or analyze spatial document structure.

In the second screenshot, we see the final combined output — a clean string of all extracted lines, concatenated for easier readability or storage.

This confirms that our pipeline—from S3 upload to text extraction with Textract via Lambda—is working as intended.

Textract Pricing

AWS Textract pricing is based on the type and volume of content you process.

Free Tier

AWS offers 1,000 pages free per month for the first 3 months when you start using Textract, which is perfect for initial testing and small-scale use cases.

Conclusion

We’ve now built a fully functional, serverless pipeline for extracting text from uploaded images or PDF files using AWS Textract, Lambda, and .NET. With minimal infrastructure overhead and a simple event-driven setup, this solution can be easily deployed in any cloud-native application that needs OCR or text extraction capabilities.

The current architecture handles synchronous text extraction well, making it suitable for lightweight documents and real-time use cases. It’s highly cost-efficient due to its on-demand nature and doesn’t require any long-running compute resources or background workers.

There’s also plenty of room for extension. You could enhance it to support asynchronous processing for multi-page documents, integrate the extracted text with search engines like OpenSearch, or even run semantic analysis using tools like Amazon Comprehend. EventBridge or Step Functions can also help build more complex workflows around the extracted data.

Overall, this setup provides a scalable, production-ready foundation for document automation or data intelligence pipelines—all with just a few services wired together.