CloudTextract - Text and Data Extractor

FAQs

Got questions? We have all answers for you.

General Information

What is Amazon Textract?

Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from scanned documents. It goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables.

What are the most common use cases for Amazon Textract?

The most common use cases for Amazon Textract include:

Import Documents and Forms into Business Applications
Create Smart Search Indexes
Build Automated Document Processing Workflows
Maintain Compliance in Document Archives
Extract Text for Natural Language Processing (NLP)
Text Extraction for Document Classification

What type of text can Amazon Textract detect and extract?

Amazon Textract can detect printed text and handwriting from the Standard English alphabet and ASCII symbols. Textract can also extract printed text in Spanish, Italian, French, Portuguese and German. Amazon Textract also extracts explicitly labeled data, implied data, and line items from itemized list of goods or services from almost any invoice or receipt without any templates or configuration. For example, customers can use Amazon Textract to extract the vendor name from the Amazon logo at the top of an invoice even though it is not labeled “Vendor: Amazon”. In other cases, if the table of line items does not include column headers, Amazon Textract infers what the column headers are meant to be based on the table content.

What document formats does Amazon Textract support?

Amazon Textract currently supports PNG, JPEG, and PDF formats. For synchronous APIs, you can submit images either as an S3 object or as a byte array. For asynchronous APIs, you can submit S3 objects. If your document is already in one of the file formats that Amazon Textract supports (PDF, JPG, PNG), don't convert or downsample it before uploading it to Amazon Textract.

How can I get the best results from Amazon Textract?

Amazon Textract uses machine learning to read virtually any type of document, in order to extract printed text, handwriting, and structured information. Keep the following tips in mind in order to get the best results:
• Make sure your document uses a language supported by Amazon Textract (Currently English, Spanish, Italian, Portuguese, French, German. Handwriting, Invoices and Receipts processing for English only).
• Provide as high quality an image as you can, ideally at least 150 DPI.
• If your document is already in one of the file formats that Amazon Textract supports (PDF, JPG, PNG), don't convert or downsample it before uploading it to Amazon Textract.
• Amazon Textract's table feature works best when the tables in your document are visually separated from surrounding elements on the page (e.g. not overlaid on an image or complex pattern), and the text within the table is upright (e.g. not rotated relative to other text on the page).

Billing Information

How does Amazon Textract count the number of pages processed?

An image (PNG or JPEG) counts as a single page. For PDFs, each page in the document is counted as a page processed.

How much does Amazon Textract cost?

Amazon Textract charges you based on the number of pages and images processed. For more information, visit the pricing page.

Does Amazon Textract participate in the AWS Free Tier?

Yes. As part of the AWS Free Usage Tier, you can get started with Amazon Textract for free. New customers can analyze up to 1,000 pages per month using the Detecting Document Text API and up to 100 pages each per month using the Analyze Document API or the Analyze Expense API, for the first three months.