Parsing a receipt: extracting text from a receipt image

Open Table of contents

Drunk and angry: the motivation
The Application
The Receipt Scanner
Closing

Drunk and angry: the motivation

A few weeks ago, I was at a bar with my coworkers from Fintoc discussing how uncomfortable the process of actually paying for the bill is. Typically bars don’t allow tables to split the bill between every participant, so what ends up happening is that one person pays the whole bill and then has to somehow charge the rest of the participants.

As we all know, this is a huge PITA. When you pay for the bill, the rest of the participants usually mumble something vague like “send me a picture of the receipt I will transfer you the money immediately” or some other form of the same 🐂shit. But you know that most of them will have forgotten about the damn receipt picture as soon as they get into their cars. Idiots.

At Fintoc, we use a different strategy: whoever pays the bill writes a Google Sheet that contains every item on the bill as rows and every participant as columns. This Google Sheet then gets sent to the rest of the participants, so all they have to do is find their column, write the amount that they consumed of each item on its corresponding row and get the amount they owe directly on the Google Sheet. The idea is to make it so simple for everyone that no one has any excuse not to pay.

As you can imagine, this works! But the participant that actually pays for the bill has to manually write down each one of the items consumed at the bar on the Google Sheet (this can sometimes be near 30 different items). No wonder no one ever wants to pay!

So that’s how one night coming home back from the bar I decided that I would give it a shot. Bear in mind, I didn’t (and still fundamentally don’t) know anything about image processing or image to text recognition, so everything I talk about here is a result of me trying to make my friends (and I) have a better time when paying the complete bill (also, it was 1 AM).

The Application

Disclaimer: This post won’t really talk much about the application itself. More of that is coming soon…

Obviously, what I needed to do was to write a complete application. This application needed the following characteristics:

The participant actually paying the bill needs to simply upload the receipt and the application must read the receipt image and have a URL created for the rest of the participants with the detected items.
Once participants enter to the generated URL, they can select the items that they consumed directly on the UI.
Finally, participants can see the total amount they owe directly on the UI.

But in order for me to build this application (which, by the way, I already built), I needed a way to first transform the receipt image into the items used by the application. Since I didn’t really find anything remotely resembling what I was looking for (and since I really love to learn about technology), I decided to write the module myself.

The Receipt Scanner

Disclaimer: Everything I talk about here I learned empirically, so I most certainly am wrong about most of it. If you see an error or a way in which I can improve the algorithm, please contact me! 💖

Enter: receipt-scanner. The idea was simple: write a basic module that receives an image path (or URL), processes said image and retrieves the text found on the receipt. Simpler said than done, I found.

To achieve the goal of creating a receipt scanner, I discovered that there are roughly 3 steps involved in reading the text from a receipt image:

Finding the borders
Processing the image
Actually reading the text.

Throughout this section, I will show the incremental changes that some filters have over a receipt image example. I will be using the following image as the base:

Original receipt image — Notice that this image has been massively resized and compressed in order to be loaded faster on this blog, so using it as a base for text extraction might not yield the expected results

Finding the borders

To find the borders of the receipt, you first have to play with some magic over the image.

First, I apply a black border around the image. This is a hacky trick to be able to actually detect images where the receipt isn’t showing completely inside the frame, so one of the borders wouldn’t be detected and no rectangle would be found.

Receipt not showing completely on the image — An example of an image where the hacky black border is needed. Notice how the top border of the receipt isn’t visible on the image, so it wouldn’t be captured by the algorithm

Then, you need to compress and resize the image. This is because handling a large image might unnecessarily eat up the RAM on your server (you don’t really need details to find the largest rectangle border available), but also because too much detail on the image makes it impossible to accurately find borders (letters get confused with borders, for example) and you are going to remove detail later anyways, so you might as well compress and resize.

After resizing you can proceed to soften the image as much as possible. My implementation starts by using a morphological closing operation on the image, which removes stuff like text from the receipt and textures in general. After that, I apply some blurs over the image and then I apply the canny edge detection algorithm. Finally, I apply a dilation filter so that the edges that are almost touching together will touch.

Morphological closing operation applied — Image after the morphological closing operation has been applied

Blurs applied — Image after the blurs have been applied

Canny filter applied — Image after the Canny filter has been applied

Dilation operation applied — Image after the dilation operation has been applied

Once the image is all chewed up, I find the best contour for the receipt.

Image with every contour outlined — Image with every contour found outlined

Image with the best contour outlined and straightened — Image with the best contour found outlined and straightened

The code that executes each of the steps mentioned above looks something like this:

original_image = open_image(file_name)

chewed_image = Filter.apply(
    original_image,
    CompressFilter(),
    ResizeFilter(EDGE_DETECTION_TARGET_WIDTH),
    MorphologicalCloseFilter(iterations=4),
    MedianBlurFilter(),
    GaussianBlurFilter(size=3),
    CannyFilter(),
    DilateFilter(),
)

contour = find_contour(chewed_image)

Processing the image

Once I (hopefully) find the contour on the chewed image, the original image is ready to be processed.

I start by wrapping the perspective of the original image (like stretching it, think CamScanner). In this process I extract the contour found on the chewed image, I project it over the original image and then I cut that portion of the image and transform it into a perfect rectangle.

Image of the final contour projected to the original image and its perspective wrapped — This image in particular was already quite a good rectangle to start with, but even here you can see that the image got a bit deskewed

Once the perspective has been wrapped, I resize the resulting image to a fixed target width. This helps a bit with small images, and doesn’t really hurt huge images so I can save some RAM for those receipts.

With my resized image, I proceed to apply median blur, denoise and then apply gaussian blur (this was the application order that behaved the best during my not-so-thorough tests).

Blurs and denoising applied — Image after the blurs and denoising have been applied. It might look a bit weird to blur the image, but it helps with the binarization

Finally, I change the color of the image to a grayscale and I binarize it (which means that any pixel above a threshold gets transformed to pure black and every other pixel gets transformed to pure white).

Transformation to black and white applied — Image after the color transformation to black and white has been applied

Binarization applied — Image after the binarization has been applied

The code that executes each of the steps mentioned above looks something like this:

processed_image = Filter.apply(
    original_image,
    PerspectiveWrapperFilter(contour),
    ResizeFilter(TEXT_CLEANUP_TARGET_WIDTH),
    MedianBlurFilter(),
    DenoiseFilter(),
    GaussianBlurFilter(),
    GrayscaleFilter(),
    BinarizeFilter(),
)

Reading the text

The last part is so simple it might sound like a joke, but reading the text from the image is actually something of a “solved problem” (not really, but the solutions are pretty awesome).

This part (to the extent of my knowledge) needs to be executed by a ML algorithm. Luckily, a very robust algorithm already exists. It is called Tesseract. So actually, the last bit of code looks like this:

import pytesseract

text = pytesseract.image_to_string(processed_image, config="--psm 4 -l spa+eng")

Nothing fancy, just using the available tools will do. What I found about Tesseract is that it is very grumpy, and needs to receive an almost perfectly processed image to work OK.

Closing

Oh boy did I learn a thing or two about image processing and using Tesseract. But I was able to learn only because of the awesome resources I found online on the subject (countless blogs and how-to’s about many different pieces of the algorithm).

Because I had to work so much to be able to write this module, I opened it with an MIT license and uploaded to PyPi, so that you can use it for your projects too!

I invite you to try Split (the application I wrote using the Receipt Scanner) and to write your own tools using receipt-scanner! I enjoyed the journey, and hope to continue learning about this subject 💖.

If you want to talk to me about anything I mentioned on this post or if you simply want to chat, contact me on one of my socials!