Image by Vincent van Dam

Solving Sudoku's the hard way

Sudoku’s are fun! For this reason, we included one if one of our traditional easter puzzles, and even made nostalgic 8bit video games of them in the past. If you enjoy solving them the old-fashioned way on a hard copy, it’s recommended to have a pencil handy, as you might need to address how to deal with multiple options.

A couple of years ago, I didn’t have a pencil nearby and used a pen instead. Wrong choice! I ended up in a situation where I had to guess and my pen made that guess final. The good thing about guessing in sudoku’s is that you can progress quite quickly, the bad thing is that you only see the consequence of your guess at the very last moment. So, that’s what happened to me. I had to guess and at the time my error was obvious, it was too late to go back. This frustrated the hell out of me, and I felt I needed to compensate.

What would be the best way to compensate? From an user-perspective, taking a picture of the sudoku and then parse the sudoku and solve it. That would be perfect! Considering this is not the most original idea and has been done before, why not raise the bar and implement this idea in an unconventional AI stack, such as Go! That will surely compensate enough for the stupidity of using a pen instead of a pencil. If you are impatient, the end result of this effort can be found here.

It takes three steps to be able to solve a sudoku by taking a picture of it. The initial step involves identifying the sudoku grid in a picture. Afterward, you translate the numbers in each cell of the picture into a digital format that we can use to solve the sudoku. Finally, you solve the sudoku and present the solution to the user.

Find the focus area

With OpenCV it is ’easy’ to perform the first step. OpenCV is a libary that can be used to detect shapes and to identify objects within images amongst other more traditional image processing features. The perfect toolbox to detect and disect a sudoko image! In our case, we need to detect the biggest square in the image, assuming the picture’s main object is the sudoku. In the OpenCV vocabulary this means finding the biggest contour within the image and validate if what you found could be considered a square (meaning the contour should have 4 sides). We can use OpenCV to transpose a warped square to a fixed square, as well as resampling the image in smaller dimensions. Assuming this is a sudoku, we divide the box into 9x9 squares and proceed to the next step.

The next step is the most challenging step of the project. How do we detect which numbers are inside each of the cells? Obviously, this can be done with OCR libraries like for example tesseract. However, considering the seamingly small scope of the problem (number recognition) why not train a network that can detect these instead. Since we decided to use Golang, there are not that many libraries that will help with this problem. The best available option seems to be Gorgonia.

Decode the numbers inside the cells

One of the classical Hello World machine learning examples is detecting handwritten numbers based on a dataset called MNIST. This dataset is often used to test different machine learning approaches and their effectiveness. Since we want to decode the numbers inside the cells, this is a good starting point. Without going into detail on neural network architectures, the best performance for detecting images can be achieved with convolutional networks (CNN). The Gorgonia framework also includes a MNIST example which is based on a CNN architecture.

Using the MNIST example gorgonia provides, we have our network architecture available. The next step is to train the network, so we can actually use it to get the desired output. In our case we want to train a classifier, which means, given a certain input we want to know the likeliness of what the output is. In order to do this, the inputs of the network represent the pixels inside the image, and we have a vector of 10 outputs that represent the digits 0 to 9. The desired output is, when the network is trained, each value inside the vector will have a number between 0 and 1 indicating how likely the result is to be that number.

With the network in place, we need to train it first. Since we have the input data (the MNIST dataset) and know the desired outputs, we can use supervised learning as opposed to unsupervised learning. The difference between the two variants, is with supervised learning you will actually train on known outputs, while with unsupervised training you don’t feed any expected output (only input). Another form of learning also exists, known as reinforcement learning. In this scenario, the model receives a reward for success and incurs a penalty for failure, much like humans in preventing not making the same mistake twice.

Not the right data

When we train the network on the MNIST dataset, and try it out on real world sudokus, the performance is low. The model seems to confuse certain numbers (like 1’s and 7’s). Increasing the number of training cycles (epochs) does not improve the results either. It could be that we are overfitting or underfitting our network and need to adjust the network architecture. However, examining at the data that we utilised for training, a dataset containing handwritten numbers, also doesn’t align with the use case. We want to to detect printed numbers, not specifically handwritten numbers. Our network might be good in detecting the MNIST data, but that’s not the data we want to detect.

To improve the classifier performance, we can create our own dataset that is a better representation of real-world data. In this case, the dataset would be a labeled set of images with numbers. We just need to make sure these numbers are presented in commonly used fonts for printing. To improve detection, we can rotate the images a bit, vary the sizing of the fonts, and add some random noise. This tweaking process of the model takes time and is based on experimentation. Changing datasets, changing network architectures and their characteristics. Unfortenately, unlike the Python data science ecosystem, there is no tooling that invites you to do this experimentation. A toolchain like jupyter greatly helps in this exploration/experimentation phase.

Inference

With the pre-trained network, we can decode images towards a data structure that represents the sudoku. Using a trained network to obtain some result is called inference. In the implementation that comes with this blog, the inference is simply a web service that takes an image as input and returns the solved sudoku. With the data structure that contains the actual sudoku it becomes easy to solve it. There are many algorithmic approaches, and considering the enormous amount of possible sudokus, some are more efficient than others. It goes beyond the scope of this blog to provide the best solution. The implementation used in the example code uses backtracking, which is good enough for most situations.

Conclusion

In hindsight, it was probably easier to have taken the time to get myself a pencil with an eraser to solve the sudoku, but where’s the joy in that? Also, Golang isn’t the best eco-system for machine learning projects. Simply said, the majority of data scientist and machine learning experts are working with python, and if you are stuck with a problem your co-pilot will not be able to help you much with the problem. There is also no tooling that helps with the exploration phase, speeding up analysis and experimentation. However, using off the shelf products like OpenCV, which have machine learning capabilties as well will work just fine if they are supported by the language of your choice.

If you want more, visit the repository to deep dive into the code or experiment with the solution by training and testing it!

Gerelateerde posts