Fingerprinting-Content-With-Python-Blog

Fingerprinting Content With Python

Overview


How much time could you save by fingerprinting file content with Python before batch processing your files?

This article will show a simple strategy to skip files that have not changed. Retrieve a prior fingerprint and compare them. Then we can determine if the actual content has changed from one time that you run your process to the next.



What is a fingerprint?


A fingerprint is a hash value generated through a hash function. There are a number of different hash functions - we will use md5. A hash function takes an input of bytes and converts it to a fixed-length sequence. The value returned is referred to as a hash, message digest, hash value, checksum, or fingerprint. In our case, we will generate a fingerprint from the file’s content.


Fingerprinting content with Python

Let's take a look how we can convert any file's content into a byte stream, and generate a fingerprint from it.
We define a BLOCK_SIZE which specifies the number of bytes to read in for each read operation in case we encounter large files.
The hash method continuously updates itself with the byte stream read (notice the 'rb' - read binary)

When we have the fingerprint, we need to compare it to the previous state of the file. You can store your fingerprint in a database or any file file format you like.

In this example, we will store our finger prints in a json file.


Checking previous state & updating the json file


We can create a function that manages the state of the JSON and returns True, False statement whether a certain file's content has changed from one run to another. You could also separate out the state management if you want to ensure some process is successful until updating the state of the JSON file.

The code below manages both updating the JSON as well as returning True, False state for a file having been updated or not. If you are curious about how to structure your framework, you can check out this post on that topic.



Conclusion


This way of fingerprinting content with Python and looking for changes is a really fast way to ensure that you only spend time batch processing files when the files have changed. It has saved me countless hours of needless processing.


About the Author

Christian Akesson

Facebook Twitter Google+

Senior Technical Artist at Nvidia. Maya user since beta 1 in 1999. Mostly in Houdini and Omniverse these days...