Overview
How much time could you save by fingerprinting file content with Python before batch processing your files?
This article will show a simple strategy to skip files that have not changed. Retrieve a prior fingerprint and compare them. Then we can determine if the actual content has changed from one time that you run your process to the next.
What is a fingerprint?
A fingerprint is a hash value generated through a hash function. There are a number of different hash functions - we will use md5. A hash function takes an input of bytes and converts it to a fixed-length sequence. The value returned is referred to as a hash, message digest, hash value, checksum, or fingerprint. In our case, we will generate a fingerprint from the file’s content.
1 2 3 4 |
import hashlib hash_object = hashlib.md5(b'Hello World') print(hash_object.hexdigest()) # b10a8db164e0754105b7a99be72e3fe5 |
Fingerprinting content with Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
# Usage: fingerprint = get_fingerprint(C:/temp/mayaFile.ma) import hashlib BLOCK_SIZE = 65536 def get_fingerprint(filepath): """ :param filepath: Path to the file you would like to hash :type filepath: str :return: str """ hash_method = hashlib.md5() with open(filepath, 'rb') as input_file: buf = input_file.read(BLOCK_SIZE) while len(buf) > 0: hash_method.update(buf) buf = input_file.read(BLOCK_SIZE) return hash_method.hexdigest() |
When we have the fingerprint, we need to compare it to the previous state of the file. You can store your fingerprint in a database or any file file format you like.
In this example, we will store our finger prints in a json file.
Checking previous state & updating the json file
We can create a function that manages the state of the JSON and returns True, False statement whether a certain file's content has changed from one run to another. You could also separate out the state management if you want to ensure some process is successful until updating the state of the JSON file.
The code below manages both updating the JSON as well as returning True, False state for a file having been updated or not. If you are curious about how to structure your framework, you can check out this post on that topic.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
import json def has_changed(filepath, key_path, json_path, update_json=True): """ :param filepath: Path to the file you would like to hash :type filepath: str :param key_path: Key value you would like to store the fingerprint with - this should be the same for all users (relative path) :type key_path: str :param json_path: Path to the shelve data base (persistent dictionary) :type json_path: str :param update_json: flag to update the json file or not :type update_json: bool :return: str """ fingerprint = get_fingerprint(filepath) json_dict = None if not os.path.exists(json_path): if update_json: json_dict = {key_path: fingerprint} _update_json_file(json_dict=json_dict, json_path=json_path) return True else with open(json_path, 'r') as json_file: json_dict = json.loads(json_file.read()) if key_path not in json_dict: if update_json: json_dict[key_path] = fingerprint _update_json_file(json_dict=json_dict, json_path=json_path) return True if not json_dict[key_path] == fingerprint: if update_json: json_dict[key_path] = fingerprint _update_json_file(json_dict=json_dict, json_path=json_path) return True return False def _update_json_file(json_dict, json_path): """ :param json_dict: Dictionary of key_path:fingerprint - key_path should be the same for all users (relative path) :type json_dict: dict :param json_path: Path to the json file to be written :type json_path: str :return: str """ with open(json_path, 'w') as json_file: json_file.write(json.dumps(json_dict, sort_keys=True, indent=4)) |
1 2 3 4 5 6 7 8 9 10 11 12 13 |
import lcg.fingerprint def some_process(tons_of_maya_files): """ Entry point to some process being run on files """ for maya_file in tons_of_maya_files: # lcg.WORKSPACE_ROOT is a static where we can calculate the unique workspace path for each user relative_path = maya_file.replace(lcg.WORKSPACE_ROOT, '') json_path = lcg.JSON_PIPELINE_X_FINGERPRINTS # lcg.JSON_PIPELINE_X_FINGERPRINTS - resolved full path to json file if not lcg.fingerprint.has_changed(filepath=maya_file, key_path=relative_path, json_path=json_path, update_json=True): continue do_some_process(maya_file) |
Conclusion
This way of fingerprinting content with Python and looking for changes is a really fast way to ensure that you only spend time batch processing files when the files have changed. It has saved me countless hours of needless processing.