Year of Python (YOP) – Week Thirteen


Hello Reader!

This weeks piece of code is actually a script I’m preparing to run at a later date.  I have some data on a system at work that I’m looking to move to our SAN.  The end goal is consolidating some files that all belong to the same category.  However because of the nature of the data, I need to make sure the contents remain unchanged from point A to point B.  Now there are some tools out there (SafeCopy is a favorite of mine) that will copy files form one place to another, and MD5 hash the source and destination file…

But this is YOP, so we must figure out how to do the same thing in Python!!!!

Now this script assumes the files have already been copied, I haven’t written the code to do that part yet.  But essentially what you do is feed it a source and destination directory, then it will traverse both locations (including subfolders), and MD5 hash all the files located in those directories.  Finally, it writes the filename and it’s corresponding hash to an output file so you have an audit log of the end result.

It was interesting working on this script.  When I started writing it earlier this week, I really just needed to look up the hashlib library code.  The rest of the code I was familiar with from other code I’ve written so far as part of this project.  So I was pleased that it seemed to be a bit easier to code then when I started this endeavor.

And then I got to testing the code to make sure it would work, and got knocked back down a few steps….first, and I think this is a common mistake for people new to the hashlib library, I realized I was not actually hashing the contents of the file in both locations, but the file name itself.  A few Google searches later (along with some Stack Overflow), and I figured out what I needed to do.

My second problem came up with traversing through the directories.  What I discovered is you need to reassemble the path to the file along with the file name.  Otherwise Python returns a file not found error.  Ironically this problem took me longer to figure out than the MD5 hashing issue.

In the end, I ended up with this piece of code:

for path, subdirs, files in os.walk(args.source_dir):
    for name in files:
        hash_filepath = os.path.join(path, name)
        with open((hash_filepath), "rb") as file_to_hash:
            md5_buff =
            md5_returned = hashlib.md5(md5_buff).hexdigest()
        source_file_list[name] = md5_returned

Now I know I can turn this into a function, since I’m running the code twice.  That’s a plan for a later date.  I’ve also added some code that’s not included here to inform the user which files matched.  I’m still tweaking that part of the code so it wasn’t ready for this week.  But I’m hoping to add it in as an additional part of the output audit file.

Until next time!


One Response to “Year of Python (YOP) – Week Thirteen”

  1. 1 Rudy

    Good catch on hashing the filename vs content. I ran into the same issue when I started with the hashlib library.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: