Ensuring Integrity of Data Files with Checksums

Integrity of data files is critical for the verifiability of computational and lab-based analyses. The way to seal a data file’s content at a point in time is to generate a checksum. Checksum is a small sized datum generated by running an algorithm, called a cryptographic hash function, on a file. As long as a data file does not change, the calculation of the checksum will always result in the same datum. If you recalculate the checksum and it is different from a past calculation, then you know the file has been altered or corrupted in some way.

Below are typical situations that call for checksum generation:

  • A data file has been newly downloaded or received from a collaborator.
  • You have copied data files to a new storage location, for instance you moved data from local computer to HPC to start an analysis. You want to create a snapshot of your data, for instance when you’re creating a supplementary material folder for a paper/report. Note: If you snapshot your data by depositing it to a data repository. then typically checksum generation will be taken care of by the repository.

In the remainder of this section we provide instructions for checksum generation on macOS, Windows and Linux platforms.

Linux and macOS

Command-line instructions for checksumming on Linux and macOS (Terminal) are common and is as follows.

shasum -a 256 name_of_file

Alternative to shasum, you may also use the commands md5sum (Linux) and md5 (Mac).

An example execution of the shasum command is given below.

$ shasum -a 256  my_file.csv 0a1802c47c9c7fb29d8a6116dc40250c33321b56767125de332a862078570364  my_file.csv

The recommended practice when generating checksums is to forward the checksum datum to a file, ideally with the same name as the data file. An example is given below.

$ shasum -a 256 my_file.csv > my_file.sha256

The .sha256 file extension denotes the algorithm that generated the checksum. Other common extensions are .sha1 or .md5.

Given a data file and its checksum, one can verify the file against the checksum with the following command.

$ shasum -c my_file.sha256
my_file.csv: OK

Finally, it is important to note that checksum calculation uses a file’s contents. When you create a copy of a file, the checksum calculator will generate the same datum for the copy. See an example below.

$ shasum -a 256  my_file.csv
0a1802c47c9c7fb29d8a6116dc40250c33321b56767125de332a862078570364  my_file.csv
$ cp my_file.csv copy_file.csv
$ shasum -a 256 copy_file.csv
0a1802c47c9c7fb29d8a6116dc40250c33321b56767125de332a862078570364  copy_file.csv

When you have several data files, checksum creation and verification needs to be automated. The following are free and open source utilities that can be used for this purpose:

  • Checksums macOS Automator workflow and shell script
  • md5deep Command Line utility for recursive checksum operations (requires compilation on your platform)

Windows

CertUtil is the command-line tool that Windows provides for checksum calculation. It is available in Windows Version 7 onwards.

To access CertUtil, from the Start menu select Command Prompt. Go to the folder containing your data file (my_file.csv) and run the following command:

> certutil -hashfile my_file.csv SHA256
SHA256 hash of my_file.csv:
0a1802c47c9c7fb29d8a6116dc40250c33321b56767125de332a862078570364
CertUtil: -hashfile command completed successfully.

You can run the certutil command by changing the last parameter, SHA256 to MD5 or SHA512. You can direct the result of checksum operation to a file as follows:

> certutil -hashfile my_file.csv SHA256 > my_file.sha256

CertUtil does not provide an option to automatically verify a given checksum against a file. Therefore, in order to do verification, you’d need to re-run the “certutil -hashfile …“ command and manually compare the result with the earlier generated checksum.

MD5Summer is a free MD5 checksum tool for Windows. It is operated via a GUI and can perform recursive checksumming of files in folders. A step-by-step tutorial on using MD5Summer is provided by the UK Data Archive here.