Naming files

(Re)Naming a file is very easy operation usually one or two clicks away (right click+rename, F2, …). Maybe thats why people do not pay enough attention when choosing a proper file name even though it can have a big impact on their ability to find those files later and to understand what they contain.

Good file name follows three basic principles:

  • machine readable
  • human readable
  • plays well with default ordering

If you are looking for information on how to organize and structure your folders, you may find this dedicated card helpful.

Machine readable

Special characters can have different meaning for different operation system or software. The most commonly found are

#$%&'(")*+,-./:;<=>?@[\]^_`{|}~ and white characters like space or tabulator.

The only two which are recommended in file names are hyphen “-” and underscore “_”. You can use underscore to separate and hyphen to combine. The file name 2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline-100-1MutantFraction_A01.csv gives us already some information about date of creation (2013-06-26), assay (BRAFWTNEGASSAY), sample set (Plasmid-Cellline-100-1MutantFraction) and well (A01). While following names

2013-06-26-BRAFWTNEGASSAY-Plasmid-Cellline-100-1MutantFraction-A01.csv
.csv
2013_06_26_BRAFWTNEGASSAY_Plasmid_Cellline_100_1MutantFraction_A01.csv

are much more prone to misinterpretation.

Accented characters

Your language might be very rich on various accented or special characters but both colleagues and your machines will have hard time to work with them. Special letters like ç, ä, ô, ě, ŕ, etc. require special encoding and might cause troublesome issues when used in file names.

Beware of typos and avoid using multiple names varying in small ways unless it has some true meaning. Following file names are distinct, but can you tell where exactly?

2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline-100-1MutantFractions_B03.csv
2013-06-26_BRAFWTNEGASSAY_Plasmid-Celline-100-1MutantFraction_B03.csv
2013-06-26_BRAFWTNEGASSAY_Plazmid-Cellline-100-1MutantFraction_B03.csv

Exploiting machine readable names

You may already have a lot of files collected for your project or you have received big dataset from one of your collaborators. Then you might think about organizing and renaming them to be compliant with your new or existing naming policy. If the names are consistent and you don’t want to loose time renaming them by hand, you may try to use dedicated tools (e.g. PSRenamer) or simple commands in your command line (rename for Mac and Linux, ren for Windows).

Once your skills develop, you will be able to use machines and machine readable file names to perform advanced operations on them, e.g. search using regular expression. Imagine folder with thousands of files. Running simple R command

flist <- list.files(pattern = "Plasmid")

will give you all file names containing word “Plasmid”.

2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline-100-1MutantFraction_A01.csv
2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline-100-1MutantFraction_A02.csv
2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline-100-1MutantFraction_A03.csv
2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline-100-1MutantFraction_B01.csv
2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline-100-1MutantFraction_B02.csv
2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline-100-1MutantFraction_B03.csv

This result can be easily further processed into an awesome meta-data table by applying split in places of underscore and dot:

flist_df <- stringr::str_split_fixed(flist, "[_\.]", 5)
names(flist_df) <- c("Date", "Assay", "Sample_set", "Well", "Format")

Date

Assay

Sample_set

Well

Format

“2013-06-26”

“BRAFWTNEGASSAY”

“Plasmid-Cellline-100-1MutantFraction”

“A01”

csv

“2013-06-26”

“BRAFWTNEGASSAY”

“Plasmid-Cellline-100-1MutantFraction”

“A02”

csv

“2013-06-26”

“BRAFWTNEGASSAY”

“Plasmid-Cellline-100-1MutantFraction”

“A03”

csv

“2013-06-26”

“BRAFWTNEGASSAY”

“Plasmid-Cellline-100-1MutantFraction”

“B01”

csv

“2013-06-26”

“BRAFWTNEGASSAY”

“Plasmid-Cellline-100-1MutantFraction”

“B02”

csv

“2013-06-26”

“BRAFWTNEGASSAY”

“Plasmid-Cellline-100-1MutantFraction”

“B03”

csv

Of course, similarly simple and powerful commands can be found in every programming language/interpreter (Python, Bash, …)

Case sensitivity

It is generally recommended not to use upper case letters. Firstly, matching patterns and splitting names with upper case letters is much harder and error prone. Another drawback might be the fact, that Windows file system is case insensitive (unlike Mac or Linux OS).

If you really want to extend hyphen-underscore semantic separation, you can use so called camelCase - substituting spaces between words by upper-casing their first letters.

Machine readable names allow us

  • easily search for files later
  • easily narrow file lists based on names
  • easily extract info from file names, e.g. by splitting

Remember that the rules on machine readability apply also for naming your folders (now containing your nicely named files). In fact, it is a good practice to stick to these rules even when naming variables in your data files.

Human readable

  • Be specific. It is generally better to create longer file name which is fulfilling its purpose than using short abbreviations which might be hard to grasp by your colleagues, eventually by yourself after some time. Stay away from cryptic names and non-standard or unclear abbreviations.

Bad named

Better name

myabstract.txt

John-White_Sensitivity-of-PLFA-analyses_abstract.txt

samples_project_start.csv

PA324_samples_2019-12-11.csv

ms_cresp_final.doc

John-White_Cell-respiration-manuscript_2019-12-11.doc

fig_1.png

John-White_Cell-respiration_fig-1_2019-12-11.png

  • Usually, file extension is already telling you some information about the file itself.

Here are some examples of file names which are unnecessarily long and could be easily shortened:

Iris-setosa_table.csv
video_2019_annual-meeting.avi
2019-12-11_notes.log
ATAC_seq1_London_mapped.bam
A2452_description-tutorial.info
  • Never use suffixes (or prefixes) like “final”, “old”, “new”, “current”, “obsolete”, “recent”, “latest”, “best”… File is hardly in such states and it will change sooner or later anyway.

  • Name should naturally explain why the file exists. If you have to search for additional information (either asking your colleagues or reading some README files), the file name is probably not chosen properly. Name file in a way that even a total stranger could get it easily.

  • Leave out meaningless or redundant words, e.g. “the”, “and”, “a”, “file”, “data” …

  • Do not be too creative, do not pun and stay professional. Bad examples:

bio-rect_UM.csv - data related to bio-reactors at University of Michigan
PEPA_d-pic.jpeg - a fourth picture from your paper on Performace Evaluation Process Algebra

Semantic versioning

If your files or documents change very often and you want to track the versions manually instead of using some sophisticated versioning software, you might follow semantic versioning scheme widely used in software development. It is based on adding several numbers, standard is 3, into a suffix of your file name where:

  • first number called MAJOR version is increased once the document has undergone significant changes
  • second number called MINOR version is incremented once some new information is added to the document or something is deleted
  • last number called PATCH should refer to very minor changes like fixing of typos or rephrasing a sentence.

These can be be headed by the letter „V“ in order to indicate the following version information.

Human readable names allow us:

  • easily understand what the file is and what it contains
  • easily share files with others

Default ordering

Inbuilt tools (e.g. file explorer) allows you to order files by name in alphanumerical order. Make the best out of this great feature.

  • Put the terms in general-to-specific order. That way, you will have files grouped in logical order and related files will be naturally close to each other.
Ares-triticum_samples_redundant_2010-04-12.csv
Ares-hordeum_samples_redundant_2010-05-12.csv
Iris-setosa_samples_1927_05_12.csv
Iris-setosa_samples_1954-06-24.csv
Iris-versicolor_samples_1945-04-12.csv
  • Put the date first to get chronological ordering:
2013-06-26_Plasmid_A01.csv
2014-06-26_Plasmid_C02.csv
2015-06-30_Plasmid_A03.csv
2015-07-12_Plasmid_B01.csv
2015-07-13_Plasmid_B02.csv
2015-11-10_Plasmid_B03.csv
  • Put number defining explicit order as first. Remember that the ordering is done by character, not by the whole number, so you might want to add leading zeros just to be sure that the ordering will be correct with growing number of your files.
01_Plasmid_A01_2013-06-26.csv
02_Plasmid_C02_2014-06-26.csv
03_Plasmid_A03_2015-06-30.csv
10_Plasmid_B01_2015-07-12.csv
11_Plasmid_B02_2015-07-13.csv
25_Plasmid_B03_2015-11-10.csv

Dates

Including date in your file names allows you to sort them easily and find exactly the one you want in very short time. Remember that recording dates using anything else than numbers (e.g. month abbreviations) can due to different language background result in formats like “11dic2019” or “11Dez2019”, etc., which doesn’t have to be recognized as date at all. It is much better to use only numeric format but even then it can be written in endless variations which are hard to read or more importantly make them ambiguous, like date 11th of December 2019 in following examples:

19/11/12
19/12/11
20191112
11.12.2019
11-12-19
...

Luckily, there is a standard for date format, YYYY-MM-DD (ISO 8601), which really nicely comply with all three principles above. Therefore, the only correct format of 11th of December 2019 is:

2019-12-11

Final notes

When starting your project or creating a new repository, give yourself a time to set a proper naming design. Remember that it should be also accepted by your teammates and other collaborators accessing the files. To make dissemination of the naming design as easy as possible, don’t forget to document it and include it into policies of your group/project.

Adopting proposed recommendations might seem like a lot of work now. But the truth is that it will pay off once the projects get more complex and your skills will evolve. Choosing good names takes time but saves more than it takes.

If you don’t agree with naming rules which are adopted in your group, follow them or make an effort to change it globally. The consistency is much more important than your preferred naming.

Resources

  • Jenny Brian’s slides on “Naming things” from Reproducible Science Workshop, Duke, 2015
  • Semantic versioning - semverdoc.org
  • LCSB IT101 training presentation