Data scientist, your variable names are a mess. Clean up your code. (2023)

Quick, what does the following code do?

para i to rango(n):para j to rango(m):para k to rango(l):temp_value = X[i][j][k] * 12.5new_array[i][j][k] = temp_value +150

Is it impossible to say right? If you were to try to modify or debug this code, you would be lost unless you could read the author's mind. Even if you were the author, you might not remember what you're doing a few days after writing this code due to useless variable names and the use ofmagic numbers.

When working with data science code, I often see examples like the one above (or worse): code with variable names likeX,j,xs,x1,x2,tp,Tennessee,clave,Record,XI,Again,yoand numerous unnamed constant values. To be blunt, data scientists (myself included) suck at variable naming.

Delete variable names in 3 steps

  1. A variable name must describe the entity that the variable represents.
  2. When writing your code, prioritize readability over typing speed.
  3. Use consistent patterns throughout the project to minimize the cognitive load of small decisions.

As I grew research oriented from writingData Science Code for Spot AnalyticsGo to the production level code (inCortex building intelligence), I had to improve my programming by unlearning practices from data science books, courses, and labs. There are significant differences between the machine learning code that can be implemented and the way data scientists learn to code, but we'll start here by focusing on two common and easy-to-fix problems:

  1. Useless, confusing, or vague variable names

  2. Nameless Magic Constant Numbers

Both of these issues contribute to the disconnect between data science research (or Kaggle projects) and machine learning production systems. Yes, you can do this in a Jupyter notebook that runs once, but when you have mission-critical machine learning pipelines that run hundreds of times a day without errors, you have to writereadable and understandable code.fortunately there isSoftware Development Best Practicesdata scientists can guess, including the ones we'll cover in this article.

Note: I will focus on Python as it is by far themost widely used language in the data science industry. Some Python-specific naming rules (seeHerefor details) include:

  • Variable/function names aresmall lettersmiseparated_with_underscores

  • The named constants are inALL_UPPERCASE_LETTERS

  • classes are inCamelAutumn

More from Will KoerhsenExplanation of the Poisson process and the Poisson distribution

name variables

When naming variables, there are three basic ideas to remember:

(Video) SAS Data Step Tutorial 14 Cleaning up a Messy Data Set, part 1

  1. The variable name must describe the information represented by the variable. A variable name should say succinctly in words what the variable represents.

  2. Your code is read more often than it is written. Prioritize the readability of your code over how fast it is written.

  3. Adopt standard naming conventions so you can make one global decision in one codebase instead of making multiple local decisions.

how it is in action Let's look at some improvements to variable names.

X e Y

If you've seen them hundreds of times, you know that they often refer to functions and goals in a data science context, but that may not be obvious to other developers reading your code. Instead, use names that describe what these variables represent, such asproperties_of_the_houseehhouse prices.


What does the value represent? Can meanspeed_mph, Ccustomer operated,efficiencyototal revenue. a name likeWertit says nothing about the purpose of the variable and just creates confusion.


Even if you're just using a variable as a temporary store of value, give it a meaningful name. Maybe it's a value you need to convert units to, so in that case make it explicit:

# Nicht tun temp = get_house_price_in_usd(house_sqft, house_room_count)final_value = temp * usd_to_aud_conversion_rate
# Tun Sie muere stattdessenhouse_price_in_usd = get_house_price_in_usd(house_sqft,house_room_count)house_price_in_aud = house_price_in_usd * usd_to_aud_conversion_rate

usd, aud, mph, kwh, square feet

If you use these abbreviations, be sure to set them up in advance. Combine common abbreviations with the rest of the team and write them down. Then, during the code review, make sure those written standards are applied.

tp, tn, fp, fn

Avoid machine learning-specific abbreviations. represent these valuestrue_positive,true negatives,False alarmmifalse negativeso make it clear. Shorter variable names are not only difficult to understand, but can also be misspelled. It is very easy to usetpif you believeTennessee, then type the full description.

The examples above are examples of how to prioritize the readability of your code over the speed at which you can write it. Poorly written code takes much longer to read, understand, test, modify, and debug than well-written code. In general, trying to write code faster using shorter variable names will increase the development and debugging time of your program. If you don't believe me, go back to some code you wrote six months ago and try changing it. If you need to decode your own old code, that's an indication to focus on better naming conventions.

xs e ys

These are often used for plotting, in which case the values ​​are plottedx_coordinatesmiy_coordinates. However, I've seen these names used for many other tasks, so avoid confusion by using specific names that describe the purpose of the variable, such as:malmidistancesotemperaturesmiEnergy_in_kwh.

When precision is not enough...Use Precision and Recall to evaluate your classification model

What makes a variable name bad?

Most variable naming problems arise from:

  • The desire to keep variable names short

  • A direct translation of formulas into code.

Regarding the first point, while languageslike fortranlimited the length of variable names (to six characters), modern programming languages ​​have no limitations, so don't feel obligated to use made-up abbreviations. Also, don't use very long variable names, but if you need to prioritize a page, keep readability in mind.

(Video) 10 Good Coding Practices for Data Science

Regarding the second point, when you write an equation or use a model, and this is a point that schools forget to emphasize, remember that the letters or inputs represent real values!

We write code to solve real-world problems, and we need to understand the problem our model represents.

Let's look at an example that makes both mistakes. Suppose we have a polynomial equation to find the price of a house from a model. You might be tempted to write the math formula directly in your code:

Data scientist, your variable names are a mess. Clean up your code. (1)

temperature = m1 * x1 + m2 * (x2 ** 2) final = temperature + b

This is code that appears to have been written by a machine for a machine. While a computer runs your code, humans will read it, so write code made for humans!

To do this, we don't need to think about the formula itself (the how) and consider the real-world objects to be modeled (the what). Let's write the complete equation. This is a good test to see if you understand the model):

Data scientist, your variable names are a mess. Clean up your code. (2)

house_price = room_price * rooms + \square_floor_price * (floors ** 2)house_price = house_price + expected_average_house_price

If you're having trouble naming your variables, it means you don't know the model or your code well enough. We write code to solve real-world problems, and we need to understand the problem our model represents.

While a computer runs your code, humans will read it, so write code made for humans!

Descriptive variable names allow you to work on ahighest level of abstractionas a formula to help you focus on the problem area.

Additional considerations for naming variables

One of the most important points to remember when naming variables is this: consistency is important. Being consistent with variable names means you spend less time worrying about the names and more time troubleshooting. This point is relevant when adding aggregations to variable names.

Variablennamen – Do's and Don'ts

  • Use meaningful variable names
  • Use function parameters or named constants instead of magic numbers.
  • Use variable names to describe what an equation or model represents.
  • Put aggregations at the end of variable names.
  • Use item_count instead of num.
  • Use descriptive loop indices instead of i, j, k.
  • Adopt naming and formatting conventions in a project.
  • Do not use machine learning-specific abbreviations.

Aggregations on variable names

So you have the basic idea of ​​using descriptive names to changexsat distances,miabout efficiency andvspeed up Now what happens if you average the speed? should beAverage speed,average speed, o velocity_average? Following these two rules solves this situation:

  1. Decide on common abbreviations:averageby average,maximumfor maximum,Standardfor the standard deviation and so on. Make sure all team members agree and write them down. (Alternatively, you can avoid abbreviating aggregations.)

    (Video) Getting a hopelessly messy data frame tidy: Tidying data with the dplyr and data.table (CC047)

  2. Put the abbreviation at the end of the name. This puts the most relevant information, the entity described by the variable, at the top.

Following these rules can be your set of added variablesaverage speed,average distance,speed_min, mimaximum distance. Rule two is a matter of personal choice, and if you disagree, that's fine. Just make sure you consistently apply the rules you choose.

A tricky point arises when you have a variable that represents the number of an element. You may be tempted to usebuilding number, but does this refer to the total number of buildings or to the specific index of a given building?

Being consistent with variable names means you spend less time worrying about the names and more time troubleshooting.

To avoid ambiguity, usebuilding numberrefers to the total number of buildings andconstruction indexrefer to a specific building. You can adapt this for other problems, for examplenumber of piecesmiitem_index. If you don't like to count thentotal itemsis also a better choice thanNumber. This approach resolves ambiguity and maintains consistency by placing aggregations at the end of names.


For some unfortunate reason, loop variables have become commonplace.UE,j, mik. This may be the source of more errors and frustrations than any other data science practice. Combine uninformative variable names with nested loops (I've seen nested loops include the use ofyo,jj, it is includediii) and has the perfect recipe for unreadable and error-prone code. This may be debatable, but I never use it.UEor some other single letter for loop variables, choosing instead to describe what I'm iterating over, like:

for building_index in range (building_count): ....


for row index in range (row number): for column index in range (column number): ....

This is especially useful when you have nested loops so you don't have to remember ifUEDoes it mean row or column or if that is whatjok. You want to use your mental resources to figure out how to build the best model, not try to figure out the specific order of the array indices.

(If you don't use a loop variable in Python, use_as a placeholder. That way you won't get confused if the variable is used for indexing or not.)

Variable Names - Conventions to Avoid

  • Numbers in variable names
  • Commonly misspelled words in English
  • Names with ambiguous characters
  • names with similar meaning
  • abbreviations in names
  • similar sounding names

All these rules follow the principle of prioritizing readability over writing. Coding is primarily a method of communicating with other programmers, so help your team members understand the programs on your computer.

Never use magic numbers

Amagic numberis a constant value with no variable name. I see this for tasks like converting units, changing time intervals, or adding an offset:

end_value = unconverted_value * 1.61end_amount = Menge / 60value_with_offset = value + 150

(By the way, those variable names are all bad!)

Magic numbers are a big source of error and confusion because:

(Video) Clean Coders Hate What Happens to Your Code When You Use These Enterprise Programming Tricks

  • Only one person, the author, knows what they represent.

  • To change the value, you have to look everywhere it's used and manually enter the new value.

Instead of using magic numbers in this situation, we can define a function for conversions that accepts the unconverted value and the conversion rate asParameter:

def convert_usd_to_aud(price_in_usd,aud_to_usd_conversion_rate):price_in_aus = price_in_usd * usd_to_aud_conversion_ratereturn price_in_aus

If we use conversion rate in multiple functions in a program, we could define a named constant in one place:

USD_TO_AUD_CONVERSION_RATE = 1.61def convert_usd_to_aud(price_in_usd): price_in_aud = price_in_usd * USD_TO_AUD_CONVERSION_RATE return price_in_aud

(Remember, before we start the project, we must finalize it with our teamAmerican dollar= US dollars andear= Australian dollars. Patterns are important!)

Here's another example:

# Konvertierungsfunktion approximationdef get_revolution_count(minutes_passed, revolution_per_minute): revolution_count = minutes_passed * revolutions_passed_return revolution_count# Benannte Konstante approachREVOLUTIONS_PER_MINUTE = 60def get_revolution_count(minutes_passed): revolution_count = minutes_passed * REVOLUTIONS_PER_MINUTE return_count revolution_count(minutes_passed): revolution_count = minutes_passed * REVOLUTIONS_PER_MINUTE return_count

using aNAMED_CONSTANTdefined in a single location makes changing the value easier and more consistent. If the conversion rate changes, you don't have to go through your entire codebase to change all occurrences because you only defined them in one place. It also tells anyone reading your code exactly what the constant represents. A function parameter is also an acceptable solution if the name describes what the parameter represents.

As a real-world example of the dangers of magic numbers, I worked on a university research project using building energy data that initially came in at 15-minute intervals. Nobody thought much of changing that, and we wrote hundreds of functions using the magic number 15 (or 96 for the number of daily observations). This worked fine until we started getting data at five and one minute intervals. We spent weeks modifying all of our functions to accept a range parameter, but we still struggled with bugs caused by using magic numbers for months.

More from our data science expertsA Beginner's Guide to Evaluating Classification Models in Python

Data in the real world often changes with you. Currency conversion rates fluctuate by the minute, and hardcoding certain values ​​means you have to spend a lot of time rewriting and debugging your code. There is no place for magic in programming, not even in data science.

The importance of standards and conventions

The benefit of adopting standards is that you can make one global decision instead of many local decisions. Instead of choosing where to put the aggregation every time you name a variable, make a decision early in your project and apply it consistently. The goal is to spend less time on concerns only marginally related to data science: naming, formatting, designing, and more time solving important problems (how to useMachine learning to fight climate change).

If you're used to working alone, it can be hard to see the benefits of adopting standards. But you can also practice on your own to define your own conventions and apply them consistently. You'll still benefit from fewer small decisions, and it's good practice when you inevitably need to progress as a team. Whenever you have more than one programmer on a project, patterns become a must!

Clarify your code even more5 Ways to Write More Python Code

You may not agree with some of the choices I've made in this article, and that's okay! Adopting a consistent set of standards is more important than choosing exactly how many spaces to use or the maximum length of a variable name. The key is to stop spending so much time on random difficulties and instead focus on essential difficulties. (Fred Brooks, author of the software engineering classicmythical man month, sombreroan excellent essayhow we went from dealing with random problems in software development to focusing on essential problems).

Now let's go back to the original code we started with and fix it.

para i to rango(n):para j to rango(m):para k to rango(l):temp_value = X[i][j][k] * 12.5new_array[i][j][k] = temp_value +150

We use descriptive variable names and named constants.

(Video) How to Clean Data in R Using RStudio

PIXEL_NORMALIZATION_FACTOR = 12.5PIXEL_OFFSET_FACTOR = 150 for row index in range (row count): for column index in range (column count): for color channel index in range (color channel count): normalized_pixelvalue = ( original_pixel_array [row_index] [column_index] [color_channel_index ] * PIXEL_NORMALIZATION_FACTOR ) transform_pixel_array [row_index_array ] column_index][color_channel_index] = (normalized_pixel_value + PIXEL_OFFSET_FACTOR)

Now we can see that this code normalizes the pixel values ​​into an array and adds a constant offset to create a new array (ignore the inefficiency of the implementation!). If we share this code with our colleagues, they can understand and change it. Also, we know exactly what we've done when we go back to the code to test it and fix our bugs.

Cleaning up variable names may seem like a chore, but if you spend time reading about software development, you'll realize that what separates the best programmers from the rest is the repeated application of mundane techniques, like using good variable names. variables, short routines, which line of code, refactoring, etc. These are the techniques you need to take your code from research or exploration to production readiness, and when you get there, you'll see how exciting it can be to have your data science models influence decisions in real life.


How do you name a variable in clean code? ›

Clean Code: Variables
  1. Use intention revealing name. ...
  2. Use methods instead of static variables to test state. ...
  3. Avoid names that are misleading or potentially confusing. ...
  4. Use words that can be pronounced and avoid abbreviations. ...
  5. Make the variable searchable. ...
  6. It's not necessary to use Hungarian Notation or other type encodings.

What are the 3 rules for naming a variable? ›

Go variable naming rules:
  • A variable name must start with a letter or an underscore character (_)
  • A variable name cannot start with a digit.
  • A variable name can only contain alpha-numeric characters and underscores ( a-z, A-Z , 0-9 , and _ )

What is are the wrong variable name names? ›

Variable name may not start with a digit or underscore, and may not end with an underscore. Double underscores are not permitted in variable name. Variable names may not be longer than 32 characters and are required to be shorter for some question types: multiselect, GPS location and some other question types.

Is it bad practice to use numbers in variable names? ›

Never Use Magic Numbers

(These variable names are all bad, by the way!) Magic numbers are a large source of errors and confusion because: Only one person, the author, knows what they represent.

Why is it important that variable names are meaningful? ›

Variables are so important to the code that they deserve a good name that accurately describes their purpose. Sometimes a bad name can be the difference between a fellow developer understanding what everything does at first glance and not having any clue where to begin.

Which variable name is preferred in clean code? ›

Clean Code and Design Principles Complete Guide

Most people go ahead with single or double letter variable names like A, v, d, mp, etc when they start coding. Most people use generic variable names like flag, value, map, arr, etc.

How do you do a clean code? ›

10 tips for writing cleaner code in any programming language
  1. Use descriptive names. ...
  2. Use empty lines to create a readable code. ...
  3. Do not send more than three parameters into a function. ...
  4. Remember the functions must do only one thing. ...
  5. Functions must be small. ...
  6. Reduce the number of characters in a line. ...
  7. Avoid using comments.

What is a good rule to use when making variable names? ›

Rules of naming variables
  • Name your variables based on the terms of the subject area, so that the variable name clearly describes its purpose.
  • Create variable names by deleting spaces that separate the words. ...
  • Do not begin variable names with an underscore.
  • Do not use variable names that consist of a single character.

What is the best practice for naming variables? ›

A good variable name should:
  • Be clear and concise.
  • Be written in English. ...
  • Not contain special characters. ...
  • Not conflict with any Python keywords, such as for , True , False , and , if , or else .
Aug 19, 2020

How do you check a variable name is valid or invalid? ›

A valid variable name begins with a letter and contains not more than namelengthmax characters. Valid variable names can include letters, digits, and underscores.

What variable name is illegal? ›

1. A variable name cannot start with a numeral. For instance, 3x or 2goats or 76trombones would all be illegal variable names. You can, however, have numbers within a JavaScript variable name; for instance up2me or go4it would both be perfectly valid variable names.

Can the name of a variable change? ›

If the new name helps the reader understand what the function is supposed to do, or if it's a newer standard notation, then you are perfectly allowed to change it.

How do you name variables in codes? ›

A good variable name should:
  1. Be clear and concise.
  2. Be written in English. ...
  3. Not contain special characters. ...
  4. Not conflict with any Python keywords, such as for , True , False , and , if , or else .
Aug 19, 2020

How do you define a variable in code? ›

Define variables by giving them a name and assigning them a value or expression. Use variables within Evaluation Blocks. Describe a situation where using variables as substitutions for values or expressions is more efficient.

How do you name things in coding? ›

Use singular names

Stick to singular names as much as possible and avoid using plural names. This applies to everything you name in programming. Like variable, function, method, class, table etc. At first it might sound a bit weird.

How do you name variables in scratch? ›

Tip: any sprite could be a button but there are already some button sprites in Scratch that you can use. Click on the Variables Blocks menu and select the Make a Variable button. Give the variable a name that is easy to recognise. You will need to add code to your button sprite to update the variable .


1. How to Do Data Cleaning (step-by-step tutorial on real-life dataset)
(Mısra Turp)
2. 🤔How to Stand Out as a Junior Data Scientist/ Data Analyst
(Thu Vu data analytics)
3. 5 RULES to Write Better Code
(Andy Sterkowitz)
4. IDS - Week 04 - 04 - Importing data
(Mine Çetinkaya-Rundel)
5. Clean Excel Data With Python Pandas - Removing Unwanted Characters
(Derrick Sherrill)
6. R programming for beginners: Rename variables and reorder columns. Data cleaning and manipulation.
(R Programming 101)
Top Articles
Latest Posts
Article information

Author: Wyatt Volkman LLD

Last Updated: 03/07/2023

Views: 6093

Rating: 4.6 / 5 (46 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Wyatt Volkman LLD

Birthday: 1992-02-16

Address: Suite 851 78549 Lubowitz Well, Wardside, TX 98080-8615

Phone: +67618977178100

Job: Manufacturing Director

Hobby: Running, Mountaineering, Inline skating, Writing, Baton twirling, Computer programming, Stone skipping

Introduction: My name is Wyatt Volkman LLD, I am a handsome, rich, comfortable, lively, zealous, graceful, gifted person who loves writing and wants to share my knowledge and understanding with you.