Why the Death Penalty is Sometimes Justified

Many Americans think that capital punishment (the death penalty) should be completely abolished. However, I believe that it is sometimes justified depending on the severity of the crime. In fact, I…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Seven Easy Data Validation Checks

If you work with data, you know how crucial it is to validate it before modeling. As critical as it is, data validation is often overlooked during a project as it’s often perceived as less interesting in comparison to other aspects of the modeling process.

As I shared in a previous post, I believe that checklists are useful tools in overcoming failures and reducing errors, especially in routine tasks. This time, I want to share one of the checklists I often use as soon as I receive a new dataset. Keep in mind: I don’t intend to offer an overly reductive account of data cleaning and validation. This is a checklist of the simplest and most important validation checks that one can perform easily to quickly find and fix some of the most obvious errors in the data.

If you have a table, check the data types of the columns to ensure that they are what they’re supposed to be. This may not seem intuitive, but zipcodes, for example, should be strings instead of a numeric type. What certain languages and systems do to leading zeros means that you may mistakenly retrieve a zipcode in New Jersey (such as 07885 of Wharton, NJ) as 7885 — this is actually the postal code for the West Coast of New Zealand! So keep an eye out for the data types of zipcode fields in your data (Holtsville in NY and some places in Puerto Rico and Virgin Islands have zipcodes with two leading zeros).

Check the dates and times and ensure they are in the right format. When in doubt, make everything text instead of numeric. It’s easier to convert text to any data type you need.

A negative duration means your end date/time is before your start date/time, which usually isn’t a good sign. Also, look at their maximum and minimum values. If you are dealing with data from last year, then logically you shouldn’t see dates belonging to next year or a decade ago.

Do a quick check on the minimum and maximum of each columns or features, and also check if there are any “Null,” “None,” or missing values as well. These statistics give you a better overall picture of errors in the features. For example, if you expect to receive the daily data on trucks leaving a warehouse, seeing very small weights like 1lb, or weights more than the truck’s capacity, are questionable.

Know what latitude and longitude are supposed to show, and use them to perform a sanity check. Latitude shows the degree to which a point is North or South with positive and negative numbers, respectively. Longitude shows the East and West with positive and negative numbers, respectively. So, if you have a country in North America and you see anything other than positive latitude (it’s in the Northern hemisphere) and negative longitude (it’s West), then your data is not correct.

Use aggregate functions like sum, count, min, max, and other related operations to help validate data completeness and accuracy.

Finally, if you can choose between using an Excel file (.xls or .xlsx format) or a CSV (.csv), pick the latter. Excel automatically tries to find the best type for data it opens, which often can cause trouble. Also, be aware that opening a CSV in Excel in an attempt to validate the data could alter the data by incorrectly casting field types or even truncating cell values.

Add a comment

Related posts:

WHAT IS ONLINE REPUTATION MANAGEMENT AND WHY IT IS IMPORTANT FOR A BRAND

A brand is said to be good only when its consumers say that it is good. So for that purpose, a positive online reputation is very vital in today’s digital era to create trust and credibility among…

INSPIRATION

Inspiration is a feeling of enthusiasm you get from someone or something, that gives you new and creative ideas. “The process of being mentally stimulated to do or feel something, especially to do…

Understanding Code Refactoring

Today we will look the overview of Refactoring and what is the scope of activities that constitutes refactoring . We will also take a look at what activities are not part of refactoring process. What…