Deleting Duplicate Records in MYSQL (Step-by-Step Guide)

As the title suggests, today I will share an SQL query to solve a commonly encountered issue in MySQL databases.

To explain the topic more effectively, let’s use an example. Suppose we have a table named “users” with a column called “civilization_number” where we store social security numbers. Let’s assume there are thousands of records in the “users” table, and we want to identify and delete the duplicate records in the “civilization_number” column, keeping only the one with the smallest ID value. If we delete all of them, we will also delete the original record while deleting up the duplicate entries.

First, let’s take a look at the SQL query we’ll be using to solve the problem:

DELETE FROM users WHERE id NOT IN (SELECT * FROM (SELECT MIN(id) FROM users GROUP BY civilization_number HAVING COUNT() > 1) AS t) AND civilization_number IN (SELECT * FROM (SELECT civilization_number FROM users GROUP BY civilization_number HAVING COUNT() > 1) AS t2);

Although this query may seem complex, it is designed to solve the problem in a simple way. If you like, we can examine how each part of the query works separately.

Firstly, we need to identify the records in the “users” table where the “civilization_number” column is duplicated. For this purpose, we can use the following query:

SELECT civilization_number, COUNT() FROM users GROUP BY civilization_number HAVING COUNT() > 1;

This query will list the count and values of the duplicate entries in the “civilization_number” column. However, this query doesn’t provide us with the corresponding ID values of the duplicate records. To solve this problem, we need to include the “id” column as well:

SELECT civilization_number, MIN(id), COUNT() FROM users GROUP BY civilization_number HAVING COUNT() > 1;

This query will list the ID values and counts of the duplicate records in the “civilization_number” column. However, since we only need the ID values, we select the smallest ID value using the “MIN(id)” function.

Now, by using this query as a subquery, we can determine the records to be deleted. This is where the lengthy SQL query shared above comes into play. This query will identify all the records where the “civilization_number” column is duplicated, excluding the one with the smallest “id” value within each “civilization_number” group. Then, it will delete all the remaining records.

You can modify this scenario according to your needs or requirements. However, keep in mind:

Since this operation can be slow and resource-intensive, especially for large databases, I strongly recommend taking a backup and testing the query before executing it.




Comments

Bir cevap yazın

E-posta hesabınız yayımlanmayacak. Gerekli alanlar * ile işaretlenmişlerdir