Someth Victory

Ruby, Rails, and Javascript Developer

Large Data Migration Tips

Active Record Migration is one of the greatest features in Rails. It actually help to reduce a very large and complicated amount of stuffs in development process. Most actions in migration such as creating table, add fields, remove fields, add index, remove index to existing table are simple and easy to do since the existing document in Rails Guides is very straight forward.

On the other hand, something like importing large amount of data, or moving data from one to another large table, migration become very complicated and panic.

Given that we have 1 millions record of products in the database. And each product belongs_to one global_category.

As developers, we want to change the relation belongs_to :global_category to be has_many :categories instead.

If the existing product contains no global_category, then product categories should contain a category named Uncategorized.

See example below:

Example 1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
class MigrateProductCategories < ActiveRecord::Migration
  def up
    Product.all.each do |product|
      global_category = product.global_category

      category = if global_category.present?
        product.categories.find_or_create_by(name: global_category.name)
      else
        product.categories.find_or_create_by(name: 'Uncategorized')
      end

      product.categories << category
      product.save
    end
  end
end

The code looks fine and simple from the outside.

It is actually not really fine, and good practice for large amount of data migration at all.

The Product.all.each, will load all products from your database into the memory. Once you got 1 million record of products, you will be running out of memory and will cause and error.

The suggestion is to use Product.find_each instead, which will load only 1000 record continuously and will not consume too much memory.

Example 2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
class MigrateProductCategories < ActiveRecord::Migration
  def up
    Product.find_each do |product|
      global_category = product.global_category

      category = if global_category.present?
        product.categories.find_or_create_by(name: global_category.name)
      else
        product.categories.find_or_create_by(name: 'Uncategorized')
      end

      product.categories << category
      product.save
    end
  end
end

Next, what will happen when there is expected errors happen? Actually, the program will crash and exit in the middle on the road. We don’t know what cause the problem and where the problem might happen. So we have to update the exisitng migration code to be rerunable, and rerun the whole migration again.

This is not an ideal solution at all. We can’t make sure that the new updated code won’t cause any other problems.

The suggestion for this is to find out any products that cause the problem during migration process, and store those products some where for later debug. Then recue from any error happen on running time and continue the working until all data are migrated.

See the refactorred example below:

Example 3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
class MigrateProductCategories < ActiveRecord::Migration
  def up
    unprocessable_products = []

    puts "=============Processing============="

    Product.find_each do |product|
      global_category = product.global_category
      begin
        category = if global_category.present?
          product.categories.find_or_create_by(name: global_category.name)
        else
          product.categories.find_or_create_by(name: 'Uncategorized')
        end
        product.categories << category
        product.save!
      rescue
        unprocessable_products << product.id
      end
    end

    puts "=============Done============="

    system "echo #{unprocessable_products} >> error.txt" if unprocessable_products.present?
  end
end

From the example above, all exceptions raised will be rescued with the code in the rescue block which will push any error products into the unprocessable_products that will be writted into the error.txt file, and the migration still continue to run for the rest of the records.

Once the migration is finished, you just need to check in the error.txt file which will list all the error product ids. With the ids in hand, you could now debug your code easily. Anyway, if you couldn’t find the error.txt file, it means there is no error at all, yeee.

Hope this help! Happy migrating!