Large Data Migration Tips

Aug 25th, 2015 9:15 pm

Active Record Migration is one of the greatest features in Rails. It actually help to reduce a very large and complicated amount of stuffs in development process. Most actions in migration such as creating table, add fields, remove fields, add index, remove index to existing table are simple and easy to do since the existing document in Rails Guides is very straight forward.

On the other hand, something like importing large amount of data, or moving data from one to another large table, migration become very complicated and panic.

Given that we have 1 millions record of products in the database. And each product belongs_to one global_category.

As developers, we want to change the relation belongs_to :global_category to be has_many :categories instead.

If the existing product contains no global_category, then product categories should contain a category named Uncategorized.

See example below:

Example 1

class MigrateProductCategories < ActiveRecord::Migration
  def up
    Product.all.each do |product|
      global_category = product.global_category

      category = if global_category.present?
        product.categories.find_or_create_by(name: global_category.name)
      else
        product.categories.find_or_create_by(name: 'Uncategorized')
      end

      product.categories << category
      product.save
    end
  end
end

The code looks fine and simple from the outside.

It is actually not really fine, and good practice for large amount of data migration at all.

The Product.all.each, will load all products from your database into the memory. Once you got 1 million record of products, you will be running out of memory and will cause and error.

The suggestion is to use Product.find_each instead, which will load only 1000 record continuously and will not consume too much memory.

Example 2

class MigrateProductCategories < ActiveRecord::Migration
  def up
    Product.find_each do |product|
      global_category = product.global_category

      category = if global_category.present?
        product.categories.find_or_create_by(name: global_category.name)
      else
        product.categories.find_or_create_by(name: 'Uncategorized')
      end

      product.categories << category
      product.save
    end
  end
end

Next, what will happen when there is expected errors happen? Actually, the program will crash and exit in the middle on the road. We don’t know what cause the problem and where the problem might happen. So we have to update the exisitng migration code to be rerunable, and rerun the whole migration again.

This is not an ideal solution at all. We can’t make sure that the new updated code won’t cause any other problems.

The suggestion for this is to find out any products that cause the problem during migration process, and store those products some where for later debug. Then recue from any error happen on running time and continue the working until all data are migrated.

See the refactorred example below:

Example 3

class MigrateProductCategories < ActiveRecord::Migration
  def up
    unprocessable_products = []

    puts "=============Processing============="

    Product.find_each do |product|
      global_category = product.global_category
      begin
        category = if global_category.present?
          product.categories.find_or_create_by(name: global_category.name)
        else
          product.categories.find_or_create_by(name: 'Uncategorized')
        end
        product.categories << category
        product.save!
      rescue
        unprocessable_products << product.id
      end
    end

    puts "=============Done============="

    system "echo #{unprocessable_products} >> error.txt" if unprocessable_products.present?
  end
end

From the example above, all exceptions raised will be rescued with the code in the rescue block which will push any error products into the unprocessable_products that will be writted into the error.txt file, and the migration still continue to run for the rest of the records.

Once the migration is finished, you just need to check in the error.txt file which will list all the error product ids. With the ids in hand, you could now debug your code easily. Anyway, if you couldn’t find the error.txt file, it means there is no error at all, yeee.

Hope this help! Happy migrating!

Leaving Yoolk Inc.

Aug 10th, 2015 7:32 pm

It has been more than 5 years that I had been working at Yoolk, and today is my last day here. Time seemed to fly so fast, that I couldn’t remember everything. What I never forget from Yoolk is the best time we had with best people there. Developers at Yoolk are very skillful and talented. They are very helpful and friendly. Yoolk itself is one of the best Tech Companies in Cambodia I could say. It’s using as well as contributing to latest open source technologies from the community. Yoolk got a very good design principle in software development following by best practises with OOdesign, Design Pattern, TDD/BDD, etc…

Yoolk had provided me very good opportunities to grow my professional skill from very ground floor position. I did learn a lot while working there. People at Yoolk not only provided me chances to improve my skills, but gave me very good time to enjoy as well. They are not only colleagues, but good friends as well. Really thanks you guys!

My feeling is very tough inside actually. Yoolkers just celebrate my birthday today without letting me know in advance while it’s 2 days late from my real birth date which is surprised me very much. Imaging when everyone coming to you, shaking your hand, saying ‘Happy Birthday to you’ and ‘Goodbye!’ at the same time. What I could do is just to smile back to them while my feeling was very tense inside. I could see their sadness just as mine as well. I’m sorry guys to bring this feeling to you all :(

Actually, leaving Yoolk is the hardest decision I have ever made. It is the most frustrated time I would say, maybe because Yoolk is my first company I worked for and I’d been there for quite a long time. My feeling is very tough. But nothing remains the same. Life is about moving forward I left Yoolk in order to find a futher improvement, opportunity, and explore a new world.

Singleton Pattern in Ruby

Aug 1st, 2015 9:51 pm

Singleton Pattern is a design pattern that allow a class to be instantiated only once. The benefit of Singleton is to be ensure that there is only one and the same instance of object is called every time. Ruby also support Singleton Pattern.

In order to create a singleton class in Ruby, Singleton module needs to be included.

Check out the code below:

Singleton Example

class OpenSRSServer
  include Singleton

  def connection
    @connection ||= OpenSRS::Server.new(
      key:      ENV['OPENSRS_ACCESS_KEY'],
      server:   ENV['OPENSRS_SERVER'],
      username: ENV['OPENSRS_REG_USERNAME']
    )
  end
end

The example above created a ruby singleton class called OpensrsServer, with an instance method called connection. This class is responsible for creating a connection object using OpenSRS gem.

In order to instantiate an object with singleton class in Ruby, we need to use .instance method rather than .new.

Singleton Object

server = OpenSRSSErver.instance

No matter how many time OpenSRSServer.instance get called, the object returned will be the same.

There Is No Ternary Syntax in CoffeeScript

Jul 15th, 2015 10:01 pm

Ternary operator in CoffeeScript is not behaving as you expected. There is no error, but your application will behave with unexpected behavior.

See example below:

Example 1

hasName = name ? false : true

The hasName variable should return false or true base on whether or not variable name exist. And yes, it is like that in pure Javascript. But in coffeescript, when writing the code like this, the hasName variable will get the value of whatever the name variable is. If the name variable stores value of ‘Victory’, then the hasName will got the value of ‘Victory’ too, which will lead your application to behave strange in some case.

Fortunately, CoffeeScript has a very nice one line syntax which behaves exactly like ternary operator in pure Javascript.

Example 2

hasName = if name then true else false

Now, hasName variable will return false or true base on whether or not the name variable has value.

Data Transfer Object in Ruby

Jun 30th, 2015 9:17 pm

A Data Transfer Object(DTO) is an object which is used to encapsulate data. It is commonly used in the Services layer which request data from third party API, or from the system itself. The benefit of DTOs is to convert the raw data in an object and reduce unnecessary information. It also makes a great model in MVC. Moreover, DTO makes the code very easy to maintain and test.

Given that we are writing some code to perform domain check from thrid party API.

Let’s see the example below:

Request data to an API

result = API::Domain.check('some-domain.com')

The response is looked like this:

Response result from an API

{
"response": {
  "status": {
    "code": 200,
    "message": "OK"
  },
  "headers": {
    "Date": {
      "Fri, 20 Jun 2014 02:41:57 GMT"
    },
    "Content-Type": {
      "text/json"
    }
  },
  "body": {
    "type": "domain",
    "name": "some-domain.com",
    "price": "11.00",
    "status": "Available"
  }
}

In order to access the attributes of the response, it will requires to go through all the hierarchy key structure:

Access field name of the response

result = JSON.parse(result)
result['response']['body']['name']
# => 'some-domain.com'
result['response']['body']['type']
# => 'domain'

Let’s say that we want to check whether or not the domain is available(domain is available only if its status is ‘Available’ and price is less than 15)

The code will look like this:

result['response']['body']['status'] == 'Available' &&
result['response']['body']['price'].to_f < 15

As you can see, the code above is not really efficient at all, in case of readability and scalability.

In addition, what would happen when some fields in the hierarchy structure needs to be changed? Let’s say field result['response']['body'] changed to result['response']['data']. As a result, the code above will be no longer work, so, the code need to be changed everywhere the result object is called.

Perform with DTO

First, create a new class called DTO::Domain. This class is responsible for translating the data returned from the API into an object.

DTO::Domain

class DTO::Domain
  attr_reader :type, :name, :price, :status

  def initialize(data)
    @type   = data['body']['type']
    @name   = data['body']['name']
    @price  = data['body']['price']
    @status = data['body']['status']
  end

  def available?
    status == 'Available' && price < 15
  end

  def taken?
    status == 'Taken'
  end
end

With DTO::Domain, we could instantiate a new object by passing the data responded from the API.

result = JSON.parse(result)
domain = DTO::Domain.new(result['response'])

All the information responded form the API is encapsulated inside a DTO::Domain class. Only needed information with some additional domain logic is included in the object. In some case that the field in the API is changed, only DTO::Domain class alone needs to be changed.

domain.name
# => 'some-domain.com'
domain.taken?
# => false
domain.available?
# => true
domain.price
# => 11.0

The code now is much cleaner, maintainable, testable, and scalable.

Reduce Code Duplication With Metaprogramming

Jun 16th, 2015 8:53 pm

Metaprogramming is a wonderful tool for producing DRY(Don’t Repeat Yourself) code in highly dynamic languages. It is commonly defined as “code that produces code”. Metaprogramming reduce the amount of unnecesssary code, make your code clean, and DRY, and easy to scale and maintain.

Let’s take a look at the example below:

Without Metaprogramming

class Domain < ActiveRecord::Base
  validates :status, presence: true, inclusion: { in: %w(pointed new subodmain) }

  def self.pointed
    where(status: 'pointed')
  end

  def self.new
    where(status: 'new')
  end

  def self.subdomain
    where(status: 'subdomain')
  end

  def pointed?
    status == 'pointed'
  end

  def new?
    status == 'new'
  end

  def subdomain?
    status == 'subdomain'
  end
end

The example above is obvious that this is a maintenance issue in the making because the code is not DRY, and will take more work in scaling.

Let’s say that we want to add another domain status 'transferred', so two methods would need to be created, self.transferred and transferred?.

Now let’s see how we could resolve this issue with Metaprogramming

See example below:

With Metaprogramming

class Domain < ActiveRecord::Base
  STATUSES = %w( pointed new subdomain )
  validates :status, presence: true, inclusion: { in: STATUSES }

  # create self.pointed, self.new, self.subdomain
  class << self
    STATUSES.each do |status_name|
      define_method "#{status_name}" do
        where(status: status_name)
      end
    end
  end

  # create pointed?, new?, subdomain?
  STATUSES.each do |status_name|
    define_method "#{status_name}?" do
      status == status_name
    end
  end

end

Now the code has been refactorred to use metaprogramming, with define_method. No matter how many domain statuses need to be added, only STATUSES constant needs to be changed, and everything will work normally. The funtionally in Example 2 is working exactly the same as in Example 1. All the methods will be automatically created in runtime.

As a clue, when writing code with metaprogramming, some comments about what the code does should be written for a better understanding.

Blog Archives

Someth Victory

Ruby, Rails, and Javascript Developer

Large Data Migration Tips

Leaving Yoolk Inc.

Singleton Pattern in Ruby

There Is No Ternary Syntax in CoffeeScript

Data Transfer Object in Ruby

Perform with DTO

Reduce Code Duplication With Metaprogramming