Magical ORM's

Burt Wonderstone

Just like many things nowadays you have to ask yourself a question. Should I use some nice library or roll my own? I hate to date myself, but a lot of nice tools have come up in web development in the past decade which makes this much harder to decide. I prefer to roll my own simple solution. Here is why.

Most projects now days seem to use a Object Relational Mapping or ORM. You can read more about it here (https://en.wikipedia.org/wiki/Object-relational_mapping) but simply, its about mapping relational data to object data mostly in an Object Oriented Programming or OOP manner. This is all an ORM is. However ORM's also provide a way to query, update, delete, and insert data into the database, not just a way to map relational data to object oriented data. A few of these large ORM's are ActiveRecord, SQLAlchemy, Django ORM, and EntityFramework.

Benefits

Benefits to using a solid ORM like SQLAlchemy is ease of use, documentation, and development speed. Using one of these ORM's can get a project up and going very quickly. The developers do not need to worry much about designing the database or how it's working. They can focus on the application and its feature set.

Drawbacks

Steve Gray

Many ORM's have a vast array of features. This might seem good but they are overly complex. Most of the features are just not needed. I am a huge fan of keeping things lean and simple.

A major drawback with using one of these libraries is the technical debt that it will grow into. A project starts off with getting a minimum viable product out as quickly as possible. So a popular ORM is used. Everything is fine and dandy until the load starts to increase. The data layer of the application starts to suffer due to poorly written queries by developers who don't know how the database actually works. They end up accidentally running hundreds of queries per web request to get the data to service the request. This can be caused by a combination of bad relational database design and how the ORM library chosen creates its queries to having different developers write different parts of the response handler. These issues are usually avoided by giving the database more processing power. The real fix is to clean up and optimize the data layer of the application.

Part of this comes from the fact that the ORM masks the actual queries its performing. Its very easy to use an ORM to write a query that ends up pounding the database with hundreds of queries when all the developer (should have) wanted was a simple select. Not to mention how easy it is to write cartesian product queries which will bring a database down to its knees. Calls to the database are expensive. Regardless the hardware and networking capabilities. It's a source of data outside of the application.

Lets look at an example of using an ORM to build a quick query. Lets use ActiveRecord this time. First let’s query for all active users:

active_users = User.find(active: true)

Pretty simple. Now let's say a user can have multiple email addresses. Making this a one to many. We want to list all the users Email addresses.

active_users_emails = User.find(active: true).includes(:email)

Here is where things can get tricky and magical. If done incorrectly you can run X number of queries where X is the number of users. It would look like this:

1
2
-- Get the active users
SELECT id, name FROM users;

Then ActiveRecord would loop over each user to get their email addresses. If we had 5 users it would be 5 queries. How would this look with thousands of records?

1
2
3
4
5
SELECT user_id, email FROM emails WHERE user_id = 1;
SELECT user_id, email FROM emails WHERE user_id = 2;
SELECT user_id, email FROM emails WHERE user_id = 3;
SELECT user_id, email FROM emails WHERE user_id = 4;
SELECT user_id, email FROM emails WHERE user_id = 5;

To do this without ORM query fanciness we it would look like this in one query:

1
2
3
4
5
6
7
SELECT
  u.id,
  u.name,
  e.email,
FROM users u
JOIN emails e ON u.id = e.user_id
WHERE u.active = 1;

I also want to mention that ActiveRecord is really smart. In most cases it will probably be able to do it in 1-2 queries. However with complex relationships and not knowing how ActiveRecord works under the hood, it’s easy to accidently have issues like this. I don’t want to bash ActiveRecord just show how easy it is to accidentally overload a database using these tools.

Personal Experience

Having over a decade of web development experience I have seen a lot of things in this area. Here are a few approaches I have seen and a few that, unfortunately, I have used.

SQL Soup

I call this SQL soup due to the queries and logic being all mingled in the same code base. There are no definition of layers in the application. They just do it all in a really messy manner. Here is quick example in C# .Net of some SQL Soup:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
string sql = new StringBuilder();
sql.append("SELECT id, name, email FROM users WHERE 1=1 ");
if(params.active)
{
  sql.Append("AND active = 1");
}
if(params.name)
{
  sql.AppendFormat("AND name like '{0}%'", params.name);
}

return Db.execute(sql);

This is really gross but in addition it's open to SQL injection. Its very difficult to debug. To get the full query you have to somehow pull the sql string out so you can run it on the database to figure out what is wrong with the query. Another common pattern like this one is using it to generate HTML instead of SQL. This sort of pattern is hard to maintain and is prone to lots of errors.

Stored Procedures

Instead of writing our SQL inside our code using constant strings we put them in the database as stored procedures and call the stored procedure. An example Oracle stored procedure would look like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
CREATE PROCEDURE PR_USERS_BY_ID
(
   P_ID IN INTEGER,
) AS
BEGIN
   SELECT
     id,
     name,
     email
  FROM users
  WHERE id = p_ID;
END;

Then we can call it and it will return the user record if it exists. This is better in certain aspects but also locks us into supporting a single database such as Oracle, SQL Server, MySql, and many more. Looking at the above example. It is written in PL/SQL which is Oracle specific syntax. Stored Procedures are not generic SQL and are different for each database technology. Which locks us into supporting only that database.

Using stored procedures like this makes the code easier to read and manage but still has its downfalls. It's difficult to keep track of source code management since the code is in the database. This can be alleviated by storing the SQL in .sql files in a source control suite, however this requires a way to keep the source files in sync with what is actually in the database. In addition to this it increases deployment complexity.

Things get really complicated when the stored procedures created do not have consistency. They are named poorly and try to do to many things. Ever heard of a 20,000 line stored procedure? I have and it's not fun. pr_QueryAllTheThings.

Gob

A better approach is to make a stored procedure for each type of CRUD operation per table. Each would be specific to performing that operation and that operation alone. Which fixes the pr_QueryAllTheThings stored procedure.

The data layer or ORM is in charge of calling the database stored procedure for that CRUD operation and mapping the results to an OOP object. This also adds more overhead because it's not as dynamic. When you need to add or update a table you have to edit all the stored procedures. A really cool side effect of this is allowing the ORM to set a flag in the database allowing row level security on the database. This can be good depending on the situation and how the application is architected. Oracle has a great row level security system where it will magically add on to the WHERE clause of the query magically limiting the records the user can see. Of course magic is not always better. It can be difficult to debug and for other developers to learn.

Smart Database

I say smart database but it's not smart to use something like this. One application I worked on was almost all in the database. The queries and stored procedures would actually generate html forms from data in the database. The actual application layer was really simple until the requirements changed so much that the generic form generation was insufficient. This ended up adding bandaids to the code never really fixing the root issue. Sometimes being too dynamic can be really bad. Things just don't always fit. It was a nightmare to maintain the application when all the developer did was update a field in the database to change a form in the front end. Source control is hard this way. You could put the insert sql scripts for the data into a file and keep it that way but it’s messy.

How we roll at Authentise

At Authentise we have decided to roll our own data layer in our micro services which we call the platform layer. Since we use the great python framework Flask for handling our web services we have chosen to use SQLAlchemy as our tool to run queries on the database. However, we intentionally do not use the full ORM that SQLAlchemy provides. We just use SQLALchemy Core. This allows us to use its great features for interacting with the database and not its magic ORM capabilities. There are a few reasons why we have opted to go this route. First we can specifically craft the queries required which are very similar to actual SQL queries. Second, instead of the stored procedure we can leverage SQLAlchemy abstract layer to support more than just one database service easily. To us SQLAlchemy Core is a way to generate SQL queries in a Pythonic way. Instead of having fancy class objects in the platform layer we use generic python dictionaries and can assemble our query through multiple layers.

Things have become more complex and actually more generic then they started out. But here is an example of a way to retrieve a record, which is easy to see that it queries a record in the database by its UUID (Primary Key):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
def by_uuid(uuid):
    query = sqlalchemy.select([
        users_table.c.uuid,
        users_table.c.name,
        users_table.c.email,
        users_table.c.created_at,
        users_table.c.updated_at,
    ]).where(users_table.c.uuid == uuid)
    result = engine.execute(query).first()
    return dict(result)

Other functions would be update, create, delete, and other select functions to retrieve specific records. This is not as dynamic as using the full blown SQLAlchemy ORM but it consolidates, simplifies, and is easier to maintain than providing the ORM with the ability to do whatever it wants with the database. We can even assert in our tests that we run a constant number of database queries when servicing responses.

Burt Wonderstone

ORM's are overkill for simple REST microservices. REST details are out of scope for this post. However, when designing REST services correctly the level of SQL queries that needs to occur get really simple. They are basicly only CRUD actions. This keeps our data layer in the services simple. Web requests usually only need one maybe two queries to the database to figure out the response needed. Looking at it this way the complexity of the data layer goes way down. All the requests to say a user endpoint https://users.service/users/ ends up being just a simple query for a GET request, a POST for a CREATE, a PUT for an UPDATE, and a delete for a DELETE. This basicly maps a request to a single database query. Permissions and authorization make it a little more complex. If we need complex reports we can create a specific endpoint for that.

The usual GET requests require a unique id in the url and returns a single record based on that id. Such as https://users.service/users/42 will return the user with an id of 42. At Authentise we have a concept of a LIST action on our REST endpoints along with the GET. This allows us to use GET on an endpoint and return a list of results instead of just one. The list of the users endpoint would be https://users.service/users/ which returns a list of users instead of the single user of 42. Using the list endpoint we can provide filters in the query string to filter the data. Say I want to list all the users that are active. I can use the list endpoint for this: https://users.service/users/?filter[active]=1. How does this relate to the data layer? I will show you. We have a fancy function that will map the querystring filters to a dictionary. So we end up with an object that looks like this:

1
2
3
filters = {
  "active": True,
}

Then our the API layer can pass this into the platform layer. Like the by_uuid function we have a by_filters function. This applies our filters dictionary onto the SQL Alchemy Core query and allows the users using the endpoint to filter the results. Here is an example of this, I also included a new version of by_uuid that uses our by_filters function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
def by_uuid(uuid):
    results = by_filter({'uuid': [uuid]})
    if not results:
        raise RecordNotFound("Could not find a capture record by uuid: {}".format(uuid))
    return results[0]

def by_filter(filters):
    # Convert UUID objects to strings for querying
    formatted_filters = chryso.queryadapter.format_filter(filters, {
        'uuid'        : lambda x: [str(u) for u in x],
    })

    # Use mapping function to map column names to filters
    filter_map = chryso.queryadapter.map_column_names([Capture], formatted_filters)

    # Apply filters to query
    query = chryso.queryadapter.apply_filter(select([Capture]), filter_map)

    # Execute Query
    results = engine.execute(query)

    # Return results as dictionaries
    return [dict(result) for result in results]

def _apply_filter(select, column, value):
    if isinstance(value, list) or isinstance(value, tuple):
        return select.where(column.in_(value))
    else:
        return select.where(column == value)

def apply_filter(select, _filter):
    query = select
    for column, value in _filter.items():
        query = _apply_filter(query, column, value)
    return query

def map_column_names(tables, _filter):
    _mapped = {}
    for column_name, value in _filter.items():
        column = _get_column_by_name(tables, column_name)
        if column is None:
            raise Exception(("Unrecognized column reference '{}'."
                " None of the tables provided have a column by that name").format(column_name))
        _mapped[column] = value
    return _mapped

Conclusion

Magic Mike XXL

Using an ORM can be a quick way to get a project up and running. But it can quickly become a crutch and lead to a lot of technical debt that will have to be paid (see the example above using ActiveRecord). Using a simple data layer and architecture in the first place will alleviate this technical debt. The tradeoff of using a full-blown ORM does not outweigh the benefits of rolling your own simple solution in the long run.