Design Patterns

Why I love the repository pattern

Persisting application state is not an easy job, to say the very least. We have an entire cottage industry of vendors who promise to simplify these issues for us. But there is a long-existing design pattern that which provides a logical separation in your application to help tackle this problem, without needing to commit to a specific database technology or library.

The repository pattern aims to isolate the complexities concerning the persistence of data (complexities which are often vendor specific) from the the business-logic of your application.

In your app, you will usually:

  1. 🔍 Find some specific entity in question
  2. 🛠️ Perform some operation on that entity, obeying a set of rules to ensure that it doesn’t enter an invalid state
  3. 💾 Persist that entity

Steps (1) and (3) both concern communication with the storage layer (e.g. the database) that you are using. It will involve mapping the data to/from storage and into objects/data structures that your app understands.

Step (2) is where the business-logic lies. A rule like "a customer’s current account balance cannot exceed its overdraft" should sit here, not in the database. These rules could be quite complex — especially if there are many rules that interact with one another — and we don’t want them to bleed into the already complicated logic surrounding (1) and (3).

Imagine the above in terms of a physical system, rather than an application. Accessing your entities is like pulling them out of a filing cabinet 🗄️: you (1) find them by a simple criteria (usually a name/ID), (2) change whatever is needed on that document and (3) then put it back into place where they were.

Repositories hence provide a simple, collection-like interface for accessing and persisting entities. E.g, if I have Accounts in my domain model, then I can create an AccountRepository:

namespace Domain;

interface AccountRepository
    public function findById(string $id): Account;
    public function save(Account $account): void;

Then in my application/command model, I don’t need to care about the implementation details of the database that is being used to persist my accounts. I can just use the repository:

namespace Command;

use Domain\AccountRepository;

class WithdrawFunds
    private AccountRepository $accounts;

    public function __construct(AccountRepository $accounts)
        $this->accounts = $accounts;

    public function __invoke(string $accountId, int $amount): void
        $account = $this->accounts->findById($accountId); // (1)
        $account->withdraw(new Money($amount));           // (2)
        $this->accounts->save($account);                  // (3)

and the implementation details are hidden behind the concrete class that I choose to use. I could have an ArrayAccountRepository that just stores them in an array – this would make it very easy to write unit tests for WithdrawFunds that don’t depend on database access, for instance. In a production environment, I could have PostgresAccountRepository, which communicates to an instance of a Postgres database. This implementation would contain all of the SQL queries needed to find the account data, and map it into an Account object. It does so without changing anything about how WithdrawFunds is implemented. I would just pass it a PostgresAccountRepository instance instead:

$withdraw = new Command\WithdrawFunds(new PostgresAccountRepository(
    // config parameters for Postgres communication...
$withdraw("my-account", 100);

So why do I personally use it?

I’ve always loved databases growing up. Part of that is due to my parents tech-related careers. My mum’s role as a data analyst involves building complex SQL reports for clients, and my dad’s role as an IT systems engineer resulted in me having access to lots of "enterprise" software in the house as a kid. Between them, I was able to work with Adobe Go-Live and Microsoft Access way back in the early 2000s to build my own (locally hosted, never deployed!) websites and store data about them. I went on to take a databases course in my second year at university, and enjoyed it thoroughly.

I say all of this to convey to you that I’m not some luddite who hates databases on principle or is afraid of integrating with them or wants to avoid writing SQL. But when building a proof-of-concept, I often get frustrated when I’m trying to quickly hash out the working logic; I get dragged down thinking about how the database schema is supposed to look before the app is even in a suitable shape to start integrating one. Sometimes the thing that I’m building may not even need a database at all when I’m done (at least, not a relational one).

What the repository pattern allows me to do is focus on the business logic of the application first, and defer the decisions on persistence until much later in the project. This is exactly the benefit that Uncle Bob has proselytized in his Clean Architecture series for years now, but it specifically comes out of an intentional decision to separate persistence logic from the other logic in your system.

It also makes my domain and application model much easier to test. Since I can create test implementations of the repositories very quickly, this encourages me to actually write those unit tests for my code, since they’re quick-wins instead of burdens. I can even use the test implementations of the repositories for local persistence to start out with, and gradually introduce the type of storage that is most relevant to the stage in the project that I’m in.

Finally, by deferring the decision on the database until much later in the project’s lifecycle, I can make a much more informed decision about the storage choice than I would have to do up-front. All too often, we reach immediately for a relational database system (RDBS) simply because these are what we tend to find ORM vendors design their solutions around. We couple other aspects of the system into these technologies on the basis that it makes it "faster" to get to a complete system, but don’t give ourselves enough time to think if said solution is appropriate to our use cases, or to think about the consequences of early coupling.

Sometimes document storage is more suitable (e.g. Redis, MongoDB); other times we need something that is better tuned for plain-text indexing (e.g. ElasticSearch), or graph-specific queries (e.g. GraphQL) or keeping an immutable audit-trail of events (e.g. Kafka, Event Store). When you’re building CRUD apps, then maybe RDBSs are fine for your use cases, but I like to have the confidence in making that decision after I’ve spent more time fleshing out the use cases and technical requirements of the system.