Video Discussion: Doctrine Migration

This recording, from the February 7, 2023 Community Call, contains a presentation about VuFind®'s use of a database migration layer and plans to migrate from the deprecated Laminas\Db library to the more widely-used Doctrine platform.

Video is available as an mp4 download or through YouTube.

PowerPoint slides are also available.

Transcript

This is an edited version of an automated transcript. In the near-future, I hope to edit this conversation to attribute to each speaker. Apologies for any errors.

So today, I wanted to talk about a project that's going to be a big part of VuFind release 10, which is migration to doctrine as a new database abstraction layer. And in order to do that, I also wanted to provide a little context about why VuFind uses a database and how we currently interact with the database so that some of the changes that are coming up will be more apparent.

So first of all, why does VuFind use a relational database at all? We're primarily interested in searching Solr or using external APIs. So why a database? Well, because even though VuFind relies on lots of external data, it does need to keep track of some persistent things internally. Primarily, user accounts, even if you're not using the database as the primary source of user data, if you're using an external authentication method, VuFind still has to keep track of users so that it has something that it can associate with user-generated content, which is also stored in the database. So things like tags, favorites, comments, etc. We also use the database to track search history. And there are some optional things that we can use the database for like storing sessions, or tracking record changes. As you can see, some of these things are optional. A lot of these things can be disabled if you don't actually want to track them in the database. But there are good useful reasons for having this data in VuFind in many situations.

So having established that we need a database, why use a database abstraction layer? There are a few benefits for this. One is, of course, to protect against vulnerabilities. SQL injection is a really common way of breaking into sites. And if you use an abstraction layer that forces proper escaping and so forth, you're less likely to accidentally leave some vulnerability somewhere. Abstraction layers also make cross-platform support a lot easier. So by using abstraction, VuFind is able to support MySQL or Postgres using the same code. That would be harder to do without an extra layer in there. And it also ensures some consistency of style or approach throughout the code base, because there are many ways to interact with databases. But if we pick one particular way, then code in one place is more likely to look like code in another place, which makes everything easier to maintain.

So currently, our database abstraction layer is something called LaminasDB, which used to be called ZendDB. So why did we choose this? Well, this was originally part of Zend Framework. And when we converted VuFind's code to use Zend Framework more than a decade ago, the PHP ecosystem was a lot different. And buying into a framework was more about using that framework for absolutely everything, because we hadn't yet gotten wonderful things like Composer and the current ecosystem of packages. So LaminasDB was selected largely because it was there and it just was the thing to use, which is not to say that it's a bad library. It, of course, offered all the advantages of abstraction layers that I already talked about. And it has a kind of neat object-oriented query builder that has some benefits like making it possible to extend a method that builds a query and then inject parameters without having to do crazy string manipulation. So, you know, LaminasDB wasn't bad. So why are we talking about using doctrine now? Well, the big reason is that LaminasDB is no longer really being maintained and the LaminasDB community suggests you should use doctrine. And, of course, doctrine is a library that's been around for a very long time. It's widely adopted and it has good support. So it is a logical replacement if we're going to have our existing layer taken away from us. And, of course, it still offers all of the advantages of using an abstraction layer. So let's talk about how VuFind's current architecture with LaminasDB is set up. There are currently two types of classes that manage interaction with the database. There are what we call table gateways, which are, you know, classes representing tables in the database. So there's a one-to-one relationship between these classes and the database tables. And the classes contain methods that are used for fetching data from the database and to some extent writing data back in again. The table gateways are complemented by row gateways, which, of course, represent rows in the database. And these also contain methods for reading and writing data. So a common pattern is that you call a method on a table gateway that returns a set of row gateways back to you. And then you can manipulate these row gateways.

And each one has a save method that you can use to write it back into the database. Some of our custom row gateways also have methods that do all kinds of stuff like pull related things from other tables or whatever. So there's a lot of power, but it's also a little bit unstructured and potentially confusing. So I wouldn't say that this is a strong design that we love. I already mentioned that query building is done through a fluent object oriented interface, which is kind of neat, but also sometimes kind of unreadable. And the data that comes back through the row gateways and table gateways basically behaves like PHP array objects, which means that each data element can be interacted with as if it were an array or as if it was an object, more or less interchangeably, which is really convenient, but also tends to result in inconsistent code because we have some templates that treat things as arrays and some things that treat things as objects. And there's not a lot of rhyme or reason to it. So this is something else we could stand to clean up.

So what am I proposing to change if we switch to doctrine? I still think we want to have two types of database classes, but I think there's a better division now. So instead of row and table gateways, I think we should have database services. A database service is just a class that contains logically related methods that interact with the database and handle all of the read and write functionality. So probably it makes sense to start off with our database services being more or less equivalent to the table gateways in the old code. But in this model, we're no longer locked in with a one-to-one correlation between table and class. So instead we can think about things more logically, and we can have classes that handle related functionality, regardless of what parts of the database it's interacting with. And I think that this will just be a little bit cleaner in the long run. And instead of row gateways, we have entity classes, which is a core part of doctrine. So these represent table rows, just like the LaminasDB row gateways do. But they also contain annotations, which explain the structure of the database.

This is a lot of how the doctrine framework actually operates, is that by examining these entity classes and looking at the annotations within them, it understands how to read and write the database sort of somewhat magically. But in any case, the entities are just there to represent data and describe the structure of the database, but they do not contain action-oriented methods. You're not going to save things to the database by calling a method on an entity. You're going to pass an entity to a database service, and the service is going to do that work. And I think that by having a stricter separation of concerns here, we'll end up with cleaner code.

Additionally, query building is done through something called DQL, doctrine query language, which is basically just SQL, except instead of table names, you put in the names of entity classes, and then the doctrine code works its magic to translate that into real SQL. There's also an object-oriented query builder that can generate DQL, so that kind of thing is still an option. But I think we'll only use it when we need to use it, as opposed to all the time, because for simple things, just writing a little string of DQL is probably more straightforward and readable. And in the end, all data is either going to be represented with entity classes or simple arrays, no more array objects, which will make it much more clear how to interact with the data. In fact, I'm thinking we might consider putting a two-array method on all of our entity classes so that everything can just be treated as an array, and then we know what we're doing, templates can be more consistent, we can use better typing, et cetera.

So advantages of migration. I've already talked about some of these as I've been proposing the new structure, but to highlight a few more, first of all, we have to do it. At least we have to leave LaminasDB because it's an end of life, we can't stay there forever. I think that switching to the idea of database services instead of table gateways requires us to rewrite a lot of code, which lets us move away from some very old early view find code that we still have in place that's a little bit more magical than I would like.

Just as one example, all of our controllers inherit a get table method, which they use to get access to table gateways, and it all happens with sort of low-level calls to service managers, and it's just not very explicit. I'm hoping we can move to more dependency injection that makes the relationships between the parts of our code more clear and easier to understand as we're refactoring all of this. As I said before, that's also an opportunity to use more consistent data representations in our templates. And I think this is an opportunity to greatly increase our test coverage because if we're using services to do all of our reading and writing instead of this sort of mix of doing some work on row classes that come out of table gateways and some work on the table gateways themselves, et cetera, it gets really hard to mock and simulate that. But I think we can come up with cleaner interfaces that are easier to test, and that potentially could greatly increase our project's test coverage, which is not currently one of our greatest strengths.

Of course, there are challenges to doing all this in addition to advantages, the biggest being that this is a huge volume of work. We have a lot of database code in VuFind, even though VuFind is not the most database-driven of applications. So we're going to have to rewrite a lot of code. We're going to have to test a lot of things. It's going to take time. I'm hoping that the work can also be distributed among a few people because it's a lot for one person to do. Additionally, there are some advanced features that may be particularly tricky. For example, the way that right now the user login system, those user objects are tied up with user table row gateways, there's going to be a lot to refactor to make sure that all works cleanly. We also have our database upgrader, which is part of the upgrade tool that automatically examines the database schema and applies changes to make it match VuFind's latest released schema. That's using really low-level LaminasDB stuff that may not have exact equivalence in Doctrine, maybe some work, though this challenge may also be an opportunity because right now, for example, the database upgrader is totally linked to MySQL and Postgres users have to do things in a more manual way.

Perhaps we can use Doctrine's DBAL library database abstraction layer to do some of the same things in a more platform agnostic way that will require some investigation.

In any case, that's sort of the high-level and maybe not so high-level look at what we have and what's coming.

I also have a link here to the pull request, which has some work in progress on getting Doctrine to work.

And though Dharma is not here today, he's also working on making some progress on that pull request.

And of course, you can contact me anytime if you have any thoughts or questions or concerns.