Is there an ER Diagram and Data Dictionary?

PaulHop · April 4, 2022, 12:45pm

Does Duplicati have an ER Diagram and Data Dictionary for its database?

I have just made performance improvements to one of Duplicati’s queries, a task that would have been far easier if I’d found documentation on the structure and usage of the database. Does Duplicati have an ER Diagram and Data Dictionary, if not I’ll start the development of one.

Duplicati’s database structure would benefit from some work. The work would improve the performance, reliability and maintainability of the application. I can commence this work, but I’ll only commence it if others support the work, I’d hate to waste my time if the work is never released. Would others support overall improvements to Duplicati’s database structure, database integrity, and query design?

ts678 · April 4, 2022, 4:16pm

Improvements to performance reliability and maintainability historically almost always get released, provided there are not downsides, but you already covered maintainability, so perhaps there aren’t.

An amusing exchange on maintainability, readability, and correctness of rewrite is in a pull request.

It would ideally be maintainable by volunteers who aren’t always top database experts unless we’re fortunate enough to always have them volunteering. Another question would be how standalone the database work is. If it gets into big revisions in the C# design or the destination files, risks increase.

I’ll describe the historical model, although this has hit snags recently due to a limit from a volunteer.
There has been a code signing certificate issue, but really what we’d like is a new release manager,
exact duties TBD, but it probably includes doing release notes, releases, and perhaps moving code.

Generally a Canary release is the first after automated testing done at pull request submission, then willing volunteers on that channel run the release. After enough confidence, it would move to a Beta.

It’s certainly a bad thing if breakages get this far, exposing the entire user base, so care is needed…
I’m trying to get you and @tarianjed working together, and I now notice that I forgot about @vmsh0

Database rebuild did an analysis, followed by my comments and my concise DB writeup at end here.

Not exactly (that I know of). There are some less formal pieces around, such as the above and the Wiki:

Documentation for the local database format

https://github.com/duplicati/duplicati/blob/master/Duplicati/Library/Main/Database/Database%20schema/Schema.sql

Anything that can be done to smooth the entry path for volunteers would help. It’s difficult to get started. Maintainability of documentation matters too. I think the above Wiki doc is slightly out of date already… Anything that needs special software to create might be more difficult for even willing helpers to update.

What’s not to like about that (if it can be done safely)? If by support, you mean help with, resources are unfortunately limited – and some volunteers would deeply benefit things, but I mentioned possible help.

The original Duplicati author was not a database professional, participates less these days, and would probably be thrilled (I know I would be) if a team of DB volunteers could make headway on such things.

tarianjed · April 4, 2022, 4:47pm

As long as you track your work with a github issue so others can collaborate and prevent double work I have no issues.
This will have a pretty large code refactoring requirement which might also overlap with the push to update the code to .net 5 (.NET 5 Migration · GitHub).

If you read this forum thread you can learn from some of the research and work I have done so far:

PaulHop · April 5, 2022, 12:08am

I 100% agree, which is why I’m proposing it. I push for a data dictionary and ER diagram on all projects I work on. Duplicati’s database is fortunately extremely simple but after I opened this topic I read an amusing post where someone was trying to work out how the database worked and contributors were trying to piece it together.

A database, like all code has been designed for a particular reason. It’s important to understand that design intent to understand why the database is designed the way it is and how the database should be used. This is what’s captured in the data dictionary.

PaulHop · April 5, 2022, 12:24am

I’ll answer this from the model “maintainability” perspective… I develop data dictionaries using Visual Paradigm. The ER diagramer is in Community Edition so is available for all. In any case I’d do the first version in VP, this would produce the initial data dictionary as a Word document and the ER diagram as an image, this would give us a good start. The Word document text is easily maintained, we’d need to look at whether we continue to use VP to maintain the model… normally I’d say “yes” but it’s a very simple model so we could use something else.

I suggest that we do get into the C# code… but not all at once . The database’s structure can be improved and then views can be used to expose the old tables naming these views the same as the current table names… therefore breaking no code. Slowly the C# code would moved to the new tables.

I had a look at the C# code last night and it’s going the way that all projects go in this position… everyone is writing their own query, hitting the database for every little piece of information. The result is a very chatty database interaction which kills performance and litters the code with snippets of SQL. I suggest this be reviewed as we understand how the database is to be used.

I totally agree that care is needed, the database is the core of the project. I would love to work with @tarianjed and @vmsh0 on this moving forward. For now, I might knock a document together as a quick start and we can go from there.

vmsh0 · April 5, 2022, 6:06pm

I’m happy to see more and more people stepping up to this task! I’m also happy to be kept in the loop with the mentions on the forum and the issues, and I’ll gladly give my two cents here.

Just for a matter of future-proofing the project I would suggest to do this using a more flat format such as Markdown.

I see two gotchas here:

Views don’t allow inserts at all in SQLite, so they don’t help for queries that write data. Those parts would have to be refactored all at once.
It’s not trivial to design views that give exactly the same behaviour as the original tables - especially while retaining similar performance. Since afaiu database size is an explicit aim of the project, materialized views are surely not an option.

If someone is willing to put the time into it, maybe the best model of developement would be to just branch out. Based on my previous experience (and especially mistakes) with refactoring a large codebase, the process could look a bit like the following:

Branch and design the new data model with space and time contraints in mind
Identify all parts of code that interact with the database, and explicitly document what those interactions achieve (critical!)
Decide whether it’s time to adopt an ORM (just putting it on the table, since no one mentioned that so far)
Implement the new design throughout the codebase, using the documentation that was produced earlier
Throughout the whole process, re-integrate master regularly into the development branch (critical!), taking care to react specifically to commits that deal with database stuff (which ideally would be minimized during the time of the refactor)

+1

I’m not making any promises, I have a pretty busy period ahead! But I’ll be glad to help however I can in what time I will have available.

PaulHop · April 6, 2022, 3:48am

Agreed, we’d need to look at this. I’ll use VP initially but long term we’d need to port the doco elsewhere.

Agreed, thank you, I missed that … once we have an idea as to where we are heading (i.e. the amount of change) then we’d need to look at this. I’m not concerned at this stage.

I mainly deal with SQL Server but from what I’ve seen SQLite’s performance is reasonable. I agree with what you’re saying… I’m pretty good at queries but we’d need to look at it. This database is very small both in terms of number and complexity of tables… anyway, we’d need to look at it as we moved forward. No, I wouldn’t be looking at materialized views.

I’m happy with (most of - see caveat below) this. We’ll look at the degree of change first and then go from there. The important thing at this stage is to document the current database design, understand it, come up with improvements then work out how to implement them. The code is starting to get query duplication through it, there’s cleanup to be done… but stage one first.

Please don’t, I beg of you . I have worked on some very large projects implementing ORMs and none of them have performed well. The reason is that the developers create an “Ok” database then throw an ORM at it and think “job done” they’ll just use the ORM. The problem is that ORMs create a very chatty database interface and all that chat has a significant overhead. The ORM isn’t the problem as such but it encourages a problem. The developers still need to develop efficient queries that use the database with the intent to which it has been designed, they still need to understand the database and what they are doing to it… the ORM hides all of that. The typical approach is that “we’ll just fix the areas that don’t perform well”, but the ORMs transactions created in the business layer combined with a chatty interface are increasing lock times in areas that are perceived as fast creating lock contention. This is a small database, if we want the database to fly then I suggest we avoid ORMs :-).

Not a problem. I have so much work on that I shouldn’t have suggested it at all so I know where you’re coming from… but I can see areas for improvement so I’m going there.

I’m on “holidays” for the next week. I usually get a lot of work done when I’m on holiday (I’m a sick puppy ) so I’m planning on starting some doco with what I think might be happening at the database and I’ll send it out for comment. If time permits I might also put together some suggested improvements based on my understanding… we’ll see how we go.