Brendan Abolivier

What a year, huh? (2023 Recap)

Brendan Abolivier — Wed, 31 Jan 2024 00:00:00 +0000

Hey there!

It’s been a while since I haven’t posted anything on here. A couple of years actually. This time, though, I’m not writing about something techy (well, mostly), or a project. This time, it’s more of a personal update. 2023 has been a pretty eventful year for me, and even though I haven’t really done that in the past, and I’m not sure how interesting this will be for anyone except myself, I thought it was worth commemorating the milestones that happened in the past year on my own little corner of the World Wide Web.

Also, yes, I know, we’re already almost a month into 2024. But some say it’s still fine to wish someone a happy new year as long as it’s still January, so surely this is also fine 🙂

A rough start

2023 didn’t exactly start on a positive note. Around mid-December 2022, I learned that I had lost my job at Element following a round of redundancies. Although I’m not going to publicly blame anyone responsible for this decision, it would be wrong to pretend this did not come as a huge shock for me.

Element had been part of my life for over 4 years (over 5 if you start counting from when I was still an intern), and had been the only company I worked for since I decided to emigrate from my hometown in France to the UK. This meant I needed to rethink my life as a British resident, now that a significant part of it had been ripped away. Worry also quickly settled in, as even though I left with a comfortable severance package, money can flow pretty fast in the 2nd most expensive city in the world.

This meant job hunting could not wait. However, within a few weeks, it quickly became obvious that what I got myself caught into was one of the most brutal crisis this industry had seen in a while, with big corporations laying off employees left and right, rendering the job market overcrowded with more and more people looking for work, but also more and more companies taking the cautious approach and leveling down their hiring.

So things went on for a while without much success. Some times it became tricky to prevent the Baserow table I was tracking my applications into from being filled with red.

But still, I trudged forward, exploring ways in which I could find my next step. I did some freelancing to help Unwired Networks improve their X509 certificate verification and handling tooling (while writing open source!), and even worked a bit for a credit card company. There have been several crossroads that could have taken me to a completely different outcome than the one I ended up with (which we’ll get to in a minute), and although I don’t really want to share too many details here, I’m immensely grateful to everyone who has followed me throughout that journey, even if just part of it.

The big 60

At the same time, my love for cycling was picking up. I’ve already shared bits of it online, and lots of it privately: cycling has been something I’ve loved doing for a very long time now. As I suddenly found myself with some free time on my hands, and I was coming off a year or two during which I’ve slowly but surely been improving on taking care of myself and getting in a better shape, I thoroughly enjoyed spending time on my bike.

Towards the start of the year, I finally did something I had been pondering for a while: signing up for a cycling event. RideLondon had been happening close to me for quite a while at that point, and I thought it would be a nice challenge to take on one of their rides, signing up for 60 miles (even though I had initially settled on 30). This gave me extra motivation to cycle again, and to start doing longer and longer distances. I spent some amazing weekends cycling through the British countryside alongside a fellow cyclist friend (thanks a lot, Michael!), smashing pretty long distances in the process.

The event itself arrived in a timely fashion. Due to some personal events I won’t get into here, the month of May 2023 for me started with one of the worst mental health breakdowns I had ever experienced. While I spent most of the first half of the month barely leaving my flat, this upcoming deadling was what I needed to get out and train.

The ride was honestly one of the best events I’ve ever attended. While it was challenging, I surprised myself with how much fun I was having, crossing milestones while enjoying parts of the countryside I had never been through.

Crossing the finish line on Tower Bridge was probably one of the most memorable moments of my life. It’s difficult to put emotions into words, and I’m aware that a lot of people cycle longer distances during those events. But when I was cycling down Tower Bridge after almost spending close to 5 hours on my bike, I couldn’t help but think back to a time that wasn’t even that far away, barely 3 years back, when I was struggling with anxiety, had a very low self-esteem, and in no way could be motivated to take much care of myself, let alone exercise. And that same me had just smashed 60 miles, 100 kilometres, of cycling in one day.

Spoiler alert: I’ve already signed up for RideLondon 2024, for 100 miles this time. And I’m so looking forward to it.

A new beginning

As a number of readers might be aware by this point, I joined the Thunderbird team at Mozilla in June 2023. This came after months of applying to job after job and getting rejection after rejection, which was slowly eating away at my confidence and mental health. In total, I applied to 76 positions, leading to over 50 rejections, and I’m not counting applications I never heard back from (though those statistics are still paling in comparison to the ones my sister once shared with me about her job hunting in the marketing sector). So this news felt like a ray of sunlight in the middle of a storm: not only did I find a role that I was happy and interested in, but it’s also to work on software I’ve been using for well over a decade, as part of an organisation I had been looking up to for so long.

While getting used to Thunderbird was challenging, it was a welcome challenge. It felt like the first time in a while that I was able to put my skills to use to not only improve a project that I feel personally attached to, but also have a positive impact in areas that matter to me. And to be part of a global community that I share common values and interests with.

A few months in, the focus of my work started shifting from fixing regressions (which was a great way to get more familiar with the project) to more feature-oriented work. I helped research existing code architecture, and design part of the project’s future. As the work to integrate Rust into the code base (led by Ikey) started to come to a close, and my own work started to gravitate around it, I got to build on top of it to create some documentation and code infrastructure that will help Thunderbird developers for year to come. Being able to have this kind of impact on a project that matters so much to me, within barely more than 6 months on the job, is something I’m immensely proud of.

(pssst, if you want to hear more about this, we’re giving a talk on this at FOSDEM this weekend 👀)

This section of the post is probably a bit shorter than the others. To be honest, I’m still struggling to figure out where the last 6 months have gone, because I don’t feel like it’s been that long. And it’s still just the beginning. Although I’m approaching my professional life with more caution than I used to, thanks to past experiences, it feels good to be doing something I enjoy again.

To 2024

I think it’s fair to say 2023 has included some of the highest highs and some of the lowest lows of my life (some I haven’t included here). If you’d asked me a year from now where I’d be now, I very likely wouldn’t have been able to guess where the year took me, and all of the hurdles and joys I encountered along the way. I want to reiterate how grateful I am to everyone who’s been with me throughout the whole journey, or just part of it.

If 2024 is as eventful as 2023 was, maybe I’ll make another one of these in a year. Though frankly I kinda hope it isn’t. But maybe I’ll write something up regardless; who knows 🙂

In the meantime, I hope everyone reading this has a smashing 2024. I’m specifically wishing all the best to everyone I know who’s looking for work at the moment, whether due to layoffs (public and not) or unhappiness with what you have right now. It’s still rough out there, but I know you’ll pull through.

See ya!

Implementing support for message retention policies in Matrix

Brendan Abolivier — Mon, 11 Jan 2021 00:00:00 +0200

Hello there, long time no see!

As you may know, I’m currently working at Element, as part of the backend team working on Matrix’s server-side implementations. The main project involved in this work, at least from my side of things, is Synapse, the reference Matrix homeserver implementation. If you don’t know what a homeserver is, you may want to check out my post Enter the Matrix, in which I give an extensive introduction to Matrix. Parts of this blog post need a basic comprehension of how Matrix works, so if you don’t already have that, you’ll probably want to give that post an eye before continuing this one.

In this context, in 2019, I got to implement in Synapse a feature that had been actively requested by the Matrix community for a while now: message retention policy support. It allows any server or room admin to define a period of time after which a message gets hidden from clients and eventually deleted.

This feature is fairly complex to implement and document, due to different moving parts needing to interact with one another. The current documentation is a good place to start, especially if you’re mainly interested in knowing how to configure a retention policy on your server or in your room. But I thought it might be interesting to get a bit deeper into its implementation and explain some design choices and shortcomings.

In other words, the goal of this post isn’t to explain how to get set up with this feature, but to be a technical breakdown explaining the internal design of this implementation. For instance, I’m not going to dump and explain the necessary bits of configuration right away but rather try to explain is as much detail as I can what they do. If you’re a complete stranger to this feature you might want to have a very quick skim through the documentation I’ve linked to in the previous paragraph, though I’m going to repeat a bunch of what’s being said there.

So here I go.

A quick look at the spec

Message retention policies are defined in Matrix in MSC1763 (an MSC being a proposal to the Matrix specification, not unlike RFCs), which defines them as state events (of type m.room.retention) sent to the room the administrator (or moderator) wants to regulate. Therefore, a policy is scoped to a room.

The content for this state event looks like this:

{
    "max_lifetime": 2419200000,
    "min_lifetime": 86400000
}

This event has two properties: max_lifetime and min_lifetime. Their values are time durations expressed in milliseconds. The combination of these properties define the lifetime of an non-state event after it’s been sent to a room:

max_lifetime defines how long after the message is sent the homeservers participating in the room can keep it around. In the example above (where its value is 2419200000), it means a homeserver must delete events sent to the room at the latest 28 days after they’ve been sent (though we’ll see later that, in the current implementation in Synapse, it can vary a bit around that value).
min_lifetime defines the minimum amount of time after the event is sent during which homeservers should store the event and not delete it. This is particularly helpful for e.g. governmental organisations that are required (through laws like the Freedom Of Information Act) to keep a record of messages sent, or for moderation purposes. In the example above (where its value is 86400000), it means a homeserver should store events at least 24h after they’ve been sent.

In other words, these parameters are limits to the total lifetime of an event. If a message retention policy has no min_lifetime then the homeserver is free to delete events as soon as it wants, and if it’s got no max_lifetime then the homeserver is free to never delete any event. It’s then up to the homeserver to decide when to delete events using these constraints.

Processing policies

So let’s have a look at how this is implemented in Synapse. Before going any further on the implementation side of things, I need to mention that this implementation is still currently considered experimental, and is disabled by default in any new install. We’ll see a bit later in this post how to enable and tweak it using Synapse’s configuration file. Currently, the MSC still needs some clarification and discussion, and so future iterations on it might cause the implementation to change. This is why we’re not declaring it stable as is and enabling it by default yet. On top of that, its main goal is to perform bulk deletion of data, so we want to make extra sure it’s done right before flicking the switch in order to prevent any irreversible breakage.

First, let’s see how Synapse keeps track of retention policies for all the rooms it’s in. That bit is rather simple: every time a state event is sent to a room with the type m.room.retention, Synapse will insert a row into its room_retention database table. This row will include some data about the policy, including the min_lifetime and max_lifetime properties. Note that both those properties are NULLable, allowing for either (or both) property to be omitted (we’ll see later what Synapse does in this case). As far as Synapse is concerned, a room with a retention policy with an empty content ({}) is the same thing as a room with no retention policy.

Now that Synapse knows the retention policy for each room it’s in, it can apply it to the events in the room. It’s worth noting that a current point of discussion on the MSC, and somewhere the implementation differs from the spec, is that the MSC mentions events should be purged according to the retention policy of the room as it was when the event was sent. Synapse, on the other hand, will purge events based on the retention policy the room currently has, because it creates less technical complications, provides better performances and seems to better fit the expectations of users.

Configuring message retention policy support

Let’s take a quick break from the technical breakdown to clarify a thing or two. In the next few sections, I’ll take a look at different parts of Synapse’s implementation of message retention policy support. I’ll also explain how they tie into the feature’s configuration.

Message retention policy support can be enabled and tweaked in Synapse’s YAML configuration file. All of the configuration related to this feature can be found in the retention section. I’m not going to get into too much detail about what the different sub-sections and settings mean and how they’re used, as the rest of this post already covers this. One thing I will mention here, however, is that you can enable the feature by setting enable to true in this section.

Note that if this setting is missing or set to false, Synapse will still store new message retention policies. It will not, however, delete any event from the database.

Now let’s see how Synapse deletes messages when this feature is enabled.

Synapse and its many jobs

Because it would be too expensive and complex to track the lifetime of each event individually, and set a timer to purge them from the database, Synapse purges events by running regularly scheduled jobs. Doing so also allows merging code paths with another feature, which is the purge history admin API. The frequency and scope of these jobs are defined in Synapse’s configuration as such:

purge_jobs:
  - longest_max_lifetime: 3d
    interval: 12h
  - shortest_max_lifetime: 3d
    interval: 1d

This example describes two purge jobs. This definition includes a frequency, defined by the required interval setting, which defines the time between two instantiations of a job.

In the example above, Synapse will run a job every 12 hours purging expired events in rooms which retention policy feature a max_lifetime with a value of 3 days or less; as well as another job every day purging expired events in rooms which retention policy feature a max_lifetime with a value of more than 3 days (note that longest_max_lifetime is inclusive but shortest_max_lifetime isn’t).

The reason Synapse allows multiple jobs to be defined in the same configuration is that all rooms don’t have the same sensitivity with regards to their retention policy. Some might have their policy dictate that no event can live longer than a day, whereas others might only require events to be purged after a year.

Another thing to keep in mind is that running a purge job might be an expensive task to run as it can involve deleting a lot of data, so you don’t want to run a job every minute purging all expired events in all rooms.

Defining multiple jobs allows making sure rooms get processed according to the sensitivity of their policy, as well as ensuring the best performance possible. You could see it as sharding (or partitioning) the load of purging history of rooms across all of the jobs based on a room’s retention policy. This also allows sufficient flexibility in the configuration.

It’s worth noting that both shortest_max_lifetime and longest_max_lifetime are optional here; and here as well lack of one limit simply means there’s no limit applied in that direction. For instance, the following example defines a purge job without any limit on the interval of max_lifetime values it handles:

purge_jobs:
  - interval: 12h

It is also possible to bind a job to a precise scope by specifying both settings:

purge_jobs:
  - interval: 12h
    shortest_max_lifetime: 6h
    longest_max_lifetime: 1d

Heads up that it’s highly recommended to configure a job with an open limit on each side of the range of max_lifetime values - this can be either a job with no limit (as shown above) or two jobs, each limiting in one direction.

But wait, you’ll then say, isn’t it bad if Synapse might delete expired messages hours, possibly days, after they’ve expired? To which I’d answer that yes, probably, however this is mitigated by another feature of this implementation: when a message expires, Synapse will stop sending it to clients. This means that, even though Synapse might not purge events immediately when they expire, it will prevent clients from seeing it. Note that clients that have already downloaded and stored the event might continue to show it, unless they themselves implement support for message retention policies, and no homeserver can do anything about that.

The same mechanism applies if an expired event is sent to Synapse by another homeserver through federation, for example when backfilling, if the remote server doesn’t implement this feature (or doesn’t have it enabled). In this case this feature, when enabled, will prevent this event from reaching clients, letting it sit in its database until the next run of a relevant purge job clears it up.

Under the hood

Right, now we understand how to configure a purge job, let’s see how it actually works. I’m not going to go into detail on the specific SQL deletion that happens, the main reason being this code was already there when implementing the feature, as part of the purge history admin API, and the purge jobs just hook into it.

Quick heads up, in this section I’ll be moving from linking to Synapse’s code to sharing snippets of the code directly here, because I believe it’s nicer to understand what’s going on. For reference, all of these snippets will come from Synapse’s pagination handler and should be located in the top half of the file, if you want to contextualise them.

When Synapse starts, it will start a looping call for each purge job configured:

# Run the purge jobs described in the configuration file.
for job in hs.config.retention_purge_jobs:
    logger.info("Setting up purge job with config: %s", job)

    self.clock.looping_call(
        run_as_background_process,
        job["interval"],
        "purge_history_for_rooms_in_range",
        self.purge_history_for_rooms_in_range,
        job["shortest_max_lifetime"],
        job["longest_max_lifetime"],
    )

If you’re running Synapse in a worker setup that isn’t configured to run background tasks on the main process, these purge jobs will be scheduled on whichever worker run_background_tasks_on is pointing to in your configuration file.

As expected, we can see that each job is run in a looping call. As its name might suggest, in Synapse, a looping call is a function that is called in an infinite loop (asynchronously) with a given interval between two calls to that function. In this instance, we can see that for each looping call we use the configured interval for the associated purge job configuration. We also provide the function with the purge job’s range.

Specifying a default retention policy

Now let’s see what a purge job actually does. First it retrieves the rooms it will be purging, and their retention policies, from Synapse’s database:

# We want the storage layer to include rooms with no retention policy in its
# return value only if a default retention policy is defined in the server's
# configuration and that policy's 'max_lifetime' is either lower (or equal) than
# max_ms or higher than min_ms (or both).
if self._retention_default_max_lifetime is not None:
    include_null = True

    if min_ms is not None and min_ms >= self._retention_default_max_lifetime:
        # The default max_lifetime is lower than (or equal to) min_ms.
        include_null = False

    if max_ms is not None and max_ms < self._retention_default_max_lifetime:
        # The default max_lifetime is higher than max_ms.
        include_null = False
else:
    include_null = False

logger.info(
    "[purge] Running purge job for %s < max_lifetime <= %s (include NULLs = %s)",
    min_ms,
    max_ms,
    include_null,
)

rooms = await self.store.get_rooms_for_retention_period_in_range(
    min_ms, max_ms, include_null
)

The first part of this code doesn’t actually do any retrieval from the database, but figures out what to retrieve. More specifically, it figures out whether this purge job needs to process rooms with no retention policy stored as well as rooms which retention policies are within the range of this job. A room with no retention policy will still be stored in the room_retention table, with a NULL retention policy, hence the name of the boolean variable indicating whether we need to retrieve these as well (include_null).

The reason we might want to process these rooms is because it is possible in Synapse to define a default policy for all rooms that don’t have one in their state, using the following configuration in the retention section of Synapse’s configuration file:

default_policy:
  min_lifetime: 1d
  max_lifetime: 1y

This example is equivalent to adding this m.room.retention event into the state of any room that doesn’t already specify a retention policy:

{
    "min_lifetime": 86400000,
    "max_lifetime": 31557600000
}

If a room already specifies a retention policy, Synapse will use that policy and not the default one.

Note that there is one difference with actually inserting the policy into the room’s state, it’s that this default policy will only be applied on your homeserver, so if another homeserver is in the room they won’t necessarily apply the same policy. However, as we’ve seen before, if another homeserver sends yours events that should be deleted according to your default policy, Synapse will hide it for clients and just wait for the relevant purge job to delete it.

This check is actually quite simple: we only need to process rooms without a retention policy if a default server-wide retention policy has been configured (because it then applies to any room without a policy). On top of that, we check whether this default policy specifies a value for max_lifetime that’s within the job’s range.

We then call get_rooms_for_retention_period_in_range on Synapse’s storage layer, which returns a dictionary associating a room’s ID with its retention policy, for example:

{
    "!someroom:example.com": {
        "max_lifetime": 2419200000,
        "min_lifetime": 86400000
    }
}

Once we have these rooms, we iterate over them.

Capping the policy

We first check if there isn’t a purge in progress in that room, and if so skip it to prevent any damage due to a conflict:

if room_id in self._purges_in_progress_by_room:
    logger.warning(
        "[purge] not purging room %s as there's an ongoing purge running"
        " for this room",
        room_id,
    )
    continue

We then proceed to cap the room’s retention policy. This is done by another bit of configuration in the retention section of Synapse’s configuration file:

allowed_lifetime_min: 1d
allowed_lifetime_max: 1y

The rationale on capping a room’s policy is that your homeserver might run under different requirements with regards to data retention than the other homeservers in the room. You might want to make sure you keep messages long enough for e.g. audit or other legal purposes, or you might want to make sure you don’t keep them too long so they don’t take up too much space on your disk and/or for privacy-related reasons. Whatever your reason is for doing so, Synapse allows you to override a room’s retention policy before purging it to ensure it doesn’t purge what you want to keep around, or it purges what you don’t want around anymore.

Both allowed_lifetime_min and allowed_lifetime_max are optional configuration parameters. They apply to both min_lifetime and max_lifetime, however when running a purge job, we only care about the policy’s max_lifetime value, so that’s the one Synapse will cap if necessary:

# If max_lifetime is None, it means that the room has no retention policy.
# Given we only retrieve such rooms when there's a default retention policy
# defined in the server's configuration, we can safely assume that's the
# case and use it for this room.
max_lifetime = (
    retention_policy["max_lifetime"] or self._retention_default_max_lifetime
)

# Cap the effective max_lifetime to be within the range allowed in the
# config.
# We do this in two steps:
#   1. Make sure it's higher or equal to the minimum allowed value, and if
#      it's not replace it with that value. This is because the server
#      operator can be required to not delete information before a given
#      time, e.g. to comply with freedom of information laws.
#   2. Make sure the resulting value is lower or equal to the maximum allowed
#      value, and if it's not replace it with that value. This is because the
#      server operator can be required to delete any data after a specific
#      amount of time.
if self._retention_allowed_lifetime_min is not None:
    max_lifetime = max(self._retention_allowed_lifetime_min, max_lifetime)

if self._retention_allowed_lifetime_max is not None:
    max_lifetime = min(max_lifetime, self._retention_allowed_lifetime_max)

logger.debug("[purge] max_lifetime for room %s: %s", room_id, max_lifetime)

We first figure out what the effective value for max_lifetime is in the room; it’s either the value from the room’s policy, or from the homeserver’s default policy if no specific policy is defined for this room.

Then we:

take the maximum value between allowed_lifetime_min and max_lifetime, so we use the effective value if it’s within the allowed range, and the minimum allowed value if it’s not.
take the minimum value between the result of step 1 and the maximum allowed value, so we use the value from step 1 if it’s within the allowed range, and the maximum allowed value if it’s not.

That way we ensure that, if the effective value of max_lifetime is within the allowed range, it stays the same, otherwise it’s changed to the bound it goes over.

Note that a previous implementation of this configuration refused entirely to process any incoming event that was describing a policy that wasn’t abiding to this range, this is no longer the case as of a few months ago, when it was changed to the implementation I’ve just described.

The purge

Now’s come the time to get rid of these nasty old events. Let’s look at the final preparation before we do that:

# Figure out what token we should start purging at.
ts = self.clock.time_msec() - max_lifetime

stream_ordering = await self.store.find_first_stream_ordering_after_ts(ts)

r = await self.store.get_room_event_before_stream_ordering(
    room_id, stream_ordering,
)
if not r:
    logger.warning(
        "[purge] purging events not possible: No event found "
        "(ts %i => stream_ordering %i)",
        ts,
        stream_ordering,
    )
    continue

(stream, topo, _event_id) = r
token = "t%d-%d" % (topo, stream)

purge_id = random_string(16)

self._purges_by_id[purge_id] = PurgeStatus()

logger.info(
    "Starting purging events in room %s (purge_id %s)" % (room_id, purge_id)
)

# We want to purge everything, including local events, and to run the purge in
# the background so that it's not blocking any other operation apart from
# other purges in the same room.
run_as_background_process(
    "_purge_history", self._purge_history, purge_id, room_id, token, True,
)

First, we need to figure out the timestamp we need to start purging at, which is just now minus the room’s policy’s max_lifetime, and convert that into a stream ordering.

Matrix rooms are DAGs, which means it’s not always possible to have a straight line from one point of the history to another. To address that, Synapse orders events with their unique index in its streams of incoming events, which is what we call the stream ordering of that event. Retrieving a stream ordering allows us to translate the timestamp into a location in that stream we can then use.

However, here we’re retrieving the first stream ordering Synapse can find after the timestamp, but the events stream isn’t scoped to the room we want to purge. This means we need to get some data on the closest event in that room, and we do that by calling get_room_event_before_stream_ordering, which will return some metadata on the event sent to that room before the given stream ordering (so the most recent event to purge from that room). This will return, beside the event’s ID, its topological and stream ordering.

Now, we already know what a stream ordering is, but what about a topological ordering? Well it’s roughly the same thing, except that instead of being the index of the event in Synapse’s events stream, it’s its index in the room’s topology. For example, the first event of the room will have a topological ordering of 1, the next one 2, etc.

The main difference with a stream ordering is that a topological ordering isn’t always unique because a DAG can sometimes branch. This is why we’re getting both the topological ordering of the event and its stream ordering, so we can tell the purge code exactly what event we want the purge to start at.

From these two integers we create a token, using the format t[topological ordering]-[stream ordering] (starting with t to make it clear which ordering comes first), and we run the _purge_history function into a background process, which is another way of saying we’re running that function in a non-blocking way, so we can start process the next room.

Now I’m not going to go any further, because as I’ve already said the rest of this code was initially introduced when implementing the purge history admin API; and I didn’t work much on this code except for making sure it was doing what I would expect it to do.

Though what you’ll probably want to know about the code that’s actually clearing off events is that it takes some precautions to make sure it doesn’t completely break the rooms it’s purging, namely:

it won’t delete state events to prevent the room from getting into a broken state
it won’t delete the most recent event in the room; that’s, again, because a room’s history is a DAG and each event needs to reference previous events (with the exception of m.room.create, which creates the room) - therefore if you don’t have any event in the room to reference, nobody will be able to send any new event in that room (or Synapse might try to reference an older state event but then the new event will probably appear out of order on other homeservers)

However, despite not being able to delete these events, Synapse will still hide them from clients, which should be enough of a mitigation in most cases.

A note on media

As you might have noticed, this feature only manages the retention of messages, not state events or, a more requested variant, media. Media retention is an entirely different problem (tracked here) for a few reasons. For a brief point of context, the way uploading media into a room work in Matrix, is that you first upload your media to the homeserver, then send an event into the room with data on how to reach (and possibly decrypt) the media.

The first issue with this is that in end-to-end encrypted rooms the homeserver won’t be able to read the event listing the media’s URL and metadata (in fact it’s not even capable of distinguishing it from a text message), so it’s not always possible to map a media with the room it’s been sent to. On top of that, some third-party media stores such as TravisR’s matrix-media-repo implement some deduplication logic so the same file might be used in two different rooms, which complicates things even more.

This means a separate feature needs to be implemented for media. The details and design still need to be ironed out, but it’s on the team’s roadmap. You might notice, however, that while this feature isn’t deleting media entirely, it removes references to them from the room, which at least would still prevent members of the room from accessing them easily.

What a journey

Message retention policies can be a super useful feature, and some bits can be a bit tricky to understand, or a bit curious in terms of design. So I hope this deep dive into how that feature works and was implemented was helpful. If it’s still a hazy and unclear feel free to reach out over Matrix or Twitter! 🙂

Note that this isn’t a technical documentation on how to use the feature, therefore I didn’t specifically outline the limitations, important bits of config, etc. related to this feature, but instead spread them through the post. If you just need to make it work and skim across its shortcomings then the documentation is the right place to look.

I sure had fun writing this post, it was nice revisiting one of my first big features in Synapse, and it motivated me to look with fresh new eyes into all of the implementation’s details (and even find a few bugs), which was welcome 😀 Huge thanks to Thibaut, Andrew and Dan for proofreading it!

See you in the next post! 🙂

Install Party 1.0

Brendan Abolivier — Fri, 01 Nov 2019 00:00:00 +0200

A few weeks ago, I attended Ubucon Europe in Sintra with two of my colleagues from the Matrix core team (oh, yes, if you didn’t know already, I joined New Vector about a year ago, and I’ve been working on Matrix as my full-time job since then). We had a few chats with very nice people, and also hosted two Matrix-related workshops.

One of these workshops, which happened on the morning of the last day, was about installing Synapse, the reference Matrix homeserver implementation. The goal was to give attendees a presentation about what Matrix is, get them to install their own homeserver, and, if possible, to get everyone’s server to federate with everyone else’s.

This is not a trivial thing to do, especially when the technical expertise of the attendence looks quite diverse. After a quick brainstorm on how to do that, Ben suggested that we give everyone access to a VPS that is accessible from the Internet, with SSH access and a domain name.

This would make things much easier than trying to get attendees to install a server on their own machine (no need to setup a custom CA, or a local DNS server), but I know well enough how making a workshop or a talk rely on Internet connectivity can really jinx it. The connectivity seemed good enough there though, so I figured I’d give a try at automating the provisioning of such servers. This evolved into a project I’ve kept working on afterwards named “Install Party”.

Install Party

Install Party is a Python module that allows users to provision a server by creating an instance (a physical or virtual machine), attaching a DNS A record to it, and running a script that installs and configures Riot and Caddy on that instance.

This can be done by simply running python -m install_party create -N x, with the number of servers to create as x:

$ python install_party create -N 3
INFO - Provisioning server coogl (expected domain name coogl.ubucon.abolivier.bzh)
INFO - Creating instance...
INFO - Waiting for instance to become active...
INFO - Host is active, IPv4 address is 54.38.70.225
INFO - Creating DNS record...
INFO - Created DNS record coogl.ubucon.abolivier.bzh
INFO - Waiting for post-creation script to finish...
INFO - Done!
INFO - Provisioning server czxcx (expected domain name czxcx.ubucon.abolivier.bzh)
INFO - Creating instance...
INFO - Waiting for instance to become active...
INFO - Host is active, IPv4 address is 54.38.70.93
INFO - Creating DNS record...
INFO - Created DNS record czxcx.ubucon.abolivier.bzh
INFO - Waiting for post-creation script to finish...
INFO - Done!
INFO - Provisioning server jswho (expected domain name jswho.ubucon.abolivier.bzh)
INFO - Creating instance...
INFO - Waiting for instance to become active...
INFO - Host is active, IPv4 address is 54.38.71.74
INFO - Creating DNS record...
INFO - Created DNS record jswho.ubucon.abolivier.bzh
INFO - Waiting for post-creation script to finish...
INFO - Done!

All servers have been created:
	- coogl.ubucon.abolivier.bzh
	- czxcx.ubucon.abolivier.bzh
	- jswho.ubucon.abolivier.bzh
$

(I’ve trimmed the log lines' length here, it’s usually longer and feature the date and the name of the module sending the line, but that would have been unreadable in this post. This will also be the case for other similar sections of the post.)

The workshop’s host can then hand out the domain name attached to a server to each attendee, who can then log in via SSH to the server and install and configure a Matrix homeserver (including, if applicable, its built-in ACME support for automatic provisioning of the certificate needed for federation). As an example, here are the instructions we got the attendees to follow during our workshop at Ubucon Europe.

One of the domains I handed out during our workshop at Ubucon

From there, the attendee can use the instance of Riot to register on their new homeserver, and federate with every other attendee’s homeserver, but also every other homeserver on the Internet.

The servers federating between themselves but also with some from the wider Internet

Once the workshop is done, the host can then delete every server with python -m install_party delete --all:

$ python -m install_party delete --all
INFO - Deleting instance for name jswho...
INFO - Deleting domain name for name jswho...
INFO - Deleting instance for name czxcx...
INFO - Deleting domain name for name czxcx...
INFO - Deleting instance for name coogl...
INFO - Deleting domain name for name coogl...
INFO - Applying the instances deletion...
INFO - Applying the DNS changes...
INFO - Done!
$

Of course, this deletion mode also has a dry-run mode, which can be turned on with -d.

They can also delete specific servers with -s foo -s bar (which would only delete the servers foo and bar), or delete every server except one or more with -a -e foo -e bar (which would delete every server but foo and bar). This became very handy when one person arrived late to the workshop, and didn’t get the time to finish it, so I could just give them some extra time to work on it and exclude their server’s domain from the deletion I performed shortly afterwards.

At any time, the host can also list every server that is still active with python -m install_party list:

$ python -m install_party list
+--------+-----------------+------------------+----------+-------+
| Name   | Instance name   | Domain           | Status   | IPv4  |
|--------+-----------------+------------------+----------+-------|
| jswho  | ubucon-jswho    | jswho.ubucon.... | ACTIVE   | ...   |
| czxcx  | ubucon-czxcx    | czxcx.ubucon.... | ACTIVE   | ...   |
| coogl  | ubucon-coogl    | coogl.ubucon.... | ACTIVE   | ...   |
+--------+-----------------+------------------+----------+-------+
$

This mode can also detect orphaned domain names (i.e. domain names which target IP address isn’t a known instance) and orphaned instances (i.e. instances that don’t have a domain name targetting their IP address):

$ python -m install_party list
+--------+-----------------+------------------+----------+-------+
| Name   | Instance name   | Domain           | Status   | IPv4  |
|--------+-----------------+------------------+----------+-------|
| jswho  | ubucon-jswho    | jswho.ubucon.... | ACTIVE   | ...   |
+--------+-----------------+------------------+----------+-------+

ORPHANED INSTANCES
+--------+-----------------+----------+-------------+
| Name   | Instance name   | Status   | IPv4        |
|--------+-----------------+----------+-------------|
| czxcx  | ubucon-czxcx    | ACTIVE   | 54.38.70.93 |
+--------+-----------------+----------+-------------+

ORPHANED DOMAINS
+--------+----------------------------+--------------+
| Name   | Domain                     | Target       |
|--------+----------------------------+--------------|
| coogl  | coogl.ubucon.abolivier.bzh | 54.38.70.225 |
+--------+----------------------------+--------------+
$

The big 1.0

A short while after Ubucon, I finished a basic implementation the three modes I’ve described above (I had already implemented the creation and was halfway done implementing the listing when the workshop happened), and released v0.3.0 with these.

Since then, I’ve been iterating on improving the codebase, adding new features to the modes, and polishing the whole thing in order to make it easier to use and contribute to. All of the improvements I’ll describe here are documented in the project’s README.

A major change is that I’ve added the ability for users to use their favourite DNS provider, as well as their favourite instances provider (i.e. the API to use to create instances), instead of the hardcoded OVH and OpenStack (which are the ones I personally use). These providers are still available, but users can now add their own providers by creating a Python class that implements the correct API, dropping it as a file in the correct location, and start using it right away.

The creation mode also got two main improvements. The first one is the ability for multiple instances to be created in the same run with the -N/--number command-line argument. This is already something I’ve described above, but didn’t exist in previous versions of Install Party (indeed, for the Ubucon workshop, I had to run Install party multiple times in multiple terminals in order to genenerate the number of instances I was going for).

Another new feature of the creation mode is the ability to provide a post-install script with the -s/--post-install-script. This script will be run on the new server(s) after the installation of Riot and Caddy.

The rest of the work was mostly about cleaning up the codebase (e.g. getting rid of some inconsistencies in the name of some variables of classes), adding some proper logging, adding docstrings to (almost) every function of the project, and improving and updating the user-facing documentation in the README.

That’s all folks!

If you’re interested in following along further developments of Install Party (though I expect it to become much calmer now), want to report an issue when using it, or want to ask a question about it, feel free to join the project’s Matrix room #install-party:abolivier.bzh, or to check out its Github repo!

When I’m writing this post, the video of the workshop Install Party was born for hasn’t been released yet. However, I’ve noticed the staff are starting to publish videos of the event’s talks, so I should update this post in a few days/weeks when the video is released.

I hope you enjoyed reading through this post, see you next time!

Make your own Google Drive+Docs with Nextcloud, Collabora Online and object storage

Brendan Abolivier — Fri, 13 Jul 2018 00:00:00 +0200

As someone who values privacy (mine and others'), I usually try to find new ways of getting rid of the now infamous GAFAM and their friends, the biggest of them all being Google. Among every Google service, there’s one that is hugely used among both individuals and organisations, which is Google Docs. Add to that the whole storage service they also provide, and you get Google Drive, the best way to directly feed Google with all your files and data, including personal and administrative documents, music collections, photos from your smartphone… In other words, a huge volume of data that either is litteraly personal data, or can be used to extract personal data about you without your explicit consent (in Google’s case).

In my case, I’m not really at ease with the fact that I need to upload administrative documents containing personal data to Google in order to register for my boat driving license, nor do I with the fact that my phone will automatically upload every picture I take to Google’s servers so it can be processed and have any data it contains extracted and stored in a database I don’t have control over.

Just as many reasons for me to look for a Google Drive-like solution that I could entirely control. Here’s what I came up with so far:

Nexctloud 13 as the whole cloud management solution
OpenStack’s object storage (Swift) as the scalable storage backend
Caddy as the web server
Collabora Online (without Docker) as the collaborative office suite

So far I got all of that working together on my personal space, so let me walk you through how to do that yourself. Because of the whole stack’s current state, it does require some technical skills, though. Sorry about that.

(Disclaimer: because I know that some who read this blog might ask the question, let me get things straight first: yup, I’m using Nextcloud as my Google Drive replacement, even though I currently work at CozyCloud. Although people can seem to think it might be hypocritical from me to do so, my take on the matter is that Cozy and Nextcloud aren’t competitors, and although they have some features in common, one can easily complete the other, as Cozy (which I also use) has features Nexctloud doesn’t have and vice-versa. Diversity is great, as people from both CozyCloud and Nextcloud would tell you. And no, I haven’t been asked to write that paragraph 😉)

Update: right after I published this post, Antoine made an amazing Ansible playbook based on it, which you can check out right here!

LCPP: (GNU/)Linux-Caddy-PostgreSQL-PHP

Get the right hosting

First things first, because I want to have control over the whole thing, I’m self-hosting most of the pieces of software I previously mentioned on a personal VPS. Mine is a START 1-S from Scaleway, because, since the storage backend will be remote (I’ll come back to that later on), I needed a better bandwidth than the best-effort 100Mbps one I have on my OVH cloud instances (and I know that because I previously tried this setup on one of these, which resulted in poor performances), so mine has a 200Mbps bandwidth (I couldn’t find any piece of info about whether that was guaranteed or best-effort, though).

For the record, OVH does offer instances with guaranteed 250Mbps bandwidth, though given their rates it’s not something I can afford only for that use, especially given what competitors offer, including Scaleway.

Anyway, try to keep the bandwidth requirement in mind while chosing where you’ll host your Nextcloud instance.

Also for the record, the operating system on my VPS at the time of writing is GNU/Linux, more precisely Ubuntu 16.04 LTS, so the processes I’ll describe in this blog post might slightly differ if you’re running another GNU/Linux distribution or another operating system.

Caddy

As I mentioned, I’m using Caddy as the web server for the whole thing. In case the name doesn’t ring a bell, Caddy is a lightweight web server with simple configuration, HTTP/2 as its default, automatic HTTPS through Let’s Encrypt, and a lot of available plugins.

One downside of using Caddy is that, unlike popular free software projects out there, it doesn’t provide any official packaging (because of complications brought by the plugins system), so installation must be done manually. This is not that hard, though, because the whole process is pretty much straightforward:

# Download caddy
curl -L "https://caddyserver.com/download/linux/amd64?license=personal&telemetry=off" > caddy.tar.gz

# Extract the whole thing in a specific directory
mkdir caddy
cd caddy
tar xvf ../caddy.tar.gz

# Install Caddy's binary and allow it to use the 80 and 443 ports
sudo cp ./caddy /usr/local/bin/
sudo setcap cap_net_bind_service=+ep /usr/local/bin/caddy

# Create Caddy's certificates directory
sudo mkdir /etc/ssl/caddy
sudo chown -R www-data:www-data /etc/ssl/caddy

# Create Caddy's configuration directory
sudo mkdir /etc/caddy /etc/caddy/caddy.conf.d
sudo echo "import caddy.conf.d/*.conf" > /etc/caddy/Caddyfile
sudo chown -R www-data:www-data /etc/caddy

# Create, enable and start Caddy's systemd service
sudo mkdir /usr/lib/systemd/system
sudo cp ./init/linux-systemd/caddy.service /usr/lib/systemd/system
sudo systemctl enable caddy
sudo systemctl start caddy

Note: either the curl line or the setcap one (or both of them) might not work because of missing packages depending on your GNU/Linux distribution and your provider’s image. On Debian-based distributions, curl can be installed by installing the curl package, and setcap can be installed by installing the libcap2-bin package. If you’re running another GNU/Linux distribution, the packages' names can differ a bit.

Now that you have Caddy installed, let’s install another very important component we’ll need to run Nextcloud: PHP.

PHP

Because Caddy can only interact with PHP using FastCGI (as far as I’m aware of), we’ll need to install its FPM (FastCGI Process Manager) version, named php-fpm. Along with that, you’ll need a fair amount of PHP extensions Nextcloud depends on. We’ll also be using PostgreSQL to manage Nextcloud’s database, so we’ll also need drivers for that.

On most Debian-based systems, this is done by doing the following:

sudo apt install php7.0-fpm php7.0-gd php7.0-json php7.0-pgsql php7.0-curl php7.0-mbstring php7.0-intl php7.0-mcrypt php-imagick php7.0-xml php7.0-zip

PHP’s FPM should now be installed, configured with the right extensions, and running. You can make sure of that by checking whether the file /var/run/php/php7.0-fpm.sock exists.

In order to be sure that all PHP extensions we just installed are loaded, let’s restart its FPM:

sudo systemctl restart php7.0-fpm

PostgreSQL

Nextcloud needs a database, so let’s install a databases management system! I personally use PostgreSQL because of its performances and ease to use. You can install it by simply running:

sudo apt install postgresql

Now, we need to do some basic configuration within PostgreSQL so Nextcloud has a user and a database to connect to. PostgreSQL provides an interactive shell (/usr/bin/psql), with the user postgres acting as a superadmin by default.

Regarding authentication, and more precisely access to the shell from the local host, PostgreSQL default to using its peer authentication method, which means you can only authenticate and access the shell as a user if you’re trying from a system user with the same name. As the postgresql package has already created a postgres system user, all that’s left to do is to call the PostgreSQL shell as that user:

sudo -u postgres psql

Note: if that doesn’t work, it might be because your system user lacks privileges. Try that again as a user with more privileges (and/or root).

Now create Nextcloud’s user and database from that shell using SQL queries:

postgres=# CREATE USER nextcloud WITH password 'ncpassword';
CREATE ROLE
postgres=# CREATE DATABASE nextcloud WITH owner 'nextcloud';
CREATE DATABASE
postgres=# \q

In PostgreSQL’s shell, \q is the exit instruction.

Don’t forget to change ncpassword in my example above with a real, strong password for Nextcloud’s database user. Also, keep that password somewhere, because you’ll need it when installing Nextcloud. And while we’re at it…

Installing Nextcloud

Because Nextcloud is only a PHP web app, installing it is as simple as downloading a ZIP archive (you might need to run sudo apt install unzip in order to be able to open the archive). The link of the archive to download can be found here as the target of the big blue “Download Nextcloud” button.

As an example, here’s how your install should go with Nextcloud 13.0.4, authenticated as root:

# Change /srv/http with your web server root (e.g. /var/www)
cd /srv/http

# Download Nextcloud's archive
curl -LO https://download.nextcloud.com/server/releases/nextcloud-13.0.4.zip

# Extract the archive's content
unzip nextcloud-13.0.4.zip

# Make the Caddy user owner of the extracted directory
chown -R www-data:www-data nextcloud

And here you go, with a brand new Nextcloud install located at /srv/http/nextcloud.

Now let’s create Nextcloud’s configuration file in Caddy’s configuration folder. You must first have a domain, or sub-domain, Nextcloud should be accessible at (i.e. cloud.example.tld). Here’s the content of my own /etc/caddy/caddy.conf.d/nextcloud.conf file:

cloud.example.tld {
	tls letsencrypt@example.tld

	root   /srv/http/nextcloud

	fastcgi / /var/run/php/php7.0-fpm.sock php {
		env PATH /bin
	}

	# checks for images
	rewrite {
		ext .svg .gif .png .html .ttf .woff .ico .jpg .jpeg
		r ^/index.php/(.+)$
		to /{1} /index.php?{1}
	}

	rewrite {
		r ^/index.php/.*$
		to /index.php?{query}
	}

	# client support (e.g. os x calendar / contacts)
	redir /.well-known/carddav /remote.php/carddav 301
	redir /.well-known/caldav /remote.php/caldav 301

	# remove trailing / as it causes errors with php-fpm
	rewrite {
		r ^/remote.php/(webdav|caldav|carddav|dav)(\/?)$
		to /remote.php/{1}
	}

	rewrite {
		r ^/remote.php/(webdav|caldav|carddav|dav)/(.+?)(\/?)$
		to /remote.php/{1}/{2}
	}

	rewrite {
		r ^/public.php/(dav|webdav|caldav|carddav)(\/?)$
		to /public.php/{1}
	}

	rewrite {
		r ^/public.php/(dav|webdav|caldav|carddav)/(.+)(\/?)$
		to /public.php/{1}/{2}
	}

	# .htaccess / data / config / ... shouldn't be accessible from outside
	status 403 {
		/.htacces
		/data
		/config
		/db_structure
		/.xml
		/README
	}

	header / Strict-Transport-Security "max-age=31536000;"
}

This file is directly adapted from one of Caddy’s example configuration files, and can be found right here. I made a few changes, which are the domain I gave Nextcloud on the first line, and the email address Let’s Encrypt should send me certificates expiration notices on the second line. I also removed the logging (basically because I don’t want it nor need it, since it’s my personal instance) and moved the vhost’s root to some other place on the disk since I want my web server’s root at /srv/http rather than /var/www (but that doesn’t really matter).

Now let’s reload Caddy’s configuration by running sudo systemctl reload caddy. Your Nextcloud should now be accessible from the domain (or sub-domain) you gave it, and opening that in a browser should display a configuration wizard. It can take some time, though, as Caddy needs to talk with Let’s Encrypt to generate the required certificates since it’s the first time it encounters this domain.

Within this wizard, you can provide the login and password for your administrator account. Before clicking “Finish setup”, click “Storage & database” (if you can’t see any other field than the ones requesting the administrator’s credentials), because we need to configure Nextcloud’s access to PostgreSQL.

Leave the “Data folder” field as it is, and fill in the user and password of Nextcloud’s PostgreSQL user (which, in our example, should be “nextcloud” and the password you previously set, respectively), and Nextcloud’s database (which, in our example, should be “nextcloud”). Because you might already have a MySQL/MariaDB/SQLite PHP driver installed, Nextcloud might offer you different databases management systems; select “PostgreSQL”.

Now you can click “Finish setup”, and here you have a working Nextcloud instance waiting for you!

Note that Nextcloud also provides a way to do the whole setup process with a command line interface, as documented here.

If you just want a simple Nextcloud instance with local storage, you can stop here. I chose to use OpenStack Swift as Nextcloud’s primary storage backend, so if that’s what you’re looking for, don’t upload anything yet and bear with me as I walk you through that setup!

OpenStack Swift as primary storage backend

As I want to store music and videos in Nextcloud, along with backups from my phone’s camera, I might get quite limited with the local disk space on my VPS. As disk space storage can get quite expensive, and is usually really limited on cheap VPSs. A great solution to that problem is cloud-based object storage. The main advantage of such a solution is being a very unexpensive storage solution (usually around 0.01€/GB/month) that scales almost infinitely.

There are two main solutions of the sort, as far as I’m aware of, which are Amazon S3, and OVH’s Public Cloud Storage, which is based on OpenStack Swift. Because I don’t want any of my data on Amazon’s servers (for the same reason I don’t want it on Google’s), and because I usually trust OVH with my data (mainly because I know a few people there, thus have a small peek at what internal use is done with that same data), I decided to go with OVH’s solution.

Before I go any further, I’ll assume you already have an OVH account (which you can create on their website). Direct your browser to OVH’s Cloud manager, then order a new “Cloud project” using the blue “Order” button on the top left corner of the page. Once your project is created, it should appear under “Servers” in the navigation bar on the left (which you might need to click on to uncover the list of projects), click on its name, then on the “Storage” item that just appeared. Click the white “Create a container” button, which takes you to an interface letting you create an object storage container.

Select the datacentre that best fit your use (i.e. the one located the closest to you). Also make sure your container’s type is set to “Private” so the container can’t be accessed without authentication. Name your container and confirm the creation.

Now that you created your container, you’ll need Nextcloud to be able to access it. First, you need to retrieve OpenStack credentials for your project. In order to get that, click the “OpenStack” entry corresponding to your project in the navigation bar on the left of the screen, then “Add user” and input a description (e.g. “nextcloud”). Copy the user’s password somewhere (because OVH’s manager will never displayed it again), then click on the “…” button on the right of the user’s row, then “Downloading an Openstack configuration file”, and select the datacentre you chose for your object storage container.

This will download a file named openrc.sh containing all remaining pieces of information required for Nextcloud to access your container. Now let’s give it these data, by editing Nextcloud’s configuration file (/srv/http/nextcloud/config/config.php in my case) and adding this block to the config (make sure to keep the last line (which should just be );) at the very end of the file in order to keep the file’s syntax correct):

'objectstore' => array(
	'class' => 'OC\\Files\\ObjectStore\\Swift',
	'arguments' => array(
		'username' => 'OS_USERNAME',
		'password' => 'OS_PASSWORD',
		'bucket' => 'nextcloud',
		'autocreate' => false,
		'region' => 'OS_REGION_NAME',
		'url' => 'https://auth.cloud.ovh.net/v2.0',
		'tenantName' => 'OS_TENANT_NAME',
		'serviceName' => 'swift',
	),
),

With:

OS_USERNAME being the username (or “ID”, as OVH’s manager calls it) of your OpenStack user.
OS_PASSWORD being the password of your OpenStack user.
OS_REGION_NAME being the datacentre’s identifier, which would be GRA3 in the example we’ve seen before example. This information can be found at the very last line from the openrc.sh file.
OS_TENANT_NAME being the name of the OpenStack tenant to use, which can be found in the openrc.sh file on the line starting with export OS_TENANT_NAME.

Now you can refresh Nextcloud’s tab in your browser, and, if everything went well, it should display Nextcloud’s interface. Because the storage backend is now remote, it might be slow to display completely, though, let’s see how we can improve performances using caching.

Caching, caching everywhere

All of these solutions work pretty well together, and are part of recommendations made by Nextcloud regarding caching.

Zend OPCache

PHP already comes bundled with a cache mechanism, which is a PHP opcache named Zend OPCache. Basically, a PHP opcache stores compiled PHP scripts so they don’t need to be re-compiled every time they are called. To enable it and get it to match Nextcloud’s recommendations, uncomment the following lines and ajust the necessary values in yout /etc/php/7.0/fpm/php.ini file in this way:

opcache.enable=1
opcache.enable_cli=1
opcache.memory_consumption=128
opcache.interned_strings_buffer=8
opcache.max_accelerated_files=10000
opcache.revalidate_freq=1
opcache.save_comments=1

Then all you need to do is restart PHP’s FPM in order for the changes to be applied:

sudo systemctl restart php7.0-fpm

The APCu PHP extension

Now we need to add some cache for Nextcloud’s data. The easiest way to achieve that is by installing the APCu PHP extension:

sudo apt install php-apcu

This package automatically installs and configures the extension, so all that’s left to do is restart PHP for this extension to be loaded:

sudo systemctl restart php7.0-fpm

Now let’s tell Nextcloud to use APCu for local cache by adding this line to its configuration file (/srv/http/nextcloud/config/config.php in my case):

'memcache.local' => '\OC\Memcache\APCu',

Again, make sure to keep the file’s last line (which should just be );) at the very end in order to keep the file’s syntax correct.

Redis

With its default configuration, Nextcloud can run into some troubles handling file locks. After switching my instance’s storage backend to OVH’s Public Cloud Storage, I had quite a huge volume of data to re-send from my disk to my instance, which sometimes would result in Nextcloud not being able to upload some files because of (probably) errored file locks.

A single Redis instance can act as a great cache for file locks, and installing it is as simple as running:

sudo apt install redis-server php-redis

Once again, restart PHP’s FPM:

sudo systemctl restart php7.0-fpm

Then let’s tell Nextcloud to use Redis to cache file locks (and how to reach the Redis instance) by adding this block to Nextcloud’s configuration file:

'memcache.locking' => '\OC\Memcache\Redis',
'redis' => array(
	 'host' => 'localhost',
	 'port' => 6379,
),

Again, watch out for the file’s last line, you know how it goes by now 😄

Refresh Nextcloud’s tab in your browser, and it should all run much faster from now on. You can also safely and efficiently start synchronising your data using Nextcloud’s or ownCloud’s client (since the APIs are the same between both systems).

Collabora Online, Docker-less

Now let’s talk about what took me the longest to figure out: Collabora Online (even though once you do, it’s pretty simple, so it shouldn’t take too long to get you set up with it).

Collabora Online is a FLOSS online collaborative office suite based on LibreOffice. It makes an amazing replacement to Google Docs’s text documents and spreadsheets editors and Nextcloud even provides an integration for it.

One of my biggest troubles, though, was that the current recommended way to install Collabora Online was through Docker. I’m personally not a huge fan of Docker, and find it has some awful design flaws when it comes to resources management.

Unfortunately, although it’s possible to install Collabora Online from Collabora’s Debian/Ubuntu/CentOS/openSUSE repositories, the process of setting it up is barely documented (and even not at all for some parts). After several hours of searching and experimenting with it, I finally managed to get it to work, so here’s an attempt at documenting the whole thing.

The first step to take is to install Collabora’s repository and the required packages. Please bear in mind that the host is running Ubuntu 16.04, so the process might differ if you’re running another GNU/Linux distribution. Instructions for all supported distributions can be found here.

# Install support for HTTPS APT repositories
sudo apt install apt-transport-https

# Import Collabora's signing key
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 0C54D189F4BA284D

# Add the URL for the Collabora Online's repository to /etc/apt/sources.list
sudo echo 'deb https://www.collaboraoffice.com/repos/CollaboraOnline/CODE ./' >> /etc/apt/sources.list

# Perform the installation
sudo apt update && sudo apt install loolwsd code-brand

Now edit (as root) the /etc/loolwsd/loolwsd.xml file, such that:

in the ssl section, enable should have false as its value, and termination must be set to true. Because we’ll serve Collabora Online behind Caddy, acting as a HTTPS reverse proxy, we don’t want Collabora Online to serve its content using HTTPS (which might cause some troubles with certificates, as Caddy’s way to handle these wouldn’t let another program access them), but we want it to tell its clients to use HTTPS URLs instead of HTTP ones, which is exactly what termination does.
in the wopi section (itself being located under the storage section), add a host line containing the domain name you gave Nextcloud, with dots (.) escaped with backslashes (\) because the value is expected to be a regular expression. Make sure the allow attribute is set to true. The storage section contains access control lists (ACLs) to tell Collabora Online where it can access files and where it can’t. Because Nextcloud uses WOPI to that end, this line grants access to Nextcloud to provide storage for Collabora Online using the WOPI protocol. As an example, if Nextcloud was served at cloud.example.tld, the line would look like:

cloud\.example\.tld

Let’s restart Collabora Online to let it know about the new configuration:

sudo systemctl restart loolwsd

Now we need to configure Caddy to act as a HTTPS reverse proxy for Collabora Online. Just like Nextcloud, you need a domain (or sub-domain) to give Collabora (here collabora.example.tld). Here’s what I have in /etc/caddy/caddy.conf.d/collabora.conf:

collabora.example.tld {
	tls letsencrypt@example.tld

	proxy /loleaflet http://127.0.0.1:9980 {
		transparent
	}

	proxy /hosting/discovery http://127.0.0.1:9980 {
		transparent
	}

	proxy /lool http://127.0.0.1:9980 {
		transparent
		websocket
	}

	header / {
		Strict-Transport-Security "max-age=31536000;"
		Content-Security-Policy "default-src 'none'; frame-src 'self' blob:; connect-src 'self' wss://cloud.example.tld; script-src 'unsafe-inline' 'self'; style-src 'self' 'unsafe-inline'; font-src 'self' data:; object-src blob:; img-src 'self' data: https://cloud.example.tld:443; frame-ancestors https://cloud.example.tld:443 'self'"
	}
}

While the whole file is pretty basic, let’s talk about the last instruction, which contains the Content Security Policy HTTP header. Because it’s a quite complex header, I don’t really know the details of how to tweak it to make it work with Collabora Online, and had neither the time nor the motivation to dive into it. Therefore, the header shown here is a plain copy of the one sent by Nextcloud’s demo servers, where I replaced every instance of demo.nextcloud.com with Nextcloud’s URL (which, in the example shown here, is cloud.example.tld, which you should replace with the domain you gave Nextcloud). I also improved the last part, allowing Collabora to open iframes pointing to itself, which is required for slideshow presentations. It works fine, but I wanted to point out that this part of the configuration isn’t my own work.

Now let’s reload Caddy to let it know of these changes in its configuration:

sudo systemctl reload caddy

And let’s make Nextcloud talk to Collabora Online, by installing the Collabora Online Nextcloud app. Documentation on installing Nextcloud apps can be found here. Then, browse to your Nextcloud settings, and click “Collabora Online” under the administration settings section. Now all there’s left to do is input your Collabora Office instance’s URL (which, with the example shown here, would be https://collabora.example.tld), hit “Save” and here you go! Nextcloud is now correctly hooked up with Collabora Online! 😁

If you want to try it out, you can upload a .doc/.docx/.odt/.ods/etc. file to Nextcloud and open it, or create a new document using the “+” button at the top of the Files app and open it. Nextcloud should then redirect you to Collabora Online’s interface, and you can start editing right away!

Conclusion

Now we have a wonderful self-hosted personal Google Drive/Docs alternative only made of FLOSS projects, which is pretty neat!

However, as you might have noticed, this setup does require quite a high and broad technical knowledge to manage, which is quite sad as it makes the whole thing out of the reach of non-technical people. For some parts, the lack of documentation even makes it out of reach of some people that actually have a technical background, though there are some other ways to end up with such a setup that would require less hassle through inexisting doc (such as using Docker to run Collabora Online with, or using nginx or Apache as the HTTPS reverse proxy).

Because of all that, waving Google & friends goodbye won’t be easy, if not unmanagable, for most people in the state things currently are. But to me the future isn’t that dark regarding the rise of privacy-respectful alternatives, since we just saw that the tools themselves exist, and what’s left to do is find a way to make them more accessible. And although this blogpost focuses entirely on the file hosting and document editing services, there’s still a lot of services to build alternatives to, such as PeerTube, an amazing decentralised, federated and peer-to-peer alternative to YouTube, which recently ended a very successful crowdfunding campaign at over 250% of their initial goal.

In my opinion, all that show quite a promising future for the Internet, in which anyone would at some point be able to benefit from amazing services without having to make unfair compromises such as delivering all of its personal data just to send a message to some friends. It does require some work, though, but as members of the French non-profit organisation Framasoft would say, the journey is long, but the path is free.

Aaaand that’s all for this week! I hope you’ve enjoyed reading this post (which is now my longest one as it’s slightly longer than Enter the Matrix), and if you did, or if you have any kind of feedback regarding it, please feel free to hit me up on Twitter, Mastodon or Matrix, I would love to see what you thought of it!

Regarding the rate these posts go out at, it’s likely that I won’t be able to carry a weekly schedule forever. This small paragraph isn’t in any way the announcement of a new rate, but is rather to make it clear that I might skip a week when I deem it necessary. I’ll try to keep it as close to weekly as possible, though, and whatever happens, the best way not to miss any post is to subscribe to the RSS feed or follow me on Twitter or Mastodon (which I’ll try to share more stuff on).

Anyway, see you next week (hopefully) for a brand new blog post! 😄

Random tools #1

Brendan Abolivier — Sun, 08 Jul 2018 00:00:00 +0200

Over time, I came to encounter a few tools, addressing different use cases and/or issues. Recently, I started listing these tools in order to share some of them with you, whether you already know them or not, in smaller posts like this one, without going too much in depth with them.

I’m giving posts like this one two objectives: help people discover tools that often help be with small or bigger tasks, sometimes on a daily basis, and allow me to effectively share some of my knowledge with everyone (which is the main goal for this blog) while requiring less work and efforts than the other posts you can usually find around here, which are mainly focus on one specific topic.

So let’s get things started with this first episode of “Random tools”!

Tilix

Let’s start with one tool that a lot of people already know: Tilix. It’s a terminal for GNU/Linux systems using GTK+3, which you might also know as its older name: Terminix.

Tilix is currently my main terminal on my computer, and almost the only one I use period (which, given my current job being systems engineer, makes it one of my main work tools). Among the amazing features it offers is the ability to split the screen in panes as much as you want, making it super easy to work with a task requiring data from several hosts at the same time, or checking in real time the effects of a patch on a system while directly applying it without losing sight of each, for example.

Add to that Tilix’s “Quake mode”, allowing you to make Tilix appear on the top of your screen and disappear from it without losing the current session (just like the Quake console minus the animation), and that makes the perfect tool for me to work with, because I never have to waste loads of time trying to bring it my terminal back at my desktop’s foreground while jungling with a few other windows.

Keyboard shortcuts for splitting are Ctrl+Alt+R for a vertical split, and Ctrl+Alt+S for a horizontal one. Quake mode can be enabled by editing the keyboard shortcut bound to Tilix in your system settings (or adding one if there’s none yet) so it runs tilix --quake (or tilix -q) instead of tilix.

TabSearch

This tool is a Firefox extension that will help you save a lot of time if you always carry loads of tabs open at once, sometimes even spread through multiple windows. In this kind of configuration, it’s usually a frequent waste of time to remeber where a specific tab is located on a tab bar, and what window this tab bar belongs to.

If added to Firefox’s top bar, TabSearch will provide you with a small graphical user interface that will allow you to browse through every tab that is open in the current window, open in another window, or has been recently closed. This interface also provides a search feature, and can be toggled by hitting Ctrl+Shift+F. You can then browse through the tabs using the arrow keys or search within them, and hit Enter when you find the tab you want to switch to.

This screenshot isn’t mine and comes from the extension’s page on AMO

The extension also provides you with other ways to interact with tabs, which are listed in its documentation. It will also add to Firefox’s top bar a count of the tabs that are currently opened in the window.

In the next episode

In order to make these posts quick to make, I’ve decided to only cover two tools per episode, so that’s the end of this first one! I’m not sure about when the next one will get its release, but I’ve already the tools to talk about in mind.

If you like that one or want to share some feedback on it with me, feel free to hit me up on Twitter, Mastodon or Matrix!

I’ll see you before then, next week in fact, for what should be a technical walkthrough. Until then, have fun, and see you next week! 😄

Making party time

Brendan Abolivier — Sun, 01 Jul 2018 00:00:00 +0200

A month ago today, the first edition of immersion{s}, a new events brand mainly focused on trance parties in Brest (Brittany), happened. This first edition, as all the next ones will be, are organised by Trancendances, a French non-profit I co-funded and have been the president of since 2014, focused on promoting trance music all around France.

In the past, we worked on a few events around Paris and promoted others all around the country, but that was the very first event we organised from scratch, as all of the gigs we previously worked on were produced by people outside of our organisation, who sometimes already had some experience with such things, had already handled all the planning and project management.

Organising an event isn’t an easy task, and that’s even more relevant when it’s your first one. It took us over 6 months of hard work to make this one happen, and I’m not even counting all the previous failed attemps. During all that time, up to the few hours after the party ended, we’ve learned a lot on a lot of topics we sometimes didn’t even expect to have to deal with, and this new knowledge is what I’d like to share with you this week.

Alex Wackii playing at immersion{s} - photo credit: Joffrey Lartigaud / Trancendances

Your party will be a financial failure

Let’s take the bad news in first. If the event you’re working on is your first one, you’re unlikely to earn a single cent from it. There’s one really good reason for that: you don’t know what you’re doing, and even in the friendliest environment, you’re likely to screw up more than once. But that’s okay.

In my opinion, the best way to learn how to do something is to do it, not with the expectation to crush it but to learn the processes, the rules to follow, who to speak to, etc. And that’s exactly what happened with this event. Financially speaking, it was kind of a disaster, since we only sold enough tickets (including at the door) to cover about 50% of our total expenses.

On the other hand, however, we spent the last months creating our own brand, putting together our own team, learning how this kind of stuff is done, who to get in touch with for each specific need, what questions to ask, etc. We tried some stuff that worked and some that didn’t (I’ll dive deeper in that later), so that now we know how to do some important tasks right because we got them wrong the last time, or because we got them right on the first try. And we learned all of that stuff so tasks that would take us weeks to work out then will only take us five minutes and an email now.

It can be one hard thing to process, because such an event usually costs a lot of money, representing thousands of euros, and you’re not likely to make them back, especially if it’s your first shot and you don’t really know what you’re doing. But it’s an important thing to be aware of that as early in the process as possible, because it will allow you to set better, more realistic objectives, which will allow you to look at the upcoming event with a better focus on what actually matters: creating reproducible processes, building a team and learning about the local audience and how to interact with it. In my opinion, it’s better to see this not as a financial loss, but rather as an investment on successes that will eventually come in the long run.

There’s something you can do to avoid losing too much, though, and that’s finding sponsors. It might be quite easy or the hardest thing ever depending on the local ecosystem. Sponsoring basically consists in contacting companies, and offering them to give you money to make your event possible in exchange for communication. Although we didn’t have any sponsor on this one (because we messed up somewhere with our planning and reached out too late), we did get the opportunity to get in touch with a few potential sponsors with which we might work on future events.

Quite similar to sponsorships are public subventions, which relate to public services. In fact, we got the chance to work in partnership with the city, which promotes local events organisators. Although it’s not money we got from them, they handled the renting of the venue and most of the sound and lighting equipment, which represents a saving we estimate at around 1k€ at least (which is a lot, regarding the event’s budget).

Both sponsorships and public subventions happen, and some companies or public services might be more than happy to promote your event. Do not hesitate to reach out to as many people as you can, the venue’s city, the state, the government, the EU, a big local company, a bank, a clothing brand, etc.. One of the things I’ve learned not only from this event, but from running Trancendances in general, is that a “no” will very not likely hurt you, and a “yes” might offer you opportunities you’ve never dreamt of, so do not give up on getting in touch with an entity because “they’re too big and I’m too small”, “they can’t interested in this kind of things”, or any other reason. And if you can’t think of anything, the way to think of potential sponsors or public services that might support your event would be, in my opinion, to have a look at flyers or posters for other similar events, since sponsorship and public subventions usually involve including some logos there.

We are human after all

One of the most important part in every project is the people you surround yourself with. Because such a project usually represents a lot of work and efforts, especially if it’s your first time, you can’t really afford being the only person on board. And for the same reason, you need the people you surround yourself with to be trustworthy, because you can’t afford to always be looking over someone else’s shoulder.

In the case of immersion{s}, we were actually a small team of two people: Raphaëlle, who manages everything related to events at Trancendances, and myself. Fortunately, Raphaëlle and I know each other since we were children, and have been friends since our teenage years, so we both know pretty well how to trust and work with each other. Working in such conditions makes the experience much better, as none of us would ever question the other’s work (once we discussed how we were going to do things), and we had quite some fun working together, making what would have otherwise been an exhaustive, stressful and trying experience much more bearable.

In such an adventure, it’s the smallest things that matter, such as being able to tell when the others aren’t in the mood to work, are going through a difficult time, etc.. Sometimes, a work meeting would turn into a good time between a couple of friends who had some pressure to release, nibbling pistachios and drinking some tea.

One very important thing that you must never forget is that, in the end, an event is just a group of people, usually not even that much, that get along quite well, and are passionate about something they want to share with the rest of the world. It’s not technology or money that drives such a machinery, it’s the human that work hard behind the scenes, a lot of times even voluntarily, so everyone have a good time.

As part of the organising team, one of your de facto responsibilities is to manage the other humans you’re working with. Always pay attention at them, see how they’re handling the work load and pressure, and be able to detect and react to something going wrong. For example, if one of your teammates is having a hard time in their personal life, and starts panicking over what would appear as a silly thing to you, you need to be able to see it and reassure them, to take the necessary actions so they won’t be confronted with the struggle they’re panicking about, rather than teasing or making fun of them about it. Working under pressure isn’t an easy thing, and not everyone reacts the same way in such conditions. Never mock, always empathise.

On the course of the process, we eventually ended up requiring more pairs of hands, especially on the event’s night as there are many things to take care of, such as selling tickets, ensuring the artists have all they need, or other specific tasks depending on the venue, partners, etc.. We reached out to a few friends of ours who offered us their help, who are friends we’ve known for at least a few years, enough for us to trust them entirely. The result was us being able to carry the event without having to worry a single second about whether someone was doing their job the right way. The carefree feeling it brought was something really amazing, which avoided us ending up having to waste a lot of time looking over everyone’s shoulder, and saved us a lot of energy and efforts during the night. For some of these friends, this ponctual help even ended up turning into a longer-term investment as they’re joining the permanent team to work with us on future events.

Always trust, always care, and always think bigger; to me, that’s how you build a great team.

Ask the right questions

The most frustrating part about doing something for the very first time is not knowing what you’re doing. You might find yourself a crazy amount of times in a position where you realise there’s one piece of information you’re missing to complete an important task, or that you’ve been assuming something only to find out that you’ve wrongly done so, forcing you into improvising while you could have planned the whole thing perfectly if you had only asked.

Situations like these happened quite a lot during the months leading up to immersion{s}, from having to open the ticket shop without knowing the room’s capacity, to setting the timetable before being made aware of the hours the venue would allow us to play music at, including planning the setting up of a cloakroom before learning, on the event’s day, that the venue already had one they operate themselves, and others situations like those.

One great thing to prevent those from happening is, before you even start planning the event, to sit with your teammate(s) and think about how is everything going to work out, as far as you know, and start listing everything you need to know for every step. Here’s a little example of what an extract of such a list might look like:

We want to do a music event at this specific venue
- What dates would be great for that?
- How do you book the venue?
- What’s the venue’s capacity?
- When can we access the venue? When do we need to leave?
- When is the soonest the music can start? When is the latest it needs to stop?
- Does the venue deduct a percentage from the ticket sales? If yes, how much?
- Does the venue give us a percentage from the bar’s sales? If yes, how much?
- Does the venue provide us with the security staff and technicians? If not, who do they use to work with?
- If they don’t provide us with the security staff, what are the security requirements: how many people, with what certifications, etc.?
- Do they operate a cloakroom? If not, is there somewhere at the venue we can set one up during the event?
We want to receive some support from the city
Who do we need to speak to?
What would they offer us?
If they rent the venue for us, is there something we need to do, or do they handle that with the venue directly?
If they rent the equipment for us, can they provide us with a list of everything that includes? If not, who do we need to ask that to?
If the help doesn’t consist in them directly giving us money, how much is the help worth?
[…]

Of course, because “as far as you know” might not lead you very far, so the list might not be exhaustive, but the goal here is to create at least a base on which you can iterate afterwards. Then, each time you start figuring how something might happen, think of all the pieces of information you need for that and add them to the list.

Not all questions are to ask to the event partners (venue, sponsors, etc.). You also must add to the list all questions you need to ask yourself and your teammates during the process, from “in what time frame do we need to have this specific task done” to the simple “we need a stamp to identify the people who already paid their tickets, where do we find one”, including “how many posters do we need to order, in what size and for which use”, and anything that might be relevant to your project.

Asking questions is the first step into efficient planning, so make sure to make it your very first step into the project, to take the time to make it as exhaustive as possible, and to continuously update it with new questions and answers. The goal here is to have the broadest, most complete picture of your project as soon as possible so you don’t move forward blindly and are able to make the best decisions. Knowledge is power, here as much as everywhere else.

The not-so-retro-spective

On top of answering questions, one very important thing to do is to often, if not continously, ask ones. You need to question everything you’ve done as a team, and how it turned out. Was that a good idea to stick half of our very pricey outdoor posters three weeks before the event, not knowing when the monthly cleanup by municipal services would happen, and with other events happening before, which organisers might also want to stick posters for? Probably not. Was it a good idea to pay a security company to provide us with the security staff, which would be more expensive but wouldn’t require our team, which has no legal expertise, to find qualified people and write them contracts? Much likely.

Obviously, a retrospective on the whole event can only happen once the event is over, but you also need to make smaller ones, reacting in almost real time to everyone of your choices and their consequences, because a retrospective on the whole project will help you make the next one better, but smaller ones during the process will help you make fewer mistakes between then and the big date.

One very important, and obvious to some, topic these smaller retrospectives must address is your communication campaigns. Alongside planning, communicating efficiently on your event is one of the most important things that can screw up months of work if done wrong. That’s why it’s important to constantly ask yourself whether you took the right decisions up to that moment, and how to fix things if not. Always ask yourself if your target audience is defined correctly, if your campaigns match such an audience, etc. One of our biggest mistakes might have been not asking ourselves such questions, and wanting to do too much while not knowing what we were doing. This led to a few mistakes, such as investing way too much on online promotion and not enough on local advertisement, messing up our local outdoor display, running short of flyers too early, etc..

All that resulted in a huge loss in money and effectiveness, and while most of my friends were telling me they were always reminded about the upcoming event, most of our local target audience wasn’t even aware such an event was happening. This could have been avoided by us asking ourselves more questions such as “what audience is the most likely to show up if they’re aware there’s a party going on that night” or “how do we interact with them”, not stopping at not precise enough answers too wide with the “we’ll figure this out later” mindset, planning every single step of the communication process beforehand, asking ourselves how much we should spend on digital advertisement and what time the promotions should go live, what exact locations should we display physical outdoor advertisements at, how many posters do we need for each location, etc.. These questions might not be obvious on the first time, so it’s easy to get it wrong at the beginning, and taking the time as often as possible to ask yourself if you did everything right, taking into account what you’ve learned in the process, might help you set things right while there’s still time. And don’t stop on broad answers thinking the details will figure themselves out later, but rather dig the deepest you can. It requires more work, but in the end you’ll have more control over your project and less frustration caused by not knowing what’s going on.

Towards the next adventure

Big projects such as organising an event from scratch for the first time are really exciting and interesting, as they teach you a lot about multiple topics. In this case, working on an electronic music party, which can be very briefly summed up as DJs playing DJ sets for a few hours, has involved project management, human management and resources, communication, branding, budgetting and a few other fields that might not look related to that at first sight.

It has been a fascinating adventure, a long, tiring, trying journey that we spent months following, which gave us some amazing results. I’ve rarely been more proud of achieving something, and it’s been a very fun and rewarding experience which will act as a baseline for future events.

I’m really happy to have been able to make this dream come true with a team of amazing people, and I really look forward to the next one, which we’re already thinking about.

Before concluding this post, I’d like to thank, once again, everyone who worked on this project by my side, along with everyone who helped us make it happen with advices, feedback, support, by sharing the event and spreading the news, or just by being there with us that night, enjoying the moment. It all means so much to me.

I’d also like to apologise regarding my absence on this space for the past few weeks. The last couple of weeks before immersion{s} have been a heck of a rush, and I much needed the month of June to recover (not even considering the infection I got in the middle of the month). Now I’m back on tracks, and I’ll try to keep up with my “One post a week” challenge during the summer.

As always, if you liked this post or want to share some feedback on it with me, feel free to hit me up on Twitter, Mastodon or Matrix!

See you next week for a new post!

Manage your passwords with pass

Brendan Abolivier — Mon, 21 May 2018 00:00:00 +0200

Let’s talk about passwords. Basically, that’s the things you’re supposed to keep different for each account you have on the Internet. Either you don’t do it, do it partially (like a mix between a leet-speak version of the service’s name and a fix part, with an uppercase letter and a character that’s neither a letter nor a number at some place, such as mySup3rw3bs!t3MyUsualPassword), or have a password manager do it for you.

I’ve had quite a hard time finding a password manager that fits my needs. During the past few years, I’ve tried quite a few of them, and eventually stopped using them one after the other. LastPass because of its poor UX on points that mattered to me, and I couldn’t feel safe trusting that much into such a centralised and closed service. Keepass because it was a pain to synchronise my database between all devices. Passbolt because it focuses on a team use case and I want something designed for individuals. You name it.

After a while, I started trying to get a description of what I wanted. To me, the ideal password manager must be:

free software
security audited
synchronisable across devices
self-hostable
easy to set up
easy/quick to use

I realised that was quite an idealistic description, and thought I was done with password managers. To be fair, to this day, I still haven’t found one that match all of my criteria, though the one I’ll be talking about in this post gets quite close.

Also, let me get things straight first: the last two points in the list above are using the relative definition of “easy”, i.e. what’s easy to set up/use to me, as someone who has some technical knowledge and background. Specifically, the solution I’ll be writing about in this post would be labelled as quite painful to use by somebody who isn’t used to bash, git et al.

It’s all about simplicity

Pass is a minimal and very simple password manager which consists in a 699 line long bash script (including comments). It stores your password as files in a given directory (the “store”), and encrypt them using GnuPG. That way, you can organise your passwords as you want, in as many sub-directories as you wish, and they will be stored, possibly along with some metadata, in a somewhat-secure fashion.

Notice here that I made a compromise on my criteria of an ideal password manager, here, because, as far as my knowledge goes, pass hasn’t got a security audit yet (only GnuPG did). I consider it safe enough for my personal use, though.

Pass also has both CLI and GUI clients for most platforms, including OS X, Android, iOS and Windows, and also some browser extensions, but I’ll only cover the basic command-line use of the bash script here. All clients and extensions can be found here, though.

Creating the store

I won’t cover installation, which is already covered on pass’s website and should be quite easy on most systems.

You’ll also need to generate a GPG key, which is the pass equivalent of the store’s master key/passphrase, if you haven’t got one, which I also won’t cover here since there’s already great resources for that on the Internet.

Once pass is installed, let’s initialise a store with

pass init GPG-ID

Here, GPG-ID is the identifier of the key you’ll use to encrypt your passwords. It can be the key’s fingerprint (in the case of my own key, E1D4B7457A829D771FBA8CACE860157274A28D7E) or one of it’s associated email addresses (which, in my case, can be hello@brendanabolivier.com).

It will then initialise a store in a directory, which path is ~/.password-store and is created if it doesn’t exist. This directory is the one in which pass will work in every call you’ll make in the future. This value can be overriden by setting the environment variable PASSWORD_STORE_DIR.

Adding an existing password

Because you had accounts on the Internet before starting using pass, you might want to store their passwords in your brand new password store.

To insert a password into your password store, just run

pass insert PASSWORD-NAME

Where PASSWORD-NAME is the name you’ll give to this entry. If you want to manage your entries with sub-directories, the entry name can also be a relative path to the password store (e.g. pass insert hostProviders/ovh will create an entry in the sub-directory “hostProviders”). If a sub-directory doesn’t exist, pass will create it for you.

It will then prompt you for your password, which you can just paste and validate, and an encrypted copy of it will be stored in the password store. For example, if the entry name is hostProviders/ovh, it will store an encrypted copy of my password in ~/.password-store/hostProviders/ovh.gpg.

You might also want to add metadata to your password, such as the account’s login, or the service’s URL, which some pass clients can use. You can do that by appending the -m flag to your pass insert call (before the PASSWORD-NAME), which allows you to write your entry using more than just one single line and save it using Ctrl+D.

In case of multiline entries, it’s usually better to start an entry with the password as the first line’s only content, and then add your metadata on the following lines. The reason for that is because, to pass, a non-multiline entry is just a one-line long file with the password as the only content. Having the first line only including the password will help pass handle multiline entries the same way as a single line entry.

In the end, your multiline entry would look like this:

mySup3rw3bs!t3P4ssw0rd
login: me@me.tld
url: mysuperwebsite.com

It might be worth noting that if you come from another password manager, there might be a migration script aiming at migrating all of your entries to pass instead of doing it manually, one at a time. Migration scripts for most password managers can be found here.

Creating passwords

Of course, one of the good things with having a password manager is having it generate different strong passwords for each service you have an account on. Generating a password with pass is as easy as calling:

pass generate PASSWORD-NAME

As with pass insert, this will create a .gpg file at the desired location, and will this time fill it with a 25-character long password. If you want the password length to be something else than 25, you tell pass by appending the desired length after the PASSWORD-NAME.

Once the password is generated, pass will print it into the terminal, so you can copy it. If you don’t want it to appear on your screen, you can also append the -c flag to your call, right before the PASSWORD-NAME. Pass will then copy it into your clipboard, which it will clear after 45 seconds (the delay can be changed by setting the environment variable PASSWORD_STORE_CLIP_TIME to the number of seconds you want).

Another useful trick is appending metadata to the newly generated password, like we’ve seen before. It’s obviously possible to edit an existing password (using pass edit PASSWORD-NAME, which will open an unencrypted copy of the password entry in vim), but I personally prefer to never have pass printing my password on a screen.

To achieve that, we can first call pass insert -m PASSWORD-NAME, which will prompt for the password and its metadata, leave the first line blank and fill the following one with metadata before hitting Ctrl+D. We can then call pass generate -ci PASSWORD-NAME. Note the -i flag (which stands for “in place”), which means that the entry we want to generate a password for already exists, in which case pass will replace the entry’s first line with the newly generated password, and leave the rest of the file as it was.

You now have your newly generated strong password copied to your clipboard, and the desired metadata in its file.

Retrieving passwords

It would be quite useless to have all your passwords stored in your store without being able to retrieve them and use them. As everything with pass, this is quite easy:

pass show PASSWORD-NAME

Which you can even shorten as:

pass PASSWORD-NAME

Pass will then print out the corresponding password, along with its metadata (if it has any) in the terminal. If you don’t want the password to be printed out, but rather to be copied to your clipboard, just append -c before the PASSWORD-NAME, just like pass generate (and just like pass generate, it will clear the clipboard after 45 seconds (again, this delay can be overriden using the PASSWORD_STORE_CLIP_TIME environment variable)).

You might also prefer not having to fire a terminal and type a command line in order to get a password you’ll then copy to the website. In that case, you might be interested in using one of the few browser extensions available, such as passwe for Firefox and Chrome, PassFF for Firefox or Browserpass for Chrome, which you can use to automatically fill in login forms using passwords from your store and their metadata. For what it’s worth, I’ve been using PassFF for quite a while now, and it works pretty well.

Synchronising passwords

Because I always have more than one device, one thing I’m really looking for in a password manager is its ability to synchronise with other devices easily. This is the reason I stopped using Keepass, because having to manually copy your database across all of your devices each time you add/remove/change an entry was really painful.

Where I become really picky is that I don’t want to be stuck with a proprietary service’s hosting such as LastPass’s or Dashlane’s. I want to control where I send my passwords, who can access them, etc.

Once again, pass choses simplicity, by implementing a great compatibility with git, letting it do all of the versionning and networking, which is, obviously, optional.

If you want to synchronise your own password store with a git repository, create an empty one somewhere (I personally did that on one of my own servers, but a GitHub/GitLab/Gitea/etc. repository will, of course, work as well), grab its URL and run

pass git init
pass git remote add origin REMOTE-URL

Where REMOTE-URL is the repo’s URL.

This will initialise a local git repository at the root of you password store, and also create a commit containing all your store’s content. Note that the pass git commands' syntax follow the standard git commands'. That is because pass git will actually run every git command you give it in the store, whatever your current working directory is. This means that you can basically use every git command you want, as long as you prefix them with pass, the commands will affect your password store and nothing else.

Now that the git repository is initialised in the password store, each time you’ll create, remove or edit a password, pass will automatically create a commit for that, so you only have to run pass git push now and then to synchronise your local password store with your remote copy.

In my case, I like to have a copy of my password store on my phone, and to manage it using the Password Store Android app (available on F-Droid and Google Play), to which I just have to give the URL and credentials required to clone the repository, and the GPG key to use when trying to decrypt passwords, and I can instantly use my passwords on my smartphone.

Of course, since pass manages your passwords files and directories, you can have multiple sub-directories in your password store, each one of them having a different git remote. For example, most of my passwords are pushed to a remote repository on a server I own, except for one folder containing internal passwords we use at CozyCloud, which are synchronised with an internal repository we have.

To infinity and beyond

Of course, I haven’t described all the features pass has. This post only describes the few I personally use, along with some setup instructions, and doesn’t really cover the various ways in which one can use it. Now it’s yours to play with it! 😉

Thanks for reading through this post, and huge thanks to the amazing feedback and attention you gave following my latest post on Matrix, that’s hugely appreciated. As always, if you wish to chat with me about this post, feel free to hit me up on Twitter, Mastodon or Matrix, I’d love to hear your thoughts about this one!

Also, the length and complexity of the said latest post brought some fatigue with it, which explains this one’s lateness. Taking that into account, and given the fact that I’m working really hard on the Trancendances presents immersion{s} party in Brest that’s taking place in less than two weeks, I don’t think I’ll be publishing any more post in the next couple of weeks (except maybe a very small one on a couple tools I discovered recently, but that’s far from sure).

I’ll see you after that, most likely in a bit less than three weeks, for a brand new blog post (of which I already know the topic, and it’ll be a completely non-tech one, for a change!). See you then!

Enter the Matrix

Brendan Abolivier — Sun, 13 May 2018 00:00:00 +0200

As you might know if you’ve been following me on Twitter for some time (or if you know me in real life), I’m very fond of free software and decentralisation. I love free software because it matches the philosophy I want to live by, and decentralisation because it enlarges a user’s freedom and individuality, and I find working on decentralised systems fascinating. Doing so forces one to change their way of designing a system entirely, since most of the Internet now consists of centralised services, which leads people to only learn how to design and engineer these.

Today I want to tell you about one of my favorite decentralised free software projects right now: Matrix. Let’s get things straight first, I’m talking about neither the science-fiction franchise, nor the nightclub in Berlin. Matrix is a protocol for decentralised, federated and secure communications, created and maintained by New Vector, a company split between London, UK and Rennes, France (which I joined for an internship in London during the last summer). It’s based on RESTful HTTP/JSON APIs, documented in open specifications, and is designed to be usable for anything that requires real-time-ish communications, from instant messaging to IoT. Some people are also experimenting with using Matrix for blogs, RSS reader, and other stuff that’s quite far from what you’d expect to see with such a project. Despite that, however, it’s currently mainly used for instant messaging, especially through the Riot client (which is also developed by New Vector).

Matrix also distances itself from the “yet another comms thing” argument with its philosophy: it’s not another standard for communications, but one that aims at binding all communications services together, using bridges, integration et al. For example, at CozyCloud, we have a Matrix room that’s bridged to our public IRC channel, meaning that every message sent to the Matrix room will get in the IRC channel as well, and vice-versa. I’m even fiddling around in my free time to bridge this room with a channel on our Mattermost instance, to create a Mattermost<->Matrix<->IRC situation and allow the community to interact with the team without members from the latter having to lose time firing up another chat client and looking at it in addition to internal communications.

There’s also been quite some noise around Matrix lately with the French government announcing its decision to go full Matrix for their internal communications, using a fork of Riot they might also release as free software to the wide world in the future.

Under the hood

It’s great to introduce the topic, but I guess you were expecting more of a technical and practical post, so let’s get into how Matrix works. Quick disclaimer, though: I won’t go too much in depth here on how Matrix works (because if I do, the post would be quite too long and I’d never get time to even finish it in a week), and will mainly focus on its core principles and how to use it in the most basic way.

As I mentioned before, Matrix is decentralised and federated. The decentralised bit means that you can run a Matrix server on your own server (quite like other services such as Mattermost), and the federated one means that two Matrix servers will be able to talk to one another. This means that, if someone (let’s call her Alice) hosts her own Matrix server at matrix.alice.tld, and want to talk to a friend of her (let’s call him Bob), who also hosts his own Matrix server at matrix.bob.tld, that’s possible and matrix.alice.tld will know how to talk to matrix.bob.tld to forward Alice’s message to Bob.

Glossary break:

There are a few server types in the Matrix specifications. The homeservers (HS) are the servers that implement the client-server and federation APIs, i.e. the ones that allows actual messages to be sent from Alice to Bob. In my example, in which I was referring to homeservers as “Matrix servers”, matrix.alice.tld and matrix.bob.tld are homeservers. Among the other server types are the identity servers (IS) that allows one to host third-party identifiers (such as an email address or a phone number) so people can reach them using one of them, and application services (AS) which are mainly used to bridge an existing system to Matrix (but are not limited to that). In this post, I’m only going to cover the basic use of homeservers, since knowledge about the other types isn’t required to understand the bases of how Matrix works.
In the Matrix spec, both Alice and Bob are identified by a Matrix ID, which takes the form @localpart:homeserver. In our example, their Matrix IDs could respectively be @Alice:matrix.alice.tld and @Bob:matrix.bob.tld. Matrix IDs' form actually follows a broader one, taken by any Matrix entity, which is *localpart:homeserver, where * is a “sigil” character which is used to identify the entity’s type. Here, the sigil character @ states that the entity is a Matrix ID.

Three roomies on three servers

Now that we have our two users talking with each other, let’s take a look at how third user (let’s call him Charlie), also hosting his own homeserver (at matrix.charlie.tld), can chat with both of them. This is done using a room, which can be defined as the Matrix equivalent of an IRC channel. As any entity in Matrix, the room has an ID which takes the general form with the ! sigil character. However, although it contains a homerserver’s name in its ID, and unlike a user ID, a room isn’t bound to any homeserver. Actually, the homeserver in the room ID is the homeserver hosting the user that created the room.

Technically speaking, if Alice wants to send a message in the room where both Bob and Charlie are, she’ll ask her homeserver to send a message in that room, which will look into its local database which homeservers are also part of that room (in our example, Bob’s and Charlie’s), and will send the message to each of them individually (and each of them will display the message to their users in the room, i.e. Bob’s server will display it to Bob). Then, each homeserver will keep track of the message in their local database. This means two things:

Every homeserver in a room keeps a content of the room’s history.
If a homeserver in a room goes down for any reason, even if it’s the homeserver which has its name in the room’s ID, all of the other homeservers in the room can keep on talking with each other.

Broadly speaking, a room can be schematised as follows:

This image is a capture of the interactive explanation on how Matrix works named “How does it work?” on Matrix’s homepage, which I’d really recommand checking out. That’s why the Matrix IDs and homeservers' names aren’t the same as in my example.

For what it’s worth, I took a shortcut earlier since, in the Matrix spec, 1-to-1 chats are also rooms. So technically speaking, Alice and Bob were already in a room before Charlie wanted to chat with them.

It might also be worth noting that a room can have an unlimited number of aliases, acting as addresses for the room, which users can use to join it if it’s public. Their syntax takes the general form we saw earlier, using # as the sigil character. This way, !wbtZVAjTSFQzROqLrx:matrix.org becomes #cozy:matrix.org, which, let’s be honest, is quite easier to read and remember. As with a room’s ID, its homeserver part is the homeserver hosting the user who created the alias, which means that I can create #cozycloud:matrix.trancendances.fr if I have enough power level, as I’m using this homeserver.

As I quickly hinted at, a room can be either public or private. Public rooms can be joined by anyone knowing one of the room’s alias (or getting it from the homeserver’s public rooms directory if it’s published there), and private rooms work on an invite-only basis. In both cases, if the homeserver doesn’t already have a user in the room, it will ask another homeserver to make the join happen (either the homeserver alias which name is in the homeserver part of the alias for a public room, or the homeserver the invite is originating from for a private room).

Events, events everywhere

Now that we know what a room is, let’s talk about what’s passing inside of one. Earlier, I’ve been talking about messages, which are actually called “events”. Technically speaking, a Matrix event is a JSON object that’s sent in a room and dispatched to all other members of the room. It, of course, has an ID that’s generated by the homeserver hosting the user who sent the message, taking the general form we saw earlier and the $ sigil character. This JSON has metadata, such as a class name to identify different event types, an author, a creation timestamp, etc. It basically looks like this:

{
  "origin_server_ts": 1526072700313,
  "sender": "@Alice:matrix.alice.tld",
  "event_id": "$1526072700393WQoZb:matrix.alice.tld",
  "unsigned": {
    "age": 97,
    "transaction_id": "m1526072700255.17"
  },
  "content": {
    "body": "Hello Bob and Charlie! Welcome to my room :-)",
    "msgtype": "m.text"
  },
  "type": "m.room.message",
  "room_id": "!TCnDZIwFBeQyBCciFD:matrix.alice.tld"
}

The example above is an event sent from Alice to Bob and Charlie in the room they’re all in. It’s a message, as hinted at by the m.room.message class name in the type property. The content property, which must be an object, contains the event’s actual content. In this case, we can see the message is text, and the text itself. This precision is needed because m.room.message can be a text, but also an image, a video, a notice, etc. as mentioned in the spec.

The unsigned property here only means the data in it mustn’t be taken into account when computing and verifying the cryptographic signature used by homeserver to pass the event to another homeserver.

The Matrix spec defines three kind of events that can pass through a room:

Timeline events, such as messages, which form the room’s timeline that’s shared between all homeservers in the room.
State events, that contain an additional state_key property, and form the current state of the room. They can describe room creation (m.room.create), topic edition (m.room.topic), join rules (i.e. either invite-only or public, m.room.join_rules), membership update (i.e. join, leave, invite or ban, m.room.member with the Matrix ID of the user whose membership is being updated as the state_key). Just like timeline events, they’re part of the room’s timeline, but unlike them, the latest event for a {type, state_key} duo is easily retrievable, as well as the room’s current state of the room, which is actually a JSON array contaning the latest events for all {type, state_key} duos. The Matrix APIs also allows one to easily retrieve the full state the room was at when a given timeline message was propagated through the room, and each state event refers to its parent.
Euphemeral events, which aren’t included in the room’s timeline, and are used to propagate information that doesn’t last in time, such as typing notification ("[…] is typing…").

Now, one of the things I really like about Matrix is that, besides the base event structure, you can technically put whatever you want into an event. There’s no constraint on its class name (except it can’t start with m., which is a namespace reserved for events defined in the spec), nor on its content, so you’re free to create your own events as you see fit, whether they are timeline events, state events or both (I’m not sure about euphemeral events, though). That’s how you can create whole systems using only Matrix as the backend.

Matrix events can also be redacted. This is the equivalent of a deletion, except the event isn’t actually deleted but stripped from its content so it doesn’t mess with the room’s timeline. The redacted event is then dispatched to every homeserver in the room so they can redact their local copy of the event as well. Regarding editing an event’s content, it’s not possible yet, but it’s a highly requested feature and should be available in the not so distant future.

A very basic client

Now I guess you’re wondering how you can use Matrix for your project, because learning the core principles is great but that doesn’t explain how to use the whole thing.

In the following steps, I’ll assume a few things:

The homeserver you’re working with is matrix.project.tld, and its client-server API is available on port 443 through HTTPS.
Your user is named Alice. Note that you must change this value for real life tests, because the Matrix ID @Alice:matrix.org is already taken.
Your user’s password is 1L0v3M4tr!x.

Note that I’ll only cover some basic use of the client-server spec. If you want to go further, you should have a look at the full spec or ask any question in the #matrix-dev room. I also won’t cover homeserver setup, here (though I might do just that in a future post). My goal here is mainly to give you a look at how the client-server APIs globally works rather tha creating a whole shiny app which would take too long for a single blog post.

It might also be worth noting that each Matrix API endpoint I’ll name in the rest of this post is a clickable link to the related section of the Matrix spec, which you can follow if you want more complete documentation on a specific endpoint.

Registering

Of course, your user doesn’t exist yet, so let’s register it against the homeserver.

The endpoint for registration is /_matrix/client/r0/register, which you should request using a POST HTTP request. In our example, the request’s full URL is https://matrix.project.tld/_matrix/client/r0/register.

Note that every endpoint in the Matrix spec always starts with /_matrix/.

The request body is a JSON which takes the following form:

{
  "username": "Alice",
  "password": "1L0v3M4tr!x",
}

Here, the username and password properties are exactly what you think it is. The Matrix ID generated for a new user contains what’s provided in the username property as the localpart.

Fire this request. You’ll now get a 401 status code along with some JSON, which looks like this:

{
    "flows": [
        {
            "stages": [
                "m.login.dummy"
            ]
        },
        {
            "stages": [
                "m.login.email.identity"
            ]
        }
    ],
    "params": {},
    "session": "HrvSksPaKpglatvIqJHVEfkd"
}

Now, this enpoint uses a part of the spec called the User-Interactive Authentication API. This means that authentication can be seen as flows of consecutive stages. That’s exactly what we have here: two flows, each containing one stage. This example is a very simple one, but it can get quite more complex, such as:

{
    "flows": [
        {
            "stages": [
                "m.login.recaptcha"
            ]
        },
        {
            "stages": [
                "m.login.email.identity",
                "m.login.recaptcha"
            ]
        }
    ],
    "params": {
        "m.login.recaptcha": {
            "public_key": "6Le31_kSAAAAAK-54VKccKamtr-MFA_3WS1d_fGV"
        }
    },
    "session": "qxATPqBPdTsaMBmOPkxZngXR"
}

Here we can see two flows, one with a single stage, the other one with two stages. Note that there’s also a parameter in the params object, to be used with the m.login.recaptcha flow.

Because I want to keep it as simple as possible here, let’s get back at our initial simple example, and use the first one-stage flow. The only stage in there is m.login.dummy, which describes a stage that will success everytime you send it a correct JSON object.

To register against this stage, we’ll only add a few lines to our initial request’s JSON:

{
  "auth": {
    "type": "m.login.dummy",
    "session": "HrvSksPaKpglatvIqJHVEfkd",
  },
  "username": "Alice",
  "password": "1L0v3M4tr!x",
}

Note that the value to the session property in the newly added auth object is the value from session taken from the homeserver’s response to our intial request. This auth object will tell the homeserver that this request is a follow-up to the initial request, using the stage m.login.dummy. The homeserver will automatically recognise the flow we’re using, and will succeed (because we use m.login.dummy), returning this JSON along with a 200 status code:

{
  "access_token": "olic0yeVa1pore2Kie4Wohsh",
  "device_id": "FOZLAWNKLD",
  "home_server": "matrix.project.tld",
  "user_id": "@Alice:matrix.project.tld"
}

Let’s see what we have here:

The home_server property contains the address of the homeserver you’ve registered on. This can feel like a duplicate, but the Matrix spec allows for a homeserver’s name to differ from its address, so here’s why it mentions it.
The user_id property contains the newly generated Matrix ID for your user.
The device_id property contains the ID for the device you’ve registered with. A device is bound to an access token and E2E encryption keys (which I’m not covering in this post).
The access_token property contains the token you’ll use to authenticate all your requests to the Matrix client-server APIs. It’s usually much longer than the one shown in the example, I’ve shortened it for readability’s sake.

Registering an user instantly logs it in, so you don’t have to do it right now. If, for any reason, you get logged out, you can log back in using the endpoint documented here.

Creating our first room

Now that we have an authenticated user on a homeserver, let’s create a room. This is done by sending a POST request to the /_matrix/client/r0/createRoom endpoint. In our example, the request’s full URL is https://matrix.project.tld/_matrix/client/r0/createRoom?access_token=olic0yeVa1pore2Kie4Wohsh. Note the access_token query parameter, which must contain the access token the homeserver previously gave us.

There are a few JSON parameters available which I won’t cover here because none of them are required to perform the request. So let’s send the request with an empty object ({}) as its body.

Before responding, the homeserver will create the room, fire a few state events in it (such as the initial m.room.create state event or a join event for your user). It should then respond with a 200 status code and a JSON body looking like this:

{
    "room_id": "!RtZiWTovChPysCUIgn:matrix.project.tld"
}

Here you are, you have created and joined your very first room! As you might have guessed, the value for the room_id property is the ID of the newly created room.

Messing with the room’s state

Browsing the room’s state is completely useless at this stage, but let’s do it anyway. Fetching the whole room state, for example, is as easy as a simple GET request on the /_matrix/client/r0/rooms/{roomId}/state endpoint, where {roomId} is the room’s ID. If you’re following these steps using curl requests in bash, you might want to replace the exclamation mark (!) in the room’s ID with its URL-encoded variant (%21). Don’t forget to append your access token to the full URL as shown above.

The request should return a JSON array containing state events such as:

{
  "age": 654742,
  "content": {
    "join_rule": "public"
  },
  "event_id": "$1526078716401exXBQ:matrix.project.tld",
  "origin_server_ts": 1526078716874,
  "room_id": "!RtZiWTovChPysCUIgn:matrix.project.tld",
  "sender": "@Alice:matrix.project.tld",
  "state_key": "",
  "type": "m.room.join_rules",
  "unsigned": {
    "age": 654742
  }
}

Now let’s try to send our own state event in the room, shall we? I order to do that, you’ll need to send a PUT request to the /_matrix/client/r0/rooms/{roomId}/state/{eventType}/{stateKey} endpoint, repacing the room’s ID, the event’s type and its state key with the right values. Note that if your state key is an empty string, you can just omit it from the URL. Again, don’t forget to append your access token!

The body for our request is the event’s content object.

Let’s create a tld.project.foo event with bar as its state key, and {"baz": "qux"} as its content. To achieve that, let’s send a PUT request to /_matrix/client/r0/rooms/!RtZiWTovChPysCUIgn:matrix.project.tld/state/tld.project.foo/bar?access_token=olic0yeVa1pore2Kie4Wohsh (from which I’ve stripped the protocol scheme and FQDN so it doesn’t appear too long in the post) with the fillowing content:

{
  "baz": "qux"
}

The homeserver then responds with an object only containing an event_id property, which contains the ID of the newly created state event.

If we retry the request we previously made to retrieve the whole room state, we can now see our event:

{
    "age": 58357,
    "content": {
        "baz": "qux"
    },
    "event_id": "$1526080218403sbpku:matrix.project.tld",
    "origin_server_ts": 1526080218639,
    "room_id": "!RtZiWTovChPysCUIgn:matrix.project.tld",
    "sender": "@Alice:matrix.project.tld",
    "state_key": "bar",
    "type": "tld.project.foo",
    "unsigned": {
        "age": 58357
    }
}

Note that sending an update of a state event is done the same way as sending a new state event with the same class name and the same state key.

Sending actual messages

Sending timeline events is almost the same thing as sending state events, except it’s done through the /_matrix/client/r0/rooms/{roomId}/send/{eventType}/{txnId} endpoint, and it uses one parameter we haven’t seen yet: the txnId, aka transaction ID. That’s simply a unique ID allowing identification for this specific request among all requests for the same access token. You’re free to place whatever you want here, as long as you don’t use the same value twice with the same access token.

Regarding the request’s body, once again, it’s the event’s content.

Retrieving timeline events, though, is a bit more complicated and is done using a GET request on the /_matrix/client/r0/sync endpoint. Where it gets tricky is in the fact that this endpoint isn’t specific to a room, so it returns every event received in any room you’re in, along with some presence event, invites, etc.

Once you’ve done such a request (again, with your access token appended to it), you can locate timeline events from your room in the JSON it responds with by looking at the rooms object, which contains an object named join which contains one object for each room you’re in. Locate the !RtZiWTovChPysCUIgn:matrix.project.tld room (the one we’ve created earlier), and in the corresponding object you’ll see the state, timeline and euphemeral events for this room.

Inviting a folk

So far, Alice has registered on the homeserver and created her room, but she feels quite alone, to be honest. Let’s cheer her up by inviting Bob in there.

Inviting someone into a room is also quite simple, and only requires a POST request on the /_matrix/client/r0/rooms/{roomId}/invite endpoint. The request’s body must contain the invited Matrix ID as such:

{
  "user_id": "@Bob:matrix.bob.tld"
}

Note that the request is the same if Bob has registered on the same server as Alice.

If all went well, the homeserver should respond with a 200 status code and an empty JSON object ({}) as its body.

In the next request on the /_matrix/client/r0/sync he’ll made, Bob will now see an invite object inside the rooms one contaning the invite Alice sent him, containing a few events including the invite event:

{
  "invite": {
    "!RtZiWTovChPysCUIgn:matrix.project.tld": {
      "invite_state": {
        "events": [
          {
            "sender": "@Alice:matrix.project.tld",
            "type": "m.room.name",
            "state_key": "",
            "content": {
              "name": "My very cool room"
            }
          },
          {
            "sender": "@Alice:matrix.project.tld",
            "type": "m.room.member",
            "state_key": "@Bob:matrix.bob.tld",
            "content": {
              "membership": "invite"
            }
          }
        ]
      }
    }
  }
}

Now Bob will be able to join the room by sending a simple POST request to the /_matrix/client/r0/rooms/{roomId}/join endpoint.

Alice meets Bob

So here we are, with a fresh room where Alice and Bob are able to interact with one another, with everything done using HTTP requests that you could do with your terminal using curl. Of course, you don’t always have to do it that manually, and there are Matrix SDKs for various languages and platforms, including JavaScript, Go, Python, Android, iOS, and a lot more. The full list is available right here.

If you want to dive a bit deeper into the Matrix APIs, I’d advise you to have a look at the spec (even though it still needs a lot of work) and what the community has done with it on the Try Matrix Now! page on Matrix’s website.

I hope you found this journey into Matrix’s APIs as interesting as I did when I first heard of the project. Matrix is definitely something I’ll keep playing with for a while, and might have some big news related to some Matrix-related projects I’m working on to share here in the coming months.

As always, I’d like to thank Thibaut for proofreading this post and giving me some useful early feedback on it. If you want to share your feedback on this post with me too, don’t hesitate to do so, either via Twitter or through Matrix, my own Matrix ID being @brendan:abolivier.bzh!

See you next week for a new post 🙂

Centralising logs with rsyslog and parsing them with Graylog extractors

Brendan Abolivier — Sat, 05 May 2018 00:00:00 +0200

Once again, we’re up for a monitoring-related post. This time, let’s take a look at logs. Logs are really useful for a lot of things, from investigating issues to monitoring stuff that can’t be watched efficiently by other monitoring tools (such as detailled traffic stats), and some of us even live in a country where it’s illegal to trash logs that were emitted before a given time limit.

When it comes to storing them, a lot of solutions are available, depending on what you need. At CozyCloud, our main need was to be able to store them somewhere safe, preferably outside of our infrastructure.

Earth, lend me your logs! says syslog-dev

We started by centralising logs using rsyslog, an open logs management system that’s described by its creators as a “swiss army knife of logging”. One of its features I’ll be writing the most about in this post is UDP and TCP forwarding. Using that, we (well, my colleagues, since I wasn’t there at that time) created a host for each of our environments which task would be to keep a copy of every log emitted from every host and by every application in the given environment.

I’ll take a quick break here to explain what I mean by “environment” in case it’s not clear: our infrastructure’s architecture is replicated 4 times in 4 different environments, each with a different purpose: dev (dedicated to experimentation and prototyping, aka our playground), int (dedicated to running the developers' integration tests, aka their playground), stg (dedicated to battle-testing features before we push them to the production) and prod (I’ll let you guess what’s its purpose). End of the break.

On each host of the whole infrastructure, we added this line to rsyslog’s configuration:

*.* @CENTRAL_LOG_HOST:514

Here, CENTRAL_LOG_HOST is the IP of the host that is centralising the logs for the given environment, in the infrastructure’s local private network. What it does is to tell rsyslog to forward every log it gets to the given host using UDP on port 514, which is rsyslog’s default port for UDP forwarding.

Then a colleague set up a Graylog instance to try and work out the processing part. He did all the set up and plugged in the dev environment’s logs output before getting drowned under a lot of higher-priority tasks, and since I was just finishing setting up a whole monitoring solution we figured I’d take over from there.

Let’s plug things

Of course, the first thing to do on your own setup is to install and configure Graylog, along with its main dependencies (which are MongoDB and Elasticsearch). The Graylog documentation covers this quite nicely with a general documentation and a few step-by-step guides offering some useful details on installation and configuration. Once your Graylog instance is set up, open your browser on whatever you set as the Web UI’s URI. In most cases, it will look like http://YOUR_SERVER:9000.

Once you’re authenticated, you’ll need to add an input source. Click on “Systems” in the navigation bar, then “Inputs” in the dropdown menu that just appeared. You’ll then be taken to a page from which you’ll be able to configure Graylog’s inputs.

Click on the “Select Input” dropdown, look for “Syslog TCP” and click “Launch new input”. Filling the form that appears then is done accordingly with your needs, however you might want to check “Store full message” at the very bottom. Graylog understands the Syslog protocol’s syntax, and the message it stores is a stripped version of what (r)syslog actually sent. Because you might want to use some of the stripped out parts, it can be wise to tell Graylog to store the full message somewhere before processing it.

You’ll then have to configure rsyslog to send the logs it gets to Graylog. Because we centralise all of our logs, we only need to configure one rsyslog daemon, by adding this line to its configuration:

*.* @@GRAYLOG_HOST:PORT;RSYSLOG_SyslogProtocol23Format

Here, the host is your Graylog server’s address and the port is the one you previously configured while setting up your Syslog TCP input.

There’s two things to notice here. First, there are two @ symbols before Graylog’s host name, which means the logs are going to be forwarded to Graylog using TCP. We previously saw a forwarding configuration line with a single @ sign, which means rsyslog will use UDP. The second thing to notice is the ;RSYSLOG_SyslogProtocol23Format part. The semicolon (;) tells rsyslog that this is a parameter defining how to send logs, and RSYSLOG_SyslogProtocol23Format is a built-in parameter telling rsyslog to send logs using the Syslog protocol as defined in RFC 5424.

Restart rsyslog to apply the new configuration, and check it works by generating some logs while running

tcpdump -Xn host GRAYLOG_HOST and port PORT

with the same values for GRAYLOG_HOST and PORT as in the bit of configuration below. This tcpdump command line can be called from either the Graylog host or the rsyslog host. If those are the same, remember to add -i lo between tcpdump and -Xn to watch the loopback interface (in this case you can also remove the host GRAYLOG_HOST and part of the command line).

Once you’ve created your input, you might want to add streams. I’m not covering this part in this post as I didn’t get to play with these, and there’s a default stream where all messages go anyway.

Now that logs are coming in, let’s process them!

Stranger in a Strange Land

There are several ways to configure logs processing in Graylog. One of them is pipelines, which are, as you can guess by the name, processing pipelines you can plug to a stream. I played around with them a bit, but gave them up quite quickly because I couldn’t figure out how to make them work properly, and I was getting some weird behaviour with their rules editor.

Another way to process logs is to set up extractors. A Graylog extractor is a set of rules which defines how logs coming from a given input will be processed, using one of many possible processing mechanisms, from JSON parsin to plain copy, including splitting, substring, regular expressions or Grok patterns.

Now let’s talk about the latter in case it doesn’t ring a bell, because I’ll be talking a lot about this type of patterns in the rest of the post. Grok patterns are kind of an overlay for regular expressions, addressing the issue of their complexity. I’m sure that, just like me, you don’t find the thought of parsing 300-character long log entries using a custom format with standard regular expressions very exciting.

Grok patterns take the form of a string (looking like %{PATTERN}) you include in your parsing instruction that will correspond to either a plain regular expression, or a concatenation between other Grok patterns. For example, %{INT}, a common pattern matching any positive or negative integer, corresponds to the regular expression (?:[+-]?(?:[0-9]+)). Another pattern, included in Graylog’s base patterns, is %{DATESTAMP} which is defined as %{DATE}[- ]%{TIME}, which is a concatenation between a regular expression and two Grok patterns. These patterns are very useful as they make your parsing instructions way easier to read than if they were only made of common regular expressions.

Graylog, like other pieces of software, allow you to describe a log entry as a concatenation of patterns and regular expressions. For example, here’s the line we’re using to parse Apache CouchDB’s' logs:

%{DATA} %{NOTSPACE:couchdb_user} %{NOTSPACE:couchdb_method} %{NOTSPACE:couchdb_path} %{NUMBER:couchdb_status_code} %{NOTSPACE} %{NUMBER}

Note the colons inside the patterns' brackets followed by lower case text: these are named captures, which means that what’s captured by the pattern with be labelled with this text. In this case, it will create a new field in the log entry’s Elasticsearch document (since Graylog uses Elasticsearch as its storage backend) with this label as the field’s name. We can even tell Graylog to ignore all un-named captures when creating an extractor.

Dissecting logs

The easiest way to create a new extractor is to browse to Graylog’s search, which can be done by clicking to the related button in the navigation bar. There you’ll see a list of all messages sent from your input.

Find a log entry you want to be processed, and click on it. If you have more than one input set up, you might want to double check that the entry come from the input you want to plug the extractor on, in order to avoid plugging it to the wrong input. Now locate the field you want to process (here we’ll use the full_message field, which is only available if “Store full message” is checked in the input’s configuration). Click on the down arrow icon on its right.

A dropdown menu appears, move your cursor over “Create extractor for field…”. Because that’s close to being the only extractor I got to use while working with Graylog, I’ll only cover extractors using Grok patterns here, so select “Grok pattern”.

Clicking on it will take you to the extractor creation page, using the entry you previously selected as an example to test the extractor against.

You can then enter your Grok pattern in the “Grok pattern” field. You can even ask Graylog to only extract named captures only by checking the related checkbox.

Now you might think of an issue with this setup: your extractor will be applied against all incoming messages from this input. To tackle that issue, let’s look at two points. First, extractors fail silently, meaning that if a log entry doesn’t match an extractor’s pattern, Graylog will just stop trying this extractor against this specific entry.

Making sure only entries from a specific program and/or host match is the reason we’re creating the exporter for the full_message field, since it contains the original host and the program which emitted the entry at the beginning of the message. These pieces of info are, of course, parsed as soon as the log reaches Graylog and saved in appropriate fields, but Graylog doesn’t allow an exporter to define execution conditions based on other field’s values.

Using values contained in the full_message field, the Grok pattern parsing CouchDB log entries I used as an example above now looks like:

%{COUCHDBHOST} couchdb %{DATA} %{NOTSPACE:couchdb_user} %{NOTSPACE:couchdb_method} %{NOTSPACE:couchdb_path} %{NUMBER:couchdb_status_code} %{NOTSPACE} %{NUMBER}

Now that’s a first step, but it still means every log entry will be tested against the pattern, which is a waste of CPU resources. That’s where my second point comes in.

Graylog allows you to set some basic conditions that will define whether a log entry must be tested against the pattern. You can check whether the field contains a given string, or matches a given regular expression which can be very basic. I chose the string check because of lack of time, but I’d recommand checking against a basic regular expression to better match the log entries you want to target.

One last thing to chose is the “Extraction strategy”, which I usually set to “Copy” to better comply with the WORM (Write Once, Read Many) philosophy. You must also set a name to the extractor so you can easily identify it in the list of existing extractors.

Now your extractor should look like this:

All that’s left to do is to click “Create extractor” and that’s it! Your extractor is up and running!

You might want to check if it runs correctly by going back to the “Search” page and selecting a log entry the extractor should target. If the extractor ran correctly, you should see your new fields added to the entry. Note that an extractor only run against entries received after its creation.

If you want to edit an extractor, click on the “System” link in the navigation bar, the select “Inputs” in the dropdown menu that appears then. Locate the input your extractor is plugged to, and click on the blue “Manage extractors” button next to it. You’ll then be taken to a list of existing extractors for this input:

Click “Edit” next to the extractor you want to edit and you’ll be taken to a screen very similar to the creation screen, where you’ll be able to edit your extractor.

In the next episode

Now, we have a copy of all of our logs at the same place, and process them at a single location in our infrastructure, which is great but creates a sort-of SPOF (single point of failure). Well, only partial, since the logs are only copied from their original hosts, so if something happen to one of these locations, “only” the processing can be permanently impacted. Anyway, it doesn’t address one of our needs, which is to do all this outside of our infrastructure.

But this is a story for another week, since this post is already quite long. Next time I’ll tell you about logs, we’ll see how we moved our logs processing and forwarding to a remote service, without losing all the work we did with rsyslog and Graylog. This won’t be next week, though, because I already have next week’s topic, and it’s not even monitoring-related!

Anyway, thanks for bearing with me as I walked you through an interesting (I hope) journey into logs processing. If you’re note aware of it, this post was part of my One post a week series, in which I challenge myself to write each week a whole blog post in order for me to re-evaluate the knowledge I have and get better at sharing it. If you’ve enjoyed it, or if you have any feedback about it, make sure to hit me up on Twitter, I’ll be more than happy to discuss it with you 🙂

Thanks to Thibaut and Sébastien for giving this post a read before I got to publish it and getting me some nice feedback.

See you next week!

Grafana Dashboards Manager

Brendan Abolivier — Sat, 28 Apr 2018 00:00:00 +0200

At CozyCloud, most of my work orbites around monitoring and supervision. That’s the main reason explaning why I was tasked with dealing with Zabbix supervision on a remote infrastructure we’re setting up, and it also explains why I’ll write some more on monitoring solutions in the future.

As you already know, some of it is done using Zabbix, and the rest of it is done using OVH’s Metrics Data Platform, which, once again, I’ll write about in a future post. Since OVH hosts a Grafana instance to let their customer visualise their data, we use it to do just that. We actually have one dashboard for each kind of metrics we’re sending to the platform, e.g.:

a dashboard named “Infra” to visualise system metrics from each host in our infrastructure
a dashboard named “CouchDB” to visualise metrics specific to our CouchDB clusters, including nodes status, databases reads/writes, etc.
a dashboard named “Cozy Stack” to visualise metrics specific to Cozy, CozyCloud’s product, including the evolution of the number of created instances, resources usage from the stack, etc.
etc.

I created most of these dashboards myself as part of prototyping and deploying the solution we’re using to push metrics to OVH’s platform (which I won’t be describing here as it deserves its own post). In fact, for my first couple of months working on this task, I was the only person creating, modifying or deleting dashboards in our Grafana organisation.

Then Nicolas started to work with dashboards too, and we stumbled across one big issue: because Grafana doesn’t embed a version control system (aka VCS, i.e. what Git, SVN et al. are), it became quite difficult to work on a dashboard: if a colleague modify a dashboard you’re currently working on, you can only either overwrite their changes, or give up yours (or merge both manually, which can be really painful).

Another situtation where I disliked the lack of a VCS was when I was editing huge and complex WarpScripts: if you save the dashboard with a faulty script by mistake, you’re going to have a very painful time finding it and fixing it. Add to this that the dashboard is actively used by other teams in your company, which adds to pressure you to patch it quickly, and compare that to the easiness of reverting to an older version and investigating calmly.

Considering all the burden this lack could create, I decided to start working on a tool for my team, which I later released as free software as the Grafana Dashboards Manager.

What is it?

The Grafana Dashboards Manager is a tool written in Go aiming at helping you manage your Grafana dashboards using Git. It takes advantage from the fact that Grafana describes a dashboard as JSON, making it easy to save and edit in a file.

Its goal is to let you retrieve your existing dashboards to a Git repository, and then edit them within your local Git repositrory, so merging two versions of the same dashboard doesn’t become a living hell. Once changes have been committed and pushed to the Git repository’s master branch the Grafana Dashboards Manager can handle synchronising the changes with your Grafana instance. And since only the master branch is watched, it means that you can take advantage of Git’s workflows, such as working on a separate branch, then merging it with the master one, either with a Pull/Merge request or not, and only then will its changes be synchronised with Grafana (if you want them to, of course).

So that’s the big picture, now let’s look at how it works. It is split in two part: a puller and a pusher. Basically, the whole thing is thought to work like this:

In this schema, the puller, a CLI tool, will fetch changes in the current Grafana dashboards, commit them to a local Git repository, push to a Git remote then exit.

In the meantime, the pusher will look for new commits in the repository to retrieve them and push changed files to Grafana as new or changed dashboards. If requested, it will also delete from Grafana all dashboards that were removed from the Git repository. It will, of course, ignore all commits created by the puller.

This check for new commits can be done in two ways: the first one will start a small web server which will only expose a route that can be used to send web hooks. Because we use GitLab internally, which means our dashboards will be versionned there, the dashboards manager currently only supports GitLab webhooks (and that’s also the reason the Grafana Dashboards Manager uses Git rather than another VCS). Does this mean you can only use the pusher with GitLab, you may ask? Of course not, I answer! The second available mode allows you to specify any Git repository URL which it will poll at a given frequency. In both mode, it will run as a daemon.

By the way, thanks to the refactoring work required to implement this “git pull” mode, if you really want to use a GitHub/Bitbucket/etc. webhook, it shouldn’t be too hard to add support for that in the pusher’s code. Any pull request is, of course, more than welcome!

I don’t want all dashboards to be pulled and pushed, how can I do that?

The configuration allows you to mention a prefix that defines ignored dashboards. If a dashboard’s slug starts with this prefix, it will be ignored by both the puller and the pusher.

Let’s say you want to edit a complex dashboard, which JSON representation is thousands of lines long, so you want to edit it using Grafana’s GUI, using this setting you can change it’s name in the JSON file (which is at the end of the file) so it starts with the given prefix, import it, and you won’t be bothering by the puller committing your WIP changes or the pusher overwriting them.

It’s worth keeping in mind that this “ignore prefix” will be replaced with a regular expression in a future release.

What if I just want a back-up tool?

The reason the Grafana Dashboards Manager is split in two parts is because each is independant from the other. If you want it to work only one way, that’s possible. If you want to use it to only upload JSON descriptions of your dashboards to Grafan, that’s possible. If you want to use it to only back-up your dashboards and push them to a Git repository, that’s possible. Just run the appropriate binary with the appropriate configuration.

Wait, and if I don’t want to use Git at all?

Of course, if you don’t want to get a Git repository involved, the pusher won’t work, since its main feature is to interact with a one.

But if you just want to back-up your dashboards on your disk, well, that’s also possible! The puller has a second mode that only writes files to disk, which is called the “simple sync” mode, and allows you to back-up your dashboards as JSON files on your disk.

I’m sold! How do I get it?

The whole thing is available on GitHub as free software (AGPLv3-licensed), with instructions on how to build it, configure it and run it. If you want to skip the “building” part, here are some built linux-amd64 binaries. All that’s left for you is to download them, create a configuration file from the existing example and run the puller, the pusher or both in the configuration you want.

Thanks a lot to Nicolas who gave me the idea to work on this tool, and to Gilles who gave me a lot of amazing feedback on it 🙂 And as with the latest post, thanks also to Thibaut for his early feedback on this post.

See you next week for a new post, and in the meantime feel free to tweet me some feedback about this one!

Zabbix supervision on a remote infrastructure with proxy and PSK-based encryption

Brendan Abolivier — Fri, 20 Apr 2018 00:00:00 +0200

All of CozyCloud’s production and development infrastructure is hosted in OVH’s datacenters. We monitor this infrastructure in two ways: by sending data points on various metrics to OVH’s Metrics Data Platform (I’ll write about that in a future post), and also by using a self-hosted Zabbix server.

All of our OVH hosts are connected to a virtual local network (vRack) that cannot be accessed from the outside world, so on-host Zabbix agents use it to send their unencrypted data to the Zabbix server, which is also connected to this local network. It was a very simple setup, which looked like this:

A new challenger approaches

Recently, we’ve been tasked with the setup of a new production infrastructure on another hosting provider. The question of how we were going to set up Zabbix’s monitoring in this new environment came up quickly. We decided not to set up another Zabbix server on the new hosting provider’s infrastructure as it would make things painful to set up and we’d have two places to watch instead of only one. So we decided that all Zabbix agents monitoring host on the remote infrastructure would send their data to the Zabbix server we already had set up on OVH’s infrastructure.

Now, this brought up an issue that needed solving before we could do anything: there’s no private local network linking the two hosting providers, so the traffic between the two goes through the Internet with neither encryption nor checksum. Luckily, Zabbix provides an encryption feature, and a proxy software which forwards data from agents to a server, so we decided that we would set up a Zabbix proxy on the remote infrastructure and would turn encryption on between the proxy and the Zabbix server. The resulting setup would look like this:

Let’s encrypt stuff

Let’s have a look at how we’ll encrypt the traffic between the proxy and the server. Zabbix actually provides three modes to describe encryption for incoming or outgoing connections:

unencrypted: the data is sent in plain text over the Internet (aka what we don’t want).
PSK (aka Pre-Shared Key): an encryption key that must be shared between the proxy and the server and is used to encrypt and decrypt the data.
Certificate-based: a PEM certificate signed by a certification authority (either public or in-house) must be generated; the CA’s certificate must be provided to the Zabbix server and is used to validate the certificates used by the proxy.

Because it was simpler to set up, we went with the PSK option. However, our Zabbix server was built and installed from the sources, with the --with-openssl option, and Zabbix’s doc on encryption states the following:

If you plan to use pre-shared keys (PSK) consider using GnuTLS or mbed TLS libraries in Zabbix components using PSKs. GnuTLS and mbed TLS libraries support PSK ciphersuites with Perfect Forward Secrecy. OpenSSL library (versions 1.0.1, 1.0.2c) does support PSKs but available PSK ciphersuites do not provide Perfect Forward Secrecy.

And since we had to update the server anyway, one of my colleague thought he would create an unofficial package (for internal use) from the sources. Why not use the official Debian packages, you ask? Because the packages coming from the official Debian repos are outdated, and we couldn’t find whether the packages coming from Zabbix’s official repos were built using OpenSSL or GnuTLS. This way, we were sure to use the latest Zabbix version with the best encryption settings.

I’m explaining this because it means we’re not using the official packages, which means that, although the setup process should be roughly the same, some steps may differ from the official from-packages install.

At this point, we have our internal packages of the Zabbix server, proxy and agent, and I was tasked to set up the whole thing on the remote infrastructure.

The proxy: a walkthrough

I’ll begin with the assumption that you already have a running Zabbix server somewhere on the Internet.

First, you need to install the Zabbix proxy. This should be as simple as running

sudo apt install zabbix-proxy-BACKEND

but can be a bit more complicated if you’re installing the proxy from the sources. Either way, it’s all documented.

In my case, once I created the proxy’s PostgreSQL user and database, I also had to manually load the database schema into PostgreSQL, or else the proxy wouldn’t start. If that’s your case, find the schema.sql or schema.sql.gz file installed on the proxy’s host by the sources or the package, un-compress it using gunzip if necessary, then enter the PostgreSQL shell (psql -U PROXY USER -d PROXY DATABASE), and run \i /path/to/schema.sql. This will do all the necessary operations to make the database usable by the proxy.

Now let’s configure the proxy. The configuration file we use, located at /etc/zabbix/zabbix_proxy.conf looks like this:

# Proxy operating mode.
# 0 - proxy in the active mode
# 1 - proxy in the passive mode
ProxyMode=0

# IP address (or hostname) of Zabbix server.
Server=ZABBIX SERVER IP/HOSTNAME

# Unique, case sensitive Proxy name.
Hostname=zabbix-proxy

# Log file name
LogFile=/var/log/zabbix-proxy/zabbix_proxy.log

# Database name.
DBName=POSTGRES DB NAME

# Database user.
DBUser=POSTGRES USER

# Database password.
DBPassword=POSTGRES PASSWORD

# How often proxy retrieves configuration data from Zabbix Server in seconds.
# For a proxy in the passive mode this parameter will be ignored.
# The default is 3600, which is an hour. We don't want to wait up to an hour
# for a new host to start being supervised.
ConfigFrequency=300

# How long we wait for agent, SNMP device or external check (in seconds).
Timeout=4

# How long a database query may take before being logged (in milliseconds).
# Only works if DebugLevel set to 3 or 4.
LogSlowQueries=3000

# How the proxy should connect to Zabbix server, aka the encryption mode we want
# to use.
TLSConnect=psk

# Unique, case sensitive string used to identify the pre-shared key.
TLSPSKIdentity=psk_remote
# Full pathname of a file containing the pre-shared key.
TLSPSKFile=/etc/zabbix/zabbix_proxy.psk

Some values have been censored because they contain sensible data (such as secrets or passwords).

Let’s give a closer look at some parts of this file.

# Proxy operating mode.
# 0 - proxy in the active mode
# 1 - proxy in the passive mode
ProxyMode=0

This means that the proxy runs in the active mode, and will fetch by itself its configuration on the server. This mainly means we don’t have to restart the proxy each time we add a host.

# IP address (or hostname) of Zabbix server.
Server=ZABBIX SERVER IP/HOSTNAME

# Unique, case sensitive Proxy name.
Hostname=zabbix-proxy

This part tells the proxy what server it should contact and what name must it give to be recognised as itself. The first parameter would be ignored if we were running in passive mode.

# Database name.
DBName=POSTGRES DB NAME

# Database user.
DBUser=POSTGRES USER

# Database password.
DBPassword=POSTGRES PASSWORD

This part tells the proxy how to connect to its database. In this case we’re using PostgreSQL.

# How the proxy should connect to Zabbix server, aka the encryption mode we want
# to use.
TLSConnect=psk

# Unique, case sensitive string used to identify the pre-shared key.
TLSPSKIdentity=psk_remote
# Full pathname of a file containing the pre-shared key.
TLSPSKFile=/etc/zabbix/zabbix_proxy.psk

Now here’s the interesting part: the part where we set up encryption for outgoing connections. We don’t set up any encryption for incoming connections, because we’re running our proxy in the active mode, which means a connection between the server and the proxy will always come from the proxy to the server.

The first parameter is TLSConnect, which tells the proxy what mode it should use to connect to the server. It can either be unencrypted, psk or cert.

Once we’ve told our proxy we want to talk with the server, there are two parameters we must define:

TLSPSKIdentity: the “identity” of the pre-shared key, aka a non-secret string identifier. You can basically input whatever you want here.
TLSPSKFile: the file containing your secret pre-shared key.

Zabbix’s documentation provides two ways to generate the PSK, which is basically a random 32-byte long string, using either OpenSSL or GnuTLS. I used GnuTLS, which looked like this:

$ psktool -u psk_identity -p database.psk -s 32
Generating a random key for user 'psk_identity'
Key stored to database.psk

$ cat database.psk
psk_identity:9b8eafedfaae00cece62e85d5f4792c7d9c9bcc851b23216a1d300311cc4f7cb

Let’s just clarify a point here: the key isn’t the one we’re using. The code block above is just an exact copy from Zabbix’s documentation.

Now that we have generated our database.psk file, we’ll need to transform it a bit so Zabbix can read it, by removing the identity and the colon, leaving only the key in the file. Using the file generated in the previous example, it should now look like this:

$ cat database.psk
9b8eafedfaae00cece62e85d5f4792c7d9c9bcc851b23216a1d300311cc4f7cb

You may of course rename the file and move it on the proxy’s host. The next step is to re-open the proxy configuration file, copy the .psk file’s absolute path as the value for the TLSPSKFile parameter, restart the proxy and voilà! The proxy should now be able to talk to the server! Or at least try to, because the server doesn’t know our proxy. Let’s see how we can fix this.

Server meets proxy

Now you’ll need to log into your Zabbix server’s web interface (as an administrator), and click on the “Proxies” sub-menu from the “Administration” menu. From there, click “Create proxy”.

Fill in your proxy’s name, but don’t click “Add” yet. Also, make sure the name is exactly the same as the Hostname you specified in the proxy’s configuration (it’s case-sensitive).

Then click “Encryption” (at the top of the gray block, next to “Proxy”), uncheck “No encryption”, check “PSK”, fill in the PSK’s identity (again, this needs to be exactly the same as the value you set to TLSPSKIdentity, and is case-sensitive), and the PSK (which is the content of the .psk file we generated just before).

Now you can click “Add”, and voilà! Your server now knows your proxy and will be happy to talk to it, using the PSK to encrypt all communications.

A few words on the agents

Now this whole setup won’t disturb on-host agents that much. They talk to a proxy the same way they talk to a server. However, you’ll need to make them talk to the proxy, and this is done in two parts:

In the agent’s configuration file, set the Server parameter to the proxy’s address, not the Zabbix server’s.
In the server’s web interface, when creating the host, make sure to select the proxy in the “Monitored by proxy” dropdown at the bottom of the main view:

There’s one special case, though, it’s the agent that’s on the proxy’s host. If you use it with the same configuration than the other agents in your remote infrastructure, it will make that the proxy forward its own monitoring data, which is not good if you want to be able to investigate incidents efficiently (and can lead to countless issues). So I’d advise to make it talk (in an encrypted fashion) directly to the Zabbix server. The agent’s configuration is almost exactly the same than the proxy’s, in fact we can even use the same encryption key. At CozyCloud, we only append these lines to the proxy’s agent configuration:

TLSConnect=psk
TLSAccept=psk
TLSPSKIdentity=psk_remote
TLSPSKFile=/etc/zabbix/zabbix_proxy.psk

Also don’t forget to change the agent’s Server configuration parameter to replace it with your server’s public address instead of the proxy’s internal address.

And voilà!

There you go, the whole thing is set up and ready to work! You can make sure encryption is turned on using tcpdump like this:

$ tcpdump -X -i eth0 dst host ZABBIX SERVER IP/HOSTNAME and dst port 10051

Make sure this command line is run from the proxy’s host. You may want to change the interface (here eth0) and the port the Zabbix server listens to (here 10051) accordingly with your own setup.

If encryption is indeed turned on, all of the translated content sent from the proxy to the server (the right part of the output) must be un-understandable gibberish.

If no traffic goes between your proxy and your server (i.e. if tcpdump shows nothing), you might want to update the firewall rules on your Zabbix server’s host to allow incoming connection on port 10051 (or any other port you might have configured the server to listen to).

If you were not aware of it, this blog post was the first episode of my One post a week series, in which I’m trying to keep up with writing a blog post a week to help me get better at sharing my knowledge. If you have any feedback on this post, make sure to hit me up on Twitter, I’ll be more than happy to discuss it with you 🙂

I’d also like to thank Nicolas who spent so much time helping me with this setup and explaining so much things on Zabbix to me, along with Thibaut and Sébastien for their early feedback on this post, which helped me make it even better.

See you next week for a new post!

One post a week

Brendan Abolivier — Sat, 14 Apr 2018 00:00:00 +0000

My name is Brendan Abolivier. I’m a young guy from Brest, France working as a junior system administrator at CozyCloud, a small French company working on an open personal cloud platform aiming at giving people ownership on their personal data back.

When I was at BreizhCamp, a 3-day long tech conference in the West of France that happened a couple of weeks ago, I attended a talk called “Teaching is learning: become a better dev by sharing your knowledge”. During this talk, the speaker, Céline Martinet Sanchez, spoke about her journey in software development and how she used knowledge that was shared by others and slowly became the one to share her own knowledge with random people on Internet forums. The full 28-min long talk is available right here.

In the “sharing” part of the talk, she described the different ways in which you can share knowledge with other people (forum posts, blog posts, talks, etc.), and remarked that we usually refrain from sharing such knowledge. We sometimes use excuses such as “I’m not good at explaining” or “I don’t have anything interesting to share with people”. She actually listed most of the excuses she used to either hear or say herself, and explained how most of them were just that, excuses with no real base. She explained that you won’t get better at explaining stuff by not doing anything about it, and that most of the time you actually have something interesting to share (you must have learned something at work this week, or while talking with friends or colleagues, that helped you in your projects), but you usually consider it as not interesting enough to share it with the rest of the world.

While listening to her speaking, I noticed that, most of the time, when I was considering going to a conference, I always had a small moment when I was undecided about how to attend (speaker? attendee? volunteer?), and always quickly rejected the speaker option because I thought I had nothing worth sharing. Same goes with writing blog posts. Most of the excuses she listed during her talk were excuses I heard coming from myself, and it made be think that maybe I devalue what I know too much, and maybe what I learn each day/week/month is worth sharing with the rest of the world. This thought became even more realistic as I got to speak with Céline Martinet Sanchez later that day, when she told me she was actually pushed by her colleagues towards doing a talk, went through this whole thought process and came up with an amazing talk that really stand out to me.

Realising all of this, I thought it would be a great exercise to finally make use of this blog I set up without a real goal a few months ago, and, each week, share something I learned at work or while working on personal projects, or just something I have in mind and want to share on this space. The posts can be tutorials, feedbacks, or even reflections on non-technical parts of stuff I work on. Some week there might even be nothing because I won’t write random stuff if I have nothing to talk (even though it’s very unlikely).

I hope you’ll hang here with me, and I’ll see you next week for the first post from this series!