SHIELD Roadmap Call

Joining the Call

On the fourth Thursday of every month, we hold a conference call to discuss the SHIELD project, and its overall direction. Topics include latest news, development progress, and future direction.

The next SHIELD Roadmap Call is
Thursday, June 25th at 11:00am EDT.

We use zoom: https://us02web.zoom.us/j/7165551212

Calls are not recorded.

Previous Calls

We don't record the roadmap calls, but we do take copious amounts of notes and attempt to summarize them below. Wouldn't your rather skim a report on a meeting, rather than watch a static recording of the meeting??

May 28th, 2020

These are the notes from our sixth community call, held at 11:00am EDT on Thursday, May 28th, 2020.

The Future of SHIELD

Most of this Roadmap call was devoted to the discussion of structural cracks and flaws in the core architecture of SHIELD, and of systemic failures in past choices. The world moves on, technology improves, and sometimes the ideas that seemed good at the time become worse, until they are no longer viable.

To that end, we are planning an ambitious, unfunded open source push to fix most or all of these problems. Namely:

  1. Replace Tenants with Folders
  2. Vue.js for the Web UI (dropping lens.js / AEGIS)
  3. A Better CLI Experience
  4. Externalize the Vault
  5. Externalize the Database
  6. Externalize the Scheduler
  7. SSH Active Fabric

Tenants will be replaced with Folders

Few people we have spoken to actually use tenancy, and the leakiness of the ACLs involved contributed to that. In the end, those who were using tenancy were doing so for organizing systems into groups / categories / taxonomies (i.e. all of the production databases, this CF, that K8s)

Vue.js for the Web UI (dropping lens.js / AEGIS)

We've had nothing but issues (bugs, inconsistencies, inaccuracies) with the SHIELD Web UI. While we still think its a useful thing to have, we no longer believe that it is (a) required or (b) benefits from a non-standard front-end framework.

So, we'll be rewriting it in Vue.js, with more node-iness than before, and accompanied by library-level unit tests that are part of CI/CD runs. This means that AEGIS goes away (to be replaced by native Vue.js data management), and the websocket gets an overhaul to accommodate (more on that later)

A Better CLI Experience

The CLI experience for SHIELD needs smoothed out. Our new goal is to be able to dispense entirely with the UI, and use the CLI for everything, if we so choose.

Externalize the Vault

SHIELD will no longer manage its own Vault. When we first started this, it was easier and better for us to ship our own Vault instance, dedicated to SHIELD. But in the modern era of Helm charts and other automation, it's much easier to properly set up a Vault. Plus, most people are already running one anyway.

Removing the responsibility of managing the Vault from SHIELD will simplify our codebase, and make deployments more flexible. Distributions of SHIELD may opt to continue running their own dedicated Vault process, at their discretion.

Externalize the Database

SQLite doesn't scale, not the way we're using it. This change moves forward on two fronts: getting task logs out of the database proper, and switching to a database-agnostic ORM layer like gorm (which has come a long way since we first evaluated it). We don't have hefty performance requirements from our database layer, and programming effort (and complexity!) should be concentrated elsewhere.

Our initial target is PostgreSQL (which is about as easy to spin as SQLite these days anyway), but if we end up effortlessly supporting other RDBMSes via an ORM, all the better.

Externalize the Scheduler

Currently, we cannot scale the SHIELD Core, either for availability or performance reasons, because the scheduler is entirely in memory. To support horizontal scale-out, we will be externalizing the state of the scheduler into something like Redis (or the database) to allow multiple cores to operate concurrently.

SSH Active Fabric

The time has come to solve the NAT problem, and allow SHIELD agents from behind a network address translation scheme to work with a core on the other side. To do this, we will be flipping the direction of communication, and using the sfab library to do it.

Under this new scheme, SHIELD agents will SSH into the SHIELD core, authenticate themselves and await instructions. The SHIELD core will then hand out tasks (dispatched via the sfab framework) to agents, or skip task scheduling for unknown / unreachable agents.

This is a very large shift, architecturally, and it will guarantee that we are not able to handle backwards compatibility with SHIELD v8.

The v9 Project

Given all the change we want to make, we are going to break with the v8.x series and implement SHIELD v9 as a new system, with similar concepts and the same supported data systems. This new SHIELD will not be a drop-in replacement for v8.x and prior releases. As such we expect to be able to move forward more quickly with the large changes we are about to implement.

Work will proceed under a v9 branch. This work will be 100% open source, volunteer-driven. We will coordinate effort via the GitHub issue tracker.

K8s, SIGs, and Go Libraries

We've been watching the core teams in the Kubernetes project for a while now, and their federated approach to both code and governance seems to be working out quite well. While we don't have the same requirements of a governance structure (BDFL seems to be working), we will begin to emulate the federate code structure by pulling parts of SHIELD Core implementation into standalone(-ish) code bases that act as libraries.

When this is all said and done, the core binary as we know it today will mostly be glue code, implementing the API and data persistence layer, but otherwise relying on the libraries of SHIELD and SHIELD-like functionality to do all the heavy lifting.

This (we believe) will grant us the ability to test some obscure edge cases that are particularly tricky to reproduce with a full complement of SHIELD core, storage system, agent, and target system. Our goal is to get 80% or better C0 code coverage during an extraction port, but more importantly to have a harness in place for writing regression tests for library-bound bugs, going forward.

Work has been proceeding, in skunkworks projects, on this front for a month or so now, and the results are promising.

Kubernetes Controller + CRDs

Extracting core functionality into one or more (tested) libraries also enables us to experiment with other substrates for the ideas behind SHIELD. Namely, we'll be able to quickly experiment with Kubernetes, by turning the database objects into Kubernetes Custom Resources, and swapping out our API for that of the Kubernetes API.

January 23nd, 2020

These are the notes from our fifth community call, held at 11:00am EDT on Thursday, January 23nd, 2020.

Happy New Year!

Spring 2020 Roadmap Goals

Our main goals for SHIELD in 2020:

  1. Scalability:

    • We currently are hitting a ulimit of about 180 agents.

    • We will be exploring Postgres as a replacement for Sqlite to better handle the database lockout problem with lots of agents.

    • We will be looking to stop writing task logs to the database to further reduce strain on the database.

    • One of our goals is HA SHIELD. (Externalize vault/scheduler).

    • Active ssh fabric for agents.

  2. Containers:

    • Continued improvements with SHIELD on k8s via helm.

    • Move bosh release to the containers release + SHIELD docker image.

    • Changes to CI/CD to reflect new packaging

After reviewing our roadmap we had discussions about some of the obstacles that we need to overcome to accomplish our goals for 2020. We talked at length about how we should best handle our transition from Sqlite to Postgres.

August 22nd, 2019

These are the notes from our fourth community call, held at 11:00am EDT on Thursday, August 22nd, 2019.

Rebooting the Roadmap Call

Sometimes life happens, and gets in the way of other things. We're okay now though, so the SHIELD Team is going to resume the Roadmap Calls monthly.

SHIELD v8.4 is out — v8.5 On The Horizon

The SHIELD Team released version 8.4 of SHIELD on August 9th, and is busily preparing for a mid-September release of version 8.5.

etcd!

Stark & Wayne sponsored an internship over the Summer of 2019 in which four enterprising University of Buffalo CS Students built a backup plugin for CoreOS' etcd key-value system. Congratulations to Sriniketh, Pururva, Jason, and Naveed!

Website Redesign

This is our website, but you may not recognize it. We just got a nifty new redesign. Now that the day-to-day annoyance of modifying the website content is more well-in-hand, the team expects to start putting up more and more blog posts and documentation!

Blogs, Blogs, and More Docs!

The SHIELD Team has been putting forth a concerted effort to write more approachable "first timer" content to get people interested in deploying SHIELD to protect their valuable cloud data assets.

Hopefully, by the time of our next Community Call (September 26th), we'll have something more concrete to show off.

September 13th, 2018

These are the notes from our third community call, held at 11:00am EDT on Thursday, September 13th, 2018.

New BackBlaze Storage Plugin

Thanks to the efforts of Mr. Dan Molik (Stark & Wayne), the SHIELD project now has support for storing backup archives in BackBlaze's B2 endpoint, via the new b2 storage plugin!

Download (and re-encryption) of Backup Archives

The SHIELD Team is working on building new functionality into SHIELD for (securely) downloading archives from Cloud Storage, via the web interface or the CLI.

This used to be a lot easier in SHIELD v6, but when we introduced encryption into the mix, we lost the ability to inspect individual archives. Furthermore, SHIELD complicates the issue by randomizing the keys and initialization vectors (IVs) for each archive, and stores them in the local Vault, which operators do not have access to. Even for "fixed-key" backups, the key that the operators are given is used to derive the key and IV used to encrypt the archives.

To remedy this, and yet maintain the security of archives, SHIELD is being extended to include a new API endpoint for downloading and re-encrypting a single archive (subject to tenant rights and access control). This new endpoint will take encryption parameters from the operator, and re-encrypt the archive as it is streamed from storage.

Open Forum

Jordan and Phillip asked several very good questions. Good on them!

Smoking Hole SHIELD Restores - Do we have documentation for the recovery of a SHIELD core itself?

We have a process, and it's sort of documented, but not well. We're working on fleshing out the SHIELD Operations Manual with these details.

Authentication Tokens - Authentication Tokens seem to be tied to individual SHIELD user accounts. Do those persist across SHIELD re-deployments?

Yes, Authentication Tokens are tied to individual accounts. Behind the scenes, those are implemented as non-expiring sessions associated with the user who created them.

Because they are resident in the sessions table in the database, they should persist across BOSH deploys, assuming (a) you are using persistent disks and (b) you are not deleting the BOSH deployment first.

Bootstrapping Local Users - Can The SHIELD BOSH Release Bootstrap Local Users?

No it cannot, but it should.

Mutual Network Visibility of SHIELD Agents - In v6, the only communication channel was unidirectional, wherein the SHIELD core connected to each SHIELD agent to orchestrate a backup. This does not seem to be the case in v8.

This isn't really a question, but we'll let that slide.

In v8, the SHIELD core still initiates an SSH session to the agent involved to orchestrate it. What's new in v8 is an HTTPS registration ping from the agents to the core. This "ping" puts the agent in the SHIELD core's database, so that the core can SSH back into the agent and retrieve metadata from them.

This poses a severe problem with NAT'ed installations which lack the required mutual network visibility to make this work. What we (Stark & Wayne) have been doing is just colocating the SHIELD core in the same NAT "scope" as the systems being orchestrated.

We've designed a few alternatives to this. One is a SHIELD-aware NAT traversal proxy that would handle connections to/from the agents across the NAT boundary. This is a fair bit of engineering work, but it does preserve backwards compatibility.

A far superior solution is to break backwards compatibility, and invert the orchestration flows. If the SHIELD agents SSH into the SHIELD core, announce themselves, and then await orders, we can handle both the metadata retrieval / agent discovery, and traverse NATs with ease.

August 9th, 2018

These are the notes from our second community call, held at 11:00am EDT on Thursday, August 9th, 2018.

Lean Table View

SHIELD now supports a Lean Table View, for collapsing lots of cards into a more compact, tabular view. We'll be working on making that view also available on the wizards (configure backup, run ad hoc, and restore).

Optional Compression

Bzip2 Compression of archives is now optional. The implementation makes it modular, so we can add in new compression schemes (gzip, zip, lzma, etc.)

Plugin Reference Documentation

We've started documenting the SHIELD Plugins and their configuration.

Thanks all for joining! See you next month!

July 12th, 2018

These are the notes for our first ever community call, held at 11:00am EDT on Thursday, July 12th, 2018.

The Website

We have a website at https://shieldproject.io.

We are putting all of our documentation, guides, and manuals there

Work is underway on fleshing out the SHIELD Operator's Manual, and the Plugin Reference.

We're also hoping to build out recipe-based docs like How do I backup Cloud Foundry?

The Trello Board

We have started using a Trello Board for coordinating developer activity. If you want to get involved in hacking on SHIELD, ask in the #help channel on Slack.

GitHub Issues will still be the place to report bugs and ask for feature requests. Trello is more for things that require multiple different states, prioritization, backlog management, etc.

Every month we will be highlighting delivered features and fixed bugs that we feel are important enough to announce (big, operator-visible stuff), and discussing our focus for the next month.

Current Future Direction

We've got a lot of minor bugs in the Trello backlog; we're focusing on those first, to get them fixed and get the fixes shipped.

We've also conducted an internal review, which we're calling Gap Analysis. Put simply, it's all the things we haven't finished, and all the features we know we need but haven't implemented. This includes things like being able to edit systems from the web interface, viewing tasks, rescheduling backup jobs, etc.

Once we've got the bug backlog tamed, we'll be moving onto the gap tickets.

Release Cadence

We're hoping to hit a weekly release schedule with SHIELD, the SHIELD BOSH release, and the SHIELD Genesis Kit. Our current plan is to cut new point releases on Friday afternoons.

Open Forum

Jordan asked a bunch of questions. Thanks, mate!

CLI Help Documentation - Curious whether or not we were going to provide more documentation for CLI usage, either via the website, or inline via --help flags.

Short answer: yes.

Long answer: definitely yes, via both methods. We'll review the state of SHIELD help inside the CLI, and see if there are logical places to flesh out what we've got, cover more abstract topics, etc.

Can SHIELD itself be recovered via the CLI only?

Currently, no. The SHIELD recovery mechanism (with the fixed key backup) relies on visual elements that are part of the Web UI. However, it's a good feature (one we're missing, so it's a "gap"), and we'll be looking into support for that particular workflow.

Manual Archive Management - Having deleted archives manually from backing cloud storage, Jordan noticed that the SHIELD Web UI didn't update its storage usage counts.

This is normal, and we'll address it specifically in the documentation, since SHIELD expects to have full and complete control over cloud storage. We did talk through a UI view (CLI and Web) for browsing cloud storage archives, and deleting them there, with an eye towards reclaiming space.

Monitoring and Notifications - Is it possible to notify about storage usage limits being reached, failing storage, failing jobs, etc.?

Short answer: yes, via the /v2/health endpoint, in an external monitoring system.

Long answer: we have plans for a separate project / product, for handling notifications more globally, with business logic and dispatch rules built into that system, which SHIELD would then integrate with.