Split mongo architecture and rollout options

Executive Summary & Action page

Issue: how to wire up split mongo asap with as little risk as possible?

Real issue: how to get RESTful api asap for SPOCs and other uses?

Goal RESTful API use cases:

SPOC subsetting
1. Create new course from existing course
2. Set dates
3. Delete some chapters, sections, units, &/or components
4. Publish
SPOC compilation
1. Create new course from existing course
2. Add chapter from another course
3. Set dates
4. Publish

Split mongo use cases whose scope is unknown:

Flag conflicting edits (forks)
Undo edits
Publish transactionally rather than inadvertent dribble

There are 2 RESTful APIs:

Studio's existing one
The proposed central one.

The proposed central one depends upon split mongo or at least a full integration of the new Locator syntax.

Of the above use case, Studio's existing RESTful api supports:

Create new course (but from scratch not from existing)
Set dates (any xblock field editing)
Create, update, or delete any xblocks
Publish subtree

Risks:

Performance: effect of any data migrations if done lazily
Non-invertability of migration: what to do if a migrated course has a defect since migration is only from old to split?

Decisions/options requiring action:

Have Studio support both back ends at the same time for not only read but also write to enable gradual and deliberate course migration (hybrid split)?
broadcast updates to both to enable reversion to old if needed?
just assign courses to one or the other (split v old)?
Course migration from old to split mongo:
Big bang: migrate all courses or all that may be edited?
Lazy: migrate upon attempt to write to old mongo?
Controlled dribble: explicitly migrate some subset and increase that subset over time
1. Does Studio need to support unmigrated courses for more than read access? (hybrid split)
2. Will this strategy only apply to edx or also edge and other sites?
Use & extend Studio's existing restful api or implement the more general one we proposed (in the short-run)?
Choose where to put the locator - location mapping (see locator-location-locus

Punchlist:

xml export from split
mixed modulestore figure out whether to read & write to split v old mongo v xml
if using broadcast model of updates, implement that.
command line or admin page to invoke course migration from old to split mongo (unless using lazy migration only)
what if any of the split mongo use cases above to support in Studio? What to do w/ that functionality in case of hybrid split b/c old won't support the use cases?
hook up Studio to split &/or hybrid
hook up lms to hybrid
test, test, test
extend the studio api or implement the general one

Architectural depictions with options

To illustrate the differences among these architectures, I will use a combined studio and lms use case. You may want to imagine what you think the students should see at each point:

Teacher creates course, sections, and subsections (Studio)
Student1 registers for course (LMS)
Student1 looks at course content (LMS)
Teacher creates units and components (Studio)
Teacher edits titles and dates for the course, sections, and subsections (Studio)
Teacher configures grading policy and marks some subsections as graded (Studio)
Student1 looks at course content (LMS)
Teacher makes some units (u_0..u_i) public (Studio)
Student1 looks at course content (LMS)
Student1 works through u_0..u_i (LMS)
Teacher edits titles, dates, and subsection order for the course, sections, and subsections (Studio)
Teacher edits u_0..u_i adding new units u_k..u_l between 0 and i (Studio)
Student2 looks at course content (LMS)
Student2 works through u_0..u_i (LMS)
Teacher "publishes" u_0..u_i including u_k..u_l (Studio)
Student3 looks at course content (LMS)
Student3 works through u_0..u_i (LMS)

Pre-split mongo

This section covers how the system worked before split mongo and the location mapper.

Pre-split architecture stack

Pre-split mongo architecture stack

This document does not currently fully explain this stack, but some notes on this diagram.

The top (yellow) are the user facing clients: currently just browser clients.
The next layer (light green) shows the app layer which is primarily restful and non-restful url handlers with any client models and app logic (e.g., most grading).
The dark green shows (external) grading and analytics as disconnected services purely as a reminder that these and others like these (e.g., drupal) exist not to show how they use the back end. It would be good to get diagrams of how these plug into the back ends.
The cyan layer is the data access and modeling layer. It handles figuring out the identities and repositories, serializing and deserializing data, determining authorization, etc.
The xblock runtime currently is subordinate to the modulestore layer which instantiates it, computes addresses, and feeds it data models. lms writes directly to it for student state data which the runtime then persists directly in SQL; however, all courseware writes go through the modulestore layer.

Use case

Teacher creates course, sections, and subsections (Studio)
Studio uses MixedModulestore to create entries in Mongo
Student1 registers for course (LMS)
LMS uses auth svcs to create entries in SQL
Student1 looks at course content (LMS)
LMS uses MixedModulestore to access all of the courseware from step 1
Student just sees an outline of the course with no content
Teacher creates units and components (Studio)
Studio uses MixedModulestore to create draft entries in Mongo
Teacher edits titles and dates for the course, sections, and subsections (Studio)
Studio uses MixedModulestore to update the entries in Mongo
Teacher configures grading policy and marks some subsections as graded (Studio)
Studio uses MixedModulestore to update the entries in Mongo
Student1 looks at course content (LMS)
LMS uses MixedModulestore to access all of the courseware from step 1
Student1 sees an outline of the course with no content but with grading, dates, and new titles
Teacher makes some units (u_0..u_i) public (Studio)
Studio uses MixedModulestore to rename draft entries as non-draft ones in Mongo
Student1 looks at course content (LMS)
LMS uses MixedModulestore to access all of the courseware from step 1
Student1 sees content
Student1 works through u_0..u_i (LMS)
LMS records student state via xblock runtime to SQL
Teacher edits titles, dates, and subsection order for the course, sections, and subsections (Studio)
Studio uses MixedModulestore to update the entries in Mongo
Teacher edits u_0..u_i adding new units u_k..u_l between 0 and i and changing the order of some units and components (Studio)
Studio uses MixedModulestore to copy non-draft entries into ones marked draft and update the draft entries in Mongo
Studio updates the children of the subsections for the inserts and reorders.
Student2 looks at course content (LMS)
LMS uses MixedModulestore to access the courseware
Student2 sees content in its new chapter, section, subsection, and unit order but not new component order and does not see the new units nor the changes to u_0..u_i
Student2 works through u_0..u_i (LMS)
LMS records student state via xblock runtime to SQL
Teacher "publishes" u_0..u_i including u_k..u_l (Studio)
Studio uses MixedModulestore to convert draft entries to non-draft overwriting the existing non-drafts (and removing the drafts) in Mongo
Student3 looks at course content (LMS)
LMS uses MixedModulestore to access the courseware
Student3 sees all content in its new order and material
Student3 works through u_0..u_i (LMS)
LMS records student state via xblock runtime to SQL

Long-term split mongo architecture

Eventually split mongo will completely replace the current mongo; so, the diagram will look just like the above one except that Mongo Modulestore will be Split Mongo Modulestore with its 3 collections giving it the ability to support editing undo, reusing content among courses, tracking changes over time (who and when), adding organizational governance over course id namespaces, and running courses over and over without export, rename, and import. The version tracking, for example, will enable the lms to know that student1 did not see the subsequently inserted material and mark which of u_0..u_i changed since the student saw them so the student can decide whether to check out the changes. It will enable analytics to compare performance before and after a courseware change. It will enable course authors to compare versions.

The eventual split mongo architecture will execute the use case above as follows:

Teacher creates course, sections, and subsections (Studio)
Studio uses MixedModulestore to create entries in draft course version in Mongo
Student1 registers for course (LMS)
LMS uses auth svcs to create entries in SQL
Student1 looks at course content (LMS)
LMS uses MixedModulestore to notice that there is no published content yet for the course
Student sees that the course has no content nor outline yet
Teacher creates units and components (Studio)
Studio uses MixedModulestore to create draft entries in Mongo
Teacher edits titles and dates for the course, sections, and subsections (Studio)
Studio uses MixedModulestore to update the entries in Mongo
Teacher configures grading policy and marks some subsections as graded (Studio)
Studio uses MixedModulestore to update the entries in Mongo
Student1 looks at course content (LMS)
LMS uses MixedModulestore to notice that there is no published content yet for the course
Student sees that the course has no content nor outline yet
Teacher publishes some units (u_0..u_i) and their parents (Studio)
Studio uses MixedModulestore to create a published branch and version and then copies the draft entries into it via Mongo
Student1 looks at course content (LMS)
LMS uses MixedModulestore to access the courseware
Student sees content
Student1 works through u_0..u_i (LMS)
LMS records student state via xblock runtime to SQL
Teacher edits titles, dates, and subsection order for the course, sections, and subsections (Studio)
Studio uses MixedModulestore to update the entries in draft branch in Mongo
Teacher edits u_0..u_i adding new units u_k..u_l between 0 and i and changing the order of some units (Studio)
Studio uses MixedModulestore to update the entries in draft branch in Mongo
Student2 looks at course content (LMS)
LMS uses MixedModulestore to access the courseware
Student2 sees same content as Student1 saw in the same order
Student2 works through u_0..u_i (LMS)
LMS records student state via xblock runtime to SQL
Teacher "publishes" u_0..u_i including u_k..u_l (Studio)
Studio uses MixedModulestore to create a new live version with the changes in being published draft entries in Mongo
Student3 looks at course content (LMS)
LMS uses MixedModulestore to access the courseware
Student3 sees content in its new order all the way down and new content
Student3 works through u_0..u_i (LMS)
LMS records student state via xblock runtime to SQL

Hybrid intermediate state of split running with old mongo

The focus of this document is which of several intermediate state options should we support. The reason for the intermediate hybrid state is to incrementally deploy functionality and to not require a big bang conversion of all existing course material and records.

Roadblocks to big bang conversion:

To enable reusing content among courses and versioning content, the new representation has a richer and slightly incompatible addressing scheme (Locators). This complicates
student state which uses the old locations
analytics using the old locations
references within a course to other course locations
1. especially if the material is now being referenced in a different course than the original course because the references will reference the original course not the course-invariant address nor the new-course relative address.
similarly references to assets because for some reason they're identified relative to the creating course.
Risk around the data migration scripts which have unit tests but which have had no real course content test.
The length of time it's taking to finish writing the code for the hypothetical future state.
The absence of Studio UX design and development to take advantage of the new functionality (reusing content, undo, comparison, controlled publication, etc)

Strategy for mitigating the risks and roadblocks:

To mitigate the addressing schema change,
we've implemented and made live an address scheme mapping service (loc_mapper) which we need to wire wherever needed (it's currently wired at the highest levels of the Studio App and the Studio Client is using the new address scheme).
we decided to temporarily not use the new address scheme in lms so that student state, analytics, grading, drupal, and other such things won't need to be aware of the new address scheme.
we'll use loc_mapper to generate the asset addresses for now
we'll use loc_mapper to try to recode cross-course references (assets and xblocks) into within course relative-references. n Locations map to same Locator. loc_mapper knows how to convert to the Locator and then from the Locator to the Location for any of the n course ids which map to it.
to mitigate migration risk,
we'll manually invoke migration on a subset of courses and ensure they work well in practice before migrating the others
this some-migrated-some-not state will require ensuring Studio can write to either back end depending on which repository owns the course.
we may make mixed modulestore broadcast each write to each representation (which means it needs both addressing schemes simultaneously and a strategy for handling old mongo's restrictive capabilities).

Current Architecture (work-in-progress):

Location mapping in studio app. Using new Locators in studio client.

Location mapping for studio app diagram

The difference here is the insertion of the loc_mapper and its store. The studio app takes each outgoing Location and uses the loc_mapper to convert it to a Locator so that all the client sees are Locators. It takes each incoming Locator and reconverts it back to a Location so that all the MixedModulestore sees is Locations.

This change has no effect on the current use case other than the form of the urls (which the use case does not discuss).

The problem is how to wire split mongo which uses Locators and old mongo and xml which use Locations without having the applications know which of the two addressing schemes the underlying data access and modeling layer uses. This problem is complicated by the fact that addresses are usually passed around merely as strings without any hint to their semantics and often hidden within other structures. Another complication is that those using Locations must also provide the unique id for the course to get a valid Locator. The loc_mapper will give a mapping even if it doesn't know which course is really in effect, but that mapping may be wrong. In practice, we don't allow more than one course with the same org and "course name"; so, most mappings will be correct; however, we cannot guarantee that they will.

The above diagram's depiction of converting at the app tier does not work for using split mongo which does not want the conversion.

Considered approaches:

use only the new Locators wherever we know that a field is an address instead of using strings, Locations, or other inert types (dicts, tuples, arrays).
add behavior to these Locators for them to mock old Location functionality for read as well as create and write (calling loc_mapper as necessary)
ensure every code place does not merely pass these around as assumed strings or ensure that the objects present such strings wherever such assumptions lie.
move the mapping functionality to the low level modulestore methods having them accept any address form and converting it to whatever representation that modulestore needs via the loc_mapper.
to ensure existing higher level code does not trip on alternative representations, we'd have to
1. ensure those functions just pass the address around inertly,
2. duplicate each xblock field which we know holds addresses and have a version of the field for each repr,
3. ensure each access stipulates what type of address it wants (and provides the course_id), or
4. tell the modulestore which representation to populate into the reference fields according to which app requested it.

Of the 2 above approaches, the first seems cleanest. It does have some risks including performance because our code frequently calls Location methods, race conditions if the code requests a translation before the loc_mapper knows about the course, and the need to do 2 pass conversions for inadvertent wrong-course hard-coded references (see above where I described asset and in course references for things borrowed by other courses) (this dual conversion problem exists in both approaches). For the second approach, none of the sub-approaches is sufficient in and of themselves. The last (encoding the address according to the application's preference) may be the closest to sufficient; however, because the code will not know how to find each reference in an xblock, some will leak to the upper layers which will need to catch address failures and attempt conversions.

For either approach, we'll need to decide whether to convert the existing mongo (aka, "old mongo") to read and write persisted addresses in either representation or only use Locations because that's what old mongo uses now.

In the long run, I'd like to deprecate the old Location and its behavior; however, it's not clear how we get there.

Whichever approach we use for addresses, the architecture becomes the following where most of the location mapping is done at the modulestore layer and only inadvertent references get mapped in the apps. The xblock runtime may need to use the loc_mapper as well.

Location translation at the modulestore layer

Hybrid approaches

The hybrid approaches for running split mongo along side old mongo have several control dimensions:

Which courses persist in split and which into old mongo?
All courses: one time big bang conversion--unlikely approach.
All current and future courses: leave archived courses alone but don't allow access from Studio--also unlikely.
Any course being edited in Studio: proactively move any course which should be accessible in Studio and have Studio only use split mongo.
Lazily any course being edited in Studio: read from either store, but only allow writes to split mongo. This was the approach I was working on. It would force migration from old to split upon first update attempt.
All new courses, but leave old ones in old mongo: this strategy doesn't save any work but may reduce risk for running courses by ensuring that no addresses change. It requires having Studio able to read and write to both stores and having LMS able to read from both (all of the below do as well).
All new courses plus a gradually increasing set of other deliberately migrated courses.
Should Studio use split but LMS use old mongo?
Requires writing a publish mechanism from split to old mongo.
Still requires determining strategy for when to move courses to split for Studio.
Should Studio broadcast updates to both stores to enable easy roll-back?
Will require some additional work as well as analysis as what information is lost in the old mongo version and whether we care about that loss.

Whatever choice we make is an interim choice; so, we need to patch together a path from all old mongo to all split no matter how hypothetical that end point may be.

Split mongo architecture and rollout options

Executive Summary & Action page

Architectural depictions with options

Pre-split mongo

Pre-split architecture stack

Use case

Long-term split mongo architecture

Hybrid intermediate state of split running with old mongo

Current Architecture (work-in-progress):

Hybrid approaches

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!