I'm in the mood to tell a little behind-the-scenes story about the complex technical journey of a seemingly simple interactive video player and how it works today. This article will be split into two parts: a history lesson and an overview of the current technical architecture.

This piece will focus on the process of identifying and solving problems, and how we used architectural design to reorganize business requirements and create sensible abstractions. For me, this is a record of a chapter in my life; for you, the reader, I hope it can be inspiring.

The History

I'd like to tell the historical part as a story, following a timeline.

Chaotic Beginnings

The project kicked off three years ago. Back then, nobody had a clear definition of what an "interactive video" was, so the entire project was in an experimental state. We had a rough consensus on what this "interactive video" thing should be:

  • It was a type of course where a video segment would be followed by an "interaction" (something like a mini-game), similar to ads on YouTube, except the interruption was a program, not another video.
  • The "interaction" was a fixed-size modal dialog that couldn't scale with the screen. Due to "technical limitations," we couldn't achieve a "seamless connection" with the video content. We also didn't consider the mobile experience, assuming all users would be on a desktop computer.
  • For mobile, the idea at the time was, "We'll just make a mini-program, but it can only play videos, no interactions."

At this stage, the project structure was simple. All the code, including the interactive courses, lived in a single repository. The interactive programs were simply imported from a directory.

The initial architectural model

The team followed this direction for a long time. However, the first boomerang hit us right in the face when a new product manager joined and requested that the interactions and videos be seamlessly connected. This meant we couldn't just use a simple modal dialog; the interaction had to align perfectly with the video frame. The most intuitive way to scale something is with CSS Transform—just scale the content to match the video size and tweak the coordinates. But it wasn't that simple.

To pursue the "ultimate fluid experience," the entire interactive video was rendered directly on a Canvas, which has a different rendering mechanism than standard DOM elements. If you use CSS Transform to scale a button, the button can still be clicked normally, and its coordinates will be correctly mapped. However, when a Canvas element is transformed, its event coordinates and hitboxes are not mapped. This meant developers would potentially have to manually recalculate event coordinates based on CSS properties. Two technical solutions emerged from the team:

  • Re-implement a synthetic event system. This would involve parsing the CSS transform operations of the entire element tree, calculating the final transform matrix, and modifying the event properties through something like a decorator before handling mouse events.
  • Put the Canvas inside an iframe and scale the iframe. This way, the browser would handle the mouse event mapping correctly.

Both solutions had significant technical complexity. For instance, the first approach required:

  • Manually composing CSS Transform values. If several container elements each have different transforms, composing them into a final matrix becomes incredibly difficult.
  • Manually injecting a decorator for every event. This would have to be done for all events in all frameworks, leading to a massive amount of boilerplate code and making the project a nightmare to maintain.

Using an iframe also brought its own set of annoying problems:

  • Media Playback: In some browsers, a page within an iframe is treated as a completely new page. This means autoplay permissions are not inherited, making it difficult to automatically play audio when an interaction starts to create that "seamless" feeling transitioning from a video scene.
  • Communication: Communicating between the iframe and the parent window isn't as simple as parent-child communication in React. All events must be passed via postMessage, and the parent window has no awareness of the iframe's internal state. In other words, you have no way of knowing if the program inside has started successfully or crashed.

After a lengthy discussion and weighing the trade-offs, we chose the iframe solution. As it turned out, this decision brought many additional benefits, which I'll discuss later.

At that point, the project architecture looked something like this:

The first architectural change

A Reckless Collision

This was probably one of the most unpopular things I did on the team. In the early stages of the project, the Canvas rendering engine we used was "in-house." A team member had designed their own drawing management library and semi-coercively pushed for its adoption. But this library was plagued with problems:

  • Missing Features: The hitbox management was incomplete, and many drawing functions were missing. If you needed a missing feature, you had to contact the maintainer and wait for them to add it, which was a frustrating bottleneck.
  • TypeScript Support: I have to reflect on whether I was too reckless in pushing for TypeScript adoption without reaching a consensus with everyone. To this day, however, I still firmly believe that for any complex modern web project, not using TypeScript is a serious engineering flaw, unless there's a clear reason other than "I can't learn it." Unfortunately, this library had no TypeScript support, and its maintainer was very resistant to it.
  • Necessity: There were already many feature-complete, well-established libraries on the market. Building our own, aside from satisfying the "reinventing the wheel" itch, offered no real benefit to the team's business goals and instead added a significant maintenance burden.

The maintainer of this in-house library once said, "I tried to implement 3D rendering with the Canvas 2D API, but I gave up after implementing back-face culling because the performance wasn't good enough. That's why we use Three.js for 3D." This statement made me realize something: he lacked the professional graphics programming and engineering skills to maintain this thing. Continuing down this path would only drag down the entire team's development speed. So, we scheduled a meeting where all the developers sat down to discuss the necessity of maintaining this library.

Fun fact: The engine initially used setInterval instead of requestAnimationFrame to control redraws. The documentation proudly stated that if you set the interval low enough, the frame rate would be very high, resulting in a smoother experience. Which is pretty weird.

The argument for keeping it was, "We need to build our own technical assets and not just use off-the-shelf solutions. If we build this ourselves, we can open-source it and let others use it, allowing us to lead its development instead of being passively controlled by others." My counterargument was simple: as a small startup, this kind of endeavor would bring no short-term benefits and would only slow down development, which could be fatal. Moreover, no other team member was involved in maintaining this library. Having such a critical piece of infrastructure depend entirely on one person is an extremely risky situation for any commercial company.

In the end, the development plan for the library was shelved, and the team switched to a tech stack of PIXI + Three + D3.js.

An Ordinary Leap Forward

Then, the machine learning project was released. It just... was released.

In the downtime before the next project kicked off, we had a chance to breathe and address the existing problems in our codebase. In my view, the biggest issue at the time was the obsession with reinventing the wheel. Besides the aforementioned rendering library, the team had all sorts of custom tools and unique technical tastes. Nearly every member was using their own set of wrappers to manage rendering, even though everyone was using PIXI. This was a serious problem:

  • From a business perspective, a small startup like ours must focus its energy on actual business development, not on each member building their own little tools to satisfy a sense of accomplishment.
  • From a development management perspective, everyone maintaining their own set of tools meant that each tool had its own unique bugs to fix, and one person's maintenance work didn't benefit anyone else. Over time, this would drain a massive amount of energy.
  • From an engineering health perspective, it prevented us from converging on a development paradigm or establishing unified coding standards. This made cross-maintenance between team members difficult and left new members feeling lost when faced with N different development styles.

So, we held a dedicated meeting to discuss this. We laid out all the custom tools and held a "bake-off," with the goal of keeping only one and retiring the rest. My argument was simple:

  • I'm not against creating business-specific wrappers, but they must make the code more expressive by hiding unnecessary details and highlighting the core business logic. Due to tight deadlines, the machine learning project was filled with copy-pasted code. The team even had a term for it: "copying homework." When one person created a wrapper or solved a technical problem, everyone else would just copy that code into their own work. This drowned the core logic in useless implementation details, making future maintenance extremely difficult.
  • We needed to leave ample room for centralized management. In other words, things like frame management (Ticker), state management, lifecycle, and communication must be centrally managed and properly encapsulated, with clear boundaries separating them from specific business logic. This would allow us to make stability improvements, performance optimizations, and bug fixes without major changes to the business code, avoiding situations where one change requires a million edits.
  • Respect the API designs of PIXI and Three. We shouldn't over-engineer rendering management or engage in thankless tasks like "unifying the PIXI and Three APIs" or "making their APIs fit our aesthetic." Our energy should be focused on the business of connecting "interactive video" and "WebGL rendering."

I've always believed that in commercial software development, it's more important to study how to properly segment a project, allowing each layer to purely express "what it does" while hiding "how it does it." Stacking these layers one by one to create a product is the hallmark of good commercial development. It's an old tune, but it bears repeating.

Eventually, the architecture I proposed became the standard toolset for interactive development. We also made adjustments to the "sandbox model" for interactions.

The iframe can be considered the "ancestor" of micro-frontends. It has some excellent properties:

  • For instance, simply setting the URL to about:blank wipes out all memory leaks and forgotten requestAnimationFrame calls.
  • Excellent security. Considering the project might evolve into a platform with external developers, it's crucial to protect the main site from being tampered with by third-party code. Sandboxing with an iframe allows us to expose only the necessary APIs, preventing uncontrolled code from running on the main site.
  • Each interactive course series can have its dependencies version-locked. Once a course is developed and stable, we can permanently fix all its dependency versions to avoid the myriad of explosions caused by version bumps.

However, these features were not well-utilized in the previous setup. The iframe contained a React app, and switching between interactions was handled by React Router. That was the entirety of React's role in the whole interactive program. This was clearly unnecessary, and making the interactive program a single-page application (SPA) only allowed memory leaks to spill over into the next interaction. So, we adjusted the project's organization and bundling strategy: each "interaction point" became a separate HTML file. When the interaction task between videos was completed, the entire HTML page would be destroyed, clearing all its resident memory resources along with it. This greatly alleviated the long-standing memory usage problems.

At this point, a layer of the architecture was cut out:

The model with one layer removed

Despite all these "micro-optimizations," the project as a whole was still a mess. By the end of its lifecycle, our technical debt was so severe that it required overtime just to maintain. Unable to bear the weight of so much "historical baggage," we had no choice but to hire new team members, blow up the interactive player, and rebuild it from scratch.

Oh, right, the Steam client was also something I hacked together during that period. I thought it would be a simple job, but because all the business logic was tightly coupled and tangled, it created a magnificent picture of shit wrapped in more shit.

After adding an Electron layer, the whole thing looked like this:

The model with two more layers added

The Modernization Process

When rebuilding the project, the first step was, of course, to organize all the business requirements. We had to sort all the features into different modules. In the following sections, I will list the problems with the product at the time and present our solutions.

First, let's look at the modernized player architecture:

A "modern architecture"

Each module was designed to solve specific engineering or user experience problems. For example, the boss often complained, "Why does this thing have a black screen so often? Why is it so laggy?" The black screens and lag actually had to be broken down into several parts, but the main culprits were flawed implementations of the lifecycle, communication, and resource management mechanisms.

Lifecycle

Let's first look at the lifecycle issues:

  • The fundamental architecture of the old player was flawed. It was initially designed with the assumption that a video would always be followed by an interaction, and an interaction by a video. Much of the switching logic was hardcoded based on this assumption, and the lifecycles for video and interaction were implemented as two separate parts.
  • Although we later modified it to support arbitrary sequences of videos and interactions, many of the original, poorly considered parts were not properly handled or redesigned.

Our solution for this part was as follows:

  • Interactions and videos became peer concepts, categorized and placed into a component called "Stage." This component operates on a "plugin system." The client downloads a configuration file from a remote server, which determines the resources for the current episode and their types. Based on the resource type, we call different "plugin components" for rendering, but every plugin uses the same, consistent lifecycle.
  • Each plugin registers itself with a "Core Manager," which handles unified lifecycle management. The Core Manager controls the state machine for the entire system: which plugin starts preloading when, when it should be destroyed, and when it should resume from a hidden state.
  • This plugin mechanism also leaves a lot of room for future product feature expansion. For example, you could insert an arbitrary webpage or even an article in the middle of a video.

Additionally, the interaction's lifecycle was hidden behind a well-designed, unified API. I've always been strongly against developers manually controlling low-level things like lifecycle in a fully managed environment. So, from the beginning, I was quite aggressive and disabled manual lifecycle reporting at the framework level. The reasons were simple:

  • First, the framework itself has its own resource pre-warming tasks that the interactive program is completely unaware of. When you report "I'm ready" to the player, it's a false "ready."
  • Second, exposing such a low-level API could lead to a non-standard initialization process, resulting in all sorts of bizarre initialization methods (especially if you have people on your team who love to "micro-innovate"). This, in turn, creates a variety of undefined behaviors. When the framework needs a unified adjustment later, it could break many things (we had already experienced this pain).
  • Developer-written preload tasks could potentially pile up with the framework's preload tasks, causing severe frame drops in the first few seconds of the program.

For a long time, our interactive program's lifecycle was incomplete. It only considered the framework-level resource pre-warming and didn't account for resources managed by the interactive program itself (like fonts, textures, audio). The philosophy was basically, "If we can't do it well, let's not do it at all and wait for a better solution."

Later, this part of the API was properly designed:

  • During initialization, the interactive program can submit its own callbacks to add its tasks to the framework's initialization queue. The framework then manages each sub-task using Time Slicing to ensure they don't block rendering.
  • The only development convention is that the submitted initialization tasks must be granular enough to be manageable by the Time Slicing Queue.
  • Once both the framework's and the program's own preload tasks are complete, we resolve a Promise and notify both the player and the interactive program that all preloading work is done.

Through this method, we successfully hid the dangerous operation of "manually reporting lifecycle," achieving the goal of hiding implementation details.

Resource Preloading Mechanism

The problems with the preloading mechanism were mainly in these areas:

  • All videos were preloaded when the player loaded. This meant for every video segment, a new <video> tag was created, leading to very inefficient network and memory usage. This preloading mechanism also caused the graphics driver to crash on all AMD GPU devices when the website loaded. The screen would flash black, fallback to the integrated GPU, and Chrome would disable hardware acceleration, breaking all WebGL rendering.
  • The timing of resource loading was poorly orchestrated. Very large resources couldn't be loaded before the program started, leaving a large black background on the screen for a long time until the textures finally loaded. It looked like the program had crashed, but it was actually running fine.

The optimizations for this part were a bit more niche:

  • With the lifecycle redesign, the problem of video preloading crashing GPU drivers was eliminated.
  • For the interactive program's resource preloading, we designed a resource management tool to centralize all textures, audio, and other assets. It allowed us to mark which ones needed to be loaded with high priority and even cached to the local disk. Depending on the settings, resources would start caching to IndexedDB, the Cache API, or the browser cache as soon as the player loaded, even before the iframe for the interaction was loaded.
  • Since the player and the interactive program don't necessarily run on the same domain, browser caches might not be shared depending on the implementation. Even if the browser cache is shared, IndexedDB and Cache API are hard to access across origins. So, we used a Service Worker to intercept all resource preload requests from within the interaction and delegate them to the parent window. For mobile apps, there was an even more reliable Resource Loader Native Backend for caching.
  • To ensure program stability, both the interactive program and its resource files were designed with a failover mechanism. It would try a "preference list" of local caches and various CDNs one by one. If a resource was unavailable, it would fall back to the next one. Although this mechanism might seem a bit redundant, it was a great help in maintaining API consistency. And on one or two occasions, a CDN actually went down, but our online service was unaffected. It goes to show that an extra layer of client-side resilience is very useful.

We were also working on this part to add an "offline course caching" feature to the website. Unfortunately, due to tight deadlines and the project being halted later, this was only half-finished. I felt quite down when I saw a similar feature appear on YouTube later.

Resource Management Mechanism

This mechanism was designed to consolidate a bunch of miscellaneous problems, such as:

  • Because audio and video tracks were separate, they had to be uploaded as two separate files to the CDN. This meant the content creation team had to separate the video and audio tracks beforehand, which was an annoying and pointless task.
  • The interactive programs contained a large number of hardcoded resource URLs. Although a simple "resource upload backend" was introduced later, it could only handle video and audio uploads. Images and other resources still had to be manually uploaded to the Alibaba Cloud OSS console, and then the URL had to be pasted into the program.
  • Our system's operating environment was quite complex. Online, we had a testing platform, a production platform, and an education platform. Later, we added a Steam client and various other environments. However, the architecture of the "resource upload backend" was not designed for this complexity. It simply stored a URL in the database, making data migration between platforms an impossible task. At the time, if we wanted to release content on a new platform, we had to resort to brute-force methods like exporting and importing SQL dumps or simply re-uploading all the resources.
  • Furthermore, the user experience of this "resource upload backend" was atrocious. It involved layer after layer of modal dialogs with various forms to fill out. I developed PTSD from organizing and uploading resources with it.
  • On the Steam client, it was even more ridiculous. I had to use regex and custom scripts to parse the project's build output, download all the URLs to the local disk, and then use Electron's API to intercept network requests and make them read from the disk. So, every time the project was updated, I had to go through this bizarre process of "offline-izing" the online project (since I was always the one handling the Steam version releases, I know exactly how painful it was).
  • Multi-device support. The design mockups often specified "use this texture for mobile" and "use that texture for desktop." Such requirements were handled by writing custom logic in each place, which made the codebase very messy.
  • Multi-language support. With hardcoded URLs, different texture variations for multi-device support, and the addition of multi-language requirements, the maintenance costs skyrocketed. This should be self-evident.

To solve this chaotic mess of requirements, we scrapped the old resource management system entirely and rebuilt it. The new logic works like this:

  • Both videos and media resources within interactions are no longer treated as separate concepts. They are all called "Resource Files," and each resource file has "tags" that can be weighted.
  • "Resource Files" can be bundled into "Resource Groups." Within a group, a selector can be used to choose a resource file based on its tags. This makes things much clearer:
    • For a typical video, I just need to specify the resource group and then provide information like "resource type is video, audio, or subtitle," "language," and "platform" to automatically filter for the corresponding file.
    • However, most of this information doesn't need to be added manually. You just need to select a video with an audio track, along with subtitle files, and drop them into the resource manager. A preprocessing plugin will automatically separate the video and audio tracks (it can even auto-detect and fix encoding issues for Safari, though this feature is only half-implemented), bundle the videos into a resource group, and automatically tag them by resource type. All you need to do is go into the editor and fill in the few remaining details.
    • These resources are stored locally before being "published." Only when the publish button is pressed are they distributed to various CDNs. The distribution process is automatic; you don't need to upload them to each CDN manually.
    • The same principle applies to textures within interactions. Select a group of resources, drag them into the resource manager, and if it detects a set of textures, it will automatically group them. Then you fill in the various tags. During development, you just need to write the group ID; you don't have to worry about the specific logic for switching languages or platforms. These implementation details are consolidated and well-hidden by the "selector" mechanism.
    • The resource manager has its own small server. During local development, you can load local resources directly without first uploading them to a CDN and hardcoding the URL into the project.
  • Resource metadata processing was also "flattened" into a peer concept. For offline clients, the resource manager directly generates a data package. For online clients, it uploads the data to each platform separately. This avoids the hassle of horizontal data migration across different platforms.
  • Finally, with a few simple operations, we even achieved one-click generation of installation packages for the Steam and Android clients. The entire packaging and publishing workflow was completely streamlined.

Communication Mechanism

The communication mechanism between the interactive program and the main site has always been problematic. If you opened the old project in two tabs simultaneously, you'd find their communication would get mixed up. The reason for this is simple: the entire communication process worked like a UDP broadcast to 0.0.0.0. You didn't know where the message was going, or if it even arrived. All communication ran on the raw postMessage API, so often the main site would send a message to the interactive program, but the program hadn't started yet. The message was never delivered, and the entire loading process would just hang there.

This also created significant risks for future feature implementations. A colleague once proposed an idea to intersperse mini interactive programs within an e-magazine to create an interactive magazine. But with the current communication method, we could only insert one interactive program per page, which was clearly unacceptable.

Another major problem with postMessage is that it's not type-safe. No matter how you wrap it, it's difficult to implement an elegant type-safety strategy. Additionally, to alleviate performance issues on "low-end devices," we collaborated with a cloud gaming provider, which required cloud-based web page integration. This changed the internal-external communication from postMessage to WebSockets. If the communication layer wasn't decoupled, the entire project would become a chaotic mess.

To solve the reliability issues at the communication layer, we introduced Jack Works' JSON RPC Call. This split the communication method and the communication content into two layers. Now, it doesn't matter much whether the underlying transport is postMessage, WebSockets, or even HTTP.

Autoplay

Oh my gosh... this was definitely the most torturous part. Because the W3C doesn't have a very restrictive standard for media playback behavior, each browser engine has come up with its own creative ways to prevent your browser from making sound. Chrome is okay; you just need to interact with the page once to enable sound. Safari, on the other hand, requires that any video with sound cannot be autoplayed. This means that video segments that were split up couldn't be played one after another.

In the old player, to determine if the current environment could trigger autoplay, a silent audio URL was hardcoded into the code. The player would try to play this audio to determine the Media Engagement Index (MEI). However, I hate this kind of hardcoding. For example, if the hardcoded URL went down one day, or if the client's network was unstable, the autoplay detection would fail. It's very unreliable.

The interactive program was also a major disaster area. Because the content inside an iframe is treated as a separate page, autoplay permissions are not inherited. You have to interact with the iframe at least once to trigger audio autoplay. This placed certain constraints on content design. For example, a common method to create a "seamless transition" feel is:

  • The video portion plays a scene, and on the last frame, it transitions into the interaction.
  • Inside the interaction, an identical scene to the last frame is displayed, the audio continues to play, and some simple animations are shown.
  • Suddenly, you're told that you can interact with the screen.
  • The naive and innocent user exclaims, "Whoa!"

Yeah, this actually happened, more than once. Even my boss was fooled.

Many good creative ideas were limited by the browser's aggressive autoplay prevention policies. To solve this problem, we placed a global singleton Audio Station and something called the Evil Trigger. In simple terms, the Evil Trigger tries to bind events to all buttons and anchor tags on the page. As soon as a button is pressed, the Audio Context managed by the global singleton Audio Station is initiated. All subsequent media playback requests go through the Audio Station, allowing for all sorts of autoplay without having to worry about the browser's mood.

Ah, well, that was the ideal. In reality, the Evil Trigger was never finished. Autoplay is currently tied to the player's play button, which is good enough.

The cost of implementing "autoplay" was actually quite high. The nature of the Web Audio Context API requires all audio files to be fully decoded into PCM format and stored in memory. This is very memory-intensive, to the point where we had to convert most of the audio to mono and reduce the sample rate to prevent mobile browsers from running out of memory.

Time Management

Another very granular requirement was time management. Many small, corner-case features needed something like this:

  • For example, syncing the time of the Audio Context and a muted video (if the progress of the two drifts too far apart, it automatically realigns them).
  • For example, loading video subtitles.
  • For example, something the boss wanted: at a certain point in the video, his avatar would pop up from the side of the video and say something to guide the user.
  • For example, the ridiculously complex BGM system that was talked about at the launch event.

In the previous player, all of these things were implemented separately and scattered throughout the codebase. But in reality, we could consolidate them into a single module: Time Management.

Syncing audio and video tracks seems straightforward. A very talented colleague of mine implemented something similar to NTP to correct the time for video, audio, and even the time inside and outside the interaction. But this correction process involves some psychological principles (which is my old field of study) and requires setting some priorities. It's written in the internal documentation, but since it's quite detailed, I won't go into it here.

The rest of the features can be understood as a series of events happening on a timeline, whether it's switching BGM, showing and hiding subtitles, or popping up a dialog box. So, we attached a configuration list to each video that describes what to do at which keyframe. These frames come in two types:

  • At a specific moment: When the time progress crosses this moment, an event is executed. This can be set to execute only once or every time it's crossed. Dialog box events can be done this way.

  • During a time period: When the time progress is within this period, a certain state is enabled; otherwise, it's disabled. BGM and subtitles use this method. In fact, features like YouTube's video chapters can also be implemented with this mechanism.

What specific event is triggered is handled by a plugin mechanism. The player checks which plugin to call based on the task type and then executes the corresponding task.

Subtitles received some extra treatment. SRT format subtitles are imported and converted into keyframes on the timeline, making it easier for the video creation team to add subtitles. Again, due to time constraints, this part was only half-finished. Not all timeline task plugins have a corresponding implementation yet; it depends on whether we encounter the corresponding needs in the future.

A tragic fact: the earliest implementation of the player's BGM was like this: all the BGM tracks to be played were hardcoded directly into the player, including the audio URLs. No matter which project you were watching, the BGM configurations for all projects were loaded into memory.

And the BGM URL wasn't even a complete address; it was a string that had to be passed through a certain function to produce a result. So when we migrated projects, we had to manually pick out these addresses one by one and process them. It was very painful.

I was the one who recommended "temporarily" writing the addresses into the player back then. The trailer had already been released, the deadline was looming, and there was no time to polish many things, so we just had to pile them on first. The original thought was "we'll separate it out later," but unfortunately, the person in charge of scheduling later didn't care about this matter at all, so this piece of crap remained in the project.

Multi-Client & Skinning System

This thing caused my colleague a lot of grief. I'm sorry! (ORZ——)

Late in the project, we encountered several new requirements. First, we needed a Steam client. This Steam client had to be completely disconnected from the main site, which meant all content metadata had to be offline. Additionally, the boss wanted the video player on the Steam version to be different from the one on the website. How different? So different that its own mother wouldn't recognize it. It was definitely not something you could just slap some CSS on and call it a day. We couldn't just fork the player, make some changes, and be done with it. As more projects came along, more and more player variations would appear, and the maintenance cost would explode exponentially.

On top of that, the website also had different modes, like "mini-interactions" and "interactive videos," which were two completely different things. As mentioned earlier, we had also considered embedding interactive content in an e-magazine, which would likely require yet another skin. To handle this problem, we reorganized the architecture again, separating the data fetching layer into its own SDK. Different SDKs would handle different types of data sources, and the SDK would also manage the hot-loading of skins.

Skin hot-loading is a particularly clever part of the system. We used the great Remote Component to implement another form of "micro-frontend": remote component loading. The overall implementation idea is as follows:

  • This component returns a Hook and a Component. The Hook tells the player which properties to inject, and the Component is responsible for wrapping the entire player, allowing it to handle content that spans across episodes.
  • All elements of the player itself are implemented via plugins. The Stage layer, the loading animation layer, the subtitle layer, the dialog area layer—each is a component with a completely consistent API. They communicate with the Core Manager to perform their respective functions. Developers can use the Hook to directly modify the layer order, or even replace or add certain layers, which gives content development a great deal of freedom.

But every beautiful flower has its thorns. Because component hot-loading is implemented with eval, if you don't put a console.log in the hot-loaded module, you can't even enter the module's virtual machine, let alone set breakpoints for debugging. In short, it's a rather difficult thing to work with. Plus, the entire system is so complex that a change in any single package could have a ripple effect. The player's behavior could always change in unexpected ways. If you're not very familiar with the player's codebase, you could easily be hit by an unexpected breaking change. This is why we still haven't tagged any of the libraries as 1.0.0. But this is a problem that must be solved eventually, and we need to put in the effort to do so.

Performance & Heat Issues

The initial product design was actually "no need for mobile." But then, just before the closed beta, they suddenly said, "We need to consider the mobile user experience!" This change in requirements caught us completely off guard. Many of the optimizations were added piecemeal as patches. To the bosses, I love you guys ♥~

Back on topic, these optimizations are related to the drawing mechanisms of PIXI / Three, as well as the WebGL implementations of various browser vendors. Since we used Three less later on, and it was mainly my colleague who was tuning it, I'll skip that part for now and focus on the PIXI side, which I was working on.

Rendering engines on the web don't have advanced features like partial rendering. Every frame is completely redrawn from scratch, so performance and power consumption are somewhat problematic. Additionally, Safari's WebGL used to be their own implementation, which not only didn't support WebGL2 but also had terrible performance, so performance optimization was an urgent matter.

The new version of Safari has finally ditched its own implementation and started using Google's ANGLE. Now, Firefox, Chromium, and Safari all use the same WebGL to Native binding. Not only is there a significant performance improvement, but the landscape of bugs will also become less complex. Hooray, have a Coke.

But remember to turn off anti-aliasing. The ANGLE version in iOS 15 has a bug that messes up the draw call order and fails to clear the render buffer, causing visual glitches.

The simplest solution is definitely dynamic frame rates. Thanks to the unified wrapper mentioned earlier, I just needed to secretly change the Ticker's behavior behind the scenes. The colleagues working on business logic were not affected at all, which was very convenient.

The approach to dynamic frame rates is as follows:

  • First, lower the base frame rate. For high-resolution screens, cap the maximum frame rate at 60fps to free up time for the main thread to do other work. If there is no user interaction, the frame rate drops to 15fps.
  • Implement a frame rate request mechanism. If needed, a component can request a higher frame rate from the Ticker, for example, when playing an animation or for special requirements.
  • In the unified animation interface (a simple binding for anime.js), request a high frame rate of 60fps to ensure animations don't stutter.
  • When the user's mouse or finger interacts with the screen, provide the full 60fps frame rate, similar to Apple's ProMotion dynamic frame rate mechanism, but we didn't go up to 120fps. After all, it's a web tech stack; you can't ask for too much.

After adding this mechanism, the heat issue on mobile devices was immediately alleviated. A colleague once complained to me, "Even Genshin Impact doesn't get as hot as your app." Now it doesn't, which is great.

Another performance issue was related to task scheduling. PIXI actually has two Tickers: the APP Ticker for rendering and the System Ticker for global mouse event management. The System Ticker is not well-documented; I only discovered it through flame graphs and digging into the source code. The problem we faced was that the System Ticker's performance was very poor. If you have a lot of interactive elements on the screen, the System Ticker's bounding box calculations would make the screen very choppy. To handle this, we rescheduled the tasks. The first frame is used for rendering, and the skipped frame is given to the System Ticker, completely distributing the computational load.

Task Slicing

This is another rather esoteric matter. We all know that JavaScript is a single-threaded language, which means that without Web Workers, most tasks have to be piled onto one thread. If a task can't be completed within one frame, your page will start to drop frames.

Handling this is a bit tricky. We adopted a similar approach to React's Time Slicing, breaking down large tasks into smaller ones and putting them in a queue. All tasks have to wait in line. After a batch is completed, we check the time. If there's enough time left, we continue; if not, we skip and wait for the next frame. It sounds simple, but it's not so easy in practice. It involved various polyfills and even modifying the standard Promise API to add a couple of features. In the end, the problem was solved, and the result was good.

And don't talk to me about Web Workers. It's not that we haven't tried. While working on the machine learning project, I even built a message queue based on Web Workers. It worked, but the developer experience was indescribable.

If your rendering lags, users will blame you for writing a bad website. But if your screen doesn't lag but the tasks run slowly, users might just blame the browser for being trash.

This optimization was used in two places: first, when the player is preloading resources, it needs to schedule a task queue. Second, when the interactive program is preloading, it needs to queue tasks to prevent too many tasks inside the iframe from freezing the parent page. The browser model here is quite complex. On one hand, the security model for a cross-origin iframe is different from a same-origin one, leading to slightly different behavior. On the other hand, some versions of some browsers don't run requestAnimationFrame when the iframe is not visible, which prevents the internal task scheduler from running. We had to rely on the player's queue system to pull tasks. These were relatively easy things to handle, written in a few strokes, worked on the first try with almost no bugs.

Multi-Project Maintenance

Later on, we had several projects running in parallel, but syncing many of the basic configurations required manual copy-pasting. To solve this problem, we packaged all our bundler configurations into a separate package. This package also integrated parts of the development environment and Service Worker configurations. This way, developers only need to upgrade the bundler integration package, and all the build configurations would be updated, and the development environment would be upgraded to the latest version, reducing a lot of miscellaneous issues during development.

Q&A

JUST, why not Unity?

This is a very nuanced question, with both historical and practical reasons. When I first joined the team, I was the one shouting the loudest for Unity and Cocos. But later, when it came to making technical decisions, I was also the first to raise objections. My thoughts were roughly as follows:

From what I knew at the time, the project was positioned as a platform with a unified entry point, from which users would access various courses. With this requirement, using Unity was simply not feasible.

  • If we were to build a platform, it would inevitably involve hot-loading:

    • You can't get away with this kind of hot-loading under Apple's nose. While a regular project might be able to secretly bypass some of Apple's restrictions, Apple would absolutely not tolerate hot-loading as a feature.
    • Version management between the main application and the interactive video programs would become very complicated. Changes to the main app's API and the versions of the interactive video programs would create a mutually blocking dependency. It's not like a website deployment where you can just upload a new version and everyone gets the latest player and interactive content.
  • There were also some historical reasons:

    • It was unlikely that we could just shut down the existing live site. As is well known, Unity's loading performance on the web is terrible. Loading a heap of those little bean-like interactive programs would cause an unmanageable performance disaster (although we did leave room in the player's API design for Unity integration later, how to actually implement it in the business was still highly uncertain).
    • How to handle the relationship between the player and Unity? Should we scrap the entire player and rewrite it in Unity, or go with a hybrid approach, half web and half Unity? If we chose a hybrid model, the mobile app would become a WebView embedding a player that then runs Unity on WebGL on a Mobile Browser. That's not a viable solution.
    • If the entire player were to be rebuilt with Unity, how would we do it? How would we engineer it? How would we re-handle the engineering complexity brought by the platformization?
  • And some rather practical issues:

    • If we completely abandoned the Mobile Web, we would also give up the possibility of using WeChat as a traffic entry point.
    • The issue of money (Unity developers are very expensive) and what to do with the people we had already hired. Not everyone is willing to learn Unity, and not everyone can learn C#.

Overall, introducing Unity would have brought a great deal of uncertainty to the team. This uncertainty would require very experienced developers to handle, as well as additional R&D manpower and funding. But unfortunately, by the time we reached that point, both manpower and funding were gone, so we could only bite the bullet and push forward.

Fortunately, through a series of optimizations, we (recently) succeeded in achieving a near-native user experience with web technologies. This research experience is actually very valuable. We have to face a reality: no matter how much you hate Electron, how much you hate everything being wrapped in a WebView, how much you hate mini-programs, these things are inevitably invading your life. The reason is simple: for businesses, the web is definitely the most cost-effective solution. It can meet the needs of rapid product development and high-speed iteration, and it is the technical choice with the lowest development cost. That's why we see more and more software joining the web-ification trend, starting to mix in web tech stacks or even being completely rewritten with web technologies. This trend is not going to reverse anytime soon; it will only intensify.

I personally also dislike the trend of using WebView for everything, of everything being Electron, but when it comes to cash, business people are honest. That's life.

Conclusion

Such a massive undertaking is not something I could have built alone. In fact, throughout the entire project, I mainly defined the architecture, coded some of the business logic that required coordination between different parties, and did a bit of UI and product design on the side. The vast majority of the development work was done by our super reliable colleagues. I can only say that after all the twists and turns, the team that stabilized is full of top-notch people. Without them, the entire project would not have been able to function properly.

I still vaguely remember the night of the first production push after the biggest architectural change. I had expected it to be full of bugs as usual, maybe even a whole new spectrum of bugs. But surprisingly, the player had no bugs—at least, all the bugs that affected the main user flow were eliminated, and no new ones even appeared. This had never happened in my two years of work experience, and I was truly moved.

Overall, you might find that behind every business requirement lies a complex technical decision. If the product's early technical foundation is insufficient and the architectural design is not inclusive, then later development can only pile layer upon layer of crap, until the shit mountain quickly collapses. After these years of work, I have two main takeaways:

First, on technology: the true value of technology is not in the features you can show off on the surface, but in the logic of thinking about problems and the methods of solving them. Code is just a means to solve a problem; without good thinking to support it, it can also become the source of problems.

Second, on product: when implementing a product, the roles and work between different parties should be clear and distinct:

  • The boss is responsible for clarifying what the product is, its core concepts, and its boundaries.
  • The product manager should truly work around the product's core concepts, in other words, to "realize the product's core concepts."
  • The designer's main job is to "realize the product solution."
  • The engineer's job is to "realize the design requirements."

When each level performs its duties without diverging, the entire product can move forward quickly. Let me give a few examples of divergence:

  • The boss might not know what they're doing, dabbling in this and that, and if the core concept is never nailed down, the underlying technical architecture will experience major earthquakes daily, and the entire product will become very unstable.
  • If the product manager does not serve the product's core concept and gets lost in their own fantasy world, creating useless peripheral features every day, then the product's form will not be stable and clear.
  • If the designer does not understand the product's goals and merely pursues satisfying their own "aesthetic taste" and desire to be different, then the developers will waste a lot of time on useless boilerplate logic.
  • And if the developer cannot grasp the design intent of the mockups and product requirements, cannot see the problem directly, or even grinds to a halt for the "thrill of reinventing the wheel," then the team's progress will be hampered.

Each workflow step is to provide an "implementation" for the previous one. In the process of advancing a commercial project, it is very serious and important to put the product's core goals first, not your own sentiments, emotions, or complexes. After all, if you look at the history of internet startups in China, those who tried to make a living off of sentiment mostly did not end well.

That's the behind-the-scenes story of the tech team over the past three years. I hope it brings you some inspiration, or perhaps a surprise.

Happy Coding!