Statistical Analysis of Immunicum Phase II MERECA Results
Disclaimer: All statistical modeling has inherent uncertainties and potential errors, and do not guarantee the predicted results. This post should not be seen as final evidence about the outcome of the studies, but only as a reasonable estimation based on an, in my judgement, suitable model for the problem. This is not a recommendation about investing in the company (something that should be done on better grounds than a blog post on the Internet).
Disclaimer 2: I’ve seen references to / discussions about this analysis together with the term “significance”. Therefore I want to make the following very clear: the term “significance” pertains to statistical hypothesis testing, which is not what I’m doing here. In the procedure for statistical hypothesis testing you must 1) define the hypotheses before starting the study, 2) perform the study exactly as planned (e.g. to the end of the pre-defined period) before judging the results, 3) afterwards perform a significance test against the null-hypothesis (that there is no difference). A reasonable way of performing a study like this would be to run the study for a pre-defined time and then perform a “log-rank test” with a pre-defined significance level, e.g. p=0.05 on the survival curves. I think it is entirely reasonable and expected that Immunicum is careful in their language about interpretation of the results, and follow the scientific method of how to perform studies like this one. That does not mean that we, as external parties, shouldn’t make statistical estimations based on the existing data. My analysis can be interpreted as a qualified guess about what the results / effect might be, together with a measure about how surprised we would be to see very differing results. To obtain a waterproof estimate you would also have to complement that with prior estimations of how probable the underlying model is, and preconceptions about probability of effect, etc. Even that would not hold as a scientific conclusion since scientific method places greater importance in following a preset procedure which is comparable to other studies, than to make the best possible guesses as to what the result might be. In everyday life, however, as in investment decisions, I believe it is more useful to base decisions on statistical estimations such as the one performed in this analysis, than to base them on scientific method.
Immunicum reported updated results from their Phase II MERECA-study February 6th 2020, comprising individual survival data up until December 2019 as well as the respective Kaplan-Meier estimate of the survival curve (i.e. the average survival function).
A survival curve simply put shows what proportion of subjects is still alive at the time T, e.g. 18 months. A survival function shows, for a specific subject with certain properties, the probability of that subject being alive at least until T.
When we make a Kaplan-Meier estimate, we make a very general “best guess” about what the average survival curve looks like, based on data that is incomplete (censored), since not all subjects have been evaluated over the time interval of interest. Below is a K-M estimate from the December 2019 follow-up:
The black, dashed line corresponds to the median (median survival occurs when the black line crosses a vertical colored line, although this requires the colored line to be fixed, which it is only up to 24 months in this case).
The Kaplan-Meier curve gives a passable indication of what the survival curve will look like, but is sensitive to individual events and difficult to draw further conclusions from, such as:
- How certain are we of a potential difference between two groups?
- If we repeated the study multiple times, what would the expected outcome be?
- What is the survival effect of combinations of factors like prognosis and treatment options?
To answer such questions, we can create a statistical model of the system under study, that matches our understanding about cause and effect, and use available data to estimate parameters of that model. That in turn allows us to estimate the effect of combinations of properties like prognosis and treatment, and to simulate repeated studies.
A basic assumption of a survival model is that the actual lifetime of an individual subject is a random outcome resulting from an underlying, time-dependent hazard function, that can depend on subject / treatment attributes. If the hazard is high, the survival will on average be short. Sometimes the hazard can change over time, e.g. when the disease isn’t lethal until a certain amount of progress.
Our goal, then, is to estimate the hazard function and how it depends on things like:
- Treatment with Ilixadencel
If we achieve this, we can also statistically simulate the expected outcome of the phase II study, given current data, and / or a hypothetical phase III study with a higher number of participants.
Our model can give us answers such as:
- What is the “best guess” for the hazard function, and how it depends on different attributes?
- What is our uncertainty about this guess, given our limited current data?
The Kaplan-Meier curve only answers question #1, and only in the form of an average value for a group, instead of revealing the relation between individual properties.
The most common model for the intended purpose is the Cox proportional hazards model. A weakness with that model is that it does not take into account that the different factors affecting survival can vary over time in relative strength, e.g. when a treatment needs a minimum time before taking effect. Instead, we can then use a model called Aalen’s additive hazards model.
The model assumes that the risk of dying over a certain time interval is affected by different factors, and that the relation between these factors can change over time.
We feed all our existing data into the model and get as a result a “best guess” of the interdependencies between different properties and survival, as well as a measure of our uncertainty given the limited quantity of data. One result is a graph that looks like this:
The plot shows how the different factors affect the hazard over time. Interestingly, treatment with Ilixadencel seems to visibly lower the risk of dying, but the effect isn’t prominent until 20 months.
Update (2012-02-12): Another thing we can check is to complement the adjustment of the model with the information from Phase I/II, where the subjects have been followed up on for a longer period of time, and which includes deaths after 48 months. This could give a more fair view of what happens in the late time intervals, where data is lacking. The difference in those results compared to the graph above, summarized, is that the red and yellow lines rise somewhat in the late parts (due to deaths of subejcts with poor prognosis / sarcomatoid features), and that the hazard-reducing effect of Ilixadencel reaches a plateau at ca. -0.4 hazard, and stays there between 40–66 months. The rest of the results are very similar to what is presented below, where only Phase II data is used.
Now we can use the best guess of the hazard functions (the dark lines in Figure 3) to calculate a theoretical survival curve where we compare Ilixadencel + Sunitinib to only Sunitinib, based on the distribution of the other properties (IMDC-prognosis, sarcomatoid features) that occurs in the studied group:
This is our best guess of the expected survival function (in general, i.e. not specifically for the Phase II study) with, and without, treatment with Ilixadencel, for subject groups with the same composition as that of the Phase II study.
To get a complete picture, we have a few steps left:
- We want to perform a “sanity check” of our estimated model against previous information
- We want to include our uncertainty about the estimated model in the survival function, to make a judgement about how certain we can be that the treatment has effect
- We want to simulate possible outcomes of the Phase II study, including our uncertainty about the effect
- We want to simulate possible outcomes of a hypothetical Phase III study with more participants, including our uncertainty about the effect
Let’s begin by simulating a Phase III study and thereafter comparing it to previous data for Sunitinib. Here I perform a simulation of a study with 300 participants (with the same distribution of properties as the Phase II study). I repeat the study a 100 times. NB: To include the uncertainty about our estimate of the effect of Ilixadencel, for each repeat of the study, a new hazard effect curve is randomly generated, based on the uncertainty intervals shown in figure 2. The alternative would have been to always use the “best guess” in figure 2, but now we instead include both sources of randomness:
- The randomness in outcome that happens due to random survival of individual subjects
- The uncertainty in outcome due to our uncertainty in the effect of Ilixadencel, due to our limited set of available data
Here is presented the outcome of a 100 simulated Phase III studies, that include our uncertainty about the effect. The dark line is the best guess of the outcome, while the shaded areas represent the area between the 5% and 95% percentile (i.e. in 9 out of 10 simulations, the outcome was within the shaded area):
Now we can perform a “sanity check” by comparing the red curve to existing data from other studies that use Sunitinib, with a similar distribution of subjects. I’ll just overlay one image on the other:
We can see that the red curve conforms reasonably to previous data.
Another “validation” is to look at figure 2 and judge whether the effect curves are reasonable. For example, is it reasonable, given previous information, that the effect curve of Ilixadencel is close to zero the first 18 months, and decreases the risk mainly after that? You could say yes, there are things that suggest that this is expected when a treatment relies on the activation of the immune system.
Update (2012-02-12): One thing that can seem curious in figure 2 is that the effect curves are “bumpy”, which is due to the same effect as present in K-M-curves, i.e. that individual events affect the model (even though it is more robust to this than K-M). To minimize the effect of this, I have rerun all simulations with a “smoothing penalty” that forces removal of the bumps (and which corresponds to our preconception that hazard shouldn’t change suddenly from one month to another). The results are very similar to the results presented below and therefore not included in the post.
In figure 5 we can also read the estimated median survival. The group treated with Sunitinib is expected to have between 21 and 29 months median survival in 90% of the cases, in a repeated number of hypothetical Phase III studies. The “best guess” of median survival for the Ilixadencel group exceeds 48 months and is potentially unbounded (i.e. at least half of the subjects survive until the end of the study). We can also see that the lower 5% limit of median survival in the Ilixadencel group is about 24 month, which is a result of the combination of:
- That we have included a lot of uncertainty about how much effect Ilixadencel has
- That the median survival for the group (without Ilixadencel) is very close to the time where Ilixadencel treatment has clear effect
More interestingly, based on the data we have, is actually the difference in long-term survival, e.g. after 48 months, where we can see a very strong effect of Ilixadencel, according to the best guess, and a strong effect even including our uncertainty about the estimates.
It is in the nature of the statistic of median survival that it is very sensitive for exactly at what level a plateau in the survival curve occurs, and when it occurs. The measure is more suitable for groups where the long-term survival is much lower than 50%.
The simulated Phase III results and corresponding hazard curves are really the most interesting parts when judging the effect of Ilixadencel from available information, but it can also be enlightening to review simulations of completed Phase II-studies, given current survival status, to get estimates of the expected median survival effect, since this is presumed to be important for treatment approval.
Here I simulate the completion of the phase II study a 100 times, given current survival data, including our uncertainty about the effect due to limited data:
This image makes it very clear why there is so much back-and-forths about the median survival: both groups have a plateau at exactly the level of median survival, making the uncertainty intervals very large. Note, however, the strong effect on 48 month survival per the following:
- Best guess (dark lines): About 26% more of the total group survive > 4 y
- If the group is “unlucky” due to random deaths or that the effect curve is in the “bad” part of the interval: about 10% more of the total group survives > 4 y
Here is an image showing the interval 5%-95% percentile and median, for the median survival (with a ceiling of 48 months, the true maximum is potentially unlimited).
Correction (2012-02-12, 10:15 CET, 12:15 CET):
The median survival below was incorrectly calculated due to a rounding error, which gave somewhat higher median survival for both groups. This is now corrected, and the study repeated 1000 times to reduce the variability in the measures. The result is that the lower 5% for the Ilixadencel group is lowered from ~36 mo to ~35 mo, while the upper 95% of the Sunitinib group is lowered from ~33 mo to ~29 mo. The separation is then higher than before. A note is that the dropouts are here regarded as right censored and included in the simulation and resulting statistic. If both dropouts were to be regarded as deceased, the lower interval of the Ilixadencel group would be reduced with ~1 mo, while if both dropouts were excluded from the calculations, the interval remains the same as shown below.
It is clear that a median survival effect as outcome of the Phase II study is very probable, including random events and the uncertainty that we have about the effect of Ilixadencel given current data. Even more relevant, however, is the dramatic effect on long-term survival that seems to exist, and which should be clear from this analysis.
Moreover, it should be apparent from the graphs that updates in e.g. 6 months won’t provide much additional information relative to the information that we already have (which is, however, sufficient to statistically judge the existence of an effect). You should expect that at least 12-18 additional months of follow-ups are needed to make the picture significantly clearer than is presented in the above analysis. However, the conclusion of this analysis is that we can already with a strong confidence see the results of the effect of Ilixadencel, so that some conclusions about the eventual success of the Phase II study can already be drawn.
Addendum: It is also worth noting that the above analysis does not rely on the positive information about “Complete Response” (CR) that Immunicum have presented, since information about the exact time intervals of response was not available and thus cannot be easily integrated into the model. The available information on the proportion of complete responses should therefore be seen as an additional contribution to the confidence in the underlying model presented here.