The Wicked Problem of Program Evaluation

Why learning from federal energy programs is harder than it looks

Jan 09, 2026

“In complex systems, cause and effect are not closely related in time and space.” — Peter Senge

Program evaluation is a foundational part of good governance. It’s how organizations assess impact, learn from experience, and improve future efforts. At the Department of Energy, evaluation is taken seriously: programs are designed with multi-year goals, results are reported to Congress through the Government Performance and Results Act (GPRA) processes, and there is genuine cultural commitment to learning and doing better over time.

And yet, in practice, evaluation in energy programs often struggles to deliver on its promise.

This is due, at least in part, to the fact that DOE energy programs operate in what behavioral economist Robin Hogarth calls a “wicked learning environment”—one in which learning from experience is unreliable, feedback is delayed or misleading, and outcomes are difficult to attribute to skill rather than luck. David Epstein later popularized the concept in Range: Why Generalists Triumph in a Specialized World, contrasting fields where expertise develops reliably with those where it doesn’t.

Understanding that wickedness—and its implications—is essential if we want evaluation to meaningfully inform energy policy and investment, and if we want to avoid wasting effort on evaluation when it will not.

Kind vs. Wicked

Hogarth distinguishes between “kind” and “wicked” learning environments.

In kind environments, rules are stable, feedback is timely and accurate, and cause-and-effect relationships are relatively clear. Chess is a classic example—you know almost immediately whether your move worked. So is tennis. Learning is straightforward: observe what worked, adjust, and improve.

Wicked learning environments are the opposite. They are characterized by:

Rules that are unclear or changing. There may be no rules at all, or they may change without notice.
Data that is messy. Information is often incomplete, inconsistent, or misleading.
Feedback that is delayed or inaccurate. It’s difficult to connect an action to its consequence.
Evolving contexts. What worked in the past may not work in the future due to volatile, game-changing events.
Difficulty distinguishing skill from luck. Outcomes can be misleading, and it’s hard to tell if success is due to ability or chance.

The stock market is a classic wicked environment. So is politics. And, as it turns out, so is federal energy program management.

Why Energy Programs Are Wicked

To set the stage, this discussion focuses on applied research where the expectation is market impact, not scientific research where the explicit goal is to increase knowledge in a domain. Consider what it takes to develop and commercialize hard technology. A typical DOE award lasts 2-4 years for earlier-stage work, 5-10 years for later-stage demonstrations. Most technologies need three or four projects across different stages to really reach the market. From initial research to commercial deployment, we’re often talking about 15-20 years.

And the world does not stand still. Over those same time horizons:

Political administrations turn over
Policy environments shift
Competing technologies advance
Markets evolve in unpredictable ways
International competitors move quickly

A cost target set in 2010 may be meaningless by 2025, either because the market moved faster than expected or because the competitive landscape changed entirely. The assumptions that shaped program design may no longer hold by the time outcomes are visible.

This creates a fundamental problem: the feedback loops are too slow. By the time you know whether a program actually worked—whether it moved the needle on commercialization, market adoption, or system transformation—the people who designed it are long gone. The context that shaped the original decisions has changed. And the lessons, even if someone bothers to extract them, may not apply to current conditions.

This is the essence of wickedness: the learning signal is delayed, distorted, missed, irrelevant, or all of the above.

The Culture of Evaluation at DOE

DOE frequently talks internally about the importance of evaluation and impact. The Department has multi-year goals reported to Congress through GPRA. There is genuine commitment to learning and improving.

But this commitment runs into two challenges. One, it consistently takes a back seat to more immediate pressures: launching programs, obligating funds, overseeing projects, responding to near-term oversight demands. Staff resources are always stretched thin. The urgent crowds out the important. And evaluation work—the slow, careful analysis of what actually happened and why—never quite makes it to the top of the priority list. Two, GPRA goals tend to be high level, such as reducing the installed cost ($/Watt) of solar energy. While this kind of goal can help track an industry’s evolution, it is too high-level to assess the impact of DOE funded technologies.

This creates predictable problems.

Staff and leadership want wins. So they focus on metrics that are easily tracked and show leading indicators of engagement rather than metrics that would actually demonstrate system impact. The easy metrics include things like number of applications submitted, competitiveness ratios, cost share amounts, and technology performance against interim targets. These are useful operational indicators, particularly for early R&D, but they’re not the same as commercial impact.

Impact data isn’t defined up front—and collection ends when projects do. The data required to assess long-term impact is often not formally included in project reporting requirements. When it is included, the burden on performers is high enough that compliance is spotty. You can’t evaluate what you didn’t measure.

But even when the right data is defined, there’s a structural problem: for most projects, reporting requirements end when the project does. Typical DOE contract structures haven’t included extended data collection, either by DOE staff or by a third party. The award closes out, the performer moves on, and the thread is lost. By the time the outcomes that actually matter start to materialize—commercial traction, follow-on investment, market adoption, workforce effects—there’s no mechanism in place to capture them.

This is an opportunity for Other Transactions* (OTs) and other flexible contracting mechanisms. Unlike traditional grants and cooperative agreements, OTs can be structured to include post-project data collection requirements, third-party evaluation relationships, or other provisions that extend visibility beyond the award period. The metrics that would actually matter—relative competitiveness against incumbent solutions, follow-on investment and commercial deals, expansion of organizational capacity, workforce development and spillover effects—require this kind of extended visibility. They unfold over years, not months. Without contractual structures that maintain the relationship past project closeout, they mostly don’t get tracked.
*OTs as a funding tool will be discussed in a future post

Staffing changeovers break continuity. Over the duration of a multi-year program, staff turn over. New people inherit projects without full context on the core program goal and original project intent. It becomes easy to rigidly follow the written plan even when the underlying logic would allow—or demand—that objectives shift.

Results from previously funded projects frequently do not get housed in adequate data repositories - they are passed on through either peer reviews or through lessons learned acquired by staff who have been around long enough to see the various cycles, which increases sensitivity to staffing changes.

Project outputs get checked; program outcomes don’t. There’s a clear process for verifying whether a project delivered its technical milestones. There’s much less infrastructure for tracking whether the portfolio of projects actually achieved the program’s intended market or system impact. DOE has historically funded some third-party retrospectives, but data is fragmented and incomplete, limiting the conclusions that can be drawn. Or outcome-based evaluations through traditional impact evaluation call for a counterfactual, which can be nearly impossible - whether that’s because of data privacy rights of applicants that did not receive funding, or just finding a corollary to a project in the private sector that did not receive federal funds.

Feedback loops are too slow to inform current decisions. Even when evaluation does happen, the signals arrive too late to meaningfully influence program design within a single administration. You’re often flying at least partially blind, making decisions based on incomplete data from programs that operated in different contexts.

This challenge became particularly visible during the implementation of major, time-sensitive efforts like those funded under the Bipartisan Infrastructure Law (BIL). Extensive review and approval processes, overlapping equities, and limited internal capacity slowed execution. Meanwhile, external conditions continued to change—sometimes faster than programs could respond. For these programs, like many prior DOE efforts, evaluation frameworks often assumed a kind learning environment where none existed.

The Attribution Problem

Beyond the timing issues, there’s a deeper problem: attribution. When programs fail to achieve their desired market impact, the failures get blown out of proportion and attributed to poor decision-making or technology failure—not to market shifts, policy changes, or system-level dynamics that were outside anyone’s control.

This creates perverse incentives. Program managers may seek to avoid anything that might fail visibly, even if taking that risk would have been the right call given the information available at the time. Similarly, they may avoid drawing attention to ambivalent or negative results, even if they could meaningfully inform future decisions.

Project successes are often similarly misattributed or incompletely considered. Offices may be encouraged to claim credit for successes that may have had little to do with their decisions, or were likely due to multiple activities from multiple offices and entities. The role of luck and favorable shifts to policy, market conditions, or other landscape actions are discounted. The feedback, such as it is, reinforces the wrong lessons.

Both situations disincentivize program evaluation - the positive outcomes are perceived to be self-apparent and easy to tell a story around, while more detailed assessments on opportunities to improve invite criticism and negative conclusions. This turns end-of-award evaluation into a clearance exercise for execution and compliance—checking boxes—rather than an analytical one. It’s a stark contrast to the rigor applied during continuation reviews, where real scrutiny happens. Impact becomes a “who cares” issue. Not because no one values it, but because no one is incentivized to own it at the end. This issue is compounded by the limited bandwidth of DOE staff and their need to move onto the next program or project to ensure the work the office is measured against gets done.

The Trust Connection

This connects to themes we’ve explored in prior posts. In our discussion of trust in federal energy programs, we noted that without strong relationships between program staff and performers, it’s difficult to weather the uncertainty inherent in long-duration technology development. And in our post on systems approaches to risk, we explored how DOE manages compliance and execution risk well but often incompletely addresses impact risk—the risk that programs fail to change the energy system in meaningful ways on a timeframe that matters.

The wicked learning environment compounds both problems. “Trust but verify” is the standard mantra. But verify what? Original targets may no longer be relevant. Technical performance requirements may have shifted with the market. The goalposts move, and without trusted relationships that allow for honest conversation about evolving value, everyone defaults to checking boxes against outdated plans.

The hard conversations—”this project isn’t working but we learned something important,” or “the market moved and we should pivot”—require trust to be productive. In its absence, performers tell program managers what they want to hear, program managers report upward what leadership wants to hear, and everyone loses the chance to actually learn.

What To Do About It

The practical question isn’t whether evaluation matters in the absolute—it does. The question is whether additional evaluation effort will actually change decisions in the timeframe that matters. Sometimes the honest answer is no. In those cases, the better investment is focusing on program design up front and then staying the course.

If you care about impacts in the next five years, you probably aren’t going to get new data that allows you to meaningfully change approach in time.

The feedback loops are simply too slow. In that case, the right investment is in collaborative program design at the outset—bringing in diverse perspectives, pressure-testing assumptions, and making the best decisions you can with available information. This is why we’ve emphasized program design as a craft learned through apprenticeship—the upfront work matters precisely because mid-course correction based on impact data is often not possible.

In these cases, program evaluation still has a place, but it’s more to build a body of knowledge around technology and market progressions than to inform near term decisions. The level of effort and burden should be set accordingly.

If you care about impacts on the decadal scale, different approaches are needed:

Establish an independent evaluation function. Program evaluation will always take a back seat to time-sensitive operational work. Existing staff will never find time to prioritize it. An independent group—within DOE or external to it—with dedicated responsibility for tracking long-term impacts is the only way to sustain attention on this work. The retrospectives this group produces can gradually adjust DOE and other government programs over time, even if they can’t inform any single administration’s decisions.

Establish different targets for different stages. Cost and performance targets are useful for early-stage work, where you’re trying to establish potential for competitiveness and demonstrate forward progress. They’re less useful for later-stage commercialization, where the cost-value proposition in real markets is what matters. Measure alignment with industry needs in the short to mid term—deals, follow-on financing, project announcements. Measure relative value, not just cost. Focus on system value, not just asset value.

Build data collection into the relationship. The independent evaluation group needs a formalized relationship with project performers that gives them the ability to get the data they need over extended timeframes. Make this a condition of government support. Use the flexibility available in contracting mechanisms (like Other Transactions) to build in these requirements without excessive burden. And ensure the evaluation group has a formalized relationship with DOE offices to inform how program objectives are defined and what data gets collected.

Accept that some learning will be retrospective. The best we can do for many programs is enable good retrospective analysis that benefits future programs, even if it can’t benefit the current one. That’s still valuable—but only if someone is actually doing the retrospectives and the institution has mechanisms to absorb the lessons.

Communicate explicitly. Be clear from the start about how evaluation will be conducted, how it will be used, and who is accountable for carrying it through. Ambiguity about ownership is how evaluation becomes performative—generating activity without insight.

The Honest Version

Here’s the uncomfortable truth: for most federal energy programs, we will never know with confidence whether they “worked” as well as they could have. The time horizons are too long, the confounding variables too numerous, the counterfactuals too unknowable. We can track outputs. We can measure some outcomes. But definitive, complete attribution of system-level impact to specific program decisions? That’s probably beyond reach.

This doesn’t mean evaluation is pointless. It means we should be realistic about what evaluation can and cannot deliver. It means we should invest in program design—in getting the logic right up front, in building trusted relationships that allow for adaptation, in constructing portfolios that can succeed across a range of scenarios. And it means we should be humble about claiming credit or assigning blame after the fact.

The learning environment is wicked. We can make it somewhat less wicked through better processes and sustained attention. But we can’t make it kind. The best we can do is acknowledge the wickedness and design our institutions and learning systems accordingly.

Innovation Waypoints is brought to you by Waypoint Strategy Group.

Greg Smestad

Jan 9Edited

A [Systems] diagram that can help visualize the points of this excellent article can be found here: J. Morabito, T. Peterson, G. P. Smestad, and K. DeGroat, “Systems Analysis and Recommendations for R&D and Accelerated Deployment of Solar Energy,” 2009 Peer Review Meeting, U.S. Department of Energy’s (DOE) Solar Energy Technologies Program, Denver CO, March 2009. Download White Paper: https://www.solideas.com/publications/

https://www.researchgate.net/publication/242081639_Systems_Analysis_and_Recommendations_for_RD_and_Accelerated_Deployment_of_Solar_Energy

DNVO

I agree with the central premise that design up front matters more than ex post evaluation, but the critical question is what kind of design are we actually talking about?

Government programs are not built for speed or efficiency, yet they continue to invest heavily in evaluation activities that prioritize data collection over insight, with little evidence of sustained tracking of real system or market outcomes. As you describe, the result is a proliferation of metrics that satisfy governance requirements but rarely inform decisions or reflect true impact.

A redesigned approach should embed, from the outset, a deliberate set of KPIs that concurrently assess operational performance (internal execution) and commercialization or system impact (external outcomes). These metrics must be explicitly aligned with the initiative’s mission, value proposition, and theory of change. This requires going deeper during program design (e.g., asking why repeatedly until the true source of intended impact is clear).

Finally, this argues for a more agile, learning-oriented design philosophy: prioritizing short feedback cycles, early signals, and iterative adjustment over traditional predictive models that look rigorous on paper but rarely survive contact with complex, evolving systems.

Discussion about this post

Ready for more?