xref: /arm-trusted-firmware/docs/components/ras.rst (revision 91f16700b400a8c0651d24a598fc48ee2997a0d7)
1*91f16700SchasingluluReliability, Availability, and Serviceability (RAS) Extensions
2*91f16700Schasinglulu**************************************************************
3*91f16700Schasinglulu
4*91f16700SchasingluluThis document describes |TF-A| support for Arm Reliability, Availability, and
5*91f16700SchasingluluServiceability (RAS) extensions. RAS is a mandatory extension for Armv8.2 and
6*91f16700Schasinglululater CPUs, and also an optional extension to the base Armv8.0 architecture.
7*91f16700Schasinglulu
8*91f16700SchasingluluFor the description of Arm RAS extensions, Standard Error Records, and the
9*91f16700Schasingluluprecise definition of RAS terminology, please refer to the Arm Architecture
10*91f16700SchasingluluReference Manual and `RAS Supplement`_. The rest of this document assumes
11*91f16700Schasinglulufamiliarity with architecture and terminology.
12*91f16700Schasinglulu
13*91f16700Schasinglulu**IMPORTANT NOTE**: TF-A implementation assumes that if RAS extension is present
14*91f16700Schasingluluthen FEAT_IESB is also implmented.
15*91f16700Schasinglulu
16*91f16700SchasingluluThere are two philosophies for handling RAS errors from Non-secure world point
17*91f16700Schasingluluof view.
18*91f16700Schasinglulu
19*91f16700Schasinglulu- :ref:`Firmware First Handling (FFH)`
20*91f16700Schasinglulu- :ref:`Kernel First Handling (KFH)`
21*91f16700Schasinglulu
22*91f16700Schasinglulu.. _Firmware First Handling (FFH):
23*91f16700Schasinglulu
24*91f16700SchasingluluFirmware First Handling (FFH)
25*91f16700Schasinglulu=============================
26*91f16700Schasinglulu
27*91f16700SchasingluluIntroduction
28*91f16700Schasinglulu------------
29*91f16700Schasinglulu
30*91f16700SchasingluluEA’s and Error interrupts corresponding to NS nodes are handled first in firmware
31*91f16700Schasinglulu
32*91f16700Schasinglulu-  Errors signaled back to NS world via suitable mechanism
33*91f16700Schasinglulu-  Kernel is prohibited from accessing the RAS error records directly
34*91f16700Schasinglulu-  Firmware creates CPER records for kernel to navigate and process
35*91f16700Schasinglulu-  Firmware signals error back to Kernel via SDEI
36*91f16700Schasinglulu
37*91f16700SchasingluluOverview
38*91f16700Schasinglulu--------
39*91f16700Schasinglulu
40*91f16700SchasingluluFFH works in conjunction with `Exception Handling Framework`. Exceptions resulting from
41*91f16700Schasingluluerrors in Non-secure world are routed to and handled in EL3. Said errors are Synchronous
42*91f16700SchasingluluExternal Abort (SEA), Asynchronous External Abort (signalled as SErrors), Fault Handling
43*91f16700Schasingluluand Error Recovery interrupts.
44*91f16700SchasingluluRAS Framework in TF-A allows the platform to define an external abort handler and to
45*91f16700Schasingluluregister RAS nodes and interrupts. It also provides `helpers`__ for accessing Standard
46*91f16700SchasingluluError Records as introduced by the RAS extensions
47*91f16700Schasinglulu
48*91f16700Schasinglulu
49*91f16700Schasinglulu.. __: `Standard Error Record helpers`_
50*91f16700Schasinglulu
51*91f16700Schasinglulu.. _Kernel First Handling (KFH):
52*91f16700Schasinglulu
53*91f16700SchasingluluKernel First Handling (KFH)
54*91f16700Schasinglulu===========================
55*91f16700Schasinglulu
56*91f16700SchasingluluIntroduction
57*91f16700Schasinglulu------------
58*91f16700Schasinglulu
59*91f16700SchasingluluEA's originating/attributed to NS world are handled first in NS and Kernel navigates
60*91f16700Schasingluluthe std error records directly.
61*91f16700Schasinglulu
62*91f16700Schasinglulu-  KFH is the default handling mode if platform does not explicitly enable FFH mode.
63*91f16700Schasinglulu-  KFH mode does not need any EL3 involvement except for the reflection of errors back
64*91f16700Schasinglulu   to lower EL. This happens when there is an error (EA) in the system which is not yet
65*91f16700Schasinglulu   signaled to PE while executing at lower EL. During entry into EL3 the errors (EA) are
66*91f16700Schasinglulu   synchronized causing async EA to pend at EL3.
67*91f16700Schasinglulu
68*91f16700SchasingluluError Syncronization at EL3 entry
69*91f16700Schasinglulu=================================
70*91f16700Schasinglulu
71*91f16700SchasingluluDuring entry to EL3 from lower EL, if there is any pending async EAs they are either
72*91f16700Schasinglulureflected back to lower EL (KFH) or handled in EL3 itself (FFH).
73*91f16700Schasinglulu
74*91f16700Schasinglulu|Image 1|
75*91f16700Schasinglulu
76*91f16700SchasingluluTF-A build options
77*91f16700Schasinglulu==================
78*91f16700Schasinglulu
79*91f16700Schasinglulu- **ENABLE_FEAT_RAS**: Enable RAS extension feature at EL3.
80*91f16700Schasinglulu- **HANDLE_EA_EL3_FIRST_NS**: Required for FFH
81*91f16700Schasinglulu- **RAS_TRAP_NS_ERR_REC_ACCESS**: Trap Non-secure access of RAS error record registers.
82*91f16700Schasinglulu- **RAS_EXTENSION**: Deprecated macro, equivalent to ENABLE_FEAT_RAS and
83*91f16700Schasinglulu  HANDLE_EA_EL3_FIRST_NS put together.
84*91f16700Schasinglulu
85*91f16700SchasingluluRAS internal macros
86*91f16700Schasinglulu
87*91f16700Schasinglulu- **FFH_SUPPORT**: Gets enabled if **HANDLE_EA_EL3_FIRST_NS** is enabled.
88*91f16700Schasinglulu
89*91f16700SchasingluluRAS feature has dependency on some other TF-A build flags
90*91f16700Schasinglulu
91*91f16700Schasinglulu- **EL3_EXCEPTION_HANDLING**: Required for FFH
92*91f16700Schasinglulu- **FAULT_INJECTION_SUPPORT**: Required for testing RAS feature on fvp platform
93*91f16700Schasinglulu
94*91f16700SchasingluluTF-A Tests
95*91f16700Schasinglulu==========
96*91f16700Schasinglulu
97*91f16700SchasingluluRAS functionality is regularly tested in TF-A CI using `RAS test group`_ which has multiple
98*91f16700Schasingluluconfigurations for testing lower EL External aborts.
99*91f16700Schasinglulu
100*91f16700SchasingluluAll the tests are written in TF-A tests which runs as NS-EL2 payload.
101*91f16700Schasinglulu
102*91f16700Schasinglulu- **FFH without RAS extension**
103*91f16700Schasinglulu
104*91f16700Schasinglulu  *fvp-ea-ffh,fvp-ea-ffh:fvp-tftf-fip.tftf-aemv8a-debug*
105*91f16700Schasinglulu
106*91f16700Schasinglulu   Couple of tests, one each for sync EA and async EA from lower EL which gets handled in El3.
107*91f16700Schasinglulu   Inject External aborts(sync/async) which traps in EL3, FVP has a handler which gracefully
108*91f16700Schasinglulu   handles these errors and returns back to TF-A Tests
109*91f16700Schasinglulu
110*91f16700Schasinglulu   Build Configs : **HANDLE_EA_EL3_FIRST_NS** , **PLATFORM_TEST_EA_FFH**
111*91f16700Schasinglulu
112*91f16700Schasinglulu- **FFH with RAS extension**
113*91f16700Schasinglulu
114*91f16700Schasinglulu  Three Tests :
115*91f16700Schasinglulu
116*91f16700Schasinglulu  - *fvp-ras-ffh,fvp-single-fault:fvp-tftf-fip.tftf-aemv8a.fi-debug*
117*91f16700Schasinglulu
118*91f16700Schasinglulu    Inject an unrecoverable RAS error, which gets handled in EL3.
119*91f16700Schasinglulu
120*91f16700Schasinglulu  - *fvp-ras-ffh,fvp-uncontainable:fvp-tftf.fault-fip.tftf-aemv8a.fi-debug*
121*91f16700Schasinglulu
122*91f16700Schasinglulu    Inject uncontainable RAS errors which causes platform to panic.
123*91f16700Schasinglulu
124*91f16700Schasinglulu  - *fvp-ras-ffh,fvp-ras-ffh-nested:fvp-tftf-fip.tftf-ras_ffh_nested-aemv8a.fi-debug*
125*91f16700Schasinglulu
126*91f16700Schasinglulu    Test nested exception handling at El3 for synchronized async EAs. Inject an SError in lower EL
127*91f16700Schasinglulu    which remain pending until we enter EL3 through SMC call. At EL3 entry on encountering a pending
128*91f16700Schasinglulu    async EA it will handle the async EA first (nested exception) before handling the original SMC call.
129*91f16700Schasinglulu
130*91f16700Schasinglulu-  **KFH with RAS extension**
131*91f16700Schasinglulu
132*91f16700Schasinglulu  Couple of tests in the group :
133*91f16700Schasinglulu
134*91f16700Schasinglulu  - *fvp-ras-kfh,fvp-ras-kfh:fvp-tftf-fip.tftf-aemv8a.fi-debug*
135*91f16700Schasinglulu
136*91f16700Schasinglulu    Inject and handle RAS errors in TF-A tests (no El3 involvement)
137*91f16700Schasinglulu
138*91f16700Schasinglulu  - *fvp-ras-kfh,fvp-ras-kfh-reflect:fvp-tftf-fip.tftf-ras_kfh_reflection-aemv8a.fi-debug*
139*91f16700Schasinglulu
140*91f16700Schasinglulu    Reflection of synchronized errors from EL3 to TF-A tests, two tests one each for reflecting
141*91f16700Schasinglulu    in IRQ and SMC path.
142*91f16700Schasinglulu
143*91f16700SchasingluluRAS Framework
144*91f16700Schasinglulu=============
145*91f16700Schasinglulu
146*91f16700Schasinglulu
147*91f16700Schasinglulu.. _ras-figure:
148*91f16700Schasinglulu
149*91f16700Schasinglulu.. image:: ../resources/diagrams/draw.io/ras.svg
150*91f16700Schasinglulu
151*91f16700SchasingluluPlatform APIs
152*91f16700Schasinglulu-------------
153*91f16700Schasinglulu
154*91f16700SchasingluluThe RAS framework allows the platform to define handlers for External Abort,
155*91f16700SchasingluluUncontainable Errors, Double Fault, and errors rising from EL3 execution. Please
156*91f16700Schasinglulurefer to :ref:`RAS Porting Guide <External Abort handling and RAS Support>`.
157*91f16700Schasinglulu
158*91f16700SchasingluluRegistering RAS error records
159*91f16700Schasinglulu-----------------------------
160*91f16700Schasinglulu
161*91f16700SchasingluluRAS nodes are components in the system capable of signalling errors to PEs
162*91f16700Schasingluluthrough one one of the notification mechanisms—SEAs, SErrors, or interrupts. RAS
163*91f16700Schasinglulunodes contain one or more error records, which are registers through which the
164*91f16700Schasinglulunodes advertise various properties of the signalled error. Arm recommends that
165*91f16700Schasingluluerror records are implemented in the Standard Error Record format. The RAS
166*91f16700Schasingluluarchitecture allows for error records to be accessible via system or
167*91f16700Schasinglulumemory-mapped registers.
168*91f16700Schasinglulu
169*91f16700SchasingluluThe platform should enumerate the error records providing for each of them:
170*91f16700Schasinglulu
171*91f16700Schasinglulu-  A handler to probe error records for errors;
172*91f16700Schasinglulu-  When the probing identifies an error, a handler to handle it;
173*91f16700Schasinglulu-  For memory-mapped error record, its base address and size in KB; for a system
174*91f16700Schasinglulu   register-accessed record, the start index of the record and number of
175*91f16700Schasinglulu   continuous records from that index;
176*91f16700Schasinglulu-  Any node-specific auxiliary data.
177*91f16700Schasinglulu
178*91f16700SchasingluluWith this information supplied, when the run time firmware receives one of the
179*91f16700Schasinglulunotification mechanisms, the RAS framework can iterate through and probe error
180*91f16700Schasinglulurecords for error, and invoke the appropriate handler to handle it.
181*91f16700Schasinglulu
182*91f16700SchasingluluThe RAS framework provides the macros to populate error record information. The
183*91f16700Schasinglulumacros are versioned, and the latest version as of this writing is 1. These
184*91f16700Schasinglulumacros create a structure of type ``struct err_record_info`` from its arguments,
185*91f16700Schasingluluwhich are later passed to probe and error handlers.
186*91f16700Schasinglulu
187*91f16700SchasingluluFor memory-mapped error records:
188*91f16700Schasinglulu
189*91f16700Schasinglulu.. code:: c
190*91f16700Schasinglulu
191*91f16700Schasinglulu    ERR_RECORD_MEMMAP_V1(base_addr, size_num_k, probe, handler, aux)
192*91f16700Schasinglulu
193*91f16700SchasingluluAnd, for system register ones:
194*91f16700Schasinglulu
195*91f16700Schasinglulu.. code:: c
196*91f16700Schasinglulu
197*91f16700Schasinglulu    ERR_RECORD_SYSREG_V1(idx_start, num_idx, probe, handler, aux)
198*91f16700Schasinglulu
199*91f16700SchasingluluThe probe handler must have the following prototype:
200*91f16700Schasinglulu
201*91f16700Schasinglulu.. code:: c
202*91f16700Schasinglulu
203*91f16700Schasinglulu    typedef int (*err_record_probe_t)(const struct err_record_info *info,
204*91f16700Schasinglulu                    int *probe_data);
205*91f16700Schasinglulu
206*91f16700SchasingluluThe probe handler must return a non-zero value if an error was detected, or 0
207*91f16700Schasingluluotherwise. The ``probe_data`` output parameter can be used to pass any useful
208*91f16700Schasingluluinformation resulting from probe to the error handler (see `below`__). For
209*91f16700Schasingluluexample, it could return the index of the record.
210*91f16700Schasinglulu
211*91f16700Schasinglulu.. __: `Standard Error Record helpers`_
212*91f16700Schasinglulu
213*91f16700SchasingluluThe error handler must have the following prototype:
214*91f16700Schasinglulu
215*91f16700Schasinglulu.. code:: c
216*91f16700Schasinglulu
217*91f16700Schasinglulu    typedef int (*err_record_handler_t)(const struct err_record_info *info,
218*91f16700Schasinglulu               int probe_data, const struct err_handler_data *const data);
219*91f16700Schasinglulu
220*91f16700SchasingluluThe ``data`` constant parameter describes the various properties of the error,
221*91f16700Schasingluluincluding the reason for the error, exception syndrome, and also ``flags``,
222*91f16700Schasinglulu``cookie``, and ``handle`` parameters from the :ref:`top-level exception handler
223*91f16700Schasinglulu<EL3 interrupts>`.
224*91f16700Schasinglulu
225*91f16700SchasingluluThe platform is expected populate an array using the macros above, and register
226*91f16700Schasingluluthe it with the RAS framework using the macro ``REGISTER_ERR_RECORD_INFO()``,
227*91f16700Schasinglulupassing it the name of the array describing the records. Note that the macro
228*91f16700Schasinglulumust be used in the same file where the array is defined.
229*91f16700Schasinglulu
230*91f16700SchasingluluStandard Error Record helpers
231*91f16700Schasinglulu~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
232*91f16700Schasinglulu
233*91f16700SchasingluluThe |TF-A| RAS framework provides probe handlers for Standard Error Records, for
234*91f16700Schasingluluboth memory-mapped and System Register accesses:
235*91f16700Schasinglulu
236*91f16700Schasinglulu.. code:: c
237*91f16700Schasinglulu
238*91f16700Schasinglulu    int ras_err_ser_probe_memmap(const struct err_record_info *info,
239*91f16700Schasinglulu                int *probe_data);
240*91f16700Schasinglulu
241*91f16700Schasinglulu    int ras_err_ser_probe_sysreg(const struct err_record_info *info,
242*91f16700Schasinglulu                int *probe_data);
243*91f16700Schasinglulu
244*91f16700SchasingluluWhen the platform enumerates error records, for those records in the Standard
245*91f16700SchasingluluError Record format, these helpers maybe used instead of rolling out their own.
246*91f16700SchasingluluBoth helpers above:
247*91f16700Schasinglulu
248*91f16700Schasinglulu-  Return non-zero value when an error is detected in a Standard Error Record;
249*91f16700Schasinglulu-  Set ``probe_data`` to the index of the error record upon detecting an error.
250*91f16700Schasinglulu
251*91f16700SchasingluluRegistering RAS interrupts
252*91f16700Schasinglulu--------------------------
253*91f16700Schasinglulu
254*91f16700SchasingluluRAS nodes can signal errors to the PE by raising Fault Handling and/or Error
255*91f16700SchasingluluRecovery interrupts. For the firmware-first handling paradigm for interrupts to
256*91f16700Schasingluluwork, the platform must setup and register with |EHF|. See `Interaction with
257*91f16700SchasingluluException Handling Framework`_.
258*91f16700Schasinglulu
259*91f16700SchasingluluFor each RAS interrupt, the platform has to provide structure of type ``struct
260*91f16700Schasingluluras_interrupt``:
261*91f16700Schasinglulu
262*91f16700Schasinglulu-  Interrupt number;
263*91f16700Schasinglulu-  The associated error record information (pointer to the corresponding
264*91f16700Schasinglulu   ``struct err_record_info``);
265*91f16700Schasinglulu-  Optionally, a cookie.
266*91f16700Schasinglulu
267*91f16700SchasingluluThe platform is expected to define an array of ``struct ras_interrupt``, and
268*91f16700Schasingluluregister it with the RAS framework using the macro
269*91f16700Schasinglulu``REGISTER_RAS_INTERRUPTS()``, passing it the name of the array. Note that the
270*91f16700Schasinglulumacro must be used in the same file where the array is defined.
271*91f16700Schasinglulu
272*91f16700SchasingluluThe array of ``struct ras_interrupt`` must be sorted in the increasing order of
273*91f16700Schasingluluinterrupt number. This allows for fast look of handlers in order to service RAS
274*91f16700Schasingluluinterrupts.
275*91f16700Schasinglulu
276*91f16700SchasingluluDouble-fault handling
277*91f16700Schasinglulu---------------------
278*91f16700Schasinglulu
279*91f16700SchasingluluA Double Fault condition arises when an error is signalled to the PE while
280*91f16700Schasingluluhandling of a previously signalled error is still underway. When a Double Fault
281*91f16700Schasinglulucondition arises, the Arm RAS extensions only require for handler to perform
282*91f16700Schasingluluorderly shutdown of the system, as recovery may be impossible.
283*91f16700Schasinglulu
284*91f16700SchasingluluThe RAS extensions part of Armv8.4 introduced new architectural features to deal
285*91f16700Schasingluluwith Double Fault conditions, specifically, the introduction of ``NMEA`` and
286*91f16700Schasinglulu``EASE`` bits to ``SCR_EL3`` register. These were introduced to assist EL3
287*91f16700Schasinglulusoftware which runs part of its entry/exit routines with exceptions momentarily
288*91f16700Schasinglulumasked—meaning, in such systems, External Aborts/SErrors are not immediately
289*91f16700Schasingluluhandled when they occur, but only after the exceptions are unmasked again.
290*91f16700Schasinglulu
291*91f16700Schasinglulu|TF-A|, for legacy reasons, executes entire EL3 with all exceptions unmasked.
292*91f16700SchasingluluThis means that all exceptions routed to EL3 are handled immediately. |TF-A|
293*91f16700Schasingluluthus is able to detect a Double Fault conditions in software, without needing
294*91f16700Schasingluluthe intended advantages of Armv8.4 Double Fault architecture extensions.
295*91f16700Schasinglulu
296*91f16700SchasingluluDouble faults are fatal, and terminate at the platform double fault handler, and
297*91f16700Schasingluludoesn't return.
298*91f16700Schasinglulu
299*91f16700SchasingluluEngaging the RAS framework
300*91f16700Schasinglulu--------------------------
301*91f16700Schasinglulu
302*91f16700SchasingluluEnabling RAS support is a platform choice
303*91f16700Schasinglulu
304*91f16700SchasingluluThe RAS support in |TF-A| introduces a default implementation of
305*91f16700Schasinglulu``plat_ea_handler``, the External Abort handler in EL3. When ``ENABLE_FEAT_RAS``
306*91f16700Schasingluluis set to ``1``, it'll first call ``ras_ea_handler()`` function, which is the
307*91f16700Schasinglulutop-level RAS exception handler. ``ras_ea_handler`` is responsible for iterating
308*91f16700Schasingluluto through platform-supplied error records, probe them, and when an error is
309*91f16700Schasingluluidentified, look up and invoke the corresponding error handler.
310*91f16700Schasinglulu
311*91f16700SchasingluluNote that, if the platform chooses to override the ``plat_ea_handler`` function
312*91f16700Schasingluluand intend to use the RAS framework, it must explicitly call
313*91f16700Schasinglulu``ras_ea_handler()`` from within.
314*91f16700Schasinglulu
315*91f16700SchasingluluSimilarly, for RAS interrupts, the framework defines
316*91f16700Schasinglulu``ras_interrupt_handler()``. The RAS framework arranges for it to be invoked
317*91f16700Schasingluluwhen  a RAS interrupt taken at EL3. The function bisects the platform-supplied
318*91f16700Schasinglulusorted array of interrupts to look up the error record information associated
319*91f16700Schasingluluwith the interrupt number. That error handler for that record is then invoked to
320*91f16700Schasingluluhandle the error.
321*91f16700Schasinglulu
322*91f16700SchasingluluInteraction with Exception Handling Framework
323*91f16700Schasinglulu---------------------------------------------
324*91f16700Schasinglulu
325*91f16700SchasingluluAs mentioned in earlier sections, RAS framework interacts with the |EHF| to
326*91f16700Schasingluluarbitrate handling of RAS exceptions with others that are routed to EL3. This
327*91f16700Schasinglulumeans that the platform must partition a :ref:`priority level <Partitioning
328*91f16700Schasinglulupriority levels>` for handling RAS exceptions. The platform must then define
329*91f16700Schasingluluthe macro ``PLAT_RAS_PRI`` to the priority level used for RAS exceptions.
330*91f16700SchasingluluPlatforms would typically want to allocate the highest secure priority for
331*91f16700SchasingluluRAS handling.
332*91f16700Schasinglulu
333*91f16700SchasingluluHandling of both :ref:`interrupt <interrupt-flow>` and :ref:`non-interrupt
334*91f16700Schasinglulu<non-interrupt-flow>` exceptions follow the sequences outlined in the |EHF|
335*91f16700Schasingluludocumentation. I.e., for interrupts, the priority management is implicit; but
336*91f16700Schasinglulufor non-interrupt exceptions, they're explicit using :ref:`EHF APIs
337*91f16700Schasinglulu<Activating and Deactivating priorities>`.
338*91f16700Schasinglulu
339*91f16700Schasinglulu--------------
340*91f16700Schasinglulu
341*91f16700Schasinglulu*Copyright (c) 2018-2023, Arm Limited and Contributors. All rights reserved.*
342*91f16700Schasinglulu
343*91f16700Schasinglulu.. _RAS Supplement: https://developer.arm.com/documentation/ddi0587/latest
344*91f16700Schasinglulu.. _RAS Test group: https://git.trustedfirmware.org/ci/tf-a-ci-scripts.git/tree/group/tf-l3-boot-tests-ras?h=refs/heads/master
345*91f16700Schasinglulu
346*91f16700Schasinglulu.. |Image 1| image:: ../resources/diagrams/bl31-exception-entry-error-synchronization.png
347