execution  and  comparison.  According  to  the 
replicated  objects,  transient  error  detection 
techniques  during  processing  can  be  divided  into 
instruction-level,  thread-level  and  application-level 
fault tolerance. 
EDDI (Oh et al.,  2002) and SWIFT (Reis  et al., 
2005) are typical representatives of instruction-level 
fault  tolerance,  which  copy  the  instructions  in  the 
original  program  at  compile  time,  and  insert 
comparison  instructions  at  appropriate  locations  to 
detect  errors.  Thread-level  fault  tolerance  method, 
like AR-SMT (Rotenberg E, 1999), SRT (Reinhardt 
et al., 2000) and CRT (Mukherjee et al., 2002), etc., 
use  more  than  two  hardware  threads  or  cores  to 
execute the same task. A specific cache  is added to 
the processor to store the execution results of the two 
threads, to detect errors by comparing the execution 
results. Application-level fault tolerance method, like 
PLR (Shye A, et al., 2009), performs replication and 
comparison  at  a  higher  software  level,  such  as 
copying a process into multiple redundant processes 
for  concurrent  execution,  and  then  comparing  the 
program output.  
Permanent  error  detection  techniques  can  be 
divided  into  two  categories.  One  is  the  hardware 
module  fault  detection  technology  at  the  micro-
architecture layer, which is often used in the design 
of  reconfigurable  processors.  The  other  is  the 
detection of node faults in high performance systems. 
Since  the  MTBF  decreases  sharply,  the  node  faults 
are common in the system.  
However, repeated execution may cost too much 
time,  which  is  not  adopted  by  high  performance 
computing  applications.  High  performance 
computing  systems  mainly  screen  node  operation 
errors  based  on  the  screening  point  program,  and 
screen out the wrong points in advance, but the 
screening program cannot cover all subject situations, 
and  errors  during  applications  execution  cannot  be 
found. 
  Error  recovery  techniques  can  be  divided  into 
two categories: forward error recovery and backward 
error recovery.   
 Forward error recovery tries to correct the error 
after  the  error  is  detected  and  continue  to  execute 
forward without roll back to the state before the error 
moment.  Redundancy  is  the  basic  way  to  realize 
forward error recovery. Three Modular Redundancy 
(TMR) is a widely used FER technology, which uses 
3 modules to perform the same operation, and then 
selects the data through a majority voter at the output 
to achieve fault tolerance, but this method requires 3 
times  the  computing  resources  and  the  overhead  is 
large,  so  this  method  is  generally  not  used  in  high 
performance system. 
Backward error recovery returns to the state 
before  the  error  occurred  after  an  error  is  detected.  
The widely used backward error recovery method is 
checkpoint.  According to  the  content  of  the  storage 
checkpoint,  checkpoint  technology  can  be  divided 
into  system-level  checkpoint  and  application-level 
checkpoint  technology  (Bronevetsky  et  al.,  2004; 
Faisal  et  al.,  2018).  According  to  the  medium  of 
storage checkpoint, it can be divided into disk-based 
and  diskless  checkpoint  technology  (Chen,  2010; 
Alshboul et al., 2019). 
Usually, error detection and recovery techniques 
are combined together to ensure the correctness of the 
applications.  A  task-based  parallel  programming 
model is proposed  in (Wang et al., 2016), in  which 
work-stealing  scheduling  scheme  supporting  fault 
tolerance  is  adopted  to  achieve  dynamic  load 
balancing support fault tolerance. 
2.2  Parallel Application Model and 
Task Scheduling 
Most  parallel  applications  can  be  divided  into  two 
categories: data parallelism and task parallelism. Task 
parallel applications usually decompose the task into 
many sub-tasks, divide the data set, and execute the 
tasks and corresponding data in parallel on different 
computing  resources.  Task  parallel  applications  are 
widely  used  in  drug  screening,  genetic  research, 
cryptanalysis,  nuclear  simulation  and  other  fields. 
There  is  no  correlation  between  subtasks,  but  the 
calculation  number  of  subtasks  may  vary 
significantly.  In  large-scale  environments,  an 
efficient  load  balancing  mechanism  is  the  key  to 
ensure application performance, and the results of 
each subtask have an important impact on the overall 
results of the project. 
Corresponding  to  task  parallel applications, task 
division is divided into static division and dynamic 
division (Mohit et al., 2019). In static division, each 
computing  node  is  statically  divided  into  the  same 
number  of  tasks  and  executed  separately.  Dynamic 
partitioningis  to  dynamically  adjust  the  tasks  of 
computing  resources  according  to  the  load  of  each 
computing  resource,  including  dynamic  scheduling 
with management nodes and task stealing (Dinan et 
al.,  2009),  etc.  In  high  performance  computing, 
dynamic task partitioning is generally used to enable 
applications  to  more  fully  utilize  computing 
resources(He et al., 2016).