illumos Fault Management Architecture (FMA) Message Registry

Message: INTEL-8000-MQ

title Level 2 Data Cache Fault
description A level 2 Data Cache on this cpu is faulty.
severity Major
type Fault
keys fault.cpu.intel.l2dcache
details INTEL-8000-MQ
Message ID: INTEL-8000-MQ indicates that the Illumos Fault Manager has received an error report
indicating that a processor has experienced an error in the level 2 data cache.

If an uncorrectable or fatal processor error is reported by a machine check exception, 
Illumos will generate an ereport and then proceed to panic
the system, followed by a warm reset.
Upon reboot, the Illumos operating system will replay the error telemetry and diagnose a faulty cpu,
resulting in that CPU/processor being off-lined. 
Performance of this system may be affected.

Your system may provide the capability to illuminate a service-required LED on or near a faulty component
as an aid in physically locating the faulty component. The Sun products listed below provide this feature.


      Sun Fire X2270, Sun Fire X4170, Sun Fire X4270, Sun Fire X4275, Sun Blade X6270, Sun Blade X6275

        
            The service-required LED indicator next to the faulty processor will be illuminated.
            However, when equipped with a "fault remind" button on the motherboard,
            the service required LED will illuminate only when this button is depressed.
            The status of the processor's service required LED is displayed from the ILOM CLI as: /SYS/MB/P0/SERVICE = On
            Refer to the service label located on top cover for the location of processor and the "Fault Remind" button.

            The chassis-wide service required LED on the rackmount server or blade server will also be illuminated.
            The status of the chassis-wide service required LED is displayed from the ILOM CLI as:  /SYS/SERVICE = On

 

The recommended service actions for this event are as follows:

      Section A -  Identify Faulty Component / FRU
      Section B -  Contact Authorized Service Provider
      Section C -  Clearing Fault after Replacement
      Section D -  Enable Off-lined Resource



Section A -  Identify Faulty Component / FRU

   
      LEGEND
      GREEN =  FRU
      BLUE    =  EVENT_ID
      RED      =  SUNW-MSG-ID

     Illumos provides commands that can be used to obtain information about faults present in the system.
     The "fmadm faulty" command is the preferred method by which fault information can be obtained.
     The "fmdump -av" command is an alternative method for obtaining simular fault information.

  •     Illumos nomenclature numbering assignment is zero-based, therefore all numbering begins with the number 0.
  •     The term "chip" is used to describe the physical CPU/Processor.
  •     The term "cpuid" is used to describe the logical/virtual CPU/Processor.
  •     A logical CPU is a strand/thread of a core contained within a physical CPU/Processor.
  •     The field replaceable unit is the physical CPU/Processor.
   
       Note:  The Event-ID  shown in Step A1 and UUID shown in Step A2 are one in the same,
                  as is the MSG-ID shown in Step A1 and SUNW-MSG-ID shown in Step A2.



       Step A1.  How to Identify the faulty processor/chip using "fmadm faulty".
                
               Example:  Use fmadm (1M) faulty to list the faulty FRU's, unique Event_ID,  and Sun Message_ID.

                     # fmadm faulty
                     ---------------    --------------------------------------------  ------------------  ------------
                     TIME                 EVENT-ID                                                  MSG-ID                SEVERITY
                     ---------------    --------------------------------------------  ------------------  -------------
                     Sep 26 20:46:15 8a47c7ba-8e66-c19f-f874-8f22a2b70cac  INTEL-8000-MQ  Major    
 
                     Fault class : fault.cpu.intel.l2dcache

                     Affects : cpu:///cpuid=1
                                         degraded but still in service
 
                     FRU : hc://:product-id=ASSY,MOTHERBOARD,LYNX_SERVER:chassis-id=0000000000:server-id=wgs40-117/motherboard=0/chip=1 
                                      faulty
 
                     Description : A level 2 data cache on this cpu is faulty.  Refer to https://illumos.org/msg/INTEL-8000-MQ for more information.
               
                     Response : The system will attempt to offline this cpu to remove it from service.
 
                     Impact : Performance of this system may be affected.
 
                     Action : Schedule a repair procedure to replace the affected CPU.  Use fmadm faulty' to identify the module.


               If your product does not provide the "fmadm faulty" output as shown above, proceed to Step A2.


               The example shown in Step A1;
  • Identifies the logical/virtual processor as "cpuid=1" in the line beginning with Affects.
  • Identifies the physical processor as "chip=1"  in the line beginning with FRU.
  • Identifies the location of the physical processor within the server as "/motherboard=0/chip=1"
               There is typically one motherboard in a server and it is commonly referred to as "motherboard=0".
               The physical processor, "/chip=1", refers to the second physical processor on the motherboard.

                Proceed to Step A3.



       Step A2.  How to Identify the faulty processor/chip using "fmdump -av".
             
               Example:  Use fmdump (1M) to list the faulty FRU's, unique Event_ID,  and Sun Message_ID.

                     # fmdump  -av
                     TIME                             UUID                                                          SUNW-MSG-ID
                       Sep 26 20:46:15.5823  8a47c7ba-8e66-c19f-f874-8f22a2b70cac  INTEL-8000-MQ
                           100%  fault.cpu.intel.l2dcache
 
                                      Problem in: hc://:product-id=ASSY,MOTHERBOARD,LYNX_SERVER:chassis-id=0000000000:server-id=wgs40-117/motherboard=0/chip=1/core=0/strand=0
                                            Affects: cpu:///cpuid=1
                                               FRU: hc://:product-id=ASSY,MOTHERBOARD,LYNX_SERVER:chassis-id=0000000000:server-id=wgs40-117/motherboard=0/chip=1
                                          Location: -


               The example shown in Step A2;
  • Identifies the logical/virtual processor as "cpuid=1" in the line beginning with Affects.
  • Identifies the physical processor as "chip=1"  in the line beginning with FRU.
  • Identifies the strand contained within the physical processor as "strand=0" in the line beginning with Problem.
  • Identifies the location of the physical processor within the server as "/motherboard=0/chip=1"
               There is typically one motherboard in a server and can be commonly referred to as "motherboard=0".
               The physical processor, "/chip=1", refers to the second physical processor on the motherboard.

                Proceed to Step A3.



       Step A3.  How to Identify the "physical" location of the faulty processor/chip.


               Sun Platforms
               A service label on the top cover and silk screen on the motherboard can assist with identifying the
               physical location of the processor/chip.  Typically, there is a label in the proximity of the processor/chip.

               The label nomenclature for a CPU/Processor is Px, whereby 'x' represents the processor number.

                Sun platforms use zero-based numbering for their labeling scheme, and as such,
                the faulty processor/chip identified in our example above (e.g. "/chip=1") would be interpreted as "P1".
                         e.g.
                         chip=0 maps to the physical location labeled "P0".
                         chip=1 maps to the physical location labeled "P1".
                         chip=2 maps to the physical location labeled "P2".
                         chip=3 maps to the physical location labeled "P3".                       


               Non-Sun Platforms
               A service label on the top cover or silk screen on the motherboard may exist to assist with identifying the
               physical location of the processor/chip.  Typically, there is a label in the proximity of the processor/chip.

               The label nomenclature for a CPU/Processor can be in the form of:  Px, CPUx, CHIPx, SCKTx, etc;
               whereby 'x' represents the processor number.

                If the product uses zero-based numbering for its labeling scheme, as Sun's products do, the faulty
                processor identified in our example (e.g. "/chip=1") would be identified as Px, CPUx, CHIPx, SCKTx, etc;
                         e.g.
                         chip=0 maps to the physical location labeled;  P0, CPU0, CHIP0, SCKT0, etc;
                         chip=1 maps to the physical location labeled;  P1, CPU1, CHIP1, SCKT1, etc;
                         chip=2 maps to the physical location labeled;  P2, CPU2, CHIP2, SCKT2, etc;
                         chip=3 maps to the physical location labeled;  P1, CPU3, CHIP3, SCKT3, etc;.             

     The recommended service action for this event is to replace the faulty processor.

     The processor is not customer serviceable and requires repair by an authorized service provider.



Section B -   Contact Authorized Service Provider


      Please contact your service provider in accordance with the terms and conditions of your service
      agreement to open a service request and to confirm and carry out the required repair actions specified
      by current service policy.


      Your service provider may ask for information displayed using the procedures outlined in Section
A.

     
      If the product is covered by a current service agreement with Sun Microsystems, Inc.,
      please refer to the following  instructions for reporting the problem and opening a service request. 


            Auto Service Request (ASR) Activated for the Product

               If ASR has been activated for the product on which this problem was diagnosed, you have,
               or will receive a notification via e-mail confirming a service request has been automatically
               opened along with instructions for viewing the service request.


               A
ll of the fault event telemetry required to open a service request has already been transmitted
               to Sun Services. Unless contacted and instructed otherwise by a Sun Service representative,
               no further actions is required to report this problem to Sun Microsystems.

               The e-mail notification will provide a pointer back to this same article.
              
               If you are reading this article in response to a fault message or SNMP trap generated on the product,
               rather than in response to the ASR notification e-mail above, then you can check on the status of the
               associated service request by logging into the Members Support Center at  http://sun.com/support 

               For more information on Auto Service Request (ASR) and the currently supported products,
               please refer to http://sun.com/service/asr



             Submitting a Service Request via the Members Support Center Portal

                In cases where ASR has not been activated:

                      1.  Login to the Members Support Center at  http://sun.com/support
         
                     
2.  Create a Service Request

                     
3.  Copy and paste the information displayed using the instructions provided in Section A
                           to identify the faulty FRU(s), into the Service Request notes.




Section C -  Clearing Fault after Replacement

    
      LEGEND
      GREEN =  FRU
      BLUE    =  EVENT_ID
      RED      =  SUNW-MSG-ID

   
     Illumos Command to Clear the Fault

     Once the processor/chip has been physically replaced and the system is rebooted, a fault management command is
     required to clear the processor/chip fault from the Illumos fault manager's resource cache to accurately reflect faults
     that are no longer present.

        Invoke the "fmadm repair" command along with the UUID (Universally Unique IDentifier) associated with the faulted processor.
        
           e.g.
           Sep 26 20:46:15 8a47c7ba-8e66-c19f-f874-8f22a2b70cac  INTEL-8000-MQ  Major


           Note:  The Event-ID  shown in Step A1 and UUID shown in Step A2 are one in the same,
                     as is the MSG-ID shown in Step A1 and SUNW-MSG-ID shown in Step A2.


                       Example:  Use fmadm (1M) repair to clear the fault using the UUID (Universally Unique Identifier).

                           # fmadm repair  8a47c7ba-8e66-c19f-f874-8f22a2b70cac
                           fmadm: recorded repair to  8a47c7ba-8e66-c19f-f874-8f22a2b70cac



     ILOM Command to Clear the Fault

     On Sun platforms with ILOM 2.0 or later on the service processor, you may also need to clear the processor/chip
     fault from the ILOM fault manager's resource cache to accurately reflect faults that are no longer present and to extinguish
     the service required LED for the faulty CPU/processor and chassis wide service required LED.

         The nomenclature ILOM uses to describe a CPU/Processor is Px, where 'x' represents the processor number.

               Sun platforms use zero-based numbering for their labeling scheme, and as such,
               the faulty processor/chip identified in our example above (e.g. "/chip=1") would be interpreted as "P1".
                         e.g.
                         chip=0 maps to the physical location labeled "P0".
                         chip=1 maps to the physical location labeled "P1".
                         chip=2 maps to the physical location labeled "P2".
                         chip=3 maps to the physical location labeled "P3".

         Login to the ILOM command line interface as 'root' and use the following command to clear the fault.

                       Example:
                       -> set /SYS/MB/P1 clear_fault_action=true
                       Are you sure you want to clear /SYS/MB/P1 (y/n)?  y
                       Set 'clear_fault_action' to 'true'

          NOTE:  The example above specifically clears the fault associated with "P1".  You will need to provide the CPU/processor
                        number in the command above as determined by the process defined in  Section A of this article.


 
  Section D -  Enable Off-lined Resource


      LEGEND
      GREEN =  FRU
      BLUE    =  EVENT_ID
      RED      =  SUNW-MSG-ID

      The term "chip" is used to describe the physical CPU/Processor.
      The term "cpuid" is used to describe the logical/virtual CPU/Processor  ( e.g. Affects: cpu:///cpuid=1 )
      A logical CPU is a strand/thread of a core contained within a physical CPU/Processor.
      Solairs assigns CPU numbers based on threads/strands, not cores or chips.
    
      A chip that consists of four(4) dual-threaded cores would show up as eight(8) logical CPU's
      and a chip that consists of two(2) dual-threaded cores would show up as four(4) logical CPU's

      Verify the status of the processor/chip in the active configuration by identifying the logical/virtual
      CPU/Processor that was faulted by using the "psrinfo" command..

              Example:  Use psrinfo (1M) to obtain status of logical/virtual processors/chips.
                     # psrinfo
                     0       on-line   since 09/26/2008 18:04:57
                     1       faulted   since 09/26/2008 20:46:15
                     2       on-line   since 09/26/2008 18:05:19
                     3       on-line   since 09/26/2008 18:05:21
                     4       on-line   since 09/26/2008 18:05:23
                     5       on-line   since 09/26/2008 18:05:25
                     6       on-line   since 09/26/2008 18:05:27
                     7       on-line   since 09/26/2008 18:05:29
                     8       on-line   since 09/26/2008 18:05:31
                     9       on-line   since 09/26/2008 18:05:33
                     10      on-line   since 09/26/2008 18:05:35
                     11      on-line   since 09/26/2008 18:05:37
                     12      on-line   since 09/26/2008 18:05:39
                     13      on-line   since 09/26/2008 18:05:41
                     14      on-line   since 09/26/2008 18:05:43
                     15      on-line   since 09/26/2008 18:05:45


    Enable the faulted processor/chip back into the active configuration by using the  "psradm" command.

              Example:  Use psradm (1M) to online a processor.
                      # psradm -F -n 1        , whereby '1' represents the "cpuid" number in the line beginning with Affects..


    Verify the status of the processor/chip in the active configuration to confirm the logical/virtual CPU number is enabled.
   
              Example:  Use psrinfo (1M) to obtain status of processors/chips.
                     # psrinfo
                      0       on-line   since 09/26/2008 18:04:57
                      1       on-line   since 09/26/2008 20:57:15
                      2       on-line   since 09/26/2008 18:05:19
                      3       on-line   since 09/26/2008 18:05:21
                      4       on-line   since 09/26/2008 18:05:23
                      5       on-line   since 09/26/2008 18:05:25
                      6       on-line   since 09/26/2008 18:05:27
                      7       on-line   since 09/26/2008 18:05:29
                      8       on-line   since 09/26/2008 18:05:31
                      9       on-line   since 09/26/2008 18:05:33
                      10      on-line   since 09/26/2008 18:05:35
                      11      on-line   since 09/26/2008 18:05:37
                      12      on-line   since 09/26/2008 18:05:39
                      13      on-line   since 09/26/2008 18:05:41
                      14      on-line   since 09/26/2008 18:05:43
                      15      on-line   since 09/26/2008 18:05:45


   Optionally, one can use the "psrinfo -vp" command to show the logical/virtual CPU/processor numbering  assigned to processor.

             e.g.  A chip that consists of four(4) dual-threaded cores would show up as eight(8) logical CPU's

                     # psrinfo -vp
                    The physical processor has 4 cores and 8 virtual processors (0-3 8-11)
                       The core has 2 virtual processors (0 8)
                       The core has 2 virtual processors (1 9)
                       The core has 2 virtual processors (2 10)
                       The core has 2 virtual processors (3 11)
                          x86 (GenuineIntel 106A5 family 6 model 26 step 5 clock 2533 MHz)
                            Intel(r) Xeon(r) CPU           E5540  @ 2.53GHz
                   The physical processor has 4 cores and 8 virtual processors (4-7 12-15)
                       The core has 2 virtual processors (4 12)
                       The core has 2 virtual processors (5 13)
                       The core has 2 virtual processors (6 14)
                       The core has 2 virtual processors (7 15)
                          x86 (GenuineIntel 106A5 family 6 model 26 step 5 clock 2533 MHz)
                             Intel(r) Xeon(r) CPU           E5540  @ 2.53GHz


impact Performance of this system may be affected.
response The system will attempt to offline this cpu to remove it from service.
action Schedule a repair procedure to replace the affected CPU. Use 'fmadm faulty' to identify the module.

Back to main page