How It Works

Welcome to the Nutanix NEXT community. To get started please read our short welcome post. Thanks!

cancel
Showing results for 
Search instead for 
Did you mean: 

Op Log Alert

SOLVED Go to solution
Highlighted
Adventurer

Op Log Alert

Started receiving a failed op log episode check.  Followed KB1541 but in my output I am getting the below information which does not include the NFS id which is what I am supposed to use to find which vm is causing the issue.  Running NOS 5.0.1.  Wondering how I can use the below output to trace back to the offending vm?

Thanks

 


################################################################################
SUMMARY RESULT
################################################################################
Detailed information for oplog_episode_count_check:
Node 10.221.1.105:
FAIL: Oplog episode count exceeds threshold (1200) for the following vdisks:
Id 13334737, episode count 1496
Refer to KB 1541 (http://portal.nutanix.com/kb/1541) for details on oplog_episode_count_check
+---------------+
| State | Count |
+---------------+
| Fail | 1 |
| Total | 1 |

1 ACCEPTED SOLUTION

Accepted Solutions
Moderator

Re: Op Log Alert

There's usually a "give-and-take" to the flushing of "episodes" for a vDisk. This can take some time to "flush" if the system (or particular vDisk) is very busy.

So, the first recommendation is to monitor for a bit and re-run the "ncc health_checks stargate_checks oplog_episode_count_check" command for progress.

If that specific NCC check does not clear after a few attempts, say after several minutes, or maybe even hours, the issue could be due to a couple of causes:
1) As the KB 1541 indicates, to a very heavy load (bursts) on a vDisk
2) NCC can also experience a hang/timeout, per previous NOS/AOS versions
3) The vdisk may have underlying issues which would require Support (and possibly Engineering) to be involved.

But first, let's look at finding the file associated:
 - Yes, KB 1541 indicates to identify the associated File(s) using the following command on one of the CVMs:
    "ncli vm ls | grep -B 9 13334737"
 - That should show the VM "Name" and its associated vDisk ID(s) as a list, including the "NFS:" preamble for the vDisk ID(s).
 - The vDisk ID(s) associated may be a long string, including the UUID. But the "NFS:13334737" is the important part and can be used to monitor whether the "episode" count is reducing over time.

 

If you know this VM to be extremely busy, that should also be investigated from within the Guest OS and Application.


Apart from the NCC check, I have used the following to count the "episodes" associated with a vDisk ID:
    "medusa_printer -lookup oplog -vdisk_id=13334737 | grep -c episode_seq"

Let's see if that produces a similar count to NCC, and hopefully, both will be decreasing over time.

 

If it is not decreasing by either of these checks, we may need to have a Support case to look further into the situation.

 

Hope this helps!

 

 

1 REPLY
Moderator

Re: Op Log Alert

There's usually a "give-and-take" to the flushing of "episodes" for a vDisk. This can take some time to "flush" if the system (or particular vDisk) is very busy.

So, the first recommendation is to monitor for a bit and re-run the "ncc health_checks stargate_checks oplog_episode_count_check" command for progress.

If that specific NCC check does not clear after a few attempts, say after several minutes, or maybe even hours, the issue could be due to a couple of causes:
1) As the KB 1541 indicates, to a very heavy load (bursts) on a vDisk
2) NCC can also experience a hang/timeout, per previous NOS/AOS versions
3) The vdisk may have underlying issues which would require Support (and possibly Engineering) to be involved.

But first, let's look at finding the file associated:
 - Yes, KB 1541 indicates to identify the associated File(s) using the following command on one of the CVMs:
    "ncli vm ls | grep -B 9 13334737"
 - That should show the VM "Name" and its associated vDisk ID(s) as a list, including the "NFS:" preamble for the vDisk ID(s).
 - The vDisk ID(s) associated may be a long string, including the UUID. But the "NFS:13334737" is the important part and can be used to monitor whether the "episode" count is reducing over time.

 

If you know this VM to be extremely busy, that should also be investigated from within the Guest OS and Application.


Apart from the NCC check, I have used the following to count the "episodes" associated with a vDisk ID:
    "medusa_printer -lookup oplog -vdisk_id=13334737 | grep -c episode_seq"

Let's see if that produces a similar count to NCC, and hopefully, both will be decreasing over time.

 

If it is not decreasing by either of these checks, we may need to have a Support case to look further into the situation.

 

Hope this helps!