ElasticSearch Slow Query Response on ISCSI Disk on Oracle Cloud

195 Views Asked by At

I'm doing a migration from Elastic version 7.1 from AWS to Oracle Cloud using elastic 8, I made the snapshot the index was restored successfully, but the elastic is taking a long time to return the answer when it has many simultaneous connections.

These machine on AWS are perfect and working properly, here is her information

AWS 3x Nodes Machine

8gb RAM 2 CPUS Disk SSD NVME JVM heapsize 5gb
Elastic version 7.1 *Query Time 500ms*
iowait (AWS 15,92% Disk: SSD NVME)


[root@es-4-node-1_subnet-1 ec2-user]# fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=10g --blocksize=1024k --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32
Starting 1 process
Jobs: 1 (f=1): [R(1)] [18.4% done] [246.0MB/0KB/0KB /s] [246/0/0 iops] [eta 00m:31s]
Jobs: 1 (f=1): [R(1)] [30.0% done] [246.0MB/0KB/0KB /s] [246/0/0 iops] [eta 00m:28s]
Jobs: 1 (f=1): [R(1)] [41.5% done] [245.0MB/0KB/0KB /s] [245/0/0 iops] [eta 00m:24s]
Jobs: 1 (f=1): [R(1)] [53.7% done] [239.0MB/0KB/0KB /s] [239/0/0 iops] [eta 00m:19s]
Jobs: 1 (f=1): [R(1)] [65.9% done] [247.0MB/0KB/0KB /s] [247/0/0 iops] [eta 00m:14s]
Jobs: 1 (f=1): [R(1)] [78.0% done] [242.0MB/0KB/0KB /s] [242/0/0 iops] [eta 00m:09s]
Jobs: 1 (f=1): [R(1)] [88.1% done] [241.0MB/0KB/0KB /s] [241/0/0 iops] [eta 00m:05s]
Jobs: 1 (f=1): [R(1)] [100.0% done] [251.0MB/0KB/0KB /s] [251/0/0 iops] [eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=29174: Thu Apr 14 04:52:41 2022
  read : io=10240MB, bw=255246KB/s, iops=249, runt= 41081msec
    slat (usec): min=26, max=41738, avg=3994.68, stdev=6172.41
    clat (msec): min=9, max=181, avg=123.92, stdev=22.70
     lat (msec): min=9, max=189, avg=127.91, stdev=23.31
    clat percentiles (msec):
     |  1.00th=[   13],  5.00th=[   99], 10.00th=[  106], 20.00th=[  116],
     | 30.00th=[  123], 40.00th=[  126], 50.00th=[  128], 60.00th=[  129],
     | 70.00th=[  131], 80.00th=[  137], 90.00th=[  145], 95.00th=[  151],
     | 99.00th=[  159], 99.50th=[  167], 99.90th=[  180], 99.95th=[  180],
     | 99.99th=[  182]
    lat (msec) : 10=0.02%, 20=2.03%, 50=0.87%, 100=2.49%, 250=94.59%
  cpu          : usr=0.11%, sys=1.15%, ctx=9640, majf=0, minf=8204
  IO depths    : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.4%, 16=0.8%, 32=98.5%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued    : total=r=10240/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
   READ: io=10240MB, aggrb=255245KB/s, minb=255245KB/s, maxb=255245KB/s, mint=41081msec, maxt=41081msec
Disk stats (read/write):
  nvme0n1: ios=46378/222, merge=0/30, ticks=1544352/5552, in_queue=1500556, util=99.15%

And my problem is on this machine the elastic snapshot from 7.1 to this on OCI with elastic 8, but the request response time is too long i dont know if the problem is that kind of virtualized disk that OCI uses, with slow R/W my ElasticSearch has 2TB of size. +5 Billions of Documents Oracle 3x Nodes Machine

16gb RAM 4 CPUS Disk ISCSI - JVM heapsize 10gb
Elastic version 8 *Query time up to 10 seconds / 20 seconds / 30 seconds / +1 minute*
(It only increases the time and does not return the answer or take too long)
iowait (Oracle 39,71% Disk: ISCSI "network storage")

Oracle ISCSI (Network Storage Disk)

root@es-master-1:/home# fio --name TEST --eta-newline=5s --filename=temp.file --rw=read --size=2g --io_size=10g --blocksize=1024k --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=32
Starting 1 process
Jobs: 1 (f=1): [R(1)][16.7%][r=239MiB/s][r=239 IOPS][eta 00m:35s]
Jobs: 1 (f=1): [R(1)][31.0%][r=234MiB/s][r=234 IOPS][eta 00m:29s] 
Jobs: 1 (f=1): [R(1)][45.2%][r=196MiB/s][r=196 IOPS][eta 00m:23s] 
Jobs: 1 (f=1): [R(1)][59.5%][r=237MiB/s][r=237 IOPS][eta 00m:17s] 
Jobs: 1 (f=1): [R(1)][73.8%][r=264MiB/s][r=264 IOPS][eta 00m:11s] 
Jobs: 1 (f=1): [R(1)][88.1%][r=251MiB/s][r=251 IOPS][eta 00m:05s] 
Jobs: 1 (f=1): [R(1)][100.0%][r=190MiB/s][r=190 IOPS][eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=14554: Thu Apr 14 04:52:48 2022
  read: IOPS=238, BW=239MiB/s (250MB/s)(10.0GiB/42923msec)
    slat (usec): min=12, max=275, avg=26.39, stdev=12.34
    clat (msec): min=15, max=350, avg=134.02, stdev=99.43
     lat (msec): min=15, max=350, avg=134.05, stdev=99.43
    clat percentiles (msec):
     |  1.00th=[   24],  5.00th=[   40], 10.00th=[   51], 20.00th=[   53],
     | 30.00th=[   55], 40.00th=[   58], 50.00th=[   73], 60.00th=[   94],
     | 70.00th=[  245], 80.00th=[  259], 90.00th=[  266], 95.00th=[  288],
     | 99.00th=[  313], 99.50th=[  330], 99.90th=[  347], 99.95th=[  347],
     | 99.99th=[  351]
   bw (  KiB/s): min=151552, max=417792, per=99.70%, avg=243557.27, stdev=53597.76, samples=85
   iops        : min=  148, max=  408, avg=237.84, stdev=52.35, samples=85
  lat (msec)   : 20=0.31%, 50=10.01%, 100=51.48%, 250=10.07%, 500=28.12%
  cpu          : usr=0.14%, sys=0.88%, ctx=8661, majf=0, minf=8203
  IO depths    : 1=0.1%, 2=0.1%, 4=0.2%, 8=0.4%, 16=0.8%, 32=98.5%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=10240,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
   READ: bw=239MiB/s (250MB/s), 239MiB/s-239MiB/s (250MB/s-250MB/s), io=10.0GiB (10.7GB), run=42923-42923msec
Disk stats (read/write):
  sda: ios=10521/236, merge=0/544, ticks=1399849/35181, in_queue=1435030, util=96.10%

What could be causing me these slowdowns, is the problem in the OCI due to a low speed disk? The response time only increases according to the number of connections, however the AWS machine is inferior but it returns the information very fast and in the OCI it is taking forever, How can I determine the problem is there any configuration for elastic, or is the problem with the machine?

On OCI my benchmark runs fine with multiple connections but when I redirect traffic from the old version on AWS to OCI the application starts to take a long time to respond until the elastic is totally frozen and it takes up to 10 minutes to return the answer

This is the result from Rally Benchmark, i dont know if this is a good. Rally Benchmark ElasticSearch https://pastebin.com/vjhDEtR4


There are 0 best solutions below