分享web开发知识

注册/登录|最近发布|今日推荐

主页 IT知识网页技术软件开发前端开发代码编程运营维护技术分享教程案例
当前位置:首页 > 教程案例

一次网站负载排查记录

发布时间:2023-09-06 02:16责任编辑:胡小海关键词:暂无标签
背景:

某天早上9.39分,nagios监控突然报警,我们一台手机业务机器出现负载升高,达到60多,这台机器仅8核心8G内存,伴随其他监控出现socket timeout,连接失败。一看该问题就会想到会严重影响业务,并且问题肯定会进行扩散,影响其他业务。不出所料,没到10分钟,其他同业务机器出现大面积报警,nginx出现端口链接超时,各种状态码监控失效.......
这种问题,不及时处理的话,客户那边的投诉会很快打进来的。

排查:

1、先排查负载高的具体进程
通过top -c -d 1,查看当前有那些进行占用CPU比较高。,由于机器负载已经很高,中间通过堡垒机登录花费了好几分钟的时间,差点就想把机器重启了,时间等不及呀。
通过top命令查看,并未发现那个进程明显占用CPU比较高。

top - 09:39:13 up 824 days, ?9:39, ?4 users, ?load average: 4.65, 6.32, 8.26Tasks: 854 total, ??5 running, 849 sleeping, ??0 stopped, ??0 zombieCpu(s): 52.4%us, 19.6%sy, ?0.0%ni, 25.7%id, ?0.2%wa, ?0.1%hi, ?1.9%si, ?0.1%stMem: ??8058948k total, ?7111024k used, ??947924k free, ???55588k buffersSwap: ?8191992k total, ?1906460k used, ?6285532k free, ?2126436k cached ?PID USER ?????PR ?NI ?VIRT ?RES ?SHR S %CPU %MEM ???TIME+ ?COMMAND ??????????????????????????????????????????????????????????????????????????????????????????????????????????19380 nobody ???20 ??0 87800 ?32m 1700 R 18.3 ?0.4 ?87:25.36 nginx: worker process ?????????????????????????????????????????????????????????????????????????????????????????????19377 nobody ???20 ??0 87512 ?32m 1704 S 18.9 ?0.4 ?87:56.77 nginx: worker process ?????????????????????????????????????????????????????????????????????????????????????????????19383 nobody ???20 ??0 87788 ?32m 1700 S 13.3 ?0.4 ?88:46.35 nginx: worker process ?????????????????????????????????????????????????????????????????????????????????????????????19382 nobody ???20 ??0 87584 ?32m 1700 S18.6 ?0.4 ?86:51.17 nginx: worker process ?????????????????????????????????????????????????????????????????????????????????????????????19379 nobody ???20 ??0 87364 ?32m 1700 S 24.8 ?0.4 ?89:36.12 nginx: worker process ?????????????????????????????????????????????????????????????????????????????????????????????19381 nobody ???20 ??0 87964 ?32m 1700 S 31.0 ?0.4 ?90:51.07 nginx: worker process ?????????????????????????????????????????????????????????????????????????????????????????????19378 nobody ???20 ??0 88144 ?32m 1700 R 09.2 ?0.4 ?90:24.09 nginx: worker process ?????????????????????????????????????????????????????????????????????????????????????????????19376 nobody ???20 ??0 87760 ?32m 1704 S 03.5 ?0.4 ?89:19.43 nginx: worker process ????354 XXXXX 20 ??0 ?444m ?21m 5828 R ?3.8 ?0.3 ?13:09.36 douDiZhuServerWorker_0 ??????????????????????????????????????????????????????????????????????????????????????????????392 XXXXX 20 ??0 ?442m ?20m 5084 S ?2.8 ?0.3 ??5:20.78 douDiZhuServerWorker_22 ???????????????????????????????????????????????????????????????????????????????????????????32655 XXXXX 20 ??0 2802m ?14m 2248 S ?2.8 ?0.2 ?10:31.01 douDiZhuServerMaster ????????????????????????????????????????????????????????????????????????????????????????????????356 XXXXX 20 ??0 ?443m ?18m 4564 R ?1.9 ?0.2 ??5:16.02 douDiZhuServerWorker_1 ??????????????????????????????????????????????????????????????????????????????????????????????358 XXXXX 20 ??0 ?442m ?18m 4560 S ?1.9 ?0.2 ??5:18.09 douDiZhuServerWorker_2 ??????????????????????????????????????????????????????????????????????????????????????????????369 XXXXX 20 ??0 ?445m ?22m 4552 S ?1.9 ?0.3 ??5:29.04 douDiZhuServerWorker_9 ??????????????????????????????????????????????????????????????????????????????????????????????373 XXXXX 20 ??0 ?443m ?21m 4532 S ?1.9 ?0.3 ??5:16.98 douDiZhuServerWorker_11 ?????????????????????????????????????????????????????????????????????????????????????????????376 XXXXX 20 ??0 ?443m ?21m 4568 S ?1.9 ?0.3 ??5:12.60 douDiZhuServerWorker_13 ?????????????????????????????????????????????????????????????????????????????????????????????383 XXXXX 20 ??0 ?444m ?18m 4564 S ?1.9 ?0.2 ??5:21.64 douDiZhuServerWorker_17 ?????????????????????????????????????????????????????????????????????????????????????????????387 XXXXX 20 ??0 ?442m ?20m 4556 S ?1.9 ?0.3 ??5:20.62 douDiZhuServerWorker_20 ?????????????????????????????????????????????????????????????????????????????????????????????388 XXXXX 20 ??0 ?444m ?19m 4564 S ?1.9 ?0.2 ??5:15.17 douDiZhuServerWorker_21 ?????????????????????????????????????????????????????????????????????????????????????????????400 XXXXX 20 ??0 ?443m ?21m 4576 S ?1.9 ?0.3 ??5:09.63 douDiZhuServerWorker_28

2、立刻排查占用IO较高的进程
通过iotop命令排查有那些进程占用IO较高的,一般占用IO较高的,否会伴随负载升高,同时在排查占用负载高的时候,也要注意内存的使用和swap的使用,基本上swap被使用了,会伴随IO迅速升高,最终会导致负载上来。

通过IO排查,发现有rsync,nginx进行占用IO很高,问题基本定位在rsync上,因为从命令上一看就是用于同步数据的,只用同步数据,就要涉及到硬盘读写,这样硬盘的IO就会升高,nginx进程占用IO基本排除在外,多数都是被rsync影响的。

Total DISK READ: 16.52 K/s | Total DISK WRITE: 95.80 K/s ?TID ?PRIO ?USER ????DISK READ ?DISK WRITE ????SWAPIN ????IO> ???COMMAND ??????????????????????????????????????????????????????????????????????????????????????????????????????????12420 be/4 root ???????3.30 K/s ???0.00 B/s ??0.00 % ??2.67 % ??[flush-8:0] 14203 be/4XXXXX ???????????0.00 B/s ???76.51 K/s ??0.00 % ???99.99 % ?rsync --server -svlogDtprze.iLs 9999 ?be/4 root ???????0.00 B/s ???3.48 K/s ??0.00 % ???99.99 % ?[kswapd0] ?14092 be/4XXXXX ????????????10.43 K/s ??0.00 B/s ???0.00 % ???99.99 % ?rsync --server -svlogDtprze.iLs 14252 be/4 nagios ?????????358.22 K/s ??0.00 B/s ??0.00 % ???90.98 % ?expr 8191992 - 6286700 26729 be/4 nobody ????????24.34 K/s ??31.30 K/s ??0.00 % ???86.79 % ?nginx: worker process 14143 be/4XXXXX ????????????24.34 K/s ??0.00 B/s ??0.00 % ????85.31 % ?rsync --server -svlogDtprze.iLs 26731 be/4 nobody ????????17.39 K/s ??38.26 K/s ?0.00 % ????85.23 % ?nginx: worker process 26727 be/4 nobody ???????6.96 K/s ???52.17 K/s ?0.00 % ????85.22 % ?nginx: worker process 705 be/4 nobody ?????????6.96 K/s ??0.00 B/s ???0.00 % ????77.00 % ?php-fpm: pool www 12797 be/4 nobody ??????0.00 B/s ???0.00 B/s ??0.00 % ????75.87 % ?php-fpm: pool www 26728 be/4 nobody ???????24.34 K/s ??62.60 K/s ??0.00 % ????59.75 % ?nginx: worker process 26733 be/4 nobody ???????27.82 K/s ??62.60 K/s ??0.00 % ????58.20 % ?nginx: worker process ?353 be/4XXXXX ???0.00 B/s ???6.61 K/s ?0.00 % ?0.00 % ???douDiZhuServerTasker_7 ?354 be/4XXXXX ???0.00 B/s ???3.30 K/s ?0.00 % ?0.00 % ???douDiZhuServerWorker_0 ?358 be/4XXXXX ???0.00 B/s ???6.61 K/s ?0.00 % ?0.00 % douDiZhuServerWorker_2 ?360 be/4XXXXX ???0.00 B/s ???3.30 K/s ?0.00 % ?0.00 % douDiZhuServerWorker_3 ?361 be/4XXXXX ???0.00 B/s ???3.30 K/s ?0.00 % ?0.00 % douDiZhuServerWorker_4

3、追踪进程与业务方对接
在上面的排查中,把方向定位在rsync这个命令上,我是不知道这个进程具体是做什么的,所以直接和业务方(开发)对接,这个进程具体是干什么用。
具体和业务方对接的情况如下:
该脚本确是用来同步数据用的,脚本是根据其他业务方有新生成的数据就将新数据同步过来,该业务方要用到最新的数据。因此,这个脚本是个常驻脚本,每分钟都是运行的。
按照正常业务逻辑上,如果真的是该脚本出问题了,那该脚本应该早就出现类似现象了,不应该是只有今天出现,并且该脚本部署在同业务的多台机器上,目前是只有该机器出问题的。逻辑上是不通的。
不死心的态度,判断该脚本是否出现了异常,比如说在某一分钟内,其他业务方产生的新数据库过多,这边要同步的数据量过大导致的,为验证这个想法,专门和开发提取 ???????出该脚本近10天的同步数据量,然并未发现那个时间段出现数据量过大的情况,反而在报警的时间段内同步的数据变少了,这就尴尬了
开发立刻掉头和我说,你机器影响我业务了......................
为了不当背锅侠,得继续查原因呀。

4、思考
整个问题的排查都是比较直观的,直接在报警的时间段内,检查排查,得到的都是线上一手数据,不可能出错,肯定是自己遗漏了那些内容。 想要找到遗漏的内容,要么重新负载重现,要么将历史数据拿出来,再次仔细的排查一遍。
寻找数据的规律,寻找报警异常的源头永远都是排查问题的重要手段及方向。

5、排查历史数据
因为线上出现问题,晚上一般无法做到立刻排查问题,或为了记录一些短暂性报警的实时状态,做了一个top脚本,将top、iotop等命令让其每分钟都执行,记录在文件中,方便查看历史状态。
我们报警是连续三分钟达到阈值才会报警,报警时间是在9点39分,那么实际出现问题的时间是在37分就开始了
排查历史记录发现,负载呈现上升趋势,是从36分开始增长的,因此问题的着重点要排查在36分之前出现过什么脚本在执行,36分又出现了什么新脚本在执行,或36分前后的异常对比

(1)每分钟的负载值

top - 09:32:02 up 824 days, ?4:35, ?1 user, ?load average: 7.32, 5.57, 3.50top - 09:33:03 up 824 days, ?4:36, ?1 user, ?load average: 5.58, 5.47, 3.60top - 09:34:02 up 824 days, ?4:37, ?1 user, ?load average: 5.81, 5.65, 3.78top - 09:35:03 up 824 days, ?4:38, ?1 user, ?load average: 6.44, 5.84, 3.96top - 09:36:03 up 824 days, ?4:39, ?1 user, ?load average: 14.31, 8.58, 5.04top - 09:36:14 up 824 days, ?4:39, ?1 user, ?load average: 17.07, 9.36, 5.33top - 09:37:03 up 824 days, ?4:40, ?1 user, ?load average: 26.44, 13.17, 6.84top - 09:37:14 up 824 days, ?4:40, ?1 user, ?load average: 38.21, 16.13, 7.87top - 09:37:39 up 824 days, ?4:41, ?1 user, ?load average: 56.97, 22.84, 10.37top - 09:37:49 up 824 days, ?4:41, ?1 user, ?load average: 58.90, 24.39, 11.00top - 09:37:57 up 824 days, ?4:41, ?1 user, ?load average: 57.39, 24.65, 11.16top - 09:38:05 up 824 days, ?4:41, ?1 user, ?load average: 58.57, 25.94, 11.72top - 09:38:06 up 824 days, ?4:41, ?1 user, ?load average: 58.57, 25.94, 11.72top - 09:38:14 up 824 days, ?4:41, ?1 user, ?load average: 65.09, 28.41, 12.68top - 09:38:23 up 824 days, ?4:41, ?2 users, ?load average: 67.81, 29.58, 13.14top - 09:38:32 up 824 days, ?4:42, ?2 users, ?load average: 73.38, 32.03, 14.11top - 09:38:37 up 824 days, ?4:42, ?2 users, ?load average: 76.15, 33.29, 14.62top - 09:38:45 up 824 days, ?4:42, ?2 users, ?load average: 79.35, 35.39, 15.50top - 09:38:51 up 824 days, ?4:42, ?2 users, ?load average: 82.12, 36.69, 16.03top - 09:39:00 up 824 days, ?4:42, ?3 users, ?load average: 87.10, 39.25, 17.08top - 09:39:03 up 824 days, ?4:42, ?3 users, ?load average: 87.10, 39.25, 17.08top - 09:39:13 up 824 days, ?4:42, ?3 users, ?load average: 99.87, 43.56, 18.72top - 09:39:24 up 824 days, ?4:42, ?3 users, ?load average: 111.69, 47.94, 20.41top - 09:39:33 up 824 days, ?4:43, ?3 users, ?load average: 104.74, 48.55, 20.90

(2)33分的状态
为什么这里只放nginx占用资源的状态呢,因为我已经发现它的异常,其他无异常的就不粘贴了,从nginx的占用CPU资源上可以看出,从33分开始到报警一直都是呈现占用CPU资源的增长状态

26726 nobody ???20 ??0 88296 ?32m 1820 S 35.8 ?0.4 ?17:52.92 nginx: worker process ??????????????????????????????????????????????????????????????????????????????????????????????26730 nobody ???20 ??0 88428 ?33m 1820 R 34.1 ?0.4 ?16:58.21 nginx: worker process ??????????????????????????????????????????????????????????????????????????????????????????????26731 nobody ???20 ??0 88692 ?33m 1820 S 27.6 ?0.4 ?17:34.17 nginx: worker process ??????????????????????????????????????????????????????????????????????????????????????????????26733 nobody ???20 ??0 88600 ?33m 1820 R 27.6 ?0.4 ?17:00.97 nginx: worker process ??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????26729 nobody ???20 ??0 88824 ?33m 1820 S 26.0 ?0.4 ?17:25.51 nginx: worker process ??????????????????????????????????????????????????????????????????????????????????????????????26728 nobody ???20 ??0 88612 ?33m 1820 S 24.4 ?0.4 ?16:09.17 nginx: worker process

(3)34分的状态

26727 nobody ???20 ??0 88828 ?33m 1820 R 55.8 ?0.4 ?17:14.88 nginx: worker process ??????????????????????????????????????????????????????????????????????????????????????????????26730 nobody ???20 ??0 88428 ?33m 1820 R 46.7 ?0.4 ?16:34.33 nginx: worker process ??????????????????????????????????????????????????????????????????????????????????????????????26733 nobody ???20 ??0 88600 ?33m 1820 R 42.2 ?0.4 ?16:36.15 nginx: worker process ??????????????????????????????????????????????????????????????????????????????????????????????26728 nobody ???20 ??0 88612 ?33m 1820 R 40.7 ?0.4 ?15:40.15 nginx: worker process ??????????????????????????????????????????????????????????????????????????????????????????????26729 nobody ???20 ??0 88824 ?33m 1820 R 40.7 ?0.4 ?16:48.89 nginx: worker process ??????????????????????????????????????????????????????????????????????????????????????????????26731 nobody ???20 ??0 88948 ?33m 1820 R 37.7 ?0.4 ?17:00.28 nginx: worker process ??????????????????????????????????????????????????????????????????????????????????????????????26726 nobody ???20 ??0 88172 ?32m 1820 R 36.2 ?0.4 ?17:25.98 nginx: worker process

(4)35分的状态

26730 nobody ???20 ??0 88816 ?33m 1816 D 57.0 ?0.4 ?17:34.14 nginx: worker process ??????????????????????????????????????????????????????????????????????????????????????????????26731 nobody ???20 ??0 88696 ?33m 1816 D 52.6 ?0.4 ?18:04.44 nginx: worker process ??????????????????????????????????????????????????????????????????????????????????????????????26726 nobody ???20 ??0 88556 ?33m 1816 D 48.7 ?0.4 ?18:24.47 nginx: worker process ??????????????????????????????????????????????????????????????????????????????????????????????26727 nobody ???20 ??0 88824 ?33m 1816 D 46.9 ?0.4 ?18:11.50 nginx: worker process ??????????????????????????????????????????????????????????????????????????????????????????????26729 nobody ???20 ??0 88828 ?33m 1816 D 46.7 ?0.4 ?17:54.94 nginx: worker process ??????????????????????????????????????????????????????????????????????????????????????????????26733 nobody ???20 ??0 88604 ?33m 1816 D 46.7 ?0.4 ?17:36.34 nginx: worker process ??????????????????????????????????????????????????????????????????????????????????????????????26728 nobody ???20 ??0 88616 ?33m 1816 D 33.8 ?0.4 ?16:36.52 nginx: worker process ??????????????????????????????????????????????????????????????????????????????????????????????26732 nobody ???20 ??0 88232 ?32m 1816 D 32.3 ?0.4 ?17:50.30 nginx: worker process

(5)36分状态

26732 nobody ???20 ??0 88672 ?33m 1720 R 80.4 ?0.4 ?18:19.66 nginx: worker process ??????????????????????????????????????????????????????????????????????????????????????????????26733 nobody ???20 ??0 88728 ?33m 1720 R 77.5 ?0.4 ?18:01.45 nginx: worker process ??????????????????????????????????????????????????????????????????????????????????????????????26728 nobody ???20 ??0 88612 ?33m 1720 D 53.9 ?0.4 ?16:58.75 nginx: worker process ??????????????????????????????????????????????????????????????????????????????????????????????26727 nobody ???20 ??0 88824 ?33m 1720 D 44.3 ?0.4 ?18:41.69 nginx: worker process ??????????????????????????????????????????????????????????????????????????????????????????????26731 nobody ???20 ??0 88692 ?33m 1720 D 44.3 ?0.4 ?18:31.20 nginx: worker process 

(6)Nginx占用IO情况

09:35:12 ?5362 be/4 nobody ????19.68 K/s ???0.00 B/s ?0.00 % 55.10 % php-fpm: pool www09:35:12 26728 be/4 nobody ????49.19 K/s ??85.26 K/s ?0.00 % 55.08 % nginx: worker process09:35:12 26726 be/4 nobody ????72.14 K/s ??75.42 K/s ?0.00 % 49.69 % nginx: worker process09:35:12 ?4967 be/4 nobody ????52.47 K/s ???0.00 B/s ?0.00 % 37.95 % php-fpm: pool www09:35:12 26729 be/4 nobody ????36.07 K/s ??98.38 K/s ?0.00 % 35.19 % nginx: worker process09:35:12 12241 be/4 nobody ?????0.00 B/s ???0.00 B/s ?0.00 % 29.30 % php-fpm: pool www09:35:12 26731 be/4 nobody ????62.31 K/s ?108.22 K/s ?0.00 % 28.04 % nginx: worker process09:35:12 26732 be/4 nobody ???118.05 K/s ?190.20 K/s ?0.00 % 26.66 % nginx: worker process09:35:12 24799 be/4 nobody ?????6.56 K/s ???0.00 B/s ?0.00 % 24.03 % php-fpm: pool www09:35:12 11119 be/4 nobody ?????0.00 B/s ???0.00 B/s ?0.00 % 23.47 % php-fpm: pool www09:35:12 26733 be/4 nobody ????52.47 K/s ?154.13 K/s ?0.00 % 22.41 % nginx: worker process09:35:12 26727 be/4 nobody ????62.31 K/s ??75.42 K/s ?0.00 % 20.70 % nginx: worker process09:35:12 26730 be/4 nobody ????59.03 K/s ?144.29 K/s ?0.00 % 17.74 % nginx: worker process

(7)与zabbix核对

6、在nginx日志上寻找规律

在上一个排查中,发现了nginx占用资源出现增长情况,并且和zabbix监控进行了对比,产生吻合的现象。因此,问题定位在nginx上。

(0)日志格式

head -n 1 /tmp/load.log access.log:[27/Sep/2018:09:34:00 +0800] [xxxxx] [222460266] [GET /tg_templates/doubleone/2016/servertime/getInfo.php HTTP/1.0] [200] [43] [-] [%E5%90%8C%E8%8A%B1%E9%A1%BA/10.00.33 CFNetwork/808.1.4 Darwin/16.1.0] [223.104.216.124, 10.99.11.9, 223.104.216.124] [0.002] [unix:/dev/shm/phpfpm5.4.socket] [0.000] [xxxxxx] [-] [/tg_templates/doubleone/2016/servertime/getInfo.php] [80] [-] []

(1)排查请求量是否异常
将报警时间段前后的日志取出,进行独立过滤,发现 URI:/img/ad 出现的频率很高(为了取出相同的URI,不过滤完整的URI),一分钟高到一万次。

awk -F ‘]‘ ‘{print $1" "$4}‘ ?/tmp/load1.log|tr -d ‘[‘ |awk -F ‘[: ]‘ ‘{print $3":"$4" "$9 }‘|awk -F ‘/‘ ‘{print $1"/"$2"/"$3}‘ |sort |uniq -c|sort -nr |head ??8781 09:29 /img/ad ??8630 09:28 /img/ad ??8328 09:27 /img/ad ??3421 09:29 /interface/trade ??3004 09:28 /interface/trade ??2884 09:27 /interface/trade ??1542 09:29 /tg_templates/doubleone ??1492 09:28 /tg_templates/doubleone ??1446 09:29 /api/index.php?action=getadlist2&platform=gphone ??1393 09:27 /tg_templates/doubleoneawk -F ‘]‘ ‘{print $1" "$4}‘ ?/tmp/load.log|tr -d ‘[‘ |awk -F ‘[: ]‘ ‘{print $3":"$4" "$9 }‘|awk -F ‘/‘ ‘{print $1"/"$2"/"$3}‘ |sort |uniq -c|sort -nr |head ?11128 09:30 /img/ad ?10965 09:31 /img/ad ?10941 09:34 /img/ad ?10194 09:32 /img/ad ?10137 09:33 /img/ad ??8181 09:35 /img/ad ??6505 09:36 /img/ad ??5196 09:30 /interface/trade ??4783 09:34 /interface/trade ??4289 09:33 /interface/trade

(2)排查慢日志量上的异常
将请求的响应时间大于1秒以上的所有请求,全部取出来,按照时间序列进行排序。发信啊在9点35分到36分的时候,出现慢日志最多。对比日志量,35分和36分都是每分钟8000多次请求。
虽然知道了这个请求很慢,但是并不能完全证明就是这个慢请求引起的

grep "/img/ad/indexNav/20180926" /tmp/load1.log ?|awk -F ‘]‘ ‘{print $1" "$(NF-9)}‘ |tr -d ‘[‘ | awk -F ‘[: ]‘ ‘{print $3":"$4" "$(NF)}‘|awk ‘($NF>1){print $1}‘ |sort |uniq -c |sort -nr |head ?????57 09:27 ?????8 09:28 ?????5 09:29grep "/img/ad/indexNav/20180926" /tmp/load.log ?|awk -F ‘]‘ ‘{print $1" "$(NF-9)}‘ |tr -d ‘[‘ | awk -F ‘[: ]‘ ‘{print $3":"$4" "$(NF)}‘|awk ‘($NF>1){print $1}‘ |sort |uniq -c |sort -nr |head ???2012 09:36 ??1197 09:35 ???565 09:30 ???200 09:34 ????70 09:33 ????31 09:32 ????19 09:31

(3)排查其他慢请求
其他慢请求基本没有,排查其他慢请求引起的。问题定位在URI:“/img/ad/indexNav/20180926”上。

grep "/interface/trade" /tmp/load.log ?|awk -F ‘]‘ ‘{print $1" "$(NF-9)}‘ |tr -d ‘[‘ | awk -F ‘[: ]‘ ‘{print $3":"$4" "$(NF)}‘|awk ‘($NF>1){print $1}‘ |sort |uniq -c |sort -nr |head ??????6 09:36 ?????3 09:35 ?????2 09:30 ?????1 09:33 

(4)排查具体的慢请求时间

grep "/img/ad/indexNav/20180926" /tmp/load.log ?|awk -F ‘]‘ ‘{print $1" "$(NF-9)}‘ |tr -d ‘[‘ | awk -F ‘[: ]‘ ‘{print $3":"$4" "$(NF)}‘|awk ‘($NF>1){print $0}‘ |sort|uniq -c |sort -nr |head ?????38 09:36 2.478 ????34 09:36 1.576 ????33 09:36 4.484 ????32 09:35 7.072 ????30 09:36 3.138 ????29 09:36 3.845 ????29 09:36 3.548 ????29 09:36 2.677 ????24 09:35 3.508 ????23 09:36 2.843

(5)排查具体慢请求的URL
查看完整的URL,用于定位最长出现的URI,着重排查该类URI。

grep "/img/ad/indexNav/20180926" /tmp/load.log ?|awk -F ‘]‘ ‘{print $1" "$4" "$(NF-9)}‘|tr -d ‘[‘|awk -F ‘[: ]‘ ‘{print $3":"$9" "$8" "$(NF)}‘|awk ‘($NF>1){print $0}‘ |head09:/img/ad/indexNav/20180926/1537959962_9815.png GET 1.32509:/img/ad/indexNav/20180926/1537960001_6674.png GET 1.32709:/img/ad/indexNav/20180926/1537960001_6674.png GET 1.13609:/img/ad/indexNav/20180926/1537960001_6674.png GET 1.13409:/img/ad/indexNav/20180926/1537960001_6674.png GET 1.13409:/img/ad/indexNav/20180926/1537960001_6674.png GET 1.57009:/img/ad/indexNav/20180926/1537960001_6674.png GET 1.38009:/img/ad/indexNav/20180926/1537960001_6674.png GET 1.80609:/img/ad/indexNav/20180926/1537960001_6674.png GET 1.38009:/img/ad/indexNav/20180926/1537960001_6674.png GET 1.386

(6)查看完整日志
查看完整的日志格式是为看出各个字段上是否出现异常,从完整日志中看到body字段,返回的body是30W左右,即300K左右,一般这个body字段只有20左右才属于正常的。300 K明显是不正常的。
计算一下,一分钟1W次,每次都是300K,也就是300万KB,除以60秒,每秒中是50M的读盘。IO当然会高了....
问题原因基本定位。

adm.access.log:[27/Sep/2018:09:34:00 +0800] [192.168.215.91] [363346437] [GET /img/ad/indexNav/20180926/1537960001_6674.png HTTP/1.1] [200] [294880] [-] [%E5%90%8C%E8%8A%B1%E9%A1%BA/10.30.14 CFNetwork/808.1.4 Darwin/16.1.0] [122.238.114.23] [0.004] [-] [-] [XXXXX] [-] [/img/ad/indexNav/20180926/1537960001_6674.png] [80] [-] []adm.access.log:[27/Sep/2018:09:34:00 +0800] [192.168.205.23] [-] [GET /img/ad/indexNav/20180926/1537959962_9815.png HTTP/1.1] [200] [294880] [-] [platform=gphone&version=G037.08.330.1.32] [223.104.215.36, 10.101.8.1] [0.000] [-] [-] [XXXXX] [-] [/img/ad/indexNav/20180926/1537959962_9815.png] [80] [-] []adm.access.log:[27/Sep/2018:09:34:00 +0800] [192.168.210.142] [-] [GET /img/ad/indexNav/20180926/1537959962_9815.png HTTP/1.1] [200] [294880] [-] [platform=gphone&version=G037.08.338.1.32] [221.215.205.170, 10.99.2.9] [0.002] [-] [-] [XXXXX] [-] [/img/ad/indexNav/20180926/1537959962_9815.png] [80] [-] []

(7)判断是否是反爬引起
爬虫爬取一个页面,一分钟进行大量爬取,即使body部分不大,机器一样会被爆掉。何况这是一个高达300K的图片。
当然,爬虫基本上是不可能的,因为这台,上线了反爬程序

grep "/img/ad/indexNav/20180926" /tmp/load.log ?|awk -F ‘]‘ ‘{print $1" "$(NF-10)}‘ |tr -d ‘[‘ | awk -F ‘[ ,]‘ ‘{print $4}‘ |sort |uniq -c |sort -nr |head -n 20 ?????63 117.136.8.233 ????62 42.101.64.192 ????61 36.102.236.184 ????54 117.136.8.69 ????52 112.20.12.199 ????46 117.136.8.74 ????46 112.20.12.205 ????45 112.20.12.194 ????44 117.136.8.78 ????42 117.136.8.66 ????42 117.136.8.65 ????38 117.136.8.224 ????35 223.104.6.27 ????35 223.104.247.7

?

7、问题总结
1、为什么每次请求都要请求本地的图片?
实际上这个问题问的就是为什么没有对图片做缓存,我们前端做的缓存是100K限制,大于100K的图片都自动不做缓存,这也是这次机器被打爆的原因。300K大小的图片远远超过了缓存的限制,所以所有请求的图片都没有缓存住。
根据实际要求,前端是否可以放宽缓存大小的限制
图片png格式,测试将其转为jpg格式后,仅19K,具体产品为什么要用怎么大的图片,可能是考虑到美观性。

2、为什么其他时间没有报警?
由于业务的性质,每天上午9.30才开市,所以流量都集中在这个点,该图片是前一天产品刚发上去的,所以是在今天报警的,下午没有报警的原因是,流量变小了。

8、处理方案
1、开发做好图片大小限制,进行审核。
2、运维做好缓存,缓存的限制要根据开发这边提供的图片大小与业务相结合。

一次网站负载排查记录

原文地址:http://blog.51cto.com/blxueyuan/2286847

知识推荐

我的编程学习网——分享web前端后端开发技术知识。 垃圾信息处理邮箱 tousu563@163.com 网站地图
icp备案号 闽ICP备2023006418号-8 不良信息举报平台 互联网安全管理备案 Copyright 2023 www.wodecom.cn All Rights Reserved