Bug 41902
Summary: | CFQ freezing the platform | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Joan (jom) |
Component: | Block Layer | Assignee: | Jens Axboe (axboe) |
Status: | RESOLVED INSUFFICIENT_DATA | ||
Severity: | high | CC: | akpm, alan, jom, vivek.goyal2008 |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.0.3 | Subsystem: | |
Regression: | Yes | Bisected commit-id: |
Description
Joan
2011-08-28 18:53:57 UTC
Is this a regression? Was 2.6.39 OK? Thanks. - can you please capture the blktrace till the time of hang - Is it possible to do some bisection and point to culprit commit. I did not have to rebuilt anything under 2.6.x (maybe because it was actually working but I can not prove it) Because it hanged once with 3.0.x, I had to rebuild the entire RAID1 (1.5 TB) that is when I notived the problem. It is a production machine so I do not have much space to make tests. I can only tell you that "deadline" instead of CFQ solve completely the situation. I can read on the web many articles about "raid1 freeze" and a solution being "change IO scheduler to deadline". That is how I got to that, and in deed, the change of scheduler fixes the problems. Furthermore, as the machine completely hangs, there is nothing written on disk that can be traced afterward. The system reboot with broken file system (fsck, then RAID rebuilt), that is the situation. My 2 cents would be to put CFQ as "experimental" in the kernel until the bug is found. (i.e. avoid to have it in any production environement) I did not have to rebuilt anything under 2.6.x (maybe because it was actually working but I can not prove it) Because it hanged once with 3.0.x, I had to rebuild the entire RAID1 (1.5 TB) that is when I notived the problem. It is a production machine so I do not have much space to make tests. I can only tell you that "deadline" instead of CFQ solve completely the situation. I can read on the web many articles about "raid1 freeze" and a solution being "change IO scheduler to deadline". That is how I got to that, and in deed, the change of scheduler fixes the problems. Furthermore, as the machine completely hangs, there is nothing written on disk that can be traced afterward. The system reboot with broken file system (fsck, then RAID rebuilt), that is the situation. My 2 cents would be to put CFQ as "experimental" in the kernel until the bug is found. (i.e. avoid to have it in any production environment) |