很多情況下開發(fā)者調測程序需要在Linux下獲取具體的IO的狀況,目前常用的IO觀察工具用vmstat和iostat,具體功能上說當然是iostat更勝一籌,在IO統(tǒng)計上時間點上更具體精細。但二者都是在全局上看到IO,宏觀上的數(shù)據(jù)對于判斷IO到哪個文件上毫無幫助,這個時候block_dump的作用就顯現(xiàn)出來了。
需要先停掉syslog功能,因為具體IO數(shù)據(jù)要通過printk輸出,如果syslog存在,則會往message產(chǎn)生大量IO,干擾正常結果
1 2 | suse:~ # service syslog stop Shutting down syslog services done |
然后啟動block_dump
1 | suse:~ # echo 1 > /proc/sys/vm/block_dump |
先說效果:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | suse:~ # dmesg | tail dmesg(3414): dirtied inode 9594 (LC_MONETARY) on sda1 dmesg(3414): dirtied inode 9238 (LC_COLLATE) on sda1 dmesg(3414): dirtied inode 9241 (LC_TIME) on sda1 dmesg(3414): dirtied inode 9606 (LC_NUMERIC) on sda1 dmesg(3414): dirtied inode 9350 (LC_CTYPE) on sda1 kjournald(506): WRITE block 3683672 on sda1 kjournald(506): WRITE block 3683680 on sda1 kjournald(506): WRITE block 3683688 on sda1 kjournald(506): WRITE block 3683696 on sda1 kjournald(506): WRITE block 3683704 on sda1 kjournald(506): WRITE block 3683712 on sda1 kjournald(506): WRITE block 3683720 on sda1 kjournald(506): WRITE block 3683728 on sda1 kjournald(506): WRITE block 3683736 on sda1 kjournald(506): WRITE block 3683744 on sda1 |
通過dmesg信息可以看到IO正在寫那些文件,有進程號,inode號,文件名和磁盤設備名;但每個文件寫了多少呢,僅僅通過dirtied inode就看不出來了,還需要分析WRITE block,后面的數(shù)字并不是真正的塊號,而是內(nèi)核IO層獲取的扇區(qū)號,除以8即為塊號,然后根據(jù)debugfs工具的icheck和ncheck選項,就可以獲取該文件系統(tǒng)塊屬于哪個具體文件,具體請google之。
block_dump的原理其實很簡單,內(nèi)核在IO層根據(jù)標志block_dump在IO提交給磁盤的關口卡主過關的每一個BIO,將它們的數(shù)據(jù)打出來:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | void submit_bio( int rw, struct bio *bio) { int count = bio_sectors(bio); bio->bi_rw |= rw; /* * If it's a regular read/write or a barrier with data attached, * go through the normal accounting stuff before submission. */ if (bio_has_data(bio) && !(rw & REQ_DISCARD)) { if (rw & WRITE) { count_vm_events(PGPGOUT, count); } else { task_io_account_read(bio->bi_size); count_vm_events(PGPGIN, count); } if (unlikely(block_dump)) { char b[BDEVNAME_SIZE]; printk(KERN_DEBUG "%s(%d): %s block %Lu on %s (%u sectors)n" , current->comm, task_pid_nr(current), (rw & WRITE) ? "WRITE" : "READ" , (unsigned long long )bio->bi_sector, bdevname(bio->bi_bdev, b), count); } } generic_make_request(bio); } |
具體WRITE block塊號和文件系統(tǒng)塊號之間的對應關系在submit_bh函數(shù)中決定
1 | bio->bi_sector = bh->b_blocknr * (bh->b_size >> 9); |
inode的block_dump實現(xiàn)是通過block_dump___mark_inode_dirty搞定的,這次把關口架在inode臟數(shù)據(jù)寫回的路上,把每個過關的inode信息打出來:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | void __mark_inode_dirty( struct inode *inode, int flags) { if (unlikely(block_dump)) block_dump___mark_inode_dirty(inode); } static noinline void block_dump___mark_inode_dirty( struct inode *inode) { if (inode->i_ino || strcmp (inode->i_sb->s_id, "bdev" )) { struct dentry *dentry; const char *name = "?" ; dentry = d_find_alias(inode); if (dentry) { spin_lock(&dentry->d_lock); name = ( const char *) dentry->d_name.name; } printk(KERN_DEBUG "%s(%d): dirtied inode %lu (%s) on %sn" , current->comm, task_pid_nr(current), inode->i_ino, name, inode->i_sb->s_id); if (dentry) { spin_unlock(&dentry->d_lock); dput(dentry); } } |
1.內(nèi)核由很多合適的關口來截獲獲取的IO信息,不改動內(nèi)核,也可以用jprobe搶劫很多東西。
2.debugfs在大量的block–>file轉換過程總太慢,自己用ext2fs寫一個,效率應該能提高很多。