Thursday, June 26, 2008

kernel concepts: confusing dentry & nameidata

The dentry itself is at first not easy to catch on of its concept within VFS. On one hand, it doesn't like superblock, inode and alike structures, which maps directly to the physical disk image. On the other hand, it also doesn't like file, vfsmount structures, which even though without physical mapping, but they do have direct process activities relationship - such as open / mount. That is to say, dentry isn't a concept at a surface level where human can catch at first sight.

The dentry is used for the pathname resolution. However, during the pathname resolution procedure, path_lookup or do_path_lookup, there is actually another structure involved: struct nameidata. They are confusing at first, indeed.

Their major differences are (not clearly stated in books / articles I've read):

The variables for nameidata are always temporary and locally declared, used during the pathname resolution, even though we can see their pointers being passed here and there. There's no such thing of kmalloc or kmem_cache_alloc for them.

On the other hand, the dentry variables are global entities. Even though they are generated dynamically during the pathname lookup procedure, they are not randomly specified. If we look up first the file: /aa/bb/cc.txt, afterwards the file: /aa/bb/dd.txt, then the dentry for 'bb' will be exactly the same object for both lookup: first time by d_alloc to get a new dentry, afterwards by d_lookup to get that same object.

Another important mission: since inode struct objects are unnamed entities, to associate them with their proper name, it is within kernel by the dentry object to provide this liaison - by d_inode field.

kernel hacking with GDB: VFS mount

"In most traditional Unix-like kernels, each filesystem can be mounted only once. However, Linux is different: it is possible to mount the same filesystem several times. Of course, if a filesystem is mounted n times, its root directory can be accessed through n mount points, one per mount operation. Although the same filesystem can be accessed by using different mount points, it is really unique. Thus, there is only one superblock object for all of them, no matter of how many times it has been mounted." -- quoted from Understanding the Linux Kernel.

So, let's do some experiments to verify it:

First, we need to have GDB to watch the kernel's memory, typically as:
gdb -x /dev/shm/gdb-script /usr/src/lki/vmlinux /proc/kcore
where '
gdb-script' is my tiny GDB scripts to facilitate this observations, /usr/src/lki/vmlinux is just this working kernel's ELF image file. Note that you need to have CONFIG_DEBUG_INFO being set, which will turn on '-g' GCC's flag within Makefile.

Here is the related command lines:

[bshyu@vmBShyu gdb-scripts]$ sudo mount /dev/disk/by-uuid/448c9c30-4a09-40f6-8e01-d3c577cd250d /media/linux
[bshyu@vmBShyu gdb-scripts]$ sudo mount /dev/disk/by-label/FAT32 /media/FAT32

[bshyu@vmBShyu gdb-scripts]$ ls /media/FAT32
bsdbg data knoppix.img mypage setup-cygwin.exe VirtMachine zycova
BShyu linux Recycled temp www

[bshyu@vmBShyu gdb-scripts]$ sudo mount --bind /media/FAT32 /media/FAT32/temp
[bshyu@vmBShyu gdb-scripts]$ sudo mount --bind /media/linux /media/FAT32/linux/


Now we are going to play with GDB, but first, listed here the fore-mentioned tiny GDB scripts, which we help us to observe Linux's link-list data structures, (assume you are already familiar with Linux's struct list_head thing, :-) :

define bsScan

# get the address of the list_head
set $head = (struct list_head*) $arg0
# set $CStructOffset = (unsigned long)(&((struct $arg1 *)0)->$arg2)))
set $CStructOffset = &(((struct $arg1 *)0)->$arg2)
#set $CStruct = $arg1

printf "\n================================================================\n"
printf "\tdisplay list of structures\n"
print *($arg0)
printf "================================================================\n"

set $ptr = $head->next
bsNext $arg0 $arg1 $arg2
end
define bsNext
if $ptr != $head
set $ent = ((struct $arg1 *)((char*)($ptr) - (unsigned)$CStructOffset))
printf "\n\n#--------------- 0x%08x ---------------#\n", $ent
print *$ent
set $ptr = $ptr->next
printf "\nUse print *$ent to show list entity\n"
printf "Use set $a = $ent to store the entity\n"
end
end
document bsScan
-----------------------------------------------------------
Bernard Shyu's LIST display script
-----------------------------------------------------------
bsScan
bsNext

Scan a list of elements one by one

Ex. bsScan &inode_in_use inode i_list
Ex. bsNext &inode_in_use inode i_list
end


So, we will start from INIT process's namespace, which in more than 99% cases will be just the namespace to be used by all processes. My kernel version is: 2.6.25.4, where, the namespace structure has been through a big change since the book of ULK - the process's namespace has been replaced by nsproxy :-) (%% You always need to keep note the kernel version being used when you want to tell the story of Linux Kernel, otherwise, we will very like miss the point you mentioned. %%)

(gdb) p *init_task.nsproxy->mnt_ns->root
$114 = {mnt_hash = {next = 0xcf80c808, prev = 0xcf80c808}, mnt_parent = 0xcf80c808,
mnt_mountpoint = 0xcf4018e0, mnt_root = 0xcf4018e0, mnt_sb = 0xcf80dab8, mnt_mounts = {
next = 0xcf80cf98, prev = 0xcf80cf98}, mnt_child = {next = 0xcf80c828, prev = 0xcf80c828},
mnt_flags = 0, mnt_devname = 0xcf802440 "rootfs", mnt_list = {next = 0xcf80cfa8,
prev = 0xcf802410}, mnt_expire = {next = 0xcf80c840, prev = 0xcf80c840}, mnt_share = {
next = 0xcf80c848, prev = 0xcf80c848}, mnt_slave_list = {next = 0xcf80c850,
prev = 0xcf80c850}, mnt_slave = {next = 0xcf80c858, prev = 0xcf80c858}, mnt_master = 0x0,
mnt_ns = 0xcf802408, mnt_count = {counter = 2}, mnt_expiry_mark = 0, mnt_pinned = 0,
mnt_ghosts = 0}

(gdb) bsScan &init_task.nsproxy->mnt_ns->root->mnt_mounts vfsmount mnt_child

================================================================
display list of structures
$115 = {next = 0xcf80cf98, prev = 0xcf80cf98}
================================================================


#--------------- 0xcf80cf78 ---------------#
$116 = {mnt_hash = {next = 0xcf80b408, prev = 0xcf80b408}, mnt_parent = 0xcf80c808,
mnt_mountpoint = 0xcf4018e0, mnt_root = 0xcf421718, mnt_sb = 0xcf1a1318, mnt_mounts = {
next = 0xcf80cf10, prev = 0xcf80cbe0}, mnt_child = {next = 0xcf80c820, prev = 0xcf80c820},
mnt_flags = 0, mnt_devname = 0xcf183590 "/dev/root", mnt_list = {next = 0xcf80cf20,
prev = 0xcf80c838}, mnt_expire = {next = 0xcf80cfb0, prev = 0xcf80cfb0}, mnt_share = {
next = 0xcf80cfb8, prev = 0xcf80cfb8}, mnt_slave_list = {next = 0xcf80cfc0,
prev = 0xcf80cfc0}, mnt_slave = {next = 0xcf80cfc8, prev = 0xcf80cfc8}, mnt_master = 0x0,
mnt_ns = 0xcf802408, mnt_count = {counter = 755}, mnt_expiry_mark = 0, mnt_pinned = 0,
mnt_ghosts = 0}

Use print *$ent to show list entity
Use set $a = $ent to store the entity

(gdb) set $root = $ent


The 2nd vfsmount structure is the real root of the filesystems directory tree we will play around soon. This is just the mount structure corresponding the '/' - the root directory in every pathname we will always use (note that we we in the most case: only one namespace).
What about the 1st, it's initramfs.

"For each mounted filesystem, a circular doubly linked list including all child mounted filesystems. The head of each list is stored in the mnt_mounts field of the mounted filesystem descriptor; moreover, the mnt_child field of the descriptor stores the pointers to the adjacent elements in the list." -- ULK

Note that we've set $root to the struct vfsmount for the root directory (/dev/root), so
$root->mnt_mounts will be the link list of all child mounted filesystems we will loop on, by the mnt_child field.

(gdb) bsScan &$root->mnt_mounts vfsmount mnt_child

================================================================
display list of structures
$117 = {next = 0xcf80cf10, prev = 0xcf80cbe0}
================================================================


#--------------- 0xcf80cef0 ---------------#
$118 = {mnt_hash = {next = 0xcf80b310, prev = 0xcf80b310}, mnt_parent = 0xcf80cf78,
mnt_mountpoint = 0xcf420128, mnt_root = 0xcf417c70, mnt_sb = 0xcf1526c8, mnt_mounts = {
next = 0xcf80c938, prev = 0xcf80ca48}, mnt_child = {next = 0xcf80ce88, prev = 0xcf80cf90},
mnt_flags = 0, mnt_devname = 0xcf1fca60 "/dev", mnt_list = {next = 0xcf80ce98,
prev = 0xcf80cfa8}, mnt_expire = {next = 0xcf80cf28, prev = 0xcf80cf28}, mnt_share = {
next = 0xcf80cf30, prev = 0xcf80cf30}, mnt_slave_list = {next = 0xcf80cf38,
prev = 0xcf80cf38}, mnt_slave = {next = 0xcf80cf40, prev = 0xcf80cf40}, mnt_master = 0x0,
mnt_ns = 0xcf802408, mnt_count = {counter = 48}, mnt_expiry_mark = 0, mnt_pinned = 0,
mnt_ghosts = 0}

Use print *$ent to show list entity
Use set $a = $ent to store the entity

(gdb) bsNext &$root->mnt_mounts vfsmount mnt_child


#--------------- 0xcf80ce68 ---------------#
$119 = {mnt_hash = {next = 0xcf80b3e0, prev = 0xcf80b3e0}, mnt_parent = 0xcf80cf78,
mnt_mountpoint = 0xcf420e38, mnt_root = 0xcf4017b0, mnt_sb = 0xcf80d688, mnt_mounts = {
next = 0xcf80c1c8, prev = 0xcf80cb58}, mnt_child = {next = 0xcf80ce00, prev = 0xcf80cf10},
mnt_flags = 0, mnt_devname = 0xcf1fca28 "/proc", mnt_list = {next = 0xcf80ce10,
prev = 0xcf80cf20}, mnt_expire = {next = 0xcf80cea0, prev = 0xcf80cea0}, mnt_share = {
next = 0xcf80cea8, prev = 0xcf80cea8}, mnt_slave_list = {next = 0xcf80ceb0,
prev = 0xcf80ceb0}, mnt_slave = {next = 0xcf80ceb8, prev = 0xcf80ceb8}, mnt_master = 0x0,
mnt_ns = 0xcf802408, mnt_count = {counter = 6}, mnt_expiry_mark = 0, mnt_pinned = 0,
mnt_ghosts = 0}

Use print *$ent to show list entity
Use set $a = $ent to store the entity


(gdb)


#--------------- 0xcf80cde0 ---------------#
$130 = {mnt_hash = {next = 0xcf80b3e8, prev = 0xcf80b3e8}, mnt_parent = 0xcf80cf78,
mnt_mountpoint = 0xcf420ed0, mnt_root = 0xcf401978, mnt_sb = 0xcf80dcd0, mnt_mounts = {
next = 0xcf80cdf8, prev = 0xcf80cdf8}, mnt_child = {next = 0xcf80c9c0, prev = 0xcf80ce88},
mnt_flags = 0, mnt_devname = 0xcf1fc9f0 "/sys", mnt_list = {next = 0xcf80c1d8,
prev = 0xcf80ce98}, mnt_expire = {next = 0xcf80ce18, prev = 0xcf80ce18}, mnt_share = {
next = 0xcf80ce20, prev = 0xcf80ce20}, mnt_slave_list = {next = 0xcf80ce28,
prev = 0xcf80ce28}, mnt_slave = {next = 0xcf80ce30, prev = 0xcf80ce30}, mnt_master = 0x0,
mnt_ns = 0xcf802408, mnt_count = {counter = 1}, mnt_expiry_mark = 0, mnt_pinned = 0,
mnt_ghosts = 0}

Use print *$ent to show list entity
Use set $a = $ent to store the entity


Then, to continue to scan through the list of child mounts, we can do either repeat the command bsNext &$root->mnt_mounts vfsmount mnt_child, or simply by ENTER (by GDB, it will repeat last command). We will find the mounted childs, such as '/dev', '/proc', '/sys', as shown by my machine.

Not shown here, but if we continue with ENTER to observe the child mounts of the root, we will notice the BIND mounts (
mount --bind /media/FAT32 /media/FAT32/temp, mount --bind /media/linux /media/FAT32/linux/) can't be found here. Where did they go? This is because they are grand-children, rather that direct children of the root mount.

So, let's find the mount for /media/FAT32, and from there to find its children.

...
...
(gdb)

#--------------- 0xcf80cab0 ---------------#
$143 = {mnt_hash = {next = 0xcf80b978, prev = 0xcf80b978}, mnt_parent = 0xcf80cf78,
mnt_mountpoint = 0xcf406848, mnt_root = 0xcf56a1c0, mnt_sb = 0xcf294470, mnt_mounts = {
next = 0xcf80cac8, prev = 0xcf80cac8}, mnt_child = {next = 0xcf80c140, prev = 0xcf80c9c0},
mnt_flags = 0,
mnt_devname = 0xcf26f238 "/dev/disk/by-uuid/448c9c30-4a09-40f6-8e01-d3c577cd250d",
mnt_list = {next = 0xcf80cb68, prev = 0xcf80ca58}, mnt_expire = {next = 0xcf80cae8,
prev = 0xcf80cae8}, mnt_share = {next = 0xcf80caf0, prev = 0xcf80caf0}, mnt_slave_list = {
next = 0xcf80caf8, prev = 0xcf80caf8}, mnt_slave = {next = 0xcf80cb00, prev = 0xcf80cb00},
mnt_master = 0x0, mnt_ns = 0xcf802408, mnt_count = {counter = 2}, mnt_expiry_mark = 0,
mnt_pinned = 0, mnt_ghosts = 0}

Use print *$ent to show list entity
Use set $a = $ent to store the entity

(gdb) set $L1 = $ent


...
...

(gdb)

#--------------- 0xcf80cbc0 ---------------#
$137 = {mnt_hash = {next = 0xcf80be00, prev = 0xcf80be00}, mnt_parent = 0xcf80cf78,
mnt_mountpoint = 0xcf7098e0, mnt_root = 0xcf709ed0, mnt_sb = 0xcfa68080, mnt_mounts = {
next = 0xcf80cc68, prev = 0xcf80cd78}, mnt_child = {next = 0xcf80cf90, prev = 0xcf80c0b8},
mnt_flags = 0, mnt_devname = 0xc94cc750 "/dev/disk/by-label/FAT32", mnt_list = {
next = 0xcf80cc78, prev = 0xcf80c0c8}, mnt_expire = {next = 0xcf80cbf8,
prev = 0xcf80cbf8}, mnt_share = {next = 0xcf80cc00, prev = 0xcf80cc00}, mnt_slave_list = {
next = 0xcf80cc08, prev = 0xcf80cc08}, mnt_slave = {next = 0xcf80cc10, prev = 0xcf80cc10},
mnt_master = 0x0, mnt_ns = 0xcf802408, mnt_count = {counter = 3}, mnt_expiry_mark = 0,
mnt_pinned = 0, mnt_ghosts = 0}

Use print *$ent to show list entity
Use set $a = $ent to store the entity

(gdb) set $F1 = $ent

So we've set two GDB variables: $L1 and $F1 to point to the interested mounts. Since we don't mount anything under the mount $L1, it's mnt_mounts is an empty list: as we can see:

(gdb) p $L1->mnt_mounts
$153 = {next = 0xcf80cac8, prev = 0xcf80cac8}
(gdb) p &$L1->mnt_mounts
$154 = (struct list_head *) 0xcf80cac8


Let's go through the children of $F1.

(gdb) bsScan &$F1->mnt_mounts vfsmount mnt_child

================================================================
display list of structures
$158 = {next = 0xcf80cc68, prev = 0xcf80cd78}
================================================================


#--------------- 0xcf80cc48 ---------------#
$159 = {mnt_hash = {next = 0xcf80be30, prev = 0xcf80be30}, mnt_parent = 0xcf80cbc0,
mnt_mountpoint = 0xcf709f68, mnt_root = 0xcf709ed0, mnt_sb = 0xcfa68080, mnt_mounts = {
next = 0xcf80cc60, prev = 0xcf80cc60}, mnt_child = {next = 0xcf80cd78, prev = 0xcf80cbd8},
mnt_flags = 0, mnt_devname = 0xc94c61d8 "/dev/disk/by-label/FAT32", mnt_list = {
next = 0xcf80cd88, prev = 0xcf80cbf0}, mnt_expire = {next = 0xcf80cc80,
prev = 0xcf80cc80}, mnt_share = {next = 0xcf80cc88, prev = 0xcf80cc88}, mnt_slave_list = {
next = 0xcf80cc90, prev = 0xcf80cc90}, mnt_slave = {next = 0xcf80cc98, prev = 0xcf80cc98},
mnt_master = 0x0, mnt_ns = 0xcf802408, mnt_count = {counter = 1}, mnt_expiry_mark = 0,
mnt_pinned = 0, mnt_ghosts = 0}

Use print *$ent to show list entity
Use set $a = $ent to store the entity

(gdb) set $F2 = $ent

(gdb) bsNext &$F1->mnt_mounts vfsmount mnt_child



#--------------- 0xcf80cd58 ---------------#
$160 = {mnt_hash = {next = 0xcf80bf38, prev = 0xcf80bf38}, mnt_parent = 0xcf80cbc0,
mnt_mountpoint = 0xc4eaf2f0, mnt_root = 0xcf56a1c0, mnt_sb = 0xcf294470, mnt_mounts = {
next = 0xcf80cd70, prev = 0xcf80cd70}, mnt_child = {next = 0xcf80cbd8, prev = 0xcf80cc68},
mnt_flags = 0,
mnt_devname = 0xcf26fbd8 "/dev/disk/by-uuid/448c9c30-4a09-40f6-8e01-d3c577cd250d",
mnt_list = {next = 0xcf802410, prev = 0xcf80cc78}, mnt_expire = {next = 0xcf80cd90,
prev = 0xcf80cd90}, mnt_share = {next = 0xcf80cd98, prev = 0xcf80cd98}, mnt_slave_list = {
next = 0xcf80cda0, prev = 0xcf80cda0}, mnt_slave = {next = 0xcf80cda8, prev = 0xcf80cda8},
mnt_master = 0x0, mnt_ns = 0xcf802408, mnt_count = {counter = 1}, mnt_expiry_mark = 0,
mnt_pinned = 0, mnt_ghosts = 0}

Use print *$ent to show list entity
Use set $a = $ent to store the entity

(gdb) set $L2 = $ent


So, we have $L1, $L2, $F1, $F2 for the 4 mounts we want to observe. Let's first check with $F1 & $F2, the mounts for the FAT32 file systems.
  1. They are indeed different structures: $F1 = 0xcf80cbc0, $F2 = 0xcf80cc48
  2. They have the same superblocks: $F1->mnt_sb = $F2->mnt_sb = 0xcfa68080. This reflect what is described in ULK: only one superblock structure for what ever number of mounts.
  3. Just as expected, $F2->mnt_parent = 0xcf80cbc0, the address of $F1
  4. Even they are different mounts, their mnt_root (struct dentry) are the same: $F1->mnt_root = $F2->mnt_root = 0xcf709ed0. That also indicate mnt_root is bound to the file system, rather than to the mount.
  5. On the other hand, the vfsmount's another dentry: mnt_mountpoint are obviously different, for their repective mount points: $F2->mnt_mountpoint = 0xcf709f68, $F1->mnt_mountpoint = 0xcf7098e0
    • $F2->mnt_mountpoint->d_mounted = 1
    • $F2->mnt_mountpoint->d_parent->d_mounted = 0
  6. The reference counter: $F1->mnt_count = 3, $F2->mnt_count = 1.
For the mounts of Linux ext3 filesystems: $L1 and $L2, they show similar results, except the reference counter: $L1->mnt_count = 2, $L2->mnt_count = 1.

That's it, using my tiny script snippets: bsScan & bsNext, you can also play around with Linux kernel internally. Here we have observed the tree of file system mountings by going through the mnt_mounts & mnt_child, starting from the namespace.

Note that another way to play around with the system's mounting structures, much more straightforward, is simply by looping through the namespace's list field. Belows are the GDB commands to navigate the whole mounts of the namespace:

(gdb) bsScan &init_task.nsproxy->mnt_ns->list vfsmount mnt_list
(gdb) bsNext &init_task.nsproxy->mnt_ns->list vfsmount mnt_list


Enjoy it.

Thursday, June 5, 2008

Elements of HereMyPage

The HereMyPage.com contains the following technical ingredients:
  • On the client side
    • jQuery based Javascript framework
    • igoogle compatible gadget API library, a from-scratch work
  • On the Server side
    • Smarty template engine
    • SimplePie RSS proxy
    • Joomla content management system for publishing contents & user management
    • External wordpress weblog system (TODO)
    • External wiki system for on-line document (TODO)
    • External forum system for discussion (TODO)
    • External News system for latest news announcement (TODO)
The service elements it will contain are (TO DO ...) :
  • Personal home portal, front-end: http://www.heremypage.com. A tab paged personalized contents.
  • Personal home portal, back-end: http://www.heremypage.com/settings
  • Personal share page: http://www.heremypage.com/user or http://user.heremypage.com
  • Community page: http://www.heremypage.com/community. A tab paged contents managed and seen by community users.
  • World page: http://www.heremypage.com/world. A tab paged contents managed and seen by all users.
  • Blog: http://blog.heremypage.com/user
  • Special pagelet:
    • Feedburner pagelet, providing RSS feeds burning service from any site
    • Designer pagelet, providing an easy place to design / modify a pagelet spec.

...... to be continued

Hot links for Web Designers

寫程式需要的網站們~

(ref: http://calos-tw.blogspot.com/2008/05/blog-post_17.html)

PHP

Zend Framework

CakePHP Framework

JavaScript

jQuery

CSS

Database

MySQL

Oracle

Service

Project

Software

SCM

Design

Web

HereMyPage.com preparation for on-line

This isn't a formal announcement, but just an test for the rules of web world.

I've created a website:
HereMyPage.com

It's just like netvibes.com or google.com/ig, providing a personal web portal service. I know it's already a late comer product in this field. Therefore, some differentiations lies here, in the slight hope of attracting some web-fans.
  • User gets more space than netvibes or igoogle
  • The so called pagelet in my term, supports igoogle's gadget API
  • Freedom of customizing user's home page, by downloading the public themes, make your modifications, upload to your personal folders or share folders, then you have your own home page
  • ...
  • More to come