|< Day Day Up >|
8.2 HFS Internal Structure
The High performance filesystem is known by many names: the McKusick filesystem (after Marshall Kirk McKusick of Berkeley University, one of its authors), the Berkeley filesystem (see Marshall Kirk McKusick), UFS, and the UNIX filesystem (although this accolade should really go to the original 512-byte block filesystem created by the guys at AT&T Bell research labs, guys like Ken Thompson and Dennis Ritchie, who wrote the earliest versions of what we call UNIX). HFS has been around since the mid- to late 1980s. It has stood the test of time because it does simple things really well. In its day, it was seen as a massive leap forward, especially in the way the files were organized on disk, which led to massive improvements in IO performance. However, it has been something of a letdown when it has to check the underlying filesystem structure after a system crash. In these situations, it has to check the entire filesystem for any corruption, even the filesystem structures written two days ago. It has no log of when updates to the structure took place and whether those updates actually completed. On large filesystems, this can be a serious problem after a system crash. An individual filesystem can take more than an hour to check itself before marking itself as clean and allowing the system to continue booting. This is one of the main reasons that HFS has lost favor over the years.
The basic design ideas employed in the HFS filesystem can still be seen in today's advanced filesystems. The key was to try to localize the control information (inodes) and the data (data blocks) for individual files as close together as possible. By doing so, we are attempting to minimize read/write head movement as well as any rotational delays in waiting for platters to spin by switching access to adjacent platter. This led to splitting the filesystem into logical chunks known as cylinder groups (see Figure 8-1).
Figure 8-1. Basic disk topology.
If we attempted to store the data blocks for an individual file in a single cylinder group, the disk's read/write heads didn't need to move very far to retrieve the data, because we could switch between individual read/write heads far quicker than the server to position the read/write heads to a new location. The number of cylinder groups created was determined by the overall size of the filesystem (usually 16 cylinders per group). The drawback with this design is in the code to understand and be able to interface with all these structures. This was overcome with the superblock. The superblock is an 8KB block of data that details the size and configuration of the filesystem as well as the location of all the cylinder groups, the location of the inode table within each cylinder group, and the free space maps for inodes and disk blocks. The superblock is read into memory at mount time for easy access. If we lost the primary superblock, we could always recover it from a backup superblock stored in each cylinder group. The backup superblocks contained the static information relating to the filesystem, e.g., size, number of inodes, location, and number of cylinder groups. The dynamic information could be reconstructed by simply traversing the entire filesystem calculating how many actual block and inodes were in use. If we think of the cylinder group as simply a chunk of disk space, we can view the overall filesystem structure as we see in Figure 8-2.
Figure 8-2. Basic HFS layout.
The idea of having the structural elements of the filesystem offset on disk was to try to alleviate the problem of a head crash scrubbing a particular section of this disk.
The number of inodes in an HFS filesystem is fixed at newfs time. The default of having one inode for every 6KB of space makes it unlikely that we will run out of inodes. However, it means that we could be wasting a significant amount of disk space on inodes that we don't need (an HFS inode being 128 bytes in size). The inode density is one configuration parameter that we may choose to modify at filesystem creation time.
McKusick and his buddies were clever with how inodes were allocated. If we think about files in the same directory, they probably mean something to each other, i.e., they belong to the same user or application. The notion of locality of data extended to allocating inodes. A new file (hence, a new inode) would ideally be created in the same cylinder group as its parent. If this is not available, then either an inode recently purged or, if we have a preference, an inode from a cylinder group with the fewest inode and directory entries is in use. Similarly, the data blocks for the file would be in the same cylinder group. To avoid a single file swamping a cylinder group, there is a configuration parameter, which allows only 25 percent (by default) of the space in a cylinder group to be used by a single file (maxpbg = max block per group). Where we have only one or two files in a single directory, e.g., a large database file, it would be worthwhile changing maxbpg = bpg (blocks per group). This makes sense only at filesystem creation time before individual blocks are allocated.
Inodes are at the heart of the filesystem because an inode stores the characteristics of a file and points us to the data blocks where our data resides. An HFS inode is 128 bytes in size. This fixed size makes locating an individual inode a trivial exercise. The superblock can calculate which cylinder group holds a specific inode and where the inode table beings within that cylinder group; also, using the inode number x 128 bytes as an offset, it can quickly locate an inode in order to determine whether we are allowed to read the data held within. File type, permissions, ownerships, timestamps, and ACL references are all listed in an inode as well as a series of pointers that direct us to where our data is located within the filesystem. Figure 8-3 gives you an idea of the structure of an inode utilizing an 8KB block.
Figure 8-3. HFS inode.
To avoid wasting space in an entire block, the smallest addressable object is a fragment. A fragment is 1/8 of a block (1KB by default). The size of a block and fragment are fixed at filesystem creation time. The default sizes have been shown to be adequate in most situations. Where you have only a few large files in an HFS filesystem, it makes sense to create a filesystem with larger block and fragment sizes (up to 64KB and 8KB, respectively). This will cut down the number of individual inode and disk reads in order to address any given quantity of data. The use of double and triple (seldom used) indirect pointers can cause a significant increase in the number of IO to a filesystem.
The root inode of a filesystem is always inode 2. Subsequent allocation is based on the allocation policy we mentioned previously. Directories have data blocks like any other file; the difference is that the data for a directory is a list of filenames and inode numbers. When we locate a file, we implicitly read the inode number from the directory. This will point us toward another inode, possibly another directory, until we finally locate the file of interest. Let's look at an example where we want to examine the content of a file called /hfs/system. Ignore the fact that we need to traverse the root filesystem first (which is VxFS). We need to locate the root inode in the /hfs filesystem. We know that it is always at inode 2. I will use the fsdb command to traverse through the inodes in the filesystem until I find the file I am looking for. NOTE: The use of fsdb is dangerous. In this instance, we are looking only at filesystem structures. Should you happen to change a filesystem using fsdb, you could render the entire filesystem unusable. This is for demonstration purposes only:
root@hpeos003 bdf /hfs Filesystem kbytes used avail %used Mounted on /dev/vg00/hfs.data 103637 15 93258 0% /hfs root@hpeos003 fsdb -F hfs /dev/vg00/hfs.data file system size = 106496(frags) isize/cyl group=48(Kbyte blocks) primary block size=8192(bytes) fragment size=1024 no. of cyl groups = 44 2i i#:2 md: d---rwxr-xr-x ln: 3 uid: 0 gid: 0 sz: 1024 ci:0 a0 : 120 a1 : 0 a2 : 0 a3 : 0 a4 : 0 a5 : 0 a6 : 0 a7 : 0 a8 : 0 a9 : 0 a10: 0 a11: 0 a12: 0 a13: 0 a14: 0 at: Wed Nov 12 20:48:27 2003 mt: Wed Nov 12 20:48:23 2003 ct: Wed Nov 12 20:48:23 2003
We can see the file type and mode (d---rwxr-xr-x). Because we are dealing with a directory, we can use the formatting instruction to instruct fsdb to print block 0 as a list of directories.
a0.fd d0: 2 . d1: 2 . . d2: 3 l o s t + f o u n d d3: 4 s y s t e m d4: 5 p a s s w d d5: 6 h o s t s
From this, we can see that the system file is located at inode 4 (directory slot d3). To make inode 4 our working inode, we simply use the command 4i:
4i i#:4 md: f---r--r--r-- ln: 1 uid: 0 gid: 3 sz: 1146 ci:0 a0 : 107 a1 : 0 a2 : 0 a3 : 0 a4 : 0 a5 : 0 a6 : 0 a7 : 0 a8 : 0 a9 : 0 a10: 0 a11: 0 a12: 0 a13: 0 a14: 0 at: Wed Nov 12 20:48:23 2003 mt: Wed Nov 12 20:48:23 2003 ct: Wed Nov 12 20:48:23 2003
From here, we can examine the content of the file if we like:
107b.fc 326000 : * * * * * * * * * * * * * * * * * * * * * * 326040 : * * * * * * * \n * S o u r c e : / u x / 326100 : f i l e s e t s . i n f o / C O R E - K R N 326140 : @ ( # ) B . 1 1 . 1 1 _ L R \n * \n * * * 326200 : * * * * * * * * * * * * * * * * * * * * * * 326240 : i t i o n a l d r i v e r s r e q u i r 326300 : y m a c h i n e - t y p e t o c r e a 326340 : e t e \n * s y s t e m f i l e d u r i 326400 : s t a l l . T h i s l i s t i s e 326440 : r t h a t t h e \n * m a s t e r . d / 326500 : n o t f o r c e o n t h e s y s t e 326540 : t i d e n t i f i a b l e b y \n * i o 326600 : h e r C P U - t y p e s p e c i f i c 326640 : e x i s t f o r t h e i r s p e c i a 326700 : s e e c r e a t e _ s y s f i l e ( 1 326740 : * * * * * * * * * * * * * * * * * * * * * * 327000 : * \n * \n * D r i v e r s / S u b s y s t e 327040 : \n b t l a n \n c 7 2 0 \n s c t l \n s d i s k 327100 : t o P C I \n t d \n c d f s \n f c p \n f c T 1 327140 : r f \n o l a r _ p s m \n o l a r _ p s m _ i 327200 : \n d i a g 0 \n d i a g 1 \n d i a g 2 \n d m e 327240 : i g \n i o m e m \n n f s _ c o r e \n n f s _ 327300 : _ s e r v e r \n n f s m \n r p c m o d \n a u 327340 : e f s c \n m a c l a n \n d l p i \n t o k e n 327400 : u i p c \n t u n \n t e l m \n t e l s \n n e t 327440 : h p s t r e a m s \n c l o n e \n s t r l o g 327500 : s c \n t i m o d \n t i r d w r \n p i p e d e 327540 : f f s \n l d t e r m \n p t e m \n p t s \n p t 327600 : i 4 \n g e l a n \n G S C t o P C I \n v x v m 327640 : \n v o l s \n i o p _ d r v \n b s _ o s m \n i 327700 : \n v x p o r t a l \n b t l a n 1 \n l v m \n l 327740 : a p e \n t a p e 2 \n C e n t I f \n l p r 2 \n
This demonstration of traversing the filesystem using primitive objects (i.e., inodes) is to try to demonstrate the amount of work the filesystem has to undertake when accessing files within an HFS filesystem.
|< Day Day Up >|