Repair a thin pool

I use ProxMox in my home lab which has been really helpful to spin up VMs as needed for tests and experiments.

I recently ran into an issue with the LVM thin pool used by ProxMox. The metadata space was completely full. The metadata space reported by lvs -a was 99.99%. 

After a quick search, I noticed I was not the first one running into this. It seems some felt the default pool size in LVM2 was not large enough:

I came up with steps to fix the issue starting with a resize the metadata space:

root@pve1:/# lvresize --poolmetadatasize +1G pve/data

Although lvs -a showed the additional space, I was still experiencing issues and I assumed the metadata was corrupted so I tried:

root@pve1:/# lvconvert --repair pve/data

This did not resolve the issue. Since the root of the tree was lost prior, lvconvert --repair was not able to recover anything and I was left with no metadeta and none of the thin volumes were available. lvs -a was still showing the thin volumes but they remain unavailable:

root@pve1:/# lvchange -ay pve/vm-100-disk-5
device-mapper: reload ioctl on failed: No data available

I tried to running vgmknodes -vvv pve but noticed those volumes got marked NODE_DEL:

Processing LV vm-100-disk-5 in VG pve.
   dm mknodes pve-vm--100--disk--5  NF   [16384] (*1)
   pve-vm--100--disk--5: Stacking NODE_DEL
   Syncing device names
   pve-vm--100--disk--5: Processing NODE_DEL

I reached out to Zdenek Kabelac and Ming-Hung Tsai who are both extremely knowledgeable with LVM thin-pools and they both provided much needed and very useful assistance. Following advice from Ming-Hung, I grabbed the source code of thin-provisioning-tools from GitHub. To properly compile in ProxMox I had to add a number of tools:

apt-get install git
apt-get install autoconf
apt-get install g++
apt-get install libexpat
apt-get install libexpat1-dev
apt-get install libexpat1
apt-get install libaio-dev libaio1
apt-get install libboost1.55-all-dev
apt-get install make

Using this new set of tools, I started poking around with thin_check, thin_scan and thin_ll_dump:

root@pve1:/# ./pdata_tools thin_check /dev/mapper/pve-data_meta2
examining superblock
examining devices tree
examining mapping tree
  missing all mappings for devices: [0, -]
    bad checksum in btree node (block 688)
root@pve1:/# ./pdata_tools thin_scan /dev/mapper/pve-data_meta2 -o /tmp/thin_scan_meta2.xml
root@pve1:/# ./pdata_tools thin_ll_dump /dev/mapper/pve-data_meta2 -o /tmp/thin_ll_dump_meta2.xml

pve_data_meta2 was the oldest backup of the metadata created by the lvconvert --repair and was the most likely to contain my metadata. But the thin_check showed the all mappings were missing because the root was missing.

To fix this with thin_ll_restore, I needed to find the correct nodes. In the thin_ll_dump meta dump created above, I was able to find the data-mapping-root:

root@pve1:/# grep "key_begin=\"5\" key_end=\"8\"" /tmp/thin_ll_dump_meta2.xml
  <node blocknr="6235" flags="2" key_begin="5" key_end="8" nr_entries="4" value_size="8"/>
  <node blocknr="20478" flags="2" key_begin="5" key_end="8" nr_entries="4" value_size="24"/>

In the thin_scan xml file created above, I was able to find the device-details-root:

root@pve1:# grep value_size=\"24\" /tmp/thin_scan_meta2.xml
<single_block type="btree_leaf" location="20477" blocknr="20477" ref_count="0" is_valid="1" value_size="24"/>
<single_block type="btree_leaf" location="20478" blocknr="20478" ref_count="1" is_valid="1" value_size="24"/>

I used the 6235 and 20477 pair to start which produced good metadata and much fewer orphans than before:

root@pve1:/# ./pdata_tools thin_ll_dump /dev/mapper/pve-data_meta2 --device-details-root=20477 --data-mapping-root=6235 -o /tmp/thin_ll_dump2.xml

root@pve1:/# ./pdata_tools thin_ll_dump /tmp/tmeta.bin --device-details-root=20478 --data-mapping-root=6235
<superblock blocknr="0" data_mapping_root="6235" device_details_root="20478">
  <device dev_id="5">
    <node blocknr="7563" flags="1" key_begin="0" key_end="708527" nr_entries="6" value_size="8"/>
  </device>
  <device dev_id="6">
    <node blocknr="171" flags="1" key_begin="0" key_end="799665" nr_entries="51" value_size="8"/>
  </device>
  <device dev_id="7">
    <node blocknr="20413" flags="1" key_begin="0" key_end="1064487" nr_entries="68" value_size="8"/>
  </device>
  <device dev_id="8">
    <node blocknr="19658" flags="1" key_begin="0" key_end="920291" nr_entries="17" value_size="8"/>
  </device>
</superblock>
<orphans>
  <node blocknr="564" flags="2" key_begin="0" key_end="0" nr_entries="0" value_size="8"/>
  <node blocknr="677" flags="1" key_begin="0" key_end="1848" nr_entries="23" value_size="8"/>
  <node blocknr="2607" flags="1" key_begin="0" key_end="708527" nr_entries="6" value_size="8"/>
  <node blocknr="20477" flags="2" key_begin="5" key_end="8" nr_entries="4" value_size="24"/>
  <node blocknr="3020" flags="1" key_begin="370869" key_end="600885" nr_entries="161" value_size="8"/>
  <node blocknr="20472" flags="2" key_begin="379123" key_end="379268" nr_entries="126" value_size="8"/>
  <node blocknr="20476" flags="2" key_begin="379269" key_end="401330" nr_entries="127" value_size="8"/>
</orphans>

Armed with this modified XML file and after making sure nothing was active and using the thin pool metadata, I was able to attempt a restore:

root@pve1:/# dmsetup remove pve-data-tpool
root@pve1:/# dmsetup remove pve-data_tdata
root@pve1:/# ./pdata_tools thin_ll_restore -i /tmp/thin_ll_dump_meta2_root_6235.xml -E /tmp/tmeta.bin -o /dev/mapper/pve-data_tmeta

Following the restore, my thin volumes ALL came back and I was able to activate every single volume.

I learned a lot about LVM thin pool in the process AND learned to be more careful with metadata space. ProxMox creates a very small space by default and when deploying a new server, metadatapoolsize should always be increased (or checked and monitored at the very least).


Also published on Medium.

11 thoughts on “Repair a thin pool

  1. Thanks for documenting this; it seems very promising for helping me with a similar problem!

    Note for others who find this: I had to compile thin-provisioning-tools with

    ./configure –enable-dev-tools
    make

    Without that, the thin_ll tools weren’t available.

    Charles, could you clarify how you generated your /tmp/tmeta.bin file? I assume it’s a binary dump of your pool’s tmeta, but I’m not sure how to create that. So I’m stuck at trying to run thin_ll_restore, which requires the -E source-metadata input.

    And if’s convenient for you to share any more pointers on how you identified the correct entries for data-mapping-root and device-details-root, that’d be great. My situation may be different than yours: lvconvert –repair allows me to access 75% of my volumes, with only a few orphans shown in thin_ll_dump, But I’d like to experiment with data-mapping-root and device-details-root to see whether I can restore them all.

    Thanks!

    1. /tmp/tmeta.bin was simply a ‘dd’ copy of the metadata partition. I used something like : ‘dd if=/dev/mapper/pve-data_meta of=tmeta.bin bs=4K’ to transfer the binary content in a file.

  2. Charles, it’s really unbelievable! They(Proxmox team) shouldn’t use such a buggy technology at all !!!
    Now I’m experiencing the same issue and I’m going just to format my LVM disk to old plain ext4 and keep images in FS!
    Thank God that in my case that was just a backup VM… unbelievable…

  3. Hello, I met the same problem as you, but I met the problem in the back according to your idea, And the thin-provisioning-tools ‘s version is different, use thin_dump but at this steps I can’t go on, please give me more advice.
    #thin_dump /dev/mapper/volumes-pool_meta0 -o /Temp/thin_dump_meta0.xml
    —Tips:bad checksum in metadata index block
    Thanks!!!

    1. Sorry, I missed some of these comments because of a WordPress issue. /tmp/tmeta.bin was simply a ‘dd’ copy of the metadata partition. I used something like : ‘dd if=/dev/mapper/pve-data_meta of=tmeta.bin bs=4K’ to transfer the binary content in a file.

  4. Hi

    I am trying to install the thin-provisioning-tools, but I have errors in make process :

    [CXX] thin-provisioning/shared_library_emitter.cc
    [LD] bin/pdata_tools
    /usr/bin/ld: cannot find -lz
    collect2: error: ld returned 1 exit status
    make: *** [Makefile:240: bin/pdata_tools] Error 1

    Anybody can help me.
    Thank’s a lot.

  5. Well, here I am two years later, having overfilled my drive and corrupted my lvm metadata again. I think last time I eventually gave up, but this time I was able to recover, thanks to Charles explaining the origin of tmeta.bin. I want to fill in some gaps though, for my reference two years from now when I do this yet again…

    I had run the lvconvert –repair multiple times, but the only metadata which showed anything useful was meta0. It showed many orphans. The others simply showed an empty (or mostly empty) superblock.

    In my thin_scan, grepping for value_size=”24″ found two block locations: 40, and 24989.

    In my thin_ll_dump, I had nothing with ‘key_begin=”5″‘. I looked at the output for the first lines where ‘key_begin=”0″‘ gave way to ‘key_begin=”1″‘ (the two lines with that both had ‘key_end=”15788″‘) Those blocknr’s were 2 and 24989.

    I ran the thin_ll_dump command twice, with –device-details-root=24989 –data-mapping-root=2 and with –device-details-root=40 –data-mapping-root=2 Both generated output with many superblock entries, and many fewer orphans. (Trying –data-mapping-root=24989 didn’t yield anything useful.)

    Because I didn’t fully understand the various _tmeta _tdata -tpool and _meta[x] entries under /dev/mapper, I followed instructions by Charl Joseph Mert to create a new metadata volume, so that I wouldn’t have to overwrite anything existing.

    lvcreate -an -Zn -L150M –name repaired volume_group_name
    vgchange -an (I may have had to do some manual tinkering to deactivate and activate specific volumes?)
    lvchange -ay -v volume_group_name/repaired
    thin_ll_restore -i /tmp/thin_ll_dump_fewer_orphans_device_40_data_2.xml -E /tmp/meta0.bin -o /dev/mapper/volume_group_name-repaired
    lvconvert –thinpool volume_group_name/pool_name –poolmetadata /dev/mapper/volume_group_name-repaired

    Then vgchange -ay actually brought up all my volumes, yay!! There are some problems with one being read-only (I had to mount with option -o ro,noload and am backing up the data, before I try lvchange -p w to make it writable and allow an fsck), and several with bad superblocks, but alternate versions of those volumes which are mountable. So it looks like I can at least recover all the data, even if after I reboot, the LVM structure won’t allow them all to open.

    Note that I used the output from –device-details-root=40. I didn’t try the xml generated from –device-details-root=24989 since the one from 40 is working pretty well.

    Huge thanks to Charles for posting all this way back when, and hopefully my additions help someone else out of a panic situation. Good luck to anyone on this page!

  6. Well, I also broke a thin pool (pve/data), using Proxmox and LXC. I tried lvconvert –repair, and thin_check, and whatever, first time i got an error that I dont have enough space in my VG, so I added another HDD. After that all command run fine, but I could never activate or access pve/data, it was always claiming that a ‘full examnation’ was neccessary. Then I noticed that one of my disks just got some bad sectors, so I ‘fixed’ that (dd_rescue to another disk…), but sill couldnt access my pool, it was not present in /dev/mapper/pve or /dev/pve, lvdisplay always said ‘unavailable’.
    Finally I landed here and built the lvm-tools from the github repository. I had to install libboost dev package in a different version (74 I think) and do ‘make dev-tools’ to get the thin_scan etc. tools.
    I was running thin_scan on a copy of my metadata and suddenly noticed that I have /dev/mapper/pve-data back before I actually did anything? I just tried to start my LXC container and after some minutes of harddisk activity it came back online!
    So for me it looks like just compiling and installing those lvm tools got me my pool back. Backup is running now!
    So thanks for your blog entry!

Leave a Reply

Your email address will not be published. Required fields are marked *