Cleaning Up My del.icio.us Links

On Monday, I posted some info about how I am thinking of posting my weekly links.

Today I want to make one correction to the process, talk details about how to clean up the diff file, and then put together a quick script to do that part automatically. Once again, I am going to do this for the first time as I write this. I will summarize the process below.

First, the correction. After my first use of this method I discovered that one more quick edit to the html export will make the parsing of the diff file much easier. Before I move ~/delicious.htm to ~/delicious-old.htm I need to add a line break just after <DL><p>. It may not seem like much but it makes a big difference.

Actually, as it turns out, this is fairly easy to do with awk and grep. Let’s take a look at exactly what I want to do first.

I am only interested in lines that start with > and a space so I start with

grep '^> ' < links.diff

I want to replace the <DL> with <dl> and I don’t need the <p> at all. So now I have

grep '^> ' < links.diff |awk '{sub(/<DL><p>/,"<dl>")}

Now we get rid of the > and the space at the beginning of each line.

grep '^> ' < links.diff |awk '{sub(/<DL><p>/,"<dl>")};{sub(/^> /, "")}

Then we don’t print the last line at all.

grep '^> ' < links.diff |awk '{sub(/<DL><p>/,"<dl>")};{sub(/^> /, "")};!/<\/DL>/{print}'

This gives me everything I need but I still have uppercase tags and attributes, some attributes I don’t really care about, and none of the elements are closed. We can take care of closing the <dl> with a simple echo “</dl>” after it.

echo "</dl>"

So, if we want to save all this to a file we can do this.

grep '^> ' < links.diff |awk '{sub(/<DL><p>/,"<dl>")};{sub(/^> /, "")};!/<\/DL>/{print}' > foo.html;echo "</dl>" >> foo.html

Now all I need to do is clean up those uppercase letters and close all the other elements. I’ll take a look at that on Friday.

This is the second in a series of posts. The first post is here and the next one is here.

Comments

Leave a Reply




Experience