Cleaning Up My del.icio.us Links
On Monday, I posted some info about how I am thinking of posting my weekly links.
Today I want to make one correction to the process, talk details about how to clean up the diff file, and then put together a quick script to do that part automatically. Once again, I am going to do this for the first time as I write this. I will summarize the process below.
First, the correction. After my first use of this method I discovered that one more quick edit to the html export will make the parsing of the diff file much easier. Before I move ~/delicious.htm to ~/delicious-old.htm I need to add a line break just after <DL><p>. It may not seem like much but it makes a big difference.
Actually, as it turns out, this is fairly easy to do with awk and grep. Let’s take a look at exactly what I want to do first.
I am only interested in lines that start with > and a space so I start with
grep '^> ' < links.diff
I want to replace the <DL> with <dl> and I don’t need the <p> at all. So now I have
grep '^> ' < links.diff |awk '{sub(/<DL><p>/,"<dl>")}
Now we get rid of the > and the space at the beginning of each line.
grep '^> ' < links.diff |awk '{sub(/<DL><p>/,"<dl>")};{sub(/^> /, "")}
Then we don’t print the last line at all.
grep '^> ' < links.diff |awk '{sub(/<DL><p>/,"<dl>")};{sub(/^> /, "")};!/<\/DL>/{print}'
This gives me everything I need but I still have uppercase tags and attributes, some attributes I don’t really care about, and none of the elements are closed. We can take care of closing the <dl> with a simple echo “</dl>” after it.
echo "</dl>"
So, if we want to save all this to a file we can do this.
grep '^> ' < links.diff |awk '{sub(/<DL><p>/,"<dl>")};{sub(/^> /, "")};!/<\/DL>/{print}' > foo.html;echo "</dl>" >> foo.html
Now all I need to do is clean up those uppercase letters and close all the other elements. I’ll take a look at that on Friday.
This is the second in a series of posts. The first post is here and the next one is here.
