Cleaning Up My del.icio.us Links

On Monday, I posted some info about how I am thinking of posting my weekly links.

Today I want to make one correction to the process, talk details about how to clean up the diff file, and then put together a quick script to do that part automatically. Once again, I am going to do this for the first time as I write this. I will summarize the process below.

First, the correction. After my first use of this method I discovered that one more quick edit to the html export will make the parsing of the diff file much easier. Before I move ~/delicious.htm to ~/delicious-old.htm I need to add a line break just after <DL><p>. It may not seem like much but it makes a big difference.

Actually, as it turns out, this is fairly easy to do with awk and grep. Let’s take a look at exactly what I want to do first.

I am only interested in lines that start with > and a space so I start with

grep '^> ' < links.diff

I want to replace the <DL> with <dl> and I don’t need the <p> at all. So now I have

grep '^> ' < links.diff |awk '{sub(/<DL><p>/,"<dl>")}

Now we get rid of the > and the space at the beginning of each line.

grep '^> ' < links.diff |awk '{sub(/<DL><p>/,"<dl>")};{sub(/^> /, "")}

Then we don’t print the last line at all.

grep '^> ' < links.diff |awk '{sub(/<DL><p>/,"<dl>")};{sub(/^> /, "")};!/<\/DL>/{print}'

This gives me everything I need but I still have uppercase tags and attributes, some attributes I don’t really care about, and none of the elements are closed. We can take care of closing the <dl> with a simple echo “</dl>” after it.

echo "</dl>"

So, if we want to save all this to a file we can do this.

grep '^> ' < links.diff |awk '{sub(/<DL><p>/,"<dl>")};{sub(/^> /, "")};!/<\/DL>/{print}' > foo.html;echo "</dl>" >> foo.html

Now all I need to do is clean up those uppercase letters and close all the other elements. I’ll take a look at that on Friday.

This is the second in a series of posts. The first post is here and the next one is here.

Posting del.icio.us Links Weekly to WordPress

I’ve been using del.icio.us to share links since 2005. I’ve always used another method for bookmarking links for myself, but del.icio.us has been my favorite method for the sharing of interesting links. Before del.icio.us I had a separate linkblog so right away I wanted a way to display my shared links in a similar format. I started out by replacing my linkblog with an html rendering of the RSS feed from del.icio.us. I quickly realized that I wanted more than that so I set the blog back up and used a cron job to auto-post my links to the WP database. I’ve written about all of this before.

After a while, I gave up on the linkblog completely and just used a widget to show the links on my blog. Not quite what I wanted but good enough for a while. Recently I decided to set up the daily blog posting feature that del.icio.us provides. This is a very nice feature but doesn’t work well for me because my links come in waves. So, I turned that off earlier this week and set off in search of a way to post the links as a weekly roundup. I’ve seen other sites do this and I like it a lot.

After a few quick searches, I didn’t find anything I thought was worth spending time fooling with. It seems to me it’s just as easy to come up with something on my own. As a hacker I would prefer something as automatic as possible, but I don’t mind having to do something manually. I will probably want to tweak the weekly posting a touch anyway.

The first thing that came to mind was using the RSS feed but I dismissed that because it will only show a maximum of 100 items. That would probably do for my purposes but I’d like to go ahead and set up something I don’t have to worry about – did I get all the links? etc.

So I decided on a different approach. I haven’t done any of this yet. I am going to work on it while I write this.

Here is the plan:

  1. export the links as html
  2. grab out the html I need and paste it into a new post in WP
  3. post it

Simple, except for a few points.

I actually came up with this idea a few days ago and I grabbed an export then. I checked my blog and found that the latest link posted was the trash vortex page at greenpeace.org so I simply removed all links above that and saved this file as ~/delicious.html.

Remove all html above the last posted link

Now it’s time to grab the new links for this week, so I go to del.icio.us and export the html and save it to the desktop. Then,

mv ~/delicious.htm ~/delicious-old.htm
mv ~/Desktop/del*.htm ~/delicious.htm
diff ~/del* > links.diff

The only thing to do now is clean it up and post it. Let’s start by doing it manually. I’ve stripped most of the new links out for demonstration. Take a look.

Diff file

First, I remove the first three lines and the last five lines. I’ve run a few tests now and it looks as though this will always be the case. This should make automation easier. This procedure is obviously going to require a bit of manual intervention so I should be able to notice when a problem crops up.

After removing those lines I am left with a bunch of lines like those below.

> <DL><p><DT><A HREF="http://online.wsj.com/article/SB123731266862258869.html" LAST_VISIT="1238172267" ADD_DATE="1238172267" TAGS="fun,economics,games,culture,scrabble,words">Scrabble and Other Games Have Overvalued Points - WSJ.com</A>
> <DD>Scrabble is a great game and should be left alone.

The only thing necessary to make this “work” is to remove the > at the beginning of each line, but we will make it “right” by changing uppercase tags and attributes to lowercase, closing all elements, and wrapping all of it in

<dl></dl>

Then I copy and paste it into a new post in WP and I’m all set. Requires a bit of input but not hard to do. I will see how much I can automate i on Wednesday.

This is the first in a series of posts. The next post is here

Scraping The iTunes Store

I recently wrote a series of posts detailing the way I chose to scrape iTunes for ratings information for Bailout America, an iPhone game we released recently.

Read more at thedoedoeblog.

Google Maps API – Knoxvoice.com

I built a “Find Us” map for Knoxville Voice, an independent newspaper in Knoxville, TN.

Find Us

Data Extraction

This client wanted a list of all YouTube videos on a certain topic. I found some of the work and posted it. Like most scraping work, it may not work anymore. It’s here if you want to check it out.

YouTube Video Mining

Excellent to work with. Great customer service, and willing to do what is necessary to get you what you need. Would work with again. Highly recommended.


YouTube Video Mining

Karl Jackson

Worked hard to get my project functioning. Excellent programmer.

Perl Script Install and Customization

Script Install and Customization

Set up an off-the-shelf Perl script for automated web-based marketing and customized it for this client’s special needs.

Website Replication

A guru in every sense of the word! This guy should be given a goverment national asset award for services to the United States! Hire him now!

Website Replication

Website Replication

My work seems to come in themes and this was the first in a long line of automation for sales of websites.

This client sold new websites to clients – what has become known as splogs (spam blogs) – and he wanted an easy way to replicate them on demand.

That’s where I came in. :)

Scraping Real Estate Data

I created an application that would mine all the data on this website and return a complete list of all property meeting a certain criteria. I provided this to the client as a Windows executable which he could run whenever he wanted to search. He was looking for very specific properties, but the application allowed for different searches on demand.

I’ve actually sold derivatives of this application to several clients.

SiteReportCard.com

I wrote the application that powers this website. I did not design or build the website, but I provided all the code that grabs the data, parses it, tabulates the results and presents it.

I enjoy this kind of work very much and it is one of my specialties.

Rebuilding Pages

You sir are a genius. This is going to save SOOO much time and it looks like we’re down to the final bit.

Did a really good job getting my script put together with very little to work with! Easy to get ahold of via IM and replies to emails quickly. No complaints at all.

parsing and rebuilding pages

Parsing and Rebuilding Pages

I wrote a program that would parse and rebuild 1000s of existing webpages. This project involved programmatically “moving” pages to different sections of the website, changing names and several other details on every page.

I don’t have a link.

Data Extraction – Sports Scores

I extracted years of historical data – NBA box scores – for a private client. I also wrote a program that would extract current data as needed.

I found an unfinished version of this script over here.

Google Search API

Matthew hired me to use the Google search API to find the number of results ( the real number, not the number Google reports ) for each keyword in his database and then insert that number in another column in his database.

There is no web based user interface to link to.

ThirdSphere :: Web hosting for small business success!

I was hired to transfer several websites from one server to another.

ThirdSphere :: Web hosting for small business success!

PhysOrg.com: latest science and technology news

I was hired to write an application that would pull RSS feeds and parse into PHP arrays.

PhysOrg.com: latest science and technology news

Experience