Antenna (antennapedia) wrote,

  • Music:

Making a local archive with wget

Do you really want to back up your LJ locally? With icons and comments and all that happy stuff? Then you want to use wget. This is a gnu utility for grabbing web and ftp resources. It comes with most Linux distros, but not with OS X.

1. Get wget. This is the hardest step of the entire process, because Apple chose several releases ago to cease including wget with OS X. Your choices are:
- download from VersionTracker
- install through one of the common port package managers (I use DarwinPorts; many people happily use Fink)
- the gnu page has download urls for Windows versions, e.g., this one
- build from source

2. Run your web browser. Log into LiveJournal if you're not already. This saves a session cookie in your browser's cookies file. We're about to piggyback on that login session.

3. In a handy text editor, build two wget command lines. Start with this template:
wget --cookies=on --keep-session-cookies --save-cookies wget_cookies.txt --load-cookies /path/to/cookies.txt --header "X-LJ-Auth: cookie"
wget -r -l6 -H --page-requisites --convert-links -R "*mode=*,*replyto=*,*/friends*" --cookies=on --keep-session-cookies --save-cookies wget_cookies.txt --load-cookies wget_cookies.txt --header "X-LJ-Auth: cookie"

Replace the bold parts with your information. Any spaces in the cookie path need to be escaped with backslash.

For Firefox, your cookie path will be something like:
/Users/yourlogin/Library/Application\ Support/Firefox/Profiles/default.something_random/cookies.txt

For Camino:
/Users/yourlogin/Library/Application\ Support/Camino/cookies.txt

For Safari... uh, yeah, the Safari cookie plist file is not in the Netscape cookie format, so SOL. Download Firefox just for the occasion. Or run this, but most non-programmers will find just using Firefox or Camino easier. I love Camino, by the way. It's a true Mac application with the Mozilla rendering engine.

Windows users are on your own for cookies.txt file locations. Tell me, and I'll amend these instructions. But Windows users should probably just use LJArchive and be done with it.

My command lines look like this:
wget --cookies=on --keep-session-cookies --save-cookies wget_cookies.txt --load-cookies /Users/antenna/Library/Application\ Support/Camino/cookies.txt --header "X-LJ-Auth: cookie"
wget -r -l6 -H --page-requisites --convert-links -R "*mode=*,*replyto=*,*/friends*" --cookies=on --keep-session-cookies --save-cookies wget_cookies.txt --load-cookies wget_cookies.txt --header "X-LJ-Auth: cookie"

4. Run Terminal, cd yourself to whatever directory you want the archive in, copy and paste your first command line.

5. Run the second command.

6. Go make a pot of coffee. If you have lots of LJ posts, two pots. You might be watching it chug for an hour or more.

7. Inside a folder named, there's a file named index.html. Double-click.

8. Enjoy.

What is this tool doing?

You're spidering your own LiveJournal site, essentially, and using cookies so the spidering tool is logged in as you.

wget is a tool for grabbing the contents of urls and storing them locally. It has some extremely nice features for mirroring web sites: it can parse the contents of a page to find links within it, and it can then follow those links. You can control which links it chooses to follow with regular expressions. (The command line above skips reply pages, for instance.) You can also tell it to fetch page prerequisites, that is, things like style sheets and images required to render a page correctly. And then, to top it all off, wget can rewrite links in the local copies of pages so they point to your local mirror, not back out to the web.

So it's grand way to get a complete local archive of a remote resource. Which is, in fact, what it was designed to do.

(By the way, you can't do this with curl, which does come with OS X, because curl doesn't do recursive gets. So wget is the tool of choice for mirroring web sites.)

Normally this would do no good for getting your LJ downloaded, because many people hide things behind the friends-lock, and therefore a spider can't see them. But wget will load your cookie file, if you tell it to, and therefore it will be treated by LJ as if it's logged in. Because it is logged in. So it will get your entire journal, including flocked and private posts.

What's the difference between this and using an LJ client, or using ljdump?

wget is just like a web browser, only it doesn't display pages. It merely fetches them. Clients like ljdump and Xjournal and so on use LiveJournal's API, that is, its programming interface. The API lets you fetch post contents & post metainformation in a compact manner. It's more efficient than spidering like this is, because it doesn't build full styled web pages. You just get the tiny amounts of text that are what you typed in when you originally wrote a post or a comment, plus a few numbers and strings that represent the filter or tags you used.

The wget method makes LJ render every single page, and is thus somewhat anti-social. The results are more like what people want when they want a local archive of their LJ, however. That is, if they want that archive so they can browse.

The XML documents that something like ljdump produces are more useful for further scripting. For instance, I could write an ljdump-variant tool that took everything I posted to LJ and reposted it to my new InsaneJournal account. It would be a PITA to do that with the results of this process. The API is also better for incremental downloads, because Danga's API designers thoughtfully included a way to request ids of resources that have changed since the last fetch. In short, once I get around to writing an archiver that uses the API, it'll be better in most ways. (Though I'm not overjoyed by the work of duplicating LJArchiver in non-proprietary programming languages and whatnot.)

But please note the corollary: Don't abuse this by running it over and over. Run it once.

I would appreciate hearing from a non-programmer how the wget download from VersionTracker went for you.
Tags: geek, meta, tools

  • Post a new comment


    Anonymous comments are disabled in this journal

    default userpic

    Your IP address will be recorded