Making a local archive with wget

Do you really want to back up your LJ locally? With icons and comments and all that happy stuff? Then you want to use wget. This is a gnu utility for grabbing web and ftp resources. It comes with most Linux distros, but not with OS X.

1. Get wget. This is the hardest step of the entire process, because Apple chose several releases ago to cease including wget with OS X. Your choices are:
- download from VersionTracker
- install through one of the common port package managers (I use DarwinPorts; many people happily use Fink)
- the gnu page has download urls for Windows versions, e.g., this one
- build from source

2. Run your web browser. Log into LiveJournal if you're not already. This saves a session cookie in your browser's cookies file. We're about to piggyback on that login session.

3. In a handy text editor, build two wget command lines. Start with this template:
wget --cookies=on --keep-session-cookies --save-cookies wget_cookies.txt --load-cookies /path/to/cookies.txt --header "X-LJ-Auth: cookie" http://yourljname.livejournal.com/
wget -r -l6 -H -Dstat.livejournal.com -DDuserpic.livejournal.com -Dyourljname.livejournal.com --page-requisites --convert-links -R "*mode=*,*replyto=*,*/friends*" --cookies=on --keep-session-cookies --save-cookies wget_cookies.txt --load-cookies wget_cookies.txt --header "X-LJ-Auth: cookie" http://yourljname.livejournal.com/

Replace the bold parts with your information. Any spaces in the cookie path need to be escaped with backslash.

For Firefox, your cookie path will be something like:
/Users/yourlogin/Library/Application\ Support/Firefox/Profiles/default.something_random/cookies.txt

For Camino:
/Users/yourlogin/Library/Application\ Support/Camino/cookies.txt

For Safari... uh, yeah, the Safari cookie plist file is not in the Netscape cookie format, so SOL. Download Firefox just for the occasion. Or run this, but most non-programmers will find just using Firefox or Camino easier. I love Camino, by the way. It's a true Mac application with the Mozilla rendering engine.

Windows users are on your own for cookies.txt file locations. Tell me, and I'll amend these instructions. But Windows users should probably just use LJArchive and be done with it.

My command lines look like this:
wget --cookies=on --keep-session-cookies --save-cookies wget_cookies.txt --load-cookies /Users/antenna/Library/Application\ Support/Camino/cookies.txt --header "X-LJ-Auth: cookie" http://antennapedia.livejournal.com/
wget -r -l6 -H -Dstat.livejournal.com -Duserpic.livejournal.com -Dantennapedia.livejournal.com --page-requisites --convert-links -R "*mode=*,*replyto=*,*/friends*" --cookies=on --keep-session-cookies --save-cookies wget_cookies.txt --load-cookies wget_cookies.txt --header "X-LJ-Auth: cookie" http://antennapedia.livejournal.com/2006/

4. Run Terminal, cd yourself to whatever directory you want the archive in, copy and paste your first command line.

5. Run the second command.

6. Go make a pot of coffee. If you have lots of LJ posts, two pots. You might be watching it chug for an hour or more.

7. Inside a folder named yourlj.livejournal.com, there's a file named index.html. Double-click.

8. Enjoy.

What is this tool doing?

You're spidering your own LiveJournal site, essentially, and using cookies so the spidering tool is logged in as you.

wget is a tool for grabbing the contents of urls and storing them locally. It has some extremely nice features for mirroring web sites: it can parse the contents of a page to find links within it, and it can then follow those links. You can control which links it chooses to follow with regular expressions. (The command line above skips reply pages, for instance.) You can also tell it to fetch page prerequisites, that is, things like style sheets and images required to render a page correctly. And then, to top it all off, wget can rewrite links in the local copies of pages so they point to your local mirror, not back out to the web.

So it's grand way to get a complete local archive of a remote resource. Which is, in fact, what it was designed to do.

(By the way, you can't do this with curl, which does come with OS X, because curl doesn't do recursive gets. So wget is the tool of choice for mirroring web sites.)

Normally this would do no good for getting your LJ downloaded, because many people hide things behind the friends-lock, and therefore a spider can't see them. But wget will load your cookie file, if you tell it to, and therefore it will be treated by LJ as if it's logged in. Because it is logged in. So it will get your entire journal, including flocked and private posts.

What's the difference between this and using an LJ client, or using ljdump?

wget is just like a web browser, only it doesn't display pages. It merely fetches them. Clients like ljdump and Xjournal and so on use LiveJournal's API, that is, its programming interface. The API lets you fetch post contents & post metainformation in a compact manner. It's more efficient than spidering like this is, because it doesn't build full styled web pages. You just get the tiny amounts of text that are what you typed in when you originally wrote a post or a comment, plus a few numbers and strings that represent the filter or tags you used.

The wget method makes LJ render every single page, and is thus somewhat anti-social. The results are more like what people want when they want a local archive of their LJ, however. That is, if they want that archive so they can browse.

The XML documents that something like ljdump produces are more useful for further scripting. For instance, I could write an ljdump-variant tool that took everything I posted to LJ and reposted it to my new InsaneJournal account. It would be a PITA to do that with the results of this process. The API is also better for incremental downloads, because Danga's API designers thoughtfully included a way to request ids of resources that have changed since the last fetch. In short, once I get around to writing an archiver that uses the API, it'll be better in most ways. (Though I'm not overjoyed by the work of duplicating LJArchiver in non-proprietary programming languages and whatnot.)

But please note the corollary: Don't abuse this by running it over and over. Run it once.

I would appreciate hearing from a non-programmer how the wget download from VersionTracker went for you.
  • Current Music: Spider (Jon Hopkins Remix) : Leo Abrahams : EP1
Tags: , ,
Do you maybe know whether the ljArchive source code can be really made to run under mono instead of the .NET framework and be platform independent? Its site seems to imply it could, but didn't give any further instructions how one would go about trying to compile it or run the code or whatever. I mean, I could install mono on my linux, but I'm not a programmer, and in the source code was no instructions how to use it that I could discern, and it didn't come with a practical readme.txt or something like linux programs that tell you stuff like for example to just do the "configure, make, make install" thing to be set that you get when you download some C source code.
I dunno. It's a possibility that occurred to me to try, but... mono is a port of C# to various other platforms, not a port of the .NET frameworks. It would surprise me if it just ran without further work. I'd rather solve the problem in a self-contained way with python and its standard libraries, if I can. Or with the fork of iJournal I've been playing with, for a pure Cocoa/Objective-C solution.

Grrrr. What I really need is infinite time. Can you arrange that for me?
Seriously? But the mono project page says:
"What is Mono?
Mono provides the necessary software to develop and run .NET client and server applications on Linux, Solaris, Mac OS X, Windows, and Unix."

*is all confused now*

I thought it might be like with Java programs or something, that you could install the interpreter and then have at least a chance for it to run if it doesn't include too much windows specific crap.

And sorry, I'm right out of portable time dilation fields. :(
Hrm! Maybe, then. It's at least worth trying. What worries me is that "dot NET" is this really fuzzy Microsoft marketing term that was used for everything for a while.
You are my new best friend. For weeks now I've been trying to solve a particular problem: how to download a friend's LJ, including locked posts visible to me, without just going through and saving them one by one. I was able to make wget either ignore robots.txt or use cookies (so as to get the locked posts), but not both. And none of those programs like ljdump work for journals other than one's own, even though there's no reason they shouldn't -- I wasn't trying to get any information I couldn't already, I just wanted to automate it.

I haven't decoded yet why your lines worked and mine didn't. But if you ever run for office, I'm voting for you twice.
Heh, thanks.

LJdump and tools like my own ljarchive... hmm. I wonder if they could be updated to get everything that the authenticated user has privileges to see. A quick look at the LJ API docs seems to suggests to me that it might be possible (get auth token for you, pass user name of the target user, get that user's sync items). Might not be, either, because they might have limited the API. Either for security reasons or because they just didn't think about it, or because they don't want people writing non-web-browser tools for reading LJ.
Thank you for this. One question before I run it: Why the "2006"? That is, your second command line ends with "http://antennapedia.livejournal.com/2006/" (Looking at that URL, 2006 appears to be when your first LJ entry dates from — which makes me wonder if I should use "2002" there.)
Yes, 2006 is the year I started my LJ, so that's my starting point page. You should probably use your first year, as you deduced.
Thanks. Unfortunately it didn't work — it mirrored all my public entries, but none of my friends-locked ones.

Rather than retry and hammer LJ repeatedly, I've revised it to snag a particular entry, given on the command-line:


# wglj.sh
# based on http://antennapedia.livejournal.com/239955.html

# antenna's first line, modified my cookies path and LJ name:
wget --cookies=on --keep-session-cookies --save-cookies wget_cookies.txt \
--load-cookies /home/xela/.mozilla/default/1cje5y3t.slt/cookies.txt \
--header "X-LJ-Auth: cookie" http://yakshaver.livejournal.com/

# antenna's second line, modified with my LJ name and ending
# with the first year of my LJ:
# wget -r -l6 -H -Dstat.livejournal.com -Duserpic.livejournal.com \
# -Dyakshaver.livejournal.com --page-requisites --convert-links \
# -R "*mode=*,*replyto=*,*/friends*" --cookies=on --keep-session-cookies \
# --save-cookies wget_cookies.txt --load-cookies wget_cookies.txt \
# --header "X-LJ-Auth: cookie" http://yakshaver.livejournal.com/2002/

# The above, modified to not be recursive (i.e. dropped the
# -r -l6 and -H flags), and to be given the URL of a particular
# entry on the command-line
wget -Dstat.livejournal.com -Duserpic.livejournal.com \
-Dyakshaver.livejournal.com --page-requisites --convert-links \
-R "*mode=*,*replyto=*,*/friends*" --cookies=on --keep-session-cookies \
--save-cookies wget_cookies.txt --load-cookies wget_cookies.txt \
--header "X-LJ-Auth: cookie" $1

If I run it on a public entry, e.g.

$ ./wglj.sh http://yakshaver.livejournal.com/132396.html

it creates a yakshaver.livejournal.com subdirectory and downloads the page. If I run it on a friends-locked entry, e.g.

$ ./wglj.sh http://yakshaver.livejournal.com/132243.html

it creates a www.livejournal.com subdkrectory and places in it a file called "index.html?returnto=http:%2F%2Fyakshaver.livejournal.com%2F132243.html" — which is essentially the page you (or anyone not on my friends list) would get if you tried to view that URL.

I realize you no doubt have better things to be doing than debugging this, but I'm stuck. If you're interested, I've put the stderr output from those two commands (there was, as you'd expect, no stdout) online at www.yakshavers.net/~xela/wglj/.

[Edit: Oh, btw, the URLS above were copy-pasted from open browser tabs in the browser whose cookie file I was using (Seamonkey on Linux), so it shouldn't be that I didn't have cookies. Also, I copied the script to my Mac and used it with Firefox's cookies file (the Firefox with which I am logged into LJ and typing this text), with the same results.]

Edited at 2008-10-13 09:01 pm (UTC)
Hrm. I wonder if LJ changed the way its session cookies work. I'll attempt to repro with my two browsers (Camino & Firefox 3 on OS X) and see what I can figure out.
FWIW, I tried the one thing I could think of — that LJ was checking the user-agent string against the browser it issued the cookie to. Using Firefox's cookies.txt on my Mac and adding
-U "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv: Gecko/2008092414 Firefox/3.0.3"
to the command line had no effect.
super thread revival
Hey, I found your post via Google. Thanks for getting me on the right track, but unfortunately your method didn't work for me. FWIW this command "wget --load-cookies /path/to/cookies.txt -r -N -l inf -k http://yourUsername.livejournal.com/" grabbed everything, after I installed the "export cookies" add-on for Firefox.
Some corrections for the passage of time

  • I think you need to add -e robots=off nowadays, because wget wasn't recursively fetching without it. Seems LJ has added a disallow-all robots.txt flag. :-/

  • Here's the domain list I used (includes my own subdomain): --domains squirrelitude.livejournal.com,l-userpic.livejournal.com,l-userpic.livejournal.net,l-stat.livejournal.com,l-stat.livejournal.net