Converting a Blogger Blog to InDesign Tagged Text with Perl

Our family has a fairly sizable blog that we (actually, mostly my wife, Nancy) have kept updated for several years. Since it contains so much family history we wanted an easy way to preserve it in print form, just in case Blogger gets the boot from Google some day (not that that will ever really happen…).

Since we’re both hobbyist graphic designers—I taught a couple print layout and design classes as an undergrad at BYU and have made several books at Lulu.com—we decided to layout and print each year of our blog, to keep for posterity.

A couple years ago Nancy attempted this with our smaller Jordan blog for a print publishing class she took at BYU. We spent the bulk of our time manually copying and pasting each post and the subsequent comments into a huge Word document. She then ran a long series of find/replaces to clean up the messy, inconsistent typography, and then finally placed it into Quark (that evil program). Through a series of unfortunate events, Quark crashed repeatedly and corrupted her file multiple times—she was lucky to get her first draft turned in for her final project (she got an A, though. Phew!).

I knew there had to be a faster, more efficient way to wrangle all the blog text, but this was back in 2006, before Blogger had an open API or options to backup a blog. Primitive, dark days indeed :) .

However, last year, Blogger introduced a fantastic new option—the ability to backup and export your entire blog, comments and all. Blogger spits out an Atom-formatted XML file that you can use to recreate your blog later on (or possibly import onto other platforms, like WordPress, I think). This was the key to simplifying the daunting task of collecting the text for our blog books. All we needed was a way to mangle the text in the XML file to create an InDesign-ready file.

So, I whipped up a semi-complicated Perl script that can parse an Atom-formatted XML file from Blogger and create a text file using InDesign Tagged Text to preapply paragraph and character styles. It also cleans up the typographic elements of the text, adding em and en dashes, removing empty paragraphs, etc. Additionally, it can add hidden index entries for each tag, essentially creating a barebones index for your book. And it only takes 10ish seconds to run on a large blog. It’s not perfect and could stand some good optimization, but it works.

Additionally, since InDesign tagged text works with, well, text, it won’t place your images for you. Instead it will insert the location of the image (the src=whatever.jpg of the img tags) in between curly braces { }. You’ll then need to manually place all the images later, deleting the braced text.

In the future, the script could be changed to output XML, which does let you include pictures, but you’d have to have all your images on your hard drive already. The script could go and download all the linked images, but it’s not really a good idea to place low resolution, web-optimized images in a print document. In our case we have high-res copies of all the pictures on the blog stored on an external hard drive, so we just have to go and find and place the images we want. It takes more time, but it makes better quality documents in the end.

Also, links are preserved as footnotes—all href="whatever.html"s show up as the footnote text.

How to use the script

First, download the script and its supporting files from Github. If you’re using Mac OS or Linux, make sure the main script file, format_for_id.pl is executable—type chmod +x format_for_id.pl at the terminal.

Next, make sure you have Perl installed on your system. If you are using Linux or Mac OS X, you’re good to go. If you’re using Windows, download and install Strawberry Perl for Windows. You can also use ActivePerl, but installing modules is a little more difficult.

The script uses several additional CPAN modules that you’ll need to install. You’ll need to use the CPAN shell to do so.

  • On Windows with Strawberry Perl: open the packaged CPAN client in the Start Menu folder
  • On Windows with ActivePerl: Good luck. There is a large repository of specially compiled CPAN modules for ActiveState, and reportedly there is a kind of CPAN shell, but I haven’t gotten either to work too well. Stick with Strawberry Perl. It’s better :)
  • On Mac OS X: type perl -MCPAN -e shell at a terminal window
  • On Linux: type sudo cpan at a terminal window

(If it’s your first time running the CPAN shell you’ll be asked to configure the installation environment. Choose the option to automatically configure everything.)

Once everything is set up and you see the cpan> shell prompt, type install Package::Name (eg. install Date::Format) for each of the dependent CPAN packages listed at the beginning of format_for_id.pl.

Log in to your Blogger Dashboard and export your blog as an XML file by going to Settings > Basic > Export blog. Place the XML file in the script folder.

Open config.cfg with a text editor and change the settings as needed. Set the input file to your newly downloaded XML file, choose the year you want to extract, set an output file, and set the file header, either <UNICODE-MAC> or <UNICODE-WIN>, depending on what platform you use InDesign on.

For now, leave all the style tags as they are so you can place the text into the example InDesign file and see how everything works. You can change them later and rerun the script

Finally, using the terminal or command prompt, navigate to the folder with the script and and run it by typing perl format_for_id.pl. If everything goes well you should have an output file at the location you specified, full of InDesign tags.

Open up Example.indd in InDesign CS3 or above and place the generated text file. All the text should come in perfectly with all the needed paragraph and character styles applied. Bravo!

Advanced usage

Obviously you’ll want to make some changes to the format of the output text. You might not want the post URL right after the tag—you might want it at the end, or not want it at all. With a little knowledge of Perl, you can edit the main script directly, mostly the combineSortClean() sub near the end of the script, to change the order of the output elements.

You can also disable tag indexing and allow the tags to be output with a paragraph style. Just comment and uncomment the appropriate sections in the code. The same goes for the author-specific character styles—comment and uncomment the needed lines in the script.

You can rename the styles and use your own—just make sure the styles exist in your InDesign document before you place the output file. InDesign will throw away any tags that don’t already exist in the document.

I made the script for our specific blog, so it doesn’t take every possible paragraph or character style into account. If you want additional functionality, you’ll have to add it. Feel free to fork the project off of GitHub and add to/improve it. That’s why it’s open source :)

If you have any questions, ask in the comments. Report any issues at the project GitHub page. I’ll try to respond quickly—I generally do, as evidenced by my pdftk-php project :)

Good luck!

  • ziwuxun123
    We’re branching out here at Truly Obscure, and please let us know what you think of our new directions. Regular readers might remember our look at the Mion sandals, a logical first step for our budding fashionista reviewers.site:shoedhardy.com
    We continue the sandal spree with a look at the Timberland Humbolt- an attractive cross between a full shoe and a sandal, similar to the Mion or the Keen. Initially, we were impressed by the light weight and decent arch support of the Humbolt- not as light as the Mion, but not noticeably heavier. Further, they were pretty comfortable and easy to put on.site:shoedhardy.com
    But with a little wear, the limitations of the Humbolt became clear- the “adjustable closure”, basically a strap that you pull to tighten the shoe’s width, continually became loose. Frustrating, but not fatal- until someone stepped on the heel
    of our loose shoe accidentally, and the strap broke.site:shoedhardy.com
    Usually, this is no big deal- simply re-thread the strap, or in the worst case you might need to get a new strap. On the Humbolt sandal, however, the strap is sewed onto the body of the shoe and is impossible to re-attach without some major effort and a sewing machine. Hand sewing won’t last long, as the strap is constantly being tugged on. Our call: the Humbolt sandal is attractive, comfortable, and reasonably priced at $80- but is suitable only for light use, and isn’t as durable as we’d like.site:shoedhardy.com
  • ziwuxun123
    More celebrities are becoming more aware of the environment while turning their newfound passion into fashion. Wyclef Jean is no exception. The Grammy Award-winning musician, humanitarian and Goodwill Ambassador to Haiti announced his partnership
    with footwear brand Timberland.site:timberlandsbuy.com
    Jean introduced a 16-boot footwear collection and with every purchase of these boots, $2 will benefit the Yele Haiti Foundation, a grassroots movement he founded to support reforestation in his homeland, Haiti. In 2005, Jean founded Yele Haiti to build global awareness for Haiti and helping the country through education, sports, arts and environment programs.site:timberlandsbuy.com
    Due to his background, Wyclef Jean is now one of six Earthkeeper Heroes within Timberland’s 2009 Earthkeepers program, which aims to catalyze a movement of environmental and social activists. This partnership will combine music, product, digital and social media, service events and concerts to motivate and inspire fans and consumers to take action in protecting the environment. Jean commented:
    “It is not common to find an agreement between an individual and a corporation with so many altruistic synergies. I am truly humbled and excited to begin our journey together. Timberland is not only a brand I have worn as long as I can remember, but now I feel I have gained a friend in Jeff [Swartz, Timberland's President and CEO] and the respect of what is beyond a brand. I have the utmost respect for what he and his company stand for not only in its legacies but in its future. I am looking forward to making change together.”site:timberlandsbuy.com
  • ziwuxun123
    Timberland’s proven steady returns, low volatility and non-correlation with global stock prices make it an extremely attractive asset class - and one that should be leveraged as an inflation hedge during this time of uncertainty. For this year’s 6th Timberland Investment World Summit, IQPC has convened some of the most outstanding experts in the timberland space to lead discussions that accelerate understanding of this challenging, yet potentially lucrative investment. site:bootsness.com
    Senior executives from the entire timberlands value chain will be meeting to discuss critical themes including the changing global market, institutional timberland investment strategies and cashflow generation opportunities. Exciting innovations this year include the Pension & Endowment Fund Think Tank and an Interactive roundtable discussion on key international timberland markets. Offering a strictly off-line and off the record discussion, the exclusive hour-long Pension & Endowment Fund Think Tank is limited to participants from pension and endowment funds. Attendees are invited to bring a specific and current topic of interest or challenge to discuss with their peers, and can be related to portfolio optimization strategies, market timing, access strategies or risk management for timberland investments. site:bootsness.com
    The Global Investment Picture for Timberland interactive roundtable discussion serves as a forum for delegates to receive detailed information on the markets in which they are most interested. Key international timberland markets will include Argentina, Australia, Brazil, Canada, Chile, China, New Zealand, Nicaragua, Panama, Russia, South Africa and Uruguay. site:bootsness.com
  • ziwuxun123
    If the county were to "lock up" resource zoning and take away ag protection, "you are saying residential development is the preferred land use." Carpenter said options two and three have no protection for oak forest or madrones, a concern raised by several speakers, some of whom were concerned about the impact of Sudden Oak Death.site:diyfootwear.com
    Kelly Brown cited a letter to the CAC from Supervisor Mike Reilly refuting Carpenter's proposals. She said Director of Forestry Andrea Tuttle tells Reilly that without local land use protections, CDF has a hard time regulating conversions.site:diyfootwear.com
    René DeMonchy of Guerneville spoke for option three. "Water is the issue that shines through," he said, adding that throughout the county water tables are dropping in direct proportion to the amount of vineyards created. The issue is the public good versus the benefit of a few owners. If it is a property right to cut down a forest and plant vineyards - or broccoli - DeMonchy said he's against it,
    because forests maintain ground water.site:diyfootwear.com

    He said he loves wine himself, and he believes that most people who don't like vineyard conversions are not against timber harvesting. "But if it destroys watersheds, it is a problem," said DeMonchy. Our prosperous times now are based on water, he went on, and it is not in our best interests to convert a lush damp forest to what Chris Poehlmann calls a biological desert.site:diyfootwear.com

    A vineyard owner said he did the math and it will take hundreds of years to convert the forest at the current rate. He said development, not vineyards, causes wells to go dry. His vineyard has bugs, animals, and all kinds of diversity. He asked for the data on the biological desert idea. He was one of several speakers on both sides of the debate who cited land use in Europe. site:diyfootwear.com
  • ziwuxun123
    NEW YORK, Aug 6 (Reuters) - Packaging and building products company Temple-Inland Inc. (TIN.N), which is under pressure from activist investor Carl Icahn, said on Monday that it would sell 1.55 million acres of timberland to Campbell Group Inc. for $2.38 billion
    and then issue a $1.1 billion special dividend.site:outletconverse.com
    The sale is part of a plan that Temple-Inland announced in February, which also includes spinning off two businesses, financial services and real estate. At that time, it had said it expected to return the majority of the proceeds from the sale to shareholders.site:outletconverse.com
    Icahn, which has a more than 8 percent stake in the company, at the beginning of the year began urging it to make changes, such as selling assets.
    Temple-Inland said it was on track to complete the reorganization by the end of the year, after which it would own only packaging and building products.
    The company said Campbell Group, a timberland investment management company in Portland, Oregon, would buy the land with installment notes, which will be pledged as collateral for a nonrecourse loan within 30 days of the sale.
    Temple-Inland then expects to use some of its expected $1.8 billion in sale proceeds to pay a special dividend estimated at $1.1 billion, or $10.25 per share. It will use the remaining $700 million to reduce debt. site:outletconverse.com
blog comments powered by Disqus