Kamis, 27 Mei 2010

Optimizing for $money?

While some sites optimize for speed and bandwidth usage, others do the opposite. About a decade ago when Internet connection at the office (then at pathetic speed of 128-256kbps) was experiencing serious slowdown, I noticed that images and even flag icons from the Summer Olympic website are deliberately made totally uncacheable, by setting Expires header value to a past date. Apparently IBM is still doing the same trick for the Grand Slams sites.

$ cctrl() { perl -MLWP::UserAgent -E'$ua=LWP::UserAgent->new; $res=$ua->get($ARGV[0]); say $res->header("cache-control")' $1; }

RSS icon, only cacheable for several hours:

$ cctrl http://www.rolandgarros.com/images/nav/rgr_nv_00000g3.gif
max-age=15000

In fact all content photos are also cacheable for several hours only, despite already having unique URLs.

Yellow button, which surely won't change a lot (and has a fairly unique URL anyway), cacheable only to a little over 11 minutes!

$ cctrl http://www.rolandgarros.com/images/misc/rgr_ms_00000g2.gif
max-age=700

Compare to:

$ cctrl http://www.facebook.com/images/app_icons/newsfeed.gif
max-age=2592000

or even:

$ cctrl http://www.ibm.com/i/v16/t/ibm-logo.gif
max-age=2592000

Jumat, 21 Mei 2010

Custom class dumping for Data::Dump

I'm using DateTime objects a lot these days: anytime I get some date/time data from outside of Perl, the first thing I do is convert them to DateTime object, to avoid calculation/formatting hassle ahead.



However, the dumps are not pretty.



% perl -MDateTime -MData::Dump -e'dd [DateTime->now]'



[

bless({

formatter => undef,

local_c => {

day => 21,

day_of_quarter => 51,

day_of_week => 5,

day_of_year => 141,

hour => 8,

minute => 55,

month => 5,

quarter => 2,

second => 36,

year => 2010,

},

local_rd_days => 733913,

local_rd_secs => 32136,

locale => bless({

"default_date_format_length" => "medium",

"default_time_format_length" => "medium",

en_complete_name => "English United States",

en_language => "English",

en_territory => "United States",

id => "en_US",

native_complete_name => "English United States",

native_language => "English",

native_territory => "United States",

}, "DateTime::Locale::en_US"),

offset_modifier => 0,

rd_nanosecs => 0,

tz => bless({ name => "UTC" }, "DateTime::TimeZone::UTC"),

utc_rd_days => 733913,

utc_rd_secs => 32136,

utc_year => 2011,

}, "DateTime"),

]



It gets worse when you have some records each with DateTime object in it.



That's why I added a couple of mechanisms to allow us to custom a class' dump.



$ perl -Ilib -MDateTime -MData::Dump -e'$Data::Dump::CUSTOM_CLASS_DUMPERS{"DateTime"} = sub { "$_[0]" }; dd [DateTime->now]'

[2010-05-21T08:57:45]



or:



$ perl -Ilib -MDateTime -MData::Dump -e'package DateTime; sub dump { "$_[0]" }; package main; dd [DateTime->now]'

[2010-05-21T08:58:09]

I know some other dumper in CPAN probably has this ability, but I like Data::Dump's output.

If you want to take a look at a couple of small patches to Data::Dump: http://github.com/sharyanto/data-dump

I've also contacted Gisle Aas to ask what he thinks of it.

Kamis, 13 Mei 2010

On RJBS's automatic version numbering scheme

Everytime I browse through CPAN recent uploads, and see versions of modules with RJBS's automatic numbering scheme, like 2.100920 or 1.091200 I tend to read it as 2.(noise) and 1.(more noise).

The problem is that it doesn't look like a date at all (is there any country or region using day of year in their date??). I've never been bothered enough with this though, as I don't use this scheme myself, but have always had a suspicion that this obfuscation is deliberate for some deep reason.

Turns out that it's just a matter of space saving and floating point issue. I'm not convinced though, is x.100513n (YYMMDD, 6 digits + 1 digit serial = 7 digits) really that much longer than x.10133n (YYDDD, 5 digits + 1 digit serial = 6 digits)? Is there a modern platform where Perl's numbers are represented with 32-bit single precision floating point (only 7 decimal digit precision) where it will present a problem when n becomes 2 digit using YYMMDD scheme?

Based on past experiences, since it is unlikely that I will do more than 20 releases in one month (usually even only once or twice a month or less frequently), if I were to adopt a date-based automatic versioning policy, perhaps I'll pick x.YYMMn where n is omitted for the first release, and then 1..9, and then 91..99 (and then 991..999 and so on). This way, most releases have the shortest number of digits. I don't "incur cost" for the first few releases (which anyway will be all there is, most of the time).

1.1005
1.10051
1.10052
...
1.10059
1.100591
1.100592
...
1.100599
1.1005991
...


In fact, I bet most modules have only a few releases per year. So how about this scheme, x.YYn:

1.10 - first release of the year
1.101 - second
1.102 - third
...
1.109 - tenth
1.1091 - eleventh
1.1092 - 12th
...
1.1099 - 19th
1.10991 - 20th


Or how about x.Dn (releases per decade) or even x.Cn (releases per century)? :-)

My brain prefers that I don't use long version numbers. Except when the version number is long because of some date (e.g. to indicate freshness of release). But why torture ourselves with a date that we need several seconds to parse in our head?


So I'll stick with 0.01, 0.02, 0.03, ... for now.

Menebak gender orang Indonesia berdasarkan nama depan

Sesuai janji di posting blog beberapa bulan lalu, hari ini saya merilis Locale-ID-GuessGender-FromFirstName. Nama modulnya jadi panjang ya? :-p

Sebab ke depannya, seiring dengan modul pelengkap yang direncanakan, Locale-ID-ParseName-Person, kita juga bisa menebak gender seseorang dari atribut nama lainnya, misalnya dari sapaan (Bapak/Ibu/Bung/Mbak), dari gelar keagamaan (H/Hj), dari pola nama kedaerahan (mis: I Ketut/Ni Ayu), dll.

Rilis pertama ini akurasi dan kelengkapannya belum bisa diandalkan, tapi sudah bisa dicoba-coba. Saya sudah menambahkan sekitar 1000 nama-nama umum dari database klien kantor (soalnya kesulitan mencari database yang lebih bagus, tidak seperti di Amrik yang bisa mengambil data dari biro sensus di sana). Algoritma heuristik (sangat) sederhana juga sudah ditambahkan, beserta dengan algoritma untuk mencari dari Google.

Ada yang punya waktu luang membuat skrip CGI sederhana, atau aplikasi Facebook, untuk interface web modul ini? Sekalian mengumpulkan lebih banyak data dan koreksi. Saya sih pengen aja, cuma males :p

Rabu, 12 Mei 2010

perlmv: Renaming files with Perl code

perlmv is a script which I have personally been using all the time for years, but has only been uploaded to CPAN today. The concept is very simple, to rename files by manipulating $_ in specified Perl code. For example, to rename all .avi files to lowercase,

$ perlmv -de '$_=lc' *.avi

The -d option is for dry-run, so that we can test our code before actually renaming the files. If you are sure that the code is correct, remove the -d (or replace it with -v, for verbose).

perlmv can also save your code into scriptlets (files in ~/.perlmv/scriptlets/), so if you do:

$ perlmv -e 's/\.(jpe?g|jpe)$/.jpg/i' -W normalize-jpeg

You can later do this:

$ perlmv -v normalize-jpeg *.JPG *.jpeg

In fact, perlmv comes with several scriptlets you can use (more useful scriptlets will be added in the future):

$ perlmv -L
lc
pinyin
remove-common-prefix
remove-common-suffix
uc
with-numbers


Let me know if you have tried out the script.

Rabu, 05 Mei 2010

So is wantarray() bad or not?

The style of returning different things in list vs scalar context has been debated for a long time (for a particular example, this thread in Perlmonks).

A few months ago I made a decision that all API functions in one of my projects should return this:

return wantarray ? ($status, $errmsg, $result) : $result;

That is, we can skip error checking when we don't want to do it.

Now, in the spirit of Fatal and autodie, I am changing the above to:

return wantarray ? ($status, $errmsg, $result) :
do { die "$status - $errmsg" unless $status == SUCCESS; $result };


But somehow I can still see myself and others tripping over this in the future, as I have, several times so far. It's bad enough that for each API function one already has to remember the arguments and their types, and one kind of return and its type.

Maybe I should just bite the bullet and admit the misadventure into wantarray(), and that context-sensitive return should be left to @foo, localtime(), and a few other classical Perl 5 builtins that have been ingrained in every Perl programmer's mind.