Monotrematica: October 2007

Monday, October 29, 2007

Slow, interactive, but it works -- looking for scripts

I've got a lot of videos of my son. When imported from the camera they're AVI files (yes, I know that's just a container, but I don't know enough about video formats to know what the actual encoding is).

The videos are pretty large. I've found a good way to make the files smaller though. Using kino, I import them (I get a larger .dv file). And then convert it to MPEG4 AVI. That works pretty well. Large files (150MB or more) convert (at High Quality, full size, progressive, 2240 kb/s) to about 1/6th the original AVI size. Small files (30MB or so) only convert to half the original size. But that's pretty good, nevertheless. The proximate reason for looking for ways to encode the files to be smaller is because I post some of them to youtube and those original AVI files greater than 100MB get rejected.

Kino has a .FLV converter too. That's interesting. But I don't think I want to keep archives of .FLV files. I'm a bit conservative about file formats and I tend to go with standards (de-factor, or de-jure) when given the choice between file formats which are on different points on the standard-non-standard continuum.

I've looked around for scripts to do this conversion. Haven't looked hard enough though, apparently. I could read the manpages and learn enough about video and audio encodings and transcoding so that I could hack up my own scripts. Too lazy to do that though. I'll wait until something comes along that's better than what I've got.

This is what it feels like to be a mere user :-). Normally I'd just totally geek and learn what I needed to get something working. Taking care of my son, and taking videos, and juggling disk space so that the videos will fit and can be backed up though, that takes all my time :-).

I've looked (via ps auxw) at the command line commands that kino spews to do the conversion. I may test some of that out. I don't know if everything I need is in there though. If it is, and a test works tomorrow, I'll hack something up and possibly post it here. But possibly not, too. Nonworking holiday tomorrow, I'll be playing with my son :-).

Sunday, October 21, 2007

Gmail Advanced

I just went and installed Better Gmail and Gmail Manager on my laptop, both firefox add-ons. I'm going to have to install both of those everywhere.

The Mail.app mac-like skin from Better Gmail looks great. I first tried Super Clean but abandoned it quickly since I found it unusable.

GMail manager lets me have two or more tabs to gmail, each tab being a different account. That's very useful and I may finally switch over to using web based gmail for my gmail accounts because of that (I'll still have to keep evolution for backup and for access to my IMAP email at work. But finally GMail is much more usable when I can have multiple tabs, one for each account.

I've also finally found a convenient way to download my spam (which I use to train a bogofilter adaptive filter, just for fun). I think Gmail must have fixed their POP3 support somehow. Previously, this technique didn't work but now it does. The solution is to:
A. go to the Spam folder
B. select all messages on the current page
C. add those messages to a label (e.g., ZSpamX)
D. mark the email not spam.
E. wait until they're downloaded by my fetchmail process running elsewhere.
F. once everything is downloaded, go into the label, select everything and mark them all Spam.
G. Go into Spam folder and delete all Spam.

Unfortunately, D doesn't work with the gmail "select everything" link that pops up as an option when we've pressed the (Select) All link. So if I've got several pages of spam, I need to do A-D for each page. That's not bad though since I've got gmail setup to show 100 emails at a time.

To be clear, that last is not a problem with either of the Gmail related plugins. It's an issue with how Gmail works to begin with. Probably as a bar against people making a dumb mistake that does too global a job to the selected email the "select everything" link that pops up when you click All does not all you to assign the same label to all matching emails, nor does it allow you to Unmark all matching emails Spam. I don't remember if it allows marking all matching email spam. I've already collected and remove all my spam, so now I don't remember whether marking all the spam messages spam again was one step or ten.

In any case, the C and F steps are vital. I don't think they previously worked. Or if they did, I didn't think to try them back then. It's great though, now, to be finally able to grab all my spam so I can train my own spam classifier. Sometimes spam gets past all the spam classifiers between me and the internet, then I'd like to have a personal spam classifier customized to my needs, similarly, sometimes mail from friends or mailing lists is marked spam by mistake. Seeing all the spam and running them through my own classifier gives me a better chance of finding the false positives since at least I control the classifier.

Tuesday, October 16, 2007

On 40 tips for optimizing PHP code

I saw something like Reinhold Weber's 40 tips for optimizing PHP code maybe a year ago. Recently, this particular set of 40 tips popped up on reddit. I decided to test. Now, my tests are trivial, on the command line, on Ubuntu. YMMV. Particularly with respect to "trivial", since the tests are so small they're likely to stay in L1 or L2 cache, so they may not look like other people's more complex tests since they're not likely to trigger cache misses.

Alright, after the first two tests, I've decided that this is probably not worth doing. I may continue writing some tests, but:

1 - use static if possible - 1.25 percent difference.
2 - echo is faster than print - yeah, by 1.35 percent.

Some of those tips might actually be pretty useful (I'll run some more tests over the next few days), but these two aren't very useful. Sure, if you applied 100 optimizations, each giving a 1 percent improvement you'd double the speed of your program. But I routinely improve the speed of programs by orders of magnitude by improving SQL queries, creating appropriate indexes and simplifying complex code. Most of those 40 tips sound like micro-optimizations which aren't worth the trouble to implement.

To be sure, error suppression with @, apart from being slow, is usually a mistake (turn on error logging in development, push error logging to a file or syslog rather than stdout in production), and $row[id] is just stupid and the sign of an incompetent developer, as are incrementing uninitialized variables and #31. Using mysql is almost always a mistake (it has its uses, but everywhere I've worked, mysql has always been the wrong tool and anyone who implemented it [i.e., me] always regretted the decision later).

3 - avoid magic (__get, __set, __autoload) functions - OK. magic __get and __set [together]
are half the speed of accessing the variables directly. So there's certainly some cost
to the indirection. It may be that high level frameworks could (and should) use the
magic functions to abstract away details. I've avoided them for now though since they
don't improve maintainability or quality of the code I currently work with, although I'd
be open to using them if I were to work on something significantly more complex than what
I work with now. I don't see anything that complex on the horizon though, I'd probably
switch to another language for anything complex (perhaps java, perhaps python, it'd depend
on the project).

Thursday, October 11, 2007

More RAM in my next job -- and multitasking limits (not enough RAM in my brain)

I've got a slow computer (800Mhz) with a reasonable amount of RAM (640MB) at work. My laptop is twice as fast, with twice as much memory. But of course, the laptop hard drive is slower than the desktop hard drive.

I've been twiddling my thumbs quite a lot lately because a lot of my work has been very disk intensive. I could be speeding things up by not doing svn update of 300MB of source, instead going to just the directories that have been modified and doing the svn update there, but then that would require knowing intimately what the other developers are doing. And *they* don't necessarily remember all the directories they committed into.

So, in my preference for dumb solutions that work all the time (e.g., svn update at the root of the tree) versus smart solutions that can easily fail (svn update to targeted directories, possibly missing other directories with fixes), I just svn update at the root. I do similar things with very large databases too, and very large file copies (rsync). The dumb solution that works all the time but is not always optimal is more cost-effective than the smart solution that will take three weeks to shake all the bugs and corner cases out of.

I'll optimize (spending that three weeks later on), when the need arises.

If I'm going to be doing much the same in my next job, then I'm going to require my employer to get me the right hardware. So I'll have 4GB of RAM (or more), and the fastest SCSI drives money can buy, in RAID-0 with rdiff-backup to RAID-1 slow SATA drives. And then I'm going to do all my svn work in a 2GB ramdisk. Hahahahaha.

I was discussing the edges of this (in the context of multitasking) with the boss of my boss and he said, "well, so you can do other things while waiting for those long tasks to finish, right?". This was in the context of a query that ran for 22 hours (and yes, I got pissed off enough at that to set up an aggregated materialized view built via triggers [postgresql] so that it now runs in an hour). At the time I said, yes, so I can multitask and need not twiddle my thumbs.

I'm finding that there are limits to multitasking though. I get confused and forget to do important steps when the list of tasks hits more than 5 or so. So its best to stick with 4-5 concurrent tasks. So if all 4 or 5 of those tasks take more than an hour, I'm back to twiddling my thumbs because I just don't do well if I add another task. As designated mentor at work, I don't actually have to twiddle my thumbs though. Instead, I can stick my nose into my colleagues work and offer unsolicited advice. Sometimes it's even good advice :-).

Wednesday, October 03, 2007

Software is Hard

Kyle Wilson has a great article on software development, compromises, economics, complexity, lines of code, and why Software is Hard

I use "economics" the way I normally do though (which is not the way it usually is meant in RL, but possibly how it would be meant among economists). It's not just about money but about compromises among competing values and how different weights among the different values affect the outcome (which is usually also measured along several dimensions).

Monday, October 01, 2007

vim on slow links (sshfs)

Sometimes (as today), I need to edit files remotely. If the bandwidth is slow (PLDT myDSL has some occasional bogosity where my route to the office, which is normally just 5 hops, bounces between two internal routers whose IP addresses differ only by 1 in the last octet, so then my vpn goes to the other links and the hop count goes to 18 or so because it's overseas), editing remotely is intolerable.

Then I go with sshfs. I've got a script that mounts my working copy over sshfs. I always need to remember to setup vim though to not use swapfiles. So, from the man page,

-n No swap file will be used. Recovery after a crash will be impossible
Handy if you want to edit a file on a very slow medium (e.g. floppy).
Can also be done with ":set uc=0". Can be undone with ":set uc=200".

I've got an annotation in my ~/.vimrc (normally commented out) to set uc=0,
but I forget that it's there and I always google for sshfs vim swap file. And
then I go to this page, which doesn't have the answer :-).

I would post the solution so that the question doesn't stay unanswered indefinitely. If I could reply without joining the group I'd do that. But joining is required, and I'm no joiner. So the best I can do is point at it here.

Heheh, this will also help ME since I would normally query something like:

monotremetech sshfs vim swap

:-)