solr_fcA couple months ago I wrote about the terrible performance and a work around for SOLR / Lucene search engine. I discovered that performance would drop off a cliff while using filter queries to narrow search results for search queries on common terms in large indexes.  Although, it looks like the issue has been addressed in some of the latest nightly SOLR builds and is scheduled for official release with SOLR v1.4. Previous to this new version the filter queries were applied after the main query ran. This is all well and good but it doesn’t help speed your query up like you think it should. The new version applies the filters in parallel to the main query significantly speeding up searches with common queries and query filters by 30% to 80% along with a 40% smaller memory footprint.

However, even with this speed improvement you still should consider how you structure your queries. There is no need to do a query across every field if you know you really want to filter everything down with a single filter query. Try moving that filter query (fq) into the actual query (q) as <field>:<filter>. You might be surprised by the results

php-med-trans-lightIn the days since I began programming in PHP the web has come a long way. With the 5.3 release of PHP, the OOP side of things are finally getting a much needed polish. In the past year there’s been a steady rise in the usage of Ruby on Rails for web development as programmers are discovering that coming up with an idea for a site is far more enjoyable than programming one. Rails offers a very rapid application development environment to get from concept to code in a minimal number of steps. Unfortunately, PHP has been lacking in this regard. PHP needs a standard framework and easy database interaction. Currently, the steps required to go from concept to code brings with it a huge overhead in scaffolding development.

It is precisely this overhead that needs to be eliminated. Time is money and if you are working with a new application on a weekly or monthly basis the scaffolding overhead comprises a great deal of that time. Not to mention that most web applications today are running to and fro from a database. This time spent fetching and storing data makes up for a significant chunk in the overall development. 2009 is a big step forward for PHP and there are tools out there that tackle this problem.

Read the rest of this entry »

php-med-trans-lightWe’ve been working hard to get this ready for people to start poking around in and we’re happy to announce that it’s now ready for public beta testing! You can grab it from http://github.com/kla/php-activerecord/. Play with it… break it… and give us your feedback to help us make a better library for everyone! We want to hear from you.

Quick Start

We’ll start first with a bare bones example to show how little you need to get up and running. There’s very little to configure. We’ve adhered to the convention over configuration philosophy so there are no code generators you need to run and no xml/yaml mapping files to maintain.

Read the rest of this entry »

Update!

Find the latest here.

PHP 5.3 gets ActiveRecord!

php-med-trans-lightA quick search to find an implementation of active record for php on google is discouraging when one considers the state of ActiveRecord for Ruby on Rails. The reader will notice that the top results are from very old posts and the rest of the results preview minimial implementations. Of course, eventually, PHP will see a robust active record similar to RoR. Fortunately, that time is now, thanks to PHP 5.3 and the beneficial new features: closures, late static binding, and namespaces.

My friend Kien and I have improved upon an earlier version of an ORM that he had written prior to PHP 5.3. The ActiveRecord we have created is inspired by Ruby on Rails and we have tried to maintain their conventions while deviating mainly because of convenience or necessity. Our main goal for this project has been to allow PHP developers tackle larger projects with greater agility. However, we also hope that use of this resource will push the PHP community further by learning the wonderful benefits of the Ruby on Rails stack. Enough with the rambling, let’s get to the interesting piece!

Read the rest of this entry »

solr_fcSingle vs. multi-core sharded index. Which one is the right one? There is not a whole lot of information out there, especially when it comes to hard numbers and comparisons. There are a couple reasons for this. The first one that comes to mind is the multi-core functionality offered by Apache SOLR is very nascent. It was recently introduced with the latest SOLR v1.3 and hasn’t had much time to be adopted by the SOLR community. Second, the results are dependent on your schema, index size, query types and user load. These factors can account for varying performance results. As evidenced by the following benchmarks, a multi-core SOLR index has the potential to speed up the performance of your application or cut throughput and scalability by approximately the inverse number of cores.

i.e. For n cores the maximum throughput is roughly 1/n vs. a single index.

With multi-core sharded indexes the underlying assumption is that search performance improves by splitting your index into smaller chunks. These smaller shards are then faster and more efficient to search and index. However, you never get anything for free, the performance increase comes at a cost of higher CPU utilization. By breaking the index into multiple smaller pieces it makes searching and indexing on that smaller subset of the index faster, but you’ll need to search each core individually for every query. Where as a single index runs one slightly slower query, a multi-core sharded query runs n queries in parallel and then combines the results.

Read the rest of this entry »

Is your SOLR installation running slower than you think it should? Performance, throughput and scalability not what you are expecting or hoping? Do you constantly see that others have much higher SOLR query performance and scalability than you do? All it might take to fix your woes is a simple schema or query change.

The following scenario I am about to describe is proof positive that you should always take the time to understand the underlying functionality of whatever operating system, programming language or application you are using. Let my oversight and ‘quick fix solution’ be a lesson to you, it is almost always worth the upfront cost of doing something right the first time so you don’t have to keep revisiting the same issue.
Read the rest of this entry »

php-med-trans-lightQuality GIS data sometimes comes with a lot more precision than what is usable for Google Maps (or other mapping software). The problem lies in the number of points representing a polygon that you want to overlay. A county representation for a state might include 100,000 points that is not usable without some form of reduction. Luckily there is an algorithm that solves that problem, Douglas-Peucker.

The algorithm simplifies a polyline by removing vertices that do not contribute (sufficiently) to the overall shape. It is a recursive process which finds the most important vertices for every given reduction. First, the most basic reduction is assumed. A single segment connecting the beginning and end of the original polyline. This is when the recursion starts, the most significant vertex (the most distant) for this segment is found and, when the distance from this vertex to the segment exceeds the reduction tolerance, the segment is split into two sub-segments, each inheriting a subset of the original vertex list. Each segment continues to subdivide until none of the vertices in the local list are further away than the tolerance value.

There is a PHP class that does just this: Douglas-Peucker Polyline Simplification in PHP by Anthony Cartmell. Based on the original quality of the data and tolerance level, I was able to achieve a 90-93% reduction in size. This reduction allows me to represent significantly more data at a reasonable performance level to clients. Keep in mind, that this reduction is removing data out of the coordinate array so the quality of your representation will go down with the tolerance and reduction being applied. I highly suggest that you play around with the tolerance until you find a good balance between data size and image quality.

PHP GIS Functions

April 14, 2009

php-med-trans-lightI have been working a lot of with PHP and GIS consulting for CitySquares and the History Engine. I found searching for everything I needed to do basic processing & Google Integration tedious and painful. So here is a collection of common functions that helped me get through the massaging of the data and ready for integration.

    pnPoly – Used to determine if a coordinate falls inside a polygon.
    Centroid – Find the center of a polygon..
    Area – Calculate the area of a polygon.
    googleGeoCoder – Extracts GIS information from Google Maps from an address.
    PolylineEncoder – Takes a set of coordinates and encodes it for Google Maps.

If you ran into the problem I did, which is that a lot of the data is coming in the form of shp/dbf files and needs to be parsed out to something friendlier either KML or CSV, there are a couple of solutions for that. You can parse out the data with shp2text if your source coordinate format is already in lat/lng or if you have different coordinate system and use ArcGIS, you can try the plugin Export to KML 2.5.3 to help with the exporting of data with the ESRI suite of products.

Once your data is in SQL, the following query is an example of distance sorting with SQL. You can grab a copy of the zip_codes database here and play around with it.

SELECT *,
sqrt((69.1 * ("37.6" - latitude)) * (69.1 * ("37.6" - latitude)) +
(53.0 * ("-77.6" - longitude)) * (53.0 * ("-77.6" - longitude)))
AS distance
FROM `zip_codes`
HAVING distance < 10
ORDER BY distance ASC

Bayesian filtering is based on the principle that most events are dependent and that the probability of an event occurring in the future can be inferred from the previous occurrences of that event (link). A probability value is then assigned to each word or token; the probability is based on calculations that take into account how often that word occurs in one category or another. The most common application of the filter is for identifying words that appear in spam versus legitimate emails. A word by itself is often times useless without the context it was used in.

There is a whole suite of tools that are able to break down content to help improve the filter by supplementing it not only with a database of words to categories, but also sets of N-gram derived from the text. There are several scripts out there that will help with this extraction and it offers a few more layers of depth for Bayesian filtering. One such tool is, Ngram Statistics Package (NSP) which is easy to install and run.

Read the rest of this entry »

In my experience the majority of web agencies and developers still do not take search seriously enough. Most businesses have very simple requests, “How do I show up for keyword for people in the area”, “How do I show up higher than my competitor on searches”, and “How do people find my site”. The web is an economy and driving consumers to business on the internet is a highly desired skill set. Consistently controlling the results of Google will be impossible and there is always room for improvement for every site.

Every developer will grow their own set of tools, but the core components are available for free. Google offers analytics to take control of your traffic performance, sources, and patterns. There is also Adwords Keyword Tool, which will help you target search phrases, volume, and competition. Based on these factors and a list of similar keywords you will be able to identify good opportunities to compete for relevant traffic. There is also the Webmaster guidelines published by Google that will give you a general best practice for search engines.

This process requires a lot of patience. It takes time for changes to take shape and results are delivered. When making changes to any site or even designing a new site with SEO built in, user traffic is not going to happen right away. Seeing the results come in will trigger an OCD to check Analytics and forever make improvements and indentify new markets and opportunities. The vast majority of web sites are there for user consumption. SEO became big business when a lot of people all at once figured out that users translated to consumers.

Google is the search leader, therefore they offer the highest return. They control the flow of traffic on the internet. Luckily, they also published a search engine optimization starter guide in pdf format! This is the 101 of SEO and it will be pointless to try to chase down every obscure reference and tip on the countless SEO sites out there when the components to their content analysis is available all in one place. The document is a general overview but offers some very important best practice rules that are easy to implement:

Title Tags

– Choose a title that effectively communicates the topic of the page’s content.
– Create unique title tags for each page
– Use brief, but descriptive titles (limit of 66 characters or 12 keywords)

Description Tags

– Accurately summarize the page’s content
– Use unique descriptions for each page
– Avoid filling the description with only keywords
– Avoid copy and pasting the entire content of the document into the description meta tag

URL structure

– Use words in URLs
– Create a simple directory structure
– Provide one version of a URL to reach a document
– Many users expect lower-case URLs and remember them better)

Site Navigation

– Create a naturally flowing hierarchy
– Use mostly text for navigation
– Use “breadcrumb” navigation
– Put an HTML sitemap page on your site, and use an XML Sitemap file
– Consider what happens when a user removes part of your URL
– Have a useful 404 page

Anchor Text (Links)

– Choose descriptive text
– Write concise text
– Format links so they’re easy to spot

Heading Text

– There are six sizes of heading tags, beginning with <h1>, the most important, and ending with <h6>, the least important.
– Imagine you’re writing an outline
– Use headings sparingly across the page
– Avoid using heading tags only for styling text and not presenting structure
– Avoid excessively using heading tags throughout the page

Other Confirmed Ranking Factors

– Keyword in URL
– Keyword in Domain name
– Freshness of Pages
– Freshness – Amount of Content Change
– Freshness of Links
– Site Age
– Anchor text of inbound link
Hilltop Algorithm
– Domain Registration Time

There is a lot of helpful content in the document but it does not go deep into the inner mechanics like other sites attempt to do. There are several sites out there that try to go beyond what has been published and into the details for generating traffic, you would just need to google “Google Ranking Factors“. A lot of information came from when google released US Patent Application #20050071741.

Use the above as a baseline of the steps to get your site more traffic. This is a topic that is constantly being updated as search improves and requires a lot of time and research to do efficiently. Overhauling existing projects to meet the standards of today’s crawlers is tedious, boring, and offers no immediate results. It has been something I avoided in the past, but for a web site to stay competitive and more importantly, be seen it has to be found. I find having some good rules in place for how to deal with SEO makes new projects going forward much easier to deal with.