Subscribe to outside.in blog updates in a reader or by Email

Archive for Development

A Look Under The Hood

Until last week, the software architecture powering our core site at outside.in was a standard Ruby on Rails application stack sitting on top of a set of PostgreSQL databases, with various aggressive caching strategies in play on the front end. It was no surprise that the setup began showing its limitations as site traffic ramped up. True to our aspirations to become a high-traffic destination, we were becoming victims of our own success, and as a result our scaling challenge was real. Page load times, overall throughput, and latency for new content all began to degrade as databases grew. And we faced a queue of new products and enhancements our over-caffeinated product team was excited to launch, each a big driver of additional traffic. We endeavored to meet the challenge with a carefully planned re-imagining of the OI platform. Here’s a high-level look at what it entailed.

Earlier this year we had the opportunity to build a new product and business, Outside.in for Publishers, from scratch.  We maximized that opportunity by rethinking the core data models and the way we handle geometric calculations, with primary goals of speed for users and minimal latencies for content acquisition. With the OIP launch, we also unleashed a new framework for content acquisition, code-named Feed Monster, which can happily overpower any mere mortal relational database with its light-speed content analysis and row insert pressure. We were very effectively adding capacity for crawling and content analysis, but not keeping pace publishing new content because of relational bottle-necking. After the launch of OIP, we plotted a road map to bring the speed optimizations of OIP to the core site and handle the Feed Monster volume without breaking a sweat…with a few other tricks in the mix.

So, on to the specifics. The knowns we started with were:

  • Relational databases are slow when they become large.
  • For all its elegance and productivity, Ruby is slow to execute.
  • Geometry, even with a mature system like PostGIS, is slow and hugely resource-intensive at scale.
Output caching injects speed on top of any of these, at the cost of serving stale data. Because our success depends on the timeliness of data delivered to our users, latencies introduced with caching are largely unacceptable. We took a hard look at where we needed performance and came up with a 3-faceted approach:
  • Denormalize content and metadata into a search-based structure
  • Move the heavy-lifting out of Ruby and into a faster stack by building a cluster of “datajoiners.”
  • Intelligently cache long-lived data in the datajoiner, where it is most flexibly utilized for various output types.

Denormalizing data into a set of search indices in a master/slave clustered environment enables very fast content retrieval without the overhead of relational integrity. We apply our computing resources at content processing time to make locating and displaying content very lightweight. We have accomplished this by embracing Apache Lucene and building some clever code to shard indexing across the farm.

Datajoiners are a set of servers that power content delivery to our internal APIs and front-end applications. The new middleware tier is built on the Java Virtual Machine in Scala, where we can take full advantage of multiple cores for parallelization with minimal effort. The datajoiner’s purpose is to take data from disparate sources, like PostgreSQL, Lucene, key value stores, and memory caches, and to speedily produce an output suitable for any of our consumers.

In the area of language performance, we knew we would be introducing a compromise between the development speed of a dynamic language, Ruby, and the execution speed and threading of a JVM-based one. We chose Scala for the datajoiner because it offers a bit, although I won’t say the best, of both worlds. Scala is a young language with its share of warts, but shows tremendous promise. With type inference, its myriad syntactic conveniences, and fast run-time performance, Scala served us well in this area of our architecture.  And ultimately performance handily exceeded our goals. Under massive load scenarios, we are able to service 8 to 10 front-end Rails web servers with each of these datajoiners without compromising page load time or requests per second.

Finally, we revisited caching and baked it into the datajoiner layer. Typically caching is done as close to the front-end as possible to reduce resource usage by serving stale data. In our case, caching policies can be applied at a very granular level such that content is served in near realtime without staleness, while long-lived data structures like region containment are efficiently cached for long periods. With the data assembly work being done in Scala, we are able to push tons of data into the system and serve it out without sacrificing freshness.

So, as Lauren says, welcome to the new Outside.in! I hope this post provides a useful peek behind the scenes of our most recent effort. I’m extremely proud of the team for bringing this to life, and it is incredibly rewarding for us to see it working flawlessly in the wild. The only thing that excites me more right now is the bright future of possibility our new platform represents. Stay tuned for a host of new stuff that will be powered by our new engine!

  • Share/Bookmark

Comments

Big Week in Development

Last week was big milestone for us, with some major news on the OI technology front, and an exciting development in the hyperlocal space in general.

First, we released Outside.in for Publishers (OIP) in full beta, with a number of prominent launch partners. Besides being a major step forward for our business, the OIP suite of applications marks a major change in the design and scalability of our underlying platform. Since January, the technology team has been implementing a series of decoupled services, in concert comprising a v2 platform for building sites and APIs that sits on top of our massive collection of user-generated/aggregated content, natural language training data, and statistical metadata.

In a way, this is the second phase of our build-out: stage one pulled a ton of data in and generated output that is streamlined, useful, and end-user-centric. This phase of work builds on all of that, adding the notion of hyperlocal publishers who can curate their own pages, a distributed architecture for data and text processing, and numerous tricks to speed it up significantly. OIP also debuts our new Geometry engine. While we make use of PostGIS and the workhorse GEOS libraries fully for raw geometric heavy lifting, we are always striving for ways to make location, proximity, and relevance lightning fast without sacrificing accuracy or flexibility. We think we have accomplished this with OIP, and will have more to reveal in this area in the coming months.

Secondly, we’re beginning to see real feedback on our latest toolset from the field of bloggers and publishers who are using them daily. This is a rewarding and humbling experience as any technologist knows; some assumptions about how users will interact with the systems are right on, others are dead wrong, and most are in between. The real litmus test in this arena is whether or not the underlying data and delivery models are flexible enough to react quickly to feedback, though not necessarily so flexible that the systems are over-engineered. I’m biased, but… I think we’ve done it, for now. There is so much still to build. The team is already hard at work on the next rev of our public API, building in the latest and greatest from the Google Maps team, and making geometry and text analysis cleaner and faster.

Finally, the release of EveryBlock’s source code under the GPL represents an important moment for hyperlocal technology. I commend the EveryBlock team for their accomplishment. It is one thing to build a system like theirs over 2 years, another entirely to prepare it for open-sourcing. They have done a terrific job. Making the code available is equally if not more important for the Python/Django communities: It’s a peek behind the curtain at how a successful Django system is conceived and architected (from Adrian Holovaty, the co-creator of Django itself), with many lessons for those starting out.

The Twittersphere and Github-sphere are abuzz with discussions around the EveryBlock source, particularly around the topic of location extraction from unstructured text. This is a topic that runs through the veins of everyone at OI, and it’s been quite interesting to compare approaches. While EveryBlock’s open-sourced app does a nice job with address extraction, we have taken a more holistic approach which uses publisher clues, our own training data based on years of human-powered quality control, and disambiguation of locations across the entire U.S. We have spent some time with the EveryBlock source and see a lot of value there for publishers of all kinds who have the bandwidth to spin up a development effort. For them, this is a remarkable head start with succinct, nicely packaged code that does a lot. We will continue to push forward with a plug-in model for our users, however. While we keep our APIs and approaches open, integration will be simple, fast, and low-impact for those who want to take their content local. At the API summit we hosted last month here at OI headquarters, we met with developers interested in local and heard some inspiring ideas about how geo-centric news and information could be used to enhance their own projects. Cool stuff. I encourage developers and publishers to use all the rich hyperlocal resources out there to maximum advantage!

  • Share/Bookmark

Comments

We’re Hiring a Senior Developer

Outside.in is the perfect place to find out what’s going on in the neighborhood you live in. We provide our users with the best way to discover the people, places and conversations in their community. Oh, and we’re also building the future of hyperlocal news and media.

We created outside.in to be the best resource online for keeping up with news and opinions in your neighborhood, finding out the inside scoop on local places or events and meeting interesting new neighbors, and sharing your local knowledge with them. Our platform is the leader in aggregation and delivery of localized content for leading publications.

We have top tier investors and are experiencing major growth. The team is small and deeply talented. We seek a seasoned developer to join our engineering team to help drive conceiving and shipping the next generation of outside.in applications.

We are looking for an entrepreneurial engineer with the following skills:

  • Experience with all of the following: Ruby, Rails, REST, Open-Source messaging, Search (Lucene for example), Java and/or C/C++
  • Demonstrable track record building distributed and/or clustered, high-throughput applications
  • Large repository experience with Subversion, or preferably Git
  • At least 6 years experience building large-scale web applications and service-oriented architectures
  • Proven experience deploying to high-traffic websites and supporting massive numbers of users
  • Text processing and Natural Language Processing experience a plus
  • Test-driven development
  • Deep SQL skills required, specifically with PostgreSQL
  • PostGIS and mapping/geometry knowledge a major plus
  • Demonstrable problem-solving and architecture skills. Be able to balance beauty and pragmatism in your designs and code!

If your skill set matches and you’d like to apply, email a cover note and resume to careers<at>outside.in

  • Share/Bookmark

Comments

Tech Update: Juggling and Delivering

It has been a particularly busy and exciting couple of weeks here at Outside.in HQ. CEO Mark Josephson and co-founder/Chairman Steven Johnson both spoke about the future of news and media to key audiences; Mark at NAA and Steven at South by Southwest. Steven’s speech is available to read here. They each addressed the problems the news industry is facing head on, and not surprisingly our company has a lot to do with the future they portrayed in their messages. In this post, I’ll share some of the efforts the OI engineering team is executing to bring Steven’s and Mark’s notion of Aggregation/Curation/Networks to life.

Software development in a technology startup is a constant juggling act. One aspect you can always count on is the need to balance infrastructure/scaling work with development of new features and evolution of old ones.  We have key initiatives on all of these fronts going on right now.

Keeping infrastructure humming under loads that increase week over week as our user base grows is ideally what I call a “dollar solvable” problem, meaning we simply bring up some additional machines, watch work distribution kick in, and get back to our software engineering work. The reality is that in any large system, bottlenecks and hotspots develop such that adding hardware incrementally is not always effective, and at those junctures code is refactored or replaced. In both cases, it is critical to have a computing environment that is as agile as possible; it must be simple and cheap to bring new machines into clusters, even simpler to remove them, and trivial to spin up small networks and environments for testing and experimentation. To achieve this, we have moved to a “cloud” environment for much of our live product presence. About 2 weeks ago, on a calm sunny Sunday in New York, we seamlessly migrated from a fixed data center model for hosting to a very flexible virtual environment. By the end of that afternoon we officially flipped the switch and watched our traffic migrate to the new network, and we haven’t looked back.  Virtual hosting isn’t a panacea, and it is certainly no substitute for good architecture. But it does make many of the tasks around managing growth inexpensive and much faster.

Architecture is a daily focus. I’ll detail some of the innovations we’re making in subsequent posts, but the summary is we’re constructing new data models to account for curation of content. Data structures can generally be made fast for data capture or retrieval, but not for both simultaneously. We have some techniques in mind for optimizing the user experience for our partners and readers of content aggregated with our platform. It is a data design and transformation problem, and with our cloud environment we will be benchmarking and honing our approach. The goal is fast, responsive UI for all of our products built on top of an engine that maximizes throughput for very fast processing and publishing of news and local information. I think we have it all figured out, and you’ll be able to grade us in short order.

And this brings us to the salient point for you, our user. The product that brings the aggregation/curation/networks vision to the forefront of what we do is in active development, and it will be announced very soon. “Very soon” means the engineering and product teams are gunning for an aggressive release date, and the tone in the office shows that to be the case; average caffeine intake is on the rise and the whiteboards are full. The release will feature our best work on all fronts, and that work will also raise the bar for the central Outside.in website, our iPhone application, the Geo Toolkit (which just got a slick facelift), and our partner offerings.

Lastly in this installment, I want to applaud OI co-founder and VP of Engineering Cory Forsyth who is off to the sold-out Scotland on Rails conference at the end of this month. Cory is presenting a talk on image processing in Ruby (read more here). Outside.in has become a fixture in the NYC Ruby community, fueled by our monthly Ruby Happy Hour meetups and events like Cory’s talks. Look for more from us in the Ruby and Open Source realms in the coming months.

Now, back to the juggling act! I’ll post again soon.

  • Share/Bookmark

Comments

Ruby Gem for Radar

One of the great things about creating an API is that it gives developers a chance to use your service to do the things they want to do with it. When we launched our API we did so hoping that it would give folks the opportunity to build the things that we haven’t had a chance to do on our own or haven’t thought of yet, and we’ve already seen some interesting stuff come out of it like near.ly.

As of last week, it’s gotten even easier to consume our API if you’re using Ruby, thanks to a neat wrapper gem called Radarb that the folks at Viget Labs have created. You can check it out on github.

  • Share/Bookmark

Comments