Data IO 2013 conference – my notes

These are my notes from today’s Data IO conference

Next Generation Search with Lucene and Solr 4

Speaker’s slides

Lucene 4

  • near real time indexes (used by Twitter for 500 million new tweets/day)
  • can plug in your own scoring model
  • flexible index formats
  • much improved memory use, regexs are faster, etc
  • new autocomplete suggester

Solr (Lucene server – managed by the same team as Lucene)

  • if someone chooses the red shirt, do we have large in stock (pivot faceting – anticipating the next question)
  • improved geo-spatial (all mexican restaurants within 5 blocks, plus function queries to rank them)
  • dstributed/sharded indexing and search
  • solr as nosql data store

Uses

  • recommendation engine (LinkedIn uses Lucene for people recommendations). Recommend content to people who exhibit certain behaviors
  • avoid flight delays – one facet – flights out of airports, pivot to destination airports (Ohare to Newark) – origin, destination, carrier, flight, delay times – look at trends over time. Solr has a stats package – you can get averages, max, min, etc
  • for local search, how to show only shops that are open? (Yelp also uses Lucene). 

You added Zookeeper to your stack, now what?

Old way of system management: active and backup servers, frantically switch to backup when active fails

Common challenges with big distributed system

  • Outages
  • Coordination
  • Operational complexity

A common deficiency: sequential consistency (handling everything in the “right” order, when data is coming from multiple places)

  • Zookeeper is a distributed, consistent data store – strictly ordered access
  • Can keep running as long as only a minority of member nodes are lost (usually want to run 3 or 5 nodes)
  • all data stored in memory (50,000 ops/sec)
  • optimized for read performance (not write); also not optimized for giant pieces of data
  • it’s a coordination service
  • A leader node is elected by all the members. Leader manages writes (proposes it to followers, they acknowledge it, then it is assigned and written)
  • nodes can have data, and can have child nodes
  • has “ephemeral nodes” – created when a client connects, destroyed when client disconnects (these do not ever have child nodes)
  • watches: clients can be kept informed about data state changes (just lets you know it has changed, but not what it’s changed to – you need to request it again if you want to know the current value)

Zookeeper open-source equivalent of Chubby

  •  good for discovery services (like DNS)
  • Use cases: Storm and HBase, Redis – http://www.slideshare.net/ryanlecompte/handling-redis-failover-with-zookeeper
  • Distributed locking

Beware – Zookeeper can be your single point of failure if you don’t have appropriate monitoring and fallbacks in place

Graph Database Use Cases

  • nodes connected by relationships
  • no tables or rows
  • nodes are property containers
  • cypher is neo4j’s query language
  • http://github.com/maxdemarzi/neo_graph_search
  • http://maxdemarzi.com/2012/10/18/matches-are-the-new-hotness
  • also used for content management, access control, insurance risk analysis, geo routing, asset management, bioinformatics
  • “what drug will bind to protein X and not interact with drug Y?”
  • http://neovisualsearch.maxdemarzi.com
  • performance factors: graph size (doesn’t matter), query degree (this is what matters – how many hops), graph density. RDBMS doesn’t scale well with data size, neo4j does
  • the more connected the data, the better it fits a graph db,
  • NoSQL – 4 categories – key value, column family, document db, graph db
  • popular combo is, e.g. mongo for data, neo4j for searching it (then hydrate the search results from mongo)
  • optimized for graph traversal, not, e.g., aggregate analysis of all the nodes
  • top reasons to use it: problems with RDBMS join performance, evolving data set, domain shape is naturally a graph, open-ended business requirements
  • Gartner’s 5 graphs: interest, intent. mobile, payment

Parquet

I didn’t take notes during those one (a drop of water from the bottom of my glass got under my Mac trackpad, and my mouse was going crazy for a while)

All the data and still not enough?

  • No matter how much data you have, it’s never enough or never seems like the right type
  • Predictive modeling – will someone default on a loan? Look at data for people who’ve had loans, and who defaulted and didn’t. Use data to make a predictive risk model
  • IID = independent and identically distributed

Example IBM sales force optimization

  • Can we predict where the opportunities are – which companies have growing IT budgets?
  • Couldn’t know what was most important – where were these target companies spending their IT budget (not disclosed)
  • Companies who are spending with us are probably representative of similar sized companies in the same market – use the “nearest neighbor” technique
  • Compared model prediction to expert salesmen’s opinions, except for 15% of them, where the expert’s put the chances at zero. Why the difference? The model mis-identified some of the companies (no good way to cross-reference millions of internal customer records with independent sources)

Siemens – compter aided detection of breast cancer

  • patient IDs ended up predicting odds for cancer. It turns out the ID was a proxy for location (whether they were at a treatment facility or a screening facility)

Display ad auctions – how do we decide who to target?

  • multi-armed bandit – exploration vs exploitation
  • what do we know? urls you’ve visited
  • for something like advertising luxury cars, very few positive examples (people don’t buy them online)
  • There is no correlation between ad clicks and purchases
  • Better to look at – did the person eventually end up at the company’s home page at some point after seeing the ad?
  • target people who the ad can actually influence (i.e. not people who already bought the product, or never will)
  • but there’s no way to get data for that
  • Counterfactuals – you can’t both show and not show someone and ad, and observe subsequent behavior. You have to either show it or not show it
  • Ideally, build a predictive model for those who see the ad, and another model for those who don’t
  • But the industry doesn’t do that – it’s all about conversion rate

Advertising fraud

  • Malware on sites generating http requests
  • Very difficult for ad auctions systems to detect
  • Detect by looking at traffic between sites. Foe example, malware site womenshealthbase generates massive traffic to lots of other sites, not about womens health
  • they make money by visiting a site with a real ad auction system. Then bid prices go up because of your traffic, which drives up ad revenue traffic on womenshealthbase
  • Auction systems now put visitors from these sites in a penalty box, until they start displaying normal behavior again

What’s new with Apache Mahout?

  • Amazon: customers who bought this item also bought this item – best known Mahout example
  • Mahout implemented most of the algorithms in the Netflix recommendation contest
  • In general, finds similarities between different groupings of things (clustering)
  • Key features: classification, clustering, collaborative filtering
  • Automatically tags questions on stack overflow

Uses

  • recommend friends, products, etc
  • classify content into groups
  • find similar content
  • find patterns in behavior
  • etc – general solution to machine learning problems

[I’m leaving it most of the details about performance improvements and the roadmap for upcoming refinements – below are other interesting points]

  • Often used with Hadoop, but it’s not necessary. Typically included in Hadoop distributions
  • Streaming K-means – given centroids (points at the center of a cluster) determine which clusters other points belong in
  • References: Mahout in Action (but a bit out of date), Taming Text http://mahout.apache.org
  • Topic Modeling He wasn’t sure what the full feature set is – he’s pretty sure it doesn’t generate topic labels for you

Leave a Reply