Skip navigation
1 2 3 Previous Next

haroldcarr

86 posts

2013-09-25

Last Modified : 2013 Sep 26 (Thu) 16:25:00 by carr.

Wednesday, Sep 25, 2013


8:30am - nuvos: the Universal SDK

  • Kevin McCarthy - President / CEO, IntraMeta

This presentation is useful as a list of things to think about when writing a thick-client/thin-server app. But not a particularly interesting talk in terms of technology. They say they have an SDK, but the talk never really got to it, instead, just listing requirements/capabilities.

http://www.nuvos.com/

case-study

  • thick client
  • real-time data push collaboration tool 
    • each client runs apps
    • sync data between them
  • deloitte, chevron

requirements

  • real-time communication push
  • low bandwidth
  • works on bad networks (latency, loss, congestion)
  • offline with seamless data sync
  • leverage Java code/developers
  • desktop browers and tablet
  • enterprise acceptable
  • run in cloud on on-premise
  • no leader fails

architecture

  • code generator 
    • javabean + serialization + storage
  • clist-side MVC framework + local data
  • server-side object store and journals
  • guaranteed delivery msg bus (comet)
  • GWT/DHTML UI

backend

code gen

  • similar to WSDL
  • entity model 
    • javabean
  • message layer

server side

  • stores - save, delete,search
  • jouranls - strict numbering for order

message bus

  • msg Q
  • guaranteed delivery
  • server migration/failover
  • message handlers
  • rpc framework

frontend

gwt

  • created custom UI widgets
  • simpler CSS
  • lighter weight

features

  • works offline; bad networks
  • push data (no polling)
  • no plugins required (e.g., flash, silverlight, …)
  • much higher limits (# user, app, data,…)
  • headless java clients 
    • full unit testing (JUnit, TesNG,) and CI (Hudson)
    • SMS gateway

browsers

  • inconsistencies: rendering, performance; features
  • performance: dom, hard to profile
  • security : XSS
  • legacy browsers won't die
  • runtime errors hard to to debug/reproduce

service integration

  • definition: APIs (models, exceptions, msgs) in Java via annotations (no XML)
  • package unit test iwth def
  • loopback impl

client

  • local or remote
  • local: native code integration point
  • remote: Java or proxy

service

  • implement API via servlet


10:00am - OAuth 2.0 : A Standard is Coming of Age

  • Uwe Friedrichsen - CTO, codecentric AG

OAuth 1.0

  • 4/2010 - IETF RC 5849
  • complex, limited scope, not extensible, not enterprise ready

players

  1. - you - user-agent 
    • have password for your resources at #2
    • you do not want to give #3 full access to #2
  2. - app with protected resources (e.g., photo server)
  3. - client - (e.g., photo book server) 
    • needs access to 2

OAuth 2.0

  • 10/2012 - IETF RFC 6749

players

  1. - you - user-agent 
    • have password for your resources at #2
    • you do not want to give #3 full access to #2
  2. a. authorization server b. resource server
  3. - client 
    • needs access to 2


11:30am - Java Flight Recorder Behind the Scenes

  • Staffan Larsen, Java SE Serviceability Architect, Oracle

overview

  • tracer and profiler
  • non-intrusive
  • built into JVM
  • modes 
    • on-demand profiling
    • after-the-fact capture and analysis 
      • circular buffers (or file)
  • 7u40

tracer and profiler

  • captures JVM and app data 
    • GC, synchronization, compiler, CPU usage, Exceptions, I/O
  • sample-based profiler 
    • low overhead; accurate data

non-intrusive

  • 2-3% overhead
  • can enable/disable
  • much info already available in JVM 
    • just available now

built into JVM

  • 50% optimized C++ code
  • 50% Java

on-demand profiling

  • start from Java Mission COntrol : jmc 
    • or CLI
  • higher overhead when on
  • GUI for analysis

after-the-fact

  • default mode
  • when SLA breach detected, dump current circular buffers
  • dump will have info leading up to breach

demo via jmc

  • name
  • duration
  • continuous | sampling | …

can connect over JMX

currently cannot compare recordings

demo via CLI

  • jcmd 
    • lists Java processes
> jcmd
3919 /opt/local/share/sbt/sbt-launch.jar
4072 sun.tools.jcmd.JCmd
  • jcmd <name of process> JRF.start

config

  • -XX:+UnlockCommericalFeatures -XX:+FlightRecorder 
    • free for development; cost $ for production
  • -XX:FlightRecoderOptions 
    • max stack trace depth
    • save recording on exit
    • logging
    • repository path

recording sessions

  • 80 events with 3 settings
  • two preconfigured settings: default | profile

simultaneous recording sessions

  • each session has own settings
  • each session sees union of all events

implementation

JMX       API
 |         |
 v         v
+------------------------------+
|   JVM                        |
|                              |
|JFR thread buffers --> global-|----> disk
|     ^   ^   ^         buffer |
|     |   |   |                |
|compiler GC runtime           |
+------------------------------+

event

  • Size
  • EventId
  • EndTime
  • StartTime
  • ThreadID
  • StackTraceID
  • Payload

event types

  • instant
  • duration
  • requestable

event metadata

  • name, path, description
  • payload 
    • name, type, duration, content type

content type

  • bytes, percentage, address, millis, nanos

XML event definitions (attributed oriented) processed into C++ classes

filter early

  • enable/disable events
  • thresholds
  • enable/disable stack trace
  • frequency

2013-09-24

Last Modified : 2013 Sep 24 (Tue) 22:39:06 by carr.

Tuesday, Sep 24, 2013


8:30am - Functional Reactive Programming with RxJava

  • Ben Christensen, Software Engineer - Edge Platform, Netflix

Excellent presentation - impossible to due justice with notes. Great slides, but really need to see/hear live.

To see someone else's live blogging of a similar talk at the 2013 International Conference on Functional Programming, see

Overview

Turning sync to async

  • callback : NO
  • futures : NO

Iterable | Observable (from GoF) pull | push T next() | on Next(T) throws Exception | onError(Exception) returns | onCompleted()

like Akka but works on stream of values not just single value

client treats all interactions as async

API impl chooses (non)blocking and resources:

  • sync on same thread
  • async on different thread
  • on actor
  • on multiple threads
  • using NIO

Compose functions

  • transform
  • filter
  • combine
  • concurrency
  • error handling

10:00am - The JVM is dead! Long live the Polyglot VM!

  • Marcus Lagergren, Oracle

universal meta execution environment - multi-language runtime

history

  • first JIT compiler: LISP 1962 - John McCarthy 
    • also GC with ref counting
    • first modern adaptive runtime
  • Alan Kay : Smalltalk 
    • first class library
    • first GUI driven IDE
    • bytecode

dynamic languages

  • ActionScript, Clojure, JavaScript. Phython, Ruby, PHP, Rebol, …
  • easy to use; no explicit compile stage, readability, fast dev for small projects, performance good enough
  • implementations 
    • native; metacircular; on top of (J)VM, CLR, …
  • characteristics 
    • resolve at runtime (instead of compiletime)
    • liberal redefinition policy
    • code equals data; eval/REPL, GC
    • thins change at runtime

putting a language on top of JVM

  • get for free 
    • GC, JIT optimization; native threading; hybridization (JSR-223); person decades of high tech
  • 2 biggest problems 
    • loose types
    • dynamic redefinition

invokedynamic to punch thru indirection layer

  • since JDK7 - rewritten in u40
  • first new bytecode since 1996
  • more than a new type of call
  • breaks constraints of Java call/linkage
  • can impl calls that act like function pointers
  • can impl custom data access
  • the JVM can optimize invokedynamic

invokedynamic

java.lang.invoke.*

Java 8 uses invokedynamic delegator with lambda

Nashorn

  • 100% Java runtime for JavaScript
  • 2-10x better than Rhino
  • approaching V8 performance
  • generates bytecode; invokedynamic everywhere
  • in JDK 8 (replaces Rhino)
  • ECMAScript compliant
  • key to performance 
    • be optimistic - rollback if wrong
    • inlining; use assembly for math intrinsics;
  • used in 

Da Vince Machine project


1pm - invokedynamic in 45 minutes

earliest adoptor of invokedynamic

https://github.com//headius/indy_deep_dive

history

  • JVM authors mentioned non-Java languages
  • hundreds of JVM languages now
  • previous to invokedynamic impls needed tricks that defeated JVM optimizations
  • 2006 - JRuby team hired
  • JSR-292 invokedynamic rebooted in 2007
  • 2011 - invokedynamic in Java 7

invokedynamic

  • user-definable bytecode
  • + method pointers and adapters

java.lang.invoke

method handles

  • function/field/array points
  • arg manipution
  • flow control
  • optimizable by JVM

java.lang.invoke

  • MethodHandle 
    • invokable target
  • MethodType
  • MethodHandles 
    • can get access to "things" in your class : e.g., method/field pointers

Adapters

  • methods on MethodHandle-s
  • arg manipulation 
    • insert, drop, permute, filter, fold, cast, splat (varags), spread (unbox varargs)

Flow Control

  • guardWithTest to decide target
  • SwitchPoint on/off branch

Exception Handling

  • catchException

bytecode

invocation

  • static, virtual, interface, special
  • they mostly do same thing (check arg types, cache, …, invoke)
  • invokedynamic let you define the flow

tools

  • ASM : byte code gen
  • Jitescript : DSL/fluent API around ASM
  • InvokeBinder : DSL/fluent API for MH chains

CallSite

  • holder for MethodHandle

your turn

  • play with MethodHandle-s
  • try generating invokedynamic= - see "JVM Bytecode for Dummies"

http://blog.headius.com


3pm - Wholly Graal - Java on GPU

  • Vansant Venkatachalam, AMD
  • Christian Thalinger, Oracle
  • Christian Wimmer, Oracle Labs

graal overview

Graal is new JIT compiler for HotSPot

research project

|--------+----------+------+
| Server | Client   | Graal|
|Compler | Compiler |      |
|--------+----------+------+
|compiler interface        |
|--------------------------+
|Java HotSpot VM           |
+--------------------------+

OpenJDK project

Sumatra OpenJDK project

backends for graal

truffle

bytecode ---------+--> graal -> cpu | ptx | hsail
                  |
ast interpreter ->+
truffle API

graal structure

bytecode

bytecode compiler

method inlining
global value numbering
escape analysis
loop optimizations

vm lowering

memory optimizations
architecture lowering

register allocation
code generation

machine code

graal intermediate representation

  • iterable and named fields
  • automatic def-use edges
  • lots of visualization tools

produces code better than client compiler but not as good as server compiler

GPU offload

why?

  • GPU has hundreds of cores
  • data-parallel parts of program 
    • e.g., squaring array elements
  • save power

considerations

  • java needs programming model to express data-parallel 
    • jdk 8: parallel streams
  • generate code for GPU while running on CPU
  • intermediate GPU target language then translate to specific GPU 
    • e.g., HSAIL

sumatra

  • goal: GPU enablement for Java
  • use stream API and sumatra does rest

heterogeneous system architecture (HSA)

  • standard interface to GPU
  • common intermediate language: HSAIL
  • common runtime: HSA stack
  • shared virtual memory 
    • direct access to Java heap object in main memory from GPU cores

execution model

  • grid based
  • programmer supplies "kernel" that is run on each work-item
  • kernel: written as single thread of execution
  • each work item has unique id
  • programmer specifies number of work-items

hsail

  • emitted by JVM
  • translated to GPU ISA by "finalizer"
  • mul_s32 $s3, $s0, $s1

examples

  • Math intrinsics (e.g., sqrt)
  • arrays, string manipulation
  • instanceof
  • increment point.x in array of points
  • atomic operations

5:30pm - Experimenting with the Boundaries of Static Typing

make rock solid via static types : good for mission-critical code

API - compiletime error if

  • move: calling on finished game
  • whoWon: calling on non-finished
  • takeBack: empty board
  • playerAt : …

carry "state/shape" in type

move type complexity into library so developer does not have to deal

examples in Groovy, Java, Scala


7:30pm - Developing Small Languages with Scala Parser Combinators

  • Travis Dazell, Systems Architect, Digi-Key Corporation

RegexParsers

small parsers COMBINed (combinator) to form a larger parser

use-case


8:30pm - GlassFish Community BoF

  • Anil Gaur - VP, Software Development, Oracle
  • Reza Rahman - Java EE/GlassFish Evangelist, Oracle
  • Santiago * - Avatar Lead, Oracle

EE 7 / GlassFish 4

Java EE 7

  • developer productivity
  • HTML5

Community contributions

  • FishCat
  • Adopt-a-JSR - 20+ JUGs
  • Tic Tac Toe sample

GlassFish 4

  • admin console : concurrency, jbatch
  • background exec of admin commands 
    • attach/detach : foreground/background cmd (just like unix&, jobs)
  • more REST/SSE support for admin
  • log format changes 
    • change from Uniform Log Formatter (ULF) to Oracle Diagnostics Logging (ODL)
  • more cmds
  • config defaults
  • OSGi admin 
    • integrate GoGo so works from asadmin
  • Concurrency : Managed*
  • JBatch : admin

Ecosystem for GlassFish 4

  • NetBeans
  • Eclipse
  • JetBRAINS
  • OSGi
  • Oracle Enterprise Manager

Project Avatar

  • http://avatar.java.net/
  • thin-server model
  • data communication - model on client
  • http/rest; websockets; SSE
  • services written in EE or nashhorn/javascript

2013-09-23

Last Modified : 2013 Sep 24 (Tue) 11:59:38 by carr.

Monday September 23, 2012


10:00am - Looking into the JVM Crystal Ball

  • Mikael Vidstedt, JVM Architect, Oracle

JDK 7

JVM Convergence

  • HotSpot + JRocket + CDC (embedded)
  • merge around HotSpot
  • from JRocket 
    • serviceability : Java Flight Recorder
  • from enbedded 
    • scalability
  • goal: same JVM from small devices to large hardware

Recent year, most time spent on security.

Java Flight Recorder - JDK 7u40 (from JRocket)

  • Event-based tracer and profiler 
    • vm internal info
    • jdk level events (e.g., I/O)
  • cyclic buffer in memory or optional store-to-disk
  • overhead: ~2-3%

Java Mission Control - JDK 7u40

  • $JAVA_HOME/bin/jmc
  • monitoring/mgmt
  • java heap/GC, hot methods, …
  • visualization of Java Flight Recorder data

Misc - JDK7u40

  • rewrite of invokedynamic impl 
    • from assembly language to Java impl
    • still needs work
  • G1 (garbage first) GC improvements 
    • tune default settings (based on feedback JDK-8001425)
    • humongous allocations prevent mixed GCs (JDK-8020155)
  • too many GC imples 
    • plan to converge around G1
  • String.intern() table performance 
    • bumped to 60013 entries on 64 bit systems (was: 1009 entries)
    • goal is to make size dynamic
  • 300+ bug fixes in HotSpot

JDK 8

Removing Permanent Generation (JEP 122)

Previous:

+-----------+---------+---------------+
| Java Heap | PermGen | Native Memory |
+-----------+---------+---------------+
  • PermGen 
    • where class metadata stored
    • set at startup - not dynamic
  • PermGen moved to Java Heap 
    • perm gen setting now ignored

Tiered Compilation

  • compiler convergence 
    • interpreter
    • client compiler (C1) 
      • faster startup
    • server compiler (C2) 
      • top performance (but compile cost)
  • tiered 
    • collect profiling info using C1 
      • used to happen in interpreter
    • then use info in C2
    • leads to faster startup

Memory Footprint

  • optimizations of common data types 
    • no unused bits
  • class, method structures 
    • Class ~30-40 bytes
    • Method ~ 32 bytes
  • Example: 10k classes, 100k methods -> ~3.5MB/process 
    • critical for embedded

JVM Future

Security

Cloud

  • e.g.,: thousands of JVMs running (almost) the same app on same "machine"
  • manage resources carefull 
    • memory/cpu varies significantly
    • virtualization adds to this
  • adapt to resource changes between JVMs 
    • maximum density
  • maintain high-availabilty and isolation

Manageability and Observability

  • ergonomics 
    • good enough default settings for majority of workloads
  • provide visibility into Java processes 
    • low-level data + high-level aggregation

Multi-Language

  • JSR292 / invokedynamic 
    • performance*
    • lamba relying heavily
    • nashorn - javascript on JVM (JDK 8) 
      • instead of rhino
  • improved Java <-> native

Scalability

  • mult-core + data parallelism 
    • lambdas + fork/join -> .parallelString()
    • synchronization/locks
  • memory 
    • huge heaps (40GB++)
    • NUMA
  • footprint + embedded 
    • streamline Java for small devices

JVM Components

Compiler

  • Sumatra : Java on GPUs
  • code memory management
  • compiler manageability and observability 
    • why/control over compiler decisions
  • compile time and warm-up time 
    • AOT?

GC

  • G1 (-XX:+UseG1GC
    • works by dividing heap into many small regions
    • regions individually GCed
  • focus: big heaps, low/consistent pause times 
    • without needing excessive tuning
  • settings 
    • regions selected, sizes of generations, number of GC threads
  • goal: deprecate/remove CMS
  • feedback: mailto:hotspot-gc-use@openjdk.java.net

Runtime

  • modularization/jigsaw
  • dynamic resizing of string and symbol tables
  • class data sharing (CDS)
  • contended locking improvements

Serviceability

  • Java Flight Recorder 
    • additional events, enable dynamically
    • event sampling
    • auto analysis in Java Mission Control
  • jcmd continued 
    • goal: deprecated other j* serviceability commands over time 
      • jstack, jinfo, jmap, …
  • JMX 
    • annotations for defining MBeans, REST protocol, batched operations
  • deprecate/remove 
    • JConsole : move functionality to Java Mission Control/VisualVM
    • hprof agent 
      • what parts are people using - replace its functionality elsewhere

Misc

  • improved testability 
    • unit testing of JVM internals
  • clean up HotSPot OS code and Makefiles 
    • unify copy/pasted logic

11:30am - Purely Functional Data Structures

  • Dan Rosen, Twitter

Mutability

Problems

  • unstable identity (as viewed by containing objects) 
    • if internal data changes and used by HashCode then identity changes
  • difficult to satisfy superclass behavior contracts in subclasses
  • 3: prevent container contents from changing 
    • Collections.unmodifiableSet
    • sync + copy (yuck)

Solution for 3: Persistent data structures

  • final fields in objects
  • setters return new object with copies of fields
  • common substructure between copies

Example: Stack

  • implemented as singly-linked list
  • shared all substructure

Example: Queue

  • implemented as doubly-linked list
  • bad: no common substructure -> deep copy
  • solution: use two stacks: incoming/outgoing 
    • analysis 
      • enqueue: O(1)
      • dequeue: O(1) - O(n) (if empty)
    • amorotized analysis 
      • assume each enqueued element will eventually be rotated
      • pay for cost of rotation for each element when enqueuing
      • enqueue: O(1) with 1 credit
      • dequeue: O(1), debiting when needed to rotate
      • put not suitable for real-time app

New invariant: stack can be at most 3 elements long

  • rotation now constant time
  • longer stacks then nested

1pm - Performance Tuning Where Java Meets the Hardware

  • Darryl Gove - Senior Principal Engineer, Oracle
  • Charlie Hunt - Architect Performance Engineering, Salesforce.com

Compiling

File.java -> File.class             -> JVM (optimization)
File.c    -> File.o (optimization)  -> File.exe

Optimization at runtime

  • use runtime info
  • finer-grained hardware info
  • optimize used code
  • super optimize important code paths

JIT compilation

  • sometimes see time move from expected location to different location 
    • because of inlining

Watching for GC

  • after Interpreter/JIT warmup then GC is majority of system time
  • -XX:+PrintGC -XX:+PrintGCDetails Xloggc:gc.log

What hardware can tell you about software

Harware performance counters

  • events 
    • instructions executed
    • cycles taken
    • loads from memory

instructions per cycle (IPC)

  • typical measure of performance
  • seemingly : low IPC/bad, high IPC/bad
  • but high might mean doing something unnecessary

solaris

  • cputrack -ef -c instr_retired.any,cpu_clk_unhalted.core java shapelist

cause of low IPC

  • long latency instructions (e.g., divide)
  • fetch data from memory
  • "bad things" happening to processor

solaris collect -h PAPI_l2_tcm -p on -j on shapelist

pointers and memory

  • pointers are virtual address
  • processor use TLB to get physical address
  • hardware uses physical address to access memory
  • memory return value back to process

memory and caches

  • cpu checks cache for data
  • if not in cache, cpu fetches from memory
  • memory returns value 
    • fetching memory is costly

object layout in memory

  • every link in a data structure is a potential cache miss
  • avoid pointers - colocate data
  • threads, cache lines and data
  • to update memory, thread neeeds exclusive access to data
  • data is hsared at cache line boundaires 
    • two threads cannot simultaneously update same cache line

data proximity

  • doing timing on low core and/or low thread machine may give dramatically different numbers than high core/thread machine

memory access rules

  • increase useful data fetched from memory 
    • group used together…

impact of polymorphism

  • do not have to test type
  • just call virtual method

3pm - Type Inference in Java SE 8

  • Daniel Smith - Java Language Designer, Oracle


7:30 - SOAP over WebSocket and InfiniBand with JAX-WS Pluggable Transports

  • Harold Carr - Software Architect, Oracle

The JAX-WS standard includes APIs for using POJOs or XML for remote messages, but it does not include APIs that enable the user to control the transport. This BOF discusses adding pluggable transport APIs to the JAX-WS standard. It shows a candidate pluggable transport mechanism for JAX-WS that enables you to use other transports besides HTTP. In particular, it shows the benefits of using WebSocket and InfiniBand transports for SOAP message exchanges.

NOTE: I will post my slides soon.

Here are my slides on my "Remoting Retrospective" presentation at JavaOne 2012: Remoting Retrospective (pdf)

SOAP and/or REST and/or WebSockets

Table of Contents

1 Intro

1.1 BOF6984 : SOAP and/or REST and/or WebSockets

  • Harold Carr 
    • architect of SOAP Web Services Technology at Oracle

1.2disclaimer

The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle

haroldcarr

My OSCON 2012 Friday Blog

Posted by haroldcarr Jul 20, 2012

OSCON Friday July 20, 2012

live streaming

Speaker Slides and Video


1 10am Declarative web data visualization using ClojureScript

Kevin Lynagh

What is Visualization?

Doing it on the Internet

  • http://d3js.org/ : Data Driven Documents 
    • "jQuery" for data
    • declarative, familiar representation (HTML, CSS, SVG)
  • Clojure(Script) 
    • rich data structures; namespaces; deliberate state/mutation
    • better to have 100 functions operate on one data structure than 10 on 10 - alan perlis

His idea: D3 + ClosureScript with emphasis on data rather than mechanics

HIPMUNK

DATA:

[{:flight-no 2, :price 106, :carrier "Alaska", ...}
 ...
]

CODE:

(map flight-data
  (fn [{:keys [price carrier depar arrive]}]
      [:div.row
          [:button.price (str "$" price)]
          [:div.flight
              {:style {:left  (time-scale depart)
                       :width (time-scale (- arrive depart))}
           :carrier carrier}
          [:span carrier]]])

2 11am Client/Server Apps with HTML5 and Java

James Ward

http://www.jamesward.com/

  • client/single code base -> PhoneGap -> deploy to multiple platforms
  • service : accessed from anything via API
  • stateless web tier
  • transparent real-time 

stateless web tier

  • server UN-affinity == scalability and upgrades
  • continuous delivery
  • browser back, forward, reload just work

client/server web and mobile apps

  • REST/JSON services
  • client on a CDN (most of app is static assets)

HTML5 : browser as an application platform

  • tags: video, section article, header, nav
  • APIs: offline, drag and drop, web storage
  • CSS3
  • plus: faster javascript as an enabler

tools

  • jQuery, https://github.com/twitter/bootstrap, coffeescript, …
  • client asset compilers: play 2, rails, …
  • MVC frameworks: backbone, angular, …
  • client-side templating: mustache, dust.js, …
  • no client dependency management tools yet

example: https://github.com/jamesward/play2bars

CDN

  • static assets in CDN
  • dynamic data in location that deals with data

3 11:50am Hacking JavaFX with Groovy, Clojure, Scala, and Visage

Stephen Chin

Note: Only 7 people in audience (whereas the Instantly Better Vim talk at the same time was so packed they had to move it to another room).

  • JavaFX APIs now in Java 
    • FXML markup for tooling
    • all JVM languages work: groovy, scala, clojure, …
  • JavaFX script no longer supported by Oracle 
    • use Visage (open-source successor)

ScalaFX: wrapper APIs to make it easier to use from Scala

haroldcarr

My OSCON 2012 Thursday Blog

Posted by haroldcarr Jul 19, 2012

OSCON Wednesday July 18, 2012

live streaming


1 9:05am The Learning Map : Danny Hillis (Applied Minds, LLC)

      model of student
             ^
             |
             v
       learning map
        ^        ^
        |        |
        v        v
assessment    learning
resources     resources
  • common core (map of knowledge)
  • LRMI : Learning Research Metadata Initiative

2 9:20am The Mudslide Hypothesis of Science : Kaitlin Thaney (Digital Science)

  • scientific publication is stuck in the 1600s
  • conventions/traditions last because of adversity to change
  • discovery tools are suboptimal
  • paper! notebooks
  • research managed by postit notes
  • maximize (re-)use
  • design decisions are key

3 9:35am Scaling OpenStack Technology. Lessons From The Field : Brian Aker (HP)

  • openstack is like amazon services
  • nova, glance, swift (core)
  • horizon, keystone, quantum
  • stages: ignore you, laugh at you, fight you, you win
  • openstack is currently at the "laugh at you" stage

4 9:50am The Clothesline Paradox and the Sharing Economy : Tim O'Reilly (O'Reilly Media, Inc.)

  • work on stuff that matters : Tim's blog January 2009
  • create more value that you capture
  • The Shareholder Value Myth - Lynn Stout
  • "Looting" (paper) by George Akerlof and Paul Romer
  • Ted talk by Nick Hanauer - capital does not create jobs - customers do
  • The Gardnes of Democracy - Nick Hanauer
  • an economy is an ecosystem

5 The Java EE 7 Platform: Developing for the Cloud

Arun Gupta

Top Ten Features in Java EE 6

  • EJB packaging in WAR
  • Type-safe dependency injection
  • Optional web.xml
  • CDI Events (producer/observer within VM)
  • JSF standardizing on Facelets (MVC - view in CSS/HTML)
  • EJCContainer API
  • @Schedule
  • EJB No Interface View
  • Servlet and CDI extension points
  • Web Profile (subset)

proprietary cloud offerings

  • IaaS: AWS, Azul
  • PaaS: GAE, Cloud Foundry
  • SaaS: SalesForce

Java EE 7 and 8

  • cloud 
    • provisioning
    • elastic and autonomic scalability
    • multi-tenancy
  • modularity 
    • building on jigsaw
    • focus on OSGi interop
    • supporting profiles and modular apss
  • html5 
    • programming model
    • JSON, WebSockets, offline, APIs, DOM

Java EE 7 model

  • build app
  • deploy to cloud admin service (will provision DB, JMS, LDAP, etc)
@DataSourceDefinition(
  name="java:app/jdbc/myDB",
  className="oracle.jdbc.pool.OracleDataSource",
  isolationLevel=TRANSACTION_REPEATABLE_READ,
  initialPoolSize=5)

@JMSConnectionFactoryDefinition(...)
@JMSDestinationDefinition(...)

Elasticy

  • service levels
  • min/max instances
  • self adjustment, capacity on demand

Service Provisioning

  • init LB
  • init DB
  • init instances in cluster

Java EE 7 JSRs

  • minor updates 
    • JSF, JSP, Servlet, CDE, Interceptors
    • EJB, JPA, JTA, Bean Validation
  • major updates 
    • JAX-RS, EL, JMS
  • extension points 
    • CDI, Web Container, Managed Beans
  • candidate JSR 

Java EE 7 Early Draft

  • requires Java SE 7
  • default data source, default JMS connection factory, …
  • jax-rpc optional, …

JPA 2.1

  • stored procedures

JAX-RS 2.0

  • client-side API
  • filters and interceptors (pre/post processing)
  • bean validation integration

JMS 2.0

  • simplified API (with injection)

JSON 1.0

  • DOM-based APIs
  • Streaming APIs

WebSocket

  • bi-directional, full-duplex, single TCP connetion
  • use: real-time games, collaboration, social networking

Project Avatar

  • solution for dynamic rich clients

6 11:30am JavaScript Libraries You Aren't Using…Yet

Nathaniel Schutta

  • jQuery getting a bit heavy
  • needs modularity
  • extra bits matter on mobile
  • microframeworks do one thing well (unix model)

zepto.js (subset of jQuery)

  • Thomas Fuchs
  • 7.4kb
  • mobile
  • uses jQuery syntax
  • does not work in IE (on purpose to reduce size)
  • works on safari, chrome, firefox, opera, webkit
  • tap, double tap, swipe, pinch
  • ajax
  • $os to query environment

backbone.js (MVC framework)

  • 1300 lines; 5kb compressed
  • not a UI framework
  • models, events, collections, views, controllers, persistence
  • influenced by Ruby on Rails
  • lots of sample apps

underscore.js

express.js (web server)


7 lunch Rogan Creswick at Galois

I have been teaching myself Haskell. I noticed Rogan Creswick listed in the OSCON attendee directory. He works at Galois - a company that uses Haskell. I contacted him and he suggested meeting for lunch at Galois in downtown Portland.

After getting food from the street vendors we sat around a table in their office along with his colleagues Jason Dagit, Iavor Diatchki and Eric Mertensdiscussing benefits of Haskell along with useful resources for learning the language.

I hope that someone from Galois submits at least a couple of talks to OSCON 2013 about their open-source libraries and tools (particularly HaLVM that enables Haskell to run directly on Xen "bare metal") and their usage of Haskell.


8 1:40pm You Ain't SPDY

Chris Strom

HTTP

  • HTTP is essentially one-way in each direction 
    • must wait for request before response can start
  • we open HTTP 6 connections to each server

SPDY

  • built on SSL, binary, no redundant header info, compressed, single tube
  • use: SYNSTREAM, SYNREPLY, DATA
  • server push

9 2:30pm How to Write Compilers and Optimizers (and solve Data Transformation Problems)

Shevek

mini-language

  • describe your problem in some language
  • implement description or write compiler for that language
  • update the problem description

reasons against compiler

  • write a library instead
  • cognitive load on people that

10 4:10pm Hybrid Applications with MongoDB and RDBMS

Steve Francia

Why/What Mongo?

  • RDBS fine for 40 years 
    • and still a great fit for many problems
    • stores things separately - e.g., MRI, XRAY separate for each other and patient
  • Mongo is document DB - patient document contains MRI, XRAY
  • Mongo: no joins, no TX

Why Hybrid?

  • need TX, rigid data
  • transition
  • right tool/job

Why not hybrid?

  • don't understand needs
  • do NOT: mongo as cache in front of SQL

Hybrid

  • partition data between mongo/RDBS
  • model layer handles mapping
  • data does not expire
  • no data duplication

Use-cases

  • craigslist 
    • alter table statements took 2 months to run
    • MyQSL for active data (100mil); MongoDB for archive (2+bil)
  • customInk.com 
    • reporting team does not want to learn MongoDB
    • MongoDB for active DB; replicate to MySQL for reporting
  • opensky.com 
    • multi-vertical catalog impossible to model in RDBMS
    • MySQL for orders; MongoDB for all else

How to build hybrid

Doctrine (ORM/ODM)

  • SQL: order; order/shipment; order/transaction; inventory
  • Mongo: user, product, address, cart, credit card, inventory, …
  • note: inventory in both : sync via listeners (in model of app)

Read article at: http://jwage.com


11 5pm Building Functional Hybrid Apps For The iPhone And Android

Carlos Andreu, Michael Thompson

Main point: With right tools, developing mobile apps is easy.

Service (independent of Mobile app impl)

  • IBM WAS Liberty Profile http://wasdev.net/
  • Eclipse with Liberty Developer Tools (from Eclipse Market Place)
  • CRUD via REST (JAX-RS)
  • JSON

Mobile app (independent of Service impl)

  • hybrid (instead of native)
  • dev tool: Eclipse + IBM Worklight
  • mobile UI: jQuery mobile (could have used dojo mobile, sencha touch)
  • apache cordova (aka PhoneGap): to get accelerometer, capture, geolocation, notifications, contacts, file system, events
  • WL.TabBar (iPhone); WL.OptionsMenu (Android)
  • Worklight adapters to do data (REST, DB, …)
  • testing: QUnit

12 8pm David Friesen Quintet at Jimmy Mak's

haroldcarr

My OSCON 2012 Tuesday Blog

Posted by haroldcarr Jul 17, 2012

OSCON Tuesday July 17, 2012

live streaming


2 1:30pm Build Social and Personal Data Apps using the Open Source Singly Platform

Jeremie Miller

Thomas Muldowney

https://singly.com/

https://github.com/singly

Personal Data Platform

examples:

Running an example:


doing node_express_skeleton (above)

https://singly.com/
create app
   URL          : localhost:8043
   callback URL : localhost:8043/callback

# for backpack (above):
# npm install express querystring request sprintf async node-native-zip

# for skeleton
npm install express querystring request sprintf oauth ejs

node app.js <your app client ID> <your app client SECRET>

# for backpack
# localhost:8442

# for skeleton
localhost:8043

3 6:00pm Exhibit Hall


4 8:00pm Party at Puppet Labs

haroldcarr

My OSCON 2012 Monday Blog

Posted by haroldcarr Jul 17, 2012

OSCON Monday July 16, 2012

live streaming


1 9:00am Building applications with MongoDB: An introduction

Steve Francia

slides: http://www.slideshare.net/spf13/oscon-2012

excerpts:

IDEAL

  • horizontal scaling 
    • cloud compatible, commodity hardware
  • fast
  • easy development 
    • data model, schema agility (N.B., : mean schema is your code!)

mongoDB

  • key/value good but
  • non-relational (no joins) makes horizontal scaling practical
  • document models are good
  • run anywhere virtualized: cloud, metal
  • written in C++
  • data serialized to BSON
  • extensive use of mem-mapped files (read-through/write-through memory caching)
  • speed: memcached > mongoDB > RDBS

document model

  • instead of separate tables, embed data inside docs they "belong" to
  • RDBS: index -> MRIs, XRays, Invoices, … 
    • normalized data gives : efficiency, validity, …
  • Document: Patient (nested)

terminology

                                          
RDBMSMongo
table/viewcollection
rowdocument
indexindex
joinembedded document
foreign keyreference
partitionshard

start with an object (or array, hash, dict, …)

  • no create table, no create collection
place1 = {
    name : "10gen HQ",
 address : "dfdfd",
    city : "New York",
     zip : "10011",
    tags : [ "business", "awesome" ]
}

insert record

  • can't do SQL injection
db.places.insert(place1)

query

db.places.findOne({ zip : "10011", tags: "awesome"})
db.places.find({tags: "business"})

nested documents

  • array of objects
  • as deep as you want with consistent performance
{ _id : ObjectId("4c4ba5c...."),
    name : "10gen HQ",
 address : "dfdfd",
    city : "New York",
     zip : "10011",
    tags : [ "business", "awesome" ]
comments : [ { author : "Fred",
               date   : "Sat ...",
               text   : "best ..."
             } ]
}

ObjectId

  • | timestamp | mac address of machine create on | process id | incremented value |

updates

  • mongodb atomic operator 
    • $push : query/update 
      • instead of query result to your app, then update (potential write conflict with other users)
    • $set, $unset, $rename, $push, $pop, $pull, $addToSet, $in
db.places.update(
  { name : "10gen HQ"},
  { $push :
    { comments :
      { author : "...",
          date : "...",
          text : "..."
      }}})

index nested documents

db.posts.ensureIndex({"comments.author":1})
db.posts.find({'comments.auther':'Fred'})

regular expressions

db.posts.find({'comments.author':/^Fr/})

index multiple values

db.post.ensureIndex({tags:1})

geospatial

  • $near, $within, …
db.posts.ensureIndex({"location": "2d"})
db.posts.find({"location":{$near:[22,42]}})

cursors

$cursor = $c->find(array("foo" => "bar"));
foreach ($cursor as $id => $value) {
    echo "$id: ";
    var_dump("value");
}

paging

page_num = 3;
results_per_page = 10;
db.collection.find().sort...

running mongoDB


cd /tmp
curl http://downloads.mongodb.org/osx/mongodb-osx-x86_64-x.y.z.tgz > mongo.tgz
tar -zxvf mongo.tgz

export PATH=/tmp/mongodb-osx-x86_64-2.0.6/bin:$PATH

mkdir  /tmp/mondodbdir
mongod --dbpath /tmp/mondodbdir &

curl -L http://j.mp/OSCONvenues | mongoimport -d milieu -c venues
mongo

use milieu
db.venues.count()
db.venues.findOne()
db.venues.ensureIndex({'location.geo':'2d'})
db.venues.getIndexes()

2 1:30pm Getting Started with OpenStack

Monty Taylor

slides: https://openstack-ci.github.com/publications/tutorial

left after 20 minutes of him messing with CLI …


3 1:30pm Scala Koans

… went here instead - but left at 2:15 when everyone was still trying to just get software installed …

… hallway conversations worked!

I'm at OSCON in Portland, Oregon this week. I will be posting my notes here.  
haroldcarr

My Strata Thursday Blog

Posted by haroldcarr Mar 1, 2012

Thursday March 1, 2012


Table of Contents

1 8:50am Democratization of Data Platforms, Jonathan Gosier (metaLayer Inc.)


2 9:05am 5 Big Questions about Big Data, Luke Lonergan (Greenplum, a division of EMC)

BigData opens door to new approach to engaging customers and making decisions.

Data that was formerly discarded is now mined.

Scale everything: storage, compute, analytics, interaction.


3 9:15am The Trouble with Taste, Coco Krumme (MIT Media Lab)

wine tasting experiment: social influence, content & value, expectation & perception

wine aroma wheel versus mass spectrometer versus human nose


4 9:25am Embrace the Chaos, Pete Warden (Jetpac)

hard to access structured data (e.g, Amazon's product catalog) in the wild (unless you pay for it)

what was previously data "exhaust" (i.e., unstructured data) is now valuable (and generally free or very low cost)

ask yourself: what would Google do?

look for patterns

"forgive" human errors, noise, poor grammar/spelling, …

what data sources do you have? (e.g., support emails, invoices, tweets)


5 9:35am Open Data and the Internet of Things, Usman Haque (Pachube.com)

machine to machine

  • gas company meter reading to compute bill (purpose specific)
  • closed

internet of things

  • light bulbs, printers, cars, people, smart phones, people, appliances
  • data open to be used in other contexts 
    • e.g., cell phone to tower info : tomtom traffic patterns

crowd-sourced data - e.g., radiation feeds

Internet of Things Bill of Rights (2011)


6 9:45am Big Data’s Next Step: Applications, Gary Lang (MarkLogic)

  • 2001 - MarkLogic founded - queries against (un)structured data in repository
  • 2003 - XML in CMS with search
  • 2006 - Hadoop
  • 2007 - poly-structured
  • 2011 - BigData (same thing MarkLogic has always done)
  • any data, volume, structure
  • analyze everything all the time in real-time
  • keep all data in commodity store and slurp into hadoop when needed

7 9:50am Dr. Richard Merkin, President and CEO of Heritage Provider Network, Announces the Winner of the Second Heritage Health Progress Prize, Richard Merkin (Heritage Provider Network)

healthcare data mining to improve people's life


8 Start-up showcase winners

judge winners

  • Tokutek - MySQL Speed, scalability agility
  • Lex Machina - data from patent litigation to interpret and make predictions
  • memsql - accelerated data
  • bitdeli - …

audience winner

  • metaLayer - drag and drop discovery 
    • delv - take real-time data streams and mashup

9 9:55am Using Google Data for Short-term Economic Forecasting, Hal Varian (Google)


10 10:40am Mining the Eventbrite Social Graph for Recommending Events, Vipul Sharma (Eventbrite)

Platform

Infrastructure

  • search - solr
  • recommendation - hadoop, native MapReduce, bash
  • persistence - MySQL, HDFS, HBase, MongoDB (investigating Cassandra and Riak)
  • stream - RabbitMQ (investigating Kafka)
  • offline - MapReduce, Streaming, Hive, Hue

Infrastructure - Sqoozie

  • workflow for mysql imports to HDFS 
    • generate sqoop commands , run imports in parallel
  • transparent to schema changes
  • include/exclude column, data types, table
  • data type casting
  • distributed table imports

Infrastructure - Blammo

  • raw logs import to HDFS via flume
  • 5 min latency
  • logs are key/value in json
  • each log publishes schema in yaml

Recommendation system

recommendation engines

  • item hierarchy - you bought camera, need batteries
  • collaborative - people who bought camera also bought…
  • collaborative item-item similarity - you like Godfather so would like …
  • social graph based - your friends liked …
  • interest graph based - your friends who like rock music like you are attending …

why interest?

  • events are social
  • interest are changing
  • dense graph is irrelevant (need segmentation)

how know interest?

  • ask you
  • based on activity (attended, browsed)
  • facebook
  • machine learning 
    • logistic regression using MLE
    • sparse matrix generated via MapReduce
    • model for each interest

recommendations

  • model based vs clustering
  • item-item vs user-user
  • building social graph is clustering step
  • social graph recommendation is a ranking problem

implicit social graph

u = user
e = event

              u1
             /  \
            e1   \
           /      \
          u2       u3
         /  \
       e2    e3
       /      \
     u4        u5
  • mixed features
  • series of map-reduce jobs
  • output on HDFS in flat files; input to subsequent jobs
  • orders - event -> attendees 
    • map eid: uid
    • reduce eid[uid]
  • attendees -> social graph 
    • input eid[uid]
    • map uid[uid]
    • reduce …

Hbase (single source of truth)

  • collect data from multiple MR jobs 
    • stores entire social graph
    • over one million writes per second
                     
rowidneighborseventsfeatureX
27110130.367879

tips and tricks

  • distributed cache as much as possible 
    • sped up some MR jobs by hours
    • be sure to use counters
  • hive (use as much as possible) 
    • "flip join" - join + processing/transformation
    • statistical functions using hive
    • UDF
  • memory memory memory
  • LZO, WAL
  • combiners are great until
  • shuffle and sorting stage
  • hadoop ecosystem is still new 
    • optimal level of spill on disk vs jvm memory
    • significant amount of time doing debugging of hadoop itself

(best talk so far - nitty gritty details)


11 11:30am Pretty Simple Data Privacy, Kaitlin Thaney (Digital Science), Betsy Masiello (Google), John Wilbanks (Kauffman Foundation for Entrepreneurship)

John:

  • designers making it easy to give away your data (especially from mobile apps)
  • privacy is about context, social situations and control
  • users are getting pissed
  • simplifying understanding of privacy for users (e.g., iconic representations)
  • genome sequencing : openSNP
  • 23andMe
  • design barriers to donate your data to science
  • Consent to Research - John Wilbanks' effort

Kaitlin:

  • make research more efficient viz privacy
  • opt-in service A conflicting with opt-out service B
  • de-anonimizers

Betsy:

Google's new privacy policy (in effect starting today: March 1)

  • notification effort started Jan 24 - and home page promos
  • previous 70 different policies - now 1 (but still has 6 other ones- e.g., google wallet)
  • treat you as single user across all products (HC: even if you don't want it)
  • to avoid: don't log in when doing search, etc (HC: so I have to logout of gmail before doing a search)
  • use "Privacy tools" 
    • dashboard, ads preferences mgr, data liberation front, out-outs, encrypted search

Panel:


12 1:30pm Data Jujitsu: The Art of Turning Data into Product, DJ Patil (Greylock Partners)

http://www.linkedin.com/in/dpatil

what is a data product (facilitates end goal thru use of data)?

philosophy for data

  • Jujitsu: the art of softness - defeat armed opponents without using weapons (use their energy)
  • use light-weight data to try things out (instead of gigantic design in advance)

build data products as a progression

  • what data do you start with: (un)structured; can you switch that ratio
  • cleanup : un -> structured
  • disambiguate by asking user (move hard backend problem to easy frontend problem)
  • human augmentation is key
  • build easy products first
  • be opportunistic for wins - e.g., people you may know
  • but yourself into the role of a physical surrounding (e.g., physical retail space)
  • giving back data is driver - e.g., who's viewed your profile, viewers by geography
  • data vomit is bad - too much info causes click-through-rate (CTF) to drop off
  • exposing data challenges - e.g., type-casting user
  • set user expectations - set your users up for success (e.g., pandora)
  • hard to test outside of production - need humans to look
  • have to win within 500ms
  • know when to build the serious stuff 
    • e.g., post a job and get people recommend ("pandora for people")

13 2:20pm Connecting Millions of Mobile Devices to the Cloud, James Phillips (Couchbase, Inc.)

     schema evolution
security
        scalability of consolidated store
selection performance
                     bandwidth conservation
referential integrity on interruption;
           battery and memory conservation
delete propagation;
             which data - temporal, spatial, user?
new user provisioning;
             conflict detection and resolution
multi-tier, scalable type 2 sync architecture:

  ??
no sql ; search
data synchronization
load balancer
web app
data infrastructure

external data
   v
big data > analysis
              v
            no sql > web app
              v
            sync > mobile

14 4:00pm Open Source Ceph Storage– Scaling from Gigabytes to Exabytes with Intelligent Nodes, Sage Weil (new dream network / dreamhost)

why?

  • requirements, save time/money

requirements

  • diverse needs 
    • object storage, block devices for snapshots/cloning, shared file, structured data
  • scale 
    • heterogeneous hardware, reliability, fault tolerance

time

  • ease of admin
  • no manual data migration, load balancing
  • painless scaling (up/down)

money

  • low cost per gigabyte
  • no vendor lock-in 
    • software solution on commodity hardware
    • open source

what is ceph?

unified storage system : distributed storage system :data center scala, FT, commodity hardware

  • objects (big/small)
  • block devices
  • distributed file system

license

  • LGPLv2
  • no dual licensing

storage device coordinate (i.e., intelligence) so clients do not have to

data distribution

  • objects replicated N times
  • auto placed, balanced, migrated
  • smart about physical infrastructure (e.g., no duplication on same rack)

crush

  • pseudo-random placement algorithm
  • rules: e.g., 3 replicas, same row, different racks
  • predictable bounded migration
  • map update (e.g., new nodes, failure, downgrade) potentially triggers data migration

distributed file system

  • cluster-coherent
  • separate metadata and data paths
  • dynamic subtree partitioning
  • move work from busy to idle servers

best tool for the job

key/value

  • cassandra, riak, redis

object store

  • rados (ceph)

map/reduce

  • gfs, hdfs

posix

  • cepth, lustre, gluster

hbase/bigtable

  • tablets, logs

map/reduce

  • data flows

percolator

  • triggers, transactions

motivation

  • limited options for scalable open source storage 
    • orangefs, lustre, glusterfs, HDFS
  • proprietary 
    • hardware+ software
  • industry needs to change

who

  • created a UC Santa Cruz (2007) as his PhD 
  • supported by DreamHost 2008-2011
  • new company 2012
  • growing community
  • they are hiring: C/C++/Python, sysadmins, testing engineers
  • http://ceph.com

15 4:50pm Mapping social media networks (with no coding) using NodeXL, Marc Smith (Social Media Research Foundation)

NodeXL

  • firefox of GraphML
  • make network charts as easy as making a pie chart
  • connect researchers to social media sources

Open Tools

Open Data

Open Scholarship

social media is all about connections from people to people and exchanges between them

patterns are left behind

like, link, reply, rate, review, favorite, friend, follow, forward, edit, tag, comment, check-in, …

the strength of weak ties (in aggregate)

social networks: 1934: Jacob L. Moreno

  • sociogram of a football team
  • look for hubs
  • bridge : in what way is a person with only two connections more "important" than one with hundreds?
  • clusters : sub-communities
  • crowds :
  • isolates :
  • Gephi - photoshop for graphs
  • NodeXML - like MSPaint for graphs

Social Network Theory

http://www.cmu.edu/joss/content/articles/volume8/Welser/images/image040.jpg

article: Visualizing the signatures of social roles in online discussion groups

  • taxonomy of types of people

51B3pk31tXL._BO2,204,203,200_PIsitb-sticker-arrow-click,TopRight,35,-76_AA300_SH20_OU01_.jpg

haroldcarr

My Strata Wednesday Blog

Posted by haroldcarr Feb 29, 2012

Wednesday February 29, 2012


Table of Contents

1 8:50am The Apache Hadoop Ecosystem Doug Cutting (Cloudera)

Context: exponential for decades

  • abundance of computing, storage, generated data
  • peta-scale now affordable
  • traditional data doesn't scale well
  • more data provides greater value
  • time for a new approach

New approach

  • traditional 
    • exotic/expensive hardware (SAN, RAID)
  • big data 
    • commodity/unreliable HW
    • reliability at software level
    • scales further

New software

  • traditional 
    • monolithic
    • schema first
    • proprietary
  • big data 
    • distributed (storage and compute)
    • raw data (optionally do schemas dynamically later)
    • open source

Ecosystem

  • Hadoop is kernel (de facto industry standard)
  • around it : Pig, Hive, Flume, …

Apache

  • no strategic agenda - quality emergent
  • community-based - decisions by consensus/transparent
  • allows competing projects
  • loose federation of projects - permits evolution
  • ensures against vendor lock-in - can't buy Apache

Typical adoption pattern

  • idea that impractical without Hadoop
  • build Hadoop-based proof of concept
  • move initial app to production
  • add more datasets and users 
    • removing silos in organizations
    • permits easy experiments on real data
  • snowballs into institutions central repository 
    • analysis
    • data processing

Questions

  • what data are you ignoring?
  • how can you use it?
  • how can you combine your data with others?

2 9:00am Do We Have The Tools We Need To Navigate The New World Of Data? Dave Campbell (Microsoft)

Microsoft now supports Hadoop (since last September)

data as a platform

         ^
         |
         |             insights and action
value    |           knowledge
         |         information
         |      data
         |  signal
         |
         +---------------------------------->
                 refine

reduce the time to insight
         ^
         |
         |             insights and action
value    |           knowledge
         |         information
         |      data
         |  signal
     publish/share
         +---------------------------------->
       /          refine
     /
   / combine
  v
reduce the time to insight

http://datamarket.azure.com

Azure Labs: Data Explorer

Visualization

  • search/acquire
  • explore/analyze
  • explain/share

Powerview (will ship with new SQL Server)


3 9:10am Decoding the Great American ZIP myth Abhishek Mehta (Tresata)

Homogeneous

  • William Levitt - father of American suburb (levittown)
  • Henry Ford - any color you want as long as its black
  • but people inside are different

The tools to build a better financial system are here

  • can be used for other purposes too
  • commodization of data stack is complete (and free/open source)

Hal Varian

unlimited storage, bandwidth, processing

  • what problem will you solve?

common platform to store, process, analyze, visualize data

(most of) the best minds of our generation are thinking about how to make people click ads …


4 9:20am Guns, Drugs and Oil: Attacking Big Problems with Big Data Mike Olson (Cloudera)

Drugs

Dna-SNP.svg

Single-Nucleotide Polymorphism

  • Tools: Bowtie and Crossbow (related SoapSNP, Contrail, CloudBurst)

Guns

  • Predictive Policing (Santa Cruz, CA: put cops where it matters) 
    • imaging adding tweeter feeds, etc
  • Entity Analysis 
    • machine learning and social networking applied to drug trafficking and Terrorism
    • who knows whom? what do they talk about?

Oil

  • reflection seismology 
    • subsurface topography
    • data analysis/modeling to produce subsurface structure and reservoir maps

The ability to use data to solve important social problems.


5 9:30am Machine Learning and Big Data: Sustainable Value or Hype? Flavio Villanustre (LexisNexis Risk Solutions and HPCC Systems)

HPCC (LexisNexis' previously internal proprietary system)

  • open source distributed Big Data analytics platform

collection, ingest, discovery/cleanse, integration, analysis, delivery/visualization

how to extract value from data?

  • machine learning : ECL-ML: HPCC machine learning
  • correlation, classifiers, clustering, statistics, …

6 9:35am Learning Analytics: What Could You Do With Five Orders of Magnitude More Data About Learning? Steve Schoettler (Junyo)

EdTech (educational technology)

Immediate feedback is most important element in student achievement.

technology in classroom

  • apply big data to "click stream" from student to analyze and give feedback

7 9:40am A Big Data Imperative: Driving Big Action Avinash Kaushik (Market Motive)

Blog: occam's razor

how we use data will define us

data democracy: build tools so people can make data decisions directly (rather than placing "data intermediaries" between end users and data).

  • known knowns
  • known unknowns
  • unknown unknowns (biggest problem = e.g., Rumsfield's excuse for screwing up)

math in the service of humanity


8 9:55am The Information Architecture of Medicine is Broken, Ben Goldacre (Bad Science)


9 10:40am Exploring Social Data: Use Cases for Real-World Application, Chris Moody (Gnip)

The decision that changed everything

  • twitter: Q3 2010 
    • enabled commercial-grade $ access to their data (rather than just previous consumer APIs)
  • now others too 
    • twitter, facebook, delicious, newsgator, youtube, g+, myspace, StockTwits, flickr, StumbleUpon, Dailymation, WordPress, disqus, …

Reaction time

  • faster/slower 
    • twitter, facebook, g+, youtube, wordpress, disqus/intensedebate

Depth

  • deep/concise 
    • youtube, wordpress, Disqus, g+, facebook, twitter
        ^
deep    |                     Product Development
        | Customer Service
        |                    Brand Management
        |
        |   Supply Chain
concise | PR
        +----------------------------------->
         faster         slower

Expected vs Unexpected Events

  • expected: ramp up, peak, ramp down
  • unexpected: spike, ramp down

Examples

ESRI and Local Retailing

  • what are people saying in store
  • how is inventory impacted by social data?
  • want to know photos people taking inside store (e.g, dirty bathroom, bad/broken display)
  • when people steal they brag about it

NetBase, JD Power and Tropicana

  • sources: twitter, blogs, comments
  • characteristics: concise for signal, deep for insight
  • concise: baby boomers like orange juice
  • deep: Tropicana/orange juice associated with reward (in millenials)
  • action: put Tropicana vending machine outside gyms/health clubs

Industrial Parts Supplier

  • sources: twitter, blogs, comments
  • characteristics: coverage
  • where is the next factory/Walmart going to be built?
  • city council people tweeting, meeting minutes, …

VisionLink and Boulder fire

  • sources: twitter, flickr
  • characteristics: fast, geodata
  • new source of images/info about fire
  • gives responders a new view of where to focus attention

What is the right social cocktail to solve your business problem?

Mining a new source

  • Disqus
  • largest 3rd party commenting platform on web

10 11:30am Business Management Strategies for Big Data, Dave Rubin (Oracle)

runs NoSQL development team at Oracle

what is big data

  • velocity - high rates on incoming/temporal
  • volume : vast quantities
  • variety : (un)structured data

common bigdata tech

  • NoSQL DB 
    • dynamic/rapidly changing schema
    • predictable/bounded low latency store
  • MapReduce 
    • breakup problem in smaller sub-problems
    • Hadoop
  • HDFS - Hadoop Distributed File System 
    • distributed, scalable storage
    • write once, read many times
  • Hive - query
  • HBase - non-relational DB like Google's BigData
  • R - language for statistical analysis
  • Pig - program MapReduce
  • Sqoop - SQL<->Hadoop

use-cases

  • predictive ad and content generation
  • data warehousing at facebook 
    • hadoop/hive warehouse
    • 48000 cores
    • 12 TB per node, storage capacity of 5.5 PetaBytes
    • two level network topology
    • apps 
      • reporting : measures of user engagement, microstrategy dashboards
      • ad hoc analysis
      • machine learning : predictive advertising
  • banking 
    • hadoop as engineered shared service
    • lines of business using hadoop
    • two level usage 
      • increase revenue: customer intelligence, sentiment analysis
      • reduce cost: fraud intelligence, risk mgmt, …

web is loaded with predictive signals

recorded futures - selling big data as a service

  • predict trading volumes
  • predict returns from sentiment
  • predict volatility

oracle products

  • bigdata appliance 
    • hadoop, NoSQL store, RDBS<->hadoop loader
    • exalytics - fast analytics/visualization
    • R

stream -> big data appliance –(infiniband)-> exadata –(infiniband)-> exalytics


11 1:30pm Building a Data Narrative: Discovering Haight Street, Jesper Andersen (Bloom Studios)

data visualization design like game design

data narration (using statistics, programming, visualization, expression)

don't give users scores, given them stories

example: what is Haight Street like?

  • named after: wikipedia conflicts with SF municipal record

voroni tesselation

RapLeaf - name to gender service

Naive Bayes - predictor of positive/negative tweets

maximum entropy model to find what people are saying (with geodata)

instagram: what to people remember?


12 2:20pm Amazon DynamoDB: A seamlessly scalable NoSQL service, Swaminathan Sivasubramanian, (Amazon Web Services)

launched one month ago

scaling

  • app -> server -> DB
  • app -> load-balanced servers -> DB (keep making bigger box for single DB)
  • app -> load-balanced servers -> distributed DB

start with scale out as guiding principle

fully managed NoSQL data store

  • minimal admin
  • low latency SSDs
  • unlimited potential storage/throughput

provisioned throughput

  • how much read/write capacity 
    • not in terms of servers and disk IO
  • increase/decrease any time
  • while app online

consistency

  • default : eventually consistent
  • choice: strongly consistent (cost 2x eventually)

reliability

  • durable 
    • all writes to disk (not memory)
    • acked when it exists in two physical data centers
  • availability 
    • data replicated to multiple zones

table

  • specify primary key when create 
    • hash on single attribute
    • or, composite
  • add more columns/rows any time

performance

  • latency : single digit millisecond
  • SSD-backed
  • consistent as throughput/storage grows
  • no need to tune

Elastic Map Reduce

  • pay as you go Hadoop
  • Hive integration with EMR
  • filters pushed down to DB
  • built in table throughput aware query engine
  • use cases 
    • archive
    • data load
    • complex queries

customers

  • amazon cloud drive, elsevier, smugmug, amazon, shazam, formspring, tapjoy

13 4:00pm Roll Your Own Front End: A Survey of Creative Coding Frameworks, Michael Edgcumbe (Columbia University), Eric Mika (The Department of Objects)

WANCS : web are not computer scientists

frameworks

  • Processing 
    • built with Java
  • OpenFrameworks 
    • built in C++
  • Raphael
  • pocode 
    • build in C++
  • d3.js 
    • toolbox
  • flash/air
  • cinder 
    • in C++
  • PhiloGL 
    • wrapper over

unique

  • Dashboard vs Thumbprints
  • libraries
  • platform UI + Custom UI
  • novel IO

Existing

  • Data Sit
  • pentaho
  • tableau

Visualizations

  • oak ridge nat labs: flocking
  • movie fingerprints: cinemetrics

Stages of Viz

  • gather, clean, feed, group
  • access, iterate, calculate geometry, display, interact, update, repeat

http://github.com/voxels

above frameworks work with

  • linux, windows, max, android, iOS
  • firefox, IE, chrome, safari

what is powering graphics

  • OpenGL (except flash)

libraries

  • best Processing, OpenFrameworks

recommendation

  • cinder ; if you know c++
  • know lots of stuff: openframeworks
  • don't know how to code : processing
  • canonical stuff to publish to web: d3.js
  • blending: processing

14 4:50pm Linked Data: Turning the Web into a Context Graph, Leigh Dodds, (Kasabi)

lod-datasets_2011-09-19_colored_300px.png

identity is hard

  • labeling
  • determining equivalence
  • e.g., street address vs lat/long

identifier used to "route" from one data set into another

use web (i.e., URI) for global identifiers

free ebook: Linked Data PatternsLeigh Dodds, Ian Davis


15 5:30pm Expo Hall Reception


16 6:30pm Strata 2012 Startup Showcase

 

 

haroldcarr

My Strata Tuesday Blog

Posted by haroldcarr Feb 28, 2012

Tuesday February 28, 2012


1 9:00am Large Scale Web Mining, Ken Krulgler, Scale Unlimited

ken@scaleunlimited.com

@kkrugler

slides

Large scale means you need distributed processing framework.

  • crawl - find the good stuff 
    • still need to focus crawl
    • ethics - don't hit machines needlessly
  • extract - get right stuff out 
    • conflicting constraints: scale viz precision
    • pages are noisy : ads, boilerplate (navigation), SEO
  • process - turn bytes into bucks 
    • reduction - i.e., pie charts
    • index - enable search
    • analytics - cluster, recommend, …

Web crawling

  • fetch pages
  • extract outlinks (means parse fetched content)
  • manage state of crawl
  • implicit rules 
    • robots exclusion protocol (robots.txt)
    • user agent (who you are, how to contact you)
    • request rate
  • broad (e.g., google)
  • focused 
    • page scoring -> outlinks; perhaps whitelist of domains to avoid traps/honeypots
    • whitelist; use top sites listed in : alexis, comcast (quamcast?) (alexa?)
  • domain 
    • limit to certain domains; for precise extraction
  • don't crawl crawl 
    • public datasets: leverage other people's crawl data
    • e.g., common crawl, wikipedia (data dump), Spinner, InfoChimps

Crawling solutions

Crawling hard

  • mining breaks implicit contract with sites 
    • you are generally not creating an index that drives traffic to them
    • you are using their bandwidth and server cycles
  • "infinite" web means you will run into edge cases
  • not everybody plays nice: link farms/honeypots, malicious sites, angry webmasters 
    • use whitelist to avoid
  • risk : can do (perceived) damage to sites (and get sued)
  • work at scale 
    • Hadoop - Nutch, Bixo
    • custom queuing - Heritrix, Droids
    • scalable queuing - Storm

Focused crawling

  • only crawl pages likely to be good
  • seed URLs - starting point: put into URL state
  • URL state - DB of all known URLs
  • page score - "quality" of page
  • link score - page score/outlink
  • fetched pages - saved results

Ethical crawling

  • send real, valid, info in user agent name 
    • contact info so they can send you email/phone,…
  • honor robots.txt
  • limit your crawl rate
  • comply with blacklisting and data removal requests
  • follow ethical guidelines
  • avoid javascript
  • don't follow form links
  • don't do CAPTCHA
  • grovel when they complain - they are right
1 seed
  -> URL state
    -> 2 sort
      -> 3 focus
        -> 4 fetch and parse
          -> 5 save fetched
            -> 6 page score
              -> 7 extract outlines
                -> 8 score outlinks
  -> 8 URL state

Seed URLs (listed broad to narrow)

  • list of registered domains
  • DMOZ - lots of spam/porn
  • Alexa/Quantcast "top sites" list
  • Wikipedia
  • Tweets - with filtering (e.g, Gnip, DataSift)
  • search 
    • manually enter URLs - slow, but curated
    • use API, faster, limited, can have junk

Scoring pages

  • analyze text (tokenize)
  • term-based: count occurrences of all phrases; good phrase (manually picked); bad phrases 
    • calculate ratios of counts: good/all - bad/all - score
  • SVN : support vector machine 
    • train with documents
    • creates statistical model 
      • divides training docs into separate classes
      • used to give an unknown doc a class
  • decided "good" 
    • minimum threshold for amount of real content (versus graphics, cruft, boilerplate, navigation, SEO, ads,…) 
      • use Boilerpipe and other cleaners
    • detect link farms with fake content

Expand crawl frontier

  • normalize links 
    • link lengthening (e.g., bitly)
    • www.(*)/
  • skipping links to low-value pages 
    • suffix filtering: images, pdf, binary types
    • DB generated pages

Focused domain crawl

  • one domain
  • generally discovery of target content pages
  • often uses URL patterns to synthesis links
  • two phases 
    • crawl : discovery of details pages
    • fetch/process : details pages
  • keep track of page type in URL state DB

Extract

  • characteristics 
    • broad - losts of domains/formats
    • precise - very specific types of data
    • accurate - low error rate
  • types 
    • instructed broad/accurate
    • semi-structured - broad/precise
    • structured - precise/accurate
  • HTML -> XHTML (TagSoup, NekoHTML, HtmlCleaner)
  • detect Charset of page and turn bytes into characters (Tika, ICU)
  • link extraction
  • remove boilerplate
  • language (e.g., tokenizing, audience) 
    • HTTP response: Content-Language: es
    • HTML meta tag : <meta http-equiv
    • HTML tag attributes
    • analyze: ngram statistics, short words, …
  • Unstructured extraction from HTML 
    • title, description from meta
  • Semi-structured 
    • patterns: phone numbers, dates
    • microformats
    • NLP: named entities
  • Structured 
    • xpath (maybe regex)
    • use firebug - will show xpath for each element
    • beware: some browsers will rewrite HTML
    • dom often generated with JavaScript
  • dealing with JavaScript 
    • need to execute: HtmlUnit, qt-webkit, headless Mozilla
    • 10x slower
    • loads server, skews site's statistics (makes webmaster angry)
    • pages work in FF/IE but not HtmlUnit (uses Rhino)
    • pages cause HtmlUnit to hang
    • often a parallel page for web crawlers (look for site map)

Resources

  • hadoop.apache.org
  • www.cascading.org
  • openbixo.org

2 1:30pm Hands-on Visualization with Tableau, Jock Mackinlay, Ross Perez, Tableau Software

tableau_logo.gif

Jock:

Data story telling / conversations

poster_OrigMinard.gif

photo51.jpg

many-eyes-blog.jpg

Process

  • task that involves data
  • forage for data
  • search translate to visual form
  • translate to visual form
  • develop insight
  • act / repeat

Data modeling

  • data cleaning : ETL

Ross:

http://bit.ly/z9CWCp


3 5:30pm Get connected

meetups

  • meetup.com/Data-Mining
  • meetup.com/VisualizeMyData

Companies

  • Kaggle 
    • run data mining competitions
    • hiring data scientists, developers, product manager, …
    • mission: make data scientists more highly valued based on their models
  • Cloud Physics 
    • hiring big data expert
    • funding from tier one venture firm
  • Yummly 
    • receipe web site
    • hiring analytics, back-end, front-end, …
  • Netflix 
    • hiring data science / data engineering, product management, …
  • Edmodo 
    • social learning or K-12
    • in/out of class to develop, deliver, test curriculum
    • measure teacher performance
    • free secure platform, no ads
    • hiring: data scientists
  • GfK 
    • global market research
    • have lots of data sets
    • hiring data visualization
  • Uber 
    • fancy taxi in SF
    • hiring data team
  • Wealthfront 
    • online financial adviser
    • how to help people interact with investments?
    • how to engage in making a good decision with little capital
    • hiring : engage with large proprietary data sets, explain finance, lead designer, engineers
  • Accretive Health 
    • healthcare technology : 1/2 billion insurance claim sets
    • hiring data miners
  • Disqus 
    • commenting system / discussion network
    • quality of comments, discovery, monetization
    • hiring data team
  • Huawei 
    • telecom
    • hiring big data
  • Tango 
    • mobile video calling
    • hiring data architect, scientist, analyst, engineer
  • Amazon 
    • hiring engineering, system engs, PM, data scientists, …
  • StudyBreak 
    • social event discovery and recommendation
    • hiring data scientist
  • Mendeley 
    • archive of research documents
    • hiring Java/Hadoop/AWS data scientist/engineers
  • New York Times 
    • OpenPaths
    • hiring data scientist, linked data scientist
  • Data Without Borders 
    • connect data scientists with social organization to serve humanity
    • "hiring" data volunteers, managers
  • Trulia 
    • real estate search engine
    • hiring data mining (home price predictions), data scientist
  • Cloudera 
    • Hadoop
    • hiring all engineering roles, data scientist/visualization
  • Flurry 
    • mobile analytics for apps
    • hiring data science, visualization, machine learning
  • Ad Mobius 
    • hiring machine learning, time series, NLP
  • TenTenData 
    • analogical database engine in cloud
    • trillion row web-based spreadsheet
    • hiring …
  • (stealth - no name) 
    • mobile cloud intersection with security
    • hiring analytics

4 7:00pm Strata Mini Maker Faire & Data Crush

cheap wine and gadgets…

I'm at Strata Santa Clara again this year.  I'll post my raw notes.  Hope you find them helpful.