This blog post belong to a series called Gremlin Recipes. The purpose is to explain the internal of Gremlin and give people a deeper insight into the query language to master it.
I KillrVideo dataset
To illustrate this series of recipes, you need first to create the schema for KillrVideo and import the data.
The graph schema of this dataset is :
INSERTING DATA
First, open the Gremlin console or Datastax Studio (whichever works fine) and execute the following statements:
:remote connect tinkerpop.server conf/remote.yaml session-manage :remote config timeout max :remote console system.graph('KillrVideo').create() :remote config alias g KillrVideo.g
schema.clear(); // Property keys schema.propertyKey("genreId").Text().create(); schema.propertyKey("personId").Text().create(); schema.propertyKey("userId").Text().create(); schema.propertyKey("movieId").Text().create(); schema.propertyKey("name").Text().create(); schema.propertyKey("age").Int().create(); schema.propertyKey("gender").Text().create(); schema.propertyKey("title").Text().create(); schema.propertyKey("year").Int().create(); schema.propertyKey("duration").Int().create(); schema.propertyKey("country").Text().create(); schema.propertyKey("production").Text().multiple().create(); schema.propertyKey("rating").Int().create(); // Vertex labels schema.vertexLabel("genre").properties("genreId","name").create(); schema.vertexLabel("person").properties("personId","name").create(); schema.vertexLabel("user").properties("userId","age","gender").create(); schema.vertexLabel("movie").properties("movieId","title","year","duration","country","production").create(); // Edge labels schema.edgeLabel("knows").connection("user","user").create(); schema.edgeLabel("rated").single().properties("rating").connection("user","movie").create(); schema.edgeLabel("belongsTo").single().connection("movie","genre").create(); schema.edgeLabel("actor").connection("movie","person").create(); schema.edgeLabel("director").single().connection("movie","person").create(); // Vertex indexes schema.vertexLabel("genre").index("genresById").materialized().by("genreId").add(); schema.vertexLabel("genre").index("genresByName").materialized().by("name").add(); schema.vertexLabel("person").index("personsById").materialized().by("personId").add(); schema.vertexLabel("person").index("personsByName").materialized().by("name").add(); schema.vertexLabel("user").index("usersById").materialized().by("userId").add(); schema.vertexLabel("user").index("search").search().by("age").by("gender").asString().add(); schema.vertexLabel("movie").index("moviesById").materialized().by("movieId").add(); schema.vertexLabel("movie").index("moviesByTitle").materialized().by("title").add(); schema.vertexLabel("movie").index("search").search().by("year").by("country").asString().add(); // Edge indexes schema.vertexLabel("user").index("toMoviesByRating").outE("rated").by("rating").add(); schema.vertexLabel("movie").index("toUsersByRating").inE("rated").by("rating").add(); schema.config().option("tx_autostart").set(true); // Load data from file KillrVideo-small.kryo graph.io(IoCore.gryo()).readGraph("/path/to/KillrVideo-small.kryo");
To be able to perform full scan on all this small dataset, you need to add
schema.config().option('graph.allow_scan').set('true');
The file KillrVideo-small.kryo can be downloaded here
II Gremlin as a Stream of data
Usually, any graph traversal starts with g.V()
but what is the type of this expression ? To know that, just execute the below query
gremlin> g.V().getClass() ==>class org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.DefaultGraphTraversal
The type of g.V()
is indeed a DefaultGraphTraversal
. According to the Gremlin API, DefaultGraphTraversal
implements Iterator<E>
. In fact DefaultGraphTraversal
is an Iterator
of Vertex
.
If we had done g.E().getClass()
it would be an Iterator
of Edge
To confirm that g.V()
is an iterator of vertex, let’s just execute this code
gremlin> g.V().next().getClass() ==>class com.datastax.bdp.graph.impl.element.vertex.DsegCachedVertexImpl
Since it’s an iterator, we can call method next()
on it and call getClass()
to get its type. As we can see DsegCachedVertexImpl
is just an implementation of Gremlin Vertex
interface. Of course since it’s an iterator, we can invoke as well other methods on Iterator
interface like hasNext()
, tryNext()
, next(n)
, limit(n)
, tail(n)
, toList()
, toSet()
…
So far so good, now let’s analyse a very simple traversal
gremlin>g. V(). hasLabel("person"). has("name", "Harrison Ford"). next() ==>v[{~label=person, community_id=1425165696, member_id=527}]
Below is a matrix giving the equivalent stream processing for each Gremlin step:
Gremlin step | Stream equivalent |
---|---|
g.V() | Iterator |
.hasLabel(“person”) | iterator.stream().filter(vertex -> vertex instanceof Person) == Iterator<Person> |
.has(“name”, “Harrison Ford”) | iterator.stream().filter(person -> person.getName().equals(“Harrison Ford”)) == Iterator<Person> |
.next() | iterator.findFirst().get() |
The above traversal just fetch the whole vertex Harrison Ford. What if we only want to retrieve the name instead of the complete vertex ?
gremlin>g. V(). hasLabel("person"). has("name", "Harrison Ford"). values("name"). // iterator.map(person -> person.getName()) == Iterator<String> next() ==>Harrison Ford
The step values("name")
is equivalent to a map operation of Java stream to extract only the name property of the vertex.
Now, if you wish to retrieve more properties from the vertex, you can either use values("property1", "property2", ...,"propertyN")
or use the step valueMap("property1", "property2", ..., "propertyN")
gremlin>g. V(). hasLabel("person"). has("name", "Harrison Ford"). valueMap("name", "personId"). // iterator.map(person -> ImmutableMap.of("name",person.getName(), "id", person.getPersonId())) next() ==>name=[Harrison Ford] ==>personId=[p3211]
The step valueMap("name", "personId")
will transform the vertex object into a Map<String, Object> structure. The key of the map is the property label, the value of the Map is the corresponding value.
To prove it, just call getClass()
again
gremlin>g. V(). hasLabel("person"). has("name", "Harrison Ford"). valueMap("name", "personId"). next(). getClass() ==>class java.util.HashMap
We get a java.util.HashMap
as expected.
III Grouping the stream
Now let’s do something crazier, let’s just count the number of movies released by year in our dataset. For that we use this traversal
gremlin>g. V(). // Iterator<Vertex> hasLabel("movie"). // Iterator<Movie> groupCount().by("year") // Iterator<Map<Int == year of release, Long = count>> ==>{1921=1, 1925=1, 1926=1, 1927=2, 1931=1, 1933=1, 1934=1, 1935=2, 1936=1, 1937=2, 1938=2, 1939=4, 1940=6, 1941=3, 1942=3, 1943=1, 1944=2, 1945=2, 1946=3, 1948=8, 1949=4, 1950=8, 1951=6, 1952=8, 1953=8, 1954=9, 1955=5, 1956=5, 1957=7, 1958=4, 1959=8, 1960=7, 1961=8, 1962=6, 1963=3, 1964=8, 1965=4, 1966=6, 1967=6, 1968=8, 1969=6, 1970=9, 1971=4, 1972=6, 1973=5, 1974=4, 1975=7, 1976=4, 1977=5, 1978=2, 1979=8, 1980=8, 1981=3, 1982=10, 1983=4, 1984=13, 1985=8, 1986=7, 1987=6, 1988=12, 1989=12, 1990=13, 1991=10, 1992=8, 1993=16, 1994=21, 1995=27, 1996=16, 1997=28, 1998=29, 1999=26, 2000=24, 2001=30, 2002=23, 2003=36, 2004=45, 2005=35, 2006=40, 2007=32, 2008=39, 2009=28, 2010=20, 2011=19, 2012=19, 2013=12, 2014=4, 2015=1, 1903=1}
So the interesting step here is the groupCount().by("year")
. This steps performs 2 actions in one: a grouping followed by a counting. The grouping is executed on the property year
as grouping key. The result of the grouping is Iterator<Map<Int == year of release, Collection<Movie>>>
. The counting step will transform this iterator into Iterator<Map<Int == year of release, Long == number of movies in this year>>
One interesting thing to notice here is that we have a nested collection structure. The outer collection is the
Iterator
and the inner collection is theMap<Int,Long>
.
Because all the grouping is result is stored inside the map, the iterator contains only a single element, which is the map itself. We can prove it with:
gremlin>g. V(). // Iterator<Vertex> hasLabel("movie"). // Iterator<Movie> groupCount().by("year"). // Iterator<Map<Int == year of release, Long = count>> count() ==>1
Now, what it we want to access the content of the inner collection(the map here) inside the iterator ? Of course we can always invoke next()
but Gremlin also offers an interesting alternative: unfold()
. It is equivalent to the Java stream flatMap()
method. Let’s see it in action:
gremlin>g. V(). // Iterator<Vertex> hasLabel("movie"). // Iterator<Movie> groupCount().by("year"). // Iterator<Map<Int == year of release, Long = count>> unfold(). // iterator.stream().flatMap() == Iterator<MapEntry<Int,Long>> next() ==>1921=1
IV Advanced grouping in Gremlin
The grouping we have seen above (groupCount().by("year")
) is just a special case of a more general grouping form: group().by(xxx).by(yyy)
The first by(xxx)
defines the grouping key e.g. the property/value on which the grouping is done. It can be a property of the vertex/edge itself or something more complicated like a complete traversal.
The second by(yyy)
can be:
- either a reducing step like
count()
to reduce the collection of matching vertices for each grouping key - or a projection step to transform the matching collection of vertices into a collection of something else. Something else can be a scalar value or vertices/edges
Let’s put all this into practice. So first, the equivalent of our previous groupCount().by("year")
can be rewritten as:
gremlin>g. V(). // Iterator<Vertex> hasLabel("movie"). // Iterator<Movie> group(). by("year"). // iterator = Iterator<Map<Int == year of release, Collection<Movie>> by(count()) // iterator.stream().map(mapStructure -> mapStructure.stream().collect(Collectors.toMap(key -> key, value -> value.size()))) ==>{1921=1, 1925=1, 1926=1, 1927=2, 1931=1, 1933=1, 1934=1, 1935=2, 1936=1, 1937=2, 1938=2, 1939=4, 1940=6, 1941=3, 1942=3, 1943=1, 1944=2, 1945=2, 1946=3, 1948=8, 1949=4, 1950=8, 1951=6, 1952=8, 1953=8, 1954=9, 1955=5, 1956=5, 1957=7, 1958=4, 1959=8, 1960=7, 1961=8, 1962=6, 1963=3, 1964=8, 1965=4, 1966=6, 1967=6, 1968=8, 1969=6, 1970=9, 1971=4, 1972=6, 1973=5, 1974=4, 1975=7, 1976=4, 1977=5, 1978=2, 1979=8, 1980=8, 1981=3, 1982=10, 1983=4, 1984=13, 1985=8, 1986=7, 1987=6, 1988=12, 1989=12, 1990=13, 1991=10, 1992=8, 1993=16, 1994=21, 1995=27, 1996=16, 1997=28, 1998=29, 1999=26, 2000=24, 2001=30, 2002=23, 2003=36, 2004=45, 2005=35, 2006=40, 2007=32, 2008=39, 2009=28, 2010=20, 2011=19, 2012=19, 2013=12, 2014=4, 2015=1, 1903=1}
As expected the result is identical to what we get earlier.
Now let’s say we want to perform a projection of the collection of movies instead of a reduction, we want to have, for each year, a list of movies title instead of movies count, the previous traversal becomes:
gremlin>g. V(). // Iterator<Vertex> hasLabel("movie"). // Iterator<Movie> group(). by("year"). // Iterator<Map<Int == year of release, Collection<Movie>> by("title"). // Iterator<Map<Int == year of release, Collection<String == movie title>> unfold(). // Iterator<MapEntry<Int == year of release, Collection<String == movie title>> take(10) ==>1921=[The Kid] ==>1925=[The Gold Rush] ==>1926=[The General] ==>1927=[Metropolis, Sunrise] ==>1931=[City Lights] ==>1933=[Duck Soup] ==>1934=[It Happened One Night] ==>1935=[A Night at the Opera, Top Hat] ==>1936=[Modern Times] ==>1937=[Snow White and the Seven Dwarfs, Captains Courageous]
So far, so good. Now instead of doing a simple projection on a Movie property, let’s do a traversal instead. We want to group the movies by year and for each year, we display the list of movie director name:
gremlin>g. V(). // Iterator<Vertex> hasLabel("movie"). // Iterator<Movie> group(). by("year"). // Iterator<Map<Int == year of release, Collection<Movie>> by(out("director").values("name")). // Iterator<Map<Int == year of release, Collection<String == director name>> unfold(). // Iterator<MapEntry<Int == year of release, Collection<String == movie title>> take(10) ==>1921=Charles Chaplin ==>1925=Charles Chaplin ==>1926=Buster Keaton ==>1927=F.W. Murnau ==>1931=Charles Chaplin ==>1933=Leo McCarey ==>1934=Frank Capra ==>1935=Mark Sandrich ==>1936=Charles Chaplin ==>1937=Victor Fleming
Strangely enough, we only get 1 director name for each year, which is not correct. It looks like a bug in Gremlin but it isn’t.
When we traverse the edge “director” from the vertex “movie”, it is a 1-to-N relationship because we can have for each movie more than 1 directors so we’ll end up having Collection<Collection<String == director name>> and thus a combinatory explosion. By design, Gremlin only call
next()
on inner traversal, maybe to avoid such explosion (this is my assumption, to be confirmed).
To force Gremlin to be exhaustive on the inner traversal out("director").values("name")
, we can use the fold()
step
gremlin>g. V(). // Iterator<Vertex> hasLabel("movie"). // Iterator<Movie> group(). by("year"). // Iterator<Map<Int == year of release, Collection<Movie>> by(out("director").values("name").fold()). // Iterator<Map<Int == year of release, String == director names>> unfold(). // Iterator<MapEntry<Int == year of release, Collection<String == movie title>> take(10) ==>1921=[Charles Chaplin] ==>1925=[Charles Chaplin] ==>1926=[Buster Keaton, Clyde Bruckman] ==>1927=[Fritz Lang, F.W. Murnau] ==>1931=[Charles Chaplin] ==>1933=[Leo McCarey] ==>1934=[Frank Capra] ==>1935=[Sam Wood, Mark Sandrich] ==>1936=[Charles Chaplin] ==>1937=[David Hand, Victor Fleming]
Here we are!
Finally, we can also replace the grouping step by a complete traversal instead of a simple vertex property. Let’s say we want to group movies by director name and display the movies title for each director:
gremlin>g. V(). // Iterator<Vertex> hasLabel("movie"). // Iterator<Movie> group(). by(out("director").values("name")). // Iterator<Map<String == director name, Collection<Movie>> by("title"). // Iterator<Map<String == director name, Collection<String> == movie title>> unfold(). // Iterator<MapEntry<String == director name, Collection<String == movie title>> take(10) ==>Kwak Jae-young=[My Sassy Girl] ==>Peter Jackson=[The Lord of the Rings: The Two Towers, The Lord of the Rings: The Fellowship of the Ring, The Hobbit: The Desolation of Smaug, The Lord of the Rings: The Return of the King, King Kong, The Hobbit: An Unexpected Journey] ==>Alejandro Agresti=[The Lake House] ==>George Pan Cosmatos=[Tombstone] ==>Pitof=[Catwoman] ==>Raman Hui=[Shrek the Third] ==>Michael Bay=[Transformers, Armageddon, The Island, Pearl Harbor, Bad Boys, The Rock] ==>Santiago Segura=[Torrente, el brazo tonto de la ley, Torrente 2: Mission in Marbella, Torrente 3] ==>Robert Stevenson=[Mary Poppins] ==>Clare Kilner=[The Wedding Date]
Again, putting out("director").values("name")
inside the first by()
instead of “year” will do the trick.
And that’s all folks! Do not miss the other Gremlin recipes in this series.
If you have any question about Gremlin, find me on the datastaxacademy.slack.com, channel dse-graph. My id is @doanduyhai