Skip to content

Ingesting from dataframes

If you prefer to initially manipulate your data in a dataframe before converting into a graph, you can easily hand this directly to Raphtory once your preprocessing is complete.

Creating a graph from dataframes

The all-in-one way to do this is via the load_from_pandas() function on the Graph which will take a dataframe for your edges (and optionally for vertices) and return a graph built from these. This function has optional arguments to cover everything we have seen in the prior direct updates tutorial.

In the example below we are ingesting some network traffic data looking at different types of interactions between servers. In the first half of the code we ingest information about the servers and their interactions into two dataframes and make some changes to the timestamp column such that its handled in milliseconds not nanoseconds. The two dataframes are then printed out so you can see the headers and values.

from raphtory import Graph
import pandas as pd

edges_df = pd.read_csv("data/network_traffic_edges.csv")
edges_df["timestamp"] = pd.to_datetime(edges_df["timestamp"]).astype(
    "datetime64[ms, UTC]"
)

vertices_df = pd.read_csv("data/network_traffic_vertices.csv")
vertices_df["timestamp"] = pd.to_datetime(vertices_df["timestamp"]).astype(
    "datetime64[ms, UTC]"
)

print("Edge Dataframe:")
print(f"{edges_df}\n")
print("Vertex Dataframe:")
print(f"{vertices_df}\n")

Output

Edge Dataframe:
                  timestamp   source  ...          transaction_type  is_encrypted
0 2023-09-01 08:00:00+00:00  ServerA  ...   Critical System Request          True
1 2023-09-01 08:05:00+00:00  ServerA  ...             File Transfer         False
2 2023-09-01 08:10:00+00:00  ServerB  ...  Standard Service Request          True
3 2023-09-01 08:15:00+00:00  ServerD  ...    Administrative Command         False
4 2023-09-01 08:20:00+00:00  ServerC  ...   Critical System Request          True
5 2023-09-01 08:25:00+00:00  ServerE  ...             File Transfer         False
6 2023-09-01 08:30:00+00:00  ServerD  ...  Standard Service Request          True

[7 rows x 6 columns]

Vertex Dataframe:
                  timestamp server_id  ...    primary_function uptime_days
0 2023-09-01 08:00:00+00:00   ServerA  ...            Database         120
1 2023-09-01 08:05:00+00:00   ServerB  ...          Web Server          45
2 2023-09-01 08:10:00+00:00   ServerC  ...        File Storage          90
3 2023-09-01 08:15:00+00:00   ServerD  ...  Application Server          60
4 2023-09-01 08:20:00+00:00   ServerE  ...              Backup          30

[5 rows x 7 columns]

Next we call the load_from_pandas() function, specifying for the edges:

  • The dataframe we are ingesting (edges_df).
  • The source, destination and time columns within the dataframe (source,destination,timestamp).
  • The temporal properties (data_size_MB), constant properties (is_encrypted), and the layer (transaction_type) .
  • An additional set of constant properties which will be added to all edges, specifying where these edges come from (useful for when we merge more data in).

This is followed by the information for the vertices:

  • The dataframe we are ingesting (vertices_df).
  • The vertex ID and time columns (server_id,timestamp).
  • The temporal properties (OS_version,primary_function,uptime_days) and constant properties (server_name,hardware_type).
  • A shared constant property labelling the source of this information.

The resulting graph and an example vertex/edge are then printed to show the data fully converted.

g = Graph.load_from_pandas(
    edges_df=edges_df,
    src="source",
    dst="destination",
    time="timestamp",
    props=["data_size_MB"],
    layer_in_df="transaction_type",
    const_props=["is_encrypted"],
    shared_const_props={"datasource": "data/network_traffic_edges.csv"},
    vertex_df=vertices_df,
    vertex_col="server_id",
    vertex_time_col="timestamp",
    vertex_props=["OS_version", "primary_function", "uptime_days"],
    vertex_const_props=["server_name", "hardware_type"],
    vertex_shared_const_props={"datasource": "data/network_traffic_edges.csv"},
)

print("The resulting graphs and example vertex/edge:")
print(g)
print(g.vertex("ServerA"))
print(g.edge("ServerA", "ServerB"))

Output

The resulting graphs and example vertex/edge:
Graph(number_of_edges=7, number_of_vertices=5, number_of_temporal_edges=7, earliest_time="1693555200000", latest_time="1693557000000")
Vertex(name=ServerA, earliest_time="1693555200000", latest_time="1693556400000", properties={OS_version: Ubuntu 20.04, primary_function: Database, uptime_days: 120, _id: ServerA, server_name: Alpha, hardware_type: Blade Server, datasource: data/network_traffic_edges.csv})
Edge(source=ServerA, target=ServerB, earliest_time=1693555200000, latest_time=1693555200000, properties={data_size_MB: 5.6, is_encrypted: {"Critical System Request": Bool(true)}, datasource: {"Critical System Request": Str("data/network_traffic_edges.csv")}})

Adding dataframes into an existing graph

It may well be the case that you already have a graph which has some data in it or you have several dataframes you wish to merge together into one graph. To handle this, the graph has the load_vertices_from_pandas() and load_edges_from_pandas() functions which can be called on an already established graph.

Below we break the above example into a two stage process, first adding the edges and then adding in the vertices. As you can see in the output the same graph has been created, and can now be updated with direct updates or further datasets.

g = Graph()
g.load_edges_from_pandas(
    edge_df=edges_df,
    src_col="source",
    dst_col="destination",
    time_col="timestamp",
    props=["data_size_MB"],
    layer_in_df="transaction_type",
    const_props=["is_encrypted"],
    shared_const_props={"datasource": "data/network_traffic_edges.csv"},
)

g.load_vertices_from_pandas(
    vertices_df=vertices_df,
    vertex_col="server_id",
    time_col="timestamp",
    props=["OS_version", "primary_function", "uptime_days"],
    const_props=["server_name", "hardware_type"],
    shared_const_props={"datasource": "data/network_traffic_edges.csv"},
)

print(g)
print(g.vertex("ServerA"))
print(g.edge("ServerA", "ServerB"))

Output

Graph(number_of_edges=7, number_of_vertices=5, number_of_temporal_edges=7, earliest_time="1693555200000", latest_time="1693557000000")
Vertex(name=ServerA, earliest_time="1693555200000", latest_time="1693556400000", properties={OS_version: Ubuntu 20.04, primary_function: Database, uptime_days: 120, _id: ServerA, server_name: Alpha, hardware_type: Blade Server, datasource: data/network_traffic_edges.csv})
Edge(source=ServerA, target=ServerB, earliest_time=1693555200000, latest_time=1693555200000, properties={data_size_MB: 5.6, is_encrypted: {"Critical System Request": Bool(true)}, datasource: {"Critical System Request": Str("data/network_traffic_edges.csv")}})

Adding constant properties via dataframes

As with the direct updates, there may be instances where you are adding a dataset which has no timestamps within it. To handle this when ingesting via dataframes the graph has the load_edge_props_from_pandas() and load_vertex_props_from_pandas() functions.

Below we break the ingestion into a four stage process, adding the constant properties at the end. These are all done from the same two dataframes for brevity of the example, in real instances these would probably be four different dataframes, one for each function call.

Warning

Constant properties can only be added to vertices and edges which are part of the graph. If you attempt to add a constant property without first adding the vertex/edge an error will be thrown.

g = Graph()
g.load_edges_from_pandas(
    edge_df=edges_df,
    src_col="source",
    dst_col="destination",
    time_col="timestamp",
    props=["data_size_MB"],
    layer_in_df="transaction_type",
)

g.load_vertices_from_pandas(
    vertices_df=vertices_df,
    vertex_col="server_id",
    time_col="timestamp",
    props=["OS_version", "primary_function", "uptime_days"],
)

g.load_edge_props_from_pandas(
    edge_df=edges_df,
    src_col="source",
    dst_col="destination",
    layer_in_df="transaction_type",
    const_props=["is_encrypted"],
    shared_const_props={"datasource": "data/network_traffic_edges.csv"},
)

g.load_vertex_props_from_pandas(
    vertices_df=vertices_df,
    vertex_col="server_id",
    const_props=["server_name", "hardware_type"],
    shared_const_props={"datasource": "data/network_traffic_edges.csv"},
)

print(g)
print(g.vertex("ServerA"))
print(g.edge("ServerA", "ServerB"))

Output

Graph(number_of_edges=7, number_of_vertices=5, number_of_temporal_edges=7, earliest_time="1693555200000", latest_time="1693557000000")
Vertex(name=ServerA, earliest_time="1693555200000", latest_time="1693556400000", properties={OS_version: Ubuntu 20.04, primary_function: Database, uptime_days: 120, _id: ServerA, server_name: Alpha, hardware_type: Blade Server, datasource: data/network_traffic_edges.csv})
Edge(source=ServerA, target=ServerB, earliest_time=1693555200000, latest_time=1693555200000, properties={data_size_MB: 5.6, is_encrypted: {"Critical System Request": Bool(true)}, datasource: {"Critical System Request": Str("data/network_traffic_edges.csv")}})