Pythonformer

non-profit

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

dsinghvi updated a dataset 8 days ago

pythonformer/nemotron-cc-small-subset-decontaminated-4conditions

dsinghvi published a dataset 8 days ago

pythonformer/nemotron-cc-small-subset-decontaminated-4conditions

AutomatedScientist updated a dataset 19 days ago

pythonformer/agents-learn-runtime-benchmarks

View all activity

dsinghvi

updated a dataset 8 days ago

pythonformer/nemotron-cc-small-subset-decontaminated-4conditions

Viewer • Updated 8 days ago • 184M • 933

dsinghvi

published a dataset 8 days ago

pythonformer/nemotron-cc-small-subset-decontaminated-4conditions

Viewer • Updated 8 days ago • 184M • 933

AutomatedScientist

updated 2 datasets 19 days ago

pythonformer/agents-learn-runtime-benchmarks

Updated 19 days ago • 23

pythonformer/agents-learn-runtime-tasks

Updated 19 days ago • 23

AutomatedScientist

updated a dataset 23 days ago

pythonformer/agents-learn-runtime-train

Viewer • Updated 23 days ago • 6k • 25

ajibawa-2023

updated a dataset 30 days ago

pythonformer/Trajectory-Stitching-Test-Small

Viewer • Updated 30 days ago • 128k • 72 • 1

ajibawa-2023

published a dataset 30 days ago

pythonformer/Trajectory-Stitching-Test-Small

Viewer • Updated 30 days ago • 128k • 72 • 1

ajibawa-2023

updated a dataset about 1 month ago

ontocord/Test-Data

Viewer • Updated May 9 • 500 • 66

ajibawa-2023

published a dataset about 1 month ago

ontocord/Test-Data

Viewer • Updated May 9 • 500 • 66

ajibawa-2023

updated a dataset about 1 month ago

pythonformer/Trajectory-Stitching-Test-7M

Viewer • Updated May 6 • 5.04M • 1.09k

ajibawa-2023

published a dataset about 1 month ago

pythonformer/Trajectory-Stitching-Test-7M

Viewer • Updated May 6 • 5.04M • 1.09k

ajibawa-2023

posted an update about 1 month ago

Post

2143

Stitched-Reasoning-Trajectories-7M

Dataset: ajibawa-2023/Stitched-Reasoning-Trajectories-7M
Stitched-Reasoning-Trajectories-7M is a massive-scale, synthetic multi-hop reasoning dataset. It was built by algorithmically "stitching" together discrete reasoning traces from the original glaiveai/reasoning-v1-20m dataset into continuous, coherent, and logically structured multi-agent trajectories.

By extracting internal sub-questions from <think> blocks and mapping high-information keyword overlaps, this dataset transforms single-turn Q&A pairs into deep, multi-step research plans. To ensure high quality and eliminate "topic drift," every trajectory has been verified using a dense semantic embedding model (BAAI/bge-large-en-v1.5).

The resulting dataset consists of 709 .jsonl files containing over 7.2 million entirely deduplicated, highly coherent reasoning chains.

huu-ontocord

updated a dataset about 2 months ago

pythonformer/codeact-agent-pretrain

Viewer • Updated May 2 • 254 • 19

huu-ontocord

published a dataset about 2 months ago

pythonformer/codeact-agent-pretrain

Viewer • Updated May 2 • 254 • 19

huu-ontocord

updated a dataset about 2 months ago

pythonformer/glaive-with-pseudo-search

Viewer • Updated May 1 • 662 • 31

huu-ontocord

published a dataset about 2 months ago

pythonformer/glaive-with-pseudo-search

Viewer • Updated May 1 • 662 • 31

huu-ontocord

updated a dataset about 2 months ago

pythonformer/glaive-with-papers-search

Viewer • Updated May 1 • 107k • 143

huu-ontocord

published a dataset about 2 months ago

pythonformer/glaive-with-papers-search

Viewer • Updated May 1 • 107k • 143

huu-ontocord

authored a paper about 2 months ago

Agents Learn Their Runtime: Interpreter Persistence as Training-Time Semantics

Paper • 2603.01209 • Published Mar 1 • 1

ajibawa-2023

posted an update about 2 months ago

Post

1338

Ruby-Code-Large
Dataset : ajibawa-2023/Ruby-Code-Large

Ruby-Code-Large is a large-scale corpus of Ruby programming language source code comprising 331,743 code samples stored in .jsonl format. The dataset is designed to support research and development in large language model (LLM) pretraining, static analysis, web application development, and software engineering automation within the Ruby ecosystem.

By offering a substantial, language-focused dataset, Ruby-Code-Large enables targeted experimentation in dynamic programming, object-oriented design, and rapid application development—areas where Ruby is widely used, particularly in web frameworks and scripting.

Ruby-Code-Large addresses the lack of large, curated, Ruby-specific datasets, enabling focused research on expressive syntax, metaprogramming, and high-level abstractions.

AI & ML interests

Recent Activity

Team members 11

pythonformer's activity