624c1d4e68
nuke doledb fixes
Matt Wells
2014-03-08 10:51:15 -07:00
29694f4efe
startup fixes
Matt Wells
2014-03-08 10:25:56 -07:00
8aa0662a27
Merge branch 'diffbot' into testing
Matt Wells
2014-03-08 09:38:44 -07:00
14817df7a9
new site patterns api stuff
Matt Wells
2014-03-08 09:23:32 -07:00
7cdd411ef1
Merge branch 'diffbot' into diffbot-testing
Matt Wells
2014-03-07 09:26:47 -08:00
72fab5b61e
Do not end a crawl while urls are still being spidered because they might add more links to spiderdb when they finally complete.
Matt Wells
2014-03-07 09:30:12 -08:00
dcd42e455e
Merge branch 'diffbot' into diffbot-testing
Matt Wells
2014-03-07 09:02:29 -08:00
c143ee1fba
fix core when creating a new collection because we incremented m_numRecs but did not grow the ptr buffer. also added support for localgb.conf so we can use that instead of gb.conf to avoid git push/pull conflicts.
Matt Wells
2014-03-07 09:05:14 -08:00
f777e6cccd
Merge branch 'diffbot' into diffbot-testing
Matt Wells
2014-03-07 08:23:21 -08:00
d6177019ec
minor fix
Matt Wells
2014-03-07 08:07:09 -08:00
434dd182d4
fix mem leak. always harvest links for custom crawls.
Matt Wells
2014-03-06 21:24:39 -08:00
e351d2a6f1
get searching on token working
Matt Wells
2014-03-06 17:01:41 -08:00
27e8e810d2
use collnum instead of coll string. more stable since resetting collections keeps string the same but changes the collnum.
Matt Wells
2014-03-06 15:48:11 -08:00
d74f748e93
search all collections under a token if "&token" is given but not "&c=..."
Matt Wells
2014-03-06 11:00:43 -08:00
97e46dbf4e
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2014-03-06 10:45:45 -08:00
ca2d307229
revert gb.conf
Matt Wells
2014-03-06 10:47:03 -08:00
efa92b16fd
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2014-03-06 10:45:35 -08:00
25cf0efdbf
first compiled stab at multi collection searching.
Matt Wells
2014-03-06 10:45:13 -08:00
451a092378
fix core from changing parms while evaluating a url.
Matt Wells
2014-03-06 07:47:43 -08:00
0962e243a4
Merge branch 'diffbot' into diffbot-testing
Matt Wells
2014-03-05 07:43:25 -08:00
58a1feeea5
specify &header=1 explicitly to get json serp header lest we break our clients parsers
Matt Wells
2014-03-05 07:41:59 -08:00
13e33bc261
fix jezebel crawl from hanging.
Matt Wells
2014-03-04 19:45:26 -08:00
1b62f1582b
print memtable when almost full so we can see where the leak is. more spiders for ethan. do not try to get diffbot reply if page is already json. likely it is an injected diffbot json reply.
Matt Wells
2014-03-04 18:19:50 -08:00
603cd67758
fix csv downloads some more
Matt Wells
2014-03-04 12:07:46 -08:00
2ab9aaeeaa
streaming csv fixes
Matt Wells
2014-03-04 11:04:26 -08:00
866b09d25e
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2014-03-04 10:46:28 -08:00
b1381cc610
make csv streamable, faster and take almost no memory.
Matt Wells
2014-03-04 10:45:57 -08:00
280dcb85cf
fix for passing testUpdatedContent smoketest
Matt Wells
2014-03-04 09:09:51 -08:00
ab9f2b33c1
definition updates
Matt Wells
2014-03-04 08:37:39 -07:00
1acb16b1ee
tweak empty doledb priority logic. anchor it more to m_doleIpTable for more reliability. seems like it was causing some slow dows during spidering. seems more continuous now.
Matt Wells
2014-03-03 13:48:59 -08:00
48b5330d9c
only skip checking to spider a url of its doleip table is empty
Matt Wells
2014-03-03 13:22:27 -08:00
282dad6cef
deal with no coll recs when getting link text using msg25. do not share g_lineTable between collections.
Matt Wells
2014-03-03 08:04:24 -08:00
ff8a0b4ef1
do not let all collections share the same line table in linkdb.cpp
Matt Wells
2014-03-03 07:50:11 -08:00
a82abe8260
added ^ operator to url crawl patterns. good for tmz crawl.
Matt Wells
2014-03-02 14:57:59 -08:00
7fd6bbd7f5
added ^ support to url crawl expressions
Matt Wells
2014-03-02 14:41:25 -08:00
e4d425c18f
fix coll being deleted when getting link text.
Matt Wells
2014-03-02 14:24:49 -08:00
bb5016e88b
add the following fields to json search results: currentTimeUTC, responseTimeMS, docsInCollection, hits, moreResultsFollow, and docId. Changes structure of json so that now the results array is returned as an array within a dictionary (field name "results") as opposed to being the only object returned
Daniel Steinberg
2014-03-01 11:16:17 -08:00
aeb2833d20
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2014-02-28 11:46:44 -08:00
11efab9862
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2014-02-28 08:23:59 -08:00
c596d38e60
fix core from getting title of json object
Matt Wells
2014-02-28 08:18:09 -08:00
5f3aa24805
took out restrictDomain logic. now we always only follow links on the same domain as the seed UNLESS a url crawl pattern or a url crawl regex was specified.
Matt Wells
2014-02-27 19:53:17 -08:00
42f254125e
fix core in new link text logic. empty msg25 replies are ok if g_errno is set.
Matt Wells
2014-02-27 13:56:32 -08:00
365fc16606
fix core in "wait in line" logic when getting link info in Linkdb.cpp.
Matt Wells
2014-02-27 09:22:35 -08:00
af9eb8fb73
need to allow clients to not restrict to seed domains.
Matt Wells
2014-02-26 22:27:22 -08:00
927f4626ee
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2014-02-26 22:26:13 -08:00
eaca38cbfd
fix new result streaming logic some more
Matt Wells
2014-02-26 21:42:43 -08:00
0933884191
fix super fast and mem efficient search results streaming code.
Matt Wells
2014-02-26 21:18:08 -08:00
f11e25024a
Merge branch 'diffbot' into diffbot-testing
Matt Wells
2014-02-26 20:34:06 -08:00
1030e6ada8
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2014-02-26 20:30:20 -08:00
b429f12346
add logic to save memory when streaming over 200 results back. should fix oom when streaming back hundreds of thousands of results.
Matt Wells
2014-02-26 20:33:35 -08:00
8208178c79
remove "Initial crawl request" dups from the urls.csv. do not count fake firstip spider requests attempts in xmldoc.cpp as crawlbot page download attempts since we just re-add that request with the correct firstip and bail. it basically doubles this count form what users would expect.
Matt Wells
2014-02-26 15:48:52 -08:00
b450bfc2a6
do not show html column in csv. libreoffice and excel flub it if a cell is over 32k or so.
Matt Wells
2014-02-26 15:03:05 -08:00
a0697e1bb5
do not allow custom crawls to spider the web any more.
Matt Wells
2014-02-26 10:26:09 -08:00
6445b0572b
fix gb.conf
Matt Wells
2014-02-26 01:08:20 -08:00
a6b7e088f5
take out tfndb, unused. fix core from diffbot url too long.
Matt Wells
2014-02-26 01:07:13 -08:00
6716d8f21b
remove entry from linetable for linkinfo lookup
Matt Wells
2014-02-26 00:27:29 -08:00
8bb5d106db
fixes for query reindex/delete.
Matt Wells
2014-02-25 18:12:45 -08:00
33c8123288
more fixes for new link info code.
Matt Wells
2014-02-25 13:53:41 -08:00
9c486c77ed
Merge branch 'diffbot' into diffbot-testing
Matt Wells
2014-02-25 12:32:40 -08:00
cf6695f625
speed up getNumTotalRecs() by caching it basically for 2 seconds since pingserver.cpp calls it all the time.
Matt Wells
2014-02-25 12:14:51 -08:00
b3ff7df904
Merge branch 'diffbot' into diffbot-testing
Matt Wells
2014-02-25 11:05:46 -08:00
94a55bf9a6
fixes for new link info code so it doesn't bottleneck. got EFENCE_SIZE working so we can use efence on large allocs only so we don't go oom using it. might help finding some of the out of bounds writing going on.
Matt Wells
2014-02-25 10:55:05 -08:00
ceb623bb8f
do not dedup bulks. only respider urls if error is tmp. mess with msg1 in spider.cpp so niceness is MAX_NICENESS and not 0 because it was not able to trigger a doledb dump.
Matt Wells
2014-02-23 20:04:46 -08:00
72f1312652
new linkdb code compiling.
Matt Wells
2014-02-20 17:27:28 -08:00
9820f14066
checkpoint
Matt Wells
2014-02-20 14:54:21 -08:00
88dfa20cbe
docid based spider rec related fixes.
Matt Wells
2014-02-20 08:46:00 -08:00
e87b71caef
fix query reindex core
Matt Wells
2014-02-19 21:07:01 -08:00
b37b19ea4a
print comma before json item so we do not end in trailing comma ever
Matt Wells
2014-02-19 10:04:49 -08:00
dda7648333
try to fix problem of crawls stopping when they shouldn't. seems like it might be doing the trick.
Matt Wells
2014-02-19 00:51:46 -08:00
b48adc0542
try to fix crawls stopping too early
Matt Wells
2014-02-18 10:28:48 -08:00
ae2aed7066
try to fix a few cores from deleting collections. try to spider urls again if user changes certain crawling parms. like regex, patterns, etc.
Matt Wells
2014-02-18 09:44:15 -08:00
117f0ca4e8
more bio updates
Matt Wells
2014-02-17 15:32:45 -07:00
bc2b9d6179
bio.html updates
Matt Wells
2014-02-17 15:29:08 -07:00
f942183104
ignore maxtocrawl for bulk jobs too
Matt Wells
2014-02-16 22:24:17 -08:00
a4deb7ff08
exempt bulk jobs from maxtoprocess
Matt Wells
2014-02-16 22:14:43 -08:00
9c9d5fff98
print out content type in caps with maroon bg in serps. use empty site patterns to mean no restriction, not "*" anymore for simplicity.
Matt Wells
2014-02-16 22:47:02 -07:00
0b5cd6d3f9
more parm fixes
Matt Wells
2014-02-16 22:18:39 -07:00
48315f6dc3
parm fixes
Matt Wells
2014-02-16 22:13:27 -07:00
725b6189a7
show user's ip in master ips description so they can add it to the list easily.
Matt Wells
2014-02-16 21:56:31 -07:00
9d0dca71db
fix rapid coll delete bug some more.
Matt Wells
2014-02-16 20:13:06 -08:00
def7822a22
multiple red boxes for clarity
Matt Wells
2014-02-16 20:33:51 -07:00
ce652462b0
add color coded circles to coll nav bar. disk usage red box.
Matt Wells
2014-02-16 19:59:53 -07:00
88a151f1d9
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2014-02-16 16:02:00 -08:00
c691b2dd5f
hopcount precedence fix
Matt Wells
2014-02-16 16:01:29 -08:00
a4b6716623
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2014-02-16 15:57:46 -08:00
29c9e49935
do not do link loop detection if doing a custom crawl.
Matt Wells
2014-02-16 15:57:32 -08:00
f8135e628e
fall back to hop count if priority is tied (and both are due to be spidered). defaults back to breadth first like it was doing before.
Matt Wells
2014-02-16 15:52:08 -08:00
0a4963f597
do not allow spot/seeds to be added to collnum being repaired or rebuilt.
Matt Wells
2014-02-16 15:18:50 -08:00
fe63371622
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2014-02-16 13:39:02 -08:00
4930243de3
minor updates
Matt Wells
2014-02-16 13:38:54 -08:00
fe0f2d3537
allow coll delete if not the one being repaired
Matt Wells
2014-02-16 10:55:34 -08:00
32526a9b25
more checksum fixes for json. fixes for repair/rebuild procedure.
Matt Wells
2014-02-16 10:46:41 -08:00
df59d3946a
fix content hash issues for json. do not hash url/resolved_url/html fields. do exact order-independent hashes of remaining field/value pairs. used for setting EDOCUNCHANGED and doing spidertime/querytime deduping. also do not index "html" json field because it is huge, slow and redundant. convert "date" field into a number so we can sort/constrain by article pub date.
Matt Wells
2014-02-15 14:40:56 -08:00
734ce1fc55
fix core from a high priority injection insert records at the same time as a lower priority spider.
Matt Wells
2014-02-14 10:51:02 -08:00
3271f22995
Merge branch 'diffbot-testing' into diffbot
Matt Wells
2014-02-13 11:25:59 -08:00
dc8b9090e8
fix out of alloc slots core
Matt Wells
2014-02-13 11:21:39 -08:00
08b103f3a4
Merge branch 'diffbot-testing' into diffbot
Matt Wells
2014-02-13 10:11:56 -08:00
c3d8a143be
fix bug of process regex being ignored when crawl regex was specified.
Matt Wells
2014-02-13 10:06:14 -08:00
4eee547391
do not do fuzzy deduping if &icc=1 (include cached copy) is true for search results.
Matt Wells
2014-02-13 08:51:03 -08:00